Orthographic Variation in Chinese

Jack Halpern
CEO
The CJK Dictionary Institute, Inc.
株式会社日中韓辭典研究所

Index to This Document
Introduction Simplified and Traditional Chinese Traditional Chinese Variants Synonym and Cross-Language Expansion Mapping Tables and Lexical Databases Documents for Reference About the Author

Index to This Document

Introduction
Simplified and Traditional Chinese
Traditional Chinese Variants
Synonym and Cross-Language Expansion
Mapping Tables and Lexical Databases
Documents for Reference

About the Author

1. Introduction

This report provides an overview of the linguistic and orthographic issues related to the processing of Chinese texts, especially webpages. It aims to enable software developers to acquire a basic knowledge of the complexities of the Chinese script, especially orthographic variation, from the point of view of building linguistic tools such as intelligent information retrieval tools, input method editors, and machine translation systems.

Abbreviations used in this document
SC Simplified Chinese

TC Traditional Chinese

STC "Simplified" Traditional Chinese (explained below)

TTC "Traditional" Traditional Chinese (explained below)

C2C Chinese to Chinese (SC-to/from-TC) conversion

Abbreviations used in this document
SC	Simplified Chinese
TC	Traditional Chinese
STC	"Simplified" Traditional Chinese (explained below)
TTC	"Traditional" Traditional Chinese (explained below)
C2C	Chinese to Chinese (SC-to/from-TC) conversion

The complexity of the Chinese writing system is well known. Some of the linguistic factors that contribute to this include the large number of characters in common use, the complexity of the character forms, the major differences between Traditional Chinese and Simplified Chinese along various dimensions (orthography, phonology, semantics), the presence of numerous orthographic variants in Traditional Chinese, and others.

From an information processing point of view (which is beyond the scope of this report), there are complex issues such as the use of multiple character sets, multiple encodings, the incompatibility between character sets, a plethora of input methods, and more.

Of crucial importance to text processing is the fact that there is a significant amount of orthographic variation. The existence of numerous orthographic variants, especially in Traditional Chinese, and the high degree of sophistication required for performing accurate conversion between SC and TC, pose a special challenge to the developers of linguistic tools, especially in the field of information retrieval, since the same word or phrase can be written in multiple, often unpredictable, ways.

This report focuses on the major types of orthographic variation in Chinese, and provides a brief analysis of the linguistic issues to be considered by software developers.

2. Simplified and Traditional Chinese

As a result of the large-scale language reforms undertaken in the PRC in the postwar period, thousands of character forms underwent drastic simplifications. Chinese written in these simplified forms is called Simplified Chinese (SC). Taiwan and Hong Kong, and most overseas Chinese continue to use the old, complex forms, referred to as Traditional Chinese (TC).

The process of automatically converting SC to/from TC, referred to as C2C conversion, is full of complexities and pitfalls. The conversion can be implemented on the three levels in increasing order of sophistication, briefly described below. For more details, see the author's The Pitfalls and Complexities of Chinese to Chinese Conversion.

2.1 Code Conversion

The easiest, but most unreliable, way to perform C2C conversion is on a codepoint-to-codepoint basis by looking the source up in a mapping table, such as the one shown below. This is referred to as Code Conversion or transcoding.

**Table 1**
Code Conversion
SC	TC1	TC2	TC3	TC4	Remarks
门	門				one-to-one
汤	湯				one-to-one
发	發	髮			one-to-many
暗	暗	闇			one-to-many
干	幹	乾	干	榦	one-to-many

Because of the numerous one-to-many ambiguities (which occur in both the SC-to-TC and the TC-to-SC directions), the rate of conversion failure is significant. Without time-consuming human proofreading, the results of code conversion are unacceptable. For more details, see Code Conversion.

2.2 Orthographic Conversion

The next level of sophistication in C2C conversion is referred to as orthographic conversion, because the items being converted are orthographic units, rather than mere codepoints in a character set. That is, they are meaningful linguistic units such as single-character free words, bound morphemes (such as affixes), and multi-character compound words. As shown in Table 1 above, code conversion is ambiguous because of the numerous one-to-many mappings. Successful C2C conversion depends on the context and requires orthographic mapping tables on the word level, as shown below.

**Table 2**
Orthographic Conversion
English	SC	TC1	TC2	Incorrect Candidates	Comments
telephone	电话	電話			unambiguous
we	我们	我們			unambiguous
start-off	出发	出發		出髮齣髮齣發	one-to-many
dry	干燥	乾燥		干燥幹燥榦燥	one-to-many
	阴干	陰乾	陰干		depends on context

As can be seen, the ambiguities inherent in code conversion are resolved by using an orthographic mapping table, which avoids false conversions such as shown in the Incorrect Candidates column above. It is important to note that such conversion must be done with the aid of a Chinese Morphological Analyzer (CMA) that can segment the text stream into meaningful units (such as lexemes). For more details, see Orthographic Conversion.

2.3 Lexemic Conversion

A more sophisticated, and far more challenging, approach to C2C conversion is called lexemic conversion, which maps SC and TC words that are semantically, not just orthographically, equivalent. For example, the SC word 信息 (xìnxī) 'information' is converted to the semantically equivalent TC 資訊 (zīxùn). This is similar to the difference between lorry in British English and truck in American English.

There are many lexemic differences between SC and TC words, especially in technical terms and proper nouns. To complicate matters, the correct TC is sometimes locale-dependent, as is shown in the table below.

**Table 4**
Lexemic Conversion
English	SC	Taiwan TC	Hong Kong TC	Other TC	Incorrect (orthographic)	Comments
Software	软件	軟體	軟件		軟件	lexemic
Taxi	出租汽车	計程車	的士	德士	出租汽車	lexemic
Kennedy	肯尼迪	甘迺迪	堅尼地		肯尼迪	lexemic proper noun
Oahu	瓦胡岛	歐胡島			瓦胡島	lexemic proper noun

Studying the above table carefully should make most of the issues clear. For a detailed discussion and more examples, see Lexemic Conversion.

3. Traditional Chinese Variants

Unlike Simplified Chinese, Traditional Chinese does not have a stable orthography. There are numerous TC variant forms, and much confusion prevails. It is necessary normalize or expand these variants based on hard-coded mapping tables, such as the ones shown below. For a more detailed discussion, see Variation in Traditional Chinese Orthography.

3.1 TC Variants in Taiwan and Hong Kong

Traditional Chinese dictionaries often disagree on the choice of the standard TC form. There are various reasons for the existence of TC variants:

Some TC forms are not available in the Big Five character set.
Some forms have coexisted historically.
Unavailability of certain glyphs in some fonts.
The use of simplified character forms, especially in handwriting.

TC variants can be classified into various types, as illustrated in the table below.

**Table 5**
TC Variants
Variant 1	Variant 2	English	Comment
裏	裡	inside	100% interchangeable
敎	教	teach	100% interchangeable
著	着	particle	Variant 2 not in Big5
為	爲	for	Variant 2 not in Big5
沉	沈	sink; surname	partially interchangeable
泄	洩	leak; divulge	partially interchangeable

3.2 Mainland vs. Taiwanese Variants

To a limited extent, the traditional forms are still used in the PRC for some classical literature, newspapers for overseas Chinese, etc., based on a standard that maps the SC forms (GB 2312-80) to their corresponding TC forms (GB 12345-90). However, these mappings do not necessarily agree with those widely used in Taiwan. We will refer to the former as Simplified Traditional Chinese (STC), and to the latter as Traditional Traditional Chinese (TTC).

**Table 6**
STC vs. TTC Variants
Pinyin	SC	STC	TTC
xiàn	线	綫	線
bēng	绷	綳	繃
cè	厕	厠	廁

4. Synonym and Cross-Language Expansion

An advanced form of variant expansion is synonym expansion and cross-language expansion. The details of this are beyond the scope of this report. In a nutshell, synonym expansion generates a picklist of synonyms and other semantically related variants; cross-language expansion generates a picklist of Chinese equivalents to a foreign language source term. A brief example of this is shown in the table below. For a more detailed treatment, see the author's paper Cross-Synonym and Cross-Language Searching in Japanese.

**Table 7**
Synonym and Cross-Language Expansion
Synonyms	时钟	時鐘	時計	錶	鐘	鐘錶
	國家	国家	故乡	故鄉	国土	國土
Cross-language	country	國家	国家

5. Lexical Databases and Conversion Software

In 1996, the CJK Dictionary Institute (CJKI) launched a project to investigate C2C conversion issues in-depth, and to build a comprehensive SC↔TC database (now at 1.2 million SC and 1.2 million TC items and growing) whose goal is to achieve near 100% conversion accuracy. We have collaborated with Basis Technology Corporation, a leading provider of CJK software technology, in developing advanced word segmentation technology and a highly accurate C2C conversion engine, which have been released as successful commercial products. These are described in detail at CJK Products.

One of the central components necessary for building tools for processing Chinese orthographic, lexemic and other variants is a database of hard-coded mapping tables fine-tuned to the needs of C2C conversion, variant expansion and normalization. Below is a list of components required for building such a database developed by our Institute, most of which have been incorporated into the sophistictaed CJK tools developed by Basis Technology..

Traditional Chinese single character variants

Traditional Chinese compound variants

Single character SC-to-TC code-level mapping table

Single character TC-to-SC code-level mapping table

STC-to/from-TTC character mapping table

SC-to/from-TC orthographic mapping table for general vocabulary

SC-to/from-TC orthographic mapping table for proper nouns

SC-to/from-TC lexemic mapping tables

Chinese thesaurus for synonym expansion

Basic Chinese-English dictionary for cross-language expansion

See the author's The Pitfalls and Complexities of Chinese to Chinese Conversion for more details.

6. Documents for Reference

See the following links for more information:

The Pitfalls and Complexities of Chinese to Chinese Conversion

Variation in Traditional Chinese Orthography
About the Author

List of Publications

The CJK Dictionary Institute