|Index to This Document|
This report provides an overview of the linguistic and orthographic issues related to the processing of Chinese texts, especially webpages. It aims to enable software developers to acquire a basic knowledge of the complexities of the Chinese script, especially orthographic variation, from the point of view of building linguistic tools such as intelligent information retrieval tools, input method editors, and machine translation systems.
|STC||"Simplified" Traditional Chinese (explained below)|
|TTC||"Traditional" Traditional Chinese (explained below)|
|C2C||Chinese to Chinese (SC-to/from-TC) conversion|
The complexity of the Chinese writing system is well known. Some of the linguistic factors that contribute to this include the large number of characters in common use, the complexity of the character forms, the major differences between Traditional Chinese and Simplified Chinese along various dimensions (orthography, phonology, semantics), the presence of numerous orthographic variants in Traditional Chinese, and others.
From an information processing point of view (which is beyond the scope of this report), there are complex issues such as the use of multiple character sets, multiple encodings, the incompatibility between character sets, a plethora of input methods, and more.
Of crucial importance to text processing is the fact that there is a significant amount of orthographic variation. The existence of numerous orthographic variants, especially in Traditional Chinese, and the high degree of sophistication required for performing accurate conversion between SC and TC, pose a special challenge to the developers of linguistic tools, especially in the field of information retrieval, since the same word or phrase can be written in multiple, often unpredictable, ways.
This report focuses on the major types of orthographic variation in Chinese, and provides a brief analysis of the linguistic issues to be considered by software developers.
As a result of the large-scale language reforms undertaken in the PRC in the postwar period, thousands of character forms underwent drastic simplifications. Chinese written in these simplified forms is called Simplified Chinese (SC). Taiwan and Hong Kong, and most overseas Chinese continue to use the old, complex forms, referred to as Traditional Chinese (TC).
The process of automatically converting SC to/from TC, referred to as C2C conversion, is full of complexities and pitfalls. The conversion can be implemented on the three levels in increasing order of sophistication, briefly described below. For more details, see the author's The Pitfalls and Complexities of Chinese to Chinese Conversion.
The easiest, but most unreliable, way to perform C2C conversion is on a codepoint-to-codepoint basis by looking the source up in a mapping table, such as the one shown below. This is referred to as Code Conversion or transcoding.
Because of the numerous one-to-many ambiguities (which occur in both the SC-to-TC and the TC-to-SC directions), the rate of conversion failure is significant. Without time-consuming human proofreading, the results of code conversion are unacceptable. For more details, see Code Conversion.
The next level of sophistication in C2C conversion is referred to as orthographic conversion, because the items being converted are orthographic units, rather than mere codepoints in a character set. That is, they are meaningful linguistic units such as single-character free words, bound morphemes (such as affixes), and multi-character compound words. As shown in Table 1 above, code conversion is ambiguous because of the numerous one-to-many mappings. Successful C2C conversion depends on the context and requires orthographic mapping tables on the word level, as shown below.
|start-off||出发||出發||出髮 齣髮 齣發||one-to-many|
|dry||干燥||乾燥||干燥 幹燥 榦燥||one-to-many|
|阴干||陰乾||陰干||depends on context|
As can be seen, the ambiguities inherent in code conversion are resolved by using an orthographic mapping table, which avoids false conversions such as shown in the Incorrect Candidates column above. It is important to note that such conversion must be done with the aid of a Chinese Morphological Analyzer (CMA) that can segment the text stream into meaningful units (such as lexemes). For more details, see Orthographic Conversion.
A more sophisticated, and far more challenging, approach to C2C conversion is called lexemic conversion, which maps SC and TC words that are semantically, not just orthographically, equivalent. For example, the SC word 信息 (xìnxī) 'information' is converted to the semantically equivalent TC 資訊 (zīxùn). This is similar to the difference between lorry in British English and truck in American English.
There are many lexemic differences between SC and TC words, especially in technical terms and proper nouns. To complicate matters, the correct TC is sometimes locale-dependent, as is shown in the table below.
|English||SC||Taiwan TC||Hong Kong TC||Other TC||Incorrect|
|Kennedy||肯尼迪||甘迺迪||堅尼地||肯尼迪||lexemic proper noun|
|Oahu||瓦胡岛||歐胡島||瓦胡島||lexemic proper noun|
Studying the above table carefully should make most of the issues clear. For a detailed discussion and more examples, see Lexemic Conversion.
Unlike Simplified Chinese, Traditional Chinese does not have a stable orthography. There are numerous TC variant forms, and much confusion prevails. It is necessary normalize or expand these variants based on hard-coded mapping tables, such as the ones shown below. For a more detailed discussion, see Variation in Traditional Chinese Orthography.
Traditional Chinese dictionaries often disagree on the choice of the standard TC form. There are various reasons for the existence of TC variants:
TC variants can be classified into various types, as illustrated in the table below.
|Variant 1||Variant 2||English||Comment|
|著||着||particle||Variant 2 not in Big5|
|為||爲||for||Variant 2 not in Big5|
|沉||沈||sink; surname||partially interchangeable|
|泄||洩||leak; divulge||partially interchangeable|
To a limited extent, the traditional forms are still used in the PRC for some classical literature, newspapers for overseas Chinese, etc., based on a standard that maps the SC forms (GB 2312-80) to their corresponding TC forms (GB 12345-90). However, these mappings do not necessarily agree with those widely used in Taiwan. We will refer to the former as Simplified Traditional Chinese (STC), and to the latter as Traditional Traditional Chinese (TTC).
An advanced form of variant expansion is synonym expansion and cross-language expansion. The details of this are beyond the scope of this report. In a nutshell, synonym expansion generates a picklist of synonyms and other semantically related variants; cross-language expansion generates a picklist of Chinese equivalents to a foreign language source term. A brief example of this is shown in the table below. For a more detailed treatment, see the author's paper Cross-Synonym and Cross-Language Searching in Japanese.
In 1996, the CJK Dictionary Institute (CJKI) launched a project to investigate C2C conversion issues in-depth, and to build a comprehensive SC↔TC database (now at 1.2 million SC and 1.2 million TC items and growing) whose goal is to achieve near 100% conversion accuracy. We have collaborated with Basis Technology Corporation, a leading provider of CJK software technology, in developing advanced word segmentation technology and a highly accurate C2C conversion engine, which have been released as successful commercial products. These are described in detail at CJK Products.
One of the central components necessary for building tools for processing Chinese orthographic, lexemic and other variants is a database of hard-coded mapping tables fine-tuned to the needs of C2C conversion, variant expansion and normalization. Below is a list of components required for building such a database developed by our Institute, most of which have been incorporated into the sophistictaed CJK tools developed by Basis Technology..
- Traditional Chinese single character variants
- Traditional Chinese compound variants
- Single character SC-to-TC code-level mapping table
- Single character TC-to-SC code-level mapping table
- STC-to/from-TTC character mapping table
- SC-to/from-TC orthographic mapping table for general vocabulary
- SC-to/from-TC orthographic mapping table for proper nouns
- SC-to/from-TC lexemic mapping tables
- Chinese thesaurus for synonym expansion
- Basic Chinese-English dictionary for cross-language expansion
See the author's The Pitfalls and Complexities of Chinese to Chinese Conversion for more details.
See the following links for more information: