Jack Halpern


   Data Licensing

Comprehensive Database of Chinese Name Variants


The Problem of Name Variants

The number of personal names and their variants is probably in the billions. The number of place names is also large, but they have fewer variants. Identifying names and their variants is a difficult computational linguistic task. Named Entity Recognition (NER) is a hot topic in computational linguistics and plays an important role in many IT applications.

To enhance this technology, CJKI maintains comprehensive databases of several million proper nouns, especially of Japanese names, Chinese names, and Arabic names. This document describes some issues of Chinese name variation and provides samples of our Chinese name variants resources. For reference, see also The Role of Lexical Resources in CJK NLP Applications and Named Entity Contextual Clues.

Currently our databases contain over 1,650,000 Chinese seed names (surnames and given names) and approximately five million romanized variants for these names.

Practical Applications

Identifying, processing and normalizing names and their numerous variants are useful in a variety of applications, including:

  • Anti money-laundering by financial institutions.
  • Security applications such as identifying suspected name variants of terrorists and criminals.
  • Query processing by search engines.
  • Immigration control systems.
  • Improving the accuracy of machine translation.
  • Entity and information extraction.
  • Segmentation and morphological analysis of CJK languages.

Large databases of name variants play a critical role in such applications. CJKI maintains databases of several million names and name variants in all major and most minor romanization systems for Chinese, Japanese and Korean, including the major Chinese dialects, as well as for Arabic and Spanish.

Chinese Name Variants

Chinese names can be spelled in a bewildering variety of ways. Our databases of Chinese names and non-Chinese proper nouns in both Simplified and Traditional Chinese, including romanized variants, contain nearly two million entries. There are several well-established systems for romanizing/transcribing Chinese, as well as various popular ones and many older ones that have fallen out of use. The principal systems and some of lesser importance are described below:

Major Chinese Transcription Systems
System Example Description
Hanzi 驰骏 Given name written in Simplified Chinese characters. (Wikipedia article)
Hanyu Pinyin Chíjùn Usually referred to as piniyin, this the official, most widely used Mandarin romanization system adopted by the PRC in 1958, which has become ISO standard ISO-7098:1991. (Wikipedia article)
English Chijun Standard English spelling follows Hanyu Pinyin but omits the tone marks. (Wikipedia article)
Wade-Giles Ch'ihchün Introduced by Thomas Wade in the 19th century, this has been the most widely used system through most of the 20th century until Hanyu Pinyin has become widespread and is still important today. (Wikipedia article)
Yale System Chrjyun Developed by Yale University in the 1950s and 1960s to facilitate Chinese to Americans, the Yale system is of limited use now but does appear in some dictionaries and textbooks. (Wikipedia article)
Tongyong Pinyin Chihjyun The official romanization system adopted by the government of Taiwan in 2000 to replace the MPS II system. (Wikipedia article)
MPS II Chrjiun Formerly officially used in Taiwan to replace Gwoyeu Romatzyh, this system never gained much popularity outside of government publications and was replaced in 2000 by Tongyong Pinyin. (Wikipedia article)
Gwoyeu Romatzyh Chyrjiunn Formerly officially used in Taiwan, this system uses complex rules to distinguish tones without diacritics. Developed by Y. R. Chao and proclaimed in 1926, this system was officially replaced by MPS II in 1986. (Wikipedia article)
Zhuyin Fuhao ㄢㄔˊㄩㄣˋ Also called Bopomofo, this is the standard phonemic transcription system used in Taiwan (and formerly in the PRC) for education and input methods. Though not a romanization system, it is given here as it is of major importance in transcribing Chinese. (Wikipedia article)

Our name variants databases provide comprehensive coverage for the major Chinese romanization systems and their variants. Two other systems also covered by CJKI's variant databases are EFEO, developed in the 19th century by Ecole francaise d'Extreme-Orient and is still in use in France, and Lessing-Othmer, used in Germany and based on German orthography.

Other systems such as MPS II are not currently supported because of their relative rarity (they are no longer official in Taiwan). There are various other systems, such as the ALA-LC system by the Library of Congress, which is essentially identical to Hanyu Pinyin except for the omission of tones. Dozens of other systems, such as Latinxua Sinwenz (拉丁化新文字; also known as "Sin Wenz") developed by Qu Qiubai in the 1920s,, have been used over the last few centuries to romanize or cyrillicisation Chinese, which are of little or no importance in the romanization of Chinese names today. Some of these are discussed here.

Data Sample

The table below shows examples of romanized Chinese names in the principal systems covered in our databases. Only the standard form is shown under the column for each system. Variants of each of these systems, such as forms without apostrophes and variant spellings, are given in the "Variants" column.

Chinese Name Variants
Chinese Pinyin1 Pinyin2 Zhuyin English Tongyong Yale Wade-
百欣 bai3-xin1 bǎixīn ㄞˇㄒㄧㄣ Baixin BaisinBaisyin Paihsin Paisin
bai2 bái ㄅㄞˊ Bai BaiBai Pai  
北强 bei3-qiang2 běiqiáng ㄅㄟˇㄑㄧㄤˊ Beiqiang Beiciang Beichyang Peich'iang Peits'iêng
炳章 bing3-zhang1 bǐngzhāng ㄅㄧㄥˇㄓㄤ Bingzhang Bingjhang Bingjang Pingchang  
宝程 bao3-cheng2 bǎochéng ㄅㄠˇㄔㄥˊ Baocheng Baocheng Baucheng Paoch'eng Paocheng
爱华 ai4-hua2 àihuáㄞˋㄏㄨㄚˊ Aihua Aihua Aihwa Aihua Ngaihua
伯芝 bo2-zhi1 bózhī ㄅㄛˊㄓ Bozhi Bojhih Bwojr Pochih  
长流 chang2-liu2 chángliú ㄔㄤˊㄌㄧㄡˊ Changliu Changliou Changlyou Ch'angliu  
邦达 bang1-da2 bāngdá ㄅㄤㄉㄚˊ Bangda Bangda Bangda Pangta  
cao2 cáo ㄘㄠˊ Cao Cao Tsau Ts'ao  
冰晓 bing1-xiao3 bīngxiǎo ㄅㄧㄥㄒㄧㄠˇ Bingxiao Bingsiao Bingsyau Pinghsiao Pingsiao
百成 bai3-cheng2 bǎichéng ㄅㄞˇㄔㄥˊ Baicheng Baicheng Baicheng Paich'eng Paicheng

Variants Based on Chinese Dialects

Chinese has seven major dialect groups, and another four minor ones, with at least 95 or so subdialects. The romanization of Chinese names based on the various dialects is often radically different from Mandarin. Our databases of Chinese name variants cover some of the major Chinese dialects, including Cantonese, Hakka and Hokkien, as well as multilingual equivalents including Traditional Chinese, Japanese, Korean and Vietnamese.

Data Sample

Category Variants Type
LANGUAGE 业经 Simplified Chinese
業經 Traditional Chinese
業経 Japanese
예징 Hanja reading
業經 Korean Hanja
Yejing MOE (Korean Ministry of Education Romanization)
Yejing NRS (New Romanization System)
Yejing KLS (Korean Language Society Romanization)
Yecing ISO DPRK (Used in North Korea)
Yejing ISO ROK (Used in South Korea)
Nghiệp Kinh Vietnamese
Yejing English
MANDARIN Yèjīng Toned Pinyin
ye4-jing1 Numbered Pinyin
Yehching Wade-Giles
Yejing Yale System
Yejing Tongyong
ㄧㄝˋㄐㄧㄥ Zhuyin
CANTONESE Yipging LAU (Sidney Lau)
Yipking CPR (Cantonese Popular Reading)
Yipging Modified Yale
Jipging Jyutping
DIALECTS Yapkeng Hokkien
Ngiapkin Hakka