Comprehensive Database of Japanese Name Variants
1. The Problem of Name Variants
The number of personal names and their variants in the world is in the billions. The number of place names is also large, but they have fewer variants. Identifying names and their variants is a difficult computational linguistic task. Named Entity Recognition (NER) is a hot topic in computational linguistics and plays an important role in many IT applications.
To enhance this technology, CJKI maintains comprehensive databases of several million proper nouns, especially of Japanese names and Chinese names. This document describes some issues of Japanese name variation and provides samples of our extensive Japanese name variant resources. For reference, see also The Role of Lexical Resources in CJK NLP Applications and Named Entity Contextual Clues.
2. Practical Applications
Identifying, processing and normalizing names and their numerous variants are useful in a variety of applications, including:
- Anti money-laundering by financial institutions.
- Security applications such as identifying suspected name variants of terrorists and criminals.
- Query processing by search engines.
- Immigration control systems.
- Improving the accuracy of machine translation.
- Entity and information extraction.
- Segmentation and morphological analysis of CJK languages.
Large databases of name variants play a critical role in such applications. CJKI maintains databases of several million names and name variants in all major and most minor romanization systems for Chinese, Japanese and Korean, including the major Chinese dialects, as well as for Arabic and Spanish.
3. The Challenge of Japanese Name Variants
Japanese personal names are extremely numerous. Our Japanese-English Dictionary of Proper Nouns contains about 400,000 unique given names and some 150,000 surnames. If we add to this the numerous romanized variants, we get millions of names.
There are several well-established systems for romanizing Japanese, as well as various popular ones and even hybrid ones where the same word is written in a mixture of different systems. The principal systems and other systems that CJKI databases support are as follows. The examples are for the name 大津 (おおづ) and 山口 (やまぐち).
|Hepburn||Ōzu||The most widely used system, in several variations as shown in the table below.|
|Kunrei||Ôzu||The official Japanese government system that has become an ISO standard (ISO 3602).|
|Nippon||Ôdu||The predecessor of the Kunrei system but still in use.|
|Waapuro||Ouzu||Based on popular input methods.|
|English||Ozu||The most common English spelling based on Hepburn with long vowels omitted.|
|Germanic||Jamagutschi||German based romanization.|
|Romance||Yamagutchi||Romance language based romanization.|
|Miscellaneous variants of each system, such as the different flavors of Hepburn.|
CJKI's name variant databases contain millions of entries that cover all the above systems, their variants, and hybrids. Below are samples of these variants and a brief description of why there is so much variation. There are other systems, like the JSL system devised by Eleanor Jorden, and the ALA-LC system, essentially identical to the Revised Hepburn system, which are not shown in the samples below.
4. Variation in Hepburn Romanization
The English-based Hepburn romanization system was devised by the Reverend James Curtis Hepburn and introduced in his Japanese–English dictionary published in 1867. It is the most widely used system and serves as the de facto standard. It is in common use even by the Japanese government in place of the Kunrei Romnization, the official standard.
Contrary to popular belief, the Hepburn system comes in many flavors. The standard, official Hepburn system is called Revised Hepburn, but some of the other variants shown below are just as popular, if not more so. Note that Revised Hepburn is sometimes confusingly referred to as Modified Hepburn, a less popular system used by some dictionaries and linguists.
|KANJI||YOMI||ENGLISH||Revised Hepburn||Modified Hepburn||Traditional Hepburn||Passport|
|Waapuro Hepburn||Hepburn Variants|
|天満屋||てんまんや||Tenman'ya, Tenmanya||Tenman'ya, Tenmanya, Tenman-ya||Tenman'ya, Ten̄man̄ya||Tenman'ya||Tenman'ya, Tenmanya, Tenman-ya||Tenmanya|
|山陰房||さんいんぼう||San'inbo, Saninbo||San'inbō, Saninbō, San-inbō||San'inboo, Saninboo, San̄in̄boo||San'imbō, Sanimbō||San'imboh, Sanimboh, San-imboh, San'imbo, Sanimbo, San-imbo||Saninbou||San'inbô, Saninbô, San-inbô, San'imbô, Sanimbô, San-imbô|
|淳一郎||じゅんいちろう||Jun'ichiro, Junichiro||Jun'ichirō, Junichirō, Jun-ichirō||Jun'ichiroo, Junichiroo, Jun̄ichiroo||Jun'ichirō, Junichirō||Jun'ichiroh, Junichiroh, Jun-ichiroh, Jun'ichiro, Junichiro, Jun-ichiro||Junichirou||Jun'ichirô, Junichirô, Jun-ichirô|
5. A Plethora of Romanization Systems
The table below shows examples of romanized names in various official and unofficial systems. Only the standard, official version is shown under the column for each of the three principal systems: Hepburn, Kuneri and Nippon. Variants of each of these systems, such as the different flavors of Hepburn, are all collected in the Variants column. All hybrids are shown in the Hybrids column. The Waapuro System, which has many variants, is not shown in a separate column but is included in the Variants column.
As can be seen from Table 2 and Table 3, variation occurs for various reasons:
- The representation of long vowels, especially /o:/ written as ō, o, ô, ou. or oh.
- Moraic /N/ (ん) sometimes written as m, rather than n, before /b/, /p/, and /m/.
- Apostrophes omitted or replaced by hyphens when /N/ is followed by a vowel or /y/.
- Multiple representations for certain consonants: e.g., じゃ is written as ja, zya or jya.
In the real world, each of the various systems has variants, and names are often written by mixing multiple systems. For example, Juniti consists of Jun, the Modified Hepburn for じゅんう, and iti, the Kunrei version of いち. We refer to such combinations as hybrids.
|佐藤||さとう||Sato||Satō||Satô||Satô||Satoo, Satou, Satoh|
|大津||おおづ||Ozu||Ōzu||Ôzu||Ôdu||Oozu, Ouzu, Ohzu, Oodu, Oudu, Ohdu, Odu||Ōdu|
|井生||いおう||Io||Iō||Iô||Iô||Ioo, Iou, Ioh|
|伊大地||いおおじ||Ioji||Iōji||Iôzi||Iôzi||Iōzi, Ioozi, Iouzi, Iohzi, Iozi, Iooji, Iouji, Iohji, Iôji|
|青柳塘||あおやぎとう||Aoyagito||Aoyagitō||Aoyagitô||Aoyagitô||Aoyagitoo, Aoyagitou, Aoyagitoh||Aojagito|
|天満屋||てんまんや||Tenman'ya||Tenman'ya||Tenman'ya||Tenman'ya||Temman'ya, Temmanya, Temman-ya, Tenmanya, Tenman-ya||Tenman'ja, Tenmanja, Tenman-ja|
|赤口||あかぐち||Akaguchi||Akaguchi||Akaguti||Akaguti||Acaguci||Akaguci, Acaguchi, Acaguti||Akagutschi||Akagutchi|
|裕子||ゆうこ||Yuko||Yūko||Yûko||Yûko||Yûco, Yūco, Yuuco, Yuco, Yuuko||Juko|
|正月||しょうげつ||Shogetsu||Shōgetsu||Syôgetu||Syôgetu||Syōgetu, Syoogetu, Syougetu, Syohgetu, Syogetu, Shoogetsu, Shougetsu, Shohgetsu, Shôgetsu||Shōgetu, Shoogetu, Shougetu, Shohgetu, Shogetu, Shôgetu, Syôgetsu, Syōgetsu, Syoogetsu, Syougetsu, Syohgetsu, Syogetsu||Schogetsu||Chogetsu|
|山陰房||さんいんぼう||San'inbo||San'inbō||San'inbô||San'inbô||Saninbô, San-inbô, Saninbō, San-inbō, San'inboo, Saninboo, San-inboo, San'inbou, Saninbou, San-inbou, San'inboh, Saninboh, San-inboh, Saninbo, San-inbo, San'imbō, Sanimbō, San-imbō, San'imboo, Sanimboo, San-imboo, San'imbou, Sanimbou, San-imbou, San'imboh, Sanimboh, San-imboh, San'imbo, Sanimbo, San-imbo, San'imbô, Sanimbô, San-imbô|
|四本松||しほんまつ||Shihonmatsu||Shihonmatsu||Sihonmatu||Sihonmatu||Shihommatsu||Shihonmatu, Shihommatu, Sihonmatsu, Sihommatsu, Sihommatu||Schihonmatsu||Chihonmatsu|
6. The Variant Explosion
As mentioned above, the reason there are so many Japanese name variants is because of such phenomena as the presence or absence of apostrophes and the multiple ways of expressing long vowels and certain consonants. If these factors happen to combine in the same name, the number of permutations explodes. Combined with the many hybrids, the number of variants for a single name can go into the hundreds.
An example of this is the first name of Japan's former prime minister Jun'ichirō Koizumi (in standard Revised Hepburn). The table below shows the 169 variants of Jun'ichirō, classified roughly by rank, many of which are high frequency and in widespread use. Although they are all legitimate in the sense that they follow the rules of spelling variation for each system, or are hybrids of such variants, some may be rare or non-existing at a particular time or a particular corpus. But since such variants can potentially occur at different times in different corpora, they are included in our databases, which aim to provide a full solution to identifying name variants.
(Please also see this .pdf document about the kanji string 淳子 ("Junko"), which shows the complex many-to-many relations between Japanese personal names.)