Orthographic Variation in Japanese


Jack Halpern
CEO
The CJK Dictionary Institute, Inc.
株式会社 日中韓辭典研究所
Created: September 13, 2001



Index to This Document
  1. Introduction
  2. Orthographic Variants
  3. Kun Homophones
  4. Synonym and Cross-Language Expansion
  5. Mapping Tables and Lexical Databases
  6. Documents for Reference


1. Introduction

This report provides an overview of the linguistic and orthographic issues related to the processing of Japanese texts, especially webpages. It has two basic aims:

  1. To enable software developers to acquire a basic knowledge of the special problems resulting from the complexity of the Japanese script, especially orthographic variation.
  2. To help software developers build linguistic tools for processing Japanese, such as intelligent information retrieval tools, input method editors, and machine translation systems.

The existence of orthographic variants and the morphological complexity of Japanese pose a special challenge to the developers of linguistic tools, especially in the field of information retrieval, since the same word or phrase can be written in multiple, often unpredictable, ways. This report focuses on the major types of orthographic variation in Japanese, and provides a brief analysis of the linguistic issues to be considered by software developers.


2. Orthographic Variants

2.1 Irregular Orthography

The Japanese orthography is highly irregular. Because of the large number of orthographic variants and easily confused homophones, the Japanese writing system is an order of magnitude more complex than any other major language, including Chinese. A major factor is the complex interaction of the four scripts used to write Japanese, resulting in countless words that can be written in a variety of often unpredictable ways. For more information, see Outline of Japanese Writing System (from the author's New Japanese-English Character Dictionary).

Table 1 shows the orthographic variants of the words 取り扱い toriatsukai 'handling'. Study it carefully to get a good grasp of the various issues involved.


Table 1
Orthographic Variants of toriatsukai
toriatsukai Type of variant
取り扱い "standard" form
取扱い okurigana variant
取扱 all kanji
とり扱い replace kanji with hiragana
取りあつかい replace kanji with hiragana
とりあつかい all hiragana


2.2 Okurigana Variants

One of the most important types of orthographic variation in Japanese occurs in kana endings, called 送り仮名 okurigana, that are attached to a kanji base or stem, as shown in Table 2.


Table 2
Okurigana Variants
English Reading "Standard" Form Variant 1 Variant 2 Variant 3
publish kakiarawasu 書き表す 書き表わす書表わす書表す
perform okonau 行う行なう    
handlingtoriatsukai取り扱い取扱い取扱  

Okurigana variants are very common. Because usage is often unpredictable, they are a nuisance in any kind of Japanese language processing When normalizing Japanese orthographic variants, special attention must be given to register all okurigana variants.


2.3 Cross-Script Orthographic Variants

Japanese is written in a mixture of four scripts: kanji (Chinese characters), two syllabic scripts called hiragana and katakana, and romaji (the Latin alphabet). Orthographic variation across scripts is as common as it is unpredictable, so that the same word can be written in hiragana, katakana or kanji, or even in a mixture of two scripts. The table below shows the major cross-script variation patterns in Japanese.

Table 3
Cross-Script Orthographic Variants
Type of Variation Var 1 Var 2 Var 3
Kanji vs. Hiragana

大勢

おおぜい

 
Kanji vs. Katakana

硫黄

イオウ

 
Kanji vs. hiragana vs. katakana

ねこ

ネコ

Katakana vs. hybrid

ワイシャツ

Yシャツ

 
Kanji vs. katakana vs. hybrid

皮膚

ヒフ

皮フ

Kanji vs. hybrid

彗星

すい星

 
Hiragana vs. katakana

ぴかぴか

ピカピカ

 


2.4 Kanji Variants

Though the Japanese writing system underwent major reforms in the postwar period and the character forms have by now been standardized, there is still a significant number of character form variants in common use, especially in proper names. Moreover, some classical works such as the Buddhist scriptures are written in the traditional character forms.


Table 4
Kanji Variants
Type of Variation English ReadingStandardVariant
Abbreviated formlargelyoohaba ni
Variant form10 years oldjussai
Traditional form developmenthattatsu
Phonetic substitute abuse ranyoo


2.6 Kana Variants

Recent years have seen a sharp increase in the use of katakana, a syllabary used mostly to write Western loanwords. Katakana orthography is often irregular, and it is quite common for the same word to be written in multiple ways.

The hiragana syllabary is used mostly to write grammatical elements and some native Japanese words. In recent years there has been a considerable increase in the use of hiragana. Though hiragana orthography is quite regular, there is a certain amount of irregularity.

Some of the major types of kana variation are shown in the table below.


Table 5
Katakana and Hiragana Variants
Type of VariationEnglish Reading Standard
Form
Variants
Macron computer konpyuuta
konpyuutaa
コンピューコンピューター
Long vowelsmaid meedo メー メイ
Multiple kana team chiimu
tiimu
ームティーム
Traditional big ookii きい きい
づ vs ず to continue tsuzuku

The above is only a brief introduction to the most important types of kana variation. There are various others, such as an optional middle dot (nakaguro) and small kana variants (クォ vs. クオ) in katakana words, the use of traditional (じ vs. ぢ) and historical (い vs. ゐ) kana, and more.


3. Kun Homophones

An important factor that contributes to the complexity of the Japanese writing system is the existence of a large number of homophones (words pronounced the same but written differently), especially kun (native Japanese) homophones. Not only can each kanji have many kun readings, but many kun words can be written in a bewildering variety of ways. The majority of kun homophones are often close or even identical in meaning and thus easily confused, i.e., noboru means 'go up' when written 上る but 'climb' when written 登る, as shown in the table below.


Table 6
Kun Homophones
ReadingHom. 1 Hom. 2 Hom. 3Hom. 4 Meaning
noboru上る登る昇る close
yawarakai柔らかい軟らかいやわらかい identical
hashidifferent
sasu差す指す刺すさすdifferent/
ambiguous

In processing Japanese texts, a central problem with kun homophones is their variable orthography. Two or more characters are often partially or completely interchangeable in some senses, while the meanings of some homophones are identical or nearly identical. To make matters worse, the distinctions are sometimes so subtle that many authors ignore the kanji and use hiragana instead.


4. Synonym and Cross-Language Expansion

An advanced form of variant expansion is synonym expansion and cross-language information retrieval (CLIR). The details of this are beyond the scope of this report. In a nutshell, synonym expansion generates a picklist of synonyms and other semantically related variants; cross-language expansion generates a picklist of Japanese equivalents to a foreign language source term. These are shown in the table below, along with some miscellaneous variant types such as abbreviations and loanwords.


Table 7
Synonym and Cross-Language Expansion
Synonyms
(kill)
殺す処刑する殺害する射殺する暗殺する殺る
AbbreviationsJAJA全中全国農業協同組合中央会  
Loanwords
(drink)
ドリンク飲み物飲料   
Cross-languagehappy幸福な 幸せな 嬉しい 楽しい 愉快な

For more information on synonym expansion and CLIR, see my article Cross-Synonym and Cross-Language Searching in Japanese.



5. Lexical Databases

One of the central components necessary for building tools for processing Japanese orthographic and other variants is a database of hard-coded mapping tables of orthographic variants and lexical databases for synonym and cross-language expansion, fine-tuned to the needs of variant expansion and normalization. Below is a list of components required for building such a database, which have been developed by our team of lexicographers.

  1. A comprehensive database of orthographic variants, with full coverage of okurigana, kanji, and kana variants.
  2. A database of semantically classified homophone groups (mere homophones are not useful).
  3. A database of semantically classified synonym groups consisting of kanji synonyms (Japanese thesaurus).
  4. An English-Japanese dictionary covering general vocabulary and important proper names.
  5. A comprehensive collection of orthographic variation rules for generating and identifying variants not listed in the database.



6. Documents for Reference

See the following links for more information: