The orthographical complexity of Chinese, Japanese and Korean (CJK) poses a special challenge to the developers of computational linguistic tools, especially in the area of intelligent information retrieval. These difficulties are exacerbated by the lack of a standardized orthography in these languages, especially the highly irregular Japanese orthography. This paper focuses on the typology of CJK orthographic variation, provides a brief analysis of the linguistic issues, and discusses why lexical databases should play a central role in the disambiguation process.
This paper was presented at the Asian Language Resources workshop of COLING 2002 on August 31, 2002 in Taipei. For the full paper, please see:
English | www.cjk.org/cjk/reference/cjkvar.pdf | |
Postscript | www.cjk.org/cjk/reference/cjkvarps.zip | |
MS Word | www.cjk.org/cjk/reference/cjkvar.doc | |
Japanese | MS Word | www.cjk.org/cjk/reference/cjkvar_j.doc |
Simplified Chinese | MS Word | www.cjk.org/cjk/reference/cjkvar_c.doc |
English | HTML | The Challenges of Intelligent Japanese Searching http://www.cjk.org/cjk/joa/joapaper.htm |
Japanese | MS Word | 知的日本語検索の諸課題 http://www.cjk.org/cjk/joa/Searching_j.doc |
Samples Japanese | HTML | 日本語異表記データベース見本 http://www.cjk.org/cjk/joa/joasam_j.htm |
Samples English | HTML | Samples of Japanese variants http://www.cjk.org/cjk/joa/joasamp1.htm http://www.cjk.org/cjk/joa/joasamp2.htm |