Lexicon-based Orthographic Disambiguation in CJK Intelligent Information Retrieval

面向中日韩文智能信息检索的基于词典的异形词排歧

Jack Halpern
©2002-2008 The CJK Dictionary Institute, Inc.

Abstract

The orthographical complexity of Chinese, Japanese and Korean (CJK) poses a special challenge to the developers of computational linguistic tools, especially in the area of intelligent information retrieval. These difficulties are exacerbated by the lack of a standardized orthography in these languages, especially the highly irregular Japanese orthography. This paper focuses on the typology of CJK orthographic variation, provides a brief analysis of the linguistic issues, and discusses why lexical databases should play a central role in the disambiguation process.

This paper was presented at the Asian Language Resources workshop of COLING 2002 on August 31, 2002 in Taipei. For the full paper, please see:


The CJK variants paper in various formats and languages

EnglishPDF www.cjk.org/cjk/reference/cjkvar.pdf
Postscript www.cjk.org/cjk/reference/cjkvarps.zip
MS Word www.cjk.org/cjk/reference/cjkvar.doc
JapaneseMS Word www.cjk.org/cjk/reference/cjkvar_j.doc
Simplified ChineseMS Word www.cjk.org/cjk/reference/cjkvar_c.doc
x

In-depth Treatment of Japanese Orthographic Variation

EnglishHTML The Challenges of Intelligent Japanese Searching
http://www.cjk.org/cjk/joa/joapaper.htm
JapaneseMS Word 知的日本語検索の諸課題
http://www.cjk.org/cjk/joa/Searching_j.doc
Samples
Japanese
HTML日本語異表記データベース見本
http://www.cjk.org/cjk/joa/joasam_j.htm
Samples
English
HTMLSamples of Japanese variants
http://www.cjk.org/cjk/joa/joasamp1.htm
http://www.cjk.org/cjk/joa/joasamp2.htm


Back to CJKI home page