Jack Halpern


   Data Licensing

Chinese Morphological Database

The CJK Dictionary Institute maintains a comprehensive lexical database of about three million Simplified and Traditional Chinese entries covering a broad spectrum of fields including proper nouns, technical terminology and general vocabulary. This is a sample of our Chinese Morphological Database, a comprehensive database of Chinese derivative affixes with adjacency attributes.

A derivational affix (DA) is a bound morpheme (though some also function as free forms) prefixed or suffixed to a base to create new words. In traditional morphology, DAs do not have lexical meanings of their own, and only add grammatical meanings. Here, we include "lexical affixes" -- compound-forming word elements that have a substantial lexical meaning of their own. Identifying DAs is very useful in NLP, IME and information retrieval applications as they significantly contribute to the accuracy of algorithmically identifying countless lexemes not registered in the lexicon.

An important principle in our criteria for selecting an affix is its ability to combine with a base consisting of two or more characters, as 迷 'fan' combining with 独轮车 'unicycle' to produce 独轮车迷 'unicycle fan'. If an affix combines with only single-character bases, it is excluded because of the danger of confusing it with two-character compounds in which it does not function as an affix, as in 入迷, or with a coincidental juxtaposition of a free form.

An adjacency attribute is a part of speech (POS) code that indicates the morphological restrictions that apply to adjacent words or DAs when these are actually used in the formation of compound words or affixed lexemes. Adjacency attributes help programs identify DAs with greater reliability, especially in systems that fully support POS-tagging. For more details, see japaffix.htm.

Adjacency Attribute Fields
Type [A1] productive derivational suffix -- always bound
[A2] Productive derivational suffix -- almost always bound
[A3] Productive derivational suffix -- sometimes bound
[B1] Historically productive derivational prefix -- always bound
[B2] Historically productive derivational prefix -- sometimes bound
POS Part of speeech code. For details see chinpos.htm
Before POS of lexeme or base preceding a suffix, e.g. "NC" for the suffix 迷 'fan' means that 迷 can be preceded by a common noun, as 独轮车 'unicycle', to produce 独轮车迷 'unicycle fan'.
After POS of lexeme or base following a suffix, e.g. "NC" for the prefix 半 'semi-' means that 半 can follow a common noun, as 文盲 'illiterate', to produce 半文盲 'semiilliterate'.
Result The POS of the lexeme resulting from affixing a prefix or suffix. For example, "NC" for 独轮车迷 'unicycle fan' means that 独轮车迷 is a common noun.

Data Sample

SC ID SC Affix TC Affix POS Code TYPE Code Pinyin Before After Result
S0007529A WS A3 xian4 NP   NP
S0009543A WS A3 tuan2 NC V   NC
S0010532B WS A3 chu4 NC V   NC
S0010875Aa WS A3tou0 NC   NC
S0015201Ad WP A2 zong3   NC V NC V
S0034279A WSA3 jie2 NC V NP   NC
S0047893A WS A3 zhen4 NP   NP
S0061252Aa WSA2 yan2 NC   NC
S0064269A WS A3 hua4 NC V A D   NC V A
S0070103Ad WS A1 ji1 V NC A   NC
S0072424Ab WS A3 gui3 A NC V   NC
S0078485Aa WS A1 xing2 NC V A NP   NC
S0084233Ah WP A3 hao3   NC A NC A
S0084666A WS A3 gong1 V NC   NC
S0098752Aa WS A2 zhe3 NC V A   NC
S0096010Ad WS A3 shou3 NC V   NC
S0101751Aa WS A2 suo3 NC V NA   NC
S0106449Ab WS A3 xin1 NC V A   NC
S0112789Ab WS A3 xing4 A NC V   NC
S0112870Ag WS A3 sheng1 NC V A   NC
S0123643A WS A3 zu2 NP NC V A   NC
S0121387Ah WP A3 duo1   NC NC
S0119011A WP A3 da4   NC V NC V A D
S0120518Ag WP A1 di4   NN NC
S0128279Ad WP A3 chao1   NC A NC A D
S0138060A WS A3 pai4 NP NC A V   A NC
S0142229Af WP A2 ban4   NC V A NC V A
S0142513Ad WP A3 fan3   NC A V NC V A
S0143043A WS A2fan4 V NC A   NC
S0141475Ad WP A1 wei1   NC NM V NC
S0144731Ae WSA3 pin3 V NC A   NC
S0148106A WS A3 bu4 NC V   NC
S0148384Ae WP A2 fu4   NC NC
S0157840A WS A3 mi2 NC V   NC
S0164882Aa WS A2 lv4 V NC A   NC
S0165711Af WP A3 lao3   NC A V NC