Japanese Lexical Database (JLD)


Overview and Coverage

Our comprehensive Japanese lexical resources (including bilingual and trilingual dictionaries) currently contain approximately three million entries, covering general vocabulary, technical terminology, proper nouns, company and organization names, and katakana loanwords.

This document describes our Japanese Lexical Database (JLD), a comprehensive database with a rich set of grammatical attributes fine-tuned for NLP applications such as machine translation (MT), information retrieval (IR) as well as morphological analysis and tokenization. It contains about 300,000 entries covering general vocabulary, both free forms and bound forms. The data is available in any encoding (UTF8, EUC, Shift-JIS) and any format (plain text, Excel, html, etc.).

JLD includes a significant number of affixes, particles, auxiliaries and conjugation patterns to account for all the inflectional, derivational and lexical morphology in Japanese so as to enable recognition of inflected and derived forms. To make JLD robust for IR, it is highly recommended to supplement it with our Japanese Orthographical Database (JOD), described in details in The Challenges of Intelligent Japanese Searching.

Description of Selected Fields
1 LEXEME Japanese word in standard kana-kanji orthography
2 HIRAGANA Reading in hiragana, including two types of okurigana, full okurigana and inflectional okurigana.
3 POS Part of speech code. See jappos.htm for POS code definitions.
4 SUBPOS Sub-part-of-speech code. See jappos.htm for SUBPOS code definitions.
5 CONJUG Conjugation pattern. See jappos.htm for CONJUG code definitions. More details are available on request.
6 TYPE A subclassification that identifies semantic properties of the headword or supplementary information such grammatical attributes. See cpostype.htm for TYPE code definitions.
7 MORPH A subclassification that identifies additional morphological properties of the headword. See jappos.htm for MORPH code definitions.
8 VALENCY Binding valency that indicates the degree of binding between a stem/lexeme and an affix. See jappos.htm for code definition and japaffix.htm for a detailed description of various morphological attributes.
9 RANKING Zero-padded six-digit number indicating a ranking based on frequency statistics.
10 SCRIPT The type of script used to write the headword:
J Japanese (kanji, hiragana, or of mixture kanji/ hiragana/romaji/katakana)
K Pure katakana -- headwords from the katakana words database have a POS value of "NC"
R Pure romaji or Latin characters
11 BEFORE An adjacency attribute that indicates the part of speech (POS) of the lexeme, stem or base preceding a suffix or suffix-like element. For example, "NX" for the compounding suffix 員 means that 員 can be preceded by a common noun or verbal noun, as in 研究員. Only given for suffixes.
12 AFTER An adjacency attribute that indicates the part of speech (POS) of the lexeme following a prefix or prefix-like element. For example, "NC" for the adnomial prefix 元 means that 元 can be followed by a common noun, as in 元総理大臣. Given only for prefixes.
13 COMPPOS The part of speech (POS) of the lexeme resulting from affixing a prefix or suffix. For example, "NC" for the adnomial prefix 元 means that prefixing 元 (to a common noun) results in a common noun (元総理大臣). Only given for affixes.
14 HEPBURN2 Reading in modified Hepburn romanization (macrons replaced by vowel repetition).

Sample of Japanese Lexical Database
1 2 3 4 5 6 7 8 9 10 11 12 13 14
がぶ飲み がぶのみ VN t 0 033273 J gabunomi
がましげ がましげ FS M 1 061089 J VC AN gamashige
がましさ がましさ WS 1 061089 J VC NC gamashisa
がま口 がまぐち NC 0 041445 J gamaguchi
がらがら がらがら D 0 033273 J garagara
がらがら がらがら VN i 0 033273 J garagara
がらがら蛇 がらがらへび NC 0 061089 J garagarahebi
がらくた がらくた NC 0 017822 J garakuta
がらっと がらっと D 0 041445 J garatto
がらっぱち がらっぱち AN 0 0 061089 J garappachi
がらっぱち がらっぱち NC 0 061089 J garappachi
がらみ がらみ WS 1 061089 J NC NC garami
がわり がわり WS 1 061089 J NC VN gawari
がんがん がんがん D 0 033273 J gangan
がんがん がんがん VN i 0 033273 J gangan
がんじがらめ がんじがらめ NC 0 013474 J ganjigarame
がんとして がんとして D 0 028538 J gantoshite
がん遺伝子 がんいでんし NC 0 013474 J gan'idenshi
がん化 がんか VN 0 028538 J ganka
がんセンター がんせんたー NC 0 025149 J gansenta_
慣れ なれ NC 0 017822 J nare
慣れきる な.れき-る V5 R 0 022662 J narekiru
慣れっこ なれっこ AN 1 0 020741 J narekko
慣れっこ なれっこ NC 0 020741 J narekko
慣れる な.れ-る V1 i 0 002465 J nareru
慣れる なれる WS 1 002465 J VC V1 nareru
慣れ切る なれき-る V5 R 0 033273 J narekiru
慣わし ならわし NC 0 033273 J narawashi
慣わす なら.わ-す V5 S t 0 061089 J narawasu
慣わす ならわす WS 1 061089 J VC V5 narawasu
慣行 かんこう NC 0 007161 J kanko_
慣行犯 かんこうはん NC 0 061089 J kanko_han
慣手段 かんしゅだん NC 0 061089 J kanshudan
慣習 かんしゅう NC 0 007457 J kanshu_
慣習法 かんしゅうほう NC 0 061089 J kanshu_ho_
慣熟 かんじゅく VN i 0 061089 J kanjuku
慣性 かんせい NC 0 013474 J kansei
慣性の法則 かんせいのほうそく U U 061089 J kanseinoho_soku
いき NC 0 061089 J iki
WS 1 061089 J NC NC u
うまれ NC 0 061089 J umare
うまれ WS 1 061089 J NC NP NC umare
うみ NC 0 061089 J umi
NC 0 061089 J ki
WP 1 061089 J NC NC ki
しょう NC 0 061089 J sho_
せい NR 0 003721 J sei
せい WS 1 003721 J NC NC sei
なま NC 0 010656 J nama
なま WP 1 010656 J NC NC nama
なまり NC 0 061089 J namari