LEXICAL FREQUENCY STATISTICS IN JAPANESE



©2001-2008 The CJK Dictionary Institute, Inc.



The CJK Dictionary Institute maintains comprehensive databases of lexical statistics, such as frequency of occurrence, for Japanese and Chinese, based on large corpora. The concept of "frequency" in relation to Japanese lexical items is tricky. This document describes seven kinds of Japanese word and character frequency statistics with example tables. In principle, these different kinds of frequency statistics apply to Chinese as well.


Index to This Document
  1. Character ranking
  2. Word ranking
  3. Reading-to-character
  4. Character-to-reading frequency
  5. Frequency of occurrence of character readings
  6. Frequency of occurrence of single characters
  7. Reading-to-word frequency



  1. Character Ranking (FR_KANJI)

    The kanji in the JIS character set are listed in descending order of importance. This is close to, but not identical to, the frequency of occurrence. Occasionally, a character could be frequent, such as 藤, because it is used in names, but is not "important" for language learners. This also applies to high-frequency characters not in the Jooyoo Kanji List.

    Character Ranking
    Rank Kanji Unicode a
    0001 65E5
    0002 4E00
    0003 5341
    0004 4E8C
    0005 5927
    0006 4EBA
    0007 4E09
    0008 4F1A
    0009 56FD
    0010 5E74
    0011 4E2D
    0012 672C
    0013 6771
    0014 4E94
    0015 6642



  2. Word Ranking (WRD_RANK)

    Japanese lexemes (words, bound morphemes and phrases) listed in descending order of frequency. This is calculated by adding the total occurrence of all words + readings in the corpus. If a lexeme has multiple parts of speech, only one of them is shown in the table below.

    Word Ranking
    Ranking Word POS Sub-POS Reading

    000001 O    
    000002 PL 0
    000003 PL 0
    000004 J  
    000005 PL 0
    000006 J  
    000007 D  
    000008 れる V1   れる
    000009 から D   から
    000010 PL 0
    000011 なる V5   なる
    000012 こと PL   こと
    000013 I  
    000014 する FS V する
    000015 いう V5   いう
    000016 PL 0
    000017 られる V1   られる
    000018 ため NC   ため
    000019 もの PL 0 もの
    000020 WS N にん
    000021 日本 ZZ   にほん
    000022 この AA   この
    000023 I  
    000024 O    
    000025 など PL 0 など
    000026 NC   ねん
    000027 その AA   その
    000028 NC   にち
    000029 ない AJ   ない
    000030 だけ PL 0 だけ
    000031 しかし J   しかし
    000032 また D   また
    000033 それ I   それ
    000034 NC   ひと
    000035 出る V1   でる
    000036 せる V1   せる
    000037 NU   えん
    000038 米国 ZZ   べいこく
    000039 NC   なか
    000040 多い AJ   おおい
    000041 できる V1   できる
    000042 場合 NC   ばあい
    000043 まで PL 0 まで
    000044 NC   まえ
    000045 問題 NC   もんだい
    000046 NC   やく
    000047 % NU   ぱーせんと
    000048 ので PL 0 ので
    000049 くる FS V くる
    000050 ても I   ても
    000051 開発 VN   かいはつ
    000052 使う V5   つかう
    000053 事件 NC   じけん
    000054 必要 NC   ひつよう
    000055 企業 NC   きぎょう
    000056 行う V5   おこなう
    000057 ところ NC   ところ
    000058 見る V1   みる
    000059 ながら PL 0 ながら
    000060 思う V5   おもう



  3. Reading-to-Character Frequency (FR_YOMKAN)

    All the characters for a given reading in descending order of frequency (hiragana stands for kun reading, katakana for on reading). Thus this is the relative frequency within a homophone group. As can be seen from the table, kanji with the same reading are ranked by frequency, with the most common ones on top. This is especially useful for Japanese IME systems.

    01-94 Frequency of JIS208 Level 1 kanji in descending order
    95 Lowest frequency of JIS208 Level 1 character
    96 JIS208 Level 2 character -- exclude from normal picklists


    Reading-to-Character Frequency
    Reading Frequency Kanji

    01
    02
    03
    04
    05
    06
    07
    95
    96
    96
    96
    96
    96
    96
    96
    96
    96
    96
    96



  4. Character-to-Reading Frequency (FR_KANYOMI)

    This indicates the relative frequency of a reading associated with a single character. Note that the on reading block preceds the kun reading block, even if the on readings are rare.

    Character-to-Reading Frequency
    Kanji Frequency Reading Unicode

    01 イン 5370
    02 しるし 5370
    03 しる.す 5370
    01 イン 54BD
    02 エン 54BD
    03 エツ 54BD
    04 むせ.ぶ 54BD
    05 むせ.る 54BD
    06 のど 54BD
    07 の.む 54BD
    01 イン 54E1
    01 イン 56E0
    02 よ.る 56E0
    03 ちな.む 56E0
    01 イン 59FB
    01 イン 5F15
    02 ひ.く 5F15
    03 ひ.ける 5F15
    01 イン 98F2
    02 の.む 98F2



  5. Occurrence of Character Readings (RD_OCCUR)

    This is the frequency of occurrence of character readings, calculated by counting the number of occurrences of each reading in the corpus. It is the "real" frequency of occurrence for kanji readings, expressed as a percentage of occurrence in the corpus.

    Occurrence of Character Readings
    Reading KanjiOccurrence (%)

    おおきな 大きな 0.01068846
    おおき・い 大きい 0.00693874
    だい 0.00602759
    おおいに 大いに 0.00564211
    おおきさ 大きさ 0.00112141
    おお 0.00035044
    おおいさ 大いさ 0.00000000
    おっき・い 大きい 0.00000000
    おっきな 大きな 0.00000000
    おおきに 大きに 0.00000000
    たいした 大した 0.00017522
    たいして 大して 0.00017522
    だい・だ 大だ 0.00031540
    だいの 大の 0.00031540
    おおらか・だ 大らかだ0.00003504



  6. Occurrence of Single Characters (CH_OCCUR)

    This is the frequency of occurrence of single characters, expressed as a percentage of all Japanese words occurring in the corpus. This is the "real" frequency of occurrence for single characters that are used as single-character independent words and/or affixes. The occurrence for each reading, such as にん and ひと for 人, could be added to get the total occurrence.

    Occurrence of Single Characters
    Kanji Reading Occurrence (%)

    にん 0.33926233
    ねん 0.22964426
    ひと 0.18184405
    えん 0.16256977
    にち 0.15969615
    なか 0.12626404
    だい 0.08908221
    かく 0.08400081
    べい 0.07576543
    やく 0.07555517
    にち 0.07485429
    まえ 0.06844121
    しゃ 0.06837112
    がわ 0.06525219
    さい 0.06199309
    ない 0.06031097
    わたし 0.06010070
    いま 0.05726213
    くに 0.05687664
    0.05628089
    0.05603558



  7. Reading-to-Word Frequency (RD_WORD)

    Each "word" in Japanese may have one or multiple readings, each of which normally represents a distinct lexeme. This is the relative frequency within a homophone group based on statistics for frequency of word occurrence, and is ideally suited to IME applications.

    Reading-to-Word Frequency
    Reading Frequency (%) Word POS

    しゅさい 01 主催 VN
    しゅさい 02 主宰 VN
    しゅさい 99 主祭 NC
    しゅさいこく 99 主催国 NC
    しゅさいち 99 主催地 NC
    しゅさつ 99 手札 NC
    しゅさんち 01 主産地 NC
    しゅざ 01 首座 NC
    しゅざい 01 取材 VN
    しゅざい 02 主剤 NC
    しゅざい 99 首罪 NC
    しゅざいじん 99 取材陣 NC
    しゅざん 01 珠算 NC
    しゅし 01 趣旨 NC
    しゅし 02 種子 NC
    しゅし 03 主旨 NC
    しゅしゃ 01 手写 VN
    しゅしゃ 99 取捨 VN
    しゅしゃせんたく 01 取捨選択 VN
    しゅしゅ 99 守株 VN
    しゅしゅ 99 種々 NC
    しゅしょ 99 手書 VN
    しゅしょ 99 朱書 VN
    しゅしょう 01 首相 NC
    しゅしょう 02 主将 NC
    しゅしょう 03 主唱 VN
    しゅしょう 99 首将 NC
    しゅしょう 99 手抄 VN
    しゅしょう 99 首唱 VN
    しゅしょう 99 殊勝 AN