LEXICAL FREQUENCY STATISTICS IN JAPANESE

The CJK Dictionary Institute maintains comprehensive databases of lexical statistics, such as frequency of occurrence, for Japanese and Chinese, based on large corpora. The concept of "frequency" in relation to Japanese lexical items is tricky. This document describes seven kinds of Japanese word and character frequency statistics with example tables. In principle, these different kinds of frequency statistics apply to Chinese as well.

Index to This Document

Character ranking
Word ranking
Reading-to-character
Character-to-reading frequency
Frequency of occurrence of character readings
Frequency of occurrence of single characters
Reading-to-word frequency

Character Ranking (FR_KANJI)

The kanji in the JIS character set are listed in descending order of importance. This is close to, but not identical to, the frequency of occurrence. Occasionally, a character could be frequent, such as 藤, because it is used in names, but is not "important" for language learners. This also applies to high-frequency characters not in the Jooyoo Kanji List.

Character Ranking
Rank	Kanji	Unicode a
0001	日	65E5
0002	一	4E00
0003	十	5341
0004	二	4E8C
0005	大	5927
0006	人	4EBA
0007	三	4E09
0008	会	4F1A
0009	国	56FD
0010	年	5E74
0011	中	4E2D
0012	本	672C
0013	東	6771
0014	五	4E94
0015	時	6642

Word Ranking (WRD_RANK)

Japanese lexemes (words, bound morphemes and phrases) listed in descending order of frequency. This is calculated by adding the total occurrence of all words + readings in the corpus. If a lexeme has multiple parts of speech, only one of them is shown in the table below.

Word Ranking
Ranking	Word	POS	Sub-POS	Reading
000001	､	O
000002	を	PL	0	を
000003	に	PL	0	に
000004	が	J		が
000005	て	PL	0	て
000006	で	J		で
000007	と	D		と
000008	れる	V1		れる
000009	から	D		から
000010	の	PL	0	の
000011	なる	V5		なる
000012	こと	PL		こと
000013	や	I		や
000014	する	FS	V	する
000015	いう	V5		いう
000016	か	PL	0	か
000017	られる	V1		られる
000018	ため	NC		ため
000019	もの	PL	0	もの
000020	人	WS	N	にん
000021	日本	ZZ		にほん
000022	この	AA		この
000023	へ	I		へ
000024	･	O
000025	など	PL	0	など
000026	年	NC		ねん
000027	その	AA		その
000028	日	NC		にち
000029	ない	AJ		ない
000030	だけ	PL	0	だけ
000031	しかし	J		しかし
000032	また	D		また
000033	それ	I		それ
000034	人	NC		ひと
000035	出る	V1		でる
000036	せる	V1		せる
000037	円	NU		えん
000038	米国	ZZ		べいこく
000039	中	NC		なか
000040	多い	AJ		おおい
000041	できる	V1		できる
000042	場合	NC		ばあい
000043	まで	PL	0	まで
000044	前	NC		まえ
000045	問題	NC		もんだい
000046	約	NC		やく
000047	%	NU		ぱーせんと
000048	ので	PL	0	ので
000049	くる	FS	V	くる
000050	ても	I		ても
000051	開発	VN		かいはつ
000052	使う	V5		つかう
000053	事件	NC		じけん
000054	必要	NC		ひつよう
000055	企業	NC		きぎょう
000056	行う	V5		おこなう
000057	ところ	NC		ところ
000058	見る	V1		みる
000059	ながら	PL	0	ながら
000060	思う	V5		おもう

Reading-to-Character Frequency (FR_YOMKAN)

All the characters for a given reading in descending order of frequency (hiragana stands for kun reading, katakana for on reading). Thus this is the relative frequency within a homophone group. As can be seen from the table, kanji with the same reading are ranked by frequency, with the most common ones on top. This is especially useful for Japanese IME systems.

01-94	Frequency of JIS208 Level 1 kanji in descending order
95	Lowest frequency of JIS208 Level 1 character
96	JIS208 Level 2 character -- exclude from normal picklists

Reading-to-Character Frequency
Reading	Frequency	Kanji
ア	01	阿
ア	02	亜
あ	03	吾
ア	04	窪
ア	05	唖
ア	06	鴉
ア	07	蛙
ア	95	娃
あ	96	呀
あ	96	嗟
ア	96	亞
ア	96	哇
ア	96	錏
ア	96	猗
ア	96	閼
ア	96	鐚
ア	96	堊
ア	96	婀
ア	96	痾

Character-to-Reading Frequency (FR_KANYOMI)

This indicates the relative frequency of a reading associated with a single character. Note that the on reading block preceds the kun reading block, even if the on readings are rare.

Character-to-Reading Frequency
Kanji	Frequency	Reading	Unicode
印	01	イン	5370
印	02	しるし	5370
印	03	しる.す	5370
咽	01	イン	54BD
咽	02	エン	54BD
咽	03	エツ	54BD
咽	04	むせ.ぶ	54BD
咽	05	むせ.る	54BD
咽	06	のど	54BD
咽	07	の.む	54BD
員	01	イン	54E1
因	01	イン	56E0
因	02	よ.る	56E0
因	03	ちな.む	56E0
姻	01	イン	59FB
引	01	イン	5F15
引	02	ひ.く	5F15
引	03	ひ.ける	5F15
飲	01	イン	98F2
飲	02	の.む	98F2

Occurrence of Character Readings (RD_OCCUR)

This is the frequency of occurrence of character readings, calculated by counting the number of occurrences of each reading in the corpus. It is the "real" frequency of occurrence for kanji readings, expressed as a percentage of occurrence in the corpus.

Occurrence of Character Readings
Reading	Kanji	Occurrence (%)
おおきな	大きな	0.01068846
おおき･い	大きい	0.00693874
だい	大	0.00602759
おおいに	大いに	0.00564211
おおきさ	大きさ	0.00112141
おお	大	0.00035044
おおいさ	大いさ	0.00000000
おっき･い	大きい	0.00000000
おっきな	大きな	0.00000000
おおきに	大きに	0.00000000
たいした	大した	0.00017522
たいして	大して	0.00017522
だい･だ	大だ	0.00031540
だいの	大の	0.00031540
おおらか･だ	大らかだ	0.00003504

Occurrence of Single Characters (CH_OCCUR)

This is the frequency of occurrence of single characters, expressed as a percentage of all Japanese words occurring in the corpus. This is the "real" frequency of occurrence for single characters that are used as single-character independent words and/or affixes. The occurrence for each reading, such as にん and ひと for 人, could be added to get the total occurrence.

Occurrence of Single Characters
Kanji	Reading	Occurrence (%)
人	にん	0.33926233
年	ねん	0.22964426
人	ひと	0.18184405
円	えん	0.16256977
日	にち	0.15969615
中	なか	0.12626404
第	だい	0.08908221
各	かく	0.08400081
米	べい	0.07576543
約	やく	0.07555517
日	にち	0.07485429
前	まえ	0.06844121
社	しゃ	0.06837112
側	がわ	0.06525219
歳	さい	0.06199309
内	ない	0.06031097
私	わたし	0.06010070
今	いま	0.05726213
国	くに	0.05687664
後	ご	0.05628089
氏	し	0.05603558

Reading-to-Word Frequency (RD_WORD)

Each "word" in Japanese may have one or multiple readings, each of which normally represents a distinct lexeme. This is the relative frequency within a homophone group based on statistics for frequency of word occurrence, and is ideally suited to IME applications.

Reading-to-Word Frequency
Reading	Frequency (%)	Word	POS
しゅさい	01	主催	VN
しゅさい	02	主宰	VN
しゅさい	99	主祭	NC
しゅさいこく	99	主催国	NC
しゅさいち	99	主催地	NC
しゅさつ	99	手札	NC
しゅさんち	01	主産地	NC
しゅざ	01	首座	NC
しゅざい	01	取材	VN
しゅざい	02	主剤	NC
しゅざい	99	首罪	NC
しゅざいじん	99	取材陣	NC
しゅざん	01	珠算	NC
しゅし	01	趣旨	NC
しゅし	02	種子	NC
しゅし	03	主旨	NC
しゅしゃ	01	手写	VN
しゅしゃ	99	取捨	VN
しゅしゃせんたく	01	取捨選択	VN
しゅしゅ	99	守株	VN
しゅしゅ	99	種々	NC
しゅしょ	99	手書	VN
しゅしょ	99	朱書	VN
しゅしょう	01	首相	NC
しゅしょう	02	主将	NC
しゅしょう	03	主唱	VN
しゅしょう	99	首将	NC
しゅしょう	99	手抄	VN
しゅしょう	99	首唱	VN
しゅしょう	99	殊勝	AN

Reading	Frequency	Kanji
ア	01	阿
ア	02	亜
あ	03	吾
ア	04	窪
ア	05	唖
ア	06	鴉
ア	07	蛙
ア	95	娃
あ	96	呀
あ	96	嗟
ア	96	亞
ア	96	哇
ア	96	錏
ア	96	猗
ア	96	閼
ア	96	鐚
ア	96	堊
ア	96	婀
ア	96	痾

Reading	Frequency	Kanji
ア	01	阿
ア	02	亜
あ	03	吾
ア	04	窪
ア	05	唖
ア	06	鴉
ア	07	蛙
ア	95	娃
あ	96	呀
あ	96	嗟
ア	96	亞
ア	96	哇
ア	96	錏
ア	96	猗
ア	96	閼
ア	96	鐚
ア	96	堊
ア	96	婀
ア	96	痾

Reading	Frequency	Kanji
ア	01	阿
ア	02	亜
あ	03	吾
ア	04	窪
ア	05	唖
ア	06	鴉
ア	07	蛙
ア	95	娃
あ	96	呀
あ	96	嗟
ア	96	亞
ア	96	哇
ア	96	錏
ア	96	猗
ア	96	閼
ア	96	鐚
ア	96	堊
ア	96	婀
ア	96	痾