LEXICAL FREQUENCY STATISTICS IN CHINESE



©2001-2008 The CJK Dictionary Institute, Inc.



The CJK Dictionary Institute maintains comprehensive databases of lexical statistics, such as frequency of occurrence, for Japanese and Chinese, based on large corpora. The concept of "frequency" in relation to Chinese lexical items is tricky. This document describes several kinds of Chinese character frequency statistics with example tables, but there are other kinds which will be added in the near future. Japanese lexical statistics are described in a separate document.

The statistics in this document are reading-to-character frequency. The characters for a given pinyin reading are listed in descending order of frequency for both the Big Five and GB-2312 character sets. Thus this is the relative frequency within a homophone group. These are presented in two modes: (1) characters sorted by pinyin + tone + frequency, and (2) characters sorted by pinyin + tone. This data is especially useful for Chinese IME applications.

In addition to single-character statistics, we also have similar statistics for word occurrence. The last table in this document shows one example (as yet unedited) of this -- a list of the top 100 words in Simplified Chinese.

Symbols used in the tables
Rank A high frequency
Rank B medium frequency
Rank V low frequency





Table 1: Big Five characters by pinyin + tone + frequency
Pinyin + tone Freq Hanzi Big5 Unicode B5 Rank
a1 01 B0DA 554A A
a1 02 AAFC 963F A
a1 03 B5CB 814C B
a1 04 EBE8 9312 B
a2 01 B0DA 554A A
a2 02 AAFC 963F A
a2 03 DCD3 55C4 B
a3 01 B0DA 554A A
a4 01 B0DA 554A A
a4 02 AAFC 963F A
a0 01 B0DA 554A A
a0 99 AAFC 963F A
ai1 01 ADFC 5509 A
ai1 02 AB73 54C0 A
ai1 03 AB75 54CE A
ai1 04 AE4A 57C3 A
ai1 05 AEC1 6328 A
ai1 06 D5D9 6B38 B
ai1 07 B1BA 6371 A
ai2 01 AEC1 6328 A
ai2 02 C0F9 764C A
ai2 03 B1BA 6371 A
ai2 04 BD4A 769A A
ai2 05 EF63 9A03 B
ai2 06 D4A9 5540 C
ai2 07 D4E4 5A3E C
ai2 07 E1F4 6573 B
ai3 01 B847 77EE A
ai3 02 C4A7 85F9 A
ai3 03 BEBC 566F A
ai3 04 C647 9744 A
ai3 05 D5D9 6B38 B
ai3 06 CB48 6BD0 B
ai3 07 CA64 4F41 B
ai3 08 CEF7 6639 C
ai4 01 B752 611B A
ai4 02 ADFC 5509 A
ai4 03 A6E3 827E A
ai4 04 C3AA 7919 A
ai4 05 C0C7 66D6 A





Table 2: Big Five characters by pinyin + frequency
Pinyin Freq Hanzi Big5 Unicode B5 Rank Pinyin + tone
a 01 B0DA 554A A a0
a 02 AAFC 963F A a0
a 03 DCD3 55C4 B a2
a 04 B5CB 814C B a1
a 05 EBE8 9312 B a1
ai 01 B752 611B A ai4
ai 02 ADFC 5509 A ai1
ai 03 AB73 54C0 A ai1
ai 04 A6E3 827E A ai4
ai 05 C3AA 7919 A ai4
ai 06 AB75 54CE A ai1
ai 07 AE4A 57C3 A ai1
ai 08 B847 77EE A ai3
ai 09 AEC1 6328 A ai1
ai 10 C0F9 764C A ai2
ai 11 C0C7 66D6 A ai4
ai 12 B969 9698 A ai4
ai 13 C4A7 85F9 A ai3
ai 14 BEBC 566F A ai4
ai 15 C647 9744 A ai3
ai 16 D5D9 6B38 B ai1
ai 17 B1BA 6371 A ai2
ai 18 CB48 6BD0 B ai3
ai 19 BD4A 769A A ai2
ai 20 C0F5 74A6 A ai4
ai 21 F957 9749 B ai4
ai 22 E954 5B21 B ai4
ai 23 C940 4E42 B ai4
ai 24 EF63 9A03 B ai2
ai 25 D4A9 5540 C ai2
ai 25 EF7C 9D31 B ai4
ai 25 F4CF 8B6A B ai4
ai 26 D8A5 5828 B ai4
ai 27 CA64 4F41 B ai3
ai 28 CEF7 6639 C ai3
ai 28 ED54 6FED B ai4
ai 28 F669 9440 C ai4
ai 29 D4E4 5A3E C ai2
ai 29 E1F4 6573 B ai2
ai 30 E4ED 50FE B ai4





Table 3: GB-2312 characters by pinyin + tone + frequency
Pinyin + tone Freq Hanzi GB Big5 Unicode GB Rank
a1 01 0-1601 554A A
a1 02 0-1602 963F A
a1 03 0-7571 814C B
a1 04 0-7925 9515 C
a1 99 0-6325 5416 C
a2 01 0-1601 554A A
a2 02 0-1602 963F A
a2 03 0-6436 55C4 C
a3 01 0-1601 554A A
a4 01 0-1601 554A A
a4 02 0-1602 963F A
a0 01 0-1601 554A A
a0 99 0-1602 963F A
ai1 01 0-1606 5509 A
ai1 02 0-1607 54C0 A
ai1 03 0-1605 54CE B
ai1 04 0-1603 57C3 B
ai1 05 0-1604 6328 A
ai1 06 0-6263 6371 C
ai1 99 0-7945 953F C
ai2 01 0-1604 6328 A
ai2 02 0-1609 764C B
ai2 03 0-6263 6371 C
ai2 04 0-1608 7691 C
ai3 01 0-1611 77EE A
ai3 02 0-1610 853C B
ai3 03 0-6440 55F3 C
ai3 04 0-8616 972D C
ai4 01 0-1614 7231 A
ai4 02 0-1606 5509 A
ai4 03 0-1612 827E B
ai4 04 0-1613 788D A
ai4 05 0-7451 66A7 C
ai4 06 0-1615 9698 B
ai4 07 0-6440 55F3 C
ai4 08 0-7208 7477 C
ai4 09 0-7040 5AD2 C
ai4 99 0-7733 7839 C
ai4 99 0-6441 55CC C
an1 01 0-1618 5B89 A





Table 4: GB-2312 characters by pinyin + frequency
Pinyin Freq Hanzi GB Qu-wei Unicode GB Rank Pinyin + tone
a 01 0-1601 554A A a0
a 02 0-1602 963F A a0
a 03 0-7571 814C B a1
a 04 0-6436 55C4 C a2
a 05 0-7925 9515 C a1
a 99 0-6325 5416 C a1
ai 01 0-1614 7231 A ai4
ai 02 0-1606 5509 A ai1
ai 03 0-1607 54C0 A ai1
ai 04 0-1612 827E B ai4
ai 05 0-1613 788D A ai4
ai 06 0-1605 54CE B ai1
ai 07 0-1603 57C3 B ai1
ai 08 0-1611 77EE A ai3
ai 09 0-1604 6328 A ai1
ai 10 0-1609 764C B ai2
ai 11 0-7451 66A7 C ai4
ai 12 0-1615 9698 B ai4
ai 13 0-1610 853C B ai3
ai 14 0-6440 55F3 C ai4
ai 15 0-8616 972D C ai3
ai 16 0-6263 6371 C ai2
ai 17 0-1608 7691 C ai2
ai 18 0-7208 7477 C ai4
ai 19 0-7040 5AD2 C ai4
ai 99 0-7733 7839 C ai4
ai 99 0-7945 953F C ai1
ai 21 0-6441 55CC C ai4
an 01 0-1618 5B89 A an1
an 02 0-1624 6848 A an4
an 03 0-1620 6309 A an4
an 04 0-1621 6697 A an4
an 05 0-1622 5CB8 A an4
an 06 0-8786 9EEF C an4
an 07 0-1619 4FFA B an3
an 08 0-1623 80FA C an4
an 09 0-1616 978D B an1
an 10 0-5847 8C19 C an1
an 11 0-6654 5EB5 B an1
an 12 0-1617 6C28 B an1





Table 5: GB-2312 Word Frequency
Rank Hanzi Occurrence Percentage
11040962.2019
2279460.5911
3254900.5392
4246360.5211
5203850.4312
6163580.3460
7144010.3046
8108740.2300
9103540.2190
10101110.2139
1194710.2003
1293750.1983
1392090.1948
1490600.1916
1588780.1878
1685990.1819
1782160.1738
1879650.1685
1979470.1681
2077910.1648
2169570.1472
22中国68760.1454
2366520.1407
2464150.1357
2562280.1317
2660610.1282
2758150.1230
28国家57700.1220
2957210.1210
3056290.1191
3155320.1170
32工作 54120.1145
3353120.1124
34经济 51960.1099
35发展 51490.1089
3651450.1088
3750790.1074
3849350.1044
3948420.1024
4048170.1019
41问题 46170.0977
4246150.0976
43全国45500.0962
4444810.0948
4543150.0913
4642970.0909
4742760.0904
4842430.0897
49今天41660.0881
5041200.0871
5140960.0866
5240220.0851
53企业 40120.0849
5439680.0839
55政府 39380.0833
56人民 38920.0823
5738790.0821
58生产35980.0761
59技术35420.0749
6035360.0748
61进行 35070.0742
6234830.0737
6334790.0736
6434480.0729
65美国34410.0728
6633990.0719
67会议33880.0717
6833670.0712
6933040.0699
70建设32750.0693
7132520.0688
72去年32420.0686
73地区31840.0673
7431620.0669
7431620.0669
76我国31450.0665
7730940.0654
77他们30940.0654
79国际 30760.0651
8030330.0642
8130200.0639
8229530.0625
8329110.0616
84亿28400.0601
8527990.0592
8627830.0589
87我们27760.0587
88使27310.0578
8927000.0571
90举行 26720.0565
9125800.0546
92世界25440.0538
9325360.0536
9425300.0535
95领导25290.0535
96记者 24480.0518
97组织 24130.0510
9824120.0510
99活动23950.0507
10023520.0498