CJK ON ONE WHEEL?

by Edward Cherlin

Introduction

Jack Halpern loves two things that he wants you to love too. He loves to ride unicycles, including the thirteen-wheeler shown here, and never misses a chance to teach others to ride or to promote unicycling as a sport. More than anything, Halpern loves languages, especially Japanese and Chinese. As a result, he has devoted many years to learning these languages, compiling Japanese and Chinese dictionaries, and providing large-scale lexical databases to developers of CJK language technology applications.

Halpern came by his first half-dozen languages naturally while growing up. He was born in Germany just after World War II. His family, typical “wondering Jews” moved to Israel, France, Brazil, the United States, and he then moved back to Israel. During these wonderings, he acquired a number of languages in addition to this native Yiddish: German, Hebrew, English, and Portuguese. His strictly orthodox Jewish education included Biblical Hebrew and Aramaic.

After the 1967 Arab-Israeli war, he went to live in a kibbutz in Israel, where his encounter with some Japanese volunteers and Chinese characters changed his life forever. The way characters are put together to create new characters fascinated him. His favorite example is the character 木, which is a picture of a tree (ki in Japanese). Putting two trees together makes 林 (hayashi), which means woods. Three trees make 森 (mori), forest.

In 1973, after exhausting the materials available to him for studying Japanese in Israel, Halpern moved with his family to Japan, where he has remained ever since. Over the years his “language collection” expanded to include Chinese, both Traditional and Simplified, and he learned various Japanese dialects including the Kansai (western) and Okinawa dialects.

Halpern has also learned Spanish, Judezmo (aka Ladino), the international languages Esperanto and Interlingua. More recently, he began to study Arabic intensely, and during a brief vacation in Aruba he literally “picked up the basics” and became hooked on his 14^th language, Papiamento, a local Creole based on Spanish and Dutch,

Judaism and Yiddish

Halpern has also put a lot of effort into explaining Judaism and Jews to Japanese. Japan has, simultaneously, an almost mythic regard for Jewish accomplishments in business, the professions, the arts, and other fields, and a reservoir of totally mythic anti-Semitism, going back centuries before the wartime alliance with Nazi Germany. Japan first heard about Jews in the 16^th century, from early contacts with Catholic Spain and Portugal. Besides giving some 600 public talks on these topics, Halpern took on the job of managing the Japanese translation of the Talmud, the great compendium of Jewish law and lore (written mostly in Aramaic). He also founded the Japan Yiddish Club, the only place to learn Yiddish in Japan, and published a Yiddish newsletter “Der Yapanisher Yid.”

Publications

Halpern has published some twenty books and hundreds of articles and papers, and presented his work in dozens of academic conferences, but he is best known for his dictionaries, produced through the Kanji Dictionary Publishing Society (http://www.kanji.org). The New Japanese-English Character Dictionary exists in three editions. These consist of two printed versions, one for Japan and one for the rest of the world, and an electronic version. The Kodansha Kanji Learner's Dictionary is a subset of Halpern’s larger dictionaries, reorganized to help the learner not only to look up characters, but to easily learn their interrelationships. These dictionaries have become standard reference works in Japanese languages education circles worldwide.

The design of Halpern’s dictionaries grew out of his own experience in learning Japanese in Israel. He was puzzled by the apparent randomness of the forms and meanings of kanji, where a single character may have several readings derived from Chinese (on), several readings for Japanese words (kun), and multiple, seemingly unrelated, meanings.

The result is a mass of unorganized material that must just be memorized. Halpern assumed at first that in Japan things were done better, but was disappointed to learn that the teaching methods in Japanese schools required schoolchildren to mechanically memorize a certain set of characters in each grade, along with a variety of compound words containing them.

Halpern solved this problem in his dictionaries by tracing the various meanings of each character back to a core meaning(s), and finding a concise, easy-to-memorize English keyword for the character meaning(s). Understanding the core meaning makes it much easier to remember the various derived meanings.

Consider the core meaning of 留, which has the on pronunciations ryū and ru in Chinese compounds, and the kun readings tomeru and todomeru, among others, in Japanese verbs. Halpern relates all of its uses to the one key concept KEEP:

1. KEEP in place, fix

2. KEEP in custody, detain

3. KEEP for future use, reserve

4. KEEP in mind, pay attention to

An important feature of any kanji dictionary is indexing. No single method of organizing characters is completely satisfactory. It must be possible to look up characters by reading and also by shape. The traditional shape-based method identifies a component of each character, called a radical, then requires counting the strokes in the rest of the character. This method is clumsy and slow.

Halpern spent some seven years analyzing and developing a more efficient pattern-recognition based indexing method for kanji, called SKIP (System of Kanji Indexing by Patterns). SKIP is a very quick and effective indexing method, particularly helpful to learners, and is being widely used by online kanji dictionaries (see here).

SC<>TC Conversion

It is widely assumed that conversion between Traditional Chinese (TC) characters such as 馬 and Simplified Chinese (SC) characters such as 马 is simply a matter of table lookup. After all, the PRC government created its repertoire of Simplified Characters by taking Traditional Characters and defining substitutes for them. What could be the problem?

Well, the problem comes from several sources. Jack Halpern and Jouni Kerman, also of The CJK Dictionary Institute, explain this in a conference paper reproduced at http://www.cjk.org/cjk/c2c/c2centry.htm. For one thing, the PRC simplifications are not entirely consistent. Some TC correspond to the same simplified form. For example, both TC 面 (mian⁴, face) and TC 麵 (mian⁴, noodles) map to SC 面, that is, it is a one-to-many, ambiguous relation.

Moreover, vocabulary in the PRC and other Chinese-speaking countries has diverged significantly since the Communist regime took over in 1949. The result of replacing simplified characters with the original traditional characters can be meaningless. The correct TC equivalent for an SC word may not be a matter of character-based code substitution (code conversion) or character-based word substitution, called orthographic conversion (like colour vs. color in British and American English), but often requires meaning-based word translation, called lexemic conversion (like truck vs. lorry). The problem is particularly acute for computer technology and proper nouns. For example, Internet is written 因特网 in SC and 網際網路 in TC, while Osama bin Ladin is written as shown below (see http://www.cjk.org/cjk/c2c/c2centry.htm for a detailed paper).

Osama bin Ladin

Arabic	أسامة بن لادن
SC (Mainland)	奥萨马本拉登
TC (Taiwan)	奧薩瑪賓拉登
TC (Hong Kong)	奧薩瑪賓拉丹

CJK Lexical Data

The foundation for Halpern’s dictionaries and for his larger business is his massive accumulation of lexical data in Chinese, Japanese, and Korean. Halpern licenses portions of this data for use in dictionaries, commercial software products, and free software. These databases are highly regarded for both academic and commercial uses. The China Lexicographic Society at Guangdong University of Foreign Studies has written, “We were deeply impressed by the high linguistic standards of [Halpern’s] work and his profound knowledge of Chinese linguistics and lexicography. We were especially impressed with the technical and linguistic sophistication of the dictionaries and systems the [CJKI] Institute developed for converting between Simplified and Traditional Chinese.”

Licensing for this database is managed by The CJK Dictionary Institute, Inc. which has a website at http://www.cjk.org. CJKI consists of a group of linguists and other experts specializing in CJK lexicography, under Halpern’s direction.Some of the world’s major search engine companies use CJKI’s data to power their Japanese and Chinese morphological analyzers or other applications.

CJKI collaborates on research into Chinese with the Institute of Computational Linguistics at Peking University, the Computer Institute of Beijing Polytechnic University, The China Lexicographic Society and other research institutes in China and Japan. Licensees of the CJKI data are using the information as the basis of numerous other products and services, including online dictionaries, CJK input method editors (IME), translation tools, Japanese and Chinese-language IR applications, and research in linguistics, lexicography, and education.

CJKI’s comprehensive lexical databases have a rich set of attributes to support morphological analyzers. These are sophisticated computational linguistic tools used for analyzing a text to segment it into lexemes (sometimes morphemes), and other computational procedures such as POS-tagging (identifying syntactic categories), stemming and indexing.. This analysis is fundamental to machine translation and natural language processing. CJKI’s data is used to power the morphological analyzers of major portals and search engines, while its Japanese dictionaries are used by MT and IME products from major Japanese IT companies.

A major focus of CJKI, and one of its greatest strengths, is the development of large-scale databases of proper nouns. CJKI has been engaged in intense efforts to develop the world’s largest Chinese (3 million entries) and Japanese (2.5 million entries) databases of proper nouns, fine-tuned to the needs of NER (named entity recognition), one of the hottest topics in computational linguistics and IR technology today.

Some examples of products using CKI’s dictionaries are Wizcom Technologies (http://www.wizcomtech.com), which offers the Quicktionary II, a pen-shaped scanner with an LCD screen and speaker built in. Users can scan a word or a line of text and immediately read a translation. Versions are available with three different CJKI dictionaries (English to Japanese, SC and TC). Babylon (http://www.babylon.com/), an online translation tool that provides information about any word with a single click, also includes CJKI’s Japanese, TC and SC. Free trial versions are available for download in any supported language.

Halpern also gave permission for Jim Breen to include SKIP coding in his free KANJIDIC Japanese-English dictionary file, part of the EDICT project at http://www.csse.monash.edu.au/~jwb/japanese.html, which Halpern also helped build. Royalty-free licensing of SKIP may be available from CJKI for use non-commercial free software. Vacs Corporation, whose Chinese IME is powered by CJKI’s data, offers a popular line of IME software, called VJE-Delta, for the Japanese market. It improves on the Microsoft IME offerings in the various versions of Windows.

Unicycling

After more than a decade in Japan, Halpern took up unicycling, with characteristic passion, and a thoroughness that resulted in him breaking the Guinness world record for 100 miles on a unicycle. He founded the Japan Unicycle Club (JUC) in 1978 and has served as Executive Director of the JUC and its successor, the Japan Unicycle Association (JUA) since that time. He was a founder of the International Unicycling Federation (IUF) in 1984, serving as its first elected President. He also brought competitive unicycling to the People's Republic of China, where he organized the first-ever Great Wall Unicycle Marathon in 1993. An important milestone in Halpern’s drive to get unicycling accepted as an Olympic sport has been its recent appearance in the World Games, and the world championships held every two years that he helps organize (see http://www.unicycling.org/iuf/). The publicity surrounding Halpern’s unicycling has been extremely helpful to him in generating interest and funding for his other projects, especially the dictionaries.

What’s Next?

Halpern and his team have been working intensely in the last several years on a comprehensive English-Chinese Dictionary of Computer and IT Terminology, which will be published by the well-known Shanghai Cishu Chubanshe. A unique feature of this dictionary is that it will include both Simplified Chinese and Traditional Chinese both on the orthographic and lexemic levels, and that it will, it seems, be the first dictionary ever published for the Chinese market whose editor-in-chief is a non-Chinese (samples at http://www.cjk.org/cjk/samples/chincome.htm).

CJKI is now working intensely on expanding its Arabic lexical resources including a database of Arabic-English personal and place names, A comprehensive database of over 200,000 Arabic names variants (over 100 ways to spell Mu'ammar Qadhafi!), based on authoritative resources and huge corpora (of great interest to security agencies), a database of broken plurals, an application for accurately transcribing to/from Arabic, and more (see this page for details)

Last but not least, there is one thing that Halpern loves even more that CJK and unicycles -- what he and his polyglot friends call “God’s mother tongue, ” a language for which he has a burning passion and which brings him great joy and excitement – Brazilian Portuguese (BP). He has hundreds of books in and on it, and spends endless hours researching its grammar, phonology and dialects. Though CJKI has no customers for it now, Halpern has launched a project the world’s largest lexical database of BP, which is now hovering at 250,000 entries of general vocabulary – proper nouns and technical terms will follow.

Driven by his passion for languages, Halpern and his devoted staff plow ahead with one dictionary project after another to feed his insatiable craving for building large-scale lexical databases which bring great benefit to the language technology community by powering IR, MT and various NLP tools and applications.