Jack Halpern loves two things
that he wants you to love too. He loves to ride unicycles, including the
thirteen-wheeler shown here, and never misses a chance to teach others to ride
or to promote unicycling as a sport. More than anything, Halpern loves
languages, especially Japanese and Chinese. As a
result, he has devoted many years to learning these
languages, compiling Japanese and
Chinese dictionaries, and providing large-scale lexical databases to developers of CJK language technology
applications.
Halpern came by his first half-dozen languages naturally while
growing up. He was born in
After the 1967 Arab-Israeli war, he went to
live in a kibbutz in
In 1973, after exhausting the materials available to him for
studying Japanese in Israel, Halpern moved with his family to Japan, where he
has remained ever since. Over the years his “language collection” expanded to
include Chinese, both Traditional and Simplified, and he
learned various Japanese dialects including the Kansai (western) and
Halpern has also learned Spanish, Judezmo (aka Ladino), the
international languages Esperanto and Interlingua. More recently, he began to
study Arabic intensely, and during a brief vacation in
Aruba he literally “picked up the basics” and became hooked on his 14th
language, Papiamento, a local Creole based on Spanish and Dutch,
Halpern has also put a lot of effort into explaining Judaism and
Jews to Japanese.
Halpern has published some twenty books and hundreds of
articles and papers, and presented his work in dozens of academic conferences,
but he is best known for his dictionaries, produced
through the Kanji Dictionary Publishing Society (http://www.kanji.org).
The New Japanese-English Character Dictionary exists in three editions.
These consist of two printed versions, one for
The design of Halpern’s dictionaries grew out of his own experience
in learning Japanese in
The result is a mass of unorganized material that must just be
memorized. Halpern assumed at first that in
Halpern solved this problem in his dictionaries by tracing the various meanings of each character back to a core meaning(s), and finding a concise, easy-to-memorize English keyword for the character meaning(s). Understanding the core meaning makes it much easier to remember the various derived meanings.
Consider
the core meaning of 留, which has the on pronunciations ryū and ru
in Chinese compounds, and the kun readings tomeru and todomeru, among others, in Japanese verbs. Halpern relates all of its uses to the one key concept KEEP:
1.
KEEP in place,
fix
2.
KEEP in custody,
detain
3.
KEEP for future
use, reserve
4. KEEP in mind, pay attention to
An important feature of any kanji dictionary is indexing. No single method of organizing characters is completely satisfactory. It must be possible to look up characters by reading and also by shape. The traditional shape-based method identifies a component of each character, called a radical, then requires counting the strokes in the rest of the character. This method is clumsy and slow.
Halpern
spent some seven years analyzing and developing a more efficient pattern-recognition based indexing method for kanji, called SKIP (System of Kanji Indexing by Patterns). SKIP
is a very quick and effective indexing method, particularly helpful to learners, and is being widely used by online kanji dictionaries (see here).
It is widely
assumed that conversion between Traditional Chinese (TC) characters such as 馬 and Simplified Chinese (SC) characters such as 马
is simply a matter of table lookup. After all, the PRC government created its repertoire of Simplified Characters by taking Traditional
Characters and defining substitutes for them. What could be the
problem?
Well, the
problem comes from several sources. Jack Halpern and Jouni
Kerman, also of The CJK Dictionary Institute,
explain this in a conference paper
reproduced at http://www.cjk.org/cjk/c2c/c2centry.htm.
For one thing, the PRC simplifications are not
entirely consistent. Some TC correspond to the same simplified form. For example, both TC 面 ( , face) and TC 麵 ( , noodles) map to SC 面, that is, it is a one-to-many, ambiguous relation.
Moreover, vocabulary
in the PRC and other Chinese-speaking countries has diverged significantly
since the Communist regime took over in 1949. The result of replacing simplified characters with the original traditional characters
can be meaningless. The correct TC equivalent
for an SC word may not be a matter of character-based code substitution (code conversion) or
character-based word substitution, called orthographic conversion (like colour
vs. color in British and American English), but often
requires meaning-based word translation, called lexemic conversion (like
truck vs. lorry). The
problem is particularly acute for computer technology and proper nouns. For example, Internet is written 因特网 in SC and 網際網路 in TC, while Osama bin Ladin is
written as shown below (see http://www.cjk.org/cjk/c2c/c2centry.htm for a detailed paper).
Osama bin Ladin
Arabic
|
أسامة
بن لادن |
SC
(Mainland) |
奥萨马本拉登 |
TC ( |
奧薩瑪賓拉登 |
TC ( |
奧薩瑪賓拉丹 |
The foundation for Halpern’s dictionaries and for his larger business is his massive accumulation of lexical data in Chinese, Japanese, and Korean. Halpern licenses portions of this data for use in dictionaries, commercial software products, and free software. These databases are highly regarded for both academic and commercial uses. The China Lexicographic Society at Guangdong University of Foreign Studies has written, “We were deeply impressed by the high linguistic standards of [Halpern’s] work and his profound knowledge of Chinese linguistics and lexicography. We were especially impressed with the technical and linguistic sophistication of the dictionaries and systems the [CJKI] Institute developed for converting between Simplified and Traditional Chinese.”
Licensing for this database is managed by The
CJK Dictionary Institute, Inc. which has a website at http://www.cjk.org.
CJKI consists of a group of linguists and other experts specializing in CJK
lexicography, under Halpern’s direction.Some of
the world’s major search engine companies use CJKI’s data to power their Japanese and Chinese morphological analyzers or other
applications.
CJKI
collaborates on research into Chinese with the
CJKI’s comprehensive lexical databases have a rich set of attributes to
support morphological analyzers. These are sophisticated computational linguistic
tools used for analyzing a text
to segment it into lexemes (sometimes morphemes), and other computational
procedures such as POS-tagging (identifying syntactic categories), stemming and
indexing.. This
analysis is fundamental to machine translation and natural language processing. CJKI’s data is used to power the morphological analyzers of major portals and search engines, while its Japanese dictionaries are used
by MT and IME products from major Japanese IT companies.
A major focus of CJKI, and one of its greatest strengths, is the
development of large-scale databases of proper nouns. CJKI has been engaged in
intense efforts to develop the world’s largest Chinese (3 million entries) and
Japanese (2.5 million entries) databases of proper nouns, fine-tuned to the
needs of NER (named entity recognition), one of the hottest topics in
computational linguistics and IR technology today.
Some examples of products using CKI’s dictionaries are Wizcom Technologies (http://www.wizcomtech.com), which offers the Quicktionary II, a pen-shaped scanner with an LCD screen and
speaker built in. Users can scan a word or a line of text and immediately read
a translation. Versions are available with
three different CJKI dictionaries (English to Japanese, SC and TC).
Halpern also gave permission for Jim Breen to include SKIP coding in his free KANJIDIC Japanese-English dictionary file, part of the EDICT project at http://www.csse.monash.edu.au/~jwb/japanese.html, which Halpern also helped build. Royalty-free licensing of SKIP may be available from CJKI for use non-commercial free software. Vacs Corporation, whose Chinese IME is powered by CJKI’s data, offers a popular line of IME software, called VJE-Delta, for the Japanese market. It improves on the Microsoft IME offerings in the various versions of Windows.
After more than a decade in
Halpern and his team have been working intensely in the last several years on a
comprehensive English-Chinese
Dictionary of Computer and IT Terminology, which will be published by the well-known Shanghai Cishu Chubanshe. A unique feature of this dictionary is that it will include both Simplified
Chinese and Traditional Chinese both on the orthographic and lexemic levels,
and that it will, it seems, be the first dictionary ever published for the
Chinese market whose editor-in-chief is a non-Chinese (samples at http://www.cjk.org/cjk/samples/chincome.htm).
CJKI is now working intensely on expanding its Arabic lexical
resources including a database of Arabic-English personal and place names, A comprehensive database of over 200,000 Arabic names variants
(over 100 ways to
spell Mu'ammar Qadhafi!), based on authoritative resources and huge
corpora (of great
interest to security
agencies), a database of broken plurals, an application for accurately transcribing to/from Arabic, and more
(see this page
for details)
Last but not least, there is one thing that Halpern
loves even more that CJK and unicycles -- what he and his polyglot friends
call “God’s mother tongue, ” a language for which he has a burning passion and
which brings him great joy and excitement – Brazilian Portuguese (BP). He has
hundreds of books in and on it, and spends endless hours researching its
grammar, phonology and dialects. Though CJKI has no customers for it now,
Halpern has launched a project the world’s largest lexical database of BP, which
is now hovering at 250,000 entries of general vocabulary – proper nouns and
technical terms will follow.
Driven by his passion for languages, Halpern and his
devoted staff plow ahead with one dictionary project after another to feed his
insatiable craving for building large-scale lexical databases which bring great
benefit to the language technology community by powering IR, MT and various NLP
tools and applications.