Transliteration and Transcription Technology

Introduction

This report provides a brief overview of some linguistic issues related to two text conversion procedures known as transliteration and transcription. A related technology, called transcripting (Chinese-to-Chinese conversion), is described in detail in our paper The Pitfalls and Complexities of Chinese to Chinese Conversion.

The orthographic complexity of some of the major languages using non-Roman scripts, such as Chinese, Japanese, and Arabic, poses formidable challenges to information processing applications. Some factors contributing to this include the large number of characters used in Japanese and Chinese and their complex forms, the lack of vowels in Arabic and other Semitic scripts (known as abjads), and the presence of a large number of orthographic variants. From an information processing point of view there are many complex issues, such as morphological analysis, incompatible character sets, text retrieval, a plethora of input methods, and many others. These are beyond the scope of this brief report, and are mostly covered in papers found on our Articles/papers page.

Definition of Terms

There is much confusion surrounding the terminology related to the general process of representing the characters of one script in those of another (such as writing Japanese or Arabic in the Roman alphabet), which includes various procedures such as transliteration, transcription, romanization, transcribing, and technography. Sometimes, the term transliteration or transcription is used as a generic term for all these processes, which is quite misleading since it does not distinguish orthographical transliteration, (one-to-one graphemic mapping) from transcription (essentially one-to-many phonemic mapping).

Note that though the terminology used here is "theoretically correct," and is used by linguists, esp. grammatologists, these terms are not standardized. In this report, transliteration is used in the strict sense of orthographical transliteration.

Transliteration

6	5	4	3	2	1
ن	د	ا	ل	ن	ب
n	d	'	l	n	b
nun	dal	alif	lam	nun	ba

The aim of transliteration is to represent the script of a source language by using the letters or symbols of another script, usually in accordance with the orthographical conventions of the target language. Let us take Bin Ladin as an example. In Arabic this is written, from right to left, using the six letters shown at right.

In Arabic, the independent shapes of some letters (graphemes) undergo form transformations (allographs) depending on their position in the word (see the charts at the end of this report). Thus in actual Arabic script Bin Ladin is written:

بن لادن

The essence of transliteration is that each letter (more precisely, each grapheme) is represented by one character or sometimes multiple characters (digraphs or trigraphs). Thus in the table above b corresponds to the letter ba (ب) and n to the letter nun (ن). In good transliteration systems, there is always full one-to-one correspondence to ensure round-trip conversion.

The word بن is actually pronounced bin, but transliteration does not attempt to represent this. It merely maps source script graphemes to target graphemes and is thus graphemic in nature.

Transcription

Transcription is the representation of the source script of a language in the target script in a manner that reflects the pronunciation of the original, often ignoring graphemic (character-to-character) correspondence. This can be a phonetic transcription, which uses a phonetic alphabet such as IPA to represent the actual speech sounds of the source language (including allophones), or a phonemic transcription, which uses scientific or conventional orthography to represents the phonemes of the source language (ignoring allophones), such as in the romanization of Arabic or Japanese.

Using our example of Bin Ladin, in Arabic this is actually pronounced [bin ladin], and the transcription Bin Ladin reflects this rather accurately, whereas such variants as Bin Laden and Ben Laden do not. As is well known, Arabic is normally written without vowel signs and thus there is no direct way to know the vowels associated with each consonant. With the vowel signs, Bin Ladin is written as follows:

بِن لاَدِن

Your browser may not render this properly but you should notice several diacritics above and below some letters which indicate vowels. For example, under the letter dal (د) there is a diagonal line that indicates that the consonant + vowel combination is pronounced [di]. As can be seen, phonemic transcription, which represents the phonemes of the source language, is extremely difficult to achieve in Arabic because the vowel information is missing in normal unvoweled Arabic.

Areas of Application

Listed below are some applications of transcription and translilteration technology. There are various other possibilities.

Electronic representation of non-Roman languages on platforms where OS support for such languages is not available. For example, texts in such languages as Arabic, Russian and Pashto can easily be stored in a multilingual database and manipulated as if they were a language like English.

Entering texts in non-Roman languages using an ordinary keyboard with a system that only supports the ASCII character set.

Enables users unfamiliar with the non-Roman alphabet to read texts in that language by converting into the Roman alphabet.

Phonemic transcription used for pedagogical purposes, called pedography, enables language learners to study non-Roman languages in the Roman script.

Cross-Language Information Retrieval (CLIR) of proper nouns and other information extraction operations can be performed by entering the transliterated string, which is converted before the search is performed. For example, entering bn ldn (transliteration) or bin ladin (transcription) to search for instances of Bin Ladin in Arabic.

Good transcription/transliteration software should have the following features:

Capability to convert any valid sequence of characters in the source script to the corresponding characters of the target script.

A good transliteration system allow for unambiguous, round-trip conversion
A good transcription system maintains one-to-one correspondence on a phoneme basis, but often does not allow round-trip conversion since it is normally fine-tuned to the orthographic conventions of the target script.

Ideally, such systems should be based on standards or orthographic conventions, so that the pronunciation of the target characters is easy to predict. Arbitrary symbols should be avoided.

Both transcription and transliteration technologies have useful roles to play, but transcription is often far more difficult to achieve. Though phonemic transcription does not easily lend itself to round-trip conversion, it is quite useful in a variety of applications since the results are written in a human-friendly conventional orthography. But this does not mean that transliteration is not useful. On the contrary, properly transliterated texts are very easy to manipulate and store in any computer application without OS support.

Conversion Tools and Mapping Tables

The CJK Dictionary Institute has a developed a transliteration/transcription tool, provisionally called TRANS. This is a generic tool that works on any language pair and could handle complex orthographies like Arabic, Hebrew, Japanese and Chinese. This is a sophisticated tool that has numerous features and options allowing to fine-tune the conversion to specific requirements, and uses script-specific mapping tables and rule tables (some very complex). We use it for converting Arabic, Russian, Simplified <> Traditional Chinese, and other scripts. In principle, TRANS can not only perform round-trip transliteration with 100% accuracy, but can also perform even strict phonemic and phonetic transcription of such complex languages as unvoweled Arabic, Japanese, and Korean.

For the tool to work correctly, precise and complete mapping tables, and in the case of phonemic transcription, complex rules using regular expressions, are required. We have completed some of these tables (such as for Arabic transliteration and Chinese transcription, among others) and others are under development. Our team of linguists have a in-depth knowledge of especially CJK and Semitic languages and are confident in our ability to build mapping tables and rule tables for any language pairs.

Arabic Script Charts

A chart showing various transcription and transliteration systems for Arabic can be viewed at here (.doc file). This is a MS Word document with various special fonts that are difficult to display in html and even in Word.

Unicode Arabic range (U+0600 - U+06FF)

The CJK Dictionary Institute

Dictionaries

Other

Company