Dictionaries

   Overview
   Arabic
   Chinese
   Japanese
   Korean
   Mobile

Other

   Articles/papers
   KDPS
   Jack Halpern

Company

   About
   Data Licensing
   Jobs
   Location
   Contact
   Map










ARAN: Automatic Romanizer of Arabic Names

Introduction

The process of automatically converting unvocalized Arabic to a Roman script representation, called romanization, and such related operations as adding vowels to unvocalized Arabic, called vocalization, are challenging tasks to which there is no definitive solution. This document describes a system for automatically romanizing Arabic names, called ARAN for Automatic Romanizer of Arabic Names, and some of the relevant linguistic issues. For example, ARAN can romanize a name like قابوس into a large variety of systems, such as /qaabuus/ (phonemic), Qabous (popular), \qAbws\ (Buckwalter), and [qɑːbuːs] (IPA).

Developed by our team of experts on Arabic orthography and phonology, ARAN is a versatile system that performs a full range of computational linguistic tasks required for processing Arabic names. Though the focus is on processing Arabic names, it can for the most part be applied to processing Arabic texts in general. ARAN consists of multiple modules that perform such tasks as phonetic and phonemic transcription, transliteration, name variant generation, vocalization, code conversion and language identification.

Our institute has spared no effort to tackle every aspect of the many tough linguistic challenges by doing meticulous research and analysis, by writing sophisticated algorithms, and by building comprehensive mapping tables. We are confident that the ARAN, along with its sister resource NANA (Non-Arabic Name Arabizer), represent the best of romanization and arabization technology today.

Practical Applications

Ultimately, for a software tool to fully disambiguate an Arabic string requires it to "understand" the text based on a semantic/syntactic analysis of the context. Though ARAN does not do that yet, it is nonetheless a highly practical tool that adequately meets the practical needs of identifying, processing and normalizing names and their numerous variants useful in a variety of real world applications, such as:

  • Information Retrieval, such as query processing by search engines.
  • Named Entity Recognition and information extraction.
  • Machine Translation, as for transcribing unknown proper nouns.
  • Anti-money laundering and fraud detection by financial institutions.
  • Security applications such as anti-terrorism watch lists and retail fraud.
  • Cyber security applications such as for preventing identity theft.
  • Law enforcement applications including most-wanted lists and deportation lists.

Romanization of Arabic has such uses as:

  • Storage and manipulation of Arabic on platforms that don't support Arabic.
  • Entering Arabic with ordinary keyboards on systems that only support ASCII.
  • Enabling non-Arabic speakers to read Arabic in romanized transcription.
  • Aiding language learners unfamiliar with the Arabic alphabet.
  • Cross-Language Information Retrieval (CLIR) of names by entering romanized strings.

Why is Arabic ambiguous?

The Arabic script is a member of a class of Semitic scripts known as abjads. A distinguishing feature of abjads in general, and of Arabic in particular, is that words are written as a string of consonants with little or no indication of vowels. This is referred to as unvocalized Arabic (or unvoweled Arabic). Though diacritics, and some consonants, are used to indicate vowels, these are sparsely used. On the whole, unvocalized Arabic is ambiguous, in some cases highly ambiguous, posing significant challenges to Arabic information processing.

For example, the two letters مو \mw\ can theoretically represent 25 legitimate consonant -vowel permutations, such as mawa, mawwa, mawi, mawwi, mawu, mawwu, maw, maww, miwa, miwwa.... etc. Humans can normally disambiguate this by context, but for a computer program the task is formidable. An example of an ambiguous unvocalized word is كاتب \kAtb\, which can represent any of the seven vocalized wordforms below:

  1. كَاتِب /kaatib/
  2. كَاتَبَ /kaataba/
  3. كَاتِبٍ /kaatibin/
  4. كَاتِبٌ /kaatibun/
  5. كَاتِبَ /kaatiba/
  6. كَاتِبِ /kaatibi/
  7. كَاتِبُ /kaatibu/

The main reason for this ambiguity is that Arabic is a highly inflected language. Inflection is indicated by changing the vowel patterns as well as by adding various suffixes, prefixes, and clitics. A full paradigm for كَاتِب /kaatib/ 'writer' that we created (for a comprehensive Arabic-English dictionary project) reaches a staggering total of 3,487 (out of a thoeretical 10,541) vocalized forms, including identical forms of distinct function (called inflectional syncretism) and sense.

Basic Concepts

There is much confusion surrounding such terms as transliteration, transcription, and romanization. It is important to understand these concepts correctly. In the definitions below, the common name Muhammad, written محمد in Arabic script, is used for illustration. More information is available at Transliteration and Transcription Technology.

Romanization

The representation of a language written in a non-Roman script, such as Chinese or Arabic, in the Roman or Latin alphabet. This includes transliteration and the various types of transcription described below.

Transliteration

Transliteration of محمد
Arabic Letter Contextual Form Transliteration Letter Name
م m miim
ح H Haa'
م m miim
د d daal

A representation of the script of a source language by using the characters of another script. It aims to represent the letters (graphemes), rather than the sounds (phonemes), of the source language, by one (sometimes multiple) characters in an unambiguous way. For example, محمد is transliterated as \mHmd\, with each Arabic letter represented unambiguously by one Roman character, as shown at right:

In good transliteration systems there is a one-to-one correspondence that enables round-trip conversion. A widely used system for transliterating Arabic on a letter-by-letter basis is the excellent Buckwalter transliteration.

Note that the term transliteration is often misleadingly used in the sense of transcription, which is very confusing and should be avoided.

Transcription

A representation of the source script of a language in the target script in a manner that reflects the pronunciation of the original, often ignoring graphemic (character-to-character) correspondence. There are three kinds of transcription:

1. Phonetic Transcription

A set of symbols used is used to represent the actual speech sounds (phones) of the source language, including allophones (predictable variants of a phoneme). The most precise and well known of these is the International Phonetic Alphabet (IPA). For example, محمد is phonetically transcribed as [muħɛ̈mmɛ̈d], a fairly accurate representation of how that name is actually pronounced.

2. Phonemic Transcription

Also called phonological transcription, this is a notation used to represent the phonemes of the source language (ignoring allophones), ideally on a one-to-one basis. For example, محمد is phonemically transcribed as /muHammad/. The a represents the phoneme /a/, an abstract unit, rather than the actual sound (phone) [ɛ̈].

3. Popular Transcription

A conventionalized orthography, often inconsistent and devised by non-natives (or even by Arabists) with a shallow knowledge of Arabic phonology, that attempts to roughly represent the pronunciation of the original. For example, محمد is transcribed in some 200 different ways, such as Mohammed, Muhammad, Moohammad, Moohamad, Mohammad, Mohamad, etc.

Vocalization

The process of automatically adding vowels to unvocalized Arabic. For example, the unvocalized محمد \mHmd\ is vocalized as مُحَمَّد \muHam~ad\. Note the four diacritics that were added in the vocalized version. This is difficult to do even for native speakers unless trained in Arabic phonology. For a computer program, the high level of ambiguity makes it extremely challenging.

Arabization

As used here, arabization refers to the process of automatically converting an Arabic or non-Arabic name written in the Latin or CJK native script into Arabic script. For example, Muhammadمحمد, Jack → جاك, and 埼玉 (Saitama) → سايتاما.

Vocalization Modes

Arabic is written mostly in unvocalized script, which is why it is so difficult to transcribe and is the raison d'être for the ARAN system. Vocalized Arabic is found in the Koran, children's books, and didactic materials such as dictionaries. The Koran is fully vocalized (explicit short vowels, gemination, nunation etc.), but in other cases one often encounters partially vocalized or semivocalized texts.

Three Modes of Vocalization
Mode Arabic Transcription Transliteration
Unvocalized كتب /kutiba/ \ktb\
Semivocalized كُتب /kutiba/ \kutb\
Fully vocalilzed كُتِبَ /kutiba/ \kutiba\

ARAN supports three modes of vocalization: unvocalized, semivocalized, and fully vocalized, as shown at right:

Transcribing vocalized and semivocalized Arabic is considerably easier than transcribing unvocalized Arabic. However, it requires a different set of rules. Similarly, vocalizing unvocalized Arabic is just as difficult as transcribing it, but again requires a different set of rules. Each ARAN module has a knowledge base that captures the precise rules for the different vocalization modes.

ARAN: Automatic Romanizer of Arabic Names

Basic Goals and Methodology

ARAN aims to provide a robust solution to the difficult task of romanizing Arabic names, including all the transcription subtypes described above. CJKI is engaged in ongoing research and development efforts to enhance the functionality of the various ARAN modules, especially ATAN, ARAN's core module for generating phonemic and popular transcriptions. The main emphasis is on automatically transcribing unvocalized Arabic names into as many popular romanized variants as possible.

The most difficult challenge, the core problem to which ARAN provides a solution, is to make an intelligent guess at determining the vowels of unvocalized Arabic names and generating a list of likely candidates on the basis of statistical models and in-depth analysis of Arabic orthography. If a name is not found in our comprehensive Database of Arab Names (DAN), variants are generated in various romanization systems by linguistically advanced algorithms using a sophsticated knowledge base that captures the rules of Arabic orthography. DAN now has approximately six and a half million entries.

ARAN Modules

ARAN consists of the following components, described in more detail in the sections below:

  1. ATAN: Automatic Transcriber of Arabic Names
  2. AXAN: Automatic Transliterator of Arabic Names
  3. APAN: Automatic Phoneticizer of Arabic Names
  4. ADAN: Automatic Diacriticizer of Arabic Names
  5. AVAN: Automatic Variant Generator for Arabic Names
  6. AEAN: Automatic Encoder of Arabic Names
  7. AIAN: Automatic Identifier of ASBL Names
  8. ACAN: Automatic Converter of ASBL Names

The table at right illustrates the conversion processes performed by the principal ARAN modules using the Arabic name Qaboos (قابوس) as an example. It shows the data input to each module and the resulting output after processing. Each module is further described in more detail in the sections below. To get an overview of ARAN's features and capabilities, please study this table carefully.

ARAN Processing of قابوس \qAbws\
Conversion process ARAN module Input Output Remarks
Phonemic Transcription ATAN قابوس /qaabuus/ْ linguistic representation of phonemes
English Transcription ATAN قابوس Qaboos "Standard" English spelling
Popular Transcriptions ATAN قابوس Qabuus, Qabus, Qabous, Qabooss, Qaaboos, Kaboos, Kabuus, Gabous... some of the many popular variants
Phonetic Transcription APAN قابوس [qɑːbuːs]ْ scientific transcription in IPA
Unvocalized Transliteration AXAN قابوس \qAbws\ Buckwalter transliteration of unvocalized Arabic
Vocalized Transliteration AXAN قَابُوس \qaAbuws\ Buckwalter transliteration of vocalized Arabic
Diacriticization ADAN قابوس قَابُوس adding vowels (vocalization) and diacrtics to unvocalized Arabic
Arabization NANA Qabuus,
Qabus, etc.
قابوس converting non-Arabic to Arabic script

ATAN: Automatic Transcriber of Arabic Names

The Automatic Transcriber of Arabic Names, or ATAN for short, is ARAN's core module for generating phonemic and popular transcriptions of Arabic personal names.

Because of the inconsistent nature of the various popular Arabic romanization systems, there are often many, sometimes dozens or even hundreds, of romanizations for the same name. ATAN supports most of the commonly used systems, and has a flexible architecture that enables the user to configure the system to support user-defined systems.

The table below shows some of the major romanization systems. Though transcription is handled by the ATAN module and transliteration by the AXAN module. For convenience examples of both are given below.

Major Arabic Romanization Systems
Example: given name شولوخ
System Example Description
ALC-LC shwlwkh Romanization standard of the American Library Association - Library of Congress.
IC Shulukh Intelligence Community Standard (.pdf).
DIN šūlūḫ DIN 31635, the DIN standard for Arabic transliteration.
BGN/PCGN Shūlūkh The official system adopted by the U.S. Board of Geographic Names (BGN) and the Permanent Committee on Geographical Names (PCGN)
IPA ʃuːluːx International Phonetic Alphabet, a scientific system of representing speech sounds.
English Shoulokh One of many possible popular transcriptions.
Buckwalter $wlwx A strict transliteration system widely used in information processing.

In addition to the systems shown above, there are others not shown here, such as Deutsche Morgenländische Gesellschaft, ISO/R 233, SATTS and many that will be supported by the ATAN and AXAN modules.

AXAN: Automatic Transliterator of Arabic Names

The Automatic Transliterator of Arabic Names, or AXAN for short, generates transliterations of Arabic names or any other Arabic text. There are few strict transliteration systems; that is, systems that use unique symbols for each letter and allow for round-trip conversion. The excellent and widely used Buckwalter transliteration system is not only supported by AXAN, but is also used for internal processing in all ARAN databases and algorithms. AXAN can be configured to support other transliteration systems, including Cyrillization, by adding a custom mapping tables. Examples are shown in the table in Section 6. ATAN.

A table comparing romanization systems can be found at this Wikipedia article.

APAN: Automatic Phoneticizer of Arabic Names

The Automatic Phoneticizer of Arabic Names, or APAN for short, generates phonetic transcriptions of Arabic names in IPA. This represents the actual pronunciation in Modern Standard Arabic (MSA), including distinctions between the major allophones. APAN can be configured to generate transcriptions in various flavors of MSA pronunciation, e.g. the Saudi, Egyptian and Levantine flavors. Flavors refers to variations in the pronunciation of MSA in various regions of the Arab world, and is not to be confused with Arabic dialects.

For example, the name قابوس Qaboos is transcribed phonetically as [qɑːbuːs]. Note that the phonemic transcription /qaabuus/ generated by ATAN indicates the long vowel a by /aa/ and does not indicate the phonetic details of that vowel other than that it is long, a phonemic distinction. In contrast, the IPA phonetic transcription generated by APAN for this vowel is [ɑː], distinguishing it from its more common realization [æː], since [ɑː] is an allophonic variant of /aa/ that occurs after the uvular stop [q]. Thus the phonemic transcription /aa/ represents a single phoneme, which can be realized phonetically as [æː] or [ɑː].

This is further illustrated by the table below:

Arabic English Phonemic Phonetic
Gulf
Phonetic
Egyptian
Phonetic
Levantine
قابوس Qaboos /qaabuus/ [qɑːbuːs] [ʔɑːbuːs] [qɑːbuːs]
جمال Jamal /jamaal/ [dʒɛ̈mɛ̈ːl] [gɛ̈mɛ̈ːl] [ʒɛmɛ̈ːl]

ADAN: Automatic Diacriticizer of Arabic Names

The Automatic Diacriticizer of Arabic Names, or ADAN for short, perfoms automatic diacriticization; that is, it automatically vocalizes (adds vowels and diacritics) to unvocalized or semi-vocalized Arabic and adds the appropriate vowel signs and other diacritics. For example, the well known name Muhammed, written محمد \mHmd\ in unvocalized Arabic, is converted into the vocalized version مُحَمَّد \muHam~ad\ (/muHammad/) by adding the diacritics damma, fatha and shadda. This is related to, but distinct from, the equally difficult task of automatically generating a romanized phonemic transcription, which is done by the ATAN module.

Below are some example of the output from the ADAN module.

Sample output from ADAN module
Unvocalized Vocalized Transcription English
محمد مُحَمَّد muHammad Muhammad
إبراهيم إِبْرَاهِـيم 'ibrahiim Abraham
إسحاق إِسْحَاق isHaaq Isaac
الرياض الرِّيـَاض arriyaaD Riyadh
مكة مَـكـَّة makkah Mecca
القاهرة الْقـَاهِـرَة alqaahirah Cairo

AVAN: Automatic Variant Generator for Arabic Names

Romanization Variants

Arabic Buckwalter
Transliteration
Popular
Transcription
معـمر mEmr Moammar
معـمر mEmr Muammar
معـمر mEmr Mu'ammar
معـمر mEmr Mu`ammar
معـمر mEmr Mo'ammar
معـمر mEmr Moammar
معـمر mEmr Moamer
معـمر mEmr Moamar
معـمر mEmr Mohamar

The many popular transcriptions of Arabic names result in a very large number of variants. One of the main factors contributing to this is that several Arabic consonants do not exist in European languages. These sounds are difficult to pronounce and are rendered in different ways when romanized. Another factor is the vowels, which are transcribed in a bewildering variety of ways, partially due to dialectical variation. For example, the Arabic vowel /u/ in /usama/> is transcribed in such different ways as Usama, Ousama, Osama and Oosama.

For more details on romanized variants of Arabic names, see our Database of Arab Names.

Arabic Orthographic Variants

The second kind of variant are variants in Arabic name itself. This could be of three kinds:

  1. Synonyms are alternative expressions that represent the same name, like امريكا \amríka\ (America) vs. الولايات الأمريكية المتحدة \AlwlAyAt Al>mrykyp AlmtHdp\ (United States of America).
  2. Orhographic variants are alternative, non-standard ways to spell a specific variant of a name, like ابو ظبي \Abw Zby\ instead of أبو ظبي \>bw Zby\ for Abu Dhabi, in which the hamza is omitted.
  3. Orhographic errors are frequently occurring, systematic spelling mistakes, like yaa' in ابو ظبي \Abw Zby\ (Abu Dhabi) being replaced by alif maqsuura in ابو ظبى \>bw ZbY\.

Though the difference between variants and errors cannot be rigorously defined (there may be differences of opinion among native speakers as to what constitutes an error), they are both based on deep statistical and linguistic analysis of contemporary Arabic orthography, and provide fairly exhaustive coverage of Arabic orthographic variation. It should also be noted that standard form, though linguistically correct, is not necessarily the most common one (we have statistics for the occurrence of each form).

Orthographic Variation in Arabic Names
Standard Buckwalter English Variant Error Remarks
أبو ظبي >bw Zby Abu Dhabi ابو ظبي أبو ظبى
ابو ظبى
V: omit hamza
E: alif maqsura replaces yaa'
الإسكندرية Al<skndryp Alexandria الاسكندرية الإسكندريه V: omit hamza
E: haa' replaces taa' marbuuTa
جدة jdp Jeddah جدّة جده V: explicit shadda
E: haa' replaces taa' marbuuTa
الأردن Al>rdn Jordan الاردن V: omit hamza
بالو ألتو bAlw>ltw Palo Alto بالو التو
بالو آلتو
  V1: omit hamza
V2: madda replaces hamza
الرياض AlryAD Riyadh الرّياض   V: explicit shadda
طوكيو Twkyw Tokyo توكيو E: taa' replaces Taa'

For details see our Dictionary of Arabic Place Name Variants.

AEAN: Automatic Encoder of Arabic Names

AEAN is a code conversion module that supports various legacy encodings for Arabic, re-enconding the text into UTF-8 or UTF-16. It supports the following encodings:

  1. ISO 8859-6, the standard 8-bit encoding scheme for Arabic.
  2. The Arabic Mac Code Page, a superset of ISO 8859-6.
  3. Microsoft's Arabic DOS Code Page (ASMO 708), also based on ISO 8859-6.
  4. Microsoft's Arabic Windows code page is based on the ISO 8859-1 (Latin 1) standard.
  5. Arabic Windows 95 Code Page (CP-1256), which adds support for Persian characters.

AIAN: Automatic Identifier of ASBL Names

This module enables the automatic identification of a language written in the Arabic script. There are dozens of non-Arabic languages that are or have been written in the Arabic script, referred to as Arabic Script Based Languages (ASBL). The most important of these are:

  1. Farsi (official language of Iran)
  2. Pashto (western Pakistan and official language of Afghanistan)
  3. Dari (Afghan dialect of Farsi, official language of Afghanistan)
  4. Urdu (official language of Pakistan)
  5. Kurdish (Turkey, Iraq, Iran, Syria, Armenia, Lebanon)

Others include Shamukhi (Pakistani version of Punjabi), Kashmiri (India and Pakistan), and Uyghur (northwest China).

ACAN: Automatic Converter of ASBL Names

ARAN will eventually be expanded to romanize to/from the major Arabic Script Based Languages (ASBL), described at Section 12 above.