Database of Arabic Place Names
DAPNA


©2004-2008 The CJK Dictionary Institute, Inc.


1. Introduction

The CJK Dictionary Institute is engaged in the development and continuous expansion of comprehensive lexical databases for CJK languages and Arabic consisting of approximately eight million entries (see CJK Lexical Resources for details). This document describes our database of Arabic place names. We also maintain a large Database of Arab Names (DAN), with over 2.4 million romanized Arab names and variants.

Though Arabic has become a world language of critical importance, lexical resources, especially for proper nouns, are either scarce or exist only on a small scale. Because of the important role place names play in such natural language applications as named entity extraction (NER) and machine translation (MT), we are continuously expanding and revising our Database of Arabic Place Name Variants (DAP), which provides systematic coverage of Arabic orthographic variants and common orthographic errors.

It is important to note that although there are a handful of machine translation packages and data providers that offer Arabic place names, their coverage is poor, the data contains many machine-generated errors, and they do not cover variants. Our project may well be the first attempt to build a comprehensive database of Arabic place names that covers the entire world, is accurate, validated, and based on state-of-the art techniques in computational lexicography. Please have a look at the data samples shown below.

2. Why are variants useful?

Identifying, processing and normalizing place names and their numerous variants is useful in a variety of applications, such as:

  1. Improving the accuracy of English-to-Arabic machine translation by providing the standard, correct Arabic form.
  2. Improving the accuracy of Arabic-to-English machine translation by identifying variants and errors in the original Arabic text.
  3. Place name dictionaries for human translators.
  4. Entity and information extraction.
  5. Segmentation and morphological analysis of Arabic texts.

3. Data Sample

Our database covers both the Arab and non-Arab world, including variants. Only the most common variants are shown in the sample below -- see the next section for more.

ArabicBuckwalter
Transliteration
EnglishVariant Error Country
ArabicBuckwalter TransliterationEnglishVariantErrorCountry
أبو ظبي >bw Zby Abu Dhabi ابو ظبي أبو ظبى, ابو ظبى UAE
الإسكندرية Al<skndryp Alexandria الاسكندرية الإسكندريه Egypt
الجزائر AljzA}r Algiers  الجزاير Algeria
برازيليا brAzylyA Brasilia برازيلية برازيليه Brazil
القاهرة AlqAhrp Cairo  القاهره Egypt
الشرق الاقصى Al$rq AlAqSY Far East  الشرق الاقصي N/A
ألمانيا >mAnyA Germany المانيا  Germany
الجيزة Aljyzp Giza  الجيزه Egypt
حيفا HyfA Haifa  حيفة Israel
جدة jdp Jeddah جدّة جده Saudi Arabia
القدس Alqds Jerusalem   Israel
المنامة AlmnAmp Manama  المنامه Bahrain
مكة mkp Mecca  مكه Saudi Arabia
نابلس nAbls Nablus   Palestinian Territory
نانجينغ nAnjyng Nanjing   China
بالو ألتو bAlw >ltw Palo Alto بالو التو, بالو آلتو  USA
الرياض AlryAD Riyadh الرّياض  Saudi Arabia

4. Place Name Variants

Orthographic variants and errors of well-known place names are shown in the table below. This sample contains American, Egyptian, Emirati, Chinese and Japanese place names. Data is ordered first by English, and then by the web frequency of the Arabic.
Database of Arabic Place Names (sample)
English Arabic Frequency
Abu Dhabiأبوظبي
14194320
Abu Dhabiابوظبي
09564310
Abu Dhabiأبو ظبي
06035820
Abu Dhabiابو ظبي
02534770
Abu Dhabiأبوظبى
00000436
Abu Dhabiابوظبى
00000435
Abu Dhabiأبو ظبى
00000121
Abu Dhabiابو ظبى
00000079
 
Alexandriaالإسكندرية
04009670
Alexandriaالاسكندرية
02553390
Alexandriaالأسكندرية
00605150
Alexandriaالاسكندريه
00000439
Alexandriaالأسكندريه
00000073
Alexandriaالإسكندريه
00000048
Alexandriaالاسكندريا
00000020
Alexandriaالاسكندريى
00000008
Alexandriaالإسكندريا
00000003
Alexandriaالأسكندريا
00000002
 
Fukuokaفوكوكا
00044800
Fukuokaفوكووكا
00002500
Fukuokaفوكوأوكا
00001500
Fukuokaفكوكا
00000284
Fukuokaفوكواوكا
00000277
Fukuokaفوكؤوكا
00000227
 
Kansas Cityكانزاس سيتي
00001060
Kansas Cityكانساس سيتي
00000781
Kansas Cityمدينة كانساس
00000658
Kansas Cityكنساس سيتي
00000479
Kansas Cityمدينة كنساس
00000332
Kansas Cityكانسس سيتي
00000058
Kansas Cityمدينة كانزاس
00000045
Kansas Cityكانزس سيتي
00000033
Kansas Cityمدينة كانسس
00000021
Kansas Cityمدينة كنزاس
00000008
Kansas Cityكنزاس سيتي
00000007
 
Nanjingنانجينغ
00002550
Nanjingنانجينج
00000822
Nanjingنانكينج
00000122
Nanjingنانكينغ
00000040
Nanjingنانغينغ
00000005
 
New Jerseyنيوجيرسي
00008410
New Jerseyنيوجرسي
00008030
New Jerseyنيو جيرسي
00004470
New Jerseyنيو جرسي
00001190
New Jerseyنيوجرسى
00000689
New Jerseyنيوجيرسى
00000542
New Jerseyنيو جيرسى
00000440
New Jerseyنيو جرسى
00000100

The table below shows various orthographic variants and common errors for االإسكندري, the Egyptian city of Alexandria, along with Google occurrences (there are many other variants involving partial vocalization). Our databases are now being expanded to systematically include all orthographic variants and errors based on statistical analysis of Arabic orthography as it currently occurs in corpora, and often include the fully vocalized versions as well (see Database of Arabic Proper Nouns for a sample).

Our Arabic place names are carefully proofread to ensure strict adherence to the complex rules of hamza orthography, something which is often ignored outside of publications of the highest editorial standards. The result of this strict editorial policy is that we can provide not only the linguistically correct standard MSA version, but also all common non-standard and incorrect versions as well, carefully flagged to distinguish between them, as shown in the table below.


Some Orthographic Variants and Common Errors
for االإسكندري (Alexandria)
Rank Type*
Arabic
Buckwalter
Transliteration
Frequency
Remarks
1
N
الاسكندريةAlAskndryp 02930000Normalized, no hamza
2
S
الإسكندريةAl<skndryp 00690000Standard form, with hamza
3
E
الاسكندريهAlAskndryh 00089200No hamza, taa' marbuuta replaced by haa'
4
V
الإسكندريّةAl<skndry~p 00000954Explicit shadda
5
E
الإسكندريهAl<skndryh 00000897taa' marbuuta replaced by haa'
6
V
الاسكندريّةAlAskndry~p 00000245no hamza, shadda explicit
7
E
الاسكندرياAlAskndryA 00000080hamza omitted, taa' marbuuta replaced by alif
8
V
الإسْكَنْدَريَّةAl<sokanodary~ap 00000024fully vocalized
9
E
الاسكندريّهAlAskndry~h 00000012no hamza, shadda explicit, taa' marbuuta replaced by haa'
10
E
الإسكندرياAl<skndryA 00000007taa' marbuuta replaced by alif tawiila
11
E
الإسكندريّهAl<skndry~h 00000005taa' marbuuta replaced by haa', shadda explicit
* V = variant; E = error; S = Standard; N = normalized

In addition to the above, our database contains many other variants, such as those with partial and full vocalization, covering all actual and potential variants. The full set of Alexandria variants includes 35 entries.