Database of Arab Names

قاعدة بيانات الأسماء العربية

DAN at a Glance

  • 6.6 million validated Arabic name variants.
  • Based on over 25 million source names.
  • Constantly updated and expanded.
  • Proofread by native Arab editors.
  • Validated against web and corpora.
  • Fully vocalized Arabic.
  • Web-based frequency statistics.
  • Various romanization system support.
  • Attributes such as type and gender.
  • Support for OFAC SDNs and aliases.
  • Supports various non-English language systems.

Practical Applications

  • Information retrieval and query processing.
  • Entity recognition and extraction.
  • Machine translation.
  • Compliance and risk management.
  • Anti-money laundering and fraud detection.
  • Anti-terror and immigration control.

(Update: June, 2012) Special version that includes vocalized and unvocalized Arabic

We are now releasing an enhanced version of DAN that provides not only the romanized Arab names and their variants, but also the corresponding name in Arabic in both fully vocalized and unvocalized Arabic. It also provides Arabic script variants as well as frequency statistics and name type codes.

Now covers approximately 6.6 million names and variants

CJKI's Database of Arab Names (DAN) v3.0 covers approximately 6.6 million entries and consists of Arabic personal names and name variants mapped to the original Arabic script with a large variety of supplementary information. Based on authoritative linguistic resources, DAN was compiled by a team of Arabic native editors, and includes numerous orthographic variants and other attributes such as web frequency, name type codes and normalized forms.

DAN plays an important role in helping software developers, especially of security applications and natural language processing tools, enhance their technology by enabling named entity recognition and extraction, machine translation, variant normalization, and information retrieval of Arabic names.

Constantly expanding coverage

DAN contains a large and constantly growing collection of romanized Arabic name variants mapped to the original Arabic script. It continues to undergo extensive expansion and proofreading.

Over two years have passed since March 2008 when CJKI released DAN v2.0 with coverage of approximately 1.5 million entries. Since that time our team of editors and programmers have been vigorously working on further expansion and validation, and now covers over seven million validated entries with version 3.0.

Unique features

In addition to comprehensive coverage, DAN offers such unique features as manual vocalization for every Arabic name, support for various romanization systems, and the validation of all romanized variants based on their frequency of occurrence. The database contains a web frequency statistic for each of the millions of variants. Augmenting DAN with frequency data from relevant lexical resources increases the effectiveness with which it can be used to distinguish names from non-names. By including relevant frequency data, DAN can be used to determine the likelihood of an arbitrary string of romainized Arabic actually being a name.

DAN also has both vocalized and unvocalized versions of the Arabic names, and sometimes multiple vocalizations for the same name. Full and accurate diacritics are provided, even such relatively rare ones as alif-wasla and dagger alif. This is not only of academic interest, but is also a practical means to ensure that romanized versions of great accuracy and variety can be provided.

DAN exists both as a standalone database, or it can be paired with our Database of Arab Names in Arabic (DANA), which contains orthographic variants of the canonical, fully sanitized Arabic name.


Data samples

Variants of عبدالرحيم
Abd Al Raheem
SUB ID VARIANTS FREQUENCY
U000261 'Abderrahim 0000382000
U000763 Abderrahim 0000382000
U000425 'Abdurrahim 0000172000
U000928 Abdurrahim 0000172000
U000385 'Abdulrahim 0000082100
U000887 Abdulrahim 0000082100
U000236 'Abdelrahim 0000054200
U000739 Abdelrahim 0000054200
U000359 'Abdul Rahim 0000040000
U000370 'Abdul-Rahim 0000040000
A more extensive sample of over
1,000 variants is also available (.pdf).
A sampling of DAN's
variant coverage
STANDARD ARABIC VARIANTS
'Abd-al-'Aziz عبدالعزيز 1,122
'Isam-al-Din عصام الدين 181
Bin-Jabr بن جبر 24
Mahmud محمود 69
Abu-Hurayrah أبو هريرة 272
Mubarak مبارك 24
Khalil خليل 69
Muhammad محمد 151
Qawmuq قوموق 45
Yusif يوسف 56