DAN at a Glance
We are now releasing an enhanced version of DAN that provides not only the romanized Arab names and their variants, but also the corresponding name in Arabic in both fully vocalized and unvocalized Arabic. It also provides Arabic script variants as well as frequency statistics and name type codes.
CJKI's Database of Arab Names (DAN) v3.0 covers approximately 6.6 million entries and consists of Arabic personal names and name variants mapped to the original Arabic script with a large variety of supplementary information. Based on authoritative linguistic resources, DAN was compiled by a team of Arabic native editors, and includes numerous orthographic variants and other attributes such as web frequency, name type codes and normalized forms.
DAN plays an important role in helping software developers, especially of security applications and natural language processing tools, enhance their technology by enabling named entity recognition and extraction, machine translation, variant normalization, and information retrieval of Arabic names.
DAN contains a large and constantly growing collection of romanized Arabic name variants mapped to the original Arabic script. It continues to undergo extensive expansion and proofreading.
Over two years have passed since March 2008 when CJKI released DAN v2.0 with coverage of approximately 1.5 million entries. Since that time our team of editors and programmers have been vigorously working on further expansion and validation, and now covers over seven million validated entries with version 3.0.
In addition to comprehensive coverage, DAN offers such unique features as manual vocalization for every Arabic name, support for various romanization systems, and the validation of all romanized variants based on their frequency of occurrence. The database contains a web frequency statistic for each of the millions of variants. Augmenting DAN with frequency data from relevant lexical resources increases the effectiveness with which it can be used to distinguish names from non-names. By including relevant frequency data, DAN can be used to determine the likelihood of an arbitrary string of romainized Arabic actually being a name.
DAN also has both vocalized and unvocalized versions of the Arabic names, and sometimes multiple vocalizations for the same name. Full and accurate diacritics are provided, even such relatively rare ones as alif-wasla and dagger alif. This is not only of academic interest, but is also a practical means to ensure that romanized versions of great accuracy and variety can be provided.
DAN exists both as a standalone database, or it can be paired with our Database of Arab Names in Arabic (DANA), which contains orthographic variants of the canonical, fully sanitized Arabic name.
|A more extensive sample of over
1,000 variants is also available (.pdf).