The CJKI Database of Arabic Plurals
The CJKI Database of Arabic Plurals (DAP), the first truly modern, fully up-to-date database covering both regular and irregular Arabic plurals. Developed by experts over a period of several years, this database includes various grammatical attributes such as part-of-speech, collectivity codes, gender codes, and full vocalization. CJKI is now making this database available for use in software development, machine translation, and Arabic language education.
The Database of Arabic Plurals has been assembled by a team of experts in Arabic grammar through meticulous attention and research to ensure accuracy and to avoid the errors found in other works. In an era in which accurate processing of Arabic text is critical, this database represents a major step forward for natural language processing, machine translation, lexicography, and pedagogy.
Arabic regular plurals, also called sound plurals, are formed with the regular suffixes ُونَ uunaa for masculine plurals and َاتْ aat for feminine plurals, as shown below:
For language learners and language processing software alike, the irregular, or broken, plurals present one of the greatest challenges in the Arabic language. These morphologically irregular plurals are distinct in that they are not formed with the regular plural suffixes shown above. Instead, they are formed by modifying the vowels of the vowel-consonant pattern (CV templates) of the singular form (a phenomenon known as "nonconcatenative morphology"). Each of these plurals follows one of dozens of complicated morphological patterns derived from the root of the singular form by adding suffixes, adding prefixes, and changing the vowels.
English also has irregular plurals such as geese for goose and loaves for loaf, but the problem in Arabic is significantly more challenging since, in Arabic, such plurals are far more numerous and the patterns far more complex. In fact, the vast majority of Arabic plurals are broken. Furthermore, some nouns have distinct plural forms for different senses of the noun. For instance, the word بَيْتٌ baytun has two plural forms: بُيُوت in the sense of 'houses' and أَبْيَات in the sense of 'verses'. To add to this complexity, while many broken plurals are formed by inserting the root into a fixed pattern or template, the formation of some broken plurals requires complex morphophonological changes that cannot be predicted from the singular form, at least not without the aid of sophisticated algorithms and computational models.
Let us examine some of the broken plurals in the table below:
|Arabic||English||Arabic||English||Pattern Singular||Pattern Plural|
The last column shows the broken plural pattern. For example, CuCuC means that the first two consonants of the root, ك /k/ and ت /t/, are followed by the vowel /u/, whereas the last consonant ب /b/ has no vowel. There are several dozen of such broken plurals patterns, all irregular and all unpredictable.
Broken Plurals in Information Processing
Since, on the whole, the formation of the broken plural cannot be easily predicted from the singular form, such plurals pose a daunting challenge for machine translation, natural language processing, and language learning. This is immediately clear from the table in the previous section. The lack of regular patterns means that learners must learn each plural individually, while software for processing Arabic texts should ideally have a hard-coded database of broken plurals to determine the plural form from the singular or vice versa.
Lack of consideration for irregular plurals would be fatal even in an English natural language processor, let alone in the more complex case of Arabic. Given the prevalence of Arabic broken plurals, it is critical to have an accurate database of irregular forms in order to process text meaningfully.
Key Features of DAP
Below are the key features that make the CJKI Database of Arabic Plurasl (DAP) a highly useful resource:
- highly accurate and up-to-date
- archaic forms excluded
- proofread and validated by experts
- includes both regular and irregular plurals
- shows multiple plurals ordered by frequency
- features comprehensive coverage of core vocabulary
- indicates sense for sense-dependent plural forms
- covers nouns and adjectives
- contains codes for part of speech, gender, and collectivity
- includes fully vocalized, unvocalized, and romanized Arabic