Romanized Arabic and Berber detection using prediction by partial matching and dictionary methods - CEA - Commissariat à l’énergie atomique et aux énergies alternatives Accéder directement au contenu
Communication Dans Un Congrès Année : 2017

Romanized Arabic and Berber detection using prediction by partial matching and dictionary methods

Résumé

Arabic is one of the Semitic languages written in Arabic script in its standard form. However, the recent rise of social media and new technologies has contributed considerably to the emergence of a new form of Arabic, namely Arabic written in Latin scripts, often called Romanized Arabic or Arabizi. While Romanized Arabic is an informal language, Berber or Tamazight uses Latin script in its standard form with some orthography differences depending on the country it is used in. Both these languages are under-resourced and unknown to the state-of-the-art language identiüers. In this paper, we present a language automatic identifier for both Romanized Arabic and Romanized Berber. We also describe the built linguistic resources (large dataset and lexicons) including a wide range of Arabic dialects (Algerian, Egyptian, Gulf, Iraqi, Levantine, Moroccan and Tunisian dialects) as well as the most popular Berber varieties (Kabyle, Tashelhit, Tarifit, Tachawit and Tamzabit). We use the Prediction by Partial Matching (PPM) and dictionary-based methods. The methods reach a macro-average F-Measure of 98.74% and 97.60% respectively.
Fichier non déposé

Dates et versions

cea-01841162 , version 1 (17-07-2018)

Identifiants

Citer

W. Adouane, N. Semmar, R. Johansson. Romanized Arabic and Berber detection using prediction by partial matching and dictionary methods. 2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA), Nov 2016, Agadir, Morocco. ⟨10.1109/AICCSA.2016.7945668⟩. ⟨cea-01841162⟩
29 Consultations
0 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More