Parsing Arabic using deep learning technology
Rahma Maalej 1, Nabil Khoufi 2 and Chafik Aloulou 3
1,3
     University of Sfax, ANLP Research Group, MIRACL Lab, Sfax, Tunisia
2
    ANLP Research Group, MIRACL Lab, Sfax, Tunisia

                Abstract
                Syntactic Parsing present a fundamental step in the process of automatic analysis of the
                language since it is the crucial task of determining the syntactic structures sentences. In this
                paper, we propose to syntactically analyze sentences for the Arabic language using deep
                learning techniques. We present our methodology and expose evaluation results using several
                deep learning architectures.
                Keywords 1
                Natural language processing, NLP, Syntactic Parsing, standard Arabic, deep learning, Machine
                Learning, LSTM, BILSTM, RNN.

1. Introduction
    In the domain of Automatic Natural Languages Processing (NLP), syntactic parsing represents a
crucial step in NLP applications such as machine translation, spelling correction, etc. It is from this
stage that we will be able to generate the syntactic structure of a text which makes it possible to clarify
the relations between the different linguistic units to construct a semantic representation. It is a
complicated step because of the complexity and the richness of Arabic language. Moreover, a bad
decomposition of the sentence or a bad choice of the grammatical category of the grammatical part will
influence the semantic interpretation. Therefore, many works have been done using different
approaches: the linguistic approach, the statistical approach and the hybrid method [1]. The linguistic
approach is based on a lexical knowledge and on a precise linguistic rule to eliminate the ambiguities
of the words of the sentence and to adapt the structure of the sentence [20]. The statistical approach
works with statistical models. The hybrid method combines both of linguistic and statistical methods
[12][9].
    Due to the complexity of the Arabic language, we still need to improve the results of Arabic parsing
task.
    The objective of this work is to realize a statistical syntactic parser for Modern Standard Arabic
(MSA) based on deep learning techniques.

2. Related works
   Several works have been done using a linguistic approach for MSA syntactic parsing.
    [17] applied an idea which consists in integrating a grammar based on the use of concepts instead
of words. This method will achieve a greater rate of coverage of the characteristics of the Arabic
language [16]. In fact, they built a probabilistic out-of-context grammar that relies on schemes from
100,000 Arabic words. They obtained an accuracy of 63.35%.
   [11] proposed to build an Arabic parser. This analyzer is based on a PCFG grammar (Probabilistic
context free grammar). Their method was divided into two phases. The first phase is the induction of
grammar. The objective of this phase is to automatically deduce a PCFG from the annotated corpus
ATB. This process consists of two steps: The first step is to derive CFG rules from the ATB. The second
step consists in assigning a probability to each deduced rule[10]. The second phase consists in

Tunisian Algerian Conference on Applied Computing (TACC 2021), December 18–20, 2021, Tabarka, Tunisia
EMAIL: rahmamaalej1234@gmail.com (R. Maalej) ; nabil.khoufi@fsegs.rnu.tn (N. Khoufi); chafik.aloulou@fsegs.tnu.tn (C. Aloulou)

             ©️ 2020 Copyright for this paper by its authors.
             Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
             CEUR Workshop Proceedings (CEUR-WS.org)
implementing the deduced grammar (results of the first phase) to perform the syntactic analysis. They
tested this analyzer with a set of sentences taken from the ATB Treebank (1650 sentences). The authors
found an accuracy of 83.59%, a recall equal to 82.98% and an F-measure equal to 83.23%.
    [2] proposed to use an Arabic property grammar to evaluate and enrich the Stanford Parser. In fact,
this parser is based on the verification of the satisfaction of the syntactic constraints, also called
properties, based on the analysis results of the corpus. Moreover, they enriched the simple
representation of the result of the analysis with syntactic properties. This makes it possible to clarify
several implicit information which present the relations between syntactic units.[2] obtained an F-score
of 77.62%. with a recall value of 70.2% and a precision of 86.81%.
    Recently,[8] proposed to construct a formal grammar for use in automatic processing applications
of the Arabic language. Their thesis work aimed to set up a grammar in order to describe the syntax and
semantics of Modern Standard Arabic. They presented a complete meta-grammatical description that
allows to achieve a syntactically and semantically rich representation of this language. For the
evaluation of the syntactic analysis, they constructed manually a test corpus by extracting 1000
syntactically correct sentences from the Tunisian school book (8th grade level). They obtained as
results, an accuracy of 82.33%, a Recall of 88.10%and an F1-score equal to 85.11%.
    [4] proposed two approaches; the first consists in detecting and correcting syntactic errors based on
the automatic generation of correct sentences [18]. The second is aimed at the automatic correction of
case termination errors based on parsing. He built a corpus that consists of 360sentences. The system
achieved an accuracy rate equals to 92.01%, a recall rate which is equal to 84.83% and an F-score of
88.27%.
    The statistical approach is based on statistical models. We have not found recent works using deep
learning for syntactic parsing of MSA. On the other hand, we found works dealing with French and
English. We can cite the work of [6] worked on lexicalized parsing for de-constituting grammars.
[6] had a main objective which consisted in adapting an underlying statistical model to these new
representations. They proposed a study of three neuronal architectures of increasing complexity and
show that the use of a non-linear hidden layer makes it possible to take advantage of the information
given by the embedding. They found as results an F-measure of 80.7% for the given tags and 78.3% for
the predicted tags.
    [4] performed the analysis of advantages of a pre-trained language model such as BERT
(bidirectional encoder representations from transformers) to the syntactic analysis in discontinues
constituents in English (PTB, Penn Treebank). They found as result F-score equal to 95%.
    [9] have implemented a Hybrid Standard Arabic parser that combines two parsing approaches:
statistical parsing with linguistic parsing. The objective of this analyzer is to reduce the analysis time
which is exponential with the size of the sentence.
    [19] present the results of an evaluation on the integration of data from a syntactic lexicon, the
Grammar lexicon, in a parser. They have carried out the evaluation according to the cross-validation
method on 10% of the French Treebank (FTB). They have obtained an F-score of 85.32%.

Table1
Related works study recap
         Authors                     Testing data                              Results
          [17]                 100000 Arabic sentence                    Accuracy= 63.35%
          [10]                            ATB                                     -
          [11]                   ATB Treebank (1650              Accuracy = 83.59%, Recall = 82.98%
                                      sentences)                      and F-measure = 83.23%.
            [2]              ATB, PTB, 100 Arabic phrases          F-score=77,62%. Recall= 70,2%,
                                    obtained from                        Precision=86,81%.
                               stories for Arab children
            [8]              corpus1: 1000 grammatically        Precision=82,33%, Recall=88,10% and
                                correct sentences and                     F1-score=85,11%.
                              corpus2: consisting of 250
                              ungrammatical sentences
            [4]                   360 Arabic sentences            = 92%, 𝑅𝑒𝑐𝑎𝑙𝑙 = 84%𝑎𝑛𝑑𝐹 − 𝑠𝑐𝑜𝑟𝑒 =
                                                                                 88.27%
            [6]                   French Corpus SPMRL            F-score = 80.7 % for the data tags and
                                                                      78.3 % for the predicted tags
             [4]                   English corpus PTB                         F-score=95%
             [9]                           ATB                              F-score=83,24%
            [19]                    French Treebank                         F-score=85.32%

3. Proposed Approach
   This section presents the general architecture of our proposed approach.
   Our statistical method for parsing Arabic based on deep learning techniques is divided into three
steps: the preprocessing step, the training step and the validation and testing step. The first step presents
the data preprocessing by extracting the syntactic levels for each sentence of the data. The second, does
the training of our models and finally, the third step is the step of validation and testing. The steps of
our proposed method are illustrated in the following figure:


Figure 1: Architecture of the proposed method


3.1.    The preprocessing step
   The preprocessing step consists of preparing data for the training step.

3.1.1. Corpus ATB
   In this work, we used the annotated Penn Arabic Treebank (ATB) corpus [13].
   The ATB corpus comprises data extracted from linguistic sources written with the standard and
modern Arabic language. It behaves 599 texts taken from the Lebanese newspaper ≪ An Nahar≫. The
texts are non-vowel or partially vowel and segment. Each word is annotated with several information
such as the morphological trait, the part of speech and the English translation. Also, it contains the
syntax trees for each sentence [14].
    In addition, The ATB corpus is very rich in information, such as gender, number and rationality. It
deals with large and important annotations. On the one hand, the corpus encompasses a set of 498
annotations to describe all the morphological functionalities, it also uses 22 annotations to describe
grammatical classes. There are 20 other notations, which describe the semantic relationships between
words. On the other hand, stop words are also marked (stop words exist with a large number in Standard
Arabic) with specific annotations. In our work, we use version 3.2 which contains 402,291 words. This
version has been created with several different formats to help the user for his research: SGM, POS,
XML, penntree, Integreted.
    For the preprocessing step of our ATB corpus, we mainly used the two formats ≪ XML ≫ and ≪
Tree ≫ because they have a hierarchical and readable representation and in addition, they contain the
essential information for our work objective and for experimentation.
    We have presented the eight syntactic levels with a set of morpho-syntactic labels for each sentence
extracted from the corpus ATB [3].
    Features. These features indicate the information used from the annotated corpus during the
    training step, which is the morphological annotations in the syntactic annotations. In fact, we
extracted the morpho-syntactic labeling from ATB. And, we have reduced the number of labels in order
to avoid complexity and to be prepared to learn and predict.
    For example: NP+ ADJP: NP.

3.2.    The training step
   Many machine learning algorithms need a vector representation as an input. In this work, we have
used embedding resources such as input from our neural systems. we have represented our data resource
with word2vec.

3.2.1. Continuous Representations of Words
   Word2vec is a popular method of constructing word embedding. It was introduced by [15] Two
variations of word2vec have been proposed for learning word embedding: Skip-gram and BOW (Bag
of Words). The BOW (Bag Of Words) is an architecture that predicts the current target word (the central
word) based on the source context words (surrounding words) [15].
   The skip-gram aims to predict the words of the context given an input word [15].
   In this work, we applied the bag of words (BOW) representation because it is the most popular
method for NLP applications.
   After obtaining the word embedding with the word2vec method, we start the training by applying
the deep learning techniques: LSTM, GRU, BI-LSTM. These models will present results after training
and we will compare them to choose the model having a better result to apply it to perform parsing.
   Today, neural networks present the systems most used in different machine learning tasks. They are
widely used, for example, in the fields of computer vision (image classification, detection of an object,
segmentation, etc.) and automatic language processing (automatic translation, voice recognition,
language models, etc.).
   In this work, we are interested in the creation of a model for each syntactic level. This model has an
important objective, is to determinate the different constituents of a sentence and the different relations
between them.
        • Long Short-Term Memory Network (LSTM) presents a building unit for the layers of a
            recurrent neural network (RNN). An RNN made up of LSTM units is often referred to as an
            LSTM network. A common LSTM unit is presented with a cell, an input gate, an output
            gate and a forget gate. The cell is responsible for” storing” values. Each of the three gates
            can be considered as a” conventional” artificial neuron, as in a multilayer neural network
            (or feedforward): that is to say, they calculate an activation (using an activation function) of
           a weighted sum. Indeed, they can be considered as regulators of the flow of values. There
           are connections between these doors and the cell.
       •   Gated Recurrent Unit (GRU) is a gated recurrent network similar to the LSTM network, but
           it receives fewer parameters than it.
       •   BILSTM is a Bidirectional LSTM, present a sequence processing model that include two
           LSTMs: one is receives the input in a forward direction, and the other in a backwards
           direction. BILSTMs increase the amount of information available on the network and
           improve the context available for the algorithm.

4. Results
   The evaluation of our method is realized in a validation and test step. To achieve this, we have
divided the ATB corpus in two parts, one for learning (70%) and one for evaluation (30%) between
validation (0.15%) and testing (0.15%).
   We used deep learning techniques for train and test our models for our eight levels. we applied the
RNN models: LSTM model, GRU model and BILSTM model.
    The results are illustrated in table 2. The table shows analysis performance by levels.

           Levels                          Accuracy
                     LSTM-model            GRU-model             BILSTM-model
           Level0    99,00%                99,08%                99,60%
           Level1    99.18%                99.31%                99.53%
           Level2    98,84%                98,91%                99,17%
           Level3    99,09%                99.15%                99.36%
           Level4    99,36%                99.39%                99.60%
           Level5    99,21%                99.28%                99.49%
           Level6    99.25%                99.32%                99.57%
           Level7    99.26%                99.35%                99.60%
Table 2
Evaluation results

   We have obtained good results which are encouraging to realize the syntactic parsing for the standard
Arabic language with a deep learning model. We can compare the results obtained for the models:
LSTM, GRU, BILSTM. We deduced that that the models BILSTM have best results.

5. Conclusion and Perspectives
   In this paper, we presented our numerical method to perform parsing for Standard Arabic using deep
learning techniques. We have built a model and we obtained encouraging results. As a perspective, we
think that we can develop the learning stage by adding other features besides the POS features. we
believe that the enrichment of the features can give better results. We plan to evaluate our method with
another external Arabic corpus.

6. References

   [1] C. Aloulou.Une approche multi-agent pour l’analyse de l’arabe: Modélisation de la syntaxe.PhD
thesis, Thèse de doctorat en informarique, Ecole Nationale des Sciences de l ..., 2005.

   [2] R. B. Bahloul, N. Kadri, K. Haddar, and P. Blache. Evaluation and enrichment of stanfordparser
using an arabic property grammar. In A. F. Gelbukh, editor,Computational Linguisticsand Intelligent
Text Processing - 18th International Conference, CICLing 2017, Budapest,Hungary, April 17-23, 2017,
Revised Selected Papers, Part I, volume 10761 ofLecture Notes inComputer Science, pages 170–182.
Springer, 2017.

   [3] H. B. Barhoumi, Aloulou and Z. (2015). Analyse syntaxique statistique de la langue arabe.2015.

   [4] M. Chouaib.Contributions à la correction automatique des erreurs syntaxiques dans la
langueArabe. PhD thesis, Faculté des Sciences Ben M’sik Université Hassan II Casablanca, 2020.

    [5] M. Coavoux. Qu’apporte bert à l’analyse syntaxique en constituants discontinus? une suitede
tests pour évaluer les prédictions de structures syntaxiques discontinues en anglais(what does bert
contribute to discontinuous constituency parsing? a test suite to evaluate discontinuous constituency
structure predictions in english). InActes de la 6e conférenceconjointe Journées d’Études sur la Parole
(JEP, 33e édition), Traitement Automatique desLangues Naturelles (TALN, 27e édition), Rencontre
des Étudiants Chercheurs en Informatiquepour le Traitement Automatique des Langues (RÉCITAL,
22e édition). Volume 2: TraitementAutomatique des Langues Naturelles, pages 189–196, 2020.

   [6] M. Coavoux and B. Crabbé. Comparaison d’architectures neuronales pour l’analysesyntaxique
en constituants. InActes de la 22e conférence sur le Traitement Automatique desLangues Naturelles.
Articles longs, pages 291–302, 2015.

   [7] M. Coavoux and B. Crabbé. Prédiction structurée pour l’analyse syntaxique en constituantspar
transitions: modèles denses et modèles creux.Traitement Automatique des Langues,57(1), 2016.

   [8] C. B. Khelil.Construction semi-automatique d’une grammaire d’arbres adjoints pour
l’analysesyntaxico-sémantique de l’arabe. PhD thesis, Université d’Orléans; Université de la
Manouba(Tunisie), 2019.

   [9] N. Khoufi. Une approche hybride pour l’analyse syntaxique de la langue arabe. PhD
thesis,Université de Sfax (Tunisie), 2017.

   [10] N. Khoufi, C. Aloulou, and L. H. Belguith. Arabic probabilistic context free grammarinduction
from a treebank.Res. Comput. Sci., 90:77–86, 2015.

   [11] N. Khoufi, C. Aloulou, and L. H. Belguith. A framework for language resource constructionand
syntactic analysis: Case of arabic. InInternational Conference on Intelligent TextProcessing and
Computational Linguistics, pages 356–365. Springer, 2016.

   [12] N. Khoufi, C. Aloulou, and L. H. Belguith. Toward hybrid method for parsing modernstandard
arabic. In2016 17th IEEE/ACIS International Conference on Software Engineering,Artificial
Intelligence, Networking and Parallel/Distributed Computing (SNPD), pages 451–456.IEEE, 2016.

   [13] M. Maamouri, A. Bies, T. Buckwalter, and W. Mekki. The penn arabic treebank: Building
alarge-scale annotated arabic corpus. InNEMLAR conference on Arabic language resourcesand tools,
volume 27, pages 466–467. Cairo, 2004.

   [14] M. Maamouri, A. Bies, and S. Kulick. Enhancing the arabic treebank: a collaborative
efforttoward new annotation guideline. InLREC, pages 3–192. Citeseer, 2008.

   [15] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representationsin
vector space.arXiv preprint arXiv:1301.3781, 2013.

    [16] M. A. B. Mohamed, S. Mallat, M. A. Nahdi, and M. Zrigui. Exploring the potential ofschemes
in building nlp tools for arabic language.International Arab Journal of InformationTechnology (IAJIT),
12(6), 2015.
   [17] M. A. B. Mohamed, S. Zrigui, A. Zouaghi, and M. Zrigui. N-scheme model: An
approachtowards reducing arabic language sparseness. In2015 5th International Conference
onInformation & Communication Technology and Accessibility (ICTA), pages 1–5. IEEE, 2015.

   [18] C. Moukrim, A. Tragha, T. Almalki, et al. An innovative approach to autocorrecting gram-
matical errors in arabic texts.Journal of King Saud University-Computer and InformationSciences,
2019.

   [19] A. Sigogne, M. Constant, and E. Laporte. Int\’egration des donn\’ees d’un lexique syntax-ique
dans un analyseur syntaxique probabiliste.arXiv preprint arXiv:1404.1872, 2014.

   [20] E. Wehrli and L. Nerima. The fips multilingual parser. InLanguage Production, Cognition,and
the Lexicon, pages 473–490. Springer, 2015.