<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Syntactic analysis of the Tunisian Arabic</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Asma Mekki</string-name>
          <email>asma.elmekki.ec@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Inès Zribi</string-name>
          <email>ineszribi@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mariem Ellouze</string-name>
          <email>mariem.ellouze@planet.tn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lamia Hadrich Belguith</string-name>
          <email>l.belguith@fsegs.rnu.tn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ANLP Research Group, MIRACL Lab., University of Sfax</institution>
          ,
          <country country="TN">Tunisia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Syntactic differences MSA-TA</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we study the problem of syntactic analysis of Dialectal Arabic (DA). Actually, corpora are considered as an important resource for the automatic processing of languages. Thus, we propose a method of creating a treebank for the Tunisian Arabic (TA) “Tunisian Treebank” in order to adapt an Arabic parser to treat the TA which is considered as a variant of the Arabic language.</p>
      </abstract>
      <kwd-group>
        <kwd>Dialectal Arabic</kwd>
        <kwd>Syntactic analysis</kwd>
        <kwd>Treebank creation</kwd>
        <kwd>Tunisian Arabic</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The structuring of the MSA is among its main characteristics. Indeed, this one is well
assured thanks to a grammar rich with universally recognized rules to follow, which is
not the case for Arabic dialects. Indeed, the syntax of DA is affected by the influence
of foreign languages and by the code switching between DA and MSA and even with
foreign languages. In addition, in the order of the words of the sentence, the nominal
sentences are constituted syntactically of a subject and a predicate. For example, the
nominal sentence ليمج ناكملا AlmakAn jamylN “the place is beautiful” can be
pronounced in TA in two ways either ةنايزم ةصلابلا AlblASaħ mizyanaħ or ةصلابلا ةنايزم
mizyanaħ AlblASaħ. In addition, the inversion of the order between the MSA and the
TA in several nominal groups is preferable.
3
3.1</p>
    </sec>
    <sec id="sec-2">
      <title>Related Works</title>
      <sec id="sec-2-1">
        <title>Parsers for MSA</title>
        <p>Berkeley parser. Berkeley parser [1] uses a split-merge algorithm to learn a
constituent grammar started with an x-bar grammar. Indeed, the split provides a tight fit to the
training data, while the merge improves generalization and controls the size of the
grammar. This analyzer is available in open source for other languages such as
English, German, Chinese, etc. but not for Arabic.</p>
        <p>Stanford parser. Stanford parser [2], created by Green and Manning is a
grammarbased analyzer. It uses the non-contextual stochastic grammars to solve the syntactic
analysis. It was trained on Arabic Treebank [2]. This parser does not contain an
integrated tokenization tool, so the corpus to be used must already be tokenized in order
to segment (clitic pronouns, prepositions, conjunctions, etc.). However, this parser
does not require the segmentation of clitic determinants لا Al “the”. The Stanford
parser [2] is available in open source.
3.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Parsers for Dialectal Arabic</title>
        <p>Maamouri et al. [3] presented a syntactic analysis method that does not require an
annotated corpora for DA (except for development and testing), or a parallel
MSA/LEV1 corpora. On the other hand, it requires the presence of a lexicon linking
the DA lexemes to the MSA lexemes and the knowledge of the morphological and
syntactic differences between the MSA and a dialect.</p>
        <p>Three methods have been proposed for the syntactic analysis of DA [4]: the
transduction of sentences, as well as that of the treebank and also the transduction of grammar.
The basic idea of sentences transduction method is to translate the words of a
sentence of the DA into one or more words in MSA that will be kept in the form of
trellis. The best path in the lattice is transmitted to the MSA analyzer [5]. Finally, they
replace the terminal nodes of the resulting analysis structure with the original words
in the LEV dialect. The second method is the treebank transduction. The basic idea is
to convert the MSA treebank (ATB), in an approximation, into a treebank for DA
using the linguistic knowledge of systematic syntactic, lexical and morphological
variations between the two varieties of Arabic. On this new treebank, the syntactic</p>
        <sec id="sec-2-2-1">
          <title>1 Levantine Arabic.</title>
          <p>parser of [5] is learned and then evaluated on the LEV. Finally, grammatical
transduction method encompasses the other two methods [4]. It uses the synchronous grammar
mechanism to generate tree pairs linking the syntactic structures of the MSA and LEV
sentences. These synchronous grammars can be used to analyze new dialect phrases.
The evaluation of these three methods showed that the transduction of the grammar
gave the best performance. It reduced the error rate by 10.4% and 15.3% respectively,
with and without the use of grammatical category labels.
4</p>
          <p>“Tunisian Treebank” creation
The following section details the steps of our method for creating a treebank for the
TA: “Tunisian Treebank”. Figure 1 shows the step of “Tunisian Treebank” creation.</p>
          <p>Tunisian
Constitution</p>
          <p>Preprocessing step</p>
          <p>Orthographic normalization
Sentences boundaries detection</p>
          <p>Words tokenization
Tunisian
Treebank</p>
          <p>Correction of</p>
          <p>Treebank</p>
          <p>Normalized</p>
          <p>corpus
Syntactic parsing:
Arabic Stanford</p>
          <p>Treebank
After the events of the Tunisian revolution, a version of the Tunisian constitution was
elaborated in TA in order to make it more comprehensible. Indeed, this corpus is one
of the rare examples of intellectualized dialect directly written in TA. It contains 12 k
words distributed among 492 sentences.
Before starting the syntactic analysis of the TA, the “Tunisian constitution” must go
through a preprocessing step. It consists of orthographic normalization, sentence
boundaries detection and words tokenization of the corpus.</p>
          <p>Orthographic Normalization. The resource we have does not follow any
orthographic convention. Consequently, we find that sometimes the same word is written
in several ways in the corpus, which increases the ambiguity of our task. To do this,
the constitution must go through a normalization step to follow the orthographic
convention of the TA "CODA-TUN" [6]. It defines a single orthographic interpretation
for each word. Indeed, it follows the objectives and principles of work of the CODA
[7]. Therefore, it is an internal coherence convention for the TA writing, which uses
the Arabic alphabet and aims to find an optimal balance between maintaining a
dialectal level of uniqueness and establishing conventions based on similarity MSA-TA.
This normalization facilitates the adaptation of the syntactic parser of the MSA in
favor of the TA. Thus, the number of modifications and treatments made during
adaptation is reduced because of the sharing of several characteristics (word segmentation
rules, derivation, etc.).</p>
          <p>In this framework, we used the tool developed by [8] to perform the normalization
in an automatic way. This step allowed us to unify the orthographic interpretation of
each word of the constituent. For example, following the spelling convention of the
TA "CODA-TUN" words such as br$p “many” and vmp “there are” are transcribed as
ةشرب and ةمث.</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>Sentences boundaries detection. Sentences of the “Tunisian constitution” are not</title>
        <p>well segmented. Moreover, we find a two-page part called ةئطوتلا or “preface”"
without any point to delimit the sentences, which considerably complicates the phase of
the syntactic analysis. Indeed, we will try to correct the segmentation of the sentences
while maintaining their significance. To do this, two different experts participated in
the manual correction of the segmentation.</p>
        <p>Indeed, in the beginning, the resource consisted of 492 sentences but after experts
segmentation correction, the number of sentences increased to 928. The maximum
length of these sentences is 70 words and the minimum length is 2 words.
Words Tokenization. After the standardization of the “Tunisian constitution”
according to the “CODA-TUN” convention and the segmentation of its sentences, we
proceed to the words tokenization step. Indeed, Stanford Parser requires a tokenized
entry, which implies the importance of this step. In fact, tokenization consists in
defining the boundaries of the words and the information about the tokens that compose
them (stem and clitics) [9].
4.3</p>
        <p>Syntactic parsing: Stanford Parser.</p>
        <p>In order to create a treebank from the “Tunisian constitution”, the syntactic analyzer
"Stanford Parser" will receive as input the sentences of the normalized constitution.
The system dedicated primarily to MSA will give as an output a syntactic tree suitable
for each sentence. In fact, this tree represents the structure of the sentence.
Nevertheless, this output is not always admissible given the errors we can derive from it. In
addition, it is worth noticing the differences, however limited, between the MSA and
the TA, but the output presents errors that we must correct.
4.4</p>
        <p>Correction of Treebank.</p>
        <p>The last part of the creation of "Tunisian Treebank" is dedicated to the correction of
parsing errors conducted by the "Stanford Parser" system. we can find mainly two
types of errors: those that arise from structures specific to the TA and those whose
words are not recognized by the system. Both Tables 1 and Table 2 below illustrate
some examples of errors committed by the parser as well as the reference annotation.</p>
        <p>In the first table, the negation structure of the sentence is not well-presented in its
syntactic tree. Obviously, for this example a parent label to highlight this structure,
which is not the case in this example, must enclose two negation particles as well as
the verb. On the contrary, we note that the verb as well as the second particle of
negation ش $ are joined with the remaining part of the sentence represented in the first
example of Table 1 by “…”. Moreover, labels used are erroneous since, in this case,
the particle of negation ام mA is not an interrogative particle and the negation particle
ش $ is not a name.</p>
        <p>ش مجني ام
“He cannot”</p>
        <sec id="sec-2-3-1">
          <title>Syntactic tree</title>
        </sec>
        <sec id="sec-2-3-2">
          <title>System</title>
        </sec>
        <sec id="sec-2-3-3">
          <title>Reference In Table 2, we present the annotation of two words not recognized by Stanford parser as well as the reference annotation.</title>
          <p>“Stanford Parser” typically uses the NNP label to annotate proper nouns as well as
unrecognized words. Hence, “Stanford parser has attributed these labels to the two
examples presented in Table 2.</p>
          <p>We have prepared some statistics concerning the number of errors for each
sentence of the corpus. Table 3 presents these statistics.</p>
          <p>The sentences of our corpus are classified according to the number of errors for
each sentence. Therefore, we find that more than one third (1/3) of the corpus
sentences either contain no fault or one fault that is usually due to a word not recognized
by the system. This favors the idea of adapting the Arabic version of the Stanford
Parser to the TA.</p>
          <p>For correcting the generated treebank, we referred to two experts to help us
annotate TA specific words and structures. These experts have corrected the annotation of
the treebank to ensure the homogeneity of the treebank.
5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Adaptation of the system</title>
      <p>The syntactic differences between the TA and the MSA are very limited, which favors
the adaptation of a system dedicated to the MSA in order to generate the most
appropriate model following the training phase. Indeed, we used the corpus that we created
to do the training. This phase allows the system to generate a model able to give the
labels of each word in its context and define the best hierarchical structure to use.
Thus, the generated model has given results that we will try to improve by setting the
highlighting attributes.
6</p>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
      <p>In this section, we use the cross-validation method, which represents the reliability
estimate of our model. In fact, we have opted for this method of validation since the
treebank we have created contains only 928 syntactic trees. However, this number is
not enough to divide into one part for learning and another part for testing the model.</p>
      <p>Table 4 below shows recall, precision and F-measure values which are calculated
according to a 10-fold cross-validation.</p>
      <p>We give in Table 4 the results for the evaluation measure Evalb which is a Java
reimplementation indicating the accuracy, recall and F-measure for the corpus data. The
results presented are calculated with the PCFG or non-contextual grammar with
probabilities assigned to the rules so that the sum of all probabilities for all rules extending
the same non-terminal equals one. This PCFG is incorporated into Stanford Parser.</p>
      <p>Subsequently, we tried to improve the F-measure value obtained by parameterizing
the attributes of which we detail the results found in Table 5. We note that these
results are obtained following a learning phase with a 75% part of the treebank and the
generated model was tested by the evaluation corpus, which represents 25% of the
treebank. In this part, the partition of the treebank was not random, but in fact, taking
a quarter of each part of the classification we had by performing as a criterion the
number of errors in each sentence.</p>
      <p>No Normalization: Used to normalize the syntax trees of our treebank. In
fact, this option has been added in order to standardize the Penn Arabic
Treebank (ATB), which has been annotated separately from the beginning,
but since our treebank is based on the annotation of the Sanford Parser
system then it would better disable it.</p>
      <p>Use Unknown Word Signatures: Applied to use the suffix and capitalization
information for unknown words. Indeed, the values from 6 to 9 are the
options dedicated to Arabic.</p>
      <p>Smart Mutation: Dedicated generally to promote a more intelligent
smoothing for the relatively rare words in the corpus. Table 6 shows the last recall,
precision and F-measure values that are calculated according to a 10-fold
cross-validation.
We presented a method of creating a treebank for the intellectualized TA from the
Tunisian constitution. Indeed, we started to preprocess our corpus in order to
normalize it by following the spelling convention “CODA-TUN”. Then, the constitution
went through a stage of segmentation of the sentences in order to correct the
segmentation of this corpus, since it presents very long sentences. Then, we completed the
pre-processing step by tokenizing the constitution. Subsequently, the pretreated
corpus went through the Arabic version of the syntactic parser “Stanford Parser” to
output a treebank “Tunisian Treebank” which we corrected. Since the TA is a variant of
standard Arabic, we proposed to make the adaptation of the syntactic parser “Stanford
Parser” to generate a model that we evaluated by our treebank. As a perspective, we
opt to implement a tokenization tool for the TA, which will facilitate the analysis of
this dialect. Similarly, we propose to increase the size of the treebank in order to
greatly improve the results obtained.
8
1.
2.
3.
4.
5.
6.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Petrov</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barrett</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thibaux</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Learning accurate, compact, and interpretable tree annotation</article-title>
          .
          <source>Proc. 21st Int. Conf. Comput. Linguist. 44th Annu. Meet.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Assoc</surname>
          </string-name>
          .
          <source>Comput. Linguist</source>
          .
          <volume>433</volume>
          -
          <fpage>440</fpage>
          (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>COLING '10 Proc. 23rd Int. Conf. Comput. Linguist</source>
          .
          <volume>394</volume>
          -
          <fpage>402</fpage>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          Tools.
          <fpage>102</fpage>
          -
          <lpage>109</lpage>
          (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Rambow</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chiang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Diab</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Habash</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hwa</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <source>Sima'an</source>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Lacey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            ,
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Nichols</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Shareef</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          : Parsing Arabic Dialects. (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>In: Proceedings of the second international conference on Human Language Technology Research</source>
          . pp.
          <fpage>178</fpage>
          -
          <lpage>182</lpage>
          (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Zribi</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boujelbane</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Masmoudi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ellouze</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Belguith</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Habash</surname>
          </string-name>
          , N.:
          <article-title>A Conventional Orthography for Tunisian Arabic</article-title>
          .
          <source>In: The Ninth International Conference on Language Resources and Evaluation (LREC'14)</source>
          . pp.
          <fpage>2355</fpage>
          -
          <lpage>2361</lpage>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>In</surname>
            : Calzolari,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Choukri</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Declerck</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doğan</surname>
            ,
            <given-names>M.U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maegaard</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mariani</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moreno</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Odijk</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Piperidis</surname>
          </string-name>
          , S. (eds.)
          <source>Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)</source>
          . pp.
          <fpage>711</fpage>
          -
          <lpage>718</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>European Language Resources Association</surname>
          </string-name>
          (ELRA), Istanbul, Turkey (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Boujelbane</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zribi</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kharroubi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ellouze</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>An Automatic Process for Tunisian Arabic Orthography Normalization</article-title>
          .
          <source>In: Lecture Note in Computer Science</source>
          , LNCS, S. (ed.)
          <source>HrTAL</source>
          <year>2016</year>
          . , Dubrovnik, Croatia (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Attia</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>a</year>
          .:
          <article-title>Arabic Tokenization System</article-title>
          .
          <source>Proc. 5th Work. Important Unresolved Matters</source>
          .
          <fpage>65</fpage>
          -
          <lpage>72</lpage>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>