<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Silver Standard Arabic Corpus for Segmentation and Validation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hussein Awdeh</string-name>
          <email>hussein.awdi85@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adelle Abdallah</string-name>
          <email>adelle.abdallah@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gilles Bernard LIASD</string-name>
          <email>gilles.bernard@iedparis8.net</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Mazen El Sayed, Mohammad Hajjar Faculty of Technology Lebanese University Hisbeh</institution>
          <addr-line>Street - Saida -</addr-line>
          <country country="LB">Lebanon</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Paris 8</institution>
          <addr-line>2 rue de la Liberté, 93526 Saint-Denis cedex -</addr-line>
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <fpage>83</fpage>
      <lpage>88</lpage>
      <abstract>
        <p>-The Arabic Natural Language Processing applications suffer from the deficiency of both Arabic corpus and gold standard corpus. Defined as a collection of written or spoken texts stored on a computer, a corpus is written either in a single language, Monolingual Corpus or in several languages, Multilingual Corpus. A corpus is considered as the most important sources for semantic and syntaxic analysis in the domain of natural language processing.</p>
      </abstract>
      <kwd-group>
        <kwd>Arabic Language</kwd>
        <kwd>Arabic Natural Language Process</kwd>
        <kwd>Validation</kwd>
        <kwd>Information Retrieval</kwd>
        <kwd>Silver standard corpus</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>I. INTRODUCTION</title>
      <p>Being one of the six official languages of the united
nations ever since 1973, the Arabic language lies among the
world's six major ones. It is the language of the Holy Quran
and spoken by more than 273 million people around the
world. Arabic dialects consist of several branches: the
classical (Language of Quran), the modern standard (used in
newspapers, books...) and the local dialects (vary
considerably among different countries).</p>
      <p>Despite all the above and the increasing need for Arabic
corpus, the Arabic corpus remains deficient to support
various Arabic linguistic researches. For instance, the
majority of Arabic corpora are limited in sources, types, and
genres or even not freely available.</p>
      <p>Though it is a major world language, it still
underrepresented in corpus linguistic. Due to this lack, one of the
aims of this paper is to build a new free Arabic Multipurpose
corpus which we call “Silver Multipurpose Arabic Corpus”
with a large size, collected over many years from multiple
sources, covering all types and genres. It will be freely
available for the researcher working on different Arabic
Natural Language Processing techniques, and mainly for the
validation and evaluation in the unsupervised methods, and
in the domain of learning for the supervised methods.</p>
    </sec>
    <sec id="sec-2">
      <title>II. RELATED WORKS This part of paper reviews the existing Arabic Corpora as we pass by the different types of the textual Arabic corpus.</title>
      <p>•
•</p>
      <p>Raw text corpora: plain text with no additional
information written in one language (Monolingual
Corpus) or in multiple languages (Multilingual
Corpus).</p>
      <p>Annotated corpora: text tagged with linguistic
information such as named entity recognition, POS
tagging, semantic and syntactic information.</p>
    </sec>
    <sec id="sec-3">
      <title>Lexicon: words lists and lexical database.</title>
      <p>Miscellaneous corpora: multipurpose corpus (Q/A,
summaries…)</p>
      <p>Our study will be limited to textual monolingual corpora,
divided into freely and commercially available corpora.</p>
      <sec id="sec-3-1">
        <title>A. Freely available arabic corpora</title>
        <p>In this section we cite 14 freely available raw text,
annotated and miscellaneous corpora. They cover mostly the
news domain. We list it according to their size.</p>
        <p>
          1- Adjir Corpora [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]: made by Abdelali, collected
from Arabic daily newspapers distributed between
the years 2004 and 2005. It’s distributed in multiple
text files. Every file is compiled and cleansed. It
contains regarding 113 million words in total.
2- King Saud University Corpus of Classical Arabic
(KSUCCA) [2]: made by Alrabiah, collected from
classical Arabic texts dating between seventh and
eleventh century covering completely different
genres of texts like science, religion, literature,
sociology, biography and linguistic. Every genre is
split into sub genres and contains in total regarding
fifty million words.
3- Tashkeela Corpus [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]: made by Zarrouki, collected
from freely revealed texts in ancient books largely
from Islamic classical books (while the foremost
half is collected from Shamela Library) and
therefore the Modern Standard Arabic texts crawled
from the Internet using web crawling process. The
books had been rewritten and vocalized by
volunteers manually, to confirm that words are
vocalized. It contains seventy five million of fully
vocalized words.
4- Open Source Arabic Corpora (OSAC) [4]: made by
Moataz Saad, collected from multiple websites like
BBC Arabic, CNN Arabic and several other
sources. It includes 22429 text documents divided
into ten categories like Economics, History,
Entertainments, Education and Family, Religion,
Sports, Health, Astronomy, Law, Stories and cook
recipes, and they were encoded with UTF-8. It
contains regarding twenty two million words.
5- Al Watan Corpus [5]: made by M. Abbas, collected
from al watan newspaper articles in Oman. It
contains regarding 20000 articles talking about the
six following topics "categories": Culture, Religion,
Economy, Local News, International News and
sports. It contains about 10 million words.
6- Al Khaleej Corpus [6]: made by M. Abbas,
collected from thousands of articles, which had
downloaded from an online newspaper “Akhbar El
Khaleej”. This corpus contains more than five
hundred articles which correspond to nearly 3
million words talking about International and local
news, Economy and sports.
7- King Abdulaziz City for Science and Technology
Arabic Corpus (KACST) [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]: made by Al-Thubaity
et al, collected from a diversity of publishing media,
such as manuscripts, newspapers, books,
magazines, scientific periodicals, etc. The corpus
contains seven hundred million words beginning
from the pre-Islamic era to the modern era. It
contains about 869800 files, with 732,780,509
words, out of which, 7,464,396 are unique.
8- Kalimat Corpus [8]: made by el-haj and koulali is a
miscellaneous corpus, collected from the Arabic
newspaper Alwatan, summarized into 2,057 multi
document system summaries, NER annotated, POS
tagged and full morphologically analyzed. It
contains about 20,291 articles, with 18,167,183
words and fall into six categories (culture,
economy, international news, local news, religion
and sports).
9- SACS Corpus (Saudi Arabian National Computer
Science Conference) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]: created by Abu Salem,
collected from the proceedings of the Saudi Arabian
National Computer Science Conference. It contains
46,968 words tagged with title, authors, sources and
abstract.
10- Al-Raya Corpus [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]: made by Hasnah, collected
from the articles of Al-Raya newspaper. It contains
187 articles and 219,978 words over 30,096 unique
words.
11- Contemporary Arabic Corpus (CAC) [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]: made
by Al-Sulaiti and Atwell, collected from
newspapers, emails and websites from 1990 to
2004. It contains 842,684 words and balanced in
topics. It is tagged in xml language.
12- The International Corpus of Arabic (ICA) [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]:
made by Alansary, nagi and adly, collected from the
articles of Arabic newspapers sites, blogs and
forums, electronic books and academic research
papers and dissertations. It contains 70,022 articles
with 80 million words over about 1,272,766 unique
words. It includes texts in eleven categories like
strategic, national and social sciences, sports,
religion, literature, bibliography and others.
13- Arabic Modern Standard Corpus [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]: made by
Abdalali collected from newspaper articles from
different Arabic countries. It contains 102,134
articles with about 113 million words.
14- University of Jordan Arabic Corpus (UJAC) [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]:
made by researchers from Jordan University,
collected from 15 Arabic newspapers and other
resources from 19 arabic countries. It contains
61,037 articles with 7,522,941 words over 70, 7385
unique words. It is tagged in XML.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>B. Commercially available arabic corpora</title>
        <p>In this section we cite 5 commercially available corpora.
They are monolingual text corpora and annotated corpora.
They cover the news domain. We list it according to their
size.</p>
        <p>
          1- LDC Corpus (Arabic Newswire) [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]: made by
Graff and walker at the University of
Pennsylvania’s LDC, collected from the articles of
the Agency France Press newswire published
between 1994 and 2000. It contains 383,872
documents containing 76 million tokens over
666,094 unique words.
2- An-Nahar Newspaper Text Corpus [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]: collected
from an-Nahar newspaper from 1995 to 2000,
stored as hypertext Mark-up Language (HTML)
files. It contains 45000 articles and 24 million
words. Each article includes information such as
title, type, date, country and page.
3- Al-Hayat Arabic Corpus [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]: made by al-Hayat
Arabic corpus, collected from the al-Hayat Arabic
newspaper and organized into seven domains in
accordance with al-Hayat’s subject tags: General,
car, computer, news, economics, science and sport.
The corpus is cleaned by removing the mark-ups,
numbers, special characters and punctuation. It
contains about 42,591 articles with 18,639,264
unique words.
4- Nemlar Corpus [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]: it is produced by the Nemlar
project. It contains about 500000 words collected
from 13 different categories such as political news
and debate, Islamic text, phrases of common
words, broadcast news, business, Arabic literature,
general news, interviews, scientific press, sports
press, dictionary entries explanation and legal
domain text. It is provides in four versions: raw,
fully vowelized, with Arabic lexical analysis and
with Arabic POS-tags.
5- Arabic Gigaword Corpus [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]: made by Graff. It is
collected from four distinct Arabic newswire
(Agency France Press, Al-hayat, Annahar and
Xinhua news agency). It contains 1,256,719
articles and 391619 Kwords. The corpus was
encoded with utf-8 and written in SGML.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Abdelali [1]</title>
    </sec>
    <sec id="sec-5">
      <title>Adjir Corpora 113,000,000 Alrabiah, Salman et Atwell [2]</title>
    </sec>
    <sec id="sec-6">
      <title>Abu Salem [9]</title>
    </sec>
    <sec id="sec-7">
      <title>SACS Corpus</title>
      <p>political news and debate, Islamic text, phrases of
common words, broadcast news, business, Arabic
literature, general news, interviews, scientific press,
sports press, dictionary entries explanation and legal
domain text
four distinct Arabic newswire (Agency France Press,
Al-hayat, Annahar and Xinhua news agency)</p>
      <p>F
F
F
F
F
F
F
F
F
F
F
F
F
F
C
C
C
C
C
C</p>
    </sec>
    <sec id="sec-8">
      <title>Parker et al.</title>
    </sec>
    <sec id="sec-9">
      <title>Corpus fourth</title>
      <p>edition
Arabic Gigaword</p>
      <p>Corpus five
edition
1,077,382,000
text corpus</p>
      <p>Despite of the Arabic language corpora cited above, there
is serious lack of high quality Arabic Corpora compared to
other languages like English.</p>
      <p>The existing corpora still has many limitations. Few
numbers of corpora are freely available with huge lack of
freely tagged corpora. And the tagged corpora doesn't fit our
need for word segmentation evaluation. Moreover, existing
corpora are narrowly small in size, limited in source types
and genres. And most of these trials don't reach the
increasing need to have a reliable, updateable and
wellstructured corpus.</p>
      <p>To overcome these problems, it was necessary to build a
new free, tagged and reliable corpus named SAC (Silver
Arabic Corpus) for Arabic word segmentation evaluation.</p>
      <p>
        Our goal is to build a gold arabic corpus serving as a
language resource for NLP researchers in the field of
evaluation for syntaxic analysis purpose. We began with
building our silver Arabic Corpus called “SAC”. It consists
of a collection of texts annotated and enriched with linguistic
information. In our work we used the Arabic Stanford POS
tagger [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ], our arabic grammar rules based tool for word
segmented and our xml annotator.
      </p>
      <sec id="sec-9-1">
        <title>A. SAC Building steps</title>
        <sec id="sec-9-1-1">
          <title>1- Collecting web documents</title>
          <p>The first version of SAC consists of arabic articles
collected from alwatan arabic newspapers (based on
alkalimat corpus) corrected and cleaned manually and covers
six topics such as culture, economy, religion, sports, local
and international news.</p>
          <p>We use the alkalimat corpus as reference for many
reasons. It is a multipurpose corpus like our corpus, tagged,
compiled and freely available. But it is developped for
specific targets such as document system summaries, NER
annotated and POS tagged and its structure differ from our
structure to use for word segmentation. Although, we notice
some words not compiled and untagged. The table below
shows the statistics.</p>
        </sec>
        <sec id="sec-9-1-2">
          <title>2- Creation process steps</title>
          <p>The process of creating SAC was applied to the entire
data.</p>
          <p>
            Firstly, we need to know the POS of each word, thus we
apply Stanford POS tagger to the document collection. The
Stanford POS Tagger is a java implementation of a log-linear
part-of-speech tagger. it is a supervised system depending on
different trained models for many languages including
Arabic. The accurate model for Arabic was trained using the
Arabic Tree-bank p1-3 corpus based on maximum entropy.
The POS tagger identifies 33 part of speeches such as: Noun
(NN), Plural Noun (NNS), Proper Noun (NNP), Verb (VB),
Adjective (JJ). The tagger reached an accuracy of 96.50%
[
            <xref ref-type="bibr" rid="ref29">29</xref>
            ].
          </p>
          <p>Secondly, we clean the POS tagged document by
removing the repetitive words.</p>
          <p>
            Finally, we apply our rule system based on the POS
tagged corpus, to make our word segmentation by using our
java tool and set of grammar rules [
            <xref ref-type="bibr" rid="ref27">27</xref>
            ].
          </p>
          <p>
            We classify the word segmentation rules system based on
[
            <xref ref-type="bibr" rid="ref27">27</xref>
            ][
            <xref ref-type="bibr" rid="ref31">31</xref>
            ][
            <xref ref-type="bibr" rid="ref3">3</xref>
            ] into three main classes:
•
•
•
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>Rules based on punctuation marks.</title>
    </sec>
    <sec id="sec-11">
      <title>Rules based on coordinating conjunctions. Rules based on certain connector words like (اذا ,نكل )</title>
      <p>
        B. Arabic word segmentation rules [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ][
        <xref ref-type="bibr" rid="ref31">31</xref>
        ][
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
1. Prepositions that are prefixed to the word like (ب, ل )
will be separated from the word during
segmentation, as the example:
2. The definition Lam will be separated from the word
during segmentation As the following example:
3. The linked pronouns will be separated from the
word during segmentation, as the following
example:
4. For the noun, the Alf noun mosana )ىنثملا نا( ,
Ouaou Jamaat )ةعامجلا نو( , At (تا ), (except ta'ta2nith
ة( will be separated from the word during
segmentation as the following examples:
باتك ب = باتكب
فٌرعم ل = فٌرعمل
امك باتك = امكباتك
ةملعم = ةملعم
نا ملعم = ناملعم
نا ةملعم = ناتملعم
نو ملعم = نوملعم
تا ملعم = تاملعم
      </p>
      <p>For the verbs that are conjugated in the past, the Ta
Ta‟nith )ثينأتلا ءات( , Ouaou Jamaat )ةعامجلا واو(,
Noun nissoua (ةوسنلا نون ), will be separated from the
word during segmentation as the following
examples:
For the verbs that are conjugated in the present, the
Ya )ي ( , Ta (ت ), Alf noun mosana )ىنثملا نا( Ouaou
Jamaat)ةعامجلا نو( , will be separated from the word
during segmentation as the following examples:
ت بعل = تبعل
او بهذ = اوبهذ
ن بهذ = نبهذ
بعل ي = بعلي
بعل ت = بعلت
نا بهذ ي = نابهذي
نو بهذ ي = نوبهذي</p>
      <p>Finally, we clean the POS tagged document by removing
the tags data and the repetitive words, then we apply our xml
annotator.</p>
      <sec id="sec-11-1">
        <title>C. XML annotation</title>
        <p>We need to have the data in a specific format, thus the
last task is adding the linguistic information to the compiled
corpus. We developed our own tool CACXml (java based
tool )that converts the corpus into xml tagged corpus.</p>
        <p>As a word consists of a sequence of morphemes in the
pattern prefix*-stem-suffix*, we generate our xml structure
to suit this sequence of morphemes:
&lt;Seg&gt;
&lt;Word&gt;w&lt;/Word&gt;
&lt;Prefixes&gt;
&lt;Prefix&gt;p1&lt;/Prefix&gt;
&lt;Prefix&gt;p1&lt;/Prefix&gt;..
&lt;/Prefixes&gt;
&lt;Stem&gt;stm&lt;/Stem&gt;
&lt;Suffixes&gt;
&lt; Suffix&gt;s1&lt;/ Suffix&gt;
&lt; Suffix&gt;s1&lt;/ Suffix&gt;..
&lt;/Suffixes &gt;
&lt;/Seg&gt;</p>
        <p>The developed tagged corpus contains four fields: the
field for the word by itself and three fields for the
morphemes of the words: prefix, stem and suffix.</p>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>IV. ANAYSIS AND COMPARISON</title>
      <p>Building a corpus is not limited to a data collection
process but it consists to compile and prepare this data to be
used in different natural language processing applications. In
the last decade, the Arabic language researchers gained more
interests. Several universities and organizations builder
Arabic corpora using multiple factors.</p>
      <p>A standard arabic corpus was the aim of this paper. This
corpus must be in large size, representative covering
different text genres, and with a specific new structure used
by Arabic NLP applications.</p>
      <p>The first factor to consider in building a corpus is the size
of this corpus. Being a multipurpose corpus, it is preferable
to be a long corpus covering a large features of linguistic
information. The corpus sizes in different languages, for the
same topics, still bigger than the arabic corpus. The largest
freely corpus is the KACST Corpus (Al-Thubaity, 2014)
having 700 million words with about 1.5 million articles
extracted from ten different sources covering the pre-Islamic
era to the present day (a period covering more than 1,500
years). It mainly contains both classical and modern arabic.
The largest commercially is the Arabic GigaWord corpus
created by an institution like the LDC covering a period of
ten years. It has 3.3 million articles, and 1.077 billion words.
Our corpus is 18,167,183 words with 20,291 articles
covering periods.</p>
      <p>The second factor is the topics included in the corpus.
Most of the cited corpus cover multiple topics as well as our
corpus that covers the mainly topics that makes it
representative. It is multitopic corpus covering different
categories like culture, economy, religion, sports, local and
international news.</p>
      <p>The third factor is the price of corpus. The commercial
corpus are difficult to reach by the arabic linguistic research.
So, our aim was to build a freely available corpus.</p>
      <p>The last factor is the structure of corpus. The arabic
language has a complex morphology, that's why we need a
well-structured corpus. Most of the available corpora are not
tagged, and the tagged corpora do not meet our needs. So we
make our own structure for the segmentation of arabic word.
In such a way, the representation of SAC in xml format
makes it easily exploitable for any syntaxic program.</p>
    </sec>
    <sec id="sec-13">
      <title>V. CONCLUSION</title>
      <p>In order to resolve problems related to the lack of Arabic
tagged corpora, this work presents a Silver Standard Corpus
(SAC) building process. Thus, we provide the built SAC for
free to help advancing the work on arabic NLP researchers in
the field of evaluation and validation of their unsupervised
learning tools and for the learning in the supervised learning
tools in the syntaxic domain. We provide Silver Arabic
Corpus for free including the articles, annotated text, entities
and summaries to help advancing the work on Arabic NLP.
The corpus can be downloaded directly from:
https://gitlab.com/Data-Liasd-papers/silver-arabic-corpus.
The Silver Arabic Corpus (SAC) can be used by researchers
as standards corpus and or baselines to test and evaluate their
Arabic tools. In addition, we welcome any amendments to
the corpus by other researchers.</p>
      <p>In our future researches, we expect to enrich the built
Silver Standard Corpus (SAC) to support more words
syntaxic information and use an xml tagger to build our Gold
Arabic Corpus GAC..</p>
    </sec>
    <sec id="sec-14">
      <title>ACKNOWLEDGEMENTS</title>
      <p>This work has been done as a part of the project "Analyse
sémantique de textes arabes utilisant l’ontologie et WordNet"
supported by the Lebanese University.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Abdelali</surname>
          </string-name>
          , “Adjir Corpora,”, http://aracorpus.e3rab.com/,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Alrabiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Salman</surname>
          </string-name>
          and E. Atwell, “King Saud University Corpus of Classical Arabic KSUCCA,”
          <source>In Proceedings of WACL'2 Second Workshop on Arabic Corpus Linguistics</source>
          , Lancaster University, UK. The University of Leeds ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zerrouki</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Balla, “Tashkeela:
          <article-title>Novel corpus of Arabic vocalized texts, data for auto-diacritization systems</article-title>
          ,” The National Computer Science Engineering School (ESI), Algiers, Algeria,
          <year>January 2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Saad</surname>
          </string-name>
          , W. Ashour, “Open Source Arabic Corpora OSAC,” 6th ArchEng International Symposiums,
          <source>6th International Conference on Electrical and Computer Systems (EECS'10)</source>
          ,
          <source>Nov 25-26</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Abbas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Smaili</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Berkani</surname>
          </string-name>
          , “Al Watan Corpus,”
          <article-title>Evaluation of Topic Identification Methods on Arabic Corpora</article-title>
          .
          <source>JDIM</source>
          ,
          <volume>9</volume>
          (
          <issue>5</issue>
          ),
          <fpage>185</fpage>
          -
          <lpage>192</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Abbas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Smaili</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Berkani</surname>
          </string-name>
          , “Al Khaleej Corpus,”
          <article-title>Comparison of topic identification methods for Arabic language</article-title>
          .
          <source>Paper presented at the Proceedings of International Conference on Recent Advances in Natural Language Processing</source>
          ,
          <source>(RANLP.)</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Al-Thubaity</surname>
          </string-name>
          , “
          <article-title>King Abdulaziz City for Science and Technology Arabic Corpus KACST,”</article-title>
          <source>In Journal Language Resources and Evaluation</source>
          Volume
          <volume>49</volume>
          Issue 3 Pages
          <fpage>721</fpage>
          -751, Springer-Verlag New York,
          <year>September 2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>M. El Haj</surname>
          </string-name>
          , R. Koulali, “
          <article-title>Kalimat a Multipurpose Arabic Corpus</article-title>
          ,” at the Second Workshop on Arabic Corpus
          <source>Linguistics (WACL-2)</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Abu Salem</surname>
          </string-name>
          , “SACS Corpus,” Saudi Arabian National Computer Science Conference.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hasnah</surname>
          </string-name>
          , “
          <string-name>
            <surname>Al-Raya</surname>
            <given-names>Corpus</given-names>
          </string-name>
          ,”
          <article-title>Full Text Processing and Retrieval: Weight Ranking, Text Structuring, and Passage Retrieval for Arabic Documents</article-title>
          .
          <source>Ph.D. Dissertation</source>
          , Illinois Institute of Technology,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>Al-Sulaiti</surname>
          </string-name>
          , E. Atwell, “Contemporary Arabic Corpus CAC,” In
          <source>the Proceedings of the CL, Corpus Linguistics Conference</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Alansary</surname>
          </string-name>
          , M. Nagi, “The International Corpus of Arabic ICA,”
          <article-title>The International Corpus of Arabic: Compilation, Analysis and Evaluation</article-title>
          . ANLP,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Abdelali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cowie</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Soliman</surname>
          </string-name>
          , “
          <article-title>Arabic Modern Standard Corpus,” In the workshop on computational modeling of lexical acquisition, the split meeting</article-title>
          . Croatia,
          <year>July 2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>B.</given-names>
            <surname>Hammo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Al-Shargi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yagi</surname>
          </string-name>
          and
          <string-name>
            <given-names>N.</given-names>
            <surname>Obeid</surname>
          </string-name>
          , “University of Jordan Arabic Corpus UJAC,” In the Second Workshop on Arabic Corpus
          <source>Linguistics (WACL-2)</source>
          , UK,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>D.</given-names>
            <surname>Graff</surname>
          </string-name>
          , K. Walker, “LDC Corpus,”
          <article-title>Arabic newswire part 1. Linguistic Data Consortium, Philadelphia. LDC catalog number LDC2001T55, from</article-title>
          : https://catalog.ldc.upenn.
          <source>edu/LDC2001T55</source>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16] ELRA, “
          <article-title>An-Nahar Newspaper Text Corpus,” European Language Resources Association, ELRA Catalog number ELRA-W0027, from</article-title>
          : http://catalog.elra.info/product_info.
          <source>php?products_id=767</source>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17] University Essex, “
          <string-name>
            <surname>Al-Hayat Arabic</surname>
            <given-names>Corpus</given-names>
          </string-name>
          ,”
          <article-title>European Language Resources Association, ELRA Catalog number ELRA-W0030, from</article-title>
          : http://catalog.elra.info/product_info.
          <source>php?products_id=632</source>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18] ALP team, “Nemlar Corpus,”
          <article-title>European Language Resources Association, ELRA Cat-alog number ELRA-W0042, retrieved on</article-title>
          , from:http://catalog.elra.info/product_info.
          <source>php?prod-ucts_id=873</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>D.</given-names>
            <surname>Graff</surname>
          </string-name>
          , “Arabic Gigaword,”
          <article-title>Linguistic Data Consortium, Philadelphia, LDC catalog number LDC2003T12</article-title>
          , retrieved on:
          <volume>10</volume>
          /25/2015,from: https://catalog.ldc.upenn.
          <source>edu/LDC2003T12</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>D.</given-names>
            <surname>Graff</surname>
          </string-name>
          , “Arabic Gigaword Third,”
          <string-name>
            <surname>Edition</surname>
          </string-name>
          .
          <article-title>Linguistic Data Consortium, Philadel-phia, LDC catalog number LDC2007T40, from</article-title>
          : https://catalog.ldc.upenn.
          <source>edu/LDC2007T4017</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>D.</given-names>
            <surname>Graff</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kong</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Maeda</surname>
          </string-name>
          , “Arabic Gigaword Second Edition,”
          <article-title>Linguistic Data Consortium, Philadelphia. LDC catalog number LDC2006T02</article-title>
          ., retrieved on:
          <volume>10</volume>
          /25/2015, from: https://catalog.ldc.upenn.
          <source>edu/LDC2006T0218</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zaghouani</surname>
          </string-name>
          , ”
          <article-title>Critical Survey of the Freely Available Arabic Corpora</article-title>
          ,” In Carnegie Mellon University Qatar Computer Science, Workshop on Free/Open-Source
          <source>Arabic Corpora and Corpora Processing Tools Conference</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>D.</given-names>
            <surname>Namly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tajmout</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bouzoubaa</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Abouenour</surname>
          </string-name>
          , “
          <article-title>A Gold Standard Corpus for Arabic Stemmers Evaluation,” 28th IBIMA Conference</article-title>
          ,Seville, Spain,
          <year>Nov 2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>A.</given-names>
            <surname>Abdelali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cowie</surname>
          </string-name>
          , H. Soliman, “
          <article-title>Building a modern standard Arabic corpus</article-title>
          ,” Workshop on Computational Modeling of Lexical Acquisition.
          <source>The Split Meeting, Croatia, 25th to 28th of july</source>
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>D.</given-names>
            <surname>Mostefa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Abualasal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gzawi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Asbayou</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Abbes</surname>
          </string-name>
          , “
          <article-title>a hybrid Arabic Error Correction System,”</article-title>
          <source>The Second Workshop on Arabic Natural Language ProcessingAssociation for Computational Linguistics</source>
          , Beijing, China,
          <year>July 2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>I.</given-names>
            <surname>Zeroual</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Lakhouaja, ”Arabic Corpus Linguistics:
          <article-title>Major Progress, but Still a Long Way to Go,” In Shaalan K</article-title>
          .,
          <string-name>
            <surname>Hassanien</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tolba</surname>
            <given-names>F</given-names>
          </string-name>
          . (
          <article-title>eds) Intelligent Natural Language Processing: Trends and Applications</article-title>
          .
          <source>Studies in Computational Intelligence</source>
          , vol
          <volume>740</volume>
          . Springer, Cham,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>H.</given-names>
            <surname>Awdeh</surname>
          </string-name>
          , A. abdallah, ”Guide de segmentation des mots Arabes”, https://gitlab.com/Data-Liasd-papers/guide-de-segmentation,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Klein</surname>
          </string-name>
          et C.
          <article-title>Manning, “Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger,”</article-title>
          <source>In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000)</source>
          , pp.
          <fpage>63</fpage>
          -
          <lpage>70</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Klein</surname>
          </string-name>
          et C. Manning, and
          <string-name>
            <given-names>Y</given-names>
            <surname>Singer</surname>
          </string-name>
          .
          <article-title>“Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network,”</article-title>
          <source>In Proceedings of HLT-NAACL</source>
          <year>2003</year>
          , pp.
          <fpage>252</fpage>
          -
          <lpage>259</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mansour</surname>
          </string-name>
          , “
          <article-title>The Absence of Arabic Corpus Linguistics: A Call for Creating an Arabic National Corpus”</article-title>
          ,
          <source>International Journal of Humanities and Social Science</source>
          ,
          <year>June 2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>M.</given-names>
            <surname>Altantawy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Habash</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Rombow</surname>
          </string-name>
          , I. Saleh, “
          <article-title>Morphological Analysis and Generation of Arabic Nouns: A Morphemic functional Approach Handbook of Natural Language Processing” Second Edition, Center for Computational Learning systems</article-title>
          , Culombia University, New York, USA.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Talha</surname>
          </string-name>
          ibnu William, “Les règles de grammaire”, du premier Livre de Médine: Prentice Hall,
          <year>2008</year>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>