Exploration of Approaches to Arabic Named
                 Entity Recognition

                 Husamelddin A.M.N Balla and Sarah Jane Delany

                          Technological University Dublin
                            School of Computer Science
                                  Dublin, Ireland
                              http://www.tudublin.ie
                {husamelddin.balla,sarahjane.delany}@tudublin.ie


        Abstract. The Named Entity Recognition (NER) task has attracted
        significant attention in Natural Language Processing (NLP) as it can
        enhance the performance of many NLP applications. In this paper, we
        compare English NER with Arabic NER in an experimental way to inves-
        tigate the impact of using different classifiers and sets of features includ-
        ing language-independent and language-specific features. We explore the
        features and classifiers on five different datasets. We compare deep neural
        network architectures for NER with more traditional machine learning
        approaches to NER. We discover that most of the techniques and fea-
        tures used for English NER perform well on Arabic NER. Our results
        highlight the improvements achieved by using language-specific features
        in Arabic NER.

        Keywords: Named Entity Recognition · Machine Learning · Arabic
        NER.


1     Introduction

 Named Entity Recognition (NER) is the process of identifying the proper names
in text and classifying them as one of a set of predefined categories of interest.
There are three universally accepted categories which are the names of locations,
people and organisations. There are other common categories such as recogni-
tion of time/date expressions, measures (money, percent, weight etc.), email
addresses etc. In addition, there can be domain-specific categories such as the
names of medical conditions, drugs, bibliographic references, names of ships, etc.
NER is useful for applications such as question answering, information retrieval,
information extraction, automatic summarization, machine translation and text
mining [1].
    Arabic is one of the five official languages used by the United Nations. Ap-
proximately 360 million people speak Arabic in more than 25 countries and
    Copyright © 2020 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).
2       H. Balla, S.J. Delany

Arabic script represents 8.9% of the world’s languages [2]. Although there is
existing work on Arabic NER, it still in the primary stage compared with En-
glish NER [2]. Certain characteristics of the Arabic language offer challenges for
the task of NER. Unlike English and other European languages, capitalization
does not exist in Arabic script. Thus, employing capitalization as a feature in
Arabic NER is not an option. However, translation to English is one way to
solve this problem [3]. The Arabic language is morphologically complicated, a
word may consist of prefixes, lemma and suffixes in different combinations [4].
That can affect the performance in Arabic NER as typically features derived
from the suffix and affix of the words are used. Also, spelling alternates can be a
challenge in Arabic NER. In the Arabic language, words (including named enti-
ties) may be spelt in different ways but have the same exact meaning generating
a many-to-one ambiguity [2]. The lack of resources in Arabic presents another
challenge for Arabic NER. There is a lack of the freely available Arabic datasets
and gazetteers as many of the available ones are not appropriate for Arabic NER
tasks because of the absence of NEs annotations.
    In this paper we explore approaches for NER on Arabic text to determine
how the state of the art approaches to NER work on the Arabic language. We
investigate the impact of using different classifiers and sets of features including
both language-independent and language-specific features, testing them on five
different datasets. We have taken English as the second source language in our
work because English NER is the most developed among other NER models.
Recently, research on English NER have achieved the best performance in the
field and represents the state of the art. We also compare against the more recent
deep neural network approaches. The neural network approaches were found to
perform better than the traditional machine learning approaches for both Arabic
and English NER. However the SVM classifier outperformed the neural network
based model on one dataset (AQMAR). Our proposed models for the Arabic
NER outperformed other’s proposed models on two Arabic datasets out of three.
    The rest of this paper is organized as follows. Related work is discussed in
section 2; the datasets and proposed models are presented in the methodology
section 3; experimental results and analysis in section 4 and finally the conclu-
sions are discussed in section 5.


2     Related Work
2.1   General NER
There are three main approaches for the NER task: rule-based, machine-learning
and hybrid approaches. Early NER approaches were rule-based using hand-
crafted rules. In rule based approaches, the rules are designed as regular ex-
pressions for pattern matching generally with a list of lookup gazetteers [4].
Rule based approaches require expert linguists to design rules for the NER task
and usually target a single language. Therefore, few researchers use rule-based
systems to develop NER systems [5]. Although the knowledge-based approach
can achieve good results, it requires a very exhaustive lexicon in order to work
            Exploration of Approaches to Arabic Named Entity Recognition       3

well. That resulting in inefficiency as entities that don’t exist in the lexicon
cannot be recognised [6].
     There are common classifiers used for NER task such as Conditional Ran-
dom Fields (CRF), Support Vector Machines (SVM), Maximum Entropy (ME),
Decision Trees and Hidden Markov Models (HMM).An important factor in the
machine learning based approach is the features that are used. There are some
features that have been often used in NER systems such as the case of the word,
upper or lower, whether the entity is a digit or contains a digit, and the part
of speech associated with a word. The digit feature is useful in NER as it can
be used to recognize dates, percentages, money, etc., [7]. The morphology of a
word can be captured by including prefixes or suffixes as features. For example,
a word can be recognized as an organization if it ended with ”tech”, ”ex” or
”soft” [8]. To extract features a window is typically passed over the text. An
example of using window feature was proposed by [9] where the part-of-speech
of two words before the current word and two words after was used to recognize
the named entities. Word length (number of characters) has also found to be an
efficient feature for NER task [10].
     The third approach to NER, the hybrid approach, which combines both rule-
based and machine learning to optimize the system performance [11], In this
approach, the output of the rule based system as tagged text is used as input to
the machine learning system).
     Most of the more recent proposed NER systems are based on recurrent neural
networks (RNN) architecture over characters or word embeddings [12]. Those
features (word embeddings) are representations of words in n-dimensional space
using unsupervised learning over large collections of unlabeled data. The first
neural network based approach for NER was proposed by [13]. The system used
feature vectors created from orthographic features (e.g., capitalization of the
first character), lexicons and dictionaries. Later they replaced these manually
created feature vectors with word embeddings. Since then, and starting with [14],
implementing neural networks for NER systems have become popular. These
kind of models are attractive because they do not require feature engineering
efforts, and are thus more domain independent. Current research has shown
using pre-trained word embeddings is important for neural network based NER
because they are more effective and less time and resource consuming [15]. Also,
pre-trained character embeddings is essential for character-based languages such
as Chinese (one Chinese character may represent a word meaning) [16].

2.2   Arabic NER
A number of research studies have focused on Arabic Named Entity Recogni-
tion ANER. An early attempt for Arabic NER was proposed by [7] where they
used a rule-based approach. Their approach consists of a whitelist represent-
ing a dictionary of names, and grammar in the form of regular expressions to
recognize the named entities. A machine learning-based approach was proposed
by [18] where they developed an Arabic NER system named ANERsys 1.0. Lin-
guistic resources have been built by the authors for their experiments including
4        H. Balla, S.J. Delany

ANERCorp, the first freely available manually annotated Arabic NER dataset
and ANERgazet, an Arabic gazetteer. Contextual and gazetteer features were
used in the first version and then part-of-speech features were added in the sec-
ond version which improved the system performance. A hybrid approach which
combines rule-based and machine learning for Arabic NER was proposed by [7].
They used the GATE toolkit 1 for the rule-based approach. The ML-based com-
ponent used a Decision Tree algorithm. The system used NE tags produced by
the rule-based approach besides other language independent features and Arabic
specific features.
    The missing capitalization feature in the Arabic language is compensated for
in some Arabic NER work by using an Arabic morphological analyzer named
Buckwalter [33]. Among those features provided by Buckwalter is a feature
named English-gloss which provides the English translation for each word in the
input Arabic text. Later a tool named MADA was built on Buckwalter and up-
graded to be named MADAMIRA [38]. It provides up to 19 orthogonal features.
We used some of those features in our designated models which were proven to
be efficient in Arabic NER models [38]. More details of the implemented features
produced by MADAMIRA are in the features section.
    Similar to English, recent work in Arabic NER focuses on developing neural
network based approaches. A neural network based approach for Arabic NER
employing Bi-LSTM and CRF to predict the named entities has been used [17].
However, their model is missing some techniques such as character representa-
tions and hyper parameters tuning. Another approach proposed by [40] used
an LSTM neural network model combined with a CNN for character-level fea-
tures representation. Their model is well designed but is also missing the hy-
perparameter tuning technique to boost the performance. Also, a new efficient
multi-attention technique has been used [41] which uses a combination of word
embeddings and character embeddings via an embedding-level attention mech-
anism. The output is fed into an encoder unit with Bi-LSTM, followed by an-
other self-attention layer to boost the performance.They evaluated their model
on ACE and ANERCorp and Twitter datasets. Their model achieved relatively
better performance on the ACE dataset which has a different tagging style (not
CoNLL-2003 tagging style) and relatively lower performance on Twitter dataset
and that is probably due to the noisy text. Their model evaluation is very simi-
lar to our neural network based model with a slight inprovement in our results
where we are using different hyperparameter values.
    Model learning as well as evaluation requires high quality annotated datasets.
Initial benchmark datasets were generally created by labeling news articles with a
small number of entity types, e.g. CoNLL-2003 [39] and ANERCorp dataset [23].
Later, more datasets were created on numerous kinds of text sources including
conversation, Wikipedia articles, and social media such as WNUT-2017 [19].
Arabic datasets are relatively few compared with English datasets and other
languages. This represents one of the Arabic NER challenges. Some of widely

1
    https://gate.ac.uk/sale/tao/split.html
              Exploration of Approaches to Arabic Named Entity Recognition      5

used Arabic datasets are ANERCorp created by Benajiba [23] and ACE2 (com-
mercial dataset).


3     Methodology
In the proposed models, we implemented both traditional machine learning and
deep learning for running our experiments and evaluated their performance on
different datasets (English and Arabic).

3.1    Datasets
There are five datasets used in the experimental comparison, two English datasets
and three Arabic datasets. To cover different datasets aspects (attributes), we
adopted diversity in the datasets represented in the different text source of each
dataset such as newspapers, Wikipedia and social media. Each dataset is split
into training, development and testing sets as indicated in the specified Table 1
for the English datasets and Table 2 for the Arabic datasets. The development
set was used for hyperparameter tuning to avoid overfitting.

English Datasets CoNLL-2003: This is a benchmark dataset which was intro-
duced in the Conference on Natural Language Learning (a shared task for named
entity recognition) [39] and it has been extensively used in the NER task. The
CoNLL-2003 datasets cover several languages and we will focus on the English
dataset. The English data was taken from the Reuters Corpus which consists
of Reuter’s news stories between August 1996 and August 1999. There are four
types of named entities in the dataset which are persons (PER), organizations
(ORG), locations (LOC) and miscellaneous names (MISC).
    WNUT-2017: This high variance dataset was introduced in the Shared Task
on Novel and Emerging Entity Recognition 2017 [29]. The named entity tags
in this dataset have a wider range including Person, Location (including GPE
(Geo Political Entity)), Facility (center, station, etc.), Group (including mu-
sic band, sports team, and non-corporate organizations), Creative work (song,
movie, book, and so on), Corporation and Product (tangible goods, or well-
defined services). The source of this dataset is comments taken from social me-
dia websites including YouTube comments, Stack Overflow responses, Twitter
text for major events in 2016-2017, unfiltered Twitter text 2010, and Reddit
comments.

Arabic Datasets ANERCorp: This is a widely used Arabic corpus that was de-
veloped by [23] and has the same format as the ConLL-2003 dataset. ANERcorp
consists of 316 articles chosen from different newspapers for the sake of gener-
alization. The named entities in this corpus are persons (PER), organizations
(ORG), locations (LOC) and miscellaneous names (MISC).
2
    https://www.ldc.upenn.edu/collaborations/past-projects/ace
6         H. Balla, S.J. Delany

    AQMAR: This dataset contains 28 hand-annotated Arabic articles collected
from Wikipedia with 74,000 tokens [31]. The format of this dataset is similar to
CoNLL-2003.
    WikiFANEGold: This dataset which is part of dataset named “gold-standard
fine-grained NE corpora” was manually created by [32]. The dataset contains
Wikipedia articles which were selected by choosing the articles that discuss
named entities and considering a fair level of distribution among the classes.
In addition, the textual data extracted from the Wikipedia articles was cleaned
by removing elements such as headings, lists, and captions on images and tables
etc. This dataset consists of 8 coarse-grained classes and 50 fine-grained classes.
The coarse grained named entities in this corpus are PER: Person, ORG: Or-
ganisation, LOC: Location, GPE: Geo-Political, FAC: Facility, VEH: Vehicle,
WEA: Weapon, PRO: Product. We are using coarse grained named entities in
our experiments with a total size of 246,303 tokens.
    The gazetteers we used in our experiments are ANERgazet [23] for Arabic
NER and the English gazetteers used by [30] for English NER which contain
lists of persons, locations and organizations names.

                                  Table 1. English Datasets

Dataset    Splits          Tokens LOC PER ORG MISC Product Corp Creative-Work Group
           Training set    203621 3.5% 3.2% 3.1% 1.7%
CoNLL-2003
           Development set 51362 3.6% 3.6% 2.6% 1.8%
           Test set        46435 3.6% 3.5% 3.6% 1.5%
           Train set       55725 0.9% 1.0%            0.2% 0.4% 0.2%          0.4%
WNUT-2017
           Development set 15734 0.5% 3.0%            0.7% 0.2% 0.7%          0.2%
           Test set        8144 1.8% 5.3%             1.6% 0.8% 1.7%          2.0%


                                  Table 2. Arabic Datasets

Dataset      Splits          Tokens LOC PER ORG MISC GPE FAC VEH WEA PRO
             Train set       197043 4.0% 34.4% 14.4%      38.2% 2.8% 0.2% 0.3% 5.7%
WikiFANEGold
             Development set 24625 3.9% 34.2% 13.9        37.8% 2.7% 0.3% 0.5% 6.2%
             Test set        24635 5.2% 33.2% 13.8%       37.5% 2.7% 0.6% 0.5% 5.4%
             Train set       12022 29.4% 24.0% 13.5% 7.4%
ANERCorp
             Development set 3150 28.8% 26.3% 16.6% 8.3%
             Test set        3005 29.5% 24.0% 13.5% 7.4%
             Train set       36050 2.7% 2.1% 0.6% 3.0%
AQMAR
             Development set 9092 1.8% 2.3% 0.5% 3.9%
             Test set        9192 1.8% 2.3% 0.6% 4.0%


3.2   Traditional Machine Learning Based Models
In the traditional machine learning models we implemented supervised machine
learning approaches for the NER task as supervised learning approaches out-
              Exploration of Approaches to Arabic Named Entity Recognition      7

perform the unsupervised learning approaches [20]- [23]. A variety of classifiers
have been used for NER, however, in our experiments we used Conditional Ran-
dom Fields (CRF), Support Vector Machine (SVM) and Random Forest (RF)
algorithms which have been proven to perform well [10, 21, 22].

3.3     Features
Features to be used in the supervised machine learning approaches were selected
based on their performance in other NER research, we used both language-
independent and language-specific features. In our proposed models, and in line
with previous research, we used the following features which have been proven
to be effective:
   First, the language-independent features:
 – The 3-character-suffix of the word [20]: word suffix information is helpful
   to identify NEs. This is based on the observation that the NEs share some
   common suffixes.
 – The 3-character-prefix of the word [20].
 – Character length of a word [21]: this is a logical valued feature used to check
   whether the character length of the current word is less than three characters
   or not. This is based on the observation that the very short words are rarely
   NEs. If the length of the corresponding word is less than or equal to 3 then
   the feature values are defined and denoted by False.
 – Whether the word contains any digit (0-9) [22]: This feature is helpful in
   recognizing miscellaneous NEs, such as time expressions, measurement ex-
   pressions and numerical numbers etc.
 – Whether the word contains any punctuation [18].
 – Previous NE tag: the previous predicted named entity tag of the current
   token [22].
      Second, the language-specific features:
 – Whether the word starts with a capital letter (English only).
 – List lookup features (gazetteers) [20]: a set of binary features which capture
   whether the word is present as a specific entity type in the gazeteer (English
   and Arabic).
 – Part of Speech tags [20]: this feature represents the part of speech tag of
   the current word and its surrounding words (two previous and two after)
   (English and Arabic).
   The following are the morphological features generated by MADAMIRA
   tool [38] for the Arabic NER only:
 – Aspect: describes the aspect of an Arabic verb. It has four possible values:
   Command, Imperfective, Perfective and Not applicable.
 – Gender: the nominal Gender. This feature has three values: Feminine, Mas-
   culine, Not applicable.
 – Person: indicates the person information. The possible values are: 1st, 2nd,
   3rd, Not applicable.
8         H. Balla, S.J. Delany

    – Voice: the verb voice. The values for this feature are: Active, Passive, Not
      applicable, Undefined, etc.

    The baseline model (for both English and Arabic NER) we used included the
following features: the 3-character-suffix of the word; the 3-character-prefix of the
word; character length of the word; whether the word contains any digit; whether
the word contains any punctuation, the Part of Speech tags and capitalization
(for English NER only).


3.4     Deep Learning Based Models

In this section, we describe the architecture of the neural network used which is
adopted from [24]. The model uses an end-to-end approach that does not require
language-specific feature engineering or data pre-processing beyond implement-
ing pre-trained word embeddings. The recognition accuracy can be improved in
sequence labeling tasks such as named entity recognition by using the sequence
around the word under prediction. Thus, using a Bidirectional Long Short-Term
Memory (Bi-LSTM) model can give good performance [24]. The Bi-LSTM model
can learn from futuristic and past input features at a particular period of time
(e.g. a window approach). The Bi-LSTM model can learn from the past input
features using the forward pass technique and the futuristic input features using
the backward pass. According to the state-of-the art literature, the Bi-LSTM
model can be combined with a Conditional Random Field (CRF) layer to en-
hance the model performance [24]. The approach of this kind of model is to
inherit the ability of learning futuristic and past input features from the Bi-
LSTM model and then implement a sentence-level tag to predict the probable
tags with the aid of the CRF layer.
    In our experiments, we used both word embeddings and character represen-
tation as features for our neural network based model (see Fig. 1). To encode
character-level information of a word, we used convolutional neural networks
(CNNs) for character-level representation. CNNs have been shown to be able
to extract morphological information from characters of words [25] such as the
word prefix or suffix and encode this information into neural representations.
We also choose to use a CNN because we are dealing with the Arabic language.
Since Arabic is rich morphological language, using such a technique for Arabic
NER will identify different character-level features (word prefixes and suffixes)
through the CNN (see Fig. 2). A combination of character- and word-level rep-
resentations then was fed into a Bi-LSTM. We used a sequential CRF on top of
the Bi-LSTM to cooperatively decode labels for the entire sentence.
    Employing word embeddings benefits NLP especially when we are dealing
with languages that have many rare words and large vocabularies [26], such as
Arabic. The nature of the Arabic language, specifically word inflections, gener-
ates several lexical variations which lead to sparseness in the Arabic corpus. In
our work, for the English NER we used Glove embeddings with 100 dimensions
which is publicly available by Stanford, trained on Wikipedia and web text,
and contains 6 billion words [27]. For Arabic NER, we used AraVec which is a
            Exploration of Approaches to Arabic Named Entity Recognition         9


        Fig. 1. The general architecture of the neural network based model.


Fig. 2. Character representation using CNN which is then concatenated with the word
vector before it is fed into Bi-LSTM.


pre-trained distributed word embedding [28]. AraVec is an open source project
which provides free to use Arabic word embeddings trained on more than 3
billion words from web pages and Wikipedia.


3.5   Hyperparameter tuning

In our experiments, we used random search for hyperparameters tuning because
it was proven to be more efficient than other tuning approach such as grid
search [34]. The hyper parameters identified for the traditional machine learning
and the neural network based models are stated in Table 3.
10     H. Balla, S.J. Delany

Table 3. Hyper parameters tuning for the traditional ML and the neural network
(NN) based models

         Classifier Hyper parameter Space                       Best value
                    Kernel           Linear, Poly, rbf, sigmoid rbf
         SVM        C                1e-02 - 1e+03              1e+02
                    gamma            1e-02 - 1e+03              1e+01
                    Number of trees 200 - 2000                  400
         RF
                    Max Features     auto, sqrt                 auto
                    Optimizer        lbfgs, l2sgd, ap, pa, arow lbfgs
                    Max iterations 100 - 25000                  25000
         CRF
                    C1               0.1 – 1.0                  0.3
                    C2               0.1 – 1.0                  0.1
                    Learning rate    0.005 – 0.008              0.0105
                    CNN dropout      0.25 – 0.85                0.25
                    Convolution size 3 – 7                      3
         NN         LSTM dropout 0.25 – 0.50                    0.50
                    LSTM state size 100 – 500                   200
                    Optimizers       SGD, Nadam                 Nadam
                    Epochs           Determined by performance 120


4    Experimental Results and Discussion
To evaluate our NER models, we used the above mentioned datasets. The CoNLL-
2003 dataset splits as training, development and test sets were those offered by
the benchmark datasets. The rest of datasets were split into 80% for training,
10% for the development set and 10% for testing. The traditional F-score mea-
sure was used to measure the performance. Results for the traditional machine
learning approaches are displayed in Table 3. The baseline model for each classi-
fier included the implementation of the classifier with the language-independent
features listed in the features section. Since we are focusing on Arabic NER
in this paper, we separated the results into two rows for each Arabic dataset.
We compared the models performance using two sets of features, the baseline
language-independent features labeled as Sub and larger feature set, labelled
Full which included the language-specific features listed in the features section.
Baseline results for the three classifiers are displayed in the columns labelled
SVM, RF and CRF respectively.
     Columns labelled with -P show the results of adding in the previous tag
because it has been shown that this feature has a big effect on model perfor-
mance. We excluded this feature in the CRF model as it is, in effect, already
included as CRF is a sequential classifier. We also experimented to see the bene-
fit of adding in the use of gazetteers, labelled -G in the table. Results show that
adding gazetteers also boosted the model performance. The columns labelled as
-PG include both the previous tag and gazetteers G.
     From Table 4 we can reveal some information related to the impact of using
previous NE tag and gazetteers features on the performance of each model. On
the CoNLL-2003 dataset, the performance of the models improved dramatically
             Exploration of Approaches to Arabic Named Entity Recognition            11

Table 4. The performance of traditional machine learning based models, where P
means the previous predicted named entity and G means Gazetteer. The best perfor-
mance on each dataset is highlighted in bold.

Dataset     Language Feature SVM SVM-P SVM-G SVM-PG RF RF-P RF-G RF-PG CRF CRF-G
CoNLL-2003   English    —    81.23 88.34 85.52 90.13 77.41 85.32 84.32 87.21 91.01 91.82
WNUT-2017    English    —    17.42 11.31 19.62 12.32 11.24 08.21 0.12 09.54 33.13 34.63
                       Sub 72.35 73.74 74.86   77.57 70.02 70.89 72.23 72.95 77.74 75.63
WikiFANEGold Arabic
                       Full 73.42 75.26 74.94  77.86 71.85 71.36 72.52 73.62 78.85 79.24
                       Sub 80.21 87.32 82.22   89.45 73.24 82.86 74.48 83.26 75.61 82.40
ANERCorp     Arabic
                       Full 83.14 88.95 86.11  89.81 75.20 84.34 76.26 85.18 83.09 87.51
                       Sub 73.24 74.46 73.52   74.48 72.16 73.27 73.69 73.75 74.53 74.82
AQMAR        Arabic
                       Full 74.44 75.93 75.21  76.98 72.51 73.34 72.28 73.96 74.93 75.89


by using the previous NE tag as well as gazetteers particularly in the SVM
model. The best performance on CoNLL-2003 dataset was achieved by using
CRF model.
    The performance on the WNUT-2017 dataset as we can notice from Table
4 is relatively low due mainly to the noise in the text as it was collected from
social networks. Using the previous NE tag or gazetteers on this dataset didn’t
improve the performance. Instead the performance was decreased probably due
to the fact that the previously predicted NE tag is more likely to be wrong
which impacts the overall model performance negatively. The same applies to
gazetteers - it is difficult to provide support in a gazetteer for such noisy text.
However, the CRF model again proved to be relatively successful on this dataset.
    In the Arabic datasets, using both language-independent and language-specific
features and including the previous predicted NE tag and gazetteers as Full fea-
tures enhanced the general performance. The best performance was achieved
using the SVM model on ANERCorp and AQMAR datasets. The models per-
formance on AQMAR dataset was almost similar to ANERCorp but the perfor-
mance on WikiFANEGold dataset was lower which is possibly due to the higher
number of classes in this dataset. The best performance on this dataset was
achieved by using CRF model.
    The performance of the Full feature was better than that of the Sub features
on all Arabic datasets. However, the effect of using the additional previous NE
tag -P and gazetteer -G is bigger as we notice from Table 4
    Table 5 shows the performance of the deep learning based models. The first
column in the table labelled Bi-LSTM gives performance using only the Bidi-
rectional Long Short-Term Memory algorithm. The second column labelled Bi-
LSTM-CNN gives the performance of the combination of convolution neural
network CNN for character representation and Bi-LSTM. The third column
labelled Bi-LSTM-CNN-CRF gives performance including the addition of Con-
ditional Random Fields CRF algorithm.
    In general, the performance of the deep learning based models is higher than
the traditional machine learning based models. Again, the performance on the
English datasets is higher than the performance on the Arabic datasets and that
is probably due to the challenges with Arabic NER already discussed. Unlike the
12     H. Balla, S.J. Delany

Table 5. The F1-score Performance of Deep Learning Mode. The best performance on
each dataset is highlighted in bold.

     Dataset     Language Bi-LSTM Bi-LSTM-CNN Bi-LSTM-CNN-CRF
     CoNLL-2003   English   90.24     91.61         92.57
     WNUT-2017    English   32.74     35.65         35.93
     WikiFANEGold Arabic    78.12      78.9         79.48
     ANERCorp     Arabic    87.12     89.81         89.92
     AQMAR        Arabic    75.12     75.52         76.46


traditional machine learning based models the differences in performance across
the deep learning based models are small.
    Table 6 shows the highest performing model from the traditional machine
learning approaches and the deep learning based approaches across the Arabic
datasets. This table also includes the best performance from existing research
on these datasets labelled SOTA.
    The current best performance on the WikiFANEGold dataset [35] used Buck-
walter transliteration, English gloss, POS and NE tag in their model besides
window-based and dependency-based representation. They created an approach
to capture a global information in the corpus instead of focusing inside the sen-
tence using a CRF classifier in their model. Both the SVM classifier and the
deep learning approach used in this paper outperform this approach.
    For the ANERCorp dataset the current best performance [36] created a neu-
ral network based model which they named Artificial Neural Network (ANN).
Their approach included three stages: the first stage was preprocessing the data,
in the second they converted Arabic letters to Roman alphabets and in the final
stage they applied a neural network to classify their data. They split the dataset
into 90% for the training set and 10% for testing set. However, compared to our
models, they achieved better performance most probably because of the data
pre-processing and converting the Arabic letters to Roman alphabets.
    For the AQMAR dataset [37], the authors proposed a model that integrates
various custom-made techniques together, including representation learning (a
model using word embeddings and Bi-LSTM), feature engineering, sequence la-
beling, and ensemble learning. They train multiple LSTM-CRF models to con-
struct the mapping from representations to predictions and then concatenate
their outputs as ensemble learning. Both the SVM-PG and the deep learning Bi-
LSTM-CNN-CRF approach used in this paper outperformed the state of the art
on this dataset. Our model SVM-PG gave the best performance on this dataset
and this is possibly because in our model, the performance was boosted by us-
ing comprehensive language-specific features in addition to the previous NE tag
and gazetteers. Also, in our approaches the hyper parameters were tuned using
random search technique while it was neglected in both compared models on
WikiFANEGold and AQMAR.
             Exploration of Approaches to Arabic Named Entity Recognition         13

Table 6. Comparison between the performance (the highest F1-score) in our traditional
ML, Deep learning based models and SOTA

             Dataset     SVM-PG Bi-LSTM-CNN-CRF SOTA
             WikiFANEGold 77.86       79.48    73.66 [35]
             ANERCorp     89.81       89.92    92.36 [36]
             AQMAR        76.98       76.46    75.82 [37]


5   Conclusion
In this paper, we have explored a variety of different approaches to NER on
Arabic text with reference to how these approaches perform also on English
text. The exploration involves evaluating different classifiers and features on
a number of datasets. The selected datasets are diverse in terms of contents
source (e.g. news articles, twitter, etc.). We evaluated both language specific
and language independent features. We found that adopting the language specific
features and using gazetteers and the previous predicted named entity tag can
achieve higher performance in traditional machine learning based models. Also,
the deep learning based models have higher performance evaluations on the
most of datasets. Our proposed models outperformed the related work on two
Arabic datasets out of three. However, the performance on the English datasets
are higher than the Arabic datasets because of the characteristic of the Arabic
language represented in the morphological ambiguity.


References
1. Nadeau, David, and Satoshi Sekine. ”A survey of named entity recognition and
   classification.” Lingvisticae Investigationes 30, no. 1 (2007): 3-26.
2. Habash, Nizar Y. ”Introduction to Arabic natural language processing.” Synthesis
   Lectures on Human Language Technologies 3, no. 1 (2010): 1-187.
3. Farber, Benjamin, Dayne Freitag, Nizar Habash, and Owen Rambow. ”Improving
   NER in Arabic Using a Morphological Tagger.” In LREC. 2008.
4. Rau, Lisa F. ”Extracting company names from text.” In [1991] Proceedings. The
   Seventh IEEE Conference on Artificial Intelligence Application, vol. 1, pp. 29-32.
   IEEE, 1991.
5. Gaizauskas, Robert, Takahiro Wakao, Kevin Humphreys, Hamish Cunningham, and
   Yorick Wilks. ”UNIVERSITY OF SHEFFIELD: DESCRIPTION OF THE LaSIE
   SYSTEMAS USED FOR MUC-6.” In MUC-6:, November 6-8, 1995.
6. Segura Bedmar, Isabel, Paloma Martı́nez, and Marı́a Herrero Zazo. ”Semeval-2013
   task 9: Extraction of drug-drug interactions from biomedical texts (ddiextraction
   2013).” ACL, 2013.
7. Shaalan, Khaled, and Hafsa Raza. ”Arabic named entity recognition from diverse
   text types.” In Int Conf on NLP, pp. 440-451. Springer, Berlin, Heidelberg, 2008.
8. Bick, Eckhard. ”A Named Entity Recognizer for Danish.” In LREC. 2004.
9. Chieu, Hai Leong, and Hwee Tou Ng. ”Named entity recognition: a maximum en-
   tropy approach using global information.” In Proceedings of the 19th Int Conf on
   Computational linguistics-Volume 1, pp. 1-7. ACL, 2002.
14      H. Balla, S.J. Delany

10. Abdul-Hamid, Ahmed, and Kareem Darwish. ”Simplified feature set for Arabic
   named entity recognition.” In Proceedings of the 2010 Named Entities Workshop,
   pp. 110-115. ACL, 2010.
11. Petasis, Georgios, Frantz Vichot, Francis Wolinski, Georgios Paliouras, Vangelis
   Karkaletsis, and Constantine D. Spyropoulos. ”Using machine learning to maintain
   rule-based named-entity recognition and classification systems.” In Proceedings of
   the 39th Annual Meeting on ACL, pp. 426-433., 2001.
12. Kim, Yoon, Yacine Jernite, David Sontag, and Alexander M. Rush. ”Character-
   aware neural language models.” In Thirtieth AAAI Conference on Artificial Intelli-
   gence. 2016.
13. Collobert, Ronan, Jason Weston, Léon Bottou, Michael Karlen, Koray
   Kavukcuoglu, and Pavel Kuksa. ”Natural language processing (almost) from
   scratch.” Journal of machine learning research 12, no. Aug (2011): 2493-2537.
14. Collobert, Ronan, and Jason Weston. ”A unified architecture for natural language
   processing: Deep neural networks with multitask learning.” In Proceedings of the
   25th international conference on Machine learning, pp. 160-167. ACM, 2008.
15. Habibi, Maryam, Leon Weber, Mariana Neves, David Luis Wiegandt, and Ulf
   Leser. ”Deep learning with word embeddings improves biomedical named entity
   recognition.” Bioinformatics 33, no. 14 (2017): i37-i48.
16. Yin, Rongchao, Quan Wang, Peng Li, Rui Li, and Bin Wang. ”Multi-granularity
   chinese word embedding.” In Proceedings of the 2016 conference on empirical meth-
   ods in natural language processing, pp. 981-986. 2016.
17. Sa’a, D. A. Alzboun, Saia Khaled Tawalbeh, Mohammad Al-Smadi, and Yaser
   Jararweh. ”Using bidirectional long short-term memory and conditional random
   fields for labeling arabic named entities: A comparative study.” In 2018 Fifth Int
   Conf on Social Networks Analysis, Management and Security (SNAMS), pp. 135-
   140. IEEE, 2018.ties.
18. Benajiba, Yassine, and Paolo Rosso. ”ANERsys 2.0: Conquering the NER Task for
   the Arabic Language by Combining the Maximum Entropy with POS-tag Informa-
   tion.” In IICAI, pp. 1814-1823. 2007.
19. Derczynski, Leon, Eric Nichols, Marieke van Erp, and Nut Limsopatham. ”Re-
   sults of the WNUT2017 shared task on novel and emerging entity recognition.” In
   Proceedings of the 3rd Workshop on Noisy User-generated Text, pp. 140-147. 2017.
20. AbdelRahman, Samir, Mohamed Elarnaoty, Marwa Magdy, and Aly Fahmy. ”In-
   tegrated machine learning techniques for Arabic named entity recognition.” IJCSI
   7 (2010): 27-36.
21. Abdul-Hamid, Ahmed, and Kareem Darwish. ”Simplified feature set for Arabic
   named entity recognition.” In Proceedings of the 2010 Named Entities Workshop,
   pp. 110-115. Association for Computational Linguistics, 2010.
22. Ekbal, Asif, and Sivaji Bandyopadhyay. ”Named entity recognition using support
   vector machine: A language independent approach.” International Journal of Elec-
   trical, Computer, and Systems Engineering 4, no. 2 (2010): 155-170.
23. Y. Benajiba, P. Rosso, and J. M. Benedı́ruiz, “Anersys: An arabic named entity
   recognition system based on maximum entropy,” in Int Conf on Intelligent Text
   Processing and Computational Linguistics, 2007, pp. 143–153.
24. Lample, Guillaume, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami,
   and Chris Dyer. ”Neural architectures for named entity recognition.” arXiv preprint
   arXiv:1603.01360 (2016).
25. Chiu, Jason PC, and Eric Nichols. ”Named entity recognition with bidirectional
   LSTM-CNNs.” Transactions of the ACL 4 (2016): 357-370.
             Exploration of Approaches to Arabic Named Entity Recognition           15

26. Zirikly, Ayah, and Mona Diab. ”Named entity recognition for arabic social media.”
   In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language
   Processing, pp. 176-185. 2015.
27. Pennington, Jeffrey, Richard Socher, and Christopher Manning. ”Glove: Global
   vectors for word representation.” In Proceedings of the 2014 conference on empirical
   methods in natural language processing (EMNLP), pp. 1532-1543. 2014.
28. Soliman, Abu Bakr, Kareem Eissa, and Samhaa R. El-Beltagy. ”Aravec: A set of
   arabic word embedding models for use in arabic nlp.” Procedia Computer Science
   117 (2017): 256-265.
29. Derczynski, Leon, Eric Nichols, Marieke van Erp, and Nut Limsopatham. ”Re-
   sults of the WNUT2017 shared task on novel and emerging entity recognition.” In
   Proceedings of the 3rd Workshop on Noisy User-generated Text, pp. 140-147. 2017.
30. Nadeau, David, Peter D. Turney, and Stan Matwin. ”Unsupervised named-entity
   recognition: Generating gazetteers and resolving ambiguity.” In Conference of the
   Canadian society for computational studies of intelligence, pp. 266-277. Springer,
   Berlin, Heidelberg, 2006.
31. Mohit, Behrang, Nathan Schneider, Rishav Bhowmick, Kemal Oflazer, and Noah
   A. Smith. ”Recall-oriented learning of named entities in Arabic Wikipedia.” In
   Proceedings of the 13th Conf of the European Chapter of the ACL, pp. 162-173.,
   2012.
32. Alotaibi, Fahd, and Mark Lee. ”A hybrid approach to features representation for
   fine-grained Arabic named entity recognition.” In Proceedings of COLING 2014, the
   25th Int Conf on Computational Linguistics: Technical Papers, pp. 984-995. 2014.
33. Buckwalter, Tim. ”Issues in Arabic orthography and morphology analysis.” In
   proceedings of the workshop on computational approaches to Arabic script-based
   languages, pp. 31-34. ACL, 2004.
34. Bergstra, James, and Yoshua Bengio. ”Random search for hyper-parameter opti-
   mization.” Journal of Machine Learning Research 13, no. Feb (2012): 281-305.
35. Alotaibi, Fahd, and Mark Lee. ”A hybrid approach to features representation for
   fine-grained arabic named entity recognition.” In Proceedings of COLING 2014, the
   25th Int Conf on Computational Linguistics: Technical Papers, pp. 984-995. 2014.
36. Mohammed, Naji F., and Nazlia Omar. ”Arabic named entity recognition using
   artificial neural network.” Journal of Computer Science 8, no. 8 (2012): 1285.
37. Liu, Liyuan, Jingbo Shang, and Jiawei Han. ”Arabic Named Entity Recognition:
   What Works and What’s Next.” In Proceedings of the Fourth Arabic NLP Work-
   shop, pp. 60-67. 2019.
38. Pasha, Arfath, Mohamed Al-Badrashiny, Mona T. Diab, Ahmed El Kholy,
   Ramy Eskander, Nizar Habash, Manoj Pooleery, Owen Rambow, and Ryan Roth.
   ”Madamira: A fast, comprehensive tool for morphological analysis and disambigua-
   tion of arabic.” In LREC, vol. 14, pp. 1094-1101. 2014.
39. Sang, Erik F., and Fien De Meulder. ”Introduction to the CoNLL-2003 shared
   task: Language-independent named entity recognition.” arXiv preprint cs/0306050
   (2003).
40. Khalifa, Muhammad, and Khaled Shaalan. ”Character convolutions for Arabic
   Named Entity Recognition with Long Short-Term Memory Networks.” Computer
   Speech and Language 58 (2019): 335-346.
41. Khalifa, Muhammad, and Khaled Shaalan. ”Character convolutions for Arabic
   Named Entity Recognition with Long Short-Term Memory Networks.” Computer
   Speech and Language 58 (2019): 335-346.