<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Ensemble Learning for Irony Detection in Arabic Tweets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Muhammad Khalifa</string-name>
          <email>muhammad.e.khalifa@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Noura Hussein</string-name>
          <email>nourahussein193@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, Benha University</institution>
          ,
          <country country="EG">Egypt</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Computer Science Department, Cairo University</institution>
          ,
          <country country="EG">Egypt</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we describe and show the results of our 3 systems submitted for the Irony Detection in Arabic Tweets Shared Task at the Forum for Information Retrieval (FIRE 2019). We employ ensemble learning for this task through 3 diferent types of ensemble models, namely classical, deep and hybrid (that combines both). We extract types of features from the tweets including TF-IDF word n-gram features, topic modeling features, bag-of-words and sentiment features. Our submitted systems scored the top 3 places with our best system achieving 84.4 F1 points on the test set.</p>
      </abstract>
      <kwd-group>
        <kwd>Irony Detection</kwd>
        <kwd>Ensemble Learning</kwd>
        <kwd>Text Classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Irony is defined as a trope whose meaning is diferent from what is literally
enunciated [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Failing to detect irony leads to the misinterpretation of the message
intended causing degradation of the performance of Natural Language
Understanding (NLU) systems. However, irony detection can be challenging since it
requires world knowledge and a more complex understanding of the context [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
Detecting irony eficiently can help with various Natural Language Processing
(NLP) tasks such as sentiment analysis, hate speech detection, fake news
detection, and online harassment detection.
      </p>
      <p>
        Taking sentiment analysis as an example, [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] shows how the presence of irony
can negatively impact sentiment classification performance. Remarkably, while
the sentiment classification performance in regular tweets could reach up to an
F1 score of 71, the performance on ironic tweets reached only a maximum of 57.
Thus, accurate sentiment classification requires accurate irony detection so that
the sentiment classifier can acknowledge that the intended sentiment is contrary
to the literal one.
      </p>
      <p>
        Text classification of Arabic tweets is typically faced by a few challenges.
First, there is the dificulty of dealing with Arabic itself, which is a
morphologically rich language with characteristics that make dealing with it a challenge
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Second, Arabic tweets are usually replete with unstandardized, dialectal and
transliterated words (ex. “hello“ becomes “ﻮﻟﺎﻫ“). This leads to what is known
as the out-of-vocabulary (OOV) problem where the learning system may fail
to generalize due to a large number of unseen words during training. Moreover
and in addition to dialectal Arabic, tweets can include code-switching between
Arabic and other languages such as English or French. This contributes more to
the dificulty of the task by introducing more unknown words or phrases whose
understanding could be essential to the task of irony detection.
      </p>
      <p>In this paper, we describe our submitted systems to the shared task of Irony
Detection in Arabic Tweets. Given a tweet, our system should automatically
decide whether it is ironic or not. We employ ensemble learning using both classical
and deep models and our results show that classical ensembles outperform deep
ensembles on this task. Moreover and prior to classification, we extract various
features from tweets including Term Frequency-Inverse Document Frequency
(TF-IDF) word n-gram features, topic modeling features and sentiment-based
features. We conduct experiments to assess the importance of each category of
features and our results show that TF-IDF features, Bag-of-words
representation, and count-based features are most significant for irony detection.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Systems Description</title>
      <sec id="sec-2-1">
        <title>Preprocessing</title>
        <p>Before feature extraction, we apply various preprocessing to the tweets in the
dataset. Our preprocessing stage comprises mainly of text normalization such
as replacing all instances of ‘ى’ with ‘ي’ and ‘ة’ with ‘ه’. Besides, all instances
of Hamza are replaced with ‘ء’ to account for incorrect word spellings. We also
normalize all instances of repeated characters such that “لوووﻮﻟ”, for instance,
becomes “لﻮﻟ” and strip all diacritics (if any).
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Feature Extraction</title>
        <p>
          Given a tweet, we extract five diferent types of features:
– Word n-gram TF-IDF: we extract TF-IDF-weighted features of word
ngrams where n 2 [1; 6]. We use only the top frequent 50K n-grams.
– Topic Modeling Features we run Latent Dirichlet Allocation (LDA) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] on
the training set setting the number of topics k = 20 and using both unigrams
and bigrams. Then, each tweet t is represented using a k-dimensional vector
V t such that Vdt = P (topic = djtext = t).
– Sentiment Features: given a tweet, we average the sentiment scores of its
constituent words. The sentiment scores used are extracted from the Arabic
sentiment lexicon proposed in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
– Pretrained Word Vectors We compute a Bag-of-words (BOW)
representation of each tweet by averaging the word vectors of its constituent words.
We use the pretrained 300-dimensional Twitter-CBOW word vectors
provided by [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
– Count-based Features: these features include word and character counts,
word density (number of characters per word), punctuation count, stopwords
count and the standard deviation of the word length per tweet.
        </p>
        <p>Table 1 shows F1 obtained using each of the performance of XGBoost using
diferent categories of the features. Apparently, TF-IDF and word vectors give
the best performance on the development set.</p>
        <sec id="sec-2-2-1">
          <title>Features F1</title>
          <p>Sentiment 59.0
Topic Modeling 60.0
Count Features 67.0
TF-IDF word n-gram 81.1
Word2vec BOW 83.6</p>
          <p>
            All 85.6
For our first submission, we use an ensemble of 3 models. Namely, Gradient
Boosting [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ], Random Forest [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ] and Multilayer Perceptron (MLP) [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]. This
ensemble is trained on all of the previously discussed features. To compute the
ifnal predictions from all the ensembles, we use Soft Voting where we sum the
probabilities of each class across models and the class with the highest probability
sum is chosen.
2.4
          </p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>Word-level Bi-LSTM ensemble</title>
        <p>
          Our second submission is an ensemble model based on a word-level bidirectional
LSTM (bi-LSTM) network [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. We augment the bi-LSTM with a subset of the
aforementioned features. These features are processed through a feed-forward
network and the output is concatenated with the output of last hidden state
of the bi-LSTM. This is passed through another feed-forward network and then
projected into a Sigmoid unit for classification. See Figure 1. The subset of
the additional features used includes TF-IDF, topic modeling and count-based
features. We use an ensemble of 8 models for our final submission.
We combine both our first and second systems into our third submission which
is an ensemble of Gradient Boosting, Random Forest, Multi-Layer Perceptron
and 8 bi-LSTMs.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments and Results</title>
      <sec id="sec-3-1">
        <title>Dataset</title>
        <p>
          We use the training dataset provided by the Irony Detection in Arabic Tweets
shared task [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The dataset is a collection of 4024 tweets with only two classes:
ironic and non-ironic. Both classes contain 2091 and 1933 samples, respectively.
We do not use any additional training data. The test set, on the other hand,
contains 1006 tweets.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Experimental Setup</title>
        <p>For hyperparameter selection, we use a randomly sampled 20% of the training
set as a development set. However, before final submission, we train each system
on the whole training set. Table 2 shows the hyperparameter settings for all
models used.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Results</title>
        <p>Table 3 shows the results on both development and test sets using both single
and ensemble models. Noticeably, the single XGBoost model performs best on
the development set compared to all other single models with an F1 score of 85.6.</p>
        <sec id="sec-3-3-1">
          <title>Model Hyperparameters</title>
          <p>Random Forest n_trees=60
MLP lna_yelary_esrisz=es3=(128, 64, 32)</p>
          <p>n_trees=200
XGBoost max_depth=10
gamma=0.5
embeddings_dim=300
embeddings_init=random_normal
lstm_n_layers=1
Bi-LSTM lfsetemdf_orhwidadrden1__uunniittss==61428
feedforward2_units=128
feedforward_activation=‘relu’
dropout=0.6</p>
          <p>By combining XGBoost with Random Forest and MLP, the classical ensemble
achieves the best F1 scores of 86.5 and 84.4 on both development and test sets,
respectively. The hybrid ensemble achieved the next best scores of 86.2 and 83.3
and the Bi-LSTM ensemble comes last with 84.6 and 82.2. Since the dataset
size is relatively small, it makes sense for classical models to outperform deep
models.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>Model Dev Test</title>
          <p>Random Forest - 80.4
XGBoost 85.6
MLP 80.8
RF + MLP + XGBoost (ensemble) 86.5 84.4
Bi-LSTM 82.6
8 Bi-LSTMs (ensemble) 84.6 82.8</p>
          <p>Hybrid ensemble 86.2 83.3
In this paper, we described our three submitted systems to the Irony Detection
in Arabic Tweets Shared task at the Forum for Information Retrieval Evaluation
(FIRE 2019). Our submitted systems are classical, deep and hybrid ensembles
that operate of a set of features extracted from each tweet. The extracted features
include TF-IDF word n-gram features, bag-of-words representation,
sentiment</p>
          <p>Khalifa et al.
based features and topic modeling features. Our results show the classical
ensemble outperforming both deep and hybrid ensembles with 84.4 F1 points on
the test set and achieving the first place on the task leaderboard.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>M.I.</given-names>
          </string-name>
          :
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>Journal of machine Learning research 3(Jan)</source>
          ,
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Breiman</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Random forests</article-title>
          .
          <source>Machine learning 45(1)</source>
          ,
          <fpage>5</fpage>
          -
          <lpage>32</lpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Farghaly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shaalan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Arabic natural language processing: Challenges and solutions</article-title>
          .
          <source>ACM Transactions on Asian Language Information Processing (TALIP) 8</source>
          (
          <issue>4</issue>
          ),
          <volume>14</volume>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Friedman</surname>
            ,
            <given-names>J.H.</given-names>
          </string-name>
          :
          <article-title>Greedy function approximation: a gradient boosting machine</article-title>
          . Annals of statistics pp.
          <fpage>1189</fpage>
          -
          <lpage>1232</lpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Gardner</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dorling</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Artificial neural networks (the multilayer perceptron)-a review of applications in the atmospheric sciences</article-title>
          .
          <source>Atmospheric environment</source>
          <volume>32</volume>
          (
          <fpage>14</fpage>
          -
          <lpage>15</lpage>
          ),
          <fpage>2627</fpage>
          -
          <lpage>2636</lpage>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ghanem</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karoui</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Benamara</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moriceau</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Idat@fire2019: Overview of the track on irony detection in arabic tweets</article-title>
          . In: Mehta P.,
          <string-name>
            <surname>Rosso</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Majumder</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mitra</surname>
            <given-names>M</given-names>
          </string-name>
          . (Eds.)
          <article-title>Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2019)</article-title>
          . CEUR Workshop Proceedings. In: CEUR-WS.org, Kolkata, India, December
          <volume>12</volume>
          -
          <fpage>15</fpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural computation 9(8)</source>
          ,
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Nakov</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ritter</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosenthal</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sebastiani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stoyanov</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Semeval-2016 task 4: Sentiment analysis in twitter</article-title>
          .
          <source>In: Proceedings of the 10th international workshop on semantic evaluation (semeval-2016)</source>
          . pp.
          <fpage>1</fpage>
          -
          <lpage>18</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Soliman</surname>
            ,
            <given-names>A.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eissa</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>El-Beltagy</surname>
            ,
            <given-names>S.R.</given-names>
          </string-name>
          :
          <article-title>Aravec: A set of arabic word embedding models for use in arabic nlp</article-title>
          .
          <source>Procedia Computer Science</source>
          <volume>117</volume>
          ,
          <fpage>256</fpage>
          -
          <lpage>265</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Van Hee</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lefever</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hoste</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Semeval-2018 task 3: Irony detection in english tweets</article-title>
          .
          <source>In: Proceedings of The 12th International Workshop on Semantic Evaluation</source>
          . pp.
          <fpage>39</fpage>
          -
          <lpage>50</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Vo</surname>
            ,
            <given-names>D.T.</given-names>
          </string-name>
          , Zhang, Y.:
          <article-title>Don't count, predict! an automatic approach to learning sentiment lexicons for short text</article-title>
          .
          <source>In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)</source>
          . pp.
          <fpage>219</fpage>
          -
          <lpage>224</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>