<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>bigIR at CheckThat! 2020: Multilingual BERT for Ranking Arabic Tweets by Check-worthiness</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maram Hasanain</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tamer Elsayed</string-name>
          <email>telsayedg@qu.edu.qa</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science and Engineering Department, Qatar University</institution>
          ,
          <addr-line>Doha</addr-line>
          ,
          <country country="QA">Qatar</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the third-year participation of our bigIR group at Qatar University in CheckThat! lab at CLEF. This year we participated only in Arabic Task 1 that focuses on detecting checkworthy tweets on a given topic. We submitted four runs using both traditional classi cation models and a pre-trained language model: multilingual BERT (mBERT). O cial results showed that our run using mBERT was the best among all our submitted runs. Furthermore, bigIR team was ranked third among all eight teams participated in the lab, with our best run ranked 6th among 28 runs.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        With the huge ood of false information on the Web and social media, veri
cation of all claims that a user face is becoming infeasible. The situation is even
more challenging for professional fact-checkers and journalists who usually track
multiple topics simultaneously with each having many claims. Twitter poses even
more challenges with the tweets being limited in size and very quickly
spreading. Moreover, there is a huge volume of tweets that might not even contain any
factual claims to begin with. This situation motivated work on prioritization of
tweets by their importance of veri cation for a given topic. Task 1 in the
CheckThat! lab at CLEF 2020 was designed to support research solving that speci c
problem [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In the lab, the problem of tweets check-worthiness estimation
targeted by Task 1 was de ned as follows: \Predict which tweet from a stream of
tweets on a topic should be prioritized for fact-checking."
      </p>
      <p>
        Although the task was o ered for both English and Arabic tweets, the bigIR
group at Qatar university decided to participate speci cally in the Arabic task,
since Arabic is one of the most dominant languages in Twitter [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], yet still
understudied in the fact-checking domain in general. This is our participation for the
third year in a row in Arabic tasks of CheckThat! lab [7,13].
      </p>
      <p>In Arabic Task 1, organizers provided participants with two datasets. The
training dataset includes three topics with each having 500 Arabic tweets
annotated by check-worthiness. The test dataset includes twelve topics, each with 500</p>
      <p>Arabic tweets [8]. For each test topic, we were asked to return a list of the 500
tweets for the topic ranked by their check-worthiness. We tackled this problem
in two ways. In the rst, we use traditional learning-based classi ers with
handcrafted features. In the second, we ne-tune a multilingual BERT (mBERT)
pre-trained model [6] with a classi cation layer. The run using mBERT was the
best-performing among all of our submitted runs and was ranked 6th among all
28 runs in the lab for this task. These results demonstrate the e ectiveness of
pre-trained models (and BERT speci cally) for the problem of check-worthiness
estimation which is consistent with very recent studies on the problem including
other submissions to the same task [9,10,12].</p>
      <p>We discuss the approach we followed in details in Section 2 and brie y present
our results in comparison to top teams in the lab in Section 3. We nally provide
some concluding remarks and directions for future work in Section 4.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Approach</title>
      <p>We approach check-worthiness ranking by training di erent classi cation models.
We choose two main approaches to the problem. We train several common text
classi cation models with hand-crafted features hypothesizing they are good
discriminators of claim check-worthiness. In the other approach, we ne-tune a
multilingual BERT model [6]. BERT has shown strong performance in multiple
text classi cation tasks, and very recent applications of BERT in the speci c
problem at hand showed promising results [9,10]. Details on both approaches
are presented in this section.
2.1</p>
      <sec id="sec-2-1">
        <title>Traditional Classi cation</title>
        <p>We start by developing 13 features hand-crafted for this task. These features
were selected and inspired by many existing studies on fact-checking and
checkworthiness ranking. The features are categorized as follows:
{ Tweet content and structure. Under this category, we select features
designed to capture tweet objectivity, its relevance to the topic, and its
structure. We preprocess both the tweet and the topic (represented using
its description). We apply the following preprocessing steps: stop words and
URLs removal, expansion of hashtags by removing the # symbol and
splitting the hashtag by underscores, eliminating special characters (e.g., $),
removing diacritics, and nally normalizing the Arabic text to consolidate
multiple spellings of the same character into a single uni ed form of it. The
computed features are:</p>
        <p>Jaccard Similarity between the topic and the tweet.</p>
        <p>
          Count of entities identi ed in a tweet using a multilingual named-entity
recognition tool [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>
          Count of polarity words including positive ones (e.g., \Success") and
negative words (e.g., \Corruption") identi ed using a large-scale
multilingual sentiment lexicon [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. We hypothesize tweets with no factual
claims will include more sentiment rather than objective language.
Count of numbers in a tweet.
        </p>
        <p>Count of quotes in a tweet.</p>
        <p>Count of unique tokens.</p>
        <p>Average of the word embedding vectors representing each token in the
tweet. The embeddings were extracted from a word embedding model
trained over a very large set of Arabic tweets [11]. For this feature, the
tweet was preprocessed using a preprocessor provided by the model
developers.</p>
        <p>
          As for the classi ers, we use three classical classi ers, namely Logistic
Regression, Support Vector Machine (SVM) and Random Forest, with default
parameters as provided by scikit-learn Python package.1 With leave-one-topic-out
cross-validation over the training dataset, we apply a stepwise feature selection
algorithm in which we greedly add the feature that results in best average
performance over the folds. Eventually, we found a combination of only three features
achieved the best overall performance for all three classi ers. Performance with
these 3 features was superior to that achieved when using all 13 features. The
features are word embeddings, isVeri ed, and count of quoted statements. We
use the prediction probability of the positive class (i.e., how probable the tweet
is check-worthy) as the ranking score to rank tweets in descending order per
topic. We train the models using the three training topics provided by the task
organizers [
          <xref ref-type="bibr" rid="ref3">3,8</xref>
          ].
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Multilingual BERT</title>
        <p>We ne-tune a Multilingual BERT (mBERT) model for the task of
checkworthiness ranking. In this model, we represent the input as follows:
[CLS] + tweet text + [SEP] + topic text + [SEP]
1 https://scikit-learn.org/stable/
where [CLS] is a special classi cation embedding and [SEP] is a token to indicate
a separator between the two sentences. The topic was represented by its title
concatenated with description. In order to use mBERT model for check-worthiness
ranking, we add on top of it a dense layer, followed by an output Softmax
classication layer to predict the probability for the two possible classes (whether the
tweet is check-worthy or not). We ne tune the model in full including all layers
of mBERT and the classi cation layer. The probability of the positive class was
used as the check-worthiness score by which we rank tweets in descending order
per topic.</p>
        <p>We apply light preprocessing to both the tweet and topic by removing URLs,
expanding hashtags by removing the # symbol and splitting the hashtag by
underscores, eliminating special characters (e.g., $), and removing diacritics.
For the model architecture speci cations, we use uncased mBERT model with
12 layers and 768 hidden units. The dense layer on top of mBERT has 256 hidden
units and relu activation function. We use binary cross-entropy loss for training,
and set the maximum sequence length to 128 with training batch size of 32. The
model was trained using the three training topics provided by the organizers.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>We submitted four runs for the task, which match exactly the models described
in Section 2. Table 1 shows the best run per team for the top three teams in the
task in addition to our remaining runs and the two baselines provided by the
task organizers. As shown in the table, the run using mBERT achieved the best
performance among all our runs measured by precision at rank 30 (P@30), which
is the the o cial measure of the task. In fact, our team is ranked third among all
participating eight teams, with a comparable performance to the second-ranked
team. We nd the mBERT model is our best performing model by far, which is
consistent with its robust and e ective performance across di erent ranking and
classi cation tasks. We also observe that although all three traditional classi ers
used the same features, SVM and Logistic Regression both showed superior
performance over Random Forest.</p>
      <p>We note here that our experiments on the problem are preliminary; further
experiments are needed to improve and understand the results. For example,
we observe that only 30% of the training data is check-worthy. Oversampling
techniques of the positive class might result in better classi cation performance.
Another future experiment is to consider integrating some of the hand-crafted
features with the BERT representation in order to represent a claim with more
than the content.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and Future Work</title>
      <p>Our work showed that a simple neural model using multilingual BERT had
competitive performance that is superior to traditional classi ers that use many
hand-crafted features for the task. However, we still need to conduct further
experiments with more elaborate parameter optimization and feature selection
to make more concrete conclusions. In comparison to other teams in the lab,
we observe that the use of a language model pre-trained on Arabic data only
can yield better performance and thus, we plan to experiment with such models
next. We also hypothesize that including some of the hand-crafted features in
the neural model can bring improvements to the performance and we plan to
test this hypothesis in future work.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>
        This work was made possible by NPRP grant# NPRP11S-1204-170060 from the
Qatar National Research Fund (a member of Qatar Foundation). The statements
made herein are solely the responsibility of the authors.
6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep
bidirectional transformers for language understanding. In: Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
pp. 4171{4186 (2019)
7. Haouari, F., Ali, Z., Elsayed, T.: bigIR at CLEF 2019: Automatic Veri cation of
Arabic Claims over the Web. In: Working Notes of CLEF 2019 { Conference and
Labs of the Evaluation Forum (2019)
8. Hasanain, M., Haouari, F., Suwaileh, R., Ali, Z., Hamdan, B., Elsayed, T.,
BarronCeden~o, A., Da San Martino, G., Nakov, P.: Overview of CheckThat! 2020 Arabic:
Automatic identi cation and veri cation of claims in social media. In: Cappellato
et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
9. Kartal, Y.S., Guvenen, B., Kutlu, M.: Too many claims to fact-check: Prioritizing
political claims based on check-worthiness. arXiv preprint arXiv:2004.08166 (2020)
10. Meng, K., Jimenez, D., Arslan, F., Devasier, J.D., Obembe, D., Li, C.:
Gradientbased adversarial training on transformer networks for detecting check-worthy
factual claims. arXiv preprint arXiv:2002.07725 (2020)
11. Soliman, A.B., Eissa, K., El-Beltagy, S.R.: Aravec: A set of arabic word
embedding models for use in arabic nlp. Procedia Computer Science 117, 256 { 265
(2017). https://doi.org/https://doi.org/10.1016/j.procs.2017.10.117, http://www.
sciencedirect.com/science/article/pii/S1877050917321749, arabic
Computational Linguistics
12. Williams, E., Rodrigues, P., Novak, V.: Accenture at CheckThat! 2020: If you say
so: Post-hoc fact-checking of claims using transformer-based models. In: Cappellato
et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
13. Yasser, K., Kutlu, M., Elsayed, T.: bigIR at CLEF 2018: Detection and Veri cation
of Check-Worthy Political Claims. In: Working Notes of CLEF 2018 { Conference
and Labs of the Evaluation Forum (2018)
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Al-Rfou</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kulkarni</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perozzi</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Skiena</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          : Polyglot-NER:
          <article-title>Massive multilingual named entity recognition</article-title>
          .
          <source>Proceedings of the 2015 SIAM International Conference on Data Mining</source>
          , Vancouver, British Columbia, Canada, April 30 - May 2,
          <source>2015 (April</source>
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Alshaabi</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dewhurst</surname>
            ,
            <given-names>D.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Minot</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arnold</surname>
            ,
            <given-names>M.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Adams</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Danforth</surname>
            ,
            <given-names>C.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dodds</surname>
            ,
            <given-names>P.S.:</given-names>
          </string-name>
          <article-title>The growing echo chamber of social media: Measuring temporal and social contagion dynamics for over 150 languages on twitter for 2009{2020 (</article-title>
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Barron-Ceden~o,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Elsayed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            , Da San Martino, G.,
            <surname>Hasanain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Suwaileh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Haouari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Babulkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Hamdan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Nikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Shaar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Sheikh Ali</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z.</surname>
          </string-name>
          : Overview of CheckThat! 2020:
          <article-title>Automatic Identi cation and Veri cation of Claims in Social Media</article-title>
          .
          <source>LNCS (12260)</source>
          , Springer (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Cappellato</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eickho</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neveol</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . (eds.):
          <source>CLEF 2020 Working Notes. CEUR Workshop Proceedings</source>
          , CEUR-WS.org (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Skiena</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Building sentiment lexicons for all major languages</article-title>
          .
          <source>In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Short Papers)</source>
          . pp.
          <volume>383</volume>
          {
          <issue>389</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>