=Paper= {{Paper |id=Vol-2936/paper-31 |storemode=property |title=QMUL-SDS at CheckThat! 2021: Enriching Pre-Trained Language Models for the Estimation of Check-Worthiness of Arabic Tweets |pdfUrl=https://ceur-ws.org/Vol-2936/paper-31.pdf |volume=Vol-2936 |authors=Amani S. Abumansour,Arkaitz Zubiaga |dblpUrl=https://dblp.org/rec/conf/clef/AbumansourZ21 }} ==QMUL-SDS at CheckThat! 2021: Enriching Pre-Trained Language Models for the Estimation of Check-Worthiness of Arabic Tweets== https://ceur-ws.org/Vol-2936/paper-31.pdf
QMUL-SDS at CheckThat! 2021: Enriching
Pre-Trained Language Models for the Estimation of
Check-Worthiness of Arabic Tweets
Amani S. Abumansour1,2 , Arkaitz Zubiaga1
1
    Queen Mary University of London, United Kingdom
2
    Taif University, Saudi Arabia


                                         Abstract
                                         This paper describes our submission to the CheckThat! Lab at CLEF 2021, where we participated in
                                         Subtask 1A (check-worthy claim detection) in Arabic. We introduce our approach to estimate the check-
                                         worthiness of tweets as a ranking task. In our approach, we propose to fine-tune state-of-art transformer
                                         based models for Arabic such as AraBERTv0.2-base as well as to leverage additional training data from
                                         last year’s shared task (CheckThat! Lab 2020) along with the dataset provided this year. According to
                                         the official evaluation, our submission obtained a joint 4th position in the competition where seven other
                                         groups participated.

                                         Keywords
                                         checkworthiness, checkworthy claim detection, fact-checking, Arabic NLP.




1. Introduction
Sifting through large volumes of social media content can become a burdensome task for
journalists performing fact-checking, where computational journalism approaches to automated
fact-checking can help alleviate the task [1]. The automated fact-checking pipeline encompasses
an important sub-task consisting of claim check-worthiness detection, i.e. given a collection of
sentences as input, identify the most prominent claims ranked by check-worthiness [2]. This
sub-task can be treated as a text classification task, consisting of determining check-worthy
statements, followed by a ranking step based on the check-worthiness score.
   There has been a body of work in claim (check-worthiness) detection in recent years. Both
ClaimBuster [3] and CNC [4] used traditional classifiers in combination, such as SVM and
Logistic regression. In addition, others have used neural networks as is the case of Atanasova
et al. [5] who considered context and features along with Feed-Forward Neural Network (FNN).
Neural network models were also used by ClaimRank [6] as well as participants of recent shared
tasks at CheckThat! [7, 8].
   Subsequently, the emergence of Bidirectional Encoder Representations (BERT) has meant a
significant milestone in the history of NLP since it attained state-of-art results when applied on
several tasks including text classification [9]. The effectiveness of pre-trained language models

CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
Envelope-Open a.s.a.abumansour@qmul.ac.uk (A. S. Abumansour); a.zubiaga@qmul.ac.uk (A. Zubiaga)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
stems from the ability of transformers to create embeddings as an output of the pre-training
process. Since then, many efforts have focused on fine-tuning pre-trained language models.
For example, Hasanain and Elsayed [10] fine-tuned multilingual BERT (mBERT), and others
fine-tuned AraBERT [11, 12] in the CheckThat! Lab 2020 [8].
   In our contribution to Substask 1A [13] in CheckThat! Lab at CLEF 2021 [14], we started
by fine-tuning the latest version of AraBERT. In addition, we investigated the benefits of
incorporating the CT20-AR dataset from last year’s edition of the shared task (CheckThat!
2020), besides preprocessing functions in our test. In what follows we describe our approach
and discuss the results we achieved.


2. Approach
This section describes in greater detail the process we follow to handle the task, and is divided
into three parts: datasets, data preprocessing, and ranking methodology. The first part illus-
trates the datasets we used in our experiments. The second part describes the pre-processing
techniques performed prior to training and testing. The third part describes our method to rank
Arabic tweets using AraBERTv0.2-base.

2.1. Datasets
The organisers provided the CT21-AR training dataset which contains Arabic tweets [13]. The
CT21-AR dataset provides labels for each entry (sentence) as to whether it is a claim or not, as
well as whether it is check-worthy or not. For the purposes of this task, we solely considered the
check-worthiness label where a value of 1 indicates a “check-worthy” tweet, and a 0 indicates a
“not check worthy” tweet.
   We also looked into expanding the training data by leveraging additional datasets from
previous editions of the CheckThat! shared task. In order to increase the training data, we
incorporated the CT20-AR dataset from CLEF 2020 along with the CT21-AR dataset for training
[15]. In the case of CT20-AR, it only contains one label pertaining to “check-worthy” (1), and
“not check worthy” (0); labels for “claim” or “not claim” were not provided.
   Both datasets, CT20-AR and CT21-AR, are imbalanced. In particular, 25% of the CT21-AR
training data are check-worthy claims, with a slightly higher ratio (27.5%) for CT20-AR.

2.2. Pre-processing
The provided dataset contains tweets written by different users and therefore with variations in
style and writing. This can be seen for example in the presence of emojis, hyperlinks, and other
symbols in some of the tweets. In addition, users might mention other users or type hashtags.
Last year’s CheckThat! participants, such as Novak [11], used AraBERT but did not apply any
preprocessing step in their attempts. Use of preprocessing has however been recommended by
Antoun et al. [16], given that it converts the data into a more standard format prior to fine-tune
AraBERT. Therefore, we considered it would be useful to leverage the preprocessor in the setup
of our experiments.
  Hence, we used AraBERT’s preprocess function1 to perform the following:

    • Substitute all URLs, email addresses, and user mentions with [‫]رابط‬, [‫]بريد‬, and [‫ ]مستخدم‬re-
      spectively.
    • Eliminate line breaks and markup written in HTML, repeated characters, extra spaces,
      and unwanted characters including emotion icons.
    • Handling white spaces between words and digits (non-Arabic, or English), and/or a
      combination of both, and before and after two brackets.

   Additionally, we applied extra functions to replace digits with [‫]رقم‬, and to remove punctuation
marks that were not treated in the previous function such as # ,and _. Afterwards, we tokenised
all sentences using the BERT Fast tokeniser2 .

2.3. Ranking Methodology
The subtask 1A in CheckThat 2021 is evaluated as a ranking task. Therefore, we used hug-
gingface transformers [17] for fine tuning a newly released AraBERTv0.2-base with Sequence
Classification. Then, the results of the neural network output layer are passed into a softmax
function in order to acquire the probability distribution for each predicted output class; we use
the value output by the softmax function to rank the sentences by check-worthiness. Thus, we
estimated the level of check-worthiness for each tweet in the test set.


3. Results and Discussion
For the experiments, we developed four models as shown in Table 1. Models vary in two aspects:
(1) whether or not the pre-processing component is used, and (2) whether or not the CT20-AR
dataset is used to expand the training data. These variations allowed us to assess the extent
to which these two variations could lead to improved performance and subsequently for us
to choose the model to submit to the competition. In models 1 and 3, both datasets (CT20-AR
and CT21-AR) are utilised for the training phase. The other models (2 and 4) are only trained
on the current CT21-AR training set. When it comes to the pre-processing, we adopted it in
two models, 1 and 2; hence testing all combinations of pre-processing (yes/no) and use of extra
CT20-AR dataset (yes/no). We consistently use the ranking methodology described in §2.3
throughout the experiments.
   Experiments on the development set enabled us to choose the optimal model to submit to
the competition. We found that both model 2 and model 4 were overfitting, with better results
for models 1 and 3 incorporating the CT20-AR data. Thus, models 1 and 3 seemed to be better
options, so we continued further exploring the pre-processing step. Through further exploration,
we found that the pre-processing was leading to noticeable improvements in the performance.
This ultimately led to our decision of submitting model 1 to the shared task.
   Further, Table 2 presents the performance of our four models based on test set. All the results
outperformed the n-gram baseline in all metrics. In terms of mean average precision (MAP)
   1
       https://github.com/aub-mind/arabert
   2
       https://huggingface.co/transformers/model_doc/bert.html#berttokenizerfast
Table 1
Description of our models using different variants of training data and pre-processing.
                          Model                 Datasets             Preprocessing
                          Model_1       CT20-AR + CT21-AR                 Yes
                          Model_2            CT21-AR                      Yes
                          Model_3       CT20-AR + CT21-AR                 No
                          Model_4            CT21-AR                      No


Table 2
Performance of our models on test set. The primary model is boldfaced
  Model               MAP      MRR          RP        P@1         P@3     P@5     P@10            P@20      P@30
  Model_1            0.597       0.5     0.603           0     0.667      0.6         0.7         0.65       0.72
  Model_2            0.5997       1      0.5868          1     0.6667     0.8         0.9          0.8        0.7
  Model_3            0.5815    0.3333    0.5868          0     0.3333     0.6         0.8          0.7      0.6667
  Model_4            0.5924       1      0.5868          1     0.6667     0.8         0.9          0.8      0.8333
  ngram-baseline      0.428      0.5        0.409        0        0.667   0.6         0.5         0.45          0.44


Table 3
Official results of subtask 1A for Arabic
 Rank          Model          MAP       MRR         RP       P@1      P@3       P@5     P@10         P@20          P@30
    1       Accenture         0.658       1         0.599      1        1        1           1           0.95          0.84
    2         bigIR           0.615      0.5        0.579      0      0.667     0.6         0.6           0.8          0.74
    3        SCUoL            0.612       1         0.599      1        1        1           1           0.95          0.78
    4       ICompass          0.597     0.333       0.624      0      0.333     0.4         0.4           0.5          0.64
    4      QMUL-SDS           0.597      0.5        0.603      0      0.667     0.6         0.7          0.65          0.72
    5      TOBB ETU           0.575     0.333       0.574      0      0.333     0.4         0.4           0.5          0.68
    6     DamascusTeam        0.571      0.5        0.558    0.667     0.6      0.8         0.7          0.64
    7         ibaris          0.548       1         0.55       1      0.667     0.6         0.5           0.4          0.58
          ngram-baseline      0.428      0.5        0.409     0       0.667     0.6         0.5          0.45          0.44


specifically, the estimated result is very similar in comparison with other results. Also, it got
the higher scores for R-Precision (RP) and p@3. However, we observe that both model 1 and
model 3 get the first position in the ranking wrong (P@1=0), which requires further analysis.
   Overall, the mean average precision (MAP) is the official metric used for the competition.
Table 3 shows the final results compared to other participants, where our team ranked as joint
4th position.


4. Conclusion
We have described our submission to CLEF CheckThat! Lab 2021 subtask 1A (claim check-
worthiness detection in Arabic). We propose two variations to further improve a fine-tuned
AraBERT model. More specifically, we propose to test variations performing text pre-processing
(yes/no) as well as incorporating additional training data from the CT20-AR dataset (yes/no). We
then rank our predictions by using a softmax function, which leads to the final ranking. Through
development, we observed that the model making use of both variants (using pre-processing
and incorporating additional data) led to best performance, and hence chose to submit this
model to the competition. The improved performance with the use of a pre-processing step
reinforces our findings from the participation in CheckThat! 2020 [18] showing that processing
special tokens, such as numeric expressions, can be beneficial for the task.
   Our primary model achieved the joint 4th place according to the official evaluation, with a
MAP score of 0.597. We make some observations that are left for further exploration in future
work. First, we plan to dig into the predictions of our models to investigate extreme cases where
p@k equals to 0 or 1. Second, we aim to tackle the imbalance of the datasets with the aim of
improving performance. Lastly, we plan to experiment with other Arabic pre-trained language
models in the future.


Acknowledgments
This work was supported by the Engineering and Physical Sciences Research Council (grant
EP/V048597/1). Amani S. Abumansour holds a scholarship from Taif University, Saudi Arabia.


References
 [1] A. Zubiaga, Mining social media for newsgathering: A review, Online Social Networks
     and Media 13 (2019) 100049.
 [2] N. Hassan, F. Arslan, C. Li, M. Tremayne, Toward automated fact-checking: Detecting
     check-worthy factual claims by claimbuster, in: Proceedings of the 23rd ACM SIGKDD
     International Conference on Knowledge Discovery and Data Mining, 2017, pp. 1803–1812.
 [3] N. Hassan, G. Zhang, F. Arslan, J. Caraballo, D. Jimenez, S. Gawsane, S. Hasan, M. Joseph,
     A. Kulkarni, A. K. Nayak, et al., Claimbuster: The first-ever end-to-end fact-checking
     system, Proceedings of the VLDB Endowment 10 (2017) 1945–1948.
 [4] L. Konstantinovskiy, O. Price, M. Babakar, A. Zubiaga, Toward automated factchecking:
     Developing an annotation schema and benchmark for consistent automated claim detection,
     Digital Threats: Research and Practice 2 (2021). URL: https://doi.org/10.1145/3412869.
     doi:1 0 . 1 1 4 5 / 3 4 1 2 8 6 9 .
 [5] P. Atanasova, P. Nakov, L. Màrquez, A. Barrón-Cedeño, G. Karadzhov, T. Mihaylova,
     M. Mohtarami, J. Glass, Automatic fact-checking using context and discourse information,
     J. Data and Information Quality 11 (2019). URL: https://doi.org/10.1145/3297722. doi:1 0 .
     1145/3297722.
 [6] I. Jaradat, P. Gencheva, A. Barrón-Cedeño, L. Màrquez, P. Nakov, ClaimRank: Detecting
     check-worthy claims in Arabic and English, in: Proceedings of the 2018 Conference of the
     North American Chapter of the Association for Computational Linguistics: Demonstra-
     tions, Association for Computational Linguistics, New Orleans, Louisiana, 2018, pp. 26–30.
     URL: https://www.aclweb.org/anthology/N18-5006. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 8 - 5 0 0 6 .
 [7] T. Elsayed, P. Nakov, A. Barrón-Cedeño, M. Hasanain, R. Suwaileh, G. Martino,
     P. Atanasova, Overview of the CLEF-2019 CheckThat! Lab: Automatic Identification and
     Verification of Claims, 2019, pp. 301–321. doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 0 3 0 - 2 8 5 7 7 - 7 _ 2 5 .
 [8] A. Barrón-Cedeno, T. Elsayed, P. Nakov, G. Da San Martino, M. Hasanain, R. Suwaileh,
     F. Haouari, N. Babulkov, B. Hamdan, A. Nikolov, et al., Overview of checkthat! 2020:
     Automatic identification and verification of claims in social media, in: Proceedings of
     CLEF, Springer, 2020, pp. 215–236.
 [9] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional
     transformers for language understanding, in: Proceedings of the 2019 Conference of
     the North American Chapter of the Association for Computational Linguistics: Human
     Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
     Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://www.aclweb.org/
     anthology/N19-1423. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 9 - 1 4 2 3 .
[10] M. Hasanain, T. Elsayed, bigir at checkthat! 2020: Multilingual bert for ranking arabic
     tweets by check-worthiness., in: Proceedings of CLEF (Working Notes), 2020.
[11] V. Novak, Accenture at checkthat! 2020: If you say so: Post-hoc fact-checking of claims
     using transformer-based models, in: Proceedings of CLEF (Working Notes), 2020.
[12] Y. S. Kartal, M. Kutlu, Tobb etu at checkthat! 2020: Prioritizing english and arabic claims
     based on check-worthiness, in: Proceedings of CLEF (Working Notes), 2020.
[13] S. Shaar, M. Hasanain, B. Hamdan, Z. S. Ali, F. Haouari, M. K. Alex Nikolov, F. A. Yavuz
     Selim Kartal, G. Da San Martino, A. Barrón-Cedeño, R. Míguez, T. Elsayed, P. Nakov,
     Overview of the CLEF-2021 CheckThat! lab task 1 on check-worthiness estimation in
     tweets and political debates, in: Working Notes of CLEF 2021—Conference and Labs of
     the Evaluation Forum, CLEF ’2021, Bucharest, Romania (online), 2021.
[14] P. Nakov, G. Da San Martino, T. Elsayed, A. Barrón-Cedeño, R. Míguez, S. Shaar, F. Alam,
     F. Haouari, M. Hasanain, W. Mansour, B. Hamdan, Z. S. Ali, N. Babulkov, A. Nikolov, G. K.
     Shahi, J. M. Struß, T. Mandl, M. Kutlu, Y. S. Kartal, Overview of the CLEF-2021 CheckThat!
     lab on detecting check-worthy claims, previously fact-checked claims, and fake news, in:
     Proceedings of the 12th International Conference of the CLEF Association: Information
     Access Evaluation Meets Multiliguality, Multimodality, and Visualization, CLEF ’2021,
     Bucharest, Romania (online), 2021.
[15] M. Hasanain, F. Haouari, R. Suwaileh, Z. S. Ali, B. Hamdan, T. Elsayed, A. Barrón-Cedeño,
     G. D. S. Martino, P. Nakov, Overview of checkthat! 2020i arabic: Automatic identification
     and verification of claims in social media, in: Proceedings of CLEF, 2020.
[16] W. Antoun, F. Baly, H. Hajj, Arabert: Transformer-based model for arabic language
     understanding, in: Proceedings of the 4th Workshop on Open-Source Arabic Corpora and
     Processing Tools, with a Shared Task on Offensive Language Detection, 2020, pp. 9–15.
[17] T. Wolf, J. Chaumond, L. Debut, V. Sanh, C. Delangue, A. Moi, P. Cistac, M. Funtowicz,
     J. Davison, S. Shleifer, et al., Transformers: State-of-the-art natural language processing, in:
     Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing:
     System Demonstrations, 2020, pp. 38–45.
[18] R. Alkhalifa, T. Yoong, E. Kochkina, A. Zubiaga, M. Liakata, QMUL-SDS at CheckThat!
     2020: determining COVID-19 tweet check-worthiness using an enhanced CT-BERT with
     numeric expressions, in: Proceedings of CLEF (Working Notes), 2020.