iCompass at CheckThat! 2022: ARBERT and AraBERT
for Arabic Checkworthy Tweet Identification
Bilel Taboubi1 , Mohamed Aziz Ben Nessir1 and Hatem Haddad1
1
    iCompass, Emeraude Palace, Rue du Lac Windermère, Les Berges du Lac, Tunis 1053


                                         Abstract
                                         This paper provides a detailed overview of systems and its achieved results, which were produced as part
                                         of CLEF2022 - Check- That! Lab Fighting the COVID-19 Infodemic and Fake News Detection. The task
                                         was carried out using transformers pre-trained models Arabic BERT, ARBERT, MARBERT, AraBERT,
                                         Arabic ALBERT and BERT base arabic. The models were fine-tuned for the down-stream task in hand,
                                         binary classification of Arabic tweets. According to the results, AraBERT attained the highest 0.462 F1
                                         score on the test set of subtask 1A and ARBERT attained the best F1 score 0.557 on the test set of subtask
                                         1C.

                                         Keywords
                                         GRU, ARBERT, ARABERT, Arabic


1. Introduction
The spread of fake news misinformation is increasing and almost turning to be unlimited due to
the increase of social media users and platforms allowing anyone these days can create and join
and share articles and information in social medias platforms pretending to be a news agency or
a popular person and this is causing serious problems to society, partly due to the fact that more
and more people only read headlines or highlights of news assuming that everything is reliable
instead of carefully analysing whether it can contain distorted or false information. Harmful
Speech is particularly widespread in online communication due to users’ anonymity and the
lack of harmful speech detection tools on social media platforms. Consequently, Harmful speech
detection has determined a growing interest in using Machine/Deep Learning techniques to
address this issue [1]. The increase of social media users conducted to a uncontrollable amount
of information shared daily, making it impossible to be covered by manual fact checking sites
where organizations and researchers began to move for a creation of automated systems with
an aim to solve the mess caused by these misinformation. This paper focus on Subtask 1A and
1C in Arabic from CheckThat, a lab contest with various tasks for competitors [2]. This year,
the lab offered the following three main tasks: Detecting Check-Worthy Claims (Task 1), Fact
Checking Claims (Task 2), and Fake News Detection (Task 3). Task 1 was divided into four
subtasks and the rest of the tasks each contain two subtasks. Both Subtask 1A and 1C where

CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
$ bileltaboubi20@gmail.com (B. Taboubi); mohamedaziz.bennessir@etudiant-isi.utm.tn (M. A. B. Nessir);
haddad.hatem@gmail.com (H. Haddad)
 0000-0003-3599-7229 (H. Haddad)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
provided in six different languages (Arabic, Bulgarian, Dutch, English, Spanish and Turkish).
Detecting Check-Worthy Claims (Task 1) presents a supervised text classification problem
aiming to classify tweets into categories based on their content, the purpose is to develop an
automated system to identify trust unworthy tweets.


2. Related Work
The winner team from the recent years contest CheckThat! Lab 2020 [3] and 2021 [4] proposed
a solution using two models BERT and RoBERTa then adding a mean-pooling, a dropout layers
and finally a classification layer.
   They also used data augmentation, in particular, they generated synthetic training data
using lexical substitution to create additional synthetic examples for the positive class and used
machine translation to translate Arabic data to English and then to Arabic again.
   The paper [5] evaluates Deep learning approches using supervised algorithms for text classifi-
cation based on Convolutional Neural Networks (CNN), Long Short Term Memory (LSTM), and
Bidirectional Encoder Representations from Transformers (BERT) for Fake news detection.The
dataset was preprocessed as following, Removal of HTML tags, Convert Accented Characters
to ASCII, Expand contractions, Removal of Special Characters, Noise Removal, Normalization,
Stemming, and Stop-words Removal. All Transformers based models outperformed basic models
with a difference of 3-4% in accuarcy . The best accuracy was reached using language model
pretraining on BERT 98.41%.
   Experiments were done by IEEEAcces [6] using the open source Fake News Corpus dataset
available on Github, the dataset has been used for determining a veracity of news articles. Text
Preprocessing techniques were applied on news article to transforms the text to UTF-8, remove
stop words and punctuation, lemmatize the sentences to get them back to their root form and
transform the text to lowercase. Many deep learning architectures were applied such as LSTM,
GRU, CNN with different word embedding techniques Word2Vec, FastText and GloVe.


3. Data Description
3.1. Subtask 1A: Check-worthiness of tweets
3.1.1. Dataset Statistics
The dataset for CLEF Subtask 1A contains 3439 tweets written in Arabic dialect, the data set had
originally 4 parts, train, dev, dev test, and test but we reassembled it into train, development and
test sets as shown in Table , labelled with binary labels, 1 for worthy claims with a percentage
of 61.4 and 0 for unworthy for the rest 38.6% of the data.

3.2. Subtask 1C: Harmful tweet detection
3.2.1. Dataset Statistics
The provided training dataset of the CLEF Subtask 1C harmful tweet detection is about 5k
tweets, labelled with the 2 categories Normal and Harmful. 81% of the tweets are Normal and
Table 1
Task 1A dataset statistics.

                                    Type          Train   Dev   Test    Total
                                Worthy claims      962    100   266     1328
                               Unworthy claims    1551    135   425     2111


19% are Harmful as shown in Table 2. Again this data set was also reassembled.

Table 2
Task 1C dataset statistics..

                                 Type     Train    Dev.   Dev. Test    Total
                                Harmful    678      60      189         927
                                Normal    2946     276      805        4027


  The dataset is highly unbalanced so we downsampled the Normal tweets, we tried multiple
combinations and percentages but down sampling it to 65%, as in making it roughly 1.5 times
the size of the harmful tweets gave the best results.


4. Data preparation
We experimented with various preprocessing techniques, such as removing emojis, normalizing
hashtags, removing Latin characters, removing URLs, data normalization, deleting tashkeel and
the letter madda from texts, as well as duplicates etc. The best results were given on the raw
unpreprocessed data for each of subtasks 1A and 1C.


5. Pre-trained Models
Different pre-trained models were used in order to achieve the best results when fine-tuning it
in a multi-task fashion.

5.1. AraBERT
AraBERT (V2) [7], is a BERT based model for Modern Standard Arabic Language understanding,
trained on 70M sentences from several public Arabic datasets and news websites. It was fine-
tuned on 3 tasks: Sequence Classification, Named Entity Recognition and Question Answering. It
was reported to achieve state-of-the-art performances even on Arabic dialects after fine-tuning.
5.2. Bert base Arabic
The Arabic BERT model [8] was trained on 8.2 billion words using the Arabic version of OSCAR,
Recent dump of Arabic Wikipedia and other Arabic resources which sum up to 95GB of text
which was filtered using Common Crawl. The final version of corpus contains some non-Arabic
words inlines. The corpus and the vocabulary set are not restricted to MSA, they contain some
dialectical (spoken) Arabic too, which boosted models performance in terms of data from social
media platforms.

5.3. ARBERT
ARBERT [9] is also a Bert based model trained on 61GB of Modern Standard Arabic text (6.5B
tokens) gathered from books, news articles, crawled data and Wikipedia.

5.4. MARBERT
MARBERT [9] is a large-scale pretrained language model using the BERT base’s architecture.
MARBERT is trained on on 128 GB of tweets from various Arabic dialects containing at least 3
Arabic words. With very light preprocessing the tweets were almost kept at their initial state to
retain a faithful representation of the naturally occurring text.


6. Results
6.1. Subtask 1A: Check-worthiness of tweets
Pre-trained models AraBERT and BERT base Arabic were trained and finetuned with the
following architecture:

    • Input layer
    • Bert model
    • A gated recurrent unit with 128 untis and 0.3 probability for dropout.
    • Dense layer with 50 units and Relu activation function
    • A dropout layer with 0.1 probability.
    • Dense layer with a Sigmoid activation function and one unit

   Best results achieved by each pre-trained model is presented in the table 3 where they got
trained on the train set,validated on development set and tested with test set.

Table 3
Task 1A Pre-trained models results on dev set.

                          Type             F1     Accuracy   Precision   Recall
                        AraBERT           0.590     0.536      0.453     0.844
                     BERT base Arabic     0.576     0.672      0.601     0.554
   The submitted model was AraBERT, trained with a 10 epochs, 2e-5 learning rate for Adam
optimizer, a sequence length of 150, 32 batch size and binary cross entropy loss function.The
model achieved F1_score 0.590 on the dev set,and 0.462 on the submission test set to get rank 3
in Subtask 1A Arabic leaderboard as shown in the table 4.

Table 4
Top 3 on Subtask 1A Arabic leaderboard

      Participants (userid/team-name)                Subtask                 F1 (postive class)
                   elfsong                Subtask-1A-Checkworthy-Arabic            0.628
                   mkutlu                 Subtask-1A-Checkworthy-Arabic            0.495
               HatemHaddad                Subtask-1A-Checkworthy-Arabic            0.462


6.2. Subtask 1C: Harmful tweet detection
All the models were finetuned with :
    • A gated recurrent unit with 256 untis and 0.5 dropout.
    • A gated recurrent unit with 128 untis and 0.4 dropout.
    • A gated recurrent unit with 64 untis and 0.3 dropout.
    • 1-dimensional convolution neural network with 64 units and a kernel size of 3.
    • A 0.3 dropout layer.
    • A layer to concatenate Global Average Pooling 1D and Global Maximum Pooling 1D of
      the previous output.
    • A 0.05 dropout layer.
    • A final dense layer with a Sigmoid activation function and one unit.
  All of the models results are presented in table 5.

Table 5
Task 1C Pre-trained models Dev results.

                          Type        F1     Accuracy   Precision   Recall
                        ARBERT       0.775     0.905      0.857     0.707
                        AraBERT      0.750     0.890      0.867     0.661
                        MARBERT       0.7      0.885      0.703     0.696


   The best results were achived with ARBERT, The submitted model was trained with a total
of 16 epochs. The first 4 epochs were only used to warm up the GRU layers, we froze ARBERT
and trained them with a learning rate of 1e-4 and then and for the rest 12 epochs we unfroze
ARBERT and used a learning rate of 1e-5. For both parts we used Adam optimizer, a batch size
of 64 and a binary cross entropy loss function. The model achieved an F1 score of 0.557 on the
test set and got rank 1, the subtask participants are shown in the table 6.
Table 6
Top 3 on Subtask 1C Arabic leaderboard
         Participants (userid/team-name)             Subtask            F1 (postive class)
                  HatemHaddad               Subtask-1C-Harmful-Arabic         0.557
                      mkutlu                Subtask-1C-Harmful-Arabic         0.268
                 random-baseline            Subtask-1C-Harmful-Arabic         0.118


7. Discussion
7.1. Subtask 1A: Check-worthiness of tweets
BERT base Arabic and AraBERT choice for this subtask was based on recent studies.However
Arabert overperformed BERT base Arabic and reached the best results since it was trained
with more vocabulary, a corpus with a large vocabulary and more than 8.6B words. Both F1
scores attained by models were low and that is due the imbalance presented in the data plus an
assemblance between worthy and unworthy tweets text from the semantic side.

7.2. Subtask 1C: Harmful tweet detection
Different language models were used in this work. However, ARBERT achieved the best results.
This was the case because it was pre-trained on modern standard arabic text from tweets with
little no normalization therefore works better for our case. In addition, the data imbalance
further illustrated in figure 1 decreased the model performance causing it to easily overfit on
the training dataset.


Figure 1: Subtask 1c harmful speech statistics.


8. Conclusion
In this paper, we demonstrated the performance of gated recurrent unit for each fo the subtasks
Harmful tweet detection and Check-worthiness of tweets by fine-tuning the pre-trained models
ARBERT and AraBERT. Despite the small sized annotated data, the model achieved satisfactory
results.

With respect to the models, further work should explore meta-learning, Focal loss,
semi-supervised learning.

As for the data, further work should focus on the exploring other augmentation and
resampleing strategies as well as collectiong more harmful tweets for Subtask1C, and feature
extracting features like account types, as number of likes, number of shares from tweet links
provided within the data for more distinguishability between the worthy and unworthy claims.


References
[1] A. Schmidt, M. Wiegand, A survey on hate speech detection using natural language
    processing, in: Proceedings of the Fifth International Workshop on Natural Language
    Processing for Social Media, Association for Computational Linguistics, Valencia, Spain,
    2017, pp. 1–10. URL: https://aclanthology.org/W17-1101. doi:10.18653/v1/W17-1101.
[2] P. Nakov, A. Barrón-Cedeño, G. Da San Martino, F. Alam, R. Míguez, T. Caselli, M. Kutlu,
    W. Zaghouani, C. Li, S. Shaar, H. Mubarak, A. Nikolov, Y. S. Kartal, J. Beltrán, Overview
    of the CLEF-2022 CheckThat! lab task 1 on identifying relevant claims in tweets, in:
    N. Faggioli, Guglielmo andd Ferro, A. Hanbury, M. Potthast (Eds.), Working Notes of CLEF
    2022—Conference and Labs of the Evaluation Forum, CLEF ’2022, Bologna, Italy, 2022.
[3] E. Williams, P. Rodrigues, V. Novak, Accenture at checkthat! 2020: If you say so: Post-hoc
    fact-checking of claims using transformer-based models, 2020. URL: https://arxiv.org/abs/
    2009.02431. doi:10.48550/ARXIV.2009.02431.
[4] E. Williams, P. Rodrigues, S. Tran, Accenture at checkthat! 2021: Interesting claim identifi-
    cation and ranking with contextually sensitive lexical training data augmentation, 2021.
    URL: https://arxiv.org/abs/2107.05684. doi:10.48550/ARXIV.2107.05684.
[5] A. Wani, I. Joshi, S. Khandve, V. Wagh, R. Joshi, Evaluating deep learning approaches for
    covid19 fake news detection, in: Combating Online Hostile Posts in Regional Languages
    during Emergency Situation, Springer International Publishing, 2021, pp. 153–163. URL:
    https://doi.org/10.1007q%2F978-3-030-73696-5_15. doi:10.1007/978-3-030-73696-5_
    15.
[6] V.-I. Ilie, C.-O. Truică, E. S. Apostol, A. Paschke, Context-aware misinformation detection:
    A benchmark of deep learning architectures using word embeddings, IEEE Access PP (2021)
    1–1. doi:10.1109/ACCESS.2021.3132502.
[7] W. Antoun, F. Baly, H. Hajj, Arabert: Transformer-based model for arabic language un-
    derstanding, 2020. URL: https://arxiv.org/abs/2003.00104. doi:10.48550/ARXIV.2003.
    00104.
[8] A. Safaya, M. Abdullatif, D. Yuret, KUISAIL at SemEval-2020 task 12: BERT-CNN for
    offensive speech identification in social media, in: Proceedings of the Fourteenth Workshop
    on Semantic Evaluation, International Committee for Computational Linguistics, Barcelona
    (online), 2020, pp. 2054–2059. URL: https://aclanthology.org/2020.semeval-1.271. doi:10.
    18653/v1/2020.semeval-1.271.
[9] M. Abdul-Mageed, A. Elmadany, E. M. B. Nagoudi, Arbert & marbert: Deep bidirec-
    tional transformers for arabic (2021). URL: https://arxiv.org/abs/2101.01785. doi:10.48550/
    ARXIV.2101.01785.