=Paper= {{Paper |id=Vol-2936/paper-57 |storemode=property |title=Fight for 4230 at CheckThat! 2021: Domain-Specific Preprocessing and Pretrained Model for Ranking Claims by Check-Worthiness |pdfUrl=https://ceur-ws.org/Vol-2936/paper-57.pdf |volume=Vol-2936 |authors=Xinrui Zhou,Bohuai Wu,Pascale Fung |dblpUrl=https://dblp.org/rec/conf/clef/ZhouWF21 }} ==Fight for 4230 at CheckThat! 2021: Domain-Specific Preprocessing and Pretrained Model for Ranking Claims by Check-Worthiness== https://ceur-ws.org/Vol-2936/paper-57.pdf
Fight for 4230 at CheckThat! 2021: Domain-Specific
Preprocessing and Pretrained Model for Ranking
Claims by Check-Worthiness
Xinrui Zhou*1 , Bohuai Wu*1 and Pascale Fung1
1
 The Hong Kong University of Science and Technology, Clearwater Bay, Kowloon, Hong Kong Special Administrative
Region of the People’s Republic of China


                                         Abstract
                                         The widespread dissemination of false news on social media has brought negative effects to society. In
                                         this paper, we describe a model submitted to the CLEF-2021 CheckThat! Task 1 - English to estimate
                                         the check-worthiness of tweets and political debates/speeches. Our official submission was ranked 2𝑛𝑑
                                         in subtask 1A with a MAP score of 0.195 and ranked 1𝑠𝑡 in subtask 1B with a MAP score of 0.402. The
                                         main challenges of the task 1 are the imbalanced data and the not standard texts of tweets. We did thor-
                                         ough data preprocessing and mainly focused on combining different pretrained models with a dropout
                                         layer and a dense linear layer. We explored and experimented with many combinations of different
                                         data preprocessing techniques and augmentation methods. We also tried extracting additional features
                                         from metadata and ensembling the best-performance models to further improve. We have developed a
                                         preprocessing procedure for tweets, and our experiments show that domain-specific preprocessing and
                                         pretrained models can significantly improve the performance. Finally, we submitted the result produced
                                         by the BERTweet model with extra dropout layer and classifier layer with preprocessed data for subtask
                                         1A and RoBERTa model fine-tuned on tweets_hate_speech_detection dataset with extra dropout layer
                                         and classifier layer for subtask 1B.

                                         Keywords
                                         check-worthiness, data preprocessing, BERTweet, distilRoBERTa




1. Introduction
Social media becomes increasingly popular for information seeking out and consumption due to
its low cost, easy access and rapid dissemination of information, however, it also facilitates the
release and dissemination of rumors and false information [1]. The detection of fake news on
social media presents unique characteristics, and there are huge differences in content, format,
and writing style, which makes the existing detection algorithms of traditional news media
ineffective or inapplicable, thus posing new challenges to fake news detection [1]. This problem
has become more serious and urgent during the epidemic. As the COVID-19 pandemic spread,
social media played an important role in socializing, and also a quick channel to seek and
share information about the diseases. This enabled an explosion of unchecked information

                  ∗
    These authors contributed equally to this work and should be considered as co-first authors.
CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" xzhouaz@connect.ust.hk (X. Zhou*); bwual@connect.ust.hk (B. Wu*); pascale@ece.ust.hk (P. Fung)
~ https://pascale.home.ece.ust.hk/ (P. Fung)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
and the spread of misinformation [2]. More than 400 people in Iran died from drinking toxic
substances due to rumors that high-proof alcohol can cure COVID-19, which have widespread
influence on social media [3]. The World Health Organization claimed that a massive amount
of misleading information on social media during the epidemic had brought an ’infodemic’ and
severely threatened public health [4]. Thus, fast false claim identification has become a crucial
and challenging task, especially in the epidemic period.
   Furthermore, the content of false news related to the epidemic is updated and spread quickly
on social media, while manual fact-checking is time-consuming and inefficient. Therefore, it is
of great significance to carry out automated false news detection to reduce the human burden.
   However, even with automated detection, we can’t detect every single claim on social net-
works. In 2020, there were over 500 million tweets sent every day on average [5]. This poses
the need to pre-filter and prioritizes what should be passed to following the fact-checking
pipeline, which is namely the task of check-worthiness estimation. A leading contribution in
check-worthiness estimation has been the CLEF CheckThat! Lab, which has set up the task of
heck-worthiness estimation in the four years’ editions. In this paper, we present our approaches
in tackling subtask 1A and subtask 1B of the CLEF-2021 CheckThat! Lab in English[6]:
   Subtask 1A - Check-worthiness of tweets: Given a topic and a stream of potentially
related tweets, rank the tweets according to their check-worthiness for the topic.
   Subtask 1B - Check-Worthiness of Debates/Speeches: Given a political debate/speech,
produce a ranked list of its sentences, ordered by their check-worthiness. This is a ranking task.
   To effectively address this challenge, we mainly focus on the pretrained models with a dropout
layer and a dense linear layer, and also explored and experimented with many combinations of
different data preprocessing and augmentation methods, with additional features and ensembling
methods. The contributions of this paper are mainly from two aspects: we developed a useful
automatic preprocessing procedure to effectively process tweets before analysis; Secondly, we
show that domain-specific preprocessing and pretrained models can significantly improve the
performance to filter out check-worthy claims.


2. Related Work
ClaimBuster system is one of the earliest end-to-end systems for check-worthiness estimation
and fact-checking [7]. The ClaimBuster system is still ongoing and could detect claims worth
checking on the live discourses, social media and news. It used various supervised learning
methods, including Multinomial Naive Bayes Classifier (NBC), Support Vector Classifier (SVM)
and Random Forest Classifier (RFC). Moreover, it used different features, such as wording
embedding with Part-of-Speech (POS) tags, entity types, sentiment and length and other 100
most important features.
  Another online system for check-worthiness detection is ClaimRank, which is trained on
actual annotations and can work for various kinds of text [8].
  In CLEF2020 CheckThat! Lab competition task1, participants investigated more methods and
models than the participants in Check That! Lab 2019. There were several models that have
been used by the participants, such as BERT [9, 10, 11, 12], RoBERTa [13, 14, 15], BiLSTM [16],
CNN, Random Forest [17], Logistic regression and SVM [9]. Many groups already combined
Table 1
Information of Data in Subtask 1A
                               Dataset      Total   Check-worthy
                                    Train   822         290
                                     Dev    140          60
                                    Test    350          19


other representations, such as FastText, GloVe, Dependencies, POS and Named entities. To deal
with the problems coming from the limited amounts of training data, several groups attempted
different external data and graph relations. According to the overviews of the task1 [18], the
top-ranked team, Accenture, used RoBERTa with extra mean pooling and dropout layer which
were more useful than other data preprocessing.
   It is worth mentioning that most of the above systems focused on tweets but we need to deal
with both tweets, and speeches and debates. We retry some of the features of these systems,
and we focus on the preprocessing techniques to fit the tweets and different combinations of
various models and above features.


3. Dataset Analysis and Processing
For tweets on subtask 1A, to better understand the relationship between different features and
corresponding check worthiness, we did the exploratory data analysis on the training data to
explore the dataset. After that, we did data preprocessing according to the analyses to remove
some useless words and modify some abbreviations. To deal with the limitation problem of the
training data, we tried data augmentation techniques, such as back translation, to produce more
data.
   Then, we extracted the different features, such as Node2Vec word embeddings, text meta
feature and sentence embeddings produced by different models for training.

3.1. Exploratory Data Analysis
In task 1A, organizers provided datasets of tweets collected from a variety of COVID-19- related
topics. The selected tweets are manual annotated and considered as check-worthy if it contains
a verifiable factual claim and also needs a professional fact-checker to verify.
   We did an exploratory analysis on tweet metadata, including word_count, unique_word_count,
stop_word_count, punctuation_count, china_mention_count, url_count, wuhan_mention_count,
mean_word_length, char_count, hashtag_count, @_count, number_count, among which we
found that most of the meta-features have information about a target, as have very differ-
ent distributions for check-worthy and non- check-worthy tweets, such as stop_word_count,
mean_word_length. These features might be useful in models. It looks like check-worthy tweets
are written more formally with longer words compared to non-check-worthy tweets.
Figure 1: Distributions of word_count, unique_word_count, stop_word_count, punctuation_count,
china_mention_count and mean_word_length.


3.2. Data preprocessing
We applied different techniques to preprocess the raw tweet text and developed many hand-
crafted domain-specific dictionaries. Our preprocessing procedure includes the following
processing rules and orders:

    • Normalize all punctuations to English
    • Clear entity reference
    • Remove all links
    • Clean punctuations, non_ASCII except emojis: In the BERTweet model, each
      emoji icon is translated into text strings as a word token using the emoji package1 . Emoji
      icons could reveal the sentence sentiment and are relevant to check-worthiness.
    • Expanding shortened quantities: The presence of quantities and numbers can
      influence the check-worthiness label. We replaced tokens such as 6m, 12k, 8wks with
      expanded quantities like 6 million, 12 thousand, 8 weeks.
    • Expand contractions: We used a handcrafted dictionary to expand contractions and
      correct misspelling in contractions, such as y’all, youve.

   1
       Available at https://pypi.org/project/emoji/
    • Unification of all coronavirus synonyms: The dataset contains different forms
      of hashtags that refer to COVID_19, such as #covid2019, #CoronaVirus, #coronavirus19,
      we unified all coronavirus synonyms to the term coronavirus, including different spelling
      such as #korona, #koronawirus. We also expand all word forms and hashtags that contain
      the word corona, such as replacing CoronaOutbreak with coronavirus outbreak.
    • Transform slang: The tweets dataset contains many slangs that can affect the semantic
      meaning of the sentences. We developed a handcraft dictionary to transform them into
      the form that models can recognize, such as transform w/o to without, lmao to laughing,
      RT to retweet.
    • Expand hashtags: Many hashtags are used directly as subjects or objects in tweets,
      therefore, expanding them can better help the model understand the semantics. We used
      a handcraft dictionary to expand them, such as POTUS to the president of the United
      States. We tried to use the wordninja package2 which splits hashtags into separate
      words based on probability. However, it has poor performance on our dataset since it can
      poorly identify which words need to be processed and could mistakenly split words like
      "the" to "t" and "he".

3.3. Data Augmentation
The size of training data is small and easily results in overfitting. To increase the amount of
data and help reduce overfitting, we employ some data augmentation methods.

3.3.1. Back Translation
We translate each tweet text to the temporary destination language (French and Dutch) and
then translate back the previously translated text into English to produce new sentences with
different expressions of the same meanings. After back translation, we have a training data size
3 times larger.

3.3.2. Synonym Replacement
We also use WordNet to identify relevant synonyms, and randomly select n words to replace
them by their synonyms in the tweet text to produce new sentences.

3.4. Features Extraction
3.4.1. Word2vec Word Embedding (WE)
Word embeddings are able to catch semantic and syntactic features of words. Thus, we represent
each sentence as the average vector of its words. We use word2vec models pretrained on Google
news, which provides a vector size of 300.




   2
       Available at https://pypi.org/project/wordninja/
3.4.2. Text Meta Feature (TMF)
Metadata of tweets might be an indicator for check-worthy claims. We used the following
information of tweets as features: word count, count of a hashtag, presence of a link, punctuation
count.

3.4.3. Sentence embedding (SE)
After getting the word embeddings from the BERT-based pretrained model, we used the orig-
inal word embedding matrix, average embedding for word embeddings of all words and the
concatenated embeddings of all words as the three different sentences embeddings.


4. Models
In this part, we used the word embeddings and other features with the various models to train
the training data and test using the validation dataset. We compared different methods to get
a whole performance picture about different natural language processing methods for check
worthiness tasks.

4.1. Bert-based classification model
BERT [19], RoBERTa [20], BERTWWM and BERTweet [21] models are the principal models
that we have used to train the training dataset for subtask1A and the distilRoBERTa [22] models
for subtask1B.
  For BERT, we used the BERT-large pretrained model with 24 transformers, 1024 hidden sizes
and 16 self-attention heads with totally 336M parameters on lower-case English texts. Similarly,
the RoBERTa and BERTWWM also used the same architecture with the BERT-large pretrained
model with 355M parameters and 336M parameters respectively.
  BERTweet is the first public large-scale pretrained language model for tweets with the
BERT BASE architecture and RoBERTa training procedure [20]. BERTweet produces better
performance than the previous state-of-the-art models on POS tagging, named-entity recognition
and text classification tasks on the English tweets. Therefore, our models mainly used the
pretrained BERTweet model released by VinAI. And also, our final model is the BERTweet
model with preprocessing for data for subtask 1A.
  The distilRoBERTa is the distilled version of RoBERTa models, which is faster than the
original RoBERTa model. And for subtask 1B, we mainly used the distilroberta-finetuned-
tweets-hate-speech model which is the distilroberta-base model architecture fine-tuned on the
tweets-hate-speech dataset released by mrm8488.

4.2. Ensembling Models
For ensembling different natural language processing tools, we also tried several ensembling
models. The first model is the combination of 4 top models with voting or weights. Second, we
fed the sentence embedding got from the BERTweet model into the AdaBoost regressor [23]
or logistic regression [24]. The third one is that we put prediction values from the Bert-based
Table 2
Original Baselines
                                     Model               MAP
                                     Random Baseline     0.4795
                                     Ngram Baseline      0.5916


Table 3
BERT-based Models With One Extra Dropout And Classifier Layer.
                     Model                      MAP       P@3     P@5   P@10
                     BERT                       0.7074      1      1     1
                     BERTWWM                    0.8030      1      1     1
                     RoBERTa                    0.7765      1      1     1
                     BERTweet                   0.8136      1      1     1
                     BERTweet w/ preprocessed   0.8753      1      1     1
                     BERTweet w/ augmented      0.8205      1      1     1


classification model and metadata as features and combined them with sentence embeddings as
the new sentence representation then fed the new embeddings into the AdaBoost regressor,
logistic regression or SVM [25].


5. Experiments
In this part, our projects will present the experiments that have been done for subtask1A. The
results will include the measurement of precision@K (K is 3, 5, 10) and Mean Average Precision
(MAP) comparison among two original baselines and different models, followed by the analyses
for the improvement.

5.1. Experiments for Subtask 1A
For subtask1A, our experiments can be divided into 3 parts, including, the comparison among
BERT-based models with one dropout and classifier layer, all BERT-based models with KFold
algorithm, and different ensembling models.

5.1.1. BERT-based Models
In this part, we used the original BERT-large model, BERTWWM, RoBERTa-large, and BERTweet
models with one extra dropout layer and one classifier layer which is a dense linear layer.
Moreover, we also trained the models with preprocessed data and augmented datasets.
   Originally, there are two official baselines, Random baseline, and Ngram baseline which use
random guess and ngram prediction respectively. Table 2 shows the MAP results of Random
Baseline and Ngram Baseline.
Table 4
BERT-based Models With KFold algorithm.
                  Model                        MAP         P@3       P@5       P@10
                  BERT                        0.7234        1         1          1
                  BERTWWM                     0.7752        1         1          1
                  RoBERTa                     0.8005        1         1          1
                  BERTweet                    0.8332        1         1          1
                  BERTweet w/ preprocessed    0.8370        1         1          1


Table 5
Ensembling Models.
                      Model               MAP      P@3          P@5       P@10
                      BERTWWM
                      +SE(mean)
                      +Adaboost           0.7243       1         1         1
                      BERTWWM
                      +SE(concat)
                      +Adaboost           0.7134       1         1         1
                      Node2vec
                      +LR                 0.6454       1         1         1
                      BERTWWM
                      +Node2vec+LR        0.7055       1         1         1
                      BERTWWM pred
                      +Node2vec
                      +LinearSVC          0.7759       1         1         1
                      Voting              0.8547       1         1         1


   In Table 3, for each model, we trained 3 epochs with 16 batch size and 128 max sequence
length. The learning rates we have used are 3e-5 and 5e-5. According to the experiments, the
BERTweet model with one dropout layer and one classifier layer achieves the highest accuracy.

5.1.2. KFold BERT-based Models
According to Table 4, we used 20 splits of StratifiedKFold to train the same BERT-based models.

5.1.3. Ensembling Models
Table 5 shows the several methods of ensembling. We used BERTWWM sequence embedding
with mean or concatenation method with AdaBoost regressor. Also, we put other text meta-
features. Also node2vec was combined into a logistic regression model. Moreover, we tried to
consider the prediction value as a novel feature and fed it into the LinearSVC with the node2vec
sequence embedding. Finally, we combined some models with different voting weights.
Table 6
DistilRoBERTa for Subtask 1B
                           Model                                  MAP
                           distilroberta + dropout + classifier   0.1696


Table 7
Final Models for Subtask 1A and 1B
 Tasks    Subtask 1A                                               Subtask 1B
 Model    BERTweet + Dropout + Classfier w/ preprocessed data      distilroberta + dropout + classifier
 MAP      0.195                                                    0.402
 P@3      0.333                                                    0.833
 P@5      0.400                                                    0.750
 P@10     0.400                                                    0.600


5.2. Experiments for Subtask 1B
For subtask 1B, according to the experiments for subtask 1A, we simply used the distilroberta-
finetuned-tweets-hate-speech model with one dropout layer and one classifier layer. And the
Table 6 shows the average MAP for 9 different speeches.


6. Results and Discussion
According to Table 3, it cannot be denied that BERT-based models have a strong ability to
deal with the classification task. Among BERT, BERTWWM, RoBERTa and BERTweet models,
RoBERTa, BERTWWM and BERTweet are more powerful than the basic BERT model. It is
due to the development of the masking pattern used by RoBERTa and BERTWWM, and the
domain-specific pretraining by the BERTweet model. And also, through the experiments of
using augmented data and preprocessed data, both the augmentation and preprocessing can
help the models to better understand the training data, especially the preprocessing, however,
the Text Meta Features (TMFs) are not very effective.
   After comparing the Table 3 and Table 4, the Kfold for training can improve most of the
BERT-based models, although the improvement for the BERTweet model with preprocessed
data is not large. But Kfold can still be used for a limited dataset with feature fine-tuning.
   According to the comparison between Table 5 and Table 3, it shows that one dropout layer
with one classifier layer is more useful than a simple adaptive boosting algorithm or logistic
regression. The experiments showed that the model with the highest accuracy will dominate
other models. Therefore, the ensembling model will have higher accuracy if the weight assigned
to the highest model is larger.
   Final models for subtask 1A and subtask 1B submitted are shown in Table 7.
7. Conclusion and Future Work
In this paper, we present our models and efforts in Task 1 of CLEF2021 Check That! Lab. For
subtask 1A, we used three main methods, BERT-based models with extra dropout layer and
classifier layer, KFold algorithm and ensembling models. We adopted various data preprocessing
and augmentation techniques to help the system improve the accuracy. The main contributions
of this paper are: firstly, the development of a useful automatic preprocessing procedure to
effectively process tweets before analysis; Secondly, we show that domain-specific preprocessing
and pretrained models can significantly improve the performance to filter out check-worthy
claims.
   In the final submission of Subtask 1a, our final system is the BERTweet model with one
dropout layer and one classifier layer without KFold algorithm on the preprocessed training
dataset, which eventually ranked 2𝑛𝑑 (out of 9 groups) based on the official evaluation metric.
For subtask 1B, we used the distilRoBERTa-finetuned-tweets-hate-speech model followed by one
dropout layer and one classifier layer, which finally ranked 1𝑠𝑡 based on the official evaluation
metric.
   In future work, we plan to experiment with more ensembling techniques as well as with
more extra features such as sentence sentiments, POS tags, social characteristics on tweets like
the number of retweets, likes and so on.


Acknowledgments
This research is partially supported by Yu Tiezheng, Dai Wenliang and Ji Ziwei. And also, Zhou
Zhuorui supports one of the ensembling models and few ideas about the processing.


References
 [1] K. Shu, A. Sliva, S. Wang, J. Tang, H. Liu, Fake news detection on social media: A data
     mining perspective, ACM SIGKDD explorations newsletter 19 (2017) 22–36.
 [2] S. B. Naeem, R. Bhatti, A. Khan, An exploration of how fake news is taking over social
     media and putting public health at risk, Health Information & Libraries Journal (2020).
 [3] B. Trew, Hundreds dead in iran from drinking methanol amid fake reports it
     cures coronavirus, 2020. URL: https://www.independent.co.uk/news/world/middle-east/
     iran-coronavirus-methanol-drink-cure-deaths-fake-a9429956.html.
 [4] Infodemic,         2020.       URL:       https://www.who.int/health-topics/infodemic/
     the-covid-19-infodemic#tab=tab_1.
 [5] D. Sayce, The number of tweets per day in 2020, 2020. URL: https://www.dsayce.com/
     social-media/tweets-day/.
 [6] S. Shaar, M. Hasanain, B. Hamdan, Z. S. Ali, F. Haouari, M. K. Alex Nikolov, F. A. Yavuz
     Selim Kartal, G. Da San Martino, A. Barrón-Cedeño, R. Míguez, T. Elsayed, P. Nakov,
     Overview of the CLEF-2021 CheckThat! lab task 1 on check-worthiness estimation in
     tweets and political debates, in: Working Notes of CLEF 2021—Conference and Labs of
     the Evaluation Forum, CLEF ’2021, Bucharest, Romania (online), 2021.
 [7] N. Hassan, G. Zhang, F. Arslan, J. Caraballo, D. Jimenez, S. Gawsane, S. Hasan, M. Joseph,
     A. Kulkarni, A. K. Nayak, et al., Claimbuster: The first-ever end-to-end fact-checking
     system, Proceedings of the VLDB Endowment 10 (2017) 1945–1948.
 [8] I. Jaradat, P. Gencheva, A. Barrón-Cedeño, L. Màrquez, P. Nakov, Claimrank: Detecting
     check-worthy claims in arabic and english, arXiv preprint arXiv:1804.07587 (2018).
 [9] G. S. Cheema, S. Hakimov, R. Ewerth, Check_square at checkthat! 2020: Claim detec-
     tion in social media via fusion of transformer and syntactic features, arXiv preprint
     arXiv:2007.10534 (2020).
[10] R. Alkhalifa, T. Yoong, E. Kochkina, A. Zubiaga, M. Liakata, Qmul-sds at checkthat! 2020:
     determining covid-19 tweet check-worthiness using an enhanced ct-bert with numeric
     expressions, arXiv preprint arXiv:2008.13160 (2020).
[11] Y. S. Kartal, M. Kutlu, Tobb etu at checkthat! 2020: Prioritizing english and arabic claims
     based on check-worthiness, Cappellato et al.[10] (2020).
[12] C.-G. Cusmuliuc, L.-G. Coca, A. Iftene, Uaics at checkthat! 2020: Fact-checking claim
     prioritization, Cappellato et al.[18] (2020).
[13] E. Williams, P. Rodrigues, V. Novak, Accenture at checkthat! 2020: If you say so: Post-hoc
     fact-checking of claims using transformer-based models, arXiv preprint arXiv:2009.02431
     (2020).
[14] A. Nikolov, G. D. S. Martino, I. Koychev, P. Nakov, Team alex at clef checkthat! 2020:
     Identifying check-worthy tweets with transformer models, arXiv preprint arXiv:2009.02931
     (2020).
[15] T. Sachin Krishan, S. Kayalvizhi, D. Thenmozhi, K. Rishi Vardhan, Ssn nlp at checkthat!
     2020: Tweet check worthiness using transformers, convolutional neural networks and
     support vector machines (2020).
[16] J. Martinez-Rico, L. Araujo, J. Martinez-Romo, Nlp&ir@ uned at checkthat! 2020: A
     preliminary approach for check-worthiness and claim retrieval tasks using neural networks
     and graphs, Cappellato et al.[10] (2020).
[17] T. McDonald, Z. Dong, Y. Zhang, R. Hampson, J. Young, Q. Cao, J. Leidner, M. Stevenson,
     The university of sheffield at checkthat! 2020: Claim identification and verification on
     twitter, Cappellato et al.[10] (2020).
[18] S. Shaar, A. Nikolov, N. Babulkov, F. Alam, A. Barrón-Cedeno, T. Elsayed, M. Hasanain,
     R. Suwaileh, F. Haouari, G. Da San Martino, et al., Overview of checkthat! 2020 english:
     Automatic identification and verification of claims in social media, Cappellato et al.[10]
     (2020).
[19] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional
     transformers for language understanding, CoRR abs/1810.04805 (2018). URL: http://arxiv.
     org/abs/1810.04805. arXiv:1810.04805.
[20] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer,
     V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint
     arXiv:1907.11692 (2019).
[21] D. Q. Nguyen, T. Vu, A. T. Nguyen, Bertweet: A pre-trained language model for
     english tweets, CoRR abs/2005.10200 (2020). URL: https://arxiv.org/abs/2005.10200.
     arXiv:2005.10200.
[22] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of BERT: smaller,
     faster, cheaper and lighter, CoRR abs/1910.01108 (2019). URL: http://arxiv.org/abs/1910.
     01108. arXiv:1910.01108.
[23] D. P. Solomatine, D. L. Shrestha, Adaboost. rt: a boosting algorithm for regression prob-
     lems, in: 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.
     04CH37541), volume 2, IEEE, 2004, pp. 1163–1168.
[24] R. E. Wright, Logistic regression. (1995).
[25] M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, B. Scholkopf, Support vector machines, IEEE
     Intelligent Systems and their applications 13 (1998) 18–28.