=Paper=
{{Paper
|id=Vol-2696/paper_226
|storemode=property
|title=Accenture at CheckThat! 2020: If you say so: Post-hoc fact-checking of Claims using Transformer-based Models
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_226.pdf
|volume=Vol-2696
|authors=Evan Williams,Paul Rodrigues,Valerie Novak
|dblpUrl=https://dblp.org/rec/conf/clef/Williams0N20
}}
==Accenture at CheckThat! 2020: If you say so: Post-hoc fact-checking of Claims using Transformer-based Models==
Accenture at CheckThat! 2020: If you say so: Post-hoc fact-checking of claims using transformer-based models Evan Williams1[0000−0002−0534−9450] , Paul Rodrigues12[0000−0002−2151−636X] , and Valerie Novak2[0000−0001−8317−0993] 1 Accenture, 800 N. Glebe Rd., Arlington, 22209, USA e.m.williams@accenture.com paul.rodrigues@accenture.com 2 University of Maryland, College Park, MD, USA 3 vnovak@umd.edu Abstract. We introduce the strategies used by the Accenture Team for the CLEF2020 CheckThat! Lab, Task 1, on English and Arabic. This shared task evaluated whether a claim in social media text should be professionally fact checked. To a journalist, a statement presented as fact, which would be of interest to a large audience, requires professional fact-checking before dissemination. We utilized BERT and RoBERTa models to identify claims in social media text a professional fact-checker should review, and rank these in priority order for the fact-checker. For the English challenge, we fine-tuned a RoBERTa model and added an extra mean pooling layer and a dropout layer to enhance generalizabil- ity to unseen text. For the Arabic task, we fine-tuned Arabic-language BERT models and demonstrate the use of back-translation to amplify the minority class and balance the dataset. The work presented here was scored 1st place in the English track, and 1st, 2nd, 3rd, and 4th place in the Arabic track. Keywords: fact checking, fact identification, Arabic, BERT, RoBERTa 1 Introduction Natural Language Processing (NLP) has been driving Artificial Intelligence re- search since the 1950s, but recently increased in distinction due to the quantity of text that can be utilized as well as new techniques to extract even more value from text. In 2018, a surge of research produced deep learning architectures in NLP which beat state of the art on a multitude of tasks, such as sentiment analysis, question answering, and semantic similarity, in a variety of languages. Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem- ber 2020, Thessaloniki, Greece. Since the innovation of ULMFit [12], numerous new architectures have been in- troduced, such as ELMo [17], BERT [9], ERNIE [26], RoBERTa [14], GPT-2 [18], GPT-3 [6], and others, yielding breakthrough innovations and increased perfor- mance, nearly month after month. These architectures require massive amounts of training data, which can be expensive to train on high-performance computing clusters [25]. However, they facilitate the practice of transfer learning. A base model trained on a large amount of general text data can then be fine-tuned, or customized for a specific problem and domain/genre, using text with far less annotated data than previous systems required. This use of transfer learning allows us to effectively craft custom cutting-edge models to solve a wide range of classification problems. While these architectures are often utilized to improve NLP tasks, the appli- cation of transformer-based transfer learning approaches are less often demon- strated as components in decision-support systems which aid the workflow of subject matter experts. We do see these technologies being used in the medi- cal field (e.g. [20]), and anticipate there will be many more applications coming. The CheckThat! Lab poses one such application, which could reduce information burden in the workflow of a journalist. 1.1 CheckThat! Lab We participated in Task 1 of the 2020 CheckThat! challenge. [5] Organizers distributed collections of tweets in English and in Arabic for training, annotated for topic group, whether the tweet was a claim, and whether the tweet was check-worthy, along with Twitter provided meta-data. [24, 10] Participants in the challenge utilized this data to train a model that could receive a list of novel tweets, classify each for check-worthiness, and rank the group of tweets by how check-worthy they were. Evaluation of the model was performed on a second test dataset provided for each language. These test datasets were held back by the organizers until shortly before the competition end time. Organizers provided this dataset unlabeled, and participants provided the labels and ranking to the organizers. Organizers evaluated the ranking produced by participating groups to a withheld labeled and ranked list. Participants were permitted to submit one primary run and up to 3 contrasting runs. The official metric for Arabic was Precision @ 30 (P@30). Precision @ k is the number of relevant results in the top k claims in the ranked list. The official metric for English was Mean Average Precision (mAP), or the mean of the average precision scores for each of the claims. Provided Data Tweets were collected by CheckThat! organizers using keyword watchlists, consisting of usernames, hashtags, or key words, designed around a variety of topic areas. For English, one topic was provided related to COVID-19, and filtered for tweets that mentioned #COVID19, #CoronavirusOutbreak, #Coronavirus, #Corona, #CoronaAlert, #CoronaOutbreak, corona, and COVID-19. This topic was the same in train, test, and the evaluation set. For Arabic, the training data included three topics–Protests in Lebanon, Emirati cleric Wassim Youssef, as well as Turkey’s intervention in Syria. Testing data included topics such as Deal of the Century, The Houthis in Yemen, COVID- 19, Feminists, Events in Libya, The group of resident non-citizens in Kuwait, Algeria, as well as Boycotting Countries & Promoting Rumors against Qatar. We note that the topics provided between train and test datasets differ, with no overlap. The topic word lists were used by the organizers to collect posts on Twitter. Annotators were presented these posts and were asked to evaluate each for check- worthiness. Check-worthiness was defined as “a tweet that includes a claim that is of interest to a large audience (especially journalists), might have a harmful effect, etc.” [8] Tweets were assigned check-worthiness labels after review by two annotators as well as a review by a third expert annotator. Check-worthiness was evaluated on the following three criteria [4]: – Do you think the claim in the tweet is of interest to the public? – To what extent do you think the claim can negatively affect the reputation of an entity, country, etc.? – Do you think journalists will be interested in covering the spread of the claim or the information discussed by the claim? In examining the labeled training data, we confirmed nuanced differences be- tween tweets that were check-worthy and tweets that were not. For example, the tweet below, which was taken from the English task development data, initially appears to be peddling a false COVID-19 claim. However, the rest of the tweet makes it clear that the author is joking, which is presumably why this tweet was not labeled as being check-worthy. ”ALERT The corona virus can be spread through money. If you have any money at home, put on some gloves, put all the money in to a plastic bag and put it outside the front door tonight. I’m collecting all the plastic bags tonight for safety. Think of your health.” In contrast, the tweet below, which was labeled check-worthy, is spreading harm- ful COVID-19 misinformation which could dissuade people from getting tested. ”Coronavirus test in US is $3,000. Here in Tokyo it’s $50, $166 without State ins. In much of Europe it’s free Worse, in much of the US, it’s not even available, unreliable. And meanwhile #POTUS recently called Corona one big “hoax.” USA: 1st world $$$, 3rd world healthcare.” We had concern that nuanced text like this may be difficult to discriminate and rank accurately. For a journalist, the task of identifying noteworthy claims for the vetting pro- cess may be intuitive. Their knowledge of the material, background in academic training, and experience as a journalist inform their processes and decision- making. Our learner is not coached, trained, or experienced in this area before- hand. It receives the data and annotations provided by the annotators and learns the patterns of language to replicate their decision process. 2 Transformer Architectures and Pre-trained Models 2.1 BERT Bidirectional Encoder Representations from Transformers (BERT) models have fundamentally changed the NLP landscape. The original BERT model’s archi- tecture consists of 12 transformers stacked on top of one another with a hidden size of 768 and 12 self-attention heads. [9] BERT models are trained by per- forming unsupervised tasks, namely masked token prediction (Masked LM) and prediction of future sentences (Next Sentence Prediction) on massive amounts of data. BERT utilizes a WordPiece tokenization scheme. [22], and was trained on Wikipedia and the BooksCorpus [30]. At the time of release, BERT was state-of-the-art in 11 NLP tasks. Since initial release, many pre-trained BERT neural networks have been re- leased. These can be focused on new languages, or differ in size. They can be either smaller and more efficient, or larger and more comprehensive, than the original release [27]. Any of these pre-trained models could serve as a base model for fine-tuning to new datasets and new tasks. 2.2 RoBERTa RoBERTa, developed by Liu et al. [14], is an derivative of BERT which in- troduced modifications to the training process. The primary modifications are the provision of more training data, increasing pre-training steps with bigger batches over more data, removing Next Sentence Prediction, training on longer sequences, and dynamically changing the masking pattern applied to the train- ing data [14]. While RoBERTa also requires sub-word tokenization, RoBERTa uses a Byte-Pair Encoding (BPE) instead of WordPiece. [23] The base-roberta model was pre-trained on 160GB of text extracted from BookCorpus, English Wikipedia, CC-News, OpenWebText, and Stories (a subset of CommonCrawl Data) [14]. At the time of release, the RoBERTa architecture achieved state-of-the-art results on publicly available benchmark datasets such as GLUE [28], RACE [13], and SQuAD [19]. Like BERT, RoBERTa models come in a variety of sizes, and choosing a model requires a trade-off between computational efficiency and model size. While some new architectures have been released which exceed RoBERTa’s performance, RoBERTa remains an accessible framework and continues to be one of the most highly ranked architectures on the SuperGLUE leaderboard.4 2.3 AraBERT AraBERT is an Arabic model developed by Wissam Antoun, Fady Baly, and Hazem Hajj at the American University of Beirut [3]. The aubmindlab/arabert 4 https://super.gluebenchmark.com/leaderboard series of models were pre-trained on Arabic documents retrieved from the web, as well as two publicly available corpora: the 1.5 billion word Arabic Corpus, and the 1 billion word Open Source International Arabic News Corpus (OSIAN). No token count was provided for the web scraped documents. [3]. 2.4 ArabicBERT ArabicBERT is an Arabic model developed by Ali Safaya, Moutasem Abdul- latif, and Deniz Yuret KUIS of Koc University. [21] ArabicBERT was trained on Wikipedia, and the OSCAR corpus [16], which utilized web data from Com- monCrawl. The corpus used to create the pre-trained model was, in total, 8.5 billion words. 3 Quantitative Analysis 3.1 Label Balance The datasets for both the English and the Arabic Challenges were imbalanced. The English Task 1 datasets contained a development dataset of 150 tweets and a training dataset of 672 tweets containing 39% and 34% check-worthy tweets respectively. The Arabic Task 1 training dataset provided 1,500 labeled tweets, 458 of which (31%) were labeled check-worthy. We will discuss provisions we make for the Arabic imbalance later in the paper. 3.2 Vocabulary Analysis When utilizing pre-trained models, vocabulary used to create these models plays a critical role. The process of fine-tuning does not allow for the addition of addi- tional vocabulary, so these systems fallback to subword units during tokenization. Because we were evaluating a corpus that contained emerging topics (such as COVID-19), and our pre-trained models were created at different points between 2018 and 2020, we wanted to understand what our pre-trained models contained. We hypothesized that the models with the greatest token overlap would perform the best. English The token overlap between the English test dataset and RoBERTa’s vocabulary file was roughly 850 tokens (54%), with RoBERTa containing about 50K items in its vocabulary. Many tokens missing from the RoBERTa vocabulary were related to the coronavirus topic, including several terms for COVID-19 as well as named entities, emoji, foreign languages in non-Latin script, misspellings and slang/internet chat language (LMAOOO). No analysis was performed on the BERT vocabulary file. Arabic The three Arabic model vocabularies contained 64K WordPieces ( aubmindlab/bert-base-arabert), 64K WordPieces (aubmindlab/bert-base-arabertv01 ) and 32K WordPieces (asafaya/bert-base-arabic). A rough tokenization and clean- ing of the tweets in the test data set resulted in roughly 15K unique tokens. The overlap between the three Arabic model vocabulary and the Arabic test data set was roughly 8.5K tokens or 56% of the tokens in the test data (aubmindlab/bert- base-arabertv01 ), 5.5K tokens or 36% of the tokens in the test data (asafaya/bert- base-arabic) and 3.5K or 23% of the tokens in the the test data (aubmindlab/bert- base-arabert). Some categories of vocabulary found in the test data set, but miss- ing from the top performing model, included English words or loan words in Ara- bic script, colloquial/slang, misspellings/missing spaces, named entities (names of people and places), emoji and tokens in Latin script. The asafaya/bert-base- arabic Arabic model vocabulary also included a lot of longer WordPieces that were unlikely to be found in data. Additionally, even though the test data set contained short vowels, none of the Arabic model vocabularies had any short vowels. 4 Approach and Results The datasets provided for English and Arabic contained Twitter metadata fields, but we discard these. Our methodology only utilizes the message text of the Tweet as well as the check-worthy field containing a binary label where the positive class denoted a check-worthy claim.5 Competition rules required that tweets most likely to be check-worthy needed to appear at the top of each topic. To generate rankings, we took the positive and negative class scores, generated by a sequence classification head on top the pooled output of the neural network models (whether it be BERT, RoBERTa, AraBERT, or ArabicBERT), and passed those scores through a softmax func- tion to normalize the classification outputs. We then subtracted the negative class probability from the positive class probability. This yielded interpretable, normalized scores between 1 and -1, where higher scores reflected our model’s confidence that a tweet was check-worthy. We then sorted by the difference of probabilities to produce the ranked tweets submitted to the organizers of the conference. 4.1 English Classification For our internal evaluations, we split the English training data provided into 80% training and 20% validation sets. We used the development set as was provided by the organizers. We evaluated three baseline models. We fine-tuned the data over 2 epochs on the original English BERT model [9], a BERT model trained on COVID-19 5 We tried concatenating the text field with the pre-labeled topicID field, but this did not improve the model’s performance at all, so we chose to exclude topic labels from the model. Twitter data [15], and the original English RoBERTa model [14]. We assumed that the COVID-19 Twitter model would generate the highest accuracy given its deep contextual knowledge of both Twitter data and COVID-19, but of the three models, RoBERTa generated the highest precision and recall for both the positive and negative class. We chose to eliminate the previous two models and focus on optimizing RoBERTa.6 In our internal evaluations, we noticed the model overfitting quickly. To help prevent this, we added an extra mean pooling layer and dropout layer to the model. Our pooling layer takes the weights from the last layer, which were overfit- ting, and averages them with weights from the second-to-last layer. This reduces overfitting by smoothing out some of the weights originally calculated in the final layer. Dropout is a regularization technique that reduces overfitting by randomly omitting (or zeroing out) hidden units from the network during each training step at a probability specified by the user [11]. By adding these two layers to the end of our RoBERTa model, we were able to improve accuracy on our test set and reduce overfitting. After a grid search, we fine-tuned with 2 epochs, a batch size of 32, and Adam optimization with a learning rate of 1.5e-5. The RoBERTa model was fine-tuned using the Keras API to TensorFlow. This output was then fed through a softmax function, and the difference between the positive and negative class likelihoods were used to rank tweets within each pre-labeled topic category. Results Results of our fine-tuned RoBERTa model can be found in Table 1 as RoBERTa. This submission placed first place among all competing teams with a mAP of 0.8064. Our contribution narrowly beat out the second place results, which likely utilized a similar model. We did not submit our BERT model or COVID Twitter models for formal evaluation. Table 1. Accenture results from CheckThat! Task1 English. Entry mAP RR R-P P@1 P@3 P@5 P@10 P@20 P@30 RoBERTa 0.8064 1.0000 0.7167 1.0000 1.0000 1.0000 1.0000 0.9500 0.7400 4.2 Arabic Classification For our internal evaluations, we split the Arabic training data provided into 70% training, 20% validation, and 10% held-out sets. We eval- uated four baseline Arabic BERT models retrieved from Huggingface, without any parameter tuning. [29]. These models were Hate-speech-CNERG/ dehatebert- mono-arabic [2], asafaya/bert-base-arabic [21], aubmindlab/bert-base-arabert [3], 6 In hindsight, these two should have been contributed for formal evaluation. and aubmindlab/bert-base-arabertv01 [3]. Out of four, we found three to have promise, aubmindlab/bert-base-arabertv01, aubmindlab/bert-base-arabert, and asafaya/bert- base-arabic. Classes were imbalanced in the Arabic training dataset with 30% of tweets labeled as part of the check-worthy class. In order to address the imbalanced classes, we chose to upsample the positive class using machine translation via Amazon Web Services (AWS) Translate. Tweets from the positive class in the training and development sets were translated to English and then back to Arabic (ar→en→ar), appended to our training dataset, and assigned a label of check-worthy. This improved both pre- cision and recall for check-worthy tweets, but slightly harmed the precision and recall for tweets that were not check-worthy. As the goal is to surface and rank the positive class at various levels of precision, a reduction in the F1-score of the negative class was acceptable for improving the F1-score of the positive class. After a grid search, our final models were fine-tuned with 2 epochs, a learning rate of 2e-05, Adam optimization, and a batch size of 32. We used a Huggingface BERT sequence classification function[29] and, like with English, added a linear layer on top of the pooled output. This output was then fed through a softmax function, and the difference between the positive and negative class likelihoods were used to rank tweets within each pre-labeled topic category. Results Results for our Arabic evaluations can be found in Table 2. Our official submission to the competition was AraBERT v0.1 Upsampled and was eval- uated in 1st place with a P@30 of 0.7000. Our comparative models AraBERT v1.0 Upsampled7 , AraBERT v0.1 Unmodified, and ArabicBERT-Base Upsampled were evaluated in 2nd, 3rd, and 4th place with P@30 scores of .6750, .6694, and .6639 respectively. The benefit of back-translation to upsample the minority class can be seen by comparing AraBERT v0.1 Upsampled (P@30 of 0.7000) with AraBERT v0.1 Unmodified (P@30 of of 0.6694). These were the same model architec- tures, with identical hyperparameters, but one had upsampled data, and the other did not. Comments: Preprocessing Once we had Arabic model performance baselines, we experimented with various preprocessing techniques. We assumed that these steps would reduce noise and help the Arabic BERT models better map words to tokens in its vocabulary. We performed internal evaluations involving varia- tions of removing diacritics, stopwords, urls, punctuation, and also of splitting 7 This is a rapidly evolving area of NLP. At the time of the challenge, documentation was not yet published for AraBERT v1.0. We did not realize v1.0 required running Farasa [1] as a preprocessing step for tokenization before utilization. We expect an Upsampled v1.0 to beat an Upsampled v0.1 when utilizing the necessary Arabic segmenter. Table 2. Accenture results from CheckThat! Task1 Arabic Entry P@5 P@10 P@15 P@20 P@25 P@30 AP AraBERT v0.1 Upsampled 0.7333 0.7167 0.7167 0.6875 0.6933 0.7000 0.6232 AraBERT v1.0 Upsampled 0.6667 0.7417 0.7333 0.7125 0.6900 0.6750 0.5967 AraBERT v0.1 Unmodified 0.6833 0.7083 0.7111 0.6833 0.6833 0.6694 0.6035 ArabicBERT-Base Upsampled 0.6000 0.6917 0.6944 0.6833 0.6667 0.6639 0.5947 underscores. We tested each of these preprocessing functions alone, as well as in combination with other preprocessing functions. We saw no increase in precision or recall from these steps. In fact, many combinations of these functions brought down our overall accuracy. We ultimately chose to forego all preprocessing. Comments: Machine Translation Back-translation provides the model with alternative ways to express similar concepts. This makes the model more robust to vocabulary not present in the training data. We evaluated three strategies to augment the corpus using translation data. – adding back-translated data (ar→en→ar) – adding the English translation (ar→en) – adding both the English and back-translated Arabic text (ar→en; ar→en→ar). We found the back-translated Arabic (without English) (ar→en→ar) had the provided the largest increase in accuracy on our internal evaluations. English was chosen as an intermediary language due solely to the fact that AWS has strong English NLP support. Future research may explore which inter- mediary language translations can offer the largest performance boosts. While we may have benefited from exploring intermediary language alternatives 8 , we had to leave this for future work due to constraints in both time and budget. We recognize that this translation approach resulted in label leakage into the hold-out and validation sets, resulting in overfitting on our internal evalu- ations. However by expanding the contextual vocabulary of the model, we had the intuition this would yield increased performance on the unseen test set. Of all of the preprocessing and tuning steps we tried on our internal evalua- tions, none resulted in a larger accuracy boost than adding this back-translated data. 5 Future Work New pre-trained neural network models are being released at a rapid pace. The trend is that they are getting larger–trained with more parameters, on larger quantities of text. Additionally, their baseline capabilities are expanding. Work like that which is presented here can be easily updated to take advantage of these new models as they become available. The workflow a year from now will 8 as well as from up-sampling the English training set be the same, but performance will improve. Today, BERT and similar pre-trained models have become the new baseline. These systems yield fantastic results, with little training data required for fine-tuning. As larger models are created and released, the models become more difficult to understand. Classification and ranking is helpful to support SMEs performing their work, but full decision support systems cannot be black boxes, and need to be able to explain why they made the suggestions they did. We are working on improving the explainability of these models to provide better support to decision makers. 6 Conclusions This paper introduced work by Accenture on using BERT and RoBERTa models to classify and rank unsubstantiated claims in social media for professional fact- checking. We demonstrate 5 models. We submitted one model to the English track, and placed 1st with a mAP of .8064. We submitted 4 models to the Arabic track, yielding 1st (P@30=.7000), 2nd (P@30=.6750), 3rd (P@30=.6694), and 4th (P@30=.6639) place. References 1. Abdelali, A., Darwish, K., Durrani, N., Mubarak, H.: Farasa: A fast and furious segmenter for arabic. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Demonstrations. pp. 11–16 (2016) 2. Aluru, S.S., Mathew, B., Saha, P., Mukherjee, A.: Deep learning models for mul- tilingual hate speech detection. arXiv preprint arXiv:2004.06465 (2020) 3. Antoun, W., Baly, F., Hazem, H.: AraBERT: Transformer-based model for arabic language understanding. In: Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection. pp. 9–15 (2020), https://arxiv.org/pdf/2003.00104v2.pdf 4. Arampatzis, A., Kanoulas, E., Tsikrika, T., Vrochidis, S., Joho, H., Lioma, C., Eickhoff, C., Névéol, A., Cappellato, L., Ferro, N. (eds.): Experimental IR Meets Multilinguality, Multimodality, and Interaction Proceedings of the Eleventh In- ternational Conference of the CLEF Association (CLEF 2020). LNCS (12260), Springer (2020) 5. Barrón-Cedeño, A., Elsayed, T., Nakov, P., Da San Martino, G., Hasanain, M., Suwaileh, R., Haouari, F., Babulkov, N., Hamdan, B., Nikolov, A., Shaar, S., Sheikh Ali, Z.: Overview of CheckThat! 2020: Automatic identification and verification of claims in social media. In: Arampatzis et al. [4] 6. Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Win- ter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language models are few-shot learners (2020) 7. Cappellato, L., Eickhoff, C., Ferro, N., Névéol, A. (eds.): Working Notes of CLEF 2020—Conference and Labs of the Evaluation Forum (2020) 8. Committee, O.: Tasks 1 & 5: Check-worthiness, https://sites.google.com/ view/clef2020-checkthat/tasks/tasks-1-5-check-worthiness 9. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 10. Hasanain, M., Haouari, F., Suwaileh, R., Ali, Z., Hamdan, B., Elsayed, T., Barrón- Cedeño, A., Da San Martino, G., Nakov, P.: Overview of CheckThat! 2020 Arabic: Automatic identification and verification of claims in social media. In: Cappellato et al. [7] 11. Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012) 12. Howard, J., Ruder, S.: Fine-tuned language models for text classification. CoRR abs/1801.06146 (2018), http://arxiv.org/abs/1801.06146 13. Lai, G., Xie, Q., Liu, H., Yang, Y., Hovy, E.: Race: Large-scale reading compre- hension dataset from examinations (2017) 14. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019), http://arxiv.org/abs/1907.11692 15. Müller, M., Salathé, M., Kummervold, P.E.: Covid-twitter-bert: A natural lan- guage processing model to analyse COVID-19 content on twitter. arXiv preprint arXiv:2005.07503 (2020) 16. Ortiz Suárez, P.J., Romary, L., Sagot, B.: A monolingual approach to contextualized word embeddings for mid-resource languages. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.acl-main.156, http://dx.doi.org/10. 18653/v1/2020.acl-main.156 17. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettle- moyer, L.: Deep contextualized word representations. In: Proc. of NAACL (2018) 18. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. Tech. rep., OpenAI, San Francisco, CA, USA (2019) 19. Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: Unanswerable ques- tions for squad (2018) 20. Rasmy, L., Xiang, Y., Xie, Z., Tao, C., Zhi, D.: Med-bert: pre-trained contextu- alized embeddings on large-scale structured electronic health records for disease prediction (2020) 21. Safaya, A., Abdullatif, M., Yuret, D.: Kuisail at semeval-2020 task 12: Bert-cnn for offensive speech identification in social media. In: Proceedings of the International Workshop on Semantic Evaluation (SemEval) (2020) 22. Schuster, M., Nakajima, K.: Japanese and Korean voice search. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 5149–5152. IEEE (2012) 23. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units (2015) 24. Shaar, S., Nikolov, A., Babulkov, N., Alam, F., Barrón-Cedeño, A., Elsayed, T., Hasanain, M., Suwaileh, R., Haouari, F., Da San Martino, G., Nakov, P.: Overview of CheckThat! 2020 English: Automatic identification and verification of claims in social media. In: Cappellato et al. [7] 25. Sharir, O., Peleg, B., Shoham, Y.: The cost of training NLP models: A concise overview. arXiv preprint arXiv:2004.08900v1 (2020) 26. Sun, Y., Wang, S., Li, Y., Feng, S., Chen, X., Zhang, H., Tian, X., Zhu, D., Tian, H., Wu, H.: Ernie: Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223 (2019) 27. Turc, I., Chang, M.W., Lee, K., Toutanova, K.: Well-read students learn better: On the importance of pre-training compact models (2019) 28. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: Glue: A multi- task benchmark and analysis platform for natural language understanding (2018) 29. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Brew, J.: Huggingface’s transformers: State- of-the-art natural language processing. ArXiv abs/1910.03771 (2019) 30. Zhu, Y., Kiros, R., Zemel, R.S., Salakhutdinov, R., Urtasun, R., Torralba, A., Fidler, S.: Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. CoRR abs/1506.06724 (2015), http: //arxiv.org/abs/1506.06724