UAICS at CheckThat! 2020: Fact-checking claim prioritization Ciprian-Gabriel Cusmuliuc, Lucia-Georgiana Coca, Adrian Iftene “Alexandru Ioan Cuza” University, Faculty of Computer Science, Iasi, Romania {cusmuliuc.ciprian.gabriel, coca.lucia.georgiana, adiftene}@info.uaic.ro Abstract. Claim proving can be an incredibly challenging task considering the amount of information in the world increases day by day. Journalists and people alike spend a lot of time investigating claims and fact-checking different state- ments. In order to address this problem CLEF 2020 CheckThat! proposes 5 tasks that each present a different side of the problem. Our team participated in Task 1 and Task 5 which aim to rank statements by check-worthiness. For Task 5, we proposed 3 methods, each based on a different machine learning algorithms, Na- ïve Bayes, Logistic Regression, and Decision Tree. For Task 1, we created a sys- tem based on BERT. For Task 5, the best result we achieved using the official measure MAP was with the Naive Bayes. This paper presents the details and results of our approaches. Keywords: Naive Bayes, BERT, Logistic Regression, Decision Tree 1 Introduction Increase in social network popularity has led users to conduct multiple activities on them, such as message exchanging, posting, news reading, commenting, and so on. In- stant sharing and broadcasting have enabled users fast and a vast access to information, but all with a cost. The problem arises that news propagation in these platforms is be- coming uncontrollable, users frequently read and share information without checking the veracity of a certain claim, leading to misinformation. The problem of needing to check the news and claims in these networks is slowly becoming of extremely high interest to major players, such as Facebook and Twitter, as recently they are having talks of integrating such tools in their systems, one such example is Twitter fact checking Donald Trump and labeling his tweets as ‘manipulated media’1 sparking outrage amongst its supporters. This problem is not entirely present 1 https://www.bbc.com/news/technology-53106029 Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 September 2020, Thessaloniki, Greece. in social media, we can see an effort to spread misinformation and propaganda on the entire internet. CLEF 2020 CheckThat! [1][19] is an evaluation campaign that is being organized as part of CLEF 2020 [2] and contains 5 tasks, each related to fact-checking. Our team participated in two tasks, Tasks 1 and 5. Task 1 requires the development of a system capable of ranking a stream of poten- tially-related tweets according to their check-worthiness; this task ran in English and Arabic, we participated only in the English version by developing a model based on BERT [3], a bidirectional transformer developed by Google with exceptional perfor- mance. Task 5 has the objective of identifying which sentences from a political debate should be prioritized for fact-checking. In this task, we submitted 3 models based on Naive Bayes, Logistic Regression and Decision Tree. This paper describes the participation of team UAICS, from the Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iasi, in Task 1 and 5 at CLEF 2020. The remaining of this paper was organized as follows: Section 2 describes the state of the art, Section 3 gives a description of the tasks. Section 4 details the model we developed and the submitted runs and then Section 5 details the results we obtained, finally Section 5 concludes this paper and presents future work. 2 State of the art Previous editions of CheckThat! i.e. 2019 and 2018 had fewer tasks; Task 1 was based on the same claim prioritization as task 5 whilst task 2 required to assess which web pages can be useful in human fact-checking and consisted of multiple subtasks. We will only be referring to task 1 as this task is also present in 2020. In 2019 approaches for task 1 were very various, the best team “Copenhagen” [4] had a MAP score of .1660 and the system was based on learning dual token embeddings in conjunction with an LSTM [5]. The network has been pre-trained using previous Trump and Clinton debates while supervising it with ClaimBuster 2. Other approaches are the following (in order of the ranking): team TheEarthIsFlat [6] used a feed-forward neural network with two hidden layers, team IPIPAN [7] used an L1-regularized lo- gistic regression, team Terrier [8] used SVM [9] in conjunction with bag-of-words and named entities and team UAICS [10] used a Naïve Bayes classifier with bag-of-words features. In 2018 the best team was still “Copenhagen” [11] with the lowest MAE of .7050, they used a similar approach as in 2019 being a convolutional neural network [12] and support vector machine. Other approaches range from random forests, logistic regres- sion, and LSTM. 2 https://idir.uta.edu/claimbuster/ 3 Tasks description In 2020 there have been 5 tasks that ran in English and Arabic, we only participated in 2 of them, Tasks 1 and 5. In this section, we will shortly present the two tasks we took part in. Task 1 requires “given a topic and a stream of potentially-related tweets, rank the tweets according to their check-worthiness for the topic”. This task runs in English and Arabic. Task 5 requires “given a political debate or a transcribed speech, segmented into sentences, with speakers annotated, identify which sentence should be prioritized for fact-checking”. This task is only in English. 3.1 Evaluation metric Both tasks use MAP [13] as the official metric which calculates the usual mean of the average precision. Other measures used are the Mean Reciprocal Rank [14] which al- lows obtaining reciprocals of the rank of the first relevant document, as well as Mean Precision at k, which performs the average of k best candidates. Details on the measures used can be found in the task overview [1]. Evaluations are carried out on primary and contrastive runs. Each participant has the right to three models, one primary and two secondary (contrastive). We tried to take advantage of this by submitting 3 models in Task 5. In previous years at CheckThat! the evaluation metric was MAE, as it could have been seen in Section 2. 4 Methods and runs 4.1 For Task 1 4.1.1 Training and test data The data provided for this task contained tweets that were split into 2 main categories, train, and dev. The data was provided in both TSV 3 and JSON4 files. We decided to only use the TSV files as we felt it was easier. The datasets used were: “train” for the training of our model and “dev” to fine-tune the hyperparameters after evaluation. A training example can be seen in Table 1. 3 https://www.imf.org/external/help/tsv.htm 4 https://www.json.org/json-en.html Table 1. Training example. Topic id Tweet id Tweet URL Tweet Text Claim Check wor- thiness https://twit- Since this will 1 1 ter.com/Eric- never get reported covid-19 1234964 Trump/sta- by the media […] tus/12349646530143 84644 We considered that the tweet URL and tweet id were irrelevant, so we did not include it in the data sent to our algorithm. 4.1.2 Preprocessing and feature extraction Before feeding the data to the model, we had to preprocess the text. Csv5 library was used to read from the files provided by the organizer after which we put the data in a list that contained in order the topic id, tweet id, tweet URL, tweet text, claim and label. The data would then be sent to a tokenizer, we decided to use BertTokenizer6 as this was the official method from Huggingface7. We would then pad the sentence to a max- imum phrase limit that in our case was 121. Example tokenization is the following: Fig. 1. Tokenization example. After tokenization we loaded each individual field in a torch tensor8 and inserted them in a TensorDataset9 that contained: all the sentence ids, each individual tokenized sentence and the labels. A code snippet for this operation is the following: all_topic_id_id = torch.tensor([f.topic_id_id for f in features], dtype=torch.long) all_tweet_text_id = torch.tensor([f.tweet_text_id for f in features], dtype=torch.long) all_claim_id = torch.tensor([f.claim_id for f in features], dtype=torch.long) dataset = TensorDataset(all_topic_id_id,all_tweet_text_id, all_claim_id) return dataset 5 https://docs.python.org/3/library/csv.html 6 https://huggingface.co/transformers/model doc/bert.html#berttokenizer 7 https://huggingface.co/ 8 https://pytorch.org/docs/stable/tensors.html 9 https://pytorch.org/docs/stable/data.html#torch.utils.data.TensorDataset 4.1.3 Models In designing the model, we decided to use BERT as a possible solution to the problem at hand. This model was chosen over other language models such as ULMFiT [15] as multiple papers demonstrate the performance benefits BERT has. For example, [16] trained an RNN on a very large text collection resulting in a 63.7% accuracy in the Winograd Schema Challenge [17] while [18] using a BERT model was able to achieve an accuracy of 72.5% on the same challenge. These results led us to choose the latter model as we feel it best fits our purpose. The system consists of a pre-trained model called “bert-large-uncased”10, it is a bi- directional transformer that contains 24 layers, 1,024 hidden layers, 16 heads, and 340 million parameters. The training has been done on lower-cased English text. We used a combination of BertModel11 and Adam12 optimizer in order to get the best results. The hyperparameters are more or less standard, we tuned them empirically and arrived at the following best configuration: batch size 8, 5 epochs, and the Adam learn- ing rate of 5e-5. The pipeline of the algorithm implies first preprocessing (tokenize and pad the sen- tence in order to satisfy the condition of the model). Then, we shuffle the data in order to avoid overfitting, and we start the training with each epoch and we feed the data through the network. Then with backpropagation, we simply update the learning rate and tell the optimizer to update the parameters. Evaluation of the trained network is done with the dev dataset, using the saved model in the previous step we feed the data through the network and compute loss on our trial data. The experimental setup was done both locally and on the cloud. In the development stage, we trained the model locally using a computer with a 12 core CPU and 32 Gb of Ram which proved very inefficient, the training time took about two days which made us switch to a cloud setup. Using PyTorch13 we made the switch to training the model on GPU using the Google Collaboratory14 platform which lowered the training time to about an hour, this made a big difference as now we could make decisions regarding the model much faster, without waiting a long time for it to finish. 4.2 For Task 5 4.2.1 Training and test data The data provided contained presidential elections debates and speeches from the United States in 2016. The data was of two main categories, training and test. The train- ing had 50 files while the test had 20 files. The main difference from 2019 is that the organizers provided more training files but also more test scenarios. This can be seen 10 https://huggingface.co/transformers/pretrained_models.html 11 https://huggingface.co/transformers/model doc/bert.html#bertmodel 12 https://huggingface.co/transformers/main classes/optimizer schedules.html#adamw 13 https://pytorch.org/ 14 https://colab.research.google.com/ in the result of the candidates that now have a lower MAP compared to 2019 as the test file number has increased dramatically from last year. We tried to further augment the models by taking files from 2019 that have not been included in 2020; we took training files but also test files with gold labels. One training example with the available columns would be the following: Table 2. Training example. Line no. Speaker Text Label 1 Trump So Ford is leaving. 1 In the training of the model we ignored the Speaker and line number, we only fed the preprocessed text and label. 4.2.2 Preprocessing and feature extraction Before sending the debate text to the machine learning algorithm we performed several preprocessing operations in a pipeline. For all the models we first tokenized the text in order to break the phrase into indi- vidual terms. After tokenization, for the contrastive 2 submissions, Logistic Regression we used TF-IDF in order to extract the features in the form of a term frequency matrix. The implementation for TF-IDF was taken from Pyspark15 and is a combination of 2 steps: HashingTF16 and IDF17. The minimum document frequency for IDF was set to 10. For the next two submissions, primary (Naïve Bayes) and contrastive 2 (Decision Tree) we decided to also use a term frequency matrix but with a different implementa- tion; instead of using HashingTF we used CountVectorizer 18 which has a lower infor- mation loss and empirically it was observed these two algorithms perform better with this feature extractor. The settings of the CountVectorizer are the following: minimum term frequency is 1 and so is the minimum definition frequency, the maximum defini- tion frequency is 263 − 1 and the vocabulary size is 218 and for IDF we set a minimum term frequency of 3. 4.2.3 Models After preprocessing the data and extracting the features they are sent to the machine learning models. We decided not to use any exterior resources for these models. We trained the algorithms and tested them using the test data from 2019 where we attached the gold labels and used the provided organizers script to calculate MAP, RR, 15 https://spark.apache.org/docs/latest/api/python/index.html 16 https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/mllib/feature/HashingTF.html 17 https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/mllib/feature/IDF.html 18 https://spark.apache.org/docs/2.1.0/ml-features.html#countvectorizer R-P and P@k. This gave us a rough estimation of the performance of the model but also a way to compare with other teams from last year [10]. The first and best model was based on Naïve Bayes which uses the default imple- mentation from Pyspark, we fine-tuned it after multiple sessions of testing and used a multinomial model and set the smoothing to 1. Even though the model is quite simple it is very powerful. It can be seen from the results in Table 4 sometimes it is twice as accurate as other algorithms. The second-best algorithm was Logistic Regression, initially this model performed poorly however we figured out that by increasing the minimum document frequency on IDF to 10 would increase its performance. The parameters of this model are the following: maximum iteration is set to 100, the regression parameter is 0, the tolerance value was 1e-6 and the aggregation depth was set to 2. The third best was Decision Tree; this model uses a minimum document frequency in IDF of 3. We tried making it as best as we could and in order not to overfit the model we arrived at the following parameters: maximum depth was 30, we increased the max- imum bins to 128 which allows the algorithm to consider more split candidates and make fine-grained split decisions, there were a minimum 5 instances per node and in- creased the maximum memory limit of the model to 4096Mb. We trained the models locally, on CPU, the training time was rather fast, Decision Tree was the slowest. 5 Results In this section, the results of the two task submissions will be discussed. Table 3 illus- trates the results for Task 1, there were 12 teams and we ranked 11th with a MAP of 0.4950 while the best result had a MAP of 0.8064 (team Accenture). It should be noted here that contrastive 1 is better than our primary, we trained the primary with more epochs and wrongly evaluated as an increase in performance. Table 4 presents the result in Task 5 where we ranked 2nd out of 3 teams with a MAP of 0.0515, the best being 0.0867 (team NLPIR01), and the worst 0.0183, almost 5 times worse than our submission. Table 3. Task 1 Results Sub. MAP RR R-P P@1 P@3 P@5 P@10 P@20 P@30 Ac- 0.8064 1.00 0.7167 1.0000 1.0000 1.00 1.0000 0.9500 0.7400 cen- ture prim. 0.4950 1.0 0.4667 1.0 0.3333 0.4 0.6 0.6 0.46 con-1 0.5333 0.5 0.5167 0.0 0.3333 0.4 0.6 0.6 0.52 Table 4. Task 5 Results Sub. MAP RR R-P P@1 P@3 P@5 P@10 P@20 P@30 NLPI 0.0867 0.27 0.0930 0.15 0.11 0.13 0.0950 0.0725 0.0390 R01 prim. 0.0515 0.2247 0.0527 0.15 0.10 0.07 0.050 0.0375 0.0270 con-1 0.0431 0.1735 0.0578 0.10 0.05 0.05 0.055 0.0450 0.0250 con-2 0.0328 0.1138 0.0282 0.05 0.05 0.03 0.035 0.0175 0.0190 For Task 5 the best result and the primary submission was of the Naïve Bayes model, contrastive 1 is 2nd place and it is the Logistic Regression algorithm, and finally con- trastive 2 is based on a Decision Tree. The results are in accordance with what we have tested locally, we feel that the performance is good and that the models performed well in the evaluation stage. 5.1 Error analysis The performance of both tasks is good; for Task 1 the main drawback of the model is the fact that we did not arrive at a finished product, we feel that the design of the model needs improvement, we do not believe it is able to extract relevant information thus an augmentation with a general knowledge ontology such as WikiData19 would be a great addition. For Task 5, the models are in a much more mature state as the performance show, we feel that they have reached their limit, the error stems from the lack of understanding of the sentence thus needing a much more complex system, probably based on a lan- guage model such as BERT. 6 Conclusion In this paper, we proposed solutions for two of CLEF CheckThat! 2020 tasks, one ap- proach is based on a bidirectional transformer and the others are based on machine learning. We achieved good results with all the submission and for future we would like to fine-tune our models in order to have a much better MAP, there is much room for improvement in Task 1 and for task 5 a language model approach would be inter- esting to see in action. Acknowledgements This work was supported by project REVERT (taRgeted thErapy for adVanced colo- rEctal canceR paTients), Grant Agreement number: 848098, H2020-SC1-BHC-2018- 2020/H2020-SC1-2019-Two-Stage-RTD. 19 https://www.wikidata.org/ References 1. Barrón-Cedeño, A., Elsayed, T., Nakov, P., Da San Martino, G., Hasanain, M., Suwaileh, R., Haouari, F., Babulkov, N., Hamdan, B., Nikolov, A., Shaar, S., Sheikh Ali, Z. (2020) Overview of CheckThat! 2020: Automatic Identification and Verification of Claims in So- cial Media. In Working Notes of CLEF 2020. 2. Arampatzis, A., Kanoulas, E., Tsikrika, T., Vrochidis, S., Joho, H., Lioma, C., Eickhoff, K., Névéol, A., Cappellato, L., Ferro, N. (2020). Experimental IR Meets Multilinguality, Mul- timodality, and Interaction. In Proceedings of the Eleventh International Conference of the CLEF Association (CLEF 2020). Lecture Notes in Computer Science (LNCS) 12260, Springer, Heidelberg, Germany. 3. Devlin, J., Chang, M. W., Lee, K., Toutanova, K. (2018) BERT: pre-training of deep bidi- rectional transformers for language understanding. CoRR, abs/1810.04805. 4. Hansen, C., Hansen, C., Simonsen, J. G., Lioma, C. (2019) Neural Weakly Supervised Fact Check-Worthiness Detection with Contrastive Sampling-Based Ranking Loss. In Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9-12, 2019 http://ceur-ws.org/Vol-2380/paper_56.pdf . 5. Hochreiter, S., Schmidhuber, J. (1997) Long short-term memory. In Neural Computation. 9 (8): 1735–1780. doi:10.1162/neco.1997.9.8.1735. PMID 9377276. 6. Favano, L., Carman, M., Lanzi, P. (2019) TheEarthIsFlat’s submission to CLEF’19 Check- That! challenge. In CLEF 2019 Working Notes. Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum. CEUR Workshop Proceedings, CEUR-WS.org, Lugano, Switzerland. 7. Gasior, J., Przyby la, P. (2019) The IPIPAN team participation in the check-worthiness task of the CLEF2019 CheckThat! lab. In CLEF 2019 Working Notes. Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum. CEUR Workshop Proceedings, CEUR-WS.org, Lugano, Switzerland. 8. Su, T., Macdonald, C., Ounis, I. (2019) Entity detection for check-worthiness prediction: Glasgow Terrier at CLEF CheckThat! 2019. In CLEF 2019 Working Notes. Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum. CEUR Workshop Proceed- ings, CEUR-WS.org, Lugano, Switzerland. 9. Cortes, C., Vapnik, V. N. (1995) Support-vector networks. In Machine Learning. 20 (3): 273–297. CiteSeerX 10.1.1.15.9362. doi:10.1007/BF00994018 10. Coca, L., Cusmuliuc, C.G., Iftene, A. (2019) 2019 UAICS. In CLEF 2019 Working Notes. Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum. CEUR Workshop Proceedings, CEUR-WS.org, Lugano, Switzerland. 11. Wang, D., Simonsen, J., Larseny, B., Lioma, C. (2018) The Copenhagen Team Participation in the Factuality Task of the Competition of Automatic Identification and Verification of Claims in Political Debates of the CLEF-2018 Fact Checking Lab. CLEF 2018 Working Notes. 12. Valueva, M. V., Nagornov, N. N., Lyakhov, P. A., Valuev, G. V., Chervyakov, N. I. (2020) Application of the residue number system to reduce hardware costs of the con-volutional neural network implementation. In Mathematics and Computers in Simula-tion. Elsevier BV. 177: 232–243. doi:10.1016/j.matcom.2020.04.031. ISSN 0378-4754. Convolutional neural networks are a promising tool for solving the problem of pattern recognition 13. Beitzel, S.M., Jensen E.C., Frieder O. (2009) MAP. In LIU L., ÖZSU M.T. (eds) Encyclo- pedia of Database Systems. Springer, Boston, MA 14. Craswell, N. (2009) Mean Reciprocal Rank. In LIU L., ÖZSU M.T. (eds) Encyclopedia of Database Systems. Springer, Boston, MA 15. Howard, J., Ruder, S. (2018) Fine-tuned language models for text classification. CoRR, abs/1801.06146. 16. Trinh, T. H., Le, O. V. (2018) A simple method for commonsense reasoning. CoRR, abs/1806.02847. 17. Levesque, H. J., Davis, E., Morgenstern, L. (2012) The winograd schema challenge. In Pro- ceedings of the Thirteenth International Conference on Principles of Knowledge Represen- tation and Reasoning, KR’12, 552–561. AAAI Press. 18. Kocijan, V., Cretu, A. M., Camburu, O. M., Yordanov, Y., Lukasiewicz, T. (2019) A sur- prisingly robust trick for winograd schema challenge. CoRR, abs/1905.06290 19. Shaar, A., Babulkov, N., Alam, F., Barrón-Cedeño, A., Elsayed, T., Hasanain, M., Suwaileh, R., Haouari, F., Da San Martino, G., and Nakov, P. (2020). Overview of CheckThat! 2020 English: Automatic Identification and Verification of Claims in Social Media. In Working Notes of CLEF 2020.