=Paper=
{{Paper
|id=Vol-2696/paper_134
|storemode=property
|title=Team Buster.ai at CheckThat! 2020: Insights and Recommendations to Improve Fact-Checking
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_134.pdf
|volume=Vol-2696
|authors=Mostafa Bouziane,Hugo Perrin,Aurélien Cluzeau,Julien Mardas,Amine Sadeq
|dblpUrl=https://dblp.org/rec/conf/clef/BouzianePCMS20
}}
==Team Buster.ai at CheckThat! 2020: Insights and Recommendations to Improve Fact-Checking==
Team Buster.ai at CheckThat! 2020: Insights And Recommendations To Improve Fact-Checking Mostafa Bouziane, Hugo Perrin, Aurélien Cluzeau, Julien Mardas, and Amine Sadeq Buster.AI, France contact@buster.ai Abstract. As part of the CheckThat! 2020 Task 2, we investigated sen- tence similarity using transformer models. In Task 2, the goal was to effectively rank claims based on their relevancy compared to an input tweet. While setting our baseline on sentence similarity for fact-checking, we gathered insights we felt compelled to share in this paper. We learned how multi-modal data utilization could foster significant uplifts in model performance. We also gained knowledge on which hybrid training and strong sampling worked best for fact-checking applications, and wanted to share our interpretation of the results we got. Finally, we want to explain our recommendations on data augmentations. All of the above allowed us to set our baseline in fact-checking in the CLEF Checkthat! 2020 Task 2 competition. Keywords: Fact-Checking, Veracity, Evidence-based Verification, Fake News Detection, Computational Journalism, Natural Language Processing, Deep Learn- ing, Language Model, Sentence Similarity. Introduction In the run-up to elections in many economic powers, being able to discern de- ception from real news has never been more critical. Moreover, in recent years, the actual cost of inaccurate news has been assessed more thoroughly. With oc- currences like the Bloomberg dubious report almost wiping up 6 billion euros of Vinci market value, resulting in a 5 million euros fine for the media conglomerate recently [13] [4], it is also manifest that fake news can have a significant impact on economic and financial markets. As we are observing during the 2019/2020 coronavirus crisis [8], false health information, potentially causes grave harm to public safety and societies as a whole, from the economic to the political stability of entire countries. While most of the effort to combat them is handmade, we believe automation is vital to ensure propelling this fight to a broader impact. Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem- ber 2020, Thessaloniki, Greece. In CheckThat! 2020 Task 2 [10], as descibed in [9], the intended goal was to effectively rank claims based on their relevancy to an input tweet. We had to output a list of numerical scores for sentences in the database, the higher the individual score, the more accurately related the match with the input text is. It became natural that we judged our results with the Mean Average Precision metric (MAP), evaluating if the valid claim is in the top-scored evidences. We will expand in the background section on the several approaches that have been tried. Figure 1 explains our strategy: after preprocessing the data and filtering out potentially unsuitable claims, we can finally estimate the matching probabil- ities with our neural network on acceptable claims. This filtering enables faster computations while preserving most of the performance that we have when com- puting scores for all documents of the database. We consequently trained our transformer model to learn sentence similarity on a semantic level, which we will detail in the methodology section. Solving such a task will assist human fact-checkers in increasing their check- ing throughput drastically. A desirable algorithm for this task will permit ob- taining source candidates automatically, diminishing proof searching times for human fact-checkers, as explained in Figure 1. It is noteworthy that sentence correspondence analysis only serves as a guide for humans, and cannot replace them for the ultimate purpose of fact-checking. Indeed, the necessity for human settlements can further ensure trust from everyone on the exactness of the fact- checking process, confidence, which is also strengthened by the associated facts from cited references so anyone can verify them. Fig. 1. Fact-checking pipeline Furthermore, grand strides have been made for general interpretable artificial intelligence, and resolving such a task benefits in achieving this in the fact- checking setting. Being able to find matches on an evidence base allows the algorithm to provide its sources for individuals to gauge its prediction quality and exhibits where an automatic entailment model built on top would acquire its reference material from. For all those reasons, this aim has a practical significance and usefulness. After focusing on what has been done already to solve such a problem, we will succinctly present our competition-winning submission. We will outline the un- derlying architecture and how we trained it, which will be a strong starting point to describe the insights we gained from the competition. The first recommenda- tion we share is how multi-modal knowledge can be leveraged to enhance score quality. Then we want to discuss the use of external datasets, which helped us propelling our algorithm farther and our interpretation of why specific datasets appeared to help more than others. Building on these observations, we want to highlight how proper sampling is required to train models efficiently. Finally, we will give a concise overview of other data augmentations that have impacted our model both for training and inference. Background In Task 2, we recall the goal is to rank claims based on their correspondence with the input statement. So, chiefly our goal is to build a sentence similarity model to learn semantics and hidden meanings. The latest advances in Natu- ral Language Processing (NLP) suggest to use Transformer models and atten- tion models. The Transformer [14] idea placed the latest paradigm for solving sequence-to-sequence tasks while handling long-range dependencies with ease. This architecture is based on self-attention, which allows the model to look at the other words in the input sequence to understand a particular word in the sequence better. Such attention is instrumental when the sentence is too long, or it contains a hidden context, this is why research has devised different variants of Transformer models, of which Bidirectional Encoder Representations from Transformers (BERT) is the most notable and constitutes state of the art [2]. BERT’s introduction caused a massive surprise in the NLP community, as it outperformed previous models in so many distinct tasks. This considerable suc- cess comes from BERT being one of the first models that were non-directional. Previous models based on recurrent neural nets (RNNs) or convolutional neu- ral nets (CNNs) implied a notion of direction, insofar as they implied a specific sequentiality when reading the input. For example, these models will predict a missing word in a sentence using either previous words or the following words, but can solely combine both approaches as an ensembling method. Nevertheless, with the BERT approach based on transformers, this prediction can be made using all available words simultaneously through the attention-based architec- ture, making BERT much more potent in creating context-based embeddings. These context-aware representations are of interest for the task at hand, since they allow elegant intertwined semantic links between query input and potential candidate evidence. In addition to a new non-directional approach, BERT-like models also introduced a new way of encoding input data, which makes BERT able to receive a single sentence as input or a two-sentence input, making it suit- able for such a large field of different tasks. Moreover, the pre-trained weights on vast amounts of data were released, leading to a sizable interest by the com- munity and a substantial amount of analysis, understanding, and improvement of the architecture, creating a series of BERT based papers with several adjust- ments and specifications [6]. We also deem noteworthy to draw parallels between the CLEF Task 2 natural pipeline and that of some of the pipelines proposed in Open Domain Question Answering, like has been posted recently by HuggingFace labs [3], since both exhibit the same kind of filtering/classify pipeline. This blog post has not inspired our work, as it just was published at the time of writing, but it is nonetheless an interesting source to reframe our current work. Methodology: Our submission overview Task Definition and dataset Task 2 aims to build tools that verify tweets based on a database of genuine claims. When we feed a tweet to the model, the comparison made with the claims database will produce a ranking from the most to the least relevant claims extracted from the database. Fig. 2. Task 2 definition Pairs of tweet and related claim constitute the samples of the available dataset; the tweet is the input that we want to verify, and the related claim can be retrieved from the database of verified claims. Proposed approach A rank-based method is an obvious thinking starting point to solve this task: a method, where the model will have to learn patterns to rank pairs based on how related they are. Nevertheless, for this specific task, we want to exploit the fact that a single claim can be related to the input, simplifying the ranking task to either being related, i.e., top-ranked, or not related, i.e., ranked second or more. A rank-based model can be the right approach if the rank scores are smoothed and well distributed over all claims, which is hard to achieve, since we do not have this information in the provided dataset. Training: Therefore, we instead opted for a binary classification approach, as follows: – Using genuine examples (pair tweet/claim) provided in the dataset. – Generating wrong examples using several sampling strategies. – Using a binary cross-entropy loss. Inference: For all our submissions, once the model is trained properly, we pair each tweet in the test set with all the verified claims in the database, and we then apply the neural network with a composite score s function for ranking, that we defined as in Equation 1, with pp being the probability of being related, and pn being the probability of not being related to the query. It is important to mention that we replace the last sigmoid activation by a softmax activation for the last layer, even though in most cases sigmoid and softmax perform similarly, since the values sum up to roughly one already on the right data. Nevertheless, softmax will prove a more reliable activation on unknown data. ( pp if pp ≥ pn s (pn , pp ) = (1) 1 − pn otherwise Network architecture Our network relies upon on a RoBERTa checkpoint [7] to begin training. There- fore, the underlying architecture is a basic BERT architecture with a 12-layer, 768-hidden, 12-heads configuration, which amounts to 125 million parameters [17]. Those parameters result from the need to have both robust and fast infer- ence, which it both achieved in practice, as it achieved the top score of around 93% MAP at depth 5, with an inference time below 0.1 seconds per pair. In absolute fairness, we shall point out that the complexity of inference, with m being the number of facts in our database, and n being the number of queries, is in O(nm), which means our approach requires a pre-filtering to scale with a broader knowledge base. Augmentations and dataset We must now disclose the external data we tried to provide as a supplement for training. We first tried using SciFact [15], a scientific claim verification corpus, comprising 1 409 claims fact-checked against 5 183 abstracts with entailment labels. Secondly, we tried augmenting our corpus with FEVER [12], a quite substantially large dataset with 185 445 short claims, a few orders of magnitude larger than SciFact and CLEF datasets. Finally, we also used Liar [16], which constituted around 12 800 claims, placing it as a middle-ground in terms of size and featuring short claims and evidence. Results: Our insights For all comparisons here, we used our checkpoints obtained at early training to avoid overfitting issues to falsify any results, and we will mention whenever we use a model checkpoint for our submissions for reference. We used Whoosh [1] as a baseline instead of elastic search, and it provides a clean Python API, and obtained decent metrics compared to ElasticSearch in our experiments. Multi-modal data utilization We define multi-modal information in this context as any information coming from either external links, pictures, videos, or audio information contained in the textual data; information, which we eventually translated into text data that can be fed to the model alongside the original tweet, as described in Figure 3. The extraction process mentioned can be applied on many kinds of data; from image reverse search to get a title for pictures, to fancier image/video description networks, or even speech-to-text algorithms, the possibilities are varied. For our submission, we mainly converted links to the page name associated with the linked page, as well as getting a title for images referred to in tweets, with an image reverse search. Table 1 sums up the results we have in terms of inference, and we can notice a significant uplift in performance, since, with the same model checkpoint, and no retraining, we achieve around 3% better performance, which amounts to more than half of the potential performance left to gain in this case. The importance of such preprocess arises from the inherent multimedia essence of Twitter data, which can also be found in most news outlets nowadays. Hence, we preconize easing up the learning and inference process by adding those additional pieces of information. Fig. 3. Multi-modal pipeline Metric Depth Whoosh baseline score Without multi-modal With multi-modal MAP 1 0.804 0.96 0.99 MAP 3 0.838 0.967 0.992 MAP 5 0.843 0.97 0.992 MAP 10 0.846 0.97 0.992 MAP 20 0.847 0.97 0.992 MAP all 0.847 0.97 0.992 Table 1. Metrics table on dev set. Results comparison between baseline, with and without multi-modal information. Hybrid training This section will explain how we used hybrid training with different external data and their effect on the overall results. FEVER: Sampling pairs from the FEVER dataset during training seems to help the model improve its performance. One reason is that those pairs have similar syntactic structures and lengths to the ones available in Task 2. The performance was stable on the dev set; the results on the test set, where we had better MAP at depth k values, confirmed this. SciFact: Sampling pairs from the SciFact dataset during training did not improve the model performance. One reason is that examples from Scifact have varying numbers of tokens, which are ampler compared to our core data, making it hard for the model to converge and generalize well. Liar: Same as SciFact, the Liar dataset did not improve the performance of the model, since the structure of the examples does not match with the ones we have in Task 2. Metric Depth without FEVER with FEVER MAP 1 0.8970 0.9070 MAP 3 0.9260 0.9370 MAP 5 0.9290 0.9380 MAP 10 0.9290 0.9380 MAP 20 0.9290 0.9380 MAP all 0.9290 0.9380 Table 2. Metrics table on test set. Results comparison between baseline, with and without sampling from FEVER dataset. Sampling scheme We also advocate for sensible sampling schemes to ensure training on the opti- mal information signal. When scrutinizing the CLEF dataset, we observe that naive sampling will often sample what we call negative samples, i.e., samples not related to the query we pair it with. Those negative samples are utterly unre- lated to the said query, leading to learning a more trivial solution of not pairing sentences which do not share anything meaningful. However, this naive approach leads to subpar MAP, as it will have a hard time distinguishing close evidence and rank them in an unsatisfiable manner as a consequence. We can illustrate this effect with the following example: with the query ”The Pope is blue with yellow stripes.” and most evidence talking about age and place of birth of vari- ous people, plus a few sentences about the Pope, a non-optimal sampling could lead to the algorithm only giving a high score to anything with ”The Pope” con- tained in it, and therefore, poorly scoring the other sentences about the Pope potentially. We solved this, inspiring ourselves from a reinforcement learning principle called experience replay [5], in which we use past predictions to learn more efficiently. In our case, we went towards this idea by sampling according to indexed searches, which produced relatively good results that could fool our al- gorithm trained with naive sampling, forcing it to learn the proper semantics to distinguish between syntactically and lexically close sentences. Ideally, we would take the top ranked results of our model to generate hard examples, however we chose to avoid the incurred computationnal cost, while using ElasticSearch results as a proxy for our model ranking. This sampling is a fast and accurate enough approximation of the reinforced mechanism of sampling new data based on our algorithm’s error. We used a balanced sampling, and we fully acknowledge that doing this will introduce a bias in the distribution of our dataset, which can have negative impact on the learning; nevertheless, we advocate that it is worth leveraging towards the fact-checking goal. In this context, we sample pairs of query and evidence, therefore balancing means we sample as much matching pairs than non-matchihng pairs. In Table 3, we have evidence it can be hugely beneficial to the training process, as it helps to gain a tremendous amount of performance, using the same network as our submission, making the difference between a usable and unusable solution. Metric Depth With proposed sampling With naive sampling MAP 1 0.96 0.1 MAP 3 0.967 0.1 MAP 5 0.97 0.1 MAP 10 0.97 0.1 MAP 20 0.97 0.1 MAP all 0.97 0.1 Table 3. Metrics table on dev set. Results comparison between with and without proper sampling. Additional augmentations In this section, we will present some other strategies that we tried but did not seem to impact the improvement of the model’s performance significantly. Named entity recognition (NER) based augmentation : Given that Task 2 data contains a lot of named entities, which are a crucial indicator of similarity between sentences, we tried to run a named entity recognition model on pairs and augment them with the detected labels. The model is a fine-tuned version of the bert-base multilingual cased model weights from HuggingFace’s transformers [17]. The results did not improve the overall performance, maybe because our augmentation strategy was not robust since we only add a separator and the detected labels at the end of each sentence, which is not very efficient since many sentences will have common labels and this will prevent the model from a better convergence. A better strategy to try would be to add the label next to the detected entity so that the model fits well on this kind of augmentation, and will make more sense of the relation between the words and the added labels. Another strategy would be to use the one-hot encoding of the named entities and stack them with BERT embeddings before the classifier layers; that way, the model will learn how to use those added encodings without having much negative impact on the semantics learned from the original sentence. ConceptNet [11] based augmentation : ConceptNet, a general knowl- edge graph, treats words and phrases of natural language as nodes that it con- nects with labeled edges, therein representing the general knowledge involved in understanding language, facilitating comprehension of the meanings hidden be- hind everyday use of a word. The intuition behind using ConceptNet for our task is that combining it with words/sentence embeddings empowers an understand- ing that is not accessible solely from distributional semantics. As with NER, we augmented pairs using ConceptNet API, with semantic relations it produces. Whenever we detect a part of speech in a sentence, i.e., verbs, nouns, or ad- jectives, we append to sentences the relations retrieved from the graph (e.g., ”chanting is synonym of shouting”). When adding those augmentations, the re- sults did not vary significantly, since we gained around 0.5% in MAP at depth 1 on dev set, and almost stable for the other depths. Again, the strategy of adding a separator and the augmentation at the end of the sentence may not be the most pertinent. Indeed, ConceptNet supplies plenty of distinct relations such as ”synonym of”, ”antonym of”, ”related to”, ”can be done with”..., so choosing the right relation is of great importance, and cannot be done straightforwardly, as the correct relation to use may depend on context. Metric Depth without ConceptNet with ConceptNet MAP 1 0.96 0.965 MAP 3 0.967 0.967 MAP 5 0.967 0.968 MAP 10 0.967 0.969 MAP 20 0.967 0.970 MAP all 0.967 0.970 Table 4. Metrics table on dev set. Results comparison between with and without using ConceptNet. Back and forth translation : In order to augment the positive examples in the dataset, one can think of using back and forth translation strategy, which means : given a source S (English in our case) and target T (anything but En- glish) language, we want to feed the example to translation algorithms following two steps: – Step one: the example will be fed to a translation algorithm from S to T, that way we will get a new sentence in the target language. – Step two: the new sentence will be fed to a translation algorithm from T to S, that way we will get a new sentence in the source language, semantically the same as the first one but with a different formulation. Metric Depth without translation augmentation with augmentation MAP 1 0.96 0.95 MAP 3 0.967 0.96 MAP 5 0.967 0.965 MAP 10 0.967 0.965 MAP 20 0.967 0.965 MAP all 0.967 0.965 Table 5. Metrics table on dev set. Results comparison between with and without using translation augmentations. Doing so, the dataset will be more diverse in terms of syntax, and will help the model to generalize better. The model are MarianMT models from Hugging- Face’s transformers [17]. This strategy did not help the baseline model to improve because the results were stable (Figure 5). We believe that this is due to the fact that the dataset is not big enough, so the augmentations did not have much effect on the overall performance. Conclusion In conclusion, our first recommendation for fact-checking was to ensure any piece of information, no matter the medium, is leveraged, to avoid as much as possible, relying on hidden information/context or implicit knowledge. Then, we wanted to stress the importance of consistent extra datasets, and how they helped training, even when dealing with unrelated subjects, as it was the case for us with FEVER, but also how they would impair learning if their size are not consistent. We also explained the importance of proper sampling. Last but not least, we presented a few augmentations that we experimented with, primarily semantic and lexical related augmentations, and hopefully gave inspiration for more augmentations to help fact-checking efforts in the future. References 1. Chaput, M. Whoosh documentation, 2020. 2. Devlin, J., Chang, M., Lee, K., and Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). 3. Jernite, Y. Explain anything like i’m five: A model for open domain long form question answering, 2020. 4. Kadri, F., and de Presse, A. F. Le faux vinci ou l’éloge de la prudence., 2016. 5. Lin, L.-J. Reinforcement learning for robots using neural networks. Tech. rep., Carnegie-Mellon Univ Pittsburgh PA School of Computer Science, 1993. 6. Liu, Q., Kusner, M. J., and Blunsom, P. A survey on contextual embeddings. ArXiv abs/2003.07278 (2020). 7. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Roberta: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019). 8. Nations, U. During this coronavirus pandemic, ‘fake news’ is putting lives at risk: Unesco, 2020. 9. Shaar, S., Babulkov, N., Da San Martino, G., and Nakov, P. That is a known lie: Detecting previously fact-checked claims. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (Online, July 2020), ACL ’20, Association for Computational Linguistics, pp. 3607–3618. 10. Shaar, S., Nikolov, A., Babulkov, N., Alam, F., Barrón-Cedeño, A., El- sayed, T., Hasanain, M., Suwaileh, R., Haouari, F., Da San Martino, G., and Nakov, P. Overview of the CLEF-2020 CheckThat! lab on automatic identifi- cation and verification of claims in social media: English tasks. In Working Notes of CLEF 2020—Conference and Labs of the Evaluation Forum (Thessaloniki, Greece, 2020), CLEF ’2020. 11. Speer, R., Chin, J., and Havasi, C. Conceptnet 5.5: An open multilingual graph of general knowledge. CoRR abs/1612.03975 (2016). 12. Thorne, J., Vlachos, A., Christodoulopoulos, C., and Mittal, A. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (New Orleans, Louisiana, June 2018), Association for Computational Linguistics, pp. 809–819. 13. Times, B. French hoax costs bloomberg 5m euros in fines., 2019. 14. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. CoRR abs/1706.03762 (2017). 15. Wadden, D., Lin, S., Lo, K., Wang, L. L., van Zuylen, M., Cohan, A., and Hajishirzi, H. Fact or fiction: Verifying scientific claims, 2020. 16. Wang, W. Y. “liar, liar pants on fire”: A new benchmark dataset for fake news detection. In Proceedings of the 55th Annual Meeting of the Association for Com- putational Linguistics (Volume 2: Short Papers) (Vancouver, Canada, July 2017), Association for Computational Linguistics, pp. 422–426. 17. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., and Brew, J. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv abs/1910.03771 (2019).