Text Similarity Using Word Embeddings to Classify Misinformation Caio Sacramento de Britto Almeida1,2 and Débora Abdalla Santos1 1 Computer Science Department, Federal University of Bahia, Brazil {caiosba,abdalla}@dcc.ufba.br http://www.dcc.ufba.br 2 Meedan, San Francisco, USA caio@meedan.com https://meedan.com Abstract. Fake news is a growing problem in the last years, especially during elections. It’s hard work to identify what is true and what is false among all the user generated content that circulates every day. Technol- ogy can help with that work and optimize the fact-checking process. In this work, we address the challenge of finding similar content in order to be able to suggest to a fact-checker articles that could have been ver- ified before and thus avoid that the same information is verified more than once. This is especially important in collaborative approaches to fact-checking where members of large teams will not know what content others have already fact-checked. Keywords: Fake News · Word Embeddings · Text Classification · Text Similarity · Misinformation 1 Introduction Fake news have always been around, but they are turning into a growing problem and more evident on the last years with the popularization of the Internet as a source of news, a role that used to be played by traditional media like television, radio, magazines and newspapers. From a theoretical point of view [1], fake news must comply to a few characteristics: should be published and shared on the Internet; should be created with fake content and without support; is used to manipulate. Disinformation (that is, purposely false information) has been playing a fun- damental role on recent democratic electoral processes. Social media have an evident merit to allow debates and to amplify voices in a space that allows great repercussion. Many studies show [2] how Twitter, Facebook and other platforms, Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). DHandNLP, 2 March 2020, Evora, Portugal. became important instruments of democracy as they allow exchanges and stim- ulate discussions. However, in the same way it happens on the public debate outside the virtual world, social media are used to disseminate fake information. Automated accounts that make it easier to send a massive amount of messages have turned into a potential tool to manipulate debates on social networks, es- pecially in moments of political relevance. in this way, online platforms allow old defamation strategies and public debate manipulation, now at a larger scale [4]. Both bots and humans have an important role in creating and spreading fake news in electoral contexts— sometimes on purpose, other times by mistake. Due to the nature of human psychology, people tend to believe things that support their existing beliefs, a process called “confirmation bias”, defined as the possibility to remember, inter- pret or research in a way that confirms an initial belief or hypothesis [5]. Attention begets more attention on social media [6]. Fake news on Twitter, for example, has 70% more chances of being accessed than true information [7], often due to shocking titles or sensationalism. 2 Automatic recognition of fake news Identify fake news automatically is not a trivial task. One approach is based on linguistics and try to identify text properties, like writing style and content, that can help on differentiating false articles from true articles. An assumption for this approach is that linguistic behaviors, such as punctuation, types of words and emotional charge are unconscious and thus are out of the author’s control, and those things could reveal important insights about the nature of the text. Studies based on that approach reach an accuracy near 76% when compared to human performance [9]. Although promising, future effort should not be lim- ited by that and could also include other information, for example: number of inbound and outbound links, number of comments, visual analysis of the page where the content is at, among other computational techniques for fact verifica- tion [10]. Even this way, since it is a work of high responsibility, a completely auto- mated classification of news can be very risky, and the reputation of a media organization can be affected by that. So, the purpose of this work is a hybrid approach that creates a human-in-the-loop system. 3 Purpose The purpose of this work is to identify articles that are similar to articles that were previously classified by a fact-checking agent as true or false, and in this way, optimize the process of verification. The idea to achieve that is to implement a plug-in for the open source software for collaborative fact-checking Check [11], already used on collaborative fact-checking projects around the world. Check has been used, for example, during the US elections in 2016, France elections in 2017, Mexico elections in 2018 and Indian elections in 2019 as well as other verification projects not related to elections. Such a plug-in can optimize the fact-checking process by suggesting similar articles that were already classified in the past, avoiding that the same, or substantially similar, content is fact- checked more than once. Technically, the idea is to use artificial neural networks to identify similar articles based on their vector representations. The plugin was implemented as a Check Bot, so every time a new piece of content is created on Check, the plugin would look for similar items that had already been fact- checked. 4 Artificial Neural Network Word2Vec [14] is the neural network that is going to be used on this work. It contains two layers and processes text. The input is a text corpus and the output is a set of vectors: characteristic vectors for words from that corpus. Its applications go beyond text analysis. It can be used for gens, code, music playlists, social network graphs, and other verbal or symbolic series where a pattern can be recognized. An example given by the authors of Word2Vec is: the vector that represents the token “Madrid”, subtracted from the vector that represents “Spain” and summed to the vector “France” will be very close to the vector obtained for “Paris”. Or, as an equation: vec(M adrid) − vec(Spain) + vec(F rance) = vec(P aris) (1) 5 Results Given the great results that Word2Vec can achieve on identifying text patterns [13], it was the choice for this work. We use the pretrained Word2Vec model, which contains 3 million vectors of 300 dimensions [15] trained on over 100 billion words from Google News. There are several promising avenues for extending this work to Portuguese (and other languages), but the datasets currently available in Portuguese are not as extensive as Google News or as well tested; so, in this paper the plug-in is used to identify the similarity between an input article in English and pre-classified articles in English. The flow works as the following: when new information to be verified is in- serted in Check’s database, the plug-in developed in this work takes action. It calculates the vectors of this input text, using Word2Vec, and stores the vectors in an ElasticSearch database, which is a distributed search service, open source, scalable, and has good search performance [16]. Another useful feature is that ElasticSearch can be extended through plug-ins, including plug-ins for search- ing criteria. For this work, it was necessary to implement a search plug-in for ElasticSearch that could calculate the similarity between the input text (repre- sented as vectors) and each stored text (also represented as vectors) using cosine distance. The interface between Check and ElasticSearch is filled by Alegre, an API part of the Check suite which is responsible for text and image processing, for example, similarity, classification, glossary and language identification. Therefore, searching for similar texts returns the vectors with smaller cosine distances for the input vectors. The next step of the plug-in is that it uses Check’s API in order to suggest to journalists using Check similar articles that were classified before and that are similar to the article that the user is currently verifying. This way, the user can decide to relate them or not—and thus, avoid fact-checking the same content multiple times. This workflow is represented in Fig. 1. Fig. 1. Workflow The plugin was integrated into the software Check and is available on GitHub [17]. 6 Conclusions and future work In this paper we developed an open-source architecture to use ElasticSearch to efficiently search language volumes of text items by their vector representations in near real-time and integrated this into an open-source fact-checking software tool. We used Word2Vec trained on English-language text, but the architecture we developed is not specific to these choices and can easily be adapted to other vector-representation models and languages. First, we would like to use a corpus with Portuguese text [19], so this solution could be more useful for media organizations in Brazil. Second, the notion of “similar” is too broad. Texts can be similar but completely contradictory. In this sense, it would be very useful to determine if a given text supports or refutes another text. An approach for that could be through stance detection [18], but it requires more research. Moreover, the choice for Word2Vec led to promising results, but other recent advances in this domain suggest that transformer models such as BERT, SBERT and XML-RoBERTa representations are able to improve the performance of NLP tasks so those options should also be evaluated [20– 22]. Although suggesting similar texts can optimize fact-checking work, much more could be done if more specific input corpora could be built. Finally, we should be able to evaluate how much this approach helps the verification work, for example, by counting how many content items did not need to be verified again because similar, previously-verified items were correctly suggested by this tool. References 1. JR, E. C. T.; LIM, Z. W.; LING, R. Defining “fake news” a typology of scholarly definitions. Digital Journalism, Taylor & Francis, v. 6, n. 2, p. 137–153, 2018. 2. ENLI, Gunn Sara; SKOGERBØ, Eli. Personalized campaigns in party-centred pol- itics: Twitter and Facebook as arenas for political communication. Information, communication & society, v. 16, n. 5, p. 757-774, 2013. 3. ADORNO, G.; SILVEIRA, J. Pós-Verdade e Fake News: Equı́vocos do Polı́tico na Materialidade Digital. Anais Do SEAD 8:1-6. 2018. 4. RUEDIGER, M. A. et al. Robôs, redes sociais e polı́tica no Brasil: estudo sobre interferências ilegı́timas no debate público na web, riscos à democracia e processo eleitoral de 2018. 2017. 5. PLOUS, S. The psychology of judgment and decision making. Mcgraw-Hill Book Company, 1993. 6. HALE, S. A. et al. How digital design shapes political participation: A natural experiment with social information. Plos One, v. 13, p. 1–20, 2018. 7. LANGIN, K. “Fake News Spreads Faster than True News on Twitter—Thanks to People, Not Bots.” Science, 2018. 8. DERAKHSHAN, H.; WARDLE, C. Information Disorder: Definitions. AA. VV., Understanding and Addressing the Disinformation Ecosystem, p. 5-12, 2017. 9. PÉREZ-ROSAS, V. et al. Automatic detection of fake news. arXiv preprint arXiv:1708.07104, 2017. 10. THORNE, J. et al. Fever: a large-scale dataset for fact extraction and verification. arXiv preprint arXiv:1803.05355. 2018. 11. MEEDAN. Check. Available at https://meedan.com/en/check/. 2016. 12. KOVÁCS, Z. L. Redes neurais artificiais. Editora Livraria da Fisica. 2002. 13. GOLDBERG, Y.; LEVY, O. Word2Vec Explained: Deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722. 2014. 14. MIKOLOV, T. et al. Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems. p. 3111- 3119. 2013. 15. MIHÁLTZ, M. Word2Vec GoogleNews Vectors. Available at https://github.com/mmihaltz/word2vec-GoogleNews-vectors. 2019. 16. BANON, S. Elasticsearch. 2013. 17. MEEDAN. Check Source Code. Available at https://github.com/meedan/check. 2018. 18. GHANEM, B.; ROSSO, P.; RANGEL, F. Stance detection in fake news a combined feature representation. In: Proceedings of the First Workshop on Fact Extraction and Verification (FEVER). p. 66-71. 2018. 19. MONTEIRO R. A.; SANTOS R. L. S.; PARDO T. A. S.; DE ALMEIDA, T. A.; RUIZ E. E. S.; VALE O. A. (2018) Contributions to the Study of Fake News in Portuguese: New Corpus and Automatic Detection Results. In: Villavicencio A. et al. (eds) Computational Processing of the Portuguese Language. PROPOR 2018. Lecture Notes in Computer Science, vol 11122. Springer, Cham 20. DEVLIN, Jacob et al. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. 21. REIMERS, Nils; GUREVYCH, Iryna. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019. 22. CONNEAU, Alexis et al. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116, 2019.