Bribones tras la esmeralda perdida@FakeDeS 2021: Fake news detection based on random forests, k-nearest neighbors, and n-grams for a Spanish corpora Victor Lomas Barrie1 , Nora Perez1 , Victor Manuel Lara2 , and Antonio Neme3 1 IIMAS, Universidad Nacional Autónoma de México, México 2 Facultad de Matemáticas, Universidad Autónoma de Yucatán, Mérida, Yucatán, México 3 Unidad Académica Mérida, IIMAS, Universidad Nacional Autónoma de México, Mérida, Mexico antonio.neme@iimas.unam.mx Abstract. Fake news constitutes a social problem that affects demo- cratic societies by influencing and manipulating public opinion. The con- cept of fake news covers a wide range of news that somehow resembles the truth. For example, they might contain explicit false assertions, con- tain some aspects of the truth, or even present the truth in a deformed fashion to disqualify a political figure or a specific group. The obnoxious impact of fake news makes it imperative to develop tools to help pro- fessional journalists, on the one hand, and the public, on the other, to detect news that is fake. In this contribution, we describe a methodology with the aim to tell apart fake from true news in a data set of several hundreds of manually curated journalistic pieces. The proposal is based on a bag-of-words approach and relies on shallow classifiers. We intended to validate the hypothesis that true and fake news can be told apart with simple assumptions. However, as discussed, the hypothesis was rejected, as it was not possible to classify the texts in a subsequent test stage successfully. The corpora used for the classification task is in Spanish, and it was presented as an open challenge in CodaLab. Keywords: Fake news · Machine learning · bag-of-words 3 1 Introduction The University of Michigan Library defines fake news as those news stories that are false: the story itself is fabricated, with no verifiable facts, sources or quotes 3 IberLEF 2021, September 2021, Málaga, Spain. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). [4]. Fake news are false, partially true, or modified versions of the true, with the aim of give the public a deformed view of a certain process, public actor, or social phenomena. Fake news are a major source of confusion in every kind of society, and thus, enormous efforts, mainly from public organizations, are in motion to try to expose not only fake news, but also the players behind them. The latter tends to be a harder problem, and is not further discussed. The spread of fake news tends to be faster than the corresponding to true news [5]. This difference in diffusion rate amplifies the damages caused by mis- leading information, since several actions have to be taken in order to counteract false information. The Collins English dictionary declared Fake news as the word of the year in 2017. It is an indication of the growing exposure of the public to such kind of news. In Mexico, the problem of fake news is an increasing malady. It has been documented in several studies [7, 1]. The vast majority of fake news tend to be of political content, towards some of the major public personalities. However, since 2020 and the arrival of the COVID19 pandemics, the presence of fake news in the Mexican media, both online and in traditional TV and radio shows has been a major problem. An obvious problem is the diffusion of false news that has a negative impact in public health. The human factor in the detection of fake news is so far, the best alternative [8]. The identification of suspicious news is a complex task. In order to detect a fake news, the reader should follow a series of steps. First, she/he should determine the subject of the piece, then, the second step, try to put in context of what it is already known, and, in a third step, corroborate that information in other trustworthy sites or media. Alternatively, he/she could confirm the information with an expert. Big mass media is not exempt of fake news diffusion [6]. In fact, several major newspapers and broadcasting companies tend to be embedded, knowingly or unknowingly, in the diffusion of fake news. An automatic tool that could be of help to the public in the identification of a suspicious news can be obtained via a classifier. The problem of automatic classification involves two main components. The first one is a dataset, usually consisting of a large enough sample containing a relevant number of instances belonging to one of the relevant classes or labels. The second component is an algorithm that, based on certain characterizations of instances in the dataset, is able to properly tell apart the relevant classes. In the problem at hand, the recognition of a journalistic piece as true or false (fake), the classification is a dichotomy, that is, it belongs to either one class or the other. In this contribution, we describe our efforts to classify journalist pieces as either true or false. The corpora (dataset) consists of almost 700 of pieces man- ually curated as true or fake. The dataset was generated by researchers working in Mexican Public institutions, and a public challenge was started to obtain an algorithm able to cope with the task at hand. The rest of the contribution goes as follows. In this job, we departed from the hypothesis that true and fake news can be adequately tell apart from each other based on the frequency of use of certain words. We describe our efforts in those line as follows. In section 2 we briefly describe related work, then, in section 3 we describe our proposal. Finally, in sections 4 and 5 we describe the results of the proposed methodology in the above mentioned dataset, and offer some conclusions and discussion. 2 Related work In [9], it is described a methodology to characterize a given text in terms of n-grams. The main idea is to train an algorithm to find differences in use of n- grams among true and fake pieces. Also, authors provide a tool helps the reader to characterize the source of the site as reliable or not, by checking the reliability of the publishing site. Authors use as classifiers random forests and naive Bayes classifiers. In [10], a deep learning architecture was trained from a corpora of several hundred papers made available from a public challenge (Kaggle). The relevant idea was to correlate the title of the journalistic piece and the main body. When an inconsistency between the title and the body, processed by word2vec (more on this in the next section), was found, the text was a candidate fake news. In [8], several classifiers were trained to tell apart true and fake news, based on several features extracted from the texts. Some linguistic features were used as input to the classifiers. In particular,the percentage of words linked to positive or negative emotions, the percentage of stop words, and the use of informal language, were of particular relevance. In [11], authors apply random forests, boosting and linear regression algo- rithms to a corpora of news in Spanish. Each text is characterized following a bag-of-words, an n-gram approach and a POS tags n-gram representation. The results are relevant since it is one of the fist experiences in texts in Spanish, and in Mexican media in particular. 3 The algorithm Our approach is based on describing each text as a point in a high-dimensional space. That space is the relative frequency of use of each of the words in the whole vocabulary. The vocabulary was obtained from all texts. The Bag-of- Words (BoW) approach treats language as structure-less objects, in which the relevant features are the relative frequency of use or appearance of words (to- kens). This approach leaves grammar aside. The term BoW was coined more than fifty years ago, in a diminishing way to bold the relevance of structure in language [12]. Despite heavy criticism, BoW has offered interesting results in several contexts, such as in spam detection [13] and in authorship attribution [14]. Aligned with the BoW approach, the perspective offered by word2vec is a rather powerful one [15], based on coding of words as weights in neural net- works. The tools under this umbrella allow a link from syntaxis to semantics. However, this technique requires large datasets, and more importantly, might hide the changes in writing styles. Since we are interested in detecting changes in the general way an author writes after a given date, we postulate that rela- tive frequency is enough. We will give evidence of this postulate while the paper unveils. Different alternatives exist to BoW, such as that presented in [16]. There, a deep structure between words is obtained, and from it, inferences about the possible authorship are made. Neural networks with deep architectures offer also a possibility, at the expense of making inferences about the relevant attributes a task on itself [17]. First, all texts from the same columnist j are scanned to create his/her vocabulary, Vj . Then, in a second scan, each text t is mapped into a new space. This space is generated from the vocabulary, and it has |Vj | dimensions. The coordinates of text t are linked to the relative frequency (probability) of use of each of the word in the vocabulary. Thus, the text t is a point in the space of relative frequencies. The N −gram approach, which is what we just described, although simple, allows the identification of relevant patterns. We define γt as the location of text j in the vocabulary space Vj . Each text t has associated a label, indicating whether it is true or fake. This label is referred to as Lj . The values for Lj can be either 0, if the text is true, or 1, if it is fake, all accordingly to the manual classification. Now we have defined all the required elements to frame the problem we are attacking within classification theory. The attributes are the relative frequency of use of each word within the vocabulary, and the label is whether the text was true or fake. What we are interested in solving is thus the classification task Lj = γj . The challenge consisted of two stages: 1. Training / calibration; and 2. Eval- uation. In the first stage, 676 news from the Mexican media were made public, 338 of which were true, and 338 were false. The sample was balanced, which facilitates, theoretically, the classification task. In the second stage, the label was not unveiled, and the system was evaluated online, at maximum two times in total. We give a panoramic view of the algorithm in fig. 1 A classifier is an algorithm that, given a set of vectors or observations, de- scribed by several attributes, tries to link them to elements in a second set, namely the labels or classes. In other words, a classifier provides a function be- tween elements in a set, with elements in a second one. This function can be either implicit, such as in the case of neural networks, or explicit, as in linear regression. The former tend to show very low errors (when provided by enough data both in quantity and quality), but tend to be only poorly interpretable. The latter are very interpretable, but tend to have very high errors, since almost no relevant phenomena is well described by a linear association. Two classifiers were applied in our proposal. The first is random forests (RF ), and the second one is k nearest neighbors (kN N ). RF offers results that are Fig. 1. The algorithm. The methodology is based on obtaining the vocabulary from all texts, both true and fake. Then, each text is characterized as a point in a d−dimensional space. Each dimension is the relative frequency of use of the k most common words among all texts. both, interpretable, as well tend to present low classification errors [18]. It is an ensemble method that generates a group of decision trees [19], and for each one of them, randomly selects a subsample of the whole training set. The overall decision about the class a given vector belongs is computed taking into account the decision made by the majority [20] of the decision trees. In several imple- mentations, the subsamples are randomly modified as to give more robustness to the ensemble. The second classifier we tested is kN N . It is an algorithm of the family commonly known as guilty by association. Here, the label of the vector to be classified is determined by the most common label among its k nearest neighbors. The k neighbors of a vector v are its closest k vectors, based on Euclidean distance. 4 Results The dataset T , consisting of 678 texts [2], in which half were true and half fake, was used to tune the algorithm. We followed a permutation approach to create several subsambles from T , and train. The first stage, as described in the previous section, was to obtain the vocabulary from the available corpora. Fig. 2 shows the 25 most common words, leaving aside the majority of prepositions, conjunctions, connectors, and so on. The number of attributes that are relevant in the classification task in hand, that is, the dimension of the feature space, is an unknown quantity. We estimated it by increasing d from 10 to 1000, and evaluating the system classification capabilities when the d most frequent words were considered. We were interested in the validation error, and for that, we split the dataset T in two sets, the training and the validation sets. Fig. 3 shows the classification error for the validation dataset σ derived from T . σ is the percentage of T used for validation, whereas 1 − σ is the percentage of T used for training. We varied σ from 0.02 to 0.5. For each pair of (d, σ), 50 Monte Carlo trials were conducted, selecting each time a possible different training and validation datasets. The average error is reported. For RF , the maximum depth of the trees was min(10, 3 × log10 (d)) and the number of estimators was 2 × d. For kN N , we varied k as a function of d by k = max(5, 2 × log100 (d)). As observed for kN N , we linked k and the dimensionality d. As observed in the figure, the classification error (number of misclassified texts) did not reach good error measures. Thus, the hypothesis that a n-gram approach, with n = 1, using the d− most frequent words in the vocabulary is enough to classify a text as either true or false (fake) has to be rejected. Once we trained the RF and kN N classifiers, we tested it in the second stage of the competition. The results were unsatisfactory. The performance for RF was 0.56 and for kN N was 0.54. Both results were below the average for the rest of the competitor. Fig. 2. The 25 most words without prepositions, conjunctions and connectors. At least 1/3 of the true and 1/3 of the fake news had to make use of the word in order to consider that specific word, otherwise, it was dropped from the vocabulary. 5 Conclusions The hypothesis that true and fake news can be tell apart based on the relative frequency of words has to be rejected, as shown in the previous section. We have shown that the n-gram approach, with n = 1, linked to random forests and k-nearest neighbors, did not offer good results. The reason for the low performance over the test dataset of the second stage of the challenge, was that the vocabulary varied to an extent in which several of the most frequent words were not present in one of the two corpora. This limitation cannot easily be solved. We plan to use a more complex feature selection algorithm, at the time that we will make use of deep learning architectures over the attributes described in this contribution. We intend to verify whether a deeper algorithm could of- fer better results over the same attributes, or if there is something inherently incomplete in them. References 1. Gómez-Adorno H, Posadas-Durán JP, Bel-Enguix G, Porto C. Overview of FakeDeS Task at Iberlef 2020: Fake News Detection in Spanish. Procesamiento del Lenguaje Natural. V. 67 No. 0 (2021). Fig. 3. Test error as a function of the number of words (dimensionality) d, and the percentage of the sample used to test, σ. The percentage of the dataset T used to train the classifiers was thus 1 - σ. A. Error from random forests. B. Error from kNN. 2. Posadas-Durán JP, Gómez-Adorno Helena, Grigori S, Escobar J, Moreno J. Detec- tion of fake news in a new corpus for the Spanish language. Journal of Intelligent & Fuzzy Systems. Vol 36, No. 5 4869-4876. (2019) 3. Montes, M., Rosso, P., Gonzalo, J., Aragón, E., Agerri, R., Carmona, M., Mellado, E., Albornoz, J., Chiruzzo, L., Freitas, L., Adorno, H., Gutiérrez, Y., Zafra, S., Lima, S., Arco, F. & Taulé, M. The overview of the Iberlef 2021. Proceedings Of The Iberian Languages Evaluation Forum (IberLEF 2021). (2021) 4. Desai, S., Mooney, H. & Oehrli, J. ”Fake News,” Lies and Propaganda: How to Sort Fact from Fiction. University Of Michigan Library. (2021), https://guides.lib.umich.edu/fakenews 5. Vosoughi S, Roy D, Aral S. The spread of true and false news online. Science 09 Mar 2018: Vol. 359, Issue 6380, pp. 1146-1151 DOI: 10.1126/science.aap9559 6. Hailey M. Fake News and the Sociological Imagination: Theory Informs Practice. Forty-sixth National LOEX Library Instruction Conference Proceedings (Library Orientation Series No. 51). http://hdl.handle.net/2027.42/143532 7. ADEMÁS DE PANDEMIA POR COVID-19, MÉXICO ENFRENTA PROPA- GACIÓN DE NOTICIAS FALSA, May, 2020. howpublished=https://www.dgcs. unam.mx/boletin/bdboletin/2020_318.html 8. Ahmad I, Yousaf M, Ahmad M. Fake News Detection Using Machine Learning Ensemble Methods. Volume 2020. Complexity. doi.org/10.1155/2020/8885861 9. Sharma U, Saran S, Patil S. Fake News Detection using Machine Learning Al- gorithms. International Journal of Engineering Research and Technology (IJERT) NTASU – 2020 (Volume 09 – Issue 03). 10. Maslej V, Butka. (2019). Deep learning methods for Fake News detection. 10.1109/CINTI-MACRo49179.2019.9105317. 11. Posadas Duran J, Sidorov G, Gomez Adorno H, Moreno J. Detection of Fake News in a New Corpus for the Spanish Language. Journal of Intelligent and Fuzzy Systems · May 2019 DOI: 10.3233/JIFS-179034 12. Harriz Z. Distributional Structure. Word. 10 (2/3): 146–62. doi:10.1080/00437956.1954.11659520 (1954). 13. Sahami M, Dumais S, Heckerman D, Horvitz E. A Bayesian approach to filtering junk e-mail AAAI’98 Workshop on Learning for Text Categorization. (1988). 14. Boughaci D, Benmesbah M, Zebiri A. An improved N-grams based Model for Authorship Attribution 2019 International Conference on Computer and Informa- tion Sciences (ICCIS), Sakaka, Saudi Arabia. 1-6, doi: 10.1109/ICCISci.2019.8716391. (2019). 15. Mikolov T, Sutskever I, Chen K; Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems. arXiv:1310.4546. Bibcode:2013arXiv1310.4546M. (2013). 16. Gómez-Adorno H, Sidorov G, Pinto D, Vilarino D, Gelbukh A. Automatic Author- ship Detection Using Textual Patterns Extracted from Integrated Syntactic Graphs. Sensors. Vol. 16. No. 1374. doi:10.3390/s16091374. (2016). 17. Shrestha P, Sierra S, Gonzalez F, Montes M, Rosso P, Solorio T. Convolutional Neural Networks for Authorship Attribution of Short Texts. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguis- tics: Volume 2 (2017). 18. Ho TK (1995). Random Decision Forests. Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal. 278–282. (1995). 19. Quinlan JR Induction of decision trees. Machine Learning. 1: 81–106. doi:10.1007/BF00116251 (1996). 20. Breiman L. Random Forests. Machine Learning. 45 (1): 5–32. doi:10.1023/A:1010933404324. (2001).