=Paper=
{{Paper
|id=Vol-2943/fakedes_paper7
|storemode=property
|title=LCAD - UFES at FakeDeS 2021: Fake News Detection Using Named Entity Recognition and Part-of-Speech Sequences
|pdfUrl=https://ceur-ws.org/Vol-2943/fakedes_paper7.pdf
|volume=Vol-2943
|authors=Marcos A. Spalenza,Leopoldo Lusquino-Filho,Felipe M. G. Franca,Priscila M. V. Lima,Elias de Oliveira
|dblpUrl=https://dblp.org/rec/conf/sepln/SpalenzaFFLO21
}}
==LCAD - UFES at FakeDeS 2021: Fake News Detection Using Named Entity Recognition and Part-of-Speech Sequences==
LCAD - UFES at FakeDeS 2021: Fake News Detection Using Named Entity Recognition and Part-of-Speech Sequences Marcos A. Spalenza1 , Leopoldo Lusquino-Filho3 , Felipe M. G. França3 , Priscila M. V. Lima2,3 , and Elias de Oliveira1 1 Postgraduate Program in Informatics (PPGI), Federal University of Espı́rito Santo (UFES), Vitória-ES, Brazil. 2 NCE, Tércio Pacitti Institute, 3 Systems Engineering and Computer Science Program (PESC), Federal University of Rio de Janeiro (UFRJ), Rio de Janeiro-RJ, Brazil. marcos.spalenza@gmail.com elias@lcad.inf.ufes.br Abstract. News is fundamental to share interesting and relevant facts for public knowledge. However, unreliable sources produce fake and bi- ased information, releasing content without proper fact-checking. The bi- ased content attends to a massive disclosure on the internet and sociopo- litical tendencies. Consequently, the identification of inaccurate news minimizes the damage to public entities. Therefore, against the mis- information, the fact-checking agencies investigate the trending news. Regarding the investigation, manual checking is slow and expensive. To filter these demands, we propose an automated method using linguis- tic components, supporting fake content identification. Our approach applies Machine Learning using POSTag+NER sequences. In the inter- domain analysis, our method achieves 71% in the F1 measure for fake news detection. Keywords: Fake News Detection · Natural Language Processing · En- semble Learning · Named Entity Recognition 1 Introduction The information acquired through the internet made the news agencies more dynamic. The duality of the news immediacy and the social networks’ textual summaries increase the disclosure of unlawful, defamatory, threatening, false, or misleading content shared without verification [22]. However, these contents frequently origin from unreliable sources, generating content leaned to political or social entities. IberLEF 2021, September 2021, Málaga, Spain. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). The untruth inside news articles impacts the common knowledge, reflecting on public entities or social traditions [23]. The Covid-19 pandemic crisis empha- sizes the misinformation problem. Fake news widespread worse the worldwide situation causing denial, self-medication and, political attacks [8, 13]. However, the countries study to treat the major unlawful consequences of fake information considering the freedom of expression rights. These concerns highlight the core problem of fake news, the subjective details that characterize a fact as truth or false [25]. Despite the subjectivity, we study methods to identify fake news among widespread topics using linguistic features. The linguistic features imply in doc- ument model references by textual sequential structures. In other words, our approach encompasses finding language models for false and true categories in- side the news corpora. This paper is presented in 5 sections. Section 2 describes some literature works on fake news detection. Section 3 present the POSTag+NER model, an- alyzing the documents using grammar sequences. Section 4 describes the fake news datasets and the obtained results. Finally, Section 5, present our conclu- sions and future works on fake news detection. 2 Related Works The content production in the digital platforms is majority unverified, widely reproduced and, continuously modified. The production of fake content is unbal- anced given the demand for verification and filtering [21]. These high demands are a consequence of the disclosure of fake information through social networks and unreliable news sites [3]. The propagation of fake news encompasses diverse factors. The media, the source, the propagation, the references and, images or textual content compose the news articles data. [6, 14]. Some works highlight the importance of meta- data to classify an article as fake or true [2, 27]. The metadata traces the false publication likelihood through the replication chain, the user comments, the cross-references and, the news agency general reliability. Although, the contents are the main evidence of factual and non-factual knowledge. Regarding the news articles’ contents, the linguistic features are the most studied approach to inves- tigate the untruthful content dissemination [4]. In linguistics, to support the verification process and restrain misinformation, the fake news detection systems seek factual and non-factual contents [5, 20]. The methods comprise searching for similar structures on the text, sentence descriptors, writing models and, linguistic sequences [10,17]. Using the linguistic descriptors, the models extract the news’ writing patterns. These descriptors comprise syntax, polarity, grammar and, readability levels. The content-based methods analyze the textual and visual information to detect fake content [26]. The textual analysis includes identifying equivalent propositions, weighting potential incoherent words or sequences. Furthermore, the modified, symmetric or, related textual features are a vulnerability on basic content-based detection [15, 28]. In other words, the systems fact-checking have to be robust to avoid textual bias. In this perspective, the recent studies aim to advance in the construction of linguistic models, to improve the assessment of writing coherence, information quality and, inter-domain adaptability [10, 20]. 3 Model Our approach includes identifying the language models for factual and non- factual articles, recognizing the specific linguistic structures. The language mod- eling aims to identify fake and true classes through writing patterns [24]. How- ever, these patterns represent proper writing styles, defined by textual sequences. In this paper, our analysis using Part-Of-Speech Tags (POSTags) to compare the news articles through the grammar sequences. We aim to produce a system that recognizes and learns to evaluate the coherence of the writing. Therefore, we expect to find in fake categorized articles some incoherent, biased, repetitive and, incorrect language format. We present an example to illustrate the process to generate the linguistic models. In the first step, the POSTagger, apply the language models to convert the words in their grammatical references. Example Luego de que se revelara que el ex alcalde de Cuernavaca y ahora aspirante de la gubernatura de Morelos, Cuauhtémoc Blanco, aún sigue registrado como jugador del Club América y en breve participará en un partido, el director técnico de la selección mexicana, Juan Carlos Osorio, reveló que Cuauhtémoc será uno de los futbolistas que serán convocados para representar a México en el Mundial de Rusia del ∗N U M BER∗. Part-of-Speech ADV ADP SCONJ PRON VERB SCONJ DET ADJ NOUN ADP PROPN CCONJ ADV NOUN ADP DET NOUN ADP PROPN PUNCT PROPN PROPN PUNCT ADV VERB ADJ SCONJ NOUN ADP PROPN PROPN CCONJ ADP NOUN VERB ADP DET NOUN PUNCT DET NOUN ADJ ADP DET NOUN ADJ PUNCT PROPN PROPN PROPN PUNCT VERB SCONJ PROPN AUX PRON ADP DET NOUN PRON AUX VERB ADP VERB ADP PROPN ADP DET PROPN ADP PROPN ADP SYM PROPN SYM PUNCT In the example, we observe the news original text and the words’ grammar tags. Afterward, we apply Named Entity Recognition (NER) to identify the en- tities that compose the news article and classify their semantic role. Using the NER, we search the fake news targets within the textual components looking for name references, such as politicians, organizations, locations or, public people. The entities’ classification include four categories: Person, Organization, Local and, Miscellaneous. In the original text sequence, together the POSTag and NER transform the sentences in the grammar functions and the specific name seman- tics. The second step detail the detection and classification of named entities within the selected example. Named Entity Recognition O O O O O O O O O O LOC O O O O O O O PER O PER PER O O O O O O O ORG ORG O O O O O O O O O O O O O O O O PER PER PER O O O PER O O O O O O O O O O O PER O O MISC MISC MISC O O NUM O O Cuernavaca (LOC) Morelos (PER) Cuauhtémoc (PER) Blanco (PER) Club (ORG) América (ORG) Juan (PER) Carlos (PER) Osorio (PER) Cuauhtémoc (PER) México (PER) Mundial (MISC) de (MISC) Rusia (MISC) NUMBER (NUM) Applying the POSTags+NER sequences [24], the system analyzes the words only by their functions on the sentence. Considering the words individually, the article’s factual content not necessarily is lined to be categorized as fake or true. Therefore, in this perspective, the words’ frequencies probably are a bias to categorize an article by its content [15]. Our example outlines the POSTag and NER combination at the final of the preprocessing. POSTag+NER ADV ADP SCONJ PRON VERB SCONJ DET ADJ NOUN ADP LOC CCONJ ADV NOUN ADP DET NOUN ADP PER PUNCT PER PER PUNCT ADV VERB ADJ SCONJ NOUN ADP ORG ORG CCONJ ADP NOUN VERB ADP DET NOUN PUNCT DET NOUN ADJ ADP DET NOUN ADJ PUNCT PER PER PER PUNCT VERB SCONJ PER AUX PRON ADP DET NOUN PRON AUX VERB ADP VERB ADP PER ADP DET MISC MISC MISC ADP NUM PUNCT To sum up, the POSTag+NER is a text preprocessing method to organize the data in vectors of linguistic patterns. These patterns include to learn the news targets, language, correctness and, factual structures. However, the model consists in a grammar sequences using documents’ n-grams. Additionally, we recognize the entities to replace the original grammar tag. The entities aims to identify the news characters using a semantic description and integrating the sequence model. At the vectorization step, we apply the Term Frequency (TF) to generate the high-dimensional and sparse document vectors, using 3 up to 7-grams sequences. Figure 1 presents the sparse document matrix. Figure 1 shows the matrix containing 1,543 document vectors and 874,584 features presenting only 0.2369% filled area. Despite the class subjectivity con- cerning the document’s content, we highlight the features’ low occurrence (spar- sity) and the coverage of linguistic structures (high-dimensionality). We test four classifiers to analyze the linguistic modeling from different perspectives: Support Vector Machines (SVM), Random Forest (RF), Gradient Boosting (GB) and, Wilkie, Stonham & Aleksander’s Recognition Device (WiSARD). Fig. 1. The POSTag+NER features’ dispersion in the sparse matrix. The SVM evaluates the document dispersion in kernels. The kernels define feature threshold hyperplanes to classify the unviewed samples [18]. The RF identifies features’ threshold and classifies the samples through multiple decision trees [7]. The GB is a combination of decision tree models through differentiable loss functions, reducing the global error for each weak learner on iterative im- provements [11]. Finally, the WiSARD [1], Weightless Neural Network using a binary thermometer [9] data encoding and classifying the samples by matching the class binary patterns. The linguistic pre-trained models for POSTagger and NER methods are pro- vided from spaCy 4 and the classifiers from scikit-learn 5 and wisardpkg 6 . Through different classifiers, we evaluate the learning of the linguistic patterns to produce the class models. 4 Experiments and Results The conducted experiments evaluated the POSTag+NER approach for Spanish fake news detection. To perform false content detection, the systems need to 4 https://spacy.io/ 5 https://scikit-learn.org/ 6 https://github.com/IAZero/wisardpkg/ identify the untruth among different data sources. However, it is fundamental to the systems adapt to multiple domains, sources, and contents. In this experiment, we tested the system inter-domain classification. The training dataset contains 917 news in Spanish [19]. The data contains 491 real and 480 fake articles about science, sport, economy, education, entertainment, politics, health, security, and social domains. The true class was manually tagged from reliable news agencies and the fake class from specialized verification sites. The test data contains 572 news articles from the Covid-19 pandemic, 286 true and 286 fake news [12]. The dataset includes news articles from multiple Ibero-American countries. The challenge is to recognize relations between the documents considering the dissimilar themes [16]. Additionally, the news articles present the regional language adaptations from each country. Furthermore, our approach aims to identify linguistic models and the factual structure for each category, despite of its textual content. To evaluate the system, we apply the F1 score to measure the detection performance. The F1 is the ratio between the precision and recall scores. On the fake news detection, the evaluation focus is the binary categorization for positive class identification. Besides that, we executed the system ten times to collect the standard deviation of classifiers’ performance. Table 1 discriminates the classifiers’ maximum and average results including the standard deviation. Table 1. The results achieved on the Spanish Fake News Detection Task IberLEF - Fake News Detection Task Train: Spanish Fake News Dataset Test: FakeDeS Covid-19 Dataset Classifiers GB RF SVM WiSARD Max. (%) 71.14 67.41 64.81 64,01 F1 Score Avg. (%) 70.83 65.63 64.81 63,28 Std.Dev. 0.1807 0.9660 0.0000 0,4652 Table 1 highlights the performance for the tested classifiers at the IberLEF Spanish Fake News Detection Task. Regarding the different approaches, in an inter-domain perspective, the four methods present good classification capabili- ties. Through the WiSARD, the adaptability for inter-domain classification was averaging 63% in the F1 score, using 3-bits thermometer encoding and 50 on address size. This case, specifically, indicates an insufficient generalization on high-dimensional binary models. The training vectors, composed mostly of zero values, present low compatibility to the test samples’ grammar sequences. As a consequence, the test vectors are similar to both binary class patterns. The SVM mounts the kernels identifying the features’ threshold. The train samples outline a regular SVC kernel zone, reinforcing the relevance of key fea- tures. Although, these kernels do not outline an efficient class area. Despite that, the observed performance was slightly better than WiSARD, approximately 65% in F1 score. In general, the SVM and WiSARD have good performance. Accord- ing to the results, the ensemble methods design a more robust and efficient classification. The ensemble training step of RF and GB produces complex class models, merging robust classifiers. On one hand, the RF combines 250 weak learners using feature analysis to identify relevant information bias in spite of the high- dimensional and sparse samples. On the other hand, the GB reduces the global error by combining 1000 weak learners through a learning rate of 0.03. The RF classification reaches 67% and GB classification reaches 71% in the F1 score. Analyzing the results, we observe a higher performance from the GB in re- lation to the other classifiers. Despite the classifier, the learning and the per- formance of the POSTag+NER linguistic model present good results in the inter-domain experiments. The results confirm the GB information acquisition presenting over 70% F1 score and low standard deviation. 5 Conclusion Fake news is a crescent problem on social, economical and, political aspects. The automatic detection of false content reduces public replication and misinforma- tion. In this problem, we propose the application of Named Entity Recognition in addition to Part-Of-Speech tag sequences. The experiments establish an inter-domain challenge, evaluating the model adequacy between a regular fake news dataset and a thematic Covid-19 fake news dataset. The POSTag+NER results are similar to other approaches, presenting high-level performances. To improve this work, the next steps comprise analyze the most relevant features to iteratively create better models and identify the features’ textual entailment related to the non-factual information. In addition, we expect to analyze and enhance inter-domain detection. Acknoledgements The authors acknowledge the Research Support Foundation of Espı́rito Santo (FAPES, process 80136451) for the research support grant. References 1. Aleksander, I., Thomas, W., Bowden, P.: WISARD· A Radical Step Forward in Image Recognition. Sensor Review (1984) 2. Alhindi, T., Petridis, S., Muresan, S.: Where is Your Evidence: Improving Fact- checking by Justification Modeling. In: Proceedings of the First Workshop on Fact Extraction and VERification (FEVER). pp. 85–90. Association for Computational Linguistics (Nov 2018) 3. Apuke, O.D., Omar, B.: Fake News and COVID-19: Modelling the Predictors of Fake News Sharing Among Social Media Users. Telematics and Informatics 56, 101475 (2021) 4. Beer, D., Matthee, M.: Approaches to Identify Fake News: A Systematic Literature Review. Integrated Science in Digital Age pp. 13–22 (2020) 5. Bonet-Jover, A., Piad-Morffis, A., Saquete, E., Martı́nez-Barco, P., Ángel Garcı́a- Cumbreras, M.: Exploiting Discourse Structure of Traditional Digital Media to Enhance Automatic Fake News Detection. Expert Systems with Applications 169, 114340 (2021) 6. Bozarth, L., Budak, C.: Toward a Better Performance Evaluation Framework for Fake News Classification. In: Proceedings of the International AAAI Conference on Web and Social Media. vol. 14, pp. 60–71 (May 2020) 7. Breiman, L.: Random Forests. Machine Learning 45(1), 5–32 (2001) 8. Bridgman, A., Merkley, E., Loewen, P.J., Owen, T., Ruths, D., Teichmann, L., Zhilin, O.: The Causes and Consequences of COVID-19 Misperceptions: Under- standing the Role of News and Social Media. Harvard Kennedy School Misinfor- mation Review 1(3) (2020) 9. Buckman, J., Roy, A., Raffel, C., Goodfellow, I.: Thermometer Encoding: One Hot Way To Resist Adversarial Examples. In: International Conference on Learning Representations (2018) 10. Choudhary, A., Arora, A.: Linguistic Feature Based Learning Model for Fake News Detection and Classification. Expert Systems with Applications 169, 114171 (2021) 11. Friedman, J.H.: Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics 29(5), 1189–1232 (Oct 2001) 12. Gómez-Adorno, H., Posadas-Durán, J.P., Bel-Enguix, G., Porto, C.: Overview of FakeDeS Task at Iberlef 2020: Fake News Detection in Spanish. Procesamiento del Lenguaje Natural 67(0) (2021) 13. Hartley, K., Vu, M.K.: Fighting Fake News in the COVID-19 Era: Policy Insights from an Equilibrium Model. Policy Sciences 53(4), 735–758 (2020) 14. Kaliyar, R., Goswami, A., Narang, P.: A Hybrid Model for Effective Fake News De- tection with a Novel COVID-19 Dataset. In: Proceedings of the 13th International Conference on Agents and Artificial Intelligence (ICAART). vol. 2, pp. 1066–1072. INSTICC, SciTePress (2021) 15. Monteiro, R.A., Santos, R.L.S., Pardo, T.A.S., de Almeida, T.A., Ruiz, E.E.S., Vale, O.A.: Contributions to the Study of Fake News in Portuguese: New Corpus and Automatic Detection Results. In: Computational Processing of the Portuguese Language. pp. 324–334. Springer International Publishing, Cham (2018) 16. Montes, M., Rosso, P., Gonzalo, J., Aragón, E., Agerri, R., Álvarez-Carmona, M.A., Mellado, E.A., Carrillo-de Albornoz, J., Chiruzzo, L., Freitas, L., Adorno, H.G., Gutiérrez, Y., Zafra, S.M.J., Lima, S., Plaza-de Arco, F.M., Taulé, M. (eds.): Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2021). CEUR Workshop Proceedings, CEUR-WS, Spain (2021) 17. Pérez-Rosas, V., Kleinberg, B., Lefevre, A., Mihalcea, R.: Automatic Detection of Fake News. In: Proceedings of the 27th International Conference on Computa- tional Linguistics. vol. 17, pp. 3391–3401. Association for Computational Linguis- tics, Santa Fe, New Mexico, USA (Aug 2018) 18. Platt, J., et al.: Probabilistic Outputs for Support Vector Machines and Compar- isons to Regularized Likelihood Methods. Advances in Large Margin Classifiers 10(3), 61–74 (1999) 19. Posadas-Durán, J.P., Gómez-Adorno, H., Sidorov, G., Escobar, J.J.M.: Detection of Fake News in a New Corpus for the Spanish Language. Journal of Intelligent & Fuzzy Systems 36(5), 4869–4876 (Jan 2019) 20. Rashkin, H., Choi, E., Jang, J.Y., Volkova, S., Choi, Y.: Truth of Varying Shades: Analyzing Language in Fake News and Political Fact-Checking. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. pp. 2931–2937. Association for Computational Linguistics, Copenhagen, Denmark (Sep 2017) 21. Shao, C., Ciampaglia, G.L., Varol, O., Yang, K.C., Flammini, A., Menczer, F.: The Spread of Low-Credibility Content by Social Bots. Nature Communications 9(1), 1–9 (2018) 22. Shu, K., Sliva, A., Wang, S., Tang, J., Liu, H.: Fake News Detection on Social Media: A Data Mining Perspective. SIGKDD Explorations Newsletter 19(1), 22– 36 (Sep 2017) 23. Shu, Kai and Wang, Suhang and Lee, Dongwon and Liu, Huan: Mining Disinfor- mation and Fake News: Concepts, Methods, and Recent Advancements, pp. 1–19. Springer International Publishing, Cham (2020) 24. Spalenza, M.A., Oliveira, E., Lusquino-Filho, L.A.D., Lima, P.M.V., França, F.M.G.: Using NER+ML to Automatically Detect Fake News. In: 20th. Interna- tional Conference on Intelligent Systems Design and Applications (ISDA). vol. 20, pp. 1176–1187 (Dec 2020) 25. Vieira, L.L., Jeronimo, C.L.M., Campelo, C.E.C., Marinho, L.B.: Analysis of the Subjectivity Level in Fake News Fragments. In: Proceedings of the Brazilian Sym- posium on Multimedia and the Web (WebMedia ’20). pp. 233—-240. Association for Computing Machinery, São Luı́s, Brazil (2020) 26. Zhou, X., Wu, J., Zafarani, R.: SAF E: Similarity-Aware Multi-modal Fake News Detection. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. pp. 354–367. Springer (2020) 27. Zhou, X., Zafarani, R.: Network-based Fake News Detection: A Pattern-Driven Approach. ACM SIGKDD Explorations Newsletter 21(2), 48–60 (2019) 28. Zhou, Z., Guan, H., Bhat, M., Hsu, J.: Fake News Detection via NLP is Vulner- able to Adversarial Attacks. In: Proceedings of the 11th International Conference on Agents and Artificial Intelligence - ICAART. vol. 2, pp. 794–800. INSTICC, SciTePress, Prague, Czech Republic (2019)