1. Introduction

TecNM at MEX-A3T 2020: Fake News and Aggressiveness Analysis in Mexican Spanish

SamuelArce-Cardena

DanielFajardo-Delgad

fajardo@itcg.edu.mx

oand Miguel Á.Álvarez-Carmon

Mexico.

Mexico

265 272

This paper describes our participation in the MEX-A3T 2020 for the tasks of identification of aggressiveness and fake news in Mexican Spanish tweets. We evaluate the combination of basic text classification techniques, including six machine learning algorithms, two methods for keyword extractions, and two preprocessing techniques. Our best run showed an F1-macro score of 0.754 for aggressiveness and 0.815 for fake news. Our preliminary results are satisfactory and competitive with other participating teams.

eol>Aggressiveness Identification Fake News Classification Natural Language Processing

1. Introduction 2. State of the art

The MEX-A3T is an evaluation forum for IberLEF intended for the research in natural language processing (NLP) and considering a variety of Mexican Spanish cultural traits. In this vein, the 2018 edition was the first to consider the aggressiveness identification for Mexican Spanish tweets [7]. The winning team for the aggressiveness task for that edition was INGEOTE8C],[ obtaining an F1-macro score of 0.620. Another interesting result was the development of linguistic generalization of the typical Mexican slang used in tweets to reduce the impact of size on the word bag 9[]. For the 2019 edition of the MEX-A3T track10[], the approach of the University of Chihuahua (UACh)1[1] obtained the best performance, outperforming all proposed baselines, except the results from the winner team of the 2018 edition. Nevertheless, the UACh approach is considerably much simpler than the one from INGEOTEC.

On the other hand, there are few studies on the detection of fakenews in Spani1s2h] [6], one of these studies evaluates the complexity, the stylometric and psychological characteristics of the text in a multilingual setting12[], they used corpus of news written in American English, Brazilian Portuguese and Spanish, they used four classifiers, k-Nearest Neighbors, Support Vector Machine, Random Forest, and Extreme Gradient Boosting, and obtained an average detection accuracy of 85.3% with Random Forest. Another interesting investigation in which they created a new corpus of news in Spanish6][, with the true and fake tags used for automatic detection of fakenews, and presenting a fakenews detection method based on algorithms of classification of lexical characteristics such as Bag of Words, part of speech tag, n-grams (with n ranging from 3 to 5) and the combination of n-grams, the best result they obtained with an accuracy of 76.94%.

3. Methodology

The methodology of this work consists of the following steps: text preprocessing, text representation, and the building of the classification models.

Text preprocessing is commonly the first step in the pipeline of an NLP system, and it includes a set of techniques designed to transform text documents into a suitable representation form for automatic processing. The preprocessing techniques we employed in this work included the use of regular expressions, the tokenization, the deletion of punctuation, symbols, stop words, and the stemming. The regular expressions allowed us to identify some incorrect words for the Mexican Spanish, mainly those in which the same vowel appears subsequently three times or more. The best way to do this was by employing the ‘’re” library in Python.

We also used the natural language toolkit (NLTK) to perform the tokenization, breaking the texts into words as essential elements. During this process, we also removed the punctuation marks, the special characters or symbols, as well as unnecessary stop words such as ”el”, ”la”, ”los”. Afterward, we used the Snowball stem library to reduce derived words into their original form or stem by performing the truncation of sufixes. Finally, to reduce even more the number of unmeaningful words, we ignored those that appear less than 20 or 40 times.

After the text preprocessing, we intended to identify the set of words that best describe the textual context. Extracting these words, also called terms or keywords, is the process to assign a numerical value that represents the relevance of each word concerning the others within the corpus. In particular, we used two methods based on a simple statistic approach, the term frequency (TF), and the term frequency-inverse document frequency (TF-IDF). TF defines the local importance that each term has in a document based on its frequency; i.e., if a word frequently appears in a document, then more important i.sIDF captures how many documents a word appears concerning the total number of words in the corpus, i.e., it highlights the rarity of the word. We used the implementations of TF and TF-IDF included in the scikit-learn library.

Finally, in order to build the classification models, we used the following machine learning algorithms implemented in scikit-learn: t he-nearest neighbors (KNN) for= 3, 7, 11 , the support vector machine (SVM) with a linear and a radial basis function (RBF) kernels, Decision trees (DT), Neural net (NN), and Naive Bayes (NB). We generated these models using the training set by using 10-fold cross-validation.

4. Experimental results

We divided the data set into 10 taking the first subset as validation and the other subsets as training, and we obtained the confusion matrix, then we take the second subset as validation and the rest as training we repeat this process until each subset has been into the validation set. Finally, we added the confusion matrices, and from this, we get the presented results.

Tables1 and 2 show the performance of the proposed classification models applied to the fake news data set by using the TF and TF-IDF methods, respectively. The best result for this data set is by the combination of NN without using the techniques of stop words and stemming, and regardless of the use of TF and TF-IDF. Note that, except for the SVM with RBF, there is a notable diference between the results of NN concerning the rest. Also note that, in general, the results are slightly better when using TF-IDF than TF.

On the other hand, Table3s and4 show the performance of the proposed classification models applied to the aggressiveness data set by using the TF and TF-IDF methods, respectively. The best result for this data set is by the combination of NN with the TF-IDF method and without using the techniques of stop words and stemming. Like the fake news data set, the results for the aggressiveness data set are slightly better when using TF-IDF than TF. On the other hand, and unlike the fake news classification results, the best model by using the TF method is the SVM with RBF. All of these results were obtained by ignoring the words that are repeated less than 20 times for both of the data sets (Tables 1-4). We omitted to report the results for the case when we ignored the words repeated less than 40 times. This because of the poor results and space limitations in the paper. On the other hand, the fake new data set includes, in addition to the complete text of the news, a header that describes the title of the news. We performed experiments either considering the header and not considering it. Tables 1 and 2 show only the results when the header is not considered, since these present better results.

Finally, for both of the data sets, the best results were obtained by preserving the stop words and omitting the steaming process. We conjecture that considering such words for these particular cases may distinguish the classes (aggressiveness/fake news) in the texts.

5. Conclusions

In this paper, we approached the tasks of fake news and aggressiveness identification for the 2020 MEX-A3T contest. Using machine learning algorithms, we generated classification models for these tasks using diferent combinations of preprocessing techniques and keyword extraction methods. Our best configurations for both of the tasks are NN and RBF (SVM) with the TF-IDF method and without using the preprocessing techniques of removing the stop words and the stemming. As future work, we look forward to exploring other preprocessing techniques and keyword extraction methods to improve our ranking for the next MEX-AT3 contests.

F-measure

Acknowledgments

S. Arce-Cardenas gratefully acknowledges the financial support from Tecnológico Nacional de México (TecNM) under the project 9518.20-P (2rn3nx). [1] M. B. Yassein, S. Aljawarneh, Y. A. Wahsheh, Survey of online social networks threats and solutions, in: 2019 IEEE Jordan International Joint Conference on Electrical Engineering

F-measure

and Information Technology (JEEIT) , 2019 , pp. 375 - 380 . [2]

Theocharis ,

Bekiari , et al., Applying social network indicators in the analysis of

verbal aggressiveness at the school , Journal of Computer and Communications 5 ( 2017 )

169. doi: 10 .4236/jcc. 2017 . 57015 . [3]

Nobata ,

Tetreault , A. Thomas,

Mehdad ,

Chang , Abusive language detection

in online user content , in: Proceedings of the 25th International Conference on World

Wide

Web , WWW '16, International World Wide Web Conferences Steering Committee,

Republic and Canton of Geneva, CHE, 2016 , p. 145 - 153 . URLh:ttps://doi.org/10.1145/

2872427.2883062. doi: 10 .1145/2872427.2883062. [4]

Bovet ,

H. A.

Makse , Influence of fake news in twitter during the 2016 us presidential

election , Nature Communications 10 ( 2019 ) 7 . do1i0:. 1038 /s41467- 018- 07761- 2. [5]

M. E.

Aragón ,

Jarquín ,

Montes-y Gómez ,

H. J.

Escalante , L. Villaseñor-Pineda,

Gómez-Adorno ,

Bel-Enguix ,

J.-P.

Posadas-Durán , Overview of mex-a3t at iberlef

2020: Fake news and aggressiveness analysis in mexican spanish , in: Notebook Papers of

2nd SEPLN Workshop on Iberian Languages Evaluation Forum (IberLEF) , Malaga, Spain,

September , 2020 . [6]

J.-P.

Posadas-Durán ,

Gómez-Adorno ,

Sidorov ,

J. J. M.

Escobar , Detection of fake

news in a new corpus for the spanish language , Journal of Intelligent & Fuzzy Systems 36

( 2019 ) 4869 - 4876 . [7]

Á . Álvarez-Carmona , E.

Guzmán-Falcón , M.

Montes-y Gómez , H. J.

Escalante ,

Villasenor-Pineda ,

Reyes-Meza ,

Rico-Sulayes , Overview of mex-a3t at ibereval

2018: Authorship and aggressiveness analysis in mexican spanish tweets , in: Notebook

Papers of 3rd SEPLN Workshop on Evaluation of Human Language Technologies for

Iberian

Languages (IBEREVAL), Seville, Spain, volume 6 , 2018 . [8]

Graf ,

Miranda-Jiménez ,

E. S.

Tellez ,

Moctezuma ,

Salgado ,

Ortiz-Bejar , C. N.

Sánchez , Ingeotec at mex-a3t: Author profiling and aggressiveness analysis in twitter

using tc and evomsa ., in: IberEval@ SEPLN, 2018 , pp. 128 - 133 . [9]

Correa ,

Martin , Linguistic generalization of slang used in mexican tweets, applied in

aggressiveness detection ., in: IberEval@ SEPLN, 2018 , pp. 119 - 127 . [10]

M. E.

Aragón ,

Á . Álvarez-Carmona , M.

Montes-y Gómez , H. J.

Escalante , L. Villasenor-

Pineda , D.

Moctezuma , Overview of mex-a3t at iberlef 2019: Authorship and aggressiveness

analysis in mexican spanish tweets , in: Notebook Papers of 1st SEPLN Workshop on

Iberian

Languages Evaluation Forum (IberLEF) , Bilbao, Spain, 2019 . [11]

Casavantes ,

López ,

L. C.

González , Uach at mex-a3t 2019 : Preliminary results on

2019 ), CEUR WS Proceedings , 2019 . [12]

H. Q.

Abonizio , J. I. de Morais,

G. M.

Tavares ,

S. Barbon

Junior , Language-independent

fake news detection: English, portuguese, and spanish mutual features , Future Internet 12