=Paper=
{{Paper
|id=Vol-2696/paper_149
|storemode=property
|title=Profiling Fake News Spreaders using Character and Words N-grams
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_149.pdf
|volume=Vol-2696
|authors=Daniel Espinosa,Helena Gómez-Adorno,Grigori Sidorov
|dblpUrl=https://dblp.org/rec/conf/clef/EspinosaGS20
}}
==Profiling Fake News Spreaders using Character and Words N-grams==
Profiling Fake News Spreaders using Characters and Words N-grams Notebook for PAN at CLEF 2020 Daniel Yacob Espinosa1 , Helena Gómez-Adorno2 , and Grigori Sidorov1 1 Instituto Politécnico Nacional, Centro de Investigación en Computación, Mexico City, Mexico espinosagonzalezdaniel@gmail.com, sidorov@cic.ipn.mx 2 Universidad Nacional Autónoma de México, Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Mexico City, Mexico helena.gomez@iimas.unam.mx Abstract With the use of social networks as mass media; The spread of fake news becomes an investigative problem. This article describes our approach to the PAN 2020 task on "Profiling Fake News Spreaders on Twitter" [7]. The objective is to distinguish which users share fake news. Our approach in- cludes a data cleaning part and feature extraction using N-grams of characters and words. The experiments were carried out with different N-gram structures depending on the languages: English and Spanish. We experimented machine learning algorithm Support Vector Machines (libSVM). 1 Introduction Currently, social networks are one of the most important means of communication. Twitter has become an extremely active social network where users can give their opin- ion on any topic. With so much data and information exposed, millions of users are unable to validate the veracity of the information shared in this social network. So users can be deceived by lies told by word of mouth; in this case from user to user, this type of information is better known as "fake news". Fake news aim to disqualify or create controversy about issues important to society [4]. Fake news are currently difficult to detect due to the use of bots for their replication and distribution. Bots are the main source of distribution of these type of messages. Bots are programs that pretend to have human behavior within social networks [2]. Through bots, the fake news has a wide distribution and reach to real users, which damages the real news ecosystem on social networks. Bots can distribute fake news with a wide user reach which becomes a serious problem on many daily activities [1]. Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons Li- cense Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 September 2020, Thessa- loniki, Greece. Table 1. Corpus of "Profiling Fake News Spreaders Twitter" [7] Language Fake news Profiles Real news Profiles Total Spanish 150 150 300 English 150 150 300 What makes it a problem is that this type of news is not filtered or fully verified that it is true. It is simply posted and with the interactions of users, it becomes a widely mentioned fake news that many people believe to be true [5]. Currently this problem is still under investigation due to the complexity of the issue. A possible solution to this problem is to identify the users’ profiles which are propagators of false notifications and not to trust their replicated information. In order to address this kind of solutions, PAN organizes the "Profiling Fake News Spreaders on Twitter" task, which aims to find users who share fake news using only a sample of their shared tweets [7]. Previously we participated in PAN 19 [8], with the task "Bots and Gender Profiling" where we showed a solution formed by N-gram structures, in particular we used charac- ter bi-grams; so for this task we decided to use the same methodology; using N.grams; but increasing the number of characters and adding the n-grams of words. 2 Corpus For the assigned task, the corpus given by PAN 20 [7] consists of 600 files (300 in Span- ish and 300 in English) where each file represents a user; so that each user contains 100 tweets, in Table 1 shows the corpus configuration in detail. A particularity of all the tweets in this corpus is that they have 140 characters, in addi- tion; all the urls, links, hashtags and usermentions are masked, this means that one of these characteristics was assigned a label. 3 Methodology 3.1 Pre-processing steps To perform the experiments, we performed a series of steps to pre-process the data: Digits On the part of the digits we decided to remove them since we consider them not important for our methodology. Emoticons Many of the tweets contain emoticons and by doing some experiments we decided to do something interesting with them: each emoticon with the same symbol, we decided to assign it a unique label; it means that if inside the tweet it contained the potato chips emoticon then a unique label is assigned for potato chips; with this we increase a little value of the percentage of accuraccy in the classification. Punctuation marks The punctuation marks for our methodology were also not rel- evant, so we decided to remove them as well. Table 2. Features of Ngrams Spanish English 2 Char-Ngrams 1 Char-Ngrams 3 Char-Ngrams 2 Char-Ngrams 5 Char-Ngrams 3 Char-Ngrams 1 Words-Ngrams 2 Words-Ngrams 3 Words-Ngrams 3 Words-Ngrams Other symbols For the symbols that are not within the reference Standard ASCII (American Standard Code for Information Interchange) they were not considered im- portant to be taken as characteristics and therefore were removed. 3.2 Features Since we preprocess the data, we need to find a pattern that helps us differentiate the users who share fake news. We can find that pattern using N-grams. N-grams are struc- tures formed from the selected data, which we can organize by the repetition frequency. With this, it is possible to create a matrix which abstractly describes a representation form of said file, this is called a vector space model, let’s not forget that each file repre- sents a user. Because we have 2 languages to classify, each one has different N-gram configu- rations. We selected the configuration of Table 2 where we show the structures of the N-grams assigned for each language. 3.3 Vector Space Model for Texts Now that we have the characteristics of the tweets represented in a Term-Document Matrix [9], it is necessary to order them so that each column is in the same order for each dimension. When this our ordered matrix is commonly called vector space model where an abstraction of objects is represented so that they can be classified by a classification algorithm. 4 Experiments Mainly the N-gram structures were considered due to our methodology used in the PAN19 task "Bots and Gender Profiles" [8], where good results were obtained with a single characteristic: Character bi-grams [3]. Using this same methodology, we pro- pose the use of word and character N-grams to solve this task. Table 3. Testing with SVM Algorithm Spanish data English data SVM 80.99 81.33 LinearSVM 81.83 86.28 After these experiments we made the comparisons between two classifiers with the vector space model. In Table 3 we show the results of the classifiers with which we carried out the experiments. Due to our methodology, it is possible to test various configurations since we have the vector space model organized and filled with the frequencies of the N-grams of the tweets. Regarding the experiments, we tested different classifiers, selecting Lin- earSVM and which is the one that gives us the best performance. All the results shown in Table 3 were evaluated with an accuracy score. An interesting aspect was assigning a label for each different emoji, when we assign the labels for the emojis these characteristics are not part of the N-gram structures within the characteristics matrix; that is, they were assigned directly with the assigned label. This method gave us an improvement in precision of 6.0% for each language All these experiments were taken care of by TIRA’s technological resources [6], where it is necessary to upload the programs for their execution. Within this virtual machine, the scores of the researchers who are within the competition are organized, identifying the best precision of all the experiments. 5 Conclusions We propose this solution for the task "Profiling Fake News Spreaders on Twitter" [7], using class and word N-gram structures, we use LinearSVM like classifier.We decided to use these functions and methods because we had previously worked with tweets; and although the task was not the same, the use of this methodology gave very interesting scores. So we understand the differences between the tasks, but we consider it important to note if they can be solved in a similar way as the data from the same social network. Something interesting was previously we used the hashtags and usermentions to obtain a greater precision in the task which this time could not be used because they had already been tagged with the corpus; which means that the task solution can be used in another social network where these characteristics do not exist. Another aspect that we would like to test is the use of the 280 characters of a tweet; This corpus was limited with the use of the 140 but we would like to know if we could improve the performance of our method using the 280 that could has a tweet. We also noticed that some of the characteristics of social networks such as hashtags or emojis, provide an improvement in accuracy. We consider that for future experiments they can be included and see how these converge for the solution of a task. This task is really interesting due to the use of social networks nowadays; As long as social networks have the same or greater interest than today, they will continue to serve as mass media where it will be essential to create a virtual environment as healthy as possible. References 1. Bakshy, E., Hofman, J., Mason, W., Watts, D.: Everyone’s an influencer: Quantifying influence on twitter. pp. 65–74 (01 2011). https://doi.org/10.1145/1935826.1935845 2. Cai, C., li, L., Zengi, D.: Behavior enhanced deep bot detection in social media. pp. 128–130 (07 2017). https://doi.org/10.1109/ISI.2017.8004887 3. Espinosa, D., Gómez-Adorno, H., Sidorov, G.: Bots and Gender Profiling using Character Bigrams. In: Cappellato, L., Ferro, N., Losada, D., Müller, H. (eds.) CLEF 2019 Labs and Workshops, Notebook Papers. CEUR-WS.org (Sep 2019), http://ceur-ws.org/Vol-2380/ 4. Ghanem, B., Rosso, P., Rangel, F.: An Emotional Analysis of False Information in Social Media and News Articles. ACM Transactions on Internet Technology (TOIT) 20(2), 1–18 (2020) 5. Popat Kashyap, Mukherjee Subhabrata, Y.A., Gerhard, W.: Declare: Debunking fake news and false claims using evidence-aware deep learning. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium (2018). https://doi.org/10.18653/v1/D18-1003, https://www.aclweb.org/anthology/D18-1003 6. Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research Architecture. In: Ferro, N., Peters, C. (eds.) Information Retrieval Evaluation in a Changing World. Springer (Sep 2019) 7. Rangel, F., Giachanou, A., Ghanem, B., Rosso, P.: Overview of the 8th Author Profiling Task at PAN 2020: Profiling Fake News Spreaders on Twitter. In: Cappellato, L., Eickhoff, C., Ferro, N., Névéol, A. (eds.) CLEF 2020 Labs and Workshops, Notebook Papers. CEUR-WS.org (Sep 2020) 8. Rangel, F., R.P.: CLEF 2019 Labs and Workshops, Notebook Papers. In: Cappellato L., Ferro N., M.H.L.D. (ed.) Overview of the 7th Author Profiling Task at PAN 2019: Bots and Gender Profiling. CEUR Workshop Proceedings (2019) 9. Sidorov, G.: Formalization in computational linguistics. In: Syntactic n-grams in Computational Linguistics. SpringerBriefs in Computer Science, Springer (2016)