=Paper=
{{Paper
|id=Vol-2696/paper_217
|storemode=property
|title=Assembly of Polarity, Emotion and User Statistics for Detection of Fake Profiles
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_217.pdf
|volume=Vol-2696
|authors=Luis Gabriel Moreno Sandoval,Edwin Puertas,Alexandra Pomares Quimbaya,Jorge Andres Alvarado Valencia
|dblpUrl=https://dblp.org/rec/conf/clef/Moreno-Sandoval20
}}
==Assembly of Polarity, Emotion and User Statistics for Detection of Fake Profiles==
Assembly of polarity, emotion and user statistics for detection of fake profiles Notebook for PAN at CLEF 2020 Luis Gabriel Moreno-Sandoval1,3 , Edwin Puertas2,1,3 , Alexandra Pomares-Quimbaya1,3 , and Jorge Andres Alvarado-Valencia1,3 1 Pontificia Universidad Javeriana, Bogotá, Colombia {morenoluis,edwin.puertas,pomares,jorge.alvarado}@javeriana.edu.co 2 Universidad Tecnológica de Bolívar, Cartagena, Colombia epuerta@utb.edu.co 3 Center of Excellence and Appropriation in Big Data and Data Analytics (CAOBA) Abstract The explosive growth of fake news on social networks has aroused great interest from researchers in different disciplines. To achieve efficient and effective detection of fake news requires scientific contributions from various dis- ciplines, such as computational linguistics, artificial intelligence, and sociology. Here we illustrate how polarity, emotion, and user statistics can be used to detect fake profiles on Twitter’s social network. This paper presents a novel strategy for the characterization of the Twitter profile based on the generation of an assem- bly of polarity, emotion, and user statistics characteristics that serve as input to a set of classifiers. The results are part of our participation in the PAN 2020 in the CLEF in the task of Profiling Fake News Spreaders on Twitter. 1 Introduction The exponential growth in social networks of fake news and rumors has led researchers from different areas to join efforts to quickly and accurately mitigate these phenomena’ proliferation. Thus, the PAN at CLEF of the 2020 edition has proposed a task of au- thorship analysis whose objective is to identify possible fake news spreaders [16] in the social networks as a first step to avoid the propagation of the already fake news said amid the online users. The way we collect and consume news has become a crucial process these days due to the growth of social media platforms, such as the social networks Twitter 4 and Facebook 5 , which have reported an exponential increase in popularity [3,18]. As an example, Twitter reported 330 million active users per month in early 2020. 6 Mean- while, Facebook reported 2.603 million active Facebook users per month worldwide as Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons Li- cense Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 September 2020, Thessa- loniki, Greece. 4 https://twitter.com/ 5 https://www.facebook.com/ 6 https://www.statista.com/statistics/282087/number-of-monthly-active-twitter-users/ of Q1 2020 7 . In fact, social networks have proven to be extremely useful for generating news, especially in crisis, due to their inherent ability to spread breaking news much more quickly than traditional media [7]. Fake news has received enormous attention from the academic community because it can be created and published online more quickly and cheaply than traditional media in several different platforms as newspapers and television. Also, several researchers suggest that humans tend to seek out, consume, and create information that is aligned with their ideological beliefs, often resulting in the perception and exchange of fake in- formation in the same way as like-minded communities [20]. In this paper, we describe our submission as part of our participation at PAN at CLEF 2020, and as Pothast et al. [14] established, this paper closes a cycle by supplying the motivation for the tackled problem, high-level descriptions of the courses of action taken, and the interpretation of the results obtained. In particular, this year the Profiling Fake News Spreaders on Twit- ter task is presented, where the main objective is to identify possible spreaders of fake news on social networks as a first step to prevent the spread of fake news among online users. Our main contributions are related to the statistical analysis of the language use of the fake news spreader profiles, having the hypothesis that these profiles are created mainly to spread negative opinions in the social networks. To do this, we use the central tendency metrics (mean, median and mode), the use of polar and emotion classification and a vector of processed words thinking that these classifiers become a contributing factor in finding those features of the fake news spreader profile. The rest of the paper is structured as follows. Section 2 introduces the related work. Section 3 describes the data set used in our strategy for celebrity characterization. Sec- tion 4 presents the details of the proposed strategy. Section 5 and 6 discuss the analysis of specific features and evaluation results. Finally, Section 7 presents some remarks and future work. 2 Related work Profiling fake news broadcasters and detecting fake news are among the most complex tasks amidst natural language processing tasks. In addition, social media sites such as Facebook and Twitter are among the largest sources of news dissemination networks [2,5,22]. The detection of fake news is an activity that in recent years has generated great importance in different areas of society, as a phenomenon that is constantly growing. In this section we review some of the most recent work published. Fake news detection has been studied from different approaches and techniques ac- cording to the scope and format of the available fake news data [11,17,9]. The most recent works are oriented towards using dynamic models of languages as those pro- posed by exBAKE [8] that mitigates the problem of data imbalance. Similarly, Cui et al [4] propose a deep end-to-end architecture which alleviates the heterogeneity intro- duced by multimodal data and it better captures the representation of user sentiment, as well. Rangel et al.[15] propose a Low Dimensionality Representation (LDR) model to reduce the possible over-fitting for identifying the language variation of different 7 https://www.statista.com/statistics/264810/number-of-monthly-active-facebook-users- worldwide/ Spanish-speaking countries, which may help discriminate among different types of au- thors. In general, current approaches based on deep neural networks have been success- ful in detecting Fake news. Still, there are other types of investigations that use tradi- tional techniques sort of term frequency-inverse document frequency (TF-IDF), part-of- speech (POS ) tagging, n-grams, among others. In relation to the TF-IDF approaches, we highlight the research of Ahmed et al.[1] who used a Stochastic Gradient Descent model using TF-IDF from the bi-grams. With regard to the part-of-speech (POS ) tag- ging approach, the results presented by Rubin et al.[10] are highlighted. They used bi- grams with POS tagging to determine whether a news item was fake or not. Wynne et al.[21] propose a fake news detection system that considers the content of online news articles through the use of the word n-grams and the analysis of n-grams characters. Shu et al.[19] analyses the correlation between user profiles and fake news extracting implicit and explicit linguistic characteristics using a Linear Regression model, the use of metrics and The Five-Factor Model (FFM) unsupervised classification model for personality prediction. Finally, Giachanou et al.[6] improve the performance of their classification model CheckerOrSpreader for user profiles as potential fact checker or a potential fake news spreader combining a Convolutional Neural Network (CNN), The Five-Factor Model (FFM) prediction model with word embedding, and the LIWC soft- ware for tracking language patterns. 3 Materials and Methods 3.1 Data Description The data set for task Profiling Fake News Spreaders on Twitter at PAN 2020 consists of 300 user profiles that spread fake news on social media. For this, files in XML format were provided with the content of 100 associated tweets of each author; this set includes texts in Spanish and English. 3.2 Model Description In this section we describe the predictive model used in our submission. The model used for the task of Profiling Fake News Spreaders on Twitter. Figure 1 shows the description model. 3.3 Resources To extract emotion and polarity from each comment associated with a user profile from the dataset, the NRC Emotion Lexicon [12] and a Combined Spanish Lexicon (CSL) [13] were used. The NRC Emotion Lexicon is a list of English words and their associ- ations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust), and two sentiments (negative and positive); it also includes translations in over 100 languages. The annotations were manually done by crowdsourcing. A Com- bined Spanish Lexicon is an approach for sentiment analysis includes an ensemble of six lexicons in Spanish and a weighted bag of words strategy. Figure 1. Model for the task of profiling fake news spreaders. 3.4 Preprocessing Initially, a cleaning and pre-processing process is applied to the texts of the 300 users. In this way, the resulting corpus is ready for both languages and integrated into a feature vector. Then, we applied a processing pipeline using Scikit-Learn to create new text fea- tures. Later, the GridSearchCV library was used to make a better search taking into account hyperparameters of various previously configured classifiers. 3.5 Feature Extraction The first part of the pipeline was in charge of reading the text in both languages. Later, a feature vector of these texts was created and then a final preprocessing per individ- ual was performed, resulting in a feature vector associated to each one of them, which sought to analyze the frequencies of the text’s features such as emojis, emoticons, hash- tags, URL or mentions. A polarity analysis was carried out for each individual, taking into account the po- larity of each of the messages shared by this user on the social network Twitter. Then, the amount of negative or positive comments was averaged, seeking to support the hy- pothesis that suggests a correct identification of fake profiles could occur through an analysis of the polarity on their messages since there is a correlation between a fake user and the negative polarity of the content shared on the network. In the same scenario, the calculation of emotions for each of the texts was done by means of a lexicon of emotions, which allowed to identify if the emotions were binding characteristics of fake content. 3.6 Settings and Classifiers. Emotional and polarity results, as well as statistics of the individual, were integrated into a single vector of characteristics to implement the classification model later. This model comprised a set of classifiers (Logistic Regression, K-Neighbors Classifier, Random Forest Classifier, Decision Tree Classifier, Linear Discriminant Analysis LDA, Multi- nomial Naive Bayes, Bernoulli Naive Bayes and Super Vector Machine) with which the hyperparameters were configured. The hyperparameters tuning goal was to search a classifier with the best perfor- mance associated that could have been generated for each of the reports by Grid- SearchCV library and taking into account the pipeline. The obtained results showed that the best performance was found using Random Forest with an accuracy of 76% for Spanish and 71.7% for English. This performance did not require changes on the settings for each language. Finally, it is worth mentioning that the pipeline allowed to generate the classifiers, save them, serialize the pipeline with the classifiers and materialize them to perform the final execution of the model. 4 Experiments and Analysis of Results As presented in Table 1, the summary shows the performance of the dummy profiles cal- culated for the challenge. For the class of dummy profiles, you can notice the best clas- sification model, the accuracy obtained with it, and the characteristics that best worked for the classification. The classifier with the best performance was Random Forest. Fur- thermore, there is a union of the characteristics coming from raw text, cleaned text, the text statistics by profile and the polarity and the emotion classification of each tweet. Finally, a features vector is created with the objective of grouping the profile’s language and sociolinguistic characteristics. Table 1. Summary of results in the task of profiling fake news spreaders Model Accuracy (es) Accuracy (en) Logistic Regression 0.643 0.650 Kneighbors Classifier 0.640 0.577 RandomForest Classifier 0.780 0.737 Decision Tree Classifier 0.597 0.673 Linear Discriminant Analysis NaN NaN Multinomial Naïve Bayes NaN NaN Bernoulli Naïve Bayes 0.680 0.670 SVM 0.630 0.677 4.1 Baselines Table 2 represents the predicted accuracy of our model for both languages compared to the baseline models made by the members in charge of the task. The main results show that the SYMANTO (LSDE) and SVM + c nGrams models outperform our model with an average difference of 4.5% and 1.3%, respectively. It should be noted that our per- formance is better in the English language concerning the SVM + c nGrams ; however, the performance drops if the analysis is in the Spanish language. On the contrary, our model has a better performance than the other models with a wide difference of 21.8% for the RANDOM baseline model and 2.8% with the closest baseline model. Table 2. Performance of the different models on PAN at CLEF Model Accuracy(En) Accuracy(ES) Accuracy(Avg) SYMANTO (LDSE) 0.745 0.790 0.768 SVM + c nGrams 0.680 0.790 0.735 morenosandoval20 0.715 0.730 0.723 NN + w nGrams 0.690 0.700 0.695 EIN 0.640 0.640 0.640 LSTM 0.560 0.600 0.580 RANDOM 0.510 0.500 0.505 On the other hand, if we compare our results with the performance of Ghanem et al. (2020) in the identification of fake news in Twitter, the general performance of the model carried out by the authors in the English language is far below 6.5%; however, the main class clickbait has a better performance than ours by a difference of 24.5%. 5 Discussion and Conclusion The task of Profiling Fake News Spreaders on Twitter PAN at CLEF 2020 generated several challenges that are worth highlighting. The collection and analysis of other language-related elements are of implicit con- text for this task of profiling fake news spreaders. Therefore, identifying profiles from their texts is an interesting approach where we can observe the analysis of variables in the use of some words that denote the social use of "sociolect" or "idiolect" languages. Therefore, this collection enables profile the own features of a specific language allow to increasing the accuracy in this type of natural language processing task. This study associates text-based statistics with the length of characters and with the use of symbols, emojis, and expressions such as hashtags that can indicate semiotics. Texts are also used to make comments to other users by creating mentions within the network and finally referring to external sources of information in the URLs that can guide or give context to the messages. These messages imply different measurements than the use of lexical or syntactic characteristics. By studying text-based statistics and other psychographic characteristics, such as emotion and polarity, it is possible to im- prove the precision of the classification processes on demographic, sociological, psy- chographic, and behavioral variables of fake news spreaders on Twitter. Acknowledgements We thank the Center for Excellence and Appropriation in Big Data and Data Analytics (CAOBA), Pontificia Universidad Javeriana, and the Ministry of Information Technolo- gies and Telecommunications of the Republic of Colombia (MinTIC). The models and results presented in this challenge contribute to the construction of the research ca- pabilities of CAOBA. Also, the author Edwin Puertas gives thank The Technological University of Bolivar. Needless to say, we thank the organizing committee of PAN, es- pecially Paolo Rosso, Francisco Rangel, Bilal Ghanem and Anastasia Giachanou for their encouragement and kind support. References 1. Ahmed, H., Traore, I., Saad, S.: Detection of online fake news using n-gram analysis and machine learning techniques. In: International conference on intelligent, secure, and dependable systems in distributed and cloud environments. pp. 127–138. Springer (2017) 2. Ahmed, H., Traore, I., Saad, S.: Detecting opinion spams and fake news using text classification. Security and Privacy 1(1), e9 (2018) 3. Bondielli, A., Marcelloni, F.: A survey on fake news and rumour detection techniques. Information Sciences 497, 38–55 (2019) 4. Cui, L., Wang, S., Lee, D.: Same: sentiment-aware multi-modal embedding for detecting fake news. In: Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. pp. 41–48 (2019) 5. Ghanem, B., Rosso, P., Rangel, F.: An emotional analysis of false information in social media and news articles. ACM Trans. Internet Technol. 20(2) (Apr 2020), https://doi.org/10.1145/3381750 6. Giachanou, A., Ríssola, E., Ghanem, B., Crestani, F., Rosso, P.: The Role of Personality and Linguistic Patterns in Discriminating Between Fake News Spreaders and Fact Checkers, pp. 181–192 (06 2020) 7. Imran, M., Castillo, C., Diaz, F., Vieweg, S.: Processing social media messages in mass emergency: Survey summary. In: Companion Proceedings of the The Web Conference 2018. pp. 507–511 (2018) 8. Jwa, H., Oh, D., Park, K., Kang, J.M., Lim, H.: exbake: Automatic fake news detection model based on bidirectional encoder representations from transformers (bert). Applied Sciences 9(19), 4062 (2019) 9. Kochkina, E., Liakata, M., Augenstein, I.: Turing at semeval-2017 task 8: Sequential approach to rumour stance classification with branch-lstm. arXiv preprint arXiv:1704.07221 (2017) 10. Lazer, D.M., Baum, M.A., Benkler, Y., Berinsky, A.J., Greenhill, K.M., Menczer, F., Metzger, M.J., Nyhan, B., Pennycook, G., Rothschild, D., et al.: The science of fake news. Science 359(6380), 1094–1096 (2018) 11. Long, Y.: Fake news detection through multi-perspective speaker profiles. Association for Computational Linguistics (2017) 12. Mohammad, S.M., Turney, P.D.: Crowdsourcing a word–emotion association lexicon. Computational Intelligence 29(3), 436–465 (2013) 13. Moreno-Sandoval, L.G., Beltrán-Herrera, P., Vargas-Cruz, J.A., Sánchez-Barriga, C., Pomares-Quimbaya, A., Alvarado-Valencia, J.A., García-Díaz, J.C.: Csl: A combined spanish lexicon - resource for polarity classification and sentiment analysis. In: Proceedings of the 19th International Conference on Enterprise Information Systems - Volume 1: ICEIS,. pp. 288–295. INSTICC, SciTePress (2017) 14. Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research Architecture. In: Ferro, N., Peters, C. (eds.) Information Retrieval Evaluation in a Changing World. Springer (Sep 2019) 15. Rangel, F., Franco-Salvador, M., Rosso, P.: A Low Dimensionality Representation for Language Variety Identification. In: International Conference on Intelligent Text Processing and Computational Linguistics. pp. 156–169. Springer (2016) 16. Rangel, F., Giachanou, A., Ghanem, B., Rosso, P.: Overview of the 8th Author Profiling Task at PAN 2020: Profiling Fake News Spreaders on Twitter. In: Cappellato, L., Eickhoff, C., Ferro, N., Névéol, A. (eds.) CLEF 2020 Labs and Workshops, Notebook Papers. CEUR-WS.org (Sep 2020) 17. Ruchansky, N., Seo, S., Liu, Y.: Csi: A hybrid deep model for fake news detection. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. pp. 797–806 (2017) 18. Sharma, K., Qian, F., Jiang, H., Ruchansky, N., Zhang, M., Liu, Y.: Combating fake news: A survey on identification and mitigation techniques. ACM Transactions on Intelligent Systems and Technology (TIST) 10(3), 1–42 (2019) 19. Shu, K., Wang, S., Liu, H.: Understanding user profiles on social media for fake news detection. In: 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR). pp. 430–435. IEEE Computer Society, Los Alamitos, CA, USA (apr 2018), https://doi.ieeecomputersociety.org/10.1109/MIPR.2018.00092 20. Shu, K., Sliva, A., Wang, S., Tang, J., Liu, H.: Fake news detection on social media: A data mining perspective. ACM SIGKDD explorations newsletter 19(1), 22–36 (2017) 21. Wynne, H.E., Wint, Z.Z.: Content based fake news detection using n-gram models. In: Proceedings of the 21st International Conference on Information Integration and Web-based Applications & Services. pp. 669–673 (2019) 22. Zhou, X., Zafarani, R.: Fake news detection: An interdisciplinary research. In: Companion Proceedings of The 2019 World Wide Web Conference. pp. 1292–1292 (2019)