Naive-Bayesian Classification for Bot Detection in Twitter Notebook for PAN at CLEF 2019 Pablo Gamallo1 and Sattam Almatarneh1,2 1 Centro de Investigación en Tecnoloxías Intelixentes (CiTIUS) University of Santiago de Compostela, Galiza 2 Computer Science Department, University of Vigo Escola Superior de Enxeñaría Informática, Campus As Lagoas, Ourense 32004, Spain. {pablo.gamallo,sattam.almatarneh}@usc.es Abstract This article describes a system that participated in the Bots and Gen- der Profiling shared task at PAN 2019. The first objective of the task is to detect whether the author of a Twitter account is a bot or a human; and in case of hu- man, the second objective is to identify the gender of the user account. For this purpose, we present a bayesian strategy based on features including specific con- tent of tweets and automatically built lexicons. The best configuration of features reached 0.88 accuracy in the official Spanish test dataset and 0.81 in the English one for the bot/human classification. For gender profiling, the scores we obtained were lower, around 0.70. 1 Introduction Social bots are programs to automate usual human activities such as messages gen- eration. The main objectives of bots are to interact with humans, resend messages or pictures from other users, add likes to other messages, and so on. The rise of bots in online social networks have led to the emergence of malicious behavior including misinformation dissemination and any pollutant content such as malware spreaders or spammers. To identify social bots, machine learning techniques have been used suc- cessfully exceeding in several cases 95% accuracy, but with datasets built by the authors themselves [17,11]. In 2015, DARPA Social Media in Strategic Communications pro- gram conducted the Twitter Bot Detection Challenge [15], whose aim was to identify influence bots supporting a pro-vaccination discussion on Twitter. In this case, the final results of the competition were much more discrete as the best systems did not reach 50% accuracy. Machine learning techniques for social bots detection in Twitter generally make use of a great variety of features. Among them, we have identified the following types: user profile, user friendship networks (following and followers), content of tweets and user history. One of the tasks in Bots and Gender Profiling Shared Task [14] in PAN 2019 Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons Li- cense Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 September 2019, Lugano, Switzerland. [7], tries to give a new pulse to the studies of bot profiling. After having dealt with some facets of author profiling in social networks since 2013 (e.g., gender, age, gender and even language variety), the main aim in 2019 is to detect whether the author of a Twitter account is a bot or a human. In addition, in case of human, the second objective is to identify the gender of the user account. It is worth noting that the training datasets provided by the Shared Task organizers do not contain all the information required to extract all feature types enumerated above. More precisely, there is no information on user profile, user friendship networks, or user history. Given the characteristic of the training dataset provided by the PAN Shared Task 2019, we design a machine learning strategy based on features only including content of tweets as well as automatically built lexicons. Therefore, features built from user profile and user history are beyond our scope. As there is small training data, we decided to use a basic Naive Bayes classifier, which performs well in this type of task as it was reported in [1]. The PAN Shared Task not only includes bot detection, but also gender identification. However, as our main objective is to identify bots from human accounts, we decided to reuse the features conceived for the bot/human task also for the female/male identifica- tion task, with slight differences concerning lexical features. As it will be reported later, for the bot detection task in PAN 2019, our best feature configuration reached 0.88 and 0.81 accuracy in the official Spanish and English test dataset, respectively. These results are similar to those achieved during the development phase, even though they are below the state-of-the-art described in the related work section. In the next section (2), we will describe other works and experiments just focused on bot detection (not gender profiling). Then, Section 3 describes the Twitter-based features used by our classification method. Experiments are reported in Section 4 and conclusions are addressed in Section 5. 2 Related Work The possibility to collect tweets from user accounts and, thereby, build training datasets allowed researchers to design machine learning methods for social bot detection. The main idea behind these methods is to discover the key features of social bots to draw the border between a human actor and a machine. In the following, we introduce some selected works. [17] try to characterize and understand Sybil account activity, that is fake accounts, in the Renren Online Social Network. Renren is a Chinese OSN similar to Facebook. They claimed that Sybils in social networks do not form close-knit communities, con- trary to what was said in the structure-based approaches for bot detection [16]. The authors applied a SVM classifier on Renren 2,000 accounts (1,000 human accounts and 1,000 Sybils) achieving very high performance: about 99% accuracy. [6] focused on the classification of human, bot, and cyborg accounts on Twitter, where cyborgs stand for either human assisted bots or bot assisted humans. Using a collection of 500,000 accounts, the authors studied the differences among bots, humans, and cyborgs, by considering features related to account properties, tweet content, and tweeting behavior. They applied a Random Forest-Based classifier on a test dataset of 2,000 users, reaching 98% accuracy for humans, 91% for cyborgs, and 96% for bots. [11] introduced the first strategy to filter out content polluters using social honey- pots. They collected 23,869 polluters (bots or not) by making use of a small set of hon- eypots they created. Then, they developed a large variety of classification algorithms, such as naive bayes, logistic regression, support vector machine, or tree-based for dis- tinguishing between content polluters and legitimate users. Random Forest produced the highest performance, reaching 98.42% accuracy. In [1], the authors collected and manually labeled a dataset of Twitter accounts, including bots, human users, and hybrids (i.e., tweets posted by both human and bots). This dataset was used to train and test several types of classifiers. Random Forest and bayesian algorithms reached the best performance at both two-class (bot / human) and three-class classification (bot / human / hybrid). [5] used a deep learning method to extract latent temporal patterns. To the best of our knowledge, the first system that applies deep neural network in bot detection. However, this method cannot be compared to the PAN Shared Task approach because the datasets provided by the organizers in the shared task do not provide explicit time axis information in the user accounts. “Bot or Not?” [8] is one of the first social bot detectors publicly available for Twitter. The detection algorithm relies on more than 1000 features which are grouped into six types: user, network, temporal, content, friends, and sentiment. BotOrNot can be used via a website and APIs.3 Another system aimed at detecting bots on Twitter is SentiBot [9], which focuses on a number of sentiment-related factors that are key to the identification of bots. There- fore, Sentibot employs sophisticated sentiment analysis techniques to extract relevant features to train the classifier. Unlike previously introduced systems, the method we propose has to adapt to the characteristics of the dataset provided by the shared task and, therefore, focus on fea- tures related to the linguistic content of tweets. 3 Types of Features Feature extraction and selection is a critical process for any classification task. In the following, we describe the different types of features we used in the experiments re- ported later. 3.1 Social Network Features These are specific characteristics of the language of social networks, which include textual elements that can only be found on Twitter. We used the following list of social network features: – Ratio of the number of hashtags (i.e. number of hashtags used by a user account divided by total number of tweets sent from this account) 3 http://botornot.co/ – Ratio of the number of user references. – Ratio of the number of url links. – Ratio of the number of retweets. – Ratio of the number of textual emoticons, where textual emoticons are, for instance: ’;)’, ’:)’, and so on. – Ratio of the number of emojis. – Ratio of the number of onomatopoeia, where onomatopoeia are for instance: haha in English or jeje in Spanish. – Ratio of the number of language abbreviations, where abbreviations are for in- stance: b4 (before) or btw (by the way) in English, and q (que) or xq (porque), in Spanish. – Ratio of the number of alliterations, that is, the repetition of vowel sounds. 3.2 Content-Based Features These are features that can be extracted from any text message. The content features we used are the following: – Ratio of the size of tweets. – Ratio of the number of identical pairs of tweets. – Lexical richness, defined as lemma/token ratio (LTR): kLk LT R = (1) kT k where k L k is the number of different lemmas appearing in the tweets of one user account, and k T k is the total number of tokens. As grammatical words should not be taken into account, we only consider lexical lemmas and tokens, that is, l ∈ L and t ∈ T if l and t are nouns, adjectives, verbs or adverbs. – Similarity between sequential pairs of tweets, t1 and t2 , defined as follows: k Lt1 ∩ Lt2 k Sim(t1 , t2 ) = (2) k Lt1 ∪ Lt2 k where Lt1 and Lt2 are the lexical lemmas (nouns, adjectives, verbs, and adverbs) of tweets t1 and t2 , respectively, where t1 ≺ t2 . To obtain the final similarity ratio associated with a user account, all Sim scores between pairs of sequential tweets are added, and the result is divided by the total number of tweets. 3.3 Lexical Features Lexical features were derived from several domain-specific lexicons, in particular, four different weighted lexicons were automatically built for each language: human-machine lexicon: a lexicon consisting of specific words belonging to two classes: the language of bots and the language of humans in Twitter. female-male lexicon: a lexicon consisting of specific words belonging to two classes: women language and men language. sentiment lexicon of human-machine: a lexicon consisting of polarity words (posi- tive or negative) used by bots or humans. sentiment lexicon of female-male: a lexicon consisting of polarity words (positive or negative) used by women or men. Each lexicon was built by making use of the annotated corpora provided by the PAN organizers and a ranking algorithm. For instance, as the word consent appears frequently in the female discourse in Twitter, it will be added as a female word within the female-male lexicon. In addition, a weight is assigned to each word within a lexicon. The higher the weight the more intense the female or male value of the word. The same procedure was followed for building the human-machine lexicon. Concerning sentiment lexicons, we also used the same method but restricted with external polarity lexicons, that is, only words also appearing in external sentiment resources are considered. As in [9], we consider that a number of sentiment-related factors might be essential to the identification of bots. We just considered words belonging to lexical categories, hence, only nouns, verbs, adjectives, and adverbs were selected. Besides lexical words, hashtags were also taken into account. PoS tagging for English and Spanish was carried out with the multilin- gual toolkit LinguaKit [10]. The polarity lexicon provided by Linguakit was also used as external resource to build sentiment lexicons of human-machine and female-male classes. The method to build a domain-specif lexicon is somehow inspired by that reported in [3,2] for very negative opinions. The score of a word given a class (bot, human, female or male), noted C, is computed as follows: f reqT otal (w) C(w) = (3) f reqC (w) where f reqT otal (w) is the number of occurrences of word w in the whole annotated corpus, and f reqC (w) stands for the number of occurrences of the same word in the segments (tweets) annotated as belonging to this class, where C stands for bot, human, female or male. In addition to the class score C, it is also required to compute a thresh- old above which the word is considered as belonging to the class. So, we compute the difference between the use of a word within the given class and out of it: DIF F (w) = f reqC (w) − f req−C (w) (4) where f req−C (w) stands for the occurrences of w in segments that are not annotated as C. To insert a word in the lexicon, the value of DIF F (w) must be higher than a threshold. In our experiments, this value for human-machine and female-male was 50. So, in these two lexicons, we only selected those words with DIF F values higher than 50. In the case for sentiment lexicons the threshold was set to 10. Finally, words were ranked by their C score giving rise to weighted and ranked lexicons. The same procedure were carried out to build specific sentiment lexicons. Yet, for this purpose, we made use of general-purpose polarity lexicons to just extract polarity words. features bot/human accuracy male/female accuracy bow 0.73 0.77 lex 0.62 0.68 text 0.62 0.51 bow+text 0.73 0.77 lex+text 0.83 0.67 bow+lex+text 0.62 0.73 Table 1. Results obtained by using the English training and development datasets features bot/human accuracy male/female accuracy bow 0.85 0.67 lex 0.65 0.57 text 0.71 0.50 bow+text 0.83 0.68 lex+text 0.90 0.67 bow+lex+text 0.80 0.63 Table 2. Results obtained by using the Spanish training and development datasets 4 Experiments In order to find the best feature configuration in a classification task, we have used a Bayesian algorithm. In addition to its simplicity and efficiency, Naive Bayes performs well in this type of task, as described in [1], where the Bayesian classifier obtained the best results in the bot/human classification. Our classifier was implemented with the NaiveBayes Perl module.4 As it was mentioned before, in order to lemmatize and iden- tify lexical PoS tags, tweets were processed using the multilingual toolkit LinguaKit [10]. In Tables 1 and 2, content and social network features are called textual features (text), while lexical features (lex) represent both human-machine and sentiment lexi- cons. It is important to point out that text features are the same for both bot/human and male/female classification. By contrast, lex features were specified for each subtask. In addition, we also consider traditional bag-of-words with term frequency (simplified as bow). Tables 1 and 2 show the results obtained by different combinations of those features configuring the bayesian classifier for English and Spanish, respectively. Naive Bayes classifier with bow works acceptably but the combination of bow with other features drops the accuracy, as in the experiments for hate speech detection re- ported in [4]. By contrast, combining lex or bow with text peforms well. This could be explained by the fact that lex-text and bow-text are pairs of features that are con- ceptually independent while Naive-Bayes algorithm assumes that all features are con- ceptually independent. In Spanish, lex-text achieves 0.90 accuracy and 0.83 in English for bot/human detection. Concerning the gender profiling task, the best configuration in 4 https://metacpan.org/pod/Algorithm::NaiveBayes Language bot/human accuracy male/female accuracy Spanish 0.88 0.71 English 0.81 0.72 Table 3. Results obtained by our best configurations with the Spanish and English official test set of PAN Shared Task: Bots and Gender Profiling. both languages is bow-text, achieving 0.77 and 0.68 accuracy, respectively in Spanish and English. By contrast, lex-bow pair does not work well in any task since lex and bow seem to be quite dependent features. In fact, lex is a relevant subset of bow. The most important observation to point out is that there are some feature combina- tions that improve the bow baseline. It is worth noting that in many classification tasks, the bow model is very difficult to overcome. So, the results we obtained seem to prove that the features described are useful and, therefore, it is worthwhile to go deeper into their improvement and use. We selected the best configurations to be used with the official test dataset of PAN Shared Task. For bot detection in Spanish and English, we used lex-text pair of fea- tures, while for gender profiling in the two languages we used bow-text. The official results depicted in Table 3 are very similar to those obtained in the development phase, although they are a little lower for bot detection and a little higher for gender profil- ing. The experiments with the official test dataset were carried out in a Ubuntu 16-04 virtual machine by means of TIRA [12], which is a web service to facilitate software submissions to shared tasks. The software will be freely available. It is worth noting than the Spanish accuracy for bot/human classification outper- forms two of the baselines proposed by the organizers, namely that relying on word embeddings and that based on the low dimensionality model reported in [13]. By con- trast, the n-grams baselines worked slightly better than our approach, which was the 16th best system out of 44 for this specific task. 5 Conclusions In this study, we presented a basic classification method to bot detection focused on the extraction and selection of relevant features. The experiments showed that both linguistic features extracted from tweets and lexical information from external resources may help the classification process by improving baseline feature configurations. The experiments also showed that the selected features have a better behavior in the task of identifying bots than in gender profiling. In current work, we are collecting political accounts from Twitter in order to an- alyze the influence of malicious bots in the different elections that are taken place in Spain in 2019. One of our aims is to build an annotated corpus with the aid of the best features we have identified in the present work. In future work, we will use those fea- tures as heuristics of an unsupervised system aimed at ranking Twitter accounts from more human to less human. This ranked list of accounts will be revised by annotators so that a reliable gold-standard dataset is obtained at the end. Acknowledgments This work has been partially supported by the DOMINO project (PGC2018-102041-B- I00, MCIU/AEI/FEDER, UE) and eRisk project (RTI2018-093336-B-C21). It also has received financial support from the Consellería de Cultura, Educación e Ordenación Universitaria (accreditation 2016-2019, ED431G/08) and the European Regional De- velopment Fund (ERDF). References 1. Alarifi, A., Alsaleh, M., Al-Salman, A.: Twitter turing test. Inf. Sci. 372(C), 332–346 (Dec 2016), https://doi.org/10.1016/j.ins.2016.08.036 2. Almatarneh, S., Gamallo, P.: Automatic construction of domain-specific sentiment lexicons for polarity classification. In: International Conference on Practical Applications of Agents and Multi-Agent Systems. pp. 175–182. Springer (2017) 3. Almatarneh, S., Gamallo, P.: A lexicon based method to search for extreme opinions. PloS one 13(5), e0197816 (2018) 4. Almatarneh, S., Gamallo, P., Pena, F.J.R.: CiTIUS-COLE at semeval - 2019 task 5: Combin- ing linguistic features to identify hate speech against immigrants and women on multilingual tweets. In: the 13th international Workshop on Semantic Evaluation (2019) 5. Cai, C., Li, L., Zengi, D.: Behavior enhanced deep bot detection in social media. In: 2017 IEEE International Conference on Intelligence and Security Informatics (ISI). pp. 128–130 (July 2017) 6. Chu, Z., Gianvecchio, S., Wang, H., Jajodia, S.: Detecting automation of twitter accounts: Are you a human, bot, or cyborg? IEEE Trans. Dependable Secur. Comput. 9(6), 811–824 (Nov 2012), http://dx.doi.org/10.1109/TDSC.2012.75 7. Daelemans, W., Kestemont, M., Manjavancas, E., Potthast, M., Rangel, F., Rosso, P., Specht, G., Stamatatos, E., Stein, B., Tschuggnall, M., Wiegmann, M., Zangerle, E.: Overview of PAN 2019: Author Profiling, Celebrity Profiling, Cross-domain Authorship Attribution and Style Change Detection. In: Crestani, F., Braschler, M., Savoy, J., Rauber, A., Müller, H., Losada, D., Heinatz, G., Cappellato, L., Ferro, N. (eds.) Proceedings of the Tenth Interna- tional Conference of the CLEF Association (CLEF 2019). Springer (Sep 2019) 8. Davis, C.A., Varol, O., Ferrara, E., Flammini, A., Menczer, F.: Botornot: A system to evaluate social bots. In: Proceedings of the 25th International Conference Companion on World Wide Web. pp. 273–274. WWW ’16 Companion, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland (2016), https://doi.org/10.1145/2872518.2889302 9. Dickerson, J.P., Kagan, V., Subrahmanian, V.S.: Using sentiment to detect bots on twitter: Are humans more opinionated than bots? In: Proceedings of the 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. pp. 620–627. ASONAM ’14, IEEE Press, Piscataway, NJ, USA (2014), http://dl.acm.org/citation.cfm?id=3191835.3191957 10. Gamallo, P., Garcia, M., Piñeiro, C., Martinez-Castaño, R., Pichel, J.C.: Linguakit: A big data-based multilingual tool for linguistic analysis and information extraction. In: 2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS). pp. 239–244 (2018) 11. Lee, K., Eoff, B.D., Caverlee, J.: Seven months with the devils: a long-term study of content polluters on twitter. In: In AAAI Int’l Conference on Weblogs and Social Media (ICWSM (2011) 12. Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research Architec- ture. In: Ferro, N., Peters, C. (eds.) Information Retrieval Evaluation in a Changing World - Lessons Learned from 20 Years of CLEF. Springer (2019) 13. Rangel, F., Franco-Salvador, M., Rosso, P.: A low dimensionality representation for language variety identification. In: Proceedings of the 17th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2016). Springer-Verlag (2016) 14. Rangel, F., Rosso, P.: Overview of the 7th Author Profiling Task at PAN 2019: Bots and Gen- der Profiling. In: In: Cappellato L., Ferro N., Müller H, Losada D. (Eds.), CLEF 2019 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org (2019) 15. Subrahmanian, V.S., Azaria, A., Durst, S., Kagan, V., Galstyan, A., Lerman, K., Zhu, L., Ferrara, E., Flammini, A., Menczer, F.: The darpa twitter bot challenge. Computer 49(6), 38–46 (Jun 2016), https://doi.org/10.1109/MC.2016.183 16. Viswanath, B., Post, A., Gummadi, K.P., Mislove, A.: An analysis of social network- based sybil defenses. SIGCOMM Comput. Commun. Rev. 41(4), – (Aug 2010), http://dl.acm.org/citation.cfm?id=2043164.1851226 17. Yang, Z., Wilson, C., Wang, X., Gao, T., Zhao, B.Y., Dai, Y.: Uncovering social net- work sybils in the wild. ACM Trans. Knowl. Discov. Data 8(1), 2:1–2:29 (Feb 2014), http://doi.acm.org/10.1145/2556609