Bot and Gender Identification: Textual Analysis of Tweets Notebook for PAN at CLEF 2019 Rodrigo Ribeiro Oliveira, Cláudio Moisés Valiense de Andrade, José Solenir Lima Figuerêdo, João B. Rocha-Junior, Rodrigo Tripodi Calumby, Iago Machado da Conceição Silva, Almir Moreira da Silva Neto University of Feira de Santana rodrigo18br@hotmail.com {claudiovaliense, solenir.figueredo, iagomachado09, almirneto338}@gmail.com {joao, rtcalumby}@uefs.br Feira de Santana, Bahia, Brazil Abstract In this paper, we describe the participation of the Advanced Data Anal- ysis and Management (ADAM) group of the University of Feira de Santana in the Bots and Gender Profiling Task organized by PAN@CLEF 2019. We used Sup- port Vector Machines (SVM) optimized through nested cross-validation. In bot detection, we used features related to behavior of the account, sentiment and va- riety of posts, in gender detection function words and emoticons. These features were evaluated both individually and in groups. Before starting the training phase, we preprocessed the data to better adjust it. For bot detection, our method reached approximately 0.9057 for English and 0.8767 for Spanish. For gender detection, 0.7696 for English and 0.7150 for Spanish. Although the results for Spanish are poorer than the ones for English, they are above the random baseline (50%). 1 Introduction Social media companies employ mobile and web-based technologies to create highly interactive platforms through which individuals and communities share, cocreate, dis- cuss, and modify user-generated content [10]. These services have changed the way we see the world and how information is disseminated. A prominent example of such ser- vices is Twitter1 . Twitter is a popular microblogging service. Microblogging is a form of communication in which users must describe their current status in short posts dis- tributed by instant messages, mobile phones, email or the Web [8]. In this kind of social media, users follow others or are followed. However, unlike other social networks, Twit- ter demands no reciprocity in the following-followed binomial. Thus, when following Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons Li- cense Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 September 2019, Lugano, Switzerland. 1 http://www.twitter.com a particular user, that user is not required to follow you back. In this kind of applica- tion, users write about different aspects of their life, sharing a variety of subjects, and generating heterogeneous discussion. Twitter is used in multiple contexts. It is mainly considered as an information dis- semination tool, but also as a source of data that may support studies in different areas of knowledge. This feature is specially interesting considering it offers an Application Programming Interface (API)2 that allows crawling and collecting data. In this context, a subject that has attracted the attention of researchers is the so-called Author profiling, in which information like age, cultural background, gender, native language, and per- sonality can be inferred through textual analysis of users’ posts. This type of analysis enables numerous applications, such as business intelligence, digital forensics, psycho- logical profiling, brand reputation monitoring, etc. With regard to forensic applications, bot detection has gained attention, especially due to bots’ self-controlled ability to dis- seminate political, extremist or misinformation material that may negatively influence a massive amount of users [6]. In this context, this paper describes the participation of the ADAM team in the Bots and Gender Profiling Task [18] organized by PAN@CLEF 2019. In this edition, different from previous years, in which various aspects of the author’s profile in social media (age and gender, also along with personality, gender and variety of languages, and gender from a multimodal perspective) were investigated, this edition included, in addition, investigation whether the author of a Twitter feed is a bot or a human. The gender profiling was maintained as a task. The analysis, as in other editions, followed a multilingual perspective, with English and Spanish being the chosen languages. The main contributions of this paper are: • We define a set of features with discrimative power for bot and gender detection; • We evaluate the power of optimizing a model through cross-validation; • We analyze the effectiveness of each group of features in the task. The remainder of this paper is organized as follows. Section 2 presents the related work. Section 3 presents the experimental validation setup. The results are discussed in Section 4. Finally, Section 5 presents our conclusions and directions for future work. 2 Related Work The use of bots has brought problems in many collaborative systems, e.g., Wikipedia and OpenStreetMaps. This scenario motivates the study of strategies to identify bots in such collaborative systems, examining the contributions done by users [7]. For textual classification, information about function words, n-grams (at word and character level), quantitative features, orthographic features, part of speech (POS) tags, and vocabulary richness features are usually used [11]. In a given country, the texts produced in a language may vary depending on the culture of the region of origin [15]. Language variety identification is a popular research topic of natural language process- ing [17]. The regional influence has an impact on the extraction of features, for example, 2 https://developer.twitter.com/en/docs.html in sentiment analysis, the weight of the dictionary used may vary according to location of the author. In the gender identification task, previous works show that women use question tags more frequently, more emoticons, and less profanities [12]. The work in [19] identifies messages from men as often related to themes such as money, sports, and work, while women refer more frequently about family, friends, and food. In addition, some subjects are predominantly approached by men (e.g. gaming) and women (e.g. shopping). The work in [1] explored some techniques for identifying twitter messages pro- duced by bots. In the experiment, results are presented concerning the use of neural networks (MLP) and Random Forest. The Random Forest algorithm presented superior performance, reaching an accuracy of 92%. In comparison to the dataset of this article, the dataset of Braz el al. [1] contains additional information about the user accounts (e.g. number of users the account follows, number of followers), impacting the results achieved, because there is data beyong the textual aspect. In [2], a purely textual approach to bot detection is used. Three features are build from the text: dissimilarity between pairs of tweets of a user; word introduction decay rate (a measure of new unique words a user introduced over time) and average number of URLs per tweet. The classification used 10-fold cross validation and achieved a rate of 90.32% in detecting bots. Similar research is related to gender identification in e-mail messages [3]. Some features used are those that express emotions through a sequence of characters (e.g. “ur ssoooo kooool”, “ihaaa”) and emoticons (e.g. “:D”, “:(” ). Another feature used by the authors in gender identification is the number of words ending with a sequence of characters (e.g. ‘less’). The paper suggests through various research that women make frequent use of adverbs and emotionally intensive adjectives and terms related to questions, personal orientation and support. The work in [9] compares several classifiers and states that SVM is appropriate for classifying text, some of the reasons being the high dimensionality in the input set (large amount of features), and having overfitting protection. 3 Experimental Setup In this section, we present the dataset (Section 3.1), the preprossessing phase (Sec- tion 3.2), the feature extraction method (Section 3.3) and the classification model (Sec- tion 3.4). 3.1 Dataset Description Table 1 contains a description of the training corpus. The data was previously split in train and dev for both English and Spanish, indicating datasets for training and valida- tion. All datasets were balanced between bot and human. Only the human portion of the data was labeled regarding gender, with equal number of male and female users also. Each file corresponds to a user, containing 100 tweets from them. The tweets are not processed in any way, so emojis, retweets, hashtags, user mentions and hyperlinks are still present. Table 1. Summary of dataset: number of files. Language Train Dev English 2880 1240 Spanish 2080 920 3.2 Preprocessing In order to better adjust the posts to the experiments, we executed three independent operations: • Conversion from multiple to single whitespaces; • Lower-casing of all text; • Removal of non-alphanumeric characters. All the operations were performed using the Natural Language Toolkit (NLTK) [13]. 3.3 Feature Extraction Multiple classes of features were used for bot and gender detection. The following sections describe each one individually, according to the associated task. Bot detection Many of the features used in previous work to detect bots in twitter, as usernames, geodata, tweet intervals and following data, is unavailable in this case. Therefore, the features are limited to the twitter text. • Twitter features: these features are related to the twitter profile of the users. Fre- quency of hashtags (#), frequency of mentions of users (@), frequency of retweets (occurrence of the string “rt”) and frequency of links (occurrence of “http”). The rationale behind the first two is that bots tend to try to increase their reach inserting trending hashtags in their posts or mentioning multiple users to call their attention. Bots use to retweet content as a way to easily build a profile, and constant posting of links is typical behavior of spam bots. • Sentiment features: in the English corpus, the VADER library3 was used to iden- tify the sentiments of the corpus. The VADER generates the mean of the four senti- ment metrics: positive, negative, neutral and compound. An additional feature was also computed, the sentiment flip, defined in [5] as the number of sentiment inver- sions (positive to negative and vice-versa) between two adjacent posts normalized by the total number of tweets authored by the user. In VADER, the compound value has the range [-1, 1], so is the one used to calculate the sentiment flip, the point of inversion being 0. Getting the sentiment features for the Spanish corpus was some- what difficult. The method recommended by the VADER developers is to translate 3 https://github.com/cjhutto/vaderSentiment the texts to English automatically and use the library on the resulting text. We ended using a machine learning based solution4 , which generated only one output in the range [0, 1]. The point of sentiment inversion was fixed on 0.5. • Variety features: these features measure how varied is the content generated by the user, under the assumption that bots tend to repeat content in their posts. Two features belong to this group: the ratio between number of words used and total number of words in all posts; and the cleanliness: the ratio between the number of characters after and before preprocessing. Gender detection • Emoticons: frequency of each term in a list of emoticons, meant to describe a range of emotions. • Function words: frequency of function words: pronouns, determiners, modals and conjunctions in English and in Spanish, conjunctions and determiners. • Sentiment features: the same sentiment features used for bot detection. 3.4 Classification Model For conducting the experiments we used a SVM classifier. In order to optimize the model learned, a 5-fold nested cross-validation [20] was done. In nested cross-validation, after a variation of parameters in order to find the optimal model, one more cross- validation is done to evaluate the model found. This process is done only in the train dataset. After the model is chosen, it is used to classify the dev dataset, giving a notion of how the model will perform in unseen data, i.e., this set validates the model. To make this last step statistically significant, during the learning process nothing of the dev dataset was used. In the cross-validation, both linear and RBF kernels were used varying their re- specting hyperparameters (C and in RBF, gamma). To do this, we used the scikit-learn library [14]. 4 Results and Discussion The training and validation steps were done to each feature group individually at first. After that, the same was done using all the features together, so the impact of each feature group on the overall result could be evaluated. The final evaluation was performed with the official PAN@CLEF 2019 [4] test set using the TIRA platform [16]. The results were grouped according to their associated task, i.e, bot or gender detection. For each task we present the assessment of the classi- fier for the three datasets provided: Train; Validation; and Test. 4.1 Bot detection Table 2 presents the results of bot detection for the English portion of the data. In this experiment the overall best performance is achieved by using all the features, rather than using them separately. 4 https://github.com/aylliote/senti-py Table 2. Results for bot detection in English language. Group Train (%) Validation (%) Test (%) Twitter Features 91.70 89.91 – Sentiment Features 80.31 74.03 – Variety Features 63.51 66.21 – All features 93.12 90.97 90.57 Table 3 presents the results of bot detection for Spanish. The quite poor accuracy of the Sentiment-based features on train set, slightly above the random baseline (50%), led us to do the training experiments without this feature class, which had better results. Hence, we decided not to include this feature in the final submission for Spanish in both tasks. Table 3. Results for bot detection in Spanish language. Group Train (%) Validation (%) Test (%) Twitter Features 84.04 83.91 – Sentiment Features 57.50 – – Variety Features 73.26 70.33 – Twitter + Variety 86.16 87.07 87.67 All features 85.58 87.61 – 4.2 Gender Detection Table 4 presents the results of gender detection for English. Similar to the Bot detection task, in the gender identification, the best performance occurs when using all features. Although the computational cost is higher, when using more features, it is expected that there will be an information gain, impacting positively the performance of the classifier. Table 4. Results for gender detection in English language. Group Train (%) Validation (%) Test (%) Function words 69.23 72.23 – Function words + Sentiment Features 69.72 72.90 – Function words + Sentiment Features + Emoticons 70.60 75.67 76.86 Table 5 presents the results in gender detection for Spanish. Adding emoticons to Function words improved the results only slightly, and the results were poorer than in English. Table 5. Results for gender detection in Spanish language. Group Train (%) Validation (%) Test (%) Function words 62.39 69.42 – Function words + Emoticons 62.61 68.84 71.50 5 Conclusion and Future Works In this paper we described the participation of the ADAM team in the Bots and Gender Profiling Task organized by PAN@CLEF 2019. In this task, focused on Twitter posts, should be determined if the author of a Twitter feed was a bot or human. Moreover, in case of a post from a human, the challenge was to identify the gender. We used a set of features in SVM with cross-validation. The final outcome suggests that our proposal, in general, achieves good results when compared to the random baseline. The best results were achieved for Bot detection in English. Although the accuracy for gender detection is inferior to the accuracy in Bot detection, it also presents promising results. As future work, we suggest trying to extract metadata from the tweet text, for ex- ample, building networks of citations through user mentions. Another approach is to detect how original are the posts of a user within the corpus, since bots are known to replicate human content to fake authenticity. In addition, it is interesting to experiment different machine learning approaches, such as deep learning. Better tools for features in non-English languages are needed, like the Sentiment ones, which are scarce. References 1. Braz, P.A., Goldschmidt, R.R.: Redes neurais convolucionais na detecção de bots sociais: Um método baseado na clusterização de mensagens textuais. In: Simpósio Brasileiro de Segurança da Informação e de Sistemas Computacionais (SBSeg), SBSeg 2018. pp. 323–336. SBC (2018) 2. Clark, E.M., Williams, J.R., Jones, C.A., Galbraith, R.A., Danforth, C.M., Dodds, P.S.: Sifting robotic from organic text: a natural language approach for detecting automation on twitter. Journal of Computational Science 16, 1–7 (2016) 3. Corney, M., De Vel, O., Anderson, A., Mohay, G.: Gender-preferential text mining of e-mail discourse. In: 18th Annual Computer Security Applications Conference (ACSAC), 2002. Proceedings. pp. 282–289. IEEE (2002) 4. Daelemans, W., Kestemont, M., Manjavancas, E., Potthast, M., Rangel, F., Rosso, P., Specht, G., Stamatatos, E., Stein, B., Tschuggnall, M., Wiegmann, M., Zangerle, E.: Overview of PAN 2019: Author Profiling, Celebrity Profiling, Cross-domain Authorship Attribution and Style Change Detection. In: Crestani, F., Braschler, M., Savoy, J., Rauber, A., Müller, H., Losada, D., Heinatz, G., Cappellato, L., Ferro, N. (eds.) Proceedings of the Tenth International Conference of the CLEF Association (CLEF 2019). Springer (Sep 2019) 5. Dickerson, J.P., Kagan, V., Subrahmanian, V.: Using sentiment to detect bots on twitter: Are humans more opinionated than bots? In: Proceedings of the 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. pp. 620–627. IEEE Press (2014) 6. Ferrara, E.: Disinformation and social bot operations in the run up to the 2017 french presidential election. First Monday 22(8) (2017) 7. Hall, A., Terveen, L., Halfaker, A.: Bot detection in wikidata using behavioral and other informal cues. Proceedings of the ACM on Human-Computer Interaction 2(CSCW), 64 (2018) 8. Java, A., Song, X., Finin, T., Tseng, B.: Why we twitter: Understanding microblogging usage and communities. In: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis. pp. 56–65. ACM, New York, NY, USA (2007) 9. Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: European conference on machine learning (ECML). pp. 137–142. Springer (1998) 10. Kietzmann, J.H., Hermkens, K., McCarthy, I.P., Silvestre, B.S.: Social media? get serious! understanding the functional building blocks of social media. Business Horizons 54(3), 241 – 251 (2011) 11. Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Literary and linguistic computing (LLC) 17(4), 401–412 (2002) 12. Lakoff, R.: Language and woman’s place. Language in society 2(1), 45–79 (1973) 13. Loper, E., Bird, S.: Nltk: the natural language toolkit. arXiv preprint cs/0205028 (2002) 14. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: Machine learning in python. Journal of machine learning research (JMLR) 12(Oct), 2825–2830 (2011) 15. Pennebaker, J.W., Mehl, M.R., Niederhoffer, K.G.: Psychological aspects of natural language use: Our words, our selves. Annual review of psychology 54(1), 547–577 (2003) 16. Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research Architecture. In: Ferro, N., Peters, C. (eds.) Information Retrieval Evaluation in a Changing World - Lessons Learned from 20 Years of CLEF. Springer (2019) 17. Rangel, F., Franco-Salvador, M., Rosso, P.: A low dimensionality representation for language variety identification. In: Gelbukh, A. (ed.) Computational Linguistics and Intelligent Text Processing. pp. 156–169. Springer International Publishing, Cham (2018) 18. Rangel, F., Rosso, P.: Overview of the 7th Author Profiling Task at PAN 2019: Bots and Gender Profiling. In: Cappellato, L., Ferro, N., Losada, D., Müller, H. (eds.) CLEF 2019 Labs and Workshops, Notebook Papers. CEUR-WS.org (Sep 2019) 19. Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blogging. In: AAAI spring symposium: Computational approaches to analyzing weblogs. vol. 6, pp. 199–205 (2006) 20. Stone, M.: Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B (Methodological) 36(2), 111–133 (1974)