Bots and Gender Profiling on Twitter using Sociolinguistic Features Notebook for PAN at CLEF 2019 Edwin Puertas2,1,3 , Luis Gabriel Moreno-Sandoval1,3 , Flor Miriam Plaza-del-Arco4 , Jorge Andres Alvarado-Valencia1,3 , Alexandra Pomares-Quimbaya1,3 , and L.Alfonso Ureña-López4 1 Pontificia Universidad Javeriana, Bogotá, Colombia {edwin.puertas,jorge.alavarado,morenoluis,pomares}@javeriana.edu.co 2 Universidad Tecnológica de Bolívar, Cartagena, Colombia epuerta@utb.edu.co 3 Center of Excellence and Appropriation in Big Data and Data Analytics (CAOBA) 4 Universidad de Jaén, Jaén, Andalucía, Spain. {fmplaza, laurena}@ujaen.es Abstract Unfortunately, in social networks, software bots or just bots are be- coming more and more common because malicious people have seen their use- fulness to spread false messages, spread rumors and even manipulate public opin- ion. Even though the text generated by users in social networks is a rich source of information that can be used to identify different aspects of its authors, not being able to recognize which users are truly humans and which are not, is a big draw- back. In this work, we describe the properties of our multilingual classification model submitted for PAN2019 that is able to recognize bots from humans, and females from males. This solution extracted 18 features from the user’s posts and applying a machine learning algorithm obtained good performance results. Keywords: Bots profiling, gender profiling, author profiling, sociolinguistic, com- putational linguistic, user profiling 1 Introduction Recent studies conducted by Yang [15] indicate that there is a steady growth of au- tonomous artificial entities known as social bots on digital platforms such as Twitter, which have allowed them to spread messages and influence large populations with ease. That study concludes in their research that between 9% and 15% of Twitter accounts show similar behaviors to bots [2,13,14]. Bots can be designed for doing malicious activities to manipulate opinions in a cer- tain domain. These bots mislead, exploit, and manipulate social media discourse with Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons Li- cense Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 September 2019, Lugano, Switzerland. rumors, malware, misinformation, spam, slander, among others [7]. Some emulate hu- man behavior to enact fake political support or change the public perception of political entities [12]. For instance, social bots was distorted the 2016 U.S Presidential election online discussion, according to a report published by researchers at Oxford University5 . Also, bots are used in the marketing area to manipulate the stock market [7] or terrorist purposes to promote terrorist propaganda and recruitment [1]. The detection of social bots is therefore an important research endeavor. The automatic detection of bots in social media has attracted the attention of re- searchers in recent years. In fact, many techniques to analyze this problem are proposed in the literature. If we focus on systems based on feature-based machine learning meth- ods, we found several works, such as the one proposed by David et al [5] that study the first social bot detection framework publicly available for Twitter. It analyzed more than 1000 features and grouped them into six classes: network, user, friends, tempo- ral, content, and sentiment. On the other hand, Dickerson [6] proposed SentiBot, an architecture and associated set of algorithms that automatically identify bots on Twitter by using a combination of features including tweet sentiment and they conclude that a number of sentiment related factors are key to the identification of bots. In this paper, the proposal is described as part of our participation in the Bots and Gender Profiling task of PAN 2019 [11,4] at CLEF. This task is focused on investigating whether the author of a Twitter account is a bot or a human. Furthermore, in case of hu- man, to profile the gender of the author. For that purpose, we study the generations and analysis of different sociolinguistic features in order to identify how various linguistic characteristics differ between bots and humans and women and men. The rest of the paper is structured as follows. In Section 2, we explain the data used in our methods. Section 3 presents the details of the proposed system. In Section 4, we discuss the analysis and evaluation results of our system. We conclude in Section 6 with remarks on future work. 2 Data Description This year’s task of author profiles of PAN 2019 is to predict if the author on Twitter is a bot or a human. The dataset contains tweets in English and Spanish as shown in Table 1. The data is split evenly between human and bot users. The tweets recovered for each user come from their timeline, which can vary between days and months depending on the frequency of use. Finally, the last 100 tweets from a timeline were recovered for each author. 3 Model Description In this section, we explain the multilingual predictive model used in our submission. The model used for the task of Bots and Gender profiling in PAN 2019 [11,4], was designed to identify two types of classes: bot and gender. We proposed two hypotheses in accordance with the attributes of the dataset and the goals of the task, which are described in detail in Table 2. 5 https://nyti.ms/2mNTwnk According to the hypothesis presented in Table 2, we proposed two strategies. The first one generates features from the vocabulary terms used in the tweets. The second one, computes statistics for each profile to characterize the use of terms, hashtags, men- tions, URLs, and emojis. On the basis of the proposed strategies, the "Training System" was designed. Figure 1 shows the proposed system to predict bot and gender, which consists of the following stages: preprocessing, standardization and transformation, ex- traction of features, configuration and classification, and testing. 3.1 Preprocessing In the preprocessing stage, we use the concatenated vocabulary terms of each user’s tweets, in order to have only one document per user profile. In addition, we applied the re-labeling of the hashtags using the word "label_hashtag", the mentions word with the word "label_mention", the URLs with the word "label_url", and the emojis by UTF- 8 were replaced with the word "label_emoji". Finally, globally re-tagged words are searched and counted. 3.2 Normalization and Transformation The next stage is associated with the normalization and transformation process. The normalization process generates random samples for the training and testing process. During the transformation process, the vector representation of words is performed and the features for each user profile are calculated. This process can be configured in such a way that the vectorial representation of the words can be done with "N-gram" and the global features related to the tweets of the user profiles can also be parameterized. It must be taken into account that the transformation process can be configured in such a way that the vectorial representation of the words can be done with "N-grams" and the global features related to the tweets of the user profiles can also be parameter- ized. 3.3 Feature Extraction According to [8], human knowledge is distributed among a large number of informa- tion sources, with data volumes constantly growing. Social networks have become in- dispensable tools for the automatic understanding of language because they allow us to model the user’s writing habits by extracting features from the texts published by them. Table 1. Characteristics of training dataset. Statistic English Spanish # Bots User 4.120 1.500 # Human User 4.120 1.500 Avg. tweets per bot user 100 100 Avg. tweets per human user 100 100 Table 2. Description of hypothesis – H0 Class Description - H0 Bots For the hypothesis of bots classification, it is suggested that bots have less lin- guistic diversity than humans. For this reason, it was proposed to use classifiers that use vocabulary features and linguistic diversity. Gender For the hypothesis of gender classification, we believe that the vocabulary used by users can be associated with the use of linguistic features. For this reason, we analyze the way authors use emojis, hashtags, and mentions in addition to the vocabulary. Figure 1. System Training. In fact, the main challenge of the task for classifying Bots and Gender is associated with the detection of writing style on Twitter. According to [3], tweets produced by bots have a high amount of URLs compared to human tweets, thus, calculating the av- erage of URLs per tweet is a valuable feature for classification algorithms. In addition, Table 3. Features Description # Feature Description 1 stats_avg_word Average word size per tweet 2 stats_kur_word Kurtosis of the variable stats_avg_word 3 stats_label_emoji Amount of emojis per tweet for the profile 4 stats_label_hashtag Number of hastags per tweet for the profile 5 stats_label_mention Number of mentions per tweet for the profile 6 stats_label_url Number of urls per tweet for the profile 7 stat_label_retweets Number of retweets per tweet for the profile 8 stat_lexical_diversity Lexicon diversity for all tweets by profile 9 stats_label_word Number of words per tweet for the profile 10 kurtosis_avg_word Kurtosis of the variable stats_kur_word 11 kurtosis_label_word Kurtosis of the variable stats_label_word 12 skew_avg_word Statistical asymmetry of the variable stats_avg_word 13 skew_label_word Statistical asymmetry of the variable stats_avg_word 14 stats_person_1_sing Number of tweets used by the first person of the singular 15 stats_person_2_sing Number of tweets used by the second person singular 16 stats_person_3_sing Number of tweets used by the third person singular 17 stats_person_1_plu Number of tweets used by the first and second person of the plural 18 stats_person_3_plu Number of tweets used by the third person plural it is well known that people don’t always spell words, hashtags, mentions, URLs and emojis correctly. For the aforementioned reasons, we extracted features at two levels: the tweet and the user profile level. At the tweet level we extracted the words, and the counts of hashtags, mentions, URLs, and emojis. At the user profile level we integrated the results obtained in the previous level calculating the average, kurtosis and asymme- try of the counts of hashtags, mentions, URLs, and emojis. Likewise, we analyze the lexical diversity comparing the words used in one tweet to the words used in the rest of the tweets. 3.4 Settings and classifiers At the configuration stage, the system will adjust machine hardware parameters such as processors and threads. In addition, different scenarios can be configured for the use of the classifiers. Finally, the system may be adjusted to store the best performing vector words and qualifiers. It should be noted that during the execution of the system, the data set was divided into 60% for training and 40% for tests for all our experiments. On the other hand, based on the goals of the task and on previous results of the author profiling tasks in the PAN, we analyzed different classifiers such as Naive Bayes (NB), Gaussian Naive Bayes (GNB), Complement Naive Bayes (CNB), Logistic Re- gression (LR), and Random Forests (RF). 3.5 Test During the test stage, a software component was developed. It first reads the test data sets. Then the tweets are processed independently for each user profile. Afterwards, Figure 2. System Training. it calculates the features for each user. Subsequently, vector representation is made. The best classifiers for bots and gender classes are then calculated. Finally, the best predictors are exported. Figure 2 shows the "System Test" used by our models. 4 Experiments and Analysis of Results During the pre-evaluation phase we carried out different experiments and the best ones were taken into account for the evaluation phase. The system was evaluated using the usual competition metrics, including Accuracy (Acc), Precision (P), Recall (R) and F1- score (F1). The best systems for bots and gender classification in the pre-evaluation phase will be explained in detail in the following sections. It should be noted that the system presented was trained and tested with the dataset provided by the official site of PAN 2019 [4]. In addition, submissions were made on the TIRA platform [9] for the task of bots and gender profiles. The results obtained after evaluating our system with training dataset is shown in Table 4. The system uses various classification algorithms, such as Random Forest, GaussianNB, ComplementNB and Logistic Regression. But in the case of the English language Random Forest obtained better performance for bots and gender. And for the Spanish language Random Forest had better accuracy for Bots while Logistic Regression had better accuracy for genre. Table 4. Summary of results in bots and gender classification per language Type Language Acc Best Model BOT en 0.91 RF GENDER en 0.81 RF BOT es 0.90 RF GENDER es 0.75 LR Table 5. Bots classification in English and Spanish Precision Recall F1-Score Support Class en es en es en es en es 0 0.97 0.96 0.85 0.82 0.91 0.88 620 460 1 0.86 0.84 0.98 0.97 0.92 0.90 620 460 Micro avg 0.91 0.89 0.91 0.89 0.91 0.89 1240 920 Macro avg 0.92 0.90 0.91 0.89 0.91 0.89 1240 920 Table 6. Gender classification in English and Spanish Precision Recall F1-Score Support Class en es en es en es en es 0 0.79 0.76 0.84 0.72 0.81 0.74 310 540 1 0.83 0.73 0.77 0.78 0.80 0.76 310 540 Micro avg 0.81 0.75 0.81 0.75 0.81 0.75 620 1080 Macro avg 0.81 0.75 0.81 0.75 0.81 0.75 620 1080 4.1 Bots classification Table 5 hows the results we have obtained for bots classification in English and Spanish languages after evaluating our system with training dataset. The best results were the Random Forest classifier for English language with 91% macro-F1 score and Spanish language with 89% macro-F1 score. 4.2 Gender classification Table 6 hows the results we have obtained for gender classification in English and Span- ish languages after evaluating our system with training dataset. The best results were the Random Forest classifier for English languages with 81% macro-F1 score and Logistic regression classifier for Spanish language with 75% macro-F1 score. 4.3 Submission Results Table 7 shows the results for English and Spanish of the bots and the gender classifi- cation corresponding to the evaluation phase by means of a low dimensionality repre- sentation baseline for the identification of linguistic varieties (LDSE) [10]. For this, the Table 7. Final classification Dataset training-dataset- test-dataset1- test-dataset2- 2019-02-18 2019-03-20 2019-04-29 es en es en es en Bot 0.84 0.91 0.70 0.90 0.81 0,88 Gender 0.80 0.84 0.61 0.78 0.69 0.76 different datasets provided by the task in that phase were applied. The measure used was the macro-F1 score, which was used to determine a weighted single value of the precision and integrity of the models used. It should be noted that the final results were obtained with the test2 dataset. In the general ranking of the task, we occupy the 33th position and we occupy the 9th position respected to baseline LDSE. 5 Discussion and Conclusion The task of Bots and Gender profiling CLEF PAN 2019 [11,4] involved different tasks. The first one was the preprocessing of the corpus, which was composed of 100 posts for each user profile, for a total of 300.000 posts. Fortunately, the quality assurance during this preprocessing was not a challenge because the tweets were cleaned and the dataset balanced for each one of the target classes. On the contrary, feature extraction was one of the most significant challenges, because it was necessary to achieve a good performance with few samples of texts per user profile. To deal with this we decided to extract features at two levels: the tweet and the user profile. The first level aimed to obtain traditional counting values of words, hashtags, mentions, URLs, and emojis per tweet. The second one was intended to explore the author’s diversity based on the analysis of the features extracted at the first level. The resulted features demonstrated to be very useful to discriminate bots from humans, and the different genders. Regarding the second task, the classification itself, it was necessary to evaluate dif- ferent techniques with different parametrization and different inputs. The final results demonstrated that Random Forest and Logistic Regression were the most relevant tech- niques for this problem. In addition, during the final task, the evaluation of the model, we demonstrated our hypothesis, the lexical diversity, expressed using the 18 features, is a well discriminant for the target classes. It is important to highlight that for the classification of bots the best classifier using the n-grams and the proposed features obtained from the training dataset got an accuracy of 0.912, and using only the proposed features in the study it got 0.907 of accuracy. This demonstrates the predictive value of these features for the bots problem. Finally, there are still issues to explore. One important aspect is to improve the profile analysis from the sociolinguistic point of view integrating features that describe the interaction dynamics of each user. 6 Acknowledgements We thank the Center for Excellence and Appropriation in Big Data and Data Analytics (CAOBA), Pontificia Universidad Javeriana, and the Ministry of Information Technolo- gies and Telecommunications of the Republic of Colombia (MinTIC). The models and results presented in this challenge contribute to the building of the research capabili- ties of CAOBA. Also, Fondo Europeo de Desarrollo Regional (FEDER) and REDES project (TIN2015-65136-C2-1-R) from the Spanish Government. Finally, the author Edwin Puertas gives thank to Universidad Tecnológica de Bolívar. Needless to say, we thank the organizing committee of PAN, especially Paolo Rosso, Francisco Rangel, Matti Wiegmann and Martin Potthast for their encouragement and kind support. References 1. Berger, J.M., Morgan, J.: The isis twitter census: Defining and describing the population of isis supporters on twitter. The Brookings Project on US Relations with the Islamic World 3(20), 4–1 (2015) 2. Cai, C., Li, L., Zengi, D.: Behavior enhanced deep bot detection in social media. In: 2017 IEEE International Conference on Intelligence and Security Informatics (ISI). pp. 128–130. IEEE (2017) 3. Clark, E.M., Williams, J.R., Jones, C.A., Galbraith, R.A., Danforth, C.M., Dodds, P.S.: Sifting robotic from organic text: a natural language approach for detecting automation on twitter. Journal of Computational Science 16, 1–7 (2016) 4. Daelemans, W., Kestemont, M., Manjavancas, E., Potthast, M., Rangel, F., Rosso, P., Specht, G., Stamatatos, E., Stein, B., Tschuggnall, M., Wiegmann, M., Zangerle, E.: Overview of PAN 2019: Author Profiling, Celebrity Profiling, Cross-domain Authorship Attribution and Style Change Detection. In: Crestani, F., Braschler, M., Savoy, J., Rauber, A., Müller, H., Losada, D., Heinatz, G., Cappellato, L., Ferro, N. (eds.) Proceedings of the Tenth International Conference of the CLEF Association (CLEF 2019). Springer (Sep 2019) 5. Davis, C.A., Varol, O., Ferrara, E., Flammini, A., Menczer, F.: Botornot: A system to evaluate social bots. In: Proceedings of the 25th International Conference Companion on World Wide Web. pp. 273–274. International World Wide Web Conferences Steering Committee (2016) 6. Dickerson, J.P., Kagan, V., Subrahmanian, V.: Using sentiment to detect bots on twitter: Are humans more opinionated than bots? In: Proceedings of the 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. pp. 620–627. IEEE Press (2014) 7. Ferrara, E., Varol, O., Davis, C., Menczer, F., Flammini, A.: The rise of social bots. Communications of the ACM 59(7), 96–104 (2016) 8. Krzywicki, A., Wobcke, W., Bain, M., Martinez, J.C., Compton, P.: Data mining for building knowledge bases: techniques, architectures and applications. The Knowledge Engineering Review 31(2), 97–123 (2016) 9. Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research Architecture. In: Ferro, N., Peters, C. (eds.) Information Retrieval Evaluation in a Changing World - Lessons Learned from 20 Years of CLEF. Springer (2019) 10. Rangel, F., Franco-Salvador, M., Rosso, P.: A low dimensionality representation for language variety identification. In: International Conference on Intelligent Text Processing and Computational Linguistics. pp. 156–169. Springer (2016) 11. Rangel, F., Rosso, P.: Overview of the 7th Author Profiling Task at PAN 2019: Bots and Gender Profiling. In: Cappellato, L., Ferro, N., Losada, D., Müller, H. (eds.) CLEF 2019 Labs and Workshops, Notebook Papers. CEUR-WS.org (Sep 2019) 12. Ratkiewicz, J., Conover, M.D., Meiss, M., Gonçalves, B., Flammini, A., Menczer, F.M.: Detecting and tracking political abuse in social media. In: Fifth international AAAI conference on weblogs and social media (2011) 13. Varol, O., Ferrara, E., Davis, C.A., Menczer, F., Flammini, A.: Online human-bot interactions: Detection, estimation, and characterization. In: Eleventh international AAAI conference on web and social media (2017) 14. Varol, O., Ferrara, E., Menczer, F., Flammini, A.: Early detection of promoted campaigns on social media. EPJ Data Science 6(1), 13 (2017) 15. Yang, K.C., Varol, O., Davis, C.A., Ferrara, E., Flammini, A., Menczer, F.: Arming the public with ai to counter social bots. arXiv preprint arXiv:1901.00912 (2019)