Twitter User Profiling: Bot and Gender Identification Notebook for PAN at CLEF 2019 Dijana Kosmajac and Vlado Keselj Dalhousie University dijana.kosmajac@dal.ca, vlado@dnlp.ca Abstract We use a set of feature extraction and transformation methods in con- junction with ensemble classifiers for the PAN Author Profiling task. For the bot identification subtask we use user behaviour fingerprint and statistical diversity measures, while for the gender identification subtask we use a set of text statistics, as well as syntactic information and raw words. 1 Introduction Automated user (bot) is a program that mimics a real person’s behavior on social me- dia. A bot can operate based on a simple set of behavioral instructions, such as tweeting, retweeting, “liking” posts, or following other users. In general, there are two types of bots based on their purpose: non-malicious and malicious. The non-malicious bots are transparent, with no intent of mimicking real Twitter users. Often, they share motiva- tional quotes or images, tweet news headlines and other useful information, or help companies to respond to users. On the other hand, malicious ones may generate spam, try to access private account information, trick users into following them or subscribing to scams, suppress or enhance political opinions, create trending hashtags for financial gain, support political candidates during elections [2], or create offensive material to troll users. Additionally, some influencers may use bots to boost their audience size. We explore bot and gender identification techniques on PAN 2019 [5] Author Pro- filing task [19]. We apply a set of feature extraction methods to describe how diverse the user behaviour is over extended period of time and if the style of writing is different between two genders. The systems were hosted and evaluated on TIRA [18], a web service that aims to facilitate software submissions and evaluations for shared tasks. The rest of the paper is organized as follows. Related work is discussed in Section 2. Section 3 briefly shows insights into the datasets. Section 4.1 describes the method we used to extract and encode features in the form of digital fingerprint. In Section 4 we describe a set of features used for user profiling, for both gender and bot identification tasks. Section 5 is dedicated to experiments and results. Finally, in Section 6 we give the conclusions and briefly discuss about future work. Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons Li- cense Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 September 2019, Lugano, Switzerland. 2 Related Work One of the most prominent tasks in recent social media analysis is detection of auto- mated user accounts (bots). Research on this topic is very active [16,28,10], because bots pose a big threat if they’re intentionally steered to target important events across the globe, such as political elections [2,27,14,12,23,13]. Paper by [16] explore strategies how bot can interact with real users to increase their influence. They show that a simple strategy can trick influence scoring systems. BotOrNot [6] is openly accessible solu- tion available as API for the machine learning system for bot detection. Authors [6,27] show that the system is accurate in detecting social bots. Authors [21] explore meth- ods for fake news detection on social media, which is closely related to the problem of automated accounts. They state that the performance of detecting fake news only from content in general doesn’t show good results, and they suggest to use user social inter- actions as auxiliary information to improve the detection. Ferrara et al. [8] use extensive set of features (tweet timing, tweet interaction network, content, language, sentiment) to detect the online campaigning as early as possible. Another recent work on bot de- tection by Cresci et al. [3] is based on DNA inspired fingerprinting of temporal user behaviour. They define a vocabulary B n , where n is the dimension. An element rep- resents a label for a tweet. User activity is represented as a sequence of tweets labels. They found that bots share longer common substrings (LCSs) than regular users. The point where LCS has the biggest difference is used as a cut-off value to separate bots from genuine users. Framework by Ahmed et al. [1] for bot detection uses the Euclidean distance between feature vectors to build a similarity graph of the accounts. After the graph is built, they perform clustering and community detection algorithms to identify groups of similar accounts in the graph. Bot problem on social media platforms inspired many competitions and evaluation campaigns such as DARPA [24] and PAN1 . When it comes to gender and age user profiling, advances in natural language pro- cessing technology have facilitated the prediction in several text genres using auto- matic analysis of the variation of linguistic characteristics. However, in social media texts, there are a couple of limitations. First, small amount of meta information about the users’ gender, age, social class, race, geographical location, etc., is available to re- searchers. Second, communication in online social networks typically occurs in a form of very short messages, often containing non-standard language usage, which makes this type of text a challenging text genre for natural language processing. Finally, given the speed at which chat language has originated globally and continues to develop, es- pecially among young people, a third challenge in automatically detecting false profiles on social networks will be the constant retraining of the machine learning algorithms in order to learn new variations of chat language. Many researchers have tried to solve some of these challenges [20,17,11,25,4,17]. 1 https://pan.webis.de/publications.html Figure 1. Bot t-SNE visualization. (a) English, (b) Spanish Figure 2. Gender t-SNE visualization. (a) English, (b) Spanish 3 Dataset The dataset provided by the organizers is divided into two parts: English and Spanish. The English dataset consists of training and development subsets, with 2,880 and 1,240 samples, respectively. The Spanish dataset is slightly smaller and consists of training and development subsets, with 2,080 and 920 samples, respectively. Each sample is a user timeline in chronological order, with 100 messages per user. Fig. 1 and Fig. 2 show the datasets using t-SNE [15], an enhanced method based on stochastic neighbour embedding. The features used for both visualizations are the ones used for the classifiers in the final submitted run (Experiment 2(4) for bots, and Experiment 5 for gender). Figure 3. 3-gram extraction example from user fingerprint. 4 Feature Engineering In this section we describe the features used for the experiments. 4.1 Bot Identification User Behaviour Fingerprint DNA sequences have been exploited in different areas such as forensics, anthropology, bio-medical science and similar. Cresci [3] used the idea of DNA coding to describe social media user behaviour in temporal dimension. The same idea was used in this study, with a slightly modified way of coding. We define a set of codes An with length n = 6. The meaning of each code is given in (1).    0, plain  8, retweet      16, reply An = (1) 1, has hastags   2, has mentions      4, has URLs  Vocabulary, given the code set A, consists of 3 ∗ 23 = 24 unique characters. Each character, which describes a tweet is constructed by adding up codes for tweet features. First three codes describe the type of the tweet (retweet, reply, or plain) and the rest describe the content of the tweet. For example, if a tweet is neither retweet nor reply, it is plain (with the code = 0). If the tweet contains hashtags, then code = code + 1, If the same tweet contains URLs, then code = code + 4. Final tweet code is 5. We transform it to a character label by using ASCII table character indexes: ASCII_tbl[65 + 5] = F . The number of tweets with attributes encoded with characters determines the length of the sequence. The sequence, in our case, is simply the length of a user timeline, that is, actions in chronological order with the appropriate character encoding. The example of a user fingerprint generated from their timeline looks like: f puser = (ACBCASSCCAF F ADADF AF ASCB...) Fingerprint segmentation using n-gram technique To calculate data statistics, we ex- tracted n-grams of different length (1-3 length appeared to work best). Fig. 3 shows the example on 3-gram extraction of sample user fingerprint. N-gram segments are used to calculate richness and diversity measures, which seem to unveil the difference between genuine user and bot online behaviour. Statistical Measures for Text Richness and Diversity Statistical measures for diver- sity have long history and wide area of application [26]. A constancy measure for a natural language text is defined, in this article, as a computational measure that con- verges to a value for a certain amount of text and remains invariant for any larger size. Because such a measure exhibits the same value for any size of text larger than a cer- tain amount, its value could be considered as a text characteristic. Common labels used are: N is the total number of words in a text, V (N ) is the number of distinct words, V (m, N ) is the number of words appearing m times in the text, and mmax is the largest frequency of a word. Yule’s K Index Yule’s original intention for K use is for the author attribution task, assuming that it would differ for texts written by different authors. mmax S2 − S1 h 1 X m 2i K=C = C − + V (m, N )( ) S12 N m=1 N To simplify, S1 = N = m V (m, N ), and S2 = m m2 V (m, N ). C is a constant P P originally determined by Yule, and it is 104 . Shannon’s H Index The Shannon’s diversity index (H) is a measure that is commonly used to characterize species diversity in a community. Shannon‘s index accounts for both abundance and evenness of the species present. The proportion of species i relative to the total number of species (pi ) is calculated, and then multiplied by the natural logarithm of this proportion (ln(pi )). The resulting product is summed across species, and multiplied by -1. V (N ) X H=− pi ln(pi ) i=1 V (N ) is the number of distinct species. Simpson’s D Index Simpson’s diversity index (D) is a mathematical measure that char- acterizes species diversity in a community. The proportion of species i relative to the total number of species (pi ) is calculated and squared. The squared proportions for all the species are summed, and the reciprocal is taken. 1 D = PV (N ) i=1 p2i Honoré’s R Statistic Honoré (1979) proposed a measure which assumes that the ratio of hapax legomena (1, N ) is constant with respect to the logarithm of the text size: log(N ) R = 100 1 − VV(1,N (N ) ) Figure 4. Diversity measures density per dataset, per user type. (a) English – top row, (b) Spanish – bottom row Sichel’s S Statistic Sichel [22] observed that the ratio of hapax dis legomena V (2, N ) to the vocabulary size is roughly constant across a wide range of sample sizes. V (2, N ) S= N We use this measure to express the constancy of n-gram hapax dis legomena (num- ber of n-grams that occur two times) which we show to be distinct for genuine and bot accounts. On Fig. 4 we show the comparison of density plots of all measures of bot accounts versus genuine users. We can see that the diversity measures are different for bots and genuine users. We exploit this characteristic to build a good classifier with as few fea- tures as possible. 4.2 Gender Identification The feature types used for this task can be split into four categories: Character and Word Features We used simple text metrics, such as total number of characters, total number of words, number of characters/words per message, number of special characters, number of digits. PoS Tags Features Using spacy2 python library we extracted word unigrams and bi- grams, as well as PoS tag bigrams. Emoji Features We counted the number of emojis, as well as fine-grained distinction between different types of emojis. To distinguish categories of emojis we used the latest standard at the time of experiments3 . Text Readability Measures In 1948, Flesch [9] developed a formula that is is consid- ered as one of the oldest and most accurate text readability formulas. nsyllables nwords RF lesch = 206.835 − 84.6 · − 1.015 · nwords nsentences The equivalent for Spanish language was developed a few years later by Huerta [7]. nsyllables nsentences RHuerta = 206.84 − 60 · − 102 · nwords nwords 5 Experiments and Results 5.1 Bot Identification For bot identification subtask we conducted four experiments with five different classi- fiers (Gradient Boosting, Random Forest, SVM, Logistic Regression, K Nearest Neigh- bours). The differences between the experiments are more focused on testing the im- provement with training data increase, as well as feature set generalization using raw fingerprint n-grams versus statistical diversity measures. Experiment 1 In Experiment 1 we used character n-grams of user fingerprint described in 4.1. The length of n-grams is a combination of 2, 3 and 4. We can see that some classifiers have fairly similar results (Table 1, column E1). The best classifier is Random Forest for both languages. In this experiment we used the training subsets for English and Spanish separately. Experiment 2 In Experiment 2 we used the diversity measures calculated on character n-grams of user fingerprint described in 4.1. The length of n-grams is a combination of 1, 2 and 3. The best classifier is Random Forest for both languages. In this experiment we used the training subsets for English and Spanish separately. Experiment 3 In Experiment 3 (Table 2, column E3) we used the same features as in Experiment 1. The best classifier is Gradient Boosting ensemble for both languages. In this experiment we used the training subsets for English and Spanish combined. Because the features are language independent, we combined training dataset into one, and tested it on both languages. The final model is same for both subsets. 2 https://spacy.io/ 3 https://unicode.org/Public/emoji/12.0/emoji-test.txt E1 E2 Dataset Classifier Precision Recall F1 Precision Recall F1 GB 0.9197 0.9153 0.9151 0.9263 0.9234 0.9233 SVM 0.9174 English 0.9161 0.9161 0.9253 0.9242 0.9241 LR 0.8840 0.8750 0.8743 0.9261 0.9242 0.9241 KNN −∗ −∗ −∗ 0.9284 0.9258 0.9257 RF 0.9284 0.9218 0.9215 0.9293 0.9266 0.9265 GB 0.8666 0.8663 0.8663 0.8429 0.8391 0.8387 SVM 0.8602 0.8598 0.8597 0.8164 0.8163 0.8163 Spanish LR 0.8663 0.8663 0.8663 0.8510 0.8478 0.8475 KNN −∗ −∗ −∗ 0.8617 0.8587 0.8584 RF 0.9115 0.9033 0.9028 0.8503 0.8489 0.8488 Table 1. Bot classification. Results tested on development dataset. Per language training dataset. ∗ not available due to memory restrictions. E3 E4 Dataset Classifier Precision Recall F1 Precision Recall F1 GB† 0.9252 0.9242 0.9241 0.9330 0.9306 0.9305 SVM 0.9094 0.9081 0.9080 0.9199 0.9177 0.9176 English LR 0.9121 0.9113 0.9112 0.9214 0.9202 0.9201 KNN −∗ −∗ −∗ 0.9256 0.9242 0.9241 RF 0.9189 0.9153 0.9151 0.9256 0.9242 0.9241 GB† 0.8896 0.8880 0.8879 0.8512 0.8424 0.8414 SVM 0.8588 0.8587 0.8587 0.8490 0.8435 0.8429 Spanish LR 0.8478 0.8478 0.8478 0.8473 0.8446 0.8443 KNN −∗ −∗ −∗ 0.8586 0.8543 0.8539 RF 0.8764 0.8696 0.8690 0.8498 0.8435 0.8428 Table 2. Bot classification. Results tested on development dataset. Combined training dataset. † used as final classifier (E4 for official ranking). ∗ not available due to memory restrictions. Experiment 4 In Experiment 4 (Table 2, column E4) we used the same features as in Experiment 2. The best classifier for English is Gradient Boosting ensemble and K Nearest Neighbours for Spanish. As in Experiment 3, we combined training dataset into one, and tested it on both languages. Although a better performance was obtained on separately trained models for two languages (Random Forest, Table 1) with raw features, we opted for Gradient Boosting ensemble which was trained on combined dataset (Spanish portion slightly dropped in performance). The classifier from Experiment 4 was used for the official ranking. 5.2 Gender Identification For the gender identification subtask we used the same set of classifiers as for bot de- tection. The results in Table 3 show that Gradient Boosting classifier performed the best Dataset Classifier Precision Recall F1 GB† 0.8167 0.8129 0.8123 SVM 0.7782 0.7774 0.7773 English LR 0.7630 0.7629 0.7629 KNN 0.6054 0.6048 0.6043 RF 0.7926 0.7919 0.7918 GB‡ 0.7062 0.7000 0.6977 SVM 0.6592 0.6587 0.6584 Spanish LR 0.6418 0.6413 0.6410 KNN 0.5851 0.5848 0.5845 RF 0.6568 0.6543 0.6530 Table 3. Gender classification. Results tested on development dataset. †, ‡ used as final classifiers. Dataset Bot Gender English 0.9216 0.7928 Spanish 0.8956 0.7494 Average 0.9086 0.7711 Table 4. Final results on test dataset. Averaged per language. for both languages. This task was language dependent, so each language had its own model. 5.3 Results on Test Data The official results are shown in Table 4. Bot detection for English performed with sim- ilar results as in our experiments with development set, while for Spanish performed better. Similar improvement was obtained with Spanish dataset for gender identifica- tion. The models for the final evaluation are trained on both, training and development sets. 6 Conclusion We show that automated accounts have less diverse behaviour than genuine user accounts and these measures can help in detecting automated behaviour without diving into language-specific analyses. For the gender identification task we used a standard set of features usually used in stylometry analysis, with the addition of emoji features on a more granular level. References 1. Ahmed, F., Abulaish, M.: A generic statistical approach for spam detection in online social networks. Computer Communications 36(10-11), 1120–1129 (2013) 2. Bessi, A., Ferrara, E.: Social bots distort the 2016 us presidential election online discussion. First Monday 21(11) (2016) 3. Cresci, S., Di Pietro, R., Petrocchi, M., Spognardi, A., Tesconi, M.: DNA-inspired online behavioral modeling and its application to spambot detection. IEEE Intelligent Systems 31(5), 58–64 (2016) 4. Dadvar, M., Jong, F.d., Ordelman, R., Trieschnigg, D.: Improved cyberbullying detection using gender information. In: Proceedings of the Twelfth Dutch-Belgian Information Retrieval Workshop (DIR 2012). University of Ghent (2012) 5. Daelemans, W., Kestemont, M., Manjavancas, E., Potthast, M., Rangel, F., Rosso, P., Specht, G., Stamatatos, E., Stein, B., Tschuggnall, M., Wiegmann, M., Zangerle, E.: Overview of PAN 2019: Author Profiling, Celebrity Profiling, Cross-domain Authorship Attribution and Style Change Detection. In: Crestani, F., Braschler, M., Savoy, J., Rauber, A., Müller, H., Losada, D., Heinatz, G., Cappellato, L., Ferro, N. (eds.) Proceedings of the Tenth International Conference of the CLEF Association (CLEF 2019). Springer (Sep 2019) 6. Davis, C.A., Varol, O., Ferrara, E., Flammini, A., Menczer, F.: Botornot: A system to evaluate social bots. In: Proceedings of the 25th International Conference Companion on World Wide Web. pp. 273–274. International World Wide Web Conferences Steering Committee (2016) 7. Fernández Huerta, J.: Medidas sencillas de lecturabilidad. Consigna 214, 29–32 (1959) 8. Ferrara, E., Varol, O., Menczer, F., Flammini, A.: Detection of promoted social media campaigns. In: tenth international AAAI conference on web and social media (2016) 9. Flesch, R., Gould, A.J.: The art of readable writing, vol. 8. Harper New York (1949) 10. Gilani, Z., Wang, L., Crowcroft, J., Almeida, M., Farahbakhsh, R.: Stweeler: A framework for Twitter bot analysis. In: Proceedings of the 25th International Conference Companion on World Wide Web. pp. 37–38. International World Wide Web Conferences Steering Committee (2016) 11. Goswami, S., Sarkar, S., Rustagi, M.: Stylometric analysis of bloggers’ age and gender. In: Third international AAAI conference on weblogs and social media (2009) 12. Guess, A., Nagler, J., Tucker, J.: Less than you think: Prevalence and predictors of fake news dissemination on Facebook. Science advances 5(1), eaau4586 (2019) 13. Hjouji, Z.e., Hunter, D.S., Mesnards, N.G.d., Zaman, T.: The impact of bots on opinions in social networks. arXiv preprint arXiv:1810.12398 (2018) 14. Howard, P.N., Woolley, S., Calo, R.: Algorithms, bots, and political communication in the US 2016 election: The challenge of automated political communication for election law and administration. Journal of Information Technology & Politics 15(2), 81–93 (2018), https://doi.org/10.1080/19331681.2018.1448735 15. Maaten, L.v.d., Hinton, G.: Visualizing data using t-SNE. Journal of machine learning research 9(Nov), 2579–2605 (2008) 16. Messias, J., Schmidt, L., Oliveira, R., Benevenuto, F.: You followed my bot! Transforming robots into influential users in Twitter. First Monday 18(7) (2013) 17. Peersman, C., Daelemans, W., Van Vaerenbergh, L.: Predicting age and gender in online social networks. In: Proceedings of the 3rd international workshop on Search and mining user-generated contents. pp. 37–44. ACM (2011) 18. Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research Architecture. In: Ferro, N., Peters, C. (eds.) Information Retrieval Evaluation in a Changing World - Lessons Learned from 20 Years of CLEF. Springer (2019) 19. Rangel, F., Rosso, P.: Overview of the 7th Author Profiling Task at PAN 2019: Bots and Gender Profiling. In: Cappellato, L., Ferro, N., Losada, D., Müller, H. (eds.) CLEF 2019 Labs and Workshops, Notebook Papers. CEUR-WS.org (Sep 2019) 20. Sarawgi, R., Gajulapalli, K., Choi, Y.: Gender attribution: tracing stylometric evidence beyond topic and genre. In: Proceedings of the Fifteenth Conference on Computational Natural Language Learning. pp. 78–86. Association for Computational Linguistics (2011) 21. Shu, K., Wang, S., Liu, H.: Understanding user profiles on social media for fake news detection. In: 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR). pp. 430–435. IEEE (2018) 22. Sichel, H.S.: On a distribution law for word frequencies. Journal of the American Statistical Association 70(351a), 542–547 (1975), https://doi.org/10.1080/01621459.1975.10482469 23. Stella, M., Ferrara, E., De Domenico, M.: Bots increase exposure to negative and inflammatory content in online social systems. Proceedings of the National Academy of Sciences 115(49), 12435–12440 (2018) 24. Subrahmanian, V., Azaria, A., Durst, S., Kagan, V., Galstyan, A., Lerman, K., Zhu, L., Ferrara, E., Flammini, A., Menczer, F.: The DARPA Twitter bot challenge. Computer 49(6), 38–46 (2016) 25. Thelwall, M., Wilkinson, D., Uppal, S.: Data mining emotion in social network communication: Gender differences in MySpace. Journal of the American Society for Information Science and Technology 61(1), 190–199 (2010) 26. Tweedie, F.J., Baayen, R.H.: How variable may a constant be? measures of lexical richness in perspective. Computers and the Humanities 32(5), 323–352 (1998) 27. Varol, O., Ferrara, E., Davis, C.A., Menczer, F., Flammini, A.: Online human-bot interactions: Detection, estimation, and characterization. In: Eleventh international AAAI conference on web and social media (2017) 28. Yang, Z., Wilson, C., Wang, X., Gao, T., Zhao, B.Y., Dai, Y.: Uncovering social network sybils in the wild. ACM Transactions on Knowledge Discovery from Data (TKDD) 8(1), 2 (2014)