Bot and Gender Detection of Twitter Accounts Using Distortion and LSA Notebook for PAN at CLEF 2019 Andrea Bacciu, Massimo La Morgia, Alessandro Mei, Eugenio Nerio Nemmi, Valerio Neri, and Julinda Stefa Affiliation Department of Computer Science, Sapienza University of Rome, Italy {lamorgia, mei, nemmi, stefa}@di.uniroma1.it bacciu.1747105@studenti.uniroma1.it, neri.1754516@studenti.uniroma1.it Abstract In this work, we present our approach for the Author Profiling task of PAN 2019. The task is divided into two sub-problems, bot, and gender detection, for two different languages: English and Spanish. For each instance of the prob- lem and each language, we address the problem differently. We use an ensemble architecture to solve the Bot Detection for accounts that write in English and a single SVM for those who write in Spanish. For the Gender detection we use a single SVM architecture for both the languages, but we pre-process the tweets in a different way. Our final models achieve accuracy over the 90% in the bot detec- tion task, while for the gender detection, of 84.17% and 77.61% respectively for the English and Spanish languages. 1 Introduction The ability to profile the author of a message automatically with the advent of social networks and online platforms is more than ever a crucial issue. As an example, be able to profile users that write on a specific topic can provide useful insight to address advertisement campaigns. In forensic investigations, detailed profiling of an anonymous user can speed up the investigative process. Moreover, be able to profile an author also include the ability to understand if the author of a set of messages is a human or a bot. On social networks, especially on Twitter, the presence of bots is heavy. Usually, bots have the task to drum up the attention of human users on a specific event but, unfortunately, they are often used to spread misleading information and fake news. In this paper, we present our approach for the Author Profiling task of PAN 2019, that focuses on the bot and gender detection of Twitter accounts. We start introducing the problem, the dataset provided by PAN, and the evaluation framework used in this task. In section 3, we describe the features selected to build our model for the bot detec- tion, the architecture, how single features perform on the task, and the final evaluation. In section 4, we address the problem of gender detection, also here we describe all the Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons Li- cense Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 September 2019, Lugano, Switzerland. phases we faced to build the final model. Finally, in section 5, we make a brief survey of the works that inspired our methodology, and section 6 concludes the paper with the final considerations about the work and the possibility to improve it. 2 Problem In this section, we describe the problem addressed in this work. More in details, we start defining the Author Profiling competition, then we report about the data, and the evaluation framework used by the organizers of the task to evaluate the classifiers of the participants. 2.1 Pan Task PAN is a series of scientific events and shared tasks on digital text forensics and stylom- etry. Currently, we are at the 19th [11] edition of the event, and the 7th on the author profiling task [34]. In each edition, the organizers propose several challenges related to the forensic analysis, provide the guidelines and the dataset to accomplish the tasks. Participants team can concur for one or all of them. At the end of the challenges, all the participants are invited to submit a notebook where they explain the methodology and the idea developed for the solution of the task. 2.2 Author Profiling Task The Author Profiling Task 2019 [34] aims to identify the nature of Twitter accounts, detecting if the writer is a Bot or a Human being and in the last case, the gender of the account owner. The task is proposed in two different languages: English and Spanish. Participants can address just one problem, Bot or Gender detection, just one language or the complete challenge. 2.3 The Dataset The dataset is made by two sets of Twitter accounts, one for the English language, and the other for the Spanish. Each account in the dataset is a collection of 100 tweets. Both the datasets are already split in Train and Dev partitions. Moreover, the complete dataset contains also two Test partitions, but they are not public, both the dimension and the number of instances for each class are unknown to the participants. Test sets can be used only in the remote evaluation phases. Instead, the known part of the dataset has the ground truth. Here, each account is described by two columns, where the first one defines the nature of the account and the second the gender. To maintain a realistic scenario, no cleaning operation on the tweets was performed by the organizers. In this way, participants can be aware if a message is a tweet or a retweet, on the other side there is no guarantee that all the tweets of the same users are in the same language. All the links in the dataset appear, as usual on Twitter, in the short format (https://t.co/id). Hence, since during the evaluation phase is not possible to use the internet connection, neither the link can be resolved nor is possible to extract information about their contexts such as the hostname, or the page content. The English part of the dataset is made globally by 4120 accounts, of which 2880 belonging to the Train set and 1240 to the Dev. Instead, the Spanish one has 3000 users, with 2080 accounts in the Train partition and the remaining users for the Dev set. In all the partitions users are evenly divided between bots and humans, and humans in their turn are divided in half between men and women. As we can see in Table 1, the length of the tweets can vary a lot. We have tweets made by only one character and tweets longer than 900. Table 1. Datasets composition Partition users max char min char En Train 2880 933 1 En Dev 1240 646 1 Es Train 2080 932 1 Es Dev 920 876 1 Evaluation Framework As we previously said, PAN does not disclose to the partic- ipants the Test dataset, in order to simulate a real case scenario. So, it is possible to evaluate the performance of the developed solution on the test set only using (TIRA). TIRA [17, 32] — Testbed for Information Retrieval Algorithms, is a platform that fo- cuses on hosting shared tasks and facilitates the submission of software. Pan provides to all the participants a Virtual Private Server on Tira, where participants can deploy their model and run it on preset datasets. For the author profiling task of 2019, two Test sets were released on TIRA. The first Test was used during the pre-evaluation phase, at this stage it was possible to execute multiple runs on the Test set, and to request the evaluation for all of them to the moderators. Instead, for the second Test set, the final one, it was possible to request the evaluation for just one run. Performances of runs are computed with the accuracy metric. Finally, it is important to note that the evaluation of the gender is based on the output of the bot detection task, this means that only the accounts classified as human by the bot detector are used to compute the accuracy of the gender classifier. 3 Bot Detection Our classifiers to address the bot detection problem rely on the meta-estimator Ad- aBoost for the English language and on a single SVM architecture for the Spanish. Following we describe step by step how we built our models. 3.1 Features In this section, we describe the features that we use for the bot detection classifier. For each account we compute the following features: – Emojis: The average number of emojis used in each tweet. – Web link: The average number of links shared in each tweet. – Hashtag: The average number of hashtags used. – Len of Tweets: The average length of the tweets. – Len of ReTweets: The average length of the retweets. – Semicolons: The average number of semicolons used. – Cosine Similarity score: For all the tweets that belong to the same user, we weight each word with the Tf-Idf. Then, for each pair of tweets, we compute the cosine- similarity. Finally, we get as feature the average of the scores. The idea is that the cosine similarity of bots is higher than that of humans. Since the messages that belong to the same Bot, tend to be very similar among them selves. – Sentiment Analysis: We perform the sentiment analysis to each tweet, then we use as features the average neutral sentiment score and the average on compound score. To perform the sentiment analysis, we use the Vader Sentiment Analysis tool [19]. It provides for the analyzed sentence a positive, a negative, a neutral and a compound score where the compound score is a single uni-dimensional measure that sums up the sentiment of the sentences. – Text Distortion: We use the function of [39] to distort the text. The distortion technique is a pre-processing method that consists in masking some part of the text before the feature extraction. The distortion is used to emphasize the use of special characters and punctuation. This function transforms every ASCII characters into * and leaves unchanged no-ASCII characters. An example of distorted text is shown in Tab. 2. So, we first concatenate all the tweets belonging to the same user. Then, we replace all the emoticons with the tag ::EMOJI::, in this way we normalize all the emoticons, and at the same time after the text distortion we can still have a recognizable pattern (::*****::). Finally, we apply the distortion and we extract from the text the first 1000 char-grams by frequency of length from 2 up to 8, that appear at least 5 times among all the tweets of the account. The selected char-grams are then weighted with Tf-Idf in which the term frequency is logarithmically scaled. In Tab 3 are reported the final dimension of each feature, and the total dimension of our features set. Table 2. Text distortion Original Text Distorted Text RT @BIBLE_Retweet: Great men are not ** @*****_*******: ***** *** *** *** always wise - Job 32:9 ****** **** - *** **:* I don’t know. Just making conversation with * ***’* ****. **** ****** ************ you, Morty. What do you think, I-I-I... know **** ***, *****. **** ** *** *****, *-*-*... everything about everything? **** ********** ***** **********? Table 3. Dimension of features Feature type En Dimension Es Dimension Emojii 1 1 Web link 1 1 Hashtag 1 — Len of Tweets 1 1 Len of ReTweets 1 1 Semicolons 1 1 Cosine Similarity Score 1 1 SA: Neutral 1 1 SA: Compound 1 1 Distortion Text tf-idf 1000 1000 Total 1009 1008 3.2 Features Reduction Once extracted all the features, we use the PCA [18,28] —Principal Component Analy- sis, to reduce the dimensionality of the features space with a minimum loss of informa- tion. In particular, in our algorithm we use the PCA implementation of sklearn. For the English bot classifier, we reduce the features dimension from 1009 to 56, while for the Spanish bot classifier from 1008 to 46. In both the cases we set the withen parameter of PCA equals to True. 3.3 Bot Classifiers We built two different classifiers one to detect the English Bot and one for the Spanish one. In Fig. 1 is shown the architecture for the English Bot classifier. As we can see, the classifier is made by two layers. In the first layer, we have a single SVM [9] with an RBF kernel, and an AdaBoost instance. AdaBoost is a meta-estimator [15], that be- gins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases. We use Ad- aBoost with 30 estimators and SVM as base estimator, with a learning rate of 65% and the SAMME.R algorithm. In the second layer, we have a Soft-Voting Classifier that ensemble the predictions of the previous layer. To ensemble the predictions we sum up for each class the probability of being the right one as predicted by our classifiers, then we pick as final prediction, the class with the highest value. For the Spanish bot classification, we use just a single SVM, since in this case, after some experiments, we found that it performs better than the ensemble architecture used for the English bot detection. The SVM on our experiment has Radial Basis Function kernel, with the hyper-parameters left to the default values. 3.4 Evaluation To evaluate our model, accordingly with the metric evaluation of the task, we use the accuracy. As first experiment, we evaluate the features we extracted one by one, with Figure 1. English bot classifier architecture Adaboost Predictions SVM1 SVM2 .. . Voting Predictions Features Classifier SVM30 SVM Predictions the same architecture described above. In Tab 4 are shown the performances. As we can see, all the features have similar performances for both the languages. Of particular relevance is the Distortion feature that alone achieves an accuracy of 90.48% for the English language, and the ratio of web links shared with more than 79%. The compound feature has a strange behavior, in fact on the English language it is very close to the random guess, while for the bot detection in Spanish it is around 68%. In Tab. 5 are shown the results with our final model on the three evaluation datasets provided by PAN. As we can see, our model perform better for the English language than for the Spanish. 4 Gender Profiling In this section, we describe our approach to gender detection. For this task, we use a single SVM architecture for both the languages. 4.1 Pre-Processing Before extracting the features from the text, we pre-process the tweets in a different way depending on the language taken into account. For both the languages, we process all the tweets belonging to the same account as a single document. English: for the English accounts, we start our pre-processing phase tokenizing the text. To accomplish Table 4. Features performance Featuretype En Score Es Score Emojii 71.05% 70.65% Web link 79.19% 87.93% Hashtag 76.53% — Len of Tweets 59.59% 53.36% Len of ReTweets 86.12% 81.41% Semicolons 64.27% 55.21% Cosine Similarity Score 60.00% 64.67% SA: Neutral 61.53% 66.19% SA: Compound 51.37% 68.36% Distortion Text tf-idf 90.48% 81.73% all without PCA 77.90% 76.73% Table 5. Bot detection evaluation results Test English Spanish Dev 95.56% 92.06% Test1 93.18% 88.89% Test2 94.32% 90.78% this operation, we use the TweetTokenizer of the NLTK python package [4]. We choose this tokenizer since it was designed exactly to web content. It can handle out of the box ASCII emoticons, and it replaces character sequences of length greater than 3 with sequences of length 3. After this step, we stem the tokenized text. We use the Snowball Stemming [31], from the NLTK library. Stemming is a technique of Natural Language Processing that reduces an inflected word to its base form. For example given this list of words [’denied’, ’died’, ’agreed’, ’owned’], the Snowball Stemmer produce as output the following words [’deni’, ’die’, ’agre’, ’own’]. Spanish: Also, for the Spanish tweets, we start our pre-processing with the tok- enization of the text. After this stage, we lemmatize the tweets instead of stemming them as done for the English. Lemmatizing like Stemming generate the root form of the inflected word, but while the stemmer operates on a single word without knowledge of the context, a lemmatizer does a full morphological analysis to identify the lemma for each word accurately. For instance, if we consider the word "meeting" that can be either the base form of a noun or a form of a verb ("to meet"), the lemmatizer tries to understand from the context of the sentence the right form and lemmatize the word ac- cordingly. To lemmatize the tweets, we use the es_core_news_sm library of the SpaCy 1 framework. 1 https://spacy.io 4.2 Features Extraction and Features Reduction After the pre-processing phase, we move to the features extraction. Here, we use the same kind of feature, Word n-gram, but with different settings for English and Spanish. For the English gender detection, we select the 90000 most frequent words n-grams of length from 1 up to 5, that have a document frequency higher than 4. Then, we weight the n-grams with the Tf-Idf, in which the term frequency is logarithmically scaled. In the Spanish case, we use words n-grams of length from 1 up to 8, that appear in each document at least 7 times, then we weight them with the Tf-Idf. In this last case, we select the 50,000 most frequents n-grams. Once we have extracted all the features, we apply the Latent Semantic Analysis to reduce their dimensionality, the effectiveness of low dimensionality representation in the Author Profiling task is shown in [33]. Latent Semantic Analysis (LSA) [14] is a statistical approach to extracting relations among words by means of their contexts of use in documents. It makes no use of natural language processing techniques for analyzing morphological, syntactic, or semantic relations, nor does it use humanly constructed resources. LSA starting from the Tf-Idf , applies a reduced-rank singular value decomposition (SVD) on the Tf-Idf matrix to reduce the number of rows while preserving the similarity structure among columns. After this step, we have that each account for both the languages is represented by 11 features. 4.3 Classifier and Evaluation In our experiments, we try different classifiers: Logistic Regression, Random Forest, and SVM. To evaluate our classifiers, we measure their performances using only the humans in the Dev set. In this way, we have not the bias of the bot miss-classified by the Bot detector. Unfortunately, we cannot repeat the same experiment on the Test sets, since they are not public. In Table 6 are shown the performances of the aforementioned classifiers. As we can see, the best result is achieved for both the languages with a single SVM with a radial base function kernel. We tune the hyper-parameter of the classifiers through grid search. We find that the best configuration for our models is to use for the English model a penalty coefficient of 8192, while of 4096 for the Spanish model, we left the remaining hyper-parameters to the default values. In Tab. 7 are shown the final results of our classifiers, in this last case, we made a full evaluation accordingly with the PAN guidelines, so we evaluate all the users that the bot detector classified as Human. Investigating on the miss-classified English speaking accounts of the Dev set, we note that most of them are humans that we erroneous classify as bot. So, we look inside their messages, and we found that most of them have strange posting behavior. As an example, they share almost the same messages multiple times, or the majority of their tweets are bible verses. 5 Related Work Bot Detection. Since 2006, the year Twitter was released to the public, it catches the at- tention of the research community both from a sociological and technical point of view. Table 6. Evaluation results of gender detection on humans of Dev set Test English Spanish Logistic Regression 75.16% 64.43% Random Forest 72.58% 63.91% SVM 85.48% 71.30% Table 7. Final results of gender detection Test English Spanish Dev 80.42% 63.68% Test1 80.68% 69.44% Test2 84.17% 77.61% Among these works, some of them focused on detecting the nature of the accounts, such as fake followers [10], bots, spammers, advertising account [40], with different approaches. Morstater et al. [26] clustered the approaches by the kind of features used in the detection outlining three different categories. The first one is characterized by the detection algorithm that exploits the content of the messages [45]. The second to those exploit the information contained in the profile description or account details like the mail address used for the sign-in or the device used for tweeting [41, 42]. Lastly, the one that takes into account the network structure and the connection of the accounts under investigation [24]. Chu et al. [8] were the first to address the problem of the bot detection in Twitter. They focused on the classification of Twitter accounts into three categories: Human, Bot and Cyborg — hybrid accounts managed by a human assisted by software and vice versa. They noticed that human have complex timing behaviors in terms of posting while bots and cyborgs posting at a regular times. Moreover, they found that Bots posts contents are very often spam messages and that Bots share more links than human. In their classification, they achieve an average accuracy of 96% with a Bayesian classifier on a balanced dataset made by 6,000 samples. Chavoshi et al. [6] showed that it is possible to detect automated account exploiting only the time series of their posting, achieving with their methodology an accuracy of 94%. Varol et al. [44] estimated that between 9% and 15% of active Twitter accounts are bots. For the estima- tion they used a Random Forest classifier and 1150 different kinds of features belonging to the three categories described above, their classifier can separate the classes with an Area Under the Curve of 0.94. Finally, Koulompis et al. [21] were the first to introduce the sentiment analysis in the task of bot detection on Twitter. In [3, 13] in conjunction with other kinds of features, they exploited the sentiment analysis to analyze the bots presence in the tweets related to the Indian and U.S elections respectively. Gender identification. The problem of automatic profiling the author of a text is a crucial problem in many application scenarios, such as forensic, security, and commer- cial settings [2]. The goal of the task is to infer as much as possible information about an unknown author. The profile traits most studied in the literature are: Age, gender, personality traits, native language. Following we focus on the works that address the gender identification problem. The linguistic community was the first researchers to face the problem of gender identification [22, 23, 43]. Pennebaker et al. [30] address the problem from a psychological point of view, they conclude that the stylistic differences between women and men are consistent with a sociological framework of gender dif- ferences in access to power. In the field of the automatic gender identification several approaches was proposed and on different datasets. Cheng et al. [7] explore the problem on the Reuters and Enron corpus using three different classifiers SVM, Decision tree, and logistic regression. In [1, 5, 12, 25] the authors deal with different twitter datasets built by themselves. Also, the participant of the PAN author profiling task of the years 2015 [38],2017 [36] and 2018 [36] address the problem on a twitter dataset provided by the task organizers. Finally, datasets built from the heterogeneous social network was used in [2,35,37] From the point of view of the features, we can divide the features used in these works into three macro-categories: Syntactic features, that capture the writing style of the authors. As example in [7, 27] the authors found that women and men have different habits of using punctuation; for example, women tend to use more question marks with respect to the men. Function words are the words that have an ambiguous meaning and express grammatical relationships among other words within a sentence. Instead, in [16, 20, 29] the authors noticed that women tend to use ’I’, ’me’, and ’my’ more frequently than men, or that the women use intensive adverbs and positive adjec- tives more than men. Finally, we have the char and the word n-grams highly adopted in every stylometric tasks. 6 Conclusion and future work In this work, we presented our approach to the Author profiling task of PAN2019. We develop 4 different classifiers, one for each problem subset. For the Bot detection of En- glish written messages, we used an ensemble architecture where the AdaBoost outputs are ensembled by a soft-voting classifier. For each one of the other problems, we use a single fine-tuned SVM. We achieve excellent performances, especially in the bot de- tection task, where we record a score of about 95% on the English Dev and the English final Test set. Regarding gender detection, our model achieves an accuracy of 85.48% on English Dev set and of 71.30% on the Spanish one, when there is no bias introduced by the Bot detection. If we consider the full pipeline of the task, that means to evalu- ate the gender on all the accounts classified as human by the bot classifiers, our model achieves an accuracy of 80.42% and 63.68% respectively for the English and Spanish datasets. Globally our models perform better on the English accounts than the Spanish ones, so we believe that more work is needed to fill this gap. Finally, we believe that the performances of the classifiers can even be better if the model can take into account at least the hostname of the link found in the dataset. Moreover, in a real case scenario, also the profile description of the account, the username, and the profile image can help to boost the performance as shown in the previous editions. Acknowledgment This work was supported in part by the MIUR under grant “Dipartimenti di eccellenza 2018-2022" of the Department of Computer Science of Sapienza University. References 1. Alowibdi, J.S., Buy, U.A., Yu, P.: Empirical evaluation of profile characteristics for gender classification on twitter. In: 2013 12th International Conference on Machine Learning and Applications. vol. 1, pp. 365–369. IEEE (2013) 2. Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Automatically profiling the author of an anonymous text. Commun. ACM 52(2), 119–123 (2009) 3. Badawy, A., Ferrara, E., Lerman, K.: Analyzing the digital traces of political manipulation: The 2016 russian interference twitter campaign. In: 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). pp. 258–265. IEEE (2018) 4. Bird, S., Klein, E., Loper, E.: Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc." (2009) 5. Burger, J.D., Henderson, J., Kim, G., Zarrella, G.: Discriminating gender on twitter. In: Proceedings of the conference on empirical methods in natural language processing. pp. 1301–1309. Association for Computational Linguistics (2011) 6. Chavoshi, N., Hamooni, H., Mueen, A.: Debot: Twitter bot detection via warped correlation. In: ICDM. pp. 817–822 (2016) 7. Cheng, N., Chandramouli, R., Subbalakshmi, K.: Author gender identification from text. Digital Investigation 8(1), 78–88 (2011) 8. Chu, Z., Gianvecchio, S., Wang, H., Jajodia, S.: Detecting automation of twitter accounts: Are you a human, bot, or cyborg? IEEE Transactions on Dependable and Secure Computing 9(6), 811–824 (2012) 9. Cortes, C., Vapnik, V.: Support-vector networks. Machine learning 20(3), 273–297 (1995) 10. Cresci, S., Di Pietro, R., Petrocchi, M., Spognardi, A., Tesconi, M.: Fame for sale: Efficient detection of fake twitter followers. Decision Support Systems 80, 56–71 (2015) 11. Daelemans, W., Kestemont, M., Manjavancas, E., Potthast, M., Rangel, F., Rosso, P., Specht, G., Stamatatos, E., Stein, B., Tschuggnall, M., Wiegmann, M., Zangerle, E.: Overview of PAN 2019: Author Profiling, Celebrity Profiling, Cross-domain Authorship Attribution and Style Change Detection. In: Crestani, F., Braschler, M., Savoy, J., Rauber, A., Müller, H., Losada, D., Heinatz, G., Cappellato, L., Ferro, N. (eds.) Proceedings of the Tenth International Conference of the CLEF Association (CLEF 2019). Springer (Sep 2019) 12. Deitrick, W., Miller, Z., Valyou, B., Dickinson, B., Munson, T., Hu, W.: Gender identification on twitter using the modified balanced winnow. Communications and network 4(3), 189–195 (2012) 13. Dickerson, J.P., Kagan, V., Subrahmanian, V.: Using sentiment to detect bots on twitter: Are humans more opinionated than bots? In: Proceedings of the 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. pp. 620–627. IEEE Press (2014) 14. Dumais, S.T.: Latent semantic analysis. Annual review of information science and technology 38(1), 188–230 (2004) 15. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences 55(1), 119–139 (1997) 16. Gleser, G.C., Gottschalk, L.A., John, W.: The relationship of sex and intelligence to choice of words: A normative study of verbal behavior. Journal of Clinical Psychology 15(2), 182–191 (1959) 17. Gollub, T., Stein, B., Burrows, S., Hoppe, D.: Tira: Configuring, executing, and disseminating information retrieval experiments. In: 2012 23rd International Workshop on Database and Expert Systems Applications. pp. 151–155. IEEE (2012) 18. Hotelling, H.: Analysis of a complex of statistical variables into principal components. Journal of educational psychology 24(6), 417 (1933) 19. Hutto, C.J., Gilbert, E.: Vader: A parsimonious rule-based model for sentiment analysis of social media text. In: Eighth international AAAI conference on weblogs and social media (2014) 20. Jaffe, J.M., Lee, Y., Huang, L., Oshagan, H.: Gender, pseudonyms, and cmc: Masking identities and baring souls. In: 45th Annual Conference of the International Communication Association, Albuquerque, New Mexico (1995) 21. Kouloumpis, E., Wilson, T., Moore, J.: Twitter sentiment analysis: The good the bad and the omg! In: Fifth International AAAI conference on weblogs and social media (2011) 22. Labov, W.: The intersection of sex and social class in the course of linguistic change. Language variation and change 2(2), 205–254 (1990) 23. Lakoff, R.: Language and woman’s place. Language in society 2(1), 45–79 (1973) 24. Lee, K., Eoff, B.D., Caverlee, J.: Seven months with the devils: A long-term study of content polluters on twitter. In: Fifth International AAAI Conference on Weblogs and Social Media (2011) 25. Liu, W., Ruths, D.: WhatâĂŹs in a name? using first names as features for gender inference in twitter. In: 2013 AAAI Spring Symposium Series (2013) 26. Morstatter, F., Wu, L., Nazer, T.H., Carley, K.M., Liu, H.: A new approach to bot detection: striking the balance between precision and recall. In: 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). pp. 533–540. IEEE (2016) 27. Mulac, A.: The gender-linked language effect: Do language differences really make a difference? Lawrence Erlbaum Associates Publishers (2006) 28. Pearson, K.: Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2(11), 559–572 (1901) 29. Pennebaker, J.W., King, L.A.: Linguistic styles: Language use as an individual difference. Journal of personality and social psychology 77(6), 1296 (1999) 30. Pennebaker, J.W., Mehl, M.R., Niederhoffer, K.G.: Psychological aspects of natural language use: Our words, our selves. Annual review of psychology 54(1), 547–577 (2003) 31. Porter, M.F.: Snowball: A language for stemming algorithms (2001) 32. Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research Architecture. In: Ferro, N., Peters, C. (eds.) Information Retrieval Evaluation in a Changing World - Lessons Learned from 20 Years of CLEF. Springer (2019) 33. Rangel, F., Franco-Salvador, M., Rosso, P.: A low dimensionality representation for language variety identification. In: International Conference on Intelligent Text Processing and Computational Linguistics. pp. 156–169. Springer (2016) 34. Rangel, F., Rosso, P.: Overview of the 7th Author Profiling Task at PAN 2019: Bots and Gender Profiling. In: Cappellato, L., Ferro, N., Losada, D., Müller, H. (eds.) CLEF 2019 Labs and Workshops, Notebook Papers. CEUR-WS.org (Sep 2019) 35. Rangel, F., Rosso, P., Chugur, I., Potthast, M., Trenkmann, M., Stein, B., Verhoeven, B., Daelemans, W.: Overview of the 2nd author profiling task at pan 2014. In: CLEF 2014 Evaluation Labs and Workshop Working Notes Papers, Sheffield, UK, 2014. pp. 1–30 (2014) 36. Rangel, F., Rosso, P., Montes-y Gómez, M., Potthast, M., Stein, B.: Overview of the 6th author profiling task at pan 2018: multimodal gender identification in twitter. Working Notes Papers of the CLEF (2018) 37. Rangel, F., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M., Stein, B.: Overview of the 4th author profiling task at pan 2016: cross-genre evaluations. In: Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings/Balog, Krisztian [edit.]; et al. pp. 750–784 (2016) 38. Rangel Pardo, F.M., Celli, F., Rosso, P., Potthast, M., Stein, B., Daelemans, W.: Overview of the 3rd author profiling task at pan 2015. In: CLEF 2015 Evaluation Labs and Workshop Working Notes Papers. pp. 1–8 (2015) 39. Stamatatos, E.: Authorship attribution using text distortion. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. pp. 1138–1149 (2017) 40. Stringhini, G., Egele, M., Kruegel, C., Vigna, G.: Poultry markets: on the underground economy of twitter followers. ACM SIGCOMM Computer Communication Review 42(4), 527–532 (2012) 41. Thomas, K., Grier, C., Paxson, V.: Adapting social spam infrastructure for political censorship. In: Presented as part of the 5th {USENIX} Workshop on Large-Scale Exploits and Emergent Threats (2012) 42. Thonnard, O., Dacier, M.: A strategic analysis of spam botnets operations. In: Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference. pp. 162–171. ACM (2011) 43. Trudgill, P.: Sex, covert prestige and linguistic change in the urban british english of norwich. Language in society 1(2), 179–195 (1972) 44. Varol, O., Ferrara, E., Davis, C.A., Menczer, F., Flammini, A.: Online human-bot interactions: Detection, estimation, and characterization. In: Eleventh international AAAI conference on web and social media (2017) 45. Wang, A.H.: Detecting spam bots in online social networking sites: a machine learning approach. In: IFIP Annual Conference on Data and Applications Security and Privacy. pp. 335–342. Springer (2010)