Bots and Gender Profiling using Masking Techniques Notebook for PAN at CLEF 2019 Victor Jimenez-Villar1 , Javier Sánchez-Junquera3 , Manuel Montes-y-Gómez1 , Luis Villaseñor-Pineda1,2 , and Simone Paolo Ponzetto4 1 Instituto Nacional de Astrofísica Óptica y Electrónica, México. {victor.jimenez,mmontesg, villasen}@inaoep.mx 2 Centre de Recherche en Linguistique Française GRAMMATICA (EA 4521), Université d’Artois, France. 3 PRHLT Research Center, Universitat Politècnica de València, Spain. jjsjunquera@gmail.com 4 Data and Web Science Group,University of Mannheim, Germany. simone@informatik.uni-mannheim.de Abstract This work describes our proposed solution for the author profiling shared task at PAN 2019. The task consists in identifying whether the author of a Twitter feed is a bot or a human, and, in case of a human, in determining if the au- thor is male or female. Like previous years, the task considers different languages, in this case, English and Spanish. Our proposal focuses on the preprocessing and feature extraction steps; we mainly apply some masking techniques that allow emphasizing the relevant terms by obfuscating the irrelevant ones but keeping information about the structure of the texts. Using this approach we obtained ac- curacies of 0.92 and 0.81 in the Spanish test set for classifying bots/humans and males/females, respectively; similarly, we obtained accuracy values of 0.91 and 0.82 for the English dataset. 1 Introduction The use of bots in social media, like Twitter or Facebook, has been constantly increas- ing. One of the main purposes of bots is to influence the opinion of people in a particular matter [22]; sometimes the use of bots might be considered unethical which is the case when they are used in a political context [9]. Furthermore, bots are commonly related to fake news spreading and polarisation. Therefore, to approach the identification of bots from an author profiling perspective is of high importance from the point of view of marketing, forensics, and security [17]. The author profiling task consists in identifying some specific traits like gender or age, given a set of documents written by him/her. In this work, we focus on the preprocessing step, by which we mask irrelevant terms while maintaining the relevant ones, this with the objective of conserving a combination of style, content words, and the structure of the documents. Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons Li- cense Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 September 2019, Lugano, Switzerland. The rest of the paper is organized as follows. Section 2 presents a brief overview of the author profiling and bots detection tasks. Section 3 describes the proposed method. Section 4 describes the performed experiments and discusses the obtained results. Fi- nally, Section 5 presents our conclusions. 2 Related Work There are two main kinds of methods for bots detection: methods that use predictive features of individual users, and methods that consider the network structure to identify a coordinate group of bots [9]. A detailed overview of these methods is presented in [6]. Particularly, the network-based approaches look for anomalous behaviour in net- work interactions [13,9,27]. They have the advantage of requiring less data about the users and also being able to simultaneously detect multiple bots. On the contrary, the approach based on predictive features only relies on textual information [2,7,1]. There are also some hybrid approaches, for example, the work presented in [5]. It employs a Random Forest classifier trained with examples of both humans and bots behaviours, based on the Texas A&M dataset [10] which contains 15,000 examples of each class and millions of tweets. This work uses six different kinds of features related to the net- work, user metadata, friends, temporal data, content, and sentiment. Nevertheless, all these features are not always available, and therefore, in our work, we only make use of textual features. Regarding gender profiling, a lot of significant work has been done. For example, in the previous edition of PAN@CLEF [18], several different approaches were evaluated, from traditional methods to deep learning techniques. Traditional methods employed content features such as word n-grams [4], and style features such as character n-grams [3] and counts of stop words, emojis, punctuation marks, etc [14]. Deep Learning ap- proaches used word embeddings [20] as well as character n-grams embeddings [12], with a combination of different neural network architectures such as RNNs [23], CNNs, and bi-LSTMs [25]. Despite the use of more advanced techniques, traditional methods remain as a good and competitive approach. In this work, we use a unified approach for the two profiling tasks. It is based on a text distortion method proposed by Stamatatos to mask topic-related information [21]. Basically, it allows transforming the input texts into a more topic-neutral form while maintaining the structure of documents associated with the personal style of the authors. 3 Proposed Method The proposed method is supported on the hypothesis that the general structure of a document in conjunction with a small subset of their words (mostly comprising style features but including a few content terms) are enough elements to characterize the author of a Twitter feed. With the objective of better understanding the results of the proposed method, we treat the author profiling task as two binary classification prob- lems. First, we identify if the given author is a human or a bot. Then, in the case of a human, we determine if the author is male or female. Our method follows a classical text classification pipeline considering preprocessing, feature extraction and classifica- tion. In this work, we put special emphasis on the preprocessing step. In the following subsections, the proposed method is described. 3.1 Preprocessing This phase is the most relevant in our method, it consists of two main steps: normaliza- tion and masking. In the normalizing step, with the help of the python library Tweet Tokenizer from NLTK [11], we perform the following actions: 1. Join all tweets from each author in one string; tweets are separated by the END tag. 2. Replace line feed characters with NL. 3. Lowercase the characters. 4. Replace URLs with URL. 5. Replace user name mentions with USER. It is important to mention that we maintain punctuation marks, emoticons and other special symbols. The masking procedure is based on the work by Stamatatos [21]. Inspired by [8], he proposed a set of text distortion methods (DV-MA, DV-SA, DV-EX, DV-L2) in the context of authorship attribution. He found that the method DV-MA is more suited on texts related to blogs. The method DV-MA is defined as follows: Let Wk be the list of k most frequent words of the language. – Every word not included in Wk is masked by replacing each of its characters with an asterisk (*). – Every digit in the text is replaced by the symbol #. For the task of authorship attribution masking all words except the k most frequent words is a good idea because it is mostly a style-based task, however, for the author profiling task it is important to consider style as well as some content features. Based on this observation, we consider different criteria to select the k most relevant terms: Document Frequency (DF): Defined as the number of documents in which the term ti occurs. We compute the document frequency for each unique term in the corpus vocabulary, the assumption is that with this measure we can take a combination of style- related and topic-oriented terms. Frequently Co-occurring Entropy (FCE): This measure was proposed by [24] and used in [19] for domain adaptation deception detection. The objective of this mea- sure is to pick out generalizable features that occur frequently in both classes and have similar occurring probability (for example, in both humans and bots). Information Gain (IG): It is commonly employed as a term selection criterion in classification tasks [26]. We compute the information gain of every term in the vocabu- lary and select as the k most relevant terms those with the greatest values, i.e., the subset of the most discriminative terms, which could include style and content elements. 3.2 Feature Extraction In order to find the best representation for the two author profiling tasks, we evaluated different representations based on word and character n-grams. Best results were ob- tained with 3-5 character n-grams. In all experiments we used this representation with the following parameters: we eliminated all n-grams occurring less than 3 times; we applied a tf × idf weighting, with sub-linear term frequency (1 + log(tf )). 3.3 Classification In all the experiments we used a Support Vector Machine (SVM) classifier, with a linear kernel and default parameters according to the scikit-learn’s implementation [15]. 4 Experiments and Results As in previous years, the task’s organizers provide a training corpus [17]. The corpus is composed of documents in English and Spanish, where each document contains 100 tweets for each author. The statistics of this corpus are presented in Table 1. In order to compare the proposed method, we implemented three baselines: i) a traditional BOW model containing all words, ii) a BOW model applying a feature se- lection procedure, and iii) a character n-gram model applying the preprocessing actions but without masking. Regarding the proposed method, we considered the different criteria to select the k most relevant terms: document frequency (DF), frequency co-occurring entropy (FCE), and information gain (IG). In all the cases, to find the optimal k value, we performed an incremental search using k = 100, 200, ....3000. 4.1 Experimental Results Table 2 shows the baseline results for the bots vs. human profiling task. We can observe that the model using character n-grams as features had the best results, an accuracy of 0.91 in Spanish and 0.93 in English. On the other hand, Table 3 shows the results of the proposed method (also with a representation of character n-grams) applying different criteria to select the k most relevant words (the words that were not masked). In this table we can observe that masking all words (i.e. using k = 0) produced a very com- petitive accuracy, 0.78 in Spanish and 0.83 in English, suggesting the importance of the text structure to distinguish bots from humans. In addition, we can observe that there are minor differences in the results from different selection criteria. Nonetheless, there is a slight improvement from 0.93 to 0.94 in English. Also, from the English results, we can deduce that the identification of bots and humans is entirely a stylistic task since the best results were obtained when selecting only the one hundred most frequent terms. Tables 4 and 5 resume the results for the gender profiling task. For this task, the proposed method allowed improving the baseline results. In both languages, the best results were obtained when using the DF criterion. It is important to notice that for this task it was necessary to maintain more terms unmasked (2600 and 2900 respectively), which indicates that, in contrast to the bots vs. human task, the differences between men and women are not only stylistic but also topic related. Furthermore, masking all the words did not produce good results, which means that texts generated by men and women share many structure characteristics. Table 1. Number of documents by language. Training Development Author Spanish English Spanish English Males 520 720 230 310 Females 520 720 230 310 Bots 1 040 1440 460 620 Table 2. Accuracy results of different baselines, in the classification of humans and bots. In the model BOW DF (Document Frequency) the best k value found was selected as a threshold, k=2900 for Spanish and k=100 for English. Classifying with SVM algorithm. Accuracy Model Spanish English BOW 90 91.53 BOW DF 89.89 90.48 char n-grams (3-5) 91.41 93.46 Table 3. Accuracy results applying the masking technique and using different feature selection methods, in the classification of humans and bots. The selected model was a combination of three to five character n-grams using a tf-idf weighting. Classifying with SVM algorithm. Spanish English Selection k Accuracy k Accuracy None 0 77.93 0 82.98 DF 2900 90.65 100 94.27 FCE 2400 90.72 1000 93.7 IG 1500 91.08 400 93.62 Table 4. Accuracy results of different baselines, in the classification of males and females. In the model BOW DF (Document Frequency) the best k value found was selected as a threshold, k=2600 for Spanish and k=2900 for English. Classifying with SVM algorithm. Accuracy Model Spanish English BOW 64.13 81.61 BOW DF 65.21 80.8 char n-grams (3-5) 66.3 80.96 Table 5. Accuracy results applying the masking technique and using different feature selection methods, in the classification of males and females. The selected model was a combination of three to five character n-grams using a tf-idf weighting. Classifying with SVM algorithm. Spanish English Selection k Accuracy k Accuracy None 0 60.65 0 51.12 DF 2600 69.13 2900 81.77 FCE 2600 66.3 900 78.87 IG 2900 66.3 2600 80.96 4.2 Final Evaluation In accordance with the previous results, for the final evaluation at TIRA platform [16], we applied our method selecting and maintaining the k terms with the greatest DF val- ues, while the rest of them were masked. In particular, for the humans vs. bots profiling task we set k = 2900 for Spanish and k = 100 for English, whereas, for the gender profiling task we set k = 2600 for Spanish and k = 2900 for English. The obtained accuracy results were as follows: in Spanish, 0.92 and 0.81 for classi- fying bots vs. humans and males vs. females, respectively; in English, 0.91 and 0.82 for both tasks. The results are very similar to those obtained in the training datasets, except for gender profiling in Spanish. We think this difference could be caused by the size of the training set on the final evaluation where the training and development partitions, used in the experimental configuration, were unified. 5 Conclusions Distinguishing between bots and humans results to be not a very complex task. A tradi- tional representation based on character n-grams showed to be a very strong baseline. Particularly our experiments showed that the single structure of the documents (i.e., when we masked all the words) provides important information for this classification task. In the case of distinguishing between men and women, our experiments showed a very different picture. The proposed approach allowed improving the baseline results, suggesting that for this task it is important to have a combination of style and topic based features. the best results were obtained when maintaining the 2600-2900 more frequent terms and masking the rest. As a general observation we can say that the proposed approach does not consider- ably improve the baseline results, but despite this, the masking technique has the advan- tage of normalizing the documents to a more neutral form and, therefore, of reducing the risk of overfitting. Acknowledgments. This work was partially supported by CONACYT-Mexico under the scholarship 868585 and project grant CB-2015-01-257383. References 1. Benevenuto, F., Magno, G., Rodrigues, T., Almeida, V.: Detecting spammers on twitter. In: Collaboration, electronic messaging, anti-abuse and spam conference (CEAS). vol. 6, p. 12 (2010) 2. Chu, Z., Gianvecchio, S., Wang, H., Jajodia, S.: Detecting automation of Twitter accounts: Are you a human, bot, or cyborg? IEEE Transactions on Dependable and Secure Computing 9(6), 811–824 (2012) 3. Daneshvar, S., Inkpen, D.: Gender identification in Twitter using N-grams and LSA: Notebook for PAN at CLEF 2018. In: Proceedings of the Ninth International Conference of the CLEF Association (CLEF 2018). vol. 2125, pp. 1–10 (2018) 4. von Daniken, P., Grubenmann, R., Cieliebak, M.: Word unigram weighing for author profiling at pan 2018. In: Proceedings of the Ninth International Conference of the CLEF Association (CLEF 2018). vol. 2125. CEUR-WS.org (2018) 5. Davis, C.A., Varol, O., Ferrara, E., Flammini, A., Menczer, F.: Botornot: A system to evaluate social bots. In: Proceedings of the 25th International Conference Companion on World Wide Web. pp. 273–274. International World Wide Web Conferences Steering Committee (2016) 6. Ferrara, E., Varol, O., Davis, C., Menczer, F., Flammini, A.: The rise of social bots. Communications of the ACM 59(7), 96–104 (2016) 7. Gianvecchio, S., Xie, M., Wu, Z., Wang, H., Member, S.: Humans and Bots in Internet Chat : Measurement , Analysis , and Automated Classification. IEEE/ACM Transactions on Networking 19(5), 1557–1571 (2011) 8. Granados, A., Cebrian, M., Camacho, D., de Borja Rodriguez, F.: Reducing the loss of information through annealing text distortion. IEEE Transactions on Knowledge and Data Engineering 23(7), 1090–1102 (2010) 9. Hjouji, Z.e., Hunter, D.S., Mesnards, N.G.d., Zaman, T.: The impact of bots on opinions in social networks. arXiv preprint arXiv:1810.12398 (2018) 10. Lee, K., Eoff, B.D., Caverlee, J.: Seven months with the devils: A long-term study of content polluters on twitter. In: Fifth International AAAI Conference on Weblogs and Social Media (2011) 11. Loper, E., Bird, S.: Nltk: the natural language toolkit. arXiv preprint cs/0205028 (2002) 12. López-Santillán, R., Gonzalez-Gurrola, L., Ramfrez-Alonso, G.: Custom document embeddings via the centroids method: Gender classification in an author profiling task. In: Proceedings of the Ninth International Conference of the CLEF Association (CLEF 2018). vol. 2125. CEUR-WS.org (2018) 13. Mesnards, N.G.d., Zaman, T.: Detecting influence campaigns in social networks using the ising model. arXiv preprint arXiv:1805.10244 (2018) 14. Patra, B.G., Das, K.G.: Dd. multimodal author profiling for arabic, english, and spanish. In: Proceedings of the Ninth International Conference of the CLEF Association (CLEF 2018). vol. 2125. CEUR-WS.org (2018) 15. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011) 16. Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research Architecture. In: Ferro, N., Peters, C. (eds.) Information Retrieval Evaluation in a Changing World - Lessons Learned from 20 Years of CLEF. Springer (2019) 17. Rangel, F., Rosso, P.: Overview of the 7th Author Profiling Task at PAN 2019: Bots and Gender Profiling. In: Cappellato, L., Ferro, N., Losada, D., Müller, H. (eds.) CLEF 2019 Labs and Workshops, Notebook Papers. CEUR-WS.org (Sep 2019) 18. Rangel, F., Rosso, P., Montes-Y-Gómez, M., Potthast, M., Stein, B.: Overview of the 6th Author Profiling Task at PAN 2018: Multimodal gender identification in Twitter. CEUR Workshop Proceedings 2125 (2018) 19. Sanchez Junquera, J.J.: Adaptación de Dominio para la Detección Automática de Textos Engañosos. Master’s thesis, Instituto Nacional de Astrofísica, Óptica y Electrónica (2018) 20. Schaetti, N.: Character-based convolutional neural network and resnet18 for twitter author profiling. In: Proceedings of the Ninth International Conference of the CLEF Association (CLEF 2018). vol. 2125. CEUR-WS.org (2018) 21. Stamatatos, E.: Masking topic-related information to enhance authorship attribution. Journal of the Association for Information Science and Technology 69(3), 461–473 (2018) 22. Subrahmanian, V., Azaria, A., Durst, S., Kagan, V., Galstyan, A., Lerman, K., Zhu, L., Ferrara, E., Flammini, A., Menczer, F.: The darpa twitter bot challenge. Computer 49(6), 38–46 (2016) 23. Takahashi, T., Tahara, T., Nagatani, K., Miura, Y., Taniguchi, T., Ohkuma, T.: Text and image synergy with feature cross technique for gender identification. In: Proceedings of the Ninth International Conference of the CLEF Association (CLEF 2018). vol. 2125. CEUR-WS.org (2018) 24. Tan, S., Cheng, X., Wang, Y., Xu, H.: Adapting naive bayes to domain adaptation for sentiment analysis. In: European Conference on Information Retrieval. pp. 337–349. Springer (2009) 25. Veenhoven, R., Snijders, S., van der Hall, D., van Noord, R.: Using translated data to improve deep learning author profiling models. In: Proceedings of the Ninth International Conference of the CLEF Association (CLEF 2018). vol. 2125. CEUR-WS.org (2018) 26. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Icml. vol. 97, p. 35 (1997) 27. Yang, Z., Wilson, C., Wang, X., Gao, T., Zhao, B.Y., Dai, Y.: Uncovering social network sybils in the wild. ACM Transactions on Knowledge Discovery from Data (TKDD) 8(1), 2 (2014)