Bots and Gender Profiling Using a Deep Learning Approach Notebook for PAN at CLEF 2019 Jose R. Prieto Fontcuberta and Gretel Liz De la Peña Sarracén? Universitat Politècnica de València {joprfon,gredela}@posgrado.upv.es Abstract This paper describes the system we developed for the Bots and gender profiling task, at PAN @ CLEF 2019. The task consists in, given a tweets set, automatically determine whether its author is a bot or a human. In case of hu- man, identify her/his gender. We propose a deep learning based system, fed with the TFIDF representation from the texts instead of word embeddings represen- tation as usual. Additionally, we use some linguistic features which improve the performance of the system according with the experimental results. 1 Introduction Nowadays we use a lot of social media content, being a powerful tool to communicate to the world. Some enterprises use bots accounts as a social manager to answer fast, free and automatically to their clients. However, sometimes some governments, people or powerful institutions abuse of these social networks and create bots to manipulate and distortion the information and the point of view of some users [3], [2]. A bot can be defined as a program that learns to promote some information as a normal user but automatically, and can be programmed with a software specially concerned on the manipulation on some topics. Hence identifying bots in the social networks is a relevant task, not only from a point of view of marketing, but also from a forensics and security perspective. Among the efforts made to address this phenomenon, this year the Bots and gender profiling task [9] has been launched as part of PAN @ CLEF 20191 [1]. This task focuses on detecting bots against humans users given a text set from Twitter, one of the most used social networks. The tweets are in English and Spanish. In recent related works as in [4], deep learning techniques, used for text classifica- tion purposes, are used also for this task. More recent and new techniques are explained and used in [5], where also Word Embeddings, dense layers and LSTM are used. In gen- eral, many works point out the flexibility in capturing nonlinear relationships of deep learning techniques. ? The authors contributed equally to this paper. 1 https://pan.webis.de/clef19/pan19-web/ Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons Li- cense Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 September 2019, Lugano, Switzerland. In this paper, we propose a Feedforward Neural Network of two layers for the task. In addition, as a second step, the system should identify the author gender in case of human. For this other task we use a similar architecture but in this case with 4 layers. A point to highlight is the use of some linguistic features which can help to discriminate among types of users. The paper is organized as follows. Section 2 describes our system. Experimental results are then discussed in Section 3. Finally, we present our conclusions with a summary of our findings in Section 4. 2 System 2.1 Preprocessing In the preprocessing step the text is cleaned. Firstly, the typical characteristics used in the tweets, and that possibly do not have discriminatory semantic information, have been identified. We identify urls, numbers, mentions to users and dates as that kind of features. Then, each part of the texts which represents this kind of features is replaced with a corresponding tag. 2.2 Method We propose a model that consists in a Feedforward Neural Network (FFNN) with two layers, BatchNorm and ReLU activation. As input, the model takes a vector which is generated as TFIDF representation of the concatenation of all tweets of the user. Thus, we achieve to represent the information of a user in a unique vector. Given a text set S, a term t, and an individual text T ∈ S, we calculate: |S| T F IDF (t) = tf (t, T ) ∗ log( ) (1) tf (t, S) Where tf (t, T ) is the number of times t appears in T , |S| is the size of the corpus, and tf (t, S) is the number of texts in which t appears in S. This representation was selected based on the idea that the words usually used by a user and no commonly by others, could be more important in the corresponding repre- sentation of the user. This is an idea that matches with the phenomenon which we try to capture. That is, this kind of representation can allow us to get typical characteristics of each user. We also tried to encode with TFIDF every tweet separately and then concatenate all of them, creating as a result something similar to a gray scale image. For this rep- resentation we used a 2D Convolutional Neural Network with large kernels matching with the number of features extracted from the TFIDF representation (2DCNN-TFIDF). These large kernels are so expensive even when a GPU is used, due to this the train is extremely slowly. We also tried to use a Recurrent Neural Network with a few layers of LSTM (LSTM-TFIDF). On both experiments the results were not enough satisfactory. Similarly, we tried other approaches with other kinds of representation, using the GloVe Word-Embeddings. We use all the tweets concatenated and then we look up into the Word-Embedding table. As before, for this representation we used a 2D Con- volutional Neural Network with large kernels matching with the number of features extracted from the embeddings (2DCNN-WE) and a Recurrent Neural Network with a few layers of LSTM (LSTM-WE). Again, on both experiments the results were not enough satisfactory. Finally, we tried to concatenate the tweets of a given user on depth, obtaining as re- sult an image with the number of channels (depth) equivalent to the number of tweets of the user. Each matrix is the concatenation of words embeddings for a tweet. Therefore, the width and height are determined by the size of embeddings and number of words in the tweet, respectively. With this approach we used a 3D Convolutional Neural Network with larger kernels (3DCNN-WE). With this last approach the idea is to consider the kernels as large as the size of the embedding and in the second dimension we chose a n-gram value to consider the context. 2.3 Linguistic Features We include some linguistic features which consider important to discriminate among users. For the implementation we used the TextBlob library2 which can be used to pro- cess texts in English. Hence, these features were employed just with the corpus in En- glish. Two types of features were analyzed. On one side, features related with subjective information and on the other hand, features related with syntactic information: – Subjective information (SI): Analysis of degree of subjectivity and sentiment present in the text. This can be a good discriminative feature, since bots can be less subjec- tive and sentimental than humans. – Syntactic features (SF): Analysis of count of adjectives, adverbs and pronouns in texts. These kinds of features can discriminate between male and females as some studies suggest [6]. 3 Results In this section, we report and discuss the performance of the system in the task. Training and evaluation were conducted using the PAN @ 2019 proportioned datasets which have 4120 and 3000 tweets in English and Spanish respectively. The data is balanced for each subtask and language. Results are obtained by uploaded the system to TIRA [7]. The results obtained on the development set with each approach commented before are reported in Table 1. As we can see the best results are achieved with the FFNN method. Hence it was the system selected for the task at PAN 2019. Other models that supposed to be superior obtained worse results. Perhaps it could be due to the large number of introduced parameters that were not well trained due the small amount of data available. 2 https://textblob.readthedocs.io/en/dev/ As we can see in Table 1, the best results achieved have an accuracy of 0.90 and 0.93 on Spanish and English datasets, respectively. In the Spanish partition we do not use linguistic features, as we do in English dataset, where we use SI and SF features. Method FFNN 2DCNN-TFIDF LSTM-TFIDF 2DCNN-WE LSTM-WE 3DCNN-WE Accuracy ES 0.90 0.76 0.73 0.70 0.69 0.72 Accuracy EN 0.93 0.78 0.75 0.72 0.70 0.73 Table 1. Results of different approaches on the development dataset In Table 2 we can see the improvement of the FFNN method adding the linguistic features (LF). As we commented before, we just used them for the English corpus. The system gains one point of accuracy adding these features, but there were no differences when adding the syntactic features. Method Without LF With SI With SI + SF Accuracy EN 0.92 0.93 0.93 Table 2. Results in the Bot vs Humans Task with FFNN Table 3 shows the results of accuracy for gender profiling for those tweets predicted as a human. We achieve 0.87 accuracy on the English corpus, and the results did not vary when the linguistic features are added. Hence these features are not important for this task according to our experimental results. On the Spanish corpus, 0.86 accuracy is achieved without any linguistic feature. Method Without LF With SI With SI + SF Accuracy EN 0.87 0.87 0.87 Accuracy ES 0.86 - - Table 3. Results in the Gender Task with FFNN Table 4 shows the results on the test datasets for our final model and the baselines proposed by the organizers of the task. We outperform in almost all the cases the results of the baselines, except in English bot task where the LDSE method [8] obtains better results. Method MAJORITY RANDOM LDSE Our proposal EN 0.5000 0.4905 0.9054 0.9045 Bots ES 0.5000 0.4861 0.8372 0.8578 EN 0.5000 0.3716 0.7800 0.7898 Gender ES 0.5000 0.3700 0.6900 0.6967 Table 4. Final results in both task on the test datasets 4 Conclusion We propose a deep learning based system for the Bots and gender profiling task, at PAN @ CLEF 2019. The model consists of a Feedforward Neural Network which gets as input the TFIDF representation from the text. The experimental results show the suitability of the used representation for the task, achieving 0.8578 of accuracy on the Spanish corpus and 0.9045 on the English corpus, on detecting bots vs human. For gender profiling we obtain an accuracy of 0.6967 and 0.7898, respectively. Also, some linguistic features are added, allowing for a small improvement in the bots and human discrimination task, but not for gender profiling. Furthermore, we tried to use word embeddings and some different kinds of architectures with these new features but the results were not enough satisfactory. Maybe the results of CNN might be improved with more data for re-training the word-embedding for this specific task. References 1. Daelemans, W., Kestemont, M., Manjavancas, E., Potthast, M., Rangel, F., Rosso, P., Specht, G., Stamatatos, E., Stein, B., Tschuggnall, M., Wiegmann, M., Zangerle, E.: Overview of PAN 2019: Author Profiling, Celebrity Profiling, Cross-domain Authorship Attribution and Style Change Detection. In: Crestani, F., Braschler, M., Savoy, J., Rauber, A., Müller, H., Losada, D., Heinatz, G., Cappellato, L., Ferro, N. (eds.) Proceedings of the Tenth International Conference of the CLEF Association (CLEF 2019). Springer (Sep 2019) 2. Ferrara, E.: Disinformation and Social Bot Operations in The Run Up to The 2017 French Presidential Election. arXiv preprint arXiv:1707.00086 22(8), 1–33 (2017) 3. Forelle, M., Howard, P., Monroy-Hernández, A., Savage, S.: Political Bots and The Manipulation of Public Opinion in Venezuela. CoRR abs/1507.07109 (2015), http://arxiv.org/abs/1507.07109 4. John, V.: A Survey of Neural Network Techniques for Feature Extraction from Text. arXiv preprint arXiv:1704.08531 (2017) 5. Kudugunta, S., Ferrara, E.: Deep Neural Networks for Bot Detection. Information Sciences 467, 312–322 (2018) 6. Nerbonne, J.: The Secret Life of Pronouns. What Our Words Say About Us. Literary and Linguistic Computing 29(1), 139–142 (2014) 7. Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research Architecture. In: Ferro, N., Peters, C. (eds.) Information Retrieval Evaluation in a Changing World - Lessons Learned from 20 Years of CLEF. Springer (2019) 8. Rangel, F., Franco-Salvador, M., Rosso, P.: A Low Dimensionality Representation for Language Variety Identification. In: International Conference on Intelligent Text Processing and Computational Linguistics. pp. 156–169. Springer (2016) 9. Rangel, F., Rosso, P.: Overview of The 7th Author Profiling Task at PAN 2019: Bots and Gender Profiling. In: Cappellato, L., Ferro, N., Losada, D., Müller, H. (eds.) CLEF 2019 Labs and Workshops, Notebook Papers. CEUR-WS.org (Sep 2019)