Using N-grams to detect Bots on Twitter Notebook for PAN at CLEF 2019 Juan Pizarro[0000−0001−9598−1929] Universitat Politècnica de València jpizarrom@gmail.com Abstract. This article describes the participation in the Bots and Gen- der Profiling shared task in PAN at CLEF 2019. We propose a Support Vector Machine Classifier with character and word n-grams features. Our model achieved the best average performance of 88.05% at the 7th In- ternational Competition on Author Profiling. In the task of determining whether the author of a set of tweets is a bot or a human, our model obtained an accuracy of 93.60% for English and 93.33% for Spanish. In the task of Gender Identification, obtained an accuracy of 83.56% for English and 81.72% for Spanish. Keywords: Author Profiling · Gender Identification · Bot identification · Twitter · Spanish · English. 1 Introduction Nowadays we communicate and interact through social media platforms on a daily basis, we use it as a source of information, a commercial or marketing channel, to buy products online even to speak with our bank representatives. Thus, social networks have a great influence on our lifestyle and affect the way decisions are made. It is known that social media bots (software controlled ac- counts) pose as humans to intervene in elections and decisions that affect many people [17]. They are also used to influence the perception of products through fake reviews. The objective of this year’s Author profiling tasks [15] in PAN [5] at CLEF 2019 is to determine if the author of a set of tweets in English or Spanish, is a human or a bot, and in case of being human, determining its gender. The paper is organised as follows. Section 2 presents the related work, Sec- tion 3 describes the corpus, environment setup, preprocessing, features and mod- els, and Section 4 shows the trained models with its accuracy on the dev set. Then in Section 5 we discuss about the obtained results in the test set. After that, Section 6 presents an overview of some experiments using deep learning. Finally, Section 7 draws some conclusions. Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 Septem- ber 2019, Lugano, Switzerland. 2 Related Work In 2002, Koppel et al. [8] evaluate the use of function words and part of speech tagging to identify the author gender and document genre of a corpus consisting of 920 labelled documents. Pang et al. [12] explore the use of unigrams and bigrams, with Naive Bayes, maximum entropy classification, and support vector machines to classify the sentiment on movie-data. In 2004, Pang et al. [11] employ support vector machine and Naive Bayes to classify movie reviews as either positive or negative. In 2006, Schler et al. [16] were able to obtain an accuracy of 80.0% in gender identification in a corpus of 85.000 blogs using style and content words. Metaxas et al. [10] [9] found social bots supporting some candidates and attacking their opponents [19]. In 2014, Dickerson et al. [7] studied the problem of identifying bots on all of Twitter and identified 19 of the 25 top features they use are sentiment-related. They use grid search to find the best hyperparameters for each of the classifiers. In 2016, Bessi et al. [2] found social bots generating a large amount of content, possibly distorting online conversations, they noted that bots tweeting about Donald Trump generated the most positive tweets [19]. In 2018, Stella et al. [17] report a case of political manipulation on social media using sentiment analysis. In the task of gender identification in PAN at CLEF 2017, Basile et al. [1] obtained 82.33% for English and 83.21% in Spanish using an SVM classifier trained with combinations of character and tf-idf n-grams. In the same task of gender identification in PAN at CLEF 2018, Daneshvar et al. [6] obtained 82.21% for English and 82.00% for Spanish using char and word n-grams as features, with a SVM classifier. Tellez et al. [18] obtained 81.21% for English and 80.05% for Spanish using a similar strategy. 3 Experimental Work This section presents the methods and materials applied in the experiments. Subsection 3.1 describes the corpus, Subsection 3.2 shows the environment setup, Subsection 3.3 explain the preprocessing, Subsection 3.4 provides a description of the feature representations. Finally, the models are presented in Subsection 3.5 and in Subsection 3.6 all the hyperparameter are shown. 3.1 Corpus The corpus consists of a set of files in the XML format, containing of 100 tweets, one file per author. It is balanced and annotated if the author is human or robot, and in case on human its gender, male or female. It is recommended to use the corpus partitions shown in Table 1 to avoid over-fitting (more than one file could be written by the same author). Also, it shows that there are more tweets for English than for Spanish. Table 1. Tweets by corpus partitions. Lang Train all Train Dev es 3000 2080 920 en 4120 2880 1240 3.2 Environment Setup The models were trained on a Jupyter notebook environment known as Colab- oratory1 . We opted to use mainly the next software tools to build our models: nltk2 , sklearn3 , hyperopt4 . 3.3 Preprocessing The XML files are parsed using the Python 3 library xml.etree.ElementTree5 to be able to work with its content. Then, for each author, their 100 tweets are concatenated forming a long string, and a custom tag is used to separate each of the tweets. After that, we applied a lowercase conversion and the strings are tokenized using nltk TweetTokenizer [3], each URL, user mention and hashtag are replaced by one fixed tag respectively, following what was done by Daneshvar et al. [6]6 . 3.4 Features Based on a quick experimentation, we choose to evaluate char and word n- grams with different n-gram orders. Also we opted to represent each document using term frequency–inverse document frequency (TF-IDF). Finally, to join both TF-IDF feature representations, the char and the word n-grams, we employ FeatureUnion7 , in order to use Pipelines8 obtaining an end to end model. 3.5 Models Taking into account our hardware resource limitation and the reason that we want to try several hyperparameters by each model, we opted to include only a few classical machine learning algorithms. 1 https://colab.research.google.com 2 https://www.nltk.org/ 3 https://scikit-learn.org/ 4 http://hyperopt.github.io/hyperopt/ 5 https://docs.python.org/3/library/xml.etree.elementtree.html 6 https://github.com/pan-webis-de/daneshvar18/blob/5542895062f2404fd5b5a07493ff098132308457/ pan18ap/train model.py 7 https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion. html 8 https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html Table 2. SVM hyperparameters. Param Values C hp.loguniform(’C’, np.log(1e-5), np.log(1e5)) tol hp.loguniform(’tol’, np.log(1e-5), np.log(1e-2)) intercept scaling hp.loguniform(’intercept scaling’, np.log(1e-1), np.log(1e1)) Table 3. MultinomialNB hyperparameters. Param Values alpha hp.loguniform(’nb alpha’, -3, 5) As a result of our research, we ended up selecting: LinearSVC9 , LogisticRe- gression10 , MultinomialNB11 . 3.6 Hyperparameter Tuning The hyperparameter tuning was done by hand at first, obtaining poorly results, because of that a parameter search tool was used. In this work we opted to explore the use of hyperopt12 , a Python library for serial and parallel optimisation over awkward search spaces, which may include real-valued, discrete, and conditional dimensions. After some experiments and looking of how others do the hyperparameters search with hyperopt we define each parameter range as shown in Table 2, Table 3 and Table 4, for LinearSVC, MultinomialNB and LogisticRegression respectively. Also, a hyperparameter search was done to find the best parameters for feature representation, what are shown in Table 5. 4 Trained Models A model was trained for each task and language separately, 4 models in total. The best 5 configurations by language and task are shown in Table 6. The models were trained with the training set and evaluated with the dev set as shown in Table 1. The best results for the task of determining whether the author is a bot or a human in English were obtained using SVM with char n-grams with range (1, 3) and word n-grams with range (2, 3). For the task of gender identification the best results were obtained also with SVM but with char ngram with range (1, 3) and word ngram with range (1, 3). 9 https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html 10 https://scikit-learn.org/stable/modules/generated/sklearn.linear model. LogisticRegression.html 11 https://scikit-learn.org/stable/modules/generated/sklearn.naive bayes. MultinomialNB.html 12 http://hyperopt.github.io/hyperopt/ Table 4. Logistic Regression hyperparameters. param values C hp.choice(’lr C’, [0.25, 0.5, 1.0]) Table 5. Feature representation hyperparamenters. N-gram type Param Values word ngram range (1, 2),(1, 3),(2, 3) word max df 0.6, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1.0 word min df 0.0001, 0.001, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.1, 1, 2, 5 char ngram range (1, 3),(1, 5),(2, 5),(3, 5),(1, 6),(2, 6) char max df 0.6, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1.0 char min df 0.0001, 0.001, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.1, 1, 2, 5 In the Spanish task the same configuration allowed for obtaining the best result on both tasks, a SVM classifier with char n-grams with range (3, 5) and word n-grams with range (1, 3). 5 Results In order to evaluate our model on the test set, a TIRA [13] account was given, to install the software dependencies and to deploy our model in testing mode. The task organisers offer us to evaluate our models in an early birds dataset, allowed us to verify the configuration of the environment and give us an early ap- proximation of our model behaviour. Table 7 shows the results in the early birds dataset (dataset1) and Table 8 shows the results the final datatest (dataset2). It is possible to see that in the gender identification task the results for the dataset2 are a couple of points better than for the dataset1, but in the task of determining the author, the results are similar for both languages and both datasets. Finally, the results obtained with our proposed model, are better than the baselines defined by the task organisers. The baseline models as shown in Table 8 are majority, random and LDSE [14]. Table 6. Best model parameters by language and task. Lang Task Classifier Loss Feats word.ngram range char.ngram range en human or bot LinearSVC -0.945968 word char (1, 2) (2, 6) en human or bot LinearSVC -0.945968 word char (1, 2) (2, 6) en human or bot LinearSVC -0.945968 word char (1, 2) (2, 6) en human or bot LinearSVC -0.945968 word char (2, 3) (1,3) en human or bot LinearSVC -0.945161 word char (2, 3) (1, 5) en gender LinearSVC -0.804839 word char (1,3) (1,3) en gender LinearSVC -0.803226 word char (1, 2) (1, 3) en gender LinearSVC -0.801613 word char (1, 2) (1, 3) en gender LinearSVC -0.801613 word char (1, 2) (1, 3) en gender LinearSVC -0.801613 word char (1, 2) (1, 3) es human or bot LinearSVC -0.922826 word char (1, 3) (3,5) es human or bot -0.918478 word char (1, 2) (1, 5) es human or bot LinearSVC -0.918478 word char (1, 3) (2, 6) es human or bot LinearSVC -0.918478 word char (1, 3) (2, 6) es human or bot LinearSVC -0.918478 word char (1, 3) (2, 6) es gender LinearSVC -0.691304 word char (1, 3) (3,5) es gender LinearSVC -0.691304 word char (1, 3) (3,5) es gender LinearSVC -0.691304 word char (1, 3) (3,5) es gender LinearSVC -0.691304 word char (1, 3) (3,5) es gender LinearSVC -0.691304 word char (1, 3) (3,5) Table 7. Results in the early birds dataset. pan19-author-profiling-test-dataset1-2019-03-20 Bot or Human Gender en es en es Our model 0.9394 0.9278 0.7879 0.7611 Table 8. Results in the test set and baselines. pan19-author-profiling-test-dataset2-2019-04-29 Bot or Human Gender en es en es Our model 0.9360 0.9330 0.8356 0.8172 MAJORITY 0.5000 0.5000 0.5000 0.5000 RANDOM 0.4905 0.4861 0.3716 0.3700 LDSE 0.9054 0.8372 0.7800 0.6900 Fig. 1. Training accuracy. 6 Trained Deep Learning Models After the submission of our run, we carried out a couple of further experiments that we could not evaluated on the test dataset and that we present in this section. We conduct several experiments of different deep learning architectures with Keras [4] and the data partitions shown in Table 1 (train and dev). The same preprocessing process shown in Subsection 3.3 was done. In addition to that, each emoji was replaced with a word using emoji13 . The model shown in Figure 3 obtained an accuracy of 94.524 ± 0.00167 eval- uated in 10 runs. We also opted to evaluate 100 experiments with different hyperparameters for the same architecture. Figure 1 shows the accuracy on the training set and Figure 2 shows the accuracy on the dev set. It is possible to see that the results were good, independently of the hyperparameters used. 13 https://github.com/carpedm20/emoji/ Fig. 2. Validation accuracy. Fig. 3. Deep learning model. 7 Conclusions Similarly as in previous years of the Author Profiling shared task in PAN, the SVM classifier with n-grams and TF-IDF features obtained very good results. The use of hyperparameter tuning tools showed to be one of the crucial parts of the model building process to obtain good results. As future work, it could very useful to explore the use of more features such as the use on lexicons, transform the emojis to custom tags, and also to try other feature representations such as word embeddings with neural networks. In Section 6 we showed the promising results of the preliminary experiments that we carried out. References 1. Basile, A., Dwyer, G., Medvedeva, M., Rawee, J., Haagsma, H., Nissim, M.: N-gram: New groningen author-profiling model. arXiv preprint arXiv:1707.03764 (2017) 2. Bessi, A., Ferrara, E.: Social bots distort the 2016 us presidential election online discussion (2016) 3. Bird, S., Klein, E., Loper, E.: Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.” (2009) 4. Chollet, F., et al.: Keras. https://keras.io (2015) 5. Daelemans, W., Kestemont, M., Manjavancas, E., Potthast, M., Rangel, F., Rosso, P., Specht, G., Stamatatos, E., Stein, B., Tschuggnall, M., Wiegmann, M., Zangerle, E.: Overview of PAN 2019: Author Profiling, Celebrity Profiling, Cross-domain Authorship Attribution and Style Change Detection. In: Crestani, F., Braschler, M., Savoy, J., Rauber, A., Müller, H., Losada, D., Heinatz, G., Cappellato, L., Ferro, N. (eds.) Proceedings of the Tenth International Conference of the CLEF Association (CLEF 2019). Springer (Sep 2019) 6. Daneshvar, S., Inkpen, D.: Gender Identification in Twitter using N-grams and LSA: Notebook for PAN at CLEF 2018. In: CEUR Workshop Proceedings. vol. 2125 (2018), http://ceur-ws.org/Vol-2125/paper 213.pdf 7. Dickerson, J.P., Kagan, V., Subrahmanian, V.S.: Using sentiment to detect bots on twitter: Are humans more opinionated than bots? In: 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014). pp. 620–627 (Aug 2014). https://doi.org/10.1109/ASONAM.2014.6921650 8. Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Literary and linguistic computing 17(4), 401–412 (2002) 9. Metaxas, P.T., Mustafaraj, E.: Social media and the elections. Science 338(6106), 472–473 (2012) 10. Mustafaraj, E., Metaxas, P.T.: From obscurity to prominence in minutes: Political speech and real-time search (2010) 11. Pang, B., Lee, L.: A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In: Proceedings of the 42Nd Annual Meeting on Association for Computational Linguistics. ACL ’04, Association for Computational Linguistics, Stroudsburg, PA, USA (2004). https://doi.org/10.3115/1218955.1218990, https://doi.org/10.3115/1218955.1218990 12. Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up?: Sentiment classification using machine learning techniques. In: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing - Volume 10. pp. 79–86. EMNLP ’02, Association for Computational Linguistics, Stroudsburg, PA, USA (2002). https://doi.org/10.3115/1118693.1118704, https://doi.org/10.3115/1118693.1118704 13. Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research Architecture. In: Ferro, N., Peters, C. (eds.) Information Retrieval Evaluation in a Changing World - Lessons Learned from 20 Years of CLEF. Springer (2019) 14. Rangel, F., Franco-Salvador, M., Rosso, P.: A low dimensionality representation for language variety identification. In: Gelbukh, A. (ed.) Computational Linguistics and Intelligent Text Processing. pp. 156–169. Springer International Publishing, Cham (2018) 15. Rangel, F., Rosso, P.: Overview of the 7th Author Profiling Task at PAN 2019: Bots and Gender Profiling. In: Cappellato, L., Ferro, N., Losada, D., Müller, H. (eds.) CLEF 2019 Labs and Workshops, Notebook Papers. CEUR-WS.org (Sep 2019) 16. Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blogging. In: AAAI spring symposium: Computational approaches to analyzing weblogs. vol. 6, pp. 199–205 (2006) 17. Stella, M., Ferrara, E., De Domenico, M.: Bots increase exposure to negative and inflammatory content in online social systems. Proceedings of the National Academy of Sciences 115(49), 12435–12440 (2018). https://doi.org/10.1073/pnas.1803470115, https://www.pnas.org/content/115/49/12435 18. Tellez, E.S., Miranda-Jiménez, S., Moctezuma, D., Graff, M., Salgado, V., Ortiz-Bejar, J.: Gender identification through multi-modal tweet analysis using microtc and bag of visual words. In: Proceedings of the Ninth International Conference of the CLEF Association (CLEF 2018) (2018) 19. Yang, K.C., Varol, O., Davis, C.A., Ferrara, E., Flammini, A., Menczer, F.: Arming the public with artificial intelligence to counter social bots. Human Behavior and Emerging Technologies 1(1), 48–61 (2019). https://doi.org/10.1002/hbe2.115, https://onlinelibrary.wiley.com/doi/abs/10.1002/hbe2.115