Tlemcen University: Bots and Gender Profiling Task Notebook for PAN at CLEF 2019 Rabia Bounaama1 and Mohammed El Amine Abderrahim2 1 Biomedical Engineering Laboratory, Tlemcen University,Algeria rabea.bounaama@univ-tlemcen.dz 2 Laboratory of Arabic Natural Language Processing, Tlemcen University, Algeria mohammedelamine.abderrahim@univ-tlemcen.dz Abstract This is about the participation of techno team at PAN @ CLEF 2019. We use to solve the task text analysis techniques and machine learning approaches. We describe the properties of our multilingual system based on Stochastic Gradi- ent Descent (SGD) learning classifier submitted for PAN2019, which recognizes bots and gender profiling using tweets in two languages, namely, English and Spanish. We show the useful of some features to identify the text style and au- thor’s information. And then we evaluate the model on a number of unseen data sets. The proposed models have as accuracies 0.50, 0.25 for English prediction of a bots or human as well gender respectively. Keywords: bots and gender profiling, machine learning, SGD classifier. 1 Introduction Social media become one of the most popular ways for people to communicate and to post. Posts are generally variable in length and may involve multiple topics. An author’s writing style can be affected by different topics and different replies/comments (e.g. supportive, negative and aggressive) [8]. In marketing, companies and resellers would like to know the view point of people about their products based on the analysis of blogs and product reviews [10], also people tend to seek out and receive news from it so these communications and ratings can produce significant quantities of data which must be analyzed. These media allow hiding the real profile of the users who interact and generate information. Therefore, the possibility of knowing social media users traits on the basis of what they share is a field of growing interest named author profiling [11]. Author profiling deals with deciphering information about the author from the text that he/she has written [1], this helps in identifying aspects about the user. Bots could artificially inflate the popularity of a product by promoting it and/or writing positive ratings, as well as undermine the reputation of competitive products through negative valuations3 . Bots and Gender Profiling task at PAN 2019 CLEF [3,2] 3 https://pan.webis.de/clef19/pan19-web/author-profiling.html Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons Li- cense Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 September 2019, Lugano, Switzerland. aim to determine whether the author of a tweet is a bot or a human. In case of human, identify her/his gender, the task is held in English and Spanish. Thus, the participants must provide their multi-lingual model solution to the problem. The performance of participants systems will be ranked by accuracy through TIRA [9]. The paper is structured as follows. In the next section, we give a brief overview of some related work. Section 3 describes the methodology and corpus preprocessing. Section 4 presents the results. Then we conclude the paper. 2 Related Work Some of the recent studies in social media [1]where the authors propose a multi-lingual model for indentification of age and gender at PAN 2015 as classification task whether they apply a linear model SGD learning, and another Multilingual Personality predic- tion model where they apply a multivariate regression model of Ensemble of Regressor Chains Corrected (ERCC). Besides that in another work of author profile at PAN 2016 [4] where they used SVM-based classifiers, liblinear for gender classification and lib- svm with a radial basis kernel to predict the age. Also they use NRC Word-Emotion Association Lexicon for their training data. In [10] authors apply TF-IDF and a Deep-Learning model based on Convolutional Neural Networks. They compute the cosinus similarity between the Tf Idf d vector and the vector Tfq of term frequencies for their training data in order to predict the gender or language variety at PAN 2017 from unseen data test . Moreover in the work of [6] for the prediction of gender and language variety at PAN 2017 also in the work of [12]for the task of the past year (PAN 2018) concerns gender identification on Twitter we found that the authors use a logistic regression with good accuracy. The studies mentioned above show the applicability of some statistical methods for author profiling tasks at PAN CLEF. In this paper we propose a multilinguale model for indentification of bots and gender profiling based on Stochastic Gradient Descent (SGD) learning classifier. 3 Method In this section, we describe two multilingual predictive models that we use in our sub- mission. We build a multilingual model for identifying bots or human users and a mul- tilingual model for predicting their gender in case of human. The organizers of PAN 2019 bots and gender Profiling Task provide a dataset which consisted of two different training sets for the different languages: English and Spanish for the total 412000,300000tweets respectively , collections is depicted in table1. The data was given in the form of xml files containing tweets for several users. We apply the following set of preprocessing steps to all documents. First we created a function to extract tweets from xml files and save them to a csv file using the "beautifullsoup"4 , "Pandas"5 libraries for both languages. We used only 4 https://www.crummy.com/software/BeautifulSoup/bs4/doc/ 5 https://pandas.pydata.org/ Table 1: Training corpora statistics language Tweets Authors (human / bots) Gender (male / female) English 412000 206000 206000 103000 103000 Spanish 300000 150000 150000 75000 75000 the posts text for training containing the tweet only with the author and the author’s gender we extract all tweets belonging to one user .We performe a pre-processing for the data before used it to train SGD(svm) classifier. The following pre-processing were performed: – Removing url links, @ username,Hashtag# . – Tokenizing text by white space. – Normalizing case to lowercase. – Removing punctuation from each word. – Removing non-printable characters. – Removing stopwords. – Lemmatizing words . Secondly, we have started the stage of the construction of the model, in this stage we have created three functions first of all the creation of the classifier from which it takes as parameter the specified classifier, the vector of features of learning with the output classes and the validation vector. According to [5], the use of N-grams is the best method to analyze emotions in microblogging context. So we train our classifier by using 3-grams features. From these features, we selected only those that have as minimum term count frenquency equal to 3 in the classification task and we used them in the third function in order to train the model. We used the same presentation of features and model parameters as the ones chosen for English to train Spanish dataset. Our model was built using the tools provided by scikit-learn machine learning library in Python [7]. We also tested several classifiers and different parameter sets. The following classifiers from Scikit-learn were tested: – Svm.linearSVC – Logistic regression – RNN (reccurent nereunal network) – Naïve bayes multinominal Best results were obtained with SGD classifier, we used ‘hinge’ as loss function and L2 for penalitie, to our submitted run . 4 Results For the task of bots and gender profile prediction, we obtain better results for the pre- diction of Spanish language as presented at table 2 and 3. Table 2: Gender prediction Table 3: Bots/human prediction language English Spanish language English Spanish Accuracy 0.2511 0.2567 Accuracy 0.5008 0.5050 Our techno team have as an average of score 0.3784 . According to the obtained results we found that sgd (svm) classifier perform better for author prediction while this approach did not perform very well at gender prediction.To overcome this limitation, we plan to do more advanced preprocessing using, for example, linguistic markers. We faced some limitation in building our system such as : – Tweets data contains incorrectly words for example people spell the word “soon” as “soooon” to convey excitement in such situations, tokenizing and identifying words becomes challenging. – Social media users use their own vocabulary to express their thoughts or feeling, thus extracting vocabulary-based or grammar-based features may not work effi- ciently for these platforms. Furthermore, social media users use multiple languages to express their opinion. This makes it impossible to apply knowledge derived from one language by extracting language dependent features, onto another language. 5 Conclusion We have presented the system developed by our techno team for participating in PAN- 2019 bots and gender profiling Task, we designed and implemented a system that could be easily configured where we use in our final model SGD classifier. The main challenge with this model is then to fight effectively overfitting. The biggest challenge of this year’s PAN bots and gender profiling task was the gender classification problem where our model achieves an average of 0.25 accuracy. References 1. Mounica Arroju, Aftab Hassan, and Golnoosh Farnadi. Age, gender and personality recog- nition using tweets in a multilingual setting. In 6th Conference and Labs of the Evaluation Forum (CLEF 2015): Experimental IR meets multilinguality, multimodality, and interaction, pages 22–31, 2015. 2. Franco M. Francisco Rangel, Paolo Rosso. A low dimensionality representation for language variety identification. In Proceedings of the 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’16). Springer-Verlag,LNCS(9624),pp. 156-169, 2018. 3. Paolo Rosso Francisco Rangel. Overview of the 7th author profiling task at pan 2019: Bots and gender profiling. CEUR Workshop Proceedings, In: Cappellato L., Ferro N., Müller H, Losada D. (Eds.), 2019. CEUR-WS.org . 4. Pepa Gencheva, Martin Boyanov, Elena Deneva, Preslav Nakov, Yasen Kiprov, Ivan Koy- chev, and Georgi Georgiev. Pancakes team: A composite system of genre-agnostic features for author profiling. In CEUR Workshop Proceedings, 2016. 5. Gonzalo Blázquez Gil, Antonio Berlanga de Jesús, and José M. Molina Lopéz. Combining machine learning techniques and natural language processing to infer emotions using spanish twitter corpus. In Highlights on Practical Applications of Agents and Multi-Agent Systems, pages 149–157. Springer Berlin Heidelberg, 2013. 6. Matej Martinc, Iza Skrjanec, Katja Zupan, and Senja Pollak. Pan 2017: Author profiling- gender and language variety prediction. In CLEF (Working Notes), 2017. 7. Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python journal of machine learning research. 2011. 8. Jian Peng, Kim-Kwang Raymond Choo, and Helen Ashman. Bit-level n-gram based forensic authorship analysis on social media: Identifying individuals from linguistic profiles. Journal of Network and Computer Applications, 70:171–182, 2016. 9. Martin Potthast, Tim Gollub, Matti Wiegmann, and Benno Stein. TIRA Integrated Research Architecture. In Nicola Ferro and Carol Peters, editors, Information Retrieval Evaluation in a Changing World - Lessons Learned from 20 Years of CLEF. Springer, 2019. 10. Nils Schaetti. Unine at clef 2017: Tf-idf and deep-learning for author profiling. In CLEF (Working Notes), 2017. 11. Mariona Taulé, M Antonia Martí, Francisco M Rangel, Paolo Rosso, Cristina Bosco, Viviana Patti, et al. Overview of the task on stance and gender detection in tweets on catalan indepen- dence at ibereval 2017. In 2nd Workshop on Evaluation of Human Language Technologies for Iberian Languages, IberEval 2017, volume 1881, pages 157–177. CEUR-WS, 2017. 12. P von Daniken, Ralf Grubenmann, and Mark Cieliebak. Word unigram weighing for author profiling at pan 2018. In Proceedings of the Ninth International Conference of the CLEF Association (CLEF 2018), 2018.