INSA LYON and UNI PASSAU’s participation at PAN@CLEF’17: Author Profiling task Notebook for PAN at CLEF 2017 Guillaume Kheng, Léa Laporte, and Michael Granitzer Institut National des Sciences Appliquées Lyon and Universität Passau guillaume.kheng@gmail.com, lea.laporte@insa-lyon.fr, michael.granitzer@uni-passau.de Abstract This paper describes the participation of INSA Lyon and UNI Passau at the PAN 2017 Author Profiling task. Given the language and tweets from an author, the goal is to predict his/her gender and language variety. We consider two strategies : a "loose" classification that learns one predictive model for the gender and another one for the variety, and a "successive" classification that first predict the gender then learn a predictive model for variety, given the gender. We consider all the languages. We experiment various features representations and machine learning algorithms used in previous PAN Author Profiling editions in order to learn the models. We adapt the features and machine learning algorithm used for each language and each classification task by selecting the configuration that provides the best results in terms of prediction performance. 1 Introduction Thanks to the expansion of social networks and the progress of Internet and the related technologies, social media are now part of our daily life. A large amount of content, especially textual content, is thus produced and read every day, but without any cer- titude about the real identity of the author. Indeed, on the internet, people can easily hide their identity, even lie about it or usurp someone else’s. One may want to know who authored a given content, or simply profile the content author in order to know to know more about him/her. Author Profiling (AP) is a text forensics field which try to tackle this latest issue. AP studies aim at retrieving some characteristics of an author (e.g. his/her age, gender, personality, etc) by analyzing only the texts he/she writes. Multiple applications of Author Profiling exist in various fields [9]. Indeed AP could help investigators profile criminals and use written content as evidence (forensics) or prevent malicious behavior on social network by profiling criminals (security). On the other hand, profiling the users of a product for a company could improve its consumer segmentation and yield a more accurate advertising campaign (marketing). The CLEF PAN track have been offering an Author Profiling task for the last 5 years [6]. The task settings differ each year in terms of text genres, languages and author’s characteristics. A summary of the previous AP editions’ settings is provided in Table 1. As shown in this table, the aim of the 2017 PAN Author Profiling task is to retrieve the gender of the author and his/her language variety i.e. the "specific variation of his/her language", due to the geographical location he/she comes from. Gender and language variety predictions can be considered as two subtasks of the PAN 2017 Author Profiling Task. Participants are given the choice to participate to the both subtasks or only one of them. Furthermore, they can consider all languages or only a subset. In our approach, we choose to participate to the both subtasks, for all the languages. Table 1: Evolution of PAN AP task settings 2013-2017 2013 2014 2015 2016 2017 Blogs X X X Reviews X X Text Genre Social Media X X Tweets X X X X Age X X X X Gender X X X X X Authors Features Language Variety X Personnality X Arabic X Dutch X X English X X X X X Languages Italian X Portuguese X Spanish X X X X X One core aspect of our approach was to analyse the impact of different combinations of feature representation techniques and classification algorithms in terms of classifica- tion accuracy. We implemented various features extraction techniques such as n-grams, TF-IDF and LSA, and various machine learning algorithms, including Support Vector Machines, Naïves bayes and Random Forests. We also wanted to study the dependency between gender and variety. To do so we performed a "gender then variety" successive classification and compare it to a loose classification. Successive classification starts by predicting the gender only, then uses the results of this classification to predict the language variety. On the other hand, loose classification predicts the gender and variety labels independently. This paper is structured as follows. In section 2, we present our proposed approach. In section 3, we detail the experimental protocol and settings. Section 4 present the the results. In section 5, we discusses the perspectives and conclude this paper. 2 Overview of our proposed approach Our approach consists in 3 main steps: preprocessing, feature extraction and learning using a machine learning algorithm. In a first part, we present the preprocessing step. Then we introduce the different features we consider in order to achieve the best clas- sification possible. Finally, we detail the learning step, including the machine learning algorithm we used in the context of the task and the successive learning strategy that we tried. 2.1 Preprocessing Several approaches have been proposed in the literature in order to process tweets for further information extraction [7]. Based on an analysis of previous PAN Author profil- ing editions, we choose to consider the following preprocesses: Removal of short tweets: we remove all tweets with letter count below 10 characters (special characters included) Removal of the @user: On Twitter, ’@’ are used to adress twitter users. When a user start a discussion with another ’@user’, this specific Twitter id will appear multiple times, which might cause over-fitting, so we remove it. Removal of URLs: URLs might be too user-specific (causing over-fitting) and they might enrich the vocabulary too much for a poor gain in information as they are likely to occur only once. Lowercase of #hashtags’ body: People might use the same hash-tag with different repartition of the upper/lowercase letters. For instance, without this technique, #Au- thorProfilingRocks and #authorProfilingROCKS supposedly mean the same but will end up as different "words". Removal of stop words: we made optional the removal of stop words as tweets are really short messages and some of the stop words might carry more meaning than in a longer text. The corpus we obtain after the preprocessing step is described in table 2. One can also notice that the number of tweets removed never amounts to more than 2% of each language sub-corpus. Table 2: Characteristics of the corpus before and after preprocessing language Arabic English Spanish Portuguese number of tweets in the initial dataset 240,000 360,000 420,000 120,000 number of "empty" tweets (len <10 chars) 4,219 1,555 1,910 1,895 number of tweets post processing 235,781 358,445 418,090 118,105 2.2 Features Extraction Once the corpus has been preprocessed, we extract features from the corpus and use them to represent the tweets. Most of the representations we considered have been used by the winners of the previous PAN AP tasks editions [8,7,9]. We consider there repre- sentation of tweets based on n-grams, TD-IDF and LSA. We also planned to consider stylometric features, but we finally do not implement them for technical reasons. N-grams models. These models consists in establishing a vocabulary for the docu- ments based on sequences of n items (characters, words, associations of words (for n > 1), POS tags) extracted from text. Then, the features are the frequencies of the n-grams in the vocabulary. N-gram-based features have proven to be highly useful indicators of various linguistic differences between authors [13]. We implemented features with unigrams, bigrams and trigrams at the word level. TF-IDF. Text Frequency-Inverse Document Frequency (TF-IDF) is a well-established technique in Information Retrieval. TF-IDF computes the apparition scores of each word by highlighting those appearing a lot in few documents which helps the learn- ing algorithm selecting words with high discriminative power between labels. This ap- proach have been widely used for the Author Profiling task. Latent Semantic Analysis (LSA). LSA allows to capture semantic relations be- tween groups of words as described in [5]. It produces a set of concepts linking words to documents and by extensions to profiles. LSA also proceeds into a dimensionality reduction which in our specific case is quite useful given the potential huge size of the vocabulary. The features provided by this technique allow us to train classifiers with deeper understanding of the tweets content. In 2015, the PAN AP task winners [7] yielded the top results with this approach. Stylometry. We planned to consider features related to the Natural Language Pro- cessing field, including Part-Of-Speech features and Stylometric features (average word count per sentence, average number of letters per word, etc), that have been shown to perform well for this task [4]. Those features can be taken alone or unified with other types of features in order to carry more information and potentially achieve a better classification. Unfortunately, given the time constraint and some technical issues, we haven’t been able to implement them. 2.3 Machine learning algorithms Our goal was to reproduce the state of the art approaches from the previous editions of PAN, more precisely, the classification techniques they considered. According to the PAN overviews [PAN16,15,14], SVM, Naive Bayes classifier and Random Forest are the most common learning techniques used in the context of the PAN AP task. Fortunately, they are also the ones achieving the best results on the AP task. We chose to implement these 3 classifiers, ie. Support Vector Machines, Naive Bayes classifer and Random Forest, in order to compare their respective results and pick the best one. As they are well-known machine learning algorithms [12,10,2], we do not describe them in details in this paper. In the case of SVM and Naives Bayes Classifier, we only indicate which variants we used. For the Naives Bayes classifier, as the official sklearn documentation suggest that the "Multinomial" Naive Bayes classifier (MNBC) works well with TF-IDF, we chose to implement this variant. This method is the fastest in terms of training and classifying on the provided data, so we used it in order to achieve a stable basis for our software at the early stage of development. Regarding SVM, we tested the kernel and linear approaches. However, the linear mode kept yielding significantly better results than the kernel one during the evaluating and comparing phases. As a consequence, we only use binary linear SVM to predict the gender and multiclass SVM with "one-vs-rest" strategy to predict the language. 2.4 Successive and loose classification As mentioned in the introdcution, we wondered if Gender and Language variety were not linked in some way. Our assumption was that predicting one label and using the re- sults of this classification in order to predict the second label may achieve good results. We chose to call this type of classification : "successive classification" as opposed to "loose classification" in which one classifies each label regardless of the others (thus predicting gender and variety are two independent subtasks). The protocol for each type of classification is as follows. For loose classification, we consider the gender and variety prediction as two inde- pendent subtask and train the corresponding models separately. We thus train a classifier to predict gender (respectively variety) on the whole language corpus, then we use the learned model to predict gender (respectively variety) for each author within the test dataset. For successive classification , in the context of the task, two strategies could be considered: first predict the gender, then predict the variety given the knowledge of the gender ("gender-then-variety" strategy); or, first predict the variety, then predict the gender given the variety ("variety-the-gender" strategy). In this work, we consider only the "gender-then-variety" strategy for successive classification. We did not consider the "variety then gender" successive classification because the number of tweets available for training would have been significantly reduced and would have likely induced over- fitting resulting in poor classification rates. In order to achieve "gender then variety" successive classification for each language corpus we proceed as follow : 1. We train a classifier to predict gender on the whole language corpus. 2. We split each language corpus in 2 sub-corpus, based on the ground truth: one for the female authors and another for the male authors. 3. On each sub-corpus we train a classifier to predict variety. This provides us with a male-variety classifier and a female-variety classifier. 4. We classify each author contained within the test-dataset on gender first and sort predicted males predicted and females authors into 2 sub-test-dataset. 5. We classify each author contained within the sub-test-datasets with the associated variety classifier i.e. the female-variety classifier predicts the variety labels for the authors classified as female, idem for the males. Figures 1 and 2 show the processing of the test dataset with the loose and successive classification procedures respectively. To simplify the notations we will call "classification units" the classifiers used to predict gender or variety labels. On figures 1 and 2, one can notice that the number of classifying units for each type of classification is different. Indeed, the "gender then Figure 1: Prediction work-flow of the loose classification Figure 2: Prediction work-flow of the successive classification variety" successive classification needs 3 classification units per language (one for the gender, then two variety classifier depending on the gender), whereas the loose clas- sification only needs 2 classification units (one per classification subtask). For each classification unit, we have implemented all the combinations of the sets of features described in section 2.2 and the learning algorithms selected in section 2.3. For each classification unit, we then select the best combination possible according to a given evaluation measure: the F-score which will be detailed in the next section. 3 Experimental Evaluation In this section, we present the experimental evaluation and results we obtained in the context of the 2017 PAN Author Profiling task edition. First, we present briefly the software implementation. Second, we describe the experimental protocols we followed through the task in order to obtain reliable results. Then, we disclose the different clas- sification units we selected for the loose and the successive classifications, along with the results obtained on the test set provided by the task chairs for each classification type. Finally, we present the results of our submitted final run. 3.1 Software implementation The software implementation relies on the Python 3 sklearn and NLTK modules.The classifiers and feature extraction tools we used were pre-implemented and documented in the sklearn module. NLTK offered some powerful tools regarding tweet-tokenisation and stop words removal. The source files are available on the following github repo : github.com/SunTasked/profiler 3.2 Experimental Protocol Regarding the training and evaluation of each classification unit, we respected a certain set of rules in order to achieve reliable results. Training of classifiers. We tested and optimized all classifiers described in subsection 2.3 using different sets of features: unigrams and bigrams at word level, TF-IDF based on unigrams, bigrams and trigrams at word level, LSA, and a combination of LSA and TF-IDF on unigrams and bigrams at word level. By optimizing, we mean tuning the classifier and feature extractor parameters to achieve the best score possible given this configuration. In order to do so, we used the sklearn "gridsearch" tool which allow you to try different combinations of parameters in a multi-threaded context. In addition, We also tested some combinations of features with and without the stop words removal step. We wanted to detect if these textual element had an impact in the classification process. This represent roughly 24 models trained for each classification unit summing up to a total of 480 models trained, optimized and cross-validated using 10 folds of the training data. Evaluation measures. We use the micro-averaged and macro-averaged f-measures as the evaluation measures, since we have a corpus with a balanced distribution over the different labels, as recommended by [11]. When comparing one approach to another, we consider the macro measure first and in case of conflicts, we then consider the mi- cro measure. If two configurations lead to same performance in terms of evaluation measure, we use the features which consume less computation power. 3.3 Selected best configurations for the classification units Regarding the training of the classifiers, we based our approach on a "single tweet" classification. We trained each classifier on tweets as standalone documents i.e. disre- garding the fact that each tweet belong to an author along with 99 other tweets. Then, when proceeding to labels predictions, we classified each of the 100 tweets available for each author separately. Consequently for each author, we obtained 100 labels predictions. In order to obtain the author labels, we summed up the labels predic- tions and chose the label with the highest score. In this subsection, we start by presenting the selected models for the gender classifi- cation units, as those are the same for loose classification and successive classification. Then we present the selected models for the language variety classification units for the loose and the successive classification. Gender classification. As shown in table 3, in all languages the best models for gen- der classification have been obtained by combining TF-IDF features on unigrams and bigrams and a Naive Bayes Classifier (NBC). Surprisingly, the classification over Latin languages seemed to work better when the stop-words were not being removed. Table 3: Best configurations for Gender classification; for all languages. These configurations have been used for loose and successive classification. Language Preprocessing Features Classifier F macro F micro Arabic removal of stop words TF-IDF (1/2-grams) NBC 0.707 0.708 English removal of stop words TF-IDF (1/2-grams) NBC 0.669 0.669 Spanish TF-IDF (1/2-grams) NBC 0.659 0.661 Portuguese TF-IDF (1/2-grams) NBC 0.659 0.663 Variety classification. Table 4 describes the best approaches selected for loose vari- ety classification while table 5 describes the best approaches for "gender then variety" successive classification. We observe that the loose classification units for variety pre- diction yield better overall accuracy scores than the corresponding successive classifi- cation units. The English and Spanish classifiers are particularly affected by the division of the Corpus. The halving of the corpus on gender implied a shrinking of the extracted features set. Consequently, classifying units might have less features to discriminate the tweets on and offer poor prediction performances. On the other hand, the Arabic and Portuguese classifying units seem to yield quite equivalent scores in both loose and successive classification contexts. Table 4: Best configurations for Variety loose classification unit, for all languages Language Preprocessing Features Classifier F macro F micro Arabic TF-IDF (1/2-grams) & LSA SVM 0.684 0.684 English rm stop words TF-IDF(1/2-grams) SVM 0.669 0.669 Spanish TF-IDF (1/2-grams) & LSA SVM 0.684 0.684 Portuguese TF-IDF (uni-, bi- and tri-grams) SVM 0.879 0.879 3.4 PAN’17 Results In the context of the PAN’17 AP task, we decided to submit the loose classification approach. Indeed, as we saw in the experimental results section, the loose classifica- tion approach yields the best overall results in terms of variety prediction accuracy. It is particularly noticeable when we consider English and Spanish variety prediction. More- over one of the main issues regarding "gender then variety" successive classification is that one must achieve a high quality classification on gender. Unfortunately, the results Table 5: Best configurations for Variety successive classification units, for all language Gender Language Preprocessing Features Classifier F macro F micro Arabic TF-IDF (1/2-grams) & LSA SVM 0.673 0.674 English rm stop words TF-IDF (1/2-grams) SVM 0.466 0.467 Female Spanish TF-IDF (1/2-grams) & LSA SVM 0.518 0.52 Portuguese TF-IDF (1/2-grams) & LSA SVM 0.88 0.88 Arabic TF-IDF (1/2-grams) & LSA SVM 0.687 0.687 English rm stop words TF-IDF (1/2-grams) NBB 0.450 0.449 Male Spanish TF-IDF (1/2-grams) SVM 0.555 0.556 Portuguese TF-IDF (1/2/3-grams) SVM 0.859 0.859 we obtained on this particular label were not very promising. The official results we obtained are exposed in table 6. Table 6: Official results for the PAN’17 Author Profiling task Language Feature Score obtained Gender 0.6856 Arabic Variety 0.7544 Joint 0.5475 Gender 0.7546 English Variety 0.7588 Joint 0.5704 Gender 0.6968 Spanish Variety 0.9168 Joint 0.6400 Gender 0.6638 Portuguese Variety 0.9750 Joint 0.6475 Although we don’t have the results of the other participants yet, the results regard- ing gender classification are not as high as we expected when we consider the result achieved in the previous PAN editions [8,7]. One could justify such gap by the fact that we are lacking some preprocessing steps (POS tags). In the contrary the quality of the classifiers in terms of variety prediction in Por- tuguese and Spanish are quite high as they achieved respectively 97,5% and 91,68% of accuracy. 4 Discussion and Perspectives In this PAN edition we wanted to implement a Multi Layered Perceptron as learning algorithm and compare to the the other classification approaches. Indeed, the use of SVM as a learning algorithm in the PAN literature represents more than 50% of the approaches as opposed to Neural networks which are almost non-existent in that same context. However, according to [3], using a MLP properly tuned can outperform a SVM in such task. Unfortunately, technical issues prevented us from realizing this compari- son. In addition, we would have liked to compare different types of aggregations for the tweets. In our approach, we used only a "single tweet" classification meaning each tweet was considered as a document. As a consequence, the classifier could never grasp the notion of an "author" as a collection of 100 tweets and might have missed some interesting features. One could try to concatenate one author tweets into a single chunk of text or to consider the whole tweet collections as a document. As we saw in section 3 (Experimental Results), we observed poor results in terms of gender classification. In order to improve those results, one could make use of the doc2vec tool which would improve significantly our results according to [1]. Another way to better the prediction of gender would be to train convolutional neural networks to extract automatically features from the tweets as described in [14]. 5 Conclusion In this paper we have offered a description of our approach in the context of PAN Author Profiling task. Our main aim was to compare loose classification to successive classi- fication. The first one predicts each author’s feature independently whereas the latest makes uses of each label prediction to sample the dataset and predict the remaining author’s features. We selected the best classification units by comparing the combi- nation of multiple features extractors and multiple classifiers while following a strict experimental protocol. The predictions rates regarding gender constrained us to submit a software implementing the loose classification. References 1. Bartle, A., Zheng, J.: Gender classification with deep learning. Text-Interdisciplinary Journal (2003) 2. Biau, G.: Analysis of a random forests model. Journal of Machine Learning Research 13, 1063–1095 (2012) 3. Dichiu, D., Rancea, I.: Using machine learning algorithms for author profiling in social media. In: Working Notes of CLEF 2016 - Conference and Labs of the Evaluation forum. pp. 858–863 (2016) 4. Grivas, A., Krithara, A., Giannakopoulos, G.: Author Profiling using Stylometric and Structural Feature Groupings. In: CLEF (Working Notes) (2015) 5. Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse processes 25(2-3), 259–284 (1998) 6. Potthast, M., Rangel, F., Tschuggnall, M., Stamatatos, E., Rosso, P., Stein, B.: Overview of PAN’17: Author Identification, Author Profiling, and Author Obfuscation. In: Jones, G., Lawless, S., Gonzalo, J., Kelly, L., Goeuriot, L., Mandl, T., Cappellato, L., Ferro, N. (eds.) Experimental IR Meets Multilinguality, Multimodality, and Interaction. 8th International Conference of the CLEF Initiative (CLEF 17). Berlin Heidelberg New York (Sep.) 7. Rangel, F., Celli, F., Rosso, P., Potthast, M., Stein, B., Daelemans, W.: Overview of the 3rd author profiling task at pan 2015. In: Cappellato, L., Ferro, N., Jones, G.J.F., SanJuan, E. (eds.) CLEF (Working Notes). CEUR Workshop Proceedings, vol. 1391 (2015) 8. Rangel, F., Rosso, P., Potthast, M., Trenkmann, M., Stein, B., Verhoeven, B., Daeleman, W., others: Overview of the 2nd author profiling task at pan 2014. In: CEUR Workshop Proceedings. vol. 1180, pp. 898–927. CEUR Workshop Proceedings (2014) 9. Rangel, F., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M., Stein, B.: Overview of the 4th author profiling task at PAN 2016: cross-genre evaluations. In: Balog, K., Cappellato, L., Ferro, N., Macdonald, C. (eds.) CLEF (Working Notes). CEUR Workshop Proceedings, vol. 1609 (2016) 10. Rish, I.: An empirical study of the naive Bayes classifier. In: IJCAI 2001 workshop on empirical methods in artificial intelligence. vol. 3, pp. 41–46. IBM New York (2001) 11. Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification tasks. Information Processing & Management 45(4), 427–437 (2009) 12. Steinwart, I., Christmann, A.: Support Vector Machines. Springer Publishing Company, Incorporated, 1st edn. (2008) 13. Vollenbroek, M.B.O., Carlotto, T., Kreutz, T., Medvedeva, M., Pool, C., Bjerva, J., Haagsma, H., Nissim, M.: Gronup: Groningen user profiling. In: Working Notes of CLEF 2016 - Conference and Labs of the Evaluation forum, Évora, Portugal, 5-8 September, 2016. pp. 846–857 (2016) 14. Zhang, X., Zhao, J.J., LeCun, Y.: Character-level convolutional networks for text classification. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015. pp. 649–657 (2015)