Subword-based Deep Averaging Networks for Author Profiling in Social Media Notebook for PAN at CLEF 2017 Marc Franco-Salvador, Nataliia Plotnikova, Neha Pawar, and Yassine Benajiba Symanto Research, Nuremberg, Germany {marc.franco,nataliia.plotnikova, neha.pawar,yassine.benajiba}@symanto.net Abstract Author profiling aims at identifying the authors’ traits on the basis of their sociolect aspect, that is, how language is shared by them. This work de- scribes the system submitted by Symanto Research for the PAN 2017 Author Profiling Shared Task. The current edition is focused on language variety and gender identification on Twitter. We address these tasks by exploiting the mor- phology and semantics of the words. For that purpose, we generate embeddings of the authors’ text based on subword character n-grams. These representations are classified using deep averaging networks. Experimental results show compet- itive performance in the evaluated author profiling tasks. 1 Introduction Author profiling aims at identifying the authors’ traits on the basis of their sociolect aspect, that is, how language is shared by them. It is used to determine language va- riety, gender, age, and personality type, among others. This task is specially attractive to industry representatives and particularly helpful for author opinion segmentation in social media. For instance, identifying the geographical distribution and gender of opin- ion authors may help to improve marketing campaigns. The task is also important for digital text forensics. Given a threat, knowing the possible author traits may help to its identification. The Uncovering Plagiarism, Authorship, and Social Software Misuse1 (PAN) eval- uation lab at the Conference and Labs of the Evaluation Forum2 (CLEF) promotes re- search and innovation in digital text forensics. Its Author Profiling Shared Task set the objective of classifying authors’ traits in several subtasks. These include the identifica- tion of age, cross-genre age, personality traits, and gender in social media. The current edition3 focuses on language variety and gender identification on Twitter. Both morphological [1,6] and semantic [7,2] features have proven to be highly dis- criminant in author profiling. To build on research, we exploit in this work word mor- phology and semantics to identify the authors’ language variety and gender. We present 1 http://pan.webis.de/ 2 http://www.clef-initiative.eu/ 3 http://pan.webis.de/clef17/pan17-web/author-profiling.html an approach based on word embeddings which in turn are generated using the sub- word information, i.e., by means of character n-gram embeddings [3]. We classify the author traits using deep averaging networks, a recent technique which magnifies the most discriminant dimensions contained within an embedding average. This has been demonstrated to be a fast and competitive approach in several text classification tasks [10] — rivalling the recurrent or convolutional neural networks performance. The rest of the work is structured as follows: in Section 2 we provide an overview of the state of the art in author profiling. In Section 3 we describe the system we employed for the PAN 2017 Author Profiling Shared Task. Next, in Section 4 we conduct our evaluation and discussion of the results. Finally, we draw our conclusions in Section 5. 2 Related Work Authorship attribution [12], the task of identifying authors’ stylistic discriminators, set the stage for the author profiling task. The use of stylistic features such as character and part-of-speech (PoS) n-grams, as well as spelling and grammatical errors, allowed us to identify authors’ native language [13]. Similarly, [26] identified age and gender in blogs using stylistic and content word features. The popularity of author profiling motivated the organization of several workshops and shared tasks. The Native Language Identification Shared Task [27] allowed participants to clas- sify English essays representing eleven native languages. The Shared Task on Discrim- inating between Similar Languages (DSL) set the objective of classifying texts rep- resenting several sets of closely related languages and language varieties [29,30,17]. Since 2013, the PAN evaluation lab organized the Author Profiling Shared Task. The first two editions focused on age and gender identification [22,21]. In addition to these two tasks, personality traits recognition was included in 2015 [19]. Finally, the focus of the 2016 edition was cross-genre age and gender identification [24][23]. This year, the PAN author profiling track is focused on the tasks of language variety and gender identification. Regarding the latter, most of the recent work on gender iden- tification originated in the PAN evaluation lab. The system winner of the 2013-2015 editions is based on a representation for documents which captures discriminative and subprofile-specific information [14]. Similar to the early work on the subject, the best performing system in 2016 employed content words, emoticons, and stylistic features [4]. The language variety identification task has attracted much interest in the last few years. Character n-grams and other features have been employed to identify varieties of Portuguese in news texts [28], Arabic in blogs and forums [25], and Spanish in tweets [15]. Word embeddings were used to classify varieties of Spanish from blogs and journalistic texts [7,8]. Also in the Spanish blogs domain, [20] a low dimensional model based on text statistics was employed. The best performing system of DSL 2015 [16] used an ensemble of models based on word and character n-grams. Unlike the majority of author profiling researchers, which employ stylistic and lexi- cal features, our approach is based on character n-gram word embeddings, with exploit the morphology and semantics of words. This choice has also been driven by our moti- vation to experiment with a pipeline that could be replicated fairly simply by researchers who want to compare results and practitioners in need of a simple, yet accurate, pipeline to perform author profiling. 3 Proposed Approach In this section we describe the system we designed for language variety and gender identification on Twitter. First, in Section 3.1 we describe our data preprocessing. Next, in Section 3.2 the embedding representations are described. Finally, in Section 3.3 we detail our classifier. 3.1 Preprocessing We preprocess each text with tokenization, word lowercase, and removing URLs. We use the Tweet NLP4 tokenizer, which is specific for English tweets. We slightly modified its regular expressions to consider Arabic, Portuguese, and Spanish punctuation, e.g. ’¿’ and ’¡’ were included for Spanish. 3.2 Subword Character n-gram Embeddings In recent years, word embeddings replaced the bag-of-words (BOW) representation as the standard for text feature extraction.5 These representations are low d-dimensional real-valued vectors which capture semantic and syntactic aspects of text. The continu- ous skip-gram model [18] of the word2vec toolkit is the preferred alternative to generate the embeddings. We should note the importance of morphology in author profiling. For instance, the derivation of words is a discriminant feature in English language variety identification, e.g. regularized vs. regularised. As an additional example, the morphological refraction is indicative of gender in Latin languages, e.g. profesor vs. profesora in Spanish (male and female professor word translation, respectively). In this work we use a recent variant of the continuous skip-gram model [3] which generates word embeddings exploiting the words’ morphology by means of character n-gram embeddings. In addition to helping better capture the morphological nuances that we previously mentioned, a character based embedding model also helps to create robust classification models in the presence of typos and abbreviations as is usually the case in social media data. When it comes to learning these embeddings, the main difference of this subword model is in the scoring function used to estimate the probability of observing a context word wc given a target word wt . The original model used the scalar product of the word vectors as scoring: s(wt , wc ) = uTwt vwc , where uwt and vwc are vectors in Rd . The subword model uses instead a scoring function which represents the target word as the sum of its character n-gram vectors: 4 http://www.cs.cmu.edu/~ark/TweetNLP/ 5 We note the increasing number of papers published at the ACL conference with "word embed- dings" or "distributed representations" as part of the title: 0 (2013), 3 (2014), 15 (2015), and 29 (2016). X s(w, c) = zgT vc , (1) g∈Gw being Gw {1, ..., G} the set of n-grams of the word w, and zg and vc vectors in Rd . Key of the model’s design is the use of a hashing function to map n-grams to integers that represent the vector index. This makes the model memory efficient and provides with an additional feature: it does not produce out-of-vocabulary words. The embedding of an unknown word is created by extracting its n-grams and doing the average of the vectors with the indexes returned by the hash function. For more details about the model please refer to its original work. We generate a word embedding inventory for the training partition (see Section 4.1) of each language using the FastText library.6 We use 300-dimensional vectors, context windows of size 10, 20 negative words for each sample, 15 epochs, and 2M hashed character n-gram vectors. We extract n-grams with length in [3, ..., 6]. We post-process and enrich the embeddings with a proprietary model c Symanto Research. 3.3 Deep Averaging Networks A standard method to obtain vector representations of text consists on computing the average of the word embeddings [5]. This embedding composition method obtained good results in language variety identification [7]. However, the longer the text, the more abstract the resulting embedding is. In this work we classify using Deep Averaging Networks (DAN) [10]. As illus- trated in Figure 1, this model receives as input the word embeddings of the text. First, a composition layer is put in place to average those embeddings. It proceeds then to use one or many non-linear hidden layers to transform the computed average. Finally, a softmax layer is used for prediction. The rationale behind DAN is that the non-linear transformations applied to the average allow to magnify and capture subtle variations in a more precise manner. As reported in the original paper, this approach can outperform syntactically informed approaches despite its simplicity. Our hidden layers have size equal to the embedding one and use the rectified linear units (ReLU) [9] as activation function. We use the cross-entropy loss function. The number of hidden layers is determined in Section 4.2. We optimize the neural network weights with Adam [11], learning rate = 0.001 and 100 epochs, using the parameters indicated on its original work. We should note that our word embeddings are static so we do not allow the model to modify them. 4 Evaluation In this section we evaluate our approach in the PAN 2017 Author Profiling Shared Task. 6 https://github.com/facebookresearch/fastText Figure 1: Illustration of the DAN architecture 4.1 Datasets and Tasks Setting Dataset The objective of the PAN 2017 author profiling shared task is to identify the language variety and gender of Twitter users. Its corpus contains four languages and nineteen language varieties: – Arabic (Egypt, Gulf, Levantine, and Maghrebi). – English (Australia, Canada, Great Britain, Ireland, New Zealand, and United States). – Portuguese (Brazil and Portugal). – Spanish (Argentina, Chile, Colombia, Mexico, Peru, Spain, and Venezuela). Next, we mention some key remarks about the dataset. The language of the user is known, so the dataset is composed by four partitions. In Table 1 we show the statistics. The labels are balanced at language variety and gender level. Finally, each Twitter user is represented by a set of approximately 100 tweets.7 In this work, we concatenate the user tweets to have an unique instance. We explored other alternatives, as the independent classification of the tweets with a subsequent sum of the class probabilities [7]. However, with this dataset, we obtained higher results after concatenating the tweets. Methodology We compare our results with those obtained by the random baseline, a BOW model classified with random forest, a model based on continuous skip-gram em- bedding averages classified with logistic regression, and a model based on the subword 7 Each tweet is composed by up to 140 characters. Statistic Arabic English Portuguese Spanish Training users 2,400 3,600 1,200 3,200 Test users 800 1,200 400 1,400 Language varieties 4 6 2 7 Table 1: Statistics of the PAN 2017 author profiling shared task dataset. embedding (see Section 3.2) averages classified with logistic regression. In the rest of the evaluation we refer to these models as Random, BOW, skip-gram emb., and sub- word emb., respectively. The prototype of our model (henceforth simply referred to as DAN) was designed using 10-fold cross-validation over the training sets. The parameter selection uses the same setting. The official measure of the competition is the accuracy. The ranking of the shared task participants is estimated as follows: i) for each language, the PAN organizers calculate individual accuracies for gender and variety identification; ii) they calculate the accuracy when both variety and gender are properly predicted to- gether; and iii) the final ranking is obtained by averaging those accuracy values obtained per language. 4.2 Parameter Selection We noticed during our experimentation phase that the performance of DAN is very sen- sitive to the number of hidden layers, which differ in function of the task and dataset. In Figure 2 we show the accuracy depending on the number of hidden layers, task, and language. As you can see, the two tasks benefit from adding layers after composi- tion/averaging one. The best performance for language variety identification is achieved using two layers. In contrast, the optimal number of hidden layers for gender identifica- tion differs depending on the language. We use the best parameters determined in this section for the rest of the evaluation. 4.3 Results and Discussion In this section we compare and discuss the results of our system. In Table 2 we show the development experiments and the comparison with the baseline models (see Sec- tion 4.1) using 10-fold cross-validation over the training set. As we can see, the three embedding-based models outperform BOW, the only purely lexical approach. The con- tinuous skip-gram embedding averages classified with logistic regression obtain better results than the subword embedding averages in tasks such as language variety identi- fication in Arabic or gender in Portuguese. However, the latter model offers in average higher results than the skip-gram one. Finally, DAN, using the same subword embed- dings, obtains the highest results and proves that deep averaging networks are useful in author profiling to magnify the most discriminant values contained in an embedding average. In Table 3 we show the results using the official test set of the shared task. This table also includes the joint accuracy, which is employed by organizers to determine the best 100 % 82 % 95 % 80 % 90 % 85 % 78 % Accuracy (%) Accuracy (%) 80 % 76 % 75 % 70 % 74 % 65 % 72 % 60 % AR AR EN EN PT PT ES ES 55 % 70 % 0 1 2 3 4 5 0 1 2 3 4 5 Number of hidden layers Number of hidden layers (a) Language variety (b) Gender Figure 2: Deep averaging networks accuracy in function of its number of hidden layers. system, i.e., when both variety and gender are properly predicted together. As we can see, DAN’s results are in line with those obtained using the 10-fold cross-validation setting. We also observe how the joint accuracy falls compared to the isolated language variety and gender results. This manifests the difficulty of this joint classification task, which continues being an open problem. Our final comments are to analyse the difference in difficulty of this shared task depending on the task and language. Identifying gender is clearly more difficult than language variety. Despite the first task has only two possible labels, gender differences are generally more subtle and require more context and topic understanding. In contrast, the language variety peculiarities are both differentiable using lexical and semantic as- pects of text. These lexical and semantic aspects are also the cause of the differences in function of the language. English and Arabic varieties are more similar at lexical level than Portuguese or Spanish ones. However, the low number of Portuguese varieties employed in this work affects too. Finally, considering the high number of Spanish va- rieties and its high results, we also consider that some languages have tweets with topics more indicative of the variety, e.g. topics about politics or events. 5 Conclusions In this work we presented the system designed by Symanto Research for the PAN 2017 author profiling shared task. The pipeline we present in this paper is easily replicable and yields a good performance while promising to be robust and flexible in the presence of noisy data. We described an approach based on subword character n-gram embeddings and deep averaging networks. We explained the rationale behind using these components in au- thor profiling. We compared our approach with several well-known baseline models. Task Model Arabic English Portuguese Spanish Average Random 25.0 16.7 50.0 14.3 26.5 Language BOW 71.2 59.4 88.7 75.1 73.6 variety Skip-gram emb. 73.0 62.4 98.6 80.6 78.7 Subword emb. 70.7 68.3 98.5 79.6 79.3 DAN 80.6 76.5 98.9 91.0 86.8 Random 50.0 50.0 50.0 50.0 50.0 BOW 66.4 66.7 71.0 63.4 66.9 Gender Skip-gram emb. 71.2 78.4 76.5 73.3 74.8 Subword emb. 73.7 78.8 72.6 74.5 74.9 DAN 74.5 80.8 78.8 75.5 77.4 Table 2: Classification accuracy (in %) using 10-fold cross-validation with the training partition. Task Model Arabic English Portuguese Spanish Average Language Random 25.0 16.7 50.0 14.3 26.5 variety DAN 76.6 75.9 97.9 90.0 85.1 Random 50.0 50.0 50.0 50.0 50.0 Gender DAN 73.0 79.6 76.9 77.2 76.7 Random 12.5 8.4 25.0 7.2 13.3 Joint DAN 56.9 60.5 75.3 70.2 65.7 Table 3: Test classification accuracy (in %). Experimental results in the tasks of native language and gender identification show the superiority of our approach and demonstrate that it is a competitive alternative. Future work will investigate further how to employ semantic representations and deep learning techniques in the task of author profiling. References 1. Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Automatically profiling the author of an anonymous text. Communications of the ACM 52(2), 119–123 (2009) 2. Bayot, R., Gonçalves, T.: Author Profiling using SVMs and Word Embedding Averages— Notebook for PAN at CLEF 2016. In: Balog, K., Cappellato, L., Ferro, N., Macdonald, C. (eds.) CLEF 2016 Evaluation Labs and Workshop – Working Notes Papers, 5-8 September, Évora, Portugal. CEUR-WS.org (Sep 2016) 3. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016) 4. Busger op Vollenbroek, M., Carlotto, T., Kreutz, T., Medvedeva, M., Pool, C., Bjerva, J., Haagsma, H., Nissim, M.: GronUP: Groningen User Profiling—Notebook for PAN at CLEF 2016. In: Balog, K., Cappellato, L., Ferro, N., Macdonald, C. (eds.) CLEF 2016 Evalua- tion Labs and Workshop – Working Notes Papers, 5-8 September, Évora, Portugal. CEUR- WS.org (Sep 2016) 5. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural lan- guage processing (almost) from scratch. Journal of Machine Learning Research 12(Aug), 2493–2537 (2011) 6. Estival, D., Gaustad, T., Pham, S.B., Radford, W., Hutchinson, B.: Tat: an author profiling tool with application to arabic emails. In: Proceedings of the Australasian Language Tech- nology Workshop. pp. 21–30 (2007) 7. Franco-Salvador, M., Rangel, F., Rosso, P., Taulé, M., Martí, M.A.: Language variety identi- fication using distributed representations of words and documents. In: Proceeding of the 6th International Conference of CLEF on Experimental IR meets Multilinguality, Multimodality, and Interaction (CLEF 2015). vol. LNCS(9283). Springer-Verlag (2015) 8. Franco-Salvador, M., Rosso, P., Rangel, F.: Distributed representations of words and docu- ments for discriminating similar languages. In: Proceeding of the RANLP Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (LT4VarDial). Hissar, Bulgaria (2015) 9. Hahnloser, R.H., Sarpeshkar, R., Mahowald, M.A., Douglas, R.J., Seung, H.S.: Digital selec- tion and analogue amplification coexist in a cortex-inspired silicon circuit. Nature 405(6789), 947–951 (2000) 10. Iyyer, M., Manjunatha, V., Boyd-Graber, J., Daumé III, H.: Deep unordered composition rivals syntactic methods for text classification. In: Association for Computational Linguistics (2015) 11. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 12. Koppel, M., Schler, J.: Exploiting stylistic idiosyncrasies for authorship attribution. In: Pro- ceedings of IJCAI’03 Workshop on Computational Approaches to Style Analysis and Syn- thesis. vol. 69, p. 72 (2003) 13. Koppel, M., Schler, J., Zigdon, K.: Automatically determining an anonymous author’s native language. In: Intelligence and Security Informatics, pp. 209–217. Springer (2005) 14. López-Monroy, A.P., y Gómez, M.M., Escalante, H.J., Villaseñor-Pineda, L., Stamatatos, E.: Discriminative subprofile-specific representations for author profiling in social media. Knowledge-Based Systems 89, 134 – 147 (2015) 15. Maier, W., Gómez-Rodríguez, C.: Language variety identification in spanish tweets. In: Pro- ceedings of the EMNLP’2014 Workshop on Language Technology for Closely Related Lan- guages and Language Variants. pp. 25–35. Doha, Qatar (October 2014) 16. Malmasi, S., Dras, M.: Language identification using classifier ensembles. In: Proceeding of the RANLP Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (LT4VarDial). Hissar, Bulgaria (2015) 17. Malmasi, S., Zampieri, M., Ljubešić, N., Nakov, P., Ali, A., Tiedemann, J.: Discriminating between similar languages and arabic dialect identification: A report on the third dsl shared task. In: Proceedings of the 3rd Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (VarDial). Osaka, Japan (2016) 18. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the Annual Neural In- formation Processing (NIPS’13) Conference - Advances in Neural Information Processing Systems 26. pp. 3111–3119 (2013) 19. Rangel, F., Celli, F., Rosso, P., Potthast, M., Stein, B., Daelemans, W.: Overview of the 3rd Author Profiling Task at PAN 2015. In: Cappellato, L., Ferro, N., Jones, G., San Juan, E. (eds.) CLEF 2015 Evaluation Labs and Workshop – Working Notes Papers, 8-11 September, Toulouse, France. CEUR-WS.org (Sep 2015) 20. Rangel, F., Franco-Salvador, M., Rosso, P.: A low dimensionality representation for language variety identification. In: Proceedings of the 17th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2016). Springer-Verlag (2016) 21. Rangel, F., Rosso, P., Chugur, I., Potthast, M., Trenkmann, M., Stein, B., Verhoeven, B., Daelemans, W.: Overview of the 2nd Author Profiling Task at PAN 2014. In: Cappellato, L., Ferro, N., Halvey, M., Kraaij, W. (eds.) CLEF 2014 Evaluation Labs and Workshop – Working Notes Papers, 15-18 September, Sheffield, UK. CEUR-WS.org (Sep 2014) 22. Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the Author Pro- filing Task at PAN 2013. In: Forner, P., Navigli, R., Tufis, D. (eds.) CLEF 2013 Evaluation Labs and Workshop – Working Notes Papers, 23-26 September, Valencia, Spain (Sep 2013) 23. Rangel, F., Rosso, P., Potthast, M., Stein, B.: In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) Working Notes Papers of the CLEF 2017 Evaluation Labs 24. Rangel, F., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M., Stein, B.: Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations. In: Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR- WS.org (Sep 2016) 25. Sadat, F., Kazemi, F., Farzindar, A.: Automatic identification of arabic language varieties and dialects in social media. In: In Proceeding of the 1st. International Workshop on Social Media Retrieval and Analysis SoMeRa (2014) 26. Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blog- ging. In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs. vol. 6, pp. 199–205 (2006) 27. Tetreault, J., Blanchard, D., Cahill, A.: A report on the first native language identification shared task. In: Proceedings of the eighth workshop on innovative use of NLP for building educational applications. pp. 48–57. Citeseer (2013) 28. Zampieri, M., Gebre, B.G.: Automatic identification of language varieties: The case of Por- tuguese. In: KONVENS2012-The 11th Conference on Natural Language Processing. pp. 233–237. Österreichischen Gesellschaft für Artificial Intelligende (ÖGAI) (2012) 29. Zampieri, M., Tan, L., Ljubešić, N., Tiedemann, J.: A report on the DSL Shared Task 2014. In: Proceedings of the COLING First Joint Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects (VarDial). pp. 58–67. Dublin, Ireland (August 2014) 30. Zampieri, M., Tan, L., Ljubešić, N., Tiedemann, J., Nakov, P.: Overview of the DSL Shared Task 2015. In: Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (LT4VarDial). Hissar, Bulgaria (2015)