Introduction

Multilingual Author Pro ling using LSTMs Notebook for PAN at CLEF 2018

Roy Khristopher Bayot Teresa Goncalves

0 0 Universidade de Evora Evora , Portugal

This paper shows one approach of the Universidade de Evora for author pro ling for PAN 2018. The approach mainly consists of using word vectors and LSTMs for gender classi cation. Using the PAN 2018 dataset, we achieved an accuracy of 67.60% for Arabic, 77.16% for English, and 68.73% for Spanish gender classi cation.

author pro ling twitter word vectors word2vec LSTM

Introduction

Communication methods have changed rapidly in the recent years especially with the rise of di erent social media platforms such as Facebook, Instagram, and Twitter. Aside from exchanges in these social media platforms, there are also platforms such as message boards, question answering sites, and recommendation sites such as Reddit, Quora, Yelp, and Amazon that also make up activity online.

Fake pro les are one of the problems with these online communication models. Incomplete information in someone's pro le is also a problem. And thus, analyzing the authorship is one way to take measures with this problem. One component of analyzing the authorship of a text is pro ling, may it be determining aspects such as age, gender, or personality.

Our work tries a method on gender author pro ling in English, Spanish, and Arabic with twitter text using long short term memory recurrent neural networks and organized as follows: Section 2 covers related literature where it initially discusses previous author pro ling endeavors, then followed by methods in PAN, followed by long short term memory recurrent neural networks in conjunction with word vectors. Section 3 describes the author pro ling task as well as the dataset. Section 4 describes the methodology and results, beginning with the creation of word2vec vectors, the model, details of the training and then how it was evaluated. Section 5 gives the conclusion and recommendations. content-based features such as the 1000 frequent words in the text with high information gain. The work also used style-based features such as the nodes of a taxonomic tree made from systemic functional linguistics [ 11 ].

Schler et al. in [ 33 ] provides another example of author pro ling. It is still centered on gender and age classi cation and it used stylistic and content features. Parts-of-speech tags, function words, hyperlinks, and non-dictionary words composed the stylistic features while word unigrams with high information gain comprised the content features. These were the features were then used on a Multi-Class Real Winnow for the classi cation. 2.1

PAN Editions

PAN is one of the initiatives at CLEF that has various tasks related to author analysis. It has author identi cation, obfuscation, and pro ling. The author proling task has been running since 2013, with di erent aspects to the task during every year.

The focus for PAN 2013 [ 26 ] was age and gender pro ling. The corpus used then were blogs in Spanish and English. The focus was extended more sources in PAN 2014 [ 26 ]. This edition had texts from blogs, reviews, twitter, and social media. The task was again expanded in PAN 2015 [ 28 ]. In that year, age and gender classi cation was also accompanied with regression for personality traits. The personality traits included extroversion, stability, agreeableness, conscientiousness, and openness. However, there was only a twitter corpus on four languages - English, Spanish, Italian, and Dutch. The focus for PAN 2016 [ 29 ] was cross-genre evaluation. The idea was to train models on tweets and test them on blogs, reviews, and social media. The languages included during this year were English, Spanish, and Dutch.

The focus for PAN 2017 [ 27 ] was on determining the author origin given a speci c text aside from gender classi cation. For instance, an English tweet could come from an author from the US, UK, Canada, Ireland, New Zealand, and Australia. A Portuguese tweet could be from Portugal or Brazil. An Arabic tweet could be from Egypt, Gulf, Levantine, or Maghrebi. And a Spanish tweet could be from Argentina, Chile, Colombia, Mexico, Peru, Spain, Venezuela.

Majority of the approaches is similar to Argamon et al. [ 2 ] and Schler et al. [ 33 ] wherein features are content-based or stylistic-based. There are also other features that are n-grams based or information retrieval based. The classi ers used are also vary from the use of logistic regression, multinomial Nave Bayes, liblinear, random forests, Support Vector Machines, and decision tables.

Among some of the variations include Weren et al. [ 39 ] where the work also used length features such as number of characters, words, sentences. Their approach also check for infomation retrieval features such as cosine similarity, as well as readability features such as Flesch-Kincaid readability score. Marquardt et al. in [ 17 ] also used a combination of content-based features (MRC, LIWC, sentiments) and stylistic features (readability, html tags, spelling and grammatical error, emoticons, total number of posts, number of capitalized letters number of capitalized words). Maharjan et al. [ 16 ] used n-grams with stopwords, punctuations, and emoticons. The work also included the idf count. Villena Roman et al. [ 38 ] used term vector model representation.

One of the more prominent approaches in the previous editions is that of Lopez-Monroy et al. in [ 14 ]. It is prominent in the sense that it worked best for most tasks in most editions. They placed second for both English and Spanish in 2013 where they used second order representation based on relationships between documents and pro les. Another work that placed rst for English in 2013 was that of Meina et al. [ 19 ] while Santosh et al. in [ 32 ] worked well for Spanish. The work of Meina et al. [ 19 ] used collocations while Santosh et al. [ 32 ] used POS features. The work of Lopez-Monroy et al. in [ 15 ] gave the best result with an average accuracy of 28.95% on all corpus-types and languages for PAN 2014. They used the same method same method as the previous year [ 14 ].

The work of Alvarez-Carmona et al. [ 1 ] used second order pro les similar to previous years. They used it in conjunction with LSA to get the best results on English, Spanish, and Dutch for PAN 2015. The work of Gonzales-Gallardo et al. [ 10 ] on the other hand, used character n-grams and POS n-grams that gave the best result for Italian.

In 2016, there had been multiple comparisons since the test genre for the early bird was di erent from that of the nal evaluation, and there had been comparisons with the earlier years as well. However, looking at the nal ranking, the top 3 are Busger et al. [ 23 ], Modaresi et al. [ 22 ], and Bilan et al. in [ 4 ]. Individually, the work of Bougiatiotis and Krithara [ 5 ] is the top for English while the work of Deneva et al. [ 9 ] is the top for Dutch, while Busger et al. [ 23 ] and Modaresi et al. [ 22 ] are tied for Spanish. Busger et al. [ 23 ] used combinations of stylistic features such as function words, parts-of-speech, emoticons, and punctuations signs. The combined this with second order representation and trained their models with SVM. Modaresi et al. [ 22 ] used a combination of lexical features with word and character n-grams together with stylometric features as inputs to a logistic regression classi er. Bilan et al. [ 4 ] used parts-of-speech, collocations, connective words and various other stylometric features for its classi cation. Bougiatiotis and Krithara [ 5 ] also used stylometric features with character n-grams and the second order representation in conjunction with SVM.

In 2017, although there have been approaches that are more related to deep learning such as RNN [ 8 ] and CNN [ 13 ], most of the top results were given using SVMs [ 7 ]. For instance, the top result came from Basile et al. [ 3 ] who used a combination of character and tf-idf n-grams to train an SVM. The second result came from Martinc et al. [ 18 ] who used a combination of character, word, and POS n-grams, emojis, sentiments, character ooding, and lists of words per variety as features to a logistic regression classi er. The third best result done by by Tellez et al. [ 36 ] also used an SVM. 2.2

LSTM and word vectors

Most of the previous approaches hinges on extracting prede ned features such as that for style and content. However, a recent trend is to use neural networks to learn certain lters at run time and use the learned lters to generate a feature representation suitable for classi cation. This approach need two things - word vectors and the neural network architecture.

Word vectors or word embeddings are needed to be created to represent words in a dictionary. These vectors capture some semantic relation between the words and word2vec is one of the prominent vectors developed by Mikolov in [ 20 ] [ 21 ]. To create the vectors, random numbers are initially used for words from a dictionary of a corpus such as a Wikpedia dump. Then, by going through the text in the corpus, a word's vector representation is learned by predicting using adjacent words. Getting the vector can be done through either skip grams or continuous bag of words (CBOW). In CBOW, the word vector is predicted given the context of adjacent words while it is the opposite in skip grams. The context words are predicted given a word. The word vectors are then updated after all the predictions are made.

Choosing an architecute comes next after creating word vectors. Among neural network architectures, recurrent neural networks are speci cally good for sequences such as text since it uses the previous inputs along with the current input for prediction. This can be shown in the simple recurrent network developed by Je Elman in the paper [ 8 ]. However, recurrent neural networks usually su er from vanishing gradient problem especially with long sequences. One way this was dealt with was using long short term memory units which originally proposed by Hochreiter and Schmidhuber in [ 12 ]. The idea was to make analog gates that control which information could be stored and which information could be used in remembered. This allowed the propagated errors to be more constant and thus help with the vanishing gradient problem.

An example of using LSTM for classi cation is that of Rao and Spasojevic in [ 30 ]. They used LSTMs for two di erent datasets - that of customer service and that of political leaning. Their problem was to classify text as either actionable or non-actionable in the domain of customer service, while classifying either Republican or Democrat in terms of political leaning.

Another example is that of Tang et al. in [ 35 ] for document classi cation. In their work, CNNs and LSTMs were used to learn sentence representations and the results were encoded with a gated recurrent network. Their model was used on reviews of IMDB and Yelp. 3

Methodology and Results

The gure 1 given below shows an overview description of the system from how the dataset is manipulated before fed into the LSTM and how it is evaluated. The details are described in the following subsections. 3.1

Dataset

In the current edition of PAN 2018 [ 34 ] for author pro ling [ 25 ], the task is to predict gender based on text, images, or both. The current dataset has 1500 users for Arabic, and 3000 users each for English and Spanish. These are all balanced with an equal number of male and female.

Each user has 100 tweets and 10 images. The images do not necessarily contain an image of a person who is the user but an assorted number of images the user has on the pro le. The idea is to train a model to classify a user with tweets alone, or images alone, or both. The model is then submitted to the TIRA server [ 24 ] for evaluation over a held-out test set. 3.2

Pre-trained Vectors

Word embeddings were created from Wikipedia dumps. The February 05, 2016 wikipedia dump was used for English and Spanish. The English wikipedia dump at that time was 11.8Gb compressed while the Spanish had 2.2Gb compressed. The Arabic wikipedia dump was from March 20, 2018 with about 600Mb compressed. These dumps were then extracted and transformed into lowercase and entries are in one le. The word2vec implementation of gensim [ 31 ] was used to generate our own vectors by using the wikipedia text as input. In terms of word2vec parameters, no lemmatization was done, and the window size used was 5. Skip grams instead of continuous bag of words was used as the method to generate the vectors and nally the size of the embeddings chosen was 300. 3.3

Preprocessing

Before training the model, we prepared the training le. The training le was done by preprocessing the XML les. Each user has one XML les and the tweets were extracted to form one training example. The examples were all put to lower case. No stop words are removed. Hash tags, numbers, mentions, shares, and retweets were not processed or transformed to anything else. They were retained as is and will correspond to another item in the dictionary of words. The test set from TIRA were also processed in the same manner. 3.4

Model

We have a basic model for this experiment that is implemented in Keras [ 6 ] with a Theano [ 37 ] backend ran on an NVIDIA Tesla K20c GPU. The model mainly consists of an embedding layer, one LSTM layer, and one dense layer.

All words in the training set are turned into number indices that corresponds to a word vector. Each training example will be represented by a sequence of numbers. The sequence length will vary. The total number of indices in the sequence is held at 64 and padding is done to ensure it. We then feed the sequence into the system. Each number will be looked up in the embedding layer and converted to a word vector according to the pre-trained word vectors previously discussed. Then the word vectors passed to an LSTM layer with an output of 32, a dropout value of 0.2, and a recurrent dropout value of 0.5. The output of the LSTM layer is sent to a dense layer with 2 output units and a sigmoid activation function.

Stochastic gradient descent over shu ed mini-batches with the Adam update rule was used for training. Each mini-batch is had 4096 examples. The development set is comprised of 20% of the training set. We also kept the number of epochs to 200 and to provide for early stopping. We saved the best model trained and used it for our test. 3.5

Evaluation

When the training nishes and the model is saved, we load the model from the test le and apply it on the tweets per user. After getting a predictions for all the tweets, we used the majority prediction as a nal prediction for the user. 3.6

Results

We used the model on the full training set and predicted the gender. The confusion matrix of the results are given in tables 1, 2, 3 for Arabic, English, and Spanish respectively.

It gives an accuracy of 81.80% for Arabic, a misclassi cation rate of 18.20%. It's also 82.44% likely to be an actual male when it predicts males. It's also rougly similar for predicting females, which has 81.18% to be actually female. English accuracy is 85.13%. The misclassi cation rate is about 14.87%. When it predicts male, it is 86.55% likely to be actually male. When it predicts female, it is 83.82% likely to predict female. Finally, Spanish accuracy is at 75.27% with a misclassi cation rate of 24.73%. When it predicts male, it's likely to be actually male by 73.51%. When it predicts female, it's likely to be actually female by 77.30%.

However, when this model was applied to the test set in the TIRA servers, the results are lower than the given accuracies. The results of our approach are in table 4. Comparing with the results from other contestants, our approach was 18th globally. We ranked 20 out of 23 for Arabic, 17 out of 23 for English, and 19 out of 23 for Spanish. The highest accuracy achieved for Arabic text was 0.8170, English text was 0.8221, and Spanish text was 0.8200. The di erence between the accuracies are 14.1%, 5.05%, and 13.27% for Arabic, English, and Spanish respectively. English has the closest gap. Perhaps it is also due to the word vectors used. Since the word vectors used came from a bigger resource, 11.8Gb against 2.2Gb and 600Mb of the other languages, it could have contributed to better vectors that were used in the classi cation problem.

Predicted

Male Female

Conclusion and Recommendation

To summarize, we were able to use word vectors together with long short term memory networks as for classi cation. Our approach is higher than 50% however it is among one of the lowest in terms of accuracy. We submitted a naive approach to LSTM and there are multiple parameters that still could be explored. Aside from the breadth of hyperparameters, it would also be interesting to see if using a vector for characters instead of words would be useful. This could be an interesting direction since some of the past approaches to classi cation that worked well has used character ngrams. Another approach could also be a way to incorporate stylometric features to the model.

1. Miguel A Alvarez-Carmona , A Pastor Lopez-Monroy , Manuel Montes-y Gomez, Luis Villasen~or- Pineda , and Hugo Jair-Escalante. Inaoe's participation at pan'15: Author pro ling task . In Working Notes Papers of the CLEF 2015 Evaluation Labs , 2015 .

Shlomo

Argamon , Moshe Koppel, James W Pennebaker, and

Jonathan

Schler . Automatically pro ling the author of an anonymous text . Communications of the ACM , 52 ( 2 ): 119 { 123 , 2009 .

Angelo

Basile , Gareth Dwyer, Maria Medvedeva, Josine Rawee, Hessel Haagsma, and

Malvina

Nissim . N-gram: New groningen author-pro ling model . arXiv preprint arXiv:1707.03764 , 2017 .

Ivan

Bilan and

Desislava

Zhekova . Caps: A cross-genre author pro ling system . In CLEF (Working Notes) , pages 824 { 835 , 2016 .

Konstantinos

Bougiatiotis and

Anastasia

Krithara . Author pro ling using complementary second order attributes and stylometric features . In CLEF (Working Notes) , pages 836 { 845 , 2016 .

Francois

Chollet . keras. https://github.com/fchollet/keras, 2015 .

Corinna

Cortes and

Vladimir

Vapnik . Support-vector networks . Machine learning , 20 ( 3 ): 273 { 297 , 1995 .

8. Je rey L Elman. Finding structure in time . Cognitive science , 14 ( 2 ): 179 { 211 , 1990 .

Pepa

Gencheva , Martin Boyanov, Elena Deneva, Preslav Nakov,

Georgiev ,

Kiprov ,

and I

Koychev . Pancakes team: a composite system of genre-agnostic features for author pro ling . Working Notes Papers of the CLEF , 2016 .

10. Carlos E Gonzalez-Gallardo , Azucena Montes, Gerardo Sierra, J Antonio Nun~ezJuarez, Adolfo Jonathan Salinas-Lopez, and Juan Ek . Tweets classi cation using corpus dependent tags, character and pos n-grams . In Proceedings of CLEF , 2015 .

11. Michael

Halliday

, Christian MIM Matthiessen, and

Christian

Matthiessen . An introduction to functional grammar . Routledge , 2014 .

12. Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory . Neural computation , 9 ( 8 ): 1735 { 1780 , 1997 .

13. Yann

LeCun

, Leon Bottou, Yoshua Bengio, and Patrick Ha ner . Gradient-based learning applied to document recognition . Proceedings of the IEEE , 86 ( 11 ): 2278 { 2324 , 1998 .

14. Adrian Pastor Lopez-Monroy, Manuel Montes-y Gomez, Hugo Jair Escalante, Luis Villasenor-Pineda, and Esau Villatoro-Tello. Inaoe's participation at pan'13: Author pro ling task . In CLEF 2013 Evaluation Labs and Workshop , 2013 .

15. Adrian Pastor Lopez-Monroy, Manuel Montes-y Gomez, Hugo Jair Escalante, and Luis Villasen~or Pineda. Using intra-pro le information for author pro ling . In CLEF (Working Notes) , pages 1116 { 1120 , 2014 .

16. Suraj

Maharjan

, Prasha Shrestha, and

Thamar

Solorio . A simple approach to author pro ling in mapreduce . In CLEF (Working Notes) , pages 1121 { 1128 , 2014 .

17. James

Marquardt

, Golnoosh Farnadi, Gayathri Vasudevan, Marie-Francine

Moens

, Sergio Davalos, Ankur Teredesai, and Martine De Cock. Age and gender identi - cation in social media . Proceedings of CLEF 2014 Evaluation Labs , 2014 .

18. Matej

Martinc

, Iza Skrjanec, Katja Zupan, and

Senja

Pollak . Pan 2017 : Author pro ling-gender and language variety prediction . Cappellato et al.[ 13 ], 2017 .

19. Michal

Meina

, Karolina Brodzinska, Bartosz Celmer, Maja Czokow, Martyna Patera, Jakub Pezacki, and

Mateusz

Wilk . Ensemble-based classi cation for author pro ling using various features . Notebook Papers of CLEF , 2013 .

20. Tomas

Mikolov

, Kai Chen, Greg Corrado, and Je rey Dean. E cient estimation of word representations in vector space . arXiv preprint arXiv:1301.3781 , 2013 .

21. Tomas

Mikolov

, Ilya Sutskever, Kai Chen, Greg S Corrado, and

Dean . Distributed representations of words and phrases and their compositionality . In Advances in neural information processing systems , pages 3111 { 3119 , 2013 .

22. Pashutan

Modaresi

, Matthias Liebeck, and

Stefan

Conrad . Exploring the e ects of cross-genre machine learning for author pro ling in pan 2016 . In CLEF (Working Notes) , pages 970 { 977 , 2016 .

23. Mart Busger op Vollenbroek, Talvany Carlotto, Tim Kreutz, Maria Medvedeva, Chris Pool, Johannes Bjerva, Hessel Haagsma, and

Malvina

Nissim . Gronup: Groningen user pro ling . 2016 .

24. Martin

Potthast

, Tim Gollub, Francisco Rangel, Paolo Rosso, Efstathios Stamatatos, and

Benno

Stein . Improving the Reproducibility of PAN's Shared Tasks: Plagiarism Detection, Author Identi cation, and Author Pro ling . In Evangelos Kanoulas, Mihai Lupu, Paul Clough, Mark Sanderson, Mark Hall, Allan Hanbury, and Elaine Toms, editors, Information Access Evaluation meets Multilinguality, Multimodality, and Visualization. 5th International Conference of the CLEF Initiative (CLEF 14) , pages 268 { 299 , Berlin Heidelberg New York, September 2014 . Springer.

25. Francisco Rangel, Paolo Rosso, Manuel Montes-y- Gomez , Martin Potthast, and Benno Stein. Overview of the 6th Author Pro ling Task at PAN 2018: Multimodal Gender Identi cation in Twitter . In Linda Cappellato, Nicola Ferro, Jian-Yun Nie , and Laure Soulier, editors, Working Notes Papers of the CLEF 2018 Evaluation Labs, CEUR Workshop Proceedings. CLEF and CEUR-WS.org , September 2018 .

26. Francisco Rangel, Paolo Rosso, Moshe Moshe Koppel, Efstathios Stamatatos, and

Giacomo

Inches . Overview of the author pro ling task at pan 2013 . In CLEF Conference on Multilingual and Multimodal Information Access Evaluation , pages 352 { 365 . CELCT , 2013 .

27. Francisco Rangel, Paolo Rosso,

Martin

Potthast , and

Benno

Stein . Overview of the 5th author pro ling task at pan 2017: Gender and language variety identi cation in twitter . Working Notes Papers of the CLEF , 2017 .

28. Francisco Rangel, Paolo Rosso, Martin Potthast, Benno Stein, and

Walter

Daelemans . Overview of the 3nd author pro ling task at pan 2015 . In L Cappellato,

Ferro ,

Gareth , and E San Juan, editors, CLEF 2015 Labs and Workshops , Notebook Papers, volume 1391 , 2015 .

29. Francisco Rangel, Paolo Rosso, Ben Verhoeven, Walter Daelemans,

Martin

Pottast , and

Benno

Stein . Overview of the 4th Author Pro ling Task at PAN 2016 . In Krisztian Balog, Linda Cappellato, Nicola Ferro, and Craig Macdonald, editors, Working Notes Papers of the CLEF 2015 Evaluation Labs , volume 1609 of CEUR Workshop Proceedings , pages 750 { 784 . CLEF and CEUR-WS .org, September 2016 .

30.

Adithya

Rao and

Nemanja

Spasojevic . Actionable and political text classi cation using word embeddings and lstm . arXiv preprint arXiv:1607.02501 , 2016 .

31.

Radim

Rehurek and

Petr

Sojka . Software Framework for Topic Modelling with Large Corpora . In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks , pages 45 { 50 , Valletta , Malta, May 2010 . ELRA. http: //is.muni.cz/publication/884893/en.

32.

Santosh , Romil Bansal, Mihir Shekhar, and

Vasudeva

Varma . Author pro ling: Predicting age and gender from blogs . Notebook Papers of CLEF , 2013 .

33. Jonathan

Schler

, Moshe Koppel, Shlomo Argamon, and James W Pennebaker. E ects of age and gender on blogging . In AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, volume 6 , pages 199 { 205 , 2006 .

34. Efstathios

Stamatatos

, Francisco Rangel, Michael Tschuggnall, Mike Kestemont, Paolo Rosso, Benno Stein, and

Martin

Potthast . Overview of PAN-2018: Author Identi cation, Author Pro ling, and Author Obfuscation . In Patrice Bellot, Chiraz Trabelsi, Josiane Mothe, Fionn Murtagh, Jian Yun Nie, Laure Soulier, Eric Sanjuan, Linda Cappellato, and Nicola Ferro, editors, Experimental IR Meets Multilinguality, Multimodality, and Interaction. 9th International Conference of the CLEF Initiative (CLEF 18) , Berlin Heidelberg New York, September 2018 . Springer.

35. Duyu

Tang

, Bing Qin, and Ting Liu. Document modeling with gated recurrent neural network for sentiment classi cation . In Proceedings of the 2015 conference on empirical methods in natural language processing , pages 1422 { 1432 , 2015 .

36. Eric S Tellez , Sabino Miranda-Jimenez, Mario

Gra , and Daniela

Moctezuma . Gender and language variety identi cation with microtc . 2017 .

37. Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions . arXiv e-prints, abs/1605 .02688, May 2016 .

38. Julio Villena Roman and Jose Carlos Gonzalez Cristobal. Daedalus at pan 2014: Guessing tweet author's gender and age, 2014 .

39. Edson RD Weren, Viviane Pereira Moreira, and Jose Palazzo M de Oliveira. Exploring information retrieval features for author pro ling . In CLEF (Working Notes) , pages 1164 { 1171 , 2014 .