Author Profiling with Bidirectional RNNs using Attention with GRUs Notebook for PAN at CLEF 2017 Don Kodiyan, Florin Hardegger, Stephan Neuhaus, and Mark Cieliebak Zurich University of Applied Sciences kodiydon@students.zhaw.ch, hardeflo@students.zhaw.ch, neut@zhaw.ch, ciel@zhaw.ch Abstract This paper describes our approach for the Author Profiling Shared Task at PAN 2017. The goal was to classify the gender and language variety of a Twitter user solely by their tweets. Author Profiling can be applied in various fields like marketing, security and forensics. Twitter already uses similar tech- niques to deliver personalized advertisement for their users. PAN 2017 provided a corpus for this purpose in the languages: English, Spanish, Portuguese and Ara- bic. To solve the problem we used a deep learning approach, which has shown recent success in Natural Language Processing. Our submitted model consists of a bidirectional Recurrent Neural Network implemented with a Gated Recurrent Unit (GRU) combined with an Attention Mechanism. We achieved an average accuracy over all languages of 75,31% in gender classification and 85,22% in language variety classification. 1 Introduction Social media has become an important platform for communication and exchange of information. In contrast to classical letters and emails, the language on social media is much more personal. This raises the question whether the text style and content allows to draw conclusions about demographics traits of its author, such as age, gender, or language variety. Such insights can be used in various applications, such as forensics, security, or marketing. For instance, on the basis of such profiles it would be possible to determine which users could be interested in a new product or campaign, how urgent a complaint is, or if a profile in an online forum might be a fake profile. The Author Profiling Shared Task of the PAN shared task aims to answer these question by extracting information about authors based on their linguistic style of writ- ing [14,13]. The goal of the 2017 shared task at PAN is to detect the author’s gender and dialect from his/her Twitter texts. Both training and test data is provided in four different languages: English, Spanish, Portuguese and Arabic. We have implemented a solution that is based on a bidirectional recurrent neural network (bi-RNN) using gated recurrent units (GRUs) in combination with an attention mechanism. The paper is structured as follows. In Section 3, we give a short overview of related work. Then, in Section 4, we describe our model, and Section 5 compares the different attempts and their results on test data. Conclusions are drawn in the last section. 2 PAN PAN is a series of different digital text forensics tasks. It organizes shared task evalua- tions. Shared Tasks are computer science events of a specific problem of interest. This paper is the result of our participation at the Author Profiling Shared Task of 2017. Au- thor Profiling includes gender and language variety predictions of an author of a given Twitter document. To solve this problems, training and test datasets are available [16]. PAN 2017 Training Data. PAN 2017 Training Data consists of Twitter profiles in four different languages: English, Spanish, Portuguese and Arabic. The corpus was annotated with gender and language variety information about the authors. For each of the language varieties, there are 600 Twitter profiles. In each language there are the same number of male and female profiles. The dataset includes exactly 100 tweets for each author. Language Variety. Language Variety is defined as a specific variation of an author’s native language. For instance, one has to identify whether an English author has a lan- guage variation from Australia, Canada, Great Britain, Ireland, New Zealand or the United States. Table 1. Distribution of data for language variety in the PAN 2017 training corpus Native Author Language Variations Language Profiles Australia, Canada, Great Britain, Ireland, English 3600 New Zealand, United States Argentina, Chile, Colombia, Mexico, Peru, Spanish 4200 Spain, Venezuela Portuguese 1200 Brazil, Portugal Arabic 2400 Egypt, Gulf, Levantine, Maghrebi TIRA. TIRA is an evaluation-as-a-service platform [12]. The submission for the PAN shared task was done with this tool. The submitted models were self-evaluated on a virtual machine which was hosted by the organizers. The test data was only available on this virtual machine and was not visible to the participants. Evaluation. The performance measure of the submissions at PAN 2017 is done with accuracy. The individual accuracy for gender and variety identification was calculated for each language as follows: accuracy  correct predicted . (1) total The joint accuracy is calculated when both gender and variety are properly pre- dicted together. The final ranking is calculated with the averaged accuracy over all four languages. 3 Related Work In this chapter we provide an overview of the most relevant works for the Author Pro- filing Task with neural networks. Neural Networks. Neural networks have achieved great results in natural language processing in the past few years. In many tasks like machine translation [10] and senti- ment analysis [7], neural networks have proven to be very successful. The two state-of- the-art neural networks used today are recurrent neural networks (RNN) and convolu- tional neural networks (CNN). The main challenge in most NLP tasks is to simplify the input sequence and keep the most important information. Research on neural machine translation (NMT) already focuses heavily on this challenge. For that reason we applied techniques from NMT to the Author Profiling Task. RNNs and CNNs. The recent success of RNNs are achieved through long short-term memory networks (LSTM) and gated recurrent unit networks (GRU) [6]. With their capabilities of long-term dependencies, LSTMs and GRUs have achieved state-of-the- art results in various NLP tasks. The work of Bahdanau et al. [1] proposed an attention mechanism to simplify a sequence. In combination with a bidirectional RNN (bi-RNN) learns this approach to automatically weigh the most relevant information of the input sequence. This leads to substantial improvements in machine translation and other fields like automatic summarization [4]. The latest research of Gehring et al. [10] has shown that CNNs are capable of achieving state-of-the-art results in NMT. Those results were achieved by applying the attention mechanism to CNNs. CNNs are computationally less expensive compared to LSTMs and GRUs, which makes them preferable for large datasets. 4 Methodology In this chapter we describe the technical solution. Main focus is on the system architec- ture of the neural networks. 4.1 Preprocessing Every single tweet was preprocessed by converting them to lower-case. We replaced URLs and usernames with a standardized token. We converted hashtags to regular words and used the TweetTokenizer from NLTK [2] to tokenize the tweets. We use a vocabulary to map tokens with an token-ID. The IDs point to a vector representation of the token, which is used later. After the preprocessing step we receive a list of tweets of each author and each tweet is a list of token-IDs. 4.2 Embeddings Each token in a tweet is represented by pretrained word embeddings [8]. For English and Spanish we used embeddings created with word2vec [11]. For both languages a corpus of 200 million unlabelled tweets were used. The skip-gram algorithm was used for training with window-size 5, sample size of 1e-05, minimum frequency of 15 and 200 dimensions. For Portuguese and Arabic we used pretrained embeddings from [3], which were trained on Wikipedia corpus1 . They have an output dimension of 300. 4.3 Architecture In this section we describe our model, which consists of a bi-RNN with GRUs followed by an attention mechanism. Figure 1. Representation of the bi-GRU+Attention model. We used n  4 and u  3 for visual- ization purposes. Embedding Layer. The embedding layer is used to map the token-IDs with their vec- tor representation. The token-ID is used to lookup the word-vector in the embeddings. Those vectors get concatenated and are passed to the next layer. This results in an out- put matrix S P Rdn , where d stands for the dimension of the word vector and n for the size of the input. To determine n, we took the tweet with the biggest amount of tokens from our training dataset and rounded the number up to the next 10. This resulted in a maximum input size of n  60. Shorter inputs were padded with zeros to match that size. To reduce the effect of unknown and padded words we used masking [5]. This way our model only uses known words and skips zero-values. GRU Layer. This layer consists of two GRUs with u number of units. We used a GRU for each direction, which resulted two matrices RF P Run and RB P Run . Finally both matrices were concatenated and resulted a matrix R P R2un . For our model we used u  50. 1 http://wikipedia.org Attention Layer. This layer is used to weight the most important parts of the GRU encoded input and deliver a more simplified matrix of the input. The output-matrix R of the previous layer, the weight-matrix Wa P R2u2u and the bias b P R2u is used to calculate a hidden state ht : ht  tanhpWa R bq. (2) The hidden state ht and the weight-vector Wu P R2u used to calculate the final attention a for each word by a  softmaxpht Wu q. (3) The attention-vector a is then multiplied with R and the result summed together. This results a summarized representation of the sentence as a vector sa P R2u . Softmax Layer. As the final layer we used a fully connected layer with softmax as the activation function. The number of output nodes were depending on the number of classification possibilities. For gender prediction were 2 nodes required, for language variety predictions were between 2 and 7 nodes required, depending on the language. Dropout. Dropout drops individual nodes during training with a probability of p and is therefore used to reduce overfitting [15]. We used dropout on our softmax layer with p  0.2. Optimization. Our model is trained using the AdaDelta optimizer [17]. We used   105 and default values for the other hyper-parameters. Author Prediction. Our model is trained to classify single tweets. To get the clas- sification of an author, his tweets are classified separately. The outputs of our model, which is the output of the softmax layer, is then summed together and the class with the highest value is the final prediction. For example, if we want to predict the gender of an user u who has three tweets t1 , t2 , t3 , we first classify the tweets separately. This could result following predictions: t1  r0.4, 0.6s, t2  r0.3, 0.7s, t3  r1.0, 0.0s. The first number of each output indicates the probability that the tweet is written by a female and the second number indicates the probability that the author is a male. The outputs of the tweets t1 , t2 , t3 are summed together and results r1.7, 1.3s. In this example, user u would be predicted as a female. 4.4 Training To train our models for submission we used 90% of the training data and the remain- ing 10% were used as validation set. The validation set was used to select a model checkpoint during training. For more details in model checkpoints, see Section 5.1. 5 Evaluation We distinguish between the evaluation during development, and the benchmark mea- sured on actual test data on TIRA. The results during the development phase were achieved on the provided training corpus with cross validation. Cross Validation. Our models were trained with 10-fold cross validation. We used cross validation to calculate a representative score for the model. The data in each fold was used as follows: 80% training data, 10% validation data and 10% test data. The evaluation on the test data does not influence the training and is only used to evaluate the model. We used a validation set in combination with model checkpoints to prevent overfitting. Model checkpoints will be explained in the following section. F1 Score. During the training phase we used F1 score to find the best model. The F1 score considers both precision and recall to compute the score. We used the F1 score, because it penalizes one-sided predictions of a model. The abbreviations tp, f p, f n indicate in the following calculations true positives, false positives and false negatives. Precision is the ratio between correct predicted (tp) to all classified data of this class (tp f p): precision  tp . (4) tp f p Recall is the ratio between correct classified data (tp) to the number of total data in the corresponding class (tp f n): recall  tp . (5) tp f n The harmonic mean of this two scores is called F1 score. The F1 score is calculated as follows: precision  recall F1  2  . (6) precision recall 5.1 Model Checkpoints The accuracy and F1 score of the model were measured during training. The scores were evaluated on a validation and a test dataset. If the model achieved a higher F1 score on the validation data than a previous one, the model (and its weights) was saved. An example of the measured scores is shown in Figure 2. The goal is to select the best weights for a model during the training phase. Figure 2 shows that our model performs very similar on validation and test data. That means by choosing the best weights on the validation set, the chances are high that the model performs equally on the test set. This makes our model very stable and predictable. 5.2 Analysis of the Attention While working with attention mechanism we developed a tool to represent how the different words in a tweet are weighted. This tool helped us to understand which words are more important for our model. An example on language variety is shown in Figure 3, where multiple tweets of British and American authors are compared. In Figure 3 the attention of the words are highlighted. As we can see some typical American English and British English words are marked. For example, in the first tweet Figure 2. Accuracy graphs of the bi-GRU+Attention model during training. Visualized compari- son between validation (orange) and test (blue) accuracy scores on author level. On the X axis is the number of epochs represented and on the Y axis the corresponding accuracy value. Figure 3. Visualized attention weights comparison between British and American Twitter users. The left side visualizes the attention of each word in a tweet. The darker the background color, the stronger those words are weighted. On the right side the final prediction and its probability is shown. In these examples are all predictions correct. is the word "color" and in the third tweet "Walmart" marked as very important, which are common words in American English. In the second and fourth tweet are the words "bloody" and "cheeky" marked as significant for British English, which are common words in British English. 5.3 Cross Validation Results During our preparation for the PAN shared task several models were tested and com- pared. Our baseline was a CNN model [7] which already participated in PAN 2016. The model has a 2-layer CNN architecture with a fully-connected softmax layer at the end. The experiments have shown that the bi-GRU+Attention model has the best per- formance on both classification tasks (gender, variety). The measured scores of both models are shown in Table 2 and Table 3. Table 2. Evaluation results of classifying gender on PAN 2017 training datasets using cross vali- dation. Gender Model English Spanish Portuguese Arabic Average bi-GRU+Attention 79,03% 72,57% 79,50% 71,58% 75,67% CNN 73,24% 72,93% 79,83% 70,88% 74,22% Table 3. Evaluation results of classifying language variety on PAN 2017 training datasets using cross validation. Language Variety Model English Spanish Portuguese Arabic Average bi-GRU+Attention 79,03% 92,05% 98,76% 78,71% 87,11% CNN 70,90% 89,67% 98,75% 78,38% 84,22% 5.4 PAN 2017 Results We trained two distinct models for each language: one for gender and one for variety. These models were uploaded to the virtual machine and were evaluated on the actual test dataset. In Table 4 the results obtained on the PAN 2017 Author Profiling test dataset are shown. The highest score on gender prediction was achieved in English. Portuguese gender prediction follows with 0.075% less accuracy. The gender predictions in Spanish and Arabic are lower than the others. We assume that this issue is related to the worse vocabulary usage: For both languages Spanish and Arabic, the vocabulary coverage is below 80%, in contrast to around 90% coverage of the vocabularies in English and Portuguese. In general, good scores are achieved for variety prediction. Outstanding is the va- riety accuracy of 91,43% for the Spanish language, which consists of seven language variations. The score dropped only in English and Arabic below 80%. The lowest score Table 4. Evaluation results in terms of accuracy for the bi-GRU+Attention model on PAN 2017 Author Profiling test dataset. Number of Language Gender Language Joint Language Variety Accuracy Variations Accuracy English 62,63% 78,88% 6 79,08% Portuguese 73,00% 78,13% 2 93,50% Spanish 66,46% 72,17% 7 91,43% Arabic 56,88% 71,50% 4 76,88% 76,88% is achieved for variety prediction on Arabic, due to low vocabulary coverage. The exact vocabulary coverage of the used embeddings is shown in Table 5. Table 5. Vocabulary usage of the embeddings on PAN 2017 training dataset in comparison to the achieved gender prediction accuracies. The languages are sorted by gender accuracy score. Gender Vocabulary Language Accuracy Coverage English 78,88% 90,85% Portuguese 78,13% 88,33% Spanish 72,17% 79,68% Arabic 71,50% 77,66% The results in Table 5 seems to imply that the accuracy for gender prediction corre- lates with vocabulary coverage. 6 Conclusion In this paper, we presented deep learning models to predict gender and language vari- ety of Twitter profiles. We described a bidirectional RNN with GRU and an attention mechanism. We compared the average accuracy of our models over all languages with a previously developed CNN model. The RNN exceeds the CNN in gender prediction by 1,45% and in variety prediction by 2,69% on average over four languages on PAN 2017 training data. For future work, we would like to see if a combination of several high-quality so- lutions for Author Profiling with a random forest could even outperform each of the subsystems. This has been done successfully for sentiment analysis [9], and it would be interesting to see if it works for Author Profiling as well. 7 References 1. Bahdanau, D., Cho, K., Bengio, Y.: Neural Machine Translation by Jointly Learning to Align and Translate. CoRR abs/1409.0473 (2014) 2. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python. O’Reilly Media (2009) 3. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching Word Vectors with Subword Information. CoRR abs/1607.04606 (2016) 4. Cho, K., Courville, A., Bengio, Y.: Describing Multimedia Content Using Attention-Based Encoder-Decoder Networks. IEEE Transactions on Multimedia 17(11), 1875–1886 (Nov 2015) 5. Chollet, F., et al.: Keras. https://github.com/fchollet/keras (2015) 6. Chung, J., Gülçehre, Ç., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555 (2014) 7. Deriu, J., Cieliebak, M.: Sentiment Analysis using Convolutional Neural Networks with Multi-Task Training and Distant Supervision on Italian Tweets. In: Evaluation of NLP and Speech Tools for Italian (EVALITA) (2016) 8. Deriu, J., Lucchi, A., Luca, V.D., Severyn, A., Müller, S., Cieliebak, M., Hofmann, T., Jaggi, M.: Leveraging Large Amounts of Weakly Supervised Data for Multi-Language Sentiment Classification. In: Proceedings of the 26th International Conference on World Wide Web. pp. 1045–1052 (2017) 9. Dürr, O., Uzdilli, F., Cieliebak, M.: JOINT_FORCES: Unite Competing Sentiment Classifiers with Random Forest. SemEval 2014-Proceedings of the 8th International Workshop on Semantic Evaluation pp. 366–369 (2014) 10. Gehring, J., Auli, M., Grangier, D., Yarats, D., Dauphin, Y.N.: Convolutional Sequence to Sequence Learning. ArXiv e-prints (May 2017) 11. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed Representations of Words and Phrases and their Compositionality. CoRR abs/1310.4546 (2013) 12. Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Improving the Reproducibility of PAN’s Shared Tasks: Plagiarism Detection, Author Identification, and Author Profiling. In: Kanoulas, E., Lupu, M., Clough, P., Sanderson, M., Hall, M., Hanbury, A., Toms, E. (eds.) Information Access Evaluation meets Multilinguality, Multimodality, and Visualization. 5th International Conference of the CLEF Initiative (CLEF 14). pp. 268–299. Springer, Berlin Heidelberg New York (Sep 2014) 13. Potthast, M., Rangel, F., Tschuggnall, M., Stamatatos, E., Rosso, P., Stein, B.: Overview of PAN’17: Author Identification, Author Profiling, and Author Obfuscation. In: Jones, G., Lawless, S., Gonzalo, J., Kelly, L., Goeuriot, L., Mandl, T., Cappellato, L., Ferro, N. (eds.) Experimental IR Meets Multilinguality, Multimodality, and Interaction. 8th International Conference of the CLEF Initiative (CLEF 17). Springer, Berlin Heidelberg New York (Sep 2017) 14. Rangel, F., Rosso, P., Potthast, M., Stein, B.: In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) Working Notes Papers of the CLEF 2017 Evaluation Labs 15. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (Jan 2014) 16. Tan, L., Zampieri, M., Ljubešic, N., Tiedemann, J.: Merging Comparable Data Sources for the Discrimination of Similar Languages: The DSL Corpus Collection. In: Proceedings of the 7th Workshop on Building and Using Comparable Corpora (BUCC). pp. 11–15. Reykjavik, Iceland (2014) 17. Zeiler, M.D.: ADADELTA: An Adaptive Learning Rate Method. CoRR abs/1212.5701 (2012)