Author Profiling based on Text and Images Notebook for PAN at CLEF 2018 Luka Stout, Robert Musters, and Chris Pool Anchormen, The Netherlands {l.stout,r.musters,c.pool}@anchormen.nl Abstract In this paper we describe our participation in the PAN 2018 shared task of Author Profiling. In this task we identify the gender of authors based on written text and shared images. We describe our approaches to the text-based, image- based and the combined task. The presence of three different languages raises the question whether a single model architecture can be built that works well on all three languages. We also propose a way to combine multiple predictions on shared content into a single prediction on user-level. Our final system for text is an ensemble of a Naive Bayes model and a RNN with attention. The image classification is done by finding selfies and predicting the gender of the person on those images using CNNs. 1 Introduction With the gaining influence and importance of social media it becomes more and more relevant to gain insights into the authors of content, mostly made up of images and text. Because social media networks allow people to create anonymous accounts it becomes of greater interest to the research community to get to know users on social media. Knowing specific details about a user, like gender, age, native language or emotional state is an interesting challenge for the marketing, forensic and security sectors. Author Profiling[1] is the task of determining an author’s features like gender, age, language va- riety by understanding their online persona. In addition to the tweets the shared task of 2018[2] includes images that were shared by the authors as well. The goal is to infer the gender of an author given one-hundred of their tweets and ten images, in three differ- ent languages, English, Spanish and Arabic. The presence of three different languages raises the question whether a single model architecture can be built that works well on all three languages. The shared task is divided into three subtasks: Infer the gender based on tweets, based on their shared images and a combination of the two. We have focused on the text-based task, however we have also developed an image-based approach to also participate in the combined task. For this we experimented with traditional tech- niques, such as tf-idf and Naive Bayes[3], as well as deep learning techniques, such as Recurrent Neural Networks (RNNs)[4] and Convolutional Neural networks (CNNs)[5]. In this paper we describe our final systems and results. 2 Dataset Description and Preprocessing The PAN 2018 Author Profiling[6] training set consists of text in three different lan- guages and images grouped by authors, who are labeled by gender and language. The number of authors per gender is balanced in every language. This training set was used for feature engineering, parameter tuning and training of the classification model. For the languages English, Spanish and Arabic we received a dataset containing 100 tweets and 10 images per author. For English and Spanish there are 3,000 authors and for Ara- bic 1,500 authors. This gives a total of 750,000 tweets and 75,000 images. The goal of the task is to predict the gender of a user given these 100 tweets and 10 images. We have chosen to create models on tweet and image level and combine the predictions to create a single prediction for every author. The following preprocessing steps were performed, the two additional preprocess- ing steps for Arabic can be found online∗ : – Replaced numbers, URLs, hashtags, mentions, emojis and smileys with their own unique tokens. – Used a tokenizer to filter out punctuation and tokenize sentences into a list of low- ercase words. – Expanded contractions. (For English) – Normalization of tokens, namely unifying the orthography of alifs, hamzas, and yas/alif maqsuras. (For Arabic) – Noise removal, i.e. removing short vowels and other symbols (harakat). (For Ara- bic) After preprocessing and tokenization, the maximum number of words in a tweet is 39 for English. For the other languages there are fewer than 200 tweets longer than 39 words. As this only accounts for 0.02% of all the tweets in the data set and to keep the models consistent across languages, we have decided to cap the number of words in a sentence to 39. Basile et al.[7] note that augmenting the tweet dataset with the data of previous Author Profiling tasks[8, 9] does not improve the performance of the resulting classi- fiers. They emphasize that this is due to temporal differences in the data. We have seen that topics reflect events from 2017 are definitely present in the data. While the data from previous years contains data with events from 2016 and before. As such we have decided not to include additional datasets to limit the effects of these differences. For the image classification task we have used additional data to create our clas- sifier: a selfie dataset[10] and the MIRFLICKR dataset[11]. Their use is explained in Section 5. 3 Prediction Strategies There are two ways to predict gender based on an author’s social media content. The first is to treat all of the content as a single item and create a single prediction based on the entirety of the data. This is analogous to a bag of words approach in the case of text. However, concatenating or summing up the images at pixel level is not straightfor- ward and does not make intuitive sense. As such we have chosen a different approach. The approach is to make predictions on the item-level and combine these predictions somehow. ∗ https://maximromanov.github.io/2013/01-02.html There are multiple ways of constructing an author-level prediction based on tweet- level predictions, whether it is text, images or a combination of both. We have used three different strategies. (1) The first strategy is using the majority class of all predictions. (2) The second strategy is to use the mean probability of all predictions. (3) The last strategy is to only use predictions where the model is very sure that an input indicates a certain gender. With the latter strategy a weighted average of the predictions where the weights are zero for predictions that are within a certain range is used: ( 0 if α < Pi (f emale) < β, wi = (1) 1 otherwise N 1 X P (f emale) = wi Pi (f emale) (2) N i=1 where P is the prediction by a single model for a single author, Pi is the prediction for the i-th tweet or image of the author and N the number of tweets or images for the author. If no such prediction exists we fall back to the second strategy, the mean strategy. We have found α = 0.25 and β = 0.75 to be good default values. Our usage of the third prediction strategy improved our accuracy on a validation set, as illustrated in Table 1. The rationale is that the predictions where the model is sure that a certain input points towards a specific gender are the only ones that have any influence on the author-level prediction. acc σ (1) Majority 0.767 ± 0.003 (2) Mean 0.780 ± 0.011 (3) Sure 0.790 ± 0.008 Table 1. 3-fold validation accuracy of using the Recurrent Neural Network on the English text. The model was trained on tweet level and then the strategies were applied to a tweet set of unseen authors. This gives an example of the performance increase gained by using a different prediction strategy. 4 Text classification 4.1 Features For author profiling, it has been shown that tf-idf weighted n-gram features, both in terms of characters and words, are very successful in inferring gender[9]. As such we have decided to use character 2- to 7-grams and word 1- to 3-grams with tf-idf weighting with sublinear term frequency scaling[12]. Word embeddings are a distributed representation for text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on chal- lenging natural language processing problems[13, 14]. They work in such a way that words with similar meaning get a similar representation in a lower dimensional space. These embeddings are trained on huge corpora of text to have the most context-specific information. For English and Arabic we used the pretrained fastText embeddings[15]. For Spanish the pretrained embeddings we used were trained on the Spanish Billion Word Corpus Embeddings[16]. 4.2 Recurrent Neural Network RNNs[4] are used to model sequences where the order is important. They have an in- ternal memory that keeps track of the examples they have seen so far in the current sequence. Text is one of the clear use cases for RNNs[17] because of its sequential nature. A challenge with using a recurrent neural networks is the vanishing gradient prob- lem. In this problem long dependencies get lost over time. The problem was explored in depth in [18, 19] who found some fundamental reasons why it might be difficult to retain these dependencies. One solution to this problem is to use multiple gates as the atomic units within a recurrent neural network. Multiple versions of these gates exist, such as Long Short Term Memory-units (LSTM)[20] and Gated Recurrent Units (GRU)[21]. Chung et al. [22] note that both LSTM and GRU are superior over recurrent neural net- works with traditional tanh units. LSTMs are, in theory, better able to remember longer sequences than GRUs and outperform them in tasks requiring modeling long-distance relations. An advantage of GRUs over LSTMs is that they are computationally more efficient because they have fewer within the units. As noted in Section 2 the texts are small in length and as such we do not need the additional power of LSTMs and have decided to use GRUs. Another way to solve the long term dependency problem is to use an attention mech- anism. They were recently demonstrated to have success in a wide range of tasks[23, 24, 25, 26]. We use a modification of the mechanism proposed by Zhou et al. [27], in which we have not used the weighted sum but instead have taken the global maximum and the global average over the attention matrix and have concatenated the two. Bidirectional RNNs[28, 29] are a combination of two seperate RNNs. The input se- quence is fed in the normal order for one network, and in reverse order for the other. The outputs of the two networks are usually concatenated at each time step. This structure allows the networks to have both backward and forward information about the sequence at every time step. Human understanding of text works in the same way, we use the con- text of words to determine their meaning. In our work the seperate RNNs have the same configuration. Recurrent neural networks can require millions of parameters to sufficiently model tasks. This high dimensional parameter space translates to a high chance of overfitting on the training data set. Because large networks are slow to use, creating an ensemble of many large networks is infeasible. One technique to reduce the overfitting is to add dropout[30, 31] to the network. We used different amounts of dropout in different places in the network. Between the embeddings and the recurrent layer of our network we tanh input Dense layer PReLU Concatenate output Figure 1. The dense block that we use instead of a single activation function. use spatial dropout[32] instead of normal dropout. The benefit of this is that entire embedding channels can be dropped with a certain probability, which is better than removing random points in the embedding matrix. We used 300-dimensional word embeddings as the input for our network. Spatial Dropout with a rate of 0.4 is applied to the word embeddings. We used a bidirectional GRU with 256 units for each direction. The GRU had a tanh activation function, an out- put dropout-rate of 0.35 and an internal dropout-rate of 0.1. After the recurrent layer the attention mechanism was applied. Global average and max pooling were applied to this layer to get a single vector for every input text. The pooling operations are concatenated together as input to a dense network with a dropout rate of 0.5 between every dense block. A dense block consists of a fully connected layer with a PReLU[33] activation function. We also applied the tanh activation function on the output of the PReLU and concatenated it together with the original, as seen in Figure 1. Three such dense blocks were used with respectively 256, 128 and 64 neurons. Because of the concatenation the output size of these blocks is twice the number of neurons. The final output was a single neuron with the sigmoid function. We optimize the model with the Adam[34] optimizer and as the loss function we chose binary cross entropy. The network architecture can be seen in Figure 2. Figure 2. The layout for the recurrent neural networks we used for classifying the text. The gray layer is the input text. The orange layer is an embedding layer which replaces the words by their embedding. The magenta layer is the shape of the output of the bidirectional recurrent layer. The green layer is the attention mechanism. The blue layer is a global max pooling layer. The red layer is a global average pooling layer. The white layers are the dense blocks. 4.3 Ensemble To do our final predictions on the texts we make use of an ensemble of two models. The ensemble is a combination of a traditional model and a deep-learning model. The deep-learning model (GRU) is described in the previous section. The traditional model is a multinomial Naive Bayes[3] classifier (NB) using the character and word n-grams with tf-idf weighting on tweet level. Naive Bayes is a fam- ily of classification algorithms based on the assumption that every feature being clas- sified is independent of any other feature given the class. The Naive Bayes classifier considers each word in a piece of text to contribute independently to the probability that the author is female (or male), regardless of any correlations between features. Al- though it is based on independence assumptions that often do not hold in the real world, Naive Bayes can often obtain surprisingly good results[35]. The ensemble uses a weighted average to combine the output of different models. The weights and models within the ensemble have the same architecture, regardless of language. 5 Image classification After inspecting the images in the dataset we found that a lot of users post selfies. Our hypothesis is that if we could identify selfies, and detect the gender of a person on that selfie we could predict the gender of the author of the picture. If we do not find selfies for a user this pipeline will give a random prediction for the user. Our image model is not our main approach to this shared task and as such we hope to improve our score in the combined task using it. Because of this it does not really make an impact on our results if a user does not post selfies. For this approach we need a dataset consisting of selfies, and a dataset without selfies. For the selfies class we used the selfie dataset provided by Kaleyeh et al. [10]. In this research 46.836 selfies where collected and annotated with 36 different attributes. The focus of this research was to predict the popularity of a selfie. For the no-selfie class we used the MIRFLICKR dataset[11]. This dataset consists of 25.000 images from Flickr. The images are annotated with tags. We removed images containing the tags ’person’, ’portrait’ or ’selfie’ resulting in 23.500 images. In 2012 Krizhevsky et al. won the ImageNet competition with a CNN[36]. Since then they have been the default architecture to tackle computer vision problems. We have used a CNN to detect selfies and if it is we predict the gender of this selfie with a different CNN with the same architecture. The architecture we used is shown in Figure 3. There are 64 filters in every convolutional layer. The kernel-sizes are 3 × 3 and the max-pooling size is 2 × 2. In every layer except the last we used the ReLU[37] activation function. In the last dense layer it is a sigmoid. The selfie detection was trained for 20 epochs using the Adam[34] optimizer on 150px by 150px versions of the input images with a batch size of 256. We augmented the dataset by rescaling, zooming and shearing and horizontal flipping of the images. We got a 96 percent accuracy of correctly identifying a selfie on a validation set of our created dataset. We found that on a small sample over 80% of the users post images that get classified as selfies. For this model we got an accuracy of 86% on just selfies. The model does not perform well on images that are not selfies. One caveat of this approach is that not every picture with a face posted is of the author themselves. However we hypothesize that more often than not women will post pictures of themselves or other women and likewise with men. Figure 3. The layout for the convolutional neural networks used in the selfie identification and gender prediction. The gray layer is the input image. The magenta layers are convolutional layers. The blue layers are max-pooling layers. The green layer is a flattened version of the layer before it. The white layers are normal dense layers. 6 Combining models To combine the text models and the image models we use a weighted average. Overall, the text models were vastly outperforming the image models, however the addition of the image models did improve the overall performance of the system. We chose to keep a single configuration for all languages. The weighted mean be- tween the text models is a 1:4 ratio in favor of the RNN model. This is also the case for the combination of the image model and the text models where the ratio is in favor of the text models. 7 Results As in previous years with this shared task the models are compared using accuracy of correctly predicting the gender of an author. For every language the accuracy is calcu- lated. Then, the accuracies are averaged to obtain a final score for our submission. The results in this section are evaluated on the PAN 2018 Author profiling evaluation set. Table 2 shows the accuracy of our text models and ensemble. We achieve an accu- racy of 76% for Arabic, 78.5% for English and 74.1% for Spanish on the evaluation set using the ensemble. There is a big difference in performance between Arabic and English, and Spanish. This might be because we have done additional preprocessing for Arabic and English. The GRU model outperforms the Naive Bayes model. The en- semble has a higher accuracy than the models separately for Spanish. For Arabic and English this is not the case, here the GRU model has the highest performance. To pre- vent overfitting on the very small test set we used for tweaking we did not alter our ensemble based on these results. Using only the selfie model we get accuracies upwards of 62% of the different languages, as can be seen in Table 2. This low accuracy might be because not all users NB† GRU† NB+GRU Image Joint Arabic 0.660 0.800 0.760 0.623 0.764 English 0.660 0.790 0.785 0.658 0.788 Spanish 0.640 0.720 0.741 0.623 0.743 Average 0.653 0.770 0.762 0.635 0.765 Table 2. The table shows the accuracy of 5 different models: 3 text models, 1 image model and 1 joint model per language. The text models are: NB, GRU and NB+GRU. We hypothesize that the big difference between languages for the text models comes from the preprocessing used varying amount of preprocessing that is done on the dataset. The image column shows the accuracy of the gender prediction pipeline using our selfie detection algorithm on the evaluation set. The joint column shows the results of the weighted combination of the NB+GRU and image models on the evaluation set. The joint model has a slight improvement over using just the text models. post selfies so our model does not know what to predict. Another reason might be that the selfies in the shared task dataset are different from the ones in the selfie dataset. It might also be the case that the MIRFLICKR dataset might not be sufficiently diverse. The images in this dataset are all high quality photos, which is not necessarily the case for the images shared in the PAN ’18 dataset. We note that the accuracy on the images shared by Spanish users is a lot higher than with the Arabic and English users. We postulate that Spanish users might post more selfies or images representative for gender. For this reason we could have chosen to make the weight of the image model higher in the combined model case. However, to prevent overfitting, we have not done this. The addition of the image models to the text models did give a very small improve- ment to the accuracy of our models (0.3%). This is because there is a big difference be- tween the performance of the two approaches. If the performance of our image models would be on the same level as our text models we would see a significant improvement by using an ensemble of the two. 8 Conclusion In this paper we have used a combination of text models and image models to cre- ate gender predictions for three different languages. We have done the predictions on individual tweets and images and then used multiple strategies to combine these pre- dictions to create a single prediction on user level. We have also chosen to keep a single configuration of the system across the languages. As such our performance on the individual languages is not as high as it could have been, had we optimized every combination of models for the different regions. † The results of the NB and GRU models are obtained by evaluating the models on a small test set of 100 users as it was not possible to run the models on the evaluation set used. As such they might not be entirely representable for the performance of our models. We show these results for completeness. An ensemble of a RNN and a bag of words model did improve performance on the English language, with respect to just using the RNN, but it does not improve on the other languages. On the evaluation set, we got accuracy scores between 62.3% and 78.8% depending on language and whether we used models that classify based on text or on images. On our small test set our non-ensemble models showed an improved performance, however the test set only contained 100 users and as such were not be representable for the distributions shown in the evaluation set. To conclude: we successfully defined an ensemble of deep-learning and traditional models capable of good performance. References 1. Gómez-Adorno, H., Markov, I., Sidorov, G., Posadas-Durán, J.P., Sanchez-Perez, M.A., Chanona-Hernandez, L.: Improving feature representation based on a neural network for author profiling in social media texts. Computational intelligence and neuroscience 2016 (2016) 2 2. Stamatatos, E., Rangel, F., Tschuggnall, M., Kestemont, M., Rosso, P., Stein, B., Potthast, M.: Overview of PAN-2018: Author Identification, Author Profiling, and Author Obfus- cation. In Bellot, P., Trabelsi, C., Mothe, J., Murtagh, F., Nie, J., Soulier, L., Sanjuan, E., Cappellato, L., Ferro, N., eds.: Experimental IR Meets Multilinguality, Multimodality, and Interaction. 9th International Conference of the CLEF Initiative (CLEF 18), Berlin Heidel- berg New York, Springer (September 2018) 3. Hand, D.J., Yu, K.: Idiot’s bayes — not so stupid after all? International Statistical Review 69(3) 385–398 4. Lipton, Z.C., Berkowitz, J., Elkan, C.: A critical review of recurrent neural networks for sequence learning. (2015) 5. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553) (2015) 436 6. Rangel, F., Rosso, P., Montes-y-Gómez, M., Potthast, M., Stein, B.: Overview of the 6th Au- thor Profiling Task at PAN 2018: Multimodal Gender Identification in Twitter. In Cappellato, L., Ferro, N., Nie, J.Y., Soulier, L., eds.: Working Notes Papers of the CLEF 2018 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org (September 2018) 7. Basile, A., Dwyer, G., Medvedeva, M., Rawee, J., Haagsma, H., Nissim, M.: N-GrAM: New Groningen Author-profiling Model. (July 2017) 8. Rangel, F., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M., Stein, B.: Overview of the 4th author profiling task at pan 2016: Cross-genre evaluations. In: Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, Évora, Portugal, CLEF and CEUR-WS.org, CLEF and CEUR-WS.org (2016/09 2016) 9. Rangel, F., Rosso, P., Potthast, M., Stein, B.: Overview of the 5th author profiling task at pan 2017: Gender and language variety identification in twitter. Working Notes Papers of the CLEF (2017) 10. Kalayeh, M.M., Seifu, M., LaLanne, W., Shah, M.: How to take a good selfie? In: Proceed- ings of the 23rd ACM International Conference on Multimedia. MM ’15, New York, NY, USA, ACM (2015) 923–926 11. Huiskes, M.J., Lew, M.S.: The mir flickr retrieval evaluation. In: MIR ’08: Proceedings of the 2008 ACM International Conference on Multimedia Information Retrieval, New York, NY, USA, ACM (2008) 12. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK (2008) 13. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems. (2013) 14. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP). (2014) 15. Grave, E., Bojanowski, P., Gupta, P., Joulin, A., Mikolov, T.: Learning word vectors for 157 languages. CoRR abs/1802.06893 (2018) 16. Cardellino, C.: Spanish Billion Words Corpus and Embeddings (March 2016) 17. Mikolov, T., Karafiát, M., Burget, L., Černockỳ, J., Khudanpur, S.: Recurrent neural net- work based language model. In: Eleventh Annual Conference of the International Speech Communication Association. (2010) 18. Hochreiter, S.: Untersuchungen zu dynamischen neuronalen netzen. Diploma, Technische Universität München 91 (1991) 1 19. Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J., et al.: Gradient flow in recurrent nets: the difficulty of learning long-term dependencies (2001) 20. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8) (1997) 1735–1780 21. Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Ben- gio, Y.: Learning Phrase Representations using RNN Encoder-Decoder for Statistical Ma- chine Translation. arXiv:1406.1078 [cs, stat] (June 2014) 22. Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014) 23. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) 24. Hermann, K.M., Kocisky, T., Grefenstette, E., Espeholt, L., Kay, W., Suleyman, M., Blun- som, P.: Teaching machines to read and comprehend. In: Advances in Neural Information Processing Systems. (2015) 1693–1701 25. Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. In: Advances in neural information processing systems. (2015) 577– 585 26. Xu, Y., Mou, L., Li, G., Chen, Y., Peng, H., Jin, Z.: Classifying relations via long short term memory networks along shortest dependency paths. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. (2015) 1785–1794 27. Zhou, P., Shi, W., Tian, J., Qi, Z., Li, B., Hao, H., Xu, B.: Attention-based bidirectional long short-term memory networks for relation classification. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Volume 2. (2016) 207–212 28. Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45(11) (1997) 2673–2681 29. Zhang, S., Zheng, D., Hu, X., Yang, M.: Bidirectional long short-term memory networks for relation classification. In: Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation. (2015) 73–78 30. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1) (2014) 1929–1958 31. Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 (2014) 32. Tompson, J., Goroshin, R., Jain, A., LeCun, Y., Bregler, C.: Efficient Object Localization Using Convolutional Networks. arXiv:1411.4280 [cs] (November 2014) 33. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level per- formance on imagenet classification. In: Proceedings of the IEEE international conference on computer vision. (2015) 1026–1034 34. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization 35. Zhang, H.: The optimality of naive bayes. (2004) 36. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems. (2012) 1097–1105 37. Xu, B., Wang, N., Chen, T., Li, M.: Empirical evaluation of rectified activations in convolu- tional network. arXiv preprint arXiv:1505.00853 (2015)