Gender Prediction From Tweets With Convolutional Neural Networks Notebook for PAN at CLEF 2018

Gender Prediction From Tweets With Convolutional Neural Networks Notebook for PAN at CLEF 2018 ErhanSezerer erhansezerer@iyte.edu.tr Izmir Institute of Technology OzanPolatbilek ozanpolatbilek@iyte.edu.tr Izmir Institute of Technology ÖzgeSevgili ozgesevgili@iyte.edu.tr Izmir Institute of Technology SelmaTekir selmatekir@iyte.edu.tr Izmir Institute of Technology Gender Prediction From Tweets With Convolutional Neural Networks Notebook for PAN at CLEF 2018 B4FECF5385B9EE78C6418B8FD51DE928 GROBID - A machine learning software for extracting information from scholarly documents

This paper presents a system 1 developed for the author profiling task of PAN at CLEF 2018 . The system utilizes style-based features to predict the gender information from the given tweets of each user. These features are automatically extracted by Convolutional Neural Networks (CNN). The system mainly depends on the idea that the informativeness of each tweet is not the same in terms of the gender of a user. Thus, the attention mechanism is included to the CNN outputs in order to discriminate the tweets carrying more information. Our architecture was able to obtain competitive results on three languages provided by the PAN 2018 author profiling challenge with an average accuracy of 75.1% on local runs and 70.23% on the submission run.

Introduction

Author profiling is the characterization of an author through some key attributes such as gender, age, and language. It's an indispensable task especially in security, forensics, and marketing. In the security world, predictive profiling is a measure for proactive threat assessment. In forensics; profiling is used to support attribution for an incident, while in marketing it helps to prepare targeted advertisements.

In today's social media-driven environment, automatic user profiling is not the same as before because what the users write and share in social media provide a great data source for the potential learning approaches. As a general rule, more data make classifiers more accurate.

In more technical terms, author profiling is defined as a classification task where the aim is to predict the attribute of an author out of the given attribute classes. The traditional machine learning process is followed to fulfill the task. Feature selection is an important part of the process. Literature categorize the types of features that can be used for authorship profiling as content-based features and style-based features. Evidence proved that the most effective style-based features for gender discrimination are determiners and prepositions (markers of male writing) and pronouns (markers of female writing). As for content-based features, words related to technology (male) and words related to personal life or relationships (female) are proved to be most useful [1].

The recent deep learning-based approaches take prominence in this area as they perform feature selection automatically. We tackled the problem in a similar way. The proposed approach feeds the characters of a specific user's tweets into the system, where the system learns the embeddings character to character and it runs a Convolutional Neural Network (CNN) for each individual tweet of the user. Then, CNN outputs are combined and pass through an attention layer to form the user specific vector for prediction.

In this work, we aim to obtain style-based features from the tweets of users by using CNNs. CNNs are known to be good at identifying the local patterns from the inputs [5]. They were originally designed to tackle the problems in vision tasks by identifying the small objects or patterns in images [9], but later, they were introduced into NLP tasks to extract the syntactic, local features from the text [4].

In PAN 2018 [16] author profiling task [15], the profiling dimensions are determined as gender and language, where the selected languages are English, Spanish, and Arabic respectively. As for training data; in addition to text in the form of tweets, the user shared images are provided as well. Thus, hybrid solutions that use both text and imagebased features are encouraged.

Our system uses only text-based features. The basic characteristics of our approach can be highlighted as follows:

-The system learns on a user basis iteratively.

-The input is in the form of characters.

-A CNN per-tweet is constructed to identify local tweet-wide indicators in larger user profile vector. -An attention layer is used to combine CNN outputs using normalized weights.

In the remaining part of the paper, we first present the related work. In Section 3, the proposed method is explained in detail. Then, the performance is tabulated and evaluated. Finally, in Section 5, the paper is concluded with some remarks and possible future directions.

Related Work

Argamon et al. [1] categorize the types of features that can be used for authorship profiling as content-based features and style-based features. Their experiments show that the most effective style-based features for gender discrimination are determiners and prepositions (markers of male writing) and pronouns (markers of female writing). As for content-based features, words related to technology (male) and words related to personal life or relationships (female) are proved to be most useful.

Rangel and Rosso [14] investigate the impact of emotions in age and gender identification. They process text, create part-of-speech (POS) tag graphs (POS tags as nodes and their sequence as edges) and expand those graphs by related topic words, polarity labels, and emotion words from the emotion dictionary. Then, they extract features using graph analytics and feed them into machine learning algorithms to make the classification. Their results prove that language use and emotions are effective in discriminating gender and age.

In the overview paper of the Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter [12], the participant systems are compared with respect to features and classification approaches. In that edition of the author profiling task, more participants employ deep learning techniques, which perform automatic feature selection. In the gender and language variety subtasks; the best performances belong to a logistic regression classifier with combinations of character, word, and POS n-grams, emojis, sentiments, character flooding, an SVM trained with combinations of character and tf-idf n-grams, and a deep learning approach combining word and character embeddings with CNN, RNN, attention mechanism, max-pooling layer, and fully-connected layer.

Basile et al. [3] try a Support Vector Machine (SVM) with word unigram and character n-grams on PAN 2017 author profiling task where they have best results among other competitors. They use character three to five grams and word uni to bi-grams with tf-idf weighting and use SVM on this feature space to discriminate both gender and language variety. They also mention that the hand-crafted features decrease accuracy rather than helping in this specific task.

Miura et al. [11] propose two deep-learning based approaches which combine both context-based and style-based features by taking the word level and the character level information of the tweet's text. Their systems use both Recurrent Neural Network (RNN) (to address context-based features with the given word information) and CNN (to address style-based features with the given character information). Their architectures consist of attention mechanism layers, a max-pooling layer, and also fullyconnected layers. The difference between the architectures is that one of them is on a tweet-basis while the other one is on a user-basis. Additionally, the places of layers lead to another difference.

Kodiyan et al. [8] also use a deep learning approach by implementing a bidirectional RNN with Gated Recurrent Units. They add an attention layer on tweet level to learn the most important parts of each tweet. In order to move from tweet level to user level they add the tweet predictions of a user together and use it as a single user level prediction.

Method

In this section, the description of the dataset and the details of the proposed model are given including choice of parameters, preprocessing steps and architectural details.

Data

PAN 2018 Author Profiling dataset [15] is based on 3 languages (English, Arabic, Spanish) with ground-truth gender information. It has 3000 users for English, 3000 users for Spanish, and 1500 users for Arabic language where each user has 100 tweets and 10 images that they posted on Twitter. In this work, only text data are used in gender classification.

Preprocessing

In Twitter, characters are used not only to create words but also to express emotions like smiling as ':)' or blinking as ';)', because of this type of usage, punctuations and stop words did not get eliminated, texts are given as how they are. NLTK [10] is used to tokenize tweets. To illustrate (example from NLTK):

Tweet = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <-" Tokenized Tweet = ['This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)', ':-P', '<3', 'and', 'some', 'arrows', '<', '>', '->', '<-'] Each word in the tokenized tweet is applied lowercasing. Then, each character from the word is taken to be utilized in the input to the system. Thus, the tweet in the above example is turned into the following input:

Input 'y', 's', 'm', 'i', 'l', 'e', 'y', ':', ':', '-', ')', ':', '-', 'p', '<', '3', 'a', 'n', 'd', 's', 'o', 'm', e', 'a', 'r', 'r', 'o', 'w', 's', '<', '>', '-', '>', '<', '-', '-'] For each user, the number of characters is set to the highest number that is allowed for tweets in Twitter. If a tweet has fewer number of characters than the maximum, padding is applied to the end of the tweet.

Character Embeddings

Character embeddings with size 25 are initialized by sampling from uniform distribution with 0 mean and trained simultaneously with the neural network. Due to their smaller size and count, training character embeddings requires fewer text to be trained than word embeddings. Therefore, the given dataset was sufficient to train them and no additional data are collected or used.

Architecture

In this study, each tweet of a user is passed to the CNN simultaneously as a sequence of characters to assess the style-based features of each particular tweet. CNN outputs a feature vector for each tweet.

At this level, using other methods like combining, flattening or averaging the feature vectors would mean to explicitly assume the equal importance among tweets. However, the level of information on gender may differ from tweet to tweet. Therefore, A Bahdanau attention mechanism [2] is combined with the character CNN in order to learn which tweet holds more information on the gender of its author. Figure 1 shows the attention mechanism in detail which is calculated by the following formulas:

A i = tanh(W α t i + b)(1)v i = exp(A i w i ) j exp(A j w j )(2)o i = v i t i(3)K = i o i (4)

where W α is a weight matrix used to multiply each output of the CNN, t i is the ith tweet, b is bias vector, w i is the attention weights, A i is the attention context vector, v i is the attention value for ith tweet, o i is attention output vector for the corresponding tweet, K is the output vector for user. A fully connected layer is used on the output of the attention layer to reduce the size of the feature vector to the number of genders. Predictions are obtained after applying softmax over the output of the fully connected layer. Proposed model can be seen in Figure 2.

CNN [6] 2 is implemented with ReLu activation function and [filter size, embedding size] shaped filters with stride 1 to make all characters visited. Adam optimizer [7] is used with cross entropy loss. To prevent the model from overfitting L2 regularization loss is used.

Parameter Selection

Exhaustive grid search is used to optimize the hyperparameters of the model. Parameters we have tried for each language can be seen in Table 1. Due to differences in each language and the size of the dataset, different hyperparameters gave best results for each language (Table 2) .

Results

We have selected the model with the best working parameters, shown in Table 2. As can be seen from Table 3, our best model gives between 70% and 79% accuracy for different languages in our validation runs. In the submission run over TIRA framework [13], our best models obtained approximately 4% lower accuracy than the validation runs for each language. We have also observed in our experiments that, instead of averaging the feature vectors at the output of the CNN or using fully connected layers to combine them, using attention increases the accuracy of the system by approximately 3 percent on an average in three aforementioned languages (Table 4). This shows that the attention layer was able to learn "where to look" and identify the tweets that are more informative when it comes to gender prediction. Table 5 shows an example of attention values for three tweets of a particular user for each gender where the attention values correspond to the probabilities of the respective tweets over the hundred tweets provided for the user by the PAN author profiling dataset. It can be seen that the attention layer was able to assign higher values to tweets which have stronger gender indicators such as the words "bro" for male, "love" for female whereas it assigned low scores to automatically generated tweets like the third tweet of the male user.

Conclusion

We have described a system submitted to the author profiling task of PAN at CLEF 2018. A CNN architecture is proposed which takes the characters of each tweet's text as an input. This input is based on a user in which each tweet of the user given to the system. Local style-based features are aimed to be extracted by this system, automatically.

The critical issue related with the proposed system is to recognize that each tweet can carry different level of information to discriminate the gender of a user. The attention mechanism is able to catch that difference. This mechanism is added to CNN outputs. Therefore, the predictions are based on the tweets holding more information about the gender. As an output, the system gives the prediction of user's gender in a vector form.

In the given dataset, in addition to tweets, there are images posted by the users. In future, we are also planning to make use of those image data along with our current architecture and we are expecting to get improved results due to that addition.

Figure 1 .1Figure 1. Attention mechanism.

Figure 2 .2Figure 2. The proposed model.

Table 1 .− 3 , 8138Hyperparameters used in optimizations. 10 −4 , 5x10 −4 , 10 −5 , 5x10 −5 , 10 −6 L2 Regularization Coefficient 5x10 −4 , 10 −5 , 5x10 −5 , 10 −6 , 5x10 −6 , 10 −7 , 5x10 −7 , 10 −Filter

Table 2 .2Parameters with tuned values.ParameterEnglish Spanish ArabicEmbedding Size252525Learning Rate10 −410 −4 10 −4L2 Regularization Coefficient 10 −6 5x10−6 10 −6Filter sizes3, 63, 63, 6, 9Number of Filters756050Strides111

Table 3 .3Gender prediction accuracy for each language.Language Validation Accuracy(%) Test Accuracy(%)English79.074.95Arabic75.769.20Spanish70.766.55Average75.170.23

Table 4 .4Accuracy(%) of models with and without attention mechanismLanguage CNN without attention CNN with attentionEnglish76.379.0Arabic72.075.7Spanish66.370.7Average71.575.1

Table 5 .5Example of attention values on tweets for two users If you caught Prt 2 of "The Oldest Profession" on Night,s hear the other 2 progs: https:******* or download all 3 @*******User Tweets

The implementation can be found at: https://github.com/Darg-Iztech/Gender_Classification implementation can be found at: https://github.com/dennybritz/cnn-text-classification-tf

Automatically profiling the author of an anonymous text SArgamon MKoppel JWPennebaker JSchler 10.1145/1461928.1461959 Commun. ACM 52 2 Feb 2009 Neural machine translation by jointly learning to align and translate DBahdanau KCho YBengio Proceedings of the 3rd International Conference on Learning Representations the 3rd International Conference on Learning Representations 2014 N-gram: New groningen author-profiling model ABasile GDwyer MMedvedeva JRawee HHaagsma MNissim CoRR abs/1707.03764 2017 Natural language processing (almost) from scratch RCollobert JWeston LBottou MKarlen KKavukcuoglu PKuksa J. Mach. Learn. Res 12 Nov 2011 Neural Network Methods in Natural Language Processing YGoldberg GHirst 2017 Morgan & Claypool Publishers Convolutional neural networks for sentence classification YKim Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2014 Association for Computational Linguistics Adam: A method for stochastic optimization DPKingma JBa CoRR abs/1412.6980 2014 Author profiling with bidirectional rnns using attention with grus DKodiyan FHardegger SNeuhaus MCieliebak CLEF 2017 Convolutional networks for images, speech, and time-series YLecun YBengio The handbook of brain theory and neural networks MArbib MIT Press 1995 Nltk: The natural language toolkit ELoper SBird Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics

Stroudsburg, PA, USA

Association for Computational Linguistics 2002 1 ETMTNLP '02 Author Profiling with Word+Character Neural Attention Network-Notebook for PAN at CLEF YMiura TTaniguchi MTaniguchi TOhkuma CLEF 2017 Evaluation Labs and Workshop -Working Notes Papers LCappellato NFerro LGoeuriot TMandl

Dublin, Ireland

2017. September. Sep 2017 Overview of the 5th author profiling task at pan 2017: Gender and language variety identification in twitter FM RPardo PRosso MPotthast BStein 2017 CLEF Improving the Reproducibility of PAN's Shared Tasks: Plagiarism Detection, Author Identification, and Author Profiling MPotthast TGollub FRangel PRosso EStamatatos BStein Information Access Evaluation meets Multilinguality, Multimodality, and Visualization. 5th International Conference of the CLEF Initiative (CLEF 14 EKanoulas MLupu PClough MSanderson MHall AHanbury EToms

Berlin Heidelberg New York

Springer Sep 2014 On the impact of emotions on author profiling FRangel PRosso Information Processing & Management 52 1 2016 Overview of the 6th Author Profiling Task at PAN 2018: Multimodal Gender Identification in Twitter FRangel PRosso MMontes-Y-Gómez MPotthast BStein Working Notes Papers of the CLEF 2018 Evaluation Labs CEUR Workshop Proceedings, CLEF and CEUR-WS LCappellato NFerro JYNie LSoulier Sep 2018 Overview of PAN-2018: Author Identification, Author Profiling, and Author Obfuscation EStamatatos FRangel MTschuggnall MKestemont PRosso BStein MPotthast Experimental IR Meets Multilinguality, Multimodality, and Interaction. 9th International Conference of the CLEF Initiative (CLEF 18) PBellot CTrabelsi JMothe FMurtagh JNie LSoulier ESanjuan LCappellato NFerro

Berlin Heidelberg New York

Springer Sep 2018