Introduction

Cross-genre Age and Gender Identi cation in Social Media

Anam Zahid

Aadarsh Sampath

Anindya Dey

Golnoosh Farnadi

1 2 0 Center for Data Science, University of Washington Tacoma , WA , USA 1 Dept. of Appl. Math., Comp. Science and Statistics, Ghent University , Belgium 2 Dept. of Computer Science, Katholieke Universiteit Leuven , Belgium

This paper1 gives a brief description on the methods adopted for the task of author-pro ling as part of the competition PAN 2016 [1]. Author pro ling is the task of predicting the author's age and gender from his/her writing. In this paper, we follow a two-level ensemble approach to tackle the cross-genre author pro ling task where training documents and testing documents are from di erent genres. We use the softvoting approach to build the classi cation ensemble. To include various feature sets, we rst train logistic regression models using the extracted word n-gram, character n-gram, and part-of-speech n-gram features for each genre. We then ensemble single-genre predictive models trained on the blog, social media and Twitter data sources, to build our multi-genre ensemble approach. The experimental results indicate that our approach performs well in both single-genre and cross-genre author pro ling tasks.

Gender identi cation Age prediction Ensemble technique Text mining Cross-genre classi cation Author pro ling

Introduction

The rapid development of social media platforms has led to a massive volume of user-generated text in the form of blog posts, status updates, and tweets. This has generated great research interest in identifying authors' pro le [ 2 ]. Author pro ling is the task of predicting the authors age and gender information with his/her writing. Most of the recent works in author pro ling address the problem as a single-genre task where the instances of the training set and the test set are coming from a single platform. Due to the di culties of gathering ground truth data for every platform, cross-genre author pro ling task has been proposed. Cross-genre pro ling has been done for personality prediction in [ 3 ], however little work has been done for identifying the age and gender of users in a cross-genre setting. Such models could be applied to environments where training data representative for the deployment domain is not available. E ective features from the recent works in age and gender classi cation were both content features such as unigrams, bigrams and word classes as well as stylistic features, such as part-of-speech (POS), slang words and average sentence length. For instance, in case of the gender identi cation, Villena Roman et al. [ 4 ] extracted n-grams or bag-of-words as content features. In [ 5 ], Argamon et al. approached the task of gender identi cation by combining function words with POS tags. Given the related works in this domain, we include various feature sets in our model by training logistic regression models using the extracted word n-gram, character n-gram, and POS n-gram features from the documents. We propose a two-level ensemble approach which is a multi-genre predictive model

1 This paper is an extended abstract

that ensembles single-genre predictive models from the available ground-truth datasets of various genres, i.e., the blog, social media and Twitter datasets. Our multi-genre ensemble approach leverages various types of documents as training examples which makes it suitable for the cross-genre author pro ling of the PAN2016 competition where the testing documents are from a hidden genre. The experimental results indicate that our ensemble approach can be used for both single-genre and cross-genre author pro ling tasks. The rest of this paper describes the details of our submission to the PAN 2016 cross-genre author pro ling task. 2

Methodology

Let us assume U is a set of all authors, where U = Utrain [ Utest. For all users in Utrain, we know their age and gender, and our aim is to predict the age and gender of all users in Utest based on their written text. If Utrain and Utest are coming from one platform (aka genre), we call the task a single-genre author pro ling task, and if Utrain and Utest are from di erent social media platforms, we call the task a cross-genre author pro ling task. The overall architecture of our proposed ensemble approach for a single-genre (S-G) and multi-genre author pro ling (M-G) is shown in Figure 1. Using the S-G ensemble approach, we incorporate various features extracted from the documents and by using the M-G ensemble approach, not only do we use di erent features, but also leverage predictive models of di erent genres which makes the framework suitable for cross-genre author pro ling task. 2.1 Pre-processing and data description: The data provided by the PAN organizers, was in the form of XML documents from which user contents were extracted and cleaned by removing HTML tags and stop words. To tackle the cross-genre author pro ling task, we collected data from 2014 and 2015 PAN author pro ling contests and added them to our training dataset. For English and Spanish, we made three datasets from di erent genres: (1) social media with 7,746 documents for English and 1,272 documents for Spanish, (2) blog with 147 documents for English and 88 documents for Spanish and (3) Twitter with 576 documents for English and 340 documents for Spanish. For the Dutch dataset, we gathered data from Twitter with 418 documents. In all the datasets the gender distributions are uniform. The statistics of the combined datasets w.r.t. the frequencies of the ve age groups (i.e., [18; 24], [25; 34], [35; 49], [50; 64], and [65; xx]) are shown in Table 1. Note that for the Dutch dataset we do not have the age of the authors. 2.2 Feature extraction: To create our feature space, we extract three di erent categories of features, drawing inspiration from related works. All the implementations are based on the machine learning package in Python called scikit-learn2. The extracted features are (1) word n-gram where n = f1; 2; 3g (aka uni, bi and tri-grams) using TF-IDF as a weighting mechanism, (2) character n-gram where n = f3; 4; 5; 6; 7g using TF-IDF as a weighting mechanism. To reduce the size of the feature space, we select k top features using Chi-square hypothesis testing where k = 5000, and (3) POS n-gram: in which we extract part-of-speech (POS) tags from each document using ntlk package in Python3. Then each word in text is mapped to its corresponding POS tag and the text comprising of those POS tags is used to extract n-gram features with the same con guration of word n-gram with n = f1; 2; 3g and TF-IDF weighting. 2.3 Predictive model: We train binary classi ers for predicting the gender of users and multi-class classi ers for predicting their age. For age and gender prediction tasks, we train three predictive models using the three feature sets that we explained above with logistic regression as a classi er for each genre-labellanguage. We then apply an ensemble soft-voting approach using the prediction scores of the models. The results of applying our S-G ensemble approach on the Twitter, social media and blog datasets are presented in Table 2. Our S-G ensemble approach outperforms the majority baseline in predicting the gender of users for all the three datasets for all three languages, however for the task of age prediction, our approach outperforms the baseline for the social media and Twitter datasets for English and Spanish. To tackle the cross-genre author pro ling task, we rst made S-G ensemble models for each genre, e.g., regarding the English dataset, we made three S-G ensemble models for the social media, blog and Twitter datasets, then we ensemble the predictions as a nal predictive

2 http://scikit-learn.org/ 3 http://www.nltk.org/

model of the cross-genre author pro ling task.To investigate the performance of our approach for the task of cross-genre age and gender prediction, we conducted three sets of experiments. We use the blog, social media and Twitter datasets and use the pre-trained models of two sources to test on the remaining source. The results indicate that our approach can be used for the cross-genre author pro ling task, where results are better than or equal to the baseline (see Table 3). However, since users' language in Twitter is di erent from their language in generating blog posts, in cross-genre author pro ling, selecting the training examples from the most similar datasets would be an advantage. However, for PAN2016, since the genre of the test set was hidden, we combine all the available datasets in our submitted software. The results of our submission for PAN2016 on a hidden test data which are evaluated using TIRA [ 6 ] are presented in [ 1 ].

Conclusion

In this paper, we brie y explained our proposed two-level ensemble approach to tackle the cross-genre author pro ling task. Our proposed approach is exible and can incorporate many feature sets and sources of information that are available which makes our approach suitable for the cross-genre author pro ling task, where no/little training example is available from the same genre. Experimental results on various datasets and languages indicate the capability of our approach. In our approach, we assigned uniform weights to ensemble the predictive models. However, giving higher weights to the predictive models with better performance may improve the overall performance which is an open path to explore in the future.

Rangel ,

Rosso ,

Verhoeven ,

Daelemans ,

Potthast , and

Stein , \ Overview of the 4th Author Pro ling Task at PAN 2016: Cross-genre Evaluations," in Proc. of the CLEF Evaluation Labs and Workshop , 2016 .

Rangel ,

Rosso ,

Potthast ,

Stein , and W. Daelemans, \ Overview of the 3rd author pro ling task at pan 2015," in Proc. of the CLEF Evaluation Labs and Workshop , 2015 .

Farnadi , G. Sitaraman,

Sushmita ,

Celli ,

Kosinski ,

Stillwell ,

Davalos , M.-F. Moens , and M. De Cock , \ Computational personality recognition in social media," User Modeling and User-Adapted Interaction , vol. 26 , no. 2 , pp. 109 { 142 , 2016 .

J. Villena

Roman and J.-C. Gonzalez Cristobal , \DAEDALUS at PAN 2014: Guessing tweet author's gender and age," in Proc. of the CLEF Evaluation Labs and Workshop , 2014 .

Argamon ,

Koppel ,

Fine , and

A. R.

Shimoni , \Gender, genre, and writing style in formal written texts," TEXT , vol. 23 , no. 3 , pp. 321 { 346 , 2003 .

Potthast ,

Gollub ,

Rangel ,

Rosso , E. Stamatatos, and

Stein , \ Improving the Reproducibility of PAN's Shared Tasks: Plagiarism Detection, Author Identi cation, and Author Pro ling," in Proc. of the CLEF Evaluation Labs and Workshop , 2014 .