=Paper=
{{Paper
|id=Vol-1609/16091014
|storemode=property
|title=Cross-Genre Age and Gender Identification in Social Media
|pdfUrl=https://ceur-ws.org/Vol-1609/16091014.pdf
|volume=Vol-1609
|authors=Anam Zahid,Aadarsh Sampath,Anindya Dey,Golnoosh Farnadi
|dblpUrl=https://dblp.org/rec/conf/clef/ZahidSDF16
}}
==Cross-Genre Age and Gender Identification in Social Media==
Cross-genre Age and Gender Identification in Social Media Anam Zahid1 , Aadarsh Sampath1 , Anindya Dey1 , Golnoosh Farnadi2,3 1 Center for Data Science, University of Washington Tacoma, WA, USA 2 Dept. of Appl. Math., Comp. Science and Statistics, Ghent University, Belgium 3 Dept. of Computer Science, Katholieke Universiteit Leuven, Belgium Abstract. This paper1 gives a brief description on the methods adopted for the task of author-profiling as part of the competition PAN 2016 [1]. Author profiling is the task of predicting the author’s age and gender from his/her writing. In this paper, we follow a two-level ensemble ap- proach to tackle the cross-genre author profiling task where training doc- uments and testing documents are from different genres. We use the soft- voting approach to build the classification ensemble. To include various feature sets, we first train logistic regression models using the extracted word n-gram, character n-gram, and part-of-speech n-gram features for each genre. We then ensemble single-genre predictive models trained on the blog, social media and Twitter data sources, to build our multi-genre ensemble approach. The experimental results indicate that our approach performs well in both single-genre and cross-genre author profiling tasks. Keywords: Gender identification, Age prediction, Ensemble technique, Text mining, Cross-genre classification, Author profiling 1 Introduction The rapid development of social media platforms has led to a massive volume of user-generated text in the form of blog posts, status updates, and tweets. This has generated great research interest in identifying authors’ profile [2]. Au- thor profiling is the task of predicting the authors age and gender information with his/her writing. Most of the recent works in author profiling address the problem as a single-genre task where the instances of the training set and the test set are coming from a single platform. Due to the difficulties of gathering ground truth data for every platform, cross-genre author profiling task has been proposed. Cross-genre profiling has been done for personality prediction in [3], however little work has been done for identifying the age and gender of users in a cross-genre setting. Such models could be applied to environments where training data representative for the deployment domain is not available. Effec- tive features from the recent works in age and gender classification were both content features such as unigrams, bigrams and word classes as well as stylis- tic features, such as part-of-speech (POS), slang words and average sentence length. For instance, in case of the gender identification, Villena Román et al. [4] extracted n-grams or bag-of-words as content features. In [5], Argamon et al. approached the task of gender identification by combining function words with POS tags. Given the related works in this domain, we include various feature sets in our model by training logistic regression models using the extracted word n-gram, character n-gram, and POS n-gram features from the documents. We propose a two-level ensemble approach which is a multi-genre predictive model 1 This paper is an extended abstract Fig. 1: The architecture of the multi-genre ensemble model . that ensembles single-genre predictive models from the available ground-truth datasets of various genres, i.e., the blog, social media and Twitter datasets. Our multi-genre ensemble approach leverages various types of documents as train- ing examples which makes it suitable for the cross-genre author profiling of the PAN2016 competition where the testing documents are from a hidden genre. The experimental results indicate that our ensemble approach can be used for both single-genre and cross-genre author profiling tasks. The rest of this pa- per describes the details of our submission to the PAN 2016 cross-genre author profiling task. 2 Methodology Let us assume U is a set of all authors, where U = Utrain ∪ Utest . For all users in Utrain , we know their age and gender, and our aim is to predict the age and gender of all users in Utest based on their written text. If Utrain and Utest are coming from one platform (aka genre), we call the task a single-genre author profiling task, and if Utrain and Utest are from different social media platforms, we call the task a cross-genre author profiling task. The overall architecture of our proposed ensemble approach for a single-genre (S-G) and multi-genre author profiling (M-G) is shown in Figure 1. Using the S-G ensemble approach, we incorporate various features extracted from the documents and by using the M-G ensemble approach, not only do we use different features, but also leverage predictive models of different genres which makes the framework suitable for cross-genre author profiling task. 2.1 Pre-processing and data description: The data provided by the PAN organizers, was in the form of XML documents from which user contents were extracted and cleaned by removing HTML tags and stop words. To tackle the cross-genre author profiling task, we collected data from 2014 and 2015 PAN author profiling contests and added them to our training dataset. For English and Spanish, we made three datasets from different genres: (1) social media with 7,746 documents for English and 1,272 documents for Spanish, (2) blog with 147 documents for English and 88 documents for Spanish and (3) Twitter with 576 documents for English and 340 documents for Spanish. For the Dutch dataset, we gathered data from Twitter with 418 documents. In all the datasets the Table 1: Statistics of the combined datasets w.r.t. the users’ age. Genre Language [18, 24] [25, 34] [35, 49] [50, 64] [65, xx] blog English 6 60 54 23 4 Spanish 4 26 42 12 4 social Media English 1,550 2,098 2,246 1,838 14 Spanish 330 426 324 160 32 Twitter English 86 200 204 80 6 Spanish 38 110 148 38 6 gender distributions are uniform. The statistics of the combined datasets w.r.t. the frequencies of the five age groups (i.e., [18, 24], [25, 34], [35, 49], [50, 64], and [65, xx]) are shown in Table 1. Note that for the Dutch dataset we do not have the age of the authors. 2.2 Feature extraction: To create our feature space, we extract three different categories of features, drawing inspiration from related works. All the implemen- tations are based on the machine learning package in Python called scikit-learn2 . The extracted features are (1) word n-gram where n = {1, 2, 3} (aka uni, bi and tri-grams) using TF-IDF as a weighting mechanism, (2) character n-gram where n = {3, 4, 5, 6, 7} using TF-IDF as a weighting mechanism. To reduce the size of the feature space, we select k top features using Chi-square hypothesis testing where k = 5000, and (3) POS n-gram: in which we extract part-of-speech (POS) tags from each document using ntlk package in Python3 . Then each word in text is mapped to its corresponding POS tag and the text comprising of those POS tags is used to extract n-gram features with the same configuration of word n-gram with n = {1, 2, 3} and TF-IDF weighting. Table 2: Accuracy of the age and gender prediction using single-genre ensemble (S-G) approach. Values in bold are higher than the majority baseline (base). All results are averaged over a 5-fold cross-validation English Spanish Dutch Gender Age Gender Age Gender Genre base S-G base S-G base S-G base S-G base S-G blog 0.50 0.65 0.41 0.32 0.50 0.66 0.48 0.48 - - social media 0.50 0.54 0.29 0.34 0.50 0.60 0.33 0.34 - - Twitter 0.50 0.60 0.35 0.46 0.50 0.57 0.43 0.48 0.50 0.53 2.3 Predictive model: We train binary classifiers for predicting the gender of users and multi-class classifiers for predicting their age. For age and gender pre- diction tasks, we train three predictive models using the three feature sets that we explained above with logistic regression as a classifier for each genre-label- language. We then apply an ensemble soft-voting approach using the prediction scores of the models. The results of applying our S-G ensemble approach on the Twitter, social media and blog datasets are presented in Table 2. Our S-G ensemble approach outperforms the majority baseline in predicting the gender of users for all the three datasets for all three languages, however for the task of age prediction, our approach outperforms the baseline for the social media and Twitter datasets for English and Spanish. To tackle the cross-genre author profiling task, we first made S-G ensemble models for each genre, e.g., regarding the English dataset, we made three S-G ensemble models for the social media, blog and Twitter datasets, then we ensemble the predictions as a final predictive 2 http://scikit-learn.org/ 3 http://www.nltk.org/ model of the cross-genre author profiling task.To investigate the performance of our approach for the task of cross-genre age and gender prediction, we conducted three sets of experiments. We use the blog, social media and Twitter datasets and use the pre-trained models of two sources to test on the remaining source. The results indicate that our approach can be used for the cross-genre author profiling task, where results are better than or equal to the baseline (see Ta- ble 3). However, since users’ language in Twitter is different from their language in generating blog posts, in cross-genre author profiling, selecting the training examples from the most similar datasets would be an advantage. However, for PAN2016, since the genre of the test set was hidden, we combine all the available datasets in our submitted software. The results of our submission for PAN2016 on a hidden test data which are evaluated using TIRA [6] are presented in [1]. Table 3: Accuracy of the age and gender prediction using multi-genre ensemble (M-G) approach. Values in bold are higher than the majority baseline (base). English Spanish Gender Age Gender Age Test (genre) Train (genres) base M-G base M-G base M-G base M-G blog Twitter+social media 0.50 0.62 0.37 0.46 0.50 0.50 0.29 0.45 social media Twitter+blog 0.50 0.50 0.27 0.29 0.50 0.52 0.25 0.25 Twitter social media+blog 0.50 0.51 0.35 0.35 0.50 0.55 0.32 0.46 3 Conclusion In this paper, we briefly explained our proposed two-level ensemble approach to tackle the cross-genre author profiling task. Our proposed approach is flexi- ble and can incorporate many feature sets and sources of information that are available which makes our approach suitable for the cross-genre author profiling task, where no/little training example is available from the same genre. Experi- mental results on various datasets and languages indicate the capability of our approach. In our approach, we assigned uniform weights to ensemble the pre- dictive models. However, giving higher weights to the predictive models with better performance may improve the overall performance which is an open path to explore in the future. References 1. F. Rangel, P. Rosso, B. Verhoeven, W. Daelemans, M. Potthast, and B. Stein, “Overview of the 4th Author Profiling Task at PAN 2016: Cross-genre Evaluations,” in Proc. of the CLEF Evaluation Labs and Workshop, 2016. 2. F. Rangel, P. Rosso, M. Potthast, B. Stein, and W. Daelemans, “Overview of the 3rd author profiling task at pan 2015,” in Proc. of the CLEF Evaluation Labs and Workshop, 2015. 3. G. Farnadi, G. Sitaraman, S. Sushmita, F. Celli, M. Kosinski, D. Stillwell, S. Dava- los, M.-F. Moens, and M. De Cock, “Computational personality recognition in social media,” User Modeling and User-Adapted Interaction, vol. 26, no. 2, pp. 109–142, 2016. 4. J. Villena Román and J.-C. González Cristóbal, “DAEDALUS at PAN 2014: Guess- ing tweet author’s gender and age,” in Proc. of the CLEF Evaluation Labs and Workshop, 2014. 5. S. Argamon, M. Koppel, J. Fine, and A. R. Shimoni, “Gender, genre, and writing style in formal written texts,” TEXT, vol. 23, no. 3, pp. 321–346, 2003. 6. M. Potthast, T. Gollub, F. Rangel, P. Rosso, E. Stamatatos, and B. Stein, “Im- proving the Reproducibility of PAN’s Shared Tasks: Plagiarism Detection, Author Identification, and Author Profiling,” in Proc. of the CLEF Evaluation Labs and Workshop, 2014.