1 Introduction

Author Profiling of Twitter Users

0 Roy Bayot , Teresa Gonçalves, and Paolo Quaresma 1 Universidade de Évora

2015

In this paper, we focused on profiling authors on age, gender, and five personality traits. The corpus consists of anonymized twitter posts categorized into 4 different languages. Our proposed approach was to use a combination of tfidf, function words, stylistic features, and text bigrams, and used an SVM for each task.

1 Introduction

Author profiling from text has been an interesting topic recently because of the increase in the availability of texts. This is mostly because of the internet where text is one of the forms of communication. This could be present in blogs, websites, customer reviews, and even twitter posts.

While author anonymity has been present mostly in the web, using profiling can be useful, especially in aspects such as marketing, advertising, as well as security. Profiling mainly uses such text to determine certain aspects of the author such as age, gender, and certain personality traits. The idea is that certain topics or word usage comes are affected by such aspects. For instance, talking about bands or any trending music at the time would be a topic for teenagers. This is not always easy since some people can always think not on their age, and that would affect the writing. Some people can write fiction and it can be that the text was written from the perspective of someone with a different personality type.

However PAN is making an effort in this aspect. In this year’s edition of PAN for author profiling, the task is specific to author profiling of twitter users in 4 languages english, dutch, italian, and spanish. The tasks include profiling for age, gender, and the big five personality traits - agreeability, conscientiousness, extrovertedness, openness, and stability [ 6 ].

Some approaches have been used previously that are similar. For instance, in [ 2 ], they used 405 function words, a list of ngrams part of speech tag where they used 500 most common ordered triples, 100 common ordered pairs, and all single tags, to categorize text by gender. In [ 7 ], both style-based features (POS tags, function words, blog words, and hyperlinks) and content-based features (content words and hand-crafted LIWC) were used to classify by age and gender. In the previous year, PAN also had ran author profiling but on different sources, not just tweets. In [ 3 ], the method used to represent terms in a space of profiles and then represent the documents in the space of profiles and subprofiles were built using expectation maximization clustering. In [ 4 ], ngrams were used with stopwords, punctuations, and emoticons retained, and then idf count was also used before being processed with 5 different classifiers. Liblinear logistic regression returned with the best result. In [ 9 ], different features were used that were related to length (number of characters, words, sentences), information retrieval (cosine similarity, okapi BM25), and readability (Flesch-Kincaid readability, correctness, style). This was used on 7 different classifiers. Another approach is to use term vector model representation as in [ 8 ]. For the work of Marquardt et. al in [ 5 ], they used a combination of content-based features (MRC, LIWC, sentiments) and stylistic features (readability, html tags, spelling and grammatical error, emoticons, total number of posts, number of capitalized letters number of capitalized words).

Since this is the first attempt at a submission to PAN, we opted to take a simpler approach of using tfidf, function words, some stylistic features, and text bigrams. 2

Methodology

For a first submission to this task, we decided to use the same approach for all the tasks. The method we used is more or less straightforward - basic feature extraction, concatenating the different features, then use the combined features for classification or regression, and use 10 fold cross validation. 2.1

Features Vector Creation

There are four main feature types used in this submission and each processed separately. The first would be the tfidf features. Term frequency-inverse document frequency or tfidf is one of the most common features obtained.

Before running the feature extraction for tfidf, preprocessing was done to the tweets obtained. For this task, all tweets from a single person were concatenated. Numbers were removed, and turned into lower case equivalents. Then stopwords from the NLTK toolkit [ 1 ] were removed from the set of words. Finally, the resulting words were used to find at tfidf vector representation through the scikits-learn python library. The vector was set to 10000 and discard the excess based on the document frequency. The defaults were chosen for the vectorizer. It should also be noted some of the tfidf representations did not maximum of 10000 in terms of dimensions.

The second would be the stylistic features. We only detected for the presence of absence of certain characters or combination of characters. This includes the following characters and combinations - "#", "@username", "http://", ":)", ";)", "o_O", "!","!!", "!!!", ":(". This is by no means exhaustive and was just an initial set. The octothorpe was to indicate if there was hashtag. The "@username" was used in case the user tags other twitter users. Normally, this will be of a twitter handle but since it was anonymized, we used this instead. The set ":)", ";)", "o_O", and ":(" just check of some sort of emotion. And finally, the exclamation points could indicate possible surprise intensity of a statement, which usually happens in the internet.

The third would detecting for function words. Function words are informative words that could be used to discriminate between classes. These were obtained by using all instances in the training data and was used to create a decision tree. And the most informative features were obtained with entropy as the criteria. The succeeding tables at show the words/characters that obtained as function words.

age "zit","heel","best","geeft","idee","nooit","weer","binnen","goed","avond", dutch ""fbeiejwstejerk",e"ng"i,n"gd"a,g""m,"eliasajets"t,e""m,"omrgaenn"",","vmoeulzt"ik,"ahnatretn"",","tooenkdoemrwste"g,""b,oeit","dh", "onderzoeksjournalistiek","onzin","proficiat","ten","verdient","verzuurde","werkt" "co","wanna","us","haha","username","fitbit","et","bowl","academia","bitch", english "happened","even","year","reach","free","times","speech","top","add","social", "think","nothing","financial","pop","inspiring","lil","complicated","aa" "domani","fa","poi","pezzo","immagini","quel","ultimo","binari","bravo", italian ""frioutsoc"i,a"mis"o,"",s"esnutpiteor"",,""statastsoo"n,"i"p,i""a,g"seengduaidreig"i,"tabloer"g,"oc"a,"saelliencgtead"",","cfce"d,"edriec"o,""d,io", "eccomi","esempio","novit","oscena","pard","piazza","preso","pu","rispetto","yg" spanish ""fhatlttpa"",,""mbuas"c,"adni"jo,""f,a"cmeobmooekn"t,o""in,"fcoi"l",","toads"a,s""b,u"feanvoosr"",,""mcuallaa"",,""nboiemb"e,r""o,fpbmahc" Table 1. Function words for age task.

gender dutch "username","goed","bent","saai"

"close","love","mention","co", english ""cwuitfee"",,""plahnoknae"",,""blee"li,e"vdea"y,"",v"iudreboa"n,",

"round","thank","bird","wouldn","aa" italian ""ccoo"n,o"sccaemsspia"g,"nvao"c,"i"ottimo",

"vida","alguien","corrupci", spanish "ciudades","si","temprano",

"puro","meta","foto","dio"

Table 2. Function words for gender task.

For the personality tasks, the decision tree was made in such a way that the output was framed as a classification problem. Instead of having continuous numbers from -0.5 to 0.5, we used discrete numbers from -0.5 to 0.5 with an interval of 0.1. The words for personality tasks were shown in the tables 3-7.

Finally, we also add text bigrams. This was to possibly capture some structure in the input texts. After features were extracted and concatenated, we used a linear SVM with a default relaxation parameter of 1. We used the scikits-learn library for this and used the SVM as an initial check for results.

Experiments and Results Setups

Each of the different features were also individually used to classify or perform a regression. Some combinations of the features were also used. In tables 8-11, different tasks were done with tfidf, function words (FW), stylistic features(SF), and text bigrams(TB), as well as combinations of these. The results from PAN are summarized in the table below. The results were not as satisfactory as we had hoped. As a conclusion, much improvement still needs to be done for such tasks. For instance exploration of more features such as stylistic features. Other classifiers are also to be explored as well as parameter tuning. Possibly one mistake this year is to just get the combination that yields more better result over all than picking and choosing certain models to certain languages and tasks. It would have been better if the system was adapted to that.

1. Bird , S. , Klein , E. , Loper , E.: Natural Language Processing with Python. O'Reilly Media ( 2009 )

2. Koppel , M. , Argamon , S. , Shimoni , A.R. : Automatically categorizing written texts by author gender . Literary and Linguistic Computing 17 ( 4 ), 401 - 412 ( 2002 )

3. López-Monroy , A.P. , Montes-y Gómez , M. , Escalante , H.J. , Villaseñor-Pineda , L. : Using intra-profile information for author profiling

4. Maharjan , S. , Shrestha , P. , Solorio , T. : A simple approach to author profiling in mapreduce

5. Marquardt , J. , Farnadi , G. , Vasudevan , G. , Moens , M.F. , Davalos , S. , Teredesai , A. , De Cock , M. : Age and gender identification in social media . Proceedings of CLEF 2014 Evaluation Labs ( 2014 )

6. Rangel , F. , Celli , F. , Rosso , P. , Potthast , M. , Stein , B. , Daelemans , W. : Overview of the 3rd Author Profiling Task at PAN 2015 . In: Working Notes Papers of the CLEF 2015 Evaluation Labs . CEUR Workshop Proceedings, CLEF and CEUR-WS.org (Sep 2015 ), http://www.clef-initiative.eu/publication/working-notes

7. Schler , J. , Koppel , M. , Argamon , S. , Pennebaker , J.W.: Effects of age and gender on blogging . In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs. vol. 6 , pp. 199 - 205 ( 2006 )

8. Villena-Román , J. , González-Cristóbal , J.C. : Daedalus at pan 2014: Guessing tweet author's gender and age

9. Weren , E.R. , Moreira , V.P. , de Oliveira, J.P. : Exploring information retrieval features for author profiling-notebook for pan at clef 2014 . Cappellato et al.[ 6 ]