=Paper=
{{Paper
|id=Vol-1180/CLEF2014wn-Pan-VillenaRomanEt2014
|storemode=property
|title=DAEDALUS at PAN 2014: Guessing Tweet Author's Gender and Age
|pdfUrl=https://ceur-ws.org/Vol-1180/CLEF2014wn-Pan-VillenaRomanEt2014.pdf
|volume=Vol-1180
|dblpUrl=https://dblp.org/rec/conf/clef/Villena-RomanG14
}}
==DAEDALUS at PAN 2014: Guessing Tweet Author's Gender and Age==
DAEDALUS at PAN 2014: Guessing Tweet Author's
Gender and Age
Julio Villena-Román1,2, José Carlos González-Cristóbal3,1
1
DAEDALUS - Data, Decisions and Language, S.A.
2
Universidad Carlos III de Madrid
3
Universidad Politécnica de Madrid
jvillena@daedalus.es, josecarlos.gonzalez@upm.es
Abstract. This paper describes our participation at PAN 2014 author profiling
task. Our idea was to define, develop and evaluate a simple machine learning
classifier able to guess the gender and the age of a given user based on his/her
texts, which could become part of the solution portfolio of the company. We
were interested in finding not the best possible classifier that achieves the
highest accuracy, but to find the optimum balance between performance and
throughput using the most simple strategy and less dependent of external
systems. Results show that our software using Naive Bayes Multinomial with a
term vector model representation of the text is ranked quite well among the rest
of participants in terms of accuracy.
Keywords: PAN, CLEF, author profiling, gender, age, user demographics,
machine learning classifier, Naive Bayes Multinomial, term vector model.
1 Introduction
PAN 1 is a competitive evaluation lab on uncovering plagiarism, authorship and social
software misuse, held as part of CLEF 2 conference. PAN 2014 offers three different
main tasks: 1) plagiarism detection, 2) author identification and 3) author profiling.
describes our participation at the PAN 2014 author profiling scenario [1]. We are a
research group led by DAEDALUS 3, a leading provider of language-based solutions
in Spain, and research groups of Universidad Politécnica and Universidad Carlos III
of Madrid. We are long-time participants in CLEF, in many different tracks and tasks
since 2003, and also in a previous edition of PAN [2].
The task is focused on author profiling, i.e., the problem to distinguish between
classes of authors studying how language is shared by people, allowing to identify
aspects such as gender, age, native language, or personality type. Specifically, the
focus is on author profiling in social media messages. Author profiling is a problem of
1 http://pan.webis.de/
2 http://www.clef-initiative.eu/
3 http://www.daedalus.es/
1157
growing importance in different applications such as forensics, security, and
marketing, for instance, to know the demographics of people that like or dislike their
products, based on the analysis of blogs and online product reviews.
Given a document, the task is to determine its author's age and gender.
Participants are provided with a training data set that consists of blog posts, Twitter
tweets and social media texts written in both English and Spanish as well as hotel
reviews written in English. Gender is a binary classification (male or female) and with
regard to age, the following 5 classes are considered: 18-24, 25-34, 35-49, 50-64,
>65. Differently to other CLEF labs, participants must not submit the results of their
experiments using a provided test corpus, but else upload a software that runs within
TIRA evaluation platform 4.
The idea behind our participation was to define, develop and evaluate a simple
machine learning classifier able to guess the gender and the age of a given user based
on his/her texts, which could become part of the solution portfolio of the company.
We were interested to find not the best possible classifier that achieves the best
accuracy, but to find the best balance between performance and throughput using the
most simple strategy and less dependent of external systems. Our system and results
achieved are presented and discussed in the following sections.
2 Our approach
The provided training data covers 1) four different types of corpus with presumably
different language usage, 2) two different languages (English and French), and 3) two
attributes to guess (gender and age). After several preliminary analysis using cross
validation on the training corpora, we decided to build a machine learning classifier
specifically trained for each combination of corpus-language-attribute, so 14
classifiers in all.
Table 1. Information of corpus
Corpus Language Authors Texts
Blog English 147 2 278
Review English 4 160 5 452
Socialmedia English 7 746 146 843
Twitter English 306 201 432
Blog Spanish 88 1 685
Socialmedia Spanish 1 272 22 097
Twitter Spanish 178 155 326
Table 1 shows the number of authors and texts for each training corpus. Given the
heterogeneity of each corpus, where some have just a few long documents per author
(such as in the review corpus) and others have many short texts per author (for
4 http://www.tira.io/
1158
instance Twitter corpus), we decided to design a two-level classifier: first, a
document-oriented classifier, which guesses the gender and age of a given text, and
then, an author-oriented classifier, which predicts the gender and age of a given user
by aggregating the output of the first classifier for each text written by a given user.
All corpora are equally balanced for gender and age, so the training is not affected by
any class unbalance problem.
All 14 classifiers are trained with all texts for each combination of corpus,
language and attribute. We used Weka 3.7 for performing our experiments and for
developing our software to run in TIRA. Texts were tokenized using WordTokenizer
to obtain a simple bag of words representation. The tokenizer allows to define split
characters that are removed from the term vector space representation of the text.
Besides the usual split symbols, spaces and some punctuation marks, we use some
specific delimiters such as hashtags (#), usernames (@), emoticons, slashes,
ampersands, question marks and hyphens that are used to separate words in SEO
optimized URLs. Finally, as a high number of terms were low frequency numerals we
decided to add numbers as well to help in normalization.
Regarding the document-oriented classifiers, a number of supervised algorithms
were evaluated using cross validation, and finally, for its performance, we selected
Multinomial Naive Bayes (NBM) classifier [3] with the default values for parameters.
Different configuration parameters were tested to reach the conclusion that NBM was
robust enough and other representations (bigrams, feature selection) were not adding
additional value.
Results of this document-oriented classifier on training data using cross validation
are shown in Table 2.
Table 2. Results for training data (document-oriented classification)
Corpus Language Gender Age
Blog English 0.8277 0.6485
Review English 0.6852 0.3400
Socialmedia English 0.6187 0.4445
Twitter English 0.8726 0.7571
Blog Spanish 0.8619 0.6660
Socialmedia Spanish 0.6217 0.4439
Twitter Spanish 0.8686 0.7598
The author-oriented classifier reads the output of the document-oriented classifier
for each text written by a given author and predicts the gender and age using a simple
voting strategy, i.e., returns the most frequent value among all texts, selected after
some preliminary tests. Some other strategies were tested, such as a voting approach
using a confusion matrix with different cost for each decision values, depending on
the estimated accuracy for each class, but unfortunately we did not find any definite
conclusion or improvement due to lack of time.
The final submission consists in a script written in PHP that reads the input test
corpus and the output directory, and, using a loop, processes every file in the test
corpus, reading all documents and creating two files in the arff format suitable for
1159
Weka, one for gender and another one for age. Then Weka is called to obtain the
predictions and then the output is aggregated to select the most frequent value that is
chosen as the final output prediction for the author.
3 Results
The gender and age predictions have been evaluated as a classification problem, so
accuracy measure over each class are reported. Results achieved by our software are
shown in Table 3.
Table 3. Results for test data (author-oriented classification)
Corpus Language Gender Age Both
Blog English 0.6410 0.3974 0.3077
Review English 0.6845 0.3143 0.2199
Socialmedia English 0.5421 0.3581 0.1905
Twitter English 0.5130 0.4156 0.2078
Average 0.5952 0.3714 0.2315
Blog Spanish 0.5179 0.4643 0.2321
Socialmedia Spanish 0.5724 0.3622 0.1961
Twitter Spanish 0.5444 0.5000 0.2667
Average 0.5449 0.4422 0.2317
In general, classifiers for Spanish achieve better results than classifiers for English,
except for the case of blogs where English works better.
Although apparently gender attribute achieves a higher precision than age attribute,
the classifier for gender is quite useless, as, taking into account that the range of
values for the attribute is just two (male vs female), a random choice would achieve a
0.50 accuracy (assuming an equally balanced test corpus, the same as the training
corpus). Thus classifiers for age outperform classifiers for gender in terms of lift
(increment with regards to the random choice): for instance, 59% vs 50% for gender
in English, 37% vs 20% for age in English (5 possible classes), etc.
Table 4 shows the comparison with other participants. This table shows, for each
corpus, language and attribute, the maximum, minimum and average values, and the
position of our software in the ranking of participants.
In general, we achieve average results just above the middle of the table, except for
same cases were our software outperforms other participants, such as social media or
reviews in English.
As it can be also noticed in the table, our results for Spanish are worse than the
average for all participants in Spanish, though the approach is the same as for English.
We do not have any explanation for this issue yet. However, we have a feeling that a
stemming or lemmatization step should have been considered for Spanish, as
inflection processes are important in this language and affect other tasks such as
information retrieval or named entity recognition.
1160
Table 4. Overall results.
Corpus Language Value* Gender Age Both
Blog English Max 0.6795 0.4615 0.3077
Min 0.5000 0.1795 0.0897
Average 0.6117 0.3516 0.2326
Ranking 3-4/7 2-3/7 1-2/7
Review English Max 0.7259 0.3502 0.2564
Min 0.5012 0.0901 0.0451
Average 0.6383 0.2879 0.1897
Ranking 2/7 5/7 5/7
Socialmedia English Max 0.5421 0.3652 0.2062
Min 0.5012 0.2355 0.1244
Average 0.5285 0.3246 0.1750
Ranking 1/7 3/7 4/7
Twitter English Max 0.7338 0.5065 0.3571
Min 0.5065 0.1104 0.0584
Average 0.5974 0.3766 0.2305
Ranking 7/8 4/8 4/8
Blog Spanish Max 0.5893 0.4821 0.3214
Min 0.4286 0.2500 0.1786
Average 0.5112 0.4152 0.2366
Ranking 3-4/8 3-4/8 4-5-6/8
Socialmedia Spanish Max 0.6837 0.4894 0.3357
Min 0.5000 0.2191 0.1060
Average 0.6144 0.3847 0.2325
Ranking 7/8 5/8 6/8
Twitter Spanish Max 0.6556 0.6111 0.4333
Min 0.5000 0.2222 0.1444
Average 0.5736 0.4875 0.2889
Ranking 5/8 5-6/8 6/8
* If there is more than one number in the ranking, it means a tie between participants
4 Conclusions and Future work
Results show that our quite simple approach using a two-level classifier composed of
a document-oriented Naive Bayes Multinomial classifier with a term vector model
representation of the text and then a voting strategy for predicting the author age
achieves acceptable results in terms of accuracy. Despite of the difficulty of the task,
results somewhat validate the fact that this technology may be already included into
an automated workflow process for the first step towards social media mining and
author profiling for supporting marketing activities.
1161
However, in general, classifiers for gender (for all participants) are quite useless as
they achieve a very low improvement over the random choice. Classifiers for age are
worse in absolute accuracy but better in terms of lift with respect to the random
choice. Obviously a different approach must be investigated to predict gender more
robustly.
We already include a module for extraction user demographics in our portfolio of
solutions [4], which tries to guess gender, age and user type (person or organization),
using the information in the user public profile in Twitter, i.e., nick, full name and
description, making no use of the texts written by that user. This module is based on
distance among histograms using n-grams (character sequences) for each attribute to
predict. Using internal evaluations, this software achieves good accuracy results for
gender (over 70%) though lower for age.
Based on the results achieved in PAN, our initial idea to find a strategy that offers
a good balance between performance and throughput using the most simple approach
and less dependent of external systems gets validated and developing such classifier is
within our immediate plans. In the short term, we plan to carry out some tests using
our software for text classification [5], which is based on a hybrid algorithm [6] [7]
that combines a statistical classification (currently based on kNN), which provides a
base model that is relatively easy to train, with a rule-based filtering, which is used to
post-process and improve the results provided by the previous classifier. We think
that this combined strategy could provide improvements over these results based just
on machine learning.
Regretfully, due to lack of time and resources, we have not been able yet to carry
out an individual analysis by language, by corpus and a detailed analysis per class
(confusion matrix) so we do not understand yet the effect of each component in the
final result.
Specifically for the age attribute, we think that in a real business scenario, accuracy
as defined in the task, i.e., a binary decision between right or not, could be somewhat
relaxed using a cost matrix, considering that a miss classification between adjacent
age ranges is less serious than between more distant ranges, specially for users who
are near the end of the interval. So, we suggest to consider a modified evaluation
metric that considers this cost matrix for future editions of PAN.
Acknowledgements
This work has been supported by several Spanish R&D projects: Ciudad2020:
Towards a New Model of a Sustainable Smart City (INNPRONTA IPT-20111006),
MA2VICMR: Improving the Access, Analysis and Visibility of Multilingual and
Multimedia Information in Web (S2009/TIC-1542) and MULTIMEDICA:
Multilingual Information Extraction in Health Domain and Application to Scientific
and Informative Documents (TIN2010-20644-C03-01).
1162
References
1. Francisco Rangel, Paolo Rosso, Moshe Koppel, Efstathios Stamatatos, and
Giacomo Inches. Overview of the Author Profiling Task at PAN 2013. In
Pamela Forner, Roberto Navigli, and Dan Tufis, editors, Working Notes
Papers of the CLEF 2013 Evaluation Labs, September 2013. ISBN 978-88-
904810-3-1.
2. Pablo Suárez, José Carlos González, Julio Villena-Román. 2010. A plagiarism
detector for intrinsic plagiarism. Lab Report for PAN at CLEF 2010. CLEF
2010 Labs and Workshops Notebook Papers. 22-23 September 2010, Padua
Italy. ISBN 978-88-904810-0-0.
3. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I.H. Witten.
2009. The WEKA Data Mining Software: An Update. SIGKDD Explorations,
Volume 11, Issue 1.
4. Textalytics User Demographics v1.0 API. 2014.
http://textalytics.com/core/userdemographics-info
5. Textalytics Text Classification v1.1 API. 2014.
http://textalytics.com/core/class-info
6. Julio Villena-Román, Sonia Collada-Pérez, Sara Lana-Serrano, and José
Carlos González-Cristóbal. 2011. Método híbrido para categorización de texto
basado en aprendizaje y reglas. Procesamiento del Lenguaje Natural, Vol. 46,
2011, pp. 35-42.
7. Julio Villena-Román, Sonia Collada-Pérez, Sara Lana-Serrano, and José
Carlos González-Cristóbal. 2011. Hybrid Approach Combining Machine
Learning and a Rule-Based Expert System for Text Categorization.
Proceedings of the 24th International Florida Artificial Intelligence Research
Society Conference (FLAIRS-11), May 18-20, 2011, Palm Beach, Florida,
USA. AAAI Press 2011.
1163