<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DAEDALUS at PAN 2014: Guessing Tweet Author's Gender and Age</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Julio Villena-Román</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>José Carlos González-Cristóbal</string-name>
          <email>josecarlos.gonzalez@upm.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DAEDALUS - Data</institution>
          ,
          <addr-line>Decisions and Language, S.A</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidad Carlos III de Madrid</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Universidad Politécnica de Madrid</institution>
        </aff>
      </contrib-group>
      <fpage>1157</fpage>
      <lpage>1163</lpage>
      <abstract>
        <p>This paper describes our participation at PAN 2014 author profiling task. Our idea was to define, develop and evaluate a simple machine learning classifier able to guess the gender and the age of a given user based on his/her texts, which could become part of the solution portfolio of the company. We were interested in finding not the best possible classifier that achieves the highest accuracy, but to find the optimum balance between performance and throughput using the most simple strategy and less dependent of external systems. Results show that our software using Naive Bayes Multinomial with a term vector model representation of the text is ranked quite well among the rest of participants in terms of accuracy.</p>
      </abstract>
      <kwd-group>
        <kwd>PAN</kwd>
        <kwd>CLEF</kwd>
        <kwd>author profiling</kwd>
        <kwd>gender</kwd>
        <kwd>age</kwd>
        <kwd>user demographics</kwd>
        <kwd>machine learning classifier</kwd>
        <kwd>Naive Bayes Multinomial</kwd>
        <kwd>term vector model</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        PAN1 is a competitive evaluation lab on uncovering plagiarism, authorship and social
software misuse, held as part of CLEF2 conference. PAN 2014 offers three different
main tasks: 1) plagiarism detection, 2) author identification and 3) author profiling.
describes our participation at the PAN 2014 author profiling scenario [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. We are a
research group led by DAEDALUS3, a leading provider of language-based solutions
in Spain, and research groups of Universidad Politécnica and Universidad Carlos III
of Madrid. We are long-time participants in CLEF, in many different tracks and tasks
since 2003, and also in a previous edition of PAN [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>The task is focused on author profiling, i.e., the problem to distinguish between
classes of authors studying how language is shared by people, allowing to identify
aspects such as gender, age, native language, or personality type. Specifically, the
focus is on author profiling in social media messages. Author profiling is a problem of
growing importance in different applications such as forensics, security, and
marketing, for instance, to know the demographics of people that like or dislike their
products, based on the analysis of blogs and online product reviews.</p>
      <p>Given a document, the task is to determine its author's age and gender.
Participants are provided with a training data set that consists of blog posts, Twitter
tweets and social media texts written in both English and Spanish as well as hotel
reviews written in English. Gender is a binary classification (male or female) and with
regard to age, the following 5 classes are considered: 18-24, 25-34, 35-49, 50-64,
&gt;65. Differently to other CLEF labs, participants must not submit the results of their
experiments using a provided test corpus, but else upload a software that runs within
TIRA evaluation platform4.</p>
      <p>The idea behind our participation was to define, develop and evaluate a simple
machine learning classifier able to guess the gender and the age of a given user based
on his/her texts, which could become part of the solution portfolio of the company.
We were interested to find not the best possible classifier that achieves the best
accuracy, but to find the best balance between performance and throughput using the
most simple strategy and less dependent of external systems. Our system and results
achieved are presented and discussed in the following sections.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Our approach</title>
      <p>The provided training data covers 1) four different types of corpus with presumably
different language usage, 2) two different languages (English and French), and 3) two
attributes to guess (gender and age). After several preliminary analysis using cross
validation on the training corpora, we decided to build a machine learning classifier
specifically trained for each combination of corpus-language-attribute, so 14
classifiers in all.
instance Twitter corpus), we decided to design a two-level classifier: first, a
document-oriented classifier, which guesses the gender and age of a given text, and
then, an author-oriented classifier, which predicts the gender and age of a given user
by aggregating the output of the first classifier for each text written by a given user.
All corpora are equally balanced for gender and age, so the training is not affected by
any class unbalance problem.</p>
      <p>All 14 classifiers are trained with all texts for each combination of corpus,
language and attribute. We used Weka 3.7 for performing our experiments and for
developing our software to run in TIRA. Texts were tokenized using WordTokenizer
to obtain a simple bag of words representation. The tokenizer allows to define split
characters that are removed from the term vector space representation of the text.
Besides the usual split symbols, spaces and some punctuation marks, we use some
specific delimiters such as hashtags (#), usernames (@), emoticons, slashes,
ampersands, question marks and hyphens that are used to separate words in SEO
optimized URLs. Finally, as a high number of terms were low frequency numerals we
decided to add numbers as well to help in normalization.</p>
      <p>
        Regarding the document-oriented classifiers, a number of supervised algorithms
were evaluated using cross validation, and finally, for its performance, we selected
Multinomial Naive Bayes (NBM) classifier [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] with the default values for parameters.
Different configuration parameters were tested to reach the conclusion that NBM was
robust enough and other representations (bigrams, feature selection) were not adding
additional value.
      </p>
      <p>Results of this document-oriented classifier on training data using cross validation
are shown in Table 2.</p>
      <p>The author-oriented classifier reads the output of the document-oriented classifier
for each text written by a given author and predicts the gender and age using a simple
voting strategy, i.e., returns the most frequent value among all texts, selected after
some preliminary tests. Some other strategies were tested, such as a voting approach
using a confusion matrix with different cost for each decision values, depending on
the estimated accuracy for each class, but unfortunately we did not find any definite
conclusion or improvement due to lack of time.</p>
      <p>The final submission consists in a script written in PHP that reads the input test
corpus and the output directory, and, using a loop, processes every file in the test
corpus, reading all documents and creating two files in the arff format suitable for
Weka, one for gender and another one for age. Then Weka is called to obtain the
predictions and then the output is aggregated to select the most frequent value that is
chosen as the final output prediction for the author.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>The gender and age predictions have been evaluated as a classification problem, so
accuracy measure over each class are reported. Results achieved by our software are
shown in Table 3.</p>
      <p>In general, classifiers for Spanish achieve better results than classifiers for English,
except for the case of blogs where English works better.</p>
      <p>Although apparently gender attribute achieves a higher precision than age attribute,
the classifier for gender is quite useless, as, taking into account that the range of
values for the attribute is just two (male vs female), a random choice would achieve a
0.50 accuracy (assuming an equally balanced test corpus, the same as the training
corpus). Thus classifiers for age outperform classifiers for gender in terms of lift
(increment with regards to the random choice): for instance, 59% vs 50% for gender
in English, 37% vs 20% for age in English (5 possible classes), etc.</p>
      <p>Table 4 shows the comparison with other participants. This table shows, for each
corpus, language and attribute, the maximum, minimum and average values, and the
position of our software in the ranking of participants.</p>
      <p>In general, we achieve average results just above the middle of the table, except for
same cases were our software outperforms other participants, such as social media or
reviews in English.</p>
      <p>As it can be also noticed in the table, our results for Spanish are worse than the
average for all participants in Spanish, though the approach is the same as for English.
We do not have any explanation for this issue yet. However, we have a feeling that a
stemming or lemmatization step should have been considered for Spanish, as
inflection processes are important in this language and affect other tasks such as
information retrieval or named entity recognition.</p>
      <p>Corpus
Blog
Review
Socialmedia English
Twitter
Blog
Twitter
Socialmedia Spanish
Results show that our quite simple approach using a two-level classifier composed of
a document-oriented Naive Bayes Multinomial classifier with a term vector model
representation of the text and then a voting strategy for predicting the author age
achieves acceptable results in terms of accuracy. Despite of the difficulty of the task,
results somewhat validate the fact that this technology may be already included into
an automated workflow process for the first step towards social media mining and
author profiling for supporting marketing activities.</p>
      <p>However, in general, classifiers for gender (for all participants) are quite useless as
they achieve a very low improvement over the random choice. Classifiers for age are
worse in absolute accuracy but better in terms of lift with respect to the random
choice. Obviously a different approach must be investigated to predict gender more
robustly.</p>
      <p>
        We already include a module for extraction user demographics in our portfolio of
solutions [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which tries to guess gender, age and user type (person or organization),
using the information in the user public profile in Twitter, i.e., nick, full name and
description, making no use of the texts written by that user. This module is based on
distance among histograms using n-grams (character sequences) for each attribute to
predict. Using internal evaluations, this software achieves good accuracy results for
gender (over 70%) though lower for age.
      </p>
      <p>
        Based on the results achieved in PAN, our initial idea to find a strategy that offers
a good balance between performance and throughput using the most simple approach
and less dependent of external systems gets validated and developing such classifier is
within our immediate plans. In the short term, we plan to carry out some tests using
our software for text classification [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which is based on a hybrid algorithm [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
that combines a statistical classification (currently based on kNN), which provides a
base model that is relatively easy to train, with a rule-based filtering, which is used to
post-process and improve the results provided by the previous classifier. We think
that this combined strategy could provide improvements over these results based just
on machine learning.
      </p>
      <p>Regretfully, due to lack of time and resources, we have not been able yet to carry
out an individual analysis by language, by corpus and a detailed analysis per class
(confusion matrix) so we do not understand yet the effect of each component in the
final result.</p>
      <p>Specifically for the age attribute, we think that in a real business scenario, accuracy
as defined in the task, i.e., a binary decision between right or not, could be somewhat
relaxed using a cost matrix, considering that a miss classification between adjacent
age ranges is less serious than between more distant ranges, specially for users who
are near the end of the interval. So, we suggest to consider a modified evaluation
metric that considers this cost matrix for future editions of PAN.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgements</title>
      <p>This work has been supported by several Spanish R&amp;D projects: Ciudad2020:
Towards a New Model of a Sustainable Smart City (INNPRONTA IPT-20111006),
MA2VICMR: Improving the Access, Analysis and Visibility of Multilingual and
Multimedia Information in Web (S2009/TIC-1542) and MULTIMEDICA:
Multilingual Information Extraction in Health Domain and Application to Scientific
and Informative Documents (TIN2010-20644-C03-01).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. Francisco Rangel, Paolo Rosso, Moshe Koppel, Efstathios Stamatatos, and
          <string-name>
            <given-names>Giacomo</given-names>
            <surname>Inches</surname>
          </string-name>
          .
          <article-title>Overview of the Author Profiling Task at PAN 2013</article-title>
          . In Pamela Forner, Roberto Navigli, and Dan Tufis, editors,
          <source>Working Notes Papers of the CLEF 2013 Evaluation Labs</source>
          ,
          <year>September 2013</year>
          .
          <source>ISBN 978-88- 904810-3-1.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Pablo</given-names>
            <surname>Suárez</surname>
          </string-name>
          , José Carlos González, Julio Villena-Román.
          <year>2010</year>
          .
          <article-title>A plagiarism detector for intrinsic plagiarism</article-title>
          .
          <source>Lab Report for PAN at CLEF</source>
          <year>2010</year>
          .
          <article-title>CLEF 2010 Labs and Workshops Notebook Papers</article-title>
          .
          <fpage>22</fpage>
          -
          <issue>23</issue>
          <year>September 2010</year>
          ,
          <string-name>
            <given-names>Padua</given-names>
            <surname>Italy</surname>
          </string-name>
          .
          <source>ISBN 978-88-904810-0-0.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>M.</given-names>
            <surname>Hall</surname>
          </string-name>
          , E. Frank,
          <string-name>
            <given-names>G.</given-names>
            <surname>Holmes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pfahringer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Reutemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I.H.</given-names>
            <surname>Witten</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>The WEKA Data Mining Software: An Update</article-title>
          .
          <source>SIGKDD Explorations</source>
          , Volume
          <volume>11</volume>
          , Issue 1.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <source>Textalytics User Demographics v1.0 API</source>
          .
          <year>2014</year>
          . http://textalytics.com/core/userdemographics-info
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <source>Textalytics Text Classification v1.1 API</source>
          .
          <year>2014</year>
          . http://textalytics.com/core/class-info
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Julio</given-names>
            <surname>Villena-Román</surname>
          </string-name>
          ,
          <article-title>Sonia Collada-Pérez, Sara Lana-Serrano, and</article-title>
          <string-name>
            <given-names>José</given-names>
            <surname>Carlos</surname>
          </string-name>
          González-Cristóbal.
          <year>2011</year>
          .
          <article-title>Método híbrido para categorización de texto basado en aprendizaje y reglas</article-title>
          .
          <source>Procesamiento del Lenguaje Natural</source>
          , Vol.
          <volume>46</volume>
          ,
          <year>2011</year>
          , pp.
          <fpage>35</fpage>
          -
          <lpage>42</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Julio</given-names>
            <surname>Villena-Román</surname>
          </string-name>
          ,
          <article-title>Sonia Collada-Pérez, Sara Lana-Serrano, and</article-title>
          <string-name>
            <given-names>José</given-names>
            <surname>Carlos</surname>
          </string-name>
          González-Cristóbal.
          <year>2011</year>
          .
          <article-title>Hybrid Approach Combining Machine Learning and a Rule-Based Expert System for Text Categorization</article-title>
          .
          <source>Proceedings of the 24th International Florida Artificial Intelligence Research Society Conference (FLAIRS-11)</source>
          ,
          <source>May 18-20</source>
          ,
          <year>2011</year>
          ,
          <string-name>
            <given-names>Palm</given-names>
            <surname>Beach</surname>
          </string-name>
          , Florida, USA. AAAI Press
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>