<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Cross-genre Age and Gender Identi cation in Social Media</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anam Zahid</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aadarsh Sampath</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anindya Dey</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Golnoosh Farnadi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center for Data Science, University of Washington Tacoma</institution>
          ,
          <addr-line>WA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dept. of Appl. Math., Comp. Science and Statistics, Ghent University</institution>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Dept. of Computer Science, Katholieke Universiteit Leuven</institution>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper1 gives a brief description on the methods adopted for the task of author-pro ling as part of the competition PAN 2016 [1]. Author pro ling is the task of predicting the author's age and gender from his/her writing. In this paper, we follow a two-level ensemble approach to tackle the cross-genre author pro ling task where training documents and testing documents are from di erent genres. We use the softvoting approach to build the classi cation ensemble. To include various feature sets, we rst train logistic regression models using the extracted word n-gram, character n-gram, and part-of-speech n-gram features for each genre. We then ensemble single-genre predictive models trained on the blog, social media and Twitter data sources, to build our multi-genre ensemble approach. The experimental results indicate that our approach performs well in both single-genre and cross-genre author pro ling tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>Gender identi cation</kwd>
        <kwd>Age prediction</kwd>
        <kwd>Ensemble technique</kwd>
        <kwd>Text mining</kwd>
        <kwd>Cross-genre classi cation</kwd>
        <kwd>Author pro ling</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The rapid development of social media platforms has led to a massive volume
of user-generated text in the form of blog posts, status updates, and tweets.
This has generated great research interest in identifying authors' pro le [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
Author pro ling is the task of predicting the authors age and gender information
with his/her writing. Most of the recent works in author pro ling address the
problem as a single-genre task where the instances of the training set and the
test set are coming from a single platform. Due to the di culties of gathering
ground truth data for every platform, cross-genre author pro ling task has been
proposed. Cross-genre pro ling has been done for personality prediction in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
however little work has been done for identifying the age and gender of users
in a cross-genre setting. Such models could be applied to environments where
training data representative for the deployment domain is not available. E
ective features from the recent works in age and gender classi cation were both
content features such as unigrams, bigrams and word classes as well as
stylistic features, such as part-of-speech (POS), slang words and average sentence
length. For instance, in case of the gender identi cation, Villena Roman et al.
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] extracted n-grams or bag-of-words as content features. In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], Argamon et al.
approached the task of gender identi cation by combining function words with
POS tags. Given the related works in this domain, we include various feature
sets in our model by training logistic regression models using the extracted word
n-gram, character n-gram, and POS n-gram features from the documents. We
propose a two-level ensemble approach which is a multi-genre predictive model
      </p>
      <sec id="sec-1-1">
        <title>1 This paper is an extended abstract</title>
        <p>that ensembles single-genre predictive models from the available ground-truth
datasets of various genres, i.e., the blog, social media and Twitter datasets. Our
multi-genre ensemble approach leverages various types of documents as
training examples which makes it suitable for the cross-genre author pro ling of the
PAN2016 competition where the testing documents are from a hidden genre.
The experimental results indicate that our ensemble approach can be used for
both single-genre and cross-genre author pro ling tasks. The rest of this
paper describes the details of our submission to the PAN 2016 cross-genre author
pro ling task.
2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Methodology</title>
      <p>Let us assume U is a set of all authors, where U = Utrain [ Utest. For all users
in Utrain, we know their age and gender, and our aim is to predict the age and
gender of all users in Utest based on their written text. If Utrain and Utest are
coming from one platform (aka genre), we call the task a single-genre author
pro ling task, and if Utrain and Utest are from di erent social media platforms,
we call the task a cross-genre author pro ling task. The overall architecture
of our proposed ensemble approach for a single-genre (S-G) and multi-genre
author pro ling (M-G) is shown in Figure 1. Using the S-G ensemble approach,
we incorporate various features extracted from the documents and by using the
M-G ensemble approach, not only do we use di erent features, but also leverage
predictive models of di erent genres which makes the framework suitable for
cross-genre author pro ling task.
2.1 Pre-processing and data description: The data provided by the PAN
organizers, was in the form of XML documents from which user contents were
extracted and cleaned by removing HTML tags and stop words. To tackle the
cross-genre author pro ling task, we collected data from 2014 and 2015 PAN
author pro ling contests and added them to our training dataset. For English
and Spanish, we made three datasets from di erent genres: (1) social media with
7,746 documents for English and 1,272 documents for Spanish, (2) blog with 147
documents for English and 88 documents for Spanish and (3) Twitter with 576
documents for English and 340 documents for Spanish. For the Dutch dataset,
we gathered data from Twitter with 418 documents. In all the datasets the
gender distributions are uniform. The statistics of the combined datasets w.r.t.
the frequencies of the ve age groups (i.e., [18; 24], [25; 34], [35; 49], [50; 64], and
[65; xx]) are shown in Table 1. Note that for the Dutch dataset we do not have
the age of the authors.
2.2 Feature extraction: To create our feature space, we extract three di erent
categories of features, drawing inspiration from related works. All the
implementations are based on the machine learning package in Python called scikit-learn2.
The extracted features are (1) word n-gram where n = f1; 2; 3g (aka uni, bi and
tri-grams) using TF-IDF as a weighting mechanism, (2) character n-gram where
n = f3; 4; 5; 6; 7g using TF-IDF as a weighting mechanism. To reduce the size of
the feature space, we select k top features using Chi-square hypothesis testing
where k = 5000, and (3) POS n-gram: in which we extract part-of-speech (POS)
tags from each document using ntlk package in Python3. Then each word in
text is mapped to its corresponding POS tag and the text comprising of those
POS tags is used to extract n-gram features with the same con guration of word
n-gram with n = f1; 2; 3g and TF-IDF weighting.
2.3 Predictive model: We train binary classi ers for predicting the gender of
users and multi-class classi ers for predicting their age. For age and gender
prediction tasks, we train three predictive models using the three feature sets that
we explained above with logistic regression as a classi er for each
genre-labellanguage. We then apply an ensemble soft-voting approach using the prediction
scores of the models. The results of applying our S-G ensemble approach on
the Twitter, social media and blog datasets are presented in Table 2. Our S-G
ensemble approach outperforms the majority baseline in predicting the gender
of users for all the three datasets for all three languages, however for the task
of age prediction, our approach outperforms the baseline for the social media
and Twitter datasets for English and Spanish. To tackle the cross-genre author
pro ling task, we rst made S-G ensemble models for each genre, e.g., regarding
the English dataset, we made three S-G ensemble models for the social media,
blog and Twitter datasets, then we ensemble the predictions as a nal predictive</p>
      <sec id="sec-2-1">
        <title>2 http://scikit-learn.org/ 3 http://www.nltk.org/</title>
        <p>
          model of the cross-genre author pro ling task.To investigate the performance of
our approach for the task of cross-genre age and gender prediction, we conducted
three sets of experiments. We use the blog, social media and Twitter datasets
and use the pre-trained models of two sources to test on the remaining source.
The results indicate that our approach can be used for the cross-genre author
pro ling task, where results are better than or equal to the baseline (see
Table 3). However, since users' language in Twitter is di erent from their language
in generating blog posts, in cross-genre author pro ling, selecting the training
examples from the most similar datasets would be an advantage. However, for
PAN2016, since the genre of the test set was hidden, we combine all the available
datasets in our submitted software. The results of our submission for PAN2016
on a hidden test data which are evaluated using TIRA [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] are presented in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>In this paper, we brie y explained our proposed two-level ensemble approach
to tackle the cross-genre author pro ling task. Our proposed approach is
exible and can incorporate many feature sets and sources of information that are
available which makes our approach suitable for the cross-genre author pro ling
task, where no/little training example is available from the same genre.
Experimental results on various datasets and languages indicate the capability of our
approach. In our approach, we assigned uniform weights to ensemble the
predictive models. However, giving higher weights to the predictive models with
better performance may improve the overall performance which is an open path
to explore in the future.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Verhoeven</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Daelemans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          , \
          <article-title>Overview of the 4th Author Pro ling Task at PAN 2016: Cross-genre Evaluations,"</article-title>
          <source>in Proc. of the CLEF Evaluation Labs and Workshop</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          , and W. Daelemans, \
          <article-title>Overview of the 3rd author pro ling task at pan 2015,"</article-title>
          <source>in Proc. of the CLEF Evaluation Labs and Workshop</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>G.</given-names>
            <surname>Farnadi</surname>
          </string-name>
          , G. Sitaraman,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sushmita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Celli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kosinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Stillwell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Davalos</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-F. Moens</surname>
          </string-name>
          , and
          <string-name>
            <surname>M. De Cock</surname>
          </string-name>
          , \
          <article-title>Computational personality recognition in social media," User Modeling and User-Adapted Interaction</article-title>
          , vol.
          <volume>26</volume>
          , no.
          <issue>2</issue>
          , pp.
          <volume>109</volume>
          {
          <issue>142</issue>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>J. Villena</given-names>
            <surname>Roman</surname>
          </string-name>
          and
          <string-name>
            <surname>J.-C. Gonzalez Cristobal</surname>
          </string-name>
          , \DAEDALUS at PAN 2014:
          <article-title>Guessing tweet author's gender and age,"</article-title>
          <source>in Proc. of the CLEF Evaluation Labs and Workshop</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>S.</given-names>
            <surname>Argamon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Koppel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fine</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Shimoni</surname>
          </string-name>
          , \Gender, genre, and
          <article-title>writing style in formal written texts,"</article-title>
          <source>TEXT</source>
          , vol.
          <volume>23</volume>
          , no.
          <issue>3</issue>
          , pp.
          <volume>321</volume>
          {
          <issue>346</issue>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gollub</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , E. Stamatatos, and
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          , \
          <article-title>Improving the Reproducibility of PAN's Shared Tasks: Plagiarism Detection, Author Identi cation, and Author Pro ling,"</article-title>
          <source>in Proc. of the CLEF Evaluation Labs and Workshop</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>