<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Word Unigram Weighing for Author Profiling at PAN 2018</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pius von Däniken</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ralf Grubenmann</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mark Cieliebak</string-name>
          <email>ciel@zhaw.ch</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>SpinningBytes AG</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Zurich University of Applied Sciences</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <abstract>
        <p>We present our system for the author profiling task at PAN 2018 on gender identification on Twitter. The submitted system uses word unigrams, character 1- to 5-grams and emoji unigrams as features to train a logistic regression classifier. We explore the impact of three different word unigram weighing schemes on our system's performance. Our submission achieved accuracies of 77:42% for English, 74:64% for Spanish, and 73:20% for Arabic tweets. It ranked 15th out of 23 competitors.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>The rise of the internet and social media has brought a plethora of user generated
content. Since an important amount of users post in pseudonymity, author profiling tasks,
such as age and gender identification, have become a compelling area of study. For
example, in the case of online harassment, one might be interested in identifying the
perpetrator. This naturally extends to other applications in fields of forensics and
security. Similarly, social sciences might be interested to use this as a jumping-off point to
study how different demographics interact with media.</p>
      <p>
        In this work, we describe our submission to the author profiling task at PAN 2018 [
        <xref ref-type="bibr" rid="ref8 ref9">8,9</xref>
        ]
on gender identification based on text and images posted by users of social media. We
compare different unigram weighing schemes for this task, which are the basis of our
approach. Our submitted system achieved accuracies of 77:42% for English, 74:64%
for Spanish, and 73:20% for Arabic Twitter messages.
1.1
The goal of the author profiling task at PAN 2018 is to identify the gender of a user
based on based on two input data: text written by the user, and images posted by the
user on social media (not necessarily showing themselves). There are 3 different
languages in the training data: English (3000 users), Spanish (3000 users), and Arabic
(1500 users). The splits of male and female labeled authors are balanced for every
language. For every user there are 100 messages and 10 images that the user posted to
Twitter. The competition consists of three subtasks: gender_txt: identify gender from
text only, gender_img: identify gender from images only, and gender_comb: identify
gender from both text and images.
      </p>
      <p>We participated in the gender_txt subtask for all three languages.
1.2</p>
      <sec id="sec-1-1">
        <title>Related Work</title>
        <p>
          This year’s author profiling task is a continuation of a series of related tasks from
previous years [
          <xref ref-type="bibr" rid="ref5 ref6">5,6</xref>
          ]. Most similar edition is the 2017 instance of the task [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], as it was using a
very similar multilingual text data set based on Twitter and also had a gender
identification subtask. The authors of [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] achieved the highest average accuracy over languages
in the gender identification task, attaining accuracies of 82:33% for English, 83:21%
for Spanish, and 80:06% for Arabic. They use word and character n-grams weighted by
TF-IDF. Our work follows a similar approach. The VarDial evaluation campaign [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] is
a similar competition, which focuses mainly on dialect identification, which has been a
topic of previous tasks at PAN.
2
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>System Description</title>
      <p>Our system uses the same approach for every language. First we preprocess the tweets
to handle idiosyncrasies such as hashtags and user handles. Then we extract word
unigram features, character n-gram features and emoji unigram features. Finally, we train
a logistic regression classifier with those features.
2.1</p>
      <sec id="sec-2-1">
        <title>Preprocessing</title>
        <p>We use the same basic preprocessing pipeline for all languages.</p>
        <p>
          First we substitute user mentions, email addresses, and URLs with special tokens.
We use the regular expression ‘@\S+‘ to find and replace user mentions and ‘\S+@\S+‘
for email addresses. Inspired by the URLExtract 3 library we identify top-level domain
names in the text and check the boundaries to find URLs. To handle Twitter’s hashtags,
we remove all ‘#’ characters from the text and replace ‘_’ (underscore) by a space
character. Next we tokenize the text using the WordPunctTokenizer provided by the Natural
Language Toolkit (NLTK, version 3:3) [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Finally we lowercase all tokens.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Feature Extraction</title>
        <p>
          TF-IDuF: The TF-IDuF score was introduced in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] as an alternative weighting scheme
to the traditional TF-IDF weighting, based on a user’s document collection. For a given
term t it is computed as: T F IDuF = tf (t) log( nNu(ut) ), where tf (t) is the term
frequency of t, Nu is the number of documents for user u, and nu(t) is the number of
documents for user u that contain the term t. We decided to apply this method because,
since we handle all of one author’s texts at once, we can implement it in a stateless
fashion.
        </p>
        <sec id="sec-2-2-1">
          <title>3 https://github.com/lipoja/URLExtract</title>
          <p>Word Features: For every tweet of a user, we compute T F IDuF features as
described above. We compute the vocabulary of considered terms by retaining all terms
that appear in the document collections of at least 2 users. In addition we set all
nonzero term frequencies to 1 as we expect this to be less noisy than full term frequencies
for short texts such as tweets.</p>
          <p>
            Character Features: We extract character n-gram features for n ranging from 1 to 5.
Every n-gram is considered at most once per tweet, and we use the hashing trick [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ]
to get a feature vector of dimension 220.
          </p>
          <p>
            We use HashingVectorizer, the implementation provided by the Scikit-learn (sklearn,
version 3:3) framework [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ]. Since this implementation expects complete strings as
input, we join the tokens from the preprocessing step with a whitespace character. This
leads to n-grams spanning across word boundaries.
          </p>
          <p>Emoji Features: Using the emoji 4 library, we extract emoji from tweets and weigh
them using TF-IDuF with the same settings as for word tokens.
2.3</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>Classification</title>
        <p>
          We train a separate logistic regression classifier for each language, applying the
LogisticRegression implementation provided by sklearn [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. We use every text of every user
as a separate sample for training, with the gender of the respective authors as labels.
At inference time we get predictions for every text of an author from the classifier and
predict the majority label.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>We performed ablation experiments to compare different ways of weighing the
terms in a document. First we examine the performance of the different unigram
weighing approaches on their own: tf is the system that uses raw term frequencies directly,
tf-idf is the standard TF-IDF weighing of terms, and tf-iduf is the TF-IDuF as described</p>
      <sec id="sec-3-1">
        <title>4 https://github.com/carpedm20/emoji</title>
        <p>en
es
ar
en
es
ar
tf
tf-idf
tf-iduf
tf-iduf &amp; chars
full
in Section 2.2. tf-iduf &amp; chars uses word unigrams weighed by TF-IDuF and
character 1- to 5-grams. Finally, we refer as full to the system incorporating all features as
described in Section 2. This is the system that we submitted to the competition.</p>
        <p>To run the experiment we split the provided training data randomly into a training
set and validation set. The split ratio of training to validation size is 80:20, i.e. 2400
authors for training and 600 for validation in the case of English and Spanish and 1200
authors for training and 300 for validation for Arabic.</p>
        <p>Each experiment is run 5 times and we report mean and standard deviation. The
numeric results are shown in Table 1 and Figure 1 gives a qualitative overview of the
results. The bars show the mean accuracy on the validation split for each system and
language. The error bars indicate the standard deviation. The horizontal lines show the
results of our submission in the competition for reference.</p>
        <p>There seem to be no qualitative differences between the explored feature sets and
weighing schemes. The mean accuracies stay mostly within the error bars of each other
per language. Furthermore for English and Arabic the validation accuracy is close to the
accuracy attained by our submission. For Spanish the validation accuracy is apparently
lower than the accuracy of our submission, but this might well be due to random chance.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>We have given an overview of our classification system for gender identification. Our
system attained accuracies of 77:42% for English, 74:64% for Spanish, and 73:20% for
Arabic Twitter messages at the author profiling task at PAN 2018. We explored different
word unigram weighing schemes and found that they all give similar performance when
applied to our system.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Basile</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dwyer</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Medvedeva</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rawee</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haagsma</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nissim</surname>
          </string-name>
          , M.:
          <string-name>
            <surname>N-GrAM</surname>
          </string-name>
          :
          <article-title>New Groningen Author-profiling Model</article-title>
          .
          <source>In: Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum</source>
          , Dublin, Ireland,
          <source>September 11-14</source>
          ,
          <year>2017</year>
          . (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Beel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Langer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gipp</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>TF-IDuF: A Novel Term-Weighting Sheme for User Modeling based on Users' Personal Document Collections</article-title>
          .
          <source>In: Proceedings of the iConference 2017. Wuhan, China (Mar</source>
          <volume>22</volume>
          - 25
          <year>2017</year>
          ), http://ischools.org/the-iconference/
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bird</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Loper</surname>
          </string-name>
          , E.:
          <article-title>Nltk: the natural language toolkit</article-title>
          .
          <source>In: Proceedings of the ACL 2004 on Interactive poster and demonstration sessions</source>
          . p.
          <fpage>31</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Nakov</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zampieri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Ljubešic´,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Tiedemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Malmasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Ali</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>Proceedings of the fourth workshop on nlp for similar languages, varieties and dialects (vardial)</article-title>
          .
          <source>In: Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)</source>
          .
          <source>Association for Computational Linguistics</source>
          (
          <year>2017</year>
          ), http://aclweb.org/anthology/W17-1200
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Pardo</surname>
            ,
            <given-names>F.M.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of the 5th author profiling task at PAN 2017: Gender and language variety identification in twitter</article-title>
          .
          <source>In: Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum</source>
          , Dublin, Ireland,
          <source>September 11-14</source>
          ,
          <year>2017</year>
          . (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Pardo</surname>
            ,
            <given-names>F.M.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verhoeven</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of the 4th author profiling task at PAN 2016: Cross-genre evaluations</article-title>
          . In: Working Notes of CLEF 2016 -
          <article-title>Conference and Labs of the Evaluation forum</article-title>
          , Évora, Portugal,
          <fpage>5</fpage>
          -
          <lpage>8</lpage>
          September,
          <year>2016</year>
          . pp.
          <fpage>750</fpage>
          -
          <lpage>784</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Pedregosa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varoquaux</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gramfort</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thirion</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grisel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blondel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prettenhofer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weiss</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dubourg</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanderplas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Passos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cournapeau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brucher</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perrot</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duchesnay</surname>
          </string-name>
          , E.:
          <article-title>Scikit-learn: Machine learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          ,
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes-</surname>
            y-Gómez,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of the 6th Author Profiling Task at PAN 2018: Multimodal Gender Identification in Twitter</article-title>
          . In: Cappellato,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.Y.</given-names>
            ,
            <surname>Soulier</surname>
          </string-name>
          ,
          <string-name>
            <surname>L</surname>
          </string-name>
          . (eds.)
          <article-title>Working Notes Papers of the CLEF 2018 Evaluation Labs</article-title>
          .
          <source>CEUR Workshop Proceedings, CLEF and CEUR-WS.org (Sep</source>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tschuggnall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kestemont</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Overview of PAN-2018: Author Identification, Author Profiling, and Author Obfuscation</article-title>
          . In: Bellot,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Trabelsi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Mothe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Murtagh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Soulier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Sanjuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Cappellato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          , N. (eds.)
          <string-name>
            <surname>Experimental IR Meets Multilinguality</surname>
          </string-name>
          , Multimodality, and
          <string-name>
            <surname>Interaction</surname>
          </string-name>
          .
          <source>9th International Conference of the CLEF Initiative (CLEF 18)</source>
          . Springer, Berlin Heidelberg New York (
          <year>Sep 2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Weinberger</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dasgupta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Langford</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smola</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Attenberg</surname>
          </string-name>
          , J.:
          <article-title>Feature hashing for large scale multitask learning</article-title>
          .
          <source>In: Proceedings of the 26th Annual International Conference on Machine Learning</source>
          . pp.
          <fpage>1113</fpage>
          -
          <lpage>1120</lpage>
          . ICML '09,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2009</year>
          ), http://doi.acm.
          <source>org/10</source>
          .1145/1553374.1553516
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>