<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Author Profiling of Twitter Users</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Roy Bayot</institution>
          ,
          <addr-line>Teresa Gonçalves, and Paolo Quaresma</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidade de Évora</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <abstract>
        <p>In this paper, we focused on profiling authors on age, gender, and five personality traits. The corpus consists of anonymized twitter posts categorized into 4 different languages. Our proposed approach was to use a combination of tfidf, function words, stylistic features, and text bigrams, and used an SVM for each task.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Author profiling from text has been an interesting topic recently because of the increase
in the availability of texts. This is mostly because of the internet where text is one of the
forms of communication. This could be present in blogs, websites, customer reviews,
and even twitter posts.</p>
      <p>While author anonymity has been present mostly in the web, using profiling can be
useful, especially in aspects such as marketing, advertising, as well as security. Profiling
mainly uses such text to determine certain aspects of the author such as age, gender,
and certain personality traits. The idea is that certain topics or word usage comes are
affected by such aspects. For instance, talking about bands or any trending music at
the time would be a topic for teenagers. This is not always easy since some people can
always think not on their age, and that would affect the writing. Some people can write
fiction and it can be that the text was written from the perspective of someone with a
different personality type.</p>
      <p>
        However PAN is making an effort in this aspect. In this year’s edition of PAN for
author profiling, the task is specific to author profiling of twitter users in 4 languages
english, dutch, italian, and spanish. The tasks include profiling for age, gender, and the
big five personality traits - agreeability, conscientiousness, extrovertedness, openness,
and stability [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        Some approaches have been used previously that are similar. For instance, in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
they used 405 function words, a list of ngrams part of speech tag where they used 500
most common ordered triples, 100 common ordered pairs, and all single tags, to
categorize text by gender. In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], both style-based features (POS tags, function words, blog
words, and hyperlinks) and content-based features (content words and hand-crafted
LIWC) were used to classify by age and gender. In the previous year, PAN also had
ran author profiling but on different sources, not just tweets. In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], the method used
to represent terms in a space of profiles and then represent the documents in the space
of profiles and subprofiles were built using expectation maximization clustering. In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
ngrams were used with stopwords, punctuations, and emoticons retained, and then idf
count was also used before being processed with 5 different classifiers. Liblinear
logistic regression returned with the best result. In [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], different features were used that
were related to length (number of characters, words, sentences), information retrieval
(cosine similarity, okapi BM25), and readability (Flesch-Kincaid readability,
correctness, style). This was used on 7 different classifiers. Another approach is to use term
vector model representation as in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. For the work of Marquardt et. al in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], they used
a combination of content-based features (MRC, LIWC, sentiments) and stylistic
features (readability, html tags, spelling and grammatical error, emoticons, total number of
posts, number of capitalized letters number of capitalized words).
      </p>
      <p>Since this is the first attempt at a submission to PAN, we opted to take a simpler
approach of using tfidf, function words, some stylistic features, and text bigrams.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Methodology</title>
      <p>For a first submission to this task, we decided to use the same approach for all the
tasks. The method we used is more or less straightforward - basic feature extraction,
concatenating the different features, then use the combined features for classification or
regression, and use 10 fold cross validation.
2.1</p>
      <sec id="sec-2-1">
        <title>Features Vector Creation</title>
        <p>There are four main feature types used in this submission and each processed separately.
The first would be the tfidf features. Term frequency-inverse document frequency or
tfidf is one of the most common features obtained.</p>
        <p>
          Before running the feature extraction for tfidf, preprocessing was done to the tweets
obtained. For this task, all tweets from a single person were concatenated. Numbers
were removed, and turned into lower case equivalents. Then stopwords from the NLTK
toolkit [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] were removed from the set of words. Finally, the resulting words were used
to find at tfidf vector representation through the scikits-learn python library. The vector
was set to 10000 and discard the excess based on the document frequency. The defaults
were chosen for the vectorizer. It should also be noted some of the tfidf representations
did not maximum of 10000 in terms of dimensions.
        </p>
        <p>The second would be the stylistic features. We only detected for the presence of
absence of certain characters or combination of characters. This includes the following
characters and combinations - "#", "@username", "http://", ":)", ";)", "o_O", "!","!!",
"!!!", ":(". This is by no means exhaustive and was just an initial set. The octothorpe was
to indicate if there was hashtag. The "@username" was used in case the user tags other
twitter users. Normally, this will be of a twitter handle but since it was anonymized,
we used this instead. The set ":)", ";)", "o_O", and ":(" just check of some sort of
emotion. And finally, the exclamation points could indicate possible surprise intensity of a
statement, which usually happens in the internet.</p>
        <p>The third would detecting for function words. Function words are informative words
that could be used to discriminate between classes. These were obtained by using all
instances in the training data and was used to create a decision tree. And the most
informative features were obtained with entropy as the criteria. The succeeding tables
at show the words/characters that obtained as function words.</p>
        <p>age
"zit","heel","best","geeft","idee","nooit","weer","binnen","goed","avond",
dutch ""fbeiejwstejerk",e"ng"i,n"gd"a,g""m,"eliasajets"t,e""m,"omrgaenn"",","vmoeulzt"ik,"ahnatretn"",","tooenkdoemrwste"g,""b,oeit","dh",
"onderzoeksjournalistiek","onzin","proficiat","ten","verdient","verzuurde","werkt"
"co","wanna","us","haha","username","fitbit","et","bowl","academia","bitch",
english "happened","even","year","reach","free","times","speech","top","add","social",
"think","nothing","financial","pop","inspiring","lil","complicated","aa"
"domani","fa","poi","pezzo","immagini","quel","ultimo","binari","bravo",
italian ""frioutsoc"i,a"mis"o,"",s"esnutpiteor"",,""statastsoo"n,"i"p,i""a,g"seengduaidreig"i,"tabloer"g,"oc"a,"saelliencgtead"",","cfce"d,"edriec"o,""d,io",
"eccomi","esempio","novit","oscena","pard","piazza","preso","pu","rispetto","yg"
spanish ""fhatlttpa"",,""mbuas"c,"adni"jo,""f,a"cmeobmooekn"t,o""in,"fcoi"l",","toads"a,s""b,u"feanvoosr"",,""mcuallaa"",,""nboiemb"e,r""o,fpbmahc"
Table 1. Function words for age task.</p>
        <p>gender
dutch "username","goed","bent","saai"</p>
        <p>"close","love","mention","co",
english ""cwuitfee"",,""plahnoknae"",,""blee"li,e"vdea"y,"",v"iudreboa"n,",</p>
        <p>"round","thank","bird","wouldn","aa"
italian ""ccoo"n,o"sccaemsspia"g,"nvao"c,"i"ottimo",</p>
        <p>"vida","alguien","corrupci",
spanish "ciudades","si","temprano",</p>
        <p>"puro","meta","foto","dio"</p>
        <p>Table 2. Function words for gender task.</p>
        <p>For the personality tasks, the decision tree was made in such a way that the output
was framed as a classification problem. Instead of having continuous numbers from -0.5
to 0.5, we used discrete numbers from -0.5 to 0.5 with an interval of 0.1. The words for
personality tasks were shown in the tables 3-7.</p>
        <p>Finally, we also add text bigrams. This was to possibly capture some structure in
the input texts.
After features were extracted and concatenated, we used a linear SVM with a default
relaxation parameter of 1. We used the scikits-learn library for this and used the SVM
as an initial check for results.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments and Results</title>
      <sec id="sec-3-1">
        <title>Setups</title>
        <p>Each of the different features were also individually used to classify or perform a
regression. Some combinations of the features were also used. In tables 8-11, different tasks
were done with tfidf, function words (FW), stylistic features(SF), and text bigrams(TB),
as well as combinations of these.
The results from PAN are summarized in the table below. The results were not as
satisfactory as we had hoped.
As a conclusion, much improvement still needs to be done for such tasks. For instance
exploration of more features such as stylistic features. Other classifiers are also to be
explored as well as parameter tuning. Possibly one mistake this year is to just get the
combination that yields more better result over all than picking and choosing certain
models to certain languages and tasks. It would have been better if the system was
adapted to that.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bird</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Loper</surname>
            ,
            <given-names>E.: Natural</given-names>
          </string-name>
          <string-name>
            <surname>Language Processing with Python. O'Reilly Media</surname>
          </string-name>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shimoni</surname>
            ,
            <given-names>A.R.</given-names>
          </string-name>
          :
          <article-title>Automatically categorizing written texts by author gender</article-title>
          .
          <source>Literary and Linguistic Computing</source>
          <volume>17</volume>
          (
          <issue>4</issue>
          ),
          <fpage>401</fpage>
          -
          <lpage>412</lpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>López-Monroy</surname>
            ,
            <given-names>A.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes-y Gómez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Escalante</surname>
            ,
            <given-names>H.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Villaseñor-Pineda</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Using intra-profile information for author profiling</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Maharjan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shrestha</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Solorio</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>A simple approach to author profiling in mapreduce</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Marquardt</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farnadi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vasudevan</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moens</surname>
            ,
            <given-names>M.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davalos</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Teredesai</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Cock</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Age and gender identification in social media</article-title>
          .
          <source>Proceedings of CLEF 2014 Evaluation Labs</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Celli</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Overview of the 3rd Author Profiling Task at PAN 2015</article-title>
          .
          <article-title>In: Working Notes Papers of the CLEF 2015 Evaluation Labs</article-title>
          .
          <source>CEUR Workshop Proceedings, CLEF and CEUR-WS.org (Sep</source>
          <year>2015</year>
          ), http://www.clef-initiative.eu/publication/working-notes
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Schler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pennebaker</surname>
            ,
            <given-names>J.W.:</given-names>
          </string-name>
          <article-title>Effects of age and gender on blogging</article-title>
          . In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs. vol.
          <volume>6</volume>
          , pp.
          <fpage>199</fpage>
          -
          <lpage>205</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Villena-Román</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>González-Cristóbal</surname>
            ,
            <given-names>J.C.</given-names>
          </string-name>
          : Daedalus at pan 2014:
          <article-title>Guessing tweet author's gender and age</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Weren</surname>
            ,
            <given-names>E.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moreira</surname>
            ,
            <given-names>V.P.</given-names>
          </string-name>
          , de Oliveira,
          <string-name>
            <surname>J.P.</surname>
          </string-name>
          :
          <article-title>Exploring information retrieval features for author profiling-notebook for pan at clef 2014</article-title>
          . Cappellato et al.[
          <volume>6</volume>
          ]
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>