<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Segmenting Target Audiences: Automatic Author Profiling Using Tweets.</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Maite Giménez</institution>
          ,
          <addr-line>Delia Irazú Hernández, and Ferran Pla</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Univesitat Politècnica de València</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <abstract>
        <p>This paper describes a methodology proposed for author profiling using natural language processing and machine learning techniques. We used lexical information in the learning process. For those languages without lexicons, we automatically translated them, in order to be able to use this information. Finally, we will discuss how we applied this methodology to the 3rd Author Profiling Task at PAN 2015 and we will present the results we obtained.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The exponential growth of social networks has led to new challenges in the study of
Natural Language Processing (NLP). In literature, we could find extensive work done
in order to understand normative texts. Social profiling is a less explored topic, even
though its study is relevant also to other sciences as: marketing, sociology, etc. [
        <xref ref-type="bibr" rid="ref1 ref3 ref8">3,1,8</xref>
        ]
      </p>
      <p>This paper explores how to define user profiles using classic techniques of NLP.
Corpora have been created compiling tweets in different languages. Twitter [15] is a
microblogging service which, according to latest statistics, has 284 million active users,
77% outside the US that generate 500 million tweets a day in 35 different languages.
That means 5.700 tweets per second and they had peaks of activity of 43.000 per second.
This numbers justify the great interest in the automatic processing of this information.
1.1</p>
      <p>
        Author profiling competition was proposed by PAN 2015. A detailed explanation
could be found in the overview paper of the task [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. We have tackled this task using
NLP techniques and machine learning (ML).
      </p>
      <p>The remainder of this paper is organized as follows. Section 2 covers briefly the state
of the art, section 3 describes the corpus, section 4 presents in detail the methodology
we used and section 5 presents the experiments we have developed. Sections 6 and 7
discusse our results and the future work in order to improve them.
2</p>
    </sec>
    <sec id="sec-2">
      <title>State of the Art</title>
      <p>Author profiling task is a research area for disciplines such as: linguistics, psychology
or marketing.</p>
      <p>
        Task complexity made it unfeasible. However, since 2000 technology begun to be
mature enough to tackle this task. Early works [
        <xref ref-type="bibr" rid="ref4">4,14</xref>
        ] only studied gender and age. Lately,
new psychological features had been tackle [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Pennebaker et al work [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] linked the
language with author’s psychological features .
      </p>
      <p>Since 2013 author profiling contest is held by the PAN. Participants of previous
editions [13,12] used stylistic features, like: term frequency, POS, stop words, and content
features, such as: n-grams, sets of words, lists of entities. They used those features to
train systems based on support vector machines (SVM), decision trees, Naïve Bayes,
etc. If we analyze the accuracy obtained in previous years we will notice how relevant is
the nature of texts of the corpus. They achieved around 40 % accuracy predicting
gender and age using data from Twitter, however accuracy falls to 25% using hotel reviews.</p>
      <p>In this edition of the PAN, task has been extended. Participants should identify age,
gender and personality traits as we described in section 1.1.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Corpora Description</title>
      <p>We start our task studying the corpora. This will allow us to select the best methodology
for the task.</p>
      <p>Multilingual corpora were provided by task organizers. Corpora contain 14166 tweets
from 152 English authors, 9879 tweets from 100 Spanish authors, 3687 tweets from 38
Italian authors and 3350 tweets from 34 Dutch authors.</p>
      <p>Tweets were balanced by gender and unbalanced by age. There were much more
tweets from users whose age range between 25-34. Nevertheless, according to Twitter’s
statistics, it is a safety guess to assume that age distribution is representative of the
reality.</p>
      <p>Then, we studied the vocabulary of each language. We removed punctuation signs
and stop words to perform this study. We tokenized words in order to obtain the
vocabulary. Consistently, most frequent words were words used in Twitter such as: RT, HTTP,
username, via and abbreviations. We followed our work, studying vocabulary
distribution between age and gender for every language. Table 1 shows the most frequent words
set for gender and age both for English and Spanish languages.
English FMemalaele uusseerrnnaammee,,HHTTTTPP,,mvi,av, ima,,lliikkee,,RloTv,e2,,knneoww,,wR,TN, o3w,gPelta,ywinagn,t,konnowe..
Spanish FMemalaele usernamHeT,THPT,TsiP,,RqT,,sis,evr,íav,esre,rq,,dd,,RhTo,yv,idd?a?,aG,xraDc,ia1s,vvae,rb,imene.jor, día.</p>
      <p>Finally, we studied hashtags. Hashtags are relevant in Tweeter, becasuse it is how
users self annotate their tweets. We found out that 37.9 % of English tweets, 26.7 % of
Spanish tweets, 59.9 % of Italian tweets and, 27.3 % of Dutch tweets have hashtags. It
is interesting to highlight that English words are present in others corpora, due to the
massive use of English in social media.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Methodology Description</title>
      <p>
        Based on the briefly analysis presented in Section 3, we decided to apply machine
learning algorithms in order to identify personality traits. We employed the Scikit-learn
toolkit [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] in our analysis and experimental settings. In order to perform training
process in our approach, we developed a novel function in the aforementioned toolkit (we
consider this as one of our main contributions). This new function allows training a
machine learning algorithm using both word lexicons and stylistic features. Furthermore,
we automatically translated some lexicons originally developed for English to Spanish,
Italian and Dutch. In our model we considered three subsets of features:
– Textual features. This set relies only on textual content (a lower casting process had
been carried out). We took into account four configurations using different n-grams
sizes: 1-3, 1-4, 1-6, 3-6 and 3-9
      </p>
      <p>TF-IDF coefficients
Inter-word chars with TF-IDF coefficients
Intra-word chars with TF-IDF coefficients</p>
      <p>Bag of words
– Stylistic features.</p>
      <p>Frequency of words with repeated chars.</p>
      <p>Frequency of uppercase words</p>
      <p>Frequency of hashtags, mentions, URL and RT.
– Lexicon-based features. Using four different lexicons, we calculated a score for
each one, by using the formula 1 Pw2W lexicon(w). In order to extract this
information we removed the stop jwWojrds.</p>
      <p>
        Afinn [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This resource consists of a list of words with polarity values between
the range -5 and +5.
      </p>
      <p>
        NRC [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] . It is a polarity dictionary that gives us a real value that represents the
polarity value for a word.
      </p>
      <p>NRC hashtags. It consists of a list of positive and negative hashtags. We
normalized the polarity values in this dictionary considering as a positive value +5
and as negative value -5.</p>
      <p>
        Jeffrey [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This resource contains two different lists of words: positives and
negatives. We computed two scores from this resource (positive and negative).
As we mentioned above, we decided to consider a machine learning experimental
setting. We carried out different classification tasks, one for determining the gender of
the author, a second for age’s identification and for each one of the personality traits
we applied a binary classification. At the end, our experiments consider seven different
classifications tasks. We tested the following classification algorithms:
– Linear Support Vector Machine (all implementations in the toolkit were applied)
– Polynomial Kernel Support Vector Machine
– Naïve Bayes
– Descendent gradient
– Logistic Regression
– Random Forest
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Experimental Work</title>
      <p>We considered two appoaches to train our system. The first one joins all tweets for each
user, therefore we will have a sample for each user. The second one uses each tweet as
training sample. This last approach will reduce spatial sparsity.</p>
      <p>As first step, we performed a preliminary experimental setting that considers the
whole set of features and all the classifiers mentioned above. The well-known 10-fold
cross validation was applied over the corpus. As evaluation measure the precision was
chosen. These experiments allow us to compare the performance of our model using
different configurations. For gender and age identification SVM was chosen, while
linear regression was selected for dealing with personality traits. As a second experimental
setting, the best ranked models were grouped in order to carry out a parameter
adjustment. The features considered are: textual, stylistic and lexical based features.</p>
    </sec>
    <sec id="sec-6">
      <title>Results</title>
    </sec>
    <sec id="sec-7">
      <title>Conclusions and future work</title>
      <p>In this paper we presented our partitipation in PAN author profiling competition. We
used Natural Language Processing techniques to solve this task. We could find that
accuracy obtained for personality traits is still low. User profiling is a hard task, especially
when we are dealing with fine grained traits.</p>
      <p>Our system performed acceptably for all languages and demographic traits studied.
Poor gender identification has penalized our global results. Our results in development
were over fitted when we adjust the parameters of our system. However, a strength of
our system it is how it can be applied automatically adapted to new languages.</p>
      <p>In the future, there are issues we should tackle such as how to deal with big data
and real time. Twitter users generates huge amount of data and if we are able to process
it in real time our systems will improve its accuracy and it could have a huge impact in
other areas as marketing. Moreover we plan to deal with slang which it is very present
in social media and it has a deep impact in NLP tools as lexicons and part of speech
taggers.</p>
      <p>Finally, we will like to try new distributed representation of the data and new
stylistic features. Distributed representation will reduce the spatial complexity which will
reduce training time, and hopefully, it will improve the accuracy of our system.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work has been partially funded by the projects, DIANA: DIscourse ANAlysis for
knowledge understanding (MEC TIN2012-38603-C02-01) and ASLP-MULAN:
Audio, Speech and Language Processing for Multimedia Analytics (MEC
TIN2014-54288C4-3-R).
12. Rangel, F., Rosso, P., Chugur, I., Potthast, M., Trenkmann, M., Stein, B., Verhoeven, B.,
Daelemans, W.: Overview of the 2nd author profiling task at pan 2014. In: Proceedings of
the Conference and Labs of the Evaluation Forum (Working Notes) (2014)
13. Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the author
profiling task at pan 2013. Notebook Papers of CLEF pp. 23–26 (2013)
14. Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on
blogging. In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.
vol. 6, pp. 199–205 (2006)
15. Twitter: About twitter,inc. https://about.twitter.com/company (2014), accessed: 30-12-2014</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Alarcón-del Amo</surname>
          </string-name>
          , M.d.C.,
          <string-name>
            <surname>Lorenzo-Romero</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gómez-Borja</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          Á.:
          <article-title>Classifying and profiling social networking site users: A latent segmentation approach</article-title>
          . Cyberpsychology, behavior, and social networking
          <volume>14</volume>
          (
          <issue>9</issue>
          ),
          <fpage>547</fpage>
          -
          <lpage>553</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pennebaker</surname>
            ,
            <given-names>J.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schler</surname>
          </string-name>
          , J.:
          <article-title>Automatically profiling the author of an anonymous text</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>52</volume>
          (
          <issue>2</issue>
          ),
          <fpage>119</fpage>
          -
          <lpage>123</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Boe</surname>
            ,
            <given-names>B.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hamrick</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aarant</surname>
            ,
            <given-names>M.L.</given-names>
          </string-name>
          :
          <article-title>System and method for profiling customers for targeted marketing (</article-title>
          <year>2001</year>
          ),
          <source>uS Patent 6</source>
          ,
          <issue>236</issue>
          ,
          <fpage>975</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Corney</surname>
          </string-name>
          , M.,
          <string-name>
            <surname>de Vel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anderson</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mohay</surname>
          </string-name>
          , G.:
          <article-title>Gender-preferential text mining of e-mail discourse</article-title>
          .
          <source>In: Computer Security Applications Conference</source>
          ,
          <year>2002</year>
          .
          <source>Proceedings. 18th Annual</source>
          . pp.
          <fpage>282</fpage>
          -
          <lpage>289</lpage>
          . IEEE (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Hansen</surname>
            ,
            <given-names>L.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arvidsson</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nielsen</surname>
            ,
            <given-names>F.Å.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Colleoni</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Etter</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Good friends, bad news-affect and virality in twitter</article-title>
          .
          <source>In: Future information technology</source>
          , pp.
          <fpage>34</fpage>
          -
          <lpage>43</lpage>
          . Springer (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Mining and summarizing customer reviews</article-title>
          .
          <source>In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          . pp.
          <fpage>168</fpage>
          -
          <lpage>177</lpage>
          . ACM (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Mohammad</surname>
            ,
            <given-names>S.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kiritchenko</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Nrc-canada: Building the state-of-the-art in sentiment analysis of tweets</article-title>
          .
          <source>In: Proceedings of the seventh international workshop on Semantic Evaluation Exercises (SemEval-</source>
          <year>2013</year>
          ). Atlanta, Georgia, USA (
          <year>June 2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Orebaugh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Allnutt</surname>
          </string-name>
          , J.:
          <article-title>Classification of instant messaging communications for forensics analysis</article-title>
          .
          <source>Social</source>
          Networks pp.
          <fpage>22</fpage>
          -
          <lpage>28</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Pedregosa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varoquaux</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gramfort</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thirion</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grisel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blondel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prettenhofer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weiss</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dubourg</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanderplas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Passos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cournapeau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brucher</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perrot</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duchesnay</surname>
          </string-name>
          , E.:
          <article-title>Scikit-learn: Machine learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          ,
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Pennebaker</surname>
            ,
            <given-names>J.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mehl</surname>
            ,
            <given-names>M.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Niederhoffer</surname>
            ,
            <given-names>K.G.</given-names>
          </string-name>
          :
          <article-title>Psychological aspects of natural language use: Our words, our selves</article-title>
          .
          <source>Annual review of psychology 54(1)</source>
          ,
          <fpage>547</fpage>
          -
          <lpage>577</lpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Overview of the 3rd author profiling task at pan 2015</article-title>
          .
          <article-title>In: Working Notes Papers of the CLEF 2015 Evaluation Labs</article-title>
          ,
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>