<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>What do your look-alikes say about you? Exploiting strong and weak similarities for author profiling.</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Piotr Przybyła</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paweł Teisseyre</string-name>
          <email>teisseyrep@ipipan.waw.pl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Computer Science, Polish Academy of Sciences Jana Kazimierza 5</institution>
          ,
          <addr-line>01-248 Warsaw</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We describe a two-step procedure for author profiling, which first exploits language similarities between users and then aims at discovering more complex dependencies for dissimilar users. The method is motivated by the fact that authors using very similar vocabulary are likely to have similar traits. We use both word-based and text-based features, as well as relying on previous research. The proposed approach gives successful results, especially for gender and age prediction. Moreover, we show the most useful features using relevance measures based on random forests.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        This paper outlines our approach to author profiling task at the 13th PAN evaluation
lab on uncovering plagiarism, authorship, and social software misuse [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The goal is to
analyse a collection of tweets (in English, Spanish, Dutch and Italian) and discover its
author’s gender, age and personality traits: extraversion, stability, agreeableness,
conscientiousness and openness. Unfortunately, the available amount of training data is very
small: from 34 users for Dutch to 152 for English. As it seems very unlikely to observe
new significant dependencies in such sets, we have decided to generate features basing
on a collection of lexicons obtained in previous works. What is more, we have observed
that authors using very similar vocabulary (the look-alikes) tend to have identical traits.
We exploit this fact by performing a two-step prediction procedure: classifying a new
item starts by finding a close neighbour; a full prediction model is used only in case
nothing close enough could be found.
In our approach, two groups of features are used: word-based and text-based. The
word-based features represent numbers of occurrences of lemmas obtained with
multilanguage TreeTagger [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The text-based features, computed as global statistics of text,
include the following:
– length – average tweet length (number of characters),
– wordLength – average word length,
– urls – average number of URLs per tweet1,
– hashtags – number of hashtags,
– citations – number of citations (@username),
– capitals – fraction of capital letters,
– exclamations – number of exclamation marks,
– questions – number of question marks,
– emoticonsPos – number of positive emoticons (recognized by a regular
expression: " [:;]nS*[n)DpPn]n *]"),
– emoticonsNeg – number of negative emoticons (recognized by a regular
expression: " :nS*[n(/nnn |C]"),
– repeatedLetters – fraction of repeated letters,
– repeatedMarks – fraction of repeated exclamation and question marks,
– numbers – number of numerical expressions (recognized by a regular expression:
" nd+([n.,]nd+)*"),
– errors – number of spelling errors (obtained using multi-language
Language
      </p>
      <p>
        Tool),
– yuleK – vocabulary size estimated using Yule’s K [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>
        To improve the predictions, we have also taken into account previous research on
textbased prediction of sentiment, emotions, etc. by including the following lexical features:
– for all languages: SSPositive/SSNegative – positive/negative sentiment score
of collection of tweets, using SentiStrength tool [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ],
– for English:
      </p>
      <p>
        NRCEmotion * – numerical value of 10 emotion associations (averaged per
word2), using NRC Word-Emotion Association Lexicon [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
NRCTwitterSentiment – sentiment value, using NRC Twitter Sentiment
Lexicon [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
NRCHashtagSentiment(140) – sentiment value, using NRC Hashtag
Emotion Lexicon and Sentiment140 lexicon [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
LexiconAFINN – sentiment value, using AFINN Lexicon [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
MRC * – features from the MRC base [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]: familiarity, concreteness, imagery,
meaningfulness (two measures) and age of acquisition,
WWBPLexAge and WWBPLexGender – usage of age- and gender-dependent
lexicons from World Well-Being Project (WWBP) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
WWBPAll* – correlations with author features: gender, age and personality
using data from WWBP [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],
– for Spanish: SELEmotion * – numerical value of one of 6 emotions (joy, anger,
fear, disgust, surprise, sadness), using Spanish Emotion Lexicon [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ],
– for Dutch: NLEmotion * – numerical value of valence, arousal, dominance and
age of acquisition, using lexicon [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
In total, we have obtained 56 features. Unfortunately, a great deal of them provides
information only in case of English texts.
1 All subsequent numbers are also averaged per tweet, unless noted otherwise.
2 All subsequent values are also averaged per word.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Prediction</title>
      <p>To predict traits (gender, age and personality) of Twitter users we apply a simple
twostep procedure. The idea is to start with exploring close similarities between writings,
and then try to discover more complex dependencies. More specifically, to predict traits
for a new user, we first find the most similar user in the training data. If the similarity
is sufficiently close, we assign traits of the found user to the new user. Otherwise, we
use an advanced classification model to predict the traits. This approach is motivated by
the fact that among large number of tweets one can easily find messages written by the
same user. Moreover, it may happen that one person sends tweets from different Twitter
accounts. So-called multiple Twitter accounts, which allow to boost users’ presence in
web, are becoming more and more popular. Finally, a very similar vocabulary can be
shared by certain groups of users, having also similar features.</p>
      <p>between concordant users
between discordant users
between concordant users
between discordant users
1. Finding similar users in training data. Here, we use two approaches, depending
on the language of tweets.</p>
      <p>800
1000
100
300</p>
      <p>
        400
200
distances
(b)
– For English we build a classification model in which identifier of a group of
concordant users (having the same traits) is used as a class variable. As a
classification model, we use random forests [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], built on all available features. If the
maximum of predicted probabilities for the new user is greater than a certain
threshold pmin, we assign traits of a corresponding group to the new user.
– For other languages we simply find a nearest neighbour of the new user in the
training data. To determine nearest neighbour, we use Euclidean distance and
all available features. If the distance is less than a certain threshold dmax, we
assign traits of the nearest neighbour to the new user.
2. Prediction for dissimilar users. If no similar users in training data are found, i.e.
predicted probability of the best group is smaller that pmin (for English) or the
distance to the nearest neighbour is greater than dmax (for other languages), we
apply random forest method to predict each trait separately. We use all available
features except word-based. For gender and age, decision trees are taken as base
learners, whereas for personal traits regression trees are used. Other classification
algorithms have also been tested (e.g. logistic regression) but they have yielded
poorer results.
      </p>
      <p>Observe that above procedure depends on the choice of threshold. If pmin is
sufficiently small (for English) or dmax sufficiently large (for other languages), all users
from training data are recognized as similar users and therefore only the first step of
the above procedure is run. In the opposite case the full prediction model is always
employed. To calibrate a threshold, we randomly split data (30 times) into training and
testing parts and then compute averaged accuracy (gender and age) and mean error –
RMSE (personal traits) for different values of threshold. Figure 2 shows the results for
English and Spanish.</p>
      <p>There is a clear optimum (maximum accuracy or minimum RMSE) for certain value
of threshold. Note that for English the optimal value is common for all traits and equals
pmin 0:12. For Spanish an optimum is at dmax 90 for gender and personal traits,
whereas in case of age it is better to use nearest neighbour approach to all users. For the
remaining languages we always apply nearest neighbour method (i.e. set dmax = 0), as
the training sets are to small to build complex models.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>
        We have examined how the prediction procedure presented in Section 3 works with the
set of features described in Section 2. As measures of performance we use accuracy
(gender and age) and RMSE (personal traits). We randomly split data into training and
testing parts in the following proportions: 75% for training and 25% for testing (for
English and Spanish). For Italian and Dutch, due to small amount of data, we take only
one observation for testing and the rest for training. The above procedure is repeated 30
times and the results are averaged over all runs. Classification procedure is implemented
in R system [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] using libraries: randomForest [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], FNN [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and class [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>Results of our experiments (for an optimal value of threshold) are shown in Table
1. Numbers in brackets correspond to a baseline which is major class share (for
classification) and mean value (for regression), calculated on training data. The third column
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0.2</p>
      <p>0.3
threshold
Age</p>
      <sec id="sec-3-1">
        <title>Baseline: 0.43 0.4 0.5 Baseline: 0.35</title>
        <p>●
●
●
● ●
y
c
a
r
u
cc 03
A 7
.</p>
        <p>0
●
● ● ● ●</p>
        <p>150
threshold</p>
        <p>Age
●
● ●
●
●</p>
        <p>150
threshold
(b)
includes joint accuracy for gender and age, whereas the last column contains RMSE,
averaged over 5 personal traits. First, note that all the results exceed baseline. It is seen
that gender and age identification are successful: we obtain accuracy 77%-90% for
gender and 69%-75% for age. Moreover, simultaneous prediction of these two traits is also
possible: the accuracy is about 3 times larger than the baseline. Personality assessment
is a much more challenging task. Our experiments indicate that it is difficult to obtain
an error significantly below the baseline.</p>
        <p>Finally, we assess predictive power of the features using variable importance
measure based on random forests. The measure pertains to average decrease of node
impurity (Gini impurity index for classification and residual sum of squares for regression).
The average is taken over all splitting nodes and over all trees used to construct an
ensemble classifier. The measure shows usefulness of a given feature for prediction when
random forest is used as a prediction tool. Figure 3 shows top 20 features for
prediction of selected traits for English. The plot clearly shows that features pertaining to
words collected from World Well-Being Project (WWBP) are among the most useful
for prediction. Moreover, it is interesting that simple style-based features like message
length, numbers of exclamation marks or citations seem to be relevant in case of age
identification.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>In this study we present a two-stage procedure for author profiling, which first exploits
language similarities between users and then aims to discover more complex
dependencies. The method is motivated by the fact that authors using very similar language tend
to have identical traits. Interestingly, it turns out that combination of these two steps
usually outperforms using each step separately. Our approach is based both on sets of
word-based and text-based features. While we obtain successful results for gender and
age prediction, the personality identification seems to be much more challenging – the
error is slightly below the baseline. The assessment based on random forests shows
high relevance of features using lexica from previous works. The results of experiments
show many possibilities for future work. In our method, separate classification models
2
4
6</p>
      <p>8</p>
      <sec id="sec-4-1">
        <title>Mean decrease of Gini index</title>
        <p>●
●
●
●
●
●
●
●</p>
      </sec>
      <sec id="sec-4-2">
        <title>WWBPLexGender</title>
      </sec>
      <sec id="sec-4-3">
        <title>SSPositive</title>
      </sec>
      <sec id="sec-4-4">
        <title>WWBPAllGender</title>
        <p>exclamations</p>
      </sec>
      <sec id="sec-4-5">
        <title>MRC_IMAG</title>
      </sec>
      <sec id="sec-4-6">
        <title>MRC_MEANC</title>
      </sec>
      <sec id="sec-4-7">
        <title>MRC_MEAN</title>
      </sec>
      <sec id="sec-4-8">
        <title>NRCEmotion_joy</title>
      </sec>
      <sec id="sec-4-9">
        <title>WWBPLexAge ●</title>
      </sec>
      <sec id="sec-4-10">
        <title>MRC_CONC ●</title>
      </sec>
      <sec id="sec-4-11">
        <title>NRCEmotion_positive ●</title>
      </sec>
      <sec id="sec-4-12">
        <title>WWBPAllPersE ●</title>
      </sec>
      <sec id="sec-4-13">
        <title>WWBPAllPersA ●</title>
      </sec>
      <sec id="sec-4-14">
        <title>NRCTwitterSentiment ● errors ● hashtags ●</title>
      </sec>
      <sec id="sec-4-15">
        <title>WWBPAllPersN ● capitals ●</title>
      </sec>
      <sec id="sec-4-16">
        <title>NRCEmotion_sadness ●</title>
      </sec>
      <sec id="sec-4-17">
        <title>LexiconAFINN ●</title>
      </sec>
      <sec id="sec-4-18">
        <title>WWBPAllGender</title>
      </sec>
      <sec id="sec-4-19">
        <title>WWBPAllPersC</title>
      </sec>
      <sec id="sec-4-20">
        <title>NRCHashtagSentiment length numbers repeatedMarks</title>
      </sec>
      <sec id="sec-4-21">
        <title>WWBPAllPersN</title>
      </sec>
      <sec id="sec-4-22">
        <title>NRCEmotion_negative</title>
      </sec>
      <sec id="sec-4-23">
        <title>WWBPLexGender</title>
      </sec>
      <sec id="sec-4-24">
        <title>NRCEmotion_trust</title>
      </sec>
      <sec id="sec-4-25">
        <title>SSPositive</title>
      </sec>
      <sec id="sec-4-26">
        <title>NRCEmotion_positive</title>
      </sec>
      <sec id="sec-4-27">
        <title>MRC_MEANC ●</title>
        <p>MRC_FAM ●
hashtags ●
errors ●
urls ●</p>
      </sec>
      <sec id="sec-4-28">
        <title>NRCEmotion_sadness ● exclamations ●</title>
      </sec>
      <sec id="sec-4-29">
        <title>NRCEmotion_disgust ●</title>
        <p>Stability
●
●
●
●
●
●
●
●
●
●
0.2
0.3
0.4</p>
      </sec>
      <sec id="sec-4-30">
        <title>Mean decrease of residual error</title>
        <p>0.1
0.2
0.3
0.4</p>
      </sec>
      <sec id="sec-4-31">
        <title>Mean decrease of residual error</title>
        <p>are build for each trait – it is worthwhile to explore dependencies between the traits to
improve the prediction performance. Secondly, in order to significantly improve
personality identification, it seems necessary to look for new features. Finally, we believe
that the advantages of using our two-stage procedure could be more clearly seen on
larger corpus of tweets.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>This study was supported by research fellowship within ”Information technologies:
research and their interdisciplinary applications” agreement number
POKL.04.01.01-00</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Beygelzimer</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kakadet</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Langford</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arya</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mount</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>FNN: Fast Nearest Neighbor Search Algorithms</article-title>
          and Applications (manual) (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Kiritchenko</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mohammad</surname>
            ,
            <given-names>S.M.</given-names>
          </string-name>
          :
          <article-title>Sentiment Analysis of Short Informal Texts</article-title>
          .
          <source>Journal of Artificial Intelligence Research</source>
          <volume>50</volume>
          ,
          <fpage>723</fpage>
          -
          <lpage>762</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Liaw</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiener</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Classification and Regression by randomForest</article-title>
          .
          <source>R news 2</source>
          ,
          <fpage>18</fpage>
          -
          <lpage>22</lpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Mohammad</surname>
            ,
            <given-names>S.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Turney</surname>
          </string-name>
          , P.D.:
          <article-title>Crowdsourcing a Word-Emotion Association Lexicon</article-title>
          .
          <source>Computational Intelligence</source>
          <volume>29</volume>
          (
          <issue>3</issue>
          ),
          <fpage>436</fpage>
          -
          <lpage>465</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Moors</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Houwer</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hermans</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wanmaker</surname>
            , S., van Schie,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Harmelen</surname>
            ,
            <given-names>A.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Schryver</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Winne</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brysbaert</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Norms of valence, arousal, dominance, and age of acquisition for 4,300 Dutch words</article-title>
          .
          <source>Behavior research methods 45(1)</source>
          ,
          <fpage>169</fpage>
          -
          <lpage>77</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Nielsen</surname>
            ,
            <given-names>F.A.</given-names>
          </string-name>
          r.:
          <article-title>A new ANEW: evaluation of a word list for sentiment analysis in microblogs</article-title>
          .
          <source>In: Proceedings of the ESWC2011 Workshop on 'Making</source>
          Sense of Microposts':
          <article-title>Big things come in small packages</article-title>
          . vol.
          <volume>718</volume>
          , pp.
          <fpage>93</fpage>
          -
          <lpage>98</lpage>
          . CEUR-WS.org (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>R</given-names>
            <surname>Core Team: R: A Language</surname>
          </string-name>
          and
          <article-title>Environment for Statistical Computing</article-title>
          .
          <source>Tech. rep., R Foundation for Statistical Computing</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Overview of the 3rd Author Profiling Task at PAN 2015</article-title>
          . In: Cappellato,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Gareth</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          , San Juan, E. (eds.)
          <article-title>CLEF 2015 Labs and Workshops, Notebook Papers</article-title>
          .
          <article-title>CEUR-WS.org (</article-title>
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Sap</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eichstaedt</surname>
            ,
            <given-names>J.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kern</surname>
            ,
            <given-names>M.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stillwell</surname>
            ,
            <given-names>D.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kosinski</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ungar</surname>
            ,
            <given-names>L.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schwartz</surname>
            ,
            <given-names>H.A.</given-names>
          </string-name>
          :
          <article-title>Developing Age and Gender Predictive Lexica over Social Media</article-title>
          .
          <source>In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          . pp.
          <fpage>1146</fpage>
          -
          <lpage>1151</lpage>
          . Association for Computational Linguistics (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Schmid</surname>
          </string-name>
          , H.:
          <article-title>Improvements In Part-of-Speech Tagging With an Application To German</article-title>
          .
          <source>In: Proceedings of the ACL SIGDAT-Workshop</source>
          . pp.
          <fpage>47</fpage>
          --
          <lpage>50</lpage>
          . Association for Computational Linguistics (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Schwartz</surname>
            ,
            <given-names>H.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eichstaedt</surname>
            ,
            <given-names>J.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kern</surname>
            ,
            <given-names>M.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dziurzynski</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramones</surname>
            ,
            <given-names>S.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agrawal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shah</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kosinski</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stillwell</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seligman</surname>
            ,
            <given-names>M.E.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ungar</surname>
            ,
            <given-names>L.H.</given-names>
          </string-name>
          :
          <article-title>Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach</article-title>
          .
          <source>PLOS ONE 8</source>
          (
          <issue>9</issue>
          ) (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Sidorov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <article-title>Miranda-Jime´nez</article-title>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          , Viveros-Jime´nez,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Gelbukh</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          ,
          <article-title>Castro-Sa´nchez</article-title>
          , N., Vela´squez, F.,
          <string-name>
            <given-names>D</given-names>
            <surname>´</surname>
          </string-name>
          ıaz-Rangel,
          <string-name>
            <surname>I.</surname>
          </string-name>
          ,
          <article-title>Sua´rez-</article-title>
          <string-name>
            <surname>Guerra</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Trevin˜o,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Gordon</surname>
          </string-name>
          , J.:
          <article-title>Empirical study of machine learning based approach for opinion mining in tweets</article-title>
          .
          <source>In: Proceedings of the 11th Mexican international conference on Advances in Artificial Intelligence (MICAI'12). Lecture Notes in Computer Science</source>
          , Springer-Verlag (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Thelwall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buckley</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paltoglou</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cai</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Sentiment Strength Detection in Short Informal Text</article-title>
          .
          <source>Journal of the American Society for Information Science</source>
          <volume>61</volume>
          (
          <issue>12</issue>
          ),
          <fpage>2544</fpage>
          -
          <lpage>2558</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Venables</surname>
            ,
            <given-names>W.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ripley</surname>
            ,
            <given-names>B.D.</given-names>
          </string-name>
          : Modern Applied Statistics with S. Springer-Verlag (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15. Wilson,
          <string-name>
            <surname>M.:</surname>
          </string-name>
          <article-title>MRC psycholinguistic database: Machine-usable dictionary</article-title>
          , version
          <volume>2</volume>
          .00. Behavior Research Methods, Instruments, &amp; Computers
          <volume>20</volume>
          (
          <issue>1</issue>
          ),
          <fpage>6</fpage>
          -
          <lpage>10</lpage>
          (
          <year>1988</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Yule</surname>
            ,
            <given-names>G.U.</given-names>
          </string-name>
          :
          <article-title>The Statistical Study of Literary Vocabulary</article-title>
          . Cambridge University Press (
          <year>1944</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>