<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using TF-IDF n-gram and Word Embedding Cluster Ensembles for Author Profiling</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Adam Poulston</institution>
          ,
          <addr-line>Zeerak Waseem, and Mark Stevenson</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science University of Sheffield</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <abstract>
        <p>This paper presents our approach and results for the 2017 PAN Author Profiling Shared Task. Language-specific corpora were provided for four langauges: Spanish, English, Portuguese, and Arabic. Each corpus consisted of tweets authored by a number of Twitter users labeled with their gender and the specific variant of their language which was used in the documents (e.g. Brazilian or European Portuguese). The task was to develop a system to infer the same attributes for unseen Twitter users. Our system employs an ensemble of two probabilistic classifiers: a Logistic regression classifier trained on TF-IDF transformed n-grams and a Gaussian Process classifier trained on word embedding clusters derived for an additional, external corpus of tweets.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Author profiling is the task of determining the characteristics of the individual who
wrote a document. Many different characteristics can be determined (e.g. personal
characteristics such as gender, age, personality [19] and socioeconomic indicators [
        <xref ref-type="bibr" rid="ref13 ref14 ref15 ref5">5,13,14,15</xref>
        ])
across a variety of media (e.g. written essays, books, blogs and other social media).
Despite their potential ethical concerns, author profiling techniques can be a valuable
component in various applications, such as bias reduction in predictive models [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and
language-variant adaption in part-of-speech taggers [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        In this paper, we present our approach to the 2017 edition of the PAN Author
Profiling shared task [
        <xref ref-type="bibr" rid="ref10 ref11">10,11,16</xref>
        ]. A dataset was provided consisting of Twitter users across
four languages and their variants. Each user was labeled with a binary gender label
(male/female) and the particular variant of their language (e.g. Brazilian vs European
Portuguese). The dataset was balanced by both gender and language variant. Given an
unseen user (and their native language), the task is to determine their gender and
language variant being used.
      </p>
      <p>
        To predict gender and language variant, we applied an ensemble of probabilistic
machine learning classifiers (described in detail in Section 2). First, an external Twitter
corpus was acquired and Tweets geo-located within the countries covered in the tasks
languages were extracted (except for the Arabic language variants). This corpus was
divided into individual languages (Portuguese, English and Spanish) and used to
derive Word2Vec word embeddings [
        <xref ref-type="bibr" rid="ref7 ref8">7,8</xref>
        ] for each language. Then, each set of language
specific word embeddings were clustered using K-Means to derive a set of word to
cluster mappings, which can be thought of as roughly analogous to topics in a topic
model. The normalised frequency of each word cluster across a user’s tweets was used
to train a Gaussian Process classifier. Second, a Logistic Regression classifier was then
trained using TF-IDF transformed unigram and bigram frequencies. Both classifiers
were employed in an ensemble approach by averaging the predicted probabilities for
each sample to determine the label.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Approach</title>
      <p>
        Our approach combines two probabilistic classifiers trained on distinct feature sets in
an ensemble to predict gender and language variant. Two classifiers were applied: a
Logistic Regression classifier trained on TF-IDF n–grams (Section 2.1) and a Gaussian
Process classifier trained on word cluster frequencies (Section 2.2). For each unseen
document, probabilities from both classifiers are taken and averaged, and the highest
average probability class is taken as the prediction. Models were trained using the
implementations found in scikit-learn [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] unless stated otherwise.
      </p>
      <p>For Arabic data, only the Logistic Regression classifier is applied, as the volume
of geo-located Arabic tweets collected was too low to allow for training of robust
Word2Vec models for use with the Gaussian Process classifier.
2.1</p>
      <sec id="sec-2-1">
        <title>Logistic regression classifier with TF-IDF n–grams</title>
        <p>
          Word unigram and bigram features were extracted for each training document. The
text was tokenised using a Twitter-aware tokeniser [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]; no additional steps were taken
to deal with the extra complexities of Arabic text. A list of stop words was not used
while deriving n–gram features, instead tokens that appeared in more than 90% of the
documents were removed, as this allows for the removal of n–grams common across a
language’s variants while also removing stop words.
        </p>
        <p>TF-IDF weighting was applied to down-weight n–grams common across the
documents and assign a higher weight to n–grams which are rare.</p>
        <p>
          A Logistic Regression classifier was trained for each language using the n–gram
features. Logistic Regression was chosen for use with the n–gram features because it
has been shown to perform well on similar high-dimensional classification tasks, and
produces probabilistic predictions [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Gaussian process classifier with word embedding clusters</title>
        <p>We obtained the data for our word embedding clusters from a Twitter Firehose1 sample
collected throughout 2015. We only used tweets that were geo-located in the specific
language regions determined by the shared task (see Table 1).</p>
        <p>Some language variants were less frequent in the resulting datasets than others, for
instance we collected very few tweets from Ireland compared to the U.S.A.
Downsampling was used to avoid over representation of the more prevent language variants.
1 Twitter Firehose has since been discontinued and can no longer be accessed.
Data for the language variant with the largest volume of documents was reduced so that
it contained no more than 10 times number of tweets of the smallest language variant.</p>
        <p>
          Word embeddings For each language dataset (Fen, Fes, and Fpt) were trained using
the Word2Vec [
          <xref ref-type="bibr" rid="ref7 ref8">7,8</xref>
          ] implementation in gensim [18] with Continuous Bag of Words
(CBOW), negative sampling, 200 dimensions, and a window size of 10.
        </p>
        <p>
          We applied K-Means clustering [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] to the word embeddings to derive a set of 100
clusters for each language, in which each word is assigned a cluster based on its nearest
cluster in the embedding space. We then computed the frequency distribution of the
clusters for every training document, and used them as features to train a Gaussian
Process classifier with an RBF kernel [17].
        </p>
        <p>
          Similar word embedding clusters have been applied with Gaussian Processes to
perform other author profiling tasks such as socio-economic status detection [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ];
furthermore, the derived clusters are similar to topics derived in a topic model, in that they
identify semantically similar groups of words in documents, which we found to perform
well in a similar task [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>In Table 3, we see that the our ensemble performs quite well for identifying language
variant or gender individually. For joint prediction our ensemble performs less well,
likely due to errors in either gender or language variant prediction propagating through
to incorrect joint predictions. Of the three languages the ensemble was applied to, the
best performance was observed for Portuguese and the worst for English. Broad topics
of interest appear to be effective for the gender prediction problem while individual
terms that are unique to specific language variants are more discriminating for language
variant prediction.</p>
      <p>
        Similar to our results in a previous PAN: Author Profiling Profiling shared task
entry [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], in which LDA topic models were able to improve predictive performance over
word n–grams, word embedding clusters improved predictive accuracy for gender
classification. For the language variant differentiation task, introducing the word embedding
clusters in fact reduced accuracy scores over earlier runs.
      </p>
      <p>Under our current clustering scheme, each term was assumed to be equally as
representative of its cluster as each other term; in practise though, certain terms were closer
to the centroid in embedding space than others. Prior to submission we had begun
experimenting with weighting terms based on their proximity to their closest centroid,
and our initial findings were promising. In future work we would like to investigate the
effect of weighting terms in more detail.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>In this notebook, we have shown that by employing an ensemble of classifiers and
utilising clusters of word embeddings reasonable results can be achieved. We propose,
that our approach can be improved by weighting the word embedding clusters by the
distance to the cluster centroid.
16. Rangel, F., Rosso, P., Potthast, M., Stein, B.: Overview of the 5th Author Profiling Task at
PAN 2017: Gender and Language Variety Identification in Twitter. In: Cappellato, L., Ferro,
N., Goeuriot, L., Mandl, T. (eds.) Working Notes Papers of the CLEF 2017 Evaluation
Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org (sep 2017)
17. Rasmussen, C.E., Williams, C.K.: Gaussian processes for machine learning, vol. 1. MIT
press Cambridge (2006)
18. Rˇehu˚rˇek, R., Sojka, P.: Software Framework for Topic Modelling with Large Corpora. In:
Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. pp.
45–50. ELRA, Valletta, Malta (May 2010)
19. Schwartz, H.A., Eichstaedt, J.C., Kern, M.L., Dziurzynski, L., Ramones, S.M., Agrawal,
M., Shah, A., Kosinski, M., Stillwell, D., Seligman, M.E.P., Ungar, L.H.: Personality,
gender, and age in the language of social media: The open-vocabulary approach. PLoS
ONE 8(9), e73791 (09 2013)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Blodgett</surname>
            ,
            <given-names>S.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Green</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>O'Connor</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Demographic dialectal variation in social media: A case study of african-american english pp</article-title>
          .
          <fpage>1119</fpage>
          -
          <lpage>1130</lpage>
          (
          <year>November 2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Culotta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Reducing sampling bias in social media data for county health inference</article-title>
          .
          <source>In: Joint Statistical Meetings Proceedings</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Freedman</surname>
            ,
            <given-names>D.A.</given-names>
          </string-name>
          :
          <article-title>Statistical models: theory and practice</article-title>
          . cambridge university press (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Gimpel</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schneider</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>O'Connor</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mills</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eisenstein</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heilman</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yogatama</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Flanigan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>N.a.</given-names>
          </string-name>
          :
          <article-title>Part-of-speech tagging for Twitter: annotation, features, and experiments</article-title>
          .
          <source>Human Language Technologies</source>
          <volume>2</volume>
          (
          <issue>2</issue>
          ),
          <fpage>42</fpage>
          -
          <lpage>47</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Lampos</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aletras</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Geyti</surname>
            ,
            <given-names>J.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zou</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cox</surname>
            ,
            <given-names>I.J.:</given-names>
          </string-name>
          <article-title>Inferring the socioeconomic status of social media users based on behaviour and language (</article-title>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>MacQueen</surname>
          </string-name>
          , J., et al.:
          <article-title>Some methods for classification and analysis of multivariate observations</article-title>
          .
          <source>In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability</source>
          . vol.
          <volume>1</volume>
          , pp.
          <fpage>281</fpage>
          -
          <lpage>297</lpage>
          . Oakland, CA, USA. (
          <year>1967</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Efficient estimation of word representations in vector space</article-title>
          .
          <source>arXiv preprint arXiv:1301.3781</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Pedregosa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varoquaux</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gramfort</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thirion</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grisel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blondel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prettenhofer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weiss</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dubourg</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanderplas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Passos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cournapeau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brucher</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perrot</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duchesnay</surname>
          </string-name>
          , E.:
          <article-title>Scikit-learn: Machine learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          ,
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Improving the Reproducibility of PAN's Shared Tasks: Plagiarism Detection, Author Identification, and Author Profiling</article-title>
          . In: Kanoulas,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Lupu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Clough</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Sanderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Hall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Hanbury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Toms</surname>
          </string-name>
          , E. (eds.)
          <article-title>Information Access Evaluation meets Multilinguality, Multimodality, and Visualization</article-title>
          .
          <source>5th International Conference of the CLEF Initiative (CLEF 14)</source>
          . pp.
          <fpage>268</fpage>
          -
          <lpage>299</lpage>
          . Springer, Berlin Heidelberg New York (
          <year>Sep 2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tschuggnall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          : Overview of PAN'17:
          <string-name>
            <surname>Author</surname>
            <given-names>Identification</given-names>
          </string-name>
          , Author Profiling, and
          <string-name>
            <given-names>Author</given-names>
            <surname>Obfuscation</surname>
          </string-name>
          . In: Jones,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Lawless</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Kelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Goeuriot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Cappellato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          , N. (eds.)
          <string-name>
            <surname>Experimental IR Meets Multilinguality</surname>
          </string-name>
          , Multimodality, and
          <string-name>
            <surname>Interaction</surname>
          </string-name>
          .
          <source>8th International Conference of the CLEF Initiative (CLEF 17)</source>
          . Springer, Berlin Heidelberg New York (
          <year>Sep 2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Poulston</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stevenson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bontcheva</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Topic models and n-gram language models for author profiling-notebook for pan at clef</article-title>
          <year>2015</year>
          . (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Poulston</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stevenson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bontcheva</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>User profiling with geo-located posts and demographic data pp</article-title>
          .
          <fpage>43</fpage>
          -
          <lpage>48</lpage>
          (
          <year>November 2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Preo</surname>
          </string-name>
          <article-title>¸tiuc-</article-title>
          <string-name>
            <surname>Pietro</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lampos</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aletras</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          :
          <article-title>An analysis of the user occupational class through Twitter content</article-title>
          .
          <source>Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>1</volume>
          : Long Papers) pp.
          <fpage>1754</fpage>
          -
          <lpage>1764</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Preo</surname>
          </string-name>
          <article-title>¸tiuc-</article-title>
          <string-name>
            <surname>Pietro</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Volkova</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lampos</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bachrach</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aletras</surname>
          </string-name>
          , N.:
          <article-title>Studying user income through language, behaviour and affect in social media</article-title>
          .
          <source>PloS one 10(9)</source>
          ,
          <year>e0138717</year>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>