<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Custom Document Embeddings Via the Centroids Method: Gender classi cation in an Author Pro ling task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Roberto Lopez-Santillan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luis Carlos Gonzalez-Gurrola</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Graciela Ram rez-Alonso</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Facultad de Ingenier a, Universidad Autonoma de Chihuahua</institution>
          ,
          <addr-line>Circuito No. 1, Nuevo Campus Universitario, Apdo. postal 1552, Chihuahua, Chih., Mexico. C.P. 31240</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>According to Smart Insights1, out of the 7.5 billion persons in total population of the world, there are 4 billion Internet users, and out of those an outstanding 3.19 billion are active social media users. In a report by the U.S. Internet Crime Complaint Center, only in 2016 Identity theft, Extortion and Harassment or violence threads stand out among the most frequently reported cyber-crime events2. The Author Pro ling (AP) task might be useful to counteract this phenomena by pro ling cyber-criminals. AP consists in detecting personal traits of authors within texts (i.e. gender, age, personality). In the current report we describe a method to address the AP problem, which is one of the three shared tasks evaluated, as an exercise in digital text forensics at PAN 2018 within the CLEF conference (Conference and Labs of the Evaluation Forum). Our approach blends Word Embeddings (WE) and the Centroids Method to produce Document Embeddings (DE), that deliver competitive results predicting the gender of authors, over a dataset comprised of text posts from Twitter R . Speci cally, in the testing dataset our proposal achieve an accuracy of 0.78 for English language users, and on average (for English, Spanish and Arabic languages users) it reaches an Accuracy score of 0.77.</p>
      </abstract>
      <kwd-group>
        <kwd>Author Pro ling dings</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Author Pro ling (AP) is the task of discovering features like gender, age and
psychological traits in persons, by analyzing their language expressions. A
methodology to obtain personal pro les of individuals with high accuracy would be useful
1
https://www.smartinsights.com/social-media-marketing/social-mediastrategy/new-global-social-media-research/
2
https://www.statista.com/statistics/184083/commonly-reported-types-of-cybercrime/
in areas like customer service, attention of neurological disorders (e.g. autism)
and cyber-crimes among others [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The pro ciency to accurately pro le the
author(s) of plagiarism, identity theft, Internet sexual predatory activities or even
terrorist attacks, has become a matter of critical importance. On the other hand,
in uential companies such as Amazon, Net ix, Google, Apple amidst other, use
several Machine Learning (ML) algorithms to address and create new demand
among their clients and to attract new ones, based on pro ling their customers
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        A fair amount of works have attempted solving this problem, several shared
task events are held each year to test accuracy of new algorithms. These works
report competitive results when predicting the gender or age group of
individuals [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In an attempt to standardize and set a context framework, world wide
conferences in the eld of Natural Language Processing (NLP) are organized
frequently. Among those events, the PAN evaluation lab on digital text forensics,
organized within the CLEF Initiative (Conference and Labs of the Evaluation
Forum), is held each year. For the 2018 edition3 the conference posed shared
tasks in mainly 3 di erent e orts: Author Identi cation, Author Obfuscation
and Author Pro ling [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Machine learning (ML) is a computer science eld that has re-gained strength
in the last years with the advent of Deep Learning (DL), a sub eld of ML that
has obtained strong achievements in image and speech recognition, computer
vision and as of lately NLP. NLP deals with the problem of how computers can
understand Human natural language [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. AP may be viewed as a sub-task of
NLP, and it should be approached as such.
      </p>
      <p>
        Departing from traditional NLP strategies, we propose a di erent approach.
Word embeddings (WE) are a type of DL application that project natural
language words (vectors) into a n-dimensional space. The distance between vectors
represent a similitude value within a semantic context. A WE algorithm proposed
by Mikolov et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] called Word2Vec is currently used in multiple NLP tasks
with amazing results. Word2Vec uses a Bag of Words approach, but it retains
the order of the tokens within the original text. This method provides
additional semantic information, richer in inner structure components, which might
increase the accuracy of ML algorithms to predict personal traits in people.
      </p>
      <p>
        Although WEs deliver state-of-art results in NLP tasks, such as text classi
cation or language translation, more di cult assignments like Sentiment Analysis
(SA) (which tries to identify positive from negative user opinions) or AP, are
not bene ted in the same way. WEs capture syntactic and semantic information,
nonetheless the latter is caught with less sensitivity. To vectorize whole
documents, literature suggests other techniques, such as the Centroids Method [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
This approach considers a document as the sum of its words, hence the WE of
each word in a sentence, a paragraph or even a whole document is composed by
an aggregate function like maximum, minimum, or a weighted average,
producing a single vector for the entire document, with a similar dimension shape as the
WE of its words. This design delivers good results when training ML algorithms
      </p>
    </sec>
    <sec id="sec-2">
      <title>3 https://pan.webis.de/clef18/pan18-web/author-pro ling.html</title>
      <p>to learn the target of such documents, like topics or themes. To identify the
gender, the age or the personality pro le of the person behind such manuscript,
the Centroids Method is more limited.</p>
      <p>
        The elds of Computer Science (CS), Linguistics and Psychology must come
together in a transversal strategy to tackle this problem. Some studies have
merged the power of computation and linguistics to identify words or
combinations of them, so several large lexicons have been developed, with tokens of
statistical signi cance that can identify gender, age or personality of authors
with high accuracy [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Such exercises give the certainty that words have a
discerning power to di erentiate men from women, youngsters from old persons
or introverts from extroverts. Then it is hypothesize that WEs could lend their
potential to an aggregate strategy that could produce document vectors as well.
Taking this into account, our approach to the AP task blends WEs in an
aggregate strategy, to produce distributed representations of whole texts (DEs),
which we use to train our model.
2
      </p>
      <sec id="sec-2-1">
        <title>Related Work</title>
        <p>
          Age and gender are often associated with the style of writing. This means that
the style might change with age and is strongly associated with the gender of
persons. The size of the texts available on a single author for training is known
to a ect the outcome of the classi ers. Even though larger texts are preferred,
short samples like the ones found in social media platforms like Twitter R , might
be useful in the AP task. Several studies show that the accuracy to predict
gender and age in short texts declined little when compared to larger samples
available [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Adorno et. al. proposed in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] an approach that joined the Doc2Vec
algorithm (a spin-o of Word2Vec focusing on sentences rather than words [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ])
and the use of a Neural Network (NN). Their hypothesis relies on the idea that
standardizing nonstandard language expressions used by several authors could
render better accuracy performance on the AP task. Moreover the detection
of emotions in posts from social media outlets, could result in better accuracy
predicting features, as proposed by [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], where the emotions detected in on-line
posts might help to predict gender accurately.
        </p>
        <p>
          The 2015 PAN AP shared task included personality as a feature to predict
[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. This endeavor is better addressed as a regression problem since the
prediction is a rational value. At that year several strategies were chosen to addresses
the problem. For instance the approach of Pervaz et. al. focused on the stylistic
and thematic properties of the dataset [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. On the other hand lvarez-Carmona
et. al. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] chose Second Order Attributes (SOA) and Latent Semantic Analysis
(LSA) to enhance the discriminative skills of their algorithms. Most of the teams
attained competitive results using state-of-the-art ML algorithms. The most
frequent classi ers used were: Naive Bayes (NB), Support Vector Machine (SVM)
and Random Forest (RF) [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>
          To properly approach the AP problem, it is essential to understand the
fundamental blocks of language. The vast majority of currently spoken idioms around
the world, are build around tokens known as words. Even though syntax,
grammar and other language constructs may vary from one dialect to another, the
single elements like "words" remain the same. Nonetheless, Western Latin
idioms are based on similar character sets, it is possible to "tokenize" languages
like Arabic in order to obtain single tokens equivalent to the former. All this
is relevant because works like the one published by Schwartz et. al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], have
established a statistically signi cant set of words for the English language, that
have the power to categorize authors that use frequently these terms, by gender,
age group and personality traits. They propose a mixture of Linguistics and
Computer Science algorithms to determine which words in the English idiom
are statistically signi cant, to be used with discriminative enough capabilities.
As it is implied in this study, the popularity of social media has produced huge
amounts of available data from all kinds of persons around the globe. This
enormous potential datasets allows to develop new ML algorithms which might reveal
the intricateness within the written language.
        </p>
        <p>In order to perform NLP tasks, words need to be treated like numeric entities,
not only as identi ers, but they must represent through their values something
meaningful, to the term and to the context they are being used on. There are
several ways to represent words in ML/NLP tasks, for instance One Hot
Encoding, Bag-Of-Words or Word Embeddings are 3 of the most used architectures
to represent words in NLP e orts. Figure 1 demonstrates the basic structure of
these word-coding methods.</p>
        <p>
          For text classi cation, textual similarities or even translation activities, WEs
by themselves deliver extraordinary results. In tasks such as sentiment analysis
or author pro ling, it is required to generate vectors for phrases or whole
documents. One of the most straightforward techniques for vectorization of whole
documents is called the Centroids Method. As proposed by Kusner et. al. in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ],
in order to perform accurate document classi cations, texts should be projected
into a n-dimensional space where a distance between them can be computed. A
document embedding or vector is generated by calculating a weighted average
of its WEs, then vectorial distances (Euclidean, Canberra, etc) between them
allows for proper document classi cation.
        </p>
        <p>
          As of late, WEs are the preferred method to engage NLP tasks, since they
deliver state-of-the-art results in tasks such as text classi cation, translation,
text generation or sentiment analysis (SA). Seyed et. al. in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] tried to enhance
WEs in a SE task by adding useful information to the word vectors. Their idea
consists in calculating vectors for Part of Speech (POS) elements within the
text. A POS is the category of the word, for instance nouns, verbs, adjectives
or pronouns. Moreover, a lexicon vector is also computed and concatenated as
well; there are several lexicons, particularly in the English language with proven
discriminative properties as demonstrated by [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>Our approach is partially based on the aforementioned studies. We di
erentiate our strategy by creating our own distributed representations for the
vocabulary and the POS tags. Also, a preliminary experiment showed that lexicons did
not add accuracy when joined to the DEs, hence they were discarded. A more
in depth explanation is presented in the next section.
3</p>
      </sec>
      <sec id="sec-2-2">
        <title>Methodology</title>
        <p>
          The goal for the AP shared task at PAN 2018 was to classify subjects by gender
in three di erent languages: English, Spanish and Arabic. In our implementation
only a SVM classi er was employed, due to delivering the best results in
preliminary runs. The model was trained o ine on the provided training dataset, then
was uploaded into a Virtual Machine (TIRA), available at a server in Bauhaus
University at Weimar Germany. This environment allows di erent models
(heterogeneous programming languages) to run on a common platform, thus making
the evaluation of the shared task easier to assess [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
        </p>
        <p>
          The employed strategy is an assemble approach which fuses a) WEs, b) the
Centroids Method to produce DEs and c) the tf-idf (term frequencyinverse
document frequency) weighting scheme. tf-idf is a statistical value computed for
each term in a document. It helps establish the importance of a term within a
corpus. The more a word appears in a text, the more the tf-idf value increases.
For terms with high frequency but few discriminative power (e.g. the, and ), an
o set value is computed. The tf-idf value is vastly used in information retrieval
tasks [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. Equation (1) shows how to compute a tf-idf value within a corpus.
tf-idft;d = (1 + log tft;d) log
        </p>
        <p>N
dft</p>
        <p>A new vocabulary of WEs was crafted from the dataset using the Skip-gram
algorithm, then a tf-idf value was calculated for each term within the context
of each collection of posts of all individuals. For example the tf-idf value for the
word "drink" will be di erent among persons in the dataset (people use words
di erently). Next the DEs were generated for each person in the dataset (one
per person), by computing the weighted average (using the tf-idf value of each
word) of all WEs in every set of documents, as shown in formula (2). Finally the
DEs were used to train the SVM classi er. A distribution of specimens in the
training dataset is depicted in table 1</p>
        <p>DEs-WAvg =</p>
        <p>Pin=1(wn tf idf [wn])</p>
        <p>Pin=1 tf idf [wn]</p>
        <p>A Tokenization process designed for Twitter R posts4 was used to produce
the individual terms in the dataset. This procedure allows to retrieve each term
in the dataset as a single entity. Using a speci c tool to produce tokens in a social
network environment, allowed us to attain context speci c terms such as "bit.ly",
":-)" or "#TuesdayThoughts", which might be more discriminative than single
words. Figure 2 shows an analysis done on the most frequent Twitter R tokens
used on the training dataset.</p>
        <p>
          Inspired by the work in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], we performed a POS tagging procedure on
the dataset for the English language. For this process we used the NLTK POS
tagger5, which uses a Greedy Averaged Perceptron (GAP) algorithm to
compute the POS tags of each word. Subsequently each word in the dataset was
replaced by its POS tag and the same Skip-gram algorithm was applied to
generate embeddings for the POS labels. WEs from the dataset were enhanced by
concatenating their POS tags vectors. To produce the Document Embeddings
(DE) for every individual on the dataset, a weighted average was also computed
for the enhanced WEs in the collection of posts of each individual. This
embeddings share the same dimensionality of the enhanced vectors from single words
(300 + 20). Figure 3 shows the proposed model. For the English language the
POS vectors were attached to the original WEs before the weighted averaging.
For the Spanish and Arabic languages, no POS vectors were attached (DEs of
300 dimensions), because multilingual POS tagger tools did not deliver useful
labels for these idioms. A translation approach to apply the POS tags strategy
in these languages did not deliver an advantageous trade o between accuracy
and running time.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4 https://github.com/dlatk/happierfuntokenizing</title>
    </sec>
    <sec id="sec-4">
      <title>5 https://www.nltk.org/book/ch05.html</title>
      <p>Table 2 shows the parameters used to train the WEs of words and their POS.
The criterion to choose the parameters was based on a randomized grid search.</p>
      <p>
        For the classi cation stage, a Support Vector Machine (SVM) algorithm was
chosen. Although not entirely new (introduced by Vapnik in the 90's of last
century), SVMs are currently acknowledged as one of the most used and e ective
ML algorithms [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. A preliminary set of runs to test the best classi er for
the AP task, showed that SVM was the ttest option among other methods
such as Random Forest and Extra Trees. SVMs are useful in problems with
hard separability, by using the so called kernel trick, they approximate a higher
dimensional projection by computing variants of the dot product, which is both
accurate enough and computationally tractable [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. A full Grid Search (GS)
was performed to nd the best parameters for the SVM, table 3 shows the best
parameters found by the GS.
(C=10, degree=2, gamma=1, kernel='poly', coef0=0.0)
      </p>
      <sec id="sec-4-1">
        <title>Results</title>
        <p>In order to evaluate our method in the training phase, a 10-fold cross-validation
strategy was selected. Table 4 shows the performance of the SVM classi ers over
the training dataset. An average of the accuracy over the three languages was
computed to produce an overall performance value.</p>
        <p>For the testing stage, we ran our model through the platform TIRA6. The
results attained on the testing data showed little decrease compared to the results
in the training step, this suggests the method is robust. Table 5 demonstrates
the results achieved on testing. In both stages (training and testing), English
language was best classi ed, this might be due in part to the POS tagging
information that was added only in this idiom. Furthermore the strategy for
Spanish and Arabic classi cation did not use POS tags, which might suggest
the Document Embeddings produced were not as rich in semantic information
as those enhanced by POS tags vectors. Despite having more samples (3000
individuals, whilst Arabic has 1500), the Spanish task resulted in the lowest
accuracy overall, whether it was on training or testing data. This suggests that
the use of words and topics in the Spanish language might be more uniform than
in the Arabic idiom.</p>
        <p>
          A more comprehensive list (from all participant teams) of methodologies
and results is explained by Rangel et. al. in [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], thus our methodology might
be assessed more properly within this context.
5
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Conclusion</title>
        <p>The Author Pro ling task is very important for the current online way of life.
The ongoing mission to produce faster and more accurate algorithms, takes us
in new directions, to explore new options, to test new approaches. Even tough
state-of-the-art results are good enough for some applications, there is still a
lot of room for improvement. The method proposed in this report, shows that</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6 http://www.tira.io/</title>
      <p>fusing "old" ways with novel ones, might be a good strategy for the upcoming
future. Moreover, given the limited scope of the available datasets for this type
of tasks, it might be a good idea to work parallel in devising new techniques to
produce more comprehensive datasets in a faster way.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>M.</given-names>
            <surname>Hildebrandt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gutwirth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hildebrandt</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Gutwirth</surname>
          </string-name>
          ,
          <article-title>Pro ling the European Citizen: Cross-Disciplinary Perspectives</article-title>
          . Springer Publishing Company, Incorporated, 1 ed.,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. C. P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          , and W. Daelemans, \Daelemans, w.:
          <article-title>Overview of the 3rd author pro ling task at pan 2015," in CLEF 2015 Labs and Workshops, Notebook Papers</article-title>
          .
          <source>CEUR Workshop Proceedings, CEUR-WS.org (Sep</source>
          <year>2015</year>
          ),
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tschuggnall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kestemont</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          , \Overview of PAN-2018:
          <article-title>Author Identi cation, Author Pro ling, and Author Obfuscation," in Experimental IR Meets Multilinguality, Multimodality, and Interaction</article-title>
          .
          <source>9th International Conference of the CLEF Initiative (CLEF</source>
          <volume>18</volume>
          )
          <string-name>
            <surname>(P. Bellot</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Trabelsi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Murtagh</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Nie</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Soulier</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Sanjuan</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Cappellato</surname>
          </string-name>
          , and N. Ferro, eds.), (Berlin Heidelberg New York), Springer, Sept.
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>I.</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Courville</surname>
          </string-name>
          , Deep Learning. The MIT Press,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. S.</given-names>
            <surname>Corrado</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          , \
          <article-title>Distributed representations of words and phrases and their compositionality,"</article-title>
          <source>in Advances in Neural Information Processing Systems</source>
          26 (
          <string-name>
            <surname>C. J. C. Burges</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Bottou</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Welling</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Ghahramani</surname>
            , and
            <given-names>K. Q.</given-names>
          </string-name>
          <string-name>
            <surname>Weinberger</surname>
          </string-name>
          , eds.), pp.
          <volume>3111</volume>
          {
          <issue>3119</issue>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>M. J. Kusner</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>N. I.</given-names>
          </string-name>
          <string-name>
            <surname>Kolkin</surname>
            , and
            <given-names>K. Q.</given-names>
          </string-name>
          <string-name>
            <surname>Weinberger</surname>
          </string-name>
          , \
          <article-title>From word embeddings to document distances,"</article-title>
          <source>in Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML'15</source>
          , pp.
          <volume>957</volume>
          {
          <issue>966</issue>
          , JMLR.org,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Schwartz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Eichstaedt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Kern</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dziurzynski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Ramones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kosinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Stillwell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. E. P.</given-names>
            <surname>Seligman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L. H.</given-names>
            <surname>Ungar</surname>
          </string-name>
          , \
          <article-title>Personality, gender, and age in the language of social media: The openvocabulary approach,"</article-title>
          <source>PLOS ONE</source>
          , vol.
          <volume>8</volume>
          , pp.
          <volume>1</volume>
          {
          <issue>16</issue>
          , 09
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>H.</given-names>
            <surname>Gomez-Adorno</surname>
          </string-name>
          , I. Markov, G. Sidorov,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Posadas-Duran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>SanchezPerez</surname>
          </string-name>
          , and L.
          <string-name>
            <surname>Chanona-Hernandez</surname>
          </string-name>
          , \
          <article-title>Improving feature representation based on a neural network for author pro ling in social media texts,"</article-title>
          <source>Comp. Int. and Neurosc</source>
          ., vol.
          <year>2016</year>
          , pp.
          <volume>1638936</volume>
          :
          <issue>1</issue>
          {
          <fpage>1638936</fpage>
          :
          <fpage>13</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , \
          <article-title>Distributed representations of sentences and documents," CoRR, vol</article-title>
          .
          <source>abs/1405.4053</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , \
          <article-title>On the identi cation of emotions and authors' gender in facebook comments on the basis of their writing style,"</article-title>
          <source>in Proceedings of the First International Workshop on Emotion and Sentiment in Social and Expressive Media: approaches and perspectives from AI</source>
          (ESSEM
          <year>2013</year>
          )
          <article-title>A workshop of the XIII International Conference of the Italian Association for Arti cial Intelligence (AI*IA</article-title>
          <year>2013</year>
          ), Turin, Italy, December 3,
          <year>2013</year>
          ., pp.
          <volume>34</volume>
          {
          <issue>46</issue>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. I.
          <string-name>
            <surname>Pervaz</surname>
            ,
            <given-names>I. Ameer</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sittar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. M. A.</given-names>
            <surname>Nawab</surname>
          </string-name>
          , \
          <article-title>Identi cation of author personality traits using stylistic features: Notebook for pan at clef 2015.," in CLEF (Working</article-title>
          <string-name>
            <surname>Notes) (L. Cappellato</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>G. J. F.</given-names>
          </string-name>
          <string-name>
            <surname>Jones</surname>
          </string-name>
          , and E. SanJuan, eds.), vol.
          <volume>1391</volume>
          of CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>M. A. A. Carmona</surname>
            ,
            <given-names>A. P.</given-names>
          </string-name>
          <string-name>
            <surname>Lopez-Monroy</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Montes-y-</article-title>
          <string-name>
            <surname>Gomez</surname>
            ,
            <given-names>L. V.</given-names>
          </string-name>
          <string-name>
            <surname>Pineda</surname>
            , and
            <given-names>H. J.</given-names>
          </string-name>
          <string-name>
            <surname>Escalante</surname>
          </string-name>
          , \
          <article-title>Inaoe's participation at pan'15: Author pro ling task," in Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum</article-title>
          , Toulouse, France, September 8-
          <issue>11</issue>
          ,
          <year>2015</year>
          .,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Rezaeinia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ghodsi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Rahmani</surname>
          </string-name>
          , \
          <article-title>Improving the accuracy of pretrained word embeddings for sentiment analysis," CoRR, vol</article-title>
          .
          <source>abs/1711.08609</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>M. Potthast</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            , E. Stamatatos, and
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Stein</surname>
          </string-name>
          , \
          <article-title>Improving the Reproducibility of PAN's Shared Tasks: Plagiarism Detection, Author Identi cation, and Author Pro ling," in Information Access Evaluation meets Multilinguality, Multimodality, and Visualization</article-title>
          .
          <source>5th International Conference of the CLEF Initiative (CLEF</source>
          <volume>14</volume>
          )
          <string-name>
            <surname>(E. Kanoulas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Lupu</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Clough</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Sanderson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hanbury</surname>
          </string-name>
          , and E. Toms, eds.), (Berlin Heidelberg New York), pp.
          <volume>268</volume>
          {
          <issue>299</issue>
          , Springer, Sept.
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15. J. Ramos, \
          <article-title>Using tf-idf to determine word relevance in document queries."</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16. S. Marsland,
          <source>Machine Learning: An Algorithmic Perspective</source>
          ,
          <string-name>
            <given-names>Second</given-names>
            <surname>Edition</surname>
          </string-name>
          . Chapman &amp; Hall/CRC, 2nd ed.,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Montes-y-</article-title>
          <string-name>
            <surname>Gomez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Potthast</surname>
            , and
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Stein</surname>
          </string-name>
          , \
          <article-title>Overview of the 6th Author Pro ling Task at PAN 2018: Multimodal Gender Identi cation in Twitter,"</article-title>
          <source>in Working Notes Papers of the CLEF 2018</source>
          Evaluation
          <string-name>
            <surname>Labs (L. Cappellato</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>J.-Y.</given-names>
          </string-name>
          <string-name>
            <surname>Nie</surname>
          </string-name>
          , and L. Soulier, eds.),
          <source>CEUR Workshop Proceedings, CLEF and CEUR-WS.org, Sept</source>
          .
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>