<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Author Profiling on Social Media: An Ensemble Learning Model using Various Features</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Engineering Yonsei University</institution>
          ,
          <addr-line>Seoul</addr-line>
          ,
          <country country="KR">Korea</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <abstract>
        <p>We describe our participation in the PAN 2019 shared task on author profiling, determine whether a tweet's author is a bot or a human, and in case of human, identify author's gender for English and Spanish datasets. In this paper, we investigate the complementarities of both stylometry methods and contentbased methods, putting forward various techniques for building flexible features. Acting as a complement to these methods, we investigate an ensemble learning method paves the way to improve the performance of AP tasks. Experimental results demonstrate that the ensemble method by the combination of the stylometry methods and content-based methods can more accurately capture the author profiles than traditional methods. Our proposed model obtained 0.9333 and 0.8352 of accuracy in the bot and gender identification tasks for English test dataset respectively.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The Author profiling (AP) deals with the classification of shared content in order to
predict general or demographic attributes of authors such as gender, age, personality,
native language, or political orientation, among others [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Being able to infer an
author’s profile has wide applicability and has proved to be advantageous in many areas
such as marketing, forensics, and security, etc.
      </p>
      <p>
        Broadly speaking, the approaches that tackle AP view the task as a multi-class or
single-label classification problem, when the set of the class label is known a priori [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
Thus, AP is modeled as a classification task, in which automatic detection methods
have to assign labels (e.g., male, female) to objects (texts). Consequently, most work
has been devoted to determining a suitable set of features to deal with the task on the
writing profile of authors. In the 2019 shared AP task on PAN dataset [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the goal is to
infer whether the author of a Twitter feed is a bot or a human and to profile the author’s
gender in case of human [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Both training and test data is provided in two different
languages: English, Spanish.
      </p>
      <p>In order to predict bot and gender, we propose the complementarities of both
stylometry methods and content-based methods, putting forward various techniques for
building flexible features (basic count features, psycholinguistic features, TF-IDF, Doc2vec).
Acting as a complement to these features, we also investigate an ensemble learning
method combining classification methods based on various features and BERT model
paves the way to improve the performance of AP tasks.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Works</title>
      <p>
        Approaches for predicting an AP can be broadly categorized into two types of methods:
(1) stylometry methods which aim to capture an author’s writing style using different
statistical features (e.g., functional words, POS, punctuation marks, and emoticons), (2)
content-based methods that intend to identify an author’s profiles based on the content
of the text (e.g., bag of words, words n-gram, term vectors, TF-IDF n-grams, slang
words, emotional words), and topics discussed in the text (e.g., topic models such as
LDA, PLSA ) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. According to the PAN1 competitions, most successful works for AP
in social media have used combinations of these two kinds of features.
      </p>
      <p>
        Every author’s writing style can be used to identify an author’s attributes. In
previous studies, style based features were used to predict the author’s attributes, age, and
gender [
        <xref ref-type="bibr" rid="ref13 ref18 ref22 ref3 ref6">3,6,13,18,22</xref>
        ]. In these methods, lexical word-based features represent text as a
sequence of tokens forming sentences, paragraphs, and documents. A token can be the
numeric number, alphabetic word or a punctuation mark. Plus, these tokens are used to
statistics such as average sentence length, average word length, a total number of words
and a total number of unique words, etc. Also, character-based features consider the
text as a sequence of characters.
      </p>
      <p>
        Content-based methods employ specific words or special content which are used
more frequently in that domain than in other domains [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. These words can be chosen
by correlating the meaning of words with the domain [
        <xref ref-type="bibr" rid="ref23 ref8">8, 23</xref>
        ] or selecting from corpus
by frequency or by other feature selection methods [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. An analysis of information gain
presented in [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] showed that the most relevant features for gender identification are
those related with content words (e.g., linux and office for identifying males, whereas
love and shopping for identifying females).
      </p>
      <p>
        Recently, some works have used deep learning models and learning method of
representations for AP [
        <xref ref-type="bibr" rid="ref11 ref21 ref7 ref9">7, 9, 11, 21</xref>
        ]. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] used the approach based on subword character
n-gram embeddings and deep averaging networks (DAN). [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] used the model consists
of a bi-RNN implemented with a Gated Recurrent Unit (GRU) combined with an
Attention mechanism. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] proposed two models for gender identification and the language
variety identification of four languages that consist of multiple layers to classify an
author’s profile trait with neural networks.
      </p>
      <sec id="sec-2-1">
        <title>1 https://pan.webis.de/</title>
        <p>The preprocessing of the text data is an essential step as it makes the raw text ready for
applying machine learning algorithms to it. The objective of this step is to clean noise
those are less relevant to detect the AP on the texts.</p>
        <p>We, at first, aggregate tweet posts published by an individual user into one
document before training to alleviate the shortcomings of short texts. In order to utilize most
of the information in text, we perform some transforming tasks of the short texts (i.e.,
XML parsing, contradictions unfolding, text tokenizing, stemming, lemmatization, and
removing stopwords). Also, for utilization and word-level representation on most of
the text information, we perform spell correction for informal words using SymSpell
library1, word segmentation for splitting hashtags using WordSegment library2, and
annotation (surround or replace with special tags such as &lt;money&gt;, &lt;number&gt;, &lt;date&gt;,
&lt;phone&gt;, or &lt;user&gt;) as a text postprocessing task (see Figure 3).
Previous works on AP tasks explore lexical, syntactic, and structural features.
Lexical features are used to measure the habit of using characters and words in the text.
The commonly used features in this kind consist of the number of characters, word, a
frequency of each type of characters, etc. Syntactic features include the use of
punctuations, part-of-speech (POS) tags, and functional words. Structural features represent
how the author organizes their documents or other special structures such as greetings
or signatures.</p>
        <p>As shown Table 1, we construct a basic count feature set including punctuation
characters (e.g., question marks, and exclamation marks) and other features (e.g., average
syllable per word, functional word count, special character count, capital ratio, etc.).</p>
      </sec>
      <sec id="sec-2-2">
        <title>1 https://github.com/wolfgarbe/SymSpell 2 https://github.com/grantjenks/python-wordsegment</title>
        <p>
          The relationship between personality traits and the use of language has been widely
studied by the psycholinguist Pennebaker. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] analyzed how the use of the language
varies depending on personal traits. For example, in regards to the authors’ gender,
he found out that in English women use more negations or first persons, because they
are more self-conscientious, whereas men use more prepositions in order to describe
their environment. These findings are the basis of LIWC1 (Linguistic Inquiry and Word
Count) that is one of the most used tools for capturing people’s social and psychological
states, which have proved to be useful in the AP task.
        </p>
        <p>LIWC has two types of categories; the first kind captures the writing style of the
author like the POS frequency or the length of the used words (i.e., summary language
variables, linguistic dimensions, other grammar). The second category (i.e.,
psychological processes) captures content information by counting the frequency of words related
to some thematic categories such as affective processes, social processes, personal
concerns, etc. Regarding the use of this tool, we focused on the content information.
3.4</p>
        <p>TF-IDF
We adapted the TF-IDF (Term Frequency-Inverse Document Frequency) method to
judge the topics of each document by the words it contains. In the TF-IDF words are
given weight – TF-IDF measures relevance, not frequency. Particularly, word counts
are replaced with TF-IDF scores across the corpus. TF-IDF, at first, measures the
number of times that words appear in a given document (i.e., term frequency). However,
since words such as and or the appear frequently in all documents, those must be
systematically discounted (i.e., inverse-document frequency). The more documents a word
appears in, the less valuable that word is as a signal to differentiate any given document.
That is intended to leave only the frequent and distinctive words as markers. TF-IDF
relevance of each word is a normalized data format as the following formula:</p>
      </sec>
      <sec id="sec-2-3">
        <title>1 http://liwc.wpengine.com/</title>
        <p>where tfi;j is the number of occurrences of i in j ; dfi is the number of documenting
containts i; N is the total number of documents.</p>
        <p>We extract unigrams, bigrams, and trigrams derived from the bag of words
representation of each Twitter posts. To account for occasional differences in content length
between train dataset and test dataset, these features are encoded as TF-IDF values.
3.5</p>
        <p>
          Doc2vec
We use the Doc2vec method [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], an unsupervised learning model that learns feature
representations of fixed length from the document of variable length. The idea is to
combine the meaning of words for construction of the meaning of documents using
a distributed memory model (see Figure 2). There are two models for the distributed
representation of documents: Distributed Memory (DM) and Distributed Bag-of-Words
(DBOW). The distributed representation obtained by this model outperforms both
Bagof-Words (BoW) and word n-gram models producing the new state of the art result for
text classification and sentiment analysis tasks.
        </p>
        <p>
          In this work, different representations of the texts were used as input data types for
the Doc2vec method in order to evaluate the quality of different distributed
representation outputs. In particular we represented the texts in terms of word unigram, bigram
and trigram. For the implementation of the Doc2vec model, a freely available package
of the Doc2vec included in the Gensim module used. The Doc2vec method offers two
possible approaches (i.e., PV-DM, PV-DBOW) to build the model. As shown in
Table 2, our experimental results report better performance when both representations are
concatenated, thus our model’s final document vector is composed of the
concatenation of the representations obtained by the DM (Distributed Memory) and the DBOW
(Distributed Bag of Words) models.
BERT (Bidirectional Encoder Representations from Transformers) model’s key
innovation is applying the bidirectional training of Transformer, a popular attention model,
to language modelling [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. The Transformer includes two separate mechanisms – an
encoder that reads the text input and a decoder that produces a prediction for the task.
Since the goal of BERT is to generate a language model, only the encoder
mechanism is necessary. As opposed to directional models, which read the text input
sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence
of words at once. This characteristic allows the model to learn the context of a word
based on all of its surroundings (left and right of the word).
        </p>
        <p>When training language models, there is a challenge of defining a prediction goal.
To overcome this challenge, BERT uses two training strategies: Masked LM (MLM)
and the next sentence prediction (NSP). Their results show that BERT outperforms the
state-of-the-art results in a wide variety of NLP tasks, including QA (SQuAD), Natural
Language Inference (MNLI), and others. In this task, we used pre-trained BERT model
(i.e., bert_uncased_L-24_H-1024_A-16) in TensorFlow Hub website 1.
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experimental Results and Discussion</title>
      <p>
        The task of identifying bot and gender from the text is cast as a classification task. For
these tasks, we have performed binary classification task, i.e., the goal is to distinguish
between two classes: (1) bot and human (2) male and female in case of human class.
We have used 10-fold cross validation for experiments. We employed Gradient Boosted
Decision Tree (GBDT) classifier and BERT model to train and test our proposed
system as classifier models. Our results on PAN2019 subtasks are summarized in Table
3. Regarding the submissions to the task through the TIRA [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] platform which is the
web service platform to facilitate software submissions into the virtual machine,
training was conducted offline by concatenating the training and validation sets as input and
      </p>
      <sec id="sec-3-1">
        <title>1 https://tfhub.dev/</title>
        <p>then, the trained models were deployed to TIRA to classify the unseen test dataset. We
tuned our single models on the development datasets and submitted the final results
using ensemble models. The ensemble models were built by averaging the outputs of
two models, which are feature stacking model and deep learning model. Our
submission results achieve an accuracy of 0.9333 for bot detection task and 0.8352 for gender
identification task on English test dataset.</p>
        <p>In the gender identification tasks, the content-based approaches show better results,
especially the Doc2vec model shows significantly better results than other feature sets.
The resulting accuracy scores by other features (i.e., Basic count features,
psycholinguistic features, TF-IDF) indicate that these approaches address the problem to some
extent but requires more distinctive features to further improve the accuracy.
Interestingly, the concatenation of both stylometry and content-based approach proves highly
effective, achieving the best results overall. However, Doc2vec method proves to be
highly competitive as a single-handed feature set.</p>
        <p>In terms of deep learning approach, even if BERT model managed to learn important
features from basically any data structure without having to manually derive features,
the performance in our experiment is not competitive. Our experimental results using
the pre-trained model did not outperform other model’s accuracy and seem to require
more fine-tuning.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>This paper summarized our participation in PAN2019 shared task on AP, where we
aimed to deal with the challenge of AP. We tried to approach the problem from the
perspective of stylometry and content-based methods, as well as contextualized word
embeddings from BERT model. In terms of training method, we adopted an ensemble
approach and carried on experiments on PAN collections. Experimental results
demonstrate that the ensemble method by the combination of the style-based and
contentbased methods can more accurately capture the author profiles than traditional methods.</p>
      <p>There is still room to improve our work in the future. In our system, we compared
the BERT model showing state-of-the-art results in various NLP tasks without
sophisticated fine-tunising. We will leave more fine-tuning work on our datasets as future work.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pennebaker</surname>
            ,
            <given-names>J.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schler</surname>
          </string-name>
          , J.:
          <article-title>Automatically profiling the author of an anonymous text</article-title>
          .
          <source>Commun. ACM</source>
          <volume>52</volume>
          (
          <issue>2</issue>
          ),
          <fpage>119</fpage>
          -
          <lpage>123</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kestemont</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manjavancas</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Specht</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tschuggnall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiegmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zangerle</surname>
          </string-name>
          , E.: Overview of PAN 2019:
          <article-title>Author Profiling, Celebrity Profiling, Cross-domain Authorship Attribution and Style Change Detection</article-title>
          . In: Crestani,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Braschler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Savoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Rauber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Heinatz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Cappellato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          , N. (eds.)
          <source>Proceedings of the Tenth International Conference of the CLEF Association (CLEF</source>
          <year>2019</year>
          ). Springer (Sep
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>De-Arteaga</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jimenez</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duenas</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mancera</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baquero</surname>
          </string-name>
          , J.:
          <article-title>Author profiling using corpus statistics, lexicons and stylistic features. Online Working Notes of the 10th PAN evaluation lab on uncovering plagiarism, authorship. and social misuse</article-title>
          ,
          <source>CLEF</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Fatima</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hasan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anwar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nawab</surname>
            ,
            <given-names>R.M.A.</given-names>
          </string-name>
          :
          <article-title>Multilingual author profiling on facebook</article-title>
          .
          <source>Information Processing &amp; Management</source>
          <volume>53</volume>
          (
          <issue>4</issue>
          ),
          <fpage>886</fpage>
          -
          <lpage>904</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Flekova</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <article-title>Preo¸tiuc-</article-title>
          <string-name>
            <surname>Pietro</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ungar</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Exploring stylistic variation with age and income on twitter</article-title>
          .
          <source>In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>2</volume>
          :
          <string-name>
            <given-names>Short</given-names>
            <surname>Papers</surname>
          </string-name>
          <article-title>)</article-title>
          .
          <source>vol. 2</source>
          , pp.
          <fpage>313</fpage>
          -
          <lpage>319</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Franco-Salvador</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Plotnikova</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pawar</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Benajiba</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Subword-based deep averaging networks for author profiling in social media</article-title>
          .
          <source>In: CLEF (Working Notes)</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Hsieh</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dias</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paraboni</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Author profiling from facebook corpora</article-title>
          .
          <source>In: Proceedings of the Eleventh International Conference on Language Resources</source>
          and
          <string-name>
            <surname>Evaluation (LREC-2018)</surname>
          </string-name>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Kodiyan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hardegger</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neuhaus</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cieliebak</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Author profiling with bidirectional rnns using attention with grus: Notebook for pan at clef 2017</article-title>
          . In:
          <article-title>CLEF 2017 Evaluation Labs</article-title>
          and Workshop-Working Notes Papers, Dublin, Ireland,
          <fpage>11</fpage>
          -
          <issue>14</issue>
          <year>September 2017</year>
          . vol.
          <year>1866</year>
          . RWTH Aachen (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Distributed representations of sentences and documents</article-title>
          .
          <source>In: International conference on machine learning</source>
          . pp.
          <fpage>1188</fpage>
          -
          <lpage>1196</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Miura</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taniguchi</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taniguchi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ohkuma</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Author profiling with word+ character neural attention network</article-title>
          .
          <source>In: CLEF (Working Notes)</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Pennebaker</surname>
            ,
            <given-names>J.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boyd</surname>
            ,
            <given-names>R.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blackburn</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>The development and psychometric properties of liwc2015</article-title>
          .
          <source>Tech. rep. (</source>
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Pervaz</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ameer</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sittar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nawab</surname>
            ,
            <given-names>R.M.A.</given-names>
          </string-name>
          :
          <article-title>Identification of author personality traits using stylistic features: Notebook for pan at clef 2015</article-title>
          . In: CLEF (Working Notes) (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiegmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>TIRA Integrated Research Architecture</article-title>
          . In: Ferro,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <surname>C</surname>
          </string-name>
          . (eds.)
          <article-title>Information Retrieval Evaluation in a Changing World - Lessons Learned from 20 Years of</article-title>
          CLEF. Springer (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franco-Salvador</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>A low dimensionality representation for language variety identification</article-title>
          .
          <source>In: The 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLingâA˘ Z´16)</source>
          . pp.
          <fpage>156</fpage>
          -
          <lpage>169</lpage>
          . Springer-Verlag (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Overview of the 7th Author Profiling Task at PAN 2019: Bots and Gender Profiling</article-title>
          . In: Cappellato,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Müller</surname>
          </string-name>
          , H. (eds.)
          <article-title>CLEF 2019 Labs and Workshops, Notebook Papers</article-title>
          .
          <source>CEUR-WS.org (Sep</source>
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes-y Gómez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of the 6th author profiling task at pan 2018: multimodal gender identification in twitter</article-title>
          .
          <source>Working Notes Papers of the CLEF</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Santosh</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bansal</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shekhar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varma</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Author profiling: Predicting age and gender from blogs</article-title>
          .
          <source>Notebook for PAN at CLEF</source>
          <year>2013</year>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Schler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pennebaker</surname>
            ,
            <given-names>J.W.:</given-names>
          </string-name>
          <article-title>Effects of age and gender on blogging</article-title>
          . In: AAAI spring symposium:
          <article-title>Computational approaches to analyzing weblogs</article-title>
          .
          <source>vol. 6</source>
          , pp.
          <fpage>199</fpage>
          -
          <lpage>205</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Sebastiani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <article-title>: Machine learning in automated text categorization</article-title>
          .
          <source>ACM computing surveys (CSUR) 34(1)</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>47</lpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Sierra</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes-y Gómez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Solorio</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>González</surname>
            ,
            <given-names>F.A.</given-names>
          </string-name>
          :
          <article-title>Convolutional neural networks for author profiling</article-title>
          .
          <source>Working Notes of the CLEF</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Wanner</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , et al.:
          <article-title>A semi-supervised approach for gender identification</article-title>
          . In:
          <string-name>
            <surname>Calzolari</surname>
            <given-names>N</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Choukri</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Declerck</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goggi</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grobelnik</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maegaard</surname>
            <given-names>B</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mariani</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mazo</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moreno</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Odijk</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piperidis</surname>
            <given-names>S. LREC</given-names>
          </string-name>
          <year>2016</year>
          ,
          <source>Tenth International Conference on Language Resources and Evaluation</source>
          ;
          <volume>2016</volume>
          <fpage>23</fpage>
          -28 May; Portorož, Slovenia.[Place unknown]
          <source>: LREC</source>
          ,
          <year>2017</year>
          . p.
          <fpage>1282</fpage>
          -
          <lpage>7</lpage>
          . LREC (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qin</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
          </string-name>
          , H.:
          <article-title>Authorship analysis in cybercrime investigation</article-title>
          .
          <source>In: International Conference on Intelligence and Security Informatics</source>
          . pp.
          <fpage>59</fpage>
          -
          <lpage>73</lpage>
          . Springer (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>