<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Bot Or Not: A Two-Level Approach In Author Profiling</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Eötvös Loránd University</institution>
          ,
          <addr-line>Budapest</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Flóra Bolonyai</institution>
          ,
          <addr-line>Jakab Buda, Eszter Katona</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <abstract>
        <p>In this notebook, we summarize our work process of preparing a software for the PAN 2019 Bots and Gender Profiling task. We propose a Machine Learning approach to determine whether an unknown Twitter user is a bot or a human, and if the latter, their gender. We use logistic regressions to identify whether the author is a bot or a human and we use neural networks to attribute their gender. We were able to achieve an accuracy of 91%, 83% for bot/human and 75%, 69% for gender in English and Spanish respectively.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Author Profiling is a fast-breaking scientific application in line with the expanding
usage of the “Big Data” paradigm. Originally, public texts were produced only by
professional authors (such as journalists, scientific / literary authors). However, as the
internet spread, the amount of content produced by everyday users skyrocketed.
Website texts, blogs, comments and social media posts now account for a significant
part of digitally produced contents. Although these texts are unstructured, they
contain data about people's opinions, preferences, attitudes, and even actions in an
enormous amount compared to pre-internet times.</p>
      <p>As the online community is growing, we face increasing threats from the
anonymity of online life. If we think of politics, bots can influence the outcome of
elections, or they can create and share fake news. Also, for example, identifying
sexual predators can be an important task, thus, identifying potential anomalies
between self-declared and true data has an increasingly important role.</p>
      <p>
        The aim of the PAN 2019 Bots and Gender Profiling task [
        <xref ref-type="bibr" rid="ref14 ref16">6, 18, 20</xref>
        ] was to
investigate whether the author of a given Twitter feed is a bot or a human, and in case
of human, identify their gender. The training and test sets of the task consisted of
English and Spanish Twitter feeds.
      </p>
      <p>We followed a mixed approach and created models on two levels, for both Spanish
and English tweets. To identify bots, we used two logistic regressions; one on the
level of tweets and another one on the level of accounts. To identify the gender of the
author, we first fitted neural networks that predicted the gender of the author of each
tweet and then aggregated the 100 tweet level results to get an account level
prediction for each author. Our final software for distinguishing between bots and
humans performed well on the test set, but our model was less accurate for the task of
discriminating between male and female Twitter users.</p>
      <p>In Section 2 we present the related work on author profiling. In Section 3 we
describe our approach in detail, including the extracted features and the fitted models.
Section 3 consists of two subsections, one for the bot-or-human distinction and one
for the gender attribution. In Section 4 we present our results. In Section 5 we discuss
some potential future work and in Section 6 we conclude our notebook.</p>
    </sec>
    <sec id="sec-2">
      <title>2 Related works</title>
      <p>There are multiple approaches to collect information about the latent characteristics
of the author of a text. There is no one best technique, and the literature shows high
variance in the performance of different methods on different corpora. It is possible to
roughly define two groups of variables that can be used to collect information about
the authors of texts: most such variables are either dictionary or style-based.</p>
      <p>A well performing model in gender differentiation based on n-grams was built by
Boulis and Ostendorf [4], who also emphasized that tf-idf representation was not
necessarily a useful measure to find differences between the texts of male and female
authors, as the most used words can be very different for the two groups and therefore
inverse weighting can be less effective.</p>
      <p>Garera and Yarowsky [7] found that removing stop-words and lemmatizing were
not useful when trying to differentiate between texts written by men and women as
the distribution of stop-words and certain grammatical forms differ in the case of
female and male authors.</p>
      <p>
        Peersman, Daelemans and Varenbergh [
        <xref ref-type="bibr" rid="ref12">16</xref>
        ] compared the performance of SVM
models based on character and word n-grams to predict the age and gender of social
media post authors. Their results show that token-based variables are more
informative than character-based ones.
      </p>
      <p>
        Schler, Koppel, Argamon and Pennebaker [
        <xref ref-type="bibr" rid="ref17">21</xref>
        ] analyzed blog posts to gain
information about the relationship between style and content-based variables and age
and gender. They found that women use more pronouns and words that express
emotions, agreements and disagreements, while men use more articles, prepositions
and links. In their case, style-based variables proved to be more informative than
content-based ones.
      </p>
      <p>Goswami, Sarkar and Rustagi [8] looked for stylometric differences by age and
gender by including slang words and average sentence length as new explanatory
variables. Their results show that the frequency of certain slang words is very
different by both age and gender but there is no significant difference regarding the
length of sentences among the groups.</p>
      <p>
        Word embeddings is another approach that does not belong to either of the
categories above. Embeddings capture the essence of words well, and therefore, by
combining these representations with neural networks, it is possible to gain
knowledge about the latent characteristics of a text, among others information about
its author. In the 2018 PAN competition, multiple well performing models used
neural networks based on word embeddings to classify texts by the gender of their
authors. However, these models did not clearly prove to be superior to traditional
machine learning ones regarding gender classification [
        <xref ref-type="bibr" rid="ref15">19</xref>
        ].
      </p>
      <p>Overall, there is no consensus about the types of variables and models that work
best in identifying latent characteristics of authors, and therefore our approach was to
gain as much information as possible from the train corpora.</p>
    </sec>
    <sec id="sec-3">
      <title>3 Our approach</title>
      <p>For the two tasks of the competition, i.e. bots and gender profiling, we trained two
substantially different models. To differentiate between humans and bots, we used a
system of two logistic regressions based on features extracted from the texts, whereas
to classify authors by gender, we implemented a recurrent neural network based on
word embeddings. In the following sections, we provide a detailed overview of our
methods for both tasks. Our codes are available on GitHub1.</p>
      <sec id="sec-3-1">
        <title>3.1 Identifying bots</title>
      </sec>
      <sec id="sec-3-2">
        <title>Features</title>
        <p>As our classifier system consists of two logistic regressions, one that predicts per
tweet and one that predicts per author, we created variables on two levels. On the one
hand, we extracted features on the level of tweets, and on the other, we also created
some aggregate features on the level of authors. The features were slightly different
on the two levels. For example, we investigated on the tweet-level if there was
another user tagged in the tweet and counted on the account-level how many different
people a user tagged. For both Spanish and English tweets we extracted the same
features.2</p>
        <p>It should be noted that we had no internet connection during testing the software,
so we could not include some of the planned information (such as expanding and
examining the shared links).</p>
        <p>To extract features, we primarily utilized online available Python packages3 and in
some cases regex.</p>
        <p>
          We distinguish between 3 types of features. For some of our features, we had to
use predefined dictionaries, so we call those dictionary-based features. Another group
1 https://github.com/pan-webis-de/bolonyai19
2 For efficient data handling we used numpy [
          <xref ref-type="bibr" rid="ref18">22</xref>
          ] and pandas [
          <xref ref-type="bibr" rid="ref10">14</xref>
          ] packages
3 We used libraries from the following packages: spanish_sentiment_analysis [9], emoji [
          <xref ref-type="bibr" rid="ref7">11</xref>
          ],
spacy [
          <xref ref-type="bibr" rid="ref6">10</xref>
          ], lexical_diversity [
          <xref ref-type="bibr" rid="ref8">12</xref>
          ], NLTK [3], textblob [
          <xref ref-type="bibr" rid="ref9">13</xref>
          ]
        </p>
        <sec id="sec-3-2-1">
          <title>Feature</title>
          <p>Emojis
Proportion of
stopwords
Sentiment score
Number of
misspelled words</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Feature</title>
          <p>Lexical diversity
POS-features</p>
          <p>
            Spacy [
            <xref ref-type="bibr" rid="ref6">10</xref>
            ]
Text
characteristics
of features consists of those that describe the tweets grammatically and structurally,
which we call style-based features. Finally, we differentiate a third group of variables
which describe some meta information that could be extracted from the tweets. The
extracted features are summarized in the following tables.
3.1.1. Feature extraction on tweet-level
          </p>
        </sec>
        <sec id="sec-3-2-3">
          <title>Info</title>
          <p>There are various methods to measure
lexical diversity (e.g. simple ttr, log ttr);
we used root ttr in our analysis.</p>
          <p>We identified the word class of each
word and created features measuring the:
• proportion of nouns
• proportion of verbs
• proportion of adjectives
in the tweets.
• Number and proportion of</p>
          <p>apostrophes
• Number and proportion of uppercase
letters
• Number and proportion of numbers
• Number and proportion of points
• Number and proportion of commas</p>
          <p>Our assumption is that humans misspell
words but bots do not. Misspelled words
can be used as an indicator for using
short forms of expressions, too.
3.1.2. Feature extraction on author-level</p>
        </sec>
        <sec id="sec-3-2-4">
          <title>Package</title>
          <p>regex [2]
regex [2]
regex [2]</p>
        </sec>
        <sec id="sec-3-2-5">
          <title>Package</title>
          <p>To differentiate between humans and bots, we fitted two logistic regressions4 for
each language on the provided training set. First, we fitted a logistic regression using
a total of 30 features extracted from individual tweets. This model predicted
separately for each tweet whether its author was a human or a bot. In our second
logistic regression, we used two types of explanatory variables: some of them were
collected from the original texts (e.g. the number of different usernames that occurred
in the tweets of an author), while other features came from the results of our first
logistic regression. The latter group consisted of the minimum, maximum, median,
mean, standard deviation and rounded mean of the hundred predictions for each
author. The structure of our system is illustrated by Figure 1.</p>
          <p>
            To avoid overfitting, we tuned the hyperparameters of both logistic regression
classifiers based on their performance on the development set. We applied grid search
using some sensible parameters (i.e. C = {10-5, 10-4, 10-3, 10-2, 10-1, 100, 101,
102,103,104,105}; Intercept = {True, False}). Our final sets of hyperparameters of
the models are summarized in Table 5.
4 We used the built-in logistic regression from scikit-learn [
            <xref ref-type="bibr" rid="ref11">15</xref>
            ].
          </p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.1 Gender attribution</title>
        <p>
          On the gender attribution task, the method used for bot identification did not yield
satisfactory results. Based on related works we trained a recurrent neural network5 on
the word embedding representation of the texts. Although logistic regression based on
text features did not work well for gender attribution, we kept the system of first
training a model for the tweets individually than aggregating the results of this model
for each author for the final classification. We used the 25 dimensional pretrained
GloVe vectors [
          <xref ref-type="bibr" rid="ref13">17</xref>
          ] trained on tweets as representation for both English and Spanish
texts. The 25d GloVe are space and time efficient and they contain a large number of
Spanish words, despite the fact that the embedding was primarily trained on English
tweets. When we experimented with a higher dimensional embedding space, it did not
perform significantly better on the development set. Before transforming the texts to
the vector space, we applied minimal preprocessing. We replaced all links with the
5 We used tensorflow and keras to fit our neural networks [1, 5].
“https” string, and all mentions with the “@” character, and finally separated all
nonalphanumeric characters from the words to form a separate word.
        </p>
        <p>The word counts of tweets vary a lot (in the English training set the mean is 18.5
and the maximum is 97), but equal input length is required for computational
efficiency. Thus, we had to pad or truncate the tweets to the same length. Because of
the great variance in the word counts, padding all tweets to the length of the longest
tweet would yield more padding tokens than actual words in the case of most tweets,
hence rendering the training slow and inefficient. We chose 38 tokens as common
length for training the neural network as 90 percent of the tweets are shorter, and the
longer ones are generally tweets with many tags, which, with the embedding used, do
not contain much information. We padded the end of the shorter tweets with padding
tokens (0 vectors) and truncated the longer ones. Based on experimental training
sessions, truncation of the beginning of the tweets gave the best results (this is
probably due to the fact that longer tweets are long because they have a lot of
mentions at the beginning, which do not attribute to the character count, and we used
the embedding of the “@” symbol for all mentions), so we kept only the end of the
longer tweets.</p>
        <p>During the training of the RNN on the full training set, despite the various levels
and types of regularizations tried, we observed heavy overfitting. This can be
probably attributed to the fact that there are relatively few authors, each with many
tweets in the training set, and authors have a more distinct tweeting pattern than
genders. As a result, the network can learn to identify each author and attribute a
gender to the authors more easily than learn the distinction between the genders, but
this cannot be generalized to new authors. To avoid this possibility, during the first
part of the training of the RNN we used only 1/10 of the training set, randomly
selecting 10 tweets from each author. After achieving convergence on this training
set, we continued the training of the RNN on the full set for a few epochs.</p>
        <p>After some experimental training with different RNN architectures, our best
performing RNN on the English dev set was a unidirectional RNN with GRU units
(with recurrent dropout value of 0.35) followed by a dropout layer (p = 0.5) and a
sigmoid unit. On the Spanish texts we used a slightly different architecture, a
bidirectional RNN. On the English set, after a total of 110 epochs, the performance of
the model converged. The Spanish model converged after a total of 140 epochs.</p>
        <p>Following the tweet level prediction of the RNN, for the English texts we did not
found a better performing aggregation method than computing the rounded mean of
the tweet-level predicted probabilities and interpreting this as the final predicted
probability for each author. For the Spanish tweets, we trained a similar logistic
regression as for the bot prediction. For input variables we used the mean, the
deviation, the minimum and the maximum of the tweet-level predictions for each
author.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4 Results</title>
      <p>
        In this section, we provide a detailed description of the performance of our
classifiers. In all cases, i.e. for bots and gender profiling and for English and Spanish
language, we used the default development set defined by the hosts of the competition
to tune the hyperparameters of the models before submitting our software on TIRA
[
        <xref ref-type="bibr" rid="ref14">18</xref>
        ] and testing it on the real test set. Therefore, we report two sets of results for each
model: performance on the pre-defined dev set, and performance on the actual test set.
For both the development and test sets, we report the accuracy score as a performance
metric. As we used a two-level approach to train our classifiers, we also report about
the performance of our tweet-based classifiers on the development set.
      </p>
      <p>It is clear that adding the second layer of regressions accounts for significant
improvements, particularly when differentiating between bots and human authors.
However, we did not experiment with pure aggregate models, so we do not know how
our two-level approach would perform against classifiers that use only author-level
features.</p>
      <p>Although our logistic regression gave encouraging results in identifying bots and
humans, it did have a shortcoming, i.e. extracting features from the texts was rather
slow. This time we did not experiment with feature selection, but it is likely that our
features are not all significant predictors to differentiate bots from humans.</p>
      <p>As our models for English tweets generally outperformed the ones for Spanish
tweets, it is likely that some of the packages we used for feature extraction are more
reliable for English texts than for Spanish ones. For example, in the case of POS
tagging, the function we used for English texts was based on a corpus from the web in
general, while the one we used for Spanish texts was trained merely on Spanish news.</p>
      <p>To achieve better performance with the RNN (aside from using word vectors
trained explicitly on the language used), one possible solution could be to construct a
deeper network with two or three layers or use the sequence returned by the RNN as
the input data for another model. Although it could increase the risk of overfitting,
this could be compensated by changing the random subset of the training set multiple
times during the initial training. Using a higher dimensional word representation and
training more epochs could also yield better accuracy, but at a great computational
cost.</p>
    </sec>
    <sec id="sec-5">
      <title>6 Conclusion</title>
      <p>
        In this notebook, we summarized our work process of preparing a software for the
PAN 2019 Bots and Gender Profiling task [
        <xref ref-type="bibr" rid="ref14 ref16">6, 18, 20</xref>
        ]. Overall, we followed different
approaches for the two tasks: to differentiate between bots and humans, we used
logistic regressions with mostly text based explanatory variables, and to differentiate
between female and male authors, we trained recurrent neural networks based on
word embeddings. In both cases, we built classifiers on two levels. First, we fitted
models to predict a response for individual tweets. Second, we created an aggregate
classifier that gave us a prediction for each author. In the case of bots vs. humans, we
used logistics regressions to get our final predictions. Besides the descriptive statistics
of the tweet-level predictions, we also included some author-level features extracted
from the texts as explanatory variables. To predict the gender of the author, we used
different approaches for the English and Spanish texts. For English tweets, we simply
took the rounded average of predictions of all tweets belonging to an author. For
Spanish tweets, we again opted for a logistic regression, using descriptive statistics of
the tweet-level predictions as input variables.
      </p>
      <p>Our results show that our classifiers for English tweets tend to outperform our
classifiers for Spanish tweets. Additionally, we achieved a higher accuracy in
identifying humans and bots than in identifying the gender of the authors.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Abadi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barham</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brevdo</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Citro</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Devin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghemawat</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goodfellow</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harp</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Irving</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Isard</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jia</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jozefowicz</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaiser</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kudlur</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levenberg</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mané</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Monga</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moore</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murray</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Olah</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schuster</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shlens</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Steiner</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Talwar</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tucker</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanhoucke</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vasudevan</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Viégas</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Warden</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wattenberg</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wicke</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>TensorFlow: Large-scale machine learning on heterogeneous systems</article-title>
          , (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Barnett</surname>
          </string-name>
          ,
          <article-title>M: regex</article-title>
          . https://pypi.org/project/regex/ (
          <year>2019</year>
          ) Bird,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Loper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Klein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.: Natural</given-names>
            <surname>Language Processing with Python. O'Reilly Media</surname>
          </string-name>
          <string-name>
            <surname>Inc.</surname>
          </string-name>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Boulis</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ostendorf</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>A Quantitative Analysis of Lexical Differences Between Genders in Telephone Conversations</article-title>
          .
          <source>in: Proceedings of the 43rd Annual Meeting of ACL</source>
          , pp.
          <fpage>435</fpage>
          -
          <lpage>442</lpage>
          . (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Chollet</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          et al.: Keras. https://keras.io (
          <year>2015</year>
          ) Daelemans,
          <string-name>
            <given-names>W.</given-names>
            ,
            <surname>Kestemont</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Manjavancas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Specht</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Tschuggnall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Wiegmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Zangerle</surname>
          </string-name>
          , E.: Overview of PAN 2019:
          <article-title>Author Profiling, Celebrity Profiling, Cross-domain Authorship Attribution and Style Change Detection</article-title>
          . In: Crestani,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Braschler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Savoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Rauber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Heinatz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Cappellato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          , N. (eds.)
          <source>Proceedings of the Tenth International Conference of the CLEF Association (CLEF</source>
          <year>2019</year>
          ). Springer (
          <year>2019</year>
          )
          <article-title>Garera</article-title>
          ,
          <string-name>
            <surname>N</surname>
          </string-name>
          , Yarowsky,
          <string-name>
            <surname>D.</surname>
          </string-name>
          :
          <article-title>Modeling latent biographic attributes in conversational genres</article-title>
          .
          <source>In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP:</source>
          Vol
          <volume>2</volume>
          , pp.
          <fpage>710</fpage>
          -
          <lpage>718</lpage>
          . (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Goswami</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sarkar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rustagi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Stylometric Analysis of Bloggers' Age and Gender</article-title>
          .
          <source>Pattern Recognition and Machine Intelligence</source>
          , Third International Conference, PReMI, New Delhi, India,
          <source>December 16-20</source>
          ,
          <year>2009</year>
          Proceedings, pp.
          <fpage>205</fpage>
          -
          <lpage>212</lpage>
          . (
          <year>2009</year>
          )
          <article-title>Hofman, E: spanish-sentiment-analysis 1.0.0</article-title>
          . https://pypi.org/project/spanish-sentimentanalysis/ (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          10.
          <string-name>
            <surname>Honnibal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montani</surname>
          </string-name>
          , I.:
          <article-title>spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing</article-title>
          . To appear. (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          11.
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wurster</surname>
          </string-name>
          , K.: emoji. https://pypi.org/project/emoji/ (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          12.
          <string-name>
            <surname>Kyle</surname>
          </string-name>
          , K.:
          <article-title>lexical-diversity</article-title>
          . https://pypi.org/project/lexical-diversity/ (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          13.
          <string-name>
            <surname>Loria</surname>
          </string-name>
          , S.: textblob Documentation,
          <source>Release 0.15.2</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          14.
          <string-name>
            <surname>McKinney</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Data Structures for Statistical Computing in Python</article-title>
          ,
          <source>Proceedings of the 9th Python in Science Conference</source>
          , pp.
          <fpage>51</fpage>
          -
          <lpage>56</lpage>
          . (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          15.
          <string-name>
            <surname>Pedregosa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varoquaux</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gramfort</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thirion</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grisel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blondel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prettenhofer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weiss</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dubourg</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanderplas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Passos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cournapeau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brucher</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perrot</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duchesnay</surname>
          </string-name>
          , E.:
          <article-title>Scikit-learn: Machine Learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research 12</source>
          , pp.
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          . (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          16.
          <string-name>
            <surname>Peersman</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
          </string-name>
          , W.,
          <string-name>
            <surname>Van Vaerenbergh</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Predicting age and gender in online social networks</article-title>
          .
          <source>In: Proceedings of the 3rd international workshop on Search and mining user-generated contents</source>
          , pp.
          <fpage>37</fpage>
          -
          <lpage>44</lpage>
          . (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          17.
          <string-name>
            <surname>Pennington</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , C. D.:
          <article-title>GloVe: Global Vectors for Word Representation</article-title>
          .
          <source>In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          , pp.
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          . (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          18.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiegmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>TIRA Integrated Research Architecture</article-title>
          . In: Ferro,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <surname>C</surname>
          </string-name>
          . (eds.)
          <article-title>Information Retrieval Evaluation in a Changing World - Lessons Learned from 20 Years of</article-title>
          CLEF. Springer. (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          19.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franco</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>A Low Dimensionality Representation for Language Variety Identification</article-title>
          .
          <source>In: Proceedings of the 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing'16)</source>
          , Springer-Verlag,
          <source>LNCS (9624)</source>
          , pp.
          <fpage>156</fpage>
          -
          <lpage>169</lpage>
          . (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          20.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <article-title>Overview of the 7th Author Profiling Task at PAN 2019: Bots and Gender Profiling</article-title>
          . In: Cappellato L.,
          <string-name>
            <surname>Ferro</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Müller</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Losada</surname>
            <given-names>D</given-names>
          </string-name>
          . (Eds.)
          <article-title>CLEF 2019 Labs and Workshops, Notebook Papers</article-title>
          .
          <source>CEUR Workshop Proceedings. CEUR-WS.org.</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          21.
          <string-name>
            <surname>Schler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pennebaker</surname>
            ,
            <given-names>J.W.:</given-names>
          </string-name>
          <article-title>Effects of age and gender on blogging</article-title>
          .
          <source>In AAAI spring symposium: Computational</source>
          approaches to analyzing weblogs, vol.
          <volume>6</volume>
          , pp.
          <fpage>199</fpage>
          -
          <lpage>205</lpage>
          . (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          22.
          <string-name>
            <surname>van der Walt</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Colbert</surname>
            ,
            <given-names>S. Ch.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varoquaux</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>The NumPy Array: A Structure for Efficient Numerical Computation</article-title>
          , Computing in Science &amp; Engineering, 13, pp.
          <fpage>22</fpage>
          -
          <lpage>30</lpage>
          . (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>