<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Bot and gender recognition on tweets using feature count deviations</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Centre for Language Studies, Radboud University Nijmegen P.</institution>
          <addr-line>O. Box 9103, NL-6500HD Nijmegen</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <abstract>
        <p>This paper describes the system with which I participated in the Author Profiling task at PAN2019, which entailed first profiling Twitter authors into bots and humans, and then the humans into females and males. The system checked to which degree feature counts for a test sample were compatible with the corresponding feature count ranges in the training data. Two features sets were used, one with surface features (token unigrams and character n-grams with n from 1 to 5), the second one with overall measurements (e.g. percentage of retweets, typetoken ratio and variation in tweet length). On the training set, recognition quality was extremely high, but much lower on the test set, indicating that some type of overtraining must have taken place.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The Author Profiling task in PAN2019 was the differentiation between bots and
humans, and subsequently for humans the differentiation between female and male
authors, for samples of 100 tweets in English or in Spanish. A detailed description of
the task is given by Rangel and Rosso[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].1 As early experiments showed a severe
risk of overtraining, the organisers provided splits into training and development sets,
where the development sets were said to be similar to the eventual test sets. The
provided tweets were not preprocessed. My approach for this task2 built on earlier work.
First of all, there was the long term work on authorship and other text classification
tasks, which used to be published under the name Linguistic Profiling, which because
Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 September 2019, Lugano,
Switzerland.
1 In this paper, I will focus on my own approach. I refer the reader to the overview paper and
the other papers on the PAN2019 profiling task for related work. Not only will this prevent
overlap between the various papers, but most of the other papers, and hence information on
the current state of the art, are not available at the time of writing of this paper.
2 I also participated in the Author Attribution task. The differences in handling the two tasks
were such that I preferred to describe the other task in a separate paper [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. However, there
will obviously be some overlap.
of ambiguity of that term has now been replaced by the working title “Feature Deviation
Rating Learning System” (henceforth Federales). Although the full name implies a
specific learning technique, the acronym indicates a combination approach. Which form of
combination was used in this task is described below (Section 3). Furthermore, I reused
specific previous work related to the current task. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] addressed, among other things,
recognition of Twitter bots (for Dutch tweets) by noticing that their artificial language
use leads to overall measurements (e.g. type-token ratio) different from that of the more
variable language use of human authors. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] addressed gender recognition on, again,
Dutch tweets, concluding that counts of all words are the best performing features: with
these we measure the authors’ (described) life rather than their language use. Although
the two studies differed in language (Dutch versus English/Spanish), sample size (full
production over several years versus 100 tweets) and time period (when the majority of
Twitter users were still reporting on themselves versus when the majority was slowly
moving towards business users), I kickstarted the current experiment from the basics of
the earlier work.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Feature Extraction</title>
      <p>The most important choice in any classification task is the selection of features for
the learning components. For this task, I mostly wanted to investigate the potential of
features relating to regular, botlike, language use. In support I included more standard
features, but kept these simple, by taking only character n-ngrams and token unigrams.
2.1</p>
      <sec id="sec-2-1">
        <title>Tokenization</title>
        <p>
          As some of the features were to be based on tokens, I tokenized all text samples, using
a specialized tokenizer for tweets, as used before for [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Apart from normal tokens like
words, numbers and dates, it is also able to recognize a wide variety of emoticons.3
The tokenizer is able to identify hashtags and Twitter user names to the extent that
these conform to the conventions used in Twitter, i.e. the hash (#) resp. at (@) sign are
followed by a series of letters, digits and underscores. URLs and email addresses are
not completely covered. The tokenizer counts on clear markers for these, e.g. http, www
or one of a number of domain names for URLs. Assuming that any sequence including
periods is likely to be a URL proves unwise, given that spacing between normal words is
often irregular. And actually checking the existence of a proposed URL was infeasible
as I expected the test machine[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] to be shielded from internet. Finally, as the use of
capitalization and diacritics is quite haphazard in tweets, the tokenizer strips all words
of diacritics and transforms them to lower case.4
3 The importance of this has dropped seriously after the introduction of emojis.
4 The system has worked suboptimally here, as the check for out-of-vocabulary words was
implemented incorrectly, comparing the normalized word forms from the samples with the
unnormalized word forms in the word list. For English, the difference was probably negligible,
but for Spanish, with all its diacritics, the OOV counts were greatly exaggerated, as we will
see in Section 2.3.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Surface Frequency Features</title>
        <p>
          Although I generally prefer to use a wide range of features, including syntactic ones
(cf. the notebook on the attribution task [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]), tweets do not lend themselves well for
many such features. I therefore decided on more local patterns, also since token
unigrams performed best in the experiments in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Given the “informal” nature of tweets,
I complemented unigrams with character n-grams (with n from 1 to 5). Token unigrams
were built with the normalized tokens, whereas character n-grams were built on the
basis of the original tweet. Both types of features were counted separately in original
tweet and retweets. In order to be included in the feature set, a feature needed to be
observed in at least five different authors (in the full training data). This led to about
1.23M features for English and 0.94M features for Spanish. However, in any specific
classification, only those features present in training or test texts were used.
2.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Overall Measurement Features</title>
        <p>
          In addition to the straightforward frequency counts, I followed the strategy described in
[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. The idea is that bots show more regular behaviour than humans, as they are driven
by algorithms. Such regularities should lead to relatively extreme behaviour, such as
low or high numbers of URLs or out-of-vocabulary words. Other examples might be
low type-token ratio because a limited vocabulary is used, high type-token ratio because
the tweets are less mutually related, or low standard deviations for tweet length. In total
I took 71 measurements, including totals over all tweets, means and coefficients of
variation for specific measurements per tweet, and some richness measures. Specific
examples (those actually used in the experiments) are shown in Tables 1 and 2.5 6
        </p>
        <p>Now many of these measurements are mutually correlated. In the initial phases
of my work on the task, I handpicked7 subsets for English and Spanish that together
yielded the best classification by themselves. In later phases it turned out that the
standard Federales models performed particularly well (on the training data), especially
after splitting the data into clusters. There was no time to return to the measurement
features, so in principle this part of the system can still be improved. This will have
to wait for future work. Information for the features that have been used in the current
experiments is listed in Tables 1 and 2. The counts for low and high values here are
based on a threshold of 2 for a z-score with regard to the mean and standard deviation
for human authors, as listed in the tables.8
5 Words were called out-of-vocabulary if they did not occur in the word lists I had available.</p>
        <p>
          For Spanish I used the file espanol.txt as provided at http://www.gwicks.net/dictionaries.htm.
Unfortunately, the words in the list were not normalized as the words from the text were, which
led to the rather high OOV measurements of 34% and 52%. However, as both human and bot
authors are mismeasured in the same way, the results still hold information. For English I used
a wordlist derived from the British National Corpus combined with a wordlist which on double
checking the software turned out (major embarrassment) to be the wrong one, namely one for
Dutch. Obviously, both lists can be improved upon.
6 IDF for English is based on the British National Corpus. For Spanish I did not have access to
an IDF list.
7 This procedure can be automated, but I did not do this at this time.
8 In the actual recognition, all values over 0.7 are taken into account.
The Federales system builds on the Linguistic Profiling system, which has been used in
various studies, such as authorship recognition [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ][
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], language proficiency[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], source
language recognition[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], and gender recognition[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. The approach is based on the
assumption that (relative) counts for each specific feature typically move within a specific
range for a class of texts and that deviations from this typical behavior indicate that the
deviating text does not belong to the class in question. If the frequency range for a
feature is very large, the design of the scoring mechanism ensures that the system mostly
ignores that feature. For each feature, the relative counts9 for all samples in the class are
used to calculate a mean and a standard deviation.10 The deviation of the feature count
for a specific test sample is simply the z-score with respect to this mean and standard
deviation, and is viewed as a penalty value. Hyperparameters enable the user to set a
threshold below which deviations are not taken into account (the smoothing threshold),
a power to apply to the z-score in order to give more or less weight to larger or smaller
9 I.e. the absolute count divided by the corresponding number of items, e.g. count of a token
in a retweet divided by all tokens within retweets, or a character n-gram count divided by the
number of characters in the text.
10 Theoretically, this is questionable, as most counts will not be distributed normally, but the
system appears quite robust against this theoretical objection.
deviations (deviation power), and a penalty ceiling to limit the impact of extreme
deviations. When comparing two classes, a further hyperparameter sets a power value for the
difference between the two distributions (difference power), the result of which is then
multiplied with the deviation value. The optimal behaviour in cases where a feature is
seen in the training texts for the class but not in the test sample, or vice versa, is still
under consideration. In the current task, features only seen in the test sample are ignored;
features only seen in the training texts are counted as they are, namely with a count of
0 in the test sample. The penalties for all features are added. A set of benchmark texts
is used to calculate a mean and standard deviation for the penalty totals, to allow
comparison between different models. For verification, the z-score for the penalty total is an
outcome by itself; for comparison between two models, the difference of the z-scores
can be taken; for attribution within larger candidate sets (such as the clusters described
in Section 5), the z-scores can be compared. In all cases, a threshold can be chosen for
the final decision.
3.2
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>Extreme Language Use (XLU) Determination</title>
        <p>The features for the overall measurements could have been mixed into the general
feature set, but there their influence would be minimal seeing the enormous number of
surface features. Instead I processed these measurements in a different way, for now
dubbed an XLU score (eXtreme Language Use). After investigation of the values on
the training data, I set a consideration threshold of 0.7 on the z-score. Any feature
having a z-score with regard to the mean and standard deviation for human authors higher
than 0.7 or lower than -0.7 scores the excess is counted as XLU points. The XLU score
for the sample is simply the sum of the XLU points over all selected features. My
expectation was that bots should be recognizable by their high XLU score. As explained
above (Section 2.3) feature selection was done early in the work on the task and may
not be optimal. This is also true for the threshold of 0.7. However, the remaining
hyperparameter, a threshold above which an author is predicted to be a bot, has been chosen
in the final phases, separately for each cluster (see below in Section 5).
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Training Procedure</title>
      <p>From the initially provided data sets, I held out 400 English authors (200 bots, 100
females, 100 males) and 300 Spanish authors (150 bots, 75 females, 75 males), all
randomly selected from the whole data set, as development material. Later during the
task, the organisers provided their own train-development split, with 620/310/310 and
460/230/230 samples held out. With both divisions, I trained XLU and Federales
models on the training sets (just the human samples for the gender tests) and classified the
development sets. Using the gold standard annotations, I selected optimal
hyperparameter settings and tested the system. The results were surprising. Where accuracies were
over 90% for my original held-out test set, they were much lower (for Spanish gender
under 70%) for the organizer train-development split. Although overtraining is likely to
play a role (as perfect and near perfect scores are inherently suspicious), the differences
between the two train-development splits are much larger than one would expect from
simple overtraining effects. Not having information about the exact composition of the
train and development sets provided by the organisers, one of my hypotheses11 was that
the text in the development set was somehow different from that in the training set. A
visual check did not immediately show clear differences. However, when I trained a
classifier for distinguishing between training and development set, most settings led to
an accuracy over 90% and some even over 95%. Unfortunately, inspection of the most
distinguishing unigrams still did not provide a clear picture of the exact difference
between the sets. Seeing that no information was available on the nature of the sets, and
obviously also not on the composition of the eventual test set, I had to adapt to potential
unknown differences. The first step in this was a simplification. The different sets led to
very different optimal hyperparameter settings. I therefore decided to drop
hyperparameter tuning and select the simplest hyperparameters: no smoothing threshold, a penalty
ceiling of 40, no power applied to deviation and to model distance difference. I would
have preferred to avoid score thresholds for Federales score as well, which would
ideally be at 0. However, it turned out that on the training and development data the optimal
thresholds were not 0. As a result, I picked the various thresholds by hand. In addition,
thresholds were also needed for XLU scores, which have no natural threshold. These
too I picked by hand. An author was predicted to be a bot if either score exceeded its
threshold.12 The second adaptation was all but a simplification. As different subsets of
the data proved to lead to different outcomes, it seemed a good idea to split the authors
into (data-driven) subsets. Any new author could then first be assigned to a subset, after
which the models for that subset would be applied. Given the size of the whole data sets,
I intuitively decided on seven subsets of authors. How these were derived is described
in the next section.
11 The others including a bug in my systems.
12 Unfortunatly, the testing phase showed that this “tuned” thresholding again led to overtraining.</p>
    </sec>
    <sec id="sec-4">
      <title>Clustering</title>
      <p>For both languages, I build frequency lists of normalized original tokens13 in the full
data set, i.e. training plus development. I then examined the top of the list to select
around 1000 most frequent tokens. For English this led to a list of 1162 tokens
occurring in at least 300 authors and for Spanish 1294 tokens occurring in at least 150
authors. I then built frequency vectors for each sample and used k-means clustering to
produce seven clusters.14 The resulting clusters had sizes 189, 448, 270, 138, 418, 277
and 320 for English, and 357, 268, 141, 145, 66, 225 and 298 for Spanish. For these
seven clusters per language, I ran Federales classification (with the hyperparameters
mentioned above) using the full dataset for both training and testing. Only samples for
which the score for their own cluster was higher than 0.5, and for which the second
highest scoring cluster was more than 10% behind, were used as prototype samples for
the cluster, i.e. used as training samples in the final cluster classification. For English,
clusters 1 and 4 kept all their samples, cluster 2 lost 46 (3 assigned to other cluster, 43
unassigned), cluster 3 lost 8 (0, 8), cluster 5 lost 84 (18, 66), cluster 6 lost 17 (6, 11) and
cluster 7 lost 20 (0, 20). For Spanish, cluster 1 lost 124 (30 assigned to other clusters,
94 unassigned), cluster 2 lost 27 (3, 24), cluster 3 lost 4 (1, 3), cluster 4 lost 4 (2, 2),
cluster 5 lost none, cluster 6 lost 32 (5, 27) and cluster 7 lost 47 (11, 36). In the final
classification, the threshold for acceptance was lowered to 0, but the minimum distance
to the runner up was kept to 10%. Tables 3 and 4 show the final attribution to clusters
for the training and development sets. For some clusters we see that there are indeed
differences between training and development set, e.g. English cluster 5 where
predominance of females/males switches from training to development set, or Spanish cluster
4 where the training set still has 18 males on a total of 117 authors but the development
set no males at all versus 55 females. We also see that clustering already goes quite far
in distinguishing between bots and humans. Females and males on the other hand are
well present in all clusters.15</p>
      <p>In the prediction phase, each test sample was submitted to an attribution choice
between a cluster and the set of all human train samples, for each of the seven clusters. The
models for the cluster with the strongest attribution score were applies to that sample.16
Models were also created from the samples in the training data that were not assigned
to any sample, to be used for unattributed test samples.
6</p>
    </sec>
    <sec id="sec-5">
      <title>Training Results</title>
      <p>
        Application of the finally submitted system on the training and development set led
to the confusion tables shown in Tables 5 and 6. The corresponding accuracies are also
13 In the full feature set, these are the features marked CTO.
14 I used the function stats::kmeans in R,[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] with the Hartigan-Wong method[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], a maximum of
40 iterations, 20 restarts and obviously a target of 7 centers.
15 It would be very interesting to investigate how the clusters differ from each other. I have
postponed this investigation until the test data and hopefully metadata have become available.
16 I contemplated a combination of all clusters accepting the sample, but I deemed this too
complicated for the current experiments.
shown in these Tables. For this material, it appeared to be possible to separate bots from
humans and females from males with an extremely high level of accuracy. If the test
samples were sufficiently similar to the training data, they too should be classified with
high accuracy.17
      </p>
      <p>I would like to point out that, theoretically, the high accuracy was not a natural
consequence of testing on the training data. Both XLU and Federales are greedy methods
based on means and z-scores over all samples. It was not as if the classifier could
recognize an individual sample and reproduce its class. The feature values for each sample
were embedded in distributions for all samples in the class training set, which tended to
be tens or hundreds of samples.</p>
      <p>On the other hand, the accuracy was too high to be believable. The earlier
discrepancy between tests on a small random held-out set and on the organizer-provided
train-test split also was reason for doubt. Unfortunately, if indeed there was some
regu17 The original version of this paper was written before the test results were available. For the
revised version, the test scores were known, but not the test data or metadata. I have decided
to leave this section in its original form, showing my reasoning before the test phase, and to
insert a new section below, commenting on the test results.
larity in the current training data which would not reoccur in the test data, and scores on
the test data therefore would turn out to be much lower, more extensive metadata was
needed to determine the nature of this regularity and from there the way to adapt the
system to become robust against such overtraining.</p>
      <p>A side-effect on the high accuracy of the Federales models was that the XLU scores
hardly played a role anymore. As this was a major focus of my initial plan, I ran a
separate prediction with only XLU, with new thresholds.18 The results are found in
Table 7.</p>
      <p>In general, results were very good. For Spanish, the bot-dominated cluster 5 was
problematic, as many humans also had high scores. For English, there were no
botdominated clusters but cluster 7 had a slightly larger minority of bots and also yielded
somewhat lower results; cluster 1 was similar but did not seem to be problematic. Again,
these were the results on the training data, with optimal thresholds.
18 The thresholds in the submitted run were tuned for optimal correction of mistakes by the</p>
      <p>Federales models.</p>
      <p>
        Given the promising results on the training data, I selected these models and
thresholds for the system to upload to TIRA for a blind test[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], in order to see if the quality
would hold up on the test data.
7
      </p>
    </sec>
    <sec id="sec-6">
      <title>Test Results</title>
      <p>
        As stated in the previous section, results on the training data were good, in fact
suspiciously good.19 Still, before the test run, the system appeared to be the best choice
at that time. The test results showed, however, that it was not. With bot recognition
scores of 89.6% (English) and 82.8% (Spanish),20 the system was not even close to the
best scores (96.0% for English and 93.3% for Spanish)21 and worse than the serious
baselines, based on character n-grams (93.6%/89.7%), word n-grams (93.6%/88.3%),
word2vec (90.3%/84.4%) and LDSE[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] (90.5%/83.7%). The enormous gap between
training scores and test scores demonstrates that some kind of overtraining must have
occurred. The clustering, which improved scores on the training data, now probably
only served to aggravate the overtraining.
      </p>
      <p>The question, now, is what the cause of the overtraining was. Generally, in machine
learning, overtraining is the result of insufficient similarity between training and test
data. If the test authors had been drawn randomly from the same pool as the training
authors, the system should in principle have done better. If, however, the test authors
stem from another source, this would explain the rather disappointing quality in the test
run. However, the author profiling task was not presented as a cross-genre task, like the
author attribution task was, so this should not be the main cause. To determine which
other factor(s) might still have been at work, I will have to investigate the test data and,
possibly even more importantly, the metadata describing the sources and their sampling.
19 This section was inserted into the paper after the test results were made available.
20 As gender scores (English 74.2% and Spanish 67.3%) are partly based on the bot scores, I
cannot judge at this time how well my gender recognition worked by itself.
21 Not reached by the same system.</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>The data that was provided for training could be modeled very well with both feature
sets that I applied, token unigrams and character n-grams on one side and overall
measurements on the other. Classification quality was high when training on the set as a
whole, and improved further after clustering the authors and applying separate
thresholds for the various clusters.</p>
      <p>However, the derived model performed disappointingly on the test data. The reasons
for this can only be determined when the test data and metadata become available, and
will therefore have to wait for future work. This future work will then also have to show
the real potential of my proposed approaches.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. van Halteren,
          <string-name>
            <surname>H.</surname>
          </string-name>
          :
          <article-title>Linguistic Profiling for authorship recognition and verification</article-title>
          .
          <source>In: Proceedings ACL</source>
          <year>2004</year>
          . pp.
          <fpage>199</fpage>
          -
          <lpage>206</lpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. van Halteren,
          <string-name>
            <surname>H.</surname>
          </string-name>
          :
          <article-title>Author verification by Linguistic Profiling: An exploration of the parameter space</article-title>
          .
          <source>ACM Transactions on Speech and Language Processing (TSLP) 4</source>
          (
          <issue>1</issue>
          ),
          <volume>1</volume>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. van Halteren,
          <string-name>
            <surname>H.</surname>
          </string-name>
          :
          <article-title>Source language markers in Europarl translations</article-title>
          .
          <source>In: Proceedings of COLING2008, 22nd International Conference on Computational Linguistics</source>
          . pp.
          <fpage>937</fpage>
          -
          <lpage>944</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. van Halteren,
          <string-name>
            <surname>H.</surname>
          </string-name>
          :
          <article-title>Metadata induction on a Dutch Twitter corpus. initial phases</article-title>
          .
          <source>Computational Linguistics in the Netherlands Journal</source>
          <volume>5</volume>
          ,
          <fpage>37</fpage>
          -
          <lpage>48</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. van Halteren,
          <string-name>
            <surname>H.</surname>
          </string-name>
          :
          <article-title>Cross-domain authorship attribution with Federales, Notebook for PAN at CLEF2019</article-title>
          . In: Cappellato,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <surname>D</surname>
          </string-name>
          . (eds.)
          <article-title>CLEF 2019 Labs and Workshops, Notebook Papers</article-title>
          .
          <source>CEUR Workshop Proceedings. CEUR-WS.org</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. van Halteren,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Oostdijk</surname>
          </string-name>
          , N.:
          <article-title>Linguistic Profiling of texts for the purpose of language verification</article-title>
          .
          <source>In: Proceedings of the 20th international conference on Computational Linguistics</source>
          . p.
          <fpage>966</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. van Halteren,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Speerstra</surname>
          </string-name>
          , N.:
          <article-title>Gender recognition of Dutch tweets</article-title>
          .
          <source>Computational Linguistics in the Netherlands Journal</source>
          <volume>4</volume>
          ,
          <fpage>171</fpage>
          -
          <lpage>190</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Hartigan</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wong</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Algorithm</surname>
            <given-names>AS</given-names>
          </string-name>
          136:
          <article-title>A k-means clustering algorithm</article-title>
          .
          <source>Journal of the Royal Statistical Society</source>
          . Series C (Applied Statistics)
          <volume>28</volume>
          (
          <issue>1</issue>
          ),
          <fpage>100</fpage>
          -
          <lpage>108</lpage>
          (
          <year>1979</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiegmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>TIRA Integrated Research Architecture</article-title>
          . In: Ferro,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <surname>C</surname>
          </string-name>
          . (eds.)
          <article-title>Information Retrieval Evaluation in a Changing World - Lessons Learned from 20 Years of</article-title>
          CLEF. Springer (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>R</given-names>
            <surname>Development Core Team: R: A Language</surname>
          </string-name>
          and
          <article-title>Environment for Statistical Computing</article-title>
          . R Foundation for Statistical Computing (
          <year>2008</year>
          ), http://www.R-project.org
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Overview of the 7th author profiling task at PAN 2019: Bots and gender profiling</article-title>
          . In: Cappellato,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <surname>D</surname>
          </string-name>
          . (eds.)
          <article-title>CLEF 2019 Labs and Workshops, Notebook Papers</article-title>
          .
          <source>CEUR Workshop Proceedings. CEUR-WS.org</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franco</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>A low dimensionality representation for language variety identification</article-title>
          .
          <source>In: Proceedings of the 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLingŠ16)</source>
          ,. pp.
          <fpage>156</fpage>
          -
          <lpage>169</lpage>
          . Springer-Verlag,
          <source>LNCS(9624)</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>