<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UniNE at PAN-CLEF 2019: Bots and Gender Task Notebook for PAN at CLEF 2019</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Catherine Ikae</string-name>
          <email>Catherine.Ikae@unine.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sukanya Nath</string-name>
          <email>Sukunya.Nath@unine.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jacques Savoy</string-name>
          <email>Jacques.Savoy@unine.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, University of Neuchatel</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>When participating in the “bots and gender” subtask (both in English and Spanish), our aim is to automatically detect different text sources (sequence of tweets sent by a bot or a human). When a text is identified as being sent by humans, the system must determine the author's gender (author profiling). To solve these questions, we focus on a simple classifier (k-NN, k = 5) usually able to produce a correct answer but not in an efficient way. Thus, we apply a feature selection procedure to reduce the number of terms (around 200 to 500). We also propose to apply a Zeta model to reduce the number of decisions taken by the kNN classifier. In this case, we focus on terms used in one category and ignored or used rarely by the second. In addition, the Type-Token Ratio of the lexical density (LD) presents some merit to discriminate between tweets sent by a bot (TTR &lt; 0.2, LD ≥ 0.8) or humans (TTR ≥ 0.2, LD &lt; 0.8).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction1</title>
      <p>
        In the last two decades, UniNE has participated in different CLEF evaluation
campaigns with the objective of creating new test collections on the one hand and, on
the other, to promote research in different NLP domains. This year, our team takes part
in the CLEF-PAN in the subtask “bots and gender profiling” using both the English and
Spanish corpus
        <xref ref-type="bibr" rid="ref7 ref9">(Rangel &amp; Rosso, 2019)</xref>
        .
      </p>
      <p>
        Within this track, given a set of tweets, the computer must identify whether this
sequence was sent by a bot or a human. In the latter case, the author gender must be
determined. This author profiling question is not new (Schwartz et al., 2016) and has
been the subject of previous evaluation campaigns
        <xref ref-type="bibr" rid="ref7 ref8">(Potthast et al., 2019a)</xref>
        . This
problem presents interesting questions from a linguistics point of view because the web
offers new forms of communication (chat, forum, e-mail, social networks, etc.). It was
recognized
        <xref ref-type="bibr" rid="ref3">(Crystal, 2006)</xref>
        that such communication channels might be viewed as new
forms between the classical oral and written usage. In addition, CLEF-PAN campaigns
allow us to access large text corpora to verify stylistic assumptions and to detect new
facets in our understanding of gender differences
        <xref ref-type="bibr" rid="ref6">(Pennebaker, 2011)</xref>
        .
      </p>
      <p>The rest of this paper is organized as follows. Section 2 describes the text datasets
while Section 3 describes our feature selection procedure. Section 4 exposes our
combined Zeta and k-NN classifier and Section 5 shows some of our evaluation results.
A conclusion draws the main findings of our experiments.</p>
    </sec>
    <sec id="sec-2">
      <title>2 Corpus</title>
      <p>When faced with a new dataset, a first analysis is to extract an overall picture of the
data, their relationships, and to detect and explore some simple patterns related to the
different categories. A statistical overview of these PAN datasets is provided in Section
2.1 while Section 2.2 focuses on the emoticon distribution across the different
categories. The distribution of the Type-Token ratio values is exposed in Section 2.3.
Section 2.4 proposes to use the lexical density to discriminate between bots and
humans. Finally, Section 2.5 exposes a brief overview of the distribution of positive
and negative emotions.</p>
      <sec id="sec-2-1">
        <title>2.1 Overall Statistics</title>
        <p>To design and implement our classification system, a training corpus was available
in the English and Spanish languages. As depicted in Table 1, the training data contains
the same number of documents (one document = a sequence of 100 tweets) in the bots
and human categories. In the latter case, one can find exactly the same number of
documents written by men and women (1,030 in English, 750 in Spanish).</p>
        <p>These values are obtained by concatenating the two subsets made available by the
organizers, namely the train and dev parts. To be precise, the train subset is
composed of 2,880 English documents and the dev by 1,240 items (for a grand total
of 4,120). For the Spanish corpus, one can count respectively 2,080 and 920 documents
(total: 3,000).</p>
        <p>As each document is not a single tweet (but usually 100), the mean number of tokens
per document is around 2,097 for the English language (median: 1,920; sd: 961.4; min:
100; max: 5,277). For the Spanish language, the mean length is 1,889 (median: 1,925.5;
sd: 619.2; min: 100; max: 4,933). In this computation, the punctuation symbols and
emoticons (or sequences of them) count as tokens. For example, from the expression
“Paul’s books!!!”, our tokenizer returns {paul ’ s book !!!}. As we can see, a light
stemmer was applied, removing only the plural form ‘-s’ (Harman, 1981). This choice
is justified to keep the word meaning as close as possible to the original one (which is
not the case, for example, with Porter’s stemmer reducing “organization” to “organ”).</p>
        <p>Nb. doc.</p>
        <p>Nb tweets
Mean length
|Voc|</p>
        <sec id="sec-2-1-1">
          <title>Bots</title>
          <p>2060
205,919
2,097</p>
          <p>Looking at the mean length for both genders, Table 1 does not corroborate the
common assumption that “women are more talkative than men”. For the English
language, the mean is slightly higher for women (2,123 vs. 2,014) but not for the
Spanish corpus (1,821 vs. 1,964).</p>
          <p>
            As text categorization problems are known for having a large and sparse feature set
            <xref ref-type="bibr" rid="ref12">(Sebastiani, 2002)</xref>
            , Table 1 indicates the number of distinct terms per category (or the
vocabulary size denoted by |Voc|) which is 101,826 for the English bots category.
Moreover, and for both languages, the vocabulary size is larger for the human category
than for the bots (English: 101,826 vs. 162,384; Spanish: 119,965 vs. 147,109). The
texts sent by bots are certainly composed with a smaller vocabulary and the same or
similar expressions are often repeated.
 #JOB  #medical Anesthesiologist https://t.co/t8C84NGQuI  #hiring #health 
https://t.co/HlAmnmpjPZ .
 #JOB  #medical Mental Health Nurse https://t.co/i9PEEOOxz2  #hiring #health 
https://t.co/HlAmnmpjPZ .
11:21 Of the Izharites, the Hebronites, the family of the LORD, that I am a brother to wife.
9:2 And he called for their land to Assyria unto this day have I drawn thee.
          </p>
          <p />
          <p>Of course, the tweets produced by bots are not really generated by computers but
correspond to retweets or tweets showing text excerpts extracted from a larger corpus.
To illustrate this, Table 2a exposes six examples of tweets generated by three bots,
while Table 2b and 2c present four tweets written by two women and men.
2.2 Emoticons</p>
          <p>
            An interesting aspect of web communication
            <xref ref-type="bibr" rid="ref3">(Crystal, 2006)</xref>
            is the frequent usage of
emoticons to denote an author’s emotions (e.g., , ) or to shorten the message (e.g.,
, , ). Table 3 shows the most frequent emoticons per category and language.
From this table, it is not fully clear how we can simply detect a pertinent pattern to be
suitable for automatic classification. One can infer that humans employ more
frequently such symbols compared to bots. On the other hand, women show a higher
usage of emoticons but without showing an important difference about the emoticon
types. When analyzing the sequence of emoticons, the most frequent one is “”
follows by “”.
          </p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.3 Type-Token Ratio</title>
        <p>
          As bots could be deployed to send a repetitive message (maybe with a slight
modification), one can assume that the TTR value (the number of distinct word-types
divided by the number of word-tokens) should be smaller than for a sequence of tweets
written by a human. Of course, the text genre has a clear impact on this estimation,
with a lower TTR value for an oral production compared to a written message. As a
comparison basis, the TTR achieved by Trump was 0.297 vs. 0.362 for Hillary Clinton
(oral form, primaries debates)
          <xref ref-type="bibr" rid="ref11">(Savoy, 2018)</xref>
          . Over all candidates, Trump achieved the
lowest value, depicting a candidate owning a reduced vocabulary and repeating the
same expressions again and again. These examples indicate that values smaller than
0.25 or 0.2 represent a clear lower limit for a message.
        </p>
        <p>Based on the training set (English language), the TTR values have been computed
for documents sent by bots and humans. The two resulting distributions are depicted
in Figure 1. In this case, one can see that messages sent by bots tend to contain the
same or similar expressions resulting in a lower TTR value, even lower than 0.2
(usually producing a boring message). A similar picture can be obtained with the
Spanish language (see the Appendix).</p>
        <p>With the English training data, one can count 398 documents generated by a bot
having a TTR value smaller than 0.2 (over 2,060 or 19.3%). On the other hand, only
13 documents having a TTR smaller than 0.2 have been written by humans. For the
Spanish corpus, one can find 843 documents generated by bots with a TTR values
smaller than 0.2 (over 1500, or 56.2%), and none by human beings.</p>
        <p>cyne
u
q
e
r
F
0
0
1
0
5
0
0
0
5
0
0
4
0
0
3
0
0
2
0
0
1
0</p>
        <p>Histogram Type-Token Ratio (English corpus)</p>
        <p>TTTTRR BHoutman
0.0
0.1
0.2
0.3 0.4
TTR values
0.5
0.6
0.7</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.4 Lexical Density</title>
        <p>The lexical density measures the percentage of content words in a text. This
percentage can also be estimated by considering the number of functional words in a
text and assuming that a word could be either a content word or a functional one (see
Eq. 1). In our implementation, the English language has 571 functional words while
for Spanish such a wordlist counts 350 entries.</p>
        <p>LD(T) = ()*
() = 1 = −
()*()
(1)
Histogram Lexical Density (Spanish corpus)</p>
        <p>LLDD BHoutman
0.2
0.4
0.8</p>
        <p>1.0
0.6</p>
        <p>LD values
As shown in Figure 2, bots tend to present a higher LD value than the set of tweets
sent by humans. For example, by assuming that the maximum value for a document
written by a human is 0.8, the system can consider documents having a larger value as
sent by bots. On the training set, one can count 322 English documents or 216 Spanish
ones (sent by bots) for one single English document written by a human (and none in
the Spanish corpus).</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.5 Emotion Distribution</title>
        <p>
          With the English language, we have a list of words corresponding to positive (159
entries) and negative (151 entries) emotions (extracted from the LIWC (Linguistic
Inquiry and Word Count)
          <xref ref-type="bibr" rid="ref14">(Tausczik &amp; Pennebaker, 2010)</xref>
          ). According to Pennebaker’s
findings (2011), one can expect a larger number of emotional words in tweets written
by woman. According to the data depicted in Table 4, such a difference does exist but
it is rather small. Moreover, when analyzing only the emotions expressed with words,
the mean is rather low (2.5%) but even smaller for bots (1.86%). One can also consider
that emotions are also provided by emoticons and thus we need to take account of both
the emoticons and words indicating emotions.
        </p>
        <sec id="sec-2-4-1">
          <title>Mean</title>
          <p>Median
Stdev</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3 The Feature Selection</title>
      <p>According to our point of view, the key function of a successful classifier is to be
able to generate a good feature set. Moreover, we also want to understand the proposed
attribution and be able to explain it in plain English. Therefore, one of our main
objectives is to reduce the feature space into one to three orders of magnitude compared
to a solution based on all possible isolated words. As shown in Table 5, the vocabulary
size (|Voc|) is large for all categories and languages. If text categorization can be
characterized by such huge feature spaces, they are also sparse (when considering
isolated words, n-grams of words or letters). Many terms occur just once (hapax
legomenon) or twice (dis legomena). Ignoring those words reduces the vocabulary size
by around 50%.</p>
      <p>
        To reduce this feature set and based on the training data, terms (isolated words or
punctuation symbols in this study) having a tweet frequency (df) smaller than a
predefined threshold (fixed at 9 in our experiments) are ignored
        <xref ref-type="bibr" rid="ref10">(Savoy, 2015)</xref>
        . With
the English bots corpus (see the first two rows in Table 5), this filter reduced the feature
space from 101,826 to 15,478 dimensions (a reduction of 84.8%). Higher reduction
rates can be achieved with the human vocabulary (English: 162,384 to 14,728 (90.9%);
Spanish: 147,109 to 13,866 (90.6%)).
      </p>
      <p>Using the term frequency difference, we can observe the term more employed in
each category. For example, the terms “urllink” (replacing the sequence “http://aref”),
“job”, “developer”, “and”, “hiring” or “swissmade” appear more frequently in tweets
sent by bots than by humans. As other examples, we can mention that men used more
frequently: “the”, “that” “it”, “he”, “a”, “is” and the punctuation symbols “.”, “,”.
Woman tweets contain more “rt” (retweets), “you”, “to”, “my”, “your”, “thank”, “me”,
“love” and the punctuation symbols “:”, “&amp;”. These short examples tend to confirm
part of Pennebaker’s (2011) findings, indicating that definite articles are more
frequently used by men while personal pronouns and emotions tend to appear more
often in female messages.
|Voc|
with df &gt; 9
with tf</p>
      <sec id="sec-3-1">
        <title>Voc Uniq</title>
      </sec>
      <sec id="sec-3-2">
        <title>Bots</title>
        <p>
          To go further in this space reduction, one can then add a final third step by applying
a feature selection procedure. For example, one can reduce the feature space to a value
between 200 to 500, allowing a manageable space to explain the proposed decision.
For example, previous studies indicate that odds ratio, mutual information or
occurrence frequency tends to produce effective reduced term sets for different text
categorization tasks
          <xref ref-type="bibr" rid="ref12">(Sebastiani, 2002)</xref>
          ,
          <xref ref-type="bibr" rid="ref10">(Savoy, 2015)</xref>
          .
        </p>
        <p>
          In addition, our classifier will also consider terms used infrequently in one category
and ignored or used rarely by the other (Zeta model)
          <xref ref-type="bibr" rid="ref1">(Burrows, 2007)</xref>
          ,
          <xref ref-type="bibr" rid="ref2">(Craig &amp; Kinney,
2009)</xref>
          . To achieve this, terms appearing only in a single category are extracted and
ranked according to their term frequency (tf) or document frequency (df). Instead of
considering all of them, only the top 200 most frequent ones (based on the tf and df
statistics) are judged useful to discriminate between two classes. The two wordlists
(each containing 200 entries) are merged to generate the final terms able to discriminate
between the two categories. The size of those lists is depicted in Table 5 under the label
“Voc Uniq” (e.g., English bots: 345, English human: 373). For example, within the
bots category, one can find terms such as: “camber”, “cincinnati”, “cooperative” or
“norwalk”. The male category is characterized by terms such as: “outwildtv”,
“obstruction”, “avalanche” or “golfer” while in tweets written by women, one can find
“gown”, “allergy”, “” or “ballet”.
        </p>
        <p>
          It is also interesting to analyze the distribution of definite articles and some pronouns
          <xref ref-type="bibr" rid="ref6">(Pennebaker, 2011)</xref>
          in both languages as depicted in Table 6a (English) and 6b
(Spanish). In those tables, the number of documents in each gender is the same and
represents the half of those appearing in the column “Bots”. Thus, looking at the
frequencies, one can expect a pattern such as 2:1:1 when the term occurrence frequency
is the same through the different categories.
        </p>
        <p>The frequencies depicted in Table 6a confirm Pennebaker’s findings. Definite
articles (“the”, “a”) are more frequent with male writers, and personal pronouns (“i”,
“you”, “me”, etc.) are more often used by women. Exceptions can be found. The
English pronoun “he” is clearly employed more often by men. Bots frequently adopt
the pronouns “you” or “we” and use more infrequently “she”. Is the bot style more
feminine?</p>
        <p>The Spanish corpus also confirms Pennebaker’s conclusions. Definite articles (“el”,
“un”, “una”, etc.) appear more frequently with men, and personal pronouns (“yo”, “tu”,
“ella”, etc.) are more associated with the woman’s style. The Spanish pronoun
“nosotros” (we) or “vosotros” (you, plural) are usually not present but this indication is
often implicit with the verbal suffixes (e.g., “podemos” we can). (A linguist will also
infer that frequencies of such pronouns will be rather small due to their spelling
composed of 8 letters, not a length reflecting the less effort principle).</p>
        <p>Finally, when analyzing the popularity of some countries (see Table 6a), one can see
that “france” is the most popular while “swissmade” appears only with bots. For the
other names, “italy” is more associated with women, while all the others are with men
(except “swiss” that is associated with bots) (due to soccer, a sport popular in Spain,
Italy and Germany?).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4 Proposed Text Classification Strategy</title>
      <p>Our solution is based on a three-stage function. In the first, the needed variables are
initialized (function preProcessing() in Figure 3) and they correspond to the
unique vocabulary used in the two categories (VocUnC1, VocUnC2) and to the
document representations belonging to the two categories (PtC1, PtC2).</p>
      <p>Based on the training data, the system extracts the vocabulary (isolated terms with
their frequency) appearing in both categories (function defineVoc()). From them,
one can determine the terms appearing frequently in one category but absent (or
occurring rarely) in the second (in our implementation, such a term can appear up to
three times (min=3) in the second category). To rank them, the term frequency (tf) or
the tweet frequency (df) statistics are applied. Instead of returning two wordlists, the
system selects the top 200 most frequent ones in the underlying category and merges
them (function topVoc()). In Steps #5 and #6, the system represents the documents
belonging to Category #1 or #2 as vectors (generating the PtC1 and PtC2 variables).</p>
      <p>After this initialization, each document belonging to the test sample can be processed
(see function binaryClassifier() in Figure 3). In Step #1, the Zeta model is
applied. This function counts the number of distinct terms appearing in VocUnC1
(denoted N1) and in VocUnC2 (or N2). If (N1 &gt; N2+q), the test identifies the given
document as belonging to Category #1. On the other hand, if (N2 &gt; N1+q), it is
assumed that the document must be labeled with the second category (e.g. Human). If
the Zeta reaches a decision (e.g., dec=1 for Bot, dec=2, for Human), this value is
returned.</p>
      <p>preProcessing(trainDoc)
1 vocC1 = defineVoc(trainDoc)
2 vocC2 = defineVoc(trainDoc)
3 VocUnC1 = topVoc(vocC1, vocC2, top=200, min=3)
4 VocUnC2 = topVoc(vocC2, vocC1, top=200, min=3)
5 PtC1 = definePoints(trainDoc, C1)
6 PtC2 = definePoints(trainDoc, C2)</p>
      <p>return(VocUnC1, VocUnC2, PtC1, PtC2)
binaryClassifier(newD, VocUnC1,VocUnC2,PtC1,PtC2):
decision = 0
1 dec = Zeta(newD, VocUnC1, VocUnC2, q=3)
2 if (dec == 1) or (dec == 2): return(dec)
3 aTTR = TTR(newD)
4 if (aTTR &lt; 0.2): return(dec=1)
5 dec = k-NN(newD, PtC1, PtC2, k=13)
return(dec)</p>
      <p>Otherwise, Zeta is unable to achieve a clear decision (dec=0). For those cases, the
TTR value (Type-Token Ratio) is computed (Step #3 and4). When this value is smaller
than 0.2, the decision is taken as “Bot” (dec=1). In addition, we might have computed
the lexical density value and returned “Bot” if this value is larger than 0.8. This step
was not included in our final submission (due to time constraints).</p>
      <p>
        In general, the Zeta model (together with the TTR value) cannot always propose a
clear answer. In this case, the system calls the k-NN function (with the new document,
and the set of points corresponding to Category #1 (PtC1) or #2 (PtC2)). In our
experiment, the k value was fixed to 13 and the distance between two text surrogates is
computed according to the Manhattan function
        <xref ref-type="bibr" rid="ref5">(Kocher &amp; Savoy, 2017)</xref>
        .
      </p>
      <p>When the document type is found to be sent as human, the system re-applies the
binaryClassifier() function but with Category #1 corresponding to male and
Category #2 to female (but ignoring the TTR computation).</p>
    </sec>
    <sec id="sec-5">
      <title>5 Evaluation</title>
      <p>
        Table 7 depicts the accuracy rate achieved with our model under different conditions
and for both the type (bot vs. human) and the gender (male vs. female). These results
were achieved with the English corpus using the dev test set. In the first row, all words
have been used to build the document surrogates. In the second line, the vocabulary
size was reduced to consider only terms having a df value larger than 9. In the next row
labelled “FS”, our feature selection is applied. Finally, the last five lines correspond to
a feature space reduced to 100, 200, 300, 400, or 500 terms selected by the information
gain function
        <xref ref-type="bibr" rid="ref12">(Sebastiani, 2002)</xref>
        . When applying our nearest neighbor approach,
Table 7 indicates the mean accuracy rates achieved considering k=13 or k=5 neighbors.
      </p>
      <p>To compute the accuracy rates, only the train subset is used to define the needed
wordlists and document surrogates (in other words, based on 2,880 English documents,
and 1,240 Spanish ones). During the evaluation, only the dev subset was needed to
derive the performance values (or with 2,080 English documents, and 980 Spanish
ones).</p>
      <p>All voc
with df &gt; 9
with tf
100 IG
200 IG
300 IG
400 IG
500 IG</p>
    </sec>
    <sec id="sec-6">
      <title>6 Conclusion</title>
      <p>The important conclusion that can be drawn from Table 7 is that it is possible to
reduce the feature set to a few hundred words and to still have a good overall
effectiveness. Considering k=13 neighbors tends to produce better results (and this
solution is less prone to over-fitting).</p>
      <p>
        Table 8 reports our official results achieved with the TIRA system
        <xref ref-type="bibr" rid="ref7 ref8">(Potthast et al.,
2019b)</xref>
        using the first (test set 1) or the second (test set 2). These evaluations
correspond to our feature selection (FS) with the inclusion of the Zeta test and TTR
filter. More information can be found in
        <xref ref-type="bibr" rid="ref7 ref9">(Rangel &amp; Rosso, 2019)</xref>
        .
      </p>
      <p>TIRA test set 1</p>
      <p>k=5
type
0.8939
gender
0.7689
type
0.8939</p>
      <sec id="sec-6-1">
        <title>TIRA test set 1 k=13 gender 0.7992</title>
      </sec>
      <sec id="sec-6-2">
        <title>TIRA test set 2 k=13 type</title>
        <p>0.9125
gender
0.7371
Using the CLEF-PAN datasets of the “bots and gender profiling” written in English
and Spanish, we were able to achieve the following main findings. First, the text genre
associated with bots can be viewed as repetitive, showing a low TTR value (usually
lower than 0.25). After fixing a threshold (e.g., 0.2) for this value, one can detect 9.6%
to 55% tweet sequences sent by bots (see Figure 1) with a low error rate (around 3%).
For a large majority however (90% for the English corpus), documents present a higher
TTR value and no decision can be reached with this simple rule. Similarly, one can
compute the lexical density value and one can see that values larger than 0.8 correspond
very often to bot tweets.</p>
        <p>Second, analyzing the emoticon distribution, or the most frequent ones, we can infer
that humans tend to employ them more frequently than bots. In tweets sent by
machines, the used emoticons indicate directions or appear to draw reader attention (see
Table 3). If humans have adopted the emoticons in their web communications, it is not
clear whether we can easily distinguish their usage between men and women.</p>
        <p>Third, our attribution approach is based on a cascade classifier. In a first step, the
Zeta classifier is used to determine the category (bots vs. human, male vs. female) based
on terms occurring infrequently in the first class and never (or very rarely) in the
second. When the test sample is strongly correlated to the training set, such a strategy
works well and can accurately determine close to 85% when a decision can be
computed. As the main drawback, this approach fails to propose an answer when the
vocabulary appearing in the new document is not associated clearly with one of the
predefined wordlists. In such cases, a second classifier must be used (k-NN in our
experiments, with k = 5, Manhattan distance).</p>
        <p>
          Fourth, removing terms occurring rarely or in a few documents corresponds to our
first step in the proposed reduction procedure. In addition, we impose that terms
appearing more frequently in a given category must be selected for that class. This
strategy can be further improved by applying a term filter (e.g., mutual information,
odds ratio
          <xref ref-type="bibr" rid="ref12">(Sebastiani, 2002)</xref>
          ,
          <xref ref-type="bibr" rid="ref10">(Savoy, 2015)</xref>
          ). After this step, the number of terms
could be limited from 200 to 500. This last step is usually accompanied with an
effectiveness decrease (around 3% to 8%, depending on the collection).
0
0
1
Histogram Lexical Density (English corpus)
Figure A.3: Distribution of the lexical density values for the bots vs. human (English corpus)
0.3
Histogram Lexical Density (English corpus)
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Burrows</surname>
            ,
            <given-names>J.F.</given-names>
          </string-name>
          <article-title>All the way through: Testing for authorship in different frequency strata</article-title>
          .
          <source>Literary and Linguistic Computing</source>
          ,
          <volume>22</volume>
          (
          <issue>1</issue>
          ),
          <fpage>27</fpage>
          -
          <lpage>47</lpage>
          (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Craig</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Kinney</surname>
            ,
            <given-names>A.F.</given-names>
          </string-name>
          <string-name>
            <surname>Shakespeare</surname>
          </string-name>
          , computers, and
          <article-title>the mystery of authorship</article-title>
          . Cambridge University Press, Cambridge (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Crystal</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <article-title>Language and the internet</article-title>
          . Cambridge University Press, Cambridge (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Harman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <article-title>How effective is suffixing</article-title>
          ?
          <source>Journal of the American Society for Information Science</source>
          ,
          <volume>42</volume>
          (
          <issue>1</issue>
          ),
          <fpage>7</fpage>
          -
          <lpage>15</lpage>
          (
          <year>1991</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Kocher</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Savoy</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <article-title>Distance measures in author profiling</article-title>
          .
          <source>Information Processing &amp; Management</source>
          ,
          <volume>53</volume>
          (
          <issue>5</issue>
          ),
          <fpage>1103</fpage>
          -
          <lpage>1119</lpage>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Pennebaker</surname>
            ,
            <given-names>J.W.</given-names>
          </string-name>
          <article-title>The secret life of pronouns</article-title>
          . Bloomsbury Press, New York (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <article-title>A decade of shared tasks in digital text forensics at PAN</article-title>
          .
          <source>Proceedings ECIR 2019</source>
          , Springer LNCS #
          <volume>11437</volume>
          ,
          <fpage>291</fpage>
          -
          <lpage>303</lpage>
          (
          <year>2019a</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiegmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <article-title>TIRA integrated research architecture</article-title>
          . In N. Ferro,
          <string-name>
            <surname>C.</surname>
          </string-name>
          Peters (eds),
          <article-title>Information retrieval evaluation in a changing world -</article-title>
          <source>Lessons Learned from 20 years of CLEF</source>
          . Springer, Berlin (
          <year>2019b</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <article-title>Overview of the 7th Author Profiling Task at PAN 2019: Bots and Gender Profiling</article-title>
          . In: Cappellato L.,
          <string-name>
            <surname>Ferro</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Müller</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Losada</surname>
            <given-names>D</given-names>
          </string-name>
          . (eds.)
          <article-title>CLEF 2019 Labs and Workshops, Notebook Papers</article-title>
          .
          <source>CEUR Workshop Proceedings. CEUR-WS.org</source>
          , (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Savoy</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <article-title>Comparative evaluation of term selection functions for authorship attribution</article-title>
          .
          <source>Digital Scholarship in the Humanities</source>
          ,
          <volume>30</volume>
          (
          <issue>2</issue>
          ),
          <fpage>246</fpage>
          -
          <lpage>261</lpage>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Savoy</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <article-title>Analysis of the style and the rhetoric of the 2016 US presidential primaries</article-title>
          .
          <source>Digital Scholarship in the Humanities</source>
          ,
          <volume>33</volume>
          (
          <issue>1</issue>
          ),
          <fpage>143</fpage>
          -
          <lpage>159</lpage>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Sebastiani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <article-title>Machine learning in automatic text categorization</article-title>
          .
          <source>ACM Computing Survey</source>
          ,
          <volume>34</volume>
          (
          <issue>1</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>27</lpage>
          (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Schwartz</surname>
            ,
            <given-names>H.A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eichstaedt</surname>
            ,
            <given-names>J.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kern</surname>
            ,
            <given-names>M.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dziurzynski</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramones</surname>
            ,
            <given-names>S.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agrawal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shah</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kosinski</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stillwell</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seliman</surname>
            ,
            <given-names>M.E.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ungar</surname>
            ,
            <given-names>L.H.</given-names>
          </string-name>
          <string-name>
            <surname>Personality</surname>
          </string-name>
          , gender, and
          <article-title>age in the language of social media</article-title>
          .
          <source>PLOS One</source>
          ,
          <volume>8</volume>
          (
          <issue>9</issue>
          ) (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Tausczik</surname>
            ,
            <given-names>Y.R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Pennebaker</surname>
            ,
            <given-names>J.W.</given-names>
          </string-name>
          <article-title>The psychological meaning of words: LIWC and computerized text analysis methods</article-title>
          .
          <source>Journal of Language and Social Psychology</source>
          ,
          <volume>29</volume>
          (
          <issue>1</issue>
          ):
          <fpage>24</fpage>
          -
          <lpage>54</lpage>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>