<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Bots and Gender Identification Based on Stylometry of Tweet Minimal Structure and n-grams Model</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alex I. Valencia Valencia</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Helena Gomez Adorno</string-name>
          <email>helena.gomez@iimas.unam.mx</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christopher Stephens Rhodes</string-name>
          <email>stephens@nucleares.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gibran Fuentes Pineda</string-name>
          <email>gibranfp@unam.mx</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centro de Ciencias de la Complejidad</institution>
          ,
          <addr-line>UNAM</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas</institution>
          ,
          <addr-line>UNAM</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Posgrado en Ciencia e Ingeniería de la Computación</institution>
          ,
          <addr-line>UNAM</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <abstract>
        <p>Social media bots pose as humans to influence users with commercial, political or ideological purposes with the aim of artificially inflate the popularity of a product by promoting it or writing positive ratings and undermine the reputation of competitive products through negative valuations. The threat is even greater when the purpose is political or ideological. Therefore, to approach the identification of bots from an author profiling perspective is highly important for marketing, forensics, and security applications. For automatic bots automatic identification, we present an approach based on the tweet minimal structure and statistical metrics related to the entropy in every tweet. Using logistic regression, we achieved 86% and 90% of accuracy in the Spanish and English datasets respectively. In gender classification, we use an n-gram model including emojis and special characters converted to text data. We achieved 75% and 84% of accuracy in the Spanish and English datasets respectively, also using logistic regression classifier.</p>
      </abstract>
      <kwd-group>
        <kwd>Stylometry</kwd>
        <kwd>Tweets</kwd>
        <kwd>Bots Identification</kwd>
        <kwd>Gender Profiling</kwd>
        <kwd>n-grams</kwd>
        <kwd>Logistic Regression</kwd>
        <kwd>Entropy</kwd>
        <kwd>Emojis</kwd>
        <kwd>Special Characters</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Since origins, human communication has had big changes respecting data propagation,
from word of mouth, radio, TV, etc. And now, society has been located in a digital era,
called social media, a place where every user is a possible data propagator. However,
this place is not populated only by humans but also by software-controlled agents, better
known as bots. Bots are programmed to his creator intentions, from sending automated
messages to assuming specific social or antisocial behaviors [13,12,7,14].</p>
      <p>
        Similar to human interactions, bots can affect the structure and the function of a
given society [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. For this reason, detecting bots can help to maintain social stability,
avoid network traps, and ensure the safety of privacy [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The Bots and gender profiling
shared task at PAN 2019 [10], is focused on investigating whether the author of a Twitter
feed is a bot or a human. Furthermore, in the case of human, to profile the gender of the
author in English and Spanish languages.
      </p>
      <p>The article is organized as follows. In Section 2, we show the state of the art related
to bots and gender profiling. In Section 3, we briefly describe the Twitter corpus of the
bots and gender profiling task at PAN 2019. In Section 4, we detail about minimal tweet
structure approach used in this research, and we target the feature extraction, detailing
on the preprocessing procedure for type and gender classification. Then, we show our
results in Section 5. Finally, we rise our conclusions in Section 6.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>Due to social impact, there has been recent interest in bots automatic detection. The
PAN 2019 [10] evaluation campaign included the author profiling task annually since
2013 and in this year included bots detection and gender profiling.</p>
      <p>Kai-Cheng et al. [15] reviewed the literature on different bots, their impact, and
detection methods. They used the case study of Botometer, a popular bot detection tool
developed at Indiana University, to illustrate how people interact with AI
countermeasures.</p>
      <p>Stella et al. [12] analyzed a large-scale social data collected during the Catalan
referendum for independence on October 1, 2017; consisting of nearly 4 million Twitter
posts, and they identified two polarized groups of Independentists and
Constitutionalists, quantifying the structural and emotional roles played by social bots.</p>
      <p>Dickerson et al. [6] developed a collection of a network, linguistic, and
applicationoriented variables that could be used as features, and identify specific features that
distinguish well between humans and bots. They analyzed a large dataset related to the
2014 Indian election, showing that a number of sentiment related factors are key to the
identification of bots. The authors achieved a 0.73 value of area under the ROC Curve
(AUROC).</p>
      <p>
        Chiyu Cai et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] proposed a behavior enhanced deep model (BeDM) for bot
detection which regards user content as temporal text data instead of plain text to extract
latent temporal patterns. BeDM fuses content information and behavior information
using deep learning method achieving a 88.41 of precision value.
      </p>
      <p>Concerning gender profiling, the second place at PAN 2017 used word unigrams
and bigrams and character n-grams from 3 to 5 size as features moreover additional
features include POS n-grams, emoji and document sentiment information [8].</p>
      <p>More recently, at PAN 2018 [5] the best performing team in the gender profiling
shared task retake the n-gram model. At character level, the authors used sizes of
ngrams from 3 to 5. For the English dataset used at word level, unigrams, bigrams and
trigrams. For the Spanish and Arabic datasets they used unigrams and bigrams of words.
Furthermore, the winning approach used the LSA and TruncatedSVD function from the
scikit-learn library.</p>
    </sec>
    <sec id="sec-3">
      <title>Dataset Description</title>
      <p>The training datasets for the author profiling task at PAN 2019 consists of tweets of
human and bots authors, in case of human, can be male or female gender, both in
English and Spanish languages. Comprises tweets of two groups of authors. In English
there are 1440 bots, 720 Female and 720 Male. In Spanish, there are 1040 bots, 520
Female and 520 Male. The datasets are balanced on the type classification class, which
is bot or human. For gender classification, the labels are bot, female or male. For each
author (Twitter user), a total of 100 tweets were provided. Authors were coded with an
alphanumeric author-ID.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Feature extraction and preprocessing</title>
      <p>Depending on classification variable we used two different feature sets, in type
classification we use statistical measures of different features and their corresponding entropy
values. In the case of gender classification, we used the n-gram model proposed in [5].
4.1</p>
      <sec id="sec-4-1">
        <title>Minimal tweet structure</title>
        <p>The hypothesis is, that depending on bot design, normally these generate content
automatically, thus leading to very similar tweet structure in each post. In this way, if
we summarize the common elements from tweets these can contain. To this aim, we
considered the following components: text, emojis, links, hashtags and user mentions.
We can define a stylometry-based tweet minimal structure on the combination of these
components.</p>
        <p>For example, we have the following tweet: “I knew there was reason that @Kezzang69
went to work there.... https://t.co/vl6LgIHMMc”
We can identify the following tweet elements in post:
Text: I knew there was reason that</p>
      </sec>
      <sec id="sec-4-2">
        <title>UserMention: @Kezzang69</title>
        <p>Text: went to work there....</p>
        <p>Link: https://t.co/vl6LgIHMMc
Given that the tweet is composed by a text section, then there is a user mention, followed
by text and finally a link. Then, the minimal tweet structure for this post is:
[’text’, ’userMention’, ’text’, ’link’]
4.2</p>
      </sec>
      <sec id="sec-4-3">
        <title>Entropy of Minimal Tweet Structure</title>
        <p>The next step is to measure the amount of information for every minimal tweet
structure. We use the information entropy, which is associated with each data value and the
negative logarithm of the probability mass function for the value:</p>
        <p>H(X) =
n
X P (xi)logbP (xi)
i=1
(1)</p>
        <p>When the feature matrix produces a low probability value (i.e., when a low-probability
event occurs), the event carries more "information" than when the source data produces
a high-probability value [11]. In this way, following the hypothesis, bots should be
associated with a low and/or a constant entropy value.</p>
        <p>(a) English
(b) Spanish
We compute the entropy of the training data set in two ways. In the first approach we
concatenate the minimal structure of all tweets for every user and then compute entropy.
Figure 1 shows the results for the English (a) and the Spanish (b) language.
(a) English
(b) Spanish
In the second approach we compute entropy for every tweet minimal tweet structure and
then obtain the average of each entropy. Figure 2 shows the results of average entropy
for English (a) and Spanish (b) languages.</p>
        <p>In both figures, we can observe that the entropy of English bots are more distinctive
from the entropy of humans than in the Spanish language, i.e. bots have a low and/or a
constant entropy value, in English. It is important to mention that if we use only average
entropy of minimal tweet structure (AEMTS) variable in a Gaussian Naive Bayes or
Logistic Regression algorithms, we got 80% of accuracy.
We also compute statistical metrics for each of the features mentioned above: text,
emojis, links, hashtags, and user mentions. The following metrics were computed for each
user: Sum, Maximum Value, Median, Average, Standard Deviation, Variance, and
Entropy.
Based on [5] we added emojis and special characters rather than remove them; we use
pandas and regular expressions to perform the next preprocessing steps:
1. Replace emojis with text using emoji library
2. Replace URLs with the word “link”
3. Replace User Mentions with the word “usermention”
4. Replace the linefeed characters with the word “linefeed”
5. Replace special characters with text using unicodedata library
6. Lowercased the characters
7. Trimmed the repeated characters: Replaced repeated character sequences of length
2 or grater with sequences of length 3
8. Any n-gram that occurred in all documents was considered a stop word and was
ignored
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experimental settings</title>
      <p>We examined the following algorithms for Type classification:
a) Naive Bayes
b) Logistic Regression with C=1e22
c) Support Vector Machine with linear kernel and C=1e6
d) Multi-Layer Perceptron with identity activation function, solver lbfgs, alpha 1e-5,
1000 maximum iterations and 200 in hidden layer sizes.</p>
      <p>
        For Gender classification, we only experimented with b, c and d, we used the same
parameters proposed in [5]. In all cases we use Normalizer4 before the training process
given that large margin classifiers are known to be sensitive to the way features are
scaled [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
4 https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html
5.1
      </p>
      <sec id="sec-5-1">
        <title>Gender Classification</title>
        <p>In our experiments, during the development phase we got better results including
emojis and special characters text data. In our final model we use the following n-grams
features:
1. Words unigrams, bigrams and trigrams
2. Character 3-grams to 5-grams. In the same way that [5] we use TfidfVectorizer5
with the following parameters for the feature sets:
(a) Term frequency- inverse document frequency (tf-idf) weighting.
(b) Sub-linear term frequency scaling, which uses 1 + log(T F ) instead of T F .
(c) Minimum document frequency = 0.01%: Terms with a document frequency
strictly lower than 0.01% would be ignored.
(d) Maximum document frequency = 1.0 (100%): Terms that occur in all
documents would be ignored.</p>
        <p>
          In the gender classification, we used only the n-grams model given that in our
experiments they were better than AEMTS + metrics approach. In our experiments, we can
see that adding emojis and special characters the performance was increased by 1%, as
shown in Table 1.
In the type classification, we used as features the Average Tweet Minimal Structure
(AEMTS) as shown in Section 4, along with the statistical metrics. We evaluated our
models on the official PAN 2019 [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] development and test sets for the author profiling
task on the TIRA platform [9]. Moreover, we also evaluated the n-grams model in order
to compare the accuracy achieved by both models, as shown in Table 2.
5 https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
        </p>
        <p>As classification algorithm we trained a Logistic Regression classifier with the
mentioned parameters. However, in the final evaluation on the test set, the performance
decreased in comparison with the n-gram model. We believe that the proposed AEMTS +
Metrics approach is a good and economic alternative to the n-gram model. The model
trained on AEMTS + Metrics achieved competitive results on the development set.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>6.1</p>
      <sec id="sec-6-1">
        <title>Type classification</title>
        <p>For the author profiling task at PAN 2019 which comprises classifying two authors
profiling aspects we can draw the following conclusions:
Using Minimal tweet structure and entropy approach, have the following advantages: a)
Variables are highly predictive for this data sets in bots detection b) Complexity order
is lower than n-grams model generation c) Variables are independent of the language,
i.e. no matter about words used by bots. Finally, hypothesis respect to bot behavior
phenomenology was fulfilled correctly. In this way, depending on bots creators
intention how much can change the tweet structure to be detected. The style frequency to
differentiate from a human tweet entropy described in Section 5.
6.2</p>
      </sec>
      <sec id="sec-6-2">
        <title>Gender classification</title>
        <p>In the literature, generally the special characters and emojis are removed, however,
we think all data can be information knowing how to deal it. In this case, adding the
emojis and special character converted them to text data in our experiments improves
the accuracy by 1%. On the other side, we think the selection of stop words is another
key to get better performance.
5. Daneshvar, S., Inkpen, D.: Gender identification in twitter using n-grams and lsa. In:
Proceedings of the 9th International Conference of the CLEF Association (CLEF 2018). vol.
2125 (2018)
6. Dickerson, J.P., Kagan, V., Subrahmanian, V.S.: Using sentiment to detect bots on twitter:
Are humans more opinionated than bots? In: Proceedings of the 2014 IEEE/ACM
International Conference on Advances in Social Networks Analysis and Mining. pp.
620–627. ASONAM ’14, IEEE Press, Piscataway, NJ, USA (2014),
http://dl.acm.org/citation.cfm?id=3191835.3191957
7. Ferrara, E., Varol, O., Davis, C., Menczer, F., Flammini, A.: The rise of social bots.</p>
        <p>Commun. ACM 59(7), 96–104 (Jun 2016), http://doi.acm.org/10.1145/2818717
8. Martinc, M., Škrjanec, I., Zupan, K., Pollak, S.: Pan 2017: Author profiling-gender and
language variety prediction (notebook for pan at clef 2017, 2nd place). In: CLEF (Working
Notes). vol. 1866 (02 2018)
9. Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research Architecture.</p>
        <p>In: Ferro, N., Peters, C. (eds.) Information Retrieval Evaluation in a Changing World
Lessons Learned from 20 Years of CLEF. Springer (2019)
10. Rangel, F., Rosso, P.: Overview of the 7th Author Profiling Task at PAN 2019: Bots and
Gender Profiling. In: Cappellato, L., Ferro, N., Losada, D., Müller, H. (eds.) CLEF 2019
Labs and Workshops, Notebook Papers. CEUR-WS.org (Sep 2019)
11. Shannon, C.E.: A mathematical theory of communication. Bell system technical journal
27(3), 379–423 (1948)
12. Stella, M., Ferrara, E., De Domenico, M.: Bots increase exposure to negative and
inflammatory content in online social systems. Proceedings of the National Academy of
Sciences 115(49), 12435–12440 (2018)
13. Tomasello, M.: Origins of human communication. MIT Press, Cambridge, Mass.; London
(2010)
14. Varol, O., Ferrara, E., Davis, C.A., Menczer, F., Flammini, A.: Online human-bot
interactions: Detection, estimation, and characterization. In: Eleventh international AAAI
conference on web and social media (2017)
15. Yang, K., Varol, O., Davis, C.A., Ferrara, E., Flammini, A., Menczer, F.: Arming the public
with AI to counter social bots. CoRR abs/1901.00912 (2019),
http://arxiv.org/abs/1901.00912</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Ben-Hur</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weston</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A user's guide to support vector machines</article-title>
          . In:
          <article-title>Data mining techniques for the life sciences</article-title>
          , pp.
          <fpage>223</fpage>
          -
          <lpage>239</lpage>
          . Springer (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bessi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferrara</surname>
          </string-name>
          , E.:
          <article-title>Social bots distort the 2016 u.s. presidential election online discussion</article-title>
          .
          <source>First Monday (November</source>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Cai</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zengi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Behavior enhanced deep bot detection in social media</article-title>
          .
          <source>In: 2017 IEEE International Conference on Intelligence and Security Informatics (ISI)</source>
          . pp.
          <fpage>128</fpage>
          -
          <lpage>130</lpage>
          (
          <year>July 2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kestemont</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manjavancas</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Specht</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tschuggnall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiegmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zangerle</surname>
          </string-name>
          , E.: Overview of PAN 2019:
          <article-title>Author Profiling, Celebrity Profiling, Cross-domain Authorship Attribution and Style Change Detection</article-title>
          . In: Crestani,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Braschler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Savoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Rauber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Heinatz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Cappellato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          , N. (eds.)
          <source>Proceedings of the Tenth International Conference of the CLEF Association (CLEF</source>
          <year>2019</year>
          ). Springer (Sep
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>