<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Unsophisticated Neural Bots and Gender Profiling System</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Oren Halvani?</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Philipp Marquardt</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fraunhofer Institute for Secure Information Technology SIT Rheinstrasse 75</institution>
          ,
          <addr-line>64295 Darmstadt</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <abstract>
        <p>In recent years a sharp increase of bot-aided campaigns can be observed across social media networks. As a consequence, an own research discipline known as social bot detection has been established, to counteract these. In the context of the shared task "Bots and Gender Profiling" at the PAN workshop, we propose a simple neural network-based approach that determines for a given Twitter feed whether its author is a bot or a human, where in the latter case it distinguishes between male and female authors. On the official English test set, our approach achieves an accuracy of 92% and 83% for type and gender detection, respectively. For the Spanish test set, however, the results are lower (82% for type and 74% for gender detection).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Bots and gender profiling can be seen as research tasks in the field of digital text
forensics where, from the perspective of machine learning, both represent classification
problems. In general, bots detection deals with the problem to judge if a piece of text (for
instance, a Facebook post or a Twitter tweet) stems from a human or a bot, while
gender profiling focuses on the question whether the text was written by a male or a female
author. With the rise and growth of social networks, social bots became more and more
present. As an attempt to counteract these, the organizers of the PAN workshop1 invited
researchers and practitioners to participate in the shared-task bots and gender profiling.
In the context of this, we present a very simple approach based on a feed-forward neural
network that was ranked 18th out of 55 participants.
Over the years, many approaches have been proposed for both bot detection and gender
profiling. In 2014, for example, Dickerson et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] proposed their SentiBot system,
which uses sentiment to distinguish humans from bots on Twitter. More precisely, they
considered four classes of features related to tweet syntax, tweet semantics, user
behavior as well as network-centric user properties. SentiBot relies on an ensemble of six
classifiers (Naive Bayes, SVMs, AdaBoost, Gradient Boosting, Random Forests and
Extremely Randomized Trees) and achieved a score of 0.73 in terms of AUC on the
India Election Dataset, which consists of 7.7 million tweets stemming from 550,000
Twitter accounts. One of the findings of Dickerson et al. was that sentiment related
factors play a significant role in regard to the detection of bots and that considering
the topics of interest to an application into account is highly important to identify bots
associated with a specific application.
      </p>
      <p>
        In 2017, Varol et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] presented a similar framework for bot detection on Twitter.
Based on a large number of tweets, their framework extracted 1,150 features, which they
categorized into six different classes (user meta-data, friends/connected users, tweet
content, sentiment, network patterns and activity time series. As an underlying model,
the authors tried out a variety of classification algorithms (Random Forests, AdaBoost,
Logistic Regression and Decision Tree classifiers), where the best performance was
obtained using the Random Forest classifier. In contrast to the study of Dickerson et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
here, Varol et al. state that both user meta-data and content features are the most
promising classes to detect simple bots. To evaluate their approach, the authors used a dataset
consisting of 14 millions twitter accounts of English-speaking active users. Their initial
system yielded an AUC score of 0.95 on this dataset. Afterwards, the authors applied
their approach on a more challenging dataset, where it also achieved a high score (0.94
AUC). In regard to their analysis, the authors made several interesting findings. They
estimate, for example, that between 9% and 15% of the active Twitter accounts are
bots. Also, they observed that simple bots tend to interact with bots that exhibit more
human-like behaviors. Furthermore, the authors performed clustering analysis, where
the resulting clusters point mainly to three subclasses of accounts (spammers, self
promoters, and accounts that post content from connected applications).
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Proposed Approach</title>
      <p>In the following, we propose our bots and gender profiling method, which is essentially
a simple feedforward-based neural network. However, before introducing the approach
in more detail, we first mention the preprocessing steps that were performed on the
respective documents.
3.1</p>
      <p>
        Preprocessing
During the inspection of the provided corpora (more precisely, the inception of the
underlying documents) we observed a large variety of noise such as citations, HTML
encoded string such as \&amp;amp;, inconsistent apostrophe usage, etc. Initially, we
attempted to clean the noise using a fine-grained preprocessing procedure based on
truecasing [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], lexical normalization [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], accents / diacritics normalization2, etc. However,
2 https://github.com/motss/normalize-diacritics
global max pooling
fully connected softmax output
Wherefore she went
after their ...
      </p>
      <p>h
t
g
n
e
lt
x
e
tt
u
p
n
i
...
...</p>
      <p>...
-xaooaobngP
il
M
ll
G
lill-xngooaboaPGM illl-xnoogboaaPGM lill-xnoogaaboPGM ...</p>
      <p>llli-xaooaobngPGM llil-xoaoaobgnPGM
200 dimensional embedding
.
.</p>
      <p>
        .
200 dimensions
– Concatenation of all tweets in each XML-file into a one long document
– Lowercasing of the entire text
– Substitution of noisy elements with a dummy token as, for example, twitter handles
(@ ! §AT§), URLs (http... ! §URL§), hashtags (# ! §HASHTAG§), numbers
([
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6 ref7">0-9</xref>
        ]+ ! §NUMBER§), Emojis (... ! §EMJOI§), punctuation marks ([.,?¿]+ !
§PUNCTATION§), retweets (RT ! §RT§).
Our approach represents a simple feedforward neural network3, which involves a single
hidden layer. The architecture is illustrated in Figure 1). As can be seen, we first
tokenize a given document and map each token into an embedding4 vector. Next we apply
global max pooling on the embedding dimensions over the sequence of tokens and
concatenate the resulting pooled values to a compact representation vector, which is then
fed into a simple fully connected hidden layer. The output layer performs the binary
classification using the Softmax function. We used the same architecture for both
classification scenarios human vs. bot and male vs. female. Furthermore, the architecture
was used for both languages English and Spanish.
To optimize the hyperparameters of the network, we applied Random Search [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. From
the pool of all constructed configurations, we picked the one that led to the most stable
3 We use the open-source neural-network framework Keras (https://keras.io)
4 Note that we learn embeddings from scratch rather than using pretrained models.
results at the expense of accuracy. The hyperparameters of this configuration are listed
in Table 1. Due to the varying lengths of the documents, we performed the following
      </p>
      <sec id="sec-2-1">
        <title>Hyperparameter</title>
      </sec>
      <sec id="sec-2-2">
        <title>Value</title>
        <p>strategy: Short documents with &lt; 2; 500 tokens were padded with zero values, while
longer texts were truncated after the 2,500-th token.</p>
        <p>
          In addition to dropout, we made use of Early Stopping [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] to counteract overfitting.
Here, we observed that in many cases only few epochs ( 10) were needed, until the
network reached a state, where the accuracy stopped to improve. Here, we also used the
Keras callback function ReduceLROnPlateau to reduce the learning rate by 1e-1, where
1e-8 was the minimum value.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>
        In order to reduce overfitting, we trained our approach on the provided training set
(truth-train.txt) and evaluated the learned model on the development set (truth-dev.txt),
as suggested5 by the PAN organizers. On the validation set our approach achieved an
accuracy of 97.69%. Afterwards, we applied the learned model on the official test set
hosted on the TIRA6 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] platform. The results are listed in Table 2.
      </p>
      <sec id="sec-3-1">
        <title>Language Type (bot vs. human) Gender (male vs. female)</title>
        <p>5 https://pan.webis.de/clef19/pan19-web/author-profiling.html
6 https://www.tira.io/
We proposed a simple feedforward-based neural network that aimed to distinguish for
a given Twitter feed whether its author is a bot or a human, where in the latter case the
gender (male/female) is also classified. Although, the proposed method is quite simple,
we observed in preliminary experiments that it was able to outperform more advanced
approaches based on CNN and LSTM building blocks. In the near future, we plan to
experiment with more sophisticated techniques such as Transformer-based networks
that are able to capture fine-grained patterns in the embedding space.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bergstra</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Random search for hyper-parameter optimization</article-title>
          .
          <source>J. Mach. Learn. Res</source>
          .
          <volume>13</volume>
          ,
          <fpage>281</fpage>
          -
          <lpage>305</lpage>
          (
          <year>Feb 2012</year>
          ), http://dl.acm.org/citation.cfm?
          <source>id=2188385.2188395 3</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Caruana</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , Lawrence,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Giles</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          :
          <article-title>Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping</article-title>
          .
          <source>In: Proceedings of the 13th International Conference on Neural Information Processing Systems</source>
          . pp.
          <fpage>381</fpage>
          -
          <lpage>387</lpage>
          . NIPS'00, MIT Press, Cambridge, MA, USA (
          <year>2000</year>
          ), http://dl.acm.org/citation.cfm?
          <source>id=3008751.3008807 4</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Dickerson</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kagan</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Subrahmanian</surname>
            ,
            <given-names>V.S.:</given-names>
          </string-name>
          <article-title>Using Sentiment to Detect Bots on Twitter: Are Humans More Opinionated Than Bots?</article-title>
          <source>In: Proceedings of the 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining</source>
          . pp.
          <fpage>620</fpage>
          -
          <lpage>627</lpage>
          . ASONAM '14, IEEE Press, Piscataway, NJ, USA (
          <year>2014</year>
          ), http://dl.acm.org/citation.cfm?
          <source>id=3191835.3191957 1</source>
          ,
          <fpage>2</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Lita</surname>
            ,
            <given-names>L.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ittycheriah</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roukos</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kambhatla</surname>
          </string-name>
          , N.: tRuEcasIng.
          <source>In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics</source>
          . pp.
          <fpage>152</fpage>
          -
          <lpage>159</lpage>
          . Association for Computational Linguistics, Sapporo,
          <source>Japan (Jul</source>
          <year>2003</year>
          ), https://www.aclweb.org/anthology/P03-1020
          <fpage>2</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiegmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>TIRA Integrated Research Architecture</article-title>
          . In: Ferro,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <surname>C</surname>
          </string-name>
          . (eds.)
          <article-title>Information Retrieval Evaluation in a Changing World - Lessons Learned from 20 Years of</article-title>
          CLEF. Springer (
          <year>2019</year>
          )
          <fpage>4</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Varol</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferrara</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>C.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Menczer</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Flammini</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Online Human-Bot Interactions: Detection, Estimation, and Characterization</article-title>
          .
          <source>In: Proceedings of the Eleventh International Conference on Web and Social Media</source>
          ,
          <string-name>
            <surname>ICWSM</surname>
          </string-name>
          <year>2017</year>
          , Montréal, Québec, Canada, May
          <volume>15</volume>
          -18,
          <year>2017</year>
          . pp.
          <fpage>280</fpage>
          -
          <lpage>289</lpage>
          . AAAI Press (
          <year>2017</year>
          ), https://aaai.org/ocs/index.php/ICWSM/ICWSM17/paper/view/15587 2
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xia</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Tweet normalization with syllables</article-title>
          .
          <source>In: ACL (1)</source>
          . pp.
          <fpage>920</fpage>
          -
          <lpage>928</lpage>
          . The Association for Computer Linguistics (
          <year>2015</year>
          )
          <fpage>2</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>