<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Bots and Gender Profiling with Convolutional Hierarchical Recurrent Neural Network</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Juraj Petrik</string-name>
          <email>juraj.petrik@stuba.sk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniela Chuda</string-name>
          <email>daniela.chuda@stuba.sk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Slovak University of Technology in Bratislava</institution>
          ,
          <country country="SK">Slovakia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <abstract>
        <p>Paper describes approach leveraging deep learning principles in bots and gender profiling task at CLEF 2019 conference. Our approach is using hierarchical network for classification of tweets sequences. We achieved 90% accuracy of type profiling for English and 86.9% for Spanish language and 77.6% and 77.2% respectively accuracy in gender profiling. Task Description 1 https://pan.webis.de/clef19/pan19-web/author-profiling.html</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        This paper describes our approach to bots and gender profiling task for PAN at
CLEF 2019 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Our approach is based on our previous work dealing with source
code authorship attribution [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This approach is based on newest natural language
processing principles used in text classification and stylometry. Our solution was
evaluated using TIRA evaluation service [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>Bots are commonly used in bank or insurance sector as chat bots. These chat bots
act as first level support for customers. They are able to help people with simple
problems. Another positive example are weather forecasting bots or stock exchange
information bots. They are effective way for sending information to multiple users
(via Twitter, Facebook and Instagram for example).</p>
      <p>However, another type of bots is used to spread misleading information, fake news
for example. And we need to filter out this kind of information, because people tend
to believe in such information as this information is spread across the whole internet
and looks like valid fact.
1.1</p>
      <p>Aim of this task1 is to determine if given Twitter feed is written by human or bot.
In case given Twitter feed is written by human, our next task is determine if it is
written by male or female. Also, this task is multilingual, it consists of two sub
datasets – English and Spanish. Despite of language separation, creators of dataset do
not guarantee language consistency for all tweets in feed.</p>
      <p>Our performance is evaluated by average accuracy of each subtask (human vs bot
and male vs female) of each language.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        In terms of stylometry, authorship attribution is application of linguistic style to
written language, but also to music [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], which is defining writer’s style as unique
property for specific author - his fingerprint. Authorship profiling is part of
stylometry, which is specifically focusing on determining author traits, such as age,
gender or occupation.
      </p>
      <p>In context of this paper we will focus on linguistic stylometry due to natural
language character of Twitter feeds.</p>
      <p>
        Problem of duplicate accounts on internet discussion forums was discussed in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
Duplicate accounts are created because of account ban, group accounts, and
reputation boosting (sales). Authors of this publication trained one classifier for each
account (discussion forum user) – this means that for N user accounts there were N
trained classifiers. Advantage of this solution is that we can run these classifiers
independently – parallelization is trivial. It is also clear from the paper, that accounts
with small number of messages with short length are problematic. Another problem is
intentional modification of writer’s style by Anonymouth tool for example [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], but
fortunately we do not have any suspicions that such a tool was used in task datasets.
      </p>
      <p>
        Other paper [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] was trying to find out if the writer’s style was intentionally
modified. They used character, numerical and special characters, words and word
function properties. Samples classification was done by support vector machines
(SVM) in cooperation with sequence minimal optimization (SMO). Also, other
classification methods were tested, such as k-NN, naive Bayes classifier, decision
trees and logistic regression. However, SVM with SMO achieved significantly better
results than other methods. Next, they evaluated information gain of properties for
distinguishing imitated and obfuscated documents from original ones, what is similar
problem as type profiling in this task (human vs bot).
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Our Method</title>
      <p>
        Our method is based on our previous work, which achieved state of the art results
in source code authorship attribution [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. We made several changes to the method, to
be able to leverage nature of twitter messages, most importantly to deal with natural
language opposed to source code. Important improvement in our approach is
hierarchical layers arrangement to take advantage of sequence character of tweets in
feed.
      </p>
      <p>
        We have done experiments also with TF-IDF based approaches in combination
with different classifiers. This approach was superior to our method. However, we
used it as a good baseline for experiments [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>Training dataset consists of 4120 English Twitter feeds and 3000 Spanish Twitter
feeds. Each feed consists of exactly 100 tweets. Maximum tweet length is 140 or 280
characters2. Samples are provided in XML files, one sample per file.</p>
      <p>As stated above, data for this task consists of unprocessed twitter feeds. Our brief
data analysis shown that majority of tweets contain relatively large number of emojis,
quiet large number of typos, double, triple chars, punctuation or mixed language.</p>
      <p>Twitter users are standardly utilizing Unicode set of emojis3. Working with special
Unicode characters is not convenient and is easier working with their word
description. In theory this should not be necessary, because we are using word
embeddings. But our training corpus is relatively small, so this step is helping us to
better train these embeddings - we are de facto extending tweets and got more tokens
in dataset. You can see example of such a transformation in Table I.
2 https://developer.twitter.com/en/docs/basics/counting-characters.html
3 http://www.unicode.org/emoji/charts/full-emoji-list.html</p>
      <p>Next step of our preprocessing pipeline is lemmatization. Lemmatization is process
of extracting word lemma (word root). We can think of this as dimensionality
reduction which will potentially easier generalization of our model (Table 2). Usually
lemmatization is not needed when word embeddings are used, however our as stated,
our corpus is small, so we are using all available method to make embeddings more
stable and more domain specific.</p>
      <p>Next step is tokenization and token encoding. We used standard Keras framework
function for these two steps. Tokens were split by space characters and we filter out
special characters such as braces, hash key, punctuation characters, etc. with
combination of converting tokens to their lowercase representation. Tokens were
encoded to integers based on their index in corpus dictionary.</p>
      <p>Last step is zero-padding of inputs to fixed length (our model doesn’t support
variable length input sequence). We empirically chose sequence length of 60 tokens
(words) based on histogram in Figure 1.
3.2</p>
      <sec id="sec-3-1">
        <title>Classification</title>
        <p>
          Our classifier is based on our previous work [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] – convolutional recurrent neural
network. It consists of multiple layers – embedding, convolutional, recurrent and
dense layers.
        </p>
        <p>Convolutional neural networks are often used in image processing, where they
achieved state of the art results in image recognition. They are gaining popularity also
in natural language processing, because they act as feature extractors in texts too.</p>
        <p>Embeddings are heavily used throughout variety of text processing problems. They
are effectively encoding words into vector representation. Our embedding layer is
randomly initialized (not pretrained) and trained on the fly - it should learn more
specific domain specific embedding vectors this way.</p>
        <p>Recurrent networks, especially Long-Short Term Memory units are used in state of
the art models for text classification, emotions detection or speech recognition. They
are good at sequence learning – and tweets are word sequences.</p>
        <p>We were struggling with overfitting of the network, which we solve by adding
dropouts. Dropout is randomly dropping connections between layers and therefore
helping generalization of the model.</p>
        <p>As stated above, tweets are word (character) sequences. We can say, that feeds are
tweet sequences, therefore it could be beneficial to work with them as they are
sequences. That’s why we propose hierarchical network on top of tweets and dealing
with them as with sequences, with all 100 tweets from sample “at once”.</p>
        <p>In text below are stated specific parameters of proposed and implemented neural
network:</p>
      </sec>
      <sec id="sec-3-2">
        <title>Layers parameters:</title>
        <p>Embedding: vector length 30
Convolutional I: kernel size 2, number of filters 16, ReLU activation
Convolutional II kernel size 2, number of filters 16, ReLU activation
Max pooling: pooling size 2
Bidirectional LSTM: 16 units
Dense: 24 units
Dense (Output): 2 units, SoftMax activation
Dropout rate 0.5</p>
      </sec>
      <sec id="sec-3-3">
        <title>Hyperparameters:</title>
        <p>Batch size: 8
Epochs: 100
Early stopping: patience 5, validation loss monitoring
Loss: categorical crossentropy
Adam optimizer: learning rate 0.001, beta1 0.9, beta2 0.99
This chapter describes our testing results and task testing dataset results. Our testing
results are average of multiple runs (10) of different dataset splits.</p>
        <p>For our testing purposes we used 50/25/25 split for training, validation and testing
fractions of data respectively. We are using accuracy metric, because classes are
perfectly balanced, which means every class has exactly same number of samples.</p>
        <p>Table 3 demonstrates our results on testing split (25% of training dataset). Our
results were quite encouraging, although were significantly worse than results on
organizers testing dataset (Table 4).</p>
        <p>Also, our testing score (accuracy) was calculated differently than on task testing
dataset. We were training exclusively on tweets posted by humans, so our gender
testing accuracy is just from human samples. Gender task testing dataset was
calculated from all samples (human and bot), this is main cause of big accuracy
difference.</p>
        <p>Our approach achieved roughly 90% accuracy in type recognition (human or bot)
and 77.5% accuracy in gender recognition (male or female) for English Twitter feeds
for task testing dataset. Spanish samples results were slightly worse – 86.9% accuracy
for type recognition and 72.5% for gender recognition (Table 4).</p>
        <p>It is evident that results on task testing dataset are significantly worse than results
on our testing split. This is probably caused by insufficient generalization of our
model, we suspect collected “vocabulary” is not large enough. Specifically speaking,
vocabulary of our embedding layer is not large enough, which results in a lot out of
vocabulary words and therefore next layers in our model don’t have enough
information to make reliable decision. Testing dataset wasn’t published to the date of
paper submission, so we are unable to make proper analysis in context of task testing
dataset.
Our final task ranking is 21 from total of 55 contestants. Unfortunately, even two
baseline methods (word and character n-grams outperformed our solution). Despite of
final ranking, we must say that this our first appearance in such a competition was not
total disaster – we rank in better half of solutions.</p>
        <p>Unfortunately, because of time and computational constrains we were not able to
realize and test all our ideas. Task results give us also some ideas, what we could
done better.</p>
        <p>First of all, tokenization step could be improved, for example there is lot of URLs
in tweets, and we could get links from them and use information from sites such as
topic or language of site. Additionally, we could use for example hypernyms in
preprocessing to normalize texts.</p>
        <p>Discussed above, our vocabulary was probably very limited (due to small training
dataset). We could overcome this problem using pretrained English and Spanish
embedding vectors or enrich dataset using Twitter real time API.</p>
        <p>We used simple word level embedding layer, however other papers show, that
using more sophisticated methods such as ELMo or character-based embedding have
better results in topic modeling for example. Therefore, we can deduct usage of these
methods could improve our results.</p>
      </sec>
      <sec id="sec-3-4">
        <title>Acknowledgments</title>
        <p>This work was partially supported by Human Information Behavior in the Digital
Space, the Slovak Research and Development Agency under the contract No.
APVV15-0508, by the Slovak Research and Development Agency under the contract No.
APVV-17-0267 - Automated Recognition of Antisocial Behaviour in Online
Communities and by data space based on machine learning, the Scientific Grant
Agency of the Slovak Republic, grant No. VG 1/0725/19.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. Francisco Rangel, Paolo Rosso.
          <source>Overview of the 7th Author Profiling Task at PAN</source>
          <year>2019</year>
          :
          <article-title>Bots and Gender Profiling</article-title>
          . In: Cappellato L.,
          <string-name>
            <surname>Ferro</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Müller</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Losada</surname>
            <given-names>D</given-names>
          </string-name>
          . (Eds.)
          <article-title>CLEF 2019 Labs and Workshops, Notebook Papers</article-title>
          .
          <source>CEUR Workshop Proceedings</source>
          . CEURWS.org
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Juraj</given-names>
            <surname>Petrík</surname>
          </string-name>
          and
          <string-name>
            <given-names>Daniela</given-names>
            <surname>Chudá</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Source code authorship approaches natural language processing</article-title>
          .
          <source>In Proceedings of the 19th International Conference on Computer Systems and Technologies (CompSysTech'18)</source>
          ,
          <source>Boris Rachev and Angel Smrikarov (Eds.)</source>
          . ACM, New York, NY, USA,
          <fpage>58</fpage>
          -
          <lpage>61</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kestemont</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manjavancas</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Specht</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tschuggnall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiegmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zangerle</surname>
          </string-name>
          , E.: Overview of PAN 2019:
          <article-title>Author Profiling, Celebrity Profiling, Cross-domain Authorship Attribution and Style Change Detection</article-title>
          . In: Crestani,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Braschler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Savoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Rauber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Heinatz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Cappellato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          , N. (eds.)
          <source>Proceedings of the Tenth International Conference of the CLEF Association (CLEF</source>
          <year>2019</year>
          ). Springer (Sep
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiegmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>TIRA Integrated Research Architecture</article-title>
          . In: Ferro,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <surname>C</surname>
          </string-name>
          . (eds.)
          <article-title>Information Retrieval Evaluation in a Changing World - Lessons Learned from 20 Years of</article-title>
          CLEF. Springer (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franco</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>A Low Dimensionality Representation for Language Variety Identification</article-title>
          .
          <source>In: Proceedings of the 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing'16)</source>
          , Springer-Verlag,
          <source>LNCS (9624)</source>
          , pp.
          <fpage>156</fpage>
          -
          <lpage>169</lpage>
          ,
          <year>2018</year>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>S.</given-names>
            <surname>Afroz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Caliskan-Islam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Stolerman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Greenstadt</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>McCoy</surname>
          </string-name>
          , “
          <article-title>Doppelganger finder: Taking stylometry to the underground</article-title>
          ,
          <source>” Proc. - IEEE Symp. Secur. Priv.</source>
          , pp.
          <fpage>212</fpage>
          -
          <lpage>226</lpage>
          ,
          <year>2014</year>
          .7.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>A. W. E.</given-names>
            <surname>Mcdonald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ulman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Barrowclift</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Greenstadt</surname>
          </string-name>
          , “Anonymouth Revamped: Getting Closer to Stylometric Anonymity,” pp.
          <fpage>2</fpage>
          -
          <lpage>4</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>S.</given-names>
            <surname>Afroz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brennan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Greenstadt</surname>
          </string-name>
          , “
          <article-title>Detecting hoaxes, frauds, and deception in writing style online</article-title>
          ,
          <source>” Proc. - IEEE Symp. Secur. Priv.</source>
          , pp.
          <fpage>461</fpage>
          -
          <lpage>475</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>E.</given-names>
            <surname>Backer</surname>
          </string-name>
          and
          <string-name>
            <surname>P. Van Kranenburg</surname>
          </string-name>
          , “
          <article-title>On musical stylometry-a pattern recognition approach,” Pattern Recognit</article-title>
          .
          <source>Lett.</source>
          , vol.
          <volume>26</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>299</fpage>
          -
          <lpage>309</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>