<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Bots and Gender Profiling using Character and Word N-Grams</article-title>
      </title-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <abstract>
        <p>Author profiling, a term used for analysing of text and identifying characteristics of a person based on stylistic and content-based features. In this paper, we describe the approach to detect bot and human (male or female) out of the authors of tweets as a submission for Bots and Gender Profiling shared task at PAN 2019. Our approach involves a combination of character and word n-grams as features for each class and trained Support Vector Machine (SVM). Our experiments show that this method gives good performance in detecting bot and gender (male or female).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Bots played a key role in generating large amounts of internet traffic in the recent years,
in fact they have become ubiquitos in the social media platforms like Twitter,
Facebook, etc [15]. Social media bots pose as human to influence users with commercial,
political or ideological purposes. For example, bots could artificially inflate the
popularity of a product by promoting it and/or writing positive ratings, as well as undermine
the reputation of competitive products through negative valuations . The threat is even
greater when the purpose is political or ideological [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Research shows that in 2016
U.S. Presidential Election, more than 1/5 of tweets on Twitter came from bot accounts
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Furthermore, bots are commonly related to fake news spreading [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Therefore, bot
detection on social media, especially on Twitter has become an important research area
across the globe. This year’s shared task on bots and gender profiling at PAN 2019 [12],
aims to investigate whether the author of a Twitter feed is a bot or a human.
Furthermore, in case of human, to profile the gender of the author in two different languages
English and Spanish.
      </p>
      <p>
        Bot and gender classification is binary problem and with in the gender, male or
female classification is again a binary classification. In this paper, we present our
approach in the final submitted software version at TIRA platform [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Word and character n-grams have been strong predictors of gender in author profiling[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
For author profiling, it has been shown that tf-idf weighted n-gram features, both in
terms of characters and words, are very successful in capturing especially gender
distinctions [14], [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Character and word grams have proven to obtain decent results in
gender classification on Twitter. In the paper [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] authors use word unigrams, bi-grams
and character 1-5 grams as features to feed into various training algorithms. Most of the
best performing teams in author profiling task at PAN have adopted similar approaches
to obtain good accuracies [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In the past years shared tasks at PAN, traditional
machine learning training algorithm Support Vector Machine (SVM) has been used in
combinations of character and tf-idf word n-grams [13]. Even though there are two
different tasks here(one bot/gender and other male/female), can a model be built with
the same set of features that are used extensively for gender detection for bot/gender
detection as well ?
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Dataset and Preprocessing</title>
      <p>The dataset provided consists of a series tweets in the form of XML files, each one
corresponding to an author, containing 100 tweets. Tweet text is in raw format, containing
links, mentions to other users and hashtags.</p>
      <p>Two groups of dataset are provided.</p>
      <p>English: 4,120 authors,
Spanish: 4,120 authors
Each XML file per author (Twitter user) with 100 tweets and authors were coded with
an alpha-numeric author-ID.</p>
      <p>
        Most of the preprocessing is done with the of TweetTokenizer module of the Natural
Language Took Kit library. Approaches followed in preprocessing tweet text are similar
to commonly used techniques [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>– Replacing line feed with &lt;LineFeed&gt;
– Tweet concatenation into one for a single author
– Replace URL with &lt;URLURL&gt;
– Removal of punctuations
– Trim repeated character sequences of length &gt;= 3
4</p>
    </sec>
    <sec id="sec-4">
      <title>Features</title>
      <p>
        In author profiling task, PAN 2018, second best performing team [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] used different
combinations of word and character n-grams on tweet text. This has motivated us to
use the similar approach for the bot and gender detection task as well. Table 3 shows
character and word n-gram hyper parameters used which are obtained after different
experiments on both English and Spanish datasets.
      </p>
      <p>
        TF-IDF matrix created out of character and word n-grams (term frequencey of less
than 2 omitted). Dimensionality reduction on this matrix is done using Singular Value
Decomposition (SVD) and library call truncateSVD from scikit learn was used. The
reduced rank space contained only 200 features as optimal. Increasing in number of
componets ( &gt; 200 ) in the reduced rank space resulted in decreased accuracy and sometimes
resulted in memory error on 4GB RAM Tira virtual machine. Support Vector Machines
(SVM) has been proven to obtain decent results in author profiling [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] . When
compared with other trainers SVMs proved to be more discriminatory. Therefore, the
implementation of linear SVM in the library scikit-python [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] was chosen as the
classification method. In order to prevent overfitting, the value of C was fixed in 1.0, as
done in [15].
In order to validate the approach, the data for each language was split in 60% for training
and 40% for test (i.e 2472 documents for training and 1648 for testing). The experiments
are made from a subset, the classification in the final task will be made using all the
training data. We have tried different trainers NaiveBayesPredict, LogisticRegression
and LinearSVC. Model training is done using 10-fold cross validation as it has obtained
good results [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. LinearSVC is choosen in the final version of the software as it has given
good results over the others. Results on test data (which is 40% of the original training
data) are shown in Table 2 for English dataset. In the final submission, model is trained
on the whole training set using SVM Classifier and tested on the official PAN 2019 test
set for the author profiling task, on the TIRA platform [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Results obtained on final
submission are shown in Table 3.
The simple approach defined here and in the past [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] performs well when compared with
others, decently. Word unigram and bigrams have given good results and increasing
word n-gram size beyond 2 decreased the performance for both English and Spanish
datasets. This hyper parameter tuning was necessary. Initial submission of software
resulted in memory error due to more number of components in reduced rank space
( done using truncatedSVD ). However, increasing the number of components beyond
200 did not improve the performance. SVM still remains at the top for the bot/gender
detection task based on our experiments. As a future work, deep neural networks can be
considered, especially Convolutional Neural Networks (CNN) to obtain better results.
12. Rangel, F., Rosso, P.: Overview of the 7th Author Profiling Task at PAN 2019: Bots and
Gender Profiling. In: Cappellato, L., Ferro, N., Losada, D., Müller, H. (eds.) CLEF 2019
Labs and Workshops, Notebook Papers. CEUR-WS.org (Sep 2019)
13. Rangel, F., Rosso, P., Potthast, M., Stein, B.: Overview of the 5th author profiling task at
pan 2017: Gender and language variety identification in twitter. Working Notes Papers of
the CLEF (2017)
14. Sanchez-Perez, M.A., Markov, I., Gómez-Adorno, H., Sidorov, G.: Comparison of
character n-grams and lexical features on author, gender, and language variety identification
on the same spanish news corpus. In: International Conference of the Cross-Language
Evaluation Forum for European Languages. pp. 145–151. Springer (2017)
15. Zampieri, M., Tan, L., Ljubešic´, N., Tiedemann, J., Nakov, P.: Overview of the dsl shared
task 2015. In: Proceedings of the Joint Workshop on Language Technology for Closely
Related Languages, Varieties and Dialects. pp. 1–9 (2015)
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. Bots and gender detection (
          <year>2019</year>
          ), https://pan.webis.de/clef19/pan19-web/author-profiling.html
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Tira platform (
          <year>2019</year>
          ), https://www.tira.io/
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Basile</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dwyer</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Medvedeva</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rawee</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haagsma</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nissim</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>N-gram: New groningen author-profiling model</article-title>
          .
          <source>arXiv preprint arXiv:1707.03764</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bessi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferrara</surname>
          </string-name>
          , E.:
          <article-title>Social bots distort the 2016 us presidential election online discussion (</article-title>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Burger</surname>
            ,
            <given-names>J.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Henderson</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zarrella</surname>
          </string-name>
          , G.:
          <article-title>Discriminating gender on twitter</article-title>
          .
          <source>In: Proceedings of the conference on empirical methods in natural language processing</source>
          . pp.
          <fpage>1301</fpage>
          -
          <lpage>1309</lpage>
          . Association for Computational Linguistics (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Daneshvar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Inkpen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Gender Identification in Twitter using N-grams and LSA: Notebook for PAN at CLEF 2018</article-title>
          .
          <source>In: CEUR Workshop Proceedings</source>
          . vol.
          <volume>2125</volume>
          (
          <year>2018</year>
          ), http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2125</volume>
          /paper_213.pdf
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Fox</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Want something to go viral? make it fake news</article-title>
          . https://www.nbcnews.com/health/health
          <article-title>-news/fake-news-lies-spread-faster-social-mediatruthdoes-</article-title>
          <string-name>
            <surname>n854896</surname>
          </string-name>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Magliani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fontanini</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fornacciari</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manicardi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iotti</surname>
          </string-name>
          , E.:
          <article-title>A comparison between preprocessing techniques for sentiment analysis in twitter (12</article-title>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Oliveira</surname>
            ,
            <given-names>R.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neto</surname>
            ,
            <given-names>R.F.O.</given-names>
          </string-name>
          :
          <article-title>Using character n-grams and style features for gender and language variety classification</article-title>
          .
          <source>In: CLEF (Working Notes)</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Pedregosa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varoquaux</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gramfort</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thirion</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grisel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blondel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prettenhofer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weiss</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dubourg</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          , et al.:
          <article-title>Scikit-learn: Machine learning in python</article-title>
          .
          <source>Journal of machine learning research 12(Oct)</source>
          ,
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiegmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>TIRA Integrated Research Architecture</article-title>
          . In: Ferro,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <surname>C</surname>
          </string-name>
          . (eds.)
          <article-title>Information Retrieval Evaluation in a Changing World - Lessons Learned from 20 Years of</article-title>
          CLEF. Springer (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>