<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Bots and Gender Profiling using Character Bigrams</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniel Yacob Espinosa</string-name>
          <email>espinosagonzalezdaniel@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Helena Gómez-Adorno</string-name>
          <email>helena.gomez@iimas.unam.mx</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Grigori Sidorov</string-name>
          <email>sidorov@cic.ipn.mx</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Instituto Politécnico Nacional, Centro de Investigación en Computación</institution>
          ,
          <addr-line>Mexico City</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidad Nacional Autónoma de México, Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas</institution>
          ,
          <addr-line>Mexico City</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <abstract>
        <p>This paper describes our approach to tackle the Author Profiling task at PAN 2019. The objective is to distinguish between bot and human users and for human users it is also necessary to detect their gender. We are given only Twitter messages in two languages (Spanish and English). Our preprocessing stage includes data cleaning as well as the extraction of features using character bi-grams. We experimented with several feature representations and machine learning algorithms ( Support Vector Machines (SVM) from libSVM). For both languages we use the same methods of feature extraction and classification.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Thanks to artificial intelligence, learning using computer is possible, because with each
interaction in technology, it can learn more from us to give us more comfort in some
tasks or to provide us with solutions, which are more according to our tastes or interests.
Actually with the help of artificial intelligence, what we want to do is to model the
human intelligence [11].</p>
      <p>
        Currently the use of artificial intelligence to make predictions is very involved in
most streaming services or social networks, to mention some internet services. They are
constantly learning about users to give them the best service according to their interests,
for the streaming services we can consider artificial intelligence to predict what a user
may like and in this way invite him to continue using the services. On the other hand,
social networks are also used to show news, pages, forums, friends or simply to meet
new people. In this new generation of web 2.0, social networks are a great double-edged
sword, since both companies and users with a more direct interaction [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] are what can
be called horizontal communication. Thanks to this, companies, agencies, and some
ministrations can interact more directly with users so that users can give their opinion
about a product or service, now imagine that many people have similar opinions and
these are shared on social networks, since that we interact with the comments of others
we can have empathy or perhaps disgust and express it in the same way [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        One of the main reasons to study bots is the impact they generate on social networks
through opinions [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], then it is tried to explore the text that they generate to detect if
it is a bot. We have to realize the importance of social networks today and the use of
technology for these, they can warn of a catastrophe situation in some part of the world
to creation of "Trending topics" about trends in the world of Fashion. Unfortunately,
so deeply penetrated social networks there are companies and governments that benefit
from this creating bots and using them to spread false news and thus create doubts,
discontents, generate uncertainties to much of the community interested in these issues
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        Then PAN workshop is organized every year since 2011 with aim of promoting
research on authorship analysis which includes authorship attribution, author profiling,
and plagiarism detection, among others [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In this year campaign, the organizers
included a subtask of automatic bot detection. The aim is to discriminate between real
users and bots based only on text messages posted in Twitter [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Corpus</title>
      <p>The task proposed by PAN is to predict if a user is a bot or not, if it is a human user
then it is also necessary to predict the gender of the user. The released dataset contains
two languages: Spanish and English. It is important to mention that each user is
represented by 100 tweets, which will be analyzed and separated depending on the language.
The dataset contains only tweets in which each file corresponds to a user. This dataset
was mainly used for training a system, which was tested with other datasets for PAN
evaluations.
3
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <sec id="sec-3-1">
        <title>Preprocessing steps</title>
        <p>Having only the tweets of the users it is necessary to do a preprocessing considering
some features that the tweets can have:
Digits For the part of the digits we decided to remove them since we considered that
they were not necessary for text feature representation.</p>
        <p>URLs Since the links are resource identifiers in this case are Internet pages are not
necessary for the bi-gram structures either.
@Mentions Mentions refer to other Twitter users with whom they interact in the
message; they are important to quote on Twitter but in our case they will not be necessary
so we will also eliminate them.</p>
        <p>Emoticons There are messages that contain emoticons but for the structure that we
use they are not necessary, however, we consider them not to be helpful for the
classification.</p>
        <p>Considering the data, a procedure for standardization is necessary:
Punctuation marks For the case of our selection of characteristics, we will not need to
use punctuation marks. We extracted them to have our data as clean as possible.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Features</title>
        <p>First for the preprocessing of the data we removed punctuation marks, since in the
experiment we will not use them as a feature, we also removed the references to other
Twitter users as well as links, numbers and emojis contained in the messages as well
as characters that are not inside of the Standard ASCII (American Standard Code for
Information Interchange).</p>
        <p>Since we have the data somehow clean, it is necessary to eliminate the spaces
between the words by the following procedures.</p>
        <p>
          The main idea of the extraction of features is to obtain particular features of the
object so we can then compare those features with others and consider some patterns that
have in common. So one way in which we can obtain these characteristics or features
is with the use of character n-grams [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. With the use of traditional character n-grams
we discovered that we had a good performance for solving the problem, but the best
results for both languages (Spanish and English) were with the formation of character
bi-grams [9]. When the bi-grams are generated, if there are equal bi-grams then they
will be added in a counter of the frequency of that word where in this case is the
character bi-gram, if a new character bi-gram comes out then it will be a new feature where
the frequency it will be 1 since it is the first time it appears, and so on until the analysis
of each user is complete, where we do not forget that each file corresponds to a user.
3.3
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Vector Space Model for Texts</title>
        <p>Now we have the characteristics obtaining the frequencies of characterbi-gram per user,
we need a method in which we can organize the data of all the documents with respect
to their characteristics; because of this we created a vector space model.</p>
        <p>The main idea of using vector space model is to represent the characteristics of each
object with its corresponding object but in an organized manner where the objects can
be compared later [10].</p>
        <p>We proceed to organize our data in a table Term-Document Matrix [11] where for
the part of the columns we have the document and in the part of the rows is a description
of the character bi-gram, in this way the content of the table will be the difference of
the character bi-gram in the analyzed file. If a character bi-gram is not found in the
document, the value of the box must be 0.</p>
        <p>Having this matrix you have all the documents with character bi-grams in an orderly
way and can be analyzed in a much more efficient way.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Experiments</title>
        <p>Thanks to the structure of the organized matrix, they were tested with several classifiers
and evaluated the accuracy to know which could be the best classifier for this task. All
the n-grams tested were of character since with them we had much better accuracy than
with other structures to obtain characteristics.</p>
        <p>After having the results of the classification between humans and bot, we only use
humans for the classification of gender using same methods of extraction of
characteristics and the same classification models.
In this paper, we present an approach to get the solution for the Task "Bots and Gender
Profiling" of PAN at CLEF 2019. Our final system for the classification between bots
and humans followed by classifying users who are human by extracting characteristics
from the tweets and placing them in a structure formed by character bi-grams. In this
way, a term-document matrix is formed in which the entire data set is ordered to pass
through a classification process. With respect to the tests carried out, we decided to use
Support Vector Machine as a classifier with cross validation with 10 boxes for training
the model and later use it with the PAN tests. We realized that for the Spanish and
English languages it did not differ much in the value of the accuracy for the classifications,
so we used the same method for the extraction of characteristics: as well as to determine
between human users and bots and the gender classification in the human users. In the
same way we use the same classifier for both languages [8].</p>
        <p>For future work due to the good performance between the classification of humans
and bots we would like to try with the different characteristics that social networks
allow to introduce in messages (for example, using the 250 characters that Twitter
allows in each Tweet); perhaps we can find more efficient ways to classify between
humans and bots using natural language processing techniques.
8. Rangel, F., R.P.F.M.: Proceedings of the 17th International Conference on Intelligent Text
Processing and Computational Linguistics. In: A Low Dimensionality Representation for
Language Variety Identification, pp. 156–169. Springer-Verlag (2018)
9. Sapkota, U., Bethard, S., Montes-y-Gómez, M., Solorio, T.: Not all character n-grams are
created equal: A study in authorship attribution. In: Proceedings of the 2015 Annual
Conference of the North American Chapter of the ACL: Human Language Technologies.</p>
        <p>NAACL-HLT ’15, Association for Computational Linguistics (2015)
10. Sidorov, G.: N-gramas sintácticos y su uso en la lingüística computacional. In: Vectores de
investigación, 6(6). pp. 1–15. SpringerBriefs in Computer Science, Springer (2013)
11. Sidorov, G.: Formalization in computational linguistics. In: Syntactic n-grams in
Computational Linguistics. SpringerBriefs in Computer Science, Springer (2016)</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kestemont</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manjavancas</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Specht</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tschuggnall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiegmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zangerle</surname>
          </string-name>
          , E.: Overview of PAN 2019:
          <article-title>Author Profiling, Celebrity Profiling, Cross-domain Authorship Attribution and Style Change Detection</article-title>
          . In: Crestani,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Braschler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Savoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Rauber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Heinatz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Cappellato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          , N. (eds.)
          <source>Proceedings of the Tenth International Conference of the CLEF Association (CLEF</source>
          <year>2019</year>
          ). Springer (Sep
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Emilio</given-names>
            <surname>Ferrara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Onur</given-names>
            <surname>Varol</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.M.A.F.</surname>
          </string-name>
          :
          <article-title>Detection of promoted social media campaigns</article-title>
          .
          <source>In: The 10th International AAAI Conference on Web and Social Media</source>
          . pp.
          <fpage>563</fpage>
          -
          <lpage>566</lpage>
          . SpringerBriefs in Computer Science, ICWSM (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Zakaria el Hjouji</surname>
            ,
            <given-names>D. Scott</given-names>
          </string-name>
          <string-name>
            <surname>Hunter</surname>
          </string-name>
          , N.G.d.M.T.Z.:
          <article-title>The impact of bots on opinions in social networks</article-title>
          .
          <source>In: arXiv preprint arXiv:1810</source>
          .
          <volume>12398</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Linda</surname>
            <given-names>S. L.</given-names>
          </string-name>
          <string-name>
            <surname>Lai</surname>
          </string-name>
          , E.T.:
          <article-title>Groups formation and operations in the web 2.0 environment and social networks</article-title>
          .
          <source>In: Group Decision and Negotiation</source>
          . p.
          <fpage>387</fpage>
          -
          <lpage>402</lpage>
          . Springer (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schütze</surname>
          </string-name>
          , H.:
          <article-title>Statistical estimation: n-gram models over sparse data</article-title>
          .
          <source>In: Foundations of Statistical Natural Language Processing</source>
          . MIT Press, MIT (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Martin</given-names>
            <surname>Potthast</surname>
          </string-name>
          , Tim Gollub,
          <string-name>
            <surname>M.W.B.S.:</surname>
          </string-name>
          <article-title>TIRA Integrated Research Architecture</article-title>
          . In: Nicola Ferro,
          <string-name>
            <surname>C.P.</surname>
          </string-name>
          (ed.)
          <source>Information Retrieval Evaluation in a Changing World - Lessons Learned from 20 Years of CLEF</source>
          . Springer (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>R.P.</surname>
          </string-name>
          :
          <article-title>CLEF 2019 Labs and Workshops, Notebook Papers</article-title>
          . In: Cappellato L.,
          <string-name>
            <surname>Ferro</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>M.H.L.D.</surname>
          </string-name>
          (ed.)
          <source>Overview of the 7th Author Profiling Task at PAN</source>
          <year>2019</year>
          :
          <article-title>Bots and Gender Profiling</article-title>
          .
          <source>CEUR Workshop Proceedings</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>