<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Tlemcen University: Bots and Gender Profiling Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rabia Bounaama</string-name>
          <email>rabea.bounaama@univ-tlemcen.dz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohammed El Amine Abderrahim</string-name>
          <email>mohammedelamine.abderrahim@univ-tlemcen.dz</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Biomedical Engineering Laboratory, Tlemcen University</institution>
          ,
          <country country="DZ">Algeria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Laboratory of Arabic Natural Language Processing, Tlemcen University</institution>
          ,
          <country country="DZ">Algeria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <abstract>
        <p>This is about the participation of techno team at PAN @ CLEF 2019. We use to solve the task text analysis techniques and machine learning approaches. We describe the properties of our multilingual system based on Stochastic Gradient Descent (SGD) learning classifier submitted for PAN2019, which recognizes bots and gender profiling using tweets in two languages, namely, English and Spanish. We show the useful of some features to identify the text style and author's information. And then we evaluate the model on a number of unseen data sets. The proposed models have as accuracies 0.50, 0.25 for English prediction of a bots or human as well gender respectively.</p>
      </abstract>
      <kwd-group>
        <kwd>bots and gender profiling</kwd>
        <kwd>machine learning</kwd>
        <kwd>SGD classifier</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Social media become one of the most popular ways for people to communicate and to
post. Posts are generally variable in length and may involve multiple topics. An author’s
writing style can be affected by different topics and different replies/comments (e.g.
supportive, negative and aggressive) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In marketing, companies and resellers would
like to know the view point of people about their products based on the analysis of blogs
and product reviews [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], also people tend to seek out and receive news from it so these
communications and ratings can produce significant quantities of data which must be
analyzed.
      </p>
      <p>
        These media allow hiding the real profile of the users who interact and generate
information. Therefore, the possibility of knowing social media users traits on the basis
of what they share is a field of growing interest named author profiling [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Author
profiling deals with deciphering information about the author from the text that he/she
has written [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], this helps in identifying aspects about the user.
      </p>
      <p>
        Bots could artificially inflate the popularity of a product by promoting it and/or
writing positive ratings, as well as undermine the reputation of competitive products
through negative valuations3. Bots and Gender Profiling task at PAN 2019 CLEF [
        <xref ref-type="bibr" rid="ref2 ref3">3,2</xref>
        ]
aim to determine whether the author of a tweet is a bot or a human. In case of human,
identify her/his gender, the task is held in English and Spanish. Thus, the participants
must provide their multi-lingual model solution to the problem. The performance of
participants systems will be ranked by accuracy through TIRA [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>The paper is structured as follows. In the next section, we give a brief overview
of some related work. Section 3 describes the methodology and corpus preprocessing.
Section 4 presents the results. Then we conclude the paper.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Some of the recent studies in social media [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]where the authors propose a multi-lingual
model for indentification of age and gender at PAN 2015 as classification task whether
they apply a linear model SGD learning, and another Multilingual Personality
prediction model where they apply a multivariate regression model of Ensemble of Regressor
Chains Corrected (ERCC). Besides that in another work of author profile at PAN 2016
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] where they used SVM-based classifiers, liblinear for gender classification and
libsvm with a radial basis kernel to predict the age. Also they use NRC Word-Emotion
Association Lexicon for their training data.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] authors apply TF-IDF and a Deep-Learning model based on Convolutional
Neural Networks. They compute the cosinus similarity between the Tf Idf d vector and
the vector Tfq of term frequencies for their training data in order to predict the gender
or language variety at PAN 2017 from unseen data test . Moreover in the work of [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] for
the prediction of gender and language variety at PAN 2017 also in the work of [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]for
the task of the past year (PAN 2018) concerns gender identification on Twitter we found
that the authors use a logistic regression with good accuracy.
      </p>
      <p>The studies mentioned above show the applicability of some statistical methods for
author profiling tasks at PAN CLEF. In this paper we propose a multilinguale model
for indentification of bots and gender profiling based on Stochastic Gradient Descent
(SGD) learning classifier.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Method</title>
      <p>In this section, we describe two multilingual predictive models that we use in our
submission. We build a multilingual model for identifying bots or human users and a
multilingual model for predicting their gender in case of human.</p>
      <p>The organizers of PAN 2019 bots and gender Profiling Task provide a dataset which
consisted of two different training sets for the different languages: English and Spanish
for the total 412000,300000tweets respectively , collections is depicted in table1.</p>
      <p>The data was given in the form of xml files containing tweets for several users. We
apply the following set of preprocessing steps to all documents.</p>
      <p>First we created a function to extract tweets from xml files and save them to a csv
file using the "beautifullsoup"4, "Pandas"5 libraries for both languages. We used only
4 https://www.crummy.com/software/BeautifulSoup/bs4/doc/
5 https://pandas.pydata.org/
the posts text for training containing the tweet only with the author and the author’s
gender we extract all tweets belonging to one user .We performe a pre-processing for
the data before used it to train SGD(svm) classifier. The following pre-processing were
performed:
– Removing url links, @ username,Hashtag# .
– Tokenizing text by white space.
– Normalizing case to lowercase.
– Removing punctuation from each word.
– Removing non-printable characters.
– Removing stopwords.
– Lemmatizing words .</p>
      <p>Secondly, we have started the stage of the construction of the model, in this stage
we have created three functions first of all the creation of the classifier from which it
takes as parameter the specified classifier, the vector of features of learning with the
output classes and the validation vector.</p>
      <p>
        According to [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the use of N-grams is the best method to analyze emotions in
microblogging context. So we train our classifier by using 3-grams features. From these
features, we selected only those that have as minimum term count frenquency equal to
3 in the classification task and we used them in the third function in order to train the
model.
      </p>
      <p>
        We used the same presentation of features and model parameters as the ones chosen
for English to train Spanish dataset. Our model was built using the tools provided by
scikit-learn machine learning library in Python [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We also tested several classifiers
and different parameter sets. The following classifiers from Scikit-learn were tested:
– Svm.linearSVC
– Logistic regression
– RNN (reccurent nereunal network)
– Naïve bayes multinominal
Best results were obtained with SGD classifier, we used ‘hinge’ as loss function and L2
for penalitie, to our submitted run .
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>For the task of bots and gender profile prediction, we obtain better results for the
prediction of Spanish language as presented at table 2 and 3.</p>
      <p>Our techno team have as an average of score 0.3784 . According to the obtained
results we found that sgd (svm) classifier perform better for author prediction while this
approach did not perform very well at gender prediction.To overcome this limitation,
we plan to do more advanced preprocessing using, for example, linguistic markers.</p>
      <p>We faced some limitation in building our system such as :
– Tweets data contains incorrectly words for example people spell the word “soon”
as “soooon” to convey excitement in such situations, tokenizing and identifying
words becomes challenging.
– Social media users use their own vocabulary to express their thoughts or feeling,
thus extracting vocabulary-based or grammar-based features may not work
efficiently for these platforms. Furthermore, social media users use multiple languages
to express their opinion. This makes it impossible to apply knowledge derived from
one language by extracting language dependent features, onto another language.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>We have presented the system developed by our techno team for participating in
PAN2019 bots and gender profiling Task, we designed and implemented a system that could
be easily configured where we use in our final model SGD classifier. The main challenge
with this model is then to fight effectively overfitting. The biggest challenge of this
year’s PAN bots and gender profiling task was the gender classification problem where
our model achieves an average of 0.25 accuracy.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Mounica</given-names>
            <surname>Arroju</surname>
          </string-name>
          , Aftab Hassan, and
          <string-name>
            <given-names>Golnoosh</given-names>
            <surname>Farnadi</surname>
          </string-name>
          .
          <article-title>Age, gender and personality recognition using tweets in a multilingual setting</article-title>
          .
          <source>In 6th Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2015</year>
          )
          <article-title>: Experimental IR meets multilinguality, multimodality, and interaction</article-title>
          , pages
          <fpage>22</fpage>
          -
          <lpage>31</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Franco</surname>
            <given-names>M</given-names>
          </string-name>
          . Francisco Rangel,
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Rosso</surname>
          </string-name>
          .
          <article-title>A low dimensionality representation for language variety identification</article-title>
          .
          <source>In Proceedings of the 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing'16)</source>
          . Springer-Verlag,
          <source>LNCS(9624)</source>
          ,pp.
          <fpage>156</fpage>
          -
          <lpage>169</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Paolo Rosso Francisco Rangel.
          <article-title>Overview of the 7th author profiling task at pan 2019: Bots and gender profiling</article-title>
          .
          <source>CEUR Workshop Proceedings</source>
          , In: Cappellato L.,
          <string-name>
            <surname>Ferro</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Müller</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Losada</surname>
            <given-names>D</given-names>
          </string-name>
          . (Eds.),
          <year>2019</year>
          . CEUR-WS.org &lt;http://ceur-ws.
          <source>org&gt;.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Pepa</given-names>
            <surname>Gencheva</surname>
          </string-name>
          , Martin Boyanov, Elena Deneva, Preslav Nakov, Yasen Kiprov, Ivan Koychev, and
          <string-name>
            <given-names>Georgi</given-names>
            <surname>Georgiev</surname>
          </string-name>
          .
          <article-title>Pancakes team: A composite system of genre-agnostic features for author profiling</article-title>
          .
          <source>In CEUR Workshop Proceedings</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Gonzalo</given-names>
            <surname>Blázquez</surname>
          </string-name>
          <string-name>
            <given-names>Gil</given-names>
            , Antonio Berlanga de Jesús, and
            <surname>José</surname>
          </string-name>
          <string-name>
            <surname>M. Molina Lopéz.</surname>
          </string-name>
          <article-title>Combining machine learning techniques and natural language processing to infer emotions using spanish twitter corpus</article-title>
          .
          <source>In Highlights on Practical Applications of Agents and Multi-Agent Systems</source>
          , pages
          <fpage>149</fpage>
          -
          <lpage>157</lpage>
          . Springer Berlin Heidelberg,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Matej</given-names>
            <surname>Martinc</surname>
          </string-name>
          , Iza Skrjanec, Katja Zupan, and
          <string-name>
            <given-names>Senja</given-names>
            <surname>Pollak</surname>
          </string-name>
          .
          <source>Pan</source>
          <year>2017</year>
          :
          <article-title>Author profilinggender and language variety prediction</article-title>
          .
          <source>In CLEF (Working Notes)</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Fabian</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          , Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          , Ron Weiss,
          <string-name>
            <surname>Vincent Dubourg</surname>
          </string-name>
          , et al.
          <article-title>Scikit-learn: Machine learning in</article-title>
          <source>python journal of machine learning research</source>
          .
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Jian</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <surname>Kim-Kwang Raymond Choo</surname>
            , and
            <given-names>Helen</given-names>
          </string-name>
          <string-name>
            <surname>Ashman</surname>
          </string-name>
          .
          <article-title>Bit-level n-gram based forensic authorship analysis on social media: Identifying individuals from linguistic profiles</article-title>
          .
          <source>Journal of Network and Computer Applications</source>
          ,
          <volume>70</volume>
          :
          <fpage>171</fpage>
          -
          <lpage>182</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Martin</given-names>
            <surname>Potthast</surname>
          </string-name>
          , Tim Gollub, Matti Wiegmann, and
          <article-title>Benno Stein. TIRA Integrated Research Architecture</article-title>
          . In Nicola Ferro and Carol Peters, editors,
          <source>Information Retrieval Evaluation in a Changing World - Lessons Learned from 20 Years of CLEF</source>
          . Springer,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>Nils</given-names>
            <surname>Schaetti</surname>
          </string-name>
          . Unine at clef 2017:
          <article-title>Tf-idf and deep-learning for author profiling</article-title>
          .
          <source>In CLEF (Working Notes)</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>Mariona</given-names>
            <surname>Taulé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M Antonia</given-names>
            <surname>Martí</surname>
          </string-name>
          , Francisco M Rangel,
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Rosso</surname>
          </string-name>
          , Cristina Bosco,
          <string-name>
            <given-names>Viviana</given-names>
            <surname>Patti</surname>
          </string-name>
          , et al.
          <article-title>Overview of the task on stance and gender detection in tweets on catalan independence at ibereval 2017</article-title>
          .
          <source>In 2nd Workshop on Evaluation of Human Language Technologies for Iberian Languages, IberEval</source>
          <year>2017</year>
          , volume
          <year>1881</year>
          , pages
          <fpage>157</fpage>
          -
          <lpage>177</lpage>
          . CEUR-WS,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. P von Daniken, Ralf Grubenmann, and
          <string-name>
            <given-names>Mark</given-names>
            <surname>Cieliebak</surname>
          </string-name>
          .
          <article-title>Word unigram weighing for author profiling at pan 2018</article-title>
          .
          <source>In Proceedings of the Ninth International Conference of the CLEF Association (CLEF</source>
          <year>2018</year>
          ),
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>