<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Ensemble Learning to Detect Aggressiveness in Mexican Spanish Tweets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mar a Dolores Molina-Gonzalez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Flor Miriam Plaza-del-Arco</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mar a Teresa Mart n-Valdivia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luis Alfonso Uren~a-Lopez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Advanced Studies Center in ICT (CEATIC) Universidad de Jaen</institution>
          ,
          <addr-line>Campus Las Lagunillas, 23071, Jaen</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>495</fpage>
      <lpage>501</lpage>
      <abstract>
        <p>Comments published on social media often contain aggressive language that can have damaging e ects on users. The severe consequences of this problem, combined with the large amount of data that users daily publish on the Web, require the development of algorithms capable of automatically detecting inappropriate online remarks. In this paper, we present our participation in IberLEF-2019: subtask MEX-A3T: Authorship and aggressiveness analysis in Twitter: case study in Mexican Spanish. Our main contribution is the development of a ensemble learning system to detect aggressiveness in tweets.</p>
      </abstract>
      <kwd-group>
        <kwd>automatic aggressiveness detection machine learning</kwd>
        <kwd>social media</kwd>
        <kwd>text mining</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>With the growing prominence of social media like Twitter or Facebook, more and
more users are publishing content and sharing their opinions with others. This
content has the potential to be transmitted quickly, reaching anywhere in the
world in few seconds. Unfortunately, the comments often contain aggressiveness
language that can have damaging e ects on social media users. The hate speech
detection includes di erent issues, such as: misogyny, xenophobia, homophobia,
cyberbullying, nastiness and aggressiveness. One of the strategies used to deal
with these online hateful behaviors and attitudes in social media is reporting or
monitoring this type of content with the main aim of limiting it. However, it is
di cult to monitor e ciently and automatic support techniques should be used.</p>
      <p>
        Recently, a growing number of researchers have started to focus on studying
the task of automatic detection of hateful language online [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Moreover, some
national and international workshops and campaigns of evaluation have taken
place focusing on the research in this issue in various languages, such as the
rst and second editions of the Workshop on Abusive Language [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], the First
Workshop on Trolling, Aggression and Cyberbullying [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which also included
a shared task on aggression identi cation, the tracks on Automatic Misogyny
Identi cation (AMI) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and on authorship and aggressiveness analysis
(MEXA3T) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] proposed at the 2018 edition of IberEval, the GermEval Shared Task
on the Identi cation of O ensive Language [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], the Automatic Misogyny
Identi cation task at EVALITA 2018 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and nally the SemEval shared task on HS
detection against immigrants and women (HatEval) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>The severe consequences of this problem, combined with the large amount of
data that users daily publish on the Web, requires the development of algorithms
capable of automatically detecting inappropriate online remarks.</p>
      <p>
        In this paper, we describe our participation in IberLEF-2019: subtask
MEXA3T: Authorship and aggressiveness analysis in Twitter: case study in Mexican
Spanish [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This track proposes to detect the aggressiveness on Mexican Spanish
tweets providing texts containing o ensive messages that disparage or humiliate
speci c target.
      </p>
      <p>The rest of the paper is structured as follows. In Section 2, we explain the
data used in our methods. Section 3 presents the details of the proposed systems.
In Section 4, we discuss the analysis and evaluation results for our system. We
conclude in Section 5 with remarks and future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Data</title>
      <p>
        To run our experiments, we used the Mexican Spanish datasets provided by the
organizers in IberLEF-2019 subtask MEX-A3T: Authorship and aggressiveness
analysis in Twitter: case study in Mexican Spanish [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The dataset description
contains two les: one of them contains 7,700 Mexican Spanish tweets of the
training set (one tweet per line) and the other one contains the corresponding
labels of the 7,700 tweets of the training set (one label per line).The label has
two possible classes: 0 means "non-aggressive", 1 means "aggressive". The 7,700
tweets have been processed before releasing. The organizers have changed all
user mentions as @USUARIO.
      </p>
      <p>During pre-evaluation period, we trained our models on the train set, and
evaluated di erent approaches with 10-fold cross-validation. During evaluation
period, we trained our models on the train and tested the model on the test set.
Table 1 shows the number of tweets used in our experiments.
3</p>
    </sec>
    <sec id="sec-3">
      <title>System Description</title>
      <p>In this section, we describe how we addressed the identi cation of
aggressiveness in Twitter, and in particular MEX-A3T organizers proposed a classi cation
task with the aim to distinguish aggressive tweet from the non-aggressive from
Mexican Spanish users.
3.1</p>
      <p>Our classi cation model
In rst place, we preprocessed the corpus of tweets provided by the organizers.
After the tokenization process, we carried out the following steps:
{ Lower-case conversion data.
{ Normalize URLs, emails, users mentions, percent, money, time, date
expressions and phone numbers.
{ Unpack hashtags (e.g. #HechosReales becomes &lt;hashtag&gt;hecho reales
&lt;hashtag&gt;).
{ Annotate and reduce elongated words (e.g. agresivooooooo becomes
&lt;elongated&gt;agresivo) and repeat words (e.g. !!!! becomes &lt;repeated&gt;!).
{ Map emoticons.</p>
      <p>In second place, an important step is converting sentences into feature vectors
since it is a focal task of supervised learning based sentiment analysis method.
Therefore, our chosen statistic feature for the text classi cation was the Term
Frequency (TF) taking into account unigrams and bigrams because it provided
the best performance.</p>
      <p>
        During our experiments, the scikit-learn machine learning library in Python
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] was used for benchmarking.
      </p>
      <p>There are many combinations to implement a model when we apply di erent
classi ers with several parameters. Therefore, one of the most important step
was to nd the best individual classi er for the problem. Table 2 shows the
results associated with each evaluated classi er in the training phase.</p>
      <p>After doing several experiments with each classi er independently, we came
up with LR, MultinomialNB and SVM classi ers. In order to improve the
performance of each classi er, we choose the best optimization of the parameters in
each of them. For the rst LR classi er we use the parameter penalty equal to
l1 and for the SVM classi er we use the linear kernel.</p>
      <p>After seeing the results in Table 2, our last classi cation model based on Vote
ensemble classi er combined three individual algorithms: Logistic Regression
(LR), Multinomial Naive Bayes (MultinomialNB) and Support Vector Machines
(SVMs). We have also tested with other models such as Decision Tree (DT) and
Random Forest (RF) but we have obtained better results with the combination
of the three algorithms mentioned above. In Figure 1, it can be seen our model.
We train our model with the training set and we evaluated it with the test set.</p>
      <p>Naive
Bayes
P1</p>
      <p>Training set</p>
      <p>Logistic
Regression</p>
      <p>P2
Voting</p>
      <p>SVM</p>
      <p>P3</p>
      <p>Test set
Predictive</p>
      <p>Model</p>
      <p>Final
Prediction
4</p>
    </sec>
    <sec id="sec-4">
      <title>Analysis of results</title>
      <p>The system has been evaluated using the o cial competition metric, the
macroaveraged F1-score. It has been computed as follows:</p>
      <p>Macro-F1 =
2</p>
      <p>Macro-Prec Macro-Rec
Macro-Prec + Macro-Rec
(1)</p>
      <p>The results of our participation in subtask MEX-A3T of IberLEF Workshop
during the evaluation phase can be seen in Table 3.</p>
      <p>In relation to our results, it should be noted that we achieve better score
in case of the class Non AGG (F1: 0.8232). However, our system is not able to
classify well the AG class (F1: 0.299).</p>
      <p>With respect to other users, we were ranked in the 21th position as can be
seen in Table 3.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Future Work</title>
      <p>
        In this paper, we describe our participation in IberLEF-2019: subtask
MEXA3T: Authorship and aggressiveness analysis in Twitter: case study in Mexican
Spanish [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. To carry out the task, our classi cation model is based on Vote
ensemble classi er combined three individual algorithms.
      </p>
      <p>For the machine learning approach, we have studied several supervised
classiers: Decision Tree, Support Vector Machine, Multinomial Naive Bayes, Random
Forest and Logistic Regression, and the use of n-grams features. It has been
observed that when we apply as feature the combination of unigrams and bigrams
the Macro F1-score increases in all classi ers. Taking into account the three
best classi ers studied, we have combined them via a majority voting ensemble
classi er.</p>
      <p>In conclusion, we consider that the automatic detection of aggressive
language in textual information in general, and in social media in particular, is a
very interesting and challenging problem. Besides, we should add the problem of
the di erent languages and variety of dialects that the Spanish language has, for
example, Mexican or Colombian Spanish. Thus, much work needs to be done
before an accurate system is nally achieved. Therefore, we will continue studying
the problem for di erent tasks related to hate speech and languages. In
particular, since the studies concentrating on Spanish are scarce, we will continue
developing systems for detecting hate speech in Spanish and its dialects, as it is
one of the most widely spoken languages in the world.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work has been partially supported by Fondo Europeo de Desarrollo
Regional (FEDER), REDES project (TIN2015-65136-C2-1-R) and LIVING-LANG
project (RTI2018-094653-B-C21)from the Spanish Government.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Alvarez-Carmona</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guzman-Falcon</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes-y Gomez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Escalante</surname>
            ,
            <given-names>H.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Villasenor-Pineda</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reyes-Meza</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rico-Sulayes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Overview of mexa3t at ibereval 2018: Authorship and aggressiveness analysis in mexican spanish tweets</article-title>
          .
          <source>In: Notebook Papers of 3rd SEPLN Workshop on Evaluation of Human Language Technologies for Iberian Languages (IBEREVAL)</source>
          , Seville, Spain. vol.
          <volume>6</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Aragon</surname>
            ,
            <given-names>M.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alvarez-Carmona</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes-y Gomez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Escalante</surname>
            ,
            <given-names>H.J.</given-names>
          </string-name>
          ,
          <article-title>Villasen~or-</article-title>
          <string-name>
            <surname>Pineda</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moctezuma</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Overview of mex-a3t at iberlef 2019: Authorship and aggressiveness analysis in mexican spanish tweets</article-title>
          .
          <source>In: Notebook Papers of 1st SEPLN Workshop on Iberian Languages Evaluation Forum (IberLEF)</source>
          , Bilbao, Spain, September (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Basile</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bosco</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fersini</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nozza</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patti</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sanguinetti</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter</article-title>
          .
          <source>In: Proceedings of the 13th International Workshop on Semantic Evaluation (SemEval-</source>
          <year>2019</year>
          ).
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Fersini</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nozza</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Overview of the evalita 2018 task on automatic misogyny identi cation (ami). Proceedings of the 6th evaluation campaign of Natural Language Processing and Speech tools for Italian (EVALITA18), Turin, Italy</article-title>
          . CEUR. org (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Fersini</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anzovino</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Overview of the task on automatic misogyny identi cation at ibereval</article-title>
          <year>2018</year>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Fortuna</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nunes</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A survey on automatic detection of hate speech in text</article-title>
          .
          <source>ACM Computing Surveys (CSUR) 51(4)</source>
          ,
          <volume>85</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ojha</surname>
            ,
            <given-names>A.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zampieri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malmasi</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>Proceedings of the rst workshop on trolling, aggression and cyberbullying (trac-2018)</article-title>
          .
          <source>In: Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Pedregosa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varoquaux</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gramfort</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thirion</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grisel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blondel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prettenhofer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weiss</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dubourg</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          , et al.:
          <article-title>Scikit-learn: Machine learning in python</article-title>
          .
          <source>Journal of machine learning research 12(Oct)</source>
          ,
          <volume>2825</volume>
          {
          <fpage>2830</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Waseem</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chung</surname>
            ,
            <given-names>W.H.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hovy</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tetreault</surname>
          </string-name>
          , J.:
          <source>Proceedings of the rst workshop on abusive language online</source>
          .
          <source>In: Proceedings of the First Workshop on Abusive Language Online</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Wiegand</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Siegel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruppenhofer</surname>
          </string-name>
          , J.:
          <article-title>Overview of the germeval 2018 shared task on the identi cation of o ensive language</article-title>
          .
          <source>In: 14th Conference on Natural Language Processing KONVENS</source>
          <year>2018</year>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>