<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Aggressive Analysis in Twitter using a Combination of Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gretel Liz De la Pen~a Sarracen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Rosso</string-name>
          <email>prosso@dsic.upv.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>PRHLT Research Center Universitat Politecnica de Valencia</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>531</fpage>
      <lpage>536</lpage>
      <abstract>
        <p>This paper describes the system we developed for the task on Aggressive detection in Authorship and aggressiveness analysis in Twitter (MEX-A3T)The task focuses on the detection of aggressive comments in tweets that come from Mexican users. We have analyzed three kinds of models and the proposed system is a combination of them. The rst model is based on Convolutional Neuronal Networks whose outputs feed a LSTM Neural Network. The second one uses the pre-trained Universal Sentence Encoder for encoding sentences into embedding vectors. Finally, the third one consists in a simple Multi-layer Perceptron. The nal results show that our model achieves good results.</p>
      </abstract>
      <kwd-group>
        <kwd>Convolutional Neural Network</kwd>
        <kwd>LSTM Model</kwd>
        <kwd>Universal Sentence Encoder</kwd>
        <kwd>Multi-layer Perceptron</kwd>
        <kwd>Aggressive Detection Track</kwd>
        <kwd>Twitter</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Nowadays, the use of social networks is increasing rapidly. Among them, Twitter
stands out as a broadcast medium of information. Many users use this social
media as one of the main sources for obtaining news. However, many of those
users are attacked by tweets with aggressive messages.</p>
      <p>
        This phenomenon constitutes a problem that a ects di erent groups of
people, due to harassment towards immigrants, women or for instance, sexist
comments [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Therefore, some treatment that controls this situation is essential.
Different researches have been done in this regard. Some approaches use traditional
classi ers such as Naive Bayes and Linear SVM [
        <xref ref-type="bibr" rid="ref13 ref15 ref9">15, 13, 9</xref>
        ]. Others use models
based on Deep Learning with architectures such as LSTM and Convolutional
Neural Networks (CNN) [
        <xref ref-type="bibr" rid="ref12 ref2 ref7">2, 7, 12</xref>
        ]. Several international competitions have also
been organized to motivate the creation of systems for the detection of this type
of messages. Such as the Workshop on Trolling, Aggression and Cyberbullying
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], that included a shared task on aggression identi cation; the tasks on
Automatic Misogyny Identi cation (AMI) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] at IberEval 2018 and EVALITA 2018
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the Workshop on Abusive Language [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and the task on Autohorship and
Aggressiveness Analysis in Twitter task (MEX-A3T) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] proposed at IberEval
2018. This year the second edition of MEX-A3T [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] has been launched. Its aim is
to further improve the research in autohorship and aggressiveness analysis tasks
and to push the computational processing of the Mexican tweets.
      </p>
      <p>In this work, we propose a system formed by the combination of three
strategies. Each of them analyzes the tweet to be classi ed in a di erent way. The rst
one is based on Convolutional Neural Networks whose outputs feed a LSTM
Neural Network. The second one uses the pre-trained Universal Sentence
Encoder for encoding sentences into embedding vectors. The third one consists of
a simple Multi-layer Perceptron which gets the TF-IDF representation of the
tweet. Then, the strategies are combined in order to build a system that takes
into account each of the analysis and predicts whether a given tweet is aggressive
or not.</p>
      <p>The rest of the paper is organized as follows. Section 2 describes our system.
Experimental results are then discussed in Section 3. Finally, we present our
conclusions with a summary of our ndings in Section 4.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>System</title>
      <sec id="sec-2-1">
        <title>Preprocessing</title>
        <p>The rst step for the development of the system is the preprocessing of the texts.
In this phase di erent characteristics, typically present in the tweets, and that
possibly do not have discriminatory semantic information, are normalized. In
this way, the numbers are replaced by the num tag, dates by the date tag, and
all the links by the url tag. In addition, user mentions, identi ed by the rst
character @, are replaced by user. The hashtags were not processed to avoid
losing information that they may contain.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Method</title>
        <p>We propose a system that consists of a combination of di erent strategies as
Figure 1 shows. The rst one is a deep learning model (CNN-LSTM) at the
word level. For each tweet, CNN-LSTM receives as input the word embbedings,
which are processed by a CNN for obtaining a sequence of vectors. These vectors
can be seen as the representation of n-grams according with the size of the kernel.
In the next section, the details are discussed. Then, the vectors feed a LSTM
model for obtaining a prediction. The second model (USE-MLP) takes as input
a vector for a tweet. This vector is obtained with the pre-trained Universal
Sentence Encoder based on the transformer architecture. Then, a Multi-layer
Perceptron is used to get a prediction. Finally, a similar model to the previous
one is used in the third one (TFIDF-MLP). The di erence is in the input of the
Multi-layer Perceptron. In this case, the vector is the TF-IDF representation of
the tweet. In addition, a new component is concatenated to the vector according
to a linguistic feature based on a lexicon of obscene and vulgar phrases in the
Mexican Spanish. Then, the nal prediction is obtained by majority of votes,
given the prediction of each model. In each case, cross entropy is used as the loss
function.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Convolutional Neural Network and LSTM Model</title>
        <p>In this rst model, as was mentioned before, the tweets are represented with a
sequence of word embeddings. For this, the Word2vec MEX-A3T model provided
by the organizers of the competition is used. This has been trained with the
MEX-A3T corpus containing 500,000 tokens. The size of the embeddings is 200.
The objective of this model is to process bigrams present within a tweet in a
sequential manner. The approximation used to obtain the sequence of bigrams
vectors is shown in the gure 2. Where 150 lters of 2x200 are used, 2 correspond
with the size of the bigrams and 200 correspond with that of the embeddings.
The result is a column matrix with depth 150, so that the i-th component taken
in depth, can be seen as a high level representation of the i-th bigram. Then, each
of these vectors is the input at each time step of the LSTM Recurrent Neural
Network which can process them sequentially. Finally, a Softmax layer is used
to obtain the prediction.
2.4</p>
      </sec>
      <sec id="sec-2-4">
        <title>Universal Sentences Encoder Model</title>
        <p>
          The second model takes advantage of the pre-trained Universal Sentence Encoder
[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] to get the prediction for a tweet. It takes variable length text as input and
as outputs a 512-dimensional vector. We have used the encoder architectures
based on the transformer architecture trained for Spanish. Two dense layers
with a Relu function are used to process the vector and nally the prediction is
obtained with a Softmax at the end.
2.5
        </p>
      </sec>
      <sec id="sec-2-5">
        <title>Multi-layer Perceptron model</title>
        <p>
          A problem that frequently occurs with the approaches based on deep learning is
the lack of data to train the models. To solve this problem, a model based on a
traditional approach has been included in the system. For this, each tweet has
been represented as a TF-IDF vector. Additionally, a linguistic feature has been
incorporated into the vector. Basically, this feature corresponds to the number
of aggressive phrases contained in the tweet. The identi cation of these phrases
is based on the study carried out in the work [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], where the authors propose a
methodology for the detection of obscene and vulgar phrases in Mexican tweets.
Then, the prediction for a tweet is obtained by a Multi-layer Perceptron of three
layers whose input is the correspondent vector.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>The Training set has been divided in the experiments, separating 30% for the
Validation set. The results of the F-measure of the aggressive class on that
Validation set were 0.64 for USE-MLP, 0.68 for TFIDF-MLP and 0.65 for
CNNLSTM, while for the combination of the three models 0.68 was obtained. As can
be seen, the best results were achieved with the simplest model, in the same way
as in the Test set as shown below.</p>
      <p>Table 1 shows the results on the Test set for di erent variants and the
result of the best system in the competition (best). The run1 corresponds to the
combination of the commented models. On the other hand, run2 and run3 are
systems that only take into account the CNN-LSTM and TFIDF-MLP models
respectively. Our best result is obtained with the simplest model, which reaches
the third position in the competition with a value very close to the rst two
in the F-measure of the aggressive class (F1), and in both class (F(P,R)). Our
particular results show that the lack of data can a ect the models based on deep
learning, with which in this case worse results were obtained. In addition, other
problem that may a ect the performance of the deep learning based system is
the fact that rare or misspelled words can not be represented with the
embeddings. This can badly condition the training, since important information may
be lost.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and Future work</title>
      <p>We proposed a combination of three di erent models for the MEX-A3T task
on aggressive detection in Twitter. The rst one uses a CNN whose outputs
feeds to a LSTM model. The second model analyzes the input at the full text
level with the Universal Sentences Encoder. The third model is the simplest one
that takes a TF-IDF representation of the text, and obtains the prediction with
a Multi-layer Perceptron. The best results have been obtained with this last
model, instead of the system which combines all the three models. This can be
for the lack of data to train deep learning models, or for the problem of rare
words that can not be represented with the embeddings. Thus, for future works,
it is important dealing with these problems to improve the performance of the
system.</p>
      <p>Acknowledgments. The work of the second author was partially funded by
the the Spanish MICINN under the research project MISMIS-FAKEnHATE on
Misinformation and Miscommunication in social media: FAKE news and HATE
speech (PGC2018-096212-B-C31).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Aragon</surname>
          </string-name>
          , Mario Ezra and
          <string-name>
            <surname>Alvarez-Carmona</surname>
            ,
            <given-names>Miguel A</given-names>
          </string-name>
          and
          <string-name>
            <surname>Montes-</surname>
          </string-name>
          y-Gomez,
          <article-title>Manuel and Escalante, Hugo Jair and Villasen~or-</article-title>
          <string-name>
            <surname>Pineda</surname>
          </string-name>
          , Luis and Moctezuma, Daniela.
          <source>Overview of MEX-3AT at IberLEF</source>
          <year>2019</year>
          :
          <article-title>Authorship and aggressiveness analysis in Mexican Spanish tweets</article-title>
          .
          <source>Notebook Papers of 1st SEPLN Workshop on Iberian Languages Evaluation Forum (IberLEF)</source>
          , Bilbao, Spain,
          <string-name>
            <surname>September.</surname>
          </string-name>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Badjatiya</surname>
          </string-name>
          ,
          <article-title>Pinkesh and Gupta, Shashank and Gupta, Manish and Varma, Vasudeva. Deep Learning for Hate Speech Detection in Tweets</article-title>
          .
          <source>Proceedings of the 26th International Conference on World Wide Web Companion. International World Wide Web Conferences Steering Committee</source>
          . (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Cer,
          <article-title>Daniel and Yang, Yinfei and Kong, Sheng-yi and Hua, Nan and Limtiaco, Nicole and John, Rhomni St and Constant, Noah and Guajardo-Cespedes, Mario and Yuan, Steve and Tar, Chris and others</article-title>
          .
          <source>Universal Sentence Encoder</source>
          . arXiv preprint arXiv:
          <year>1803</year>
          .
          <fpage>11175</fpage>
          . (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Elisabetta</given-names>
            <surname>Fersini</surname>
          </string-name>
          , Maria Anzovino, and
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Rosso</surname>
          </string-name>
          .
          <article-title>Overview of the Task on Automatic Misogyny Identi cation at Ibereval 2018</article-title>
          .
          <source>In Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval</source>
          <year>2018</year>
          ),
          <article-title>co-located with 34th Conference of the Spanish Society for Natural Language Processing (SEPLN</article-title>
          <year>2018</year>
          ).
          <source>CEUR Workshop Proceedings. CEUR-WS. org, Seville</source>
          , Spain.
          <volume>2150</volume>
          . (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Elisabetta</given-names>
            <surname>Fersini</surname>
          </string-name>
          , Debora Nozza, and
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Rosso</surname>
          </string-name>
          .
          <article-title>Overview of the EVALITA 2018 Task on Automatic Misogyny Identi cation (AMI)</article-title>
          .
          <article-title>Proceedings of the 6th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian (EVALITA18), Turin, Italy</article-title>
          .
          <source>CEUR. org. 2263</source>
          . (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Frenda</surname>
          </string-name>
          ,
          <article-title>Simona and Ghanem, Bilal and Montes-y-</article-title>
          <string-name>
            <surname>Gomez</surname>
          </string-name>
          ,
          <article-title>Manuel and Rosso, Paolo. Online Hate Speech against Women: Automatic Identi cation of Misogyny and Sexism on Twitter</article-title>
          .
          <source>Journal of Intelligent Fuzzy Systems. 36.5</source>
          . pp.
          <fpage>4743</fpage>
          -
          <lpage>4752</lpage>
          . (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Gamback, Bjorn and Sikdar,
          <string-name>
            <given-names>Utpal</given-names>
            <surname>Kumar</surname>
          </string-name>
          .
          <article-title>Using Convolutional Neural Networks to Classify Hate-Speech</article-title>
          .
          <source>Proceedings of the First Workshop on Abusive Language Online</source>
          . (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Guzman</surname>
          </string-name>
          , Estefania and Beltran, Beatriz and Tovar, Mireya and Vazquez, Andres and Mart nez, Rodolfo. Clasi cacion de Frases Obscenas o Vulgares dentro de Tweets. Research in Computing Science.
          <volume>85</volume>
          , pp.
          <volume>65</volume>
          {
          <fpage>74</fpage>
          . (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Gomez-Adorno</surname>
          </string-name>
          ,
          <article-title>Helena and Bel-Enguix, Gemma and Sierra, Gerardo</article-title>
          and Sanchez, Octavio and Quezada,
          <string-name>
            <surname>Daniela.</surname>
          </string-name>
          <article-title>A Machine Learning Approach for Detecting Aggressive Tweets in Spanish</article-title>
          .
          <source>In Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval</source>
          <year>2018</year>
          ),
          <source>CEUR WS Proceedings. 2150</source>
          , pp.
          <volume>102</volume>
          {
          <fpage>107</fpage>
          . (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Ritesh</surname>
            <given-names>Kumar</given-names>
          </string-name>
          , Atul Kr Ojha, Marcos Zampieri, and
          <string-name>
            <given-names>Shervin</given-names>
            <surname>Malmasi</surname>
          </string-name>
          .
          <source>Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)</source>
          .
          <article-title>(</article-title>
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Miguel</surname>
          </string-name>
          Alvarez-Carmona,
          <article-title>Estefan a Guzman-Falcon, Manuel Montes-y Gomez, Hugo Jair Escalante, Luis Villasenor-Pineda, Veronica Reyes-Meza, and Antonio Rico-Sulayes. Overview of MEX-A3T at Ibereval 2018: Authorship and Aggressiveness Analysis in Mexican Spanish Tweets</article-title>
          .
          <source>In Notebook Papers of 3rd SEPLN Workshop on Evaluation of Human Language Technologies for Iberian Languages (IBEREVAL)</source>
          , Seville, Spain,
          <volume>6</volume>
          . (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Nikhil</surname>
          </string-name>
          ,
          <article-title>Nishant and Pahwa, Ramit and Nirala, Mehul Kumar and Khilnani, Rohan. LSTMs with Attention for Aggression Detection</article-title>
          . arXiv preprint arXiv:
          <year>1807</year>
          .
          <fpage>06151</fpage>
          . (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Waseem</surname>
          </string-name>
          , Zeerak and Hovy, Dirk.
          <article-title>Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter</article-title>
          .
          <source>Proceedings of the NAACL Student Research Workshop</source>
          . (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Waseem</surname>
            , Zeerak and
            <given-names>Kyong</given-names>
          </string-name>
          <string-name>
            <surname>Chung</surname>
          </string-name>
          ,
          <article-title>Wendy Hui and Hovy, Dirk and Tetreault, Joel</article-title>
          .
          <source>Proceedings of the First Workshop on Abusive Language Online</source>
          .
          <source>In Proceedings of the First Workshop on Abusive Language Online</source>
          . (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Xiang</surname>
          </string-name>
          ,
          <article-title>Guang and Fan, Bin and Wang, Ling and Hong, Jason and Rose, Carolyn. Detecting O ensive Tweets via Topical Feature Discovery over a Large Scale Twitter Corpus</article-title>
          .
          <source>Proceedings of the 21st ACM International Conference on Information and Knowledge Management. ACM</source>
          . (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>