<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UACh at MEX-A3T 2019: Preliminary Results on Detecting Aggressive Tweets by Adding Author Information Via an Unsupervised Strategy</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marco Casavantes</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Lopez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luis Carlos Gonzalez</string-name>
          <email>lcgonzalezg@uach.mx</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universidad Autonoma de Chihuahua. Facultad de Ingenier a. Chihuahua</institution>
          ,
          <addr-line>Chih.</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>537</fpage>
      <lpage>543</lpage>
      <abstract>
        <p>In this paper we describe our participation for the Aggressiveness Detection Track in the second edition of MEX-A3T. We evaluate di erent strategies for text classi cation, including classi ers such as Support Vector Machines and a Multilayer Perceptron trained on n-grams (words and characters) and word embeddings. We also study the inclusion of features to try to give context to the text messages and explore if people verbally attack di erently depending on their traits and overall environment. Preliminary results show that our strategy is competitive to detect aggression in tweets, ranking in 2nd place with respect to the participants of 2018 and 2019.</p>
      </abstract>
      <kwd-group>
        <kwd>Spanish text classi cation Aggressiveness Detection Multilayer Perceptron</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Technology has changed the way in which people communicate with each other,
giving rise to new services such as social networks, where a style of informal
communication is used. Such social networks, though, present several challenges to
maintain communication channels open to the free sharing of ideas. The
intolerance and aggressiveness of certain users a ects the experience of other consumers
or people interested in being part of the communities and their conversations.
The fact of not being face to face in the communication channel and even
preserve anonymity, encourages these individuals to express themselves o ensively.
However, the volume of messages that are sent daily, the growth of online
communities, and the respective ease of access to these social networks, make the
moderation of communication channels a di cult task to be dealt with by
conventional means, and as people increasingly communicate online, the need for
high quality automated abusive language classi ers becomes much more
profound[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        One of the goals of the second edition of MEX-A3T[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is to tackle this
problem and further improve the research of this important NLP task, the detection
of aggressive tweets in Mexican Spanish. In this work we evaluate strategies
proposed before, such as the use of lexical features through TF-IDF representations,
and di erent approaches to add features in order to try to give context to each
text. Surprisingly, even tackling the task with such a basic approach our proposal
is able to o er competitive results, just slightly behind the top performer of this
competition in 2018 and 2019, INGEOTEC. Furthermore, we also investigate
how to incorporate author's traits by using unsupervised methods and
attempting to include this information as possible features, based on the hypothesis that
there are di erent ways of aggression depending on the author's context.
2
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Proposed Method</title>
      <sec id="sec-2-1">
        <title>Data Pre-processing</title>
        <p>After loading the train and test sets, we strip the tweets from non-alphanumeric
characters and only keep some relevant Spanish characters (a,e, ,o,u,n~,and u),
all words are then made lowercase and subsequently we noticed that in both sets
there exists many di erent terms to express laughter (mainly due to how many
times "ja" is repeated when the word "jaja" appears and because of typos) so
that led us to replace every word containing "jaja" to "risa" (laugh), with the
purpose of decreasing the number of terms that represent this emotion.
It is worth mentioning that we also created and conducted experiments on a
version of the datasets where emojis were converted to text and hashtags were
separated by words (e.g., ":)" would turn into "smiling face", and "#FelizMiercoles"
would be "feliz miercoles"), however most hashtags were wrongly separated and
the performance of the classi ers decreased by incorporating these steps and
were therefore discarded.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Features</title>
        <p>We conducted our research using the following features:
Lexical: We use word n-grams (n=1, 2) and char n-grams (n=3, 4) as features,
this collection of terms is weighted with its term frequency-inverse document
frequency (TF-IDF).</p>
        <p>
          Document Embeddings: The objective was to represent the tweets through
Word Embeddings[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] and try di erent classi ers with these new features, each
text message was converted to a vector of size = 300 (mean of the vectors of
each word). The model of words in Spanish was computed with fastText[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] and
downloaded from [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] .
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>User Occupation and Location predictions: Although we attempted sev</title>
        <p>
          eral strategies to obtain unsupervised author pro les for each document [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], we
ended up using the output of the system developed by [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] as predictions of
occupation and location values to explore the possibility of di erences in vocabulary
that exists according to the pro le of the author of the message.
Grouping tweets by theme: An implementation of Self Organizing Maps
(SOM) as a clustering strategy called MiniSom[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] was used with aims to nd
groups in the collection of texts based on underlying or non-explicit features,
the clustering was done including all words and also ignoring swear words (to
reduce the noise and focus on thematic terms), after training the network we
were able to compute the coordinates assigned to a tweet on the map and use
these as new features.
        </p>
        <p>
          Perspicuity score / In esz scale: Based on [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], we adapted the idea of
capturing the quality of each tweet by using a modi ed Flesch Reading Ease score
(since this test only applies to text written in English), called Perspicuity score
and its equivalence to the In esz scale, following the equation described in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]
where the number of sentences is also xed at one.
        </p>
        <p>All the extra categorical features mentioned above were concatenated
following a One Hot Encoding scheme.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments and Results</title>
      <p>The datasets were provided by MEX-A3T Team. Table 1 shows the distribution
of training and test partitions for Spanish tweets.</p>
      <p>
        We separated the training set in 67% for training and 33% for validation
to evaluate our experiments with di erent combinations of features discussed
in section 2.2. We started our research by recreating the baselines described in
the overview of the rst edition of MEX-A3T[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], particularly focusing on the
character trigrams baseline, as it holds the best performance in comparison to
the BoW baseline.
      </p>
      <p>
        We trained Linear Support Vector Machines and a Multilayer Perceptron as
classi ers for this task, and we decided to use the perceptron as the nal system
to submit our predictions since it exhibited the best results in the validation
stage, as shown in Table 2 where we obtained the F1-score macro and the
FMeasure over the aggressive class. We performed all modeling regarding the
creation of tf-idf feature matrices and SVM classi ers using scikit-learn[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], and
for the Multilayer Perceptron, we used the implementation described in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ],
there was only an instance were this Perceptron couldn't be trained with Word
Embeddings, so we tried another con guration on the MLPClassi er from
scikitlearn getting low scores similar to the ones obtained using LinearSVM, and
therefore casting aside this approach.
3.1
      </p>
      <sec id="sec-3-1">
        <title>Results</title>
        <p>
          As stated before, the Multilayer Perceptron was chosen as nal system, however,
because of time and memory constraints we had to train this model using only
character n-grams of range [
          <xref ref-type="bibr" rid="ref3 ref4">3,4</xref>
          ] for this task even though later results shows
better performance by using n-grams of range [
          <xref ref-type="bibr" rid="ref3 ref5">3,5</xref>
          ]. Table 3 list the top ve
nal rankings for the aggressiveness detection task for 2019, more details of all
results of the contest are shown at [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. It is interesting to observe that even
when our system relied on such a basic approach, it is able to compete
faceto-face against INGEOTEC, a model based on an ensemble of classi ers, which
specially tailors discriminative features for aggressive detection via a Genetic
Programming strategy.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Analysis</title>
        <p>To breakdown our results, we started by getting the 10 most valuable n-grams
at character level separated by length, as shown in Table 4. With respect to
the aggressive class, our nal con guration had more false positives than false
negatives, meaning that it was easier for an aggressive tweet to be missclassi ed
as non-aggressive than the other way around. Despite running several
experiments and adding new features trying to give context to the tweets, in hopes
of improving classi cation in this task, unfortunately these strategies showed,
at best, almost unnoticeable changes in the results, and hinder of classi cation
at worst. After manual inspection, we observed that this could have happened
because:
{ Occupation and Location predictions did not group the messages in a
balanced way, in fact, most tweets would fall under only one out of eight
available categories for occupation and six categories for location.
{ SOM Coordinates would not enhance the classi cation scores before as the
clusters were capturing word repetition instead of thematic aspects for each
tweet. Later experiments (after submission of results) showed that this
behaviour was caused because the clustering was made with n-grams; training
the SOM with word embeddings created with the train set of this task
(without external resources) solved this issue and did a better job at grouping the
tweets by subjects.
{ There was no relevant pattern by applying a perspicuity score to each tweet,
as there were multiple cases of similar scores assigned to both aggressive and
non-aggressive messages.
In this paper, we describe our strategy to classify aggressive and non-aggressive
tweets in Mexican Spanish. In our best performing system, we use only lexical
features and our results show a better performance than most results of all
participants. This outcome, and the fact that the F-measure for the aggressive class
is still low compared to the score on the non-aggressive class, motivates the idea
of future work focusing on feature analysis for aggressiveness detection and
explore which representations are truly relevant, including word embeddings, Bag
of Words and Characters of di erent n-gram ranges, see if these complement each
other and if so, how to combine them. We analyzed our clustering strategies, and
after changing the way they were trained we could observe slight improvement
in classi cation results, motivating us to keep experimenting on ways to try to
add context to the text messages. We also believe in the potential that neural
networks display for this task, and that more research on how to build and train
them properly will certainly improve the current situation of this task.
As future work, we look forward to develop new strategies based on deep neural
networks, such as Recurrent Neural Networks, which are tools aimed to work
with sequential data similar in nature to time series.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Chikashi</given-names>
            <surname>Nobata</surname>
          </string-name>
          , Joel Tetreault, Achint Thomas,
          <string-name>
            <given-names>Yashar</given-names>
            <surname>Mehdad</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Yi</given-names>
            <surname>Chang</surname>
          </string-name>
          .
          <article-title>Abusive language detection in online user content</article-title>
          .
          <source>In Proceedings of the 25th International Conference on World Wide Web, WWW '16</source>
          , pages
          <fpage>145</fpage>
          {
          <fpage>153</fpage>
          , Republic and Canton of Geneva, Switzerland,
          <year>2016</year>
          . International World Wide Web Conferences Steering Committee.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Mario</given-names>
            <surname>Ezra</surname>
          </string-name>
          <string-name>
            <surname>Aragon</surname>
          </string-name>
          ,
          <article-title>Miguel A Alvarez-Carmona, Manuel Montes-y Gomez, Hugo Jair Escalante, Luis Villasen~or-Pineda, and Daniela Moctezuma. Overview of MEX-A3T at IberLEF 2019: Authorship and aggressiveness analysis in Mexican Spanish tweets</article-title>
          .
          <source>In Notebook Papers of 1st SEPLN Workshop on Iberian Languages Evaluation Forum (IberLEF)</source>
          , Bilbao, Spain, September,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Quoc</given-names>
            <surname>Le</surname>
          </string-name>
          and
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          .
          <article-title>Distributed representations of sentences and documents</article-title>
          .
          <source>In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML'14</source>
          ,
          <string-name>
            <surname>pages</surname>
            <given-names>II</given-names>
          </string-name>
          {
          <article-title>1188{II{1196</article-title>
          . JMLR.org,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Piotr</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          , Edouard Grave, Armand Joulin, and
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          .
          <article-title>Enriching word vectors with subword information</article-title>
          .
          <source>Transactions of the Association for Computational Linguistics</source>
          ,
          <volume>5</volume>
          :
          <fpage>135</fpage>
          {
          <fpage>146</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. Github - mquezada/starsconf2018-word
          <article-title>-embeddings: Material para el taller "representaciones vectoriales de palabras basadas en redes neuronales" de la starsconf 2018</article-title>
          . https://github.com/mquezada/starsconf2018-word-embeddings.
          <source>(Accessed on 06/02/</source>
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Roberto</given-names>
            <surname>Lopez Santillan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.C.</given-names>
            <surname>Gonzalez-Gurrola</surname>
          </string-name>
          , and
          <string-name>
            <surname>Graciela Ram</surname>
          </string-name>
          rez-Alonso.
          <article-title>Custom document embeddings via the centroids method: Gender classi cation in an author pro ling task</article-title>
          .
          <source>In Linda Cappellato</source>
          , Nicola Ferro,
          <string-name>
            <surname>Jian-Yun Nie</surname>
          </string-name>
          , and Laure Soulier, editors,
          <source>CLEF 2018 Evaluation Labs and Workshop { Working Notes Papers</source>
          ,
          <volume>10</volume>
          -
          <fpage>14</fpage>
          September, Avignon, France. CEUR-WS.org,
          <year>September 2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Rosa</given-names>
            <surname>Mar</surname>
          </string-name>
          <article-title>a Ortega-Mendoza and A Pastor Lopez-Monroy. The winning approach for author pro ling of mexican users in twitter at mex</article-title>
          . a3t@
          <fpage>ibereval</fpage>
          -
          <lpage>2018</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. Github - justglowing/minisom:
          <article-title>Minisom is a minimalistic implementation of the self organizing maps</article-title>
          . https://github.com/JustGlowing/minisom. (Accessed on 06/03/
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Davidson</surname>
          </string-name>
          , Dana Warmsley,
          <string-name>
            <given-names>Michael W.</given-names>
            <surname>Macy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Ingmar</given-names>
            <surname>Weber</surname>
          </string-name>
          .
          <article-title>Automated hate speech detection and the problem of o ensive language</article-title>
          .
          <source>CoRR, abs/1703.04009</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. Escala in esz | legible. https://legible.es/blog/escala-in esz/. (Accessed on 06/02/
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Miguel</surname>
          </string-name>
          Alvarez-Carmona,
          <article-title>Estefan a Guzman-Falcon, Manuel Montes-y Gomez, Hugo Jair Escalante, Luis Villasen~or-</article-title>
          <string-name>
            <surname>Pineda</surname>
          </string-name>
          , Veronica Reyes-Meza, and Antonio Rico-Sulayes.
          <article-title>Overview of MEX-A3T at IberEval 2018: Authorship and aggressiveness analysis in Mexican Spanish tweets</article-title>
          .
          <source>CEUR Workshop Proceedings</source>
          ,
          <volume>2150</volume>
          :
          <fpage>74</fpage>
          {
          <fpage>96</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thirion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dubourg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vanderplas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Passos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cournapeau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brucher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Perrot</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Duchesnay</surname>
          </string-name>
          .
          <article-title>Scikit-learn: Machine learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>12</volume>
          :
          <fpage>2825</fpage>
          {
          <fpage>2830</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. Github - afshinrahimi/sparsemultilayerperceptron: Lasagne /
          <article-title>theano based multilayer perceptron mlp which accepts both sparse and dense matrices and is very easy to use with scikit-learn api similarity</article-title>
          . https://github.com/afshinrahimi/sparsemultilayerperceptron. (Accessed on 06/03/
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>