<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UACh at MEX-A3T 2020: Detecting Aggressive Tweets by Incorporating Author and Message Context</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marco Casavantes</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto López</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luis Carlos González</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universidad Autónoma de Chihuahua. Facultad de Ingeniería. Chihuahua</institution>
          ,
          <addr-line>Chih.</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
      </contrib-group>
      <fpage>273</fpage>
      <lpage>279</lpage>
      <abstract>
        <p>In this paper we describe our participation in the Aggressiveness Detection Track at the third edition of MEX-A3T. We evaluate two strategies for text classification, a traditional classifier (Logistic Regression) and a classifier based on transformers (BETO). We also study the inclusion of social media metadata features to try to get context from authors and text messages. Social media platforms are one of the most popular ways to communicate in the "Information Age", allowing their users to express and spread many kinds of ideas, from the cheerful to the not so positive side of "freedom of speech". These networks aren't immune to people that share ofensive content, users that show malicious intent and are quick to reply with aggressive manners. Anonymity, ease of access and lack of punishment for the most part, encourages these individuals to express themselves ofensively. The volume of messages that are sent daily on social media makes moderation a dificult task to be dealt with by conventional means, and as people increasingly communicate online, the need for high quality automated abusive language classifiers becomes much more profound[ 1]. One of the goals of the third edition of MEX-A3T [2] is to tackle this problem and further improve the research of this important Natural Language Processing (NLP) task, the detection of aggressive tweets in Mexican Spanish. The issue is that spotting ofensive messages and hate speech is challenging because systems cannot rely on the text content [3, 4]; for this reason, in this work we evaluate a strategy to try to give context to short texts from social media by taking into account message and author metadata. Our hypothesis is that these additional attributes are expected to better distinguish between ofensive and not-ofensive messages and improve classification scores.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Spanish text classification</kwd>
        <kwd>Aggressiveness Detection</kwd>
        <kwd>Metadata</kwd>
        <kwd>Twitter</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Proposed Method</title>
      <sec id="sec-2-1">
        <title>2.1. Data Pre-processing</title>
        <p>1. We replaced the string of characters “&amp;amp;" with “&amp;". This was necessary to get a closer
representation to the text used in the original tweets.
2. We strip emojis from the tweets.
3. All words were made lowercase.</p>
        <p>
          For LogisticRegression classifier, we tokenized the dataset using the TweetTokenizer utility
from NLTK [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], for BETO’s case we employed BertTokenizer [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Features</title>
        <p>We conducted our research using the following features:
Lexical: We use word n-grams (n=1, 3) as features, this collection of terms is weighted with its
term frequency (TF).</p>
        <p>
          Metadata (MD): By using the Standard Twitter API platform [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] together with additional
libraries in Python such as GetOldTweets3 [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] and Twython [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] it is possible to search for every
tweet in the dataset; if a message is still available online, we are able to retrieve properties of
the post as well as information of the author of the tweet (shown in tables 1 and 2).
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Classifiers</title>
        <p>Logistic Regression (LR): Considered part of the traditional approaches for most of the
NLP tasks. This algorithm uses a linear regression equation that includes a function called
“logistic/sigmoid function”, this function produces an “S” shaped curve that is able to tell the
probability of class assignment.</p>
        <p>
          BETO: BERT model trained on a big Spanish corpus. BETO [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] is of size similar to a BERT-Base
and was trained with the Whole Word Masking technique. BERT models [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] are a new method
of pre-training language representations, currently obtaining state-of-the-art results on a wide
array of NLP tasks.
        </p>
        <p>Because of its computational afordability and flexibility at handling diferent types of inputs
(including NULL values), XGBoost Classifier was selected as the blender of the information
present in our proposed systems. The first step in our framework involves feeding the text part
of the dataset to either LR or BETO, then the classifier returns class probabilities, and lastly
these predictions are concatenated with either metadata or NaN values (for the tweets that we
couldn’t retrieve their metadata online) to form the input vector for XGBoost, which outputs
the final decisions.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments and Results</title>
      <p>The datasets were provided by MEX-A3T Team. Table 3 shows the distribution of training and
test partitions for Spanish tweets.</p>
      <p>Our first task was to "extend" the dataset. This process is simple: for every message in the
collection a query is made using the text to search for that tweet online. Due to the nature of the
task at hand, some tweets couldn’t be recovered, possibly due to deletion of posts or suspended
accounts. Table 4 shows the amount of tweets that we were able to recover.</p>
      <p>We trained two classification systems for this task, one with Logistic Regression and one
using BETO, and we decided to submit a set of predictions for each system:
• Run 1 consists of a XGBoost classifier fed with metadata features and Logistic Regression
probabilities, trained with features from a Bag of Words of range=(1, 3) considering the
term frequency of all the tokens that appear at least twice.
• Run 2 is similar to Run 1, but in this case the XGBoost classifier takes metadata features
and BETO probabilities as inputs.</p>
      <p>To evaluate our experiments with the features discussed in section 2.2, we performed a 5-Fold
Cross Validation on the train set for LR, and a single train-test split using BETO due to time
constraints.</p>
      <p>
        We performed all modeling regarding the creation of TF feature matrices, LR and XGBoost
classifiers using scikit-learn[ 12], and for the BETO model, we used the implementation described
in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. We decided to include or exclude metadata features through manual feature selection.
For each run, a diferent combination of added features exhibited improved F1 scores, the
attributes used along with F1 scores are shown in Table 5.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Results</title>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Analysis</title>
        <p>From a total of 19 registered submissions, the BETO classification was ranked 4th place (above
both Bi-GRU and BoW-SVM baselines), while the LR classification was ranked 8th. Table 6
lists the top five final rankings for the Aggressiveness Identification track for 2020 (our scores
appear in bold).</p>
        <p>To breakdown our results, we started by addressing the performance of our proposal, regarding
F1-score and contrasted against the rest of the competitors. Fig. 1 presents two box plots for the
complete distribution of submitted results in terms of F1 ofensive and F1 non-ofensive. This
analysis suggests that the outcome achieved by our proposal is competitive, been located within
the first quartile for all submissions. The second part of our analysis focuses on reviewing
what kind of class predictions were changed by adding metadata to our best ranked system. In
the validation stage, metadata features were able to rectify 25 label assignments (including 17
true positives) and also make 27 new mistakes (with 24 false positives). Table 7 shows some
examples of these results (trigger words appear in bold).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions and Future Work</title>
      <p>In this paper, we describe our strategy to classify aggressive and non-aggressive tweets in
Mexican Spanish. In our best performing system, we use a transformers based classifier, BETO,
paired with the addition of metadata features through a decision-tree-based ensemble algorithm,
XGBoost. Our proposal shows to be competitive for this specific task. We noticed that metadata
can be helpful to detect subtle samples of aggressiveness but also classify tweets as ofensive
when swearing is present and not necessarily used to insult someone. However it is interesting
to see how these additional attributes based on the behavior of users inside social media can be
used to strengthen classification methods. We look forward to enhance our current framework,
focusing on diferent levels of lexical and contextual analysis using state-of-the-art approaches
and complementing them with metadata attributes.
transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[12] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,
M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine
Learning Research 12 (2011) 2825–2830.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Nobata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tetreault</surname>
          </string-name>
          , A. Thomas,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mehdad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <article-title>Abusive language detection in online user content</article-title>
          ,
          <source>in: Proceedings of the 25th International Conference on World Wide Web, WWW '16, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>145</fpage>
          -
          <lpage>153</lpage>
          . URL: https://doi.org/10. 1145/2872427.2883062. doi:
          <volume>10</volume>
          .1145/2872427.2883062.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Aragón</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jarquín</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Montes-y Gómez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Escalante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Villaseñor-Pineda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gómez-Adorno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Bel-Enguix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-P.</given-names>
            <surname>Posadas-Durán</surname>
          </string-name>
          ,
          <article-title>Overview of MEX-A3T at IberLEF 2020: Fake news and Aggressiveness Analysis in Mexican Spanish</article-title>
          ,
          <source>in: Notebook Papers of 2nd SEPLN Workshop on Iberian Languages Evaluation Forum (IberLEF)</source>
          , Malaga, Spain, September,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <source>Overview of the HASOC track at FIRE</source>
          <year>2019</year>
          :
          <article-title>Hate speech and ofensive content identification in indo-european languages</article-title>
          ,
          <source>CEUR Workshop Proceedings</source>
          <volume>2517</volume>
          (
          <year>2019</year>
          )
          <fpage>167</fpage>
          -
          <lpage>190</lpage>
          . doi:
          <volume>10</volume>
          .1145/3368567.3368584.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Casavantes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>González</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Montes-y Gómez, UACH-INAOE at HASOC 2019: detecting aggressive tweets by incorporating authors' traits as descriptors</article-title>
          ,
          <source>in: Proceedings of the 11th annual meeting of the Forum for Information Retrieval Evaluation</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>[5] nltk</article-title>
          .
          <source>tokenize package - NLTK 3.5 documentation</source>
          ,
          <year>2020</year>
          . URL: https://www.nltk.org/api/ nltk.tokenize.html.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>[6] BERT - transformers 2.11.0 documentation</source>
          ,
          <year>2020</year>
          . URL: https://huggingface.co/ transformers/model_doc/bert.html.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Tweets - Twitter Developers</surname>
          </string-name>
          ,
          <year>2020</year>
          . URL: https://developer.twitter.com/en/products/tweets.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>GetOldTweets3</given-names>
            <surname>· PyPI</surname>
          </string-name>
          ,
          <year>2018</year>
          . URL: https://pypi.org/project/GetOldTweets3/.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>McGrath</surname>
          </string-name>
          ,
          <source>Twython - Twython 3.6.0 documentation</source>
          ,
          <year>2013</year>
          . URL: https://twython. readthedocs.io/en/latest/.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cañete</surname>
          </string-name>
          , G. Chaperon,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fuentes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pérez</surname>
          </string-name>
          ,
          <article-title>Spanish pre-trained bert model and evaluation data</article-title>
          , in: to appear
          <source>in PML4DC at ICLR</source>
          <year>2020</year>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>