<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TecNM at MEX-A3T 2020: Fake News and Aggressiveness Analysis in Mexican Spanish</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>SamuelArce-Cardena</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>DanielFajardo-Delgad</string-name>
          <email>fajardo@itcg.edu.mx</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>oand Miguel Á.Álvarez-Carmon</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mexico.</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mexico</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mexico</string-name>
        </contrib>
      </contrib-group>
      <fpage>265</fpage>
      <lpage>272</lpage>
      <abstract>
        <p>This paper describes our participation in the MEX-A3T 2020 for the tasks of identification of aggressiveness and fake news in Mexican Spanish tweets. We evaluate the combination of basic text classification techniques, including six machine learning algorithms, two methods for keyword extractions, and two preprocessing techniques. Our best run showed an F1-macro score of 0.754 for aggressiveness and 0.815 for fake news. Our preliminary results are satisfactory and competitive with other participating teams.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Aggressiveness Identification</kwd>
        <kwd>Fake News Classification</kwd>
        <kwd>Natural Language Processing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. State of the art</title>
      <p>The MEX-A3T is an evaluation forum for IberLEF intended for the research in natural language
processing (NLP) and considering a variety of Mexican Spanish cultural traits. In this vein, the
2018 edition was the first to consider the aggressiveness identification for Mexican Spanish
tweets [7]. The winning team for the aggressiveness task for that edition was INGEOTE8C],[
obtaining an F1-macro score of 0.620. Another interesting result was the development of
linguistic generalization of the typical Mexican slang used in tweets to reduce the impact of
size on the word bag 9[]. For the 2019 edition of the MEX-A3T track10[], the approach of
the University of Chihuahua (UACh)1[1] obtained the best performance, outperforming all
proposed baselines, except the results from the winner team of the 2018 edition. Nevertheless,
the UACh approach is considerably much simpler than the one from INGEOTEC.</p>
      <p>On the other hand, there are few studies on the detection of fakenews in Spani1s2h] [6],
one of these studies evaluates the complexity, the stylometric and psychological characteristics
of the text in a multilingual setting12[], they used corpus of news written in American English,
Brazilian Portuguese and Spanish, they used four classifiers, k-Nearest Neighbors, Support
Vector Machine, Random Forest, and Extreme Gradient Boosting, and obtained an average
detection accuracy of 85.3% with Random Forest. Another interesting investigation in which
they created a new corpus of news in Spanish6][, with the true and fake tags used for automatic
detection of fakenews, and presenting a fakenews detection method based on algorithms of
classification of lexical characteristics such as Bag of Words, part of speech tag, n-grams (with
n ranging from 3 to 5) and the combination of n-grams, the best result they obtained with an
accuracy of 76.94%.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>The methodology of this work consists of the following steps: text preprocessing, text
representation, and the building of the classification models.</p>
      <p>Text preprocessing is commonly the first step in the pipeline of an NLP system, and it includes
a set of techniques designed to transform text documents into a suitable representation form
for automatic processing. The preprocessing techniques we employed in this work included the
use of regular expressions, the tokenization, the deletion of punctuation, symbols, stop words,
and the stemming. The regular expressions allowed us to identify some incorrect words for the
Mexican Spanish, mainly those in which the same vowel appears subsequently three times or
more. The best way to do this was by employing the ‘’re” library in Python.</p>
      <p>We also used the natural language toolkit (NLTK) to perform the tokenization, breaking the
texts into words as essential elements. During this process, we also removed the punctuation
marks, the special characters or symbols, as well as unnecessary stop words such as ”el”, ”la”,
”los”. Afterward, we used the Snowball stem library to reduce derived words into their original
form or stem by performing the truncation of sufixes. Finally, to reduce even more the number
of unmeaningful words, we ignored those that appear less than 20 or 40 times.</p>
      <p>After the text preprocessing, we intended to identify the set of words that best describe the
textual context. Extracting these words, also called terms or keywords, is the process to assign
a numerical value that represents the relevance of each word concerning the others within
the corpus. In particular, we used two methods based on a simple statistic approach, the term
frequency (TF), and the term frequency-inverse document frequency (TF-IDF). TF defines the
local importance that each term has in a document based on its frequency; i.e., if a word
frequently appears in a document, then more important i.sIDF captures how many documents
a word appears concerning the total number of words in the corpus, i.e., it highlights the rarity
of the word. We used the implementations of TF and TF-IDF included in the scikit-learn library.</p>
      <p>Finally, in order to build the classification models, we used the following machine learning
algorithms implemented in scikit-learn: t he-nearest neighbors (KNN) for= 3, 7, 11 , the
support vector machine (SVM) with a linear and a radial basis function (RBF) kernels, Decision
trees (DT), Neural net (NN), and Naive Bayes (NB). We generated these models using the training
set by using 10-fold cross-validation.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental results</title>
      <p>We divided the data set into 10 taking the first subset as validation and the other subsets as
training, and we obtained the confusion matrix, then we take the second subset as validation
and the rest as training we repeat this process until each subset has been into the validation set.
Finally, we added the confusion matrices, and from this, we get the presented results.</p>
      <p>Tables1 and 2 show the performance of the proposed classification models applied to the
fake news data set by using the TF and TF-IDF methods, respectively. The best result for this
data set is by the combination of NN without using the techniques of stop words and stemming,
and regardless of the use of TF and TF-IDF. Note that, except for the SVM with RBF, there is a
notable diference between the results of NN concerning the rest. Also note that, in general, the
results are slightly better when using TF-IDF than TF.</p>
      <p>On the other hand, Table3s and4 show the performance of the proposed classification models
applied to the aggressiveness data set by using the TF and TF-IDF methods, respectively. The
best result for this data set is by the combination of NN with the TF-IDF method and without
using the techniques of stop words and stemming. Like the fake news data set, the results for
the aggressiveness data set are slightly better when using TF-IDF than TF. On the other hand,
and unlike the fake news classification results, the best model by using the TF method is the
SVM with RBF. All of these results were obtained by ignoring the words that are repeated less
than 20 times for both of the data sets (Tables 1-4). We omitted to report the results for the case
when we ignored the words repeated less than 40 times. This because of the poor results and
space limitations in the paper. On the other hand, the fake new data set includes, in addition
to the complete text of the news, a header that describes the title of the news. We performed
experiments either considering the header and not considering it. Tables 1 and 2 show only the
results when the header is not considered, since these present better results.</p>
      <p>Finally, for both of the data sets, the best results were obtained by preserving the stop words
and omitting the steaming process. We conjecture that considering such words for these
particular cases may distinguish the classes (aggressiveness/fake news) in the texts.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>In this paper, we approached the tasks of fake news and aggressiveness identification for the
2020 MEX-A3T contest. Using machine learning algorithms, we generated classification models
for these tasks using diferent combinations of preprocessing techniques and keyword extraction
methods. Our best configurations for both of the tasks are NN and RBF (SVM) with the TF-IDF
method and without using the preprocessing techniques of removing the stop words and the
stemming. As future work, we look forward to exploring other preprocessing techniques and
keyword extraction methods to improve our ranking for the next MEX-AT3 contests.</p>
      <p>F-measure</p>
      <p>F-measure</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>
        S. Arce-Cardenas gratefully acknowledges the financial support from Tecnológico Nacional de
México (TecNM) under the project 9518.20-P (2rn3nx).
[1] M. B. Yassein, S. Aljawarneh, Y. A. Wahsheh, Survey of online social networks threats
        <xref ref-type="bibr" rid="ref1">and
solutions, in: 2019</xref>
        IEEE Jordan International Joint Conference on Electrical Engineering
      </p>
      <p>F-measure</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>and Information Technology (JEEIT)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>375</fpage>
          -
          <lpage>380</lpage>
          . [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Theocharis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bekiari</surname>
          </string-name>
          , et al.,
          <article-title>Applying social network indicators in the analysis of</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>verbal aggressiveness at the school</article-title>
          ,
          <source>Journal of Computer and Communications</source>
          <volume>5</volume>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          169. doi:
          <volume>10</volume>
          .4236/jcc.
          <year>2017</year>
          .
          <volume>57015</volume>
          . [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Nobata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tetreault</surname>
          </string-name>
          , A. Thomas,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mehdad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chang</surname>
          </string-name>
          , Abusive language detection
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>in online user content</article-title>
          ,
          <source>in: Proceedings of the 25th International Conference on World</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Wide</given-names>
            <surname>Web</surname>
          </string-name>
          , WWW '16, International World Wide Web Conferences Steering Committee,
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Republic</surname>
          </string-name>
          and Canton of Geneva, CHE,
          <year>2016</year>
          , p.
          <fpage>145</fpage>
          -
          <lpage>153</lpage>
          . URLh:ttps://doi.org/10.1145/
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          2872427.2883062. doi:
          <volume>10</volume>
          .1145/2872427.2883062. [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bovet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Makse</surname>
          </string-name>
          ,
          <article-title>Influence of fake news in twitter during the 2016 us presidential</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>election</surname>
          </string-name>
          ,
          <source>Nature Communications</source>
          <volume>10</volume>
          (
          <year>2019</year>
          )
          <article-title>7</article-title>
          . do1i0:.
          <volume>1038</volume>
          /s41467- 018- 07761- 2. [5]
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Aragón</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jarquín</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Montes-y Gómez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Escalante</surname>
          </string-name>
          , L. Villaseñor-Pineda,
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>H.</given-names>
            <surname>Gómez-Adorno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Bel-Enguix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-P.</given-names>
            <surname>Posadas-Durán</surname>
          </string-name>
          ,
          <article-title>Overview of mex-a3t at iberlef</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          2020:
          <article-title>Fake news and aggressiveness analysis in mexican spanish</article-title>
          , in: Notebook Papers of
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>2nd SEPLN Workshop on Iberian Languages Evaluation Forum (IberLEF)</source>
          , Malaga, Spain,
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>September</surname>
          </string-name>
          ,
          <year>2020</year>
          . [6]
          <string-name>
            <given-names>J.-P.</given-names>
            <surname>Posadas-Durán</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gómez-Adorno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sidorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J. M.</given-names>
            <surname>Escobar</surname>
          </string-name>
          , Detection of fake
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <article-title>news in a new corpus for the spanish language</article-title>
          ,
          <source>Journal of Intelligent &amp; Fuzzy Systems 36</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          (
          <year>2019</year>
          )
          <fpage>4869</fpage>
          -
          <lpage>4876</lpage>
          . [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Á</surname>
          </string-name>
          .
          <string-name>
            <surname>Álvarez-Carmona</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Guzmán-Falcón</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Montes-y Gómez</surname>
            ,
            <given-names>H. J.</given-names>
          </string-name>
          <string-name>
            <surname>Escalante</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>L.</given-names>
            <surname>Villasenor-Pineda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Reyes-Meza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rico-Sulayes</surname>
          </string-name>
          ,
          <article-title>Overview of mex-a3t at ibereval</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          2018:
          <article-title>Authorship and aggressiveness analysis in mexican spanish tweets</article-title>
          , in: Notebook
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <source>Papers of 3rd SEPLN Workshop on Evaluation of Human Language Technologies for</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Iberian</given-names>
            <surname>Languages</surname>
          </string-name>
          (IBEREVAL), Seville, Spain, volume
          <volume>6</volume>
          ,
          <year>2018</year>
          . [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Graf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Miranda-Jiménez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. S.</given-names>
            <surname>Tellez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Moctezuma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Salgado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ortiz-Bejar</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. N.</surname>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Sánchez</surname>
          </string-name>
          , Ingeotec at mex-a3t:
          <article-title>Author profiling and aggressiveness analysis in twitter</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <article-title>using  tc and evomsa</article-title>
          ., in: IberEval@ SEPLN,
          <year>2018</year>
          , pp.
          <fpage>128</fpage>
          -
          <lpage>133</lpage>
          . [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Correa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <article-title>Linguistic generalization of slang used in mexican tweets, applied in</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <article-title>aggressiveness detection</article-title>
          ., in: IberEval@ SEPLN,
          <year>2018</year>
          , pp.
          <fpage>119</fpage>
          -
          <lpage>127</lpage>
          . [10]
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Aragón</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Á</surname>
          </string-name>
          .
          <string-name>
            <surname>Álvarez-Carmona</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Montes-y Gómez</surname>
            ,
            <given-names>H. J.</given-names>
          </string-name>
          <string-name>
            <surname>Escalante</surname>
          </string-name>
          , L. Villasenor-
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Pineda</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Moctezuma</surname>
          </string-name>
          ,
          <article-title>Overview of mex-a3t at iberlef 2019: Authorship and aggressiveness</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <article-title>analysis in mexican spanish tweets</article-title>
          ,
          <source>in: Notebook Papers of 1st SEPLN Workshop on</source>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <given-names>Iberian</given-names>
            <surname>Languages Evaluation Forum (IberLEF)</surname>
          </string-name>
          , Bilbao, Spain,
          <year>2019</year>
          . [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Casavantes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. C.</given-names>
            <surname>González</surname>
          </string-name>
          , Uach at mex-a3t
          <year>2019</year>
          :
          <article-title>Preliminary results on</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <year>2019</year>
          ),
          <source>CEUR WS Proceedings</source>
          ,
          <year>2019</year>
          . [12]
          <string-name>
            <given-names>H. Q.</given-names>
            <surname>Abonizio</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. I.</surname>
          </string-name>
          de Morais,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Tavares</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Barbon</given-names>
            <surname>Junior</surname>
          </string-name>
          , Language-independent
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <article-title>fake news detection: English, portuguese, and spanish mutual features</article-title>
          ,
          <source>Future Internet 12</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>