TASS 2018: Workshop on Semantic Analysis at SEPLN, septiembre 2018, págs. 111-115


          INGEOTEC solution for Task 4 in TASS’18
                      competition
         Solución del grupo INGEOTEC para la tarea 4 de la
                         competencia TASS’18
              Daniela Moctezuma1 , José Ortiz-Bejar3 , Eric S. Tellez2 ,
                      Sabino Miranda-Jiménez2 , Mario Graff2
                               1
                                 CONACYT-CentroGEO
                                2
                                  CONACYT-INFOTEC
                                       3
                                         UMSNH
          dmoctezuma@centrogeo.edu.mx, jortiz@umich.mx, eric.tellez@infotec.mx,
                   sabino.miranda@infotec.mx, mario.graff@infotec.mx

       Resumen: En este artı́culo se presenta un sistema de clasificación de noticias basado
       en clasificadores B4MSA, vectores de documentos calculados mediante vectores de
       palabras pre-entrenados, ası́ como recursos especializados para detectar agresividad
       y afectividad en el texto. Todos estos recursos fueron entrenados de manera inde-
       pendiente para luego ser combinados en un modelo único mediante Programación
       Genética, utilizando nuestro clasificador EvoMSA. Utilizando este esquema, nuestro
       sistema alcanzó los mejores resultados de este año en la competencia en dos de los
       tres corpus de prueba.
       Palabras clave: Categorización automática de texto, programación genética, clasi-
       ficación automática de noticias seguras o inseguras.
       Abstract: This paper describes a classification system based on our generic classifier
       B4MSA, sequence vectors computed with pre-trained Spanish word embeddings,
       and a list of specialized resources to detect aggressiveness and affectivity in text.
       These resources, along with the official training set, were independently trained
       and combined into a single model using Genetic Programming with our EvoMSA
       classifier. Using this approach, our system achieves the best performances, in two
       of three test corpus, of this year in Task 4.
       Keywords: text categorization, genetic programming, safe-unsafe classification of
       news.

1    Introduction                                               is SAFE or UNSAFE, a corpus was built from
                                                                RSS feeds of a number of online newspapers
News classification problem is closely related                  in different varieties of Spanish (Argentina,
to traditional text classification applications                 Chile, Colombia, Cuba, Spain, USA, Mexico,
such as topic classification (e.g., classifying a               Peru, and Venezuela). For the purpose of
news-like text into sports, politics, or econ-                  classifying these news, as SAFE or UNSAFE,
omy). Knowing any kind of categorization                        the headlines were provided.
of news can reflect the problems of society                         From Task 4, two sub-tasks were spec-
in several domains. For instance, the discov-                   ified: Subtask-1 Monolingual classification
ery of negative news over time can be helpful                   and Subtask-2 Multilingual classification.
to have a reference for leaders or decision-                    The main difference between these two tasks
makers to do something about the current                        is that, in the first case, the algorithm must
situation (Wang et al., 2018).                                  be trained and tested with the same Spanish
   In this year, in TASS competition                            variety. In contrary, in the second case, the
(Martı́nez-Cámara et al., 2018), a new task                    algorithm can be trained with a Spanish va-
was proposed (Task 4), this task is related to                  riety and tested with a different one. More
an emotional categorization of news articles.                   information about details from Task 4 please
With the purpose for knowing if a new article                   see (Martı́nez-Cámara et al., 2018).
ISSN 1613-0073                     Copyright © 2018 by the paper's authors. Copying permitted for private and academic purposes.
                   Daniela Moctezuma, José Ortiz-Bejar, Eric S. Tellez, Sabino Miranda-Jiménez y Mario Graff


   In this paper, the solution proposed for                         the first part, a set of different classifiers are
the INGEOTEC team is presented. This so-                            trained with datasets provided by the con-
lution is based on our B4MSA classifier and                         tests and others as additional knowledge, i.e.,
a number of specialized resources related to                        whatever knowledge could be integrated into
aggressiveness and affectivity detection. Fi-                       EvoMSA. In this case, we used tailor-made
nally, our EvoMSA classifier based on Ge-                           lexicons for identifying aggressiveness, posi-
netic Programming is used to combine all the                        tiveness, and negativeness in texts, see Sec-
resources and the available training data. It                       tion 2.2 for more details. The precise con-
is worth to mention that our scheme to cre-                         figuration of our benchmarked system is de-
ate our systems for Task 1 (monolingual and                         scribed in Section 3.
cross-lingual polarity classification) and Task                     2.1.1 B4MSA
4 (this one) is pretty similar; of course, we
use the given task’s training set to learn and                      B4MSA2 (a.k.a. µTC) is a minimalistic sys-
optimize for each task.                                             tem able to tackle general text classification
   The manuscript is organized as follows. In                       tasks independently of domain and language.
Section 2 the description of our solution is                        For complete details of the model see (Tellez
detailed. In Section 3 our results achieved in                      et al., 2018). Roughly speaking, µTC cre-
task 4 is presented. Finally, the conclusions                       ates text classifiers searching for the best
are commented in Section 4.                                         models in given configuration space. A con-
                                                                    figuration consists of instructions to enable
2       System Description                                          several preprocessing functions, a combina-
                                                                    tion of tokenizers among the power set of
As commented before, we use a com-
                                                                    several possible ones (character q-grams, n-
bination of several sub-systems to tackle
                                                                    word grams, and skip-grams), and a weight-
the (un)safeness categorization of the given
                                                                    ing scheme such as TF, TFIDF, or several
news. Firstly, we use our generic text clas-
                                                                    distributional schemes. µTC uses an SVM
sifier B4MSA (Tellez et al., 2017) and a
                                                                    (Support Vector Machine) classifier with a
vocabulary of pre-trained vectors of Fast-
                                                                    linear kernel. A text transformation feature
Text (Mikolov et al., 2013). Also, we use two
                                                                    could be binary options (yes/no) or ternary
different domain-specific lexicon resources,
                                                                    options (group/delete/none). Tokenizers de-
one of them designed to detect aggressive-
                                                                    note how texts must be split after applying
ness and the other one designed to detect
                                                                    the process of each text transformation to
emotions in text. All these sub-systems and
                                                                    texts, all tokens generated are part of the
resources are combined using our genetic pro-
                                                                    text representation. In Table 1, we can see
gramming scheme (EvoMSA) over the deci-
                                                                    details of the preprocessing, tokenizers, and
sion functions of several classifiers built on
                                                                    term weighting scheme.
top of these resources. The rest of this sec-
tion details the use of these sub-systems and                       2.2      Lexicon-based models
resources.
                                                                    To introduce extra knowledge into our ap-
2.1       EvoMSA                                                    proach, we used two lexicon-based mod-
EvoMSA1 has two stages. The first one,                              els. The first, Up-Down model produces a
namely B4MSA (Tellez et al., 2017), uses                            counting of affective words, that is, it pro-
SVMs to predict their decision function val-                        duces two indexes for a given text: one
ues of a given text. On the second hand,                            for positive words, and another for negative
EvoDAG (Graff et al., 2016; Graff et al.,                           words. We created the positive-negative lex-
2017) is a classifier based on Genetic Pro-                         icon based on the several Spanish affective
gramming with semantic operators which                              lexicons (de Albornoz, Plaza, y Gervás, 2012;
makes the final prediction through a com-                           Sidorov et al., 2013; Perez-Rosas, Banea,
bination of all the decision function values.                       y Mihalcea, 2012); we also enriched this
Furthermore, EvoMSA is open to being fed                            lexicon with Spanish WordNet (Fernández-
with different models such as B4MSA (Tellez                         Montraveta, Vázquez, y Fellbaum, 2008).
et al., 2018), and lexicon-based models, and                        The other Bernoulli model was created to
EvoDAG. It is an architecture of two phases                         predict aggressiveness using a lexicon with
to solve classification tasks, see Figure 1. In                     aggressive words. We created this lexicon
    1                                                                  2
        https://github.com/INGEOTEC/EvoMSA                                 https://github.com/INGEOTEC/microTC
                                                             112
                              INGEOTEC solution for Task 4 in TASS'18 competition


                      Figure 1: Architecture of our EvoMSA framework

       Text transformation      Value                      2.3        EvoDAG
       remove diacritics        yes                        EvoDAG3 (Graff et al., 2016; Graff et al.,
       remove duplicates        yes                        2017) is a Genetic Programming system
       remove punctuation       yes                        specifically tailored to tackle classification
       emoticons                group                      problems on very large and high dimensional
       lowercase                yes                        vector spaces. EvoDAG uses the principles
       numbers                  group
       urls                     group
                                                           of Darwinian evolution to create models rep-
       users                    group                      resented as a directed acyclic graph (DAG).
       hashtags                 none                       Due to lack of space, we refer the reader to
       entities                 none                       (Graff et al., 2016) where EvoDAG is broadly
              Term weighting                               described. It is important to mention that
                                                           EvoDAG does not have information regard-
       TF-IDF                   yes
                                                           ing whether input Xi comes from a particu-
       Entropy                  no
                                                           lar class decision function, consequently from
                 Tokenizers                                EvoDAG point of view all inputs are equiva-
       n-words                  {1, 2}                     lent.
       q-grams                  {2, 3, 4}
       skip-grams               —                          2.4        FastText
                                                           FastText (Joulin et al., 2017) is a tool to
                                                           create text classifiers and learn a semantic
Table 1: Example of set of configurations for              vocabulary, learned from a given collection
text modeling                                              of documents; this vocabulary is represented
                                                           with a collection of high dimensional vectors,
                                                           one per word. It is worth to mention that
                                                           FastText is robust to lexical errors since out-
                                                           vocabulary words are represented as the com-
                                                           bination of vectors of sub-words, that is, a
gathering common aggressive words for Span-                kind of character q-grams limited in context
ish. These indexes and prediction along with               to words. Nonetheless, the main reason of in-
B4MSA’s (µTC) outputs are the input for
                                                               3
EvoDAG system.                                                     https://github.com/mgraffg/EvoDAG
                                                     113
                 Daniela Moctezuma, José Ortiz-Bejar, Eric S. Tellez, Sabino Miranda-Jiménez y Mario Graff


cluding FastText as part of our system is to
overcome the small train set that comes with                           Team’s name             Macro-F1      Accuracy
Task 4, which is fulfilled using the pre-trained                         ELiRF                   0.883        0.893
vectors computed in the Spanish content of                              rbnUGR                   0.873        0.888
Wikipedia (Bojanowski et al., 2016). We use                           INGEOTEC                   0.866        0.871
these vectors to create document vectors, one                         MeaningCloud               0.793        0.801
vector per document. A document vector is,                                SINAI                  0.773        0.793
roughly speaking, a linear combination of the                         TNT-UA-WFU                 0.544        0.552
word vectors that compose the document into                             lone wolf                  0            0
a single vector of the same dimension. These
document vectors were used as input to an                         Table 3: Subtask-2: Multilingual Classifica-
SVM with a linear kernel, and we use the de-                      tion: SANSE-TEST-13152
cision function as input to EvoMSA.

3    Experiments and results                                           Team’s name            Macro-F1       Accuracy

In order to test all the approaches in the                            INGEOTEC                   0.719        0.737
Task-4, the SANSE (Spanish brANd Safe                                  ELiRF-UPV                 0.699        0.722
Emotion) corpus was established.         The                             rbnUGR                  0.683        0.631
SANSE corpus is composed of 2,000 head-                               MeaningCloud               0.651        0.658
lines of news written in the Spanish lan-                              ITAINNOVA                 0.617        0.575
guage along several Spanish speaker countries
Spain, Mexico, Cuba, Chile, Colombia, Ar-                         Table 4: Subtask-2: Multilingual Classifica-
gentina, Venezuela, Peru, and U.S.A.                              tion: SANSE-408
    In the case of Subtask-1, Monolingual
Classification, the goal was training with a                      4     Conclusions
Spanish variety, e.g., Mexico, and then test-                     Our solution based on Genetic Program-
ing with the same Spanish variety. In this                        ming reached the best result in SubTask-
case, our results and the results of the best                     1 Monolingual Classification SANSE-TEST-
five teams ranked by Macro-F1 metric, are                         500 and SubTask-2 Multilingual Classifica-
presented in Table 2.                                             tion SANSE 408 corpus. In the largest cor-
                                                                  pus in SubTask-2 (SANSE-TEST-13152) our
                                                                  system reached the third best team solution.
    Team’s name        Macro-F1         Accuracy
                                                                     Our approach, EvoMSA, is able to deal
    INGEOTEC              0.795            0.802                  with several data sources through an ensem-
     ELiRF-UPV             0.79             0.8                   ble of decision functions from each different
      rbnUGR              0.774            0.786                  bunch of data such as extra knowledge coded
    MeaningCloud          0.767            0.776                  into lexicons for sentiment analysis and ag-
        SINAI             0.728            0.742                  gressiveness identification, and semantic in-
      lone wolf           0.700            0.718                  formation from word vectors.
    TNT-UA-WFU            0.492            0.518
                                                                  Acknowledgements
Table 2: Subtask-1: Monolingual Classifica-
tion results: SANSE-TEST-500                                      The authors would like to thank Laborato-
                                                                  rio Nacional de GeoInteligencia for partially
    Table 3 shows the best five teams in the                      funding this work.
SANSE-TEST-13152 corpus. With this cor-
pus, our team reached the third position with                     References
a 0.866 and 0.871 of Macro-F1 and accuracy,                       Bojanowski, P., E. Grave, A. Joulin, y
respectively. Regarding Multilingual Classi-                        T. Mikolov. 2016. Enriching word vectors
fication subtask, in Table 4, all the results                       with subword information. arXiv preprint
obtained by the best five teams, ranked by                          arXiv:1607.04606.
Macro-F1 metric, are reported.
    In nutshell, from three datasets, our solu-                   de Albornoz, J. C., L. Plaza, y P. Gervás.
tion reached highest Macro-F1 in two corpus                          2012.   Sentisense: An easily scalable
and middle position in the other corpus.                             concept-based affective lexicon for senti-
                                                           114
                             INGEOTEC solution for Task 4 in TASS'18 competition


   ment analysis. En Proceedings of LREC                      information processing systems, páginas
   2012, páginas 3562–3567.                                  3111–3119.
Fernández-Montraveta, A., G. Vázquez, y                 Perez-Rosas, V., C. Banea, y R. Mihal-
   C. Fellbaum. 2008. The spanish version of                 cea. 2012. Learning sentiment lexi-
   wordnet 3.0. Text Resources and Lexical                   cons in spanish. En LREC, volumen 12,
   Knowledge. Mouton de Gruyter, páginas                    página 73.
   175–182.                                               Sidorov, G., S. Miranda-Jiménez, F. Viveros-
Graff, M., E. S. Tellez, S. Miranda-Jiménez,                Jiménez, A. Gelbukh, N. Castro-Sánchez,
  y H. J. Escalante. 2016. Evodag: A                         F. Velásquez, I. Dı́az-Rangel, S. Suárez-
  semantic genetic programming python li-                    Guerra, A. Treviño, y J. Gordon. 2013.
  brary. En 2016 IEEE International Au-                      Empirical study of machine learning based
  tumn Meeting on Power, Electronics and                     approach for opinion mining in tweets. En
  Computing (ROPEC), páginas 1–6, Nov.                      Proceedings of the 11th Mexican Interna-
                                                             tional Conference on Advances in Arti-
Graff, M., E. S. Tellez, H. J. Escalante, y                  ficial Intelligence - Volume Part I, MI-
  S. Miranda-Jiménez. 2017. Semantic Ge-                    CAI’12, páginas 1–14, Berlin, Heidelberg.
  netic Programming for Sentiment Analy-                     Springer-Verlag.
  sis. En O. Schütze L. Trujillo P. Legrand,
  y Y. Maldonado, editores, NEO 2015, nu-                 Tellez, E. S., S. Miranda-Jiménez, M. Graff,
  mero 663 en Studies in Computational In-                   D. Moctezuma, R. R. Suárez, y O. S. Sior-
  telligence. Springer International Publish-                dia. 2017. A simple approach to multi-
  ing, páginas 43–65. DOI: 10.1007/978-3-                   lingual polarity classification in Twitter.
  319-44003-3 2.                                             Pattern Recognition Letters, 94:68–74.

Joulin, A., E. Grave, P. Bojanowski, y                    Tellez, E. S., D. Moctezuma, S. Miranda-
   T. Mikolov. 2017. Bag of tricks for                       Jiménez, y M. Graff. 2018. An au-
   efficient text classification. En Proceed-                tomated text categorization framework
   ings of the 15th Conference of the Euro-                  based on hyperparameter optimization.
   pean Chapter of the Association for Com-                  Knowledge-Based Systems, 149:110–123.
   putational Linguistics: Volume 2, Short                Wang, B., L. Gao, T. An, M. Meng, y
   Papers, páginas 427–431. Association for                T. Zhang. 2018. A method of educa-
   Computational Linguistics, April.                        tional news classification based on emo-
                                                            tional dictionary. En 2018 Chinese Con-
Martı́nez-Cámara, E., Y. Almeida-Cruz,
                                                            trol And Decision Conference (CCDC),
  M. C. Dı́az-Galiano, S. Estévez-Velarde,
                                                            páginas 3547–3551, June.
  M. A. Garcı́a-Cumbreras, M. Garcı́a-
  Vega, Y. Gutiérrez, A. Montejo Ráez,
  A. Montoyo, R. Muñoz, A. Piad-
  Morffis, y J. Villena-Román.       2018.
  Overview of TASS 2018:          Opinions,
  health and emotions. En E. Martı́nez-
  Cámara Y. Almeida-Cruz M. C. Dı́az-
  Galiano S. Estévez-Velarde M. A. Garcı́a-
  Cumbreras M. Garcı́a-Vega Y. Gutiérrez
  A. Montejo Ráez A. Montoyo R. Muñoz
  A. Piad-Morffis, y J. Villena-Román, edi-
  tores, Proceedings of TASS 2018: Work-
  shop on Semantic Analysis at SEPLN
  (TASS 2018), volumen 2172 de CEUR
  Workshop Proceedings, Sevilla, Spain,
  September. CEUR-WS.
Mikolov, T., I. Sutskever, K. Chen, G. S. Cor-
  rado, y J. Dean. 2013. Distributed repre-
  sentations of words and phrases and their
  compositionality. En Advances in neural
                                                    115