=Paper=
{{Paper
|id=Vol-2664/mexa3t_paper3
|storemode=property
|title=Idiap and UAM Participation at MEX-A3T Evaluation Campaign
|pdfUrl=https://ceur-ws.org/Vol-2664/mexa3t_paper3.pdf
|volume=Vol-2664
|authors=Esaú Villatoro-Tello,Gabriela Ramírez-de-la-Rosa,Sajit Kumar,Shantipriya Parida,Petr Motlicek
|dblpUrl=https://dblp.org/rec/conf/sepln/Villatoro-Tello20
}}
==Idiap and UAM Participation at MEX-A3T Evaluation Campaign==
<pdf width="1500px">https://ceur-ws.org/Vol-2664/mexa3t_paper3.pdf</pdf>
<pre>
Idiap and UAM Participation at MEX-A3T Evaluation
Campaign
Esaú Villatoro-Telloa,b , Gabriela Ramírez-de-la-Rosab , Sajit Kumarc , Shantipriya Paridab
and Petr Motlicekb
a
  Universidad Autónoma Metropolitana, Unidad Cuajimalpa, Mexico City, Mexico
b
  Idiap Research Institute, Rue Marconi 19, 1920, Martigny, Switzerland
c
  Centre of Excellence in AI, Indian Institute of Technology Kharagpur, West Bengal, India


                                          Abstract
                                          This paper describes our participation in the shared evaluation campaign of MexA3T 2020. Our main goal was
                                          to evaluate a Supervised Autoencoder (SAE) learning algorithm in text classification tasks. For our experiments,
                                          we used three different sets of features as inputs, namely classic word n-grams, char n-grams, and Spanish BERT
                                          encodings. Our results indicate that SAE is adequate for longer and more formal written texts. Accordingly,
                                          our approach obtained the best performance (𝐹 = 85.66%) in the fake-news classification task.

                                          Keywords
                                          Supervised Autoencoders, Text Representation, Deep Learning, Natural Language Processing


1. Introduction
In this era where social media and instant messaging is widely used for communication, the reach
and volume of these text messages are enormous. The use of aggressive language or dissemination
of false news is widespread across these communication channels. It is impossible to verify the text
messages manually. We need automated systems that help users of these communication channels to
determine if they are reading real or fake news or to try to flag when someone has been targeted with
aggressive messages.
   Besides the fact that most of the previous works done in these two tasks, namely aggressiveness
detection and fake-news detection, are for English, little research has been done for Spanish using the
most recent NLP techniques such as deep learning approaches. On the one hand, for aggressiveness
detection, in past editions of the MEX-A3T1 challenge [1], only three out of nine approaches used
some deep learning classifier, particularly for CNN, LSTM, and GRU, with no good performances [2].
On the other hand, most of the current research on fake-news detection has been done for the English
language, using graph CNNs [3], and more recently attention mechanism-based transformer models
[4].
   Our participation at MEX-A3T 2020 aimed at exploring the use of Supervised Autoencoder (SAE)
[5] in two different text classification tasks: i) aggressiveness detection in Spanish tweets, where
documents are very short and informal texts; and, ii) fake-news detection from Spanish newspapers,

Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020)
email: evillatoro@correo.cua.uam.mx, esau.villatoro@idiap.ch (E. Villatoro-Tello); gramirez@correo.cua.uam.mx (G.
Ramírez-de-la-Rosa); kumar.sajit.sk@gmail.com (S. Kumar); shantipriya.parida@idiap.ch (S. Parida);
petr.motlicek@idiap.ch (P. Motlicek)
orcid: 0000-0002-1322-0358 (E. Villatoro-Tello); 0000-0003-3387-6300 (S. Parida)
                                       © 2020 Copyright for this paper by its authors.
                                       Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings           CEUR Workshop Proceedings (CEUR-WS.org)
                  http://ceur-ws.org
                  ISSN 1613-0073


                  1
                      https://sites.google.com/view/mex-a3t/home
Table 1
Features as inputs for the Supervised Autoencoder Method.

                   Features type                           Sub-type                     Identifier
                   Word n-grams                            n=(1,2) and n=(1,3)              W
                   Char n-grams                            n=(1,2) and n=(1,3)              C
                   BETO                                    min, max, and mean pooling       B
                   Word n-grams and Char n-grams                                          W+C
                   BETO and Word n-grams                                                  B+W
                   BETO and Char n-grams                                                   B+C
                   BETO, Word n-grams and Char n-grams                                   B+W+C


where documents are larger and contain a more formal written style. We found that SAE can gener-
alize well for both tasks, particularly, for the aggression detection our approach obtains an F1 macro
of 80.7%, while for the fake-news detection we reached the best score with an F1 macro of 85.6%.


2. Methodology
For both tasks, we aimed at evaluating the impact of recent generalization techniques, namely SAE
[5] with a varied set of features as input vectors. Although SAE has been extensively evaluated in
image classification tasks [6], very few works exist evaluating the impact of SAE in text classification
tasks, e.g. language detection [7]. Next, we briefly describe the SAE theory, and we provide some
details on how the document representation was generated for all the explored features.

2.1. Supervised Autoencoder
An autoencoder (AE) is a neural network that learns a representation (encoding) of input data and then
learns to reconstruct the original input from the learned representation. The autoencoder is mainly
used for dimensionality reduction or feature extraction [5]. Normally, it is used in an unsupervised
learning fashion, meaning that we leverage the neural network for the task of representation learning.
By learning to reconstruct the input, the AE extracts underlying abstract attributes that facilitate
accurate prediction of the input.
   Thus, an SAE is an autoencoder with the addition of a supervised loss on the representation layer.
The addition of supervised loss to the autoencoder loss function acts as a regularizer and results in
the learning of the better representation for the desired task [6]. For the case of a single hidden layer,
a supervised loss is added to the output layer and for a deep supervised autoencoder, the innermost
(smallest) layer would have a supervised loss added to the bottleneck layer that is usually transferred
to the supervised layer after training the autoencoder.
   For all our performed experiments, the overall configuration of the SAE model was done using
nonlinear activation function (ReLU) with 3 hidden layers, the number of nodes in the representation
layer was set to 300, and we trained to a maximum of 100 epochs.

2.2. Input Features
The SAE receives as input the representation of the document build using Spanish pre-trained BERT
encodings (BETO [8]), traditional text representation techniques such as word and char n-grams
(ranges 1-2 and 1-3), and, combinations of BETO encodings plus traditional words/char n-grams vec-
tors.


                                                         253
   We choose to evaluate the impact of word and char n-grams since as previous research has shown
[9, 10, 11], word n-grams are capable of capturing the identity of a word and its contextual usage, while
character n-grams are additionally capable of providing an excellent trade-off between sparseness
and word’s identity, while at the same time they combine different types of information: punctuation,
morphological makeup of a word, lexicon and even context. For generating this type of features we
used the CountVectorizer and TfidfTransformer libraries from the scikitlearn2 toolkit. For the
case of the fake-news detection task, we empirically chose the best values for the min-df and max-df
parameters, which are reported on Table 3. For the aggressiveness task, these values were fixed (for
all the experiments) to min-df= 0.001 and max-df= 0.3.
   Additionally, we evaluate the impact of transformer-based models [12] as a language representation
strategy. For our experiments we tested BETO3 , a BERT model trained on a large dataset of Spanish
documents [8]. As known, the [CLS] token acts an “aggregate representation” of the input tokens,
and can be considered as a sentence representation for many classification tasks [13]. Accordingly,
we apply the following approaches for generating the representation of the document: i) for the
aggressiveness task, each tweet is directly passed to the BETO model, and is represented using the
encoding of the last hidden layer from the [CLS] token; ii) for the fake-news detection task, we split
the news document into smaller chunks, obtain the [CLS] encoding of each chunk, and then we apply
either a min, max, mean pooling for generating the final document representation. Table 1 depicts the
type and variations of features tested during the training phase.
   Finally, it is worth mentioning that we did not apply any preprocessing steps in any of the tasks.
To validate our experiments, we performed a stratified 10 cross-fold validation strategy.


3. Aggressiveness Identification
The offensive language in Mexican Spanish corpus used for this task has 10,475 Spanish tweets. The
training partition contains 7332 tweets with two possible classes (aggressive or non-aggressive). More
details of this corpus can be found in [14]. Table 2 shows the results obtained in both, the validation
phase and our two runs submitted for the final evaluation of this task over 3143 unseen tweets. The
difference between the two submitted outputs, i.e., run id 1 and 2 (†), is the classifier, submission 2
was trained using a Multi-Layer Perceptron (MLP).


4. Fake-News Identification
The fake-news Spanish corpus used in this task has 971 news from 9 different topics. The training
partition provided for the development stage has 676 news with a binary class (fake or true). Each
news is compose by the headline, body, and the URL from where the news was published (the complete
description of this corpus can be found in [15]). For our experiments, we used only the headline and
the body of the news as a single document. Table 3 shows the results obtained in the development
stage of the challenge, and the two runs submitted for the final evaluation of the tasks over 295 unseen
news.


   2
       https://scikit-learn.org/stable/index.html
   3
       https://github.com/dccuchile/beto


                                                    254
Table 2
Results in validation and test phases reported in F-score for aggressive (F+), non-aggressive (F-), and macro
average of the F-score (Fm).

                                               Validation phase                          Test phase
     Input features                           Fm         F+            F-       ID       Fm           F+       F-
     W (1,2)                                  0.783      0.698         0.868    -        -            -        -
     W (1,3)                                  0.777      0.690         0.864    -        -            -        -
     C (1,2)                                  0.726      0.601         0.850    -        -            -        -
     C (1, 3)                                 0.778      0.689         0.866    -        -            -        -
     B (LHL)                                  0.742      0.628         0.856    -        -            -        -
     C (1, 3) + W (1,2)                       0.780      0.702         0.857    -        -            -        -
     B + W (1,2)                              0.787      0.694         0.879    -        -            -        -
     B + C (1,3)                              0.780      0.684         0.876    -        -            -        -
     B + W (1,2) + C (1,3)                    0.803      0.716         0.889    1        0.807        0.725    0.888
     B + W (1,2) + C (1,3)†                   0.798      0.702         0.894    2        0.801        0.706    0.895
     Bi-GRU (baseline-given by track organizers)                                         0.798        0.712    0.884
     BOW-SVM (baseline-given by track organizers)                                        0.777        0.676    0.878
     Best system (in the task [1])                                                       0.859        0.799    0.919


Table 3
Results in validation and test phases reported in F-score for fake-news (F+), real-news (F-), and macro average
of F-score (Fm).

                                                       Validation phase                           Test phase
    Input features            min-df,max-df    Fm          F+           F-          ID    Fm           F+       F-
    W(1,2)                    0.01, 0.5        0.775       0.793        0.758       -     -            -        -
    W(1,3)                    0.01, 0.5        0.778       0.798        0.758       -     -            -        -
    C(1,2)                    0.01, 0.5        0.697       0.719        0.674       -     -            -        -
    C(1, 3)                   0.01, 0.5        0.757       0.768        0.745       -     -            -        -
    B(min-pooling)                             0.843       0.842        0.845       2     0.856        0.844    0.868
    B(max-pooling)                             0.830       0.830        0.830       -     -            -        -
    B(mean-pooling)                            0.833       0.831        0.835       -     -            -        -
    C(1, 3)+W(1,2)            0.01, 0.5        0.805       0.807        0.802       -     -            -        -
    B+W(1,2)                  0.01, 0.3        0.845       0.846        0.844       1     0.850        0.840    0.859
    B+C(1,3)                  0.01, 0.3        0.834       0.834        0.835       -     -            -        -
    B+W(1,2)+C(1,3)           0.01, 0.3        0.833       0.831        0.835       -     -            -        -
    B+W(1,2)+C(1,3)           0.01, 0.5        0.848       0.846        0.850       -     -            -        -
    Third best system (in the track)                                                      0.817        0.819    0.817
    BOW-RF (baseline-given by track organizers)                                           0.786        0.785    0.787


5. Conclusions
This paper describes Idiap & UAM participation at the MEX-A3T 2020 shared task on the Classification
of Fake-News and Aggressiveness analysis. Our participation aimed at analyzing the performance of
recent generalization techniques, namely deep supervised autoencoders. To this end, we performed
a comparative analysis among simple transformers based language representation strategies and tra-
ditional text representations such as word and character n-grams. Notably, the SAE method benefits
the most when it is feed with input features generated from the combination of BERT encodings and
word/char n-grams. Particularly, for the aggression detection task, our proposed approach can obtain


                                                                 255
a relative improvement of 1.1% over the stronger baseline, while for the fake-news detection task the
improvement over the baseline is 8.1%.
   As future work, we plan to perform an analysis of what are the dataset characteristics that allow
the SAE approach to provide good performances. Also, we want to evaluate the impact of SAE’s
hyperparameter tuning through optimization methods, such as Bayes Optimizer[16], and evaluate
our proposed approach on other similar classification tasks.


Acknowledgments
The work was supported by an innovation project (under an InnoSuisse grant) oriented to improve
the automatic speech recognition and natural language understanding technologies for German. Title:
“SM2: Extracting Semantic Meaning from Spoken Material” funding application no. 29814.1 IP-ICT
and EU H2020 project “Real-time network, text, and speaker analytics for combating organized crime"
(ROXANNE), grant agreement: 833635. The first author, Esaú Villatoro-Tello is supported partially by
Idiap, SNI-CONACyT, CONACyT project grant CB-2015-01-258588, and UAM-C Mexico during the
elaboration of this work.


References
 [1] M. E. Aragón, H. Jarquín, M. Montes-y Gómez, H. J. Escalante, L. Villaseñor-Pineda, H. Gómez-
     Adorno, G. Bel-Enguix, J.-P. Posadas-Durán, Overview of mex-a3t at iberlef 2020: Fake news
     and aggressiveness analysis in mexican spanish, in: Notebook Papers of 2nd SEPLN Workshop
     on Iberian Languages Evaluation Forum (IberLEF), Malaga, Spain, September, 2020.
 [2] M. E. Aragón, M. Á. Álvarez-Carmona, M. Montes-y Gómez, H. J. Escalante, L. Villasenor-Pineda,
     D. Moctezuma, Overview of mex-a3t at iberlef 2019: Authorship and aggressiveness analysis in
     mexican spanish tweets, in: Notebook Papers of 1st SEPLN Workshop on Iberian Languages
     Evaluation Forum (IberLEF), Bilbao, Spain, 2019.
 [3] F. Monti, F. Frasca, D. Eynard, D. Mannion, M. M. Bronstein, Fake news detection on social
     media using geometric deep learning, arXiv preprint arXiv:1902.06673 (2019).
 [4] M. Qazi, M. U. S. Khan, M. Ali, Detection of fake news using transformer model, in: 2020 3rd
     International Conference on Computing, Mathematics and Engineering Technologies (iCoMET),
     2020, pp. 1–6.
 [5] Q. Zhu, R. Zhang, A classification supervised auto-encoder based on predefined evenly-
     distributed class centroids, arXiv preprint arXiv:1902.00220 (2019).
 [6] L. Le, A. Patterson, M. White, Supervised autoencoders: Improving generalization performance
     with unsupervised regularizers, in: Advances in Neural Information Processing Systems, 2018,
     pp. 107–117.
 [7] S. Parida, E. Villatoro-Tello, S. Kumar, P. Motlicek, Q. Zhan, Idiap submission to swiss-german
     language detection shared task, in: Proceedings of the 5th Swiss Text Analytics Conference
     (SwissText) & 16th Conference on Natural Language Processing (KONVENS), 2020.
 [8] J. Cañete, G. Chaperon, R. Fuentes, J. Pérez, Spanish pre-trained bert model and evaluation data,
     in: to appear in PML4DC at ICLR 2020, 2020.
 [9] Z. Wei, D. Miao, J.-H. Chauchat, R. Zhao, W. Li, N-grams based feature selection and text rep-
     resentation for chinese text classification, International Journal of Computational Intelligence
     Systems 2 (2009) 365–374.


                                                 256
[10] A. Kulmizev, B. Blankers, J. Bjerva, M. Nissim, G. van Noord, B. Plank, M. Wieling, The power
     of character n-grams in native language identification, in: Proceedings of the 12th Workshop
     on Innovative Use of NLP for Building Educational Applications, 2017, pp. 382–389.
[11] F. Sánchez-Vega, E. Villatoro-Tello, M. Montes-y Gómez, P. Rosso, E. Stamatatos, L. Villaseñor-
     Pineda, Paraphrase plagiarism identification with character-level features, Pattern Analysis and
     Applications 22 (2019) 669–681.
[12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,
     Attention is all you need, in: Advances in neural information processing systems, 2017, pp.
     5998–6008.
[13] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional trans-
     formers for language understanding, in: Proceedings of the 2019 Conference of the North
     American Chapter of the Association for Computational Linguistics: Human Language Tech-
     nologies, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186. URL: https://www.aclweb.
     org/anthology/N19-1423. doi:10.18653/v1/N19-1423.
[14] M. J. Díaz-Torres, P. A. Morán-Méndez, L. Villasenor-Pineda, M. Montes-y Gómez, J. Aguilera,
     L. Meneses-Lerín, Automatic detection of offensive language in social media: Defining linguistic
     criteria to build a Mexican Spanish dataset, in: Proceedings of the Second Workshop on Trolling,
     Aggression and Cyberbullying, European Language Resources Association (ELRA), Marseille,
     France, 2020, pp. 132–136. URL: https://www.aclweb.org/anthology/2020.trac-1.21.
[15] J.-P. Posadas-Durán, H. Gomez Adorno, G. Sidorov, J. Moreno, Detection of fake news in a new
     corpus for the spanish language, Journal of Intelligent Fuzzy Systems 36 (2019) 4869–4876.
     doi:10.3233/JIFS-179034.
[16] J. Snoek, H. Larochelle, R. P. Adams, Practical bayesian optimization of machine learning algo-
     rithms, in: Advances in neural information processing systems, 2012, pp. 2951–2959.


                                                 257

</pre>