Detecting Aggressiveness in Mexican Spanish Tweets with LSTM + GRU and LSTM + CNN Architectures Victor Peñalozaa a RLICT: Research Laboratory in Information and Communication Technologies, Universidad Galileo, 7a. Avenida, calle Dr. Eduardo Suger Cofiño, Zona 10, Ciudad de Guatemala, Guatemala Abstract This paper presents a description of our participation in MEX-A3T 2020 aggressiveness detection on the Spanish Mexican tweets track. The goal of this task is to analyze a corpus comprised of Spanish Mexican tweets and identify its aggressiveness level (aggressive or not). For this task, we proposed two architectures; the first one is a BiLSTM + GRU based, and the second is a BiLSTM + CNN based architecture. After experimenting and evaluating, our BiLSTM + CNN model achieves 63.88% on aggressive class F1-Score, and our BiLSTM + CNN model achieves 63.87% on aggressive class F1-Score. Keywords Aggressiveness, Long Short Term Memory, Gated Recurrent Unit, Convolutional Neural Network, Twitter, Mexican Spanish text classification. 1. Introduction The use of social communication tools on the Internet is being an essential side of daily human life. These social communications tools are generating a large amount of data that has sparked analysis interest among natural language and data science experts. Although diverse models have been proposed to analyze social media data, there are still many challenges and ample space to improve research. One of these challenges is the multi- language content generated in these social networks. To push the improvement of research and to promote research in Mexican Spanish data, MEX-A3T 2020 proposed a track to identify aggressiveness on Mexican Spanish Tweets. This study proposed two architectures that use LSTM, GRU, and Convolutional Networks as a block to be evaluated on the MEX-A3T 2020 aggressiveness detection track. This paper is comprised of five sections: the first one presents an introduction to this task and study. The second section describes the corpus preprocessing phase. The third section describes the proposed architectures. The fourth section presents the results achieved in competition and the testing phase. The last section presents some conclusions and future work to continue the experiment with this task and architectures. Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020) email: victorsergio@galileo.edu (V. Peñaloza) orcid: 0000-0001-7335-8255 (V. Peñaloza) © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 2. Data Preprocessing Although supervised deep learning models can learn the main features from a dataset, the performance of such models depends on the quality of input data [1]. Previous sentiment analysis research on twitter-based corpus shows that various corpus-preprocessing techniques provide a significant improvement in model performance. Some techniques merely remove noise data, and others reduce terms and expressions to basic meaning [2]. 2.1. Basic Data Preprocessing For models described in this paper, the next steps were performed on the training data set [3]: 1. Lower case input text. 2. Remove URLs: URLs were encoded on the training data set as . 3. Remove accents, diaresis and tilde characters: Input text to NFKD to ASCII. 4. Remove numeric characters. 5. Remove single character and two-character elements. 6. Remove punctuation symbols. 2.2. Text Sequences Length LSTM [4] and GRU [5] architectures are a proposal to learn long term dependencies. Despite the success of these architectures, there are concerns about the ability of these networks to manage such dependencies [6]. Considering those, we decided to limit the length of text sequences looking to get a sequence length that preserves the relevant information about the tweet and reduces the model training time. Trimming was done by shortening at the end of each text sequence. 2.3. Lemmatization Lemmatization makes a morphological analysis of words and tries to remove inflectional endings, returning words to their dictionary word. In previous research, the use of lemmatization outperforms base algorithms on language modeling [7].The pipeline used was: 1. Tokenization. 2. Multiword tokens expansion. 3. POS labeling. 4. Lemmatization. For the previous pipeline, we used AnCora treebank, Spanish models, from Python Stanford NLP package [8]. 2.4. Stop Words We remove stop words using the Spanish corpus from open-source Natural Language Toolkit (NLTK) [9]. 281 2.5. Word Vectors As a word-level representation, we used pre-trained embedding vectors with FastText [10] library. Embedding vectors used were pre-trained on external Mexican Spanish tweets. This pre- trained file contains 1,247.3M tokens with 100 dimensions each. These vectors were provided by the last MEX-A3T 2019 organizers [11]. 2.6. Balance Dataset On un-balanced data sets, different categories were represented unequally. So the output model is not biased to learn features of the majority class in classification task use of over-sampling techniques on minority class was proposed previously to get a better classifier performance. SMOTE is an oversampling method, in which the minority class is over-sampled creating “synthetic” samples rather than by over-sampling with replacement [12]. MEX-A3T 2020 training corpus was not balanced; we applied the SMOTE method to get a corpus with aggressive and not-aggressive equally represented classes. 3. Systems Description Recurrent networks have proven to be useful in natural language processing tasks for their ability to carry information from the past [13]. On the other hand, convolutional neural networks have been used and showed promising results in diverse applications of natural language processing [14]. Additionally, the architecture used has proven to be effective on previous NLP classification tasks[15] and was altered to be adapted to this specific domain task. This paper discussed two model’s performance with slightly different approaches. The first model (Fig. 1) is comprised of an embedding input layer, followed by a spatial dropout that feeds a BiLSTM layer and a BiGRU layer respectively. Each of BiLSTM and BiGRU individual blocks feeds an independent global average polling layer and global max-pooling layer. The polling layers outputs are merged and followed by a dense layer with a ReLU activation function. Next batch normalization and dropout are applied. The last layer is dense with a SoftMax activation function. The first model (BiLSTM + BiGRU) was trained using an Adam optimizer (learning rate = 3e-5, epsilon = 1e-8, norm clipping = 1.0), with sparse categorical loss entropy as a loss function, and was trained for 13 epochs. The second model (Fig. 2) is a slightly different version of the first model, but the BiGRU layer was replaced for a 1D convolutional layer, and was trained for 15 epochs. Table 1 shows in detail the values of the parameters used for each model. 4. Results The official competition metric was the F1 score on aggressive class. Table 2 shows our results on MEX-A3T 2020 on the test dataset and results on an own test data set used to experiment on the modeling phase. Own test data set was created, taking 20% of content provided official training set. Additionally, Table 2 shows two baselines used by organizers to compare with 282 Global Average Pooling 1D Bidirectional Concatenate (CuDNNLSTM) Global Max Pooling 1D Spatial Batch Input Layer Embedding Concatenate Dense Dropout Dense Dropout1D Global Normalization Average Pooling 1D Bidirectional Concatenate (CuDNNGRU) Global Max Pooling 1D Figure 1: BiLSTM + BiGRU architecture. Global Average Pooling 1D Bidirectional Concatenate (CuDNNLSTM) Global Max Pooling 1D Spatial Batch Input Layer Embedding Concatenate Dense Dropout Dense Dropout1D Global Normalization Average Pooling 1D Conv 1D Concatenate Global Max Pooling 1D Figure 2: BiLSTM + CNN architecture participating models, and some results from other participants ranked by a place on competition are shown too. Based on the results, it should be noted that the two proposed architectures achieved similar performance. It can be observed that achieved results on the official test set not differ so much from results achieved on own test set. This indicates that chosen test data for the modeling phase represents well the proposed task dataset, and proposed models are not overfitting the training set. We achieved 16th place with run 2 (BiLSTM + CNN). Although our results are lower than baselines models, this work shows a comparison between two proposed models on aggres- siveness detection on Mexican Spanish tweets and leave possibilities open for architecture improvement with further research. 5. Conclusions and Future Work In this work, we describe our participation in MEX-A3T@IberLEF2020, Aggressiveness Identifi- cation on Spanish Mexican Tweets Track [3]. We have shown two proposed architectures, first uses a BiLSTM + BiGRU combination as the base and second are BiLSTM + CNN combination based. 283 Table 1 Model architecture parameters. Parameters marked with * are parameters of the convolutional layer used only in the LSTM + CNN model. Parameter Value Description spatial dropout rate 0.2 Fraction of the input embedding layer output to drop. biLSTM layer units 600 The dimensionality of bidirectional LSTM output space. biGRU layer units 600 The dimensionality of bidirectional GRU output space. filters* 332 The number of output filters in the convolution. kernel size* 2 Length of the 1D convolution window. activation function* ReLU Convolutional layer activation function. dense layer units 144 The dimensionality of the intermediate dense layer output space. dropout rate 0.2 The probability that each element of intermediate dense layer output is dropped. last dense layer units 2 The dimensionality of the last dense layer output space. Binary classification with SoftMax activation function. Table 2 Official results of aggressive detection on organizer test data and own evaluation results on own test data set. Rank Team Name Official F1 Own test F1 aggressive aggressive 1 CIMAT-1 0.7998 - 7 Baseline (Bi-GRU) 0.7124 - 12 Baseline (BoW-SVM) 0.6760 - 16 UGalileo-2 (BiLSTM + CNN) 0.6388 0.6650 17 UGalileo-1 (BiLSTM + BiGRU) 0.6387 0.6333 21 Intensos-2 0.2515 - According to our experiment results, these two architectures show similar results on the aggressiveness detection task. Although proposed architectures achieved lower results com- pared to baseline models, it is possible to continue improving them, especially working on the corpus-preprocessing phase. We think that we have lost task-relevant information on tweets preprocessing phase that did not allow us to obtain better models performance. Additionally, it would be worth to try other embedding vectors and dictionaries that represent better particular features of Mexican Spanish. Acknowledgments This work was supported by Facultad de Ingeniería de Sistemas, Informática y Ciencias de la Computación (FISICC) and Research Laboratory in Information and Communication Technolo- 284 gies (RLICT), both part of Universidad Galileo from Guatemala. References [1] S. B. Kotsiantis, D. Kanellopoulos, P. E. Pintelas, Data preprocessing for supervised leaning, World Academy of Science, Engineering and Technology, International Journal of Com- puter, Electrical, Automation, Control and Information Engineering 1 (2007) 4104–4109. [2] G. Angiani, L. Ferrari, T. Fontanini, P. Fornacciari, E. Iotti, F. Magliani, S. Manicardi, A comparison between preprocessing techniques for sentiment analysis in twitter, in: KDWeb, 2016. [3] M. E. Aragón, H. Jarquín, M. Montes-y Gómez, H. J. Escalante, L. Villaseñor-Pineda, H. Gómez-Adorno, G. Bel-Enguix, J.-P. Posadas-Durán, Overview of mex-a3t at iberlef 2020: Fake news and aggressiveness analysis in mexican spanish, in: Notebook Papers of 2nd SEPLN Workshop on Iberian Languages Evaluation Forum (IberLEF), Malaga, Spain, September, 2020. [4] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Computation 9 (1997) 1735–1780. [5] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio, Learning phrase representations using RNN encoder–decoder for statistical machine translation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, 2014, pp. 1724–1734. URL: https://www.aclweb.org/anthology/D14-1179. doi:1 0 . 3 1 1 5 / v 1 / D14- 1179. [6] J. Zhao, F. Huang, J. Lv, Y. Duan, Z. Qin, G. Li, G. Tian, Do rnn and lstm have long memory?, 2020. a r X i v : 2 0 0 6 . 0 3 8 6 0 . [7] V. Balakrishnan, E. Lloyd-Yemoh, Stemming and lemmatization: A comparison of retrieval performances, in: Lecture Notes on Software Engineering, volume 2, 2014, pp. 262–267. [8] P. Qi, T. Dozat, Y. Zhang, C. D. Manning, Universal dependency parsing from scratch, in: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association for Computational Linguistics, Brussels, Belgium, 2018, pp. 160–170. URL: https://nlp.stanford.edu/pubs/qi2018universal.pdf. [9] S. Bird, E. Klein, E. Loper, Natural Language Processing with Python, 1st ed., O’Reilly Media, Inc., 2009. [10] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword information, arXiv preprint arXiv:1607.04606 (2016). [11] INGEOTEC, FastText Word Embeddings for Spanish Language Variations, 2019 (accessed June 10, 2020). URL: https://github.com/INGEOTEC/RegionalEmbeddings. [12] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, Smote: Synthetic minority over-sampling technique, J. Artif. Intell. Res. 16 (2002) 321–357. [13] T. Mikolov, M. Karafiát, L. Burget, Jan, H. . Černocký, S. Khudanpur, Recurrent neural network based language model., in: In INTERSPEECH 2010„ 2010, pp. 1045–1048. [14] Y. Kim, Convolutional neural networks for sentence classification, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 285 Association for Computational Linguistics, Doha, Qatar, 2014, pp. 1746–1751. URL: https: //www.aclweb.org/anthology/D14-1181. doi:1 0 . 3 1 1 5 / v 1 / D 1 4 - 1 1 8 1 . [15] E. Garcia, Mercado libre data challenge, https://github.com/eduagarcia/ meli-challenge-2019, 2019. 286