Detecting Aggressiveness in Mexican Spanish Social
Media Content by Fine-Tuning Transformer-Based
Models
Mircea-Adrian Tanase, George-Eduard Zaharia, Dumitru-Clementin Cercel and
Mihai Dascalu
Computer Science Department, University Politehnica of Bucharest, Bucharest, Romania


                                      Abstract
                                      Aggressiveness and several other related problems, such as hate speech, offensive language, or harass-
                                      ment, are experiencing a growing online presence in the context of contemporary social media platforms.
                                      The research efforts towards detecting, isolating, and stopping these disturbing behaviors have intensi-
                                      fied, in tight relation with the increasing performance of deep learning techniques applied in various
                                      Natural Language Processing (NLP) tasks. This study presents our NLP architectures for tackling the
                                      problem of aggressiveness detection in the context of the MEX-A3T@IberLEF2020 shared task. We ex-
                                      perimented with several pre-trained Transformer-based models, fine-tuned on various combinations of
                                      task-specific datasets. Our best model on the MEX-A3T dataset achieves an offensive F1-score of 79.69%
                                      on the test dataset, the third in the competition; nevertheless, the difference between the winning solu-
                                      tion versus our model is marginal, of only 0.29%. This result argues that Transformer-based models can
                                      be successfully used to detect aggressiveness in Mexican Spanish tweets.

                                      Keywords
                                      BETO, XLM-RoBERTa, social media, aggressiveness detection, mexican spanish


1. Introduction
Smart Insights1 notes that nearly 60% of the world population is online, with more than one
third using social media platform, out of which Facebook and Twitter are the most popular
alternatives. The massive rise of social media technologies in both personal, business, as well as
political communication raised a number of new concerns regarding their misusage because
these platforms can also become channels for the proliferation of disturbing trends, such as
aggressiveness, harassment, hate speech or cyberbullying2 .
   Social media companies made various attempts towards detecting, removing, and stopping
these behaviors, with both Twitter and Facebook rolling out several tools for flagging and


Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020)
email: mircea.tanase@stud.acs.upb.ro (M. Tanase); george.zaharia0806@stud.acs.upb.ro (G. Zaharia);
dumitru.cercel@upb.ro (D. Cercel); mihai.dascalu@upb.ro (M. Dascalu)

                                    © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
         CEUR Workshop Proceedings (CEUR-WS.org)
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073


               1
     https://www.smartinsights.com/social-media-marketing/social-media-strategy/
new-global-social-media-research/. Accessed July 6th 2020
   2
     https://www.stopbullying.gov/cyberbullying/what-is-it. Accessed July 6th 2020
reporting unwanted pieces of content3,4 ; however, these efforts encountered several problems.
First, a very small percentage of the victims even consider using these tools5 ; thus, they are
unaware of the offences towards them. Second, the amount of data that needs to be flagged
and analyzed by human moderators is enormous - for example, Internet Live Stats6 estimates
that 6,000 tweets are posted online every second. These reasons powered the increasing
research efforts towards automated processes, grounded in Natural Language Processing (NLP)
techniques, of identification and removal of aggressive, offensive, or hateful content in online
social media. Nevertheless, creating an annotated corpus of social media content suitable for
this work proved to be a very challenging task due to the subjective and fluctuating definitions
of the labels [1].
   This work presents our approach for the detection track within the MEX-A3T@IberLEF2020
shared task [2], which required a binary classification of aggressiveness in tweets written in
the Mexican Spanish dialect. State-of-the-art NLP models were considered and fine-tuned
starting from pre-trained Transformer-based architectures on Spanish and English, together
with multilingual entries.
   The remainder of the paper is structured as follows. A brief analysis on state-of-the-art
approaches is performed in section 2, followed by a description of the datasets and details on
the methods employed for automated detection of aggressiveness in sction 3. Section 4 outlines
the evaluation process, while conclusions are drawn in section 5.


2. Related Work
The task of automated detection of online aggressiveness is a necessity for modern social media
platforms. Early studies are based on classical machine learning algorithms - for example,
Greevy and Smeaton [3] used the Support Vector Machines to detect racist texts in web pages.
However, machine learning algorithms evolved in the last decade, with numerous NLP systems
being developed and employed for such problems. Cambria et al. [4] used the sentic computing
paradigm to detect web trolls. Their approach aims to improve the performance for recognition
and interpretation of sentiments and opinions in texts, by employing Semantic Web and Artificial
Intelligence techniques. Davidson et al. [5] analyzed and proposed, for the first time, a hate
speech detection corpus together with several machine learning experiments with multiple
algorithms (e.g., logistic regression or random forest) and pre-processing techniques (e.g., TF-IDF
scores, stemming). Malmasi and Zampieri [6] also analyzed this dataset and tried to improve
the results using skip-gram features. Gambäck and Sikdar [7] improved the results on the same
dataset by employing a Convolutional Neural Network (CNN) [8] as classifier. Also, Zhang et al.
[9] analyzed both CNNs and Gated Recurrent Units [10], and achieved better results.
   In addition, several shared tasks, surveys, and workshops were published in the last years on
the previously mentioned topics. For example, Schmidt and Wiegand [11] presented a survey
on non-neural network methods for detecting hate speech. Workshops and shared tasks on the

   3
     https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy. Accessed July 6th 2020
   4
     https://www.facebook.com/communitystandards/hate_speech. Accessed July 6th 2020
   5
     https://www.pewresearch.org/internet/2017/07/11/online-harassment-2017/. Accessed July 6th 2020
   6
     https://www.internetlivestats.com/. Accessed July 6th 2020


                                                   237
subject include both editions of Abusive Language Online [1, 12], which targeted, among others,
the cyberbullying issue. HASOC [13] also addressed the problem of hate speech. Moreover,
both the SemEval 2019 Task 5 (OffensEval 2019) [14] and SemEval 2020 Task 12 (OffensEval
2020) [15] shared tasks address offensive language detection in social media. The TRAC-1 [16]
and TRAC-2 [17] workshops proposed the problem of aggression detection on Twitter, focusing
on the English, Bangla, and Hindi languages; the best performing models for TRAC-2 used
Transformer-based architectures, such as BERT [18] or RoBERTa [19].
  With a stronger focus on the identification of aggressiveness in the Mexican Spanish social
media, we also considered the first two editions of the IberLEF MEX-A3T shared task [20, 21].
Reviewing the research of the teams that obtained the highest scores in the 2019 edition reveals
that most approaches used deep learning architectures, such as CNNs or Long Short-term
Memory networks [22], with varying degrees of success.


3. Method
3.1. Datasets
The MEX-A3T dataset proposed by the organizers for training consists of 7,332 Mexican Spanish
tweets, out of which 2,110 are labeled as positive for aggressiveness. Ten percent of the data
was held for validation purposes, while preserving the label distribution. The test set of 3,143
tweets was used to compare the submitted solutions.
  Our aim was to automatically increase the size of the training data in order to improve
the results of our models. At the same time, we wanted to analyze the impact on the model
performance if various training datasets, having different labeling schemes, are combined. Each
additional dataset has various particularities (see table 1), such as language specific structures
and annotation schemes; thus, a standardization process (i.e., thresholding) was imposed to
make them suitable for a binary classification task.

3.2. Baseline
The XGBoost [30] model was considered as baseline. In the text pre-processing phase, the tweets
were converted to a lowercase representation and stopwords were removed. Tf-Idf scores were
computed, alongside character n-grams with n = 1,2,3 as a feature extraction step. Moreover, a
grid search was performed to select the best parameters for the XGBoost classifier.

3.3. BERT
Our models are based on the architecture proposed by Devlin et al. [18], namely the Bidirectional
Encoder Representations from Transformers (BERT). BERT is a deep neural network designed
for various NLP tasks, leveraging the power of both the WordPiece [31] embeddings and the
Transformers. The architecture is pre-trained using large language corpora designed for a
generic task, such as next sentence prediction, and is usually fine-tuned on a downstream
task, using a specific corpus. A distinct feature and major advantage of BERT is the unified


                                               238
architecture across tasks, as there are minimal differences (weights) between the pre-trained
architecture and the final architecture used for a particular task.


Table 1
Considered additional datasets.
           Dataset                                    Description
        OLID [23]            13,240 Twitter samples binary labeled for offensive language de-
                             tection, with a 33.23% positive label rate.
        OffensEval [15]      A collection of offensive language data for five different lan-
                             guages covering the following languages:
                                  • English (SOLID) [24], consists of over nine million En-
                                    glish tweets annotated in a semi-supervised manner, us-
                                    ing OLID as the starting point. By applying the pre-
                                    viously mention thresholding process, we obtained a
                                    12.58% positive label rate.
                                  • Arabic [25], including 7,000 Twitter samples, binary la-
                                    bels, 19.58% positive rate.
                                  • Danish [26], with 3,000 Twitter, Reddit, and Facebook
                                    samples, binary labels, 12.80% positive rate.
                                  • Greek [27], containing 7,843 tweets, binary labels,
                                    28.43% positive rate.
                                  • Turkish [28], having 31,277 tweets, binary labeled,
                                    19.33% positive rate.

        HASOC [13]            A binary annotated corpus for both hate speech and offensive
                              language identification, including subsets for three languages
                              (i.e., English, German, and Hindi), with 5,852 (38.63% positive
                              rate), 3,819 (10.65% positive rate) and 4,665 (47.07% positive
                              rate) tweets, respectively.
        Davidson et al. [5]   24,783 tweets annotated using three classes: a) hate speech, b)
                              profanity, but not hate speech, and c) none. We consider the
                              former two positive and the latter as negative to standardize
                              labels; 16.09% positive label rate was achieved after standard-
                              ization.
        HatEval [14]          Two subsets, one with 10,000 English tweets (42.36% positive)
                              and the other with 5,000 Spanish tweets (41.58% positive), bi-
                              nary annotated for hate speech detection.
        SIMAH [29]            6,374 binary labeled Twitter samples proposed for the SIMAH
                              2019 competition.


                                                 239
3.4. Multilingual BERT (mBERT)
mBERT [32] is a model pre-trained by Google using a multilingual corpus7 . Pre-training
considered sample texts belonging to over 100 languages from different Wikipedia entries. This
leads to better performance for highly represented languages (such as English, as well as the
Spanish). The mBERT model was finetuned using the following combined datasets:

    • The MEX-A3T training set only.

    • The MEX-A3T training set along with all the OffensEval data, for observing the influence
      on the performance of the test set by adding all data from OffensEval (labeled using the
      same annotation techniques).

    • All the available non-English data - this allows us to assess the impact of adding other
      non-English data to the training set.

    • All the available data - this is a separate experiment from the previous one because the
      English set is significantly larger and English is the most represented language in the
      embedding layers of mBERT.

3.5. XLM-RoBERTa
Liu et al. [19] proved that BERT is under-trained and proposed a robust BERT pretraining
approach which exceeds the results obtained by Devlin et al. [18] on several NLP tasks. Conneau
et al. [33] designed XLM-RoBERTa, a cross-lingual masked language model, and pretrained it
on a large corpus containing over 100 languages by applying the same technique as for mBERT.
The resulting model significantly outperformed mBERT, while also obtaining good results for
low-represented languages. We repeated all the experiments presented in Section 3.4 using the
XLM-RoBERTa model instead of mBERT.

3.6. BETO
Canete et al. [34] introduced BETO, a pre-trained BERT model for Spanish. The pre-training
corpus includes, aside from the entire Spanish Wikipedia content, all sources from the Spanish
OPUS project [35]. Two experiments were performed using this model:

    • Fine-tuning using the MEX-A3T training set only.

    • Fine-tuning using the MEX-A3T training set merged with the HatEval Spanish subset
      to observe the effect on the performances of the model by adding a labeled dataset for
      slightly different tasks.


   7
       https://github.com/google-research/bert/blob/master/multilingual.md


                                                      240
3.7. Dataset Translation
We also experimented with automatically translating the entire MEX-A3T training dataset in
English through the Yandex translation API8 . We chose English by considering the multitude of
available texts and corpora used for the pre-training Transformer-based models. We then used
the translated training set for fine-tuning the English pre-trained BERT. Our intuition is that
the pre-trained BERT model should perform better, given that the semantics and the syntax
of the translated entries are properly preserved. However, the preservation task is non-trivial,
as the two languages (i.e., English and Spanish) have noticeable structural differences, and a
translation process can seriously alter the original idea included in the source entry. Most
translation engines suffer from this issue, inasmuch as that they are not capable to accurately
transfer the particularities of the source language into the target one.


4. Results
Our aim was to integrate as much as possible the data presented in section 3.1, although most
datasets contain other languages, and are labeled for related tasks in the offensive language
detection field. All experiments were performed using an Ubuntu server machine with 4
Intel CPU Cores, one Nvidia RTX2080 Ti GPU and 64 GB RAM. The implementations from the
Transformer Python package [36] was used for the BERT-based models. The Adam optimizer [37]
with a learning rate of 2e-5 was used for all the BERT and XLM-RoBERTa experiments.
   The results (i.e., accuracy, precision, recall, and offensive F1-score) of each fine-tuned model
presented in Section 3 and evaluated on the MEX-A3T validation set are summarized in Table 2.

Table 2
Results obtained on the MEX-A3T aggressiveness validation set.
    Model            Pre-training        Pre-          Fine-Tuning
                                                                           Acc (%)   P (%)   R (%)   F1 (%)
 Architecture         Language        processing         Dataset
                                        n-gram
    XGBoost                 -                       MEX-A3T train set       58.61    42.12   34.82   40.61
                                        TF-IDF
        BERT            English       Translation     All English data      68.42    54.18   48.41   51.07
                        Spanish            -         MEX-A3T train set      84.32    82.02   80.26   81.13
       BETO                                          MEX-A3T train set
                        Spanish             -                               85.24    81.90   81.22   81.55
                                                     + HatEval Spanish
                         Multi              -        MEX-A3T train set      72.44    69.20   64.81   66.93
                         Multi              -       All non-English data    74.31    69.14   68.35   68.74
        BERT
                         Multi              -       All OffensEval data     76.88    72.14   69.83   70.96
                         Multi              -             All data          76.27    71.92   70.18   71.03
                         Multi              -        MEX-A3T train set      73.80    70.34   65.45   67.80
                         Multi              -       All non-English data    77.22    72.62   68.91   70.71
 XLM-RoBERTa
                         Multi              -       All OffensEval data     83.48    80.47   78.57   79.50
                         Multi              -             All data          83.42    80.42   78.96   79.68

  As expected, fine-tuning the Transformer-based models surpasses the classical machine

   8
       https://tech.yandex.com/translate/


                                                     241
learning baseline, whereas XLM-RoBERTa outperforms mBERT. Furthermore, adding more data
to the fine-tuning dataset improves the results, even if the added data is labeled for a different,
but related task, and contains other languages.
   Our experiment with automated translation of Spanish samples to English and using them
to fine-tune an English pre-trained BERT was a failed attempt. This is most likely caused by
the poor quality of the automatic translation applied on short messages. After inspecting the
translation, words are mainly translated correctly, but the syntax and semantics of the tweet
were destroyed.
   The best results were obtained using the BETO models, thus underlining that, when available,
a language-specific pre-trained model scores better than a multilingual one. The best performing
model was fine-tuned using both the MEX-A3T training dataset and the HatEval Spanish subset,
proving once again that adding the hate speech corpus was beneficial. This last model was used
for predicting the labels of the competition test set, obtaining scores of F1=79.69%, P=84.40%,
R=86.68%, and Acc=87.59% on the dedicated test set for aggresiveness detection. The previous
results ranked third out of 19 submissions, with the leader achieving an offensive F1-score of
79.98%.


5. Conclusions
This work presented our approach to the shared task of automated aggressiveness detection
in Spanish social media samples organized at MEX-A3T 2020. We experimented with fine-
tuning pre-trained Transformer-based models, solutions that achieved state-of-the-art results
in multiple NLP tasks.
   The obtained scores in the validation phase for the the MEX-A3T 2020 competition prove
that the aforementioned method can be successfully applied for the current task, using both
multilingual and Spanish pre-trained models. Furthermore, several combinations of datasets
annotated for various related tasks (i.e., hate speech, offensive language, and harassment) were
included for fine-tuning. We discovered that the performance of the models can be improved
by adding more data, even if it is labeled for a slightly different task.
   Future development directions include exploring other related datasets for offensive language,
aggressiveness, and harassment detection fields. We will also consider potential techniques
of pre-processing tweets, including the expansion of mentioned hashtags with corresponding
details. In addition, advanced error analysis techniques, such as feature importance or model
explainability, could be used to improve the model’s performance.


References
 [1] Z. Waseem, T. Davidson, D. Warmsley, I. Weber, Understanding abuse: A typology of
     abusive language detection subtasks, in: Proceedings of the First Workshop on Abusive
     Language Online, 2017, pp. 78–84.
 [2] M. E. Aragón, H. Jarquín, M. Montes-y Gómez, H. J. Escalante, L. Villaseñor-Pineda,
     H. Gómez-Adorno, G. Bel-Enguix, J.-P. Posadas-Durán, Overview of mex-a3t at iberlef
     2020: Fake news and aggressiveness analysis in mexican spanish, in: Notebook Papers of


                                               242
     2nd SEPLN Workshop on Iberian Languages Evaluation Forum (IberLEF), Malaga, Spain,
     September, 2020.
 [3] E. Greevy, A. F. Smeaton, Classifying racist texts using a support vector machine, in:
     Proceedings of the 27th annual international ACM SIGIR conference on Research and
     development in information retrieval, 2004, pp. 468–469.
 [4] E. Cambria, P. Chandra, A. Sharma, A. Hussain, Do not feel the trolls, ISWC, Shanghai
     (2010).
 [5] T. Davidson, D. Warmsley, M. Macy, I. Weber, Automated hate speech detection and the
     problem of offensive language, in: Eleventh international aaai conference on web and
     social media, 2017.
 [6] S. Malmasi, M. Zampieri, Challenges in discriminating profanity from hate speech, Journal
     of Experimental & Theoretical Artificial Intelligence 30 (2018) 187–202.
 [7] B. Gambäck, U. K. Sikdar, Using convolutional neural networks to classify hate-speech, in:
     Proceedings of the first workshop on abusive language online, 2017, pp. 85–90.
 [8] Y. Kim, Convolutional neural networks for sentence classification, in: Proceedings of
     the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP),
     Association for Computational Linguistics, Doha, Qatar, 2014, pp. 1746–1751. URL: https:
     //www.aclweb.org/anthology/D14-1181. doi:10.3115/v1/D14-1181.
 [9] Z. Zhang, D. Robinson, J. Tepper, Detecting hate speech on twitter using a convolution-gru
     based deep neural network, in: European semantic web conference, Springer, 2018, pp.
     745–760.
[10] J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrent neural
     networks on sequence modeling, arXiv preprint arXiv:1412.3555 (2014).
[11] A. Schmidt, M. Wiegand, A survey on hate speech detection using natural language
     processing, in: Proceedings of the Fifth International Workshop on Natural Language
     Processing for Social Media, 2017, pp. 1–10.
[12] D. Fišer, R. Huang, V. Prabhakaran, R. Voigt, Z. Waseem, J. Wernimont, Proceedings of the
     2nd workshop on abusive language online (alw2), in: Proceedings of the 2nd Workshop
     on Abusive Language Online (ALW2), 2018.
[13] T. Mandl, S. Modha, P. Majumder, D. Patel, M. Dave, C. Mandlia, A. Patel, Overview of the
     hasoc track at fire 2019: Hate speech and offensive content identification in indo-european
     languages, in: Proceedings of the 11th Forum for Information Retrieval Evaluation, 2019,
     pp. 14–17.
[14] V. Basile, C. Bosco, E. Fersini, D. Nozza, V. Patti, F. M. R. Pardo, P. Rosso, M. Sanguinetti,
     Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women
     in twitter, in: Proceedings of the 13th International Workshop on Semantic Evaluation,
     2019, pp. 54–63.
[15] M. Zampieri, P. Nakov, S. Rosenthal, P. Atanasova, G. Karadzhov, H. Mubarak, L. Der-
     czynski, Z. Pitenis, c. Çöltekin, SemEval-2020 Task 12: Multilingual Offensive Language
     Identification in Social Media (OffensEval 2020), in: Proceedings of SemEval, 2020.
[16] R. Kumar, A. K. Ojha, S. Malmasi, M. Zampieri, Benchmarking aggression identification
     in social media, in: Proceedings of the First Workshop on Trolling, Aggression and
     Cyberbullying (TRAC-2018), 2018, pp. 1–11.
[17] S. Bhattacharya, S. Singh, R. Kumar, A. Bansal, A. Bhagat, Y. Dawer, B. Lahiri, A. K. Ojha,


                                               243
     Developing a multilingual annotated corpus of misogyny and aggression, in: Proceedings
     of the Second Workshop on Trolling, Aggression and Cyberbullying, European Language
     Resources Association (ELRA), Marseille, France, 2020, pp. 158–168.
[18] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, in: Proceedings of the 2019 Conference of
     the North American Chapter of the Association for Computational Linguistics: Human
     Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186.
[19] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer,
     V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint
     arXiv:1907.11692 (2019).
[20] M. Á. Álvarez-Carmona, E. Guzmán-Falcón, M. Montes-y Gómez, H. J. Escalante,
     L. Villasenor-Pineda, V. Reyes-Meza, A. Rico-Sulayes, Overview of mex-a3t at ibereval
     2018: Authorship and aggressiveness analysis in mexican spanish tweets, in: Notebook
     Papers of 3rd SEPLN Workshop on Evaluation of Human Language Technologies for
     Iberian Languages (IBEREVAL), Seville, Spain, volume 6, 2018.
[21] M. E. Aragón, M. Á. Á. Carmona, M. Montes-y Gómez, H. J. Escalante, L. V. Pineda,
     D. Moctezuma, Overview of mex-a3t at iberlef 2019: Authorship and aggressiveness
     analysis in mexican spanish tweets., in: IberLEF@ SEPLN, 2019, pp. 478–494.
[22] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (1997)
     1735–1780.
[23] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, R. Kumar, Predicting the type
     and target of offensive posts in social media, in: Proceedings of the 2019 Conference of
     the North American Chapter of the Association for Computational Linguistics: Human
     Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 1415–1420.
[24] S. Rosenthal, P. Atanasova, G. Karadzhov, M. Zampieri, P. Nakov, A Large-Scale Semi-
     Supervised Dataset for Offensive Language Identification, in: arxiv, 2020.
[25] H. Mubarak, A. Rashed, K. Darwish, Y. Samih, A. Abdelali, Arabic offensive language on
     twitter: Analysis and experiments, arXiv preprint arXiv:2004.02192 (2020).
[26] G. I. Sigurbergsson, L. Derczynski, Offensive Language and Hate Speech Detection for
     Danish, in: Proceedings of the 12th Language Resources and Evaluation Conference, ELRA,
     2020.
[27] Z. Pitenis, M. Zampieri, T. Ranasinghe, Offensive Language Identification in Greek, in:
     Proceedings of the 12th Language Resources and Evaluation Conference, ELRA, 2020.
[28] c. Çöltekin, A Corpus of Turkish Offensive Language on Social Media, in: Proceedings of
     the 12th International Conference on Language Resources and Evaluation, ELRA, 2020.
[29] S. Sharifirad, S. Matwin, When a tweet is actually sexist. a more comprehensive classifica-
     tion of different online harassment categories and the challenges in nlp, arXiv preprint
     arXiv:1902.10584 (2019).
[30] T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in: Proceedings of the
     22nd acm sigkdd international conference on knowledge discovery and data mining, 2016,
     pp. 785–794.
[31] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao,
     K. Macherey, et al., Google’s neural machine translation system: Bridging the gap between
     human and machine translation, arXiv preprint arXiv:1609.08144 (2016).


                                              244
[32] T. Pires, E. Schlinger, D. Garrette, How multilingual is multilingual bert?, in: Proceedings
     of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp.
     4996–5001.
[33] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave,
     M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at
     scale, arXiv preprint arXiv:1911.02116 (2019).
[34] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, J. Pérez, Spanish pre-trained bert
     model and evaluation data, in: Practical ML for Developing Countries Workshop@ ICLR
     2020, 2020.
[35] J. Tiedemann, Parallel data, tools and interfaces in opus., in: LREC, volume 2012, 2012, pp.
     2214–2218.
[36] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf,
     M. Funtowicz, et al., Transformers: State-of-the-art natural language processing, arXiv
     preprint arXiv:1910.03771 (2019).
[37] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint
     arXiv:1412.6980 (2014).


                                              245