Non-contextual Binary Classification for
         Mexican Spanish with XLM and CNN

      Suidong Qu[0000−0002−7274−5891] , Qinyu Que[0000−0001−6688−7896] , and
                      Shuangjun Jia[0000−0001−8315−5662]

                         Yunnan University,Yunnan,P.R.China
                               icat@mail.ynu.edu.cn


        Abstract. This article first introduces the development of deep learning
        in natural language processing in recent years. Then the related work
        is briefly explained. We participat in the classification tasks of MeOf-
        fendEs@IberLEF 2021 Subtask 3: Non-contextual binary classification of
        Mexican Spanish. In this task, our work involves categorizing tweets as
        offensive or non-offensive tweets in the OffendMEX corpus. The method
        in this article first uses XLM to obtain semantic feature information, and
        then extracts the features again through convolutional neural networks.
        Focal Loss is used in the model to improve the classification effect. In
        this task, our team name is Dong. The precision of our model is 0.6050,
        the recall is 0.5361, and the F1 score is 0.5685.

        Keywords: Binary Classification · Offensive Language Detection · XLM.


1     Introduction
In recent years, with the rapid development of Internet-based media, a large
number of websites with social services as the main content have emerged, such
as Facebook, Twitter and Weibo. Their birth makes information dissemination
no longer centered on media, but centered on users. The cost for people to share
and obtain information has become very low, but at the same time offensive
language is also flooding people’s vision. Offensive language is text content that
can irritate individuals or groups, including hate language, personal attacks,
harassment, ridicule, etc. In recent years, research related to speech abuse has
received more and more attention. Preventing the abuse of offensive language
is of great significance to maintaining social harmony. The basis for effectively
avoiding speech abuse is to quickly, automatically and accurately identify offen-
sive language. In this binary classification task of MeOffendEs@IberLEF 2021
[1, 2]: Subtask 3 , we need to classify the data set of short texts based on Twitter
as offensive or non-offensive tweets in the OffendMEX corpus.
     Section 2 of this article briefly introduces the progress and current situation
of work related to natural language processing. Section 3 describes the data
    IberLEF 2021, September 2021, Málaga, Spain.
    Copyright © 2021 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).
set used in this mission. Section 4 illustrates our method, which describes the
preprocessing steps and main structure of the model. Section 5 describes the
adjusted hyper-parameters applicable to this model and our experimental results.
We summarized our work in Section 6.


2   Related Work

With the application and development of deep learning in many industrial and
commercial fields and the rapid improvement of computer computing perfor-
mance, deep learning models are becoming more and more easily applicable to
various scenarios, including the task of detecting offensive short texts. Deep neu-
ral networks have a more complex network structure than traditional machine
learning methods, and can dig out high-dimensional feature information from
samples by self-learning the characteristics of samples. In 2017, Thomas David-
son et al. [3] conducted a research on automatic hate speech recognition based on
Twitter text. Experimental results showed that the effects of logistic regression
and linear SVM (Support Vector Machines) are significantly better than classi-
fiers based on naive Bayes and decision trees. In 2017, Shervin Malmasi et al. [4]
used linear SVM as a classifier to detect hate speech in Twitter text. They used
character-based and vocabulary-based N-Grams features with different N values.
In 2016, Zhang et al. [5] used a two-way gated neural network to extract gram-
matical and semantic information in Twitter text, and used a pooling operation
to extract contextual features in historical data. In 2017, Pinkesh Badjatiya et
al. [6] used CNN (Convolutional Neural Network) and LSTM (Long Short-Term
Memory) methods to detect hate speech in Twitter text. In 2017, Ji Ho Park
et al. [7] used a hybrid model combining character CNN and word CNN for the
particularity of tweets. In 2019, Shiwei Zhang et al. [8] used a two-way long and
short-term memory network combined with an attention mechanism to detect
mocking comments in tweets. In 2019, Pushkar Mishra et al. [9] used CNN to
detect the abuse of speech in combination with user personal information and
user social information. The model can capture the characteristics of individual
language behavior and group structure. In 2020, Baruah and Das et al. [10] used
pre-training BERT [11] for feature extraction of context and target text. We
participate in this task for Mexican Spanish and propose a method that includes
XLM [12] and CNN models. It can integrate the advantages of two models to
enhance the effectiveness of the binary classification.


3   Data and Resources

This task is Subtask 3 in MeOffendEs@IberLEF 2021. Our task is to classify
tweets as offensive or non-offensive tweets in the OffendMEX corpus. The train-
ing data used in our task is provided to each participant by the task organizer.
These data are collected from Mexican Spanish on Twitter and have been tagged.
4     System Description
4.1   Data Preprocessing
In order to make the model better understand the semantic relationship of sen-
tences during the training phase, we remove punctuation marks, numbers, and
emoticons during preprocessing. This approach does not affect the effect of the
model in this classification task, and can reduce the difficulty of training the
model. BERT inserts an identifier [CLS] and a separator [SEP] into each sen-
tence sample during training. Among them, [CLS] is added at the beginning of
the sentence, and [SEP] is added at the end of the sentence. [SEP] separates two
sentences, so that the sample forms a structure of [CLS] + sentence + [SEP].

4.2   Model Description
Our model includes XLM and CNN models. Since BERT was proposed, it has
proved that the pre-training model is very effective in processing natural lan-
guage understanding tasks, and it has achieved good results in a variety of nat-
ural language processing tasks. We consider that BERT mainly focuses on the
natural language processing tasks of a single language, so in this task we used
Hugging Face’s implementation of XLM model. XLM is a Cross-Lingual version
developed by Facebook on the basis of BERT. XLM has the same structure
as BERT, so it is also an encoder composed of multiple Encoder structures in
Transformer [13].
   When training XLM, we use XLM to encode any sentences into a shared
embedding space. The input sequence will be masked. The process is to select
15% of the input sequence for random masking, 80% of which are replaced by
[Mask], 10% are replaced by random words, and the remaining 10% are still
correct words. The Fig.1 below shows the basic structure of the model.


      Fig. 1. The XLM model captures the feature information of the sentence.


   The focus of this classification task is to extract the core semantic information
contained in the sentence. We input the feature information extracted by XLM
into CNN. We initialize it before entering CNN. The CNN layer can effectively
reduce the number of parameters, thereby reducing the difficulty of training,
and at the same time extract low-level local features into higher-level features.
In the convolutional neural network we designed, the Batch size is set to 8, the
size of the convolution kernel is 2, 3, 4, and 5, and the dropout is set to 0.2.
Dropout is added to reduce the possibility of overfitting. In order to accelerate
convergence and alleviate the problem of gradient disappearance, we use the
Batch Normalization layer to normalize each batch of data. We use the ReLU
activation function inside the CNN, and use the sigmoid activation function for
the final output.


                      Fig. 2. The convolutional neural network


    When some previous models used the traditional cross-entropy loss function,
those models ignored the difference in the degree of optimization of the model
between positive and negative samples. In this task, the proportion of positive
samples and negative samples in the training data set is unbalanced. Therefore,
this model adoptes the Focal Loss function[14] based on the cross-entropy loss
function in fine-tuning. This method can reduce the weight occupied by a large
number of simple negative samples in the process of training and optimization,
and put the model’s focus on sparse difficult samples. This method is added to
the classification model of this article, and and are set to 2 and 0.75 respectively,
which can improve the classification effect.
5   Hyper-parameters and Results

In this classification task, our model is based on PyTorch. After many experi-
ments, the training parameter values are shown in Table 1. For this task, our
model is composed of XLM and CNN, and the Adam [15] optimizer is used. In
the internal and output of CNN, we use ReLU activation function and sigmoid
activation function respectively.


                           Table 1. Hyper-parameters

                             hyper-parameters       Values
                               learning rate         1e-5
                          per gpu train batch size    64
                        Gradient accumulation steps    6
                             num train epochs          8
                              max seq length         100
                                  dropout             0.2


   In this task, the scores obtained by our model are shown in Table 2. The
organizer’s evaluation indicators for this task are precision, recall and F1 score.
The precision of our model is 0.6050 (ranked 10th, 0.31 lower than the highest
score). The recall is 0.5361 (ranked 9th, 0.16 lower than the highest score). The
F1 score is 0.5685 (ranked 9th, 0.14 lower than the highest score).


                           Table 2. Evaluation results

                             Our model Model 1 Model 2 Model 3
             Macro precision 0.6050(10) 0.7600(4) 0.6733(7) 0.9183(1)
              Macro recall 0.5361(9) 0.6533(5) 0.6966(1) 0.3141(12)
             Macro F1 score 0.5685(9) 0.7026(1) 0.6847(3) 0.4683(12)


    From Table 2 we can see that although our macro precision is not high
compared to model 3, our model has better stability. Comparing madol 1 and
model 2, our macro recall and F1 score are relatively low. We plan to use Focal
Loss to improve macro recall, but the effect is limited. From the perspective of
the data set, we find that the amount of offensive and non-offensive text is very
different. In addition, the semantics of some texts are ambiguous, which can
cause interference. This situation makes our model more biased towards the side
with more data when fine-tuning, and the model may have over-fitting during
training. These situations ultimately lead to the model’s poor performance in
the test set.
6    Conclusion
This article mainly describes the overall idea and optimization scheme of non-
contextual offensive detection and classification for Mexican Spanish. In the
model implementation, we use the XLM pre-training model as a word vector
model, and construct a convolutional neural network in downstream tasks to
extract deep semantic information. The Focal Loss function is introduced into
the model to improve the classification performance. Among these evaluation
indicators, this method is not optimal compared with other methods, indicat-
ing that this model has certain disadvantages. Although the downstream task
captures some local information through CNN, its extraction ability is limited.
In the next stage of research, we can add sentence-level attention information
to improve the detection results. In addition, we consider adjusting the depth
of the CNN in the model and the number of epochs to improve the model’s
characterization ability.


References
 1. Montes, M., Rosso, P., Gonzalo, J., Aragón, E., Agerri, R., Álvarez-Carmona,
    M.Á., Álvarez Mellado, E., Carrillo-de Albornoz, J., Chiruzzo, L., Freitas, L.,
    Gómez Adorno, H., Gutiérrez, Y., Jiménez-Zafra, S.M., Lima, S., Plaza-de Arco,
    F.M., Taulé, M. (eds.): Proceedings of the Iberian Languages Evaluation Forum
    (IberLEF 2021) (2021)
 2. Plaza-del-Arco, F.M., Casavantes, M., Jair Escalante, H., Martin-Valdivia, M.T.,
    Montejo-Ráez, A., Montes-y-Gómez, M., Jarquı́n-Vásquez, H., Villaseñor-Pineda,
    L.: Overview of the MeOffendEs task on offensive text detection at IberLEF 2021.
    Procesamiento del Lenguaje Natural 67(0) (2021)
 3. Davidson, T., Warmsley, D., Macy, M., Weber, I.: Automated hate speech detection
    and the problem of offensive language (2017)
 4. Malmasi, S., Zampieri, M.: Detecting hate speech in social media (2017)
 5. Zhang, M., Yue, Z., Fu, G.: Tweet sarcasm detection using deep neural network
 6. Badjatiya, P., Gupta, S., Gupta, M., Varma, V.: Deep learning for hate speech
    detection in tweets. In: Proceedings of the 26th International Conference on World
    Wide Web Companion. pp. 759–760 (2017)
 7. Ji, H.P., Fung, P.: One-step and two-step classification for abusive language de-
    tection on twitter. In: Proceedings of the First Workshop on Abusive Language
    Online (2017)
 8. Zhang, S., Zhang, X., Chan, J., Rosso, P.: Irony detection via sentiment-based
    transfer learning. Information Processing Management 56(5), 1633–1644 (2019)
 9. Mishra, P., Tredici, M., Yannakoudakis, H., Shutova, E.: Abusive language detec-
    tion with graph convolutional networks (2019)
10. Baruah, A., Das, K., Barbhuiya, F., Dey, K.: Context-aware sarcasm detection
    using bert. In: Proceedings of the Second Workshop on Figurative Language Pro-
    cessing (2020)
11. Bert: Pre-training of deep bidirectional transformers for language understanding
    (2018)
12. Conneau, A., Lample, G.: Cross-lingual language model pretraining. In: Advances
    in Neural Information Processing Systems. vol. 32, pp. 7057–7067 (2019)
13. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
    L., Polosukhin, I.: Attention is all you need. In: Proceedings of the 31st Interna-
    tional Conference on Neural Information Processing Systems. vol. 30, pp. 5998–
    6008 (2017)
14. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object
    detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 42(2),
    318–327 (2020)
15. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. Computer Sci-
    ence (2014)
16. Miao, Y., Ji, Y., Peng, E.: Application of cnn-bigru model in chinese short text
    sentiment analysis. In: Proceedings of the 2019 2nd International Conference on
    Algorithms, Computing and Artificial Intelligence (2019)