Lee@HASOC2020: ALBERT-based Max Ensemble
with Self-training for Identifying Hate Speech and
Offensive Content in Indo-European Languages
Junyi Li, Tianzi Zhao
School of Information Science and Engineering Yunnan University, Yunnan, P.R. China


                                      Abstract
                                      This paper describes the system submitted to HASOC 2020. This task aims to identify hate speech and
                                      offensive content in Indo-European languages . We only participate in the English part of subtask A,
                                      which aims to identify hate speech and offensive content in English. To solve this problem, we propose
                                      an ALBERT-based model , and use the self-training and max ensemble to improve model performance.
                                      Our model achieves a macro F1 score of 0.4976 (ranks 20/35) in subtask A.

                                      Keywords
                                      Hate Speech and Offensive Content, Indo-European Languages, Self-training, ALBERT, Max Ensemble,


1. Introduction
In recent years, due to the rapid development of the mobile Internet and social media platforms,
people have begun to use various social media to share their lives, such as Facebook and
Twitter. People share their views on life on social media, and this behavior may receive good
or bad comments. Some bad comments slowly evolved into offensive language. Social media
is flooded with a lot of offensive language [1], and these remarks have led to deviations in
people’s perception of things. Therefore, all major social media platforms urgently need tools
to automatically monitor user speech [2].
   HASOC 2020 [3] is a data challenge to identify hate speech in multiple languages. Its goal is
to use computational methods to identify offensive and hate speech in user-generated content
on online social media platforms. This task provides posts from social media platforms and
classifies this content. At the same time, the application of multiple languages greatly broadens
our recognition range.
   In this task, we only participate in subtask A of English language : Identifying hate, offensive
and profane content. In the task, the main problem to be solved is how to get the best task
performance. In order to solve the problem that affects task performance, this paper proposes a
method that combines two effective strategies. First, we introduced an external data set [4] to
increase the amount of training data and avoid overfitting of model training. Secondly, We use


FIRE ’20, Forum for Information Retrieval Evaluation, December 16–20, 2020, Hyderabad, India.
Envelope-Open 18314327187@163.com (J. Li)
Orcid 0000-0002-7162-5396 (J. Li); 0000-0002-8908-2537 (T. Zhao)
                                    © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)
an ALBERT-based model with model self-ensemble. Through many experiments, this method
can get a good task performance, which can effectively solve the problem.
  The rest of our paper is structured as follows. Section 2 describes data preparation. Meth-
ods are described in Section 3. Experiments and evaluation are described in Section 4. The
conclusions are drawn in Section 5.


2. Data and Data Preparation
2.1. Data
The organizers provided training and test datasets, containing 3708 and 814 sentences respec-
tively. We counted the number and distribution of labels in the dataset. The number of labels in
the dataset is shown in Figure 1.


Figure 1: The number of labels in the dataset


  where NOT means that This text does not contain any Hate speech, profane, offensive content
and HOF means that its contains hate, offensive, and profane content.

2.2. Data Preparation
Some text in the tweets has no effect on the meaning expressed. Tweets are processed using
the tweettokenize tool [5]. Cleaning the text before further processing helps to generate better
functionality and semantics. We perform the following preprocessing steps.

    • We know that some repeated symbols have no meaning. As a result, repeated periods,
      question marks and exclamation marks are replaced with a single instance with the special
      mark ”repeat” added.
    • All contractions were changed to complete parts. This helps the machine understand the
      meaning of words (for example:”there’re” changed to ”there” and ”are”.
    • Twitter data contains a lot of emojis. Emojis can cause the number of unknown words to
      rise, which can lead to poor pre-training effects. Emoticons (for example, ”:(”, ”:” ”,”: P
      ”and emoticons, etc.) are replaced by emotional words with their own meaning. This will
      improve the pre-training effect.
    • Generally, words have different forms according to the change of context. However,
      different forms of words will cause ambiguity in pre-training and affect the effect of
      pre-training. Lexicalization, through WordNetLemmatizer to restore language vocabulary
      to the general form (can express complete semantics).
    • Tokens are converted to lower case.


3. Methods
3.1. Self-training
In Natural Language Processing (NLP), using different data in the same domain to train a model
is a form of model training. This training method is called self-training [6], and aims to establish
a broad semantic understanding to promote performance improvement for training and test
tasks.
   In this paper, we use self-training to train the model. The self-training process of our model
is shown in Figure 2. Our self-training method uses the idea of ”Teacher and student”. ”Teacher
and student” refers to the same training process. The beginning of student training is the end
of teacher training, which can deepen the learning of the model. We use the ”Teacher-Student”
method to design model self-training. The ”Teacher” part uses an additional external dataset,
which is the task dataset [7] of task12 from SemEval2020. This dataset has the same purpose as
the HASOC2020 task, but the content of the data is different. We only randomly used 10,000
sentences. The student part uses the training dataset of the task. Experiments show that our
design is an effective method.


Figure 2: The architecture of self-training
3.2. ALBERT
Google has introduced a new language representation model called BERT [8], which stands for
Bidirectional Encoder Representations from Transformers. However, the large-scale parameters
of the BERT lead to an exponential increase in training time and a shortage of computing
resources. Meanwhile, too much parameter ratio will cause the performance of the model to
decrease. Therefore, Google and Toyota Institute of Technology proposed a new model, the
ALBERT model.
   The ALBERT [9] model combines two parameter reduction techniques, which eliminate the
main obstacles to scaling pre-trained models. The first is decomposed embedded parameteriza-
tion. This separation makes it easier to increase the hidden size without significantly increasing
the parameter size of the vocabulary embedding. The second technique is cross-layer parameter
sharing. This technique prevents the parameters from increasing with the depth of the network.
Through these two technologies, the ALBERT model performs better when the number of
parameters decreases.
   In this task, we use the ALBERT model to get good performance. Our model is shown in
Figure 3.


Figure 3: The architecture of the model


3.3. Max Ensemble
In this paper, we hope to make full use of the ALBERT model through better fine-tuning
strategies, so as to achieve the best task performance. In fact, fine-tuning the performance of
ALBERT is usually sensitive to different random seeds and orders of the training data. In order
to alleviate this situation, ensemble methods can be used to reduce overfitting and improve
model generalization ability. Therefore, ensemble [10] methods are widely used to combine
multiple fine-tuned models. The ensemble ALBERT model usually has higher performance than
a single ALBERT model.
   We know that the common ensemble method is based on voting [11]. In this paper, we
fine-tune multiple ALBERT models with different random seeds. For each input, we will output
the best prediction and probability derived from the fine-tuned ALBERT, and summarize the
predicted probability of each model. The output of the ensemble model is the prediction with
the highest probability. This ensemble method is called the max ensemble. The formula for the
max ensemble [12] we used is shown below

                                                      𝑠
                           𝐴𝐿𝐵𝐸𝑅𝑇𝑣𝑜𝑡𝑒 (𝑥; 𝑠) = 𝑀𝑎𝑥(∑ 𝐴𝐿𝐵𝐸𝑅𝑇 (𝑥𝑠 ))                        (1)
                                                     𝑛=1

where 𝐴𝐿𝐵𝐸𝑅𝑇 (𝑥𝑠 ) represents a fine-tuning of the ALBERT model.


4. Experiments and Evaluation
In this task, we use self-training and max ensemble ALBERT-based model. For the ALBERT
model, the main hyper-parameters we focused on are the training step size, batch size, warm
steps, and learning rate. After learning the hyper-parameter adjustment for similar tasks, we
fine-tune the model hyper-parameters. As is shown in Table 1.

Table 1
Details of the hyper-parameters.
                      train step   learning rate   batch size   warm steps
                        23800           5e-6           32          1256


   This task mainly uses F1 macro-average score for performance evaluation. To test the
effectiveness of our method, we conduct ablation experiments for our method. Our experimental
results based on test dataset are shown in Table 2.

Table 2
Performance with our methods on test dataset.
                                 Method              F1 macro-average
                         ALBERT(self-training/o)           0.83
                        ALBERT(max ensemble/o)             0.85
                         ALBERT(our methods)               0.90


  where 𝐴𝐿𝐵𝐸𝑅𝑇 (𝑠𝑒𝑙𝑓 −𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔/𝑜) means that only self-training is not used. 𝐴𝐿𝐵𝐸𝑅𝑇 (𝑚𝑎𝑥𝑒𝑛𝑠𝑒𝑚𝑏𝑙𝑒/𝑜)
means that only max ensemble is not used. 𝐴𝐿𝐵𝐸𝑅𝑇 (𝑜𝑢𝑟𝑚𝑒𝑡ℎ𝑜𝑑𝑠) means that we use self-training
and max ensemble ALBERT-based model.
  From this table, we can see that the self-training and the max ensemble method can effectively
optimize the effectiveness of ALBERT model. So, for this task, our method can get a good
performance.
5. Conclusion
In this task, our main consideration is how to get a good task performance. In other words, we
need to adopt methods to optimize the performance of our models. We mainly use the ALBERT
model. Based on the model, we also adopt the method of self-training and max ensemble.
Experiments prove that our method can achieve the best performance.
   However, in the ranking, the performance of our model is still not satisfying. In the future,
we will improve our methods by adjusting our model and trying more ensemble methods.


Acknowledgements
This work was supported by the Science Foundation of Yunnan Education Department under
Grant 2020Y0011.


References
 [1] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, R. Kumar, Predicting the Type
     and Target of Offensive Posts in Social Media, in: Proceedings of the 2019 Conference of
     the North American Chapter of the Association for Computational Linguistics (NAACL),
     2019, pp. 1415–1420.
 [2] T. Mandl, S. Modha, P. Majumder, D. Patel, M. Dave, C. Mandlia, A. Patel, Overview
     of the HASOC track at FIRE 2019: Hate speech and offensive content identification in
     indo-european languages, in: Proceedings of the 11th Forum for Information Retrieval
     Evaluation, 2019, pp. 14–17.
 [3] T. Mandl, S. Modha, G. K. Shahi, A. K. Jaiswal, D. Nandini, D. Patel, P. Majumder, J. Schäfer,
     Overview of the HASOC track at FIRE 2020: Hate Speech and Offensive Content Identifica-
     tion in Indo-European Languages, in: Working Notes of FIRE 2020 - Forum for Information
     Retrieval Evaluation, CEUR, 2020.
 [4] S. Rosenthal, P. Atanasova, G. Karadzhov, M. Zampieri, P. Nakov, A Large-Scale Weakly
     Supervised Dataset for Offensive Language Identification, in: arxiv, 2020.
 [5] R. K. Bakshi, N. Kaur, R. Kaur, G. Kaur, Opinion mining and sentiment analysis, in: 2016 3rd
     International Conference on Computing for Sustainable Global Development (INDIACom),
     IEEE, 2016, pp. 452–455.
 [6] I. Z. Yalniz, H. Jégou, K. Chen, M. Paluri, D. Mahajan, Billion-scale semi-supervised
     learning for image classification, arXiv preprint arXiv:1905.00546 (2019).
 [7] H. Mubarak, A. Rashed, K. Darwish, Y. Samih, A. Abdelali, Arabic offensive language on
     twitter: Analysis and experiments, arXiv preprint arXiv:2004.02192 (2020).
 [8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
 [9] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, Albert: A lite bert for
     self-supervised learning of language representations, arXiv preprint arXiv:1909.11942
     (2019).
[10] S. Avidan, Ensemble tracking, IEEE transactions on pattern analysis and machine intelli-
     gence 29 (2007) 261–271.
[11] A. Onan, S. Korukoğlu, H. Bulut, A multiobjective weighted voting ensemble classifier
     based on differential evolution algorithm for text sentiment classification, Expert Systems
     with Applications 62 (2016) 1–16.
[12] Y. Xu, X. Qiu, L. Zhou, X. Huang, Improving bert fine-tuning via self-ensemble and
     self-distillation, arXiv preprint arXiv:2002.10345 (2020).