=Paper=
{{Paper
|id=Vol-2826/T2-16
|storemode=property
|title=YUN_DE at HASOC2020 subtask A: Multi-Model Ensemble Learning for Identifying Hate Speech and Offensive Language
|pdfUrl=https://ceur-ws.org/Vol-2826/T2-16.pdf
|volume=Vol-2826
|authors=Zichen Zhang,Yuhang Wu,Hao Wu
|dblpUrl=https://dblp.org/rec/conf/fire/ZhangWW20
}}
==YUN_DE at HASOC2020 subtask A: Multi-Model Ensemble Learning for Identifying Hate Speech and Offensive Language==
<pdf width="1500px">https://ceur-ws.org/Vol-2826/T2-16.pdf</pdf>
<pre>
YUN_DE at HASOC2020 subtask A: Multi-Model
Ensemble Learning for Identifying Hate Speech and
Offensive Language
Zichen Zhang, Yuhang Wu and Hao Wu∗
School of Information Science and Engineering, Yunnan University, Chenggong Campus, Kunming, P.R. China


                                      Abstract
                                      This paper describes our system in subtask A of HASOC2020: Hate Speech and Offensive Content
                                      Identification in Indo-European Languages. We propose a method of multi-model ensemble learning,
                                      which includes BERT, ON-LSTM, and TextCNN models. The multi-model ensemble aims to make better
                                      results about text classification than the single model. Our system achieves the Macro average F1-score
                                      of 0.5017 and is ranked 11th on the final leader board of the competition among the 36 teams.

                                      Keywords
                                      Text Classification, BERT, ON-LSTM, TextCNN, Ensemble Learning


1. Introduction
With the popularity of social media, much of the world communicates on it, for example, nearly
a third of the world’s population active on Facebook alone. Meanwhile, anyone can publish
content and access content of interest in these platforms [1]. However, it provides space for
discourses that include offensive content and hate speech, which is used to express hatred towards
a targeted group or be derogatory to the members of the group [2]. These words have seriously
destroyed social harmony, even led to violence and conflicts. Although many social media
platforms confronting the trend have created their provision to against someone’s hate speech
under laws prohibiting hate speech, they have to need an efficient automatic detection system
facing the timely transmission of massive hate speech.
   In recent years, lots of researchers in industry and academia have developed some natural
language processing (NLP) techniques for detecting hate speech and offensive content. The
QutNocturnal team utilized Convolutional Neural Networks (CNN) to identify whether tweets
are hate speech [3]. Then, Manolescu et al. proposed a system based on Long Short-Term
Memory (LSTM) with an embedding layer, for detecting hate speech against immigrants and
women in Twitter [4]. Ordered Neurons LSTM (ON-LSTM) has integrated the hierarchical
structure (tree structure) into the LSTM, which allowed the LSTM to automatically learn the
hierarchical structure information [5], so Wang et al. proposed ON-LSTM with attention
for identifying hate speech and offensive language, and use the K-fold ensemble approach to

FIRE ’20, Forum for Information Retrieval Evaluation, December 16–20, 2020, Hyderabad, India.
Envelope-Open zczhang@mail.ynu.edu.cn (Z. Zhang); yuhangwu@mail.ynu.edu.cn (Y. Wu); haowu@ynu.edu.cn (H. Wu∗ )
Orcid 0000-0001-6716-3339 (Z. Zhang); 0000-0003-0690-5364 (Y. Wu); 0000-0002-3696-9281 (H. Wu∗ )
                                    © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)
enhance the performance [6]. Mozafari et al. investigated the ability of BERT [7] at capturing
hateful context within social media content [8]. These methods can effectively detect hate
speech, however they just take advantage of themselves. After all the ability of the single model
is limited. A good solution is to integrate multiple models to gain the integration advantage.
   HASOC2020 aims at identifying hate speech and offensive content in Indo-European Lan-
guages [9], which provides a forum and a data challenge for multilingual research on the
identification of problematic content. For subtask A, it needs to classify tweets into two cat-
egories: hate and offensive (HOF), non-hate and offense (NOT). We participated in it for the
English language and proposed an ensemble learning method that includes BERT, ON-LSTM,
and TextCNN models. It can integrate the advantages of each model to enhance the effectiveness
of the detection of hate speech and offensive content. The experimental results indicated that
our method performs better than the single model and achieves the Macro F1-score of 0.8896.


2. Method
2.1. BERT model
BERT is mainly composed of Bidirectional Transformers blocks, which overcomes the problem
of invariance of word vector representation in previous models. It can generate different word
vectors of the same word according to different contexts. BERT achieves excellent results in
many natural language tasks these years.
   In Google’s research, a large number of high-quality texts are used to pre-train through
self-supervised methods. The language knowledge contained in texts is encoded and transferred
to Bi-Transformer for training, and their pre-training model parameters1 is released. Often we
only need to fine-tune the pre-trained model to deal with most text classification tasks.
   Because the Bi-Transformer cannot remember the time sequence information, we add the
[CLS] flag to the head of the input text to indicate whether or not it is used for a classification
task. Then, we extract all the first elements of rows from the output of BERT and send it to a
simple fully connected layer to output the classification results.

2.2. TextCNN model
TextCNN is a variant of the CNN architecture. Firstly, the sentence is embedded with word
vectors represented as [𝑥1 , 𝑥2 , … , 𝑥𝑛 ], where 𝑥𝑖 is the k-dimensional word vector corresponding
to the i-th word in the sentence. A convolution operation involves a filter which is applied to
a window of ℎ words to produce a new feature. For example, a feature 𝑧𝑖 is generated from a
window of words [𝑥𝑖 ∶ 𝑥𝑖+ℎ−1 ] as follows:

                                         𝑧𝑖 = 𝑓 (𝑤 [𝑥𝑖 ∶ 𝑥𝑖+ℎ−1 ] + 𝑏)                          (1)
   where 𝑤 is a weight matrix, 𝑏 is a bias vector and 𝑓 (⋅) is a non-linear function. This filter
is applied to 𝑛 − ℎ + 1 possible windows to produce a feature matrix 𝑍 = [𝑧1 , 𝑧2 , … , 𝑧𝑛−ℎ+1 ].
Then it takes the maximum value 𝑧𝑚𝑎𝑥 from 𝑍 by 1-max pooling layer. Therefore, we adopt 𝑚

    1
        https://github.com/google-research/bert
filters with different window sizes to achieve the above process, and combine the maximum
                                         1 , 𝑧 2 , … , 𝑧 𝑚 ]. Finally, we get this model output the
value of each filter into a vector 𝑉 = [𝑧𝑚𝑎𝑥  𝑚𝑎𝑥       𝑚𝑎𝑥
probability of each category by a fully connected softmax layer.

2.3. ON-LSTM model


                               Figure 1: The unit of ON-LSTM model


  ON-LSTM is a new variant of LSTM, whose unit architecture is shown in Figure1. It uses an
architecture similar to the standard LSTM, so is also composed of forget gate 𝑓𝑡 , input gate 𝑖𝑡
and output gate 𝑜𝑡 . The formula is as follows:

                                      𝑓𝑡 = 𝜎 (𝑊𝑓 𝑥𝑡 + 𝑈𝑓 ℎ𝑡−1 + 𝑏𝑓 ) ,
                                      𝑖𝑡 = 𝜎 (𝑊𝑖 𝑥𝑡 + 𝑈𝑖 ℎ𝑡−1 + 𝑏𝑖 ) ,                          (2)
                                      𝑜𝑡 = 𝜎 (𝑊𝑜 𝑥𝑡 + 𝑈𝑜 ℎ𝑡−1 + 𝑏𝑜 )
   where 𝑥𝑡 is the input variable at time 𝑡, ℎ𝑡−1 corresponds to the hidden layer at time 𝑡 − 1, 𝑈
and 𝑊 are weight matrices, 𝑏 is the bias vector, and 𝜎 (⋅) is a sigmod function.
   However, the different between ON-LSTM and LSTM is to enforce an order to the update
frequency, we introduce a new activation function, the specific formula is as follows:

                    𝑓𝑡̃ = 𝑐𝑠
                          ⃗ (𝑠𝑜𝑓 𝑡𝑚𝑎𝑥 (𝑊𝑓 𝑥̃ 𝑡 + 𝑈𝑓 ℎ̃ 𝑡−1 + 𝑏𝑓 ))
                                                                ̃ ,

                    𝑖𝑡̃ = 1 − 𝑐𝑠
                               ⃗ (𝑠𝑜𝑓 𝑡𝑚𝑎𝑥 (𝑊𝑖 𝑥̃ 𝑡 + 𝑈𝑖 ℎ̃ 𝑡−1 + 𝑏𝑖 ))
                                                                     ̃ ,
                    𝜔𝑡 = 𝑓𝑡̃ ∘ 𝑖𝑡̃ ,                                                            (3)
                    𝑐𝑡̂ = 𝑡𝑎𝑛ℎ (𝑊𝑐 𝑥𝑡 + 𝑈𝑐 ℎ𝑡−1 + 𝑏𝑐 ) ,
                    𝑐 = 𝜔𝑡 ∘ (𝑓𝑡 ∘ 𝑐𝑡−1 + 𝑖𝑡 ∘ 𝑐𝑡̂ ) + (𝑓𝑡̃ − 𝜔𝑡 ) ∘ 𝑐𝑡−1 + (𝑖𝑡̃ − 𝜔𝑡 ) ∘ 𝑐𝑡̂
  where 𝑐𝑠
         ⃗ is an accumulative function, which represents the influence of historical information
and current information. the 𝑓𝑡̃ and 𝑖𝑡̃ are called master forget gate and master input gate
respectively. 𝜔𝑡 gives the vector where the intersection is 1 and the rest is 0. In this way,
restructuring the output of long-term memory 𝑐, the high-level information may be stored
for a long time, but the low-level information may be updated at each step of input, so the
hierarchical structure is embedded by information hierarchy.

2.4. Ensemble learning
We use the multi-model ensemble learning approach to get a stable system that performs well
in all aspects. We further use hard voting to determine the final category, whose main idea is to
vote for a speech by the classification results of each model and the minority obeys the majority.
Thus, the system integrates the models of TextCNN, BERT, and ON-LSTM by ensemble learning,
as showed in Figure 2.


                             Figure 2: Multi-model ensemble learning


  With input of speech [𝑆1 , 𝑆2 , … , 𝑆𝑛 ], the output value of each model is 0 or 1 which is repre-
sented as a vote 𝑝𝑖 . Adding all the values together to average, we will get the final score of a
candidate, as follows:
                                                    𝑛𝑢𝑚
                                               1
                                      𝑠𝑐𝑜𝑟𝑒 =     ∑𝑝                                            (4)
                                              𝑛𝑢𝑚 𝑖=0 𝑖
 where 𝑠𝑐𝑜𝑟𝑒 ≥ 0.5 shows that the speech is NOT, otherwise HOF. 𝑛𝑢𝑚 in as the number of
models.


3. Experiment
3.1. Dataset
We divided the training set and valid set from the HASCO2020 data in English. Statistics of
the dataset are shown in Table 1, where data is relatively balanced and does not need us to do
distribution processing.
Table 1
Statistics of the dataset.

                                       Dataset     NOT     HOF     Total
                                       Train set   6200    4327    10527
                                        Dev set     499     501     1000


3.2. Implementation details
To achieve better results, we clean and preprocess the texts, mainly including the following
steps:
    • Using regular expressions to remove users and topics.
    • Convert emoji into language expression.
    • Check the spelling of words.
    • Restore abbreviations.
    • Remove URL link.
   To ensure the objectivity and fairness of all experiments, we set all models according to
the hyperparameters: Adam optimizer, Learning rate=5e-6, epoch=15, batch size=32. And for
TextCNN and ON-LSTM model, we use glove.twitter.27B.200d2 for word embedding.
   As for all model training, we adopt the classical cross-entropy loss function, which is used
to measure the approximate degree between the predicted data distribution and the real data
distribution. And the model learns quickly to achieve the best performance by it. The formula
as it follows:


                                               𝑛
                                          1
                            𝐶𝐸(𝑝, 𝑌 ) =     ∑ −𝑌 𝑙𝑜𝑔(𝑝𝑖 ) − (1 − 𝑌𝑖 )𝑙𝑜𝑔(1 − 𝑝𝑖 )             (5)
                                          𝑛 𝑖=1 𝑖

  where 𝑌𝑖 is value of label (0 or 1), 𝑝𝑖 is probability of 𝑌𝑖 .

3.3. Result analysis
To optimize the prediction results, we first trained each model to obtain the parameters under
the best F1-score. The loss of each iteration in the training process is shown in Figure 3. We
found that the best performance of each model was not when the loss value reach the lowest,
especially ON-LSTM can achieve the best performance in the second epoch. Meanwhile, we
save the model parameters when each model has the best performance for making ensemble
learning more effective.
   Then we compared the multi-model ensemble learning approach to the single model about
prediction results, which used Macro F1-score, Precision, and Accuracy to evaluate the prediction
    2
        https://github.com/stanfordnlp/GloVe
                         Figure 3: Training process under different model


results, as shown in Table 2. We can see our multi-model ensemble method has improved over the
single models on three indicators, especially the Macro F1-score is improved by 5.4% compared
with TextCNN. It demonstrated that multi-model ensemble learning can integrate the strengths
of each model and reduce the negative impact of the single model on the results.

Table 2
Prediction results under different methods
                       Method            Macro F1-score    Precision    Accuracy
                     TextCNN                 0.8355          0.7935         0.826
                       BERT                  0.8789          0.9002         0.884
                     ON-LSTM                 0.8761          0.8992         0.879
                Ensemble_soft_voting         0.8864          0.9010         0.888
                Ensemble_hard_voting         0.8896          0.9119         0.892

   Besides, we also compared the different ensemble approaches, using our method based on
hard voting(Ensemble_hard_voting) versus it based on soft voting(Ensemble_soft_voting). We
found that hard voting is better because it is only to obey most of the model and do not need
to synthesize all the model opinions compared to soft voting. This can reduce the impact of
individual models on prediction results.


4. Conclusions
In this paper, we presented a system to detect hate speech and offensive language for the English
language, which uses a method of multi-model ensemble learning for identifying such content.
We achieved better results than the single model in subtask A for the English language. In
future research, we will consider a more efficient ensemble method to further enhance the
performance of the model.


Acknowledgments
This work is supported by the National Natural Science Foundation of China (61962061), partially
supported by the Yunnan Provincial Foundation for Leaders of Disciplines in Science and
Technology, Top Young Talents of ”Ten Thousand Plan” in Yunnan Province, the Program for
Excellent Young Talents of Yunnan University.


References
[1] M. Mondal, L. A. Silva, F. Benevenuto, A measurement study of hate speech in social media,
    in: Proceedings of the 28th acm conference on hypertext and social media, 2017, pp. 85–94.
[2] T. Davidson, D. Warmsley, M. Macy, I. Weber, Automated hate speech detection and the
    problem of offensive language, arXiv preprint arXiv:1703.04009 (2017).
[3] M. A. Bashar, R. Nayak, Qutnocturnal@ hasoc’19: Cnn for hate speech and offensive
    content identification in hindi language, arXiv preprint arXiv:2008.12448 (2020).
[4] M. Manolescu, D. Löfflad, A. N. M. Saber, M. M. Tari, Tueval at semeval-2019 task 5: Lstm
    approach to hate speech detection in english and spanish, in: Proceedings of the 13th
    International Workshop on Semantic Evaluation, 2019, pp. 498–502.
[5] Y. Shen, S. Tan, A. Sordoni, A. Courville, Ordered neurons: Integrating tree structures into
    recurrent neural networks, arXiv preprint arXiv:1810.09536 (2018).
[6] B. Wang, Y. Ding, S. Liu, X. Zhou, Ynu_wb at hasoc 2019: Ordered neurons lstm with
    attention for identifying hate speech and offensive language., in: FIRE (Working Notes),
    2019, pp. 191–198.
[7] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
    transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[8] M. Mozafari, R. Farahbakhsh, N. Crespi, A bert-based transfer learning approach for hate
    speech detection in online social media, in: International Conference on Complex Networks
    and Their Applications, Springer, 2019, pp. 928–940.
[9] T. Mandl, S. Modha, G. K. Shahi, A. K. Jaiswal, D. Nandini, D. Patel, P. Majumder, J. Schäfer,
    Overview of the HASOC track at FIRE 2020: Hate Speech and Offensive Content Identifica-
    tion in Indo-European Languages), in: Working Notes of FIRE 2020 - Forum for Information
    Retrieval Evaluation, CEUR, 2020.

</pre>