<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>with Self-training for Identifying Hate Speech and Ofensive Content in Indo-European Languages</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Junyi Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tianzi Zhao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>FIRE '20, Forum for Information Retrieval Evaluation</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Hate Speech and Ofensive Content</institution>
          ,
          <addr-line>Indo-European Languages, Self-training, ALBERT, Max Ensemble</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>School of Information Science and Engineering Yunnan University</institution>
          ,
          <addr-line>Yunnan</addr-line>
          ,
          <country country="CN">P.R. China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the system submitted to HASOC 2020. This task aims to identify hate speech and ofensive content in Indo-European languages . We only participate in the English part of subtask A, which aims to identify hate speech and ofensive content in English. To solve this problem, we propose an ALBERT-based model , and use the self-training and max ensemble to improve model performance. Our model achieves a macro F1 score of 0.4976 (ranks 20/35) in subtask A.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In recent years, due to the rapid development of the mobile Internet and social media platforms,
people have begun to use various social media to share their lives, such as Facebook and
Twitter. People share their views on life on social media, and this behavior may receive good
or bad comments. Some bad comments slowly evolved into ofensive language. Social media
is flooded with a lot of ofensive language [</p>
      <p>
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and these remarks have led to deviations in
people’s perception of things. Therefore, all major social media platforms urgently need tools
to automatically monitor user speech [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        HASOC 2020 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is a data challenge to identify hate speech in multiple languages. Its goal is
to use computational methods to identify ofensive and hate speech in user-generated content
on online social media platforms. This task provides posts from social media platforms and
classifies this content. At the same time, the application of multiple languages greatly broadens
our recognition range.
      </p>
      <p>
        In this task, we only participate in subtask A of English language : Identifying hate, ofensive
and profane content. In the task, the main problem to be solved is how to get the best task
performance. In order to solve the problem that afects task performance, this paper proposes a
method that combines two efective strategies. First, we introduced an external data set [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to
increase the amount of training data and avoid overfitting of model training. Secondly, We use
an ALBERT-based model with model self-ensemble. Through many experiments, this method
can get a good task performance, which can efectively solve the problem.
      </p>
      <p>The rest of our paper is structured as follows. Section 2 describes data preparation.
Methods are described in Section 3. Experiments and evaluation are described in Section 4. The
conclusions are drawn in Section 5.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Data and Data Preparation</title>
      <p>2.1. Data
The organizers provided training and test datasets, containing 3708 and 814 sentences
respectively. We counted the number and distribution of labels in the dataset. The number of labels in
the dataset is shown in Figure 1.</p>
      <p>where NOT means that This text does not contain any Hate speech, profane, ofensive content
and HOF means that its contains hate, ofensive, and profane content.</p>
      <sec id="sec-2-1">
        <title>2.2. Data Preparation</title>
        <p>
          Some text in the tweets has no efect on the meaning expressed. Tweets are processed using
the tweettokenize tool [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Cleaning the text before further processing helps to generate better
functionality and semantics. We perform the following preprocessing steps.
• We know that some repeated symbols have no meaning. As a result, repeated periods,
question marks and exclamation marks are replaced with a single instance with the special
mark ”repeat” added.
• All contractions were changed to complete parts. This helps the machine understand the
meaning of words (for example:”there’re” changed to ”there” and ”are”.
• Twitter data contains a lot of emojis. Emojis can cause the number of unknown words to
rise, which can lead to poor pre-training efects. Emoticons (for example, ”:(”, ”:” ”,”: P
”and emoticons, etc.) are replaced by emotional words with their own meaning. This will
improve the pre-training efect.
• Generally, words have diferent forms according to the change of context. However,
diferent forms of words will cause ambiguity in pre-training and afect the efect of
pre-training. Lexicalization, through WordNetLemmatizer to restore language vocabulary
to the general form (can express complete semantics).
        </p>
        <p>• Tokens are converted to lower case.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <sec id="sec-3-1">
        <title>3.1. Self-training</title>
        <p>
          In Natural Language Processing (NLP), using diferent data in the same domain to train a model
is a form of model training. This training method is called self-training [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], and aims to establish
a broad semantic understanding to promote performance improvement for training and test
tasks.
        </p>
        <p>
          In this paper, we use self-training to train the model. The self-training process of our model
is shown in Figure 2. Our self-training method uses the idea of ”Teacher and student”. ”Teacher
and student” refers to the same training process. The beginning of student training is the end
of teacher training, which can deepen the learning of the model. We use the ”Teacher-Student”
method to design model self-training. The ”Teacher” part uses an additional external dataset,
which is the task dataset [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] of task12 from SemEval2020. This dataset has the same purpose as
the HASOC2020 task, but the content of the data is diferent. We only randomly used 10,000
sentences. The student part uses the training dataset of the task. Experiments show that our
design is an efective method.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. ALBERT</title>
        <p>
          Google has introduced a new language representation model called BERT [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], which stands for
Bidirectional Encoder Representations from Transformers. However, the large-scale parameters
of the BERT lead to an exponential increase in training time and a shortage of computing
resources. Meanwhile, too much parameter ratio will cause the performance of the model to
decrease. Therefore, Google and Toyota Institute of Technology proposed a new model, the
ALBERT model.
        </p>
        <p>
          The ALBERT [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] model combines two parameter reduction techniques, which eliminate the
main obstacles to scaling pre-trained models. The first is decomposed embedded
parameterization. This separation makes it easier to increase the hidden size without significantly increasing
the parameter size of the vocabulary embedding. The second technique is cross-layer parameter
sharing. This technique prevents the parameters from increasing with the depth of the network.
Through these two technologies, the ALBERT model performs better when the number of
parameters decreases.
        </p>
        <p>In this task, we use the ALBERT model to get good performance. Our model is shown in
Figure 3.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Max Ensemble</title>
        <p>In this paper, we hope to make full use of the ALBERT model through better fine-tuning
strategies, so as to achieve the best task performance. In fact, fine-tuning the performance of
ALBERT is usually sensitive to diferent random seeds and orders of the training data. In order
to alleviate this situation, ensemble methods can be used to reduce overfitting and improve
model generalization ability. Therefore, ensemble [10] methods are widely used to combine
multiple fine-tuned models. The ensemble ALBERT model usually has higher performance than
a single ALBERT model.</p>
        <p>We know that the common ensemble method is based on voting [11]. In this paper, we
ifne-tune multiple ALBERT models with diferent random seeds. For each input, we will output
the best prediction and probability derived from the fine-tuned ALBERT, and summarize the
predicted probability of each model. The output of the ensemble model is the prediction with
the highest probability. This ensemble method is called the max ensemble. The formula for the
max ensemble [12] we used is shown below

∑  (
=1


(; )
=  (
 ))
(1)
where  (</p>
        <p>) represents a fine-tuning of the ALBERT model.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments and Evaluation</title>
      <p>In this task, we use self-training and max ensemble ALBERT-based model. For the ALBERT
model, the main hyper-parameters we focused on are the training step size, batch size, warm
steps, and learning rate. After learning the hyper-parameter adjustment for similar tasks, we
ifne-tune the model hyper-parameters. As is shown in Table 1.
means that only max ensemble is not used.  ( ℎ)
and max ensemble ALBERT-based model.</p>
      <p>From this table, we can see that the self-training and the max ensemble method can efectively
optimize the efectiveness of ALBERT model. So, for this task, our method can get a good
performance.
means that we use self-training</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this task, our main consideration is how to get a good task performance. In other words, we
need to adopt methods to optimize the performance of our models. We mainly use the ALBERT
model. Based on the model, we also adopt the method of self-training and max ensemble.
Experiments prove that our method can achieve the best performance.</p>
      <p>However, in the ranking, the performance of our model is still not satisfying. In the future,
we will improve our methods by adjusting our model and trying more ensemble methods.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>This work was supported by the Science Foundation of Yunnan Education Department under
Grant 2020Y0011.
[10] S. Avidan, Ensemble tracking, IEEE transactions on pattern analysis and machine
intelligence 29 (2007) 261–271.
[11] A. Onan, S. Korukoğlu, H. Bulut, A multiobjective weighted voting ensemble classifier
based on diferential evolution algorithm for text sentiment classification, Expert Systems
with Applications 62 (2016) 1–16.
[12] Y. Xu, X. Qiu, L. Zhou, X. Huang, Improving bert fine-tuning via self-ensemble and
self-distillation, arXiv preprint arXiv:2002.10345 (2020).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Malmasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rosenthal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Farra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>Predicting the Type and Target of Ofensive Posts in Social Media</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1415</fpage>
          -
          <lpage>1420</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mandlia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <source>Overview of the HASOC track at FIRE</source>
          <year>2019</year>
          :
          <article-title>Hate speech and ofensive content identification in indo-european languages</article-title>
          ,
          <source>in: Proceedings of the 11th Forum for Information Retrieval Evaluation</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>14</fpage>
          -
          <lpage>17</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Shahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Jaiswal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nandini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <article-title>Overview of the HASOC track at FIRE 2020: Hate Speech and Ofensive Content Identification in Indo-European Languages</article-title>
          , in: Working Notes of FIRE 2020 -
          <article-title>Forum for Information Retrieval Evaluation</article-title>
          ,
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Rosenthal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Atanasova</surname>
          </string-name>
          , G. Karadzhov,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <article-title>A Large-Scale Weakly Supervised Dataset for Ofensive Language Identification</article-title>
          , in: arxiv,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Bakshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kaur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kaur</surname>
          </string-name>
          , G. Kaur,
          <article-title>Opinion mining and sentiment analysis</article-title>
          ,
          <source>in: 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom)</source>
          , IEEE,
          <year>2016</year>
          , pp.
          <fpage>452</fpage>
          -
          <lpage>455</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>I. Z.</given-names>
            <surname>Yalniz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jégou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Paluri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mahajan</surname>
          </string-name>
          ,
          <article-title>Billion-scale semi-supervised learning for image classification</article-title>
          , arXiv preprint arXiv:
          <year>1905</year>
          .
          <volume>00546</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Mubarak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rashed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Darwish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Samih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abdelali</surname>
          </string-name>
          ,
          <article-title>Arabic ofensive language on twitter: Analysis and experiments</article-title>
          , arXiv preprint arXiv:
          <year>2004</year>
          .
          <volume>02192</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Goodman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Gimpel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sharma</surname>
          </string-name>
          , R. Soricut,
          <string-name>
            <surname>Albert:</surname>
          </string-name>
          <article-title>A lite bert for self-supervised learning of language representations</article-title>
          , arXiv preprint arXiv:
          <year>1909</year>
          .
          <volume>11942</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>