<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TUB at HASOC 2020: Character based LSTM for Hate Speech Detection in Indo-European Languages</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Salar Mohtaj</string-name>
          <email>salar.mohtaj@tu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vinicius Woloszyn</string-name>
          <email>woloszyn@tu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sebastian Möller</string-name>
          <email>sebastian.moeller@tu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>FIRE '20, Forum for Information Retrieval Evaluation</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>German Research Centre for Artificial Intelligence (DFKI)</institution>
          ,
          <addr-line>Projektbüro Berlin, Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Quality and Usability Lab, Technische Universität Berlin</institution>
          ,
          <addr-line>Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <abstract>
        <p>This paper presents TU Berlin team experiments and results on the task 1 of the shared task on hate speech and ofensive content identification in Indo-European languages. Recently, hate speech has become an important problem that is seriously afecting online social media. Large scale social platforms are currently investing important resources to automatically detect and classify toxic language. The competition evaluates the success of diferent natural language processing models on detecting hate speech in diferent languages, automatically. Among the state-of-the-art deep learning models that have been used for the experiments, the character based LSTM achieved the best results on detecting hate speech contents in tweets.</p>
      </abstract>
      <kwd-group>
        <kwd>Hate speech detection</kwd>
        <kwd>Ofensive Content Identification</kwd>
        <kwd>Bert</kwd>
        <kwd>LSTM</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>With a massive increase of content generation on online social media, there has also been an
increase of hateful and ofensive language in online posts. It is possible to automate a part or
the whole process of toxic language detection among the content that is generated by users by
using Natural Language Processing (NLP).</p>
      <p>
        HASOC (2020) at FIRE1 provides a shared task and a data challenge for multilingual research
on the identification of toxic content. HASOC ofers the task of hate speech detection on English,
German and Hindi languages, includes 2 sub-tasks, on annotated tweets from Twitter [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>Sub-task A of HASOC (2020) is a binary classification task for identifying hate, ofensive and
profane content. Two classes include:
• (NOT) Non Hate-Ofensive : These posts do not contain any hate speech, profane, ofensive
content
Möller)
https://salarmohtaj.github.io/ (S. Mohtaj);
CEUR
Workshop
Proceedings
5852
3819
4665</p>
      <p>• (HOF) Hate and Ofensive : These posts contain hate, ofensive, and profane content
On the other hand, sub-task B is a three classes classification task to Discrimination between
hate, profane and ofensive posts in order to further classify tweets from the sub-task A into
three categories.</p>
      <p>• (HATE) Hate speech: Posts under this class contain Hate speech content
• (OFFN) Ofenive : Posts under this class contain ofensive content
• (PRFN) Profane: Posts contain profane words</p>
      <p>The TU Berlin team take part in the sub-task A, where the state-of-the-art methods on text
classification are applied on the tweets to categorize them into one of the aforementioned
classes. In this paper the applied methods to pre-process and process the content are described
in details.</p>
      <p>The paper is organized as follows; Section 2 present a short description on the proposed
datasets for train and test in the competition. The proposed approaches for data preprocessing
and the experiments are described in details in Section 3. Section 4 contains the achieved
results on the test data that is reported by the competition’s organizers. Finally, in Section 5 we
conclude the approaches and the results.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Data</title>
      <p>In this section we briefly described the proposed dataset by task organizers in order to train
and test models for the task of hate speech detection.</p>
      <p>
        The HASOC dataset is sampled from Twitter and partially from Facebook in English, German
and Hindi languages [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Some statistics of the train and test datasets are presented in Table
1 and 2, respectively. The content would contains hashtags, emojis, links and usernames that
refer to a user on Twitter or Facebook. Moreover, some samples from the English dataset are
shown in Table 3.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <p>In this section we describe the approaches that have been used in order to pre-process the data
and also the state-of-the-art models that have been trained on the resulting text.</p>
      <sec id="sec-3-1">
        <title>English</title>
      </sec>
      <sec id="sec-3-2">
        <title>German</title>
      </sec>
      <sec id="sec-3-3">
        <title>Hindi</title>
        <p>@realDonaldTrump TRAITOR #TrumpIsATraitor HOF
https://t.co/Ip9XqS0U3c
If that Bengali Jihadi $lu7 catches any illness, no doctor must HOF
treat her. She should sufer from non-treatment and face
consequences! #DoctorsFightBack #DoctorsUnderOppression
#DoctorsProtest
@brianstelter @OANN I went looking for dickheads this week- HOF
end and I found you. #douchebag
@HufPost Is she wearing clothes? #TrumpIsATraitor NOT</p>
        <sec id="sec-3-3-1">
          <title>3.1. Data preprocessing</title>
          <p>
            As mentioned in Section 1, the proposed data in the share task contains hashtags, mentioned
usernames, links and emojis. To clean up the data by reserving important information from
tweets and removing unimportant ones, we follow some of the steps that was used in [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ] for
the preprocessing. The following changes have been applied as the preprocessing steps:
• Username mentions (e.g., terms starting with @) are replaced with ’username’ phrase.
          </p>
          <p>Although the username itself do not contain important information for the task of hate
speech detection, pointing in the tweet that it contains a username would improve the
overall performance of the classifier.
• Emojis (i.e. smileys) are replaced with a short textual description that express the
corresponding emotion, using demoji2 package.
• Links are replaced with a ’link’ phrase. Like username, although individual links don’t
contain important information to be kept, referring to the model that a tweet includes
links would improve the performance of the model. To this end all the terms started with
http have been replaced with the ’link’ term.
• Multiple white spaces are replaced with a single white space.
2https://pypi.org/project/demoji/
• All the token in tweets are Lower-cased.</p>
          <p>The same steps of preprocessing are applied on the train and the test datasets to empower the
model in order to generalize information from the tweets, in all of the three languages (e.g.,
English, German and Hindi).</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.2. Models</title>
          <p>Transformer based language models (e.g., BERT [4] and ELMO [5]) received lots of attention
during last years and achieved stunning results in many NLP tasks. They have been also used
by some of the participants of HASOC (2019) for the task of hate speech detection [6, 7].</p>
          <p>We used a BERT based architecture and also a character based LSTM [8] model in our
experiments. For the BERT based transfer learning approach, we fine-tuned weights from
the pre-trained models based on the proposed data for the task of hate speech detection.
We used both bert-based and bert-large from the huggingface3 package with diferent sets
of hyperparameters for the English tweets. Moreover, the corresponding German and Hindi
models from the same package have been used for the two other languages.</p>
          <p>In addition to the state-of-the-art BERT architecture, a character based LSTM model is also
trained on the training datasets. For this end, a bidirectional two layer LSTM with embedding
size of 200 and hidden layer size of 256 is trained.</p>
          <p>For measuring the performance of the two models for the task of hate speech detection, a part
of the training dataset has been separated for the test purpose. Our experiments on diferent
hyperparameters show that the LSTM model outperform the BERT based model, from the F1
measure point of view. So, the LSTM model is submitted as the team’s model for the competition.
The result of the model on the test data for diferent languages is reported in the next section.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>In this section we present the achieved results on the test dataset in all of the three languages.
In addition to applying the models on the test datasets that are published by the organizers, the
model are also applied on approximately 15% of a private test dataset. The proposed model on
the English dataset achieved a F1 accuracy of 0.504 (in macro average) and ranked 6th among
35 participated teams. The final results of top 10 teams on the English data is presented in Table
4.</p>
      <p>The proposed model for the German dataset does not generated any positive label (HOF),
neither on train and test datasets. The reason would be the high-class imbalance (10%-90% for
HOF and NOT classes, respectively) in the German data, comparing to the English and Hindi
datasets. The model achieved a F1 accuracy of 0.427 and ranked 19th among 20 participated
teams in the German hate speech detection task. Finally, the proposed Hindi model achieved a
F1 accuracy of 0.467 and ranked 19th among 24 teams in the Hindi task.
#
IIIT_DWD
CONCORDIA_CIT_TEAM
AI_ML_NIT_Patna
Oreo
MUM
Huiping Shi</p>
      <sec id="sec-4-1">
        <title>TU Berlin</title>
        <p>NITP-AI-NLP
JU
HASOCOne
F1 Macro average</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and future work</title>
      <p>In this paper, we presented the proposed models on the task 1 of the shared task on hate speech
and ofensive content identification in English, German and Hindi languages. We used a BERT
based architecture and a character based LSTM model for training classifiers to detect ofensive
language among tweets. Our experiments show that LSTM model outperform the BERT based
approach. The proposed model achieved the 6th best performance in the English data for task 1.</p>
      <p>The achieved result can be improved by making the training data more balance, using
upsampling approaches. There is also design space based on Bert, for the specific architectures for
our task that is a future direction of our work.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We would like to thank the organizers of HASOC2020 shared task for organizing the competition
and taking time on the inquiries.
P. Majumder, M. Mitra (Eds.), Working Notes of FIRE 2019 - Forum for Information
Retrieval Evaluation, Kolkata, India, December 12-15, 2019, volume 2517 of CEUR Workshop
Proceedings, CEUR-WS.org, 2019, pp. 191–198. URL: http://ceur-ws.org/Vol-2517/T3-2.pdf.
[4] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional
transformers for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings
of the 2019 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA,
June 2-7, 2019, Volume 1 (Long and Short Papers), Association for Computational Linguistics,
2019, pp. 4171–4186. URL: https://doi.org/10.18653/v1/n19-1423. doi:1 0 . 1 8 6 5 3 / v 1 / n 1 9 - 1 4 2 3 .
[5] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep
contextualized word representations, in: M. A. Walker, H. Ji, A. Stent (Eds.), Proceedings of
the 2018 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana,
USA, June 1-6, 2018, Volume 1 (Long Papers), Association for Computational Linguistics,
2018, pp. 2227–2237. URL: https://doi.org/10.18653/v1/n18-1202. doi:1 0 . 1 8 6 5 3 / v 1 / n 1 8 - 1 2 0 2 .
[6] T. Ranasinghe, M. Zampieri, H. Hettiarachchi, BRUMS at HASOC 2019: Deep learning
models for multilingual hate speech and ofensive language identification, in: P. Mehta,
P. Rosso, P. Majumder, M. Mitra (Eds.), Working Notes of FIRE 2019 - Forum for Information
Retrieval Evaluation, Kolkata, India, December 12-15, 2019, volume 2517 of CEUR Workshop
Proceedings, CEUR-WS.org, 2019, pp. 199–207. URL: http://ceur-ws.org/Vol-2517/T3-3.pdf.
[7] S. Mishra, S. Mishra, 3idiots at HASOC 2019: Fine-tuning transformer neural networks for
hate speech identification in indo-european languages, in: P. Mehta, P. Rosso, P. Majumder,
M. Mitra (Eds.), Working Notes of FIRE 2019 - Forum for Information Retrieval Evaluation,
Kolkata, India, December 12-15, 2019, volume 2517 of CEUR Workshop Proceedings,
CEURWS.org, 2019, pp. 208–213. URL: http://ceur-ws.org/Vol-2517/T3-4.pdf.
[8] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (1997) 1735–1780.</p>
      <p>URL: https://doi.org/10.1162/neco.1997.9.8.1735. doi:1 0 . 1 1 6 2 / n e c o . 1 9 9 7 . 9 . 8 . 1 7 3 5 .</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Shahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Jaiswal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nandini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <article-title>Overview of the HASOC track at FIRE 2020: Hate Speech and Ofensive Content Identification in Indo-European Languages)</article-title>
          ,
          <source>in: Working Notes of FIRE 2020 - Forum for Information Retrieval Evaluation</source>
          ,
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <source>Overview of the HASOC track at FIRE</source>
          <year>2019</year>
          :
          <article-title>Hate speech and ofensive content identification in indo-european languages</article-title>
          , in: P. Mehta,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          , M. Mitra (Eds.), Working Notes of FIRE 2019 -
          <article-title>Forum for Information Retrieval Evaluation, Kolkata</article-title>
          , India,
          <source>December 12-15</source>
          ,
          <year>2019</year>
          , volume
          <volume>2517</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>167</fpage>
          -
          <lpage>190</lpage>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2517</volume>
          /
          <fpage>T3</fpage>
          -1.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , Ynu_wb at HASOC 2019:
          <article-title>Ordered neurons LSTM with attention for identifying hate speech and ofensive language</article-title>
          , in: P. Mehta, P. Rosso,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>