<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>learning of multilingual hate speech and ofensive content detection system</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pankaj Singh</string-name>
          <email>pankajsingh7@iitb.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pushpak Bhattacharyya</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>BERT, Multilingual Hate Speech and Ofensive Content Detection, Multi-task learning</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>FIRE '20, Forum for Information Retrieval Evaluation</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Indian Institute of Technololgy</institution>
          ,
          <addr-line>Bombay</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <abstract>
        <p>This paper describes our system submitted to HASOC FIRE 2020. The goal of the shared tasks was to detected hate speech and ofensive content in three languages namely Hindi, English, and German. The ifrst subtask was a binary classification of a sentence into hate and ofensive and normal. In the second subtask, a more granular classification of hate/ofensive sentences was required. So overall there were 6 subtasks, 2 per language for 3 languages. We propose a system that performs all these tasks with a single model by jointly training a multilingual system on a combined corpus for all languages. It is relatively easy to fine-tune a model per task but it can pose various problems during deployment. These days most of the online platform supports multiple languages and it is not practical to deploy one model per language or per task. There are so many languages and tasks to cover and the online system will quickly run into memory and latency issues if there were multiple models handling the same task for diferent languages. Our system is capable of handling all subtasks for three languages with a single deep learning model. On the test set, we achieved a weighted average f1-score of 0.62, 0.85, 0.75 on subtask A and 0.35, 0.51, 0.43 on subtask B for Hindi, English, and German respectively.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>With the rising popularity of the internet, there has been the growth of various online platforms
and social media platforms being among them. Currently, we have multiple social media
platforms operating globally across many countries and regions. There is a need for automatic
monitoring systems for these social media platforms to detect unsocial elements such as hate
speech and ofensive content which can quickly disrupt the harmony in societies. This problem
becomes more challenging when we want a platform to support multiple languages to fulfill
the needs and enhance the experience of users from various backgrounds. The solution needs
to scalable across multiple languages and multiple tasks. HASOC 2020 provides a platform to
test hate speech and ofensive content detection through the competition organized by them.
We proposed and evaluated a system capable of supporting multiple languages and performing
multiple tasks using a single deep learning model. The requirement of such systems is becoming
obvious day by day as many social media platforms started supporting a large number of
https://www.cse.iitb.ac.in/~pb/ (P. Bhattacharyya)</p>
      <p>© 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
languages and it is very ineficient to have one model per language during deployment as it
increases the memory requirement and also adds up the training time.</p>
      <p>
        In HASOC 2019 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], a similar shared task competition was organized and received many
submissions from researchers across the globe. Although various machine learning model
was proposed but the transformer-based model seemed to be a popular choice and also best
performing. Many other shared task such as GermEval 2018 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], HatEval [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and OfensEval [ 4]
has been organized to push the research in this area.
      </p>
      <p>In HASOC: FIRE 2020 [5], organizers have presented two challenges and provided the training
and test datasets. These two subtasks were for three Indo-European Languages- Hindi, English,
and German. The dataset was curated by collecting tweets and manually annotating the dataset
for hate speech and ofensive content. The challenge consists of the following two tasks for
each of the three languages mentioned above:
• Sub-task A- Identifying Hate, ofensive, and profane content: This was basically a
binary classification task where each of the tweets was required to be classified either as
a normal tweet or hate speech/ofensive.
• Sub-task B- Discrimination between Hate, profane and ofensive posts: This was
a further granular classification of hate speech and ofensive tweets. Each of the hate
speech or ofensive tweets was required to be classified as either hate or ofensive or
profane. This subtask was relatively more challenging than the first one due to an increase
in the number of classes and a reduction in dataset size as only tweets labeled as hate
speech or ofensive are relevant for training.</p>
      <p>We propose a joint multitask learning approach to perform all the six subtasks in the challenge
using a single deep learning model. We combined the datasets of all three languages and
finetuned a multilingual BERT [6] on two subtask A and B together. The results of this system
were pretty competitive and reduced the resource requirements during training and inference.
In section 2 we explain our system and training method in detail. In section 3 we report the
performance of our system for various subtasks on the test dataset.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Materials and method</title>
      <p>In this section, we provide the dataset description and statistics, deep learning network
architecture, and its joint training on multilingual corpora on both subtasks.</p>
      <sec id="sec-2-1">
        <title>2.1. Dataset</title>
        <p>Organizes have collected tweets and annotated them for subtask A subtask B in all three
languages. This dataset was made available to the participants along with the dataset of
HASOC: 2019. However, we used only HASOC 2020 dataset to train and evaluate our system.
Since, the dataset was for Hindi, English, and German languages so it contained both Roman
and Devanagari scripts. Hindi language dataset also contained some amount of code-mixing
and transliteration. For subtask A, each tweet was labeled as either (NOT) Non-Hate-Ofensive
or (HOF) Hate and Ofensive and for subtask B, each of the (HOF) labeled tweets was further
categorized into (HATE) Hate speech, (OFFN) Ofensive, and (PRFN) Profane. Table 1 provides
details about the number of tweets per class in the training dataset for three languages. This
also shows class imbalance in the dataset which is an important issue to tackle while building
robust hate speech and ofensive content detection systems. We split the provided dataset into
ifve-folds and performed 5-fold cross-validation.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. System description</title>
        <p>Since the dataset was directly scraped from Twitter and raw tweets contained various
unnecessary features we processed the text before feeding it to the deep learning model. As a
pre-processing step we performed the following tasks:
• Removed @mentions and RT from the tweets
• Replaced website URLs with string URL
• Removed # character from words used as hashtags
• Removed multiple spaces, if present any sentence</p>
        <p>As a deep learning model, we choose multilingual BERT [6] and trained jointly on both
subtasks on a combined corpus of all three languages. The final hidden state vector of special token
[CLS] is taken as an aggregate representation [7] of the entire tweet and this 768-dimensional
vector is passed throughout two diferent fully connected neural networks. One neural network
ends with a softmax layer having two heads responsible for subtask A and the second neural
network ends with a softmax layer having four heads responsible for subtask B. In subtask B
we did four-class classification by also considering NOT labeled tweets for training along with
HATE, OFF and PRFN labels. Figure 1 depicts the overview of the proposed multilingual deep
learning system. We combined the loss from both the network heads and the back-propagate
average of these two losses. The entire network was then jointly trained on both subtasks by
gradually unfreezing the layers of the multilingual BERT model.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments and results</title>
      <p>We did extensive hyperparameter tuning to get the best performance from the system. The deep
learning model was trained on a combined corpus of all three languages and jointly fine-tuned
for both the subtasks of each language. Since there was a class imbalance in the training
dataset, hence we employed weighted cross-entropy loss giving more weight to counter the
under-representation of some classes.</p>
      <p>In table 2, we report the performance of our system on the test set provided by the organizers.
Subtask A was a binary classification system having two class labels, NOT and HOF. Subtask B
was a four-class classification system with four labels, NONE, HATE, OFF, and PRFN. We used
the macro average f1-score as an evaluation metric. We also report accuracy, macro average
precision, and recall for each subtask of six languages. In leaderboard scores, on average there
was an absolute diference of 0.0356 in the macro average f-score of our system and top-3
bestperforming systems in individual tasks. Given that we have trained a single deep learning model
for every subtask this trade-of between f-score and resources (memory and latency) seems
promising. In table 3, we compare the the performance of our system with top-3 performing
system as per leaderboard published by organizers.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>With an increase in diversity and number of users, online platforms have to support multiple
languages. This leads to the demand for language scalable solutions if we want to perform hate
speech and ofensive content detection. Having one deep learning model per language or per
task will be a very ineficient solution in deployment if our platform has to support hundreds of
languages and multiple tasks. We purposed and established the eficacy of a multilingual and
multitask system that can support three languages and perform two tasks for each language.
The performance of our system was very competitive and at par with single individual models
ifne-tuned for one task.</p>
      <p>In the future, we would like to expand our system to more languages and increase the number
of tasks it can perform. We will also explore the use of other multilingual transformer models
and do a comparative analysis.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>We thank all the organizers of HASOC, FIRE 2020 for arranging this opportunity to push the
research in multilingual hate speech and ofensive content detection. We also express our
gratitude towards them for their continuous support throughout the competition and for being
very accommodating towards the requests from participants.
Association for Computational Linguistics, Minneapolis, Minnesota, USA, 2019, pp. 54–63.</p>
      <p>URL: https://www.aclweb.org/anthology/S19-2007. doi:1 0 . 1 8 6 5 3 / v 1 / S 1 9 - 2 0 0 7 .
[4] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, R. Kumar, SemEval-2019 task 6:
Identifying and categorizing ofensive language in social media (OfensEval), in: Proceedings
of the 13th International Workshop on Semantic Evaluation, Association for Computational
Linguistics, Minneapolis, Minnesota, USA, 2019, pp. 75–86. URL: https://www.aclweb.org/
anthology/S19-2010. doi:1 0 . 1 8 6 5 3 / v 1 / S 1 9 - 2 0 1 0 .
[5] T. Mandl, S. Modha, G. K. Shahi, A. K. Jaiswal, D. Nandini, D. Patel, P. Majumder, J. Schäfer,
Overview of the HASOC track at FIRE 2020: Hate Speech and Ofensive Content
Identification in Indo-European Languages), in: Working Notes of FIRE 2020 - Forum for Information
Retrieval Evaluation, CEUR, 2020.
[6] T. Pires, E. Schlinger, D. Garrette, How multilingual is multilingual BERT?, in: Proceedings
of the 57th Annual Meeting of the Association for Computational Linguistics, Association
for Computational Linguistics, Florence, Italy, 2019, pp. 4996–5001. URL: https://www.
aclweb.org/anthology/P19-1493. doi:1 0 . 1 8 6 5 3 / v 1 / P 1 9 - 1 4 9 3 .
[7] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional
transformers for language understanding, in: Proceedings of the 2019 Conference of
the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://www.aclweb.org/
anthology/N19-1423. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 9 - 1 4 2 3 .</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mandlia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <article-title>Overview of the hasoc track at fire 2019: Hate speech and ofensive content identification in indoeuropean languages</article-title>
          ,
          <source>in: Proceedings of the 11th Forum for Information Retrieval Evaluation</source>
          , FIRE '19,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2019</year>
          , p.
          <fpage>14</fpage>
          -
          <lpage>17</lpage>
          . URL: https://doi.org/10.1145/3368567.3368584.
          <source>doi:1 0 . 1 1</source>
          <volume>4 5 / 3 3 6 8 5 6 7 . 3 3 6 8 5 8 4 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Siegel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ruppenhofer</surname>
          </string-name>
          ,
          <article-title>Overview of the germeval 2018 shared task on the identification of ofensive language</article-title>
          ,
          <source>Proceedings of GermEval 2018, 14th Conference on Natural Language Processing (KONVENS</source>
          <year>2018</year>
          ), Vienna, Austria - September 21,
          <year>2018</year>
          , Austrian Academy of Sciences, Vienna, Austria,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          . URL: http://nbn-resolving. de/urn:nbn:de:bsz:
          <fpage>mh39</fpage>
          -
          <lpage>84935</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>V.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bosco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Fersini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nozza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Rangel Pardo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , M. Sanguinetti, SemEval
          <article-title>-2019 task 5: Multilingual detection of hate speech against immigrants and women in Twitter</article-title>
          ,
          <source>in: Proceedings of the 13th International Workshop on Semantic Evaluation,</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>