<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Application of XLM-RoBERTa for Multi-Class Classification of Conversational Hate Speech</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tebo Leburu-Dingalo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Karabo Johannes Ntwaagae</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nkwebi Peace Motlogelwa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edwin Thuma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Monkgogi Mudongo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Botswana</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, team University of Botswana Computer Science (UB-CS) investigate the use of XLMRoBERTa, a multilingual model trained on 100 diferent languages for transfer learning in the identification of conversational hate-speech in code-mixed languages. We also investigate whether enriching the tweets with textual sentiments from emojis can help improve the classification performance. Our proposed solution outperformed other teams that participated at the HASOC (2022) Task 2 with a macro F1 score of 0.4939. The result suggest that enriching the tweets with textual sentiments and using a pre-trained multilingual model for transfer learning can help in the identification of conversational hate-speech in code-mixed languages.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Hate Speech</kwd>
        <kwd>XLM-RoBERTa</kwd>
        <kwd>Transfer Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        it is supporting an ofensive preceding or parent message. Furthermore messages are often
expressed using a mix of languages, a property that needs to be factored in the development of
hate and ofensive content detection systems [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Hence towards addressing this challenge the
HASOC 2022 Task 2: Identification of Conversational Hate-Speech in Code-Mixed Languages
(ICHCL) - Multiclass Classification encourages the development of systems capable of detecting
ofensive or hateful content in tweets looking at the context of the parent content 1 [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. In
particular systems should be able to identify those posts that are hateful or ofensive as well as
those that support the dissemination of hateful and ofensive content. In this paper we attempt
to address the problem through the use of a transformer model XLM-Roberta [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] which has
been proved efective in multilingual text classification tasks. We fine-tune the model on the
provided dataset. In an attempt to improve the model performance for the task, we focus on
enhancing the tweets through data cleaning and text augmentation. To this end we pre-process
the tweets and convert emojis which make a sizeable part of the tweets to text. Our approach
based on the intuition that emojis can express the actual emotion felt by the user when typing a
posts regardless of rhetoric expressed in the tweet. Therefore, we theorize that augmenting
tweets with emoji descriptions will enhance model performance as they give a better reflection
of sentiment and type of language used in the tweet.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Evolution of the HASOC Shared Task</title>
      <p>
        The Hate Speech and Ofensive Content Identification in Indo-European Languages (HASOC
(2019)) 1 shared task started in 2019 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] inspired by two evaluation forums, OfensEval 2 [7]
and GermanEval 3 [8]. In particular, the objective of the HASOC task was to develop data, hate
speech detection technology and evaluation resources for several Indo-European languages.
For example, the HASOC (2019) shared task ofered 3 tasks. The first task (Sub-task A) ofered
in three languages (English, German and Hindi) was a binary classification task in which
participants were required to classify tweet into Hate and Ofensive (HOF) and Non- Hate and
ofensive (NOT) classes [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In Sub-task B, the classes in Sub-task A were further classified into
three classes namely: (HATE) Hate speech, (OFFN) Ofenive and (PRFN) Profane. In Sub-task
C, only posts labelled as HOF were included and participants were required to check the type
of ofence[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The two types of ofences were Targeted Insult (TIN) and Untargeted (UNT).
HASOC (2020) Shared task did not difer that much from the preceding year (HASOC (2019)).
In particular, the Sub-tasks A &amp; B were made multilingual by joining the English, German and
Hindi datasets in order to promote research on multilingual techniques [9].
      </p>
      <p>
        A new task was introduced in HASOC (2021) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and HASOC (2022) 4 where participants were
required to identify from a conversational thread whether a parent tweet, reply where either a
standalone Hate (SHOF), Contextual Hate (CHOF) and Non-Hate (NONE). This was motivated
by the fact that a majority messages on social networking sites form part of a conversational
thread. Such conversational threads can contain hate and ofensive content which may not be
      </p>
      <sec id="sec-2-1">
        <title>1https://hasocfire.github.io/hasoc/2019/call_for_participation.html</title>
        <p>2https://competitions.codalab.org/competitions/20011
3https://projects.fzai.h-da.de/iggsa/
4https://hasocfire.github.io/hasoc/2022/index.html
visible from a single comment or reply but can be determined if parent content is considered.
The aim of the task is thus to detect posts that are hateful or ofensive on their own, and those
that support hate or ofensive content of their parent posts. Hence the task defines three classes
for the identification of hate and ofensive language in posts as follows:
• (SHOF) Standalone Hate - This tweet, comment, or reply contains Hate, ofensive, and
profane content in itself.
• (CHOF) Contextual Hate - Comment or reply is supporting the hate, ofence and profanity
expressed in its parent. This includes afirming the hate with positive sentiment and
having apparent hate.
• (NONE) Non-Hate - This tweet, comment, or reply does not contain Hate, ofensive, and
profane content in itself.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental Setup</title>
      <sec id="sec-3-1">
        <title>3.1. Training and Validation Dataset</title>
        <p>The dataset for Task 2: Identification of Conversational Hate-Speech in Code-Mixed Languages
(ICHCL) comprises twitter postings, comments and replies to each comment based on
controversial stories from diferent topics including Temple-Mosque Controversy, Taliban and
Covid Controversy. The tweets use mix of both the English and Hindi languages referred to as
Hinglish. The statistics of the dataset is shown Table 1. This data was randomly split into 80%
training data and 20% validation data.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Pre - Processing</title>
        <p>The tweets were first concatenated to create conversational threads comprising parent tweets
and comments as well as parent tweets, comments and replies where available. A manual
exploration of the training data indicated that the tweets contained a lot of special characters,
urls and emojis. We perform data cleaning by removing urls, stopwords, extra spaces and
newlines. We however retain emojis which we expand to text using the emoji library5 to
augment the tweets.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Selection of Model Parameters</title>
        <p>In our emperical investigation we deploy a SimpleTransformers6 Library by HuggingFace7,
which has implementation of task-specific SimpleTransformer models. In particular, we use a
classification model called ClassificationModel , which uses a pre-trained model for the task of
binary and multi-class classification. The model used is based on the HugginFace implementation
of XLM-RoBERTa, a transformer based multilingual model pre-trained on CommonCrawl data
containing 100 languages. XLM-RoBERTa is based on the BERT architecture and has a total
of 12 layers for learning diferent semantic information with a classification layer built on top.
Since we consider the influence of emojis in our experiments we first deployed the model with
emojis omitted from the tweets using a learning rate of 1e-5 at 3 and 5 epochs respectively.
We further experimented with augmented tweets similarly at a learning rate of 1e-5 at 3 and 5
epochs. All models used the AdamW optimizer. Base on the result in Table 2, we chose to use
the parameter used in Run 4 enhanced tweet for our run submission the Task 2: Identification
of Conversational Hate-Speech in Code-Mixed Languages (ICHCL).</p>
        <sec id="sec-3-3-1">
          <title>6https://github.com/ThilinaRajapakse/simpletransformers 7https://huggingface.co/xlm-roberta-base</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Analysis</title>
      <p>Table 3 shows the leaderboard of the HASOC (2022) Task 2: Identification of Conversational
Hate-Speech in Code-Mixed Languages (ICHCL). Our team UB-CS denoted by † managed to
outperform other teams. The results suggest that using multilingual model trained on several
languages can improve the identification of conversational hate speech in code mixed languages
(HINGLISH - Hindi-English). In addition, the results suggest that we can further improve the
performance by enriching the tweets with textual sentiments generated from emojis.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion and Conclusion</title>
      <p>The results of our investigation suggests that enriching the tweets with textual sentiments and
using a pre-trained multilingual model for transfer learning can help in the identification of
conversational hate-speech in code-mixed languages. A natural progression of this work is
to analyse whether a state-of-the-art performance can be attained by using an ensemble from
several pre-trained multilingual models for transfer learning.
FIRE ’19, Association for Computing Machinery, New York, NY, USA, 2019, p. 14–17. URL:
https://doi.org/10.1145/3368567.3368584.
[7] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, R. Kumar, SemEval-2019 task 6:
Identifying and categorizing ofensive language in social media (OfensEval), in: Proceedings
of the 13th International Workshop on Semantic Evaluation, Association for Computational
Linguistics, Minneapolis, Minnesota, USA, 2019, pp. 75–86. URL: https://aclanthology.org/
S19-2010. doi:10.18653/v1/S19-2010.
[8] M. Wiegand, M. Siegel, Overview of the germeval 2018 shared task on the identification of
ofensive language, 2018.
[9] T. Mandl, S. Modha, A. Kumar M, B. R. Chakravarthi, Overview of the hasoc track at fire
2020: Hate speech and ofensive language identification in tamil, malayalam, hindi, english
and german, in: Forum for Information Retrieval Evaluation, FIRE 2020, Association for
Computing Machinery, New York, NY, USA, 2020, p. 29–32. URL: https://doi.org/10.1145/
3441501.3441517. doi:10.1145/3441501.3441517.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>C. O'Regan</surname>
          </string-name>
          ,
          <article-title>Hate Speech Online: an (Intractable) Contemporary Challenge?</article-title>
          ,
          <source>Current Legal Problems</source>
          <volume>71</volume>
          (
          <year>2018</year>
          )
          <fpage>403</fpage>
          -
          <lpage>429</lpage>
          . URL: https://doi.org/10.1093/clp/cuy012.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Shahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Madhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ranasinghe</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Zampieri, Overview of the hasoc subtrack at fire 2021: Hate speech and ofensive content identification in english and indo-aryan languages and conversational hate speech, in: Forum for Information Retrieval Evaluation</article-title>
          ,
          <string-name>
            <surname>FIRE</surname>
          </string-name>
          <year>2021</year>
          ,
          <article-title>Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <year>2021</year>
          , p.
          <fpage>1</fpage>
          -
          <lpage>3</lpage>
          . URL: https://doi.org/10.1145/3503162.3503176.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Madhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ranasinghe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          , K. North,
          <string-name>
            <given-names>D.</given-names>
            <surname>Premasiri</surname>
          </string-name>
          ,
          <article-title>Overview of the hasoc subtrack at fire 2022: Hate speech and ofensive content identification in english and indo-aryan languages</article-title>
          ,
          <source>in: FIRE</source>
          <year>2022</year>
          :
          <article-title>Forum for Information Retrieval Evaluation, Virtual Event</article-title>
          ,
          <fpage>9th</fpage>
          -13th
          <source>December</source>
          <year>2022</year>
          , ACM,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Madhu</surname>
          </string-name>
          ,
          <article-title>Overview of the hasoc subtrack at fire 2022: Identification of conversational hate-speech in hindi-english codemixed and german language</article-title>
          , in: Working Notes of FIRE 2022 -
          <article-title>Forum for Information Retrieval Evaluation</article-title>
          ,
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Conneau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Khandelwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wenzek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Guzmán</surname>
          </string-name>
          , E. Grave,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Unsupervised cross-lingual representation learning at scale, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>8440</fpage>
          -
          <lpage>8451</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .acl-main.
          <volume>747</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>747</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mandlia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <article-title>Overview of the hasoc track at fire 2019: Hate speech and ofensive content identification in indoeuropean languages</article-title>
          ,
          <source>in: Proceedings of the 11th Forum for Information Retrieval Evaluation</source>
          ,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>