<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>K. Kumari); jps@nitp.ac.in (J.P. Singh)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>AI_ML_NIT_Patna HASOC 2020: BERT Models for @ Hate Speech Identification in Indo-European Languages</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kirti Kumari</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jyoti Prakash Singh</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Technical Education and Research (ITER)</institution>
          ,
          <addr-line>Siksha 'O' Anusandhan, Bhubaneswar, Odisha</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Institute of Technology Patna</institution>
          ,
          <addr-line>Bihar</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>The current paper describes the system submitted by team AI_ML_NIT_Patna. The task aims to identify ofensive language in code-mixed dataset of comments in Indo-European languages ofered for English, German, Hindi collected from Twitter. We participated in both Sub-task A, which aims to classify comments into two class, namely: Hate and Ofensive (HOF), and Non- Hate and ofensive (NOT), and Sub-task B, which aims to identify discrimination between Hate (HATE), profane (PRFN) and ofensive (OFFN) comments. In order to address these tasks, we utilized pre-trained multi-lingual transformer (BERT) based neural network models and their fine-tuning. This resulted in a better performance on the validation, and test set. Our model achieved 0.88 weighted F1-score for English language in Sub-task A on testing dataset, and got 3 rank on the leaderboard private test data having F1 Macro average of 0.5078.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Multi-lingual Text</kwd>
        <kwd>Hate Speech</kwd>
        <kwd>Abusive Language</kwd>
        <kwd>HASOC</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        With the increase in usage of Internet, appealing to the propagation of thoughts, and
expressions of an individual, there has been a tremendous increase in the spread of online hate speech
[
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ], cyberbullying [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and cyber-aggression [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. Such phenomena have significantly
affected the daily life of people, and such motives may also result in depression or suicide. It has
become very important to track these behaviour in order to indicate activity that is hateful,
ofensive or that promotes violence towards an individual or group based on anything such as
race, religion, gender or sexual orientation, to ensure that the Internet remains an open place,
and to foster diversity of information, opinion, and innovation. It is very challenging task due
to several reasons such as multi-linguality, multi-modality, non-standard writing style of social
media post.
      </p>
      <p>The Hate Speech and Ofensive Content Identification in Indo-European Languages (HASOC)
shared tasks of 2019, and 2020 focused on Indo-European languages in three diferent
languages: English, German, and Hindi. The shared tasks have two sub-tasks: Sub-task A, and
Sub-task B focused on multi-lingual posts. Therefore, it gave us an opportunity to address the
multi-lingual issues associated with social media posts. In order to address this, we applied
pre-trained transformer based neural network (BERT) models, which are publicly available in
multiple languages, and the model supports fine-tuning for specific tasks. In addition to this, its
multi-lingual feature allows for us to analyse sentiment for the comments which have multiple
language words, and sentences, for example HINGLISH (which is combination of English, and
Hindi, commonly used as an expression medium) in Indian social media. We participated for
both sub-tasks, and all the three languages. We have tried several neural networks architecture
but found that our BERT model is performing better than other models. We achieved 3 rank
for English Sub-task A.</p>
      <p>The organization of the paper as follows: the previous works in these scenarios are discussed
in Section 2, the data description is explained in Section 3. The models description and their
training are presented in Section 4 and Section 5, respectively. The results of various
experiments are explained in Section 6 and the current paper summarizes the most important findings
in Section 7.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>Lot of researchers and practitioners from industry and academia have been attracted towards
the problem of automatic identification of hate speech, cyberbullying, cyber-aggression and
ofensive languages.</p>
      <p>
        The dynamics of the definition of cyberbullying, cyber-aggression, hate speech and its
undoubted potential for social impact are addressed by authors [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref6">1, 2, 3, 6</xref>
        ], particularly in online
forums and social networking sites. Mishra and Mishra [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] developed BERT model for
identiifcation of hate, ofensive and profane comments of multi-lingual English, Hindi and German
languages from Facebook and Twitter social media platform. Mujadia et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] used hybrid of
diferent machine learning, and neural network based models for hate, ofensive, and profane
identification on multi-lingual comments of English, Hindi, and German languages collected
from Facebook, and Twitter. They found that word, and character TF-IDF features with
ensemble model of SVM, Random Forest, and Adaboost classifiers with majority voting is performing
better than the other models. Wang et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] used LSTM with attention model to identify hate,
ofensive and profane comments of English comments of Facebook and Twitter and found that
the k-fold ensemble method is performing better than other methods. Kumari and Singh [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
have introduced a four-layer CNN model with three embedding techniques for detecting hate
speech, and ofensive content for multi-lingual HASOC 2019 text comments collected from
Facebook, and Twitter.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>
        The data supplied by the organizing team, are the posts collected from Twitter. The posts are
in the following three languages: English, German, and Hindi. The competition has two
subtasks for each language, i.e. Sub-task A, and Sub-task B. Sub-task A consists of labelling with
Hate and Ofensive (HOF) if the content of the post contains any hate speech, otherwise the
label should be Not Hate-Ofensive (NOT). Sub-task B has a more fine-grained data which also
pointed to discriminate between Hate, Profane, and Ofensive posts by labelling with HATE,
PRFN, and OFFN. We utilized the testing dataset released by the organizers. The details of the
collection, and labelling of datasets are discussed in paper [
        <xref ref-type="bibr" rid="ref11">11, 12</xref>
        ]. We presented the
description of training dataset in Table 1 where number of samples of specific class are given.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Models</title>
      <p>We used BERT as our pre-trained language model because of its success in recent as well as
availability in multiple languages. We finely tuned the already trained model on our training
dataset, and generated the submission model.</p>
      <p>We used BERT, specifically BERT BASE UNCASED, which is a transformers model
pretrained on a large corpus of data in a supervised fashion. It has 756 encoding length with
12 transformer layers, 12 attention heads, and 110 million parameters. Sentences are tokenized
by BertTokenizerFast. Before padding up to 256 sequence length, Tokens are converted into
Sequences. Two tokens were added to each input text (CLS, and SEP) marking the
beginning, and the end of the sequence. These sequences are passed into the model. After that, we
used BERT BASE UNCASED model from TFBertModel. It contains vocabulary, and pre-trained
model. We used optimizer as Adam with Categorical Cross-entropy for Sub-task B, and Binary
Cross-entropy for Sub-task A as a loss function.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Training</title>
      <p>We have trained the BERT model on English, German, and Hindi datasets of HASOC 2020 for
Sub-task A. In addition to HASOC 2020 datasets, we also used HASOC 2019 datasets to train
our models for Sub-task B. Training involves batch of size 64 with 20 epochs (on an average).
Adam is used with the learning rate of 5e-05. For training, we have utilized 80% samples from
training data provided by the organizing team, and rest 20% was used for validation. After
training the model, we evaluated our model with the provided testing data. We experimented
with diferent number of epochs, and analysed the training model with each epoch. It was
observed that on an average after 20 epochs, the training accuracy did not improve rather it was
decreasing, and hence the accuracy started to oscillate with its highest being at 20 epochs. The
ifnal leaderboard is calculated with approximately 15% of the private test data by the organizing
teams. F1 Macro average was used which was for the deciding parameter that can be observed
from the classification report.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Results</title>
      <p>In this competition, teams are ranked on performance in terms of F1 Macro average on final
leaderboard results. The Table 2 presents our results on testing samples provided by the
organizing team in terms of Weighted F1-Score and F1 Macro average as performance metrics.
The Table 3 presents the final leaderboard results for all the languages with their respective
sub-tasks. According to the oficial results in Sub-task A, the best Marco F1 score of our model
are 0.5078, 0.4768, and 0.4561 respectively, for English, German, and Hindi languages ranked
3 place for English Sun-task A.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>In this paper, we presented our approach based on fine-tuning monolingual, and multi-lingual
transformer networks (BERT model) to classify Twitter posts in three Indo-European
languages: English, German, and Hindi, for hate-speech, and ofensive content identification.
Using transfer learning with the pre-trained BERT model on English Sub-task A, we achieved
rank 3 on leaderboard private test data.
Overview of the HASOC track at FIRE 2020: Hate Speech and Ofensive Content
Identification in Indo-European Languages), in: Working Notes of FIRE 2020 - Forum for
Information Retrieval Evaluation, CEUR, 2020.
[12] T. Mandl, S. Modha, P. Majumder, D. Patel, M. Dave, C. Mandlia, A. Patel, Overview of the
hasoc track at fire 2019: Hate speech and ofensive content identification in indo-european
languages, in: Proceedings of the 11th Forum for Information Retrieval Evaluation, 2019,
pp. 14–17.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegand</surname>
          </string-name>
          ,
          <article-title>A survey on hate speech detection using natural language processing</article-title>
          ,
          <source>in: Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Fortuna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nunes</surname>
          </string-name>
          ,
          <article-title>A survey on automatic detection of hate speech in text, ACM Computing Surveys (CSUR) 51 (</article-title>
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>30</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Kumari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. K.</given-names>
            <surname>Dwivedi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. P.</given-names>
            <surname>Rana</surname>
          </string-name>
          ,
          <article-title>Towards Cyberbullying-free social media in smart cities: a unified multi-modal approach</article-title>
          ,
          <source>Soft Computing</source>
          <volume>24</volume>
          (
          <year>2020</year>
          )
          <fpage>11059</fpage>
          -
          <lpage>11070</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>Kumari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. K.</given-names>
            <surname>Dwivedi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. P.</given-names>
            <surname>Rana</surname>
          </string-name>
          ,
          <article-title>Aggressive Social Media Post Detection System Containing Symbolic Images</article-title>
          , in: Conference on e-Business, e-Services and eSociety, Springer,
          <year>2019</year>
          , pp.
          <fpage>415</fpage>
          -
          <lpage>424</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Kumari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <surname>AI_</surname>
          </string-name>
          <article-title>ML_NIT_Patna@ TRAC-2: Deep Learning Approach for Multilingual Aggression Identification</article-title>
          , in: Proceedings of the Second Workshop on Trolling,
          <source>Aggression and Cyberbullying</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>113</fpage>
          -
          <lpage>119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Kumari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Identification of cyberbullying on multi-modal social media posts using genetic algorithm, Transactions on Emerging Telecommunications Technologies (</article-title>
          <year>2020</year>
          )
          <article-title>e3907</article-title>
          . doi:
          <volume>10</volume>
          .1002/ett.3907.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          , 3Idiots at HASOC 2019:
          <article-title>Fine-tuning Transformer Neural Networks for Hate Speech Identification in Indo-European Languages</article-title>
          .,
          <source>in: FIRE (Working Notes)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>208</fpage>
          -
          <lpage>213</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>V.</given-names>
            <surname>Mujadia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Sharma</surname>
          </string-name>
          , IIIT-Hyderabad at HASOC 2019:
          <article-title>Hate Speech Detection</article-title>
          .,
          <source>in: FIRE (Working Notes)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>271</fpage>
          -
          <lpage>278</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , YNU_Wb at HASOC 2019:
          <article-title>Ordered Neurons LSTM with Attention for Identifying Hate Speech and Ofensive Language</article-title>
          .,
          <source>in: FIRE (Working Notes)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>191</fpage>
          -
          <lpage>198</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>Kumari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <surname>AI_</surname>
          </string-name>
          <article-title>ML_NIT Patna at HASOC 2019: Deep Learning Approach for Identification of Abusive Content, in: Proceedings of the 11th annual meeting of the Forum for Information Retrieval Evaluation ( FIRE 2019</article-title>
          ,
          <year>December 2019</year>
          ),
          <year>2019</year>
          , pp.
          <fpage>328</fpage>
          -
          <lpage>335</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Shahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Jaiswal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nandini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          , J. Schäfer,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>