<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Astralis@Hasoc 2020:Analysis On Identification Of Hate Speech In Indo-European Languages With Fine-Tuned Transformers.</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hiren Madhu</string-name>
          <email>hirenmadhu16@gmail.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shrey Satapara</string-name>
          <email>shreysatapara@gmail.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harsh Rathod</string-name>
          <email>harshrathod6874@gmail.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>LDRP-ITR</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gandhinagar</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>India</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>The detection of hate speech in online social media platforms is of great importance in text classification. There is a need to research languages other than English. In this paper, we describe our team Astralis' combined effort in the shared task HASOC. We analyzed various models such as Naive Bayes, SVM, ANN, CNN, and embeddings such as TF-IDF, Multilingual BERT, and OPENAI-GPT2. Our relative performance was better in Subtask B for all languages, with our best-performed system ranked in second position in German Subtask B.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>● (HOF) Hate and Offensive - This division consists of Hate and offensive content.</p>
      <p>Sub-task B is a fine-grained classification offered for English, German, Hindi. Hate-speech and
offensive posts from the sub-task A are further classified into three categories.
● (HATE) Hate speech:- This contains a class of hate speech content.
● (OFFN) Offensive:- Posts under this class contain offensive content.</p>
      <p>● (PRFN) Profane:- This subcategory contains profane content.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Hate speech detection is a vast field of research and attracts many. Here we briefly describe some of
the works done in this area.</p>
      <p>
        The GermEval presented a task that identifies offensive language. The performance was measured by
F1 score, precision, and recall.[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] This was a competition comprising 20 teams working on the shared
task. The best results were achieved using five disjoint sets to train three different classifiers and then
combining them, resulting in a meta-level classifier.[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
SemEval 2019 focuses on studying the type and target of the offensive language. They presented a
shared task called OffensEval[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. A schema is defined for taking into account the class and target. The
dataset used is Offensive language identification(OLID). Three sub-tasks were given according to their
annotation schema, which the participating team had to use. Sub-task A was offensive language
identification, Sub-task B was Automatic categorization of offence types Subtask C was Offense target
identification.
      </p>
      <p>
        Work was done on text classification using CNN by Yoon kim[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The CNN models discussed herein
improve upon state of the art on 4 out of 7 tasks, including sentiment analysis and question
classification. These vectors were trained by Mikolov et al. (2013) on 100 billion words of Google
News and are publicly available.[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
Nobata et al.[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] came up with a model that uses regression techniques to detect hate and offensive
content from a speech. Djuric et al.[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] presented a model that used LR classifiers to identify hate
content. Besides using conventional techniques, this research also used comment embeddings as one of
their features.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>Below we briefly describe the dataset used for HASOC 2020. Given below is the class-wise
distribution of the dataset provided to us during this task.</p>
      <p>HOF
1856
847</p>
      <p>NOT
1852
2116</p>
    </sec>
    <sec id="sec-4">
      <title>4. Approach</title>
      <p>Here we describe the various methodologies we used in different steps of the experiment.</p>
    </sec>
    <sec id="sec-5">
      <title>4.1. Preprocessing</title>
      <p>
        For preprocessing, we followed a few simple traditional steps. At first, all of the Twitter handles
were removed. After that, the links in the tweets were removed. All the retweets in the data had
“RT” at the start. So we removed that. Then we removed all the residual blanks and kept the emojis.
As the transformers[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] model we used had emoji support.
      </p>
    </sec>
    <sec id="sec-6">
      <title>4.2. Embeddings</title>
      <p>Here we define the methodology that we used to analyze two different subtasks that were given
to us. We have used various models to get the best possible results.</p>
      <p>4.2.1. TF-IDF</p>
      <p>
        It is short for the term frequency-inverse document frequency. One gram and bigrams were
used to create this vectorizer with minimum occurrences of 5 for English. The length of the vectors
derived was 1281
4.2.2. BERT[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
      </p>
      <p>
        BERT stands for Bidirectional Encoder Representation from Transformers. The hugging
face[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] transformers library has made many transformers models available for use. From BERT,
we used two models. bert-base-uncased for English and bert-base-multilingual-uncased for English,
Hindi, and German. Matrices of length 768 * MAX_LEN was received for this. MAX_LEN is a
parameter that shows the Max Length of tokenized tweets. Post padding with ‘0’ was used.
4.2.3. OPENAI GPT-2[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
      </p>
      <p>Similarly, shaped matrices as BERT were received from this transformer model for Hindi,
English, and German.</p>
    </sec>
    <sec id="sec-7">
      <title>4.3. Models</title>
      <p>
        We have used various Machine Learning algorithms and Deep Neural Networks, and here we
describe them in detail. We used tensorflow[18] keras[19] to make all the Neural Networks.
4.3.1. SVM[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]
      </p>
      <p>SVM performs exceptionally well in specific NLP scenarios. To implement it, we used the
scikit-learn[14] library. We used SVC with RBF kernel. To implement this with matrices that we
got from transformers, we used the Continuous Bag Of Words method, in which the mean of
embeddings of each word is taken. So the input to SVM was 768 length vectors.</p>
      <p>4.3.2. KNN</p>
      <p>This was also implemented using scikit-learn library with neighbours set to 3. Here again, the
CBOW method was used.</p>
      <p>4.3.3. Naive Bayes[15]</p>
      <p>Naive Bayes also works well for NLP. Here, the CBOW method was used to get the
embeddings.</p>
      <p>4.3.4. ANN</p>
      <p>The input to this ANN was the CBOW embeddings and tf-idf for English. It was a five-layer
NN with 128, 128, 256, 512 neurons and the last layer had 1 or 4 neurons depending upon the
subtask.</p>
      <p>4.3.5. CNN[16]</p>
      <p>Here we have used a similar approach as the work produced by Yoon Kim. The architecture of
CNN has layers in the following order.</p>
      <p>1. Input: For initializing the input tensors
2. BatchNormalization[20]: To normalize each batch that is being processed.
3. Dropout[17]: Dropout helps to reduce the possibility of overfitting
4. After this, it is divided into various branches. Each branch computes x-gram.</p>
      <p>a. Conv1D: The kernel_size is set to x to compute x-gram.
b. GlobalMaxPool1D: To get a vector of length of the number of filters
c. BatchNormalization
5. The above three layers produce one x-gram. After this, all of the x-grams are
merged into one Tensor.
6. The merged tensor is then passed into a Dense layer of 128 neurons.
7. And then, finally, a dense classifier layer with 1 or 4 neurons based on the
subtask.</p>
      <p>The architecture described above can be seen in Figure 1.</p>
      <p>First, we experimented with CNN with Bigram, Trigram, and Four-gram. Then after with Bigram
and Trigram. In most cases, the Bigram + Trigram model was working better than the Bigram + Trigram
+ Four-gram model. So, for further analysis, we considered only the Bigram+Trigram model.</p>
      <p>For the above models, training Adam optimizer [21] was used. We also used mild kernel
regularization[22] in English subtask 2. We have also used class weights[23] in subtask B for all
languages because there was an imbalance of classes in the dataset.</p>
    </sec>
    <sec id="sec-8">
      <title>5. Analysis</title>
      <p>Experimental results in the available test set show that CNN(bigram + trigram) outperforms all
other models. CNN can exceed most of all baseline models, precisely because of the nature of tweets.
For example, tweets can be indirect texts. (e.g., sarcasm), full of noise and may not follow proper
grammatical structure.</p>
      <p>
        CNN can identify many small and large patterns in a tweet; if some are impacted by the noise[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], it
can still use other patterns to determine the class, which can be seen in Table 4 and 5, which displays
Embedding vs Model F1 Macro scores which is the metric used for scoring in HASOC 2020. The
TFIDF for vectors for English subtask A works well enough, but the subtask B performance is lower,
which is caused by the imbalance in the dataset's distribution. The bert-based-uncased Bigram +
Trigram model gives the best performance in English. Models for which CBOW was used do not
perform on par with the CNN model for both the subtasks. The bert-base-multilingual-cased and the
gpt2 transformers embedding gives a reasonably similar performance in some situations, and for
some, BERT performs a little better. bert-based-uncased performs better than
bert-base-multilingualcased in both the subtasks of English.
      </p>
      <p>So from the above work, we concluded to submit the CNN with Bigram + Trigram model with
bert-based-uncased transformer for English subtasks and bert-base-multilingual-cased transformer for
Hindi and German subtasks. The label wise F1 scores for the submitted models are shown in Table 6.
OFFN and HATE categories from subtask 2 have relatively lower F1 scores due to relatively lower
occurrences in the dataset. The final results on the private dataset on which the final ranks were given
are displayed in Table 7.</p>
    </sec>
    <sec id="sec-9">
      <title>6. Conclusion</title>
      <p>This paper describes offensive text identification into three Indo-European languages. We have shown
our methodology for classifying tweets and posts from social media using multiple models in given
three languages categorizing hate and offensive speech. After analyzing different models, we
observed that the bert-based-uncased Bigram + Trigram model gives the best performance in English
and bert-base-multilingual-cased transformer for Hindi and German subtasks. The results indicate that
organizing profane and hate content is a strenuous task. In future work, we hope that better models
and methods can be used to improve the effect of identifying hate speech.</p>
    </sec>
    <sec id="sec-10">
      <title>7. References</title>
      <p>2011.
[15]
[18]
[16]
[17]</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Mandl</surname>
          </string-name>
          , Thomas and Modha, Sandip and Shahi, Gautam Kishore and Jaiswal, Amit Kumar and Nandini, Durgesh and Patel, Daksh and Majumder, Prasenjit and Schäfer, Johannes.
          <year>2020</year>
          .
          <article-title>Overview of the HASOC track at FIRE 2020: Hate Speech and Offensive Content Identification in Indo-European Languages</article-title>
          .
          <source>In Proceedings of the 12th annual meeting of the Forum for Information Retrieval Evaluation. Working Notes of FIRE 2020 - Forum for Information Retrieval Evaluation</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Wiegand</surname>
          </string-name>
          , Melanie Siegel, and
          <string-name>
            <given-names>Josef</given-names>
            <surname>Ruppenhofer</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Overview of the GermEval 2018 Shared Task on the Identification of Offensive Language</article-title>
          .
          <source>In Proceedings of GermEval.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Montani</surname>
            ,
            <given-names>Joaquın</given-names>
          </string-name>
          <string-name>
            <surname>Padilla</surname>
          </string-name>
          .
          <year>2018</year>
          . Tuwienkbs at germeval 2018:
          <article-title>German abusive tweet detection</article-title>
          .
          <source>In14thConference on Natural Language Processing KONVENS</source>
          <year>2018</year>
          , page
          <issue>45</issue>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Marcos</given-names>
            <surname>Zampieri</surname>
          </string-name>
          , Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and
          <string-name>
            <given-names>Ritesh</given-names>
            <surname>Kumar</surname>
          </string-name>
          .
          <year>2019b</year>
          . SemEval
          <article-title>-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval)</article-title>
          .
          <source>In Proceedings of The 13th International Workshop on Semantic Evaluation (SemEval).</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Convolutional neural networks for sentence classification</article-title>
          .
          <source>CoRR abs/1408</source>
          .5882 (
          <year>2014</year>
          ), http://arxiv.org/abs/1408.5882
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>Tomas</given-names>
          </string-name>
          &amp; Chen, Kai &amp; Corrado,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>s &amp; Dean</article-title>
          ,
          <string-name>
            <surname>Jeffrey.</surname>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Efficient Estimation of Word Representations in Vector Space</article-title>
          .
          <source>Proceedings of Workshop at ICLR</source>
          .
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Nobata</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tetreault</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Thomas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mehdad</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Abusive Language Detection in Online User Content</article-title>
          .
          <source>In: WWW 2016</source>
          . pp.
          <fpage>145</fpage>
          -
          <lpage>153</lpage>
          . Montreal (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Djuric</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morris</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grbovic</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Radosavljevic</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bhamidipati</surname>
          </string-name>
          , N.:
          <article-title>Hate Speech Detection with Comment Embeddings</article-title>
          .
          <source>In: WWW 2015</source>
          . pp.
          <fpage>29</fpage>
          -
          <lpage>30</lpage>
          . Florence,
          <string-name>
            <surname>Italy</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          , I. Polosukhin, “
          <article-title>Attention Is All You Need”</article-title>
          .
          <source>arXiv:1706</source>
          .03762 https://arxiv.org/pdf/1706.03762.pdf
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          : BERT:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers). pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . Association for Computational Linguistics, Minneapolis,
          <source>Minnesota (Jun</source>
          <year>2019</year>
          ). https://doi.org/10.18653/v1/
          <fpage>N19</fpage>
          -1423
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Wolf</surname>
            , Thomas &amp; Debut, Lysandre &amp; Sanh, Victor &amp; Chaumond, Julien &amp; Delangue, Clement &amp; Moi, Anthony &amp; Cistac, Pierric &amp; Rault, Tim &amp; Louf, Rémi &amp; Funtowicz, Morgan &amp; Brew,
            <given-names>Jamie.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>Transformers: State-of-the-art Natural Language Processing</article-title>
          . arXiv:
          <year>1910</year>
          .
          <article-title>03771v5 [cs</article-title>
          .CL] https://arxiv.org/pdf/
          <year>1910</year>
          .03771.pdf
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Radford</surname>
          </string-name>
          , Alec, and
          <string-name>
            <surname>Wu</surname>
          </string-name>
          , Jeff, and
          <string-name>
            <surname>Child</surname>
            , Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya, Language Models are
            <given-names>Unsupervised</given-names>
          </string-name>
          <string-name>
            <surname>Multi-Task</surname>
            <given-names>Learners</given-names>
          </string-name>
          , (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Marti</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hearst</surname>
          </string-name>
          .
          <year>1998</year>
          .
          <article-title>Support Vector Machines</article-title>
          .
          <source>IEEE Intelligent Systems 13, 4 (July</source>
          <year>1998</year>
          ),
          <fpage>18</fpage>
          -
          <lpage>28</lpage>
          . DOI:https://doi.org/10.1109/5254.708428
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>