<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Comput. Surv.</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>IIT Varanasi at HASOC 2019 : Hate Speech and O ensive Content Identi cation in Indo-European Languages</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Akanksha Mishra</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sukomal Pal</string-name>
          <email>spalg@itbhu.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Engineering, Indian Institute of Technology (BHU)</institution>
          ,
          <addr-line>Varanasi - 221005</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Introduction - Task Description</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <volume>51</volume>
      <issue>4</issue>
      <abstract>
        <p>The track aims to develop a system that identi es hate speech and o ensive content in the document and further classi es them into hate speech, o ensive content, or usage of profane words. Also, it determines whether hate speech is targetted to some individual or a group. We use bidirectional long short term memory along with attention across all languages (English, German, and Hindi) in the track.</p>
      </abstract>
      <kwd-group>
        <kwd>Hate Speech</kwd>
        <kwd>O ensive Content</kwd>
        <kwd>Indo-European Languages</kwd>
        <kwd>Bidirectional LSTM</kwd>
        <kwd>Attention</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Hate and O ensive Content (HOF): The document or post
contains non acceptable languages which may be in the form of hate speech,
o ensive content or profane words.</p>
      <p>Non-Hate and O ensive Content (NOT): The document or post
contains no hate speech or o ensive content for an individual or a group.
{ Sub-Task 2: This sub-task is a multi class classi cation problem to
further classify whether the document or post contains hate speech, o ensive
content or profane words against an individual or a group. In this sub-task,
we consider only those documents or posts which are classi ed as HOF in
the rst sub-task. This sub-task classify the document or post in one of the
classes for all three
languages:</p>
      <p>Hate Speech (HATE): The document or post which contains hate
speech against an individual or a group. It may also contain hate speech
for a group due to their political opinion, gender, social status, race,
religion or any other equivalent reasons.</p>
      <p>O ensive (OFFN): The document or post which makes social users
uncomfortable or upset about anything. The content may also be seen
as violent acts or insulting an individual.</p>
      <p>Profane (PRFN): The document or post consists of unacceptable
languages which may be cursing or usage of swear words. It doesn't include
posts which contains abuse or insult of an individual or a group.
{ Sub-Task 3: This sub-task also considers only those documents or posts
which are classi ed as HOF in sub-task 1. This sub-task is only for English
and Hindi data. This sub-task classify the document or post into one of the
categories:</p>
      <p>Targeted Insult (TIN): The document or post which targets an
individual, group or others.</p>
      <p>Untargeted (UNT): The document or post which are not targeting
any individual, group or others.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>
        Several shared tasks organized related to o ensive content identi cation for one
or the other languages. O ensEval [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] task organized in SemEval-2019 focuses
on the identi cation of o ensive content, automatic categorization of o ense
types, and identi cation of the target of o ensive posts. The shared task used
O ensive Language Identi cation Dataset (OLID) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] consists of 14,000 English
tweets from Twitter and annotated mainly for o ensive language.
The GermanEval [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] shared task on the identi cation of o ensive content deals
with the German tweets from Twitter. It focuses on two sub-tasks, mainly binary
and 4-way classi cation. For this task, several machine learning (SVM, Logistic
Regression, Decision Trees, and Naive Bayes) and neural network (CNN, LSTM
and its variants, GRU, and combination of these) based classi ers were used.
Ngrams and word embeddings are commonly used features, and SVM, RNN, and
LSTM are widely used classi ers in the shared task on aggression identi cation
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] organized as part of First Workshop on Trolling, Aggression, and
Cyberbullying (TRAC-1) at COLING 2018.
      </p>
      <p>
        Survey on automatic detection of hate speech [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] describes di erent de nitions
of hate speech from various sources. Most of the studies have considered this as
a binary classi cation problem; however, some have considered this as a
multiclass approach. Machine learning, deep learning, and ensemble-based classi ers
are generally used. Frequently used features are TF-IDF, bag of words, N-gram,
dictionary, types dependencies, word sense disambiguation techniques, word2vec,
paragraph2vec, and several others.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>This section describes the model and architecture followed for the identi cation
of hate speech and o ensive content in the document and further segregating
them as per the relevant category.</p>
      <p>
        Preprocessing of Data: We preprocessed data by removing all the
punctuation symbols using a pre-initialized string, string.punctuation available in the
string library. We kept words with hashtags; however, we removed the hash
symbols. After that, we removed stop words from the data. Further, we removed all
usernames, webpage links, and retweet symbol (RT) in case of Twitter data.
After removing non-letters from the data, all the tokens are lemmatized. All the
data preprocessing steps mentioned here are done for all the languages.
Model Architecture: The model consists of four layers as explained
below:Word Representation Layer: We represent each word of a sentence of the
document or post in the form of dense vectors. We used two di erent versions1 of
pretrained glove [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] word embedding. One of the pretrained glove embeddings
is based on the common crawl which represents each word in the dimension of
300, and the other one is based on Twitter data which represents each word in
the dimension of 200.
      </p>
      <p>
        Bidirectional LSTM layer: In this layer [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], two copies of hidden layer is
created. Vector representation of words is fed to the rst hidden layer as the input
sequence is and reverse copy of the input sequence is fed to the second hidden
1 https://nlp.stanford.edu/projects/glove/
layer. The results of two hidden layers is concatenated and fed to the next layer.
Attention Layer: This layer helps in focussing on the important terms in the
input by iterating over the input trying to focus on relevant information.
Fully connected layer and output layer: In this layer, all the nodes of the previous
layer are connected to all the nodes of the next layer.
4
4.1
      </p>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>Dataset
The dataset is created from Twitter and Facebook data and shared by the task
organizers in a tab-separated format for three languages, namely English,
German, and code-mixed Hindi for all sub-tasks. However, there is no sub-task 3 for
the German language. All the instances belonging to NOT category in sub-task
1 will further be classi ed into NONE category in sub-task 2 and sub-task 3.
Figure 1 shows detailed statistics about the dataset.
We perform padding of the sentence to make sentences of equal length based on
the maximum length of the sentence in the dataset. In the bidirectional LSTM
layer, we use recurrent dropout of 0.2 and tanh as an activation function. The
dropout layer, with a rate of 0.3, is used to avoid over tting of the model. At the
output layer, a softmax activation function is used. We use Adam optimizer and
categorical cross-entropy loss function for training. Detailed variation of di erent
runs submitted for various sub-tasks for di erent languages is listed out in table
1.</p>
      <p>As mentioned earlier, we have used two di erent versions of GloVe pre-trained
embedding. These versions di er in the sense that they are trained on di erent
datasets. GloVe common crawl embedding is trained by crawling the data on
the internet and collecting about 840B tokens, 2.2M vocabulary, and
representing each word in a 300-dimensional vector. GloVe twitter pre-trained embedding
is trained on twitter dataset and consists of 2B tweets, 27B tokens, 1.2M
vocabulary, and representing each word in a 200-dimensional vector. We stopped
further iterations as soon as the model starts over tting. Di erent epochs for
di erent runs of various sub-tasks is given in table 1. We trained the model with
or without NONE category; hence, NONE included indicates whether the model
is trained, including the NONE category or not. In case of sub-task 1, there is
no NONE category thus it is not applicable (NA) for sub-task 1.
This section discusses the di erent metrics evaluated for the track. Detailed
results based on macro F1, weighted F1 and accuracy for all subtasks for all
languages are given in table 2. For the English language, best performance based
on macro F1 score is obtained for Run 2, Run 3, and Run 3 for sub-task 1,
subtask 2, and sub-task 3 respectively. Similarly, in the case of German, Run 2 and
Run 3 performs better as compared to other runs for sub-task 1 and sub-task 2
respectively. Moreover, for Hindi language, Run 1, Run 2 and Run 1 outperforms
other runs for sub-task 1, sub-task 2 and sub-task 3 respectively.
languages `English', `German' and `Hindi' respectively.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Fortuna</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nunes</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A survey on automatic detection of hate speech in text</article-title>
          . ACM
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ojha</surname>
            ,
            <given-names>A.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malmasi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zampieri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Benchmarking aggression identi cation in social media</article-title>
          .
          <source>In: Proceedings of TRAC</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Modha</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mandl</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Majumder</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patel</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Overview of the HASOC track at FIRE 2019: Hate Speech and O ensive Content Identi cation in Indo-European Languages</article-title>
          .
          <source>In: Proceedings of the 11th annual meeting of the Forum for Information Retrieval Evaluation</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Pennington</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , C.D.: Glove:
          <article-title>Global vectors for word representation</article-title>
          .
          <source>In: Empirical Methods in Natural Language Processing (EMNLP)</source>
          . pp.
          <volume>1532</volume>
          {
          <issue>1543</issue>
          (
          <year>2014</year>
          ), http://www.aclweb.org/anthology/D14-1162
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Schuster</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paliwal</surname>
            ,
            <given-names>K.K.:</given-names>
          </string-name>
          <article-title>Bidirectional recurrent neural networks</article-title>
          .
          <source>IEEE Transactions on Signal Processing</source>
          <volume>45</volume>
          (
          <issue>11</issue>
          ),
          <volume>2673</volume>
          {
          <fpage>2681</fpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Wiegand</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Siegel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruppenhofer</surname>
          </string-name>
          , J.:
          <article-title>Overview of the germeval 2018 shared task on the identi cation of o ensive language (</article-title>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Zampieri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malmasi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nakov</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosenthal</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farra</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          , R.:
          <article-title>Predicting the Type and Target of O ensive Posts in Social Media</article-title>
          .
          <source>In: Proceedings of NAACL</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Zampieri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malmasi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nakov</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosenthal</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farra</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>R.:</given-names>
          </string-name>
          <article-title>Semeval2019 task 6: Identifying and categorizing o ensive language in social media (o enseval)</article-title>
          .
          <source>In: Proceedings of the 13th International Workshop on Semantic Evaluation</source>
          . pp.
          <volume>75</volume>
          {
          <issue>86</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>