<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>NSIT &amp; IIITDWD @ HASOC 2020: Deep learning model for hate-speech identification in Indo-European languages</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Roushan Raj</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shivangi Srivastava</string-name>
          <email>shivangisrivastava762@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sunil Saumya</string-name>
          <email>sunil.saumya@iiitdwd.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Indian Institute of Information Technology Dharwad</institution>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Netaji Subhas Institute of Technology</institution>
          ,
          <addr-line>Bihta, Patna</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In current times, social media is the most widely used platform, and everyone has the right to express their speculations, ideas, thoughts, etc. In such a case, it is often seen that hate speech and ofensive contents are spreading like wildfire, making a detrimental impact on the world. It is important to identify and eradicate such ofensive content from social media. This paper is a contribution to the Hate Speech and Ofensive Content Identification in Indo-European Languages (HASOC) 2020 shared task. Our target is to present deep learning models to detect hate speech and ofensive content in three languages English, Hindi, and German. Our team NSIT_ML_Geeks has developed models using Convolutional Neural Networks (CNN), Bi-directional long short term memory (BiLSTM), and hybrid models (CNN+BiLSTM). The word-embeddings used are GloVe and fastText to convert our corpus into vectors of real numbers to train models. Our best models for Hindi sub-task A and B secured First and Second positions by outperforming other models submitted in the competition with f1 macro-avg score of 0.5337 and 0.2667 respectively.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Users across the world express their thoughts in diversified languages. Thus, there has been
extensive research to create advanced automated systems using AI technologies to detect and eliminate
ofensive content from social media platforms in all possible languages. Several models have been
proposed by various researchers in the field of hate-speech identification with feature engineering.
Risch and Krestel [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] proposed a semi-automatic approach for comment moderation where
moderators take notice of a potentially violating comment using a logistic regression model with features
like word and character N-grams, linguistic features, etc. Shubhanshu and Sudhanshu [7] proposed
ifne-tuned pre-trained monolingual and multilingual BERT based approach for HASOC 2019 shared
task. Waseem and Hovy [8] proposed models on Hate tweet identification linked with sexism and
racism using character n-grams which outperformed word n-grams. Alfina et al. [9] presented ML
models like Naïve Bayes, SVM, Bayesian Logistic Regression, and Random Forest Decision Tree to
encounter hate speech on the Indonesian language. Kamble et al. [10] used domain-specific word
embeddings for hate speech detection in code-mixed Hindi-English tweets. Xu et al. [11] came with
the CrossNet model which consists of four layers, embedding, context encoding, attention, and finally
prediction layer that learns unseen similar destination target. Research communities are increasingly
taking interest to apply machine learning and natural language processing techniques. Many social
media platforms monitor user’s posts to identify the ofensive language. The Hate Speech and
Ofensive Content Identification in Indo-European Languages (HASOC) 3 [12] has been organized as a step
towards this direction in three major languages- English, Hindi, and German.
      </p>
      <p>There are 2 sub-tasks for each of the three languages [13]. Sub-task A is a binary classification
problem to classify tweets into HOF (Hate and Ofensive) or NOT (Non Hate-ofensive). Sub-task B
is a multi-class fine-grained classification of hate speech and ofensive posts obtained from subtask
A further into Hate, Ofensive, or Profane posts. HATE class includes posts that contain hate content
due to political opinion, gender, social status, race, religion, or any other equivalent reasons. OFFN
(Ofensive) class covers the posts containing ofensive content, insulting an individual or a group, or
uncomfortable content. PRFN (Profane) class counts the posts containing profane words,
unacceptable languages which may be cursing or usage of swear words.</p>
      <p>In this paper, we proposed Convolutional Neural Networks (CNN) [14] and Bi-directional long short
term memory (BiLSTM) [15] deep neural networks for each language. Our model for Hindi sub-tasks
A and B outshined other models, securing 1st and 2nd positions respectively.</p>
      <p>In the forthcoming section we describe the dataset, Section 3 presents the methodology in two
steps- pre-processing and model architecture, in Section 4 we analyzed the results of all the
experimented models while in Section 5 we outlined the conclusion and future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset Description</title>
      <p>The given dataset for each of the three languages is a collection of tweets from Twitter [16] consisting
of two sub-tasks (sub-task A and sub-task B). As shown in Table 1, each instance in the dataset consists
of a tweet_id which is a unique value for the tweets, the full text of the tweet, the target variable in
two separate columns for both sub-task A and sub-task B indicating whether the tweet is HOF (Hate
and Ofensive) or NOT (Non Hate-Ofensive) for subtask A and whether it is HATE, OFFN (Ofensive)
or PRFN (Profane) for subtask B, and the unique HASOC IDs for each tweet. For most of the sub-tasks,
the given dataset is highly imbalanced in all three languages for both training and testing as shown
in Table 2 and Table 3 respectively.</p>
      <sec id="sec-2-1">
        <title>Columns</title>
        <p>tweet_id
text
task1
task2</p>
        <p>ID</p>
      </sec>
      <sec id="sec-2-2">
        <title>Description</title>
        <p>unique value for the tweets</p>
        <p>full text of the tweets
target value, either tweet is HOF or NOT for sub-task A
target value, either tweet is HATE, OFFN or PRFN for sub-task B
unique hasoc ID for each tweet</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>This section describes the approach followed to bring out a model that segregates the ofensive and
hateful tweets from the other non-ofensive tweets. The subsequent content describes the approach
used for the further classification of hate speech into three diferent categories of hate, profane, and
ofensive. We begin by expounding the preprocessing steps of the dataset for each of the three
languages, followed by the model architecture for each of them. We have also made our approach public4.</p>
      <sec id="sec-3-1">
        <title>3.1. Pre-processing</title>
        <p>The preprocessing of text data for the three languages has been done in the following ways. For
the Hindi language, we first converted the texts to lowercase, and removed the redundant texts such
as URLs and punctuation symbols e.g. !"#$%&amp;´()*+,-./:;&lt;=&gt;?@[/]ˆ{|}. We removed the retweet symbol
(RT) of Twitter data. Next, we removed all the Hindi stopwords using the ‘Swadesh’5 list. Further,
we tokenized each word, created vocabulary for tokens followed by encoding. Lastly, we performed
padding keeping a fixed length of size 100. The same steps have been applied to preprocess English
and German languages with slight diferences. In the English dataset, we filtered the data using regex
library (re), eliminated single alphanumeric characters, and removed apostrophes by expanding the
word to maintain proper structure, and to avoid any chances of word sense disambiguation. Then
we removed English stopwords and performed stemming using ‘PorterStemmer’. For the German
dataset, stopwords were removed and stemming was performed using ‘SnowballStemmer’. The
further preprocessing steps of tokenization, encoding, and padding in both English and German datasets
were identical to the steps mentioned above for the Hindi dataset.</p>
        <p>4https://github.com/roushan-raj/HASOC-2020
5https://docs.cltk.org/en/latest/hindi.html</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Model Architecture</title>
        <p>The proposed model consists of two diferent deep neural network approaches tested for all three
languages. These are the Convolutional Neural network (CNN) and Bi-directional LSTM (BiLSTM).
The forthcoming section describes our best performing models. Let us understand each model one by
one.</p>
        <p>In the English language model, we used GloVe6 embeddings [17] in both the sub-tasks. Embedding
is a technique used to encode corpus into pre-trained weights. This embedding layer is fed into the
input layer of deep neural networks. In the CNN model, we used two convolutional, two dropout, and
two max-pooling layers accompanied by a flatten layer and a dense layer. For the German model, we
used ‘fastText’7 embedding [18] for both the sub-tasks. The output of this embedding layer is fed to
one convolutional layer followed by a dropout and a max-pooling layer after which flatten and dense
layers are used. Also, in Hindi sub-tasks, we used ‘fastText’ embedding. In sub-task A, one layer of
bi-directional LSTM and a dropout layer followed by a dense layer performed best. For sub-task B,
one convolutional layer with dropout and max-pooling is used followed by a flatten and dense layer.
For each sub-task in each language, we used an embedding dimension of 300, and applied ‘Adam’
optimizer to reduce the losses and to achieve the most accurate results possible. Also, we applied
the ‘ADASYN’ [19] over-sampling technique to balance the data for sub-task B as the dataset was
heavily unbalanced as shown in Table 2. ‘ReLU’ activation is used in the internal layers and ‘sigmoid’
activation at the final output dense layer. We have used Keras library 8 to build all our models.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>In this section, we describe the results obtained in each sub-task for the experimented models of all the
three languages. We further analyze and compare the observations. The results for all the sub-tasks
are evaluated using the f1 macro-avg score. We experimented with one-layer and two-layer CNN,
one layer and two-layer BiLSTM, and Hybrid (combination of CNN and BiLSTM) models.</p>
      <p>Table 4 shows the f1 macro-avg score of our best six models calculated by the organization with
approximately 15% of the private test data. Among all the models submitted, we secured First and
Second position for the Hindi sub-tasks A and B delivering the leading f1 macro-avg score of 0.5337
&amp; 0.2667 respectively.</p>
      <p>Languages
English
German</p>
      <sec id="sec-4-1">
        <title>Hindi</title>
        <p>Sub-task
Sub-task A and B
Sub-task A and B</p>
      </sec>
      <sec id="sec-4-2">
        <title>Sub-task A and B</title>
        <p>f1 macro-avg
0.4879 and 0.2361
0.4919 and 0.2468
0.5337 and 0.2667
6https://nlp.stanford.edu/projects/glove/
7https://fasttext.cc/docs/en/crawl-vectors.html
8https://keras.io/
English
German
Hindi</p>
        <p>A
B
A
B
A
B</p>
        <p>Model
CNN 1 layer</p>
      </sec>
      <sec id="sec-4-3">
        <title>CNN 2 layer</title>
        <p>BiLSTM 1 layer
BiLSTM 2 layer
Hybrid Model
CNN 1 Layer</p>
      </sec>
      <sec id="sec-4-4">
        <title>CNN 2 Layer</title>
        <p>BiLSTM 1 Layer
BiLSTM 2 Layer
Hybrid Model</p>
      </sec>
      <sec id="sec-4-5">
        <title>CNN 1 layer</title>
        <p>CNN 2 layer
BiLSTM 1 layer
BiLSTM 2 layer
Hybrid Model</p>
      </sec>
      <sec id="sec-4-6">
        <title>CNN 1 Layer</title>
        <p>CNN 2 Layer
BiLSTM 1 Layer
BiLSTM 2 Layer
Hybrid Model
CNN 1 layer</p>
        <p>CNN 2 layer</p>
      </sec>
      <sec id="sec-4-7">
        <title>BiLSTM 1 layer</title>
        <p>BiLSTM 2 layer
Hybrid Model</p>
      </sec>
      <sec id="sec-4-8">
        <title>CNN 1 Layer</title>
        <p>CNN 2 Layer
BiLSTM 1 Layer
BiLSTM 2 Layer
Hybrid Model
Embedding</p>
        <p>GloVe</p>
      </sec>
      <sec id="sec-4-9">
        <title>GloVe</title>
        <p>GloVe
GloVe</p>
        <p>GloVe
GloVe, Unbalanced dataset
GloVe, SMOTE</p>
        <p>GloVe, ADASYN
GloVe, Unbalanced dataset</p>
        <p>GloVe, SMOTE</p>
      </sec>
      <sec id="sec-4-10">
        <title>GloVe, ADASYN</title>
        <p>GloVe, Unbalanced dataset
GloVe, SMOTE</p>
        <p>GloVe, ADASYN
GloVe, Unbalanced dataset
GloVe, SMOTE
GloVe, ADASYN
GloVe, Adasyn
fastText
fastText
fastText
fastText
fastText
fastText, Unbalanced dataset
fastText, SMOTE
fastText, ADASYN
fastText, Unbalanced dataset
fastText, SMOTE
fastText, ADASYN
fastText, Unbalanced dataset
fastText, SMOTE
fastText, ADASYN
fastText, Unbalanced dataset
fastText, SMOTE
fastText, ADASYN
fastText, ADASYN
fastText
fastText
fastText
fastText
fastText
fastText, Unbalanced dataset
fastText, SMOTE
fastText, ADASYN
fastText, Unbalanced dataset
fastText, SMOTE
fastText, ADASYN
fastText, Unbalanced dataset
fastText, SMOTE
fastText, ADASYN
fastText, Unbalanced dataset
fastText, SMOTE
fastText, ADASYN
fastText, ADASYN</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future works</title>
      <p>This paper puts forward a deep neural network model to identify hate speech, ofensive content, and
profane tweets. We proposed diferent CNN and BilSTM architecture developed using word vectors
of the relevant pre-trained corpus. Many types of researches have been carried out for the English
language but languages such as Hindi, German, etc, and multilingual data are now also being focused
upon. We saw that the dataset for sub-task B was majorly unbalanced and gave a lower f1
macroavg score even after applying SMOTE and ADASYN over-sampling techniques. Future work could
be improving dataset balancing. Further improvisation could be to tackle the identification of hate
speech in multilingual tweets and posts on social media and presumably by adding other features that
may not have been included in the specified models.
in: Proceedings of the first workshop on trolling, aggression and cyberbullying (TRAC-2018),
2018, pp. 166–176.
[7] S. Mishra, S. Mishra, 3idiots at hasoc 2019: Fine-tuning transformer neural networks for hate
speech identification in indo-european languages., in: FIRE (Working Notes), 2019, pp. 208–213.
[8] Z. Waseem, Are you a racist or am i seeing things? annotator influence on hate speech detection
on twitter, in: Proceedings of the first workshop on NLP and computational social science, 2016,
pp. 138–142.
[9] I. Alfina, R. Mulia, M. I. Fanany, Y. Ekanata, Hate speech detection in the indonesian language:
A dataset and preliminary study, in: 2017 International Conference on Advanced Computer
Science and Information Systems (ICACSIS), IEEE, 2017, pp. 233–238.
[10] S. Kamble, A. Joshi, Hate speech detection from code-mixed hindi-english tweets using deep
learning models, arXiv preprint arXiv:1811.05145 (2018).
[11] C. Xu, C. Paris, S. Nepal, R. Sparks, Cross-target stance classification with self-attention
networks, arXiv preprint arXiv:1805.06593 (2018).
[12] T. Mandl, S. Modha, G. K. Shahi, A. K. Jaiswal, D. Nandini, D. Patel, P. Majumder, J. Schäfer,
Overview of the HASOC track at FIRE 2020: Hate Speech and Ofensive Content Identification
in Indo-European Languages), in: Working Notes of FIRE 2020 - Forum for Information Retrieval
Evaluation, CEUR, 2020.
[13] Hate speech and ofensive content identification in indo-european languages competition
overview and details, 2020. URL: https://competitions.codalab.org/competitions/26027.
[14] Y. Kim, Convolutional neural networks for sentence classification, arXiv preprint
arXiv:1408.5882 (2014).
[15] P. Zhou, Z. Qi, S. Zheng, J. Xu, H. Bao, B. Xu, Text classification improved by integrating
bidirectional lstm with two-dimensional max pooling, arXiv preprint arXiv:1611.06639 (2016).
[16] Hate speech and ofensive content dataset, 2020. URL: https://competitions.codalab.org/
competitions/26027#participate-get-data.
[17] J. Pennington, R. Socher, C. D. Manning, Glove: Global vectors for word representation, in:
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP),
2014, pp. 1532–1543.
[18] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword
information, Transactions of the Association for Computational Linguistics 5 (2017) 135–146.
[19] H. He, Y. Bai, E. A. Garcia, S. Li, Adasyn: Adaptive synthetic sampling approach for imbalanced
learning, in: 2008 IEEE international joint conference on neural networks (IEEE world congress
on computational intelligence), IEEE, 2008, pp. 1322–1328.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saumya</surname>
          </string-name>
          ,
          <article-title>A comparative analysis of machine learning techniques for disaster-related tweet classification</article-title>
          ,
          <source>in: 2019 IEEE R10 Humanitarian Technology Conference (R10-HTC)</source>
          (
          <volume>47129</volume>
          ), IEEE,
          <year>2019</year>
          , pp.
          <fpage>222</fpage>
          -
          <lpage>227</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Saumya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>Predicting stock movements using social network</article-title>
          , in: Conference on e-Business, e-Services and e-Society, Springer,
          <year>2016</year>
          , pp.
          <fpage>567</fpage>
          -
          <lpage>572</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Saumya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Singh</surname>
          </string-name>
          , et al.,
          <article-title>Spam review detection using lstm autoencoder: an unsupervised approach</article-title>
          ,
          <source>Electronic Commerce Research</source>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>21</lpage>
          , https://doi.org/10.1007/s10660-020-09413- 4.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Saumya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Detection of spam reviews: A sentiment analysis approach</article-title>
          ,
          <source>Csi Transactions on ICT 6</source>
          (
          <year>2018</year>
          )
          <fpage>137</fpage>
          -
          <lpage>148</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hinduja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Patchin</surname>
          </string-name>
          ,
          <article-title>Connecting adolescent suicide to the severity of bullying and cyberbullying</article-title>
          ,
          <source>Journal of school violence 18</source>
          (
          <year>2019</year>
          )
          <fpage>333</fpage>
          -
          <lpage>346</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Risch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Krestel</surname>
          </string-name>
          ,
          <article-title>Delete or not delete? semi-automatic comment moderation for the newsroom</article-title>
          ,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>