<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Amrita CEN at HASOC 2019: Hate Speech Detection in Roman and Devanagiri Scripted Text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sreelakshmi.K</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Premjith.B</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Soman K.P</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center for Computational Engineering &amp; Networking (CEN) Amrita School of Engineering</institution>
          ,
          <addr-line>Coimbatore, Amrita Vishwa Vidyapeetham</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Nowadays the usage of social media sites like Facebook and Twitter has increased rapidly which has lead to huge ooding of data in the social media sites. Though these social media sites give free opportunities to people to express and share their thoughts they also end up in spread of huge amount of hate content. In this paper we present a domain speci c word embedding model for classi cation of English tweets to Non Hate-O ensive and Hate-O ensive and a fastText model for Hindi text classi cation. The classi cation is done using the dataset got from HASOC 2019 shared task. Deep learning algorithm is used as the classi er.</p>
      </abstract>
      <kwd-group>
        <kwd>FastText Convolutionl Neural Network Long short term memory</kwd>
        <kwd>Hate speech</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Hate speech is a form of expressing aggression, profanity in verbal or non-verbal
way. It can be like discriminating or using lthy language against a person or
group just on grounds of their age, gender, sex, caste, economical status etc. this
can even lead to huge violence or con ict between individuals or communities.
So it is very important to detect them before it reaches a huge mass [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>For a country like India people tend to use regional language for texting or
tweeting. Around half of the population speak Hindi. So the need to nd hate
speech in Hindi is very high. Not only human it can even corrupt chatbots. Since
chatbots learn from conversation with human if it is not able to di erentiate hate
and non-hate content then it also starts to use it. So it is has become a huge
responsibility for the government as well as Twitter and Facebook to detect this
hate speech content.</p>
      <p>
        So for this,in the paper we developed two separate models to classify tweets
in Hindi and English as hate or not. The English data is in roman script and
the Hindi data in Devanagari script. The dataset is from HASOC 2019 shared
task [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] Two samples of English data is given below HATE
"I love this bill, I think they should start printing them FuckTrump https://t.co/NY9CuyivGl"
Non-HATE
"All Indian spectators shd hv BalidanBadge in ground, DhoniKeepsTheGlove
DhoniKeepBalidaanBadgeGlove DhoniKeepsTheGlove DhoniKeSathDesh"
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>There are lot of works done in the area of hate speech detection, few of them
are given below</p>
      <p>
        Shervin et.al [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] in his paper developed a model using character n-grams,
word n-grams and word skip-grams for the classi cation of English tweets to
hate speech (HATE), o ensive and no o ensive content. The system used SVM
as the classi er with an accuracy of 78%.
      </p>
      <p>
        Georgious et.al [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] in his paper presents a model to detect hateful content in
so-cial media. They made use of of Recurrent NeuralNetwork (RNN)classi ers
and fed various features associated with user-related information, such as the
users' tendency towards racism or sexism. They made use of a publicly available
corpus with 16000 tweets.
      </p>
      <p>
        Satyajith et.al [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] collected around 250000 tweets using Twitter API and
trained a word2vec model and obtained the domain speci c word embedding.
Using these embeddings they extracted the features for 4500 Hindi-English
codemixed data and classi ed it as hate and non-hate. They used CNN, LSTM and
BiLSTM as classi ers.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Proposed methodology</title>
      <p>The steps used for our proposed methodology is as follows:
{ Pre-processing: The data consists of usernames, hashtags, urls and
unwanted characters. The rst step was to remove these usernames, hashtags,
urls , unwanted characters ,punctuations. Then the whole text was converted
to lower case.
{ Retraining the model:Once the text data is cleaned we tokenized the
data and segemented it to the level of words. Each tokeninzed sentence is
given to a bilingual model which is already trained on 250K code-mixed
sentences. We retrained that model using gensim's word2vec with our data
and generated word embedding as feature vectors from the retrained model.
{ Feature Extraction: For the Hindi corpus fastext features were extracted.</p>
      <p>FastText consists of pre-trained model for hindi. Each sentence was tokenised
and the wordvector of each word was taken from fastText model and the
average of each words of a sentence was taken. The vector size for fastText
was speci ed as 300. For english data teh vector representation for each data
was taken using bilingual word embedding and the average of each words of
a sentence was taken. For this word2vec was used and the vector isze was
speci ed to be 300.
{ Classi cation: For deep learning model which consists of CNN, LSTM
layers were used for classi cation. The feature extracted matrix was fed to
a embedding layer then to CNN and then LSTM. The ow diagram is given
in Fig. 2
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>In many applications like chatbot building, content recommendation and
sentiment analysis the need for hate speech detection is high. Especially for a country
like India with diverse culture and language the usage of Hindi in Twitter is also
high. So this paper presents a deep learning model which makes use of two
di erent features to classify tweets in English and Hindi to hate and non-hate.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Malmasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rosenthal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Farra</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          , \
          <article-title>Semeval-2019 task 6: Identifying and categorizing o ensive language in social media (o enseval),"</article-title>
          <source>in Proceedings of the 13th International Workshop on Semantic Evaluation</source>
          ,
          <year>2019</year>
          , pp.
          <volume>75</volume>
          {
          <fpage>86</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Siegel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Ruppenhofer</surname>
          </string-name>
          , \
          <article-title>Overview of the germeval 2018 shared task on the identi cation of o ensive language</article-title>
          ,"
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Ojha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Malmasi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          , \
          <article-title>Benchmarking aggression identi cation in social media,"</article-title>
          <source>in Proceedings of TRAC</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Malmasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rosenthal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Farra</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          , \
          <article-title>Predicting the Type and Target of O ensive Posts in Social Media,"</article-title>
          <source>in Proceedings of NAACL</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Patel</surname>
          </string-name>
          , \
          <article-title>Overview of the HASOC track at FIRE 2019: Hate Speech and O ensive Content Identi cation in Indo-European Languages,"</article-title>
          <source>in Proceedings of the 11th annual meeting of the Forum for Information Retrieval Evaluation</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Pitsilis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ramampiaro</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Langseth</surname>
          </string-name>
          , \
          <article-title>Detecting o ensive language in tweets using deep learning,"</article-title>
          arXiv preprint arXiv:
          <year>1801</year>
          .04433,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>S.</given-names>
            <surname>Malmasi</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          , \
          <article-title>Detecting hate speech in social media,"</article-title>
          <source>arXiv preprint arXiv:1712.06427</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>S.</given-names>
            <surname>Kamble</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Joshi</surname>
          </string-name>
          , \
          <article-title>Hate speech detection from code-mixed hindi-english tweets using deep learning models,"</article-title>
          arXiv preprint arXiv:
          <year>1811</year>
          .05145,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>