<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Leveraging Text Generated from Emojis for Hate Speech and Ofensive Content Identification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nkwebi Peace Motlogelwa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edwin Thuma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Monkgigi Mudongo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tebo Leburu-Dingalo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gontlafetse Mosweunyane</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Botswana</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, team University of Botswana Computer Science (UBCS) investigate whether enriching social media data with text generated from emojis can help in the identification of Hate Speech and Ofensive Content. In particular, we build three diferent binary text classifiers that can detect Hate and Ofensive content (HOF) or Not Hate-Ofensive content (NOT) on data sampled from Twitter. In building our first classifier, we used pre-processed text from twitter only without emojis. In the second classifier, we enrich our preprocessed text from Twitter with text generated from emojis within the Tweets. Our result suggests that enriching Tweets with text generated from emojis within the Tweets improves the classification accuracy of our hate and ofensive content classier.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Hate Speech</kwd>
        <kwd>Binary Classification</kwd>
        <kwd>fastText</kwd>
        <kwd>Emojis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        pre-trained on relevant social media corpus. In their experimental results, they suggest that
transfer learning of word embeddings can significantly improve the classification accuracy of
hate speech and ofensive content. Mishra et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] also used BERT pre - trained transformer
based neural network models to fine tune their model. In their work, they utilized BERT
implementation present in pytorch-transformers library. Their proposed solution outperformed
other participants in the HASOC 2019 shared task [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In this paper, we present our proposed
solution to the HASOC 2021 shared task English Sub-task A, which is binary classification
task [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ]. In the aforementioned task, participating system are required to classify Tweets
into two classes, namely: Hate and Ofensive (HOF) and Non- Hate and ofensive (NOT). In
our participation, we investigate whether enriching social media text with text generated from
emojis can improve the classification accuracy of our binary classifier. Our proposed solution is
motivated by the fact that people usually include emojis to accompany the text in order to fill
in emotional cues that are missing in the typed messages. For example, one may use an angry
face emoji only in their message to depict that they are disgusted and outraged or they can use
this message to accompany the typed conversation.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>
        In this Section, we present our binary text classification approaches for classifying tweets into
two classes, namely: Hate and Ofensive (HOF) and Non- Hate and ofensive (NOT). HOF class
signifies that the tweet contains Hate, ofensive and profane content. NOT signifies that the
tweet does not contain any Hate speech, profane, ofensive content. Our proposed binary text
classifier used fastText [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. fastText 1, contributed by Facebook AI Research (FAIR), is an
open-source library for eficient text classification and word representation.
      </p>
      <sec id="sec-2-1">
        <title>2.1. Training Dataset</title>
        <p>
          The training dataset was pre-processed to make it compatible with fastText by moving the
labels (HOF or NOT) to the beginning of each sentence and adding __label__ as prefix to each
label. Additional pre-processing was then performed on the dataset. In particular, we used the
Natural Language Toolkit (NLTK)2, a suite of libraries and programs for symbolic and statistical
natural language processing to stem the text and for stop words removal. The Porter stemming
algorithm was used for stemming [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. In addition, the following pre-processing steps were
applied to the dataset mainly to clean the text:
• Removing HTML tags
• Removing URLs
• Converting all cases to lower case
• Hashtags and mentions not removed, as well as punctuations not removed.
        </p>
        <p>The training dataset contains 3843 tweets. Of this, 1342 are not hate speech and 2501 are hate
speech. During training, the training dataset was subdivided such that 3043 tweets train our
1https://fasttext.cc/
2http://www.nltk.org/
classification model and 800 tweets are used for validation. The subdivision was done such that
the first 3043 tweets are for training the model, and the last 800 tweets are validation. This was
done using standard Linux head and tail commands.</p>
        <p>• head -n 3043 en_hasoc_clean.csv &gt; en_hasoc_clean.train
• tail -n 800 en_hasoc_clean.csv &gt; en_hasoc_clean.valid</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Testing Dataset</title>
        <p>The same pre-processing done in the training dataset was performed on the test data set, except
for pre-processing that deals with labelling the tweets as hate speech or none hate speech.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Description of Runs</title>
      <p>
        We submit 3 runs for: Subtask 1A: Identifying Hate, ofensive and profane content from the
post. Below is a brief description of each run:
3.1. Run 1 - UBCS
This is our baseline run. We used fastText to build a binary classifier for the identification of Hate
and Ofensive (HOF) and Non - Hate and Ofensive (NOT). When building our binary classifier,
fastText automatically generated a Tweet vector by averaging the word embeddings for each
tweet in the pre-processed training set as features. To train and test our classification model,
fastText used multinomial logistic regression [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], which is a linear learner. Before making
predictions of the labels for the test dataset using the trained model, fastText also generates
feature vectors for the Tweets in the test set using the same techniques used in generating
feature vectors for the training set. Both the training and test dataset underwent the same
pre-processing steps as described in Section 2.1.
3.2. Run 2 - UBCS
In this run, our aim is to improve the classification accuracy of our binary classifier in our
baseline run (Run 1 - UBCS) by replacing emojis with text. For our emoji replacement, we used
emojis 1.6.1 3, which is a Python package for converting emoticons to words and vice versa. In
particular, we used the demojize() function to convert the emojis to text. The pre-processing
and emoji removal was applied to both the training dataset and the test dataset. Both the
training data and test data were pre-processed as described in Section 2.1. Figure 1 shows emoji
replacement.
3.3. Run 3 - UBCS
In this run, our aim was to improve the classification accuracy of our binary classifier after for
Run 2 - UBCS where both the training data and test data were pre-processed and emojis replaced
with corresponding text. In particular, we fine tuned the parameters of our classifier in order to
improve the performance of our model. Specifically, we explored the following: Learning rate
(-lr), number of epochs (-epoch), and maximum length of word ngrams (-wordNgrams). The
model that improved on performance was then used to predict labels of the pre-processed test
data. This was achieved using this command: ./fasttext supervised -input en_hasoc_clean.train
-output model_hasoc_clean_epoch -lr 0.5 -epoch 50 -wordNgrams 2.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Analysis</title>
      <p>In this paper, we investigate whether enriching social media tweets with text generated from
emojis that accompany the text can improve the classification accuracy of our classifier. Table 1
presents the results of our investigation. Run 1 - UBCS is our baseline run, which does not
include text generate from emojis. This baseline run performed poorly compared to the other
runs in terms of Macro F1, which was used as the oficial evaluation measure for the HASOC
2021 binary classification task. Run 2 - UBCS is our best run, with a Micro F1 score of 0.7070.
For this run, we fixed all the parameters used in our baseline run (RUN 1 - UBCS) and then
enriched the tweets in the training and testing set with emojis. The results of our investigation
suggest incorporating emotions as text from emojis can improve the classification accuracy of
hate speech or ofensive content on social media. In our third run, we attempted to improve
the classification accuracy of our second run (Run 2 - UBCS) using the optimal parameters that
gave the best classification accuracy on our training set. In particular, we varied the epoch and
the learning rate. However, this resulted in the degradation in the classification accuracy on the
test set.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion and Conclusion</title>
      <p>The most obvious finding to emerge from this study is that we can improve the classification
accuracy of our binary classifier for identification of hate speech or ofensive content in social
media tweets by enriching the tweets with text generated from emojis. You will recall that
evidence from previous studies suggest that BERT based models produce better performance.
This is also evidenced by the overall performance of teams that participated in this years task.
Further studies need to be carried out in order to validate whether emojis can significantly
improve the classification accuracy of a binary classier which is built to identify hate speech or
ofensive content in social media tweets using BERT based models.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <source>Overview of the HASOC track at FIRE</source>
          <year>2019</year>
          :
          <article-title>Hate speech and ofensive content identification in indo-european languages</article-title>
          , in: P. Mehta,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          , M. Mitra (Eds.), Working Notes of FIRE 2019 -
          <article-title>Forum for Information Retrieval Evaluation, Kolkata</article-title>
          , India,
          <source>December 12-15</source>
          ,
          <year>2019</year>
          , volume
          <volume>2517</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>167</fpage>
          -
          <lpage>190</lpage>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2517</volume>
          /
          <fpage>T3</fpage>
          -1.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Struß</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Siegel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ruppenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegand</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Klenner, Overview of germeval task 2, 2019 shared task on the identicfiation of ofensive language</article-title>
          ,
          <source>in: Proceedings of the 15th Conference on Natural Language Processing (KONVENS</source>
          <year>2019</year>
          ),
          <article-title>German Society for Computational Linguistics</article-title>
          &amp; Language
          <string-name>
            <surname>Technology</surname>
          </string-name>
          , Erlangen, Germany,
          <year>2019</year>
          , pp.
          <fpage>354</fpage>
          -
          <lpage>365</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Malmasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rosenthal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Farra</surname>
          </string-name>
          , R. Kumar, SemEval
          <article-title>-2019 task 6: Identifying and categorizing ofensive language in social media (OfensEval)</article-title>
          ,
          <source>in: Proceedings of the 13th International Workshop on Semantic Evaluation</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Minneapolis, Minnesota, USA,
          <year>2019</year>
          , pp.
          <fpage>75</fpage>
          -
          <lpage>86</lpage>
          . URL: https: //aclanthology.org/S19-2010. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>S19</fpage>
          -2010.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <source>Association for Computational Linguistics</source>
          , Minneapolis, Minnesota,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . URL: https://aclanthology.org/ N19-1423. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N19</fpage>
          -1423.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Paraschiv</surname>
          </string-name>
          , D.-C. Cercel, Upb at germeval
          <article-title>-2019 task 2: Bert-based ofensive language classification of german tweets</article-title>
          ,
          <source>in: Proceedings of the 15th Conference on Natural Language Processing (KONVENS</source>
          <year>2019</year>
          ),
          <article-title>German Society for Computational Linguistics</article-title>
          &amp; Language
          <string-name>
            <surname>Technology</surname>
          </string-name>
          , Erlangen, Germany,
          <year>2019</year>
          , pp.
          <fpage>398</fpage>
          -
          <lpage>404</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Bashar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nayak</surname>
          </string-name>
          , Qutnocturnal@hasoc'19:
          <article-title>CNN for hate speech and ofensive content identification in hindi language</article-title>
          , in: P. Mehta,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          , M. Mitra (Eds.), Working Notes of FIRE 2019 -
          <article-title>Forum for Information Retrieval Evaluation, Kolkata</article-title>
          , India,
          <source>December 12-15</source>
          ,
          <year>2019</year>
          , volume
          <volume>2517</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>237</fpage>
          -
          <lpage>245</lpage>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2517</volume>
          /
          <fpage>T3</fpage>
          -8.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          , 3idiots at HASOC 2019:
          <article-title>Fine-tuning transformer neural networks for hate speech identification in indo-european languages</article-title>
          , in: P. Mehta,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          , M. Mitra (Eds.), Working Notes of FIRE 2019 -
          <article-title>Forum for Information Retrieval Evaluation, Kolkata</article-title>
          , India,
          <source>December 12-15</source>
          ,
          <year>2019</year>
          , volume
          <volume>2517</volume>
          <source>of CEUR Workshop Proceedings</source>
          , CEURWS.org,
          <year>2019</year>
          , pp.
          <fpage>208</fpage>
          -
          <lpage>213</lpage>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2517</volume>
          /
          <fpage>T3</fpage>
          -4.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Shahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Madhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ranasinghe</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Zampieri, Overview of the HASOC Subtrack at FIRE 2021: Hate Speech and Ofensive Content Identification in English and Indo-Aryan Languages and Conversational Hate Speech</article-title>
          , in: FIRE 2021:
          <article-title>Forum for Information Retrieval Evaluation, Virtual Event</article-title>
          ,
          <fpage>13th</fpage>
          -17th
          <source>December</source>
          <year>2021</year>
          , ACM,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Shahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Madhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ranasinghe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nandini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Jaiswal</surname>
          </string-name>
          ,
          <article-title>Overview of the HASOC subtrack at FIRE 2021: Hate Speech and Ofensive Content Identification in English and Indo-Aryan Languages</article-title>
          , in: Working Notes of FIRE 2021 -
          <article-title>Forum for Information Retrieval Evaluation</article-title>
          ,
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          ,
          <year>2021</year>
          . URL: http://ceur-ws.org/.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          , E. Grave,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          , T. Mikolov,
          <article-title>Bag of tricks for eficient text classification</article-title>
          ,
          <source>arXiv preprint arXiv:1607.01759</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Porter</surname>
          </string-name>
          ,
          <article-title>An algorithm for sufix stripping</article-title>
          ,
          <source>Program</source>
          <volume>14</volume>
          (
          <year>1980</year>
          )
          <fpage>130</fpage>
          -
          <lpage>137</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D.</given-names>
            <surname>Böhning</surname>
          </string-name>
          ,
          <article-title>Multinomial logistic regression algorithm</article-title>
          ,
          <source>Annals of the Institute of Statistical Mathematics</source>
          <volume>44</volume>
          (
          <year>1992</year>
          )
          <fpage>197</fpage>
          -
          <lpage>200</lpage>
          . URL: https://ideas.repec.org/a/spr/aistmt/ v44y1992i1p197-
          <fpage>200</fpage>
          .html. doi:
          <volume>10</volume>
          .1007/BF00048682.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>