<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Ofensive Language from Manglish Tweets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sara Renjit</string-name>
          <email>sararenjit.g@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sumam Mary Idicula</string-name>
          <email>sumam@cusat.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Ofensive Language, Social Media Texts, Code-Mixed, Embeddings, Manglish</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Cochin University of Science and Technology</institution>
          ,
          <addr-line>Kerala</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>20</lpage>
      <abstract>
        <p>With the popularity of social media, communications through blogs, Facebook, Twitter, and other platforms have increased. Initially, English was the only medium of communication. Fortunately, now we can communicate in any language. It has led to people using English and their own native or mother tongue language in a mixed form. Sometimes, comments in other languages have English transliterated format or other cases; people use the intended language scripts. Identifying sentiments and ofensive content from such code mixed tweets is a necessary task in these times. We present a working model submitted for Task2 of the sub-track HASOC Ofensive Language Identification- DravidianCodeMix in Forum for Information Retrieval Evaluation, 2020. It is a message level classification task. An embedding model-based classifier identifies ofensive and not ofensive comments in our approach. We applied this method in the Manglish dataset provided along with the sub-track.</p>
      </abstract>
      <kwd-group>
        <kwd>Manglish</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        As code-mixing has become very common in present communication media, detecting
ofensive content from code mixed tweets and comments is a crucial task these days [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Systems
developed to identify sentiments from the monolingual text are not always suitable in a
multilingual context. Hence we require eficient methods to classify ofensive and non-ofensive
content from multilingual texts. In this context, two tasks are part of the HASOC FIRE 2020
sub-track [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The first task deals with the message-level classification of code mixed YouTube
comments in Malayalam, and the second task deals with the message-level classification of
tweets or Youtube comments in Tanglish and Manglish (Tamil and Malayalam using written
using Roman characters). Tamil and Malayalam languages are Dravidian languages spoken in
South India [
        <xref ref-type="bibr" rid="ref4">4, 5, 6, 7</xref>
        ].
      </p>
      <p>The following sections explain the rest of the contents: Section 2 presents related works in
ofensive content identification. Task description and dataset details are included in Section
3. Section 4 explains the methodology used. Section 5 relates to experimental details and
evaluation results. Finally, Section 6 concludes the work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>We discuss works done related to ofensive content identification in the past few years.
Ofensive content detection from tweets is part of some conferences as challenging tasks. In 2019,
SemEval (Semantic Evaluation) [8] conducted three tasks, out of which the first task was the
identification of ofensive and non-ofensive comments in English tweets. The dataset used was
OLID. It has 14000 tweets annotated using a hierarchical annotation model. The training set
has 13240 tweets, and a test set has 860 tweets. They used diferent methods like Convolutional
Neural Networks (CNN), Long Short Term Memory (LSTM), LSTM with attention, Embeddings
from Language Models (ELMo), and Bidirectional Encoder Representations from
Transformers (BERT) based systems. Also, few teams attempt traditional machine learning approaches
like logistic regression and support vector machine (SVM) [8]. A new corpus developed for
sentiment analysis of code-mixed text in Malayalam-English is detailed in [9]. SemEval in
2020 presented ofensive language identification in multilingual languages named as
OfensEval 2020, as a task in five languages, namely English, Arabic, Danish, Greek, and Turkish [ 10].
The OLID dataset mentioned above extends with more data in English and other languages.
Pretrained embeddings from BERT-transformers, ELMo, Glove, and Word2vec are used with
models like BERT and its variants, CNN, RNN, and SVM.</p>
      <p>
        The same task was conducted for Indo-European languages in FIRE 2019 for English, Hindi,
and German. The dataset was created by collecting samples from Twitter and Facebook for
all the three languages. Diferent models such as LSTM with attention, CNN with Word2vec
embedding, BERT were used for this task. In some cases, top performance resulted from
traditional machine learning models, rather than deep learning methods for languages other than
English [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Automatic approaches for hate speech also includes keyword-based approaches
and TF-IDF based multiview SVM, which is described in [11].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Task Description &amp; Dataset</title>
      <p>HASOC sub-track, Ofensive Language Identification of Dravidian CodeMix [ 12] [13] [14]
consists of two tasks: message level classification of YouTube comments in code mixed Malayalam
(Task1) and message level classification of Twitter comments in Tamil and Malayalam written
using Roman characters(Task2). This task is in continuation with the HASOC, 2019 task as in
[15]. The details about corpus creation and its validation are detailed in [16]. Orthographic
information is also utilized while creating the corpus, as mentioned in [17]. This paper discusses
the methods used for Task 2 classification of twitter comments in Manglish text. The training
dataset consists of comments in two diferent classes for the task, as shown in Table 1.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Proposed Methods</title>
      <p>We present three submissions experimenting with three diferent methods based on the general
system design, as in Figure 1. The ofensive language identification system consists of the
following stages:
1. Preprocessing: This stage includes preprocessing texts based on removing English
stopwords present in the Manglish comment texts, URLs defined with prefix “www” or “https”,
usernames with the prefix “@”, hash in hashtags, repeated characters, unwanted
numbers. All text is converted to lowercase and tokenized into words. Tweet preprocessor 1
for cleaning tweets is used to remove hashtags, URLs, and emojis.
2. Comments representation: Here, we embed the comments based on two embedding
mechanisms:
• Using Keras embedding2, we represent the sentences using one-hot representation,
adequately padded to a uniform length, and passed to Keras embedding layer to
produce 50-dimensional sentence representations.
• Using paragraph vector, an unsupervised framework that learns distributed
representations from texts of variable length [18]. The paragraph vector is an algorithm
that uses Word2vec based word vector representations [19].
3. Classification: In this step, the comments represented as n-dimensional vectors are trained
with the following network parameters:</p>
      <p>• System A: Classifier with an LSTM layer and recurrent dropout(0.2) followed by a
1https://pypi.org/project/tweet-preprocessor
2https://keras.io/api/layers/core_layers/embedding
dense layer with sigmoid activation and binary cross-entropy classifies comments
as ofensive or not ofensive in 5 epochs.
• System B: Classifier with three dense layers with Relu activation and the final layer
is dense with sigmoid activation and binary cross-entropy and trained for 50 epochs
for classification.
• System C: Combination of two classifiers: We use a mathematical combination of
predictions from both classifiers to produce the third submission results, based on a
decision function. Prob(X) denotes the probability values output by system X, and
Pred(X) denotes the predicted class of System X.</p>
      <p>Decision function:
if Pred(A) == Pred(B) then</p>
      <p>Pred(C)= Pred(A or B)
else if Prob(A)+Prob(B) greater than 1 then</p>
      <p>Pred(C) = “Ofensive”
else</p>
      <p>Pred(C) = “Not Ofensive”
end if</p>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Results</title>
      <p>The proposed system3 is trained on 4000 comments from the training set and tested on 1000
comments. The weighted average F1-score is used for evaluation as it is an imbalanced
binary classification task. Table 2 shows the performance of System A based on Keras
embedding and LSTM layer, Table 3 shows System B’s results based on document embedding using
Doc2Vec, and Table 4 is the result of the combined classifier based on mathematical logic. We
use precision, recall, and F1-score as evaluation metrics. The weighted average F1-score is
more significant as it handles class imbalance. Its presence in the data sample weights each
class score, showing a balance between precision and recall. We calculate these metrics
using the Scikit-Learn4 package. All the results show an average performance, which shows the
scope for improvement.</p>
      <p>3https://github.com/SaraRenG/Code1
4https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This paper presents ofensive content identification systems for Manglish tweets or comments.
It is as part of the HASOC sub-track in FIRE, 2020. We implemented simple methods using
sentence representations and binary classification using neural networks. Significant challenges
in this task are the representation of text in Manglish(Malayalam written using Roman
characters), it’s embedding without losing much information, which we can further improve in future
attempts.
[5] B. R. Chakravarthi, M. Arcan, J. P. McCrae, Wordnet gloss translation for under-resourced
languages using multilingual neural machine translation, in: Proceedings of the Second
Workshop on Multilingualism at the Intersection of Knowledge Bases and Machine
Translation, 2019, pp. 1–7.
[6] B. R. Chakravarthi, M. Arcan, J. P. McCrae, Comparison of Diferent Orthographies for
Machine Translation of Under-resourced Dravidian Languages, in: 2nd Conference on
Language, Data and Knowledge (LDK 2019), Schloss Dagstuhl-Leibniz-Zentrum fuer
Informatik, 2019.
[7] B. R. Chakravarthi, P. Rani, M. Arcan, J. P. McCrae, A survey of orthographic information
in machine translation, arXiv e-prints (2020) arXiv–2008.
[8] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, R. Kumar, SemEval-2019 Task
6: Identifying and Categorizing Ofensive Language in Social Media (OfensEval), in:
Proceedings of the 13th International Workshop on Semantic Evaluation, 2019, pp. 75–86.
[9] B. R. Chakravarthi, N. Jose, S. Suryawanshi, E. Sherly, J. P. McCrae, A sentiment analysis
dataset for code-mixed Malayalam-English, in: Proceedings of the 1st Joint Workshop
on Spoken Language Technologies for Under-resourced languages (SLTU) and
Collaboration and Computing for Under-Resourced Languages (CCURL), European Language
Resources association, Marseille, France, 2020, pp. 177–184. URL: https://www.aclweb.
org/anthology/2020.sltu-1.25.
[10] M. Zampieri, P. Nakov, S. Rosenthal, P. Atanasova, G. Karadzhov, H. Mubarak, L.
Derczynski, Z. Pitenis, Ç. Çöltekin, SemEval-2020 Task 12: Multilingual Ofensive Language
Identification in Social Media (OfensEval 2020), arXiv preprint arXiv:2006.07235 (2020).
[11] S. MacAvaney, H.-R. Yao, E. Yang, K. Russell, N. Goharian, O. Frieder, Hate speech
detection: Challenges and solutions, PloS one 14 (2019) e0221152.
[12] B. R. Chakravarthi, M. A. Kumar, J. P. McCrae, P. B, S. KP, T. Mandl, Overview of the track
on “HASOC-Ofensive Language Identification- DravidianCodeMix”, in: Proceedings of
the 12th Forum for Information Retrieval Evaluation, FIRE ’20, 2020.
[13] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, S. Suryawanshi, N. Jose, E. Sherly,
J. P. McCrae, Overview of the track on Sentiment Analysis for Dravidian Languages in
Code-Mixed Text, in: Proceedings of the 12th Forum for Information Retrieval Evaluation,
FIRE ’20, 2020.
[14] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, S. Suryawanshi, N. Jose, E. Sherly,
J. P. McCrae, Overview of the track on Sentiment Analysis for Dravidian Languages in
Code-Mixed Text, in: Working Notes of the Forum for Information Retrieval Evaluation
(FIRE 2020). CEUR Workshop Proceedings. In: CEUR-WS. org, Hyderabad, India, 2020.
[15] T. Mandl, S. Modha, P. Majumder, D. Patel, M. Dave, C. Mandlia, A. Patel, Overview
of the HASOC track at FIRE 2019: Hate Speech and Ofensive Content Identification in
Indo-European Languages, in: Proceedings of the 11th Forum for Information Retrieval
Evaluation, ACM, 2019, pp. 14–17.
[16] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, J. P. McCrae, Corpus creation for
sentiment analysis in code-mixed Tamil-English text, in: Proceedings of the 1st Joint
Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and
Collaboration and Computing for Under-Resourced Languages (CCURL), European
Language Resources association, Marseille, France, 2020, pp. 202–210. URL: https://www.
aclweb.org/anthology/2020.sltu-1.28.
[17] B. R. Chakravarthi, Leveraging orthographic information to improve machine translation
of under-resourced languages, Ph.D. thesis, NUI Galway, 2020. URL: http://hdl.handle.net/
10379/16100.
[18] Q. Le, T. Mikolov, Distributed representations of sentences and documents, in:
International conference on machine learning, 2014, pp. 1188–1196.
[19] T. Mikolov, K. Chen, G. Corrado, J. Dean, Eficient estimation of word representations in
vector space, arXiv preprint arXiv:1301.3781 (2013).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vegupatti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Named entity recognition for code-mixed indian corpus using meta embedding</article-title>
          ,
          <source>in: 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>68</fpage>
          -
          <lpage>72</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Jose</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sherly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>A survey of current datasets for code-switching research</article-title>
          ,
          <source>in: 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>136</fpage>
          -
          <lpage>141</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mandlia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <article-title>Overview of the HASOC Track at FIRE 2019: Hate Speech and Ofensive Content Identification in Indo-European Languages</article-title>
          ,
          <source>in: Proceedings of the 11th Forum for Information Retrieval Evaluation</source>
          , FIRE '19,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2019</year>
          , p.
          <fpage>14</fpage>
          -
          <lpage>17</lpage>
          . URL: https://doi.org/10.1145/3368567.3368584.
          <source>doi:1 0 . 1 1</source>
          <volume>4 5 / 3 3 6 8 5 6 7 . 3 3 6 8 5 8 4 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Arcan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Improving wordnets for under-resourced languages using machine translation</article-title>
          ,
          <source>in: Proceedings of the 9th Global WordNet Conference (GWC</source>
          <year>2018</year>
          ),
          <year>2018</year>
          , p.
          <fpage>78</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>