<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Zyy1510@HASOC-Dravidian-CodeMix-FIRE2020: An Ensemble Model for Ofensive Language Identification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yueying Zhu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>XiaobingZhou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Information Science and Engineering, Yunnan University</institution>
          ,
          <addr-line>Yunnan</addr-line>
          ,
          <country country="CN">P.R. China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper reports the zyy1510 team's work in the HASOC-Ofensive Language Identification-Dravidian Code-Mixed FIRE 2020 shared task, whose goal is to identify the ofensive language of the code-mixed text of comments/posts in Dravidian Languages (Malayalam-English and Tamil-English) collected from social media. This task is a message-level label classification task. Given a tweet or YouTube comments code-mixed text, and systems accurately classify it into ofensive or not-ofensive. We propose an ensemble model combines with diferent models to improve the F-1 value of the framework. The ensemble model is a combination of a BiLSTM (Bidirectional LSTM), an LSTM+Convolution, and a CNN (Convolution Neural Network) model. The proposed model have achieved an F-1 of 0.93 (ranked)3 in Malayalam-English of task1, and F-1 of 0.87 (ranked)3and 0.67 (ranked 9ℎ ) in Tamil-English and Malayalam-English of task2, respectively.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>poyi kananam super abinayam. The English words ‘super’ and ‘family’ intra-sententially
codemixed and the word ‘familyaayi’ is a neologism that combines English and Malayalam and is
another encoding mixed, called Intra-word conversion, that occurs at the word leve3l].[And
Malayalam-EnglishE:nthu oola trailer aanu ithu. poor dialogue delivery. This is an example of
inter-sentential code-mixing.</p>
      <p>
        This task consists of two subtasks, which is a message-level label classification task. Given a
tweet or Youtube comments in Manglish (Malayalam not written using Roman Characters in
task1), or Tanglish and Manglish (Tamil and Malayalam written using Roman Characters in
task2), systems have to classify it into ofensive or not-ofensive [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. As we all known systems
that train on monolingual data, like English, fail on code-mixed data because of the complexity
of switching code between diferent language levels in text.
      </p>
      <p>We propose an ensemble model that combined with diferent models by a BiLSTM
(Bidirectional LSTM), an LSTM+Convolution, and a CNN (Convolution Neural Network) model, which
can improve the F-1 values from diferent aspects. We’ll discuss this model more detail in the
system description section. We have tested our system on the test data in Dravidian languages
released for the task. The model have achieved an F-1 of 0.93 (ranked 3) in Malayalam-English of
task1 and F-1 of 0.87 (ranked 3 ) and 0.67 (ranked 9ℎ ) in Tamil-English and Malayalam-English
of task2, respectively. Our code is available on GitH1ub</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>As far as we know, this is the first shared task on ofensive language in Dravidian code-mixed
text. The goal of this task is to identify ofenslve language of the code-mixed dataset of
comments/posts in Dravidian Languages (Malayalam-English and Tamil-English) collected from
social media2. The corpus available for code-mixed is small in itself, Tamil and Malayalam
languages are even less common. There are some work of other languages of the code-mixed as
reference.</p>
      <p>
        Gupta et al.5[] developed a supervised system based on conditional random field classifier
which assigned coarse-grained and fine-grained PoS tags for the English-Hindi. Zhang et al.
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] demonstrated that a feed-forward network with a simple globally constrained decoder can
accurately and quickly annotate 100 languages and 100 pairs of code-mixed and single-language
texts on the English-Bengali and English-Telugu. Dahiya et a7l]. i[ntroduced curricu-lum
learning strategies for semantic tasks in code-mixed Hindi-English texts. Vyas et a8l]. d[escribed
their initial eforts to create a multi-level annotated corpus of Hindi-English code-mixed text
and explored language identification, back-transliteration, normalization and POS tagging of
this data. Thamar et al. 9[] described Language identification in the first shared task of the
code-switched data held at EMNLP 2014. Prabhu et al.10[] introduced learning sub-word
level representations and they also provided a usable data set of Hindi-English code-mixed
text. Choudhary et al. 1[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] proposed a new approach, called mixed discourse emotion analysis
(SACMT), which uses comparative learning to categorize sentences into corresponding emotions
– positive, negative, or neutral.
      </p>
      <p>1https://github.com/TroubleGilr/HASOC-Dravidian-CodeMix—FIRE-2020
2https://sites.google.com/view/dravidian-codemix-fire2020/overview</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>The organizer provide YouTube comments in code-mixed Malayalam-English where Malayalam
is the non-Roman script of task1, and task2 contains Tamil-English and Malayalam-English
(Tamil and Malayalam written using Roman Characters) which are two kinds of labels of
ofensive or not-ofensive. No labels are provided for all test text and no external data is used.
We can get detailed data from Table 1.</p>
      <p>The organizer provide two subtasks, in which task1 only contains Malayalam-English
codemixed text, but task2 includes Tamil and Malayalam code-mixed text. The NOT/OFF of training
set and verification set in task1 are 2633/567 and 328/72, respectively. And task2 doesn’t
distinguish between the training set and validation set. The NOT/OFF of Tamil and Malayalam
languages training set are 2020/1980 and 2047/1953, respectively, we automatically separate the
0.2 training set as the verification set. More data details can be seen in this pape3r][ [12] and
some of the processing of code mixed text can be seen in [13].</p>
    </sec>
    <sec id="sec-4">
      <title>4. System Description</title>
      <sec id="sec-4-1">
        <title>4.1. Pre-processing</title>
        <p>The tweet or YouTube comments have been originally Malayalam using not-Roman script in
task1 and Malayalam written using Roman Characters in task2. The tweets or comments are
preprocessed using the following ways before feeding it to the training stage:
1. Transliteration: Non-English words in task1 are converted into Roman script by phonetic
transliteration. The transliteration A3PfIor Google is used for this. While English words are
not changed, and all the words in task2 remain the same.</p>
        <p>2. Out of order: We randomly scramble the order of all the datasets to improve the accuracy
of the prediction.</p>
        <p>3. Noise removal: Usernames (annotated as @username), and emoticons present in the
tweets are removed altogether, while hashtags are left as it is and then fed the model.</p>
        <p>4. Label Encoding: Categorical sentiment values were label encoded as 0,1 to ofensive or
not-ofensive, respectively. This was done to give a numeric representation to the categorical
data.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Model Architecture</title>
        <p>The model consists of three parts, a basic CNN (Convolution Neural Network), an LSTM +
Convolution, and a BiLSTM (Bidirectional LSTM). These three modules are ensemble as our
classifier, as shown in Figure 1.</p>
        <p>1. LSTM+Conv: The module consists of a convolutional layer with a kernel size of 3, followed
by a global maximum pool layer, an LSTM layer and a dense la1y4e]r,[the details of which are
shown in Figure 2(a). CNN, to some extent, takes into account the ordering of the words and
the context in which each word appears.</p>
        <p>2. CNN: This particular module uses 3 diferent convolutional layers, with the kernel of 3,4,5,
connected to the embedding layer. The output of each layer is connected and then passed to a
global maximum pool layer, followed by two dense layers, as shown in Figure 2(b). The idea
behind using several filter sizes is to capture contexts of varying lengths. The convolution layer
is used to extract local features around each word window, while the global maximum pool
layer is used to extract the essential features in the feature map.</p>
        <p>3. BiLSTMs: In this module, a BiLSTM [15] layer is used, followed by a convolutional layer
with a kernel size of 3. The output of this layer goes through two diferent layers, the global
average pool and the global maximum pool. The output is connected and then passed to dense
layer 2. Figure 2(c) shows the details of the model.</p>
        <p>To achieve better F-1 accuracy, we build an ensemble model that utilizes the advantages of
these individual model. Inputting the text processed in the pre-processing stage to all models,
and the output after training is denoted as:

  = ∑  
=1
(1)
(2)
Where i=number of sentences.</p>
        <p>The final output matrix was calculated using the following formula:
  
= (
10,  20,  30), (11, 
21,  31)
O</p>
        <p>represents the probability of the class j for theℎn model (here n was the no of the model
stated above). Where n=1, 2, 3 denotes model and j=0, 1 denotes thecategory (0-ofensive,
1-not-ofensive) in O  . After the calculation, the maximum probability of each sentence was
assigned.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experiments Detail</title>
      <p>The oficially provided dataset in task1 is divided into three parts - training, validation, and
testing set but task2 has no validation set. We randomly divide the training data into 80-20 split
to get the final training and validation data in task2. In this paper, we propose an ensemble
model and train it on the training set. Then we have tested our system on the test data. Our
model achieve an F-1 of 0.93 (ranked 3 ) in Malayalam-English of task1 and F-1 of 0.87 (ranked
3 ) and 0.67 (ranked 9ℎ ) in Tamil-English and Malayalam-English of task2, respectively. Details
are shown in table 2.</p>
      <p>Through experimental comparison, we find that the epochs are 7,5,4 in the BiLSTM, the
LSTM+Convolution and the CNN model, respectively, which have better accuracy with a batch
size of 128, vocabulary size of 20000, the text sequence length of 50 with sparse categorical loss
and learning rate of 0.01.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <p>In this paper, the detailed approach of us for the ofensive language detection in Dravidian
languages is described. We propose an ensemble model over three distinct modules that on their
own do perform well with the task. However, the ensemble model is able to catch a particular
sentiment exceptionally well. We achieve a score of 0.93, just 0.02 below the first rank. In the
future, we’re going to put emotional information into the system and a voted ensemble may be
attempted to improve the score. Bert is also one of the ways we think about.
[10] A. Prabhu, A. Joshi, M. Shrivastava, V. Varma, Towards sub-word level compositions for
sentiment analysis of hindi-english code mixed text (2016).
[11] N. Choudhary, R. Singh, I. Bindlish, M. Shrivastava, Sentiment analysis of code-mixed
languages leveraging resource rich languages (2018).
[12] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, J. P. McCrae, Corpus creation for
sentiment analysis in code-mixed Tamil-English text, in: Proceedings of the 1st Joint
Workshop on Spoken Language Technologies for Under-resourced languages (SLTU)
and Collaboration and Computing for Under-Resourced Languages (CCURL), European
Language Resources association, Marseille, France, 2020, pp. 202–210. URhLt:tps://www.
aclweb.org/anthology/2020.sltu-1.2.8
[13] B. r. Chakravarthi, Leveraging orthographic information to improve machine translation
of under-resourced languages, Ph.D. thesis, NUI Galway, 2020. URLh:ttp://hdl.handle.net/
10379/16100.
[14] R. Sawhney, M. Ayyar, R. R. Shah, Did you ofend me? classification of ofensive tweets in
hinglish language, 2018, pp. 138–148. do1i0:.18653/v1/W18- 5118.
[15] G. Xu, Y. Meng, X. Qiu, Z. Yu, X. Wu, Sentiment analysis of comment texts based on bilstm,
IEEE Access 7 (2019) 51522–51532. doi:10.1109/ACCESS.2019.2909919.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          , P. B,
          <string-name>
            <surname>S. KP</surname>
          </string-name>
          , T. Mandl,
          <article-title>Overview of the track on 'HASOC-offensive languageidentification-dravidiancodemix', in: Proceedings of the 12th Forum for Information RetrievalEvaluation</article-title>
          ,FIRE '
          <volume>20</volume>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>D. S Nair</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. R R</surname>
          </string-name>
          , J. Jayan,
          <string-name>
            <given-names>S.</given-names>
            <surname>Elizabeth</surname>
          </string-name>
          ,Sentima- sentiment
          <source>extractionfor malayalam</source>
          ,
          <source>2014. doi:1 0 . 1 1</source>
          <volume>0</volume>
          <fpage>9</fpage>
          <string-name>
            <surname>/ I C A C C I</surname>
          </string-name>
          .
          <volume>2 0 1 4 . 6 9 6 8 5 4 8 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,N. Jose,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sherly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>A sentiment analysis datasetfor code-mixed Malayalam-Englishi,n: Proceedings of the 1st Joint Workshop on Spoken LanguageTechnologiesfor Under-resourced languages(SLTU) and Collaboration and Computing for Under-Resourced Languages(CCURL), European LanguageResources association</article-title>
          ,MarseilleF,rance,
          <year>2020</year>
          , pp.
          <fpage>177</fpage>
          -
          <lpage>184</lpage>
          . URL: https://www.aclweb.org/anthology/ 2020.sltu-
          <volume>1</volume>
          .
          <fpage>25</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          , P. B,
          <string-name>
            <surname>S. KP</surname>
          </string-name>
          , T. Mandl,
          <article-title>Overview of the track on 'HASOC-offensive languageidentification-dravidiancodemix'</article-title>
          ,
          <source>in: Working Notes of the Forum for Information RetrievalEvaluation(FIRE</source>
          <year>2020</year>
          ). CEUR Workshop Proceedings. In: CEUR-WS. org, Hyderabad,India,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tripathi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ekbal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhattacharyya</surname>
          </string-name>
          ,
          <article-title>Smpost: Parts of speech tagger for code-mixed indic socialmedia text (</article-title>
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Riesa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gillick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bakalov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Baldridge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <article-title>A fast, compact, accurate model for languageidentificationof codemixed text</article-title>
          ,
          <year>2018</year>
          , pp.
          <fpage>328</fpage>
          -
          <lpage>337</lpage>
          .
          <source>doi:1 0 . 1 8</source>
          <volume>6 5 3</volume>
          /v 1 /D 1 8
          <article-title>- 1 0 3 0</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dahiya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Battan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shrivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <article-title>Curriculum learningstrategies for hindi-englishcodemixed sentiment analysis</article-title>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Vyas</surname>
          </string-name>
          , S. GellaJ,.
          <string-name>
            <surname>Sharma</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Bali</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Choudhury</surname>
          </string-name>
          ,
          <article-title>Pos taggingof english-hindicode-mixed socialmedia content</article-title>
          ,
          <year>2014</year>
          , pp.
          <fpage>974</fpage>
          -
          <lpage>979</lpage>
          . doi:
          <article-title>1 0 . 3 1 1 5 / v 1 / D 1 4 - 1 1 0 5</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Solorio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Blair</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Maharjan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bethard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Diab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghoneim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hawwari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alghamdi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hirschberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fung</surname>
          </string-name>
          ,
          <article-title>Overview for the first shared task on language identificationin code-switched data</article-title>
          ,
          <source>2014. doi:1 0 . 3 1</source>
          <volume>1 5</volume>
          / v 1 / W 1 4
          <article-title>- 3 9 0 7</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>