<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>KBCNMUJAL@HASOC-Dravidian-CodeMix- FIRE2020: Using Machine Learning for Detection of Hate Speech and Ofensive Code-Mixed Social Media text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Varsha Pathak</string-name>
          <email>varsha.pathak@imr.ac.in</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manish Joshi</string-name>
          <email>joshmanish@gmail.com</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Prasad Joshi</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Monica Mundada</string-name>
          <email>monicamundada5@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tanmay Joshi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>4Department of Dizitilization, Copenhagen Business School</institution>
          ,
          <country country="DK">Denmark</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Brihan Maharashtra College of commerce</institution>
          ,
          <addr-line>Pune, MS</addr-line>
          <country country="IN">India</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institute of Management and Research, Jalgaon, Afil- KBC North Maharashtra University</institution>
          ,
          <addr-line>Jalgaon MS</addr-line>
          <country country="IN">India</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>JET's Z. B. College,DhuleAfil- KBC North Maharashtra University</institution>
          ,
          <addr-line>Jalgaon, MS</addr-line>
          <country country="IN">India</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>School of Computer Sciences, KBC North Maharashtra University</institution>
          ,
          <addr-line>Jalgaon, MS</addr-line>
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the system submitted by our team, KBCNMUJAL, for Task 2 of the shared task “Hate Speech and Ofensive Content Identification in Indo-European Languages (HASOC)” at Forum for Information Retrieval Evaluation, December 16-20, 2020, Hyderabad, India. The datasets of two Dravidian languages viz Malayalam and Tamil of size 4000 observations, each were shared by the HASOC organizers. These datasets are used to train the machine using diferent machine learning algorithms, based on classification and regression models. The datasets consist of tweets or YouTube comments with two class labels “ofensive” and “not ofensive”. The machine is trained to classify such social media messages in these two categories. Appropriate n-gram feature sets are extracted to learn the specific characteristics of the Hate Speech text messages. These feature models are based on TFIDF weights of n-gram. The referred work and respective experiments show that the features such as word, character and combined model of word and character n-grams could be used to identify the term patterns of ofensive text contents. As a part of the HASOC shared task, the test data sets are made available by the HASOC track organizers. The best performing classification models developed for both languages are applied on test datasets. The model which gives the highest accuracy result on training dataset for Malayalam language was experimented to predict the categories of respective test data. This system has obtained an F1 score of 0.77. Similarly the best performing model for Tamil language has obtained an F1 score of 0.87. This work has received 2nd and 3rd rank in this shared Task 2 for Malayalam and Tamil language respectively. The proposed system is named HASOC_kbcnmujal.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Support Vector Classifier</kwd>
        <kwd>Multinomial Bayes</kwd>
        <kwd>LR</kwd>
        <kwd>Random Forest Classifier</kwd>
        <kwd>n-gram model</kwd>
        <kwd>Text Classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Social media has become a modern channel of public expression for the people irrespective of
the socio-economic boundaries. Due to the pandemic situations, even the common people have
started looking at social media as a formal medium to remain connected with masses. Though
such mediums are available majorly for constructive and creative expressions, these days it is
found to be in many negative and ofensive expressions. Some people take disadvantage of
the language based social boundaries. They use absurd and hateful speech using their native
languages to hurt other communities. Some people, using hate speech or ofensive language
intentionally or unintentionally, can hurt some popular person, specific community or even
innocent people. If not detected in time, such messages could damage social health. Ignoring
such messages could push certain unhealthy issues which could turn into disastrous events at
a certain point of time in future. There are cases that such ofensive comments have brought
serious threats to the community disturbances. To identify such content is today’s need and
significant work has been done for the English language.</p>
      <sec id="sec-1-1">
        <title>1.1. Transliterated Text</title>
        <p>India is a multi-state, multilingual nation. Each state has its own oficial spoken language and
respective script.</p>
        <p>The transliterated, Romanized text of the native language with English as the binding language
is termed as Code-mixed Text. Many people use code-mixed text to create their social media
contents. Most of the southern Indian languages have their origin in Dravidian language. Due
to this the languages such as Malayalam of Kerala state, Tamil of Tamilnadu, Telugu of Andhra
Pradesh are commonly called as Dravidian Languages. It’s a challenge for the research
community to trace and restrict such ofensive content from the native code-mixed text messages
of Dravidian languages.</p>
        <p>
          The "Hate Speech and Ofensive Content Identification in Indo-European Languages
(HASOC)", at Forum for Information Retrieval Evaluation, 2019 [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] is the first such initiatives
as a shared task on ofensive language. The HASOC track has further introduced the shared
task on Dravidian Code-Mixed text in "Forum for Information Retrieval Evaluation, December
16-20, 2020, Hyderabad, India" [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], including Malayalam and Tamil languages. The goal of Task
2 is to classify the tweets, into ofensive or not-ofensive categories.
        </p>
        <p>
          This paper presents development and implementation of our model for Task 2. The Task 2
challenge provides the respective datasets in Romanized transliterated code forms. The challenges
and related study of transliterated search in the context of Indian languages, with diferent
native scripts, could be found in literature like [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The HASOC problem could be found as an
extension to the transliterated search. Similarly, on social media, people use Short Messages
Systems (SMS). Thus "SMS based transliterated code", is the upcoming challenge of Natural
Language Processing (NLP) [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. With the interest to work on diferent extensions and
applications of this recent area of NLP, our team has participated in this HASOC shared Task 2.
In this section we have discussed on the need and relevance of HASOC problem. The related
work on the concept of Hate Speech and Ofensive Language detection is discussed in the
following subsection. The problem statement, a brief information of Hate Speech Dataset and
methodology applied in the work has been discussed in the 2nd section. This is followed by
two sections related to the experimental work and result analysis respectively.
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>1.2. Related Work</title>
        <p>
          Many researchers have published their work on automated detection of hate speech and
ofensive content. Malmasi and Zampieri[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] has adopted a linear support vector classifier on word
skip-grams, brown cluster and surface n-grams. Arup Baruah et al.[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] works on support vector
machine, BiLSTM and neural network models on the TF-IDF features of character and word
n-grams, embeddings from language models. Glove and fastText embeddings. Saroj et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]
applies traditional machine learning classifiers: Support Vector Machine, XGBOOST. Nemanja
Djuric et al.[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] works on paragraph2vec and Continuous Bag of Words to approach the neural
language model.
        </p>
        <p>
          The related study shows that a significant work has been done on detecting hate speech in other
than the English language. The approach of system developed by Mubarak et. al.[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] is based
on Seed Words, word unigrams, word bigrams. This system detects abusive language in Arabic
social media. Another work done by Su et.al.[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] detects and rephrases profanity in Chinese.
Bharathi et.al.[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] uses many machine learning classifiers to determine the sentiments from
Malayalam-English, code-mixed data.
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Problem Definition</title>
      <p>We propose a model that uses best performing classifier on the given dataset. Best resulting
features are used by extracting language specific and language independent characteristics of
the tweeter dataset. The approach applied for the development of this model is explained in this
section. The statistical details of Hate Speech Dataset is described in the next subsection. This
is followed by the methodology and experimental work explained in respective subsections.</p>
      <sec id="sec-2-1">
        <title>2.1. Hate Speech Dataset</title>
        <p>
          The Task 2 dataset which has been released for the HASOC shared task of as discussed above, is
consisting of two CSV files of tweets. First is of Manglish (Malayalam + English) and other one
is of Tamglish (Tamil+ English) [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. This training dataset has 3 columns with column names
as Tweet ID, Tweet text or YouTube comments and the Label respectively. The Label column
has values either OFF, indicating ofensive text or NOT, indicating non-ofensive tweet. The
number of tweets in each file is around 4000. This is a good number to carry the experimental
work of training the machine by applying appropriate machine learning algorithms. The Test
dataset is then released in later part of HASOC Task 2 shared task. The test data set is
consisting of only first two columns. The third column i.e. Label column is missing. The machine
after the training in first phase has to predict the labels of the respective tweets. Approximately,
1000 tweets are available in this dataset for both the languages. Table 1, presents the statistical
data about this Training and Test Data Set for both the Malayalam and Tamil languages. The
details of how these datasets are constructed and the overview of the shared HASOC tasks is
available in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Methodology</title>
        <p>In the experimental work, a supervised machine learning approach is applied. The tweets’ from
the labeled dataset are preprocessed to remove noisy elements from its text contents.
Appropriate feature extraction model is developed that enables the machine to learn
ofensive/nonofensive terms and respective patterns in the text.</p>
        <p>Finally, the performance of diferent classifiers based on extracted features are compared using
standard measures to develop our best performing proposed model. We named this proposed
system as HASOC_kbcnmujal system. Figure 1, shows the architectural view of the machine
learning approach of this system.</p>
        <sec id="sec-2-2-1">
          <title>2.2.1. Data Pre-processing</title>
          <p>
            In general, the social media users enjoy flexibility in forming their tweets or comments. They
would not worry about applying any specific grammar of respective languages. Generally,
they use their native language with casual expressions. Due to this flexibility in code-mixed
text[
            <xref ref-type="bibr" rid="ref13">13</xref>
            ], an appropriate pre-processing method is the important concern.
          </p>
          <p>In our system unnecessary and stop words are removed. Similarly, white spaces, digits, special
characters [@,#,%,$,,(,)], extra spaces etc. are eliminated to simplify the text messages. In
adˆ
dition to this the special tags like @USER, @RT(retweet) and TAG are also removed.
Thus the tweet’s text data is cleaned by applying appropriate Pre-processing as above. As a
result both the Manglish and Tamglish datasets are ready for the feature extraction phase which
is the next phase of the HASOC_kbcnmujal system.</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>2.2.2. Features Extraction</title>
          <p>For feature extraction we applied two major methods viz. TF-IDF and custom word embedding
methods. The details of these methods are given below.</p>
          <p>• Using TF-IDF: The Term Frequency (TF) and Inverse Document Term Frequency (IDF)
are the two important measures that reflect the specificity and relevance of terms with
the information carried by the documents. These n-grams are useful to capture the small
and localized syntactic patterns within text in flexible language.</p>
          <p>We have applied three variations of TF-IDF weights n-gram model. "Word n-grams of
order(1, n)", "Character n-grams of order(1, n)" and the "Combined both these
WordChar n-grams". The value of n could be 2 or more. These three TF-IDF n-gram models
produced 38536, 81191 and 119727 features respectively in case of Malayalam. Similarly
in case of the Tamglish, all the above three feature models with minor variation has
produced 117173, 325902 and 443075 features respectively. Figure 1, presents this feature
data of both the languages.
• Using Custom Word Embedding: We apply another feature model, that is termed as
"Custom Word Embedding". For this, we have extracted 15430, 15292 unique words from
Malayalam and Tamil dataset respectively. The length of longest sentence i.e. MaxLen
is considered for the custom embedding of all the sentences. For the given dataset,
MaxLen("Malayalam")= 65 and MaxLen("Tamil")=64. In the next step the length of all
the sentences is modified to MaxLen. To achieve this, zeros are appended at the end of
each sentence using pad_sequences method from Keras1 . This has resulted in 13,000
dimensional vector for Malayalam and 12800 dimensional vector for Tamil.</p>
        </sec>
        <sec id="sec-2-2-3">
          <title>2.2.3. Classifier Models</title>
          <p>
            We have adopted various classifiers like SVC, MNB, LR, AdaBoost, DTC and RF. The above
mentioned features are extracted from the training data. For Neural Network model, we have
used the "custom word embedding" feature set [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ]. The 70% of the given training datasets is
used to train the Classifier. We have evaluated the performance of these classifiers by
measuring the accuracy of the result of classification. For this the remaining 30% of the given training
dataset has been considered as the test data. The parameters of each classifier are varied to find
the best performing parametric values. For training and evaluation of these classifiers sklearn2
is implemented. From this experimental work, a few well performing classifiers are described
in next section.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental work</title>
      <p>To develop a best performing model for the HASOC Task 2 problem, the supervised machine
learning approach is applied. The designed machine is trained by using diferent set of
extracted features. Various classifier algorithms are experimented by using these feature sets.
The hyper parameters of each classifier is tuned to found the best performing parameters of
respective classifier. For both languages similar method is applied. For evaluation purpose
appropriate measures viz. accuracy, precision, recall and F1-score are applied. We now discuss
the experimental set up of some selected classifiers in this section, as below.</p>
      <p>
        • Support Vector Classifier (SVC) : We used Linear SVC. SVC fits the data, and returns a
best fitting hyper-plane that divides the data points in two categories. It scales to a large
number of samples and has more flexibility in the choice of penalties and loss function
[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. For Malayalam as well as for Tamil, we trained by applying all the three TF-IDF
ngram feature extraction models as discussed in previous section. For both the languages,
we tuned the parameters: kernel with ‘linear’, ‘rbf’ and gamma with ‘auto’, ‘scale’. The
L2 regularization is used and the hyper-parameter ’C’ is also tuned with these features.
For Tamil language, SVC is trained using character n-gram, which gives best result. The
Hyper-parameters kernel is set to ‘linear’, gamma is set to ‘auto’ and ’C’ is set to 20. For
both languages, we found that the "character n-grams" feature model, has increased the
accuracy of the classifier as compared to that of "word n-gram" and "Combined
Wordchar n-gram" feature models.
1https://keras.io/
2http://scikit-learn.org/
• Multinomial Naive Bayes (MNB): This is a probabilistic model and specialized
version of Naive Bayes. Simple Naive Bayes represents a document with the presence or
absence of a particular word, whereas Multinomial Naive Bayes explicitly represents the
document with the word counts and adjusts the underlying calculations to identify them
from the document set [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. It works very well on small amounts of training data and
gets trained relatively fast compared to other models. We have trained MNB by using
the same set of TF-IDF features on both languages tweets datasets as mentioned above.
The hyper-parameter "alpha" has been set to 0.6. The "combined word-character n-gram"
feature model gives best accuracy in case of both the Manglish and Tanglish datasets.
• Logistic Regression (LR):This is a statistical model. It transforms its output into a
probability values which can be mapped to two or more discrete classes. LR is used to conduct
regression when the dependent variable is binary. We have trained LR in the same way
as that of SVC and MNB. L2 regularization is used with the hyper-parameter "C" is set
to default value="1.0". The respective experimental work shows higher accuracy for the
"Combined word-character n-gram feature model", for both languages.
• Ensemble Approach: We used a hard voting approach. Hard voting sums the predicates
for each class label from multiple models and predict the class label with maximum votes.
We combined the predictions of our top three models: SVM, MNB and LR. The
"Combined word-char n-gram feature", gives best accuracy for datasets of both languages.
• Random Forest Classifiers (RFC) : It is an ensemble approach combining multiple
decision trees and producing them randomly without defining the rules [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. We keep all
the hyper-parameters to their default setting. Only we have tested n-estimators to
different values. We have found that "character n-gram" feature and "combined word-char
n-gram" features have near about same accuracy for both the languages.
• Neural Network for Text Classification : A Simple text classification neural network
model is created using Python’s Keras Library. Keras is one of the most famous and
commonly used deep learning library. It can be used to learn custom words embedding or
can be used to load "pre-trained word embedding". We have used Keras Sequential model
and have added the embedding layer as its first layer. "Keras embedding layer", takes 3
parameters as arguments as shown below.
      </p>
      <p>keras.layers.Embedding(size_of_vocabulary, number_of_word_dimensions,</p>
      <p>length_of_longest_sentence)
Here, for both datasets, size_of_vocabulary is nothing but the number of unique words
which are 15430(Malayalam) &amp; 15292(Tamil). At the time of training the model, we have
rounded these values to 15450 &amp; 15300 respectively.</p>
      <p>The Second parameter number_of_word_dimensions represents each word as a 200
dimensional vector. And the third parameter length_of_longest_sentence is length of longest
sentence from dataset i.e. 65(Malayalam) and 64(Tamil).</p>
      <p>At the embedding layer, we got 30,90,000 trainable parameters for Malayalam and 30,60,000
for Tamil. The output of the embedding layer has its sentences with each word
represented by a 200 dimensional vector. We flattened this embedding layer to get 13000 and
12800 dimensional vectors for Malayalam &amp; Tamil respectively. The second layer i.e.
dense layer has 1 neuron. Since ours is a binary classification problem, for both
languages, we use the Sigmoid function as loss function at the dense layer, 4 epochs. Adam
optimizer and the binary cross-entropy loss function were used for training.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Discussion</title>
      <p>For evaluating the performance of our model, 70% of the dataset is used for training and 30%
of the dataset is used as the test dataset. For both the languages we used the same strategy.
Table 2, shows the accuracy F1 score of all classifiers for both the Malayalam and the Tamil
language. As shown in Table 2, for Malayalam, MNB has an F1 score of 0.78 for “not ofensive”
tweets. For “ofensive” tweets, it has the highest F1 score of 0.74. Hence MNB stood out as
the best classifier . The second best F1 score (0.72), for “ofensive” tweets, is same for all SVC,
LR and Ensemble. Among all the classifiers, performance of SVC &amp; LR are same for “ofensive"
and “not ofensive” tweets. RFC performs very low in predicting both the categories. Simple
NN Model has scored 0.52.
In the same Table 2, it could be observed that for the Tamil data,SVC has the highest F1 score
0.87 for “not ofensive” and 0.86 for “ofensive” tweets . The performances of MNB, LR
and Ensemble are identical. RFC has also performed well for Tamil language. Simple NN Model
does not show any improvement for Tamil and has just scored 0.53.</p>
      <p>Table 3, presents the confusion matrices with respect to the best performing classifiers and their
diferent (feature based) variations on Malayalam language’ data. The table clearly shows that
the "Combined word-char n-grams" features, efectively allow MNB to predict “not ofensive”
&amp; “ofensive” tweets. Using these features SVC can predict high predictions of “not ofensive”
tweets, but fail to predict “ofensive” tweets correctly. Interestingly the "combined Word-char"
and "word n-gram" features, have increased the performance for most of the classifiers except
SVC for “not ofensive” class and for RFC in case of “ofensive” class.</p>
      <p>Similarly, Table 4 presents the confusion matrix result of Tamil language. Performance of
MNB, for predicting “ofensive” category tweets is high for all feature models. As we can
see, the results of "Char n-grams" and that of "Combined word-char n-gram" features for SVC
are marginally the same, the "Char n-gram model" has proved to be best for predictions both
classes.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This paper presents the experimental work and the results of the HASOC Task 2 to detect
ofensive content in code-mixed dataset of Dravidian languages. We have used diferent features like
word n-gram, character n-gram and combined word-character n-grams and custom word
embedding. Custom word embedding is used to train a simple neural network model. Applying a
systematic Supervised Machine Learning approach we have developed our HASOC_kbcnmujal
system. This system has obtained an F1 score of 0.77 for Malayalam language and has received,
the rank of 2nd in HASOC shared task (Task 2) competition. For the Tamil language Model the
system has obtained F1 score of 0.87 and has received 3rd rank in the competition.
This work will be further extended to develop a system that could learn ofensive terms from
the text contents or even from speech irrespective of the language. We are interested in
revealing hidden negative messages from the social media comments which may be presented
superficially as positive message.</p>
      <p>The content of social messages that can damage the social and communal health shall be
detected and cured at the right time.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mandlia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <article-title>Overview of the hasoc track at fire 2019: Hate speech and ofensive content identification in indo-european languages</article-title>
          ,
          <source>in: Proceedings of the 11th Forum for Information Retrieval Evaluation</source>
          , FIRE '19,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2019</year>
          , p.
          <fpage>14</fpage>
          -
          <lpage>17</lpage>
          . URL: https://doi.org/10.1145/3368567.3368584. doi:
          <volume>10</volume>
          .1145/3368567.3368584.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muralidaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jose</surname>
          </string-name>
          , E. Sherly,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Overview of the track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text, in: Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2020)</article-title>
          . CEUR Workshop Proceedings. In: CEUR-WS. org, Hyderabad, India,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>V.</given-names>
            <surname>Pathak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <article-title>Itrans encoded marathi literature document relevance ranking for natural language flexible queries</article-title>
          , in: Computer Networks &amp;
          <source>Communications (NetCom)</source>
          , Springer,
          <year>2013</year>
          , pp.
          <fpage>417</fpage>
          -
          <lpage>424</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V. M.</given-names>
            <surname>Pathak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <article-title>Relevance feedback mechanism for resolving transcription ambiguity in sms based literature information system</article-title>
          ,
          <source>in: Smart Intelligent Computing and Applications</source>
          , Springer,
          <year>2019</year>
          , pp.
          <fpage>527</fpage>
          -
          <lpage>542</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Malmasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          ,
          <article-title>Detecting hate speech in social media</article-title>
          ,
          <source>in: Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP</source>
          <year>2017</year>
          ,
          <string-name>
            <given-names>INCOMA</given-names>
            <surname>Ltd</surname>
          </string-name>
          .,
          <string-name>
            <surname>Varna</surname>
          </string-name>
          , Bulgaria,
          <year>2017</year>
          , pp.
          <fpage>467</fpage>
          -
          <lpage>472</lpage>
          . URL: https://doi.org/10.26615/
          <fpage>978</fpage>
          -954-452-049-6_
          <fpage>062</fpage>
          . doi:
          <volume>10</volume>
          .26615/
          <fpage>978</fpage>
          -954-452-049-6_
          <fpage>062</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Baruah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. A.</given-names>
            <surname>Barbhuiya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dey</surname>
          </string-name>
          , IIITG-ADBU at HASOC 2019:
          <article-title>Automated hate speech and ofensive content detection in english and code-mixed hindi text</article-title>
          , in: P. Mehta,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          , M. Mitra (Eds.), Working Notes of FIRE 2019 -
          <article-title>Forum for Information Retrieval Evaluation, Kolkata</article-title>
          , India,
          <source>December 12-15</source>
          ,
          <year>2019</year>
          , volume
          <volume>2517</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>229</fpage>
          -
          <lpage>236</lpage>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2517</volume>
          /
          <fpage>T3</fpage>
          -7.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Saroj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Mundotiya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pal</surname>
          </string-name>
          , Irlab@iitbhu at HASOC 2019:
          <article-title>Traditional machine learning for hate speech and ofensive content identification</article-title>
          , in: P. Mehta,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          , M. Mitra (Eds.), Working Notes of FIRE 2019 -
          <article-title>Forum for Information Retrieval Evaluation, Kolkata</article-title>
          , India,
          <source>December 12-15</source>
          ,
          <year>2019</year>
          , volume
          <volume>2517</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>308</fpage>
          -
          <lpage>314</lpage>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2517</volume>
          /
          <fpage>T3</fpage>
          -17.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>N.</given-names>
            <surname>Djuric</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Grbovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Radosavljevic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bhamidipati</surname>
          </string-name>
          ,
          <article-title>Hate speech detection with comment embeddings</article-title>
          ,
          <source>in: WWW (Companion Volume)</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>29</fpage>
          -
          <lpage>30</lpage>
          . URL: https://doi.org/10.1145/2740908.2742760.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Mubarak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Darwish</surname>
          </string-name>
          , W. Magdy,
          <article-title>Abusive language detection on Arabic social media</article-title>
          ,
          <source>in: Proceedings of the First Workshop on Abusive Language Online</source>
          , Association for Computational Linguistics, Vancouver, BC, Canada,
          <year>2017</year>
          , pp.
          <fpage>52</fpage>
          -
          <lpage>56</lpage>
          . URL: https://www.aclweb.org/anthology/W17-3008. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>W17</fpage>
          -3008.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.-P.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.-J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.-T.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-J. Lin</surname>
          </string-name>
          ,
          <article-title>Rephrasing profanity in Chinese text</article-title>
          ,
          <source>in: Proceedings of the First Workshop on Abusive Language Online</source>
          , Association for Computational Linguistics, Vancouver, BC, Canada,
          <year>2017</year>
          , pp.
          <fpage>18</fpage>
          -
          <lpage>24</lpage>
          . URL: https://www.aclweb. org/anthology/W17-3003. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>W17</fpage>
          -3003.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          , N. Jose,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sherly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>A sentiment analysis dataset for code-mixed Malayalam-English, in: Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), European Language Resources association</article-title>
          , Marseille, France,
          <year>2020</year>
          , pp.
          <fpage>177</fpage>
          -
          <lpage>184</lpage>
          . URL: https://www.aclweb. org/anthology/2020.sltu-
          <volume>1</volume>
          .
          <fpage>25</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muralidaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jose</surname>
          </string-name>
          , E. Sherly,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Overview of the track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text</article-title>
          ,
          <source>in: Proceedings of the 12th Forum for Information Retrieval Evaluation</source>
          ,
          <source>FIRE '20</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muralidaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Corpus creation for sentiment analysis in code-mixed Tamil-English text</article-title>
          ,
          <source>in: Proceedings of the 1st Joint Workshop on Spoken Language Technologies</source>
          for
          <article-title>Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), European Language Resources association</article-title>
          , Marseille, France,
          <year>2020</year>
          , pp.
          <fpage>202</fpage>
          -
          <lpage>210</lpage>
          . URL: https://www. aclweb.org/anthology/2020.sltu-
          <volume>1</volume>
          .
          <fpage>28</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <article-title>Leveraging orthographic information to improve machine translation of under-resourced languages</article-title>
          ,
          <source>Ph.D. thesis, NUI Galway</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <article-title>scikit-learn developers (BSD License)</article-title>
          ,
          <fpage>2007</fpage>
          -
          <lpage>2020</lpage>
          . URL: https://scikit-learn.org/stable/ modules/generated/sklearn.svm.LinearSVC.html.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>McCallum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Nigam</surname>
          </string-name>
          ,
          <article-title>A comparison of event models for naive bayes text classification</article-title>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>