<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DLRG@HASOC 2019: An Enhanced Ensemble Classi er for Hate and O ensive Content Identi cation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>R.Rajalakshmi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>B. Yashwant Reddy</string-name>
          <email>byashwanth.reddy2016@vitstudent.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Computing Science and Engineering Vellore Institute of Technology</institution>
          ,
          <addr-line>Chennai</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <abstract>
        <p>Recent advancements in the Internet technologies have made a tremendous change in the social media. Hate Speech is an attack that is directed towards a group of people based on their religion, gender, colour etc. The o ensive content in social media poses a threat to democracy. As these kind of hate speech and o ensive content on the web increases day by day, manually monitoring or controlling such hate crimes is a highly challenging task. Most of the existing methodologies focus on English language tweets and only limited work has been reported for Hindi and German language posts. Also, the importance of feature selection methods is not explored much for this problem. In this research work, an enhanced ensemble classi er approach is proposed to identify hate and o ensive content posted in Hindi or German languages. In the proposed approach, CHI square based feature selection method is combined with a Random Forest Classi er to classify the tweets. This work was submitted to Hate and O ensive Content Identi cation (HASOC) task@FIRE2019. From the various experiments conducted on the released HASOC dataset, it is shown that an accuracy of 81% and 64% was achieved on German and Hindi language tweets.</p>
      </abstract>
      <kwd-group>
        <kwd>Hate Speech Identi cation</kwd>
        <kwd>Ensemble Classi er</kwd>
        <kwd>Chi Square Feature Selection</kwd>
        <kwd>German</kwd>
        <kwd>Hindi</kwd>
        <kwd>Social Media</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Nowadays many people post their opinions, thoughts and comments on social
websites like face book, twitter etc. due to the advanced technologies. The
offensive and hate speech posted in social media increases every day and the
companies are investing heavily to identify such o ensive tweets. As these kind of
o ensive tweets contain di erent hash tags, emojis and follow various language
styles, it is highly challenging to monitor and control such hate crimes manually.</p>
      <p>* Corresponding Author</p>
      <p>
        To overcome the above issues, machine learning based methods were proposed
in existing works [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] and they focused on detecting common hate speech, not
particular to o ensive speech. Even though hate speech detection problem on
English language has been studied by various researchers, only few works were
reported for German and Hindi language tweets. The lexicon based and
rulebased approaches were followed in the existing works which is not able to
generalize well. Also, traditional tf-idf based methods were used with simple linear
classi ers and emphasis was given to other feature weighting methods. In this
research work, an attempt is made to study the importance of feature selection
methods along with the power of ensemble based classi ers. We have proposed an
enhanced ensemble classi er with the CHI square based feature selection method
to select the important features.
      </p>
      <p>This research work was submitted to Hate and O ensive Content Identi
cation (HASOC) task@FIRE2019. As part of the task, the organizers released
the datasets containing the tweets in German and Hindi languages. The task is
to identify the tweets that contain the hate and o ensive content. To perform
this binary classi cation task, we applied various machine learning techniques
by extracting suitable features from the given data. To study the importance
of feature selection methods, we conducted experiments with di erent feature
selection methods such as TF-IDF Mutual Information and CHI square based
approach. Among the two datasets, German dataset was highly imbalanced, so
we have applied the widely used SMOTE analysis. To design a suitable predictive
model, we conducted experiments with various machine learning techniques such
as Logistic Regression, Support Vector Machine and Random Forest Classi er.
From the experimental results, it is observed that the ensemble based approach
is better than the individual classi ers. We have achieved an accuracy of 81% on
German dataset and 65% on Hindi dataset, applying Random Forest classi er
with CHI square based feature selection.</p>
      <p>The paper is organized as follows: Related works are presented in Section 2
and the proposed methodology is detailed in Section 3. The experimental results
and discussion are briefed in Section 4 followed by conclusion in Section 5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Works</title>
      <p>
        There had been many studies reported on classifying the o ensive content on
the web. Greevy and Smeaton [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] used SVM and bag of words to detect o ensive
content on web pages. They have used PRINCP corpus of 3 million words with
2 class labels namely o ensive and not o ensive. BOW, n-gram word sequences
and POS tagged documents were used by them to represent the dataset. But
they used only SVM classi er for detection and other methods were not explored.
A similar approach was suggested by Warner and Hirschberg [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] using unigrams
with SVM to detect o ensive content of web. Hate base is and online
repository of hate speech words. T. Davidson, D. Warmsley [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] had build a classi er
for Hate base. They have created unigram, bigram , trigram features weighted
with its TF-IDF and calculated its Part of Speech (POS) tag. They suggested
linear classi ers for classifying the o ensive language. But the model was biased
towards the o ensive language and failed to di erentiate between the common
place o ensive language with serious hate speech. Google had developed a tool
for identifying toxicity of comments between the range of 0 to 100. C. Nobata, J.
Tetreault [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] had proposed annotation of hate speech versus clean speech. They
have collected news and nance dataset for the binary classi cation of abusive
and clean tweets. They have employed Vowpal Wabbit's regression model for
the features obtained through n-grams, linguistic, syntactic and distributional
semantics. They have compared the accuracies of all the features but they worked
only on English language and did not attempt in other languages. D. Gitari [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
had further classi ed the tweets into strong or weak using lexicon based
approaches. They have used semantic and subjectivity approach to create lexicon
and use these features for a classi er. But they used rule-based classi er instead
of machine learning model which lead to low precision and recall scores. Nitesh
et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] over-sampled the minority class through SMOTE (Synthetic Minority
Over-sampling Technique), which generated new synthetic examples along the
line between the minority examples and their selected nearest neighbors.
      </p>
      <p>
        To handle multilingual queries, code mixing and code borrowing need to
be di erentiated. The borrowing likeliness of English words in Hindi language
was determined by a novel relevant factor [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. In this work, both Hindi and
English tweets were considered to nd the relevant words. Various feature weighting
methods have been proposed for URL classi cation and sentiment analysis
problems and the e ectiveness of di erent classi ers were studied. The importance of
features like tf-idf and mutual information in determining the category of a web
page was explored by using URL based features [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. For sentiment analysis on
movie reviews, the tf-idf and word2vec methods were applied and the e
ectiveness of deep learning model has been studied in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. A novel feature weighting
method has been proposed for Nave Bayes classi er [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], for the problem of
categorizing the URLs by considering only the features derived from URLs. In
this work, a variant of CHI square method was suggested to nd the goodness
of features and it was embedded into the calculation of likelihood probability for
the Nave Bayes Classi er. Using linear SVM weights as features, URL classi
cation was performed in [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. These URL features were automatically learnt and
data-set independent dictionary was constructed to classify the URLs. In another
work [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], transfer learning approach was preferred to learn the features from
Convolutional Neural Network and it has been used as input to SVM for
classifying the URLs that are generated using Domain Generated Algorithms. In all
the above mentioned works, the signi cance of feature weighting methods have
been studied for classifying the web pages. GermEval is a shared task focused on
o ensive language identi cation in German tweets (8500 tweets). Wiegand et al.
(2018) [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] further applied the idea to Waseem et al to this task. They
experimented with detecting o ensive vs. non-o ensive tweets, and also with a second
task on further sub-classifying the o ensive tweets as, insult, abuse or profanity.
The 2018 Workshop on Trolling, Aggression, and Cyber bullying (TRAC) hosted
a shared task focused on detecting aggressive text in both English and Hindi [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ].
      </p>
      <p>
        The dataset from this task is available to the public and contains 15,869
Facebook comments labeled as overtly aggressive(OAG), covertly aggressive(CAG),
or non-aggressive(CAG). The best-performing scores was obtained using
convolutional neural networks (CNN), recurrent neural networks, and LSTM for their
approach. O ensive Language Identi cation Dataset (OLID) dataset, which was
built speci cally for this task was annotated using a hierarchical three-level
annotation model introduced in Zampieri et al. [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Three sub tasks include O ensive
Language Identi cation (Not O ensive, O ensive), Categorization of O ensive
Language (Targeted Insult, Untargeted), O ensive Language Target Identi
cation (Individual, Group, Other) [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. In all the above methods, the importance
of determining the o ensive content is emphasized.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Proposed Methodology</title>
      <p>The task of identifying the hate and o ensive content in the tweets is
considered as a binary classi cation problem. The performance of any binary classi er
depends on the suitable features and chosen machine learning algorithm. In this
work, three di erent feature selection methods were chosen to viz., :i) TF-IDF
(Term Frequency / Inverse Data Frequency ii) Mutual Information and iii) CHI
square. Also, the e ectiveness of ensemble method has been studied by
applying on three classi ers viz., Logistic Regression, Support Vector Machine and
Random Forest Classi er. To identify hate and o ensive speech on two data sets
viz., German dataset and Hindi dataset, the following steps are performed:
{ Translation of tweets to English
{ Pre-processing and Tokenization
{ Feature Extraction by applying three variants viz., TF-IDF, Mutual
Information and CHI square
{ Performing SMOTE analysis (this step is required only for German dataset,
as it is highly imbalanced)
{ Building the model and predicting whether the given tweet is o ensive or
not by using the model.
3.1</p>
      <sec id="sec-3-1">
        <title>Translation of Tweets</title>
        <p>
          In this task, we have been provided with two di erent language datasets (German
and Hindi). As a rst step, the tweets are translated to English language. For
example, a tweet in German "Frank Rennicke { Ich binxa0stolz" was converted
by employing MLtranslate and it results in the corresponding English tweet
Frank Rennicke - I am proud. For this translation process, ML Translator API
was used, which is a Google's Neural Machine Translation (NMT) system [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ].
This translation method was widely used because of its simplicity and zero-shot
translation. Melvin et al.[
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] proposed a single Neural Translation multilingual
model that shares the same encoder, decoder and attention modules for all the
languages without increasing the complexity of model. Also, as the parameters
are shared across all the languages, it generalizes well to multiple languages.
This NMT model has the advantage of zero-shot translation, as several language
pairs are used in a single model and unseen word pairs in di erent languages
were also learnt by the model. We found this translation process as suitable for
this task and hence applied the same for converting the tweets in German /
Hindi to English.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Pre-processing and Tokenization</title>
        <p>Hash tags provide insights about a speci c ideology by a group of people. These
tags provide vital information for text classi cation, especially in the case of
identi cation of o ensive language in tweets. So we have processed the hash
tags and obtained tokenized words out of it after segmenting the tokens. For
example, after applying the hash tag segmentation on the pre-processed tweet
#everyhingisgood, we obtain everything is good. Lemmatization is the process of
reducing the word to its root form, which is helpful. We have used NLTK
(Natural Language Tool Kit) WordNet Lemmatizer for performing lemmatization.
Consider the following example, Koeln Mohamed recognizes no German right
but only the #Scharia. That he wanted to break Cologne Cathedral was just a
joke but when he comes out of jail, he has no more pity. After lemmatising, it
becomes koeln mohamed recognizes german right scharia wanted break cologne
cathedral joke come jail pity.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Feature Extraction</title>
        <p>In any text classi cation task, the feature extraction plays an important role.
To extract the suitable features from the pre-processed data, we have used three
variants namely TF-IDF, Mutual Information and Chi-square.</p>
        <p>TF-IDF: The TF-IDF (Term Frequency { Inverse Document Frequency) is the
well-known weighting scheme and this score is calculated based on the count of
terms that are present in every tweet with the terms present in the entire corpus.
As it extracts most descriptive terms from the tweet collection and simple to
implement, we have chosen this feature weighting scheme, In our experiments,
the minimum frequency of the word is set to 5 and maximum number of words
is set to 5000.</p>
        <p>Mutual Information: Mutual Information (MI) is the measure of dependence
between two random variables, and it can be used to nd the dependency
between the input features and the output categories in the context of feature
selection for text classi cation problems. For the given task of classifying the
tweets, we can calculate the amount of information a particular word contributes
to the class label (o ensive). If the mutual information is high, then the feature
has high relevance to that target and if it is zero, there is no relevance.</p>
        <p>In the HASOC German (also for Hindi) dataset, we have calculated the values
of a, c, b and d based on the number of training tweets in positive / negative
category that contains / does not contain the term ti. The mutual information
is obtained by using the formula shown below.</p>
        <p>
          M I = log2(max(aN=(a + c)N; cN=(a + c)N )
(1)
where `a' denotes the number of positive category tweets in training data that
contains the term ti `b' denotes the number of positive category tweets in training
data that do not contain the term ti `c' denotes the number of negative category
tweets in training data that contains the term ti `d' denotes the number of
negative category tweets in training data that do not contain the term ti
Chi Square: The Chi-Square test is generally applied to nd the relationship
between two variables. The e ectiveness of Chi-square based feature selection
method has been reported in various text / web page classi cation problems
[
          <xref ref-type="bibr" rid="ref15 ref17">15,17</xref>
          ]. In Natural Language Processing, identifying the relevant words is
important to increase the e ciency of the classi cation algorithm. The Chi square
statistic would be small if the term is uncorrelated with the class and would
be high, if the term is correlated. In this task, we have calculated Chi-square
statistic using the dataset and selected the terms with high score as they are the
most informative features. Its formula is given below using the same notations
a, b, c and d mentioned above.
        </p>
        <p>Chi = (N (ad
bc)2)=((a + c)(b + d)(a + b)(c + d))
(2)
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Addressing Imbalanced data and Classi cation</title>
        <p>The German dataset was a highly imbalance dataset, that contains 3412 hate and
o ensive tweets with 407 non-o ensive tweets, so SMOTE analysis is performed.
For the Hindi dataset, this step was ignored, as it is a balanced dataset.</p>
      </sec>
      <sec id="sec-3-5">
        <title>Random Oversampling and under sampling The mechanics of random</title>
        <p>
          oversampling follow naturally from its description by adding a group of N number
of samples from the minority category. While oversampling adds data to the
original data set, random under sampling removes the data from the data set.
Both the methods try to alter the size of the original data set. Even though,
training accuracy may increase by applying this method, the model performance
will be relative low on testing data [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
        <p>
          SMOTE Analysis We have applied SMOTE (Synthetic Minority
Oversampling Technique) from sklearn. By this oversampling technique, the size of
minority class tweets are increased to the size of majority class tweets. This method
generates synthetic minority examples to over-sample the minority class. For
every minority example, its k (which is set to 5 in SMOTE) nearest neighbours of
the same class are calculated, then some examples are randomly selected from
them according to the over-sampling rate. SMOTE analysis was applied to give
better performance compared with other sampling Techniques [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
Classi cation After making the dataset suitable for training, two di erent
models were designed, one with Logistic Regression and another one with the
ensemble classi er Random Forest by varying the feature weighting methods
viz., TF-IDF, Mutual Information and Chi-square.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experimental Results</title>
      <p>To study the performance of the proposed method on the German (and Hindi
) datasets, various experiments were conducted. For implementation, we used
Python 3 and scikit-learn library. All the experiments were carried on a
workstation with Intel Xeon Quad Core Processor, 32 GB RAM, NVIDIA Quadro
P4000 GPU 8GB. For the initial experiments, we have divided the released
training data into training set and validation set and conducted the experiments
using accuracy as the performance metric. Finally the performance of the proposed
system was tested on the test set released by the organizers. For these
experiments, we combined all the training and validation data into a single training
set and applied the algorithm. We have reported the validation accuracy and
test accuracy obtained on both German dataset and Hindi dataset.</p>
      <p>After translation and pre-processing of tweets, tokenization was performed.
Then to extract the suitable features, we have applied three variants viz.,
TFIDF, Mutual Information and Ch-square. First, TF-IDF vectorizer (using sklearn)
was used to get maximum of 10,000 features with the minimum occurrence
frequency of 2 for German dataset and 5000 features for Hindi dataset. We then
tried with count vectorizer (using sklearn) and calculated Mutual Information
and Chi-square values for every word token using the above mentioned
formulas. By this way, a total of 12,717 features were extracted for German dataset
and 15,111 features were extracted for Hindi dataset. We have used the above
features and used Logistic Regression (LR) and Random Forest (RF) classi er
with three variants viz., TF-IDF, Mutual Information and Chi-square values.
The accuracy of simple and ensemble classi ers on validation set and test set
was presented in Table 1 and Table 2.</p>
      <p>It is observed from Table 1 that, on German dataset, among the three feature
weighting schemes, CHI square based feature weighting method performs better
than the other two methods viz.,TF-IDF and MI with Random Forest Classi er.
A validation accuracy of 90% was achieved while combining CHI square based
features feature with the ensemble classi er Random Forest. It is also to be noted
that, MI performs better than TF-IDF and resulted in 88% and 89% validation
accuracy on German dataset with the single Logistic classi er and with Random
Forest classi er. On Hindi dataset, the validation accuracy of 79% was achieved
with the Random Forest classi er for the CHI square based feature selection.
Based on the inference on the validation set, we have applied CHI square with
Random Forest classi er on the released test set and the results are reported in
Table 2. We have obtained an accuracy of 81% and 64% on German dataset and
Hindi dataset respectively.
This work was submitted to the FIRE2019 task, Identi cation of Hate and
Offensive Speech in Indo-European Languages. In this research work, the problem
of identifying the hate and o ensive content in tweets have been
experimentally studied on two di erent language datasets German and Hindi. The
importance of feature weighting methods was analysed by using three di erent
variants viz., TF-IDF, Mutual Information and CHI square based feature
selection. After choosing the suitable feature selection method, we have studied the
signi cance of ensemble classi er over individual classi er. Among the released
datasets, German dataset was highly imbalanced, so we applied SMOTE
analysis and then performed classi cation. From the experimental results, it is shown
that the performance of the Random Forest classi er with CHI square based
feature selection method is better than the other methods and a test accuracy of
81% and 64% were achieved on German and Hindi dataset respectively. In this
work, we have restricted to machine learning approaches with suitable feature
selection method and deep learning techniques will be explored in future.
6</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgment</title>
      <p>The authors would like to thank the management of Vellore Institute of
Technology, Chennai for providing the support to carry out this work. The rst would like
to thank the Department of Science and Engineering Research Board (SERB),
Government of India for their nancial grant (Award Number: ECR/2016/00484)
for this research work.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Burnap</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>M.L.</given-names>
          </string-name>
          :
          <article-title>Cyber hate speech on twitter: An application of machine classi cation and statistical modeling for policy and decision making</article-title>
          .
          <source>In: Policy and Internet</source>
          , Vol.
          <volume>7</volume>
          .
          <issue>2</issue>
          , pp.
          <volume>223</volume>
          {
          <issue>242</issue>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Kwok</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Locate the hate: Detecting tweets against blacks</article-title>
          .
          <source>In: TwentySeventh AAAI Conference on Arti cial Intelligence</source>
          ,pp.
          <fpage>1621</fpage>
          -
          <lpage>1622</lpage>
          (
          <year>2013</year>
          ) .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>de Gibert</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perez</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garc</surname>
          </string-name>
          <article-title>'ia-</article-title>
          <string-name>
            <surname>Pablos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cuadros</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Hate Speech Dataset from a White Supremacy Forum</article-title>
          .
          <source>In: 2nd Workshop on Abusive Language Online</source>
          ,pp.
          <fpage>11</fpage>
          -
          <lpage>20</lpage>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Warner</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hirschberg</surname>
          </string-name>
          , J.:
          <source>Detecting Hate Speech on the World Wide Web. In: Proceedings of the Second Workshop on Language in Social Media</source>
          ,pp.
          <fpage>19</fpage>
          -
          <lpage>26</lpage>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Greevy</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smeaton</surname>
            ,
            <given-names>A.F.</given-names>
          </string-name>
          :
          <article-title>Classifying racist texts using a support vector machine</article-title>
          .
          <source>In: Proceedings of the 27th annual international conference on Research and development in information retrieval - SIGIR '04</source>
          , pp.
          <volume>468</volume>
          {
          <issue>469</issue>
          (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Davidson</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Warmsley</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Macy</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weber</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Automated Hate Speech Detection and the Problem of O ensive Language</article-title>
          .
          <source>In: Proceedings of the Eleventh International AAAI Conference on Web and Social Media (ICWSM</source>
          <year>2017</year>
          ),pp.
          <fpage>512</fpage>
          -
          <lpage>515</lpage>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Nobata</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tetreault</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Thomas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mehdad</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Abusive Language Detection in Online User Content</article-title>
          .
          <source>In: Proceedings of the 25th International Conference on World Wide Web (WWW</source>
          <year>2016</year>
          ),pp.
          <fpage>145</fpage>
          -
          <lpage>153</lpage>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Gitari</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zuping</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Damien</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Long</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A Lexicon-based Approach for Hate Speech Detection</article-title>
          .
          <source>In: International Journal of Multimedia and Ubiquitous Engineering</source>
          , vol.
          <volume>10</volume>
          .4, pp.
          <fpage>215</fpage>
          -
          <lpage>230</lpage>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smith</surname>
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Practical feature subset selection for machine learning</article-title>
          .
          <source>In: Proceedings of the 21st Australasian Conference on Computer Science</source>
          , pp.
          <fpage>181</fpage>
          -
          <lpage>191</lpage>
          (
          <year>1998</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Balancing Between Over-Weighting and Under-Weighting in Supervised Term Weighting</article-title>
          .
          <source>In: International Journal of Information Processing and Management</source>
          , vol.
          <volume>53</volume>
          ,pp.
          <fpage>547</fpage>
          -
          <lpage>557</lpage>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Chawla</surname>
            ,
            <given-names>N.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bowyer</surname>
            ,
            <given-names>K.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>L.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kegelmeyer</surname>
            <given-names>W.P.</given-names>
          </string-name>
          : SMOTE:
          <article-title>Synthetic Minority Over-Sampling Technique</article-title>
          .
          <source>In: Journal of Arti cial Intelligence Research</source>
          , vol.
          <volume>16</volume>
          ,pp.
          <fpage>321</fpage>
          -
          <lpage>357</lpage>
          (
          <year>2002</year>
          ) .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. Han,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.Y.</given-names>
            ,
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.H.</surname>
          </string-name>
          :
          <article-title>Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning</article-title>
          .
          <source>In: International Conference on Intelligent Computing</source>
          ,pp.
          <fpage>878</fpage>
          -
          <lpage>887</lpage>
          (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garica</surname>
            ,
            <given-names>E.A.</given-names>
          </string-name>
          :
          <article-title>Learning from imbalanced data</article-title>
          .
          <source>In: IEEE Transactions On Knowledge and Data Engineering</source>
          ,vol.
          <volume>21</volume>
          ,pp.
          <fpage>1263</fpage>
          -
          <lpage>1284</lpage>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Rajalakshmi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agrawal</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <source>Borrowing Likeliness Ranking based on Relevance Factor, In: Proceedings of the Fourth ACM IKDD Conferences on Data Sciences, CODS</source>
          <year>2017</year>
          , India, pp:
          <volume>12</volume>
          :1{
          <issue>12</issue>
          :
          <fpage>2</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Rajalakshmi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xaviar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <article-title>Experimental Study of Feature Weighting Techniques for URL Based Webpage Classi cation</article-title>
          ,
          <source>Procedia Computer Science</source>
          , Vol.
          <volume>115</volume>
          , pp.
          <fpage>218</fpage>
          -
          <lpage>225</lpage>
          , (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Sivakumar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rajalakshmi</surname>
            ,
            <given-names>R</given-names>
          </string-name>
          ,
          <article-title>Comparative evaluation of various feature weighting methods on movie reviews</article-title>
          ,
          <source>Advances in Intelligent Systems and Computing</source>
          , Vol-
          <volume>711</volume>
          , pp.
          <fpage>721</fpage>
          -
          <lpage>730</lpage>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Rajalakshmi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aravindan</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <article-title>Naive Bayes approach for URL classi cation with supervised feature selection and rejection framework</article-title>
          ,
          <source>Computational Intelligence</source>
          ,
          <volume>34</volume>
          (
          <issue>1</issue>
          ), pp.
          <fpage>363</fpage>
          -
          <lpage>396</lpage>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <given-names>R.</given-names>
            <surname>Rajalakshmi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Aravindan</surname>
          </string-name>
          ,
          <article-title>"An E ective and Discriminative Feature Learning for URL Based Web Page Classi cation,"</article-title>
          <source>2018 IEEE International Conference on Systems, Man, and Cybernetics</source>
          (SMC), Miyazaki, Japan,
          <year>2018</year>
          , pp.
          <fpage>1374</fpage>
          -
          <lpage>1379</lpage>
          , (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Rajalakshmi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramraj</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramesh</surname>
            <given-names>Kannan</given-names>
          </string-name>
          ,
          <string-name>
            <surname>R</surname>
          </string-name>
          ,
          <article-title>Transfer learning approach for identi cation of malicious domain names</article-title>
          ,
          <source>Communications in Computer and Information Science</source>
          , Vol.
          <volume>969</volume>
          , pp.
          <fpage>656</fpage>
          -
          <lpage>666</lpage>
          .(
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Zampieri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malmasi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nakov</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosenthal</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farra</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          , R.: SemEval
          <article-title>-2019 Task 6: Identifying and Categorizing O ensive Language in Social Media (O ensEval)</article-title>
          .
          <source>In: Proceedings of the 13th International Workshop on Semantic Evaluation</source>
          ,pp.
          <fpage>75</fpage>
          -
          <lpage>86</lpage>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Wiegand</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Siegel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruppenhofer</surname>
          </string-name>
          , J.:
          <article-title>Overview of the germeval 2018 shared task on the identi cation of o ensive language(</article-title>
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ojha</surname>
            ,
            <given-names>A.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malmasi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zampieri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Benchmarking Aggression Identi cation in Social Media</article-title>
          .
          <source>In: Proceedings of TRAC</source>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Zampieri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malmasi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nakov</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosenthal</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farra</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          , R.:
          <article-title>Predicting the Type and Target of O ensive Posts in Social Media</article-title>
          .
          <source>In: Proceedings of NAACL</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Johnson</surname>
          </string-name>
          , Melvin.,
          <string-name>
            <surname>Schuster</surname>
          </string-name>
          , Mike.,
          <string-name>
            <surname>Le</surname>
          </string-name>
          , Quoc V.,
          <string-name>
            <surname>Krikun</surname>
          </string-name>
          , Maxim.,
          <string-name>
            <surname>Wu</surname>
          </string-name>
          , Yonghui.,
          <string-name>
            <surname>Chen</surname>
          </string-name>
          , Zhifeng.,
          <string-name>
            <surname>Thorat</surname>
          </string-name>
          , Nikhil., Vi'egas, Fernanda.,
          <string-name>
            <surname>Wattenberg</surname>
          </string-name>
          , Martin.,
          <string-name>
            <surname>Corrado</surname>
          </string-name>
          , Greg.,
          <string-name>
            <surname>Hughes</surname>
            ,
            <given-names>Macdu .</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
          </string-name>
          , Je rey,
          <source>Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation</source>
          , Vol-
          <volume>5</volume>
          , pp.
          <volume>339</volume>
          |
          <issue>351</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Modha</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mandl</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Majumder</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patel</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Overview of the HASOC track atFIRE 2019: Hate Speech and O ensive Content Identi cation in IndoEuropeanLanguages</article-title>
          . In:
          <article-title>Proceedings of the 11th annual meeting of the Forum for Informa-tion Retrieval Evaluation (December</article-title>
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>