<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Forum for Information Retrieval Evaluation, December</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Suyash Sangwan</string-name>
          <email>suyash.sangwan@tcs.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lipika Dey</string-name>
          <email>lipika.dey@tcs.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohammad Shakir</string-name>
          <email>m.shakir@tcs.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Haryana</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>India</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Deep Learning</institution>
          ,
          <addr-line>Multi-task framework, Text classification, Twitter, Hate speech, Ofensive speech, Profane</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Tata Consultancy Services Limited</institution>
          ,
          <addr-line>Block-C, Kings Canyon,ASF Insignia, Gurgaon, Gwal Pahari, Gurgaon 12203</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>1</volume>
      <fpage>3</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>Related tasks are generally dependent on each other and thereby perform better when solved in a joint framework. Considering the same fact, our team named 'TCS Research Lab Gurgaon' presents a deep learning-based multi-task framework that jointly identifies the presence of 'Hate and Ofensive' speech and further classifies the type of 'HOF' speech present (i.e. hate or ofensive or profane) . For each tweet, we extract three forms of feature sets (i.e. word embeddings, topical distribution, and TF-IDF score) to convey diverse and distinctive information. As we know all these feature sets usually don't have equal contribution in final decision-making, therefore we use the 'gated' mechanism to assign weights to these feature sets based on their importance in the final prediction. We have evaluated and validated our proposed approach on subtask1A and subtask1B [1] of HASOC-2021 challenge [2]. Evaluation results suggest and show that the multi-task learning ('MTL') framework provides better results as compared to the single-task learning ('STL') framework (i.e. solving both the subtasks independently).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>ofensive data is the data that ofends someone. It does not have to be threatening or hateful
to be considered ofensive. It is simply the data that upsets someone and has nothing with
the legality of the speech. For Example: “I get turned on by 13 years old”. On the other hand
‘Profanity’ is a class of data that includes dirty words and ideas. Swear words, vulgar language,
obscene gestures, and naughty jokes are all considered ‘Profane’.</p>
      <p>
        Now there are many layers to the dificulty of automatically detecting hateful and/or ofensive
speech, particularly in social media. The first one is ‘subjectivity’. A seemingly neutral sentence
can be ofensive for one person and not bother another. Similarly hate data has no legal
definition. For this reason, what is and is not hate/ofensive data is open to interpretation. A lot
depends on the domain and the context. Even the usage of these slang or insulting words vary
in usage to express contempt, the diference of opinions, and in some cases, people use these
words in humor. Another big challenge is the use of “Hinglish” (which is a blend of Hindi and
English i.e. Hindi written in the Roman script instead of the native Devanagari) in the HASOC [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
dataset. As the dataset is collected from Twitter India so being a multi-lingual society people
tend to use code-mixed patterns.
      </p>
      <p>Therefore keeping all these challenges in mind, we propose a multi-task deep learning-based
gated model where we aim to leverage the inter-dependence of given two subtasks to increase
the confidence of individual subtask in prediction. For Example: if subtask 1A says “NOT”, then
for the subtask 1B it’s always “NONE”. For each tweet, we utilize three feature sets (i.e. RNN based
word embedding, topical distribution, and TF-IDF scores of words in the tweet). Our main intuition
behind using diferent feature sets is to utilize the advantages of diferent feature extraction
methods in a single framework. For tweets having out of vocabulary words (like hashtags), the
TF-IDF component will be of utmost importance. Similarly, the intensity and combination of
diferent topics present within a tweet also helps in providing additional context.</p>
      <p>The main contributions of this paper are:
1. We propose a multi-task framework to leverage the inter-dependence of two related tasks
(i.e. HOF/NOT and type of HOF) in improving each other’s performance.
2. We utilize diferent feature extraction techniques in a single framework.
3. We apply a gated module to refine the feature sets (i.e assign weights to feature sets) based
on their role in the final prediction.</p>
      <p>The remainder of this paper is organized as follows. In section 2 we describe our problem
definition. In section 3 we describe some previous work. Our architecture is introduced in
section 4, followed by an outline of dataset, experimental results, and analysis in section 5.
Finally we conclude with a discussion in section 6.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Problem Definition</title>
      <p>Our first subtask (1A) focuses on Hate speech and Ofensive language identification. It is a
coarse-grained binary classification where we have to classify the tweets into two classes,
namely: Hate and Ofensive (HOF) and Non-Hate and ofensive (NOT). The first category of
data i.e. ‘HOF’ contains posts having hate, ofensive, and profane content whereas the second
category of data i.e. ‘NOT’ contains posts that do not have any hate, profane, ofensive content.</p>
      <p>The other sub-task (1B) is a fine-grained classification where ‘HOF’ posts from the above
sub-task are further classified into three categories: 1). Hate (HATE) – posts under this category
contain hate speech content. 2). Ofensive (OFFN) – posts under this category contain ofensive
content. 3). Profane (PRFN) – posts under this category contain profane content.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Related Work</title>
      <p>
        The interest in detecting hate speech and bullying data, particularly on social media, has
attracted attention from researchers interested in sociological and linguistic features. In this
section we review a number of studies and briefly discuss their findings. In 2012, Xu et al.
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] applied sentiment analysis to detect bullying roles in the tweets. Here the authors have
used LDA for identifying relevant topics and formulated the task as a binary classification
task i.e. text is classified as an instance of bullying or not. Further in 2015, Burnap et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
used a supervised machine learning text classifier that distinguishes between hateful and/or
antagonistic responses with a focus on race, ethnicity, or religion; and more general responses.
They derived the classification features like grammatical dependencies between words and used
this model to forecast the likely spread of cyber hate in a sample of Twitter data. In 2016, Nobata
et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] developed a machine learning based method to detect hate speech on online user
comments by developing a corpus of user comments annotated for abusive language. As we
know lexical detection methods tend to have low precision because they classify all messages
containing particular terms as hate speech and previous work using supervised learning has
failed to distinguish between diferent categories. Therefore in 2017, Davidson et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] used a
crowd-sourced hate speech lexicon to collect tweets containing hate speech keywords. Their
ifndings revealed that racist and homophobic tweets are more likely to be classified as hate
speech but sexist tweets are generally classified as ofensive and tweets without explicit hate
keywords are also more dificult to classify. Our model is diferent in the sense that we utilize
diferent feature sets to identify both presence and type of ‘HOF’ data in a single framework.
We also hypothesize that applying the gating mechanism to assign weights to diferent feature
sets may assist the network in a better way.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Proposed Architecture</title>
      <p>
        In this section, we describe our proposed architecture as shown in Figure1, where we aim to
leverage diferent feature sets and gating mechanism for solving multiple tasks together. The
proposed framework takes given tweet as an input and pre-processes it. For each pre-processed
tweet we provide three types of input (or feature sets). First, pre-trained word embeddings which
are further processed through bi-directional Gated Recurrent Units (GRUs) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] for capturing
the contextual information. Second, topical distribution (i.e. percentage of topics present in the
tweet). Third, ‘TF-IDF’ score of the tweet words. All three feature sets are then applied to a fully
connected layer to make the output dimension same. Now the key challenge here is to utilize
and fuse the relevant information for the prediction. For that, we employ a gated mechanism to
refine the feature sets (i.e. assign weights to the feature sets). The gating mechanism evaluates
the importance of an individual feature set based on its role in final prediction.
1. Pre-Processing:The first step before building any machine learning model is to
preprocess the data. If the data is fairly pre-processed, the results would also be reliable.
So at first we remove all the punctuations and convert the tokenized words into
lowercase format. Then to clean the noise present in the dataset we use framework [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] to
automatically improve the quality of the dataset. Since HASOC [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] dataset consists
of tweets, hence it is prone to spelling mistakes and people usually misspell abusive
words to avoid getting detected by the auto blocking. Through this framework, we detect
misspelled words and replace them with the correctly spelled words to improve the quality
of the classifier.
2. Feature Extraction: The pre-processed text needs to be transformed into vectors so
that the algorithms will be able to learn and make predictions. In the following, we give
a brief description of the three feature sets used in our proposed multi-task framework.
a) Word Embedding (WE): Here each word is represented as a 100-dimensional
feature vector using pre-trained Word2Vec [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] embeddings. Each of these wordwise
embeddings is then applied to a separate bi-directional Gated Recurrent Unit (GRU)
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] for capturing the contextual information and only one output representation for
all the input words is obtained at the end.
b) Topical Distribution (TD): Second, we use topical distribution as a feature set.
      </p>
      <p>
        Unsupervised models like topic extraction help us in clustering similar content. So
we use LDA MALLET (Latent Dirichlet Allocation MAchine Learning for LanguagE
Toolkit [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], which is an open-source toolkit designed for NLP tasks like classification,
clustering, and topic-modeling) for extracting topics from training data. Given a
ifxed number of topics (which we decide by maximizing intra-topic coherence scores),
say ‘k’, LDA MALLET uses an optimized Gibbs sampling algorithm to calculate the
probability of each word in a document to belong to a particular topic, and thereafter
the distribution of all ‘k’ topics in each text document. The probability score of
each word in the vocabulary belonging to a topic is also obtained. Each topic can
therefore be represented by the top ‘n’ words that have the highest probability of
belonging to it. These probability scores are further used to compute the topical
distribution in the test data. Our main intuition behind using topical distribution as
a feature set is that the combination and intensity of diferent topics present in the
tweet may help in predicting the final class.
      </p>
      <p>For Example: Top topic words of topic2 2 are ‘time’, ‘death’, ‘money’, ‘virus’, ‘lockdown’,
‘Chinese’, which show the topic contains general discussion on Coronavirus. So
tweets having higher percentage of topic2 are more probable of belonging to the
‘NOT’ class. Similarly, top topic-words of topic5 2 are ‘bitch’, ‘ass’, ‘fuck’, ‘shit’,
‘whore’, ‘dumb’. So tweets having a higher percentage of topic5 are more probable
of belonging to the ‘Profane’ (PRFN) class.
c) TF-IDF (TI): Third, we use ‘TF-IDF’ as a feature set. Here we use Term Frequency
– Inverse Document Frequency (‘TF-IDF’) weight to evaluate how important a word
is to a class (HOF or NOT or OFFN or PRFN). ‘TF’ summarizes how often a given
word appears within a given class, whereas ‘IDF’ downscales words that appear a lot
across classes. A word has a high ‘IDF’ score if it appears in few classes. Conversely,
if the word is very common among classes (i.e. ‘a’, ‘an’, ‘the’), the word would have
a low ‘IDF’ score.</p>
      <p>=
log</p>
      <p>(Number of total classes)
(Number of classes the term appears in)
(1)
So using ‘TF-IDF’, we try to capture the most important words or terms per class.
This approach also helps in dealing with out-of-vocabulary words or terms (like
hashtags and slang words).
3. Gating Mechanism:It is true and obvious that all the three feature sets cannot contribute
equally to the final prediction. As ‘topical distribution’ and ‘TF-IDF’ approaches are
completely unsupervised, we refine these features before passing them to the classifier.
For refinement, we use gating mechanism where we decide the weights of these feature
sets with respect to the word-embedding feature vector (‘WE’). So at first, we concatenate
‘WE’ feature vector and topical distribution (‘TD’) feature vector and apply a dense layer
to the concatenation. Now we pass the concatenated feature vector (X) through a ‘Sigmoid’
which gives the value (x‘), representing the weight of ‘TD’ feature vector w.r.t. ‘WE’
feature vector. Finally, this weight (x‘) is multiplied by the entire feature vector ‘TD’. So
weightage of topical distribution w.r.t. word embedding is TD  =x‘⋅TD, which helps in
Similarly to refine TF-IDF vector (‘TI’) w.r.t. word-embedding vector (‘WE’) :
 =</p>
      <p>_([ ,  ])
 =</p>
      <p>_([ ,   ])
 ‵ = ( )</p>
      <p>WE =  ‵ ⋅  
 ’ = ( )
  WE =  ‵ ⋅  
the refinement of the topical-distribution feature (i.e. either passing it or suppressing it)
depending on its role in final prediction.</p>
      <p>Equations for topical gating w.r.t. word embeddings are:
Then we concatenate these gated representations TDWE and TIWE with ‘WE’ along with
their residual connections for final prediction. We append residual connections of the
modalities to boost the gradient flow to the lower layers.
4. Multi-task Framework:The multi-task learning paradigm provides an eficient platform
for achieving generalization. Multiple tasks can exploit the inter-relatedness for improving
individual performance through a shared representation. Overall, it provides three basic
advantages over the single-task learning paradigm. 1). It helps in achieving generalization
for multiple tasks. 2). Each task improves its performance in association with the
other participating tasks. 3). It ofers reduced complexity because a single system can
handle multiple problems or tasks at the same time. So after gating, the concatenated
representation is shared across the two branches of our proposed multi-task
frameworkcorresponding to the two tasks, i.e. ‘HOF’ or ‘NOT’ and ’type of ‘HOF’.The shared
representation will receive gradients of error from both the branches and accordingly
adjust the weights of the models. Thus, the shared representation will not be biased to
any particular task, and it will assist the model in achieving generalization for multiple
tasks.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Dataset, Experimental Results, and Analysis</title>
      <p>
        In this section, we describe the dataset used for our experiments, hyper-parameters used, results,
error analysis, and finally other models that we use to compare our results with.
1. Dataset: We use the English dataset of HASOC-Challenge 2021 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to evaluate and
validate our proposed approach. The training and test set consists of 3,843 and 1,281
tweets respectively.
(2)
(3)
(4)
(5)
(6)
(7)
2. Experiments: We use the python based Keras 1 library for its implementation. For the
experiments, we perform 5-fold cross-validation on the training data, as the test data is
unlabeled. For evaluation, we compute macro f1-score, and weighted f1-score to measure
the performance of the model. We choose weighted f1-score as a metric because samples
are unbalanced across various classes. We use the grid search to find the optimal
hyperparameters for our experiments. We use Bi-GRU with 300 neurons each and set dropout
to 0.3, batch size to 50 and the number of epochs to 5. We use ReLu as the activation
function, Adam as an optimizer, and binary cross-entropy as the loss function. As first
subtask (1A) is a binary classification problem, ‘sigmoid’ is applied for final prediction
where we choose a threshold value of 0.5, whereas for the other subtask (1B) we use
‘softmax’ .
      </p>
      <p>The optimal number of topics extracted from diferent sets of cross-validation is diferent,
but we choose the model with the highest weighted f1-score. So the optimal number of
topics extracted is 10, as shown in fig 2.</p>
      <p>
        For extracting class-wise most important features using TF-IDF approach, we use a Python
library named ‘sklearn’ [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and selected the features having p-value &gt;0.95 Some examples
of class-wise top features selected:
a) ‘NOT’ class - ‘indiacovidcrisis’, ‘covidvaccine’, ‘heartbreaking’, ‘covid19’.
b) ‘HATE’ class - ‘bjp’, ‘shame’, ‘resignmodi’, ‘resignpmmodi’, ‘politics’.
      </p>
      <p>c) ‘OFFN’ class - ‘narendramodi’, ‘modi’, ‘pm’, ‘modi ji’, ‘pmoindia’.</p>
      <p>Tasks</p>
      <p>Framework
Tasks</p>
      <p>Framework
Subtask 1A
Subtask 1B
Subtask 1A
Subtask 1B</p>
      <p>STL
MTL
STL
MTL
STL
MTL
STL</p>
      <p>MTL
d) ‘PRFN’ class - ‘fuck’, ‘motherfucker’, ‘dick’, ‘pussy’, ‘asshole’, ‘bollock’.</p>
      <p>
        We evaluate our proposed approach with FastText [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] embeddings, pretrained BERT
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] model and Word2Vec [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] embeddings. As shown in Table 1 and 2 Word2Vec model
performed better than FastText and BERT models. We therefore choose Word2Vec model
for further experiments.Then we evaluate our proposed model with input combinations
like, only word embeddings (WE), combination of word-embeddings and TF-IDF (WE+TI),
combination of word-embeddings and topical distribution (WE+TD), and finally
combination of all three (WE,TD,TI). For consistency, we use the same hyperparameters for
training all the models. We obtain best results when we use gated multi-task framework
with all the three feature sets (i.e. ‘WE’, ‘TI’, and ‘TD’)
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>
        In this paper, we have proposed an RNN based gated multi-task framework that aims to reveal
and utilize the inter-dependence of two related tasks i.e. presence of ‘HOF’ data and type of ‘HOF’
data present. Our proposed approach learns a joint-representation for both the tasks and uses
weighted representation of feature sets (using gating mechanism). We evaluate our proposed
approach on the recently released English dataset of HASOC challenge[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Experimental results
suggest that (‘topical distribution’) and (‘TFIDF’), if refined, help (‘word embeddings’) for better
predictions. In the future, we would like to explore the other dimensions of our multi-task
framework.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Shahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Madhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ranasinghe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nandini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Jaiswal</surname>
          </string-name>
          ,
          <article-title>Overview of the HASOC subtrack at FIRE 2021: Hate Speech and Ofensive Content Identification in English and Indo-Aryan Languages</article-title>
          , in: Working Notes of FIRE 2021 -
          <article-title>Forum for Information Retrieval Evaluation</article-title>
          ,
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          ,
          <year>2021</year>
          . URL: http://ceur-ws.org/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Shahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Madhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ranasinghe</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Zampieri, Overview of the HASOC Subtrack at FIRE 2021: Hate Speech and Ofensive Content Identification in English and Indo-Aryan Languages and Conversational Hate Speech</article-title>
          , in: FIRE 2021:
          <article-title>Forum for Information Retrieval Evaluation, Virtual Event</article-title>
          ,
          <fpage>13th</fpage>
          -17th
          <source>December</source>
          <year>2021</year>
          , ACM,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>HASOC</surname>
          </string-name>
          <year>2021</year>
          dataset, https://hasocfire.github.io/hasoc/2021/dataset.html,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>J.-M. Xu</surname>
            ,
            <given-names>K.-S.</given-names>
          </string-name>
          <string-name>
            <surname>Jun</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Bellmore</surname>
          </string-name>
          ,
          <article-title>Learning from bullying traces in social media, in: Proceedings of the 2012 conference of the North American chapter of the association for computational linguistics: Human language technologies</article-title>
          ,
          <year>2012</year>
          , pp.
          <fpage>656</fpage>
          -
          <lpage>666</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Burnap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <article-title>Cyber hate speech on twitter: An application of machine classification and statistical modeling for policy and decision making</article-title>
          ,
          <source>Policy &amp; internet 7</source>
          (
          <year>2015</year>
          )
          <fpage>223</fpage>
          -
          <lpage>242</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Nobata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tetreault</surname>
          </string-name>
          , A. Thomas,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mehdad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <article-title>Abusive language detection in online user content</article-title>
          ,
          <source>in: Proceedings of the 25th international conference on world wide web</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>145</fpage>
          -
          <lpage>153</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Davidson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Warmsley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Macy</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Weber</surname>
          </string-name>
          ,
          <article-title>Automated hate speech detection and the problem of ofensive language</article-title>
          ,
          <source>in: Proceedings of the International AAAI Conference on Web and Social Media</source>
          , volume
          <volume>11</volume>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Dey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Salem</surname>
          </string-name>
          ,
          <article-title>Gate-variants of gated recurrent unit (gru) neural networks</article-title>
          ,
          <source>in: 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS)</source>
          , IEEE,
          <year>2017</year>
          , pp.
          <fpage>1597</fpage>
          -
          <lpage>1600</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shakir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Dasgupta</surname>
          </string-name>
          ,
          <article-title>Learning domain terms-empirical methods to enhance enterprise text analytics performance</article-title>
          ,
          <source>in: Proceedings of the 28th International Conference on Computational Linguistics: Industry Track</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>190</fpage>
          -
          <lpage>201</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Corrado,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>Eficient estimation of word representations in vector space</article-title>
          ,
          <source>arXiv preprint arXiv:1301.3781</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>King</surname>
          </string-name>
          ,
          <article-title>Dlib-ml: A machine learning toolkit</article-title>
          ,
          <source>The Journal of Machine Learning Research</source>
          <volume>10</volume>
          (
          <year>2009</year>
          )
          <fpage>1755</fpage>
          -
          <lpage>1758</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thirion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dubourg</surname>
          </string-name>
          , et al.,
          <article-title>Scikit-learn: Machine learning in python</article-title>
          ,
          <source>Journal of machine learning research 12</source>
          (
          <year>2011</year>
          )
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          , E. Grave,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Douze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jégou</surname>
          </string-name>
          , T. Mikolov, Fasttext. zip:
          <article-title>Compressing text classification models</article-title>
          ,
          <source>arXiv preprint arXiv:1612.03651</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>I.</given-names>
            <surname>Tenney</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Das</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Pavlick</surname>
          </string-name>
          ,
          <article-title>Bert rediscovers the classical nlp pipeline</article-title>
          , arXiv preprint arXiv:
          <year>1905</year>
          .
          <volume>05950</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>