<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multilingual Transformers for Hate Speech Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sayar Ghosh Roy</string-name>
          <email>sayar.ghosh@research.iiit.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ujwal Narayan</string-name>
          <email>ujwal.narayan@research.iiit.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tathagata Raha</string-name>
          <email>tathagata.raha@research.iiit.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zubair Abid</string-name>
          <email>zubair.abid@research.iiit.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vasudeva Varma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>FIRE '20, Forum for Information Retrieval Evaluation</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Information Retrieval and Extraction Lab, International Institute of Information Technology</institution>
          ,
          <addr-line>Hyderabad</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Detecting and classifying instances of hate in social media text has been a problem of interest in Natural Language Processing in the recent years. Our work leverages state of the art Transformer language models to identify hate speech in a multilingual setting. Capturing the intent of a post or a comment on social media involves careful evaluation of the language style, semantic content and additional pointers such as hashtags and emojis. In this paper, we look at the problem of identifying whether a Twitter post is hateful and ofensive or not. We further discriminate the detected toxic content into one of the following three classes: (a) Hate Speech (HATE), (b) Ofensive (OFFN) and (c) Profane (PRFN). With a pre-trained multilingual Transformer-based text encoder at the base, we are able to successfully identify and classify hate speech from multiple languages. On the provided testing corpora, we achieve Macro F1 scores of 90.29, 81.87 and 75.40 for English, German and Hindi respectively while performing hate speech detection and of 60.70, 53.28 and 49.74 during fine-grained classification. In our experiments, we show the eficacy of Perspective API features for hate speech classification and the efects of exploiting a multilingual training scheme. A feature selection study is provided to illustrate impacts of specific features upon the architecture's classification head.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>With a rise in the number of posts made on social media, an increase in the amount of toxic
content on the web is witnessed. Measures to detect such instances of toxicity is of paramount
importance in today’s world with regards to keeping the web a safe and healthy environment
for all. Detecting hateful and ofensive content in typical posts and comments found on the web
is the first step towards building a system which can flag items with possible adverse efects
and take steps necessary to handle such behavior.</p>
      <p>In this paper, we look at the problem of detecting hate speech and ofensive remarks within
tweets. More specifically, we attempt to solve two classification problems. Firstly, we try to
assign a binary label to a tweet indicating whether it is hateful and ofensive (class HOF) or not
nEvelop-O
LGOBE
rOcid
(V. Varma)
https://sayarghoshroy.github.io/ (S. Ghosh Roy); https://www.ujwalnarayan.ml/ (U. Narayan);
© 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
(class NOT). Secondly, if the tweet belongs to class HOF, we classify it further into one of the
following three possible classes: (a) HATE: Contains hate speech, (b) OFFN: Is ofensive, and (c)
PRFN: Contains profanities.</p>
      <p>The language in use on the web is in a diferent text style as compared to day-to-day speech,
formally written articles, and webpages. In order to fully comprehend the social media style of
text, a model needs to have knowledge of the pragmatics of emojis and smileys, the specific
context in which certain hashtags are being used, and it should be able to generalize to various
domains. Also, social media text is full of acronyms, abbreviated forms of words and phrases,
orthographic deviations from standard forms such as dropping of vowels from certain words,
and contains instances of code mixing.</p>
      <p>The escalation in derogatory posts on the internet has prompted certain agencies to make
toxicity detection modules available for web developers as well as for the general public. A
notable work in this regard is Google’s Perspective API1 which uses machine learning models
to estimate various metrics such as toxicity, insult, threat, etc., given a span of text as input. We
study the usefulness of these features for hate speech detection tasks in English and German.</p>
      <p>
        In recent years, utilizing Transformer-based [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] Language Models pre-trained with certain
objectives on vast corpora [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] has been crucial to obtaining good representations of textual
semantics. In our work, we leverage the advances in language model pre-training research and
apply the same to the task of hate speech detection. Lately, we have witnessed the growing
popularity of multilingual language models which can work upon input text in a language
independent manner. We hypothesize that such models will be efective on social media texts
across a collection of languages and text styles. Our intuition is experimentally verified as we
are able to obtain respectable results on the provided testing data for the two tasks in question.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        In this section, we will provide a brief overview of the variety of methods and procedures
applied in attempts to solve the problem of hate speech detection. Approaches using Bag of
Words (BoW) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] typically lead to a high number of false positives. They also sufer from data
sparsity issues. In order to deal with the large number of false positives, eforts were made to
better characterize and understand the nature of hate speech itself. This led to the formation
of finer distinctions between the types of hate speech [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]; in that, hate speech was further
classified into “profane” and “ofensive”. Features such as N-gram graphs [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] or Part of Speech
features [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] were also incorporated into the classification models leading to an observable rise
in the prediction scores.
      </p>
      <p>
        Later approaches used better representation of words and sentences by utilizing semantic
vector representations such as word2vec [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and GloVe [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. These approaches outshine the
earlier BoW approaches as similar words are located closer together in the latent space. Thus,
these continuous and dense representations replaced the earlier binary features resulting in a
more efective encoding of the input data. Support Vector Machines (SVMs) with a combination
of lexical and parse features have been shown to perform well for detecting hate speech as well.
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
      </p>
      <sec id="sec-2-1">
        <title>1https://www.perspectiveapi.com/</title>
        <p>3708
2373
2963</p>
        <sec id="sec-2-1-1">
          <title>Test</title>
          <p>
            The recent trends in deep learning led to better representations of sentences. With RNNs, it
became possible to model larger sequences of text. Gated RNNs such as LSTMs [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ] and GRUs
[
            <xref ref-type="bibr" rid="ref10">10</xref>
            ] made it possible to better represent long term dependencies. This boosted classification
scores, with LSTM and CNN-based models significantly outperforming character and word
based N-gram models. [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ] Character based modelling with CharCNNs [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ] have been applied
for hate speech classification. These approaches particularly shine in cases where the ofensive
speech is disguised with symbols like ‘*’, ‘$’ and so forth. [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ]
          </p>
          <p>
            More recently, attention based approaches like Transformers [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ] have been shown to
capture contextualized embeddings for a sentence. Approaches such as BERT [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ] which have
been trained on massive quantities of data allow us to generate robust and semantically rich
embeddings which can then be used for downstream tasks including hate speech detection.
          </p>
          <p>
            There have also been a variety of open or shared tasks to encourage research and development
in hate speech detection. The TRAC shared task [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ] on aggression identification included
both English and Hindi Facebook comments. Participants had to detect abusive comments and
distinguish between overtly aggressive comments and covertly aggressive comments. OfensEval
(SemEval-2019 Task 6) [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ] was based on the the Ofensive Language Identification Dataset
(OLID) containing over 14,000 tweets. This SemEval task had three subtasks: discriminating
between ofensive and non-ofensive posts, detecting the type of ofensive content in a post
and identifying the target of an ofensive post. At GermEval, [ 16] there was a task to detect
and classify hurtful, derogatory or obscene comments in the German language. Two sub-tasks
were continued from their first edition, namely, a coarse-grained binary classification task and a
ifne-grained multi-class classification problem. As a novel sub-task, they introduced the binary
classification of ofensive tweets into explicit and implicit.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>The datasets for the tasks were provided by the organizers of the HASOC ’202. [17] The data
consists of tweets from three languages: English, German and Hindi, and was annotated on two
levels. The coarse annotation involved a binary classification task with the given tweet being
marked as hate speech (HOF) or not (NOT). In the finer annotation, we diferentiate between
the types of hate speech and have four diferent formal classes:
1. HATE: This class contains tweets which highlight negative attributes or deficiencies of
certain groups of individuals. This class includes hateful comments towards individuals</p>
      <sec id="sec-3-1">
        <title>2https://competitions.codalab.org/competitions/26027</title>
        <p>1852
1700
2116</p>
        <p>In table 1, we list the data size in number of tweets, and in tables 2 and 3, we provide the
number of instances of diferent classification labels.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Approach</title>
      <p>In this section, we outline our approach towards solving the task at hand.
4.1. Preprocessing
We utilized the python libraries tweet-preprocessor3 and ekphrasis4 for tweet tokenization and
hashtag segmentation respectively. For extracting English and German cleaned tweet texts,
tweet-preprocessor’s clean functionality was used. For Hindi tweets, we tokenized the tweet
text on whitespaces and symbols including colons, commas and semicolons. This was followed
by removal of hashtags, smileys, emojis, URLs, mentions, numbers and reserved words (such as
@RT which indicates Retweets) to yield the pure Hindi text within the tweet.
4.2. Feature Engineering
In addition to the cleaned tweet, we utilize tweet-preprocessor to populate certain information
ifelds which can act as features for our classifiers. We include the hashtag text which is
segmented into meaningful tokens using the ekphrasis segmenter for the twitter corpus. We
also save information such as URLs, name mentions such as ‘@derCarsti’, quantitative values
and smileys. We extract emojis which can be processed in two ways. We initially experimented
with the emot5 python library to obtain the textual description of a particular emoji. For example,
‘ ’ maps to ‘smiling face with open mouth &amp; cold sweat’ and ‘ ’ maps to ‘panda’. We later
chose to utilize emoji2vec [18] to obtain a semantic vector representing the particular emoji.
The motivation behind this is as follows: the text describing the emoji’s attributes might not
capture all the pragmatics and the true sense of what the emoji signifies in reality. As a concrete
example, consider ‘ ’, the tongue emoji. The textual representation will not showcase the
emoji’s association with ‘joking around, laughter and general goofiness’ which is its real world
implication. We expect emoji2vec to capture these kinds of associations.
4.3. Perspective API Features
We perform experiments with features extracted from the Perspective API. The API uses machine
learning models to estimate various numerical metrics modeling the perceived impact which a
post or a comment might have within a conversation. Right now, the Perspective API does not
support Hindi natural language text in Devanagari script. Thus, our experiments are on German
and English. On German text, the API provides scores which are real numbers between 0 and 1
for the following fields: ‘toxicity’, ‘severe toxicity’, ‘identity attack’, ‘insult’ and ‘profanity and
threat’. For English text, in addition to the fields for German, the API provides similar scores
for the fields: ‘sexually explicit’, ‘obscene’ and ‘toxicity fast’ (which simply uses a faster model
for computing toxicity levels on the back-end).</p>
      <p>For both English and German tweets, we extract perspective API scores for all available fields
using (a) the complete tweet as is, and (b) the extracted cleaned tweet text excluding emojis,
smileys, URLs, mentions, numbers, hashtags and reserved words. Thus, we have 18 features for
English tweets and 12 features for German tweets to work with.</p>
      <p>We trained multi-layer perceptron classifiers for English and German using a concatenation
of these features as the input vector. In addition to these classifiers trained in the monolingual</p>
      <sec id="sec-4-1">
        <title>3https://github.com/s/preprocessor 4https://github.com/cbaziotis/ekphrasis 5https://github.com/NeelShah18/emot</title>
        <p>Raw Tweet Text
RT @jeonggukpics: Don’t</p>
        <p>disturb please, he is
enjoying his snacks while
making those little dance
#BBMAsTopSocial BTS
#JUNGKOOK #정국…
Cleaned Text
: Dont disturb please,
he is enjoying his
snacks while making
those little dance BTS</p>
        <p>Hashtags
#BBMAsTopSocial
#JUNGKOOK
#정국
emoji2vec
Transformer</p>
        <p>Encoder</p>
        <p>Static
Transformer
concat</p>
        <p>Classifier
loss</p>
        <p>True Label
Predicted</p>
        <p>
          Label
setting, we trained an English-German multilingual classifier using the 12 perspective API
features which are common to English and German. The datapoints in the corresponding
training sets were randomly shufled and standardized. The same standardization values were
used on the test set during inference. We tried out multiple training settings with diferent
activation functions and optimization techniques. The best results with Perspective features
are presented in Section 5.
4.4. Proposed Transformer-based Models
We leverage Transformer-based [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] masked language models to generate semantic embeddings
for the cleaned tweet text. In addition to the cleaned tweet’s embedding, we generate and
utilize semantic vector representations for all the emojis and segmented hashtags available
within the tweet. The segmented hash embeddings are generated using the same pre-trained
Transformer model such that the text and hashtag embeddings are grounded in the same latent
space. emoji2vec is used to create the emojis’ semantic embeddings. The Transformer layers
encoding the cleaned tweet text are updated during the fine-tuning process on the available
training data. For classification, we use the concatenation of the cleaned tweet’s embedding
with the collective embedding vector for segmented hashtags and emojis.
        </p>
        <p>We are required to encode a list of emojis &amp; a list of segmented hashtags, both of which can
be of variable lengths. Therefore, we average the vector representations of all the individual
emojis or segmented hashtags as the case may be, to generate the centralised emoji or hashtag
representation. This is simple, intuitive, and earlier work on averaging local word embeddings
to generate global sentence embeddings [19] has showed that this yields a comprehensive vector
representation for sentences. We assume the same to hold true for emojis and hashtags as well.
monolingual
multilingual</p>
        <sec id="sec-4-1-1">
          <title>Activation Optimization</title>
          <p>identity</p>
          <p>tanh
identity
identity
tanh
tanh
adam (early-stop)
adam (early-stop)
sgd (adaptive LR)
adam (early-stop)
sgd (adaptive LR)
adam (early-stop)</p>
          <p>
            The concatenated feature-set is then passed to a two layer multi-layer perceptron (MLP). The
loss from the classifier is propagated back through the cleaned tweet Transformer encoder during
training. We experimented with XLM-RoBERTa (XLMR) [20] as our pre-trained Transformer in
various training settings. XLM-RoBERTa has outperformed similar multilingual Transformer
models such as mBERT(multilingual BERT) [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ] and multilingual-distilBERT [21] on various
downstream tasks. We therefore chose XLMR as our base Transformer model for the purpose
of the shared task. A high level overview of our model flow is shown in figure 1.
          </p>
          <p>For fine-tuning our XLMR Transformer weights, we perform learning rate scheduling based
on the actual computed macro F1-scores on the validation split instead of using the validation
loss. As opposed to simply using early-stopping to prevent overfitting, we consider the change
in validation performance at the end of each training iteration. If the validation performance
goes down across an iteration, we trace back to the previous model weights and scale down our
learning rate. Training stops when the learning rate reaches a very small value  6. Although
expensive, this form of scheduling ensures that we maximize our Macro F1-score on the validation
split. For further details on specific implementation nuances and choice of hyperparameters,
refer to Section 6.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>In this section, we provide quantitative performance evaluations of our approaches on the
provided testing-set, the evaluation metric used throughout being the macro F1-score.</p>
      <p>In table 4, we present our study on usage of Perspective API features with a multi-layer
perceptron classifier for English and German tasks. We notice that these features are able to
provide respectable results on the hate and ofensive content detection but cannot compete with
the Transformer-based models when fine-grained classification is required. In the monolingual
mode, our exhaustive grid search showed that the use of identity activation for English and
tanh activation for German are the most efective MLP hidden layer activation settings. Table 4
lists the best activation functions and optimization techniques for particular (task, language)
pairs. We observe that German Task 2 benefits from the multilingual mode and we attribute this</p>
      <sec id="sec-5-1">
        <title>6Set to 1e-12 in our experiments.</title>
        <p>Model
XLMR-freeze-mono
XLMR-freeze-multi</p>
        <p>XLMR-adaptive</p>
        <p>XLMR-tuned
to the additional data from the English training examples which allow the model to generalize
better. However, a drop in the English results is witnessed which might be due to the reduction
in the number of available features.</p>
        <p>In table 5, we present results using our proposed Transformer-based models. We present
XLMR-freeze-mono and XLMR-freeze-multi as baselines in which we use the pre-trained
XLMRoBERTa Transformer weights without any fine-tuning 7. Only the classifier head is trained in
these models. We train six separate models for the three languages (two tasks per language)
and report corresponding results in the monolingual mode. In multilingual mode, we only train
two models on the aggregated training data for the two tasks and use that for inference across
the three languages.</p>
        <p>The models: XLMR-adaptive and XLMR-tuned use our proposed adaptive learn rate
scheduling. In XLMR-tuned, the epsilon value of the Adam optimizer was set to 1e-7 as this experimental
setting provided gains on the validation split in our hyper-parameter tuning phase. In both of
these models, we jointly fine-tune the XLM-RoBERTa Transformer weights and the classifier
head in a multilingual setting. Our proposed models significantly outperform baselines with
frozen Transformer weights which is both intuitive and expected.</p>
        <p>Finally, in table 6, we show results for a study on feature selection using pre-trained
XLMRoBERTa as the Transformer architecture for generating text embeddings. Note that our
primary models including XLMR-freeze utilize all of the discussed features. Like XLMR-freeze,
7We used UKPLab’s sentence-transformers library’s pre-trained model:
‘xlm-r-100langs-bert-base-nli-meantokens’ for this task. The model is available at https://github.com/UKPLab/sentence-transformers.
the Transformer layers are frozen and not fine-tuned during the training process. The table
is separated into monolingual and multilingual modes of training. Results are showed using
diferent feature collections, namely, ‘cleaned tweet text only’, ‘cleaned tweet + hashtags’, and
‘cleaned tweet + emojis’ as inputs to the classifier. We observe a performance drop for English
and Hindi and a considerable performance gain for German while moving from monolingual to
multilingual training settings.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Experimental Details</title>
      <p>We used Hugging Face’s8 implementation of XLM-RoBERTa in our proposed architecture. Our
architectures using Transformer models with custom classification heads were implemented
using pytorch9. We used Adam optimizer for training with an initial learning rate of 2e-5,
dropout probability of 0.2 with other hyper-parameters set to their default values. We updated
weights based on cross-entropy loss values. For studies with Perspective API Features and
experiments where we do not fine-tune the Transformer weights, we used scikit-learn’s [ 22]
implementation of a multi-layer perceptron and UKPLab’s sentence-transformers library [23]
whenever applicable.</p>
      <p>In our Perspective API experiments, we used deep multi-layer perceptrons with 12 and 9
hidden layers for the binary and multi-class classification modes respectively. Across all our
experimental settings, we used a batch size of 200 with other hyper-parameter values set to
default. We performed an exhaustive grid search for every multi-layer perceptron model varying
the activation function, size of hidden layer, optimization algorithm and type of learning rate
scheduling. We reported results using the grid search settings which performed the best on a
4-fold cross validation on the training set. Our experimentation code is publicly available at
https://github.com/sayarghoshroy/Hate-Speech-Detection.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>In this paper, we have leveraged the recent advances in large scale Transformer-based language
model pre-training to build models for coarse detection and fine-grained classification of
hateful and ofensive content in social media posts. Our experiments showcase the utility
and efectiveness of language models pre-trained with multi-lingual training objectives on a
variety of languages. Our studies show the eficacy of Perspective API metrics by using them as
standalone features for hate speech detection. Our best model utilized semantic embeddings for
cleaned tweet text, emojis, and segmented hashtags as features, and a customized two-layer
feedforward neural network as the classifier. We further conducted a feature selection experiment
to view the impact of individual features on the classification performance. We concluded that
the usage of hashtags as well as emojis add valuable information to the classification head. We
plan to further explore other novel methods of capturing social media text semantics as part of
future work.</p>
      <sec id="sec-7-1">
        <title>8https://huggingface.co/ 9https://pytorch.org/</title>
        <p>[16] J. Struß, M. Siegel, J. Ruppenhofer, M. Wiegand, M. Klenner, Overview of germeval task 2,
2019 shared task on the identification of ofensive language, 2019.
[17] T. Mandl, S. Modha, G. K. Shahi, A. K. Jaiswal, D. Nandini, D. Patel, P. Majumder, J. Schäfer,
Overview of the HASOC track at FIRE 2020: Hate Speech and Ofensive Content
Identification in Indo-European Languages), in: Working Notes of FIRE 2020 - Forum for
Information Retrieval Evaluation, CEUR, 2020.
[18] B. Eisner, T. Rocktäschel, I. Augenstein, M. Bosnjak, S. Riedel, emoji2vec: Learning
emoji representations from their description, CoRR abs/1609.08359 (2016). URL: http:
//arxiv.org/abs/1609.08359.
[19] S. Arora, Y. Liang, T. Ma, A simple but tough-to-beat baseline for sentence embeddings
(2016).
[20] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave,
M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at
scale, 2020. a r X i v : 1 9 1 1 . 0 2 1 1 6 .
[21] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller,
faster, cheaper and lighter, 2020. a r X i v : 1 9 1 0 . 0 1 1 0 8 .
[22] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,
M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine
Learning Research 12 (2011) 2825–2830.
[23] N. Reimers, I. Gurevych, Making monolingual sentence embeddings multilingual using
knowledge distillation, arXiv preprint arXiv:2004.09813 (2020). URL: http://arxiv.org/abs/
2004.09813.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>in: Advances in neural information processing systems</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>I.</given-names>
            <surname>Kwok</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Locate the hate: Detecting tweets against blacks</article-title>
          ,
          <source>in: AAAI</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Thirunarayan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Sheth</surname>
          </string-name>
          ,
          <article-title>Cursing in english on twitter</article-title>
          ,
          <source>in: Proceedings of the 17th ACM conference on Computer supported cooperative work &amp; social computing</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>415</fpage>
          -
          <lpage>425</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Themeli</surname>
          </string-name>
          ,
          <article-title>Hate Speech Detection using diferent text representations in online user comments</article-title>
          ,
          <source>Ph.D. thesis</source>
          ,
          <year>2018</year>
          .
          <source>doi:1 0 . 1 3 1 4 0 / R G . 2 . 2 . 1 2</source>
          <volume>9 9 1 . 2 5 7 6 4 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>Detecting ofensive language in social media to protect adolescent online safety</article-title>
          ,
          <source>in: 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing, IEEE</source>
          ,
          <year>2012</year>
          , pp.
          <fpage>71</fpage>
          -
          <lpage>80</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. S.</given-names>
            <surname>Corrado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          ,
          <source>in: Advances in neural information processing systems</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pennington</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          , Glove:
          <article-title>Global vectors for word representation</article-title>
          ,
          <source>in: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>I.</given-names>
            <surname>Sutskever</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Sequence to sequence learning with neural networks</article-title>
          ,
          <source>in: Advances in neural information processing systems</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>3104</fpage>
          -
          <lpage>3112</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gulcehre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <article-title>Empirical evaluation of gated recurrent neural networks on sequence modeling</article-title>
          ,
          <source>arXiv preprint arXiv:1412.3555</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P.</given-names>
            <surname>Badjatiya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Varma</surname>
          </string-name>
          ,
          <article-title>Deep learning for hate speech detection in tweets</article-title>
          ,
          <source>in: Proceedings of the 26th International Conference on World Wide Web Companion, WWW '17 Companion, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE</source>
          ,
          <year>2017</year>
          , p.
          <fpage>759</fpage>
          -
          <lpage>760</lpage>
          . URL: https://doi.org/ 10.1145/3041021.3054223.
          <source>doi:1 0 . 1 1</source>
          <volume>4 5 / 3 0 4 1 0 2 1 . 3 0 5 4 2 2 3 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <article-title>LeCun, Character-level convolutional networks for text classification</article-title>
          ,
          <source>in: Advances in neural information processing systems</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>649</fpage>
          -
          <lpage>657</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mehdad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tetreault</surname>
          </string-name>
          ,
          <source>Do characters abuse more than words?</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>299</fpage>
          -
          <lpage>303</lpage>
          .
          <source>doi:1 0 . 1 8</source>
          <volume>6 5 3</volume>
          / v 1 / W 1 6
          <article-title>- 3 6 3 8</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Ojha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          , S. Malmasi (Eds.),
          <source>Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Santa Fe, New Mexico, USA,
          <year>2018</year>
          . URL: https://www.aclweb.org/anthology/ W18-4400.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Malmasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rosenthal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Farra</surname>
          </string-name>
          , R. Kumar, SemEval
          <article-title>-2019 task 6: Identifying and categorizing ofensive language in social media (OfensEval)</article-title>
          ,
          <source>in: Proceedings of the 13th International Workshop on Semantic Evaluation</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Minneapolis, Minnesota, USA,
          <year>2019</year>
          , pp.
          <fpage>75</fpage>
          -
          <lpage>86</lpage>
          . URL: https: //www.aclweb.org/anthology/S19-2010.
          <article-title>doi:1 0 . 1 8 6 5 3 / v 1 / S 1 9 - 2 0 1 0</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>