<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>model for identification of ofensive content in south Indian languages</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shankar Biradar</string-name>
          <email>shankar@iiitdwd.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sunil Saumya</string-name>
          <email>sunil.saumya@iiitdwd.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arun Chauhan</string-name>
          <email>aruntakhur@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Indian Institute of Information Technology Dharwad</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <abstract>
        <p>In recent years, there has been a lot of focus on ofensive content. The amount of ofensive content generated by social media is increasing at an alarming rate. It created a greater need to address this issue than ever before. To address these issues, the organizers of “Dravidian-Code Mixed HASOC-2021” have created two challenges. Task 1 involves identifying ofensive content in Malayalam data, whereas Task 2 includes Malayalam and Tamil Code Mixed Sentences. Our team participated in Task 2. We used multilingual BERT to extract features in our proposed model, and we used two diferent classifiers, Support Vector Machine (SVM) and Deep Neural Network (DNN), on the extracted features. In addition, we used the proposed data to evaluate the performance of a monolingual BERT classifier. Our best performing model monolingual Bert received a weighted F1 score of 0.70 for Malayalam data, ranking ifth; we also received a weighted F1 score of 0.573 for Tamil Code Mixed data, ranking twelfth.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The availability of smartphones and the internet has created a lot of interest in social media
among today’s youth. These applications give a huge platform for users to connect with the
outside world and share their ideas and opinions with others. With these benefits comes a
disadvantage: many people misuse the platform under the name of freedom of expression to
publish inflammatory content on social media. This inflammatory information typically targets
a single person, a group of people, a particular faith, or a community [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. People generate
objectionable content and aggressively propagate it on social media. This type of material is
produced for a variety of reasons, including commercial and political gain [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This type of
content can disrupt social harmony and cause riots in society. Also, it has the potential to have a
detrimental psychological influence on the readers. It can harm people’s emotions and conduct.
Therefore, identifying this type of content is critical; as a result, researchers, policymakers, and
investors (stakeholders) are attempting to develop a dependable technique to identify ofensive
content on social media.
nEvelop-O
(A. Chauhan)
      </p>
      <p>
        Various studies on hate speech, harmful content, and abusive language identification in social
media have been conducted during the previous decade. The majority of these studies were
focused on monolingual English content, and a large amount of English language cuprous
has been created [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. But, people in countries with a complex culture and history, such as
India, frequently use regional languages to generate inappropriate social media posts. Users
typically mix their regional languages with English while creating such content. This type of
text is known as code mixed text on social media. Hence we require an eficient method to
classify ofensive content in Code-Mixed Indian languages. In this context, the
“DravidianCodeMixed HASOC-2021” shared task provider has organized two tasks for detecting hate
speech in Dravidian languages such as Malayalam and Tamil code-mixed data. Our team took
part in Task No. 2, and this paper presents the working notes for our suggested model.
      </p>
      <p>The rest of the article is arranged in the following manner: Section 2 provides a brief summary
of previous work, while Section 4 describes the proposed model in full. Section 5 concludes by
providing information on the outcome.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Literature review</title>
      <p>
        Many researchers and practitioners from industry and academia have been attracted to the
subject of automatic identification of hostile and harmful speech. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] Provided a high-level
review of the current state-of-the-art techaniques in ofensive language identification and
related issues, such as hate speech recognition. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] Developed a publicly accessible dataset for
identifying the ofensive language in tweets by categorizing them as hate speech, ofensive
but not hate speech or neither. Various machine learning models, such as Support Vector
Machine (SVM) and logistic regression, were created utilizing various data properties, such as
n-grams, TF-IDF, readability, etc., for this purpose. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] Built a model with deep neural networks
in combination with SVM for the detection of ofensive content with the accomplishment of F1
score of 90%.
      </p>
      <p>
        Ofensive content detection from tweets is part of some conferences as well as competition
tasks. Ofensive 2020 was provided by SemEval in 2020 as a task in five languages: English,
Arabic, Danish, Greek, and Turkish [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In FIRE 2019, a similar task was achieved for
IndoEuropean languages such as English, Hindi, and German. The data set was created using
samples obtained on Twitter and Facebook in all three languages. Various models, including
LSTM with attention, Word2vec embedding with CNN, and BERT, were used for this task. In
several cases, traditional learning models outperformed deep learning methods for a language
other than English [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Shared task on ofensive language detection in Dravidian languages was
provided by [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Task and Data set description</title>
      <p>
        We have taken data set from HASOC subtask, ofensive language identification of Dravidian
CodeMix[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Challenges provided by the organizers are as follows.
      </p>
      <p>Task 1: A binary classification problem with message-level labeling for ofensive and
nonofensive information in Malayalam CodeMixed YouTube comments.</p>
      <p>Task 2: Given Romanized Tanglish and Manglish tweeter or YouTube comments, the system
must classify them as ofensive or non-ofensive.</p>
      <p>
        Our team took part in Task 2 for identifying ofensive information in the Tanglish and
Manglish data sets. According to the organizer, Tanglish data is collected from Twitter tweets
and comment on the hello APP. Whereas Manglish data is taken from YouTube comments [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
A detailed description of the data set is provided in Table 1, both Tanglish and Manglish data
contain ID, Tweet, and Label fields.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Methdology</title>
      <p>Our team has proposed three submissions based on three diferent models. In the first two
models, mBERT embeddings are passed through SVM and DNN classifiers, while in the third
model, monolingual BERT is employed as a classifier. Each of them is designed using the general
architecture shown in Figure 1. Thus, our model consists of three stages, each of which is
discussed in the preceding subsections.</p>
      <sec id="sec-4-1">
        <title>4.1. Data processing</title>
        <p>
          The data set provided by the organizer contains many unwanted information. A few data
preprocessing steps were undertaken on both text and label fields to convert the data suitable for
model building. Digits, special characters, hyperlinks, and Twitter user handles were omitted
from the data set because they were not helping us improve the performance of our model.
Furthermore, the social media data provided by the organizer did not follow grammatical norms;
hence, data lemmatization is performed to convert the data to its usable base form. For example,
the word ate, eaten, and eating were converted to their base form eat. Converting text to lower
case is also done to eliminate redundant terms. All of this preprocessing was done with the
help of the NLTK toolbox from the Python library [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. The preprocessed data is then fed into
a tokenizer, which divides the tweet into several tokens. The mBERT tokenizer 1 is used for
this purpose. Padding and masking were also used to handle variable-length sentences.
        </p>
        <p>1https://huggingface.co/bert-base-multilingual-cased</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Feature extraction</title>
        <p>
          To obtain contextual embeddings from Code-Mixed data, we used the multilingual Bidirectional
Encoder Representation (mBERT) model [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] in models 1 and 2, and monolingual BERT in
model 3. The architecture of the mBERT model is largely based on the original monolingual
BERT architecture [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], which has 12 transformer blocks, 12 attention heads, and 768 hidden
layers. Furthermore, the vector dimension of mBERT embeddings is 768. This model was
trained using the same pre-training technique as the BERT, namely Masked Language Modeling
(MLM) and Next Sentence Prediction. The only distinction is that multilingual BERT is trained
on Wikipedia data from 104 diferent languages to handle languages other than English. We
only draw embeddings from the CLS token at the beginning for our classification purposes
because it gives whole sentence embeddings.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Classification</title>
        <p>Our proposed model experimented with three diferent classifiers: SVM and DNN classifiers
with mBERT embedding in models 1 and 2 and pre-trained language model BERT in model 3.
The descriptions of these models are presented in the subsections that follow. The intuition
behind selecting these proposed models is that they outperformed other models such as Logistic
Regression (LR), Random Forest (RF), and Naive Bayes (NB) in our preliminary trials.</p>
        <sec id="sec-4-3-1">
          <title>4.3.1. Traditional machine learning based classifier</title>
          <p>
            We experimented with traditional machine learning algorithms such as Support Vector Machine
(SVM) with ten-fold cross-validation. Experiment results for the suggested model demonstrate
that kernel value ”1” and solver ”lbfgs” produce the best results. Experimental trials are used
to determine these hyper-parameter values. This model accepts mBERT embeddings as input
and produces labels that are either ofensive or non-ofensive. The model was developed using
Python’s sci-kit-learn library [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ].
          </p>
        </sec>
        <sec id="sec-4-3-2">
          <title>4.3.2. Deep neural network based model</title>
          <p>Later, we experimented with the Deep Neural Network (DNN) model, a second model in our
proposed methodology. The DNN model comprises several dense layers that are designed to
extract more significant features from input embeddings. We used dense layers of 1000, 500,
100, and 50 neurons in our model. Each dense layer follows a dropout rate of 0.4 to prevent the
overfitting problem. The optimum grid search value determines the dropout rate of 0.4, and it
remains constant throughout the experiment. To normalize activation data, we additionally
employed a batch normalization layer. The output from these layers is then classified using the
sigmoid layer.</p>
        </sec>
        <sec id="sec-4-3-3">
          <title>4.3.3. Transformer model</title>
          <p>In our last model, we experimented with transformer-based language models such as BERT.
Transformer architectures are trained on generic tasks such as modeling language and then
ifne-tuned for classification. The underlying model for our classification is Bert-base-uncased 2,
which BERT developers provide. We did not use ten-fold cross-validation to evaluate
monolingual BERT since it is more computationally expensive. Implementation details of all three
proposed models are provided in GitHub repository3.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Results</title>
      <p>To evaluate the presented models, the organizers have provided a weighted F1 score. Among the
proposed models, our top-performing monolingual BERT received a sixth-place for ofensive
content recognition in the Mangalish data set and eleventh in the Tanglish data set. Table 2
and Table 3 provide the list of top-performing models with weighted F1 scores for Manglish
2https://huggingface.co/transformers/model_doc/bert.html
3https://github.com/shankarb14/dravidian-codemix
and Tanglish data set respectively (The result of our proposed model is shown in bold letters).
Among our proposed models, BERT outperformed other classifiers, reaching 70% accuracy for
the Mangalish data set and 57% accuracy for the Tanglish data set. Finally, we compared the
results of our proposed models in Table 4. We trained our proposed models on a Tanglish data
set comprising 4000 comments from the training set and tested them on 940 comments from
the test set. For the Manglish data set, 4000 train comments and 1000 test comments are used.</p>
      <sec id="sec-5-1">
        <title>5.1. Error analysis</title>
        <p>We investigated the behavior of proposed models on sample test sentences to evaluate their
performance. We discovered that our best-performing model monolingual BERT classifier could
accurately classify all of the test samples based on our experimental observations. However,
multilingual BERT models such as mBERT+SVM and mBERT+DNN could not classify test
samples 3 and 2, respectively. Table 5 summarises the results of the findings.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and future enhancement</title>
      <p>In our work, we presented a model submitted by our team IIITD-ShankarB for ofensive content
identification in the shared task “Dravidian-CodeMixed HASOC-2021”. Our proposed work
experimented with three distinct models: a machine learning-based model, a Deep Neural
Network model, and a transformer-based language model. Our model is one of the top-performing
models, ranking sixth on the Manglish data set and eleventh on the Tanglish data set. In future
work, we can improve the eficiency of the suggested model by including domain-specific
embeddings.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Chowdhury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mubarak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abdelali</surname>
          </string-name>
          , S.-g. Jung,
          <string-name>
            <given-names>B. J.</given-names>
            <surname>Jansen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Salminen</surname>
          </string-name>
          ,
          <article-title>A multiplatform Arabic news comment dataset for ofensive language detection</article-title>
          ,
          <source>in: Proceedings of the 12th Language Resources and Evaluation Conference</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>6203</fpage>
          -
          <lpage>6212</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C. N. d.</given-names>
            <surname>Santos</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Melnyk</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Padhi</surname>
          </string-name>
          ,
          <article-title>Fighting ofensive language on social media with unsupervised text style transfer</article-title>
          , arXiv preprint arXiv:
          <year>1805</year>
          .
          <volume>07685</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Mubarak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Darwish</surname>
          </string-name>
          , W. Magdy,
          <article-title>Abusive language detection on Arabic social media</article-title>
          ,
          <source>in: Proceedings of the first workshop on abusive language online</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>52</fpage>
          -
          <lpage>56</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Fortuna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nunes</surname>
          </string-name>
          ,
          <article-title>A survey on automatic detection of hate speech in text, ACM Computing Surveys (CSUR) 51 (</article-title>
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>30</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Davidson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Warmsley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Macy</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Weber</surname>
          </string-name>
          ,
          <article-title>Automated hate speech detection and the problem of ofensive language</article-title>
          ,
          <source>in: Proceedings of the International AAAI Conference on Web and Social Media</source>
          , volume
          <volume>11</volume>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Al-Khalifa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Magdy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Darwish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Elsayed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mubarak</surname>
          </string-name>
          ,
          <source>Proceedings of the 4th workshop on</source>
          open
          <article-title>-source Arabic corpora and processing tools, with a shared task on ofensive language detection</article-title>
          ,
          <source>in: Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Ofensive Language Detection</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rosenthal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Atanasova</surname>
          </string-name>
          , G. Karadzhov,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mubarak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Derczynski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Pitenis</surname>
          </string-name>
          , Ç. Çöltekin, Semeval-2020 task 12:
          <article-title>Multilingual ofensive language identification in social media</article-title>
          (ofenseval
          <year>2020</year>
          ), arXiv preprint arXiv:
          <year>2006</year>
          .
          <volume>07235</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mandlia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <source>Overview of the HASOC track at FIRE</source>
          <year>2019</year>
          :
          <article-title>Hate speech and ofensive content identification in Indo-European languages</article-title>
          ,
          <source>in: Proceedings of the 11th forum for information retrieval evaluation</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>14</fpage>
          -
          <lpage>17</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jose</surname>
          </string-name>
          , A.
          <string-name>
            <surname>Kumar</surname>
            <given-names>M</given-names>
          </string-name>
          , T. Mandl,
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Kumaresan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ponnusamy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. R L</given-names>
            ,
            <surname>J. P. McCrae</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sherly</surname>
          </string-name>
          ,
          <article-title>Findings of the shared task on ofensive language identification in Tamil, Malayalam, and Kannada</article-title>
          , in: ”
          <source>Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages”, Association for Computational Linguistics</source>
          , Kyiv,
          <year>2021</year>
          , pp.
          <fpage>133</fpage>
          -
          <lpage>145</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          . dravidianlangtech-
          <volume>1</volume>
          .
          <fpage>17</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Kumar</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <source>Overview of the HASOC track at FIRE</source>
          <year>2020</year>
          :
          <article-title>Hate speech and ofensive language identification in Tamil, Malayalam, Hindi, English and German</article-title>
          ,
          <source>in: Forum for Information Retrieval Evaluation</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>29</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Kumaresan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sakuntharaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Madasamy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thavareesan</surname>
          </string-name>
          , P. B,
          <string-name>
            <given-names>S. Chinnaudayar</given-names>
            <surname>Navaneethakrishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <article-title>Overview of the HASOC-DravidianCodeMix Shared Task on Ofensive Language Detection in Tamil and Malayalam</article-title>
          , in: Working Notes of FIRE 2021 -
          <article-title>Forum for Information Retrieval Evaluation</article-title>
          ,
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bird</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Klein</surname>
          </string-name>
          , E. Loper,
          <article-title>Natural language processing with Python: analyzing text with the natural language toolkit, ”</article-title>
          <string-name>
            <surname>O'Reilly Media</surname>
          </string-name>
          , Inc.”,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Pires</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Schlinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Garrette</surname>
          </string-name>
          ,
          <article-title>How multilingual is multilingual bert?</article-title>
          , arXiv preprint arXiv:
          <year>1906</year>
          .
          <volume>01502</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thirion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dubourg</surname>
          </string-name>
          , et al.,
          <article-title>Scikit-learn: Machine learning in python</article-title>
          .
          <source>the journal of machine learning research 12</source>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>