<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ALBERT for Hate Speech and Ofensive Content Identification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jun Zeng</string-name>
          <email>zengjun@mail.ynu.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Li Xu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hao Wu</string-name>
          <email>haowu@ynu.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Information Science and Engineering, Yunnan University</institution>
          ,
          <addr-line>Kunming</addr-line>
          ,
          <country country="CN">P.R. China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <abstract>
        <p>This paper describes our system in Subtask 1A of HASOC 2021, and our team name is JZ2021. Subtask 1A focuses on hate speech and ofensive language recognition for English and Hindi. Now, the detection of hate speech and ofensive content on the Internet has received widespread attention. These comments have caused a lot of trouble to people, and the identification of the comments are very meaningful. With the development of deep learning, many pre-trained deep neural network models are used for text classification tasks. However, some pre-trained models contain a large number of parameters, although they perform well. In HASOC 2021 task, we use a model called ALBERT. It improves the BERT model and efectively reduces the number of parameters of the model. We chose ALBERT-large, which gets great results in the task. Our system achieves the Macro F1 score of 83.75%.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In recent years, with the rapid development of Internet technology, the number of users is
rapidly increasing, and various social platforms have also emerged. Netizens can freely express
their opinions on the platforms. These platforms mainly have anonymous functions, so many
people will give vent to their dissatisfaction of life. As a result, many hate speeches or ofensive
contents have been generated on the Internet. This kind of problem needs to be taken seriously,
because it will not only cause distress to people, but even cause some people to sufer from
mental illness or suicide. However, there are a huge amount of comments generated on the
Internet every day, and it is very unrealistic to recognize these comments by manual methods.
It is particularly important to use artificial intelligence methods to replace manual methods.</p>
      <p>The identification of hate speech and ofensive content faces some challenges. First of all,
posts on social media include multiple languages, and each person’s writing style is diferent. At
the same time, the irregular writing and the emergence of some new Internet expression will also
bring some dificulties to the detection task. Secondly, some comments do not directly contain
insulting words, but are implicit or ironic attacks. This is also a dificult point. In addition,
people do not have a very clear standard for the definition of hate speech. The performance of
the model highly dependents on the training data set, which is related to the person who mark
the data to a certain extent.</p>
      <p>
        In response to the above issues, the NLP community has released a series of tasks on hate
speech identification. HASOC 2021 shared task is one of them [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. It is dedicated to the
identification of hate speech and ofensive content in English and Indo-Aryan languages. In
this task, we used the pre-training model ALBERT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which has made some improvements on
the basis of the BERT model [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], it greatly reduces the number of parameters and improves the
speed of training.
      </p>
      <p>The structure of this article is as follows: Section 2 introduces related research on hate speech
and ofensive speech recognition. Section 3 explains the model used in this task. Section 4
shows the experimental procedure. Section 5 describes the results of the experiment and the
analysis of the results. Section 6 summarizes this work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>
        Identifying hate speech is a text classification [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] task, and text classification has attracted
a lot of attention in the field of natural language processing. In recent years, deep learning
technology has developed rapidly and has been widely used in many fields. In text classification
tasks, deep learning models based on Convolutional Neural Networks (CNN) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], Recurrent
Neural Networks (RNN) and Attention Mechanisms have made good progress.
      </p>
      <p>
        CNN were initially successful in the image field [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and they were used for text classification
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] later. TextCNN [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is a deep learning model based on CNN, which has achieved excellent
results in classification tasks. CNN cannot process sequential input data, but in natural language
processing tasks, most of the input data is sequential data. In order to solve this demand,
Recurrent Neural Network (RNN) has also developed rapidly. Long Short-Term Memory networks
(LSTM) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and GRU are two classic RNN-based models, but they can only handle fixed-length
sequences. Sutskever et al. proposed the Sequence to Sequence (Seq2Seq) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] model, which
can handle variable-length sequences and is used for machine translation. Later, the Attention
Mechanism was proposed, and many models began to adopt the Attention Mechanism. Google
proposed the Transformer model [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], which only uses the Attention Mechanism. The proposal
of Transformer has brought far-reaching influence to the field of natural language processing.
In order to get better results, the current model parameters are increasing considerably, and
training become slower. Pre-trained models can alleviate this problem. Model can adapt some
tasks by fine-tuning. GPT and BERT are two pre-training models, both of which are based on
the Transformer structure. After the release of BERT, many new results have appeared in the
NLP task. The BERT model has a huge amount of parameters, which brings some dificulties to
training. ALBERT model [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] alleviates these problems, and its performance is also very great.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <p>3.1. BERT
BERT [13]: Bidirectional Encoder Representation from Transformers, just like its name, this is a
bidirectional model based on Transformers. BERT uses a large amount of text data to construct
two pre-training tasks, namely Masked Language Model (MLM) and Next Sentence Prediction
(NSP). MLM alleviates the constraint of single direction. It randomly masks some tokens in
the input, and then predicts the masked words based on the context. This is diferent from
the traditional left-to-right training model, because it allows us to generate deep bidirectional
language representations. When using MLM, we don’t always use the actual [mask] token to
replace the masked word. The training data generator will randomly select 15% of the token
positions for prediction. The selected token will perform the following operations: (1) Replacing
the token with the [mask] token 80% of the time; (2) Replacing the token randomly with a token
10% of the time; (3) Remaining unchanged 10% of the time. NSP is used to train a model so that
the model can understand the relationship between sentences, because it is very important for
many downstream tasks to understand the relationship between two sentences. Specifically,
we select sentence  and sentence  in the corpus to form a training example.  is the next
sentence of  at the time of 50%, and 50% of the time is not. The main structure of BERT is
stacked by the encoder of the Transformer. It takes a series of words as input, and applies
the Self-attention Mechanism to each layer, and then passes the result to the next encoder
through the feedforward neural network. BERT is divided into two versions: BERT-base and
BERT-large, as shown in Table 1. L represents the number of layers of the Transformer, H
represents the dimension of the output, A represents the number of Mutil-head Attention, and
TotalParameters represents the size of the model parameters. Because of the Self-attention
Mechanism in Transformer, BERT can be well used for many downstream tasks by fine-tuning.</p>
      <sec id="sec-3-1">
        <title>3.2. ALBERT</title>
        <p>Recently, the trend in the NLP field is to use large-scale models to obtain better performance.
However, stacking model parameters brainlessly may not bring better results. Although BERT
is powerful, its parameters are very large, which puts forward higher requirements on hardware
conditions, and training also consumes more time. The emergence of A Lite BERT (ALBERT)
alleviates this problem, and its parameters are significantly less than the traditional BERT model.</p>
        <p>ALBERT mainly uses two methods to reduce model parameters. The first way is factorized
embedding parameterization. In BERT model, the WordPiece [13] embedding size  and the
hidden layer size  are equal. However, this approach is not necessary. In reality, NLP requires
a large vocabulary  , and the size of the embedding matrix is  ×  . If  is always equal
to  , increasing  will cause the embedding matrix to increase. As a result, the parameters
of the model may increase dramatically. In ALBERT, factorization is used to decompose the
embedded parameters into two smaller matrices. First, we project the one-hot vector into a
low-dimensional space of size  and then project it into the hidden space. Through factorization,
the embedding parameter changes from ( ×  ) to ( ×  +  ×  ) . When  is much larger
than  , the parameters of ALBERT are reduced a lot compared to BERT. Another method is
cross-layer parameter sharing. In order to further improve parameter eficiency, ALBERT uses
a cross-layer parameter sharing method. There are many ways to share parameters. But the
default way is to share all parameters across layers in ALBERT.</p>
        <p>BERT hope the model will learn to understand the relationship between two sentences by
using the NSP loss, so that the model can adapt to NLP tasks like QA. However, research has
found that the efect of NSP is not good, and this method is unreliable. ALBERT proposed
sentence-order prediction (SOP) loss, which emphasizes the coherence between sentences. The
SOP loss takes two consecutive segments from the same document as a positive example, and
a negative example swaps the positions of the two consecutive segments. NSP tends to learn
simpler topic prediction signals, so it cannot solve the SOP task, but SOP has a good performance
on the NSP task.</p>
        <p>There are 4 versions about ALBERT: ALBERT-base, ALBERT-large, ALBERT-xlarge and
ALBERT-xxlarge. The basic hyperparameters of the diferent versions of the BERT model and
the ALBERT model are shown in Table 2. Obviously, when the layers and hidden layer size of
ALBERT are similar to the BERT’s, the number of parameters of ALBERT is much smaller than
that of BERT.</p>
        <p>In this shared task, we chose the ALBERT model to identify hate speech and ofensive content.
More specifically, we used ALBERT-large, its parameter size is 18 , there are a total of 24
Transformer layers,    = 1024 . The reason for introducing the BERT model is that ALBERT
is improved based on BERT, and we need to compare the parameters of BERT and ALBERT. The
smaller amount of parameters is the reason why we choose ALBERT.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment</title>
      <p>This section will introduce task description, the data set we used, and other experimental details.</p>
      <sec id="sec-4-1">
        <title>4.1. Task Description</title>
        <p>
          We participated in Subtask 1A [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Subtask 1A focuses on hate speech and ofensive language
recognition for English and Hindi. It is a coarse-grained binary classification, in which the
participating system is required to classify tweets into two classes, namely: Hate and Ofensive
(HOF) and Non-Hate and ofensive (NOT). The definition of NOT is that this post does not
contain any hate speech, profanity, or ofensive content. The definition of HOF is that this post
contains hate, ofensive, and profane content.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Dataset</title>
        <p>
          The data of this experiment is provided by HASOC 2021 [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], and the data sets are from Twitter.
We used the English data set for related tasks. There are a total of 3790 English training data
instances, of which the number of labels with HOF is 2417, and the number of NOT is 1325. The
number of instances in the test data set is 1268.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Experimental Steps</title>
        <p>First of all, we have to preprocess the data. This mainly includes: (1) converting Emojis
into corresponding phrases; (2) changing all text to lowercase form; (3) turning numbers
into string form; (4) removing @ symbol. We divide 80% of the English training data set
into training data, and use the rest as validation data. We use ALBERT for embedding.  
([CLS]vectors) contains the semantic information of the entire sentence, it is transfered to the
classifier (fully connected layers), and then activated using the softmax function. The loss
function is sparse_categorical_crossentropy, and as optimizer we use Adam. The architecture
of our model is shown in Figure 1.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>We also use other methods as baseline models. They are SVM [14], LR (Logistic Regression) [15]
and BERT-base. SVM is a binary classifier model that maps the feature vector of the example
into some points in the space. The purpose of SVM is to draw a line to correctly distinguish
these points. LR is also a binary classifier method. We use Macro F1 to evaluate the performance
of the model, and results are shown in Table 3.</p>
      <p>It can be seen that the performance of ALBERT-large is the best (Macro F1 of 83.75%), and
we use the model for submitting runs to the shared task. BERT-base also performed well
(Macro F1 of 83.26%, which is 0.49% worse than ALBERT-large), ranking second among the
models. The performance of the other two traditional machine learning methods is vastly worse.
Deep learning methods can automatically extract feature with neural networks. Although the
computational cost is higher, it also gets more useful information, and its performance is better
than traditional machine learning methods.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>Now, the performance of deep learning models is constantly improving, but the number of
parameters is also increasing considerably. ALBERT efectively reduces the number of model
parameters through factorization of the embedding layer and cross-layer parameter sharing, so
that ordinary users can also run it. Compared to traditional machine learning methods, ALBERT
performs better. ALBERT has a performance that is not inferior to BERT, and the number
of parameters is much smaller. In this task, unfortunately, due to our insuficient hardware
conditions, we cannot try the larger versions of ALBERT-xlarge and ALBERT-xxlarge.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work is supported by the National Natural Science Foundation of China (61962061, 61562090,
U1802271), partially supported by the Yunnan Provincial Foundation for Leaders of Disciplines
in Science and Technology(202005AC160005), Top Young Talents of ”Ten Thousand Plan” in
Yunnan Province(YNWR-QNBJ-2019-188), the Program for Excellent Young Talents of Yunnan
University.
December 4-9, 2017, Long Beach, CA, USA, 2017, pp. 5998–6008. URL: https://proceedings.
neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
[13] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional
transformers for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.),
Proceedings of the 2019 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019,
Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Association for
Computational Linguistics, 2019, pp. 4171–4186. URL: https://doi.org/10.18653/v1/n19-1423.
doi:1 0 . 1 8 6 5 3 / v 1 / n 1 9 - 1 4 2 3 .
[14] S. Liang, A. Q. M. Sabri, F. Alnajjar, C. K. Loo, Autism spectrum self-stimulatory behaviors
classification using explainable temporal coherency deep features and SVM classifier,
IEEE Access 9 (2021) 34264–34275. URL: https://doi.org/10.1109/ACCESS.2021.3061455.
doi:1 0 . 1 1 0 9 / A C C E S S . 2 0 2 1 . 3 0 6 1 4 5 5 .
[15] W. Ksiazek, M. Gandor, P. Plawiak, Comparison of various approaches to combine logistic
regression with genetic algorithms in survival prediction of hepatocellular carcinoma,
Comput. Biol. Medicine 134 (2021) 104431. URL: https://doi.org/10.1016/j.compbiomed.
2021.104431. doi:1 0 . 1 0 1 6 / j . c o m p b i o m e d . 2 0 2 1 . 1 0 4 4 3 1 .</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Shahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Madhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ranasinghe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nandini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Jaiswal</surname>
          </string-name>
          ,
          <article-title>Overview of the HASOC subtrack at FIRE 2021: Hate Speech and Ofensive Content Identification in English and Indo-Aryan Languages</article-title>
          , in: Working Notes of FIRE 2021 -
          <article-title>Forum for Information Retrieval Evaluation</article-title>
          ,
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          ,
          <year>2021</year>
          . URL: http://ceur-ws.org/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Shahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Madhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ranasinghe</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Zampieri, Overview of the HASOC Subtrack at FIRE 2021: Hate Speech and Ofensive Content Identification in English and Indo-Aryan Languages and Conversational Hate Speech</article-title>
          , in: FIRE 2021:
          <article-title>Forum for Information Retrieval Evaluation, Virtual Event</article-title>
          ,
          <fpage>13th</fpage>
          -17th
          <source>December</source>
          <year>2021</year>
          , ACM,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Goodman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Gimpel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sharma</surname>
          </string-name>
          , R. Soricut,
          <string-name>
            <surname>ALBERT:</surname>
          </string-name>
          <article-title>A lite BERT for self-supervised learning of language representations</article-title>
          ,
          <source>in: 8th International Conference on Learning Representations, ICLR</source>
          <year>2020</year>
          ,
          <string-name>
            <given-names>Addis</given-names>
            <surname>Ababa</surname>
          </string-name>
          , Ethiopia,
          <source>April 26-30</source>
          ,
          <year>2020</year>
          , OpenReview.net,
          <year>2020</year>
          . URL: https://openreview.net/forum?id=
          <fpage>H1eA7AEtvS</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M. T. R.</given-names>
            <surname>Laskar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Hoque</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Utilizing bidirectional encoder representations from transformers for answer selection</article-title>
          , CoRR abs/
          <year>2011</year>
          .07208 (
          <year>2020</year>
          ). URL: https://arxiv. org/abs/
          <year>2011</year>
          .07208.
          <article-title>a r X i v : 2 0 1 1 . 0 7 2 0 8</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Alsaleh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. L.</given-names>
            <surname>Marie-Sainte</surname>
          </string-name>
          ,
          <article-title>Arabic text classification using convolutional neural network and genetic algorithms</article-title>
          ,
          <source>IEEE Access 9</source>
          (
          <year>2021</year>
          )
          <fpage>91670</fpage>
          -
          <lpage>91685</lpage>
          . URL: https://doi.org/10.1109/ ACCESS.
          <year>2021</year>
          .
          <volume>3091376</volume>
          .
          <source>doi:1 0 . 1 1 0</source>
          <string-name>
            <given-names>9</given-names>
            <surname>/ A C C E S S</surname>
          </string-name>
          .
          <volume>2 0 2 1 . 3 0 9 1 3 7 6 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Alhichri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Alswayed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bazi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ammour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Ajlan</surname>
          </string-name>
          ,
          <article-title>Classification of remote sensing images using eficientnet-b3 CNN model with attention</article-title>
          ,
          <source>IEEE Access 9</source>
          (
          <year>2021</year>
          )
          <fpage>14078</fpage>
          -
          <lpage>14094</lpage>
          . URL: https://doi.org/10.1109/ACCESS.
          <year>2021</year>
          .
          <volume>3051085</volume>
          .
          <source>doi:1 0 . 1 1 0</source>
          <string-name>
            <given-names>9</given-names>
            <surname>/ A C C E S S</surname>
          </string-name>
          .
          <volume>2 0 2 1 . 3 0 5 1 0 8 5 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ma</surname>
          </string-name>
          , A. Basu,
          <article-title>Feature-guided CNN for denoising images from portable ultrasound devices</article-title>
          ,
          <source>IEEE Access 9</source>
          (
          <year>2021</year>
          )
          <fpage>28272</fpage>
          -
          <lpage>28281</lpage>
          . URL: https://doi.org/10.1109/ACCESS.
          <year>2021</year>
          .
          <volume>3059003</volume>
          .
          <source>doi:1 0 . 1 1 0</source>
          <string-name>
            <given-names>9</given-names>
            <surname>/ A C C E S S</surname>
          </string-name>
          .
          <volume>2 0 2 1 . 3 0 5 9 0 0 3 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <article-title>An analysis method for interpretability of CNN text classification model</article-title>
          ,
          <source>Future Internet</source>
          <volume>12</volume>
          (
          <year>2020</year>
          )
          <article-title>228</article-title>
          . URL: https://doi.org/10.3390/fi12120228.
          <source>doi:1 0 . 3 3 9 0 / f i 1 2</source>
          <volume>1 2 0 2 2 8 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>I. Alshubaily</surname>
          </string-name>
          ,
          <article-title>Textcnn with attention for text classification</article-title>
          ,
          <source>CoRR abs/2108</source>
          .
          <year>01921</year>
          (
          <year>2021</year>
          ). URL: https://arxiv.org/abs/2108.
          <year>01921</year>
          .
          <article-title>a r X i v : 2 1 0 8 . 0 1 9 2 1</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>C.</given-names>
            <surname>Acartürk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sirlanci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. G.</given-names>
            <surname>Balikcioglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Demirci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Sahin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. A.</given-names>
            <surname>Kucuk</surname>
          </string-name>
          ,
          <article-title>Malicious code detection: Run trace output analysis by LSTM, IEEE Access 9 (</article-title>
          <year>2021</year>
          )
          <fpage>9625</fpage>
          -
          <lpage>9635</lpage>
          . URL: https://doi.org/10.1109/ACCESS.
          <year>2021</year>
          .
          <volume>3049200</volume>
          .
          <source>doi:1 0 . 1 1 0</source>
          <string-name>
            <given-names>9</given-names>
            <surname>/ A C C E S S</surname>
          </string-name>
          .
          <volume>2 0 2 1 . 3 0 4 9 2 0 0 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Chin</surname>
          </string-name>
          ,
          <source>Application of seq2seq models on code correction</source>
          ,
          <source>Frontiers Artif. Intell</source>
          .
          <volume>4</volume>
          (
          <year>2021</year>
          )
          <article-title>590215</article-title>
          . URL: https://doi.org/10.3389/frai.
          <year>2021</year>
          .
          <volume>590215</volume>
          .
          <source>doi:1 0 . 3 3 8 9 / f r a i . 2 0</source>
          <volume>2 1 . 5 9 0 2 1 5 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          , in: I. Guyon, U. von Luxburg, S. Bengio,
          <string-name>
            <given-names>H. M.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. V. N.</given-names>
            <surname>Vishwanathan</surname>
          </string-name>
          , R. Garnett (Eds.),
          <source>Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems</source>
          <year>2017</year>
          ,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>