<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Forum for Information Retrieval Evaluation, December</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Detection of Rhetorical Role Labels using ERNIE2.0 and RoBERTa</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Guneet Singh Kohli</string-name>
          <email>gkohli_be18@thapar.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>PrabSimran Kaur</string-name>
          <email>pkaur_be18@thapar.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jatin Bedi</string-name>
          <email>jatin.bedi@thapar.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Engineering, Thapar Institute of Information and Technology</institution>
          ,
          <addr-line>Patiala, Punjab</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>1</volume>
      <fpage>3</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>Automatic detection of the rhetorical roles of sentences in a legal case judgment can help in numerous tasks such as summarizing legal decisions, legal search, etc. Thus, making this problem a field of interest for various researches. Legal case documents, however, are not usually well-structured, which makes this task challenging. In this paper, we propose a multi-class text classification for rhetorical role labeling of legal judgments for task 2 of the track 'Artificial Intelligence for Legal Assistance' presented by the Forum of Information Retrieval Evaluation in 2021. We have implemented the following methodology (i) we used ERNIE 2.0 token embedding, which can better capture the lexical, syntactic, and (ii) semantic aspects of information in the training data. The overall F1 score, Precision, and Recall is 0.505, 0.465, and 0.591 respectively, which is third in all the submitted teams. We make our code publicly available at</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>GitHub 1.</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>sentence in the document one of the seven labels: Facts, Ruling by Lower Court, Argument,
Statute, Precedent, Ratio of the decision, Ruling by Present Court.Our team made the following
contributions to this problem as part of the shared task efort:
• We use ERNIE 2.0 token embeddings, which can better capture the lexical, syntactic, and
semantic aspects of information in the training data.;
• We perform single attention learning to capture long-range relations.;</p>
    </sec>
    <sec id="sec-3">
      <title>2. Related Work</title>
      <p>
        There have been various attempts towards the development of automatic identification of
rhetorical roles. Initial work focused on understanding the rhetorical roles in case documents
to summarize these documents. Later the focus shifted to applying techniques on handcrafted
features for segmenting a document into functional and issue-specific parts. For instance, [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
used Conditional Random Fields (CRF) to classify the document into seven rhetorical roles to
bring out an efective summary. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] looked into the segmentation of U.S. court documents into
functional and issue-specific parts CRF with handcrafted features.
      </p>
      <p>
        [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] proposed a skip-gram model for identifying factual and non-factual sentences using a
classifier model from fastText library. In another line of work, [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] used the Machine Learning
approaches to compare the use of rule-based scripts. Unlike all these works that focused on
using handcrafted features to identify rhetorical roles in the legal domain automatically, [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
used a Deep learning approach where no handcrafted features were required. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] used a
hierarchical BiLSTM model, which does not require handcrafted features and performs much
better at this task.Later, more deep learning approaches were used for the classification of
rhetorical roles. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] produced a fully annotated dataset 53,210 documents that they collected
from a (http://www.westlawindia.com). Later more deep learning approches were applied on
the data. For instance, [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] used Roberta embeddings and passed the output through a neural
network model for classification. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] used Bert model for the classification purpose.
      </p>
    </sec>
    <sec id="sec-4">
      <title>3. Dataset</title>
      <p>The dataset provided by AILA 2021 contained 60 legal case documents. The training sample
included 50 annotated legal text documents, including 9380 training data as shown in Table 1.
The test included 10 annotated legal text documents, including 1905 test data as shown in Table
1. Every sentence in the documents was assigned one of the 7 rhetorical roles, as explained
below.</p>
      <p>1. Facts (abbreviated as FAC): These refer to the sentences that contain information that
led to the filing of the case and how it evolved in the legal system. (First Information
Report at a police station, filing an appeal).
2. Ruling by Lower Court (abbreviated as RLC): Since the cases mentioned in the dataset
were from the Supreme court, preliminary ruling by the lower courts (Tribunal, High
Court, etc.). These sentences correspond to the verdicts given by the lower courts.
3. Argument (abbreviated as ARG): These sentences contain the court’s discussion over
the arguments presented by the opposing parties.
4. Statute (abbreviated as STA): Established law cited from various sources.
5. Precedent (abbreviated as PRE): Relevant precedent cited. These are similar to the
statute citations.
6. Ratio of the decision (abbreviated as Ratio): These sentences denote the
rationale/reasoning given by the Supreme Court for the application of any legal principle to the legal
issue (final judgment).
7. Ruling by Present Court (abbreviated as RLC): These sentences denote the final
decision given by the Supreme Court for that case document.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Methodology</title>
      <p>The Legal case documents are usually lengthy and unstructured, full of legal jargons with
missing headings thus making them dificult to read. It becomes a tedious task for a reader to
understand where the components are located like facts: that led to filing of cases, arguments:
arguments presented by the contenders, statutes: relevant citations to previous statutes and
many others similar categories related to legal proceedings. Therefore, the task of
semantic/thematic segmentation, also known as rhetorical role labelling of sentences, becomes an
important task. It not only enhances the readability of the document but also has applications
in several downstream tasks like summarization, case law analysis, semantic search and so on
thus increasing the use case of the data along with opening various possibilities.</p>
      <p>The methodology used, focused on understanding the semantic relation between the tokens
of the sentences that implied towards the legal apprehensions that could be mapped with the
provided tokens. The overall task was considered as an extension of a sentence classification
problem with seven labels to be predicted. We employed the use of dedicated pre-processing
techniques which helped in eficient handling of sentences with token length less than ten by
integrating the existing knowledge derived from the data and the observation observed during
text EDA. The above pre-processing technique combined with eficient removal of stop words
and application of the inflectional stemming which helped in increasing the accuracy of the
information retrieval system. The processed text was passed through the complete pipeline of a
deep learning-based transformers approach to derive the accurate contextual perceptions of the
text and eficiently establish them with the corresponding labels in the data.</p>
      <sec id="sec-5-1">
        <title>4.1. Data Preparation</title>
        <p>Properly cleaned data is essential for the correct text analysis and removing unwanted noise
before feeding it into the model. Thus, we performed simple preprocessing, which includes
tokenization, stopword removal, and lemmatization. Initially, the sentences were split into
smaller pieces or ”tokens.” The data was further cleaned by removing common words(stopwords)
like ”we” and ”are,” which does not help in text classifications. Finally, the words were lemmatized
to obtain the lemma, or base form, of the words.</p>
        <p>Additionally, a surprising observation was made regarding the pattern of labels in the case of
sentences with word lengths less than 10. Such sentences had the same labels as their previous
statement’s label, which helped make the prediction easier for the model in case of lower
sentence length, thus making our approach robust enough to handle smaller sentences that
lacked high semantic knowledge.</p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Modelling</title>
        <p>
          RoBERTa: [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] RoBERTa, retrains BERT with an improved methodology, much more data, larger
batch size and longer training times. In RoBERTa the training strategy of BERT is modified by
removing the NSP objective. Further, RoBERTa uses byte pair encoding (BPE) as a tokenization
algorithm instead of Word Piece tokenization in BERT.
        </p>
        <p>
          BERT: [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]is a bidirectional language model which aims to learn contextual relations between
words using the transformer architecture. We use an oficial release of the pre-trained models,
details about the specific hyperparameters are found in Section V-A. The input to BERT is
either a single text (a sentence or document), or a text pair. The first token of each sequence is
the special classification token [CLS], followed by WordPiece tokens of the first text A, then a
separator token [SEP], and (optionally) after that WordPiece tokens for the second text B. In
addition to token embeddings, BERT uses positional embeddings to represent the position of
tokens in the sequence. For training, BERT applies Masked Language Modeling (MLM) and
Next Sentence Prediction (NSP) objectives. In MLM, BERT randomly masks 15
        </p>
        <p>
          ERNIE2.0 Transformer Encoder: [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] The model uses a multi-layer Transformer [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]
as the basic encoder like other pre-training models such as GPT [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], andBERT [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. The
transformer can capture the contextual information for each token in the sequence via
selfattention, and generate a sequence of contextual embeddings. Given a sequence, the special
classification embedding [CLS] is added to the first place of the sequence. Furthermore, the
symbol of [SEP] is added as the separator in the intervals of the segments for the multiple input
segment tasks. Task Embedding The model feeds task embedding to represent the characteristics
of diferent tasks. We represent diferent tasks with an id ranging from 0 to N. Each task id is
assigned to one unique task embedding. The corresponding token, segment, position and task
embedding are taken as the input of the model. We can use any task id to initialize our model
in the fine-tuning process.
        </p>
        <p>In the present research work, we have implemented the ERNIE 2.0 model to carry out the task
of Sentence Label Classification which has been established to have outperformed BERT and
the recent XLNet in 16 NLP tasks in Chinese and English Language. The base model contains
12 layers, 12 self-attention heads and 768-dimensional of hidden size while the large model
contains 24 layers, 16 self-attention heads and 1024-dimensional of hidden size. The model
settings of XLNet are the same as BERT. The transformer employed for prediction used ERNIE
2.0 pretrained token embeddings which are known to have better contextual dependencies
with each other thus helping us in mitigating the deviation from the text and output label
relationship. A single attention mechanism was applied on the token embeddings to derive
better understanding of the hidden relationships that could help us in determining the output
labels in a more optimized and accurate way. The mentioned pipeline was tested with models
like ROBERTA (Base), LawBert and Bert-base-uncased however the performance of the ERNIE
2.0 architecture along with the embeddings generated performed best in our case.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Experimentation and Results</title>
      <p>The data was tested using the proposed methodology where the two runs submitted are diferent
from each other based on the diference in the base model used which is ERNIE 2.0 in case
of Run1 and RoBERTa in case of Run2. After the thorough analysis of the data processed by
1.00
0.75
0.50
0.25
0.00
0.8
0.6
0.4
0.2
0.0</p>
      <sec id="sec-6-1">
        <title>Precision</title>
      </sec>
      <sec id="sec-6-2">
        <title>Recall</title>
      </sec>
      <sec id="sec-6-3">
        <title>Fscore</title>
        <p>ent
Argum
Facts</p>
        <p>Precedent
atio_of_the_d
R
uling_by_Low
R
uling_by_Pres
R
Statute
verall
O
(a) ERNIE (Run-1)
RoBERTa</p>
      </sec>
      <sec id="sec-6-4">
        <title>Precision</title>
      </sec>
      <sec id="sec-6-5">
        <title>Recall</title>
      </sec>
      <sec id="sec-6-6">
        <title>Fscore</title>
        <p>ent
Argum</p>
        <p>Facts</p>
        <p>Precedent
atio_of_the_d
R
uling_by_Low
R
uling_by_Pres</p>
        <p>R
(b) RoBERT (Run-2)</p>
        <p>Statute
verall
O
our proposed scheme the best run was established to be the use ERNIE 2.0 based sequence
classification pipeline built upon the ERNIE 2.0 pre trained token embeddings which are known
to capture the contextual understanding in the English language better than existing models
like BERT, RoBERTa, XLNet. In reference to Table 2 and Table 3, it can be observed that the f1
score for particularly Argument (ARG), Facts (FAC), Ratio of the decision (Ratio) and ruling by
present court (RLC) is more than 0.5 which indicates the ability of our methodology to capture
the underlying meanings of the said labels. On closely observing the results of submitted runs
(shown in Figure 1), we can derive the conclusion of ERNIE 2.0 in establishing itself as a better
analyzer of the legal sentences. The use of Roberta gives a better score when compared on
the basis of Precedent and Ruling by lower court however in other labels RUN1 establishes
itself to be the go-to approach. When looked at each label separately it can be observed that
the percentage of improvement in results from Roberta to Ernie or from Run2 to Run1 can be
attributed to the fact that Ernie has a more complex architecture with 12 self-attention heads
that are more robust to the category of data that the model encountered. The run1 gets the f1
score of 0.505 which is a direct 3% improvement of results. In reference to Table 4, the token
length was set at 250 in the case of Run1 and 300 in case of Run2 however the final outcome
showed the accuracy of ERNIE with 250 token lengths to be better. Also, Table 5 validates the
approach of using the proposed pre-processing as the results were observed to be getting better
a value of 0.05 f1 score. The epochs for both the models were set at 15 and the model were
trained on Tesla P100-PCIE-16GB. The final result in case of RUN1 was overall precision of 0.465
in comparison to RUN2’s precision of 0.450 and the Recall of Run1 was 0.591 in comparison to
Run’s 2 recall of 0.586.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusion</title>
      <p>From the overall experiments carried out on the legal corpus it can be concluded that ERNIE 2.0
comes out as the better analyser of Legal Text and has the capability to capture the underlying
meaning in the best way. The corpus having 7 labels have contextual overlapping which makes
it dificult for various models in giving a higher performance. However the use of better deep
learning approaches with advanced embeddings of ERNIE 2.0 makes the above problem easier
to solve. For future purposed ensembling of RoBERTa, ERNIE 2.0 and LawBERT can have
promising results along with more exploration of preprocessing on the basis of corresponding
token lengths as tried in our proposed methodology.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>V.</given-names>
            <surname>Parikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mathur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mittal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <article-title>Lawsum: A weakly supervised approach for indian legal document summarization</article-title>
          ,
          <source>arXiv preprint arXiv:2110.01188v3</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Saravanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ravindran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Raman</surname>
          </string-name>
          ,
          <article-title>Automatic identification of rhetorical roles using conditional random fields for legal document summarization</article-title>
          ,
          <source>in: Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Savelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. D.</given-names>
            <surname>Ashley</surname>
          </string-name>
          ,
          <article-title>Segmenting us court decisions into functional and issue specific parts</article-title>
          .,
          <source>in: JURIX</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>111</fpage>
          -
          <lpage>120</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>I. Nejadgholi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bougueng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Witherspoon</surname>
          </string-name>
          ,
          <article-title>A semi-supervised training method for semantic search of legal facts in canadian immigration cases</article-title>
          ., in: JURIX,
          <year>2017</year>
          , pp.
          <fpage>125</fpage>
          -
          <lpage>134</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>V. R.</given-names>
            <surname>Walker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Pillaipakkamnatt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Davidson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Linares</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Pesce</surname>
          </string-name>
          ,
          <article-title>Automatic classification of rhetorical roles for sentences: Comparing rule-based scripts with machine learning</article-title>
          .,
          <source>in: ASAIL@ ICAIL</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wyner</surname>
          </string-name>
          ,
          <article-title>Identification of rhetorical roles of sentences in indian legal judgments</article-title>
          ,
          <source>in: Legal Knowledge and Information Systems: JURIX</source>
          <year>2019</year>
          :
          <article-title>The Thirty-second Annual Conference</article-title>
          , volume
          <volume>322</volume>
          , IOS Press,
          <year>2019</year>
          , p.
          <fpage>3</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Das</surname>
          </string-name>
          ,
          <article-title>Rhetorical role labelling for legal judgements using roberta</article-title>
          .,
          <source>in: FIRE (Working Notes)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>22</fpage>
          -
          <lpage>25</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z. Han,</surname>
          </string-name>
          <article-title>The language model for legal retrieval and bert-based model for rhetorical role labeling for legal judgments</article-title>
          .,
          <source>in: FIRE (Working Notes)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>71</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          , arXiv preprint arXiv:
          <year>1907</year>
          .
          <volume>11692</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          , Ernie
          <volume>2</volume>
          .
          <article-title>0: A continual pretraining framework for language understanding</article-title>
          ,
          <source>in: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          , volume
          <volume>34</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>8968</fpage>
          -
          <lpage>8975</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>in: Advances in neural information processing systems</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Sutskever</surname>
          </string-name>
          , et al.,
          <article-title>Language models are unsupervised multitask learners</article-title>
          ,
          <source>OpenAI blog 1</source>
          (
          <year>2019</year>
          )
          <article-title>9</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>