A Novel Initial Reminder Framework for Acronym
Extraction
Xiusheng Huang1,2 , Bin Li3 , Fei Xia1,2 and Yixuan Weng1,2
1
  National Laboratory of Pattern Recognition, Institute of Automation Chinese Academy Sciences, Beijing, 100190, China
2
  School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, 100190, China
3
  College of Electrical and Information Engineering, Hunan University


                                             Abstract
                                             Acronym extraction is committed to extracting acronyms (e.g., short-forms) and their meaning (e.g., long-forms) from the
                                             original document, this is one of the key and challenging tasks in scientific document understanding (SDU@AAAI-22) tasks.
                                             Previous work regarded them as a task of named entity recognition, ignoring the relationship between acronyms and their
                                             meaning, especially the importance of initials. In this paper, we propose a novel Initial Reminder Framework (IRF) for acronym
                                             extraction task. Specifically, the IRF recognize the span of acronym for the first time, combined with the initial information,
                                             and recognized their meaning again. At the same time, considering that acronyms are often close to their meaning, the IRF
                                             adopts Neighborhood Search Strategy. Experiments on two acronym extraction dataset show IRF outperforms the previous
                                             methods by 5.90/7.10 F1. Further analysis reveals IRF is effective in extracting short-forms and long-forms.

                                             Keywords
                                             Acronym extractions, The initials, Initial Reminder Framework, Neighborhood Search Strategy


1. Introduction
                                                                                                                       XLII (I) Consejo Mundial de la Paz (CMP) Federacion Internacional de
Acronym extraction is a task to identify acronyms and                                                                  Comercio de Cacao (FICC) Federacion Internacional de Ia Industria del
their meanings, which is very important for scientific                                                                 Medicamento (FIIM).
document understanding (SDU@AAAI-22)[1, 2]. The                                                                         'Acronym' :        'Long-form' :         'Initial' :
previous method regards this task more as a sequence
annotation task[3, 4, 5], and the model will recognize the
acronyms and long-term.                                                                                               Figure 1: In the figure, the Green text represents acronyms,
   The context of acronyms often have more obvious                                                                    orange text represents long term, and red text represents ini-
characteristics, for example, there are brackets around                                                               tials. At the same time, red, blue and black lines indicate the
                                                                                                                      correspondence between initials and acronyms, respectively.
acronyms, or acronyms themselves have a specific format,
                                                                                                                      (Dataset: Spanish)
which leads to a higher accuracy of identifying acronyms.
However, the accuracy of identifying long-term is rela-
tively low, and there are some problems, such as inaccu-
rate identification and no identification.                                                                            In this paper, we propose a novel Initial Reminder Frame-
   As shown in Figure 1, in a document, we need to                                                                    work (IRF) for acronym extraction task. Through ex-
identify the acronyms and long-term. The context of                                                                   periments, we find that the model has high accuracy in
acronyms often has some characteristics (e.g. brackets),                                                              acronym recognition than long-term recognition. Specif-
which helps the model to identify them. Long-term recog-                                                              ically, in Spanish, the model achieved 91% F1 in the task
nition is a challenge. It needs to have a certain under-                                                              of identifying acronyms, the F1 score is only 83%. At the
standing of the document content. The better solution                                                                 same time, considering the correlation between acronyms
is to know what the corresponding acronym is before                                                                   and long-term, IRF first completes the task of identifying
extracting long term, which will help model recognition                                                               acronyms. On this basis, combined with the initial in-
of long term.                                                                                                         formation contained in acronyms, IRF further identifies
   Through Figure 1, we can find that each character of                                                               long-term. We verify the effectiveness of our method on
the acronym can correspond to the initial of the long-                                                                two acronym extraction data sets, including Spanish and
term, which will help the model identify the long-term.                                                               Danish.
                                                                                                                         We summarize our contributions as follows:
SDU@AAAI-22: Workshop on Scientific Document Understanding,                                                                • We introduce a fresh perspective to revisit the
co-located with AAAI 2022. 2022 Vancouver, Canada.
                                                                                                                             acronym extraction task with a principled prob-
Envelope-Open huangxiusheng2020@ia.ac.cn (X. Huang); libincn@hnu.edu.cn
(B. Li); xiafei2020@ia.ac.cn (F. Xia); wengsyx@gmail.com (Y. Weng)                                                           lem formulation, which implies a general algorith-
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                                             mic framework that helps the identify long-term
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                               by initials.
      • we propose a novel Initial Reminder Framework Table 2
        (IRF) for acronym extraction task. Specifically, Statistical Information of Danish Dataset.
        IRF makes use of the high accuracy of acronym                 Data           Sample Number                         Ratio
        recognition and helps the model recognize long-
                                                                 Training Set              3082                            80.00%
        term by integrating the initial information.
                                                              Development Set               385                            9.99%
      • We conduct experiments on two acronym extrac-
                                                                    Test Set                386                            10.1%
        tion datasets. Experimental results demonstrate
        that our IRF model can achieves state-of-the-art              Total                3853                            100%
        performance compared with baselines.
                                                                       training (5928), development (741), and testing (741) sets
2. Task introduction                                                   from the whole dataset. As shown in Table 2, the Danish
                                                                       dataset is divided into training (3082), development (385),
2.1. Problem definition                                                and testing (160) sets according to the whole dataset.
                                                                       Both datasets have been manually labeled, where the
We regard the acronym extraction task as a sequence                    label is a list of position boundaries.
annotation task. Different from the previous methods,
considering the high accuracy of acronym recognition,
we will first recognize the acronym, and then use the                  3. Methodology
character information of the acronym to recognize the
long-term. Given a document D = {𝑥1 , 𝑥2 , … , 𝑥𝑛 }, the               In this section, we will introduce our proposed IRF model.
initials of each word in the document is I = {𝑦1 , 𝑦2 , … , 𝑦𝑛 }.      IRF utilizes the corresponding relationship between the
Utilizing our IRF model, we will get each acronyms and                 characters of acronyms and the initials of long-term, this
long-term :                                                            will effectively help the model improve the accuracy of
                                                                       long-term recognition.
          𝐴, 𝐿 = IRF (𝑥1 , 𝑥2 , … , 𝑥𝑛 ; 𝑦1 , 𝑦2 , … , 𝑦𝑛 )      (1)
   where A refers to acronyms and L refers to long-term. 3.1. Encoder
                                                            Given a document D = {𝑥1 , 𝑥2 , … , 𝑥𝑛 }, and the initials
2.2. Evaluation metric                                      of each word in the document is I = {𝑦1 , 𝑦2 , … , 𝑦𝑛 }. We
                                                            leverage the pre-trained language model as an encoder
The online results will be evaluated with the macro- to obtain the embedding as follows:
averaged precision, recall, and F1 scores. The final score
is the prediction correctness of short-form (i.e., acronym)
and long-form (i.e., phrase) boundaries in the given sen-       𝐻 = BERT Encode (𝑥1 , 𝑥2 , … , 𝑥𝑛 ; 𝑦1 , 𝑦2 , … , 𝑦𝑛 ) (2)
tence. The short-form or long-form predictions are cor-
rect once the beginning and the end of the position of the     where 𝐻 = [ℎ1 , ℎ2 , … , ℎ𝑛 ] is the embedding of each
predicted short-form or long-form are equal to the label token, 𝐼 is the embedding of each initial.
respectively. The official score is counted based on the
macro average of short-form and long-form F1 scores.        3.2. Acronyms Tagger
                                                                       The low level tagging module is designed to recognize
2.3. Dataset introduction
                                                                       all possible acronyms in the input sentence by directly
                                                                       decoding the encoded vector 𝐻 produced by the N-layer
Table 1                                                                BERT encoder. More precisely, it adopts two identical
Statistical Information of Spanish Dataset.                            binary classifiers to detect the start and end position of
                                                                       acronyms respectively by assigning each token a binary
            Data              Sample Number              Ratio
                                                                       tag (0/1) that indicates whether the current token cor-
      Training Set                    5928              80.00%         responds to a start or end position of a acronym. The
    Development Set                    741              10.00%         detailed operations of the acronyms tagger on each token
        Test Set                       741              10.00%         are as follows:
            Total                     7410               100%
                                                                                         𝑠𝑡𝑎𝑟𝑡𝑎
                                                                                       𝑝𝑖         =𝜎(𝑊𝑠𝑡𝑎𝑟𝑡 ℎ𝑖 +𝑏𝑠𝑡𝑎𝑟𝑡 )            (3)
  This task contains various multi-lingual datasets com-
posed of document sentences in science fields. Among
                                                                                            𝑒𝑛𝑑𝑎
them, the statistics of the Spanish and the Danish datasets                              𝑝𝑖        =𝜎(𝑊𝑒𝑛𝑑 ℎ𝑖 +𝑏𝑒𝑛𝑑 )               (4)
are shown in Table 1. The Spanish dataset is divided into
                                𝑠𝑡𝑎𝑟𝑡                   𝑒𝑛𝑑                                                                                                     𝑠𝑡𝑎𝑟𝑡      𝑒𝑛𝑑
   where 𝑝𝑖 𝑎 and 𝑝𝑖 𝑎 represent the probability of iden-                                                                                           where 𝑝𝑖 𝑙 and 𝑝𝑖 𝑙 represent the probability of iden-
tifying the i-th token in the input sequence as the start                                                                                        tifying the i-th token in the input sequence as the start
and end position of a acronym, respectively. The cor-                                                                                            and end position of a long-term respectively, and 𝑉𝑠ℎ𝑜𝑟𝑡 𝑘
responding token will be assigned with a tag 1 if the                                                                                            represents the encoded representation vector of the k-th
probability exceeds a certain threshold or with a tag 0                                                                                                                                           𝑘
                                                                                                                                                 subject detected in low level module, the 𝐸𝑖𝑛𝑖𝑡𝑖𝑎𝑙    repre-
otherwise. ℎ𝑖 is the encoded representation of the i-th                                                                                          sents the embedding of initials (i.e., C, M and P). For each
token in the input sequence, i.e., ℎ𝑖 =𝐻 [𝑖], where 𝑊𝑠𝑡𝑎𝑟𝑡                                                                                       acronym, we iteratively apply the same decoding process
and 𝑊𝑒𝑛𝑑 represent the trainable weight, and 𝑏𝑠𝑡𝑎𝑟𝑡 and                                                                                          on it. Meanwhile, for the Neighborhood Search Strategy,
𝑏𝑒𝑛𝑑 are bias and 𝜎 is the sigmoid activation function.                                                                                          we set the search length to 𝛾, where the 𝛾 is a hyperpa-
                                                                                                                                                 rameter which is the longest distance between acronyms
3.3. Long-term Tagger                                                                                                                            and long in the statistical training set.

Considering that each acronym and its meaning are
always connected together, we utilize Neighborhood 4. Experiments
Search Strategy to select the context near the search
acronym, so as to extract the correct long-term.                            4.1. Baseline models
      The high level tagging module simultaneously identi-                      • Rule-based method The rule-based baseline
fies the long-term with respect to the acronyms obtained                          method is proposed to adopt manual rules for
at lower level. As show in the Figure 2, for the acronyms                         this task [6]. The words with more than 60% of
𝐶𝑀𝑃, We search for its corresponding long term in a                               their characters are upper-cased to be selected
limited context. Different from acronyms tagger directly                          as acronyms. The long-forms are chosen once
decoding the encoded vector 𝐻, the Long-term Tagger                               the initial characters of the preceding words are
takes the acronyms features and initial features into ac-                         before an acronym. The whole codes are online
count as well. The detailed operations of the Long-term                           on the website1 .
Tagger on each token are as follows:
                                                                                • BiLSTM-CRF model The bidirectional LSTM
                                                                                  [7] is an extension of LSTM that adopts a for-
                       Neighborhood Search Strategy
                                                                                  ward and backward LSTM network for sequence
  Start 0  1   0   0 0    0    0    0    0    0     0    0  0 0 0 0 0 0   0       processing, where the links of the network is used
         0 0   0   0 0    1    0    0    0    0     0    0  0 0 0 0 0 0   0
  End
Long-term
                                                                                  as the output layer (Huang et al., 2015). The BiL-
 Tagger
                                    Hn + Vkshort +Ekinitial                       STM structure gathers contextual information
                                                                                  simultaneously from the past with bidirectional.
 Start   0 0   0   0 0    0    0    1    0    0     0    0  0 0 0 0 0 1   0
                                                                                  Besides, the BiLSTM has advantages in the LSTM
  End    0 0   0   0 0    0    0    1    0    0     0    0  0 0 0 0 0 1   0       that avoids gradient vanishing compared with
                                                                                  the RNN. The output hidden state of BiLSTM will
Acronyms                                                                          be concatenated between the forward LSTM 𝐻𝑓
 Tagger                                BERT Encode
                                                                                  and backward LSTM 𝐻𝑏 networks as final output
 Initial
         X C   M   d l    P    (    C    )     F    I    d  C d C d ( F   )
                                                                                  [𝐻𝑓 , 𝐻𝑏 ]. This feature is calculated with the cross-
 Token
                                                                                  entropy loss with the target token-level labels.
                                                                                 Internacional
                                       de


                               (         )                          (     )
                   Consejo


                                                                    Federación


                                                                                                                                    FICC
                                                              CMP
                             Mundial


                                             la


                                                                                • BERT-CRF model The BERT-CRF [8] is imple-
            XLII


                                                  Paz


                                                                                                      Comercio


                                                                                                                      Cacao

                                                                                                                              del
                                                                                                 de


                                                                                                                 de


                                                                                  mented with the token-level neural network with
                                                                                  the conditional random field (CRF) on top, where
                                                                                  the backbone of this baseline can choose from the
Figure 2: An overview of the proposed IRF framework. In this
                                                                                  Mbert[]. The Mbert is the multilingual masked
example, there are two candidate acronyms detected at the
low level, while the presented 0/1 tags at high level are specific                language model (MLM) trained with multiple cor-
to the first acronym CMP, i.e., a snapshot of the iteration state                 pora. The backbone has varients such as base
when k = 1 is shown as above. 𝐾=2 corresponds to the second                       and large, which are chosen as our baselines. As
acronym FICC.                                                                     for the input tokens, the backbone encodes the
                                                                                  tokens to the encoding. The final classification
                                                                                  scores are obtained in the CRF layer, where the
              𝑠𝑡𝑎𝑟𝑡𝑙                         𝑘            𝑘                       tag is used as the transition matrix. The matrix
             𝑝𝑖 =𝜎(𝑊𝑠𝑡𝑎𝑟𝑡 (ℎ𝑛 +𝑉𝑠ℎ𝑜𝑟𝑡 +𝐸𝑖𝑛𝑖𝑡𝑖𝑎𝑙 )+𝑏𝑠𝑡𝑎𝑟𝑡 )              (5)       contains two states including the beginning (B)
                                   𝑒𝑛𝑑𝑙                    𝑘 +𝐸 𝑘
                             𝑝𝑖             =𝜎(𝑊𝑒𝑛𝑑 (ℎ𝑛 +𝑉𝑠ℎ𝑜𝑟𝑡 𝑖𝑛𝑖𝑡𝑖𝑎𝑙 )+𝑏𝑒𝑛𝑑 )                                                           (6)
                                                                                                                                                      1
                                                                                                                                                          https://github.com/amirveyseh/AAAI-22-SDU-shared-task-1-
                                                                                                                                                 AE
Table 3                                        Table 4
F1 Performance in Spanish dataset              F1 Performance in Danish dataset
          Method              Val F1       Test F1                       Method                  Val F1         Test F1
        Rule-based            0.5667       0.5596                      Rule-based                0.7021         0.6842
       BiLSTM-CRF             0.7717       0.7623                     BiLSTM-CRF                 0.7671         0.7587
        BERT-CRF              0.8397       0.8211                      BERT-CRF                  0.8673         0.8554
       Roberta-CRF            0.8667       0.8531                     Roberta-CRF                0.8979         0.8931
     IRF-BERT𝑏𝑎𝑠𝑒 (𝑜𝑢𝑟𝑠)     0.8742        0.8537               IRF-BERT𝑏𝑎𝑠𝑒 (𝑜𝑢𝑟𝑠)              0.9133         0.9032
    IRF-BERT𝑙𝑎𝑟𝑔𝑒 (𝑜𝑢𝑟𝑠)     0.9035        0.8911              IRF-BERT𝑙𝑎𝑟𝑔𝑒 (𝑜𝑢𝑟𝑠)              0.9532         0.9413
   IRF-Roberta𝑙𝑎𝑟𝑔𝑒 (𝑜𝑢𝑟𝑠)   0.9233       0.91.21             IRF-Roberta𝑙𝑎𝑟𝑔𝑒 (𝑜𝑢𝑟𝑠)            0.9744         0.9641


       and the end (E). This baseline is trained with the   Table 5
       first sub-token via the cross-entropy loss.          Test F1 score (%) on extracting long-term.
     • Roberta-CRF model The Roberta-CRF [9] is the                 Model                         Val F1          Test F1
       same architecture as the BERT-CRF, where the                 BERT-CRF                      80.42            79.11
       difference is that the Roberta model removes the             IRF-BERT𝑏𝑎𝑠𝑒 (𝑜𝑢𝑟𝑠)       85.31 (+ 4.89)   84.23 (+ 5.12)
       next sentence prediction (NSP) task, and uses dy-            Roberta-CRF                   83.44            82.19
       namic masking for text encoding. The Roberta                 IRF-Roberta𝑙𝑎𝑟𝑔𝑒 (𝑜𝑢𝑟𝑠)   90.13 (+ 6.69)   89.07 (+ 6.88)
       model uses the Byte-Pair Encoding (BPE) to mix
       character-level and word-level representations
       and support processing many common natural           show that PAEE performs better than these methods.
       language corpora vocabularies. We adopt differ-      Specifically, in Spanish dataset, our best model, IRF
       ent varients of the Roberta as our baselines, in-    built upon 𝑅𝑜𝑏𝑒𝑟𝑡𝑎𝑙 𝑎𝑟𝑔𝑒, is +5.66 / +5.90 F1 better on
       cluding the base and the large version.              Val/Test set than Roberta-CRF. In addition, in Danish
                                                            dataset, IRF built upon 𝑅𝑜𝑏𝑒𝑟𝑡𝑎𝑙 𝑎𝑟𝑔𝑒, is +7.65 / +7.10 F1
4.2. Datasets                                               better on Val/Test set than Roberta-CRF. They obtain
                                                            new state-of-the-art(SOTA) results, we held the first
We evaluated our method on two acronym extraction           position on the CodaLab scoreboard under the alias
datasets� mainly including Spanish dataset and Danish       WENGSYX2 .
dataset. Specifically, the Spanish dataset has 7410 sam-
ples, and the Danish dataset has 3853 samples [10].
                                                            4.5. Analysis
4.3. Implementation Detail                                 Considering the correlation between acronyms and the
                                                           initials of long-term, our IRF establishes the relationship
We used cased BERT-base, or RoBERTa-large as the en- between acronyms and long-term, which improves the
coder on Spanish and Danish dataset. All models are accuracy of extracting long and the overall performance
implemented based on the open-source transformers li- of the model. In order to further explore the effectiveness
brary of huggingface [11]. we initialize the model with of our method, we analyze the accuracy of identifying
mbert [12]. We use mixed-precision training [13] based long-term in the acronym extraction task. As show in
on the Apex library. Our model is optimized with AdamW Table 5, compared with baseline, our IRF can significantly
[14] using learning rates ∈ [2𝑒−5, 3𝑒−5, 5𝑒−5, 1𝑒−4], with improve the accuracy of extracting long-term. Specifi-
a linear warmup [15] for the first 6% steps followed by cally, on the F1 score, we have a maximum performance
a linear decay to 0. We report the mean and standard improvement of 5%. The significant increase of the recog-
deviation of F1 on the development set by conducting 5 nition accuracy of the model in long term will help to
runs of training using different random seeds. We utilize improve the overall performance of the model.
the In-trust loss [5] function to optimize the model.

4.4. Results                                                5. Conclusion
In the Spanish and Danish datasets, we compare IRF with In this paper, we propose a novel Initial Reminder Frame-
baselines, including Rule-based, BiLSTM-CRF, BERT- work (IRF) for acronym extraction task. Specifically,
CRF and Roberta-CRF. Results in Table 3 and Table 4
                                                                2
                                                                    https://competitions.codalab.org/competitions/34925results
IRF utilizes Acronyms Tagger to recognize the span of                               Roberta: A robustly optimized bert pretraining ap-
acronym for the first time. Then combining with the ini-                            proach, arXiv preprint arXiv:1907.11692 (2019).
tial information, IRF utilizes Long-term Tagger to recog- [10] S. Y. R. J. F. D. T. H. N. Amir Pouran Ben Vey-
nize the long-term. IRF captures the relationship between                           seh, Nicole Meister, MACRONYM: A Large-
acronyms and long-term in the dataset. Meanwhile, uti-                              Scale Dataset for Multilingual and Multi-Domain
lizing the character information in acronyms, the IRF                               Acronym Extraction, in: arXiv, 2022.
improves the accuracy of long-term recognition. We con- [11] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De-
duct experiments on two acronym extraction datasets.                                langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun-
Experimental results demonstrate that our IRF model                                 towicz, et al., Huggingface’s transformers: State-of-
can achieves state-of-the-art performance compared with                             the-art natural language processing, arXiv preprint
baselines.                                                                          arXiv:1910.03771 (2019).
                                                                               [12] J. Libovickỳ , R. Rosa, A. Fraser, How language-
                                                                                    neutral is multilingual bert?, arXiv preprint
References                                                                          arXiv:1911.03310 (2019).
                                                                               [13] P. Micikevicius, S. Narang, J. Alben, G. Diamos,
  [1] A. P. B. Veyseh, F. Dernoncourt, T. H. Nguyen,
                                                                                    E. Elsen, D. Garcia, B. Ginsburg, M. Houston,
       W. Chang, L. A. Celi, Acronym identification and
                                                                                    O. Kuchaiev, G. Venkatesh, et al., Mixed precision
       disambiguation shared tasks for scientific document
                                                                                    training, in: International Conference on Learning
       understanding, arXiv preprint arXiv:2012.11760
                                                                                    Representations, 2018.
       (2020a).
                                                                               [14] I. Loshchilov, F. Hutter, Decoupled weight decay
  [2] S. Y. R. J. F. D. T. H. N. Amir Pouran Ben Veyseh,
                                                                                    regularization, in: International Conference on
       Nicole Meister, Multilingual Acronym Extraction
                                                                                    Learning Representations, 2018.
       and Disambiguation Shared Tasks at SDU 2022, in:
                                                                               [15] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis,
       Proceedings of SDU@AAAI-22, 2022.
                                                                                    L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, K. He,
  [3] L. Luo, Z. Yang, P. Yang, Y. Zhang, L. Wang, H. Lin,
                                                                                    Accurate, large minibatch sgd: Training imagenet
       J. Wang, An attention-based bilstm-crf approach to
                                                                                    in 1 hour (2018).
       document-level chemical named entity recognition,
       Bioinformatics 34 (2018) 1381–1388.
  [4] H. Zhao, L. Huang, R. Zhang, Q. Lu, H. Xue, Span-
       mlt: A span-based multi-task learning framework
       for pair-wise aspect and opinion terms extraction,
       in: Proceedings of the 58th Annual Meeting of the
       Association for Computational Linguistics, 2020, pp.
       3239–3248.
  [5] X. Huang, Y. Chen, S. Wu, J. Zhao, Y. Xie, W. Sun,
       Named entity recognition via noise aware training
       mechanism with data filter, in: Findings of
       the Association for Computational Linguistics:
       ACL-IJCNLP 2021, Association for Computational
       Linguistics, Online, 2021, pp. 4791–4803. URL:
       https://aclanthology.org/2021.findings-acl.423.
       doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 1 . f i n d i n g s - a c l . 4 2 3 .
  [6] A. S. Schwartz, M. A. Hearst, A simple algorithm for
       identifying abbreviation definitions in biomedical
       text, in: Biocomputing 2003, World Scientific, 2002,
       pp. 451–462.
  [7] Z. Huang, W. Xu, K. Yu, Bidirectional lstm-crf
       models for sequence tagging, arXiv preprint
       arXiv:1508.01991 (2015).
  [8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,
       Bert: Pre-training of deep bidirectional transform-
       ers for language understanding, arXiv preprint
       arXiv:1810.04805 (2018).
  [9] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen,
       O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,