Effective Ensembling of Transformer based Language Models for Acronyms
                                   Identification
                                              Divesh Kubal, Apurva Nagvenkar
                                                        CRIMSON AI
                                     divesh.kubal@crimsoni.ai, apurva.nagvenkar@crimsoni.ai


                           Abstract                                  rule-based(Schwartz and Hearst 2002) or machine learning-
                                                                     based(Jacobs, Itai, and Wintner 2020; Kuo et al. 2009;
  An acronym can be viewed as a word which is constructed            Liu, Liu, and Huang 2017). Some techniques use word-
  by taking initial components from a phrase. This study
  deals with the problem of identification and extraction of
                                                                     embedding techniques (Kirchhoff and Turner 2016) to ex-
  an acronym’s short and long-form. The proposed approach            tract acronyms. There are several packages available in
  solves the Acronym Identification (AI) task mentioned in the       python (Cook 2019; Schwartz and Hearst 2002) which ex-
  scientific document understanding (SDU@AAAI-21) task.              tract acronyms and their expansions. The understanding of
  This paper model the Acronym identification task as the sen-       acronyms and their expansions is an important task in the
  tence level sequence-labeling problem. The proposed method         following use-cases:
  is computed by an ensemble of various Language Models
  trained by hyper-parameter tuning. This ensembling tech-           • Text understanding: There can be multiple expansions of
  nique is then coupled with post-processing steps to extract          an acronym. Hence, identification of correct contextual
  the best possible predictions. The trained model’s perfor-           meaning is important to understand the text in an unam-
  mance is evaluated against standard evaluation metrics such          biguous manner.
  as precision, recall, and F1-score. The final model achieves       • Information retrieval: When a document is queried by in-
  an F1 score of 95.60%, a precision of 93.97%, and a recall of
                                                                       putting a query containing an acronym, the results should
  97.95% on the development dataset. On the test data, the pro-
  posed model achieves an F1 score of 92.08%, a precision of           contain the relevant results.
  89.70%, and the highest recall of 94.59%, compared to other        • Machine translation: Acronyms posses a big challenge
  participants results in the competition.                             when translating a source language to its target language.
                                                                     • Text Summarization: It is advisable to use an acronym
                        Introduction                                   counterpart of its expansion to summarize the text.
The acronyms in the technical/scientific domain are increas-
ing at an exponential rate. This is due to a huge amount of
research conducted in the technical and non-technical do-
mains. The term “acronym” is defined as the name given to
a particular word or a phrase by taking the first letters of
each word of a phrase (Mack 2012). For example, ‘ANN’ is
an acronym that stands for ‘Artificial Neural Network’. The
more common or general term used is “abbreviation”. Ab-
breviations encompass acronyms and few abbreviations that
use letters other than initial characters of phrases, such as
‘Mr.’ for ‘Ministers’. Hence, there is a thin line that distin-
guishes acronyms from abbreviations. The acronyms serve
a vital role in writing science or research-related technical               Figure 1: Examples of Acronym Identification
documents, patents, etc., by preventing content repetition.
This enables speeding the reading process and paving the
                                                                        This paper presents an effective ensembling based ap-
way for an easier understanding of the content written in-
                                                                     proach to automatically extract acronyms along with their
side a document.
                                                                     extended/long-forms. The proposed approach combines en-
   Several techniques have been proposed to extract
                                                                     sembling Language Models + hyper-parameter tuning +
acronyms from a given input text corpus. These systems are
                                                                     post-processing to get the best possible results. This paper is
Copyright © 2021for this paper by its authors. Use permitted under   structured by giving a survey of related work where a quick
Creative Commons License Attribution 4.0 International (CC BY        brief about existing approaches used to solve the problem of
4.0).                                                                Acronym Identification (Veyseh et al. 2020a) is given. After
the related work section, an in-depth explanation of the pro-         One of the machine learning-based Acronym Identifica-
posed system architecture is discussed. A thorough compar-         tion systems proposed by (Kuo et al. 2009) extracts features
ative analysis of results on different scenarios is explained in   and then uses various algorithms like Support Vector Ma-
the section results and discussions. Finally, this paper con-      chine, Naive Bayes, Logistic Regression, and Monte-Carlo
cludes by giving a quick conclusion and future directions.         Sampling Logistic Regression for training.
                                                                        (Liu, Liu, and Huang 2017) proposed a Latent-state
                         Related Work                              Neural Conditional Random Fields model (LNCRF) sys-
                                                                   tem which couples Conditional Random Fields with nonlin-
The existing Acronym Identification systems can be broadly         ear hidden layers. This system models the task of Acronym
classified into rule-based, features & machine learning-           Identification as a sequence labeling problem, and it sur-
based and Deep Learning-based as depicted in figure 2. The         passes many baseline models that were in existence at that
rule-based systems mainly consist of techniques that use           time.
rules, patterns and regular expressions to extract acronyms           A machine learning-based approach proposed by (Jacobs,
and their expansions. The machine learning-based tech-             Itai, and Wintner 2020) aims to extract acronyms by au-
niques first extract features required for Acronym Identifi-       tomatically building an acronym-based dictionary from an
cation, and then a classifier (like Support Vector Machine,        unannotated dataset. One of the critical parts of this system
Conditional Random Fields) is trained over these features.         is that it is capable of extracting non-local acronyms too. It
The Deep Learning-based techniques use state-of-the-art            means extracting the expanded form of an acronym even if
algorithms like Recurrent Neural Networks (RNN), Long              the short form is not present in a given sentence.
Short Term Memory (LSTM), transformer-based models,                   Another approach that utilizes Long Short Term Mem-
etc.                                                               ory - Conditional Random Field (LSTM-CRF) proposed
                                                                   by (Veyseh et al. 2020b) provides an in-depth explanation
                                                                   about overall Acronym identification and Disambiguation
                                                                   implementations.
                                                                      The rule-based and machine learning-based models were
                                                                   not able to capture the contextual information. An acronym
                                                                   extraction system that uses machine learning combined with
                                                                   a neural network based-contextual model was proposed
                                                                   by (Kirchhoff and Turner 2016). This model can store con-
                                                                   textual information and hence can also be used to solve the
                                                                   task of acronym disambiguation.
Figure 2: Acronym Identification Systems Broad Classifica-
tion
                                                                    Effective Ensembling of Language Models for
                                                                      Sequence Labeling Problem: EELM-SLP
    The Acronym Identification techniques can be traced back
to the year 1999 where (Taghva and Gilbreth 1999) pro-             The proposed system architecture is depicted in figure 3.
posed a system named as Acronym Finding Program (AFP).             The main components of Effective Ensembling of Language
It is a simple regex-based system that identified candi-           Models for Sequence Labeling Problem (EELM-SLP) Sys-
date acronyms (upper-case words of three to ten grams).            tem Architecture are as follows:
It attempts to find acronym expansion by scanning a 2n-            1. Data Acquisition/Collection.
window (n: number of letters in candidate acronyms). The
final system was evaluated on 1328 files. Many regex-based         2. Finetuning of transformer-based Language Models for
Acronym Identification systems were proposed afterward.               sequence-labeling task.
The issue with regex-based techniques was that it is impossi-      3. Hyperparameter tuning and retraining of Language Mod-
ble to incorporate all the possible rules, and it required more       els.
manual work to identify patterns and then to write rules.          4. Ensembling and postprocessing to obtain final predic-
    A popular approach to detect acronyms was proposed                tions.
by (Schwartz and Hearst 2002)1 . This technique is capa-
ble of extracting complicated acronyms, and it’s expansions.       5. Evaluation on development and test datasets.
It works in two stages. The first stage identifies candidate
acronyms by using predefined patterns (“acronym” and “ex-          Data Acquisition/Collection
pansion” and vice-versa). In the second stage, the overlap-        The entire data can be downloaded from https://github.
ping characters in an acronym and expansion are counted,           com/amirveyseh/AAAI-21-SDU-shared-task-1-AI. A sam-
and finally, this count is compared to a specified thresh-         ple snapshot of the data is depicted in figure 1. The data
old. As this technique cannot store contextual information,        is bifurcated into training, development, and test data. The
it failed to extract acronyms and their expansions with long-      training dataset consists of 14006 labeled sentences (Vey-
term dependencies.                                                 seh et al. 2020b). The development data consists of 1717 la-
                                                                   beled sentences. Each datapoint in train and test data has ‘id’
   1
       https://github.com/philgooch/abbreviation-extraction        representing train or development datapoint ID, ‘tokens’, a
   Figure 3: Effective Ensembling of Language Models for Sequence Labeling Problem (EELM-SLP) System Architecture


list containing the sentence split into individual tokens, and    • Longformer has approximately 149M parameters. Here
‘labels’ are the token-wise annotations for the tokens asso-        4096 represents that the model is pretrained on documents
ciated with them. The test dataset consists of about 1750           of maximum length 4096.
datapoints where only ‘id’ and ‘tokens’ are provided. The        All the above-mentioned Language Models are finetuned on
‘labels’ depict the short and long-form acronyms in BIO for-     the training dataset for sentence-level sequence labeling. Ev-
mat (short for inside, outside, beginning). B-long, I-long, B-   ery model is trained separately, and the latest iteration is
short, and I-short are the labels used.                          stored.

Finetuning of transformer-based Language Models                  Hyperparameter tuning of Language Models for
for sequence-labeling task.                                      Sequence Labeling Task
The proposed approach uses six transformer-based pre-            It was observed that during the inference phase, the output
trained Language Models, who are fined tuned on a down-          length was truncated to 128 tokens as it was the default
stream sequence labeling task. The following Language            ‘max seq length’ parameter. To preserve the entire token
Models are used: BERT (Devlin et al. 2018), RoBERTa (Liu         length, two parameters were used heavily. These two param-
et al. 2019), XLM-RoBERTa (Conneau et al. 2019),                 eters are ‘sliding window’ and ‘max seq length’. The slid-
CamemBERT (Martin et al. 2019), Longformer (Beltagy,             ing window prevents the truncation of sentences by split-
Peters, and Cohan 2020) and DistilBERT (Sanh et al. 2019).       ting the input sequence into multiple windows if it ex-
BERT is trained with the objective of masked language mod-       ceeds the default maximum sequence value. The sliding
eling (MLM) and the next sentence prediction (NSP). For          window problem is that the contextual information is broken
BERT, RoBERTa, XLM-RoBERTa, CamemBERT, and Dis-                  while predicting, and hence it was not used. Some experi-
tilBERT, the base-cased configurations are used. AllenAI-        ments were carried on with max seq length parameter, and
base-4096 configuration used for longformer.                     it was observed that the model was performing better when
   The configuration used for finetuning Language Models         max seq length was kept to 350. Other max seq length val-
are as follows:                                                  ues were 128, 256, 300, 400, 450, and 512. The language
                                                                 models were finetuned by using ‘Simple Transformers’ (Ra-
• The BERT and CamemBERT model is 12-layered having              japakse 2020)2 library and by using ‘Hugging Face’ (Wolf
  768-hidden layers, 12-heads, and 109M parameters.              et al. 2019)3 pretrained models. GEFORCE RTX 2080 Ti
                                                                 11GB GPU was used to train all the models with a 32GB
• RoBERTa has the same configuration as that of BERT,            primary memory system.
  except it has 125M parameters.                                    Table 1 depicts the final list of hyperparameters used to
• XLM-RoBERTa has approximately 270M parameters                  finetune the Language Models on the sequence labeling task.
  with 12-layers, 768-hidden-state, 3072 feed-forward            The ‘adam epsilon’ is the epsilon value to use for adam
  hidden-state, 8-heads, which is pretrained on Common-          optimizer. ‘best model dir’ is the directory where the best
  Crawl data in 100 languages.                                   model automatically gets stored after the completion of all
                                                                 epochs. ‘cache dir’ is where the processed data is stored,
• DistilBERT has less parameters of 65M, and it is 6-
                                                                    2
  layered. DistilBERT model is distilled from the BERT-                 https://simpletransformers.ai/
                                                                    3
  based configuration.                                                  https://huggingface.co/
          Hyperparameter                          Value                   tion loss, and the ‘patience’ is kept to be 3. The training pro-
              adam epsilon                        1e-08                   cess avoids evaluation while training. Hence, the parameter
             best model dir               outputs/best model/             ‘evaluate while training’ is kept to be False. ‘fp16’ corre-
                  cache dir                       cache/                  sponds to the 16-bits training and sometimes also referred
   train custom parameters only                    False
      dataloader num workers                         4                    to as mixed-precision training is kept to True. The ‘gradi-
             do lower case                         False                  ent accumulation step’ is kept to 1 and the ‘learning rate’ to
  early stopping consider epochs                   False                  be 4e-05. The training was carried on a single Graphics Pro-
         early stopping delta                        0                    cessing Unit (GPU) system, and so the ‘n gpu’ was fixed to
        early stopping metric                    eval loss                1. As we don’t want the model to save very frequently and
 early stopping metric minimize                    True                   save the secondary storage space, the ‘save steps’ is kept to
       early stopping patience                       3                    a higher value of 20,000. The training epochs were set to
                  encoding                         null
             eval batch size                         8
                                                                          50, but it was observed that the models were able to train
      evaluate during training                     False                  around 20 epochs. Hence, the training process was stopped
  evaluate during training silent                  True                   manually as soon as the loss was stagnant and at the low-
   evaluate during training steps                  2000                   est point. The ‘labes list’ corresponds to the target labels
 evaluate during training verbose                  False                  and it was set to [“B-long”, “I-long”, “B-short”, “I-short”,
                     fp16                          True                   “O”]. The ‘max seq length’ is one of the hyperparameters
    gradient accumulation steps                      1                    which was finetuned. These are some of the important hy-
               learning rate                      4e-05                   perparameters known to have a huge impact on finetuning
                 local rank                          -1
              logging steps                          50
                                                                          Language Models for sequence labeling task. After the final
            max grad norm                           1.0                   hyperparameters were computed, all the Language Models
            max seq length                          350                   were again finetuned for Sentence Level Sequence Labeling
     multiprocessing chunksize                      500                   task to identify acronyms from a given input text.
                    n gpu                            1
                  no cache                         False                  Ensembling and postprocessing to obtain final
                   no save                         False
           num train epochs                          50
                                                                          predictions
                 output dir                      outputs/                 After all the language models were finetuned on training
         overwrite output dir                      True                   data, the ensembling technique was applied. The advantage
              process count                          4
                                                                          of performing ensembling is to construct a robust discrim-
         reprocess input data                      True
            save best model                        True                   inator or model by using comparatively weaker models. In
        save eval checkpoints                      True                   this paper, a novel-ensembling approach based on RoBERTa
      save model every epoch                       True                   prioritizing was used. The table 2 shows that the F1-score of
                 save steps                       20000                   RoBERTa for the development dataset is higher than other
   save optimizer and scheduler                    True                   algorithms. The steps are as follows:
                    silent                         False
            train batch size                         16                   1. Initially, all the predictions from six pre-trained language
      use cached eval features                     False                     models were extracted.
          use early stopping                       False
         use multiprocessing                       True                   2. For each sentence, token level predictions were extracted
              warmup ratio                         0.06                      for all six models. Hence for each token for every sen-
              warmup steps                         2628                      tence, six predictions were obtained. For simplicity, the
              weight decay                           0                       predictions can be from any of the following labels - ‘O’,
         classification report                     False                     ‘B-short’, ‘I-short’, ‘B-long’ and ‘I-long’.
                 labels list        B-long, I-long, B-short, I-short, O
               lazy loading                        False                  3. If all the predictions are ‘O’ and any of the models gives
        lazy loading start line                      0                       any predictions from the bouquet of ‘B-short’, ‘I-short’,
                                                                             ‘B-long’ and ‘I-long’, then the prediction other than ‘O’
Table 1: Final hyperparameters list used to finetune Lan-                    is considered.
guage models for Sequence Labeling Task)                                  4. If there is a conflict between ‘B-short’, ‘I-short’, ‘B-
                                                                             long’, and ‘I-long’, then the prediction suggested by
                                                                             RoBERTa is taken into consideration. But if, in this case,
                                                                             if RoBERTa’s prediction is ‘O’, then the prediction with
which is consumable by PyTorch (Paszke et al. 2019). The                     the highest frequency is considered.
‘train custom parameters only’ is kept False as we are uti-
lizing all the hyperparameters available in Language Mod-                    The above ensembling technique gave rise to an increase
els. The ‘dataloader num workers’ specifies the number of                 in recall. The ensembling approach’s limitation is that it pre-
CPUs which will be used for data processing. As this is a se-             dicted labels having ‘I-short’ and ‘I-long’ as the beginning
quence labeling task, the ‘cased’ configurations of Language              tags, which lowered precision for some tokens. To overcome
Models are used, and hence the ‘do lower case’ is kept to be              this problem, the following post-processing rules were ap-
False. The ‘early stopping metric’ is kept to be the evalua-              plied:
               Algorithm
                                            Precision   Recall    F-1 Score               LONG                             SHORT
       (hyperparameters finetuned)
                                                                              Precision   Recall    F1-score   Precision   Recall   F1-score
 Ensembling + Post Processing (EELM-SLP)     93.35      97.95        95.6       91.43     97.27      94.26       95.27     98.63     96.92
          Ensembling Technique               92.36       97.2       94.72        90.9     96.65      93.69       93.81     97.74     95.74
                 RoBERTa                     93.19       91.2       92.18       91.65     92.56       92.1       94.74     89.83     92.22
              XLM-RoBERTa                    93.05      90.89       91.95       91.49       92       91.75       94.61     89.77     92.12
                   BERT                      92.34      90.88        91.6       90.43     90.27      90.35       94.24     91.48     92.84
               CamemBERT                     93.45      88.93       91.13       92.32     90.64      90.98       95.58     87.23     91.21
                Longformer                   93.14      88.59       90.81       90.49      90.2      90.34       95.8      86.97     91.17
                DistilBERT                   91.04      90.51       90.77       88.47     88.03      88.25       93.6      92.98     93.29

                       Table 2: Evaluation results for Development Dataset (all values are in percentage)

                             Algorithm                                         Precision           Recall      F-1 Score
              Ensembling + Post Processing (EELM-SLP)                             89.7             94.59         92.08
                       Ensembling Technique                                      89.93             93.87         91.86
                RoBERTa after hyperparameter tuning                              90.26             92.46         91.34
               RoBERTa before hyperparameter tuning                              90.85             91.73         91.29

                            Table 3: Evaluation results for Test Dataset (all values are in percentage)


• If at any point of time the prediction starts with ‘I-short’,         of 95.6% recorded on development data. It can be clearly
  then it is replaced as ‘B-short’.                                     seen that the proposed approach surpasses other individual
• The pattern ‘O’, ‘I-long’ was replaced with ‘B-long’, ‘I-             language models by quite a considerable margin. For all the
  long’ to maintain consistency.                                        individual finetuned models and the proposed approach, the
                                                                        class scores are also computed, namely precision, recall, and
   By applying the above post-processing rules, the precision           F1-score for ‘short’ and ‘long’ labels.
improved the proposed, and hence the proposed EELM-SLP                     In table 3, the evaluation on the test dataset is presented
system surpassed the other individual trained models on the             on standard evaluation metrics like precision, recall, and F1-
development dataset.                                                    score. Initially, the test dataset was evaluated on RoBERTa
   All the models are trained on 11GB Nividia 2080 Ti GPU               finetuned model before finetuning. This initial evaluation
for 20-22 epochs, with each epoch taking around three to                resulted in precision, recall, and F1-score to be 90.85%,
four minutes. Each model took around 90 minutes to train,               91.73%, and 91.29%, respectively. The RoBERTa was then
and due to early stopping, the training was stopped as soon             finetuned after hyperparameter tuning, which further ac-
as the loss stopped reducing. The proposed architecture is              celerated the metrics. The hyperparameter tuned RoBERTa
efficient and can be integrated into an actual production en-           achieved the precision of 90.26%, recall of 92.46%, and F1-
vironment. The prediction pipeline is designed in such a way            score of 91.34%. One of the significant highlights is the
that it takes advantage of CPU parallelism. So if the input             results obtained after applying the ensembling technique,
contains a batch of sentences, the predictions will be com-             which improved the F1-score from 91.34% to 91.86%. Fi-
puted in parallel independent batches and then later com-               nally, the highest F1-score was achieved after the applica-
bined to form in order as they appear in the original para-             tion of post-processing. The final proposed system achieved
graph or list of sentences.                                             an F1-score of 92.08%, precision of 89.7%, and recall of
                                                                        94.59%.
               Results and Discussions
The table 2 depicts the evaluation metrics on the devel-                             Conclusion and Future Scope
opment dataset. It can be seen that the top-3 perform-
ers among individual language models based on F1-score                  In this paper, an Effective Ensembling of Language Models
are RoBERTa, followed by XLM-RoBERTa, which is fol-                     for Sequence Labeling (EELM-SLP) is proposed. The en-
lowed by BERT. The RoBERTa achieves the highest F1-                     sembling technique’s importance can be clearly seen based
score of 92.18%, while XLM-RoBERTa achieves the F1-                     on the development and test dataset results. The final model
score of 91.95%. The third best performing algorithm BERT               achieves a precision of 93.35%, recall of 97.95% and F1-
achieves the F1-score of 91.6%. The ensembling technique                score of 95.6% on the development dataset and precision
surpasses the hyperparameter finetuned individual Language              of 89.7%, recall of 94.59% and F1-score of 92.08% on the
Models and achieves a precision of 92.36%, recall of 97.2%,             test set. There is an improvement on all the three metrics of
and F1-score of 95.6%. However, the proposed approach of                precision, recall, and F1-score for the proposed approach on
Effective Ensembling of Language Models for Sequence La-                the development dataset. Although the recall and F1-score
beling Problem (EELM-SLP) achieves the highest F1-score                 are highest for the test dataset, the precision is lowered. This
might be due to the introduction of false-positives while en-     Rajapakse, T. 2020. Simple transformers.
sembling and post-processing step. On the one hand, the en-       Sanh, V.; Debut, L.; Chaumond, J.; and Wolf, T. 2019. Dis-
sembling and post-processing step increased the recall and        tilBERT, a distilled version of BERT: smaller, faster, cheaper
F1-score, but on the other hand, it lowered the precision.        and lighter. arXiv preprint arXiv:1910.01108 .
Hence, the post-processing can be finetuned in-depth. Fur-
ther, this paper uses all the language models in their ‘base’     Schwartz, A. S.; and Hearst, M. A. 2002. A simple algorithm
configuration and not ‘large’ configuration. In the future, ex-   for identifying abbreviation definitions in biomedical text. In
periments can be carried on ‘large’ configuration based lan-      Biocomputing 2003, 451–462. World Scientific.
guage models, which might improve the scores.                     Taghva, K.; and Gilbreth, J. 1999. Recognizing acronyms
                                                                  and their definitions. International Journal on Document
                        References                                Analysis and Recognition 1(4): 191–198.
Beltagy, I.; Peters, M. E.; and Cohan, A. 2020. Long-             Veyseh, A. P. B.; Dernoncourt, F.; Nguyen, T. H.; Chang, W.;
former: The long-document transformer. arXiv preprint             and Celi, L. A. 2020a. Acronym Identification and Disam-
arXiv:2004.05150 .                                                biguation shared tasksfor Scientific Document Understand-
Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.;            ing. arXiv preprint arXiv:2012.11760 .
Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettle-              Veyseh, A. P. B.; Dernoncourt, F.; Tran, Q. H.; and Nguyen,
moyer, L.; and Stoyanov, V. 2019. Unsupervised cross-             T. H. 2020b. What Does This Acronym Mean? Introducing
lingual representation learning at scale. arXiv preprint          a New Dataset for Acronym Identification and Disambigua-
arXiv:1911.02116 .                                                tion. arXiv preprint arXiv:2010.14678 .
Cook, B. 2019. ACRONYM: Acronym CReatiON for You                  Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.;
and Me. arXiv preprint arXiv:1903.12180 .                         Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al.
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018.        2019. HuggingFace’s Transformers: State-of-the-art Natural
Bert: Pre-training of deep bidirectional transformers for lan-    Language Processing. ArXiv arXiv–1910.
guage understanding. arXiv preprint arXiv:1810.04805 .
Jacobs, K.; Itai, A.; and Wintner, S. 2020. Acronyms: iden-
tification, expansion and disambiguation. Annals of Mathe-
matics and Artificial Intelligence 88(5): 517–532.
Kirchhoff, K.; and Turner, A. M. 2016. Unsupervised res-
olution of acronyms and abbreviations in nursing notes us-
ing document-level context models. In Proceedings of the
Seventh International Workshop on Health Text Mining and
Information Analysis, 52–60.
Kuo, C.-J.; Ling, M. H.; Lin, K.-T.; and Hsu, C.-N. 2009.
BIOADI: a machine learning approach to identifying ab-
breviations and definitions in biological literature. In BMC
bioinformatics, volume 10, S7. Springer.
Liu, J.; Liu, C.; and Huang, Y. 2017. Multi-granularity se-
quence labeling model for acronym expansion identification.
Information Sciences 378: 462–474.
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.;
Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V.
2019. Roberta: A robustly optimized bert pretraining ap-
proach. arXiv preprint arXiv:1907.11692 .
Mack, C. A. 2012. How to write a good scientific paper:
acronyms. Journal of micro/nanolithography, MEMS, and
MOEMS 11(4): 040102.
Martin, L.; Muller, B.; Suárez, P. J. O.; Dupont, Y.; Romary,
L.; de la Clergerie, É. V.; Seddah, D.; and Sagot, B. 2019.
Camembert: a tasty french language model. arXiv preprint
arXiv:1911.03894 .
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.;
Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.;
et al. 2019. Pytorch: An imperative style, high-performance
deep learning library. In Advances in neural information
processing systems, 8026–8037.