=Paper= {{Paper |id=Vol-2831/paper31 |storemode=property |title=SciDr at SDU-2020 : IDEAS - Identifying and Disambiguating Everyday Acronyms for Scientific Domain |pdfUrl=https://ceur-ws.org/Vol-2831/paper31.pdf |volume=Vol-2831 |authors=Aadarsh Singh,Priyanshu Kumar |dblpUrl=https://dblp.org/rec/conf/aaai/SinghK21 }} ==SciDr at SDU-2020 : IDEAS - Identifying and Disambiguating Everyday Acronyms for Scientific Domain== https://ceur-ws.org/Vol-2831/paper31.pdf
        SciDr at SDU-2020 : IDEAS - Identifying and Disambiguating Everyday
                           Acronyms for Scientific Domain
                                             Aadarsh Singh, Priyanshu Kumar *
                                      Indian Institute of Technology (Indian School of Mines)
                                                     Dhanbad, Jharkhand, India
                                   aadarshsingh191198@gmail.com, kpriyanshu256@gmail.com


                            Abstract                                 sentence. We start the experimentation process for AI with
  We present our systems submitted for the shared tasks of
                                                                     rule based models. The experiments on both tasks are then
  Acronym Identification (AI) and Acronym Disambiguation             extended to using Transformers (Vaswani et al. 2017) based
  (AD) held under Workshop on SDU. We mainly experiment              architecture, BERT (Devlin et al. 2018) as the backbone of
  with BERT and SciBERT. In addition, we assess the effec-           the model; followed by SciBERT (Beltagy, Lo, and Cohan
  tiveness of “BIOless” tagging and blending along with the          2019), which too is a BERT-based model, but is pretrained
  prowess of ensembling in AI. For AD, we formulate the prob-        on text from scientific research papers instead of Wikipedia
  lem as a span prediction task, experiment with different train-    corpus. In addition, for AD, we experiment with different
  ing techniques and also leverage the use of external data. Our     training procedures, aiming to instill knowledge about vari-
  systems rank 11th and 3rd in AI and AD tasks respectively.         ous topics into our models.
                                                                        The rest of the paper is organized as follows : Related
                     1    Introduction                               works have been discussed in section 2, followed by a brief
An acronym is an abbreviation formed from the initial let-           description of the shared task datasets in section 3. The
ters of other words and pronounced as a word. The usage of           methodology and experimental settings are covered in sec-
acronyms in articles and speech has increased as it avoids the       tions 4 and 5. Sections 6 and 7 contain the results and discus-
effort of remembering long complex terms. However, this              sion. Section 8 concludes the paper and also includes scope
increased usage of acronyms has also caused new issues of            of future work.
Acronym Identification (AI) and of Acronym Disambigua-
tion (AD). AI is the process of identifying which parts of                              2    Related Work
a sentence constitute the acronyms and their corresponding           Initial works on AI incorporate the use of rule-based meth-
long forms, whereas AD is the process of correctly predict-          ods. Park and Byrd (2001) present rule based methods for
ing the long form expansion of an acronym given a context            finding acronyms in free text. They make use of various
of its usage. AI and AD are beneficial for applications like         patterns, text markers and linguistic cue words to detect
question answering (Ackermann et al. 2020) and definition            acronyms and also their definitions. Schwartz and Hearst
extraction (Kumar et al. (2020), Singh, Kumar, and Sinha             (2002) make use of the fact that majority acronyms and their
(2020)). Since, both AI and AD tasks are benefited with do-          long forms are found in close vicinity in sentence, with one
main knowledge, manual identification and disambiguation             of them enclosed between parentheses and thus extract short
of acronyms by domain experts is possible. However, it is            and long pairs from sentences. They also propose an algo-
tiresome and expensive. Hence, there is a dire need to de-           rithm for identifying correct long forms.
velop intelligent systems that can mimic the role of domain             People have also tried to leverage the use of web-search
experts and can help us automate the task of AI and AD.              queries and logs to identify acronym-expansion pairs. A
   In this paper, we present our approach for the shared             framework for automatic acronym extraction on a large
tasks of Acronym Identification and Acronym Disambigua-              scale was proposed by Jain, Cucerzan, and Azzam (2007).
tion held under the workshop of Scientific Document Under-           They scrape the web for candidate sentences (those contain-
standing (SDU). The problem of AI is treated as a sequence           ing acronym-expansion pairs) and then identify acronyms-
tagging problem. For AD, we treat it as a span prediction            expansion pairs using search query logs and search results.
problem i.e. given a sentence containing an acronym and the          They also try to rank acronym expansions by assigning
possible long forms of that acronym, we aim to extract the           a score to expansions using various factors. Taneva et al.
span from the possible expansions, which is the most appro-          (2013) target the problem of finding distinct expansions for
priate long form of the acronym as per the context in the            an acronym. They make use of query click logs and cluster-
   * Authors have equal contribution.                                ing techniques to extract candidate expansions of acronyms
Copyright © 2021for this paper by its authors. Use permitted under   and group them such that each group has a unique meaning.
Creative Commons License Attribution 4.0 International (CC BY        They then assign scores to grouped expansions to find the
4.0)                                                                 appropriate expansion.
   A comprehensive comparative study between rule-based           pare the performance of LSTMs with ELMo embeddings
and machine based methods for identifying and resolving           armed with different types of attention mechanisms.
acronyms has been done by Harris and Srinivasan (2019).              An overview of the submissions made to the shared tasks
They collect data from various resources and then experi-         of AI and AD has been done by the organizers (Veyseh et al.
ment with machine based algorithms, crowd-sourcing meth-          2020a).
ods and a game based approach.
   Liu, Liu, and Huang (2017) treat AI as a sequence la-                                 3   Datasets
belling problem and propose Latent-state Neural Condi-            Veyseh et al. (2020b) provide the shared task participants
tional Random Fields model (LNCRF), which are superior            with a dataset for AI and AD tasks called SciAI and SciAD
to CRFs in handling complex sentences by making use of            respectively. SciAI contains 17,506 sentences from research
nonlinear hidden layers. The incorporation of neural net-         papers, in which the boundaries of acronyms and their long
works with CRFs enable learning of better representations         forms are labelled using the BIO format. The tag set con-
from manually created features, which help in better perfor-      sists of B-short, B-long, I-short, I-long and O, “short” rep-
mance.                                                            resenting the acronym and “long” representing the expan-
   Many works solve AD task by creating word vectors              sion respectively. SciAD contains 62,441 instances covering
and then using them to rank the candidates of the acronym         acronyms used in the scientific domain. The dataset contains
with reference to its usage. McInnes et al. (2011) correlate      the sentence, the acronym and the correct expansion of that
acronym disambiguation with word sense disambiguation.            acronym as per its usage in the sentence. The dataset also
They create 2nd order vectors of all possible long forms and      contains a dictionary which is a mapping of the acronyms
the acronym with the help of word co-occurrences. The cor-        to candidate long forms. Both datasets are different from the
rect long form is then found out using cosine similarity be-      existing datasets for AI and AD as they are larger in size and
tween the vectors. Li et al. (2018) present an end-to-end         have instances belonging to scientific domain (majority AI
pipeline for acronym disambiguation in the domain of en-          and AD datasets belong to the medical domain).
terprise. Due to the lack of mapping of acronym to their
long forms, they first use data mining techniques to create                          4    Methodology
a knowledge base. Further, they treat acronym disambigua-
tion as a ranking problem and create ranking models using         4.1   Models
some manually created features.                                   Since both the tasks are similar, we try out the following
   With the advent of deep learning, researchers have tried       models for both of them and then build upon them:
to create more informative word vectors for the previous ap-
                                                                  • BERT : BERT, based on the Transformer architecture
proach. Wu et al. (2015) first use deep learning to create neu-
                                                                    consists of multi-attention heads which apply a sequence-
ral word embedding from medical domain data. They com-
                                                                    to-sequence transformation on the input text sequence.
bine the word embeddings of a sample text in different ways
                                                                    The training objectives of BERT make it unique. The
and then train a Support Vector Machine (SVM) classifier
                                                                    Masked Language Model (MLM) learns to predict a
for each acronym. Charbonnier and Wartena (2018) explore
                                                                    masked token using the left and right context of the text
acronym disambiguation in the scientific research domain.
                                                                    sequence. BERT also learns to predict whether two sen-
They obtain word vectors from text of scientific research pa-
                                                                    tences occur in continuation or not (Next Sentence Pre-
pers and create vector representations for the context of the
                                                                    diction).
acronym. Distance minimisation between vector of context
and acronym expansion, gives the appropriate expansion.           • SciBERT : Allen Institute for Artificial Intelligence (AI2)
   Ciosici, Sommer, and Assent (2019) present an unsuper-           pretrain the base version of BERT (SciBERT) on scien-
vised approach for acronym disambiguation by treating it            tific text from 1.14 million research papers from Seman-
as a word prediction problem. They use word2vec (Mikolov            tic Scholar. Owing to the similarity of the domain of the
et al. 2013) to simultaneously learn word embeddings and by         shared task dataset and SciBERT training corpus, we be-
learning to predict the correct special token (concatenation        lieve the model will be beneficial for the tasks. We use
of short and long form) of a sentence. The obtained word            SciBERT with SciVocab in our experiments.
embeddings are used to create representations of the context
of the short form and the best expansion of the short form        4.2   AI
is obtained from the candidates by minimising distance be-        Problem Formulation We can easily identify the AI task
tween representations.                                            as a NER (Named Entity Recognition) / BIO tagging task.
   Many works also treat AD as a classification problem.          The tags used in the above methods were short-form and
Jin, Liu, and Lu (2019) explore the usage of contextualised       long-form labels of the words in BIO format. One of the in-
BioELMO word embeddings for acronym disambiguation.               teresting experiments that we perform is to make use of “BI-
They train separate BiLSTM classifiers for each acronym           Oless” tags. Keeping all factors constant, classifiers ought
which outputs the appropriate expansion when a text is in-        to work better if the number of classes are less. Tagging
put. They achieve state of the art performance on PubMed          is a token classification task. Hence, the tagger should per-
dataset. Li et al. (2019) propose a novel neural topic at-        form better if the number of tags are reduced. The following
tention mechanism to learn better contextualised representa-      changes are carried out in the training data to obtain “BIO-
tions for medical term acronym disambiguation. They com-          less” tags :
1. B-short and I-short tags are changed to B-short                  classification, the final prediction is the mode of the pre-
2. B-long and I-long tags are changed to B-long                     dictions of the participating models; similarly, in a tagging
                                                                    task or rather token classification, the final prediction for a
3. O tags are unchanged.                                            given sequence is the sequence of modes of the prediction
   The models are trained and once the results are obtained,        sequence of the participating models.
the definition of B, I and O tags viz. beginning, inside and        Assume y is label, x is the token, N is the total number of
outside, are used to reconstruct the original tags. It is done      base taggers employed and Ti is a function that returns 1
by changing the first tag in a cluster to B-short or B-long and     if the prediction of the ith tagger is y, otherwise 0.
the rest of them to I-short or I-long.                              Then, W (y, x) is said to be the score and is defined as:
Models We experiment with the following                  mod-                                      N
els/variations of the models already mentioned :
                                                                                                   X
                                                                                      W (y, x) =         Ti (y, x)
• Conditional Random Fields (CRFs) : Considering la-                                               i=0
  belling of sentences with POS (Parts Of Speech) tags, it          The y with the highest score is chosen as the label of x.
  is highly probable that a NOUN is followed by a VERB.
  Therefore, these kinds of task fall under a category which      • Blending (Sikdar and Gambäck 2017) : Hereby, we de-
  is essentially a combination of classification (classifying       pict our process of blending models (Figure 1). The whole
  a word to one of the POS tags) and graphical modelling            process consists of the following 3 stages:
  (one word influences the POS tag of other words). Thus,           a. The base models are trained on the training data and
  these tasks involve predicting a large number of variables           then predictions are made on the validation data using
  that depend on each other as well as on other observed               these.
  variables.
                                                                    b. The predictions obtained in the previous stage are used
  CRFs are a popular probabilistic method suitable for tasks           as the features for this stage. A CRF is fit on these fea-
  such as this. They combine the ability of graphical models           tures using 5-fold cross validation.
  to compactly model multivariate data with the ability of
                                                                    c. The five trained models obtained in the previous stage
  classification methods to perform prediction using large
                                                                       are then ensembled using majority voting to make the
  sets of input features. For the current data, we use the fol-
                                                                       final prediction.
  lowing features as input:
  For the current word -                                          4.3   AD
  a. The lower cased version of the word                          Problem Formulation Many existing works on AD solve
  b. The last three letters of the word                           the problem as a text classification problem, i.e. given a text
  c. If all characters of the word are upper case                 and an acronym, classify the long form of the acronym or
  d. If the word is title cased                                   by developing rich word vector representation to extract the
                                                                  most suitable full form out of some candidate long forms.
  e. The POS tag of the word                                      We, instead, treat AD as a span prediction problem. The
  f. The first two characters of the POS tag of the word          model predicts the span containing the correct long form
  g. If 60% of the word is uppercase                              from the concatenated text consisting of the acronym, the
   For neighbouring words -                                       candidate long forms of that acronym and the sentence (in
                                                                  the same order). The predicted span is then compared with
  a. The lower cased version of the word                          the candidate long forms and the best match is chosen as per
  b. If the word is title cased                                   Jaccard score.
  c. If all characters of the word are upper case                    Each approach has its own shortcomings. For the classifi-
                                                                  cation approach, the size of the model increases with the in-
  d. The POS tag of the word
                                                                  crease in dictionary size; training models for a large number
  e. The first two characters of the POS tag of the word          of classes is difficult. A solution to this problem is to build
• BERT base cased : We use the cased base version of              individual models for acronyms, but the solution might not
  BERT as the backbone of our Transformer-CRF architec-           be feasible if there are many acronyms. For the vector based
  ture                                                            methods, achieving rich representations is difficult. As for
                                                                  the span prediction approach, the handling of long inputs is
• SciBERT cased : We use the cased version of SciBERT             difficult and time-consuming. We may have to compromise
  as the backbone of our Transformer-CRF architecture.            on the context of the acronym in order to adjust for long
                                                                  sequences.
Post Modelling Experiments The process of ensembling
                                                                     To prepare our input text for the model, we take advan-
helped to get a major boost in the score of the base models.
                                                                  tage of the fact that BERT can encode a pair of sequences
We used two kinds of ensembling process:
                                                                  together. Therefore, the first sequence is the acronym con-
• Majority Voting/Hard Voting (Wu et al. 2006): The idea          catenated with all possible expansions from the dictionary
  here is to simply go with what the majority of the mod-         and the second sequence is the input text. Since, some of the
  els in the ensemble method are predicting. In the case of       input sentences are quite long, we sample tokens from the
                      Model 1                         Predictions
                                                                           Fold 1                   CRF                   Predictions
                      Model 2                         Predictions
 Train
 Data                                                                      Fold 2                   CRF                   Predictions
                                         Val                                                                                             Soft       Final
                                        Data                                                                                            Voting    Prediction
                                                                                                                Test
                                                                                                                Data
                      Model 3                         Predictions

                                                                           Fold K                   CRF                   Predictions


                                      Inference
                                                        Val
         Training                                      Data
                                                                                    Training                  Inference
                                                    New features
                        STAGE 1                                                                STAGE 2                                  STAGE 3




                                                                    Figure 1: Blending for AI


sentences. In order to input sufficient context of the acronym                          scrape Wikipedia for articles (using Wikipedia API https:
into the models, we take n/2 space delimited tokens to the                              //pypi.org/project/wikipedia/ ) related to the long forms
left of the acronym and n/2 space delimited tokens to the                               of acronyms present in the dictionary and fine tuned the
right of it, where n is a hyperparameter. We find in our ex-                            LM of SciBERT using the data. We then use the new fine
periments that taking n to be sufficiently large gives almost                           tuned model weights for the SciBERT backbone and train
consistent performance. We fix n to 120 in our experiments.                             it for span prediction.
   We experiment with different training approaches and
                                                                                     • SciBERT uncased with 2 stage training : We train the
pretrained weights keeping the architecture of our model
                                                                                       model in 2 stages using different data. We prepare our
constant in all cases. The backbone of the architecture is
                                                                                       own dataset using the articles scrapped from Wikipedia;
the base version of BERT. The sequence outputs of the
                                                                                       occurrences of long forms of acronyms are replaced by
last layer of BERT (shape = (batch size, max len, 768))
                                                                                       the acronym. We first train our model on this data and then
is passed through a dense layer to reduce its shape to
                                                                                       on the shared task data. This is a supervised approach to
(batch size, max len, 2). The output is splitted into 2 parts
                                                                                       help models learn for acronyms and expansions under rep-
at the 2nd axis to get our token level logits for start position
                                                                                       resented in the shared task data as compared to the above
and end position. A pictorial representation of the model can
                                                                                       approach which is unsupervised.
be found in Figure 2.
                                                                                    Post Modelling Experiments
    [CLS] Acronym                                                                    • Ensemble : Since our approach outputs start and end
                                               Enconding




                                                                         SP
                            SciBERT




                                                                                       probability distribution over the entire sequence of tokens,
                             BERT /




                                                               Linear




     Expansion_1
                                                 Token
                                                 Level




    Expansion_2 ...                                                                    we cannot average probabilities from models using differ-
  Expansion_N [SEP]                                                                    ent tokenizers. Keeping the above fact in mind, we aver-
   Sentence [SEP]
                                                                         EP
                                                                                       age the probabilities from the two best models (as per CV)
                                                                                       i.e. SciBERT uncased and SciBERT uncased with 2 stage
Figure 2: Model Architecture for AD; SP and EP stand for                               training. The appropriate acronym expansion is then ex-
Start Probability and End Probability .                                                tracted with the help of this averaged probability, which
                                                                                       provides robustness in our predictions.

Models We experiment with the following models:                                      • Ensemble with post-processing : We also devise a post
                                                                                       processing that can help us rectify some of the mistakes of
• BERT base uncased : We use the uncased base version                                  our models to some extent. All the post-processing does is
  of BERT as the backbone of our model.                                                that if a candidate expansion of an acronym is present in
• SciBERT uncased : We use the uncased version of SciB-                                the sentence and the acronym is enclosed within parenthe-
  ERT as the backbone of our model.                                                    sis in the sentence, then that candidate expansion is pre-
                                                                                       dicted as the expansion of the acronym. The motivation
• SciBERT uncased with fine tuned LM : The dataset                                     for devising this post-processing is discussed in Section
  does not contain samples for all acronym expansions.                                 7.
  Hence, models trained only on the provided dataset may
  suffer when it comes to predicting unseen acronym expan-
  sions. We try to instill some knowledge of the acronym                                                 5   Experimental Settings
  expansions in our model by fine tuning the MLM. We                                For AI task, there are three kinds of experimental settings:
a. The base models were trained on the training data and               Model                    Val        CV        Test
   evaluated on the validation data.                                   Baseline               0.8546        -       0.8409
                                                                       CRF                    0.8254        -          -
b. For the better performing base models, we concatenate the
                                                                       CRF BIOless            0.7994        -          -
   training and validation data and perform a 5 fold cross-
                                                                       BERT cased             0.9145        -          -
   validation on the concatenated dataset.
                                                                       BERT cased BIOless     0.9163        -          -
c. For blending, we perform a 5 fold cross-validation on the           SciBERT cased          0.9173        -       0.8921
   validation data.                                                    SciBERT cased BIO-     0.9165        -       0.9005
For each one of the above settings, training was done for              less
20 epochs using early stopping with patience of 10. Model              SciBERT cased             -       0.9075     0.9023
optimisation was done using BertAdam with a learning rate              SciBERT cased BIO-        -       0.9073     0.9036
of 1e-3, a batch size of 16 and gradient accumulation batch            less
size of 32.                                                            Blending with mode        -       0.8962     0.9090
   For the AD task, we concatenate the training and valida-            ensembling
tion data and perform a 5 fold stratified cross-validation on
the joined dataset (stratified with respect to acronym). The                     Table 1: Results of AI task.
folds are trained for 5 epochs using early stopping with pa-
tience of 2 and tolerance of 1e-3. Model optimisation is done                 User / Team Name         Test Score
using AdamW (Loshchilov and Hutter 2018) with a learning                              zdq                0.9330
rate of 2e-5 and a batch size of 32.                                             qinpersevere            0.9311
                                                                                    Mobius               0.9281
                       6    Results                                               SciDr (Us)             0.9090
6.1   AI                                                                      Table 2: Comparison of AI results
The macro F1 scores of our approaches are listed in Table 1.
For, the base models, validation is done using the validation
data. Only the promising models, in our case SciBERT mod-        6.2    AD
els, are taken through the arduous cross validation process.     We tabulate the macro F1 score of the models in the cross-
   It should also be noted that the folds for the process of     validation and test setting (in Table 3). The performance of
cross validation on the modified blending technique are ex-      SciBERT is superior to BERT owing to the similarity of
tracted out of the validation data unlike SciBERT models         pretraining corpus and task dataset. We also observe that
which are cross validated on the combined data (train + val-     the performance of SciBERT uncased and SciBERT un-
idation), and hence the two CV scores are not comparable.        cased with 2 stage training is almost similar in both cross-
The other observations are enumerated as follows:                validation and test, with the latter performing a bit better
a. The official baseline, though rulebased, surpasses CRF.       than former, whereas the performance of the one with fine-
                                                                 tuned LM is lower. A possible reason for this observation
b. As expected, SciBERT performs better than BERT.               can be attributed to the difference between the source of the
c. As for the BIOless variants:                                  data used for fine tuning (Wikipedia) and the shared task
                                                                 data (scientific papers). The usage of extra data created us-
   • CRFs see a considerably big difference (0.026) be-          ing Wikipedia is beneficial for the model since it contains
     tween the BIOless and BIO variants. The hypothesis          samples for some acronyms under-represented in the task
     that “the tagger should perform better if the number of     dataset.
     tags are reduced” seems to fail here. The present task of
     AI seems a bit complex for CRFs as they do not even                 Model                   CV             Test
     surpass the baseline score of 0.84. Hence, it would only            Baseline                -              0.6097
     be justifiable to treat CRFs as an exception with respect           BERT uncased            0.7549         0.8980
     to the hypothesis.                                                  SciBERT uncased         0.8423         0.9244
   • For all the other models/variations, BIOless is pretty              SciBERT       uncased   0.8278         0.9194
     close (a difference of 0.0008 or 0.0002 ) or sur-                   with fine tuned LM
     passes the BIO variant(with a relatively larger differ-             SciBERT       uncased   0.8424         0.9292
     ence 0.0084 or 0.0013).                                             with 2 stage training
d. Based on the test score, BIOless variants perform better              Ensemble                -              0.9303
   than their corresponding BIO counterparts.                            Ensemble with post-     -              0.9319
                                                                         processing
e. The test score undoubtedly shows the eminence of the
   modified blending technique.                                                  Table 3: Results of AD task.
  Table 2 shows a comparison of our results with the top
scoring submissions of AI task.                                    Table 4 lists the scores of the top submissions for AD task.
                                     Figure 3: A few erroneously tagged instances for AI.


             User / Team Name        Test Score                   F1 but the precision obtained was quite good compared to
                DeepBlueAI             0.9405                     the precision of the SciBERT cased BIOless model with hard
                 qwzhong               0.9373                     voting. The only way to employ the adroitness of the base-
                SciDr (Us)             0.9319                     line model was to stack it (and some other better performing
                    del2z              0.9266                     models) with the SciBERT cased BIOless model. And as is
                                                                  visible in Table 5, the Blended model improved considerably
            Table 4: Comparison of AD results                     especially with respect to precision.
                                                                     Figure 3 represents some of sentences tagged incorrectly
                                                                  by the SciBERT model. Ideally the analysis should have
                     7     Discussion                             been done on the best model, but it is too complex to in-
7.1   AI                                                          terpret it. Having a look at the DEV-297 and DEV-42, it is
                                                                  clear that the gold truths have some annotation flaws. HMM
The best proposed method for the AI task involves the use
                                                                  is clearly an acronym for Hidden Markov Models and still is
of the following three main building blocks:
                                                                  not labelled. Similarly, RNN, CNN and WiFi are acronyms
 • SciBERT as the base model                                      for Recurrent Neural Network, Convolutional Neural Net-
 • BIOless variant                                                work and Wireless Fidelity respectively but only CNN is
                                                                  marked in the ground truth. Also, complicated neural net-
 • Modified blending technique or the blending method cou-
                                                                  work is no full form but is used to show the complications
   pled with hard voting.
                                                                  of RNN and CNN neural networks. Our base model does
   The reason for SciBERT performing better than the BERT         good in predicting the right tags for there samples.
model lies in the fact that the pretraining corpus is simi-          On the other hand, we find that in DEV-1313 and DEV-
lar to our dataset. The hypothesis for using BIOless vari-        593, the model has completely failed to identify the long
ants instead of the conventional technique seems to hold true     forms, and also misidentified a few short forms. Two likely
(points c, d and e in Subsection 6.1).                            causes could be as follows:
   Model                      F1      Precision   Recall          • improper tokenization of the dataset
   Baseline                 0.8409     0.9131     0.7793          • “and”, “-”, “of” etc. in between long forms
   SciBERT cased BIO-       0.9036     0.8987     0.9086
   less with hard voting
   Blending with mode       0.9090     0.9097     0.9083          7.2   AD
   ensembling                                                     The formulation of AD as a span prediction problem is quite
                                                                  efficient from the performance and computational expense
Table 5: F1, Precision and Recall of some models used in AI       point of view. A complete cross-validation run under the ex-
Task                                                              perimental settings can be performed in 6 hours on an aver-
                                                                  age on a NVIDIA Tesla P100.
   Ensembling has always helped in the domain of Machine             Speaking about the results, for the out-of-fold predictions
Learning. The third block viz. modified blending technique        of SciBERT uncased, we observe that the model is incorrect
is a combination of two propitious methods - blending and         mainly for acronyms which do not have many occurrences
hard voting, and ultimately went about to give the best re-       in the task dataset. This motivated us to attempt instilling
sults. The baseline method used by the organizers had a low       knowledge into our models via external data.
    Id        Acronym      Text                                           Normal             Stage                 Ensemble
  TS-633        FM         Ultimately , once we select an FM ,          feature map      fuzzy measure       factorization machines
                           the ChI becomes a specific operator
                           .
  TS-811         GS        Additionally , using WSE ( GS               genetic search   google scholar ’s         gold standard
                           search ) we obtained 84.4 accuracy
                           with an FPR of 0.157 and AUC
                           value of 0.918 .
 TS-5682         EL        Thus , with EL system ( ) , only two        external links    euler - lagrange         entity linking
                           structures are possible for : ( i ) , and
                           ( ii ) , .

Table 6: Mismatch of predictions between SciBERT uncased, SciBERT uncased with 2 stage training and their soft ensemble.


   We first examine the differences between the test set pre-                                8    Conclusion
dictions of SciBERT uncased, SciBERT uncased with 2                     We present our approach for Acronym Identification and
stage training and their ensemble (represented as Normal,               Acronym Disambiguation in scientific domain. The usage
Stage and Ensemble respectively) to understand the differ-              of SciBERT in both tasks is beneficial because of domain
ence between the models and to find out which model is                  and training corpus similarity. We addressed AI as a tag-
exhibiting more confidence in its prediction.                           ging problem. Our experiments prove the usefulness of data
                                                                        transformation using BIOless tags, and the adroitness of
     Id        Acronym       Text                                       blending incorporated with hard voting. We approached AD
   TS-5572       LPP         The LPP can be briefly de-                 as span prediction problem. Our experimental work demon-
                             scribed as follows .                       strates the effect of pretrained weights, external data, ensem-
   TS-5830        GCN        Effect of both kernels added               bling and post-processing. Our analysis provides some inter-
                             at end to get actual GCN                   esting insights into some of the shortcomings of the models
                             output .                                   and also some of the flaws in the dataset annotation. For fu-
                                                                        ture work, we can experiment with data augmentation and
    Table 7: Instances lacking sufficient context for AD.               observe the behaviour of the models for both AI and AD.

   We examine those samples where all of the three pre-                                       9    Appendix
dictions are different (Table 6). It can be observed that the
predictions of SciBERT uncased seem pretty appropriate as               The source code of our approaches for AI and AD can be
per the context and the contributions from the Stage model              found at :
changes the final prediction. However, there are 92 instances            • AI : https://github.com/aadarshsingh191198/AAAI-21-
in the test predictions where any of the three predictions are             SDU-shared-task-1-AI
different. These are the instances where the ensemble sub-
mission gets the test score boost.                                       • AD : https://github.com/aadarshsingh191198/AAAI-21-
   We observe that some of the samples in the test set do not              SDU-shared-task-2-AD
contain sufficient context which can help in acronym disam-
biguation. This can be an issue and it is difficult to say how                            Acknowledgements
the models will behave in such situations. Some of the sam-             We thank Google Colab and Kaggle for their free computa-
ples are shown in Table 7. For the text with id TS-5572, the            tional resources.
possible long forms of LPP are “locality preserving projec-
tions” and “load planning problem”. Both the models predict                                      References
one of the expansions and both the expansions seem rele-
vant in the given context. Similar arguments can be given               Ackermann, C. F.; Beller, C. E.; Boxwell, S. A.; Katz, E. G.;
for the text with id TS-5830, where the models get confused             and Summers, K. M. 2020. Resolution of acronyms in ques-
between “global convolution networks” and “graph convo-                 tion answering systems. US Patent 10,572,597.
lution networks”.                                                       Beltagy, I.; Lo, K.; and Cohan, A. 2019. SciBERT: A pre-
   Many of the instances in the test set are such that the long         trained language model for scientific text. arXiv preprint
form expansion of the acronym is present in the text and the            arXiv:1903.10676 .
acronym is present within parentheses. Our models correctly
predict the long form for most of these instances, but miss             Charbonnier, J.; and Wartena, C. 2018. Using word embed-
out on a few occasions. This motivated us to devise a post-             dings for unsupervised acronym disambiguation .
processing for such instances, where we can directly check              Ciosici, M.; Sommer, T.; and Assent, I. 2019. Unsupervised
for such conditions and predict accordingly, overwriting the            Abbreviation Disambiguation Contextual disambiguation
model predictions.                                                      using word embeddings. arXiv preprint arXiv:1904.00929 .
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018.        the Fourteenth Workshop on Semantic Evaluation, 710–716.
Bert: Pre-training of deep bidirectional transformers for lan-    Barcelona (online): International Committee for Computa-
guage understanding. arXiv preprint arXiv:1810.04805 .            tional Linguistics. URL https://www.aclweb.org/anthology/
Harris, C. G.; and Srinivasan, P. 2019. My Word! Machine          2020.semeval-1.93.
versus Human Computation Methods for Identifying and              Taneva, B.; Cheng, T.; Chakrabarti, K.; and He, Y. 2013.
Resolving Acronyms. Computación y Sistemas 23(3).                Mining acronym expansions and their meanings using query
Jain, A.; Cucerzan, S.; and Azzam, S. 2007. Acronym-              click log. In Proceedings of the 22nd international confer-
expansion recognition and ranking on the web. In 2007             ence on World Wide Web, 1261–1272.
IEEE International Conference on Information Reuse and            Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
Integration, 209–214. IEEE.                                       L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-
Jin, Q.; Liu, J.; and Lu, X. 2019. Deep Contextual-               tention is all you need. In Advances in neural information
ized Biomedical Abbreviation Expansion. arXiv preprint            processing systems, 5998–6008.
arXiv:1906.03360 .                                                Veyseh, A. P. B.; Dernoncourt, F.; Nguyen, T. H.; Chang, W.;
Kumar, P.; Singh, A.; Kumar, P.; and Kumar, C. 2020. An           and Celi, L. A. 2020a. Acronym Identification and Disam-
explainable machine learning approach for definition extrac-      biguation shared tasksfor Scientific Document Understand-
tion. In International Conference on Machine Learning, Im-        ing. arXiv preprint arXiv:2012.11760 .
age Processing, Network Security and Data Sciences, 145–          Veyseh, A. P. B.; Dernoncourt, F.; Tran, Q. H.; and Nguyen,
155. Springer.                                                    T. H. 2020b. What Does This Acronym Mean? Introducing
Li, I.; Yasunaga, M.; Nuzumlalı, M. Y.; Caraballo, C.; Ma-        a New Dataset for Acronym Identification and Disambigua-
hajan, S.; Krumholz, H.; and Radev, D. 2019. A Neural             tion. In Proceedings of COLING.
Topic-Attention Model for Medical Term Abbreviation Dis-          Wu, C.-W.; Jan, S.-Y.; Tsai, R. T.-H.; and Hsu, W.-L.
ambiguation. arXiv preprint arXiv:1910.14076 .                    2006. On using ensemble methods for Chinese named entity
Li, Y.; Zhao, B.; Fuxman, A.; and Tao, F. 2018. Guess Me if       recognition. In Proceedings of the Fifth SIGHAN Workshop
You Can: Acronym Disambiguation for Enterprises. In Pro-          on Chinese Language Processing, 142–145.
ceedings of the 56th Annual Meeting of the Association for        Wu, Y.; Xu, J.; Zhang, Y.; and Xu, H. 2015. Clinical abbre-
Computational Linguistics (Volume 1: Long Papers), 1308–          viation disambiguation using neural word embeddings. In
1317.                                                             Proceedings of BioNLP 15, 171–176.
Liu, J.; Liu, C.; and Huang, Y. 2017. Multi-granularity se-
quence labeling model for acronym expansion identification.
Information Sciences 378: 462–474.
Loshchilov, I.; and Hutter, F. 2018. Fixing weight decay
regularization in adam .
McInnes, B.; Pedersen, T.; Liu, Y.; Pakhomov, S.; and
Melton, G. B. 2011. Using second-order vectors in a
knowledge-based method for acronym disambiguation. In
Proceedings of the fifteenth conference on computational
natural language learning, 145–153.
Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Ef-
ficient estimation of word representations in vector space.
arXiv preprint arXiv:1301.3781 .
Park, Y.; and Byrd, R. J. 2001. Hybrid text mining for find-
ing abbreviations and their definitions. In Proceedings of the
2001 conference on empirical methods in natural language
processing.
Schwartz, A. S.; and Hearst, M. A. 2002. A simple algorithm
for identifying abbreviation definitions in biomedical text. In
Biocomputing 2003, 451–462. World Scientific.
Sikdar, U. K.; and Gambäck, B. 2017. A feature-based en-
semble approach to recognition of emerging and rare named
entities. In Proceedings of the 3rd Workshop on Noisy User-
generated Text, 177–181.
Singh, A.; Kumar, P.; and Sinha, A. 2020. DSC IIT-
ISM at SemEval-2020 Task 6: Boosting BERT with De-
pendencies for Definition Extraction. In Proceedings of