=Paper=
{{Paper
|id=Vol-3164/paper29
|storemode=property
|title=A Novel Initial Reminder Framework for Acronym Extraction
|pdfUrl=https://ceur-ws.org/Vol-3164/paper29.pdf
|volume=Vol-3164
|authors=Xiusheng Huang,Bin Li,Fei Xia,Yixuan Weng
|dblpUrl=https://dblp.org/rec/conf/aaai/HuangLXW22
}}
==A Novel Initial Reminder Framework for Acronym Extraction==
A Novel Initial Reminder Framework for Acronym
Extraction
Xiusheng Huang1,2 , Bin Li3 , Fei Xia1,2 and Yixuan Weng1,2
1
National Laboratory of Pattern Recognition, Institute of Automation Chinese Academy Sciences, Beijing, 100190, China
2
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, 100190, China
3
College of Electrical and Information Engineering, Hunan University
Abstract
Acronym extraction is committed to extracting acronyms (e.g., short-forms) and their meaning (e.g., long-forms) from the
original document, this is one of the key and challenging tasks in scientific document understanding (SDU@AAAI-22) tasks.
Previous work regarded them as a task of named entity recognition, ignoring the relationship between acronyms and their
meaning, especially the importance of initials. In this paper, we propose a novel Initial Reminder Framework (IRF) for acronym
extraction task. Specifically, the IRF recognize the span of acronym for the first time, combined with the initial information,
and recognized their meaning again. At the same time, considering that acronyms are often close to their meaning, the IRF
adopts Neighborhood Search Strategy. Experiments on two acronym extraction dataset show IRF outperforms the previous
methods by 5.90/7.10 F1. Further analysis reveals IRF is effective in extracting short-forms and long-forms.
Keywords
Acronym extractions, The initials, Initial Reminder Framework, Neighborhood Search Strategy
1. Introduction
XLII (I) Consejo Mundial de la Paz (CMP) Federacion Internacional de
Acronym extraction is a task to identify acronyms and Comercio de Cacao (FICC) Federacion Internacional de Ia Industria del
their meanings, which is very important for scientific Medicamento (FIIM).
document understanding (SDU@AAAI-22)[1, 2]. The 'Acronym' : 'Long-form' : 'Initial' :
previous method regards this task more as a sequence
annotation task[3, 4, 5], and the model will recognize the
acronyms and long-term. Figure 1: In the figure, the Green text represents acronyms,
The context of acronyms often have more obvious orange text represents long term, and red text represents ini-
characteristics, for example, there are brackets around tials. At the same time, red, blue and black lines indicate the
correspondence between initials and acronyms, respectively.
acronyms, or acronyms themselves have a specific format,
(Dataset: Spanish)
which leads to a higher accuracy of identifying acronyms.
However, the accuracy of identifying long-term is rela-
tively low, and there are some problems, such as inaccu-
rate identification and no identification. In this paper, we propose a novel Initial Reminder Frame-
As shown in Figure 1, in a document, we need to work (IRF) for acronym extraction task. Through ex-
identify the acronyms and long-term. The context of periments, we find that the model has high accuracy in
acronyms often has some characteristics (e.g. brackets), acronym recognition than long-term recognition. Specif-
which helps the model to identify them. Long-term recog- ically, in Spanish, the model achieved 91% F1 in the task
nition is a challenge. It needs to have a certain under- of identifying acronyms, the F1 score is only 83%. At the
standing of the document content. The better solution same time, considering the correlation between acronyms
is to know what the corresponding acronym is before and long-term, IRF first completes the task of identifying
extracting long term, which will help model recognition acronyms. On this basis, combined with the initial in-
of long term. formation contained in acronyms, IRF further identifies
Through Figure 1, we can find that each character of long-term. We verify the effectiveness of our method on
the acronym can correspond to the initial of the long- two acronym extraction data sets, including Spanish and
term, which will help the model identify the long-term. Danish.
We summarize our contributions as follows:
SDU@AAAI-22: Workshop on Scientific Document Understanding, β’ We introduce a fresh perspective to revisit the
co-located with AAAI 2022. 2022 Vancouver, Canada.
acronym extraction task with a principled prob-
Envelope-Open huangxiusheng2020@ia.ac.cn (X. Huang); libincn@hnu.edu.cn
(B. Li); xiafei2020@ia.ac.cn (F. Xia); wengsyx@gmail.com (Y. Weng) lem formulation, which implies a general algorith-
Β© 2021 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
mic framework that helps the identify long-term
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org) by initials.
β’ we propose a novel Initial Reminder Framework Table 2
(IRF) for acronym extraction task. Specifically, Statistical Information of Danish Dataset.
IRF makes use of the high accuracy of acronym Data Sample Number Ratio
recognition and helps the model recognize long-
Training Set 3082 80.00%
term by integrating the initial information.
Development Set 385 9.99%
β’ We conduct experiments on two acronym extrac-
Test Set 386 10.1%
tion datasets. Experimental results demonstrate
that our IRF model can achieves state-of-the-art Total 3853 100%
performance compared with baselines.
training (5928), development (741), and testing (741) sets
2. Task introduction from the whole dataset. As shown in Table 2, the Danish
dataset is divided into training (3082), development (385),
2.1. Problem definition and testing (160) sets according to the whole dataset.
Both datasets have been manually labeled, where the
We regard the acronym extraction task as a sequence label is a list of position boundaries.
annotation task. Different from the previous methods,
considering the high accuracy of acronym recognition,
we will first recognize the acronym, and then use the 3. Methodology
character information of the acronym to recognize the
long-term. Given a document D = {π₯1 , π₯2 , β¦ , π₯π }, the In this section, we will introduce our proposed IRF model.
initials of each word in the document is I = {π¦1 , π¦2 , β¦ , π¦π }. IRF utilizes the corresponding relationship between the
Utilizing our IRF model, we will get each acronyms and characters of acronyms and the initials of long-term, this
long-term : will effectively help the model improve the accuracy of
long-term recognition.
π΄, πΏ = IRF (π₯1 , π₯2 , β¦ , π₯π ; π¦1 , π¦2 , β¦ , π¦π ) (1)
where A refers to acronyms and L refers to long-term. 3.1. Encoder
Given a document D = {π₯1 , π₯2 , β¦ , π₯π }, and the initials
2.2. Evaluation metric of each word in the document is I = {π¦1 , π¦2 , β¦ , π¦π }. We
leverage the pre-trained language model as an encoder
The online results will be evaluated with the macro- to obtain the embedding as follows:
averaged precision, recall, and F1 scores. The final score
is the prediction correctness of short-form (i.e., acronym)
and long-form (i.e., phrase) boundaries in the given sen- π» = BERT Encode (π₯1 , π₯2 , β¦ , π₯π ; π¦1 , π¦2 , β¦ , π¦π ) (2)
tence. The short-form or long-form predictions are cor-
rect once the beginning and the end of the position of the where π» = [β1 , β2 , β¦ , βπ ] is the embedding of each
predicted short-form or long-form are equal to the label token, πΌ is the embedding of each initial.
respectively. The official score is counted based on the
macro average of short-form and long-form F1 scores. 3.2. Acronyms Tagger
The low level tagging module is designed to recognize
2.3. Dataset introduction
all possible acronyms in the input sentence by directly
decoding the encoded vector π» produced by the N-layer
Table 1 BERT encoder. More precisely, it adopts two identical
Statistical Information of Spanish Dataset. binary classifiers to detect the start and end position of
acronyms respectively by assigning each token a binary
Data Sample Number Ratio
tag (0/1) that indicates whether the current token cor-
Training Set 5928 80.00% responds to a start or end position of a acronym. The
Development Set 741 10.00% detailed operations of the acronyms tagger on each token
Test Set 741 10.00% are as follows:
Total 7410 100%
π π‘πππ‘π
ππ =π(ππ π‘πππ‘ βπ +ππ π‘πππ‘ ) (3)
This task contains various multi-lingual datasets com-
posed of document sentences in science fields. Among
ππππ
them, the statistics of the Spanish and the Danish datasets ππ =π(ππππ βπ +ππππ ) (4)
are shown in Table 1. The Spanish dataset is divided into
π π‘πππ‘ πππ π π‘πππ‘ πππ
where ππ π and ππ π represent the probability of iden- where ππ π and ππ π represent the probability of iden-
tifying the i-th token in the input sequence as the start tifying the i-th token in the input sequence as the start
and end position of a acronym, respectively. The cor- and end position of a long-term respectively, and ππ βπππ‘ π
responding token will be assigned with a tag 1 if the represents the encoded representation vector of the k-th
probability exceeds a certain threshold or with a tag 0 π
subject detected in low level module, the πΈππππ‘πππ repre-
otherwise. βπ is the encoded representation of the i-th sents the embedding of initials (i.e., C, M and P). For each
token in the input sequence, i.e., βπ =π» [π], where ππ π‘πππ‘ acronym, we iteratively apply the same decoding process
and ππππ represent the trainable weight, and ππ π‘πππ‘ and on it. Meanwhile, for the Neighborhood Search Strategy,
ππππ are bias and π is the sigmoid activation function. we set the search length to πΎ, where the πΎ is a hyperpa-
rameter which is the longest distance between acronyms
3.3. Long-term Tagger and long in the statistical training set.
Considering that each acronym and its meaning are
always connected together, we utilize Neighborhood 4. Experiments
Search Strategy to select the context near the search
acronym, so as to extract the correct long-term. 4.1. Baseline models
The high level tagging module simultaneously identi- β’ Rule-based method The rule-based baseline
fies the long-term with respect to the acronyms obtained method is proposed to adopt manual rules for
at lower level. As show in the Figure 2, for the acronyms this task [6]. The words with more than 60% of
πΆππ, We search for its corresponding long term in a their characters are upper-cased to be selected
limited context. Different from acronyms tagger directly as acronyms. The long-forms are chosen once
decoding the encoded vector π», the Long-term Tagger the initial characters of the preceding words are
takes the acronyms features and initial features into ac- before an acronym. The whole codes are online
count as well. The detailed operations of the Long-term on the website1 .
Tagger on each token are as follows:
β’ BiLSTM-CRF model The bidirectional LSTM
[7] is an extension of LSTM that adopts a for-
Neighborhood Search Strategy
ward and backward LSTM network for sequence
Start 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 processing, where the links of the network is used
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
End
Long-term
as the output layer (Huang et al., 2015). The BiL-
Tagger
Hn + Vkshort +Ekinitial STM structure gathers contextual information
simultaneously from the past with bidirectional.
Start 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0
Besides, the BiLSTM has advantages in the LSTM
End 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 that avoids gradient vanishing compared with
the RNN. The output hidden state of BiLSTM will
Acronyms be concatenated between the forward LSTM π»π
Tagger BERT Encode
and backward LSTM π»π networks as final output
Initial
X C M d l P ( C ) F I d C d C d ( F )
[π»π , π»π ]. This feature is calculated with the cross-
Token
entropy loss with the target token-level labels.
Internacional
de
( ) ( )
Consejo
FederaciΓ³n
FICC
CMP
Mundial
la
β’ BERT-CRF model The BERT-CRF [8] is imple-
XLII
Paz
Comercio
Cacao
del
de
de
mented with the token-level neural network with
the conditional random field (CRF) on top, where
the backbone of this baseline can choose from the
Figure 2: An overview of the proposed IRF framework. In this
Mbert[]. The Mbert is the multilingual masked
example, there are two candidate acronyms detected at the
low level, while the presented 0/1 tags at high level are specific language model (MLM) trained with multiple cor-
to the first acronym CMP, i.e., a snapshot of the iteration state pora. The backbone has varients such as base
when k = 1 is shown as above. πΎ=2 corresponds to the second and large, which are chosen as our baselines. As
acronym FICC. for the input tokens, the backbone encodes the
tokens to the encoding. The final classification
scores are obtained in the CRF layer, where the
π π‘πππ‘π π π tag is used as the transition matrix. The matrix
ππ =π(ππ π‘πππ‘ (βπ +ππ βπππ‘ +πΈππππ‘πππ )+ππ π‘πππ‘ ) (5) contains two states including the beginning (B)
ππππ π +πΈ π
ππ =π(ππππ (βπ +ππ βπππ‘ ππππ‘πππ )+ππππ ) (6)
1
https://github.com/amirveyseh/AAAI-22-SDU-shared-task-1-
AE
Table 3 Table 4
F1 Performance in Spanish dataset F1 Performance in Danish dataset
Method Val F1 Test F1 Method Val F1 Test F1
Rule-based 0.5667 0.5596 Rule-based 0.7021 0.6842
BiLSTM-CRF 0.7717 0.7623 BiLSTM-CRF 0.7671 0.7587
BERT-CRF 0.8397 0.8211 BERT-CRF 0.8673 0.8554
Roberta-CRF 0.8667 0.8531 Roberta-CRF 0.8979 0.8931
IRF-BERTπππ π (ππ’ππ ) 0.8742 0.8537 IRF-BERTπππ π (ππ’ππ ) 0.9133 0.9032
IRF-BERTπππππ (ππ’ππ ) 0.9035 0.8911 IRF-BERTπππππ (ππ’ππ ) 0.9532 0.9413
IRF-Robertaπππππ (ππ’ππ ) 0.9233 0.91.21 IRF-Robertaπππππ (ππ’ππ ) 0.9744 0.9641
and the end (E). This baseline is trained with the Table 5
first sub-token via the cross-entropy loss. Test F1 score (%) on extracting long-term.
β’ Roberta-CRF model The Roberta-CRF [9] is the Model Val F1 Test F1
same architecture as the BERT-CRF, where the BERT-CRF 80.42 79.11
difference is that the Roberta model removes the IRF-BERTπππ π (ππ’ππ ) 85.31 (+ 4.89) 84.23 (+ 5.12)
next sentence prediction (NSP) task, and uses dy- Roberta-CRF 83.44 82.19
namic masking for text encoding. The Roberta IRF-Robertaπππππ (ππ’ππ ) 90.13 (+ 6.69) 89.07 (+ 6.88)
model uses the Byte-Pair Encoding (BPE) to mix
character-level and word-level representations
and support processing many common natural show that PAEE performs better than these methods.
language corpora vocabularies. We adopt differ- Specifically, in Spanish dataset, our best model, IRF
ent varients of the Roberta as our baselines, in- built upon π
πππππ‘ππ ππππ, is +5.66 / +5.90 F1 better on
cluding the base and the large version. Val/Test set than Roberta-CRF. In addition, in Danish
dataset, IRF built upon π
πππππ‘ππ ππππ, is +7.65 / +7.10 F1
4.2. Datasets better on Val/Test set than Roberta-CRF. They obtain
new state-of-the-art(SOTA) results, we held the first
We evaluated our method on two acronym extraction position on the CodaLab scoreboard under the alias
datasetsοΏ½ mainly including Spanish dataset and Danish WENGSYX2 .
dataset. Specifically, the Spanish dataset has 7410 sam-
ples, and the Danish dataset has 3853 samples [10].
4.5. Analysis
4.3. Implementation Detail Considering the correlation between acronyms and the
initials of long-term, our IRF establishes the relationship
We used cased BERT-base, or RoBERTa-large as the en- between acronyms and long-term, which improves the
coder on Spanish and Danish dataset. All models are accuracy of extracting long and the overall performance
implemented based on the open-source transformers li- of the model. In order to further explore the effectiveness
brary of huggingface [11]. we initialize the model with of our method, we analyze the accuracy of identifying
mbert [12]. We use mixed-precision training [13] based long-term in the acronym extraction task. As show in
on the Apex library. Our model is optimized with AdamW Table 5, compared with baseline, our IRF can significantly
[14] using learning rates β [2πβ5, 3πβ5, 5πβ5, 1πβ4], with improve the accuracy of extracting long-term. Specifi-
a linear warmup [15] for the first 6% steps followed by cally, on the F1 score, we have a maximum performance
a linear decay to 0. We report the mean and standard improvement of 5%. The significant increase of the recog-
deviation of F1 on the development set by conducting 5 nition accuracy of the model in long term will help to
runs of training using different random seeds. We utilize improve the overall performance of the model.
the In-trust loss [5] function to optimize the model.
4.4. Results 5. Conclusion
In the Spanish and Danish datasets, we compare IRF with In this paper, we propose a novel Initial Reminder Frame-
baselines, including Rule-based, BiLSTM-CRF, BERT- work (IRF) for acronym extraction task. Specifically,
CRF and Roberta-CRF. Results in Table 3 and Table 4
2
https://competitions.codalab.org/competitions/34925results
IRF utilizes Acronyms Tagger to recognize the span of Roberta: A robustly optimized bert pretraining ap-
acronym for the first time. Then combining with the ini- proach, arXiv preprint arXiv:1907.11692 (2019).
tial information, IRF utilizes Long-term Tagger to recog- [10] S. Y. R. J. F. D. T. H. N. Amir Pouran Ben Vey-
nize the long-term. IRF captures the relationship between seh, Nicole Meister, MACRONYM: A Large-
acronyms and long-term in the dataset. Meanwhile, uti- Scale Dataset for Multilingual and Multi-Domain
lizing the character information in acronyms, the IRF Acronym Extraction, in: arXiv, 2022.
improves the accuracy of long-term recognition. We con- [11] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De-
duct experiments on two acronym extraction datasets. langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun-
Experimental results demonstrate that our IRF model towicz, et al., Huggingfaceβs transformers: State-of-
can achieves state-of-the-art performance compared with the-art natural language processing, arXiv preprint
baselines. arXiv:1910.03771 (2019).
[12] J. LibovickyΜ , R. Rosa, A. Fraser, How language-
neutral is multilingual bert?, arXiv preprint
References arXiv:1911.03310 (2019).
[13] P. Micikevicius, S. Narang, J. Alben, G. Diamos,
[1] A. P. B. Veyseh, F. Dernoncourt, T. H. Nguyen,
E. Elsen, D. Garcia, B. Ginsburg, M. Houston,
W. Chang, L. A. Celi, Acronym identification and
O. Kuchaiev, G. Venkatesh, et al., Mixed precision
disambiguation shared tasks for scientific document
training, in: International Conference on Learning
understanding, arXiv preprint arXiv:2012.11760
Representations, 2018.
(2020a).
[14] I. Loshchilov, F. Hutter, Decoupled weight decay
[2] S. Y. R. J. F. D. T. H. N. Amir Pouran Ben Veyseh,
regularization, in: International Conference on
Nicole Meister, Multilingual Acronym Extraction
Learning Representations, 2018.
and Disambiguation Shared Tasks at SDU 2022, in:
[15] P. Goyal, P. DollΓ‘r, R. Girshick, P. Noordhuis,
Proceedings of SDU@AAAI-22, 2022.
L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, K. He,
[3] L. Luo, Z. Yang, P. Yang, Y. Zhang, L. Wang, H. Lin,
Accurate, large minibatch sgd: Training imagenet
J. Wang, An attention-based bilstm-crf approach to
in 1 hour (2018).
document-level chemical named entity recognition,
Bioinformatics 34 (2018) 1381β1388.
[4] H. Zhao, L. Huang, R. Zhang, Q. Lu, H. Xue, Span-
mlt: A span-based multi-task learning framework
for pair-wise aspect and opinion terms extraction,
in: Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics, 2020, pp.
3239β3248.
[5] X. Huang, Y. Chen, S. Wu, J. Zhao, Y. Xie, W. Sun,
Named entity recognition via noise aware training
mechanism with data filter, in: Findings of
the Association for Computational Linguistics:
ACL-IJCNLP 2021, Association for Computational
Linguistics, Online, 2021, pp. 4791β4803. URL:
https://aclanthology.org/2021.findings-acl.423.
doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 1 . f i n d i n g s - a c l . 4 2 3 .
[6] A. S. Schwartz, M. A. Hearst, A simple algorithm for
identifying abbreviation definitions in biomedical
text, in: Biocomputing 2003, World Scientific, 2002,
pp. 451β462.
[7] Z. Huang, W. Xu, K. Yu, Bidirectional lstm-crf
models for sequence tagging, arXiv preprint
arXiv:1508.01991 (2015).
[8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,
Bert: Pre-training of deep bidirectional transform-
ers for language understanding, arXiv preprint
arXiv:1810.04805 (2018).
[9] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen,
O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,