A Novel Initial Reminder Framework for Acronym Extraction Xiusheng Huang1,2 , Bin Li3 , Fei Xia1,2 and Yixuan Weng1,2 1 National Laboratory of Pattern Recognition, Institute of Automation Chinese Academy Sciences, Beijing, 100190, China 2 School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, 100190, China 3 College of Electrical and Information Engineering, Hunan University Abstract Acronym extraction is committed to extracting acronyms (e.g., short-forms) and their meaning (e.g., long-forms) from the original document, this is one of the key and challenging tasks in scientific document understanding (SDU@AAAI-22) tasks. Previous work regarded them as a task of named entity recognition, ignoring the relationship between acronyms and their meaning, especially the importance of initials. In this paper, we propose a novel Initial Reminder Framework (IRF) for acronym extraction task. Specifically, the IRF recognize the span of acronym for the first time, combined with the initial information, and recognized their meaning again. At the same time, considering that acronyms are often close to their meaning, the IRF adopts Neighborhood Search Strategy. Experiments on two acronym extraction dataset show IRF outperforms the previous methods by 5.90/7.10 F1. Further analysis reveals IRF is effective in extracting short-forms and long-forms. Keywords Acronym extractions, The initials, Initial Reminder Framework, Neighborhood Search Strategy 1. Introduction XLII (I) Consejo Mundial de la Paz (CMP) Federacion Internacional de Acronym extraction is a task to identify acronyms and Comercio de Cacao (FICC) Federacion Internacional de Ia Industria del their meanings, which is very important for scientific Medicamento (FIIM). document understanding (SDU@AAAI-22)[1, 2]. The 'Acronym' : 'Long-form' : 'Initial' : previous method regards this task more as a sequence annotation task[3, 4, 5], and the model will recognize the acronyms and long-term. Figure 1: In the figure, the Green text represents acronyms, The context of acronyms often have more obvious orange text represents long term, and red text represents ini- characteristics, for example, there are brackets around tials. At the same time, red, blue and black lines indicate the correspondence between initials and acronyms, respectively. acronyms, or acronyms themselves have a specific format, (Dataset: Spanish) which leads to a higher accuracy of identifying acronyms. However, the accuracy of identifying long-term is rela- tively low, and there are some problems, such as inaccu- rate identification and no identification. In this paper, we propose a novel Initial Reminder Frame- As shown in Figure 1, in a document, we need to work (IRF) for acronym extraction task. Through ex- identify the acronyms and long-term. The context of periments, we find that the model has high accuracy in acronyms often has some characteristics (e.g. brackets), acronym recognition than long-term recognition. Specif- which helps the model to identify them. Long-term recog- ically, in Spanish, the model achieved 91% F1 in the task nition is a challenge. It needs to have a certain under- of identifying acronyms, the F1 score is only 83%. At the standing of the document content. The better solution same time, considering the correlation between acronyms is to know what the corresponding acronym is before and long-term, IRF first completes the task of identifying extracting long term, which will help model recognition acronyms. On this basis, combined with the initial in- of long term. formation contained in acronyms, IRF further identifies Through Figure 1, we can find that each character of long-term. We verify the effectiveness of our method on the acronym can correspond to the initial of the long- two acronym extraction data sets, including Spanish and term, which will help the model identify the long-term. Danish. We summarize our contributions as follows: SDU@AAAI-22: Workshop on Scientific Document Understanding, β€’ We introduce a fresh perspective to revisit the co-located with AAAI 2022. 2022 Vancouver, Canada. acronym extraction task with a principled prob- Envelope-Open huangxiusheng2020@ia.ac.cn (X. Huang); libincn@hnu.edu.cn (B. Li); xiafei2020@ia.ac.cn (F. Xia); wengsyx@gmail.com (Y. Weng) lem formulation, which implies a general algorith- Β© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). mic framework that helps the identify long-term CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) by initials. β€’ we propose a novel Initial Reminder Framework Table 2 (IRF) for acronym extraction task. Specifically, Statistical Information of Danish Dataset. IRF makes use of the high accuracy of acronym Data Sample Number Ratio recognition and helps the model recognize long- Training Set 3082 80.00% term by integrating the initial information. Development Set 385 9.99% β€’ We conduct experiments on two acronym extrac- Test Set 386 10.1% tion datasets. Experimental results demonstrate that our IRF model can achieves state-of-the-art Total 3853 100% performance compared with baselines. training (5928), development (741), and testing (741) sets 2. Task introduction from the whole dataset. As shown in Table 2, the Danish dataset is divided into training (3082), development (385), 2.1. Problem definition and testing (160) sets according to the whole dataset. Both datasets have been manually labeled, where the We regard the acronym extraction task as a sequence label is a list of position boundaries. annotation task. Different from the previous methods, considering the high accuracy of acronym recognition, we will first recognize the acronym, and then use the 3. Methodology character information of the acronym to recognize the long-term. Given a document D = {π‘₯1 , π‘₯2 , … , π‘₯𝑛 }, the In this section, we will introduce our proposed IRF model. initials of each word in the document is I = {𝑦1 , 𝑦2 , … , 𝑦𝑛 }. IRF utilizes the corresponding relationship between the Utilizing our IRF model, we will get each acronyms and characters of acronyms and the initials of long-term, this long-term : will effectively help the model improve the accuracy of long-term recognition. 𝐴, 𝐿 = IRF (π‘₯1 , π‘₯2 , … , π‘₯𝑛 ; 𝑦1 , 𝑦2 , … , 𝑦𝑛 ) (1) where A refers to acronyms and L refers to long-term. 3.1. Encoder Given a document D = {π‘₯1 , π‘₯2 , … , π‘₯𝑛 }, and the initials 2.2. Evaluation metric of each word in the document is I = {𝑦1 , 𝑦2 , … , 𝑦𝑛 }. We leverage the pre-trained language model as an encoder The online results will be evaluated with the macro- to obtain the embedding as follows: averaged precision, recall, and F1 scores. The final score is the prediction correctness of short-form (i.e., acronym) and long-form (i.e., phrase) boundaries in the given sen- 𝐻 = BERT Encode (π‘₯1 , π‘₯2 , … , π‘₯𝑛 ; 𝑦1 , 𝑦2 , … , 𝑦𝑛 ) (2) tence. The short-form or long-form predictions are cor- rect once the beginning and the end of the position of the where 𝐻 = [β„Ž1 , β„Ž2 , … , β„Žπ‘› ] is the embedding of each predicted short-form or long-form are equal to the label token, 𝐼 is the embedding of each initial. respectively. The official score is counted based on the macro average of short-form and long-form F1 scores. 3.2. Acronyms Tagger The low level tagging module is designed to recognize 2.3. Dataset introduction all possible acronyms in the input sentence by directly decoding the encoded vector 𝐻 produced by the N-layer Table 1 BERT encoder. More precisely, it adopts two identical Statistical Information of Spanish Dataset. binary classifiers to detect the start and end position of acronyms respectively by assigning each token a binary Data Sample Number Ratio tag (0/1) that indicates whether the current token cor- Training Set 5928 80.00% responds to a start or end position of a acronym. The Development Set 741 10.00% detailed operations of the acronyms tagger on each token Test Set 741 10.00% are as follows: Total 7410 100% π‘ π‘‘π‘Žπ‘Ÿπ‘‘π‘Ž 𝑝𝑖 =𝜎(π‘Šπ‘ π‘‘π‘Žπ‘Ÿπ‘‘ β„Žπ‘– +π‘π‘ π‘‘π‘Žπ‘Ÿπ‘‘ ) (3) This task contains various multi-lingual datasets com- posed of document sentences in science fields. Among π‘’π‘›π‘‘π‘Ž them, the statistics of the Spanish and the Danish datasets 𝑝𝑖 =𝜎(π‘Šπ‘’π‘›π‘‘ β„Žπ‘– +𝑏𝑒𝑛𝑑 ) (4) are shown in Table 1. The Spanish dataset is divided into π‘ π‘‘π‘Žπ‘Ÿπ‘‘ 𝑒𝑛𝑑 π‘ π‘‘π‘Žπ‘Ÿπ‘‘ 𝑒𝑛𝑑 where 𝑝𝑖 π‘Ž and 𝑝𝑖 π‘Ž represent the probability of iden- where 𝑝𝑖 𝑙 and 𝑝𝑖 𝑙 represent the probability of iden- tifying the i-th token in the input sequence as the start tifying the i-th token in the input sequence as the start and end position of a acronym, respectively. The cor- and end position of a long-term respectively, and π‘‰π‘ β„Žπ‘œπ‘Ÿπ‘‘ π‘˜ responding token will be assigned with a tag 1 if the represents the encoded representation vector of the k-th probability exceeds a certain threshold or with a tag 0 π‘˜ subject detected in low level module, the πΈπ‘–π‘›π‘–π‘‘π‘–π‘Žπ‘™ repre- otherwise. β„Žπ‘– is the encoded representation of the i-th sents the embedding of initials (i.e., C, M and P). For each token in the input sequence, i.e., β„Žπ‘– =𝐻 [𝑖], where π‘Šπ‘ π‘‘π‘Žπ‘Ÿπ‘‘ acronym, we iteratively apply the same decoding process and π‘Šπ‘’π‘›π‘‘ represent the trainable weight, and π‘π‘ π‘‘π‘Žπ‘Ÿπ‘‘ and on it. Meanwhile, for the Neighborhood Search Strategy, 𝑏𝑒𝑛𝑑 are bias and 𝜎 is the sigmoid activation function. we set the search length to 𝛾, where the 𝛾 is a hyperpa- rameter which is the longest distance between acronyms 3.3. Long-term Tagger and long in the statistical training set. Considering that each acronym and its meaning are always connected together, we utilize Neighborhood 4. Experiments Search Strategy to select the context near the search acronym, so as to extract the correct long-term. 4.1. Baseline models The high level tagging module simultaneously identi- β€’ Rule-based method The rule-based baseline fies the long-term with respect to the acronyms obtained method is proposed to adopt manual rules for at lower level. As show in the Figure 2, for the acronyms this task [6]. The words with more than 60% of 𝐢𝑀𝑃, We search for its corresponding long term in a their characters are upper-cased to be selected limited context. Different from acronyms tagger directly as acronyms. The long-forms are chosen once decoding the encoded vector 𝐻, the Long-term Tagger the initial characters of the preceding words are takes the acronyms features and initial features into ac- before an acronym. The whole codes are online count as well. The detailed operations of the Long-term on the website1 . Tagger on each token are as follows: β€’ BiLSTM-CRF model The bidirectional LSTM [7] is an extension of LSTM that adopts a for- Neighborhood Search Strategy ward and backward LSTM network for sequence Start 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 processing, where the links of the network is used 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 End Long-term as the output layer (Huang et al., 2015). The BiL- Tagger Hn + Vkshort +Ekinitial STM structure gathers contextual information simultaneously from the past with bidirectional. Start 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 Besides, the BiLSTM has advantages in the LSTM End 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 that avoids gradient vanishing compared with the RNN. The output hidden state of BiLSTM will Acronyms be concatenated between the forward LSTM 𝐻𝑓 Tagger BERT Encode and backward LSTM 𝐻𝑏 networks as final output Initial X C M d l P ( C ) F I d C d C d ( F ) [𝐻𝑓 , 𝐻𝑏 ]. This feature is calculated with the cross- Token entropy loss with the target token-level labels. Internacional de ( ) ( ) Consejo FederaciΓ³n FICC CMP Mundial la β€’ BERT-CRF model The BERT-CRF [8] is imple- XLII Paz Comercio Cacao del de de mented with the token-level neural network with the conditional random field (CRF) on top, where the backbone of this baseline can choose from the Figure 2: An overview of the proposed IRF framework. In this Mbert[]. The Mbert is the multilingual masked example, there are two candidate acronyms detected at the low level, while the presented 0/1 tags at high level are specific language model (MLM) trained with multiple cor- to the first acronym CMP, i.e., a snapshot of the iteration state pora. The backbone has varients such as base when k = 1 is shown as above. 𝐾=2 corresponds to the second and large, which are chosen as our baselines. As acronym FICC. for the input tokens, the backbone encodes the tokens to the encoding. The final classification scores are obtained in the CRF layer, where the π‘ π‘‘π‘Žπ‘Ÿπ‘‘π‘™ π‘˜ π‘˜ tag is used as the transition matrix. The matrix 𝑝𝑖 =𝜎(π‘Šπ‘ π‘‘π‘Žπ‘Ÿπ‘‘ (β„Žπ‘› +π‘‰π‘ β„Žπ‘œπ‘Ÿπ‘‘ +πΈπ‘–π‘›π‘–π‘‘π‘–π‘Žπ‘™ )+π‘π‘ π‘‘π‘Žπ‘Ÿπ‘‘ ) (5) contains two states including the beginning (B) 𝑒𝑛𝑑𝑙 π‘˜ +𝐸 π‘˜ 𝑝𝑖 =𝜎(π‘Šπ‘’π‘›π‘‘ (β„Žπ‘› +π‘‰π‘ β„Žπ‘œπ‘Ÿπ‘‘ π‘–π‘›π‘–π‘‘π‘–π‘Žπ‘™ )+𝑏𝑒𝑛𝑑 ) (6) 1 https://github.com/amirveyseh/AAAI-22-SDU-shared-task-1- AE Table 3 Table 4 F1 Performance in Spanish dataset F1 Performance in Danish dataset Method Val F1 Test F1 Method Val F1 Test F1 Rule-based 0.5667 0.5596 Rule-based 0.7021 0.6842 BiLSTM-CRF 0.7717 0.7623 BiLSTM-CRF 0.7671 0.7587 BERT-CRF 0.8397 0.8211 BERT-CRF 0.8673 0.8554 Roberta-CRF 0.8667 0.8531 Roberta-CRF 0.8979 0.8931 IRF-BERTπ‘π‘Žπ‘ π‘’ (π‘œπ‘’π‘Ÿπ‘ ) 0.8742 0.8537 IRF-BERTπ‘π‘Žπ‘ π‘’ (π‘œπ‘’π‘Ÿπ‘ ) 0.9133 0.9032 IRF-BERTπ‘™π‘Žπ‘Ÿπ‘”π‘’ (π‘œπ‘’π‘Ÿπ‘ ) 0.9035 0.8911 IRF-BERTπ‘™π‘Žπ‘Ÿπ‘”π‘’ (π‘œπ‘’π‘Ÿπ‘ ) 0.9532 0.9413 IRF-Robertaπ‘™π‘Žπ‘Ÿπ‘”π‘’ (π‘œπ‘’π‘Ÿπ‘ ) 0.9233 0.91.21 IRF-Robertaπ‘™π‘Žπ‘Ÿπ‘”π‘’ (π‘œπ‘’π‘Ÿπ‘ ) 0.9744 0.9641 and the end (E). This baseline is trained with the Table 5 first sub-token via the cross-entropy loss. Test F1 score (%) on extracting long-term. β€’ Roberta-CRF model The Roberta-CRF [9] is the Model Val F1 Test F1 same architecture as the BERT-CRF, where the BERT-CRF 80.42 79.11 difference is that the Roberta model removes the IRF-BERTπ‘π‘Žπ‘ π‘’ (π‘œπ‘’π‘Ÿπ‘ ) 85.31 (+ 4.89) 84.23 (+ 5.12) next sentence prediction (NSP) task, and uses dy- Roberta-CRF 83.44 82.19 namic masking for text encoding. The Roberta IRF-Robertaπ‘™π‘Žπ‘Ÿπ‘”π‘’ (π‘œπ‘’π‘Ÿπ‘ ) 90.13 (+ 6.69) 89.07 (+ 6.88) model uses the Byte-Pair Encoding (BPE) to mix character-level and word-level representations and support processing many common natural show that PAEE performs better than these methods. language corpora vocabularies. We adopt differ- Specifically, in Spanish dataset, our best model, IRF ent varients of the Roberta as our baselines, in- built upon π‘…π‘œπ‘π‘’π‘Ÿπ‘‘π‘Žπ‘™ π‘Žπ‘Ÿπ‘”π‘’, is +5.66 / +5.90 F1 better on cluding the base and the large version. Val/Test set than Roberta-CRF. In addition, in Danish dataset, IRF built upon π‘…π‘œπ‘π‘’π‘Ÿπ‘‘π‘Žπ‘™ π‘Žπ‘Ÿπ‘”π‘’, is +7.65 / +7.10 F1 4.2. Datasets better on Val/Test set than Roberta-CRF. They obtain new state-of-the-art(SOTA) results, we held the first We evaluated our method on two acronym extraction position on the CodaLab scoreboard under the alias datasetsοΏ½ mainly including Spanish dataset and Danish WENGSYX2 . dataset. Specifically, the Spanish dataset has 7410 sam- ples, and the Danish dataset has 3853 samples [10]. 4.5. Analysis 4.3. Implementation Detail Considering the correlation between acronyms and the initials of long-term, our IRF establishes the relationship We used cased BERT-base, or RoBERTa-large as the en- between acronyms and long-term, which improves the coder on Spanish and Danish dataset. All models are accuracy of extracting long and the overall performance implemented based on the open-source transformers li- of the model. In order to further explore the effectiveness brary of huggingface [11]. we initialize the model with of our method, we analyze the accuracy of identifying mbert [12]. We use mixed-precision training [13] based long-term in the acronym extraction task. As show in on the Apex library. Our model is optimized with AdamW Table 5, compared with baseline, our IRF can significantly [14] using learning rates ∈ [2π‘’βˆ’5, 3π‘’βˆ’5, 5π‘’βˆ’5, 1π‘’βˆ’4], with improve the accuracy of extracting long-term. Specifi- a linear warmup [15] for the first 6% steps followed by cally, on the F1 score, we have a maximum performance a linear decay to 0. We report the mean and standard improvement of 5%. The significant increase of the recog- deviation of F1 on the development set by conducting 5 nition accuracy of the model in long term will help to runs of training using different random seeds. We utilize improve the overall performance of the model. the In-trust loss [5] function to optimize the model. 4.4. Results 5. Conclusion In the Spanish and Danish datasets, we compare IRF with In this paper, we propose a novel Initial Reminder Frame- baselines, including Rule-based, BiLSTM-CRF, BERT- work (IRF) for acronym extraction task. Specifically, CRF and Roberta-CRF. Results in Table 3 and Table 4 2 https://competitions.codalab.org/competitions/34925results IRF utilizes Acronyms Tagger to recognize the span of Roberta: A robustly optimized bert pretraining ap- acronym for the first time. Then combining with the ini- proach, arXiv preprint arXiv:1907.11692 (2019). tial information, IRF utilizes Long-term Tagger to recog- [10] S. Y. R. J. F. D. T. H. N. Amir Pouran Ben Vey- nize the long-term. IRF captures the relationship between seh, Nicole Meister, MACRONYM: A Large- acronyms and long-term in the dataset. Meanwhile, uti- Scale Dataset for Multilingual and Multi-Domain lizing the character information in acronyms, the IRF Acronym Extraction, in: arXiv, 2022. improves the accuracy of long-term recognition. We con- [11] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De- duct experiments on two acronym extraction datasets. langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun- Experimental results demonstrate that our IRF model towicz, et al., Huggingface’s transformers: State-of- can achieves state-of-the-art performance compared with the-art natural language processing, arXiv preprint baselines. arXiv:1910.03771 (2019). [12] J. LibovickyΜ€ , R. Rosa, A. Fraser, How language- neutral is multilingual bert?, arXiv preprint References arXiv:1911.03310 (2019). [13] P. Micikevicius, S. Narang, J. Alben, G. Diamos, [1] A. P. B. Veyseh, F. Dernoncourt, T. H. Nguyen, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, W. Chang, L. A. Celi, Acronym identification and O. Kuchaiev, G. Venkatesh, et al., Mixed precision disambiguation shared tasks for scientific document training, in: International Conference on Learning understanding, arXiv preprint arXiv:2012.11760 Representations, 2018. (2020a). [14] I. Loshchilov, F. Hutter, Decoupled weight decay [2] S. Y. R. J. F. D. T. H. N. Amir Pouran Ben Veyseh, regularization, in: International Conference on Nicole Meister, Multilingual Acronym Extraction Learning Representations, 2018. and Disambiguation Shared Tasks at SDU 2022, in: [15] P. Goyal, P. DollΓ‘r, R. Girshick, P. Noordhuis, Proceedings of SDU@AAAI-22, 2022. L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, K. He, [3] L. Luo, Z. Yang, P. Yang, Y. Zhang, L. Wang, H. Lin, Accurate, large minibatch sgd: Training imagenet J. Wang, An attention-based bilstm-crf approach to in 1 hour (2018). document-level chemical named entity recognition, Bioinformatics 34 (2018) 1381–1388. [4] H. Zhao, L. Huang, R. Zhang, Q. Lu, H. Xue, Span- mlt: A span-based multi-task learning framework for pair-wise aspect and opinion terms extraction, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 3239–3248. [5] X. Huang, Y. Chen, S. Wu, J. Zhao, Y. Xie, W. Sun, Named entity recognition via noise aware training mechanism with data filter, in: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Association for Computational Linguistics, Online, 2021, pp. 4791–4803. URL: https://aclanthology.org/2021.findings-acl.423. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 1 . f i n d i n g s - a c l . 4 2 3 . [6] A. S. Schwartz, M. A. Hearst, A simple algorithm for identifying abbreviation definitions in biomedical text, in: Biocomputing 2003, World Scientific, 2002, pp. 451–462. [7] Z. Huang, W. Xu, K. Yu, Bidirectional lstm-crf models for sequence tagging, arXiv preprint arXiv:1508.01991 (2015). [8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transform- ers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [9] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,