<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Siheng Li</string-name>
          <email>lisiheng21@mails.tsinghua.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cheng Yang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tian Liang</string-name>
          <email>liangt21@mails.tsinghua.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xinyu Zhu</string-name>
          <email>zhuxy21@mails.tsinghua.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chengze Yu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yujiu Yang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Acronym Extraction, Natural Language Processing, BERT</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Tsinghua Shenzhen International Graduate School, Tsinghua University</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Acronym extraction plays an important role in scientific document understanding. Recently, the AAAI-22 Workshop on Scientific Document Understanding released multiple high-quality datasets and attracted widespread attention. In this work, we present our hybrid strategies with adversarial training for this task. Specifically, we first apply pre-trained models to obtain contextualized text encoding. Then, on the one hand, we employ a sequence labeling strategy with BiLSTM and CRF to tag each word in a sentence. On the other hand, we use a span selection strategy that directly predicts the acronym and long-form spans. In addition, we adopt adversarial training to further improve the robustness and generalization ability of our models. Experimental results show that both methods outperform strong baselines and rank high on the SDU@AAAI-22 - Shared Task 1: Acronym Extraction, our scores rank 2nd in 4 test sets and 3rd in 3 test sets. Moreover, the ablation study further verifies the efectiveness of each component. Our code is available at https://github.com/carlyoung1999/AAAI-SDU-Task1 .</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Input:</title>
      <p>Existing methods for learning with noisy labels (LNL)
primarily take a loss correction approach.</p>
    </sec>
    <sec id="sec-2">
      <title>Output:</title>
      <p>Acronym: LNL</p>
    </sec>
    <sec id="sec-3">
      <title>Long-form: learning with noisy labels</title>
      <p>to capture feature interactions between adjacent words
further and employ CRF to model the dependency
between sequence labels for the sequence labeling strategy.
As for the span selection strategy, we use binary
taggers to predict the start and end index for acronyms or
long-forms. To further improve our models’ robustness
and generalization ability, we employ adversarial
training, which dynamically adds noise to avoid overfitting.
These two strategies get comparable performance, and
we choose the better one for evaluation according to their
performance in the development set. Our contributions
• We propose two strategies for acronym
extraction, sequence labeling and span selection.
• Our adversarial training further improves the
robustness and generalization ability of our models.
• Experiments show that our models
outperform strong baselines and rank high in the</p>
      <sec id="sec-3-1">
        <title>1. Introduction</title>
        <p>An acronym consists of the initial letters of the
corresponding terminology and is widely used in scientific
documents for its convenience. However, this also makes
it dificult to understand scientific documents for both
humans and machines. In natural language processing,
accurate acronym extraction is beneficial for the
downstream applications like question answering [1],
definition extraction [2] and relation extraction [3, 4]. Recently,
The second workshop on Scientific Document Understanding at AAAI</p>
      </sec>
      <sec id="sec-3-2">
        <title>2. Related</title>
      </sec>
      <sec id="sec-3-3">
        <title>Works</title>
        <p>In this section, we introduce the related studies for
acronym extraction, including Rule-based, LSTM-based,
and Pre-trained-based methods.</p>
        <sec id="sec-3-3-1">
          <title>2.1. Rule-based</title>
          <p>Traditional acronym extraction methods mainly focus
on rule-based methods. Specifically, most of them [ 13]
utilize generic rules or text patterns to discover acronym
expansions in the field of biomedicine. Torres-Schumann
and Schulz [14] further extend rule sets to hidden Markov
models and improve both recall and precision values.
Recently, a new work [15] has made a comprehensive
introduction to the rule-based machine identification
methods. They comprehensively classify present Rule-based
gorithm and a crowd-sourcing approach), and compare
them in detail. However, Due to the conservative nature
of rule-based models, this method requires complicated
manual formulations and lacks flexibility.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>2.2. LSTM-based</title>
          <p>Taking advantage of LSTM [16]’s power for text
modeling, LSTM-based methods has got decent performance in
acronym extraction. They mainly focus on better
semantic representations and attention mechanisms. DECBAE
[17] extracts contextualized features with BioELMo [18]
and provides these features to specific abbreviated
BiLSTMs, achieving good performance. In addition, they
use a simple but efective heuristic method for
automatically collecting datasets from a large corpus. Li et al.
[19] propose a novel topic-attention model and compare
the performance of diferent attention mechanisms
embedded in LSTM and ELMo. Their model is applied to
the acronym task of medical terms. To further capture
the dependency between sequence labels, Veyseh et al.
[20] propose to combine LSTM with CRF for Acronym
identification and Disambiguation.</p>
        </sec>
        <sec id="sec-3-3-3">
          <title>2.3. Pre-trained-based</title>
          <p>Language models pre-trained with a large corpus have
shown promising performance in lots of downstream
tasks. One of the most popular is Bidirectional Encoder
Representations from Transformers (BERT) [21], which
obtains rich semantic representations by Masked LM task
in the pre-training stage. BERT has been applied to many
NLP tasks like information extraction [22] and dialogue
state tracking [23].</p>
          <p>In addition, it is worth mentioning that there have
been many fine-grained improvements or specific domain
variants of BERT. RoBERTa [12] optimizes the training
strategy with BPE (Byte-Pair-Encoding) and dynamic
masking to increase shared vocabulary, thus providing
more fine-grained representations and stronger
robustness. SciBERT [24] has the same structure as BERT, while
it is well pre-trained to process scientific documents
specifically. Many works utilize the power of pre-trained
models for acronym extraction. Pan et al. [25] proposes
a multi-task learning method based on BERT-CRF and
BERT-Span, which makes full use of these two separate
models through redefining the fusion loss function and
achieves great performance. Li et al. [26] utilizes
Sentence Piece byte-pair encoding to relabel sentences. Then,
they are embedded into the XLNet [27] for processing.
3.</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>Methodology</title>
        <p>similar with  .</p>
        <sec id="sec-3-4-1">
          <title>3.2. Overview</title>
          <p>Given a text  = {</p>
          <p>1,  2, ...,   } where each   is a word and
 represents text length, acronym extraction aims to find
all acronyms and long-forms mentioned in this text.
Formally, the model needs to automatically extract acronym
mention set  = {[</p>
          <p>1,  1), [ 2,  2), ..., [  ,   )}, where   and
  denotes the start and end position of the i-th acronym
respectively. In addition, the model also needs to extract
long-form mention set ℬ = {[ 1,  1), [ 2,  2), ..., [  ,   )},
We describe our hybrid strategies to extract acronyms and
long-forms in this section. At first, we use pre-trained
models for tokenizing and encoding the original sentence.
Then, we employ a BiLSTM-CRF head to model acronym
extraction as a sequence labeling task and a BiLSTM-Span
head to model it as a span selection task. In addition, to
improve the robustness and generalization of our models,
we apply adversarial training techniques.</p>
        </sec>
        <sec id="sec-3-4-2">
          <title>3.3. BERT Encoder</title>
          <p>We adopt BERT or RoBERTa as a text encoder to capture
rich contextualized word embeddings. For brevity, we
use BERT to indicate both BERT and RoBERTa following.
Given the input  = {</p>
          <p>
            1,  2, ...,   }, with the help of deep
multi-head attention layers, BERT captures
contextualized representation for each token. The encoding process
is as follows:
 =

BERT([ 1,  2, ...,   ]) = [ℎ1, ℎ2, ..., ℎ ] ,
(
            <xref ref-type="bibr" rid="ref1">1</xref>
            )
where  ∈ ℝ × , and  denotes hidden dimension.
acronyms: [“DL”]
long-forms: [“Deep Learning”]
acronyms: [“DL”]
long-forms: [“Deep Learning”]
output
sequences: B-Acronym
          </p>
          <p>B-Acronym</p>
          <p>O</p>
          <p>O
O
O</p>
          <p>O
O
O</p>
          <p>B-Long
B-Long
B-Long</p>
          <p>I-Long
I-Long</p>
          <p>O
CRF Tagger
Linear Layer</p>
          <p>BERT Encoder
BiLSTM</p>
          <p>BiLSTM</p>
          <p>BiLSTM</p>
          <p>BiLSTM</p>
          <p>BiLSTM
input:</p>
          <p>DL
stands
for</p>
          <p>Deep</p>
          <p>Learning</p>
        </sec>
        <sec id="sec-3-4-3">
          <title>3.4. Sequence Labeling Strategy</title>
          <p>For this strategy, we first transform the character-level
• B-Acronym: Beginning of an acronym.
• I-Acronym: Inside of an acronym.
• B-Long: Beginning of a long-form.
• I-Long: Inside of a long-form.
• O: Outside of any acronym and long-form.
S-Acronym
E-Acronym 1
S-Long
E-Long
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
S-Acronym Tagger
E-Acronym Tagger</p>
          <p>Binary Taggers</p>
          <p>S-Long Tagger
E-Long Tagger
BiLSTM</p>
          <p>BiLSTM</p>
          <p>BiLSTM</p>
          <p>BiLSTM</p>
          <p>BiLSTM</p>
          <p>BERT Encoder
input:</p>
          <p>DL
stands
for</p>
          <p>Deep</p>
          <p>Learning</p>
        </sec>
        <sec id="sec-3-4-4">
          <title>3.5. Span Selection Strategy</title>
          <p>
            BIO labels as follows:
position labels provided by raw datasets to token-level the probability in (
            <xref ref-type="bibr" rid="ref1">0, 1</xref>
            ). The loss function is negative
log
          </p>
          <p>To solve this sequence labeling problem, we adopt
a BERT-BiLSTM-CRF method, and the architecture is
shown in Figure 2. First, we utilize a BiLSTM network
to capture feature interactions between adjacent words
further:</p>
          <p>
            ′ = BiLSTM( ),
where  ′ ∈ ℝ×2 . Then, a linear classifier transforms  ′
into the logits of 5 BIO labels defined above:
(
            <xref ref-type="bibr" rid="ref2">2</xref>
            ) ing token classification.
          </p>
          <p>We also formulate it as an extractive span selection task,
aiming to find the text span of acronyms and long-forms
directly. Similar to the sequence labeling strategy, we
transform the character-level labels [ , )
by raw datasets to token-level [ , )</p>
          <p>provided
for the
followℝ×2 . Then we construct four binary taggers:</p>
          <p>We adopt the same BERT encoder and LSTM network
as above to get contextualized word representations  ′ ∈
 = [ 0,  1,  2,  3,  4] =  ′  ,</p>
          <p>
            (
            <xref ref-type="bibr" rid="ref3">3</xref>
            )
the logits.
where   ∈ ℝ2×5 and  = [ 0,  1,  2,  3,  4] ∈ ℝ×5 are
          </p>
          <p>To model the dependency between sequence labels,
we adopt a Linear Chain CRF (Conditional Random Field)
[28], the probability of a tagged sequence is:
 ( | ) =</p>
          <p>exp(∑=1 (  |  ) + ∑
=1  (  | −1 ))

 ( )
where  = [ 1,  2, ...,   ] is the ground truth label
sequence and   is the label for  -th token. (⋅) represents
emission scorer which refers to the logits  above.  (⋅)
the start of an acronym.
the end of an acronym.
start of a long-form.</p>
          <p>end of a long-form.
• S-Acronym Tagger predicts whether a token is
• E-Acronym Tagger predicts whether a token is
• S-Long Tagger predicts whether a token is the
• E-long Tagger predicts whether a token is the</p>
          <p>
            We apply a simple linear layer to represent these
taggers which work as follows:
 = [ 0,  1,  2,  3] =  ′  ,
(
            <xref ref-type="bibr" rid="ref6">6</xref>
            )
Rule
BERT
RoBERTa
Ours-SL
Ours-SS
Performance comparison on the test sets of scientific domain, ⋆ indicates the score of our model.
          </p>
          <p>Datasets
English Scientific
Persian
Vietnamese
English Legal
French
Spanish
Danish
3980
1336
1274
3564
7783
5928
3082
497
167
159
445
973
741
385
where   ∈ ℝ2×4 , and  = [</p>
          <p>0,  1,  2,  3] ∈ ℝ×4 are
logits for 4 classes declared above. The loss function is
binary cross entropy:</p>
          <p>3
the logit for  -th token regarding class  , and  ()
where    is the label for  -th token regarding class  ,</p>
          <p>is
denotes
sigmoid function.</p>
          <p>For the inference, we first predict the class label of
each token. Then, we match each S-Acronym token with
the nearest E-Acronym token to get an acronym. The
operation for long-form is the same.</p>
        </sec>
        <sec id="sec-3-4-5">
          <title>3.6. Adversarial Training</title>
          <p>To enhance the robustness and generalization ability of
our models, we adopt adversarial training. Specifically,
given an input  , we incorporate a posterior
regularization mechanism [29]:</p>
          <p>
            ‖‖≤
ℒ
= max ∑ Div(  ( )||  ( + )),
(
            <xref ref-type="bibr" rid="ref8">8</xref>
            )
where Div is some f-divergence1,  is noise,  is noise
norm and   represents the predict function in our
models, like CRF tagger and Binary taggers. This loss
regularizes the posterior diference between original and noisy
inputs to avoid overfitting. Practically, we use an inner
loop to search the most adversarial direction.
1We use Jensen-Shannon divergence in our experiments.
Test
498
168
160
446
973
741
386
(
            <xref ref-type="bibr" rid="ref9">9</xref>
            )
(
            <xref ref-type="bibr" rid="ref10">10</xref>
            )
Training
          </p>
          <p>Development</p>
        </sec>
        <sec id="sec-3-4-6">
          <title>3.7. Objective Function</title>
          <p>We jointly train our models with adversarial training, for
sequence labeling strategy:
For span selection strategy:
ℒ = ℒ +  ℒ  .</p>
          <p>ℒ = ℒ +  ℒ  .</p>
          <p>The  is used for controlling the significance of
adversarial training.</p>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>4. Experiments</title>
        <sec id="sec-3-5-1">
          <title>4.1. Datasets</title>
          <p>Our experiments are conducted on the oficial dataset
of SDU@AAAI-22 - Shared Task 1: Acronym
ExtracRule
BERT
RoBERTa
Ours-SL
Ours-SS</p>
          <p>P</p>
          <p>For baselines, we select pre-train models trained with
Ours-SS 0.86 0.89 0.87 corresponding language corpora in HuggingFace
TransOurs-SS w/o AT 0.86 0.87 0.86 formers [30]. As for ours, we adopt the best pre-trained
Table 6 models according to their performance in the
developAblation studies in the development set of English Scientific. ment set. Specifically, we adopt roberta-base 3 for
English, roberta-fa-zwnj-base-ner4 for Persian,
bert-basevi-cased5 for Vietnamese, bert-base-fr-cased6 for French,
bert-base-es-cased7 for Spanish and
danish-bert-botxotion. They provide the data of scientific domain includ- ner-dane8 for Danish.
ing English, Persian and Vietnamese; and legal domain We tune the hyper-parameters according to the
perforincluding English, French, Spanish and Danish. Table 3 mance in the development set. For the sequence labeling
smuemnmts.arizes the statistics of datasets used in our experi- satdrvaetresgayr,iathlterabianticnhgswizeei,gLhSt TarMe 8la,y1e,r2,5L6S,T0.M1, rheisdpdeecntivsiezley,.
The batch size, LSTM layer, LSTM hidden size, and
ad4.2. Baselines versarial training weight for our span selection strategy
To investigate the efectiveness of our proposed approach, are 16, 1, 256, and 1.0. We run all experiments using
we compare it with the following three baselines: PyTorch 1.9.1 on the Nvidia GeForce RTX 3090 GPU,
Intel(R) Xeon(R) Platinum 8260L CPU on Ubuntu 18.04.4
• Rule-based This method utilizes a manually de- LTS OS. Our code will be released soon.
signed pattern to extract acronyms and is
provided by SDU@AAAI-22 2.
• BERT-based This method employs BERT [21] as
a text encoder to get contextualized word repre- 3https://huggingface.co/roberta-base
sentation, then employs a classification head to
tag each word.</p>
          <p>4https://huggingface.co/HooshvareLab/roberta-fa-zwnj-base-ner
5https://huggingface.co/Geotrend/bert-base-vi-cased
6https://huggingface.co/Geotrend/bert-base-fr-cased
7https://huggingface.co/Geotrend/bert-base-es-cased
8https://huggingface.co/Maltehb/danish-bert-botxo-ner-dane
2https://github.com/amirveyseh/AAAI-22-SDU-shared-task-1-AE</p>
        </sec>
        <sec id="sec-3-5-2">
          <title>4.4. Results</title>
          <p>4.4.1. Scientific Domain
The comparison between the proposed model and
baseline models is shown in Table 1. The main observations
can be summarized as follows:
• Compared with manually designed rule-based
methods, pre-trained model-based methods have
huge advantages because they can capture
reasonable word representations.
• The diference between the BERT model and
RoBERTa model is remarkable. We conjecture
this is due to the datasets being small; thus, the
results depend more on the power of the
pretrained model.
• Our two strategies get similar results and
outperform all baseline methods. We submit the better
one for testing.
4.4.2. Legal Domain
The comparison is shown in Table 4, the observations are
similar with Scientific Domain, and our method
outperforms all baseline models stably. Table 5 shows the top
4 scores in the test sets; our method gets decent
performance and ranks 2 in English and French, 3 in Spanish
and Danish.</p>
        </sec>
        <sec id="sec-3-5-3">
          <title>4.5. Ablation Study</title>
          <p>
            To further prove the efectiveness of each component, we
run ablation studies on the development set of English
Scientific, as shown in Table 6. We find that: (
            <xref ref-type="bibr" rid="ref1">1</xref>
            ) for our
sequence labeling strategy, CRF is necessary because it
helps capture the dependency between sequence labels.
(
            <xref ref-type="bibr" rid="ref2">2</xref>
            ) adversarial training is beneficial to both strategies by
adding reasonable noises, which improve our models’
robustness and generalization performance.
          </p>
        </sec>
      </sec>
      <sec id="sec-3-6">
        <title>5. Conclusion</title>
        <p>In this paper, we explore and propose two strategies
with adversarial training for SDU@AAAI-22 - Shared
Task 1: Acronym Extraction. Experiments show that
our methods outperform strong baseline methods in all
7 datasets. In addition, our score ranks high in the test
sets. For future work, we will try to solve the problem of
class imbalance in both strategies.</p>
      </sec>
      <sec id="sec-3-7">
        <title>6. Acknowledgments</title>
        <p>This research was supported in part by the National
Key Research and Development Program of China (No.
2020YFB1708200) and the Shenzhen Key Laboratory of
Marine IntelliSense and Computation under Contract
ZDSYS20200811142605016.
pers), Association for Computational Linguistics, putational Linguistics, 2019, pp. 3613–3618.
2019, pp. 4171–4186. [25] C. Pan, B. Song, S. Wang, Z. Luo, Bert-based
[12] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, acronym disambiguation with multiple training
O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, strategies, in: Proceedings of the SDU@AAAI 2021,
Roberta: A robustly optimized BERT pretraining volume 2831 of CEUR Workshop Proceedings, 2021.
approach, CoRR abs/1907.11692 (2019). [26] F. Li, Z. Mai, W. Zou, W. Ou, X. Qin, Y. Lin,
[13] A. S. Schwartz, M. A. Hearst, A simple algorithm for W. Zhang, Systems at SDU-2021 task 1:
Transidentifying abbreviation definitions in biomedical formers for sentence level sequence label, in:
Protext, Pacific Symposium on Biocomputing. Pacific ceedings of the SDU@AAAI 2021, volume 2831 of
Symposium on Biocomputing 4 (2003) 451–462. CEUR Workshop Proceedings, 2021.
[14] E. Torres-Schumann, K. U. Schulz, Stable meth- [27] Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R.
Salakhutods for recognizing acronym-expansion pairs: from dinov, Q. V. Le, Xlnet: Generalized autoregressive
rule sets to hidden markov models, Int. J. Document pretraining for language understanding, in:
AdAnal. Recognit. 8 (2006). vances in Neural Information Processing Systems
[15] C. G. Harris, P. Srinivasan, My word! machine 32, NeurIPS 2019, 2019, pp. 5754–5764.
versus human computation methods for identifying [28] C. Sutton, A. McCallum, An introduction to
condiand resolving acronyms, Computación y Sistemas tional random fields, Found. Trends Mach. Learn. 4
23 (2019). (2012) 267–373.
[16] S. Hochreiter, J. Schmidhuber, Long short-term [29] H. Cheng, X. Liu, L. Pereira, Y. Yu, J. Gao, Posterior
memory, Neural Comput. 9 (1997) 1735–1780. diferential regularization with f-divergence for
im[17] Q. Jin, J. Liu, X. Lu, Deep contextualized biomedical proving model robustness, in: Proceedings of the
abbreviation expansion, in: Proceedings of the NAACL-HLT 2021, Association for Computational
BioNLP@ACL 2019, Association for Computational Linguistics, 2021, pp. 1078–1089.</p>
        <p>Linguistics, 2019, pp. 88–96. [30] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C.
De[18] Q. Jin, B. Dhingra, W. W. Cohen, X. Lu, Prob- langue, A. Moi, P. Cistac, T. Rault, R. Louf, M.
Funing biomedical embeddings from language models, towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma,
CoRR abs/1904.02181 (2019). Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger,
[19] I. Li, M. Yasunaga, M. Y. Nuzumlali, C. Caraballo, M. Drame, Q. Lhoest, A. M. Rush, Transformers:
S. Mahajan, H. M. Krumholz, D. R. Radev, A neural State-of-the-art natural language processing, in:
topic-attention model for medical term abbreviation Proceedings of the EMNLP 2020 - Demos,
Associdisambiguation, CoRR abs/1910.14076 (2019). ation for Computational Linguistics, Online, 2020,
[20] A. P. B. Veyseh, F. Dernoncourt, Q. H. Tran, T. H. pp. 38–45.</p>
        <p>Nguyen, What does this acronym mean?
introducing a new dataset for acronym identification and
disambiguation, arXiv preprint arXiv:2010.14678
(2020).
[21] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,</p>
        <p>Bert: Pre-training of deep bidirectional
transformers for language understanding, arXiv preprint
arXiv:1810.04805 (2018).
[22] Z. Wei, J. Su, Y. Wang, Y. Tian, Y. Chang, A novel
cascade binary tagging framework for relational
triple extraction, in: Proceedings of the 58th
Annual Meeting of the Association for Computational
Linguistics, ACL 2020, Association for
Computational Linguistics, 2020, pp. 1476–1488.
[23] S. Kim, S. Yang, G. Kim, S. Lee, Eficient dialogue
state tracking by selectively overwriting memory,
in: Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics, ACL
2020, Association for Computational Linguistics,
2020, pp. 567–582.
[24] I. Beltagy, K. Lo, A. Cohan, Scibert: A pretrained
language model for scientific text, in: Proceedings
of the EMNLP-IJCNLP 2019, Association for
Com</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C. F.</given-names>
            <surname>Ackermann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. E.</given-names>
            <surname>Beller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Boxwell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. G.</given-names>
            <surname>Katz</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. M. Summers</surname>
          </string-name>
          ,
          <article-title>Resolution of acronyms in question answering systems</article-title>
          ,
          <year>2020</year>
          . US Patent
          <volume>10</volume>
          ,
          <issue>572</issue>
          ,
          <fpage>597</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Head</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sidhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Weld</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Hearst</surname>
          </string-name>
          ,
          <article-title>Document-level definition detection in scholarly documents: Existing models, error analyses, and future directions</article-title>
          ,
          <source>in: Proceedings of SDP@EMNLP</source>
          <year>2020</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          ,
          <year>2020</year>
          , pp.
          <fpage>196</fpage>
          -
          <lpage>206</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Relational facts extraction with splitting mechanism</article-title>
          ,
          <source>in: 2020 IEEE International Conference on Knowledge Graph</source>
          ,
          <string-name>
            <surname>ICKG</surname>
          </string-name>
          <year>2020</year>
          „ IEEE,
          <year>2020</year>
          , pp.
          <fpage>374</fpage>
          -
          <lpage>379</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lei</surname>
          </string-name>
          , G. Xun,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          , FAT-RE:
          <article-title>A faster dependency-free model for relation extraction</article-title>
          ,
          <source>J. Web Semant</source>
          .
          <volume>65</volume>
          (
          <year>2020</year>
          )
          <fpage>100598</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Y. R. J. F. D. T. H. N. Amir Pouran Ben Veyseh</surname>
          </string-name>
          , Nicole Meister,
          <article-title>MACRONYM: A LargeScale Dataset for Multilingual and Multi-Domain Acronym Extraction</article-title>
          , in: arXiv,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Y. R. J. F. D. T. H. N. Amir Pouran Ben Veyseh</surname>
          </string-name>
          , Nicole Meister,
          <source>Multilingual Acronym Extraction and Disambiguation Shared Tasks at SDU</source>
          <year>2022</year>
          ,
          <source>in: Proceedings of SDU@AAAI-22</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Okazaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ananiadou</surname>
          </string-name>
          ,
          <article-title>Building an abbreviation dictionary using a term recognition approach</article-title>
          ,
          <source>Bioinform</source>
          .
          <volume>22</volume>
          (
          <year>2006</year>
          )
          <fpage>3089</fpage>
          -
          <lpage>3095</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Kuo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. H. T.</given-names>
            <surname>Ling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hsu</surname>
          </string-name>
          ,
          <string-name>
            <surname>BIOADI:</surname>
          </string-name>
          <article-title>a machine learning approach to identifying abbreviations and definitions in biological literature</article-title>
          ,
          <source>BMC Bioinform</source>
          .
          <volume>10</volume>
          (
          <year>2009</year>
          )
          <article-title>7</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhong</surname>
          </string-name>
          , G. Zeng,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          , AT-BERT:
          <article-title>adversarial training BERT for acronym identification winning solution for sdu@aaai-21</article-title>
          , CoRR abs/2101.03700 (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>N.</given-names>
            <surname>Egan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bohannon</surname>
          </string-name>
          ,
          <article-title>Primer ai's systems for acronym identification and disambiguation</article-title>
          ,
          <source>in: Proceedings of the SDU@AAAI</source>
          <year>2021</year>
          , volume
          <volume>2831</volume>
          <source>of CEUR Workshop Proceedings</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <source>in: Proceedings of the NAACL-HLT</source>
          <year>2019</year>
          ,
          <article-title>Volume 1 (Long</article-title>
          and Short Pa-
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>