<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <article-id pub-id-type="doi">10.1007/978-3-031-17849-8\_2</article-id>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrea Vignali</string-name>
          <email>andrea.vignali@unina.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Named Entity Recognition</institution>
          ,
          <addr-line>Data Augmentation, Active Learning, Reinforcement Learning, Few-shot learning</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Naples Federico II</institution>
          ,
          <addr-line>Via Claudio 21, Naples</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Vincenzo Moscato</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>While the field of Natural Language Processing</institution>
          ,
          <addr-line>NLP</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>13590</volume>
      <fpage>19</fpage>
      <lpage>20</lpage>
      <abstract>
        <p>Traditional techniques for Named Entity Recognition (NER) need an extensive amount of labeled data in order to get accurate outcomes. However, in real-world situations, it can be dificult to find large datasets, particularly in the biomedical field, where it is challenging to retrieve the required material from which to derive the examples to be annotated and where a domain expert is required for annotations. To address this challenge, data augmentation can be used to generate synthetic data from an existing few-shot training set. However, current methods have a tendency to generate a vast amount of noise, thus hindering performance improvements. In this work, we propose a framework to refine a policy that allows the selection of the most informative examples in an augmented pool with a Policy-based Active Learning approach that employs a deep Q-network to define the selection strategy. We experimented the proposed approach on three benchmark biomedical datasets by simulating few-shot scenarios and found it to be more efective than the selected baselines in most of the cases.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Biomedical Named Entity Recognition (BioNER) is a
crucial natural language processing task that plays a pivotal
role in automatically identifying and extracting essential
entities, such as diseases, chemical agents, and genes,
from unstructured text data. Accurate BioNER is
fundamental for numerous downstream applications, including
medical question answering agents and knowledge graph
ical information.</p>
      <p>Training efective Named Entity Recognition (NER)
models requires significant amounts of manually
annotated data, which is a time-consuming and expensive
gal, historical, or biomedical, where domain knowledge
is fundamental. Additionally, the availability of domain
experts in the medical field may be limited due to their
busy schedules.</p>
      <p>To address the challenges posed by limited training</p>
      <p>
        0000-0002-0754-7696 (V. Moscato); 0000-0003-1470-8053
(M. Postiglione); 0000-0003-4033-3777 (G. Sperlí);
0000-0002-0273-1056 (A. Vignali)
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], the use of context similarity-based criteria [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and
the imitation of language patterns from high-resource
      </p>
      <p>Although the first attempts of NER data augmentation
have shown promising results, the proposed methods
of data manipulation may frequently generate a
considerable amount of mislabeled and noisy samples, as the
new data may not be syntactically and/or semantically
accurate. For example, if we manipulate the sentence
”Hypotension is a term that indicates low blood pressure”
so as to replace the entity mention hypotension with
anaugmented sample may be inaccurate and thus mislead
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License other disorder (e.g. dyspnea, hypertension), the resulting
the model in efectively identifying mentions.
from an augmented pool has its foundations in
policyIn this work, we address the issue of selecting the most
based active learning [6]. Active learning (AL) is a
wellinformative samples from an augmented pool. Inspired
established method to select the most informative
unby policy-based active learning [6], we do not use a fixed
labeled data to be annotated in order to train the best
heuristic, but rather allow our framework to learn how to
classifier, thus optimizing human eforts. AL approaches
actively select data by formalizing the selection process as
are based on heuristics: uncertainty sampling [9, 10]
a reinforcement learning problem. Specifically, for each
selects data based on the uncertainty expressed in the
sample in the augmented pool, an agent has to decide
outputs of the model, Seung et al. [11] choose data based
whether to select it or not, based on its characteristics
on the disagreement of a committee. Fang et al. [6]
reand model outputs. The selection policy is learned by
vise AL as a reinforcement learning problem where the
means of a deep Q-network [7].
selection strategy is automatically learned by an agent
We experiment our method by simulating few-shot
sceby means of a deep Q-network [7]. In our work, an
intelnarios in BioNER applications, i.e. we use only  samples
as our training data,  ∈ {10, 50, 100} . In such settings,
we demonstrate the ability of the framework to select
the most informative augmented samples first and show
ligent agent automatically learns a policy to identify the
most advantageous samples, from an augmented dataset,
to improve the overall performance of the model. By
doing so, the agent selects samples that are less likely to
promising results as shown by the comparison with the
mislead the model without the requirement of human
selected baselines. Our approach presents a new direc- input or guidance.
tion for exploring the potential of data augmentation
to improve the performance of NER models when the
training data is scarce, and our findings reveal a
considerable margin for improvement, as the data augmentation
technique employed for generating the augmented pool
can be readily replaced with more advanced and efective
methods.</p>
      <p>The remainder of this paper is organized as follows.</p>
      <p>In Section 2, we summarize the literature on NER data
augmentation. In Section 3, we present our augmentation
framework, while experimental results are reported in
Section 4. We conclude our work in Section 5.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related</title>
    </sec>
    <sec id="sec-3">
      <title>Work</title>
      <p>
        Data augmentation aims to increase the amount of
available training data by means of data manipulations,
heuristics or external data sources. Dai and Adel [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] investigate
the improvements in performance obtained by
augmenting NER data with simple data manipulations, such as
token replacements, mention replacements, and shufling.
However, these approaches may generate too many noisy
samples which may in turn hinder the ability of the model
3.
      </p>
    </sec>
    <sec id="sec-4">
      <title>Methodology</title>
      <p>The methodological workflow of the proposed
framework is illustrated in Figure 1. In this section, we will
ifrst provide a formalization of the few-shot BioNER
problem, and then describe each module in-depth, from the
generation of a concepts vocabulary to the reinforcement
learning cycles.</p>
      <sec id="sec-4-1">
        <title>3.1. Problem formulation</title>
        <p>The input of a NER system is a sentence s, which can
be represented as a sequence of tokens s = [ 1,  2, … ,   ].</p>
        <p>NER outputs a list of tuples [  ,   , ] representing named
entities mentioned in s. Here,   ∈ [1,  ]</p>
        <p>and   ∈ [1,  ]
are the indexes of start and end characters of the named
entity mention, while  is the entity type [12].</p>
        <p>In practice, this task is usually accomplished by
pro  ∈ 
ducing a paired sequence of categorical values y =
[ 1,  2, … ,   ] as the output of the NER model, where</p>
        <p>indicates the entity type of the  -th token. Hence,
a NER dataset is defined as a collection of pairwise data

=1 ,  being the number of examples.</p>
        <p>
          For the purposes of this work, we will be using the IOB
to be efectively trained. Bartolini et al. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] address this  = {( s , y )}
challenge by replacing entity mentions with the most
similar entities retrieved by computing context-base sim- scheme to identify entity mentions. Under this scheme,
ilarity. Zeng et al. [8] address the poor generalization
each input token is mapped to the beginning (B), inside
        </p>
        <p>Our approach for the selection of the less noisy samples
will consider inputs from biomedical domains, where the
NER task is known as Biomedical NER (BioNER). Due to
the data scarcity that usually afects such domains, we
will test our system in few-shot settings, i.e. the number
of training instances  is small (e.g.  ∈ {10, 50, 100} ).</p>
        <p>Training
data
hypertensive disorder
cancer
dyspnea
...</p>
        <p>Mention 
Replacement</p>
        <p>Augmented
pool</p>
        <p>Reinforcement Learning</p>
        <p>States</p>
        <p>Actions
✅
❌
✅
❌
...</p>
        <p>...</p>
        <p>...</p>
        <p>...</p>
        <p>...
content marginals confidence</p>
        <p>NER model</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Generation of a vocabulary of concepts</title>
        <p>Based on the available training data, we extract all the
entity mentions, thus building a vocabulary of concepts.</p>
        <p>In this work, we test our framework by relying solely
on the input training data, but this module can be easily
extended to include concepts from biomedical ontologies
or guided by domain experts. For example, physicians
are usually aware of the ways medical concepts can be
written in clinical notes; hence, if they are interested
in recognizing mentions of a particular concept, they
can provide a set of aliases for our system to efectively
augment the original training set.</p>
        <p>We learn how to select data from the augmented pool
with a module based on reinforcement learning. Our
method is built upon the foundations of Policy-based
Active Learning [6], which has been previously
demon3.3. Data augmentation via mention strated to be capable of automatically learning an active
replacement learning strategy from data by formulating the active
learning as a reinforcement learning problem where the
For each sentence in our training set, to determine state corresponds to the unlabeled data selected for
lawhether a mention should be replaced, we employ a bino- beling, and their label, and the action is the selection
mial distribution. If the outcome is afirmative, we select heuristic. Specifically, we adapt the method not to work
a replacement mention from the concepts vocabulary. with unlabeled data and human oracles, but with the
augSubsequently, we modify the corresponding IOB-label mented pool generated in the previous step, and to learn
sequence as needed. Some examples of mention replace- the best strategy to select the samples that may mostly
ment are provided in Table 1. benefit the performance of the NER model. Furthermore,</p>
        <p>
          The reason behind the choice of this augmentation while Fang et al. [6] make a streaming assumption, i.e.
technique lies in the high number of noisy samples it unlabelled data arrive one by one and the agent decides
may generate, given the random nature of the mention
replacement. This allows us to efectively test the ability
of our framework to discard samples that may mislead
the model. However, it is our belief that the performance
of the framework can be further improved with more
sophisticated augmentation methods, e.g. based on
context similarity [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] or learning patterns from cross-domain
data [5].
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>3.4. Reinforcement learning cycles</title>
        <sec id="sec-4-3-1">
          <title>If untreated, hemochromatosis can cause serious illness</title>
        </sec>
        <sec id="sec-4-3-2">
          <title>If untreated, mononucleosis can cause serious illness and</title>
          <p>and early death, but the disease is still substantially
underearly death, but the disease is still substantially
under</p>
        </sec>
        <sec id="sec-4-3-3">
          <title>When expressed in Escherichia coli, SH-PTP2 displays</title>
        </sec>
        <sec id="sec-4-3-4">
          <title>When expressed in Escherichia coli, PTPN6 displays</title>
          <p>tyrosine-specific phosphatase activity
tyrosine-specific phosphatase activity
Examples of data augmentation via mention replacement. Here, entity mentions are reported in bold.
the action to take, we assume batch-based learning where
the most probable sequence of labels under the model,
diagnosed.
of the three representations described in the following: decisions made by the agent. At each step  , the reward
the augmented pool is entirely available and the reward is
computed on the set of actions that the agent has decided
to take on the whole dataset in the  -th cycle.</p>
          <p>In the remainder of this section, we provide in-depth
details on the components of the reinforcement learning
process.
3.4.1. States
We represent the state of each sentence s in the
augmented pool at time  by taking in consideration both an
embedded representation of its content and the outputs
of the NER model Θ trained over the selected data at time
 . Specifically, the state   consists in the concatenation
the set of states at time  .
content, marginals and confidence. We denote with  
Content Following Kim [13], we first encode each
of the  tokens   in the sentence to produce a matrix
X = {x1, x2, … , x } and then apply a convolutional neural
network, which consists in a series of filters using linear
transformations followed by ReLU activation functions;
the last layer of the network performs a max-pooling
operation that provides the representation of the sentence
content hc.</p>
          <p>Marginals
of the NER model given the input sentence s. Another
convolutional neural network is used to represent the
predictive marginals, i.e. the probability distributions</p>
          <p>Let  Θ (y|s) indicate the prediction outputs</p>
          <p>[6], the convolutional layer contains  filters activated
with ReLU applied with a window width of 3 and height
equal to the number of classes (3 in our case, i.e. I, O and
B). Padding is used to endure a wide convolution, and
mean pooling is used to allow the network to efectively
capture the average uncertainty in each window. The
ifnal hidden layer outputs the representation of predictive
marginals h .
associated to all the tokens in s. Following Fang et al. considering the rewards obtained in each episode.
Confidence</p>
          <p>Following Fang et al. [6], we repre- Bellman equation, which recursively defines the  -value
sent the confidence by computing the probability of
for a state-action pair as the immediate reward plus the
Output
diagnosed.
 =  maxy Θ (y|s), where  is the length of the sentence
3.4.2. Actions
Given the state of each input sample, an agent has to
decide whether to select it or not to re-train the NER
model. Thus, for each sentence s in the augmented pool,
the agent selects either to use it (  = 1) or not (  = 0).</p>
          <p>We denote the set of actions made at time  with   .
3.4.3. Reward
The reward provides a feedback on the quality of the
is defined as the change in held-out performance:
ℛ ( −1 ,   ) = Performance(Θ ) − Performance(Θ−1 ),
where Performance(⋅) is a measure of the model’s
quality. In our work, we compute the F1 score to determine
rewards. Note that the value of ℛ could also be
negative, i.e. the efect of the actions made by the agent has a
detrimental efect on the performance.</p>
          <p>(1)
3.4.4. Deep Q-Network
We adopt a deep Q-learning [7] approach where the
utility of choosing the action   from state   is evaluated
by the Q function   (  ,   ) according to the policy  .</p>
          <p>The Q-function is iteratively updated by the agent by
possible actions   ∈ {0, 1}.</p>
          <p>The deep Q-network (DQN) consists in a single hidden
layer which takes the state vector of a single instance   =
[h , h , ] as input and uses a ReLU activation function
to output two scalar values (  ,   ) associated to the two</p>
          <p>The training objective is to minimize the diference
between the estimated  -value and the true  -value for
a given state-action pair. This is typically done by
using a variant of the Q-learning algorithm known as the
Shots</p>
          <p>Dataset
10
50
100</p>
        </sec>
        <sec id="sec-4-3-5">
          <title>NCBI-Disease BC2GM BC5CDR BC2GM</title>
          <p>BC5CDR</p>
        </sec>
        <sec id="sec-4-3-6">
          <title>NCBI-Disease</title>
        </sec>
        <sec id="sec-4-3-7">
          <title>NCBI-Disease BC2GM BC5CDR</title>
          <p>Method
discounted future  -value for the next state-action pair. 4. Experiments
Mathematically, this can be expressed as:</p>
          <p>
            (  ,   ) = [  +  ⋅ max +1 ( +1 ,  +1 )],
where (, ) is the  -value for state  and action  ,  ∈
[
            <xref ref-type="bibr" rid="ref1">0, 1</xref>
            ] is the discount factor, and max +1 ( +1 ,  +1 ) is the
maximum  -value over all possible actions in the next
state.
          </p>
          <p>The goal of the Q-learning algorithm is to update the
Q-network weights  to minimize the mean squared error
between the estimated  -value (, ; ) and the target
 -value  :
2
ℒ () =  [(  (  ,  +1 ) − (  ,   ; ) ) ],
(3)
where   (  ,  +1 ) =   +  ⋅ max +1 ( +1 ,  +1 ;  −1 ) is the
target  -value based on the current parameters  −1 , and
results are averaged over a minibatch of samples.
Learning updates are based on stochastic gradient descent.</p>
          <p>In this section, we provide an in-depth description of
(2) the experiments we ran to assess the performance of
our system. First, we describe the experimental setup
in Section 4.1; then, we discuss experimental results in
Section 4.2.</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>4.1. Experimental setup</title>
        <p>4.1.1. Datasets
We test our method on the three popular benchmark
datasets from the biomedical field listed as follows:
• NCBI-Disease [14]: 793 abstracts from PubMed,
annotated with disorders entity mentions.
• BC2GM [15]: over 20,000 abstracts from PubMed
annotated with gene mentions.
• BC5CDR [16]: over 1,500 abstracts from PubMed
annotated with diseases and chemicals. For
simplicity, we consider only chemical entity
mentions in our experiments.</p>
        <p>For each dataset, we have considered the original
training, validation and test sets provided with their original
release.</p>
        <p>Few-shot simulations To simulate data scarcity
scenarios, we randomly sample  sentences from the training
set,  ∈ {10, 50, 100} . Since performance can vary greatly
based on the selected training samples, we run each
experiment 5 times and always report averaged results.
• Uncertainty: we leverage uncertainty-based
active learning [9] as an heuristic-based framework
for the selection of samples, i.e. we rank
augmented samples according to the uncertainty of
the model in its predictions. Since model
predictions are mapped to each token in the sentence,
we aggregate them to obtain a single ranking
value.</p>
        <p>For each method, we always pre-train the initial model
with the available training data in the simulated few-shot
scenario. Then, we assess the performance of the same
model when it is fine-tuned with the selected samples.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.2. Results</title>
        <p>4.1.2. Training details
Table 2 presents a comprehensive comparison
perforGiven the data-scarcity nature of our work, we assume mance of diferent baselines in various  -shot
scenarthat data to tune hyper-parameters is not available. How- ios and datasets for BioNER tasks. We can observe that
ever, we test models on the entire test set. Hence, our method consistently shows competitive performance,
we choose hyperparameters based on previous work achieving the highest F1 scores in several cases,
indicatand practical considerations. Specifically, we use a pre- ing its efectiveness in correctly identifying entity
mentrained biomedical Transformer network [17] and train tions. The highest improvement over the random
baseall our models for 3 epochs with a learning rate of line can be seen in the 10-shot scenario on BC2GM data.
2 ⋅ 10−4, an AdamW optimizer, a batch size of 5 and Being this dataset focused on the gene entity type, and
a maximum sequence length of 256. We run our rein- being this information usually mentioned in many
hetforcement learning framework for 5 episodes. We evalu- erogeneous ways (e.g. mStaf gene, OBP, primase, V8
ate the quality of models in terms of precision, recall and protease, MT), a vast amount of noise can be generated
F1 scores obtained with the seqeval1 Python library. by randomly replacing mentions. This negative efect
of noise is higher as the data-scarcity scenario becomes
4.1.3. Hardware configuration stricter, because the efect of noise can be limited when
we include clean training data. Furthermore, our method
All experiments were conducted on the platform Google consistently achieves the highest precision or rivals the
Colab, using the Free tier plan, which provides a virtual top performer, as the learned selection policy reliably
amnaIcnhtienle® wXietohna®npNroVcIeDsIsAorTw4itGhPaUfrweqituhe1n6cGyBofo2f.3RGAHMz, minimizes the occurrence of false positives.
We should notice that results show high standard
deand 10 cores (but only one used by the VM instance), viations, being the quality of the model strongly related
12 GB of available memory and 78.19 GB of free disk. to the sampling that generates the few-shot training set.
Due to the limits imposed by Colab’s free plan, we were Figure 2 illustrates the performance trends of F1 scores
unable to pursue further improvements on the obtained as we increase the number of selected data from the
augresults. Specifically, we could only run a maximum of 5 mented pool in the 50-shots setting. Our method
consisPAL episodes per experiment. Although this was sufi- tently surpasses the random and uncertainty-based
seleccient to achieve the intended goals, conducting a greater tion curves and the improvement is generally higher as
number of episodes could have allowed for a more refined the number of selected elements is smaller. As the
numselection policy for the augmented instances, potentially ber of elements increases, the three curves converge to
leading to improved model performances. values that are reasonably close. This trend is consistent
with findings from Fang et al. [ 6], which demonstrated
4.1.4. Baselines that Policy-based Active Learning achieves superior
perWe compare our method with the two baselines for the formance while using fewer selected elements. In all the
selection of samples from the augmented pool described three plots it can be observed that the peak performance
as follows: of our method outperforms the performance obtained by
the model without the augmented samples, represented
by the dashed red line.
• Random: we sample a random set of instances</p>
        <p>from the augmented pool.
1https://github.com/chakki-works/seqeval
20</p>
        <p>30 40
# Selected samples</p>
        <p>50
(c) BC2GM</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion &amp; Future Work</title>
    </sec>
    <sec id="sec-6">
      <title>6. Acknowledgments</title>
      <p>In this work, we proposed a novel approach for selecting
informative samples from an augmented pool to improve
the performance of NER models in the biomedical domain
with limited training data. Our framework leverages
policy-based active learning [6] to learn a selection policy
that identifies the most informative augmented samples
to enhance the NER model’s generalization ability.</p>
      <p>We evaluated our method on simulated few-shot
scenarios in BioNER applications, where we demonstrated
its ability to select the most informative augmented
samples first and achieve promising results compared to
selected baselines. Our approach presents a new direction
for exploring the potential of data augmentation to
improve NER models’ performance in specialized domains,
such as biomedical, where labeled data is scarce and
domain knowledge is essential.</p>
      <p>Future work should explore the robustness of our
framework on real-world biomedical datasets and
investigate the efectiveness of diferent data augmentation
techniques on improving the performance of the proposed
approach. The simple mention replacement approach used
in the current implementation of our framework could be
indeed replaced with more advanced and sophisticated
approaches.</p>
      <p>Furthermore, we intend to extend our approach to
other NLP tasks beyond NER, such as relation extraction
and entity linking, and compare its performance with
existing state-of-the-art methods. Finally, we also plan
to investigate the interpretability of the learned selection
policy to gain insights into the most informative features
and characteristics of the augmented data samples that
contribute to the NER model’s improved performance.
This project has been funded by PNRR MUR project
PE0000013-FAIR. We also aknowledge support
from NextGenerationEU via PNRR - DM 352 (CUP:
E66G22000400009).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bayer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kaufhold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Reuter</surname>
          </string-name>
          ,
          <article-title>A survey on data augmentation for text classification</article-title>
          ,
          <source>ACM Comput. Surv</source>
          .
          <volume>55</volume>
          (
          <year>2023</year>
          )
          <volume>146</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>146</lpage>
          :
          <fpage>39</fpage>
          . URL: https: //doi.org/10.1145/3544558. doi:
          <volume>10</volume>
          .1145/3544558.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Liu</surname>
          </string-name>
          , G. Xu,
          <string-name>
            <given-names>C.</given-names>
            <surname>Jia</surname>
          </string-name>
          , W. Ma, L.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Vosoughi</surname>
          </string-name>
          ,
          <article-title>Data boost: Text data augmentation through reinforcement learning guided conditional generation</article-title>
          , in: B.
          <string-name>
            <surname>Webber</surname>
            , T. Cohn,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
          </string-name>
          , Y. Liu (Eds.),
          <source>Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP</source>
          <year>2020</year>
          , Online,
          <source>November 16-20</source>
          ,
          <year>2020</year>
          , Association for Computational Linguistics,
          <year>2020</year>
          , pp.
          <fpage>9031</fpage>
          -
          <lpage>9041</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2020</year>
          .emnlp-main.
          <volume>726</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .emnlp- main.726.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>X.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Adel</surname>
          </string-name>
          ,
          <article-title>An analysis of simple data augmentation for named entity recognition</article-title>
          , in: D.
          <string-name>
            <surname>Scott</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Bel</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          Zong (Eds.),
          <source>Proceedings of the 28th International Conference on Computational Linguistics</source>
          ,
          <string-name>
            <surname>COLING</surname>
          </string-name>
          <year>2020</year>
          , Barcelona, Spain (Online),
          <source>December 8-13</source>
          ,
          <year>2020</year>
          ,
          <source>International Committee on Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>3861</fpage>
          -
          <lpage>3867</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2020</year>
          .coling-main.
          <volume>343</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .coling- main.343.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>I.</given-names>
            <surname>Bartolini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Moscato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Postiglione</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sperlì</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vignali</surname>
          </string-name>
          ,
          <article-title>COSINER: context similarity data augmentation for named entity recognition</article-title>
          , in: Simi-
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>