1. Introduction

10.1007/978-3-031-17849-8\_2

Andrea Vignali

andrea.vignali@unina.it 0 1 2 3 0 Named Entity Recognition , Data Augmentation, Active Learning, Reinforcement Learning, Few-shot learning 1 University of Naples Federico II , Via Claudio 21, Naples , Italy 2 Vincenzo Moscato 3 While the field of Natural Language Processing , NLP

2021

13590 19 20

Traditional techniques for Named Entity Recognition (NER) need an extensive amount of labeled data in order to get accurate outcomes. However, in real-world situations, it can be dificult to find large datasets, particularly in the biomedical field, where it is challenging to retrieve the required material from which to derive the examples to be annotated and where a domain expert is required for annotations. To address this challenge, data augmentation can be used to generate synthetic data from an existing few-shot training set. However, current methods have a tendency to generate a vast amount of noise, thus hindering performance improvements. In this work, we propose a framework to refine a policy that allows the selection of the most informative examples in an augmented pool with a Policy-based Active Learning approach that employs a deep Q-network to define the selection strategy. We experimented the proposed approach on three benchmark biomedical datasets by simulating few-shot scenarios and found it to be more efective than the selected baselines in most of the cases.

1. Introduction

Biomedical Named Entity Recognition (BioNER) is a crucial natural language processing task that plays a pivotal role in automatically identifying and extracting essential entities, such as diseases, chemical agents, and genes, from unstructured text data. Accurate BioNER is fundamental for numerous downstream applications, including medical question answering agents and knowledge graph ical information.

Training efective Named Entity Recognition (NER) models requires significant amounts of manually annotated data, which is a time-consuming and expensive gal, historical, or biomedical, where domain knowledge is fundamental. Additionally, the availability of domain experts in the medical field may be limited due to their busy schedules.

To address the challenges posed by limited training

0000-0002-0754-7696 (V. Moscato); 0000-0003-1470-8053 (M. Postiglione); 0000-0003-4033-3777 (G. Sperlí); 0000-0002-0273-1056 (A. Vignali) [ 3 ], the use of context similarity-based criteria [ 4 ] and the imitation of language patterns from high-resource

Although the first attempts of NER data augmentation have shown promising results, the proposed methods of data manipulation may frequently generate a considerable amount of mislabeled and noisy samples, as the new data may not be syntactically and/or semantically accurate. For example, if we manipulate the sentence ”Hypotension is a term that indicates low blood pressure” so as to replace the entity mention hypotension with anaugmented sample may be inaccurate and thus mislead © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License other disorder (e.g. dyspnea, hypertension), the resulting the model in efectively identifying mentions. from an augmented pool has its foundations in policyIn this work, we address the issue of selecting the most based active learning [6]. Active learning (AL) is a wellinformative samples from an augmented pool. Inspired established method to select the most informative unby policy-based active learning [6], we do not use a fixed labeled data to be annotated in order to train the best heuristic, but rather allow our framework to learn how to classifier, thus optimizing human eforts. AL approaches actively select data by formalizing the selection process as are based on heuristics: uncertainty sampling [9, 10] a reinforcement learning problem. Specifically, for each selects data based on the uncertainty expressed in the sample in the augmented pool, an agent has to decide outputs of the model, Seung et al. [11] choose data based whether to select it or not, based on its characteristics on the disagreement of a committee. Fang et al. [6] reand model outputs. The selection policy is learned by vise AL as a reinforcement learning problem where the means of a deep Q-network [7]. selection strategy is automatically learned by an agent We experiment our method by simulating few-shot sceby means of a deep Q-network [7]. In our work, an intelnarios in BioNER applications, i.e. we use only samples as our training data, ∈ {10, 50, 100} . In such settings, we demonstrate the ability of the framework to select the most informative augmented samples first and show ligent agent automatically learns a policy to identify the most advantageous samples, from an augmented dataset, to improve the overall performance of the model. By doing so, the agent selects samples that are less likely to promising results as shown by the comparison with the mislead the model without the requirement of human selected baselines. Our approach presents a new direc- input or guidance. tion for exploring the potential of data augmentation to improve the performance of NER models when the training data is scarce, and our findings reveal a considerable margin for improvement, as the data augmentation technique employed for generating the augmented pool can be readily replaced with more advanced and efective methods.

The remainder of this paper is organized as follows.

In Section 2, we summarize the literature on NER data augmentation. In Section 3, we present our augmentation framework, while experimental results are reported in Section 4. We conclude our work in Section 5.

2. Related Work

Data augmentation aims to increase the amount of available training data by means of data manipulations, heuristics or external data sources. Dai and Adel [ 3 ] investigate the improvements in performance obtained by augmenting NER data with simple data manipulations, such as token replacements, mention replacements, and shufling. However, these approaches may generate too many noisy samples which may in turn hinder the ability of the model 3.

Methodology

The methodological workflow of the proposed framework is illustrated in Figure 1. In this section, we will ifrst provide a formalization of the few-shot BioNER problem, and then describe each module in-depth, from the generation of a concepts vocabulary to the reinforcement learning cycles.

3.1. Problem formulation

The input of a NER system is a sentence s, which can be represented as a sequence of tokens s = [ 1, 2, … , ].

NER outputs a list of tuples [ , , ] representing named entities mentioned in s. Here, ∈ [1, ]

and ∈ [1, ] are the indexes of start and end characters of the named entity mention, while is the entity type [12].

In practice, this task is usually accomplished by pro ∈ ducing a paired sequence of categorical values y = [ 1, 2, … , ] as the output of the NER model, where

indicates the entity type of the -th token. Hence, a NER dataset is defined as a collection of pairwise data =1 , being the number of examples.

For the purposes of this work, we will be using the IOB to be efectively trained. Bartolini et al. [ 4 ] address this = {( s , y )} challenge by replacing entity mentions with the most similar entities retrieved by computing context-base sim- scheme to identify entity mentions. Under this scheme, ilarity. Zeng et al. [8] address the poor generalization each input token is mapped to the beginning (B), inside

Our approach for the selection of the less noisy samples will consider inputs from biomedical domains, where the NER task is known as Biomedical NER (BioNER). Due to the data scarcity that usually afects such domains, we will test our system in few-shot settings, i.e. the number of training instances is small (e.g. ∈ {10, 50, 100} ).

Training data hypertensive disorder cancer dyspnea ...

Mention Replacement

Augmented pool

Reinforcement Learning

States

Actions ✅ ❌ ✅ ❌ ...

...

... content marginals confidence

NER model

3.2. Generation of a vocabulary of concepts

Based on the available training data, we extract all the entity mentions, thus building a vocabulary of concepts.

In this work, we test our framework by relying solely on the input training data, but this module can be easily extended to include concepts from biomedical ontologies or guided by domain experts. For example, physicians are usually aware of the ways medical concepts can be written in clinical notes; hence, if they are interested in recognizing mentions of a particular concept, they can provide a set of aliases for our system to efectively augment the original training set.

We learn how to select data from the augmented pool with a module based on reinforcement learning. Our method is built upon the foundations of Policy-based Active Learning [6], which has been previously demon3.3. Data augmentation via mention strated to be capable of automatically learning an active replacement learning strategy from data by formulating the active learning as a reinforcement learning problem where the For each sentence in our training set, to determine state corresponds to the unlabeled data selected for lawhether a mention should be replaced, we employ a bino- beling, and their label, and the action is the selection mial distribution. If the outcome is afirmative, we select heuristic. Specifically, we adapt the method not to work a replacement mention from the concepts vocabulary. with unlabeled data and human oracles, but with the augSubsequently, we modify the corresponding IOB-label mented pool generated in the previous step, and to learn sequence as needed. Some examples of mention replace- the best strategy to select the samples that may mostly ment are provided in Table 1. benefit the performance of the NER model. Furthermore,

The reason behind the choice of this augmentation while Fang et al. [6] make a streaming assumption, i.e. technique lies in the high number of noisy samples it unlabelled data arrive one by one and the agent decides may generate, given the random nature of the mention replacement. This allows us to efectively test the ability of our framework to discard samples that may mislead the model. However, it is our belief that the performance of the framework can be further improved with more sophisticated augmentation methods, e.g. based on context similarity [ 4 ] or learning patterns from cross-domain data [5].

3.4. Reinforcement learning cycles If untreated, hemochromatosis can cause serious illness If untreated, mononucleosis can cause serious illness and

and early death, but the disease is still substantially underearly death, but the disease is still substantially under

When expressed in Escherichia coli, SH-PTP2 displays When expressed in Escherichia coli, PTPN6 displays

tyrosine-specific phosphatase activity tyrosine-specific phosphatase activity Examples of data augmentation via mention replacement. Here, entity mentions are reported in bold. the action to take, we assume batch-based learning where the most probable sequence of labels under the model, diagnosed. of the three representations described in the following: decisions made by the agent. At each step , the reward the augmented pool is entirely available and the reward is computed on the set of actions that the agent has decided to take on the whole dataset in the -th cycle.

In the remainder of this section, we provide in-depth details on the components of the reinforcement learning process. 3.4.1. States We represent the state of each sentence s in the augmented pool at time by taking in consideration both an embedded representation of its content and the outputs of the NER model Θ trained over the selected data at time . Specifically, the state consists in the concatenation the set of states at time . content, marginals and confidence. We denote with Content Following Kim [13], we first encode each of the tokens in the sentence to produce a matrix X = {x1, x2, … , x } and then apply a convolutional neural network, which consists in a series of filters using linear transformations followed by ReLU activation functions; the last layer of the network performs a max-pooling operation that provides the representation of the sentence content hc.

Marginals of the NER model given the input sentence s. Another convolutional neural network is used to represent the predictive marginals, i.e. the probability distributions

Let Θ (y|s) indicate the prediction outputs

[6], the convolutional layer contains filters activated with ReLU applied with a window width of 3 and height equal to the number of classes (3 in our case, i.e. I, O and B). Padding is used to endure a wide convolution, and mean pooling is used to allow the network to efectively capture the average uncertainty in each window. The ifnal hidden layer outputs the representation of predictive marginals h . associated to all the tokens in s. Following Fang et al. considering the rewards obtained in each episode. Confidence

Following Fang et al. [6], we repre- Bellman equation, which recursively defines the -value sent the confidence by computing the probability of for a state-action pair as the immediate reward plus the Output diagnosed. = maxy Θ (y|s), where is the length of the sentence 3.4.2. Actions Given the state of each input sample, an agent has to decide whether to select it or not to re-train the NER model. Thus, for each sentence s in the augmented pool, the agent selects either to use it ( = 1) or not ( = 0).

We denote the set of actions made at time with . 3.4.3. Reward The reward provides a feedback on the quality of the is defined as the change in held-out performance: ℛ ( −1 , ) = Performance(Θ ) − Performance(Θ−1 ), where Performance(⋅) is a measure of the model’s quality. In our work, we compute the F1 score to determine rewards. Note that the value of ℛ could also be negative, i.e. the efect of the actions made by the agent has a detrimental efect on the performance.

(1) 3.4.4. Deep Q-Network We adopt a deep Q-learning [7] approach where the utility of choosing the action from state is evaluated by the Q function ( , ) according to the policy .

The Q-function is iteratively updated by the agent by possible actions ∈ {0, 1}.

The deep Q-network (DQN) consists in a single hidden layer which takes the state vector of a single instance = [h , h , ] as input and uses a ReLU activation function to output two scalar values ( , ) associated to the two

The training objective is to minimize the diference between the estimated -value and the true -value for a given state-action pair. This is typically done by using a variant of the Q-learning algorithm known as the Shots

Dataset 10 50 100

NCBI-Disease BC2GM BC5CDR BC2GM

BC5CDR

NCBI-Disease NCBI-Disease BC2GM BC5CDR

Method discounted future -value for the next state-action pair. 4. Experiments Mathematically, this can be expressed as:

( , ) = [ + ⋅ max +1 ( +1 , +1 )], where (, ) is the -value for state and action , ∈ [ 0, 1 ] is the discount factor, and max +1 ( +1 , +1 ) is the maximum -value over all possible actions in the next state.

The goal of the Q-learning algorithm is to update the Q-network weights to minimize the mean squared error between the estimated -value (, ; ) and the target -value : 2 ℒ () = [( ( , +1 ) − ( , ; ) ) ], (3) where ( , +1 ) = + ⋅ max +1 ( +1 , +1 ; −1 ) is the target -value based on the current parameters −1 , and results are averaged over a minibatch of samples. Learning updates are based on stochastic gradient descent.

In this section, we provide an in-depth description of (2) the experiments we ran to assess the performance of our system. First, we describe the experimental setup in Section 4.1; then, we discuss experimental results in Section 4.2.

4.1. Experimental setup

4.1.1. Datasets We test our method on the three popular benchmark datasets from the biomedical field listed as follows: • NCBI-Disease [14]: 793 abstracts from PubMed, annotated with disorders entity mentions. • BC2GM [15]: over 20,000 abstracts from PubMed annotated with gene mentions. • BC5CDR [16]: over 1,500 abstracts from PubMed annotated with diseases and chemicals. For simplicity, we consider only chemical entity mentions in our experiments.

For each dataset, we have considered the original training, validation and test sets provided with their original release.

Few-shot simulations To simulate data scarcity scenarios, we randomly sample sentences from the training set, ∈ {10, 50, 100} . Since performance can vary greatly based on the selected training samples, we run each experiment 5 times and always report averaged results. • Uncertainty: we leverage uncertainty-based active learning [9] as an heuristic-based framework for the selection of samples, i.e. we rank augmented samples according to the uncertainty of the model in its predictions. Since model predictions are mapped to each token in the sentence, we aggregate them to obtain a single ranking value.

For each method, we always pre-train the initial model with the available training data in the simulated few-shot scenario. Then, we assess the performance of the same model when it is fine-tuned with the selected samples.

4.2. Results

4.1.2. Training details Table 2 presents a comprehensive comparison perforGiven the data-scarcity nature of our work, we assume mance of diferent baselines in various -shot scenarthat data to tune hyper-parameters is not available. How- ios and datasets for BioNER tasks. We can observe that ever, we test models on the entire test set. Hence, our method consistently shows competitive performance, we choose hyperparameters based on previous work achieving the highest F1 scores in several cases, indicatand practical considerations. Specifically, we use a pre- ing its efectiveness in correctly identifying entity mentrained biomedical Transformer network [17] and train tions. The highest improvement over the random baseall our models for 3 epochs with a learning rate of line can be seen in the 10-shot scenario on BC2GM data. 2 ⋅ 10−4, an AdamW optimizer, a batch size of 5 and Being this dataset focused on the gene entity type, and a maximum sequence length of 256. We run our rein- being this information usually mentioned in many hetforcement learning framework for 5 episodes. We evalu- erogeneous ways (e.g. mStaf gene, OBP, primase, V8 ate the quality of models in terms of precision, recall and protease, MT), a vast amount of noise can be generated F1 scores obtained with the seqeval1 Python library. by randomly replacing mentions. This negative efect of noise is higher as the data-scarcity scenario becomes 4.1.3. Hardware configuration stricter, because the efect of noise can be limited when we include clean training data. Furthermore, our method All experiments were conducted on the platform Google consistently achieves the highest precision or rivals the Colab, using the Free tier plan, which provides a virtual top performer, as the learned selection policy reliably amnaIcnhtienle® wXietohna®npNroVcIeDsIsAorTw4itGhPaUfrweqituhe1n6cGyBofo2f.3RGAHMz, minimizes the occurrence of false positives. We should notice that results show high standard deand 10 cores (but only one used by the VM instance), viations, being the quality of the model strongly related 12 GB of available memory and 78.19 GB of free disk. to the sampling that generates the few-shot training set. Due to the limits imposed by Colab’s free plan, we were Figure 2 illustrates the performance trends of F1 scores unable to pursue further improvements on the obtained as we increase the number of selected data from the augresults. Specifically, we could only run a maximum of 5 mented pool in the 50-shots setting. Our method consisPAL episodes per experiment. Although this was sufi- tently surpasses the random and uncertainty-based seleccient to achieve the intended goals, conducting a greater tion curves and the improvement is generally higher as number of episodes could have allowed for a more refined the number of selected elements is smaller. As the numselection policy for the augmented instances, potentially ber of elements increases, the three curves converge to leading to improved model performances. values that are reasonably close. This trend is consistent with findings from Fang et al. [ 6], which demonstrated 4.1.4. Baselines that Policy-based Active Learning achieves superior perWe compare our method with the two baselines for the formance while using fewer selected elements. In all the selection of samples from the augmented pool described three plots it can be observed that the peak performance as follows: of our method outperforms the performance obtained by the model without the augmented samples, represented by the dashed red line. • Random: we sample a random set of instances

from the augmented pool. 1https://github.com/chakki-works/seqeval 20

30 40 # Selected samples

50 (c) BC2GM

5. Conclusion & Future Work 6. Acknowledgments

In this work, we proposed a novel approach for selecting informative samples from an augmented pool to improve the performance of NER models in the biomedical domain with limited training data. Our framework leverages policy-based active learning [6] to learn a selection policy that identifies the most informative augmented samples to enhance the NER model’s generalization ability.

We evaluated our method on simulated few-shot scenarios in BioNER applications, where we demonstrated its ability to select the most informative augmented samples first and achieve promising results compared to selected baselines. Our approach presents a new direction for exploring the potential of data augmentation to improve NER models’ performance in specialized domains, such as biomedical, where labeled data is scarce and domain knowledge is essential.

Future work should explore the robustness of our framework on real-world biomedical datasets and investigate the efectiveness of diferent data augmentation techniques on improving the performance of the proposed approach. The simple mention replacement approach used in the current implementation of our framework could be indeed replaced with more advanced and sophisticated approaches.

Furthermore, we intend to extend our approach to other NLP tasks beyond NER, such as relation extraction and entity linking, and compare its performance with existing state-of-the-art methods. Finally, we also plan to investigate the interpretability of the learned selection policy to gain insights into the most informative features and characteristics of the augmented data samples that contribute to the NER model’s improved performance. This project has been funded by PNRR MUR project PE0000013-FAIR. We also aknowledge support from NextGenerationEU via PNRR - DM 352 (CUP: E66G22000400009).

[1]

Bayer ,

Kaufhold ,

Reuter , A survey on data augmentation for text classification , ACM Comput. Surv . 55 ( 2023 ) 146 : 1 - 146 : 39 . URL: https: //doi.org/10.1145/3544558. doi: 10 .1145/3544558.

[2]

Liu , G. Xu,

Jia , W. Ma, L. Wang , S. Vosoughi , Data boost: Text data augmentation through reinforcement learning guided conditional generation , in: B. Webber , T. Cohn, Y. He , Y. Liu (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020 , Online, November 16-20 , 2020 , Association for Computational Linguistics, 2020 , pp. 9031 - 9041 . URL: https://doi.org/10.18653/v1/ 2020 .emnlp-main. 726 . doi: 10 .18653/v1/ 2020 .emnlp- main.726.

[3]

Dai ,

Adel , An analysis of simple data augmentation for named entity recognition , in: D. Scott , N. Bel , C. Zong (Eds.), Proceedings of the 28th International Conference on Computational Linguistics , COLING 2020 , Barcelona, Spain (Online), December 8-13 , 2020 , International Committee on Computational Linguistics , 2020 , pp. 3861 - 3867 . URL: https://doi.org/10.18653/v1/ 2020 .coling-main. 343 . doi: 10 .18653/v1/ 2020 .coling- main.343.

[4]

Bartolini ,

Moscato ,

Postiglione ,

Sperlì ,

Vignali , COSINER: context similarity data augmentation for named entity recognition , in: Simi-