Named Entity Recognition using context similarity data augmentation Ilaria Bartolini1,† , Angelo Chianese2,† , Vincenzo Moscato2,3,† , Marco Postiglione2,† , Giancarlo Sperlí2,3,*,† and Andrea Vignali2,† 1 Alma Mater Studiorum, University of Bologna, Via Zamboni 33, 40126, Bologna, Italy 2 University of Naples Federico II, Dept. of Electrical Engineering and Information Technology (DIETI), Via Claudio 21, 80125, Naples, Italy 3 CINI - ITEM National Lab, Complesso Universitario Monte S.Angelo, Naples, Italy Abstract This paper is an extended abstract of a recent work, in which we introduce COSINER, a novel approach to enhancing Named Entity Recognition (NER) tasks through data augmentation. Unlike traditional methods that risk introducing noise, COSINER leverages context similarity to substitute entity mentions with more contextually appropriate ones, yielding superior performance in limited-data scenarios. Experimental results demonstrate COSINER’s effectiveness over existing baselines, with computational times comparable to basic augmentation methods and superior to pre-trained model-based approaches. Keywords Named Entity Recognition, Data Augmentation, Similarity Learning, Few Shot Learning. 1. Introduction Named Entity Recognition (NER) is a crucial component of natural language processing (NLP), which aims to understand and process natural language for various tasks like sentiment analysis, text classification, and machine translation. NER’s objective is to identify and classify entity mentions (e.g., person, organization, disease) in unstructured text. It serves as a foundational step in several applications (e.g., machine translation or information discovery). NER identifies and extracts relevant items from unstructured text, like diseases or genes in medical records, serving as a crucial initial step for various applications like knowledge graphs and Q/A bots. NER model training typically requires vast annotated data, but obtaining quality annotations, especially in specialized domains, is time-consuming and costly. Few-shot learning, exploring unique strategies for constrained datasets, addresses this challenge, particularly in fields lacking readily available domain specialists. Data augmentation, a method to enhance dataset size by generating additional samples, is commonly used to address data scarcity. In Natural Language Processing (NLP), techniques SEBD 2024: 32nd Symposium on Advanced Database Systems, June 23-26, 2024, Villasimius, Sardinia, Italy * Corresponding author. † These authors contributed equally. $ ilaria.bartolini@unibo.it (I. Bartolini); angelo.chianese@unina.it (A. Chianese); vincenzo.moscato@unina.it (V. Moscato); marco.postiglione@unina.it (M. Postiglione); giancarlo.sperli@unina.it (G. Sperlí); andrea.vignali@unina.it (A. Vignali) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings like word replacement [1], random deletion [2], word position swap [3] and generative models [4] are popular. However, token-level classification in NER becomes more and more complex using traditional augmentation, requiring an increasing effort in analyzing possible approaches in this area [5]. Recent efforts explore transfer learning [6] and Masked Language Models (MLM) [7] to alleviate label misalignment and augment datasets effectively. Moreover, while data augmentation holds promise, the current manipulation methods often generate noisy and misclassified samples. The added data may contain syntactic or semantic errors, leading to inaccuracies in classification. To address this challenge, we present our method, COntext SImilarity-based data augmentation for NER (COSINER) [8], which utilizes similarity metrics to generate augmented examples that closely resemble real context. Our approach introduces a context-based mention replacement technique, substituting mentions in input data with entities from an Entity Lexicon that are contextually appropriate. In this paper, which is an extended abstract of our previous work [9], our contribution consists of the development of COSINER and an extensive evaluation across three prominent biomedical benchmark datasets that demonstrate COSINER’s superiority over existing methods, highlighting its general applicability beyond the biomedical domain. Notably, COSINER’s effectiveness is attributed to its ability to improve performance primarily through top-ranked samples, reducing reliance on large augmented datasets and enhancing computational efficiency. 2. Methodology COSINER utilizes mention replacement to expand the initial training set, a technique previously explored by Dai et al. [5]. While their method randomly substitutes entities within sentences using a binomial distribution, we introduce a systematic approach centered on similarity, where entity mentions are replaced with counterparts closely matching in syntax, semantics, and context. Despite the quadratic time complexity of our methodology, equal to 𝑂(𝑚𝑛2 ) for computing cosine similarity between 𝑛 embeddings (with a size of 𝑚), the time spent generating new examples remains insignificant. Figure 1 provides an overview of our methodical flow, elaborated further in subsequent sections. Lexicon generation In the training set, each entity, referred to as a 𝑐𝑜𝑛𝑐𝑒𝑝𝑡, needs to be collected for replacement purposes. A 𝑐𝑜𝑛𝑐𝑒𝑝𝑡 can comprise one or a group of words, and we also record the frequency of each word’s appearance in the training set within the Lexicon 𝐶𝑐𝑜𝑛𝑐𝑒𝑝𝑡 . The size of the Lexicon varies depending on the number of mentions in the dataset. It is significant to emphasize that although the size of the Lexicon influences the speed of computing similarity values between entity pairs, this influence is not a constraint, particularly as we conduct experiments in few-shot scenarios. Embeddings extraction In order to calculate entities similarities, it’s imperative to establish a comprehensive representation (𝑉𝑐𝑜𝑛𝑐𝑒𝑝𝑡 ) for all Lexicon concepts, which serves as viable input for our predictive model. We employ a pre-trained language model as a feature extractor [10, 11] to process each phrase containing a given mention from the Lexicon, mapping each token to Training set (augmentable sentences) Entity Embeddings Augmented set Similarity calculation Original Training set extraction Lexicon extraction Embeddings generation Similarity Dataset lists IR Training set (augmentable sentences) Training set + Augmented training set NER model Results Validation and test sets Figure 1: COSINER methodological flow: (1) Original training set is utilized to create a Lexicon of all entities. (2) Entities are embedded into a vector space based on sentences containing at least on mention, (3) Similarity scores between pairs of embeddings are computed to establish connections between each entity and the related ranked list similar entities, (4) An augmented training set is formulated, (5) The model undergoes training employing both the original and augmented training datasets. its word embedding 𝑉𝑐𝑜𝑛𝑡𝑒𝑥𝑡 (i.e. an array of numerical features representing the token in its context). In cases where mentions consist of multiple tokens, 𝑉𝑐𝑜𝑛𝑡𝑒𝑥𝑡 is obtained by averaging the word embeddings of all tokens. Upon retrieving 𝑉𝑐𝑜𝑛𝑡𝑒𝑥𝑡 , the numerical representation of the concept 𝑉𝑐𝑜𝑛𝑐𝑒𝑝𝑡 is updated using the formula: 𝑉𝑐𝑜𝑛𝑐𝑒𝑝𝑡 = 𝑉𝑐𝑜𝑛𝑐𝑒𝑝𝑡 + 𝑙𝑟 · (1 − sim) · 𝑉𝑐𝑜𝑛𝑡𝑒𝑥𝑡 , where 𝑙𝑟 denotes a regularization term determined by the inverse of the frequency of a mention across the entire dataset, and 𝑠𝑖𝑚 represents the cosine similarity between 𝑉𝑐𝑜𝑛𝑐𝑒𝑝𝑡 and 𝑉𝑐𝑜𝑛𝑡𝑒𝑥𝑡 . Initially, 𝑉𝑐𝑜𝑛𝑐𝑒𝑝𝑡 is set to the 𝑉𝑐𝑜𝑛𝑡𝑒𝑥𝑡 value of the first sentence where the mention appears. Similarity computation We calculate the cosine similarity between the embeddings 𝑉𝑐𝑜𝑛𝑐𝑒𝑝𝑡 of every pair of entities in the Lexicon to derive a ranked list of similarity scores 𝑧𝑖𝑗 = 𝑗 sim(𝑉𝑐𝑜𝑛𝑐𝑒𝑝𝑡 𝑖 , 𝑉𝑐𝑜𝑛𝑐𝑒𝑝𝑡 ) associated with each Lexicon entry. We define two ranking criteria: 1)Maximum (descending order): Prioritizing concepts with the highest relatedness at the top of the list. This approach facilitates the generation of realistic augmented samples that uphold contextual consistency within sentences. 2)Minimum (ascending order): By initially considering the least similar entities, we encompass samples farthest from the knowledge boundary. This inclusion enables the recognition and accurate classification of extreme cases. Augmented set generation The augmented set is constructed from all sentences featuring at least one mention. Each sentence is assigned a similarity value 𝑠𝑚 , , which is computed as the mean of entity similarity scores 𝑧𝑖𝑗 for the additional entities present within the sentence. We employ two strategies: 1) Local Augmentation: Each sentence results in the generation of 𝑘 new samples, ensuring the contribution of every training instance to the augmented set. Local augmentation Augmented Similarity sentence value Augmented sentence 1.1 0.87 Augmented sentence 1.2 0.73 LOCAL ... Augmented Augmented sentence 1.k 0.48 training set Training set Sentence 1 Augmented sentence 2.1 0.80 Dataset Sentence 2 Sentence 3 MR Augmented sentence 2.2 ... 0.78 + ... Augmented sentence 2.k 0.29 Global Augmented sentence 3.1 0.77 sort augmentation Similar Similarity Augmented sentence 3.2 0.64 Entity entities value Augmented sentence 1.1 0.87 ... GLOBAL Entity 1: Augmented sentence 2.1 0.80 Augmented sentence 3.k 0.31 Augmented Entity 2 z1,2 Augmented sentence 2.2 0.78 Augmented sentence 3.1 0.77 training set Entity j z1,j Augmented sentence 1.2 0.73 ... ... Entity i: Similarity lists Augmented sentence h sh Entity 2 zi,2 Entity j zi,j ... Augmented sentence h+1 sh+1 Entity Lexicon ... Figure 2: COSINER Augmentation strategies. Both local and global strategies start by generating 𝑘 augmented sentences per phrase with at least one mention, using Mention Replacement (MR) and similarity lists from the training set. Then, each augmented example is assigned a sentence similarity value 𝑠𝑚 . In the local strategy, the new training set comprises all augmented examples. In the global approach, a new list is generated, arranging examples based on their 𝑠𝑚 , values, and the top ℎ sentences are selected for the augmented training set. 2) Global Augmentation: Similar to the previous strategy, 𝑘 new samples are generated for each sentence. Subsequently, we rank all newly generated sentences in a single list based on their similarity value 𝑠𝑚 and select the first ℎ elements. In Figure 2 we emphasize the distinctions between the two strategies. NER model training We adhere to the IOB2 scheme for the NER token-classification task [12]. The original training dataset and the augmented samples are fed into a Transformer network backbone [10, 11]. Model parameters undergo optimization via cross-entropy minimization. 3. Experimental Analysis We conduct training and evaluation on three renowned benchmark datasets sourced from biomedical articles: i) NCBI-Disease [13]: Comprising 793 PubMed abstracts, with 6,881 dis- ease entities, ii) BC5CDR [14]: Comprising 1,500 PubMed articles, containing 15,935 chemical mentions, and BC2GM [15]: Comprising 20,000 sentences extracted from PubMed abstracts, involving 20,702 gene entities. We delineate three distinct few-shot scenarios, each characterized by the percentage of samples drawn from the available corpora employed in implementing our methods: specifically, 2%, 5%, and 10%. Subsequently, we present all experimental findings within these aforementioned few-shot scenarios. Dataset statistics and few-shot scenarios details are summarized in Table 1. 3.1. Hyperparameter tuning Table 2 presents results achieved using various parameter configurations for similarity compu- tation (Maximum vs Minimum) and augmented set generation (Local vs Global), as discussed Table 1 Statistics of the dataset used. Dataset splits Few-shot size Dataset Entity type N. Annotations Train Dev Test 2% 5% 10% NCBI-disease Disease 6881 5425 924 941 108 271 542 BC5CDR Chemical 15411 4561 4582 4798 91 228 456 BC2GM Gene 20703 12575 2520 5039 251 628 1257 Table 2 Exploration of optimal strategies for COSINER. Dataset size Similarity Strategy NCBI Disease BC5CDR BC2GM Maximum Global 0.688 ± 0.077 0.83 ± 0.023 0.658 ± 0.036 Minimum Global 0.683 ± 0.086 0.823 ± 0.032 0.652 ± 0.027 2% Maximum Local 0.689 ± 0.088 0.832 ±0.022 0.665 ±0.038 Minimum Local 0.692 ±0.081 0.824 ± 0.015 0.659 ± 0.049 Maximum Global 0.765 ±0.035 0.858 ± 0.023 0.717 ± 0.007 Minimum Global 0.756 ± 0.028 0.853 ± 0.029 0.713 ± 0.009 5% Maximum Local 0.76 ± 0.031 0.863 ±0.042 0.726 ±0.022 Minimum Local 0.764 ± 0.041 0.86 ± 0.031 0.714 ± 0.007 Maximum Global 0.807 ± 0.038 0.88 ± 0.018 0.76 ± 0.02 Minimum Global 0.807 ± 0.029 0.873 ± 0.016 0.761 ± 0.012 10% Maximum Local 0.816 ±0.066 0.882 ±0.007 0.767 ±0.023 Minimum Local 0.807 ± 0.038 0.876 ± 0.016 0.76 ± 0.009 in Section 2. As anticipated, employing Maximum similarity computation generally yields superior performance, as augmented samples are plausible and closer to the test distribution. Nevertheless, the notable performance achieved with the Minimum configuration suggests that at times, considering "distant" entities may prove beneficial in expanding the NER model’s scope. Regarding augmented set generation, the Local criterion typically outperforms, owing to its augmentation of all sentences in the original dataset. In summary, it’s noteworthy that Maximum local emerges as the most favorable overall strategy. When creating an augmented dataset, the quantity of augmented samples is a crucial pa- rameter to consider. Therefore, we conducted experiments using three distinct budgets for the augmented set: small (100 samples), medium (300 samples), and large (500 samples). Figure 3 illustrates the results obtained across the three benchmark datasets. Due to the similarity-based approach, which prioritizes the most informative examples in the top-ranked positions, there is minimal discrepancy observed when using higher budgets. 4. Result We contrast our top-performing results with baselines drawn from current literature [5], as follows: • No Augmentation: Results obtained using the original training set assessed with a BERT NCBI-Disease BC5CDR BC2GM 0.8 0.8 0.8 0.6 0.6 0.6 F1 score F1 score F1 score 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0.0 0.0 2% 5% 10% 2% 5% 10% 2% 5% 10% Dataset size Dataset size Dataset size Figure 3: Comparison of outcomes among the small, medium, and large budget allocations for local augmentation strategy using the maximum similarity technique. or BioBERT pre-trained model. • Mention Replacement (MR): Random selection of a mention from the original training set with the same entity type for each mention in the instance. • Label-wise Token Replacement (LwTR): Randomly decide whether to replace each word within a sentence with any other word in the dataset sharing the same label. • Synonym Replacement (SR): Employ a binomial distribution to determine whether to replace each word within a sentence with a synonym from WordNet [16]. • Masked Entity Language Modeling (MELM): Employ a pre-trained RoBERTa model as MLM to predict masked tokens within the training set. Subsequently, utilize the augmented dataset to train a BERT model. • Cross-Domain Named Entity Recognition (style_NER): Utilize additional data to transfer knowledge from a source domain to a target domain. Table 3 compares the precision, recall, and F1 scores of the baselines with our method, which achieved the best outcomes for each dataset and related scenarios. Results indicate that COSINER outperforms most baselines across scenarios and datasets. While it consistently ensures the highest recall scores, signifying the system’s ability to identify more entity mentions present in the corpus, COSINER falls short of SR in terms of precision in some scenarios. This suggests that the augmentation process may generate a higher number of false positives. 5. Conclusion In this study, we have employed a context similarity-based approach to generate augmented data, aiming to enhance the performance of NER tasks while mitigating the adverse effects of noisy and mislabeled data commonly encountered with existing techniques. Our experiments conducted in the medical domain, where data augmentation is particularly crucial, underscore the efficacy of our method. We have demonstrated its superiority over several state-of-the-art baselines, achieving comparable or improved execution times. Looking ahead, our approach holds promise for integration with complementary techniques beyond Mention Replacement. Future investigations will explore its applicability across diverse Table 3 Comparative results between baselines and our best strategy. Size Method NCBI-Disease BC5CDR BC2GM F1 Precision Recall F1 Precision Recall F1 Precision Recall No augmentation 0.430 ±0.193 0.403 ±0.169 0.461 ±0.225 0.628 ±0.179 0.625 ±0.185 0.634 ±0.215 0.510 ±0.036 0.448 ±0.015 0.592 ±0.082 No augmentation (BioBERT) 0.651 ±0.122 0.619 ±0.100 0.688 ±0.162 0.792 ±0.067 0.799 ±0.058 0.786 ±0.110 0.644 ±0.031 0.600 ±0.057 0.695 ±0.022 MR 0.666 ±0.084 0.626 ±0.1 0.710 ±0.067 0.813 ±0.032 0.806 ±0.06 0.822 ±0.071 0.640 ±0.02 0.593 ±0.062 0.696 ±0.049 LwTR 0.677 ±0.101 0.637 ±0.125 0.723 ±0.08 0.828 ±0.019 0.808 ±0.052 0.850 ±0.075 0.642 ±0.037 0.591 ±0.059 0.704 ±0.019 2% SR 0.692 ±0.103 0.649 ±0.132 0.742 ±0.084 0.813 ±0.032 0.811 ±0.085 0.835 ±0.064 0.662 ±0.033 0.619 ±0.058 0.710 ±0.029 MELM 0.578 ±0.038 0.545 ±0.046 0.615 ±0.041 0.754 ±0.019 0.719 ±0.047 0.795 ±0.036 0.566 ±0.011 0.504 ±0.006 0.647 ±0.027 style_NER 0.581 ±0.061 0.537 ±0.076 0.636 ±0.067 0.752 ±0.018 0.713 ±0.041 0.796 ±0.016 0.581 ±0.003 0.540 ±0.018 0.631 ±0.025 COSINER (ours) 0.689 ±0.088 0.629 ±0.078 0.764 ±0.11 0.832 ±0.022 0.814 ±0.08 0.853 ±0.066 0.665 ±0.038 0.614 ±0.065 0.724 ±0.025 No augmentation 0.621 ±0.055 0.572 ±0.088 0.68 ±0.054 0.757 ±0.039 0.73 ±0.062 0.788 ±0.121 0.612 ±0.022 0.563 ±0.03 0.671 ±0.077 No augmentation (BioBERT) 0.735 ±0.041 0.706 ±0.051 0.767 ±0.062 0.850 ±0.02 0.836 ±0.01 0.865 ±0.048 0.711 ±0.012 0.680 ±0.028 0.744 ±0.019 MR 0.743 ±0.048 0.712 ±0.045 0.776 ±0.059 0.849 ±0.021 0.834 ±0.03 0.865 ±0.026 0.713 ±0.006 0.675 ±0.02 0.755 ±0.024 LwTR 0.743 ±0.072 0.710 ±0.066 0.780 ±0.086 0.860 ±0.039 0.846 ±0.017 0.876 ±0.067 0.699 ±0.012 0.660 ±0.024 0.742 ±0.029 5% SR 0.758 ±0.044 0.719 ±0.049 0.800 ±0.049 0.858 ±0.03 0.841 ±0.033 0.875 ±0.067 0.719 ±0.011 0.684 ±0.023 0.758 ±0.019 MELM 0.678 ±0.034 0.647 ±0.037 0.713 ±0.035 0.800 ±0.020 0.769 ±0.043 0.835 ±0.030 0.629 ±0.010 0.587 ±0.010 0.677 ±0.021 style_NER 0.687 ±0.040 0.662 ±0.038 0.714 ±0.042 0.805 ±0.015 0.793 ±0.020 0.818 ±0.020 0.640 ±0.005 0.594 ±0.018 0.695 ±0.017 COSINER (ours) 0.76 ±0.031 0.721 ±0.029 0.805 ±0.057 0.863 ±0.042 0.839 ±0.04 0.892 ±0.058 0.726 ±0.022 0.692 ±0.013 0.767 ±0.03 No augmentation 0.712 ±0.056 0.670 ±0.065 0.76 ±0.046 0.804 ±0.032 0.781 ±0.046 0.829 ±0.054 0.669 ±0.019 0.626 ±0.026 0.720 ±0.045 No augmentation (BioBERT) 0.791 ±0.028 0.760 ±0.024 0.825 ±0.036 0.875 ±0.013 0.858 ±0.02 0.892 ±0.028 0.759 ±0.017 0.734 ±0.019 0.786 ±0.016 MR 0.794 ±0.018 0.761 ±0.025 0.831 ±0.019 0.874 ±0.034 0.859 ±0.038 0.889 ±0.04 0.754 ±0.01 0.724 ±0.013 0.787 ±0.032 LwTR 0.789 ±0.023 0.756 ±0.034 0.825 ±0.036 0.882 ±0.017 0.870 ±0.021 0.893 ±0.022 0.741 ±0.012 0.712 ±0.023 0.772 ±0.025 10% SR 0.803 ±0.033 0.776 ±0.033 0.832 ±0.053 0.883 ±0.018 0.862 ±0.016 0.904 ±0.021 0.763 ±0.012 0.738 ±0.019 0.788 ±0.02 MELM 0.740 ±0.017 0.712 ±0.019 0.770 ±0.016 0.841 ±0.010 0.824 ±0.013 0.858 ±0.019 0.685 ±0.006 0.647 ±0.008 0.728 ±0.010 style_NER 0.745 ±0.014 0.738 ±0.018 0.752 ±0.014 0.838 ±0.012 0.829 ±0.025 0.847 ±0.021 0.694 ±0.004 0.660 ±0.009 0.732 ±0.010 COSINER (ours) 0.816 ±0.066 0.780 ±0.014 0.856 ±0.068 0.882 ±0.007 0.861 ±0.022 0.914 ±0.02 0.767 ±0.023 0.738 ±0.026 0.798 ±0.015 contexts and with various entity types, fostering a deeper understanding of its potential and versatility. Acknowledgments This work has been funded by the project NextGenerationEU via PNRR - DM 352 (CUP: E66G22000400009). We acknowledge financial support from the PNRR project “Future Ar- tificial Intelligence Research (FAIR)” – CUP E63C22002150007 References [1] H. Cai, H. Chen, Y. Song, C. Zhang, X. Zhao, D. Yin, Data manipulation: Towards effective instance learning for neural dialogue generation via learning to augment and reweight, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 6334–6343. [2] J. Wei, K. Zou, EDA: Easy data augmentation techniques for boosting performance on text classification tasks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 6382–6388. [3] J. Min, R. T. McCoy, D. Das, E. Pitler, T. Linzen, Syntactic data augmentation increases robustness to inference heuristics, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 2339–2352. [4] K. M. Yoo, Y. Shin, S.-g. Lee, Data augmentation for spoken language understanding via joint variational generation, in: Proceedings of the AAAI conference on artificial intelligence, volume 33, 2019, pp. 7402–7409. [5] X. Dai, H. Adel, An analysis of simple data augmentation for named entity recognition, in: Proceedings of the 28th International Conference on Computational Linguistics, Inter- national Committee on Computational Linguistics, Barcelona, Spain (Online), 2020, pp. 3861–3867. [6] S. Chen, G. Aguilar, L. Neves, T. Solorio, Data augmentation for cross-domain named entity recognition, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 2021, pp. 5346–5356. [7] R. Zhou, X. Li, R. He, L. Bing, E. Cambria, L. Si, C. Miao, Melm: Data augmentation with masked entity language modeling for low-resource ner, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 2251–2262. [8] I. Bartolini, V. Moscato, M. Postiglione, G. Sperlì, A. Vignali, Cosiner: Context similarity data augmentation for named entity recognition, in: International Conference on Similarity Search and Applications, Springer, 2022, pp. 11–24. [9] I. Bartolini, V. Moscato, M. Postiglione, G. Sperlì, A. Vignali, Data augmentation via context similarity: An application to biomedical named entity recognition, Information Systems 119 (2023) 102291. [10] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1, Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. [11] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language models are few-shot learners, in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in Neural Information Processing Systems, volume 33, Curran Associates, Inc., 2020, pp. 1877–1901. [12] L. A. Ramshaw, M. P. Marcus, Text chunking using transformation-based learning, in: Natural language processing using very large corpora, Springer, 1999, pp. 157–176. [13] R. Doğan, R. Leaman, Z. Lu, NCBI disease corpus: a resource for disease name recognition and concept normalization, 2014. [14] J. Li, Y. Sun, R. Johnson, D. Sciaky, C. Wei, R. Leaman, A. Davis, C. Mattingly, T. Wiegers, Z. Lu, BioCreative V CDR task corpus: a resource for chemical disease relation extraction, 2016. [15] L. Smith, L. Tanabe, R. Ando et al., The BioCreative II - Critical Assessment for Information Extraction in Biology Challenge, 2008. [16] G. A. Miller, Wordnet: A lexical database for english, Commun. ACM 38 (1995) 39–41.