Comparing Transformer-based NER approaches for analysing textual medical diagnoses Marco Polignano, Marco de Gemmis and Giovanni Semeraro University of Bari Aldo Moro Via E. Orabona 4, 70125, Bari, Italy Abstract The automated analysis of medical documents has grown in research interest in recent years as a con- sequence of the social relevance of the thematic and the difficulties often encountered with short and very specific documents. In particular, this fervent area of research has stimulated the development of several techniques of automatic document classification, question answering, and name entity recogni- tion (NER). Nevertheless, many open issues must be addressed to obtain results that are satisfactory for a field in which the effectiveness of predictions is a fundamental factor in order not to make mistakes that could compromise people’s lives. To this end, we focused on the name entity recognition task from medical documents and, in this work, we will discuss the results we obtained by our hybrid approach. In order to take advantage of the most relevant findings in the field of natural language processing, we decided to focus on deep neural network models. We compared several configurations of our model by varying the transformer architecture, such as BERT, RoBERTa and ELECTRA, until we obtained a con- figuration that we considered the best for our goals. The most promising model was used to participate in the SpRadIE task of the annual CLEF (Conference and Labs of the Evaluation Forum). The obtained results are encouraging and can be of reference for future studies on the topic. 1. Introduction Different definitions of digital health [1] are provided in the scientific literature, and many of them focus on the use of smart systems that enable the delivery of medical and health services directly to the patients. Those systems are grounded on the use of innovative technologies that not only allow providing innovative functionalities to users such as recommendations, monitoring services, and document archiving, but they also collect an enormous amount of data that needs to be automatically processed to be correctly dispatched to physicians, specialist doctors or the national health network. Generally speaking, eHealth approaches could be grouped into two main macro-categories by considering their domain of application: • Wellness: all those applications for fitness and to support a correct lifestyle; • Disease & Treatment Management: all those systems for the management of a pathol- ogy, supporting all the phases from diagnosis to treatment and monitoring of the same. CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania " marco.polignano@uniba.it (M. Polignano); marco.degemmis@uniba.it (M. de Gemmis); giovanni.semeraro@uniba.it (G. Semeraro)  0000-0002-3939-0136 (M. Polignano); 0000-0002-2007-9559 (M. de Gemmis); 0000-0001-6883-1853 (G. Semeraro) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Both these two main kind of eHealth systems,have recently gained a large amount of new young consumers. eHealth systems and applications are quickly increasing, and the market of eHealth related systems and mobile application is expected to constantly grow of the 48.44% each year for the next four years 1 . The amount of user of eHalth platforms, which today is around 825 million worldwide, will cross the billion mark in the next four years. Unfortunately, despite the increasing amount of data in healthcare available, the percentage of those annotated and usable for supervised machine learning systems is still very small. As already stated in [2] this behavior is more frequent to observe in the medical domain than in others due to some critical limits about the ownership of sensitive data and the deep knowledge need for correctly annotating them. Only a few years ago, in Europe, the regulation regarding the use of sensitive user data (GDPR) [3] was defined and some guidelines regarding how to properly manage such data in a uniform manner among the member states of the European Union have been provided. Nevertheless, often a simple anonymization of data is not enough and a careful procedure of requesting consents for research purposes is necessary. These limitations make it necessary for artificial intelligence techniques that can make good use of the little data available, inheriting, where possible, knowledge from other similar application domains. Fortunately, in recent years, transfer learning has become a common practice to address various tasks of natural language processing. This opportunity has allowed the growth of several applications [4] in the field of eHealth including name entity recognition [5, 6, 7], the main topic of interest of this work. In particular, we decided to focus on NER’s task regarding medical diagnoses in text format due to the substantial amount of available data provided for the SpRadIE 2021 [8, 9] competition co-located with the Conference and Labs of the Evaluation Forum 2021. In detail the main contribution of this work is about: • the definition of an hybrid NER model for analysing textual medical diagnoses, based on a deep learning transformer based models [10] and conditional random field (CRF) [11]; • the comparison of different model configurations in order to detect the better performing transformer based model among BERT, BERT-mul, BETO, RoBERTa, XLM-RoBERTa, ELECTRA, ELECTRA-ES, ELECTRA-MED, BioBert, BioClinBERT. Following we will present some relevant work useful for better understanding the technologies used in the model described in Chapter 3. In Chapter 4, we will discuss the configurations performed and the results obtained, until we present the one that performed best on the data at hand. We will conclude with our considerations and directions for future work. 2. Related Work In recent years, natural language processing has gained a lot of attention, quickly becoming one of the most trending research areas. In particular, technologies based on deep learning have made it possible to learn models on a vast amount of data that are highly efficient and versatile in many scenarios of text analysis and classification. The most famous of such transfer learning approaches is BERT [12], a transformer model that uses a bidirectional encoding model 1 https://www.mckinsey.com/industries/healthcare-systems-and-services/our-insights/the-era-of-exponential- improvement-in-healthcare to generate a text representation that takes into account the context of the use of each term. Its applicability has been demonstrated in several works by adopting it for facing tasks as sentiment analysis [13, 14, 15], question-answering, name-entity recognition, text summarization, and many more [16]. The use of BERT is not only limited to the English language. Different versions of it, trained on different sets of data, have been presented to make it suitable in scenarios where it needs to work correctly with multilingual text (BERT multilang). Still, it has also been trained on many single-language data in order to correctly work on specific languages different than English, such as Italian (AlBERTo) [17] and Spanish (BETO) [18]. Different specializations of BERT have been proposed for a specific domain of use, including the medical one. Indeed, BioBERT [19] has been trained on biomedical literature, making it suitable for many tasks of text analysis in the biological domain. BioClinicalBERT [20] starting from BioBert specialized the vocabulary on clinical terms. Taking inspiration from the BERT architecture, other models based on transformer have been realized. Among these, we would like to cite RoBERTa [21], a model that evolves BERT with the aim of making the learning phase less demanding from the computational point of view while maintaining performances comparable to those of the original approach. In addition, to ensure its easy use in contexts where more than one language is used in the text, the XLM-RoBERTa [22] version was created. Similar to RoBERTa, ELECTRA [23] follows the BERT architecture by replacing its training model. Instead of using a masking approach, it uses another neural network that attempts to trick the model during the training phase by replacing random tokens with fake tokens. This strategy allows ELECTRA to train a more effective final language understanding model. The idea of using BERT as the basis for a name entity recognition task in the medical domain is not entirely new. The name entity recognition (NER) task consists of identifying n-grams of a textual content that refers to entities or names relevant for the domain of application. It is common to find in literature NER systems able to detect in text locations, organizations, companies, person names but having enough training data it is possible to train those systems on every entity of interest. The task is a subproblem of information extraction and involves processing structured and unstructured documents [24]. Since it does not require the annotation of strictly personal and sensitive data, but generic texts are sufficient, this task is widely studied also in the literature even in the medical domain of application. Mao et al. [25] use BERT and the Conditional Random Field (CRF) models to identify the name of hospitals, medical staff, and territory in medical records. Similarly, Xue wt al. [26] fine-tuned the BERT model in order to detect entities and relations in Chinese medical documents. Vunikili et al. [27] explored the use of Bidirectional Encoder Representations from Transformers (BERT) based contextual embeddings trained on general domain Spanish text to extract tumor morphology from clinical reports written in Spanish. In a similar way, its not difficult to find attempts that use the RoBERTa model [28], ELECTRA [29] or BioBERT [30]. All those work fall in not evaluating the efficacy of different transformer models before choosing one to use in their models. On the contrary, we focus on developing a hybrid model that merges an arbitrary transformer based architecture with a classic CRF layer. In this way, we will not limit ourselves to use with blinded eyes only a single transformer based model. Still, we will evaluate different ones in order to identify the one that best performing in the medical domain and, in particular, on documents written in Spanish as provided us by the organizers of the SpRadIe challenge [8, 9]. Figure 1: Hybrid architecture proposed 3. Transformer based NER Model The hybrid model we propose is inspired by a common architecture largely used for dealing with a NER task. In particular, we are looking at the architecture proposed by Huang [31] that is composed by the concatenation of a Long Short Term Memory (LSTM) model and a Conditional Random Field layer (CRF) [11]. This approach has been demonstrated to be very effective for dealing with the identification of entities and relations into the text, and consequently, it has been our starting point. We decided to combine a transformer based model with a CRF layer, similarly to the approach discussed in [32]. The architecture of our proposed hybrid model is showed in Fig. 1. The first step consists of pre-processing the text to make it suitable for the transformer based model. In particular, we split the text into sentences, and we added special tokens at the beginning and end of the sentence. The [CLS] token is used for indicating the beginning of the sentence, [SEP] for indicating its end, and [PAD] tokens are added before the [SEP] for making the input length uniform. For each token, the transformer based model requires input ids, a sequence of integers identifying each input token to its index number in the tokenizer vocabulary provided by the specific pre-trined model. Consequently, the tokens are transformed in their numerical counterpart in order to be properly encoded using token embeddings, sentence embeddings, and positional vectors [33]. The formatted input of length n is denoted as 𝑤 =< 𝑤1 , 𝑤2 , .., 𝑤𝑛 >, and the corresponding tag sequence is denoted as 𝑡 =< 𝑡1 , 𝑡2 , ..., 𝑡𝑛 >. The tag annotation schema used for the model is BIO. The schema allows to learn a NER model able to identify tokens in text that can represent one of the following tag: (i) the Beginning (B) of the entity; (ii) the word Inside (I) the entity; (iii) words Outside (O) entities [34]. When the data is ready, it is given as input to the chosen transformer based model. Each one of them has its internal architecture, but all of them allow us to obtain from an embedding representation of each token present in the input sentence. Generally speaking, transformer is an architecture for transforming one sequence into another one by using two modules: the encoder and the decoder. Both encoder and decoder are composed of blocks that can be stacked on top of each other multiple times in order to obtain the desired depth of the network. Each block is mainly composed of a multi-head attention layer, followed by a normalization layer and a feed-forward layer. Using only the encoder module, we can obtain a word embedding representation of each token of the input after any possible arbitrary number of encoding blocks. In particular, BERT, in its basic version, is trained on a Transformer network made of only 12 encoding blocks, 768 dimensional states and 12 heads of attention for a total of 110M of parameters trained on BooksCorpus [35] and Wikipedia English for 1M of steps. The learning phase is performed by scanning the span of text in both directions, from left to right and from right to left, as was already done in Bidirectional LSTM. Moreover, BERT uses a “masked language model”: during the training, random terms are masked in order to be predicted by the net. Jointly, the network is also designed to potentially learn the next span of text from the one given in input. These peculiarities allow BERT to be the current state of the art language understanding model. By following the logical flow of our hybrid model, we used the embedding generated by the transformer based model to perform a token-level classification task. In particular, an hidden layer has been added on top of the stack of the model in order to obtain a prediction matrix P ∈ R𝑛×𝑘 (i.e. one prediction vector for each on of the 𝑛 tokens) for the given input sequence. Formally speaking, the classification layer projects each token’s encoded representation to the tag space R𝐻 ↦→ R𝑘 , where 𝑘 is the number of tags which varies in accordance with the number of classes and the tagging scheme. The CRF layer consequently learns only the transition probability of the output labels, 𝐴 ∈ R𝐾+2×𝐾+2 where +2 indicates one tag each for start and end marker. The matrix 𝐴 is such that 𝐴𝑖,𝑗 represents the score of transitioning from tag 𝑖 to tag 𝑗 [36]. For an input sequence 𝑥 =< 𝑥1 , ..., 𝑥𝑛 > and a sequence of tag predictions 𝑦 =< 𝑦1 , ..., 𝑦𝑛 >, 𝑦𝑖 ∈ 1, ..., 𝑘, the score of the sequence is defined as [11]: 𝑛 ∑︁ 𝑛 ∑︁ 𝑆(𝑥, 𝑦) = 𝐴𝑦𝑖 ,𝑦𝑖+1 + P𝑖,𝑦𝑖 (1) 𝑖=𝑜 𝑖=𝑜 The objective function is the maximum likelihood of the probability distribution denoted as: ⎛ ⎞ ′ ∑︁ 𝑙𝑛( 𝑝(𝑦|𝑥) ) = 𝑆(𝑥, 𝑦) − 𝑙𝑛 ⎝ 𝑒𝑆(𝑥,𝑦 ) ⎠ (2) 𝑦 ′ ∈𝑦 We use 𝑦 ⋆ to represent the most likely tag sequence of x that will be considered as output of the model: 𝑦 ⋆ = arg max(𝑆(𝑥, 𝑦 ′ )) (3) 𝑦 ′ ∈𝑦 During the evaluation, the most likely sequence is obtained by the Viterbi algorithm. The output obtained is re-processed to group back sub-tokens obtained due to the tokenization approach used by transformers models [37]. Only the tag of the first sub-token is considered for the final tag sequence. 4. Experimental setting and discussion of results 4.1. Implementation The model architecture has been implemented by using the Python programming language. In particular we used the Pytorch [38] syntax and the Transformer Huggingface library [39]. The CRF layer used is the one provided by the library TorchCRF2 . We set AdamW as optimizer; a modified version of the Adam optimizer that uses a weight decay factor. Consequently we set a weight decay of 0.01, a learning rate of 1e-5, a number of epochs equal to 500, a number of steps equal to 2000, and a batch size of 16. The code has been run on Google Colab environment3 by using a Tesla T4 Nvidia GPU with 16GB of RAM. 4.2. Dataset The dataset is provided by the organizers of the SpRadIe 2021 [8, 9] competition. It consists of 513 ultrasonography reports provided by a pediatric hospital in Argentina. Reports are unstructured and have abundance of orthographic and grammatical errors and have been anonymized in order to remove patient IDs, and names and the enrollment numbers of the physicians. Reports were annotated by clinical experts and then revised by linguists. Annotation guidelines and training were provided for both rounds of annotations. Automatic classifiers will be expected to perform well in those cases where human annotators have strong agreement, and worse in cases that are difficult for human annotators to identify consistently. Annotations are provided in brat format. More details are reported in [2]. The annotated dataset is constituted as follows: training set (175 reports); Same-sample development set (47 reports); Held-out development set (45 reports); Held-out test set (207 reports). The following entities are distinguished: • Anatomical Entity: entities corresponding to an anatomical part, for example breast (pecho), liver (hígado), right thyroid lobe (lóbulo tiroideo derecho); • Finding: a pathological finding or diagnosis, for example: cyst, cyanosis; • Location: it refers to a location in the body. Examples of locations are: walls, cav- ity, longitudinal, frontal, occipital, cervicodorsolumbosacra, lumbosacral, intracanalar, subcutanea; • Measure: expression indicating a measure; • Type of Measure: expression indicating a type of measure; • Degree: It indicates the degree of a finding or some other property of an entity, for example, “leve”, “levemente” (slight), “mínimo” (minimal); • Abbreviation: acronyms or abbreviations to indicate a medical concept; • Negation: hedge cues indicating negation; • Uncertainty: hedge cues indicating a probability (not a certainty) that some finding may be present in a given patient. 2 https://pypi.org/project/TorchCRF/ 3 https://colab.research.google.com/ Figure 2: Example of annotated piece of diagnosis. Figure 3: Distribution of tag annotations among the training dataset. • Conditional Temporal: hedge cues indicating that something occurred in the past or may occur in the future. In Fig. 2 is reported an example of annotated piece of diagnosis. In Fig. 3 it is showed the distribution of different entities among the training dataset. The entity type Finding is particularly challenging, as it presents great variability in its textual forms. It ranges from a single word to more than 10 words in some cases, and comprising all kinds of phrases. However, this is also the most informative type of entity for the potential users of these annotations. Other challenging phenomena are the regular polysemy observed between Anatomical entities and Locations, and the irregular uses of Abbreviations. Moreover the same token can be labeled with different tags at the same time making the automatic annotation a complex task. 4.3. Setting and metrics In order to train the proposed hybrid NER model, we merged the train set with the Same-sample development set resulting in a single set of 222 diagnoses (i.e. 18428 tokens obtained by using a white space splitting). Consequently we used only the Held-out development set for evaluating the performances of the model on 45 diagnoses (i.e. 5090 tokens). The performances of the models are evaluated by using classic metrics such as Precision, Recall and F1-score. As a conse- quence of the unbalancing of the dataset, we decided to consider the micro average of results obtained for each tag category. We chose ten different transformer models imported through the Transformer HuggingFace library. They are listed in the following: • bert-base-uncased (BERT): it is the standard base version of BERT (12-layer, 768-hidden, 12-heads, 168M parameters) with an uncased English vocabulary; • bert-base-multilingual-uncased (BERT-mul): it is the base version of BERT trained on lower-cased text in the top 102 languages; • dccuchile/bert-base-spanish-wwm-uncased (BETO): BETO is a BERT model trained on a big Spanish corpus. BETO is of size similar to a BERT-base and was trained with the Whole Word Masking technique; • roberta-base (RoBERTa): it is a standard RoBERTa transformers model pretrained on a large corpus of English data in a self-supervised fashion; • xlm-roberta-base (XLM-RoBERTa): it is a multilingual RoBERTa model trained on 100 different languages. • google/electra-small-discriminator (ELECTRA): is a new pretraining approach which trains two transformer models: the generator and the discriminator. It has been trained on English data; • skimai/electra-small-spanish (ELECTRA-ES): it is the ELECTRA model trained on Span- ish data; • fspanda/electra-medical-small-discriminator (ELECTRA-MED): ELECTRA model trained on English scientific paper in medical domain published on PubMed and PubMEdCentral; • emilyalsentzer/Bio_ClinicalBERT (BioCliBERT): The Bio_ClinicalBERT model is initial- ized with BERT-Base and it was trained on all notes from MIMIC III, a database containing electronic health records from ICU patients at the Beth Israel Hospital in Boston, MA; • seiya/oubiobert-base-uncased (BioBERT): Bidirectional Encoder Representations from Transformers for Biomedical Text Mining by Osaka University (ouBioBERT) is a language model based on the BERT-Base architecture. 4.4. Discussion of results The results we obtained are reported in Table 1-3. It is possible to observe that the best performing model, by considering F1-score as main metric, is BERT in its base version trained on a multilingual dataset. The model has obtained a micro F1-score of 0.69632 that outperforms the competitors in a range from 0.59% to 17.94%. The most significant difference is obtained by comparing the results with them of the ELECTRA-MED base model. In particular, these results are unexpected low and required further investigation for understanding their causes. On the contrary, the model based on BETO is performing particularly well obtaining a results comparable with the one obtained by using BERT multilingual. The differences among the two models are very low, probably because they share the same internal architecture, and both of them are based on a vocabulary able to deal with Spanish terms. If from one end BERT multilanguage is trained on many different languages simultaneously creating some noise on data, on the other end, BETO is trained on a smaller set of data losing then in accuracy. The two issues of the two models are able to produce results somehow very close to each other. If we observe the results obtained for precision (Table 1) and recall (Table 2), we can keep an interesting behavior of the two models. BETO, even if for a small gap, obtains better results for precision than BERT-mul. On the contrary, considering the recall, BERT-mul is the best. Still looking at the precision values, we can observe that, in general, the chosen models show prediction difficulties for the tag classes "Conditional Temporal", "Uncertainly", "Degree" and "Location". In particular, such behavior could derive from the specificity of such classes for the medical field. It is, in fact, evident as for classes of more general character like "Negation", "Type of Measure", "Abbreviation" all the models perform medium-well. The only exception is the class "Anatomical Entity" that even if much specific of the medical domain turns out to be classified correctly in nearly the totality of the models. This could derive from the wide everyday use that is made of terms regarding the parts of the body also in the common language. Looking at the recall, we can see that some classes were easier to find than others, such as "Anatomical Entity", "Negation", "Abbreviation", "Uncertainly". For the first three classes, obtaining good results of precision and recall, we can say that the models are able not only to be precise in the classification but also to identify them correctly in the text. For the class "Uncertainty", however, the models tend to overestimate its presence in the sentences because it has a good recall value and low precision. If we want to understand if the training language of the model is an issue for a correct classification of individual tokens, we can observe the differences between monolingual and multilingual models. Generally speaking, we can observe that multilingual and Spanish-trained models are better performing than their counterpart monolingual. This result was expected as a consequence of the language of the dataset. It is obvious that a model able to manage words in the dataset language is more efficient than a model trained on English data also if the SentencePiece tokenizer, typical of transformer-based architectures, is alleviating these differences through the use of sub-tokens. If we focus on domain-specific models (BioCliBERT and BioBERT), we can observe that they perform in line with their non-specialized counterparts. The difference in F1-score among BERT-base, BioCliBERT, and BioBERT is only around 0.01. This suggests that the training on the medical domain is not correctly supporting the task of NER for clinical diagnoses, if, as in this case, the model is pre-trained on a different language of the testing dataset (i.e. English). In order for it to be possible to draw conclusions about the usefulness of a model trained on medical data for the specific NER task, it would be necessary as a future development to train a new model with medical data in the Spanish language. By focusing on the differences in transformer architectures, we can affirm that BERT is commonly performing better than others even if XLM-RoBERTa and ELECTRA are, however, obtaining results not very far from the BERT counterpart. Finally, all the models could not correctly classify Conditional Temporal tags, but this problem is almost certainly related with the size of this class in the dataset. Indeed, only 19 instances are available in the training set and only one instance was present in the set used for the analysis. 5. SpRadIe submission In light of the results, decisions were made to submit to the SpRadIe competition. In particular, the first decision taken concerns the configuration of the hybrid model to use. The obvious choice would have fallen on using BERT-mul as the transformer-based model. On the contrary, XLM- Table 1 Results obtained for the chosen transformer models considering the Precision. BERT BERT-MUL BETO RoBERTa XLM-RoBERTa Abbreviation 0.79412 0.73874 0.69919 0.66071 0.6774 Anatomical 0.75776 0.77193 0.76630 0.77844 0.7794 Entity Conditional 0.00000 0.00000 0.00000 0.00000 0.0000 Temporal Degree 0.33333 0.45455 0.55556 0.50000 0.5294 Finding 0.50292 0.63725 0.63462 0.52688 0.5418 Measure 0.58549 0.71721 0.67932 0.69515 0.7111 Negation 0.89394 0.89888 0.86408 0.86357 0.8701 Type of 0.89394 0.79167 0.70492 0.80625 0.8421 measure Uncertainty 0.50000 0.42857 0.33333 0.50000 0.5000 Location 0.46400 0.51250 0.63846 0.43810 0.4806 microAvg 0.62489 0.69462 0.69645 0.65649 0.6635 ELECTRA ELECTRA ELECTRA BioCliBERT BioBERT -ES -MED Abbreviation 0.66667 0.71585 0.36111 0.75385 0.81818 Anatomical 0.74859 0.77273 0.64306 0.79655 0.78049 Entity Conditional 0.00000 0.00000 0.00000 0.25000 0.00000 Temporal Degree 0.38095 0.57895 0.27778 0.31579 0.50000 Finding 0.50581 0.57285 0.56757 0.47181 0.56836 Measure 0.63687 0.73092 0.66818 0.68456 0.60870 Negation 0.86765 0.91429 0.80723 0.86667 0.89773 Type of 0.93103 0.74627 0.39130 0.85714 0.82927 measure Uncertainty 0.50000 0.26667 0.23077 0.42857 0.28571 Location 0.54082 0.48503 0.44037 0.44231 0.41104 microAvg 0.63983 0.67321 0.56189 0.63311 0.64984 RoBERTa was chosen. This decision is motivated by our feeling that the validation dataset was not large enough for containing a representative subset of all the possible variations in data we could have in validation phase, especially regarding the less numerous classes. Indeed, a model best performing on validation set is not always the best performing on test set, especially when the validation set is very small. Therefore, it was decided to opt for a model with intermediate performance in F1-score, following the idea that a less specialized model could better deal the Table 2 Results obtained for the chosen transformer models considering the Recall. BERT BERT-mul BETO RoBERTa XLM-RoBERTa Abbreviation 0.87097 0.85417 0.81132 0.86071 0.8630 Anatomical 0.80795 0.78338 0.81503 0.80069 0.8179 Entity Conditional 0.00000 0.00000 0.00000 0.00000 0.0000 Temporal Degree 0.37500 0.58824 0.55556 0.50286 0.5294 Finding 0.42786 0.53830 0.52800 0.41294 0.4467 Measure 0.59162 0.75431 0.68511 0.59553 0.6038 Negation 0.85507 0.83333 0.85577 0.83869 0.8481 Type of 0.85507 0.76000 0.76786 0.75152 0.7619 measure Uncertainty 1.0 0.75000 0.75000 0.66667 1.0 Location 0.65169 0.75926 0.72807 0.64750 0.6526 microAvg 0.63077 0.69803 0.68801 0.63621 0.6404 ELECTRA ELECTRA ELECTRA BioCliBERT BioBERT -ES -MED Abbreviation 0.83871 0.85065 0.75581 0.87500 0.70130 Anatomical 0.87748 0.72070 0.67964 0.81338 0.78528 Entity Conditional 0.00000 0.00000 0.00000 1.0 0.00000 Temporal Degree 0.50000 0.55000 0.29412 0.46154 0.41176 Finding 0.43284 0.51805 0.43659 0.43923 0.45887 Measure 0.59686 0.71094 0.66516 0.58286 0.58065 Negation 0.85507 0.80672 0.72043 0.83871 0.89773 Type of 0.77143 0.64935 0.73469 0.80000 0.77273 measure Uncertainty 1.0 0.80000 1.0 1.0 0.66667 Location 0.59551 0.61364 0.45714 0.58974 0.67000 microAvg 0.64530 0.65794 0.58129 0.63252 0.62697 variability on new unseen data. In particular, we decided to use this strategy as a regularization one, similarly to what is commonly performed with early-stopping for reducing overfitting. Our second decision concerned the number of models to be learned. In particular, our study of the dataset revealed that the same token could be annotated with more than one tag. For example, the term "cm" could be both parts of a "Measure" tag and an abbreviation tag. To this end, we decided to create four groups of tags with as little overlap as possible and to create Table 3 Results obtained for the chosen transformer models considering the F1-score. BERT BERT-mul BETO RoBERTa XLM-RoBERTa Abbreviation 0.83077 0.79227 0.75109 0.74071 0.7590 Anatomical 0.78205 0.77761 0.78992 0.77513 0.7982 Entity Conditional 0.00000 0.00000 0.00000 0.00000 0.0000 Temporal Degree 0.35294 0.51282 0.55556 0.51053 0.5294 Finding 0.46237 0.58361 0.57642 0.46865 0.4896 Measure 0.58854 0.73529 0.68220 0.62508 0.6531 Negation 0.87407 0.86486 0.85990 0.84060 0.8590 Type of 0.87407 0.77551 0.73504 0.75385 0.8000 measure Uncertainty 0.66667 0.54545 0.46154 0.57143 0.6667 Location 0.54206 0.61194 0.68033 0.50979 0.5536 microAvg 0.62782 0.69632 0.69220 0.64619 0.6517 ELECTRA ELECTRA ELECTRA BioCliBERT BioBERT -ES -MED Abbreviation 0.74286 0.77745 0.48872 0.80992 0.75524 Anatomical 0.80793 0.74581 0.66084 0.80488 0.78287 Entity Conditional 0.00000 0.00000 0.00000 0.40000 0.00000 Temporal Degree 0.43243 0.56410 0.28571 0.37500 0.45161 Finding 0.46649 0.54408 0.49354 0.45494 0.50778 Measure 0.61622 0.72079 0.66667 0.62963 0.59434 Negation 0.86131 0.85714 0.76136 0.85246 0.89773 Type of 0.84375 0.69444 0.51064 0.82759 0.80000 measure Uncertainty 0.66667 0.40000 0.37500 0.60000 0.40000 Location 0.56684 0.54181 0.44860 0.50549 0.50951 microAvg 0.64255 0.66549 0.57143 0.63282 0.63820 a classifier for each. In particular, the four groups are composed as follows: (i) Finding; (ii) Anatomical Entity, Measure, Degree; (iii) Location, Negation, Type of Measure; (iv) Abbreviation, Uncertainty, Conditional Temporal. In the prediction phase for each tag, we obtained the results of each classifier and chose the one with the highest probability. If considering the results we obtained from the task organizers reported in Table 4, we can observe that the results are lower than expected, especially for some specific tag class. In Table 4 Results obtained for the SpRadIe submission considering the "exact" matching and the micro F1-score. Anat. Cond. Type Abbr. Degree Finding Location Measure Negation Unc. Entity Temp measure SWAP 0,49 0,52 0,00 0,48 0,44 0,31 0,51 0,76 0,37 0,17 particular, "Type of measure" is resulted 116.2% lower than what obtained in the preliminary evaluation. Similar values are obtained also for the classes Uncertainity, Abbreviation and Anatomical Entity. In any case it will be considered future work to investigate the reasons for these poorer than expected performances as soon as the annotated test set is released. 6. Conclusion Automated analysis of textual documents in medical settings is still a research challenge that will be a topic of discussion in the coming years. Recently, the amount of medical data available has grown rapidly due to the deployment of numerous applications and systems to support the patient and wellness enthusiast. Nevertheless, the amount of annotated data is still meager due to their privacy issues. Recently, however, attempts have been made to overcome this problem with transfer learning strategies based on large pre-trained models that use deep learning techniques. Among these, those based on transformers have proved to be the most reliable and versatile. In the wake of these scientific innovations, it was decided to address the problem of name entity recognition in the medical field by exploiting these approaches. Specifically, we proposed a hybrid model that combines the power of representation of transformer models with the predictive power of probabilistic conditional random field models. The model was evaluated in several of its combinations, varying the transformer model at the core of its archi- tecture. Specifically, the models BERT, BERT-mul, BETO, RoBERTa, XLM-RoBERTa, ELECTRA, ELECTRA-ES, ELECTRA-MED, BioBert, BioClinBERT were evaluated. The experimentation was performed on data provided by the SpRadIE 2021 [8, 9] competition co-located with the Conference and Labs of the Evaluation Forum 2021. In particular, the training set counted 222 medical diagnoses in the Spanish language appropriately annotated by experts in the field. The preliminary analysis was carried out on a validation set extracted from a data collection different from the training set. Specifically, this portion of data contained 45 medical diagnoses, also in Spanish. The results obtained showed excellent effectiveness of the transformer-based models trained on multilingual or Spanish data. Specifically, BERT-base in its multilingual version and BETO proved to be the most suitable for the task of NER for medical data. The results of this analysis were used to make implementation decisions regarding the system to be submitted to the SpRadIE 2021 competition. Specifically, it was decided to perform a submission using the transformer based XLM-RoBERTa model with the hope of obtaining satisfactory results through the use of a model with average performance in the preliminary analysis phase. This decision was made to avoid basing our decisions solely on a portion of the data that was not expressive enough due to its size. The results obtained proved to be lower than expected, but further study and investigation will be carried out upon release of the annotated test set. 7. Acknowledgment This work has been supported by Apulia Region, Italy through the project ”Un Assistente Dialogante Intelligente per il Monitoraggio Remoto di Pazienti” (Grant n. 10AC8FB6) in the context of ”Research for Innovation - REFIN”. References [1] H. Oh, C. Rizo, M. Enkin, A. Jadad, What is ehealth?: a systematic review of published definitions, World Hosp Health Serv 41 (2005) 32–40. [2] V. Cotik, D. Filippo, R. Roller, H. Uszkoreit, F. Xu, Annotation of entities and relations in spanish radiology reports., in: RANLP, 2017, pp. 177–184. [3] P. Voigt, A. Von dem Bussche, The eu general data protection regulation (gdpr), A Practical Guide, 1st Ed., Cham: Springer International Publishing 10 (2017) 3152676. [4] H. Suominen, L. Kelly, L. Goeuriot, A. Névéol, L. Ramadier, A. Robert, E. Kanoulas, R. Spijker, L. Azzopardi, D. Li, et al., Overview of the clef ehealth evaluation lab 2018, in: International Conference of the Cross-Language Evaluation Forum for European Languages, Springer, 2018, pp. 286–301. [5] R. M. R. Zavala, P. Martínez, I. Segura-Bedmar, A hybrid bi-lstm-crf model for knowledge recognition from ehealth documents., in: TASS@ SEPLN, 2018, pp. 65–70. [6] E. Andrés, O. Sainz, A. Atutxa, O. L. de Lacalle, Ixa-ner-re at ehealth-kd challenge 2020: Cross-lingual transfer learning for medical relation extraction, in: Proceedings of the Iberian Languages Evaluation Forum co-located with 36th Conference of the Spanish Society for Natural Language Processing, IberLEF@ SEPLN, volume 2020, 2020. [7] N. Garcıa-Santa, K. Cetina, Fle at clef ehealth 2020: Text mining and semantic knowledge for automated clinical encoding, in: Working Notes of Conference and Labs of the Evaluation (CLEF) Forum. CEUR Workshop Proceedings, 2020. [8] H. Suominen, L. Goeuriot, L. Kelly, L. A. Alemany, E. Bassani, N. Brew-Sam, V. Cotik, D. Filippo, G. González-Sáez, F. Luque, et al., Overview of the clef eHealth evaluation lab 2021, in: CLEF 2021 - 12th Conference and Labs of the Evaluation Forum, Lecture Notes in Computer Science (LNCS), Springer, 2021. [9] V. Cotik, L. A. Alemany, D. Filippo, F. Luque, R. Roller, J. Vivaldi, A. Ayach, F. Carranza, L. D. Francesca, A. Dellanzo, et al., Overview of CLEF eHealth Task 1 - SpRadIE: A challenge on information extraction from Spanish Radiology Reports, in: CLEF 2021 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS, 2021. [10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polo- sukhin, Attention is all you need, in: Advances in Neural Information Processing Systems, 2017, pp. 5998–6008. [11] J. Lafferty, A. McCallum, F. C. Pereira, Conditional random fields: Probabilistic models for segmenting and labeling sequence data (2001). [12] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://www.aclweb.org/ anthology/N19-1423. [13] P. Basile, V. Basile, D. Croce, M. Polignano, Overview of the evalita 2018 aspect-based sentiment analysis task (absita), volume 2263, 2018. URL: https://www.scopus.com/inward/record.uri?eid=2-s2.0-85058658278&partnerID= 40&md5=9719e0512649279a6ed46a8262f050dc, cited By 6. [14] L. de Mattei, G. de Martino, A. Iovine, A. Miaschi, M. Polignano, G. Rambelli, Ate absita @ evalita2020: Overview of the aspect term extraction and aspect-based sentiment analy- sis task, volume 2765, 2020. URL: https://www.scopus.com/inward/record.uri?eid=2-s2. 0-85097555110&partnerID=40&md5=2122e26ed7367e5a7c40642835591903, cited By 5. [15] K. Florio, V. Basile, M. Polignano, P. Basile, V. Patti, Time of your hate: The challenge of time in hate speech detection on social media, Applied Sciences (Switzerland) 10 (2020). URL: https://www.scopus.com/inward/record.uri?eid=2-s2.0-85087763770&doi=10.3390% 2fAPP10124180&partnerID=40&md5=d4c6b8ba193ed062299bc19f02cb462e. doi:10.3390/ APP10124180, cited By 7. [16] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, S. R. Bowman, Glue: A multi-task benchmark and analysis platform for natural language understanding, arXiv preprint arXiv:1804.07461 (2018). [17] M. Polignano, P. Basile, M. De Gemmis, G. Semeraro, V. Basile, Alberto: Italian bert language understanding model for nlp challenging tasks based on tweets, in: 6th Italian Conference on Computational Linguistics, CLiC-it 2019, volume 2481, CEUR, 2019, pp. 1–6. [18] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, J. Pérez, Spanish pre-trained bert model and evaluation data, in: PML4DC at ICLR 2020, 2020. [19] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, Biobert: a pre-trained biomedical language representation model for biomedical text mining, Bioinformatics 36 (2020) 1234–1240. [20] E. Alsentzer, J. R. Murphy, W. Boag, W.-H. Weng, D. Jin, T. Naumann, M. McDermott, Publicly available clinical bert embeddings, arXiv preprint arXiv:1904.03323 (2019). [21] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019). [22] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, arXiv preprint arXiv:1911.02116 (2019). [23] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, Electra: Pre-training text encoders as discriminators rather than generators, arXiv preprint arXiv:2003.10555 (2020). [24] A. Mansouri, L. S. Affendey, A. Mamat, Named entity recognition approaches, International Journal of Computer Science and Network Security 8 (2008) 339–344. [25] J. Mao, W. Liu, Hadoken: a bert-crf model for medical document anonymization., in: IberLEF@ SEPLN, 2019, pp. 720–726. [26] K. Xue, Y. Zhou, Z. Ma, T. Ruan, H. Zhang, P. He, Fine-tuning bert for joint entity and relation extraction in chinese medical text, in: 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), IEEE, 2019, pp. 892–897. [27] R. Vunikili, N. SH, G. Marica, O. Farri, Clinical ner using spanish bert embeddings, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), CEUR Workshop Proceedings, 2020. [28] K. S. Kalyan, S. Sangeetha, Want to identify, extract and normalize adverse drug reactions in tweets? use roberta, arXiv preprint arXiv:2006.16146 (2020). [29] I. B. Ozyurt, On the effectiveness of small, discriminatively pre-trained language repre- sentation models for biomedical text mining, in: Proceedings of the First Workshop on Scholarly Document Processing, 2020, pp. 104–112. [30] X. Yu, W. Hu, S. Lu, X. Sun, Z. Yuan, Biobert based named entity recognition in electronic medical record, in: 2019 10th International Conference on Information Technology in Medicine and Education (ITME), IEEE, 2019, pp. 49–52. [31] Z. Huang, W. Xu, K. Yu, Bidirectional lstm-crf models for sequence tagging, arXiv preprint arXiv:1508.01991 (2015). [32] M. Liu, Z. Tu, Z. Wang, X. Xu, Ltp: A new active learning strategy for bert-crf based named entity recognition, arXiv preprint arXiv:2001.02524 (2020). [33] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [34] L. Ratinov, D. Roth, Design challenges and misconceptions in named entity recognition, in: Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), 2009, pp. 147–155. [35] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, S. Fidler, Aligning books and movies: Towards story-like visual explanations by watching movies and reading books, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 19–27. [36] R. Panchendrarajan, A. Amaresan, Bidirectional lstm-crf for named entity recognition., in: PACLIC, 2018. [37] T. Kudo, J. Richardson, Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing, arXiv preprint arXiv:1808.06226 (2018). [38] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., Pytorch: An imperative style, high-performance deep learning library, arXiv preprint arXiv:1912.01703 (2019). [39] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al., Huggingface’s transformers: State-of-the-art natural language pro- cessing, arXiv preprint arXiv:1910.03771 (2019).