-

Vicomtech at MEDDOPROF: Automatic Information Extraction and Disambiguation in Clinical Text

SNLT group at Vicomtech Foundation

Basque Research

Technology Alliance (BRTA)

Mikeletegi Pasealekua

Donostia/San-Sebastian

Spain

ezotova

agarciap

mcuadrosg@vicomtech.org

0 Department of Languages and Computer Systems. University of the Basque Country (UPV-EHU) , Leioa , Spain

This paper describes the participation of the Vicomtech NLP team in the MEDDOPROF shared task. The challenge consists in automatic detection of occupations and employment status, as well as their normalization or entity mapping, within medical documents in Spanish language. The competition is split into three tasks, NER, CLASS and NORM. We have participated using a multitask joint model based on Transformers, which tries to solve all the three tasks at once. However, the NORM task, which consists on disambiguation of the detected entities against thousands of di erent possible codes, can be solved more e ectively using other approaches. Because of that, we have submitted an additional sequence-to-sequence based approach and a semantic-search based approach to deal with the NORM task. We achieve a 77% of F1score for the NER task, and 70% of F1-score for the CLASS task, and a 48% of F1-score for the NORM task.

Clinical Text Information Extraction Automatic Indexing

This article presents the participation of the Vicomtech NLP team in the MEDDOPROF Shared Task: Medical Documents Profession Recognition shared task [ 7 ]. The shared task consists in developing systems for automatic detection of occupations and employment status, as well as their normalization or entity mapping, within medical documents in Spanish language. The target data consists in a corpus of clinical case reports from heterogeneous medical specialities.

The competition is divided into three tasks. The rst task, NER, requires automatically nding mentions of occupations and classifying each of them as a profession, an employment status or an activity. The second task, CLASS, requires classifying mentions of occupations to determine whether they are related to the patient, to a family member, to a health professional or to someone else. Finally, the third task, NORM, requires mapping the task 1 predictions to one of the codes in a list of unique concept identi ers. We refer the reader to the shared task overview article [ 7 ] for more detailed information about MEDDOPROF.

The rest of the document is structured as follows. Section 2 introduces the data provided by the organizers of the challenge. Sections 3 and 4 describe our submitted systems and the training setup, respectively. Section 5 presents the o cial results. In section 6, we discuss some decisions taken during the development and training phases, inherent aws of our systems, and potential improvements. Finally, section 7 provides some concluding remarks and future work hints. 2

Data description

The provided corpus is a collection of 1844 clinical cases from over 20 di erent specialties annotated with professions and employment statuses. The gold annotations for NER and CLASS are provided in Brat format [12] (see Figure 1), while the codes for the NORM task are provided as a .tsv le with codes assigned to each profession/activity in the corpus (see Table 1). It must be noted that, in this regard, the NER and NORM tasks are related because the input for the NORM task are the entities detected in the NER task. caso clinico psiquiatria95 haber dejado el ejercito 2562 2586 SCTID: 73438004 caso clinico psiquiatria23 le ha llevado a despedirse 2127 2153 SCTID: 73438004 caso clinico urologia302 medico de Atencion Primaria 185 212 2211.1

In addition, the organizers provided an extension of the dataset with an extra set of labels for di erent entities: symptoms, diseases, procedures, negation markers and negation spans, etc. The entities do not count toward the competition evaluation, but the organizers encourage the participants to make use of them to develop more interesting and complete systems.

The organizers also provide a list of valid codes related to professions, labour activities and occupations from SNOMED Clinical Terms (www.snomed.org) (50 codes) and European Skills, Competences, Quali cations and Occupations (ESCO) (ec.europa.eu/esco) classi cations (3508 codes). Both SNOMED CT and ESCO are described as a machine-readable multilingual thesaurus with an ontological foundation.

The core concepts of the ontology are: concept codes|numerical codes that identify clinical terms, organized in hierarchies; descriptions|textual descriptions of concept codes; and relationships between concepts. SNOMED CT comprehensive coverage includes a large variety of concepts such as symptoms, diagnoses, procedures, body structures etc. The use of SNOMED CT within this competition is restricted to classifying activities and employment status. European Skills, Competences, Quali cations and Occupations (ESCO) is a multilingual classi cation of skills, competences, quali cations and occupations relevant for the EU labor market and education. We have approached the challenge as a joint multitask end-to-end model based on Transformers, trying to solve the three tasks at the same time. However, the NORM task can be solved more e ectively using other techniques and separated models, so we have competed with several di erent approaches for this third task. 3.1

Multitask joint model

The multitask joint model tries to solve all the tasks, including the detection of the extended entity set, using a single model based on transformers.

Except for the NORM task, the other tasks are treated as regular sequencelabelling tasks. At the core, there is a pre-trained BERT model that encodes the texts, converting each token into a contextual word-embedding. These embeddings are the base for several classi cation heads that perform a IOB tagging [ 10 ].

This regular sequence labelling approach solves the NER task and the CLASS task. However, the joint model also tries to deal with the NORM task treating with a hierarchical classi cation approach.

The hierarchical classi cation consists of a bunch of classi cation heads, one per each non-terminal node of the hierarchy, that are trained at the same time. The ESCO codes, used in this task to identify the professions, follow an hierarchical structure being the rst digit the most coarse grained category. Each following digit adds a more ne-grained de nition of the profession the code is describing. The key di erence with a at classi er is that, instead of trying to select a code from a at list of potentially thousand of codes, a tree is built with the codes, level by level. Each node of the tree has only a limited amount of children nodes according to the actual ESCO hierarchy. These non-terminal nodes are turned into classi ers, and their children are the output size of each of those classi ers.

For the training, each ESCO code is decoded so only the appropriate classi ers have something to predict. The rest of the nodes are forced to predict a special \OUT" value. At inference time, the nal code is reconstructed from the root of the hierarchy, classi er after classi er, following the hierarchy structure. The resulting code is emitted when the current classi er predicts the special value "OUT", or a leaf node (with no further children in the hierarchy) is reached.

This approach is suitable for this kind of task, but has several disadvantages. It is computationally expensive depending on the size of the hierarchy: for the ESCO codes involved in this competition it resulted in about 800 node-classi ers. Also, it is complex to implement, and each hierarchy may have subtleties that must be taken into account when modelling the tree structure and how the codes are encoded/decoded into a set of nodes. Finally, since there are a lot of nodes to train, the amount of training data available in the competition might not be enough.

Due to this reason, for the NORM task we have tried several additional approaches that are described in the next subsections. 3.2

Seq2Seq Translation System for NORM Task

The second approach to tackle NORM task is self-attention Transformer architecture [14]. We adapt sequence to sequence modelling (seq2seq) [ 13, 2 ] to the task of mapping terms from clinical texts to their codes in SNOMED CT and ESCO classi cations. Term description is a source input for encoder and it's code in the corresponding ontology is a target input for decoder. High-level architecture of the system is depicted in Figure 3.

In order to train the mapping system, we prepare the training corpus as follows. The set of valid codes consists of 3.558 unique codes, some of them have various synonymous de nitions, speci cally, the number of synonyms varies from 1 to 38. We split all the multiple term de nitions and assign a corresponding code to each synonym and get a dataset arranged as shown in Table 3. Here we can see that one code may be presented by various highly similar de nitions.

Furthermore, we combine the training set of the NORM task and descriptions of all codes from the SNOMED CT and ESCO ontologies, which results in 297 unique codes with a distribution that ranges from 1 to 182 examples per code. Finally, we obtain 15.869 examples for train set and reserve 346 examples for development set (10% of original train set provided for the task). The dataset is highly unbalanced: only 10% of the codes has more than 10 examples per code.

During the text preprocessing step the source terms are lower-cased, cleaned from punctuation and tokenized on word-level. The target codes are not preprocessed, just tokenized by space. Number of tokens in source examples varies from 1 to 22, and in target set it is 1 or 2 depending on code type. We train the jefe de producto de las TIC 1330.6 jefa de producto de las TIC 1330.6 encargado de la gestion de productos de las TIC 1330.6 encargada de la gestion de productos de las TIC 1330.6 product manager de las TIC 1330.6 jefa de producto de las TI 1330.6 jefa de producto de las TICs 1330.6 jefe de producto de las TI 1330.6 jefe de producto de las TICs 1330.6 desempleado SCTID: 73438004 trabajador SCTID: 106541005 refugiado SCTID: 446654005 model with the parameters of the transformer architecture shown in the Table 4.

This approach has several advantages. First, it turns an extremely large multi-class classi cation problem into an straightforward sequence-to-sequence approach. Another advantage is that the pairs of codes and their descriptions, which are already de ned in ontologies and vocabularies, can be leveraged as extra training instances complementing the actual training data. 3.3

Semantic similarity mapping for NORM Task

The following mapping system for NORM task is based on the idea of semantic search. Semantic search is an information retrieval method that leverages semantic similarity measure to retrieve semantically close documents. The main objective of semantic similarity is to measure the distance between the vectors that represent a pair of words, sentences, or documents. The key concepts of semantic search are the following: query, collection of documents, and degree of relevance between a query and retrieved documents.

We adapt the method to map terms written in natural language to the codes in SNOMED CT and ESCO classi cations. In this case, a term previously detected by the NER system (see Subsection 3.1) as PROFESION, ACTIVIDAD o SITUACION LABORAL is used as the query to search the closest document. The collection of documents is represented by SNOMED CT and ESCO ontologies provided by the organizers. The codes are separated by synonyms as shown in Table 3, so each code has various descriptions. The descriptions are the documents to search through. To compute a notion of similarity between a term and a code description we use the cosine distance.

We have experimented with di erent pretrained language models to create common vector space for terms and code descriptions. We have selected LaBSE model [ 5 ], because it obtained the best F-score during the experimentation. LaBSE is a BERT sentence embedding model supporting 109 languages. It is developed using masked language modelling [ 4 ] and translation language modelling [ 3 ] with a translation ranking task using bi-directional dual encoders.

Since the type of a term (i.e. whether it is a profession, and activity or an employment status) is detected in previous task, we execute the mapping process in two ways: 1) search in SNOMED CT and ESCO separately; 2) search in the database where SNOMED CT and ESCO codes are united. Figure 4 depicts the basic algorithm of semantic search applied to the NORM task independent from the database.

For the case of separate search, we select the closest description with the following condition: if assigned tag is SITUACION LABORAL or ACTIVIDAD, the term is to be search in SNOMED CT database (50 codes), if the tag is PROFESION, the term is to be searched in ESCO database (3554 codes). In our experiments, the terms tagged as SITUACION LABORAL, and thus mapped to the SNOMED CT codes, reached a micro-F1 score of 0.577, while and PROFESION terms mapped to ESCO obtained a micro-F1 score of 0.215. This suggests that, as could be expected, the performance of the semantic search method is in uenced by the number of target elements and to which extent they are semantically separable. 4

Training setup and submitted systems

We have participated in all the tasks proposed by the competition. The rst two tasks, NER and CLASS, have been only dealt with the multitask joint model. For the NORM task we have submitted di erent runs using di erent approaches. The rst approach is the same multitask model, since it aims to predict all the information requested in the competition in a single step.

In order to train the multitask joint model we used IXAmBERT [ 9 ] and BETO [ 1 ] as the pre-trained BERT models that form the core of the model. We have experimented with the two because both of them are pre-trained using Spanish data. After validation in the development set, BETO seemed to obtain a slight advantage, so nally we decided to make the submission using the BETObased multitask model.

The multitask joint model has been implemented in Python 3.7 with HuggingFace's transformers library [15] (github.com/huggingface/transformers) and it has been trained on a Nvidia GeForce RTX 2080ti GPU with 11GB of memory. The learning rate was set to 2E-5 and the optimizer was AdamW [ 8 ]. During the training the micro-F1 score of the predictions on the development set was monitored, with 100 epochs of early stopping patience. That means that the model continued training until reaching 100 consecutive epochs without any improvement in the validation metric.

The NORM task Transformer model implemented in OpenNMT toolkit[ 6 ] and its PyTorch based framework OpenNMT-py (opennmt.net/OpenNMT-py). The model was trained on on a Nvidia GeForce RTX 2080ti GPU with 11GB of memory, with learning rate set to 2E-5 and the optimizer AdamW [ 8 ] during 10.000 steps. The best model was selected by the micro-F1 score.

Semantic similarity inference implemented with Sentence Transformers library [ 11 ] on a Nvidia GeForce RTX 2080ti GPU with 11GB of memory. 5

Results

The multitask-joint model performs reasonably well for NER and CLASS tasks. The F-scores scores for NER and CLASS tasks achieved by our multitask joint model are 25.6% and 31.7% above the baseline respectively. For NORM task the best performing system is the one that uses a sequence-to-sequence approach based on transformers.

The score for the NORM task, even for the best performing system, is below the baseline score. A possible explanation is that the most frequent codes are repeated a lot of times and the the baseline approach can easily nd those common codes very straightforwardly.

Since the NORM task input is the output of the NER task, and the participants do not have access to any gold-labelled input for the NORM task, the competing systems need to rely on the imperfect outcomes of the corresponding NER system. This fact results in an error accumulation that lowers the nal score. To clarify this point, it would be interesting to compare our results with other participants.

At the time of writing these working notes, the o cial ranking with the scores from all the participants has not been published yet, so we cannot assess to which extent our results are competitive.

Discussion

In order to better understand the behaviour and the result of some of our submitted systems, we have carried out some error analysis to pose some discussion points for future work.

In the NORM task we see the following challenging issues: { The Seq2Seq Translation system (see Subsection 3.2) seems to be biased due to unbalanced dataset: some codes have only one description while the others have more than 130. { Hierarchical structure of the ESCO classi cation and short descriptions lead to many semantically close terms that are labelled with di erent codes. This leads to codes that are "almost" correctly predicted, in the sense of that only the most ne-grained part of the code is incorrect. However this counts as an error regardless of how close the predicted code was from the correct one. For instance the term \vendedora en un comercio pequen~o" (\salesperson in a small business") is manually labelled as code 5223 (Asistentes de venta de tiendas y almacenes - sales assistants of shops and warehouses) and the system predicts code 5223.7 (vendedor especializado/vendedora especializada specialized salesperson). { The Seq2Seq Translation system performance is highly in uenced by hyperparameters and other facts that deserve further experimentation. { The Semantic mapping method presented in this article is straightforward and does not require previous training. However, the system fails mainly in mapping semantically close terms. It performs better when the search database is of moderate size and the documents are more semantically separable. 7

Conclusions

In these working notes we have presented Vicomtech's participation in MEDDOPROF shared task. We have participated with a multitask joint model based on Transformers, which solves the three tasks, NER, CLASS and NORM. In addition, we have presented another two systems to solve the NORM task. The multitask joint model works for the three tasks at the same time, although the NORM task can be better tackled using other approaches, such as using a sequence to sequence approach to map terms and codes. The quantitative results seem reasonable, but at the moment of this writing the o cial score ranking has not been published, so we cannot perform any comparison against other participants to conclude if our proposed systems are competitive or not.

All in all, the objective of the proposed tasks in relevant and interesting, and it is still far from being solved. In order to keep improving the results, apart from trying new approaches, more experimentation will be needed to improve some design decisions and chose better hyper-parameter settings that seem to highly in uence the performance of the systems.

Acknowledgments

This work has been partially funded by the projects DeepText (KK-2020-00088, SPRI, Basque Government) and DeepReading (RTI2018-096846-B-C21, MCIU/AEI/FEDER, UE). 12. Stenetorp, P., Pyysalo, S., Topic, G., Ohta, T., Ananiadou, S., Tsujii, J.: BRAT: A Web-based Tool for NLP-assisted Text Annotation. In: Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL '12). pp. 102{107 (2012) 13. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to Sequence Learning with Neural Networks. In: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. p. 3104{3112. NIPS'14, MIT Press, Cambridge, MA, USA (2014) 14. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention Is All You Need. In: Proceedings of the Thirtyrst Conference on Advances in Neural Information Processing Systems (NeurIPS 2017). pp. 5998{6008 (2017) 15. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Brew, J.: HuggingFace's Transformers: Stateof-the-art Natural Language Processing. arXiv:1910.03771 pp. 1{11 (2019)

1. Can~ete, J., Chaperon , G. , Fuentes , R. , Perez , J.: Spanish Pre-Trained BERT Model and Evaluation Data . In: Proceedings of the Practical ML for Developing Countries Workshop at the Eighth International Conference on Learning Representations (ICLR 2020 ). pp. 1 { 9 ( 2020 )

2. Cho , K., van Merrienboer, B. , Gulcehre , C. , Bahdanau , D. , Bougares , F. , Schwenk , H. , Bengio , Y. : Learning Phrase Representations using RNN Encoder{Decoder for Statistical Machine Translation . In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) . pp. 1724 { 1734 . Association for Computational Linguistics, Doha, Qatar ( 2014 )

3. Conneau , A. , Lample , G.: Cross-lingual Language Model Pretraining . In: Wallach, H. , Larochelle , H. , Beygelzimer , A., d' Alche-Buc, F. , Fox , E. , Garnett , R . (eds.) Advances in Neural Information Processing Systems . vol. 32 . Curran Associates, Inc. ( 2019 )

4. Devlin , J. , Chang , M.W. , Lee , K. , Toutanova , K. : BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers). pp. 4171 { 4186 ( 2019 )

5. Feng , F. , Yang , Y. , Cer , D. , Arivazhagan , N. , Wang , W. : Language-agnostic BERT Sentence Embedding ( 2020 )

6. Klein , G. , Kim , Y. , Deng , Y. , Senellart , J. , Rush , A. : OpenNMT: Open-Source Toolkit for Neural Machine Translation . In: Proceedings of ACL 2017 , System Demonstrations . pp. 67 { 72 . Association for Computational Linguistics, Vancouver, Canada ( 2017 )

7. Lima-Lopez , S. , Farre-Maduell , E. , Miranda-Escalada , A. , Briva-Iglesias , V. , Krallinger , M. : Nlp applied to occupational health: Meddoprof shared task at iberlef 2021 on automatic recognition, classi cation and normalization of professions and occupations from medical texts . Procesamiento del Lenguaje Natural 67 ( 2021 )

8. Loshchilov , I. , Hutter , F. : Decoupled Weight Decay Regularization . In: Proceedings of the Seventh International Conference on Learning Representations (ICLR 2019 ). pp. 1 { 18 ( 2019 )

9. Otegi , A. , Agirre , A. , Campos , J.A. , Soroa , A. , Agirre , E.: Conversational Question Answering in Low Resource Scenarios: A Dataset and Case Study for Basque . In: Proceedings of The 12th Language Resources and Evaluation Conference . pp. 436 { 442 ( 2020 )

10. Ramshaw , L.A. , Marcus , M.P. : Text Chunking Using Transformation-based Learning . In: Natural language processing using very large corpora , pp. 157 { 176 . Springer ( 1999 )

11. Reimers , N. , Gurevych , I. : Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation . In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (11 2020 )