ADOP FERT-Automatic Detection of Occupations and Profession in Medical Texts using Flair and BERT

ADOP FERT-Automatic Detection of Occupations and Profession in Medical Texts using Flair and BERT FazlourrahmanBalouchzahi Center for Computing Research Instituto Politécnico Nacional

CDMX Mexico

GrigoriSidorov sidorov@cic.ipn.mx Center for Computing Research Instituto Politécnico Nacional

CDMX Mexico

HosahalliLakshmaiahShashirekha Department of Computer Science Mangalore University

574199 Mangalore India

ADOP FERT-Automatic Detection of Occupations and Profession in Medical Texts using Flair and BERT 3673B072B4A3E7CA8DFFB3769F57F83C GROBID - A machine learning software for extracting information from scholarly documents Profession Medical Documents NER BERT Flair Embeddings

Technological developments in healthcare industry are generating lots of electronic health records as well as text data which is usually referred as medical text data. Processing medical text data in unstructured form is not only challenging but also has lot of applications. Named entity recognition, the task of extracting named entities and classifying them into predefined categories is an important preprocessing step in the NLP pipeline. Extracting named entities from medical text is very useful for many applications and at the same time very challenging because of the characteristics of medical text data. Considering the gravity of medical text processing, in this paper, we (Team MUCIC) describe the models submitted to "MEDical DOocu-ments PROFessions recognition" (MED-DOPROF), a first shared task consisting of three Tracks, namely: Track 1: MEDDOPROF-NER, Track 2: MEDDOPROF-CLASS, and Track 3: MEDDOPROF-NORM, in Spanish language. We participated in Track 1 and 2 and proposed two models based on fine-tuning BERT embeddings using i) BertForTokenClassification from transformers and ii) Flair framework, for the automatic detection of Occupations and Professions in medical text. The model using BertForTokenClassification reported micro F1 scores of 0.629 and 0.598 for Track 1 and 2 respectively, while the Flair framework model reported micro F1 scores of 0.8 and 0.764 for Track 1 and 2 respectively. Further, the Flair framework model for MEDDOPROF-NER track became one among the best models.

Introduction

The recent updates on medical and health-care information systems are generating large amount of Electronic Health Records (EHRs) [1] as well as text data in medical domain. Despite the popularity of existing systems to manage EHRs there is a massive amount of unstructured medical text data that are required to be transformed into a more structured format for further processing [2]. Medical text processing or text analytics is one of the exciting areas of research in NLP world that deals with various applications like Text Classification (TC) (classification of medical records, classification of medical news articles), Text Summarization (automatic generation of summaries from medical news articles, summarization of clinical in-formation), Hypothesis Generation and Knowledge Discovery and so on.

One of the most popular text processing applications is Named Entity Recognition (NER), which is used to automatically recognize and classify Named Entities (NEs) [3] representing names of persons, and organizations, locations and so on from a given natural language text. NER is a crucial step in NLP pipeline as performance of the NER module decides the performance of subsequent modules [4] and NER systems also act as a preprocessing step for tasks like Relation Extraction [5]. Medical NER which deals with extracting medical NEs such as disease names, symptoms, medical conditions, medications, medical professions, employment status, etc., from medical texts is challenging due to specialized terminology, huge number of alternate spellings, and multi-word NEs. Even though a variety of works have been explored for processing medical texts in diverse aspects, very few works are reported in the literature on processing texts related to medical profession and employment status in general and in particular to identify and classify the NEs describing medical occupations in medical documents.

To address the challenges of identifying and classifying the NEs describing medical occupations and employment status in Spanish medical documents, in this paper, we (Team MUCIC) describe the models submitted to two Tracks of MEDical Documents PROFessions recognition (MEDDOPROF) [6] task. MED-DOPROF is a first shared task of its kind that consists of three sub-tracks and the description of Tracks 1 and 2 (the ones in which we participated) are briefly given below:

-Track 1 -MEDDOPROF-NER: includes identifying the portion of texts that mentions an occupation and classifying them into one of three predefined categories, namely: PROFESION (PROFESSION), SITUACION LAB-ORAL (WORKING STATUS) or ACTIVIDAD (ACTIVIDAD). -Track 2 -MEDDOPROF-CLASS: includes automatically finding the beginning and end of occupation mentions and classifying them into one of the further categories, namely: PACIENTE (Patient), FAMILIAR (Family member), SANITARIO (Health professional) or OTRO (Other).

Based on the description of Tracks and categories, Track1 and 2 can be modeled as an NER task of identifying the NEs (tokens) which could be either a single word or multi-word and then classifying/labeling them into one of the pre-defined categories according to the Tracks. Of late transformer based models are achieving state-of-the art results for many NLP tasks compared to various Machine Learning (ML) and Deep Learning (DL) models. To explore transformers [18], we have proposed two models based on fine tuning Bidirectional Encoder Representations from Transformers (BERT) [7] embeddings using i) BertForTo-kenClassification class from transformer library and ii) Flair framework, for the task of automatic detection of occupations and professions in Spanish medical texts for Track 1 and 2 of MEDDOPROF.

BERT as a language representation model employs bidirectional representations from text to pre-train both left and right context. It also can be fine-tuned for downstream tasks such as NER and TC, only by adding a specific output layer [8]. The difference between BERT and Embeddings from Language Models (ELMo) [9] that uses pre-trained language models is that ELMo uses the language model as additional features whereas BERT enables fine-tuning all parameters of pre-trained language model to make it task-specific on downstream task [7].

Flair framework provides a standard model for training along with hyperparameter selection and unified interface that reduce complexity of using various embeddings and enables researchers to mix the embeddings effectively. It also offers various embeddings that are publically available in HuggingFace [19]. In the current work Flair is used with BERT embeddings [10]. Generative Pretrained Transformer (OpenAI GPT) is another architecture that allows finetuning. However, it is limited in unidirectional representation whereas BERT utilize bidirectional representation which effectively overcomes the restrictions of OpenAi GPT's architecture [7].

The rest of paper is organized as follows: Section 2 gives an overview of the works carried out in the related area and Section 3 describes the proposed methodology. While Section 4 presents Experiments and Results, Section 5 gives the conclusion and throws light on future work.

Related Work

ML classifiers have reported reasonable and competitive performance for various TC applications such as NER, Sentiment Analysis, Opinion Mining, etc. However, these days' Neural Network (NN) based systems are commonly used for various TC applications in various domains including medical domain. Some recent adventures in medical text processing are described below:

Yepes et al. [11] developed a NN based system for the identification of medical NEs from Twitter posts. The authors used 148 million tweets collected to generate a CBOW word embeddings that is used as weights in model construction. Two LSTMs are used to construct a sequence to sequence model where first LSTM acts as an encoder to encode the texts to vectors and second LSTM as main classification model to classify the tokens. The proposed model on Micromed [20] dataset containing 1300 tweets obtained F1 scores of 0.665, 0.682, and 0.718 on diseases, pharmacological substances and symptoms entities respectively.

Li et al. [12] presented a NN based model for medical NER in Chinese texts. The authors used character level and word level embeddings to capture orthographic and lexicosemantic features along with POS tags as word information features. A Chinese medical corpus containing 12,498 records is used and 1739 records out of them were manually annotated into two categories, namely, subject and lesion where symptoms related to body are considered as subject and lesion refers to the pathological changes of the subjects. The dataset is transformed into BIESO NE representation where B, I, E, and O indicates the beginning, inside, end, and outside of the entity respectively and S illustrates that the entity consists of only a single word. RNN, LSTM, GRU, BLSTM and BGRU are experimented with various configurations and feature combinations. Among all, BGRU without employing any embeddings and only POS tags features, had the best performance with 90.36 and 90.48% F1 scores for subject and lesion detection tasks respectively. Feature engineering step is one of the important steps in any NLP task as it aims to improve the performance of the system. Weegar et al. [13] explored the impact of simple feature engineering in NER systems for medical texts of three languages namely, English, Swedish, and Spanish. The authors examined some basic features including POS and semantic tags along with prefixes, window size, and capitalization. Averaged structured perceptron algorithm is used with SemEval-2014 Task 7 Analysis of Clinical Text Shared Task dataset containing 9,694 disease NEs for English, EHRs consisting of patient records developed by Oronoz et al. [14] containing 3,362 instances of diseases and 1,406 drugs entities as Spanish dataset and a dataset containing 4,000 entities corresponding to body parts, disorders and findings from over 500 different clinical units at Karolinska University Hospital for Swedish released by Dalianis et al. [15]. The observation of the results illustrates that in many cases simple but neglected features can significantly enhance the performance of the systems. Their best performing systems which obtained F1 scores of 66.40 and 68.41, and 68.22 for English, Swedish, and Spanish languages respectively used specialized medical dictionaries. Sometimes instead of working on features and model construction, proposing new representation for the data might be more efficient. In one of such studies Hamada et al. [4] proposed FROBES Segment Representation (SR) model which is an exten-sion of IOBES model, when NEs are multi-word in nature. In the proposed FROBES model F, R, O, B, and E represents front, rear, outside, begin, and end respectively and S represents a single word. FROBES is an extension of IOBES where tag I in IOBES is replaced with F and R when entity has more than two words. As it considers both halves of entities, first half are annotated with B and F and second half contains R and E. The proposed SR scheme is evaluated using BiLSTM as baseline model on two datasets, namely: i2b2/VA 2010 challenge dataset and JNLPBA 2004 shared task dataset and the results reported by the authors illustrate that using FROBES improved the performance slightly. However, ensembling the baseline models with different SR models, namely: IOB2, IOBES, and FROBES outperformed the baseline models with F1 scores of 71.99 and 83.62 on the same datasets.

Methodology

The two proposed models based on fine-tuning BERT embeddings using i) BertForTokenClas-sification from transformers and ii) Flair framework, designed and evaluated for the Tracks 1 and 2 of MEDDOPROF are described in this section.

Data Transformation

The datasets provided by the organizers of MEDDOPROF shared task for the sub-Tracks are in Brat standoff annotation format. As per this format, for each text file there is a corresponding annotation file consisting of an annotation ID, a label, and beginning and ending offset for each NE which could be of a single word or multi-word. More details of Brat standoff format can be found on its website [21]. As the data in CONLL IOB [22] format is easy to handle, the given data in Brat standoff annotation format is transformed to CONLL format with IOB representation using brat to conll.py [23] module. IOB representation assigns the tags I and O for the tokens that are inside and outside the NE respectively and assigns the tag B for the first word of the NE [2,4]. A snapshot of data in Brat format and CONLL (IOB) format is shown in Figure 1. As the data transformed into CONLL IOB format is used to train the classifier models, the predictions of the models will also be in CONLL IOB format. This requires a post processing step to re-transform the predictions in CONLL IOB format back to Brat standoff annotation format to generate .ann files as output as required by the organizers.

Models

The main component of the proposed models is BETO [16,17,24] which is a Spanish BERT language model trained on a large amount of Spanish unannotated corpora [25]. In this work, we have used bert-base-spanish-wwm-cased [26] model which is more efficient for NER tasks as capitalization play a major role in identifying NEs.

BertForTokenClassification using Transformer:

The first step of this model is to fine tune the BERT model on downstream task using transformers library. Using the data in CONLL IOB format, the finetuned models are further trained for Track 1 and 2 of the shared tasks. For each test dataset, the models generate tagged sequences sentence-wise in IOB annotation format which will be converted back to Brat standoff annotation format manually.

As BERT based models require to be fed with sequences of same length, the maximum length of sequences is set into 510 and the shorter sequences are padded to this length. However, an attention mask is employed to avoid distracting models with padded elements. Similar to Keras [28], BERT support attention masks that are used to allow the model focus on main part of the sequence ignoring padded elements. In other words, mask is typically used for attention when a batch has varying length of sentences. Therefore, it takes real tokens for training by assigning 1 to in sequence tokens and 0 for out of sequence. After assembling training data and corresponding masks using PyTorch [27], BETO will be initialized using BertForTokenClassification class from transformers library which adds a token level predictor on BERT model.

Setting the optimizer to AdamW the models have been trained for 50 epochs. Figure 2 represents training and validation loss where validation set is 10% of training set. Overview of the model based on BERT using transformer library is shown in Figure 3. Flair with BERT Embedding: Flair is a PyTorch based NLP tool that provides a model training framework in which various embeddings and language models can be used individually or in combination and fine-tuned for downstream tasks with special support for Medical domain data [10]. However, to compare the performance of this model with that of BertForTokenClassification model, bert-base-spanish-wwm-cased model is used and fine-tuned using Sequence Tagger from Flair which is BiLSTM based backend. It is also possible to use CRF on top of the model, but it is not used in this work. As Flair requires the training data in CONLL format, data from Brat standoff annotation format is transformed to CONLL IOB format as described in Section 3.1 and is loaded using ColumnCorpus class from Flair library. A summary of the layers used in this model is given in our Github page [29]. Parameters of the proposed model are set as given in Table 1 and an overview of proposed Flair model is presented in Figure 4. The main requirement of any task is an annotated dataset for training the models. MEDDOPROF corpus provided by the organizers contains 1844 clinical cases with more than 20 specialties annotated manually by clinical and linguistics experts following strict guidelines. Each clinical case is stored as a single text file along with a corresponding Brat standoff annotation file. The description of the dataset is available in the task website [30] and the descriptions of the labels for both the Tracks are given in Table 2.

Evaluating models' performance is the most important task. As per the submission guidelines [6], the predictions for each test file should be in Brat standoff annotation format, i.e., the annotation file should have an extension .ann and should consist of an annotation ID, label, correct beginning and ending offset for each predicted NE, in one line, similar to the annotation file given in the training set. However, the value of annotation ID is generated at random as it does not have any influence on the prediction. Annotation files are generated for each file in the test set and submitted to the task organizer for evaluation. The performance of the models is evaluated in terms of Micro average Precision, Recall and F1-score.

Organizers provided the results obtained by a simple lookup system and the annotations from the training data as baseline. Therefore baseline's results and performances of the proposed models reported by the organizer for both the Tracks with Micro average scores are shown in Table 3. The results illustrate that the pro-posed models obtained quite good performances for both the Tracks. Further, the results also illustrate that the proposed models performed better for MEDDOPROF-NER task. In addition, the model using Flair framework and BERT embeddings outperformed the other proposed model and became one of the best performing models in the shared task.

Fig. 1 .1Fig. 1. Snapshot of data in Brat standoff and CONLL IOB format

Fig. 2 .Fig. 3 .23Fig. 2. Training and Validation loss while fine-tuning BERT

Fig. 4 .4Fig. 4. Overview of proposed Flair mode

Table 1 .1Parameters in Flair model Parameter Max len Hidden size Learning rate Mini batch size EpochsValue5122565.0e-6410

Table 2 .2Labels description in MEDDOPROF-NER and MEDDOPROF-CLASSTrackLabelsToken DescriptionMEDDOPROF-NER PROFESION Indicates a professionSITUACION LABORALIndicates an employment statusACTIVIDAD Indicates an activityMEDDOPROF-CLASS PACIENTE Token is related to the patientFAMILIAR Token is related to a family memberSANITARIO Token is related to a health professionalOTROSToken is related to someone else

Table 3 .3Performances of proposed models (Micro average)Medical text processing is one of more exciting as well as vital task in NLP. Considering its importance MEDDOPROF has called for a shared task with three Tracks and we participated in two of them, namely: MEDDOPROF-NER and MEDDOPROF-CLASS for the automatic detection of occupations and profession in Spanish medical texts. We (team MUCIC) proposed two models using BERT embed-dings, namely: BertForTokenClassification from transformers and Flair framework. The results illustrate that the models performed better in NER and Flair model outperformed the other model in both the Tracks and also obtained very good results with micro F1-scores of 0.8 and 0.764 for MEDDOPROF-NER and MEDDOPROF-CLASS respectively. Further, the Flair model for MEDDOPROF became one of the best per-forming models in the shared task. As future work it is planed for exploring the Language Understanding with Knowledge-based Embeddings (LUKE) model which is a new pre-trained contextualized representation of words and entities based on transformer. Improving the performances of system with modifications in NEs representations and also exploring various learning approaches for task of NER in medical texts are other plans of future work.SubtaskModelPrecision Recall F1-scoreMEDDOPROF-NER Baseline0.4650.508 0.486BERT0.8090.515 0.629Flair-BERT embeddings 0.8130.788 0.8MEDDOPROF-CLASS Baseline0.3910.377 0.384BERT0.770.488 0.598Flair-BERT embeddings 0.770.75 0.7645 Conclusion and Future Work

Acknowledgements

Team MUCIC deeply appreciates the organizers of MEDDOPROF shared task for their efforts, guidance and supports during the task and the anonymous reviewers for their valuable comments.

Improving NER for Clinical Texts by Ensemble Approach using Segment Representations HNayel HLShashirekha Proceedings of the 14th International Conference on Natural Language Processing the 14th International Conference on Natural Language Processing

ICON-

2017. 2017 Dec Regular Expression Based Medical Text Clas-sification using Constructive Heuristic Approach MCui RBai ZLu XLi UAickelin PGe IEEE Access 7 2019 Oct 11 PUNER-Parsi ULMFiT for Named-Entity Recognition in Persian Texts HLBalouchzahi Fazlourrahman Shashirekha 2020 EasyChair 4224 Improving Multi-Word Entity Recog-nition for Biomedical Texts HNayel HLShashirekha HShindo YMatsumoto arXiv:1908.05691.2019 Aug 15 arXiv preprint A Comparative Study of Segment Representation for Biomedi-cal Named Entity Recognition HLShashirekha HANayel International Conference on Advances in Compu-ting, Communications and Informatics (ICACCI 2016. 2016 Sep 21 NLP applied to occupational health: MEDDO-PROF shared task at IberLEF 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts SalvadorLima-López EulàliaFarré-Maduell AntonioMiranda-Escalada VicentBrivá-Iglesias MartinKrallinger Procesamiento del Lenguaje Natural 67 2021 Bert: Pre-Training of Deep Bidirectional Trans-formers for Language Understanding JDevlin MWChang KLee KToutanova arXiv:1810.04805.2018 Oct 11 arXiv preprint A Clinical Trials Corpus Annotated with UMLS Entities to Enhance the Access to Evidence-Based Medicine. BMC medical informatics and decision making LCampillos-Llanos AValverde-Mateos ACapllonch-Carrión AMoreno-Sandoval 2021 Dec 21 MEPeters MNeumann MIyyer MGardner CClark KLee LZettlemoyer arXiv:1802.05365.2018 Deep Con-textualized Word Representations Feb 15 arXiv preprint FLAIR: An Easy-To-Use Framework for State-Of-The-Art NLP AAkbik TBergmann DBlythe KRasul SSchweter RVollgraf Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations 2019 Jun NER for Medical Entities in Twitter using Sequence to Sequence Neural Networks AJYepes AMackinlay Proceedings of the Australasian Language Technology Association Workshop the Australasian Language Technology Association Workshop 2016. 2016 Dec WCP-RNN: a Novel RNN-based Approach for Bio-NER in Chinese EMRs. The journal of supercomputing JLi SZhao JYang ZHuang BLiu SChen HPan QWang 2020 Mar 76 The Impact of Sim-ple Feature Engineering in Multilingual Medical NER RWeegar ACasillas ADDe Ilarraza MOronoz APérez KGojenola Proceedings of the Clinical Natural Language Processing Workshop (ClinicalNLP the Clinical Natural Language Processing Workshop (ClinicalNLP 2016 Dec Automatic Annotation of Medical Records in Spanish with Disease, Drug and Substance Names MOronoz ACasillas KGojenola APerez Iberoamerican Congress on Pattern Recognition

Berlin, Heidelberg

Springer 2013 Nov 20 HEALTH BANK-A Work-bench for Data Science Applications in Healthcare HDalianis AHenriksson MKvist SVelupillai RWeegar CAiSE Industry Track 1381 2015 Jun 11 Spanish Pre-trained BERT Model and Evaluation data JCanete GChaperon RFuentes JPérez PML4DC at ICLR 2020. 2020 SWu DredzeMBeto Bentz arXiv:1904.09077.2019 becas: The Surprising Cross-lingual Effectiveness of BERT Apr 19 arXiv preprint BertForTokenClassification Hugging Face homepage MedInfo 2015 brat standoff format homepage <author> <persName><surname>Neuroner</surname></persName> </author> <ptr target="https://github.com/Franck-Dernoncourt/NeuroNER" /> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b22"> <monogr> <author> <persName><surname>Beto</surname></persName> </author> <ptr target="https://github.com/dccuchile/beto" /> <title level="m">Spanish BERT Spanish Unannotated Corpora <author> <persName><forename type="first">Spanish</forename><surname>Bert</surname></persName> </author> <ptr target="https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased" /> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b25"> <monogr> <ptr target="https://pytorch.org/" /> <title level="m">PyTorch homepage <author> <persName><surname>Keras Homepage</surname></persName> </author> <ptr target="https://keras.io/" /> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b27"> <monogr> <ptr target="https://temu.bsc.es/meddoprof/data/" /> <title level="m">MEDDOPROF homepage