Introduction

Occupation Recognition and Normalization in Clinical Notes

0 Philips India Limited , Bangalore, Karnataka 560045 , India

This paper describes the system submitted in MEDical DOcuments PROFessions recognition (MEDDOPROF) shared task, which is part of Iberian Languages Evaluation Forum (IberLeF) 2021 tasks. The Named Entity Recognition (NER) model built using Conditional Random Field, detects occupation and employment status entities in the Spanish medical documents. The entities are mapped to their code using vector embedding similarity of mention text with the code label text. The model obtains F-score of 0.635 for NER and 0.566 for the normalization task.

Named Entity Recognition Entity Linking Vector Embedding Similarity

Introduction

{ Track 1: MEDDOPROF-NER { Track 2: MEDDOPROF-CLASS { Track 3: MEDDOPROF-NORM MEDDOPROF-NER Track 1 is a named entity recognition problem which requires nding mentions of occupations and classifying them as: { Profession: label PROFESION. { Employment Status : label SITUACION LABORAL. { Activity : label ACTIVIDAD.

MEDDOPROF-CLASS Track 2 requires determining to whom the occupation mention belongs to: { Patient : label PACIENTE. { Family Member : label FAMILIAR. { Health Professional : label SANITARIO.

{ Someone else: label OTROS.

MEDDOPROF-NORM Track 3 is an entity linking problem which requires mapping predicted entities to the codes. The codes are unique concept identi ers from { European Skills, Competences, Quali cations and Occupations (ESCO) { SNOMED-CT

The shared task consists of clinical text written in Spanish. As per Instituto Cervantes's 2019 yearbook, Spanish is the world's second-most spoken native language with 480+ million native speakers (https://www.languagemagazine.com/ 2019/11/18/spanish-in-the-world/). According to their 2014 report (https:// en.wikipedia.org/wiki/List of countries where Spanish is an o cial language), it is the o cial language for 20 sovereign states and one dependent territory, totaling population around 442 million. Hence, building NLP systems for Spanish can have a signi cant societal impact.

System described in this paper participated in MEDDOPROF-NER and MEDDOPROF-NORM. Source code has been shared on github (https://github. com/kaushikacharya/clinical occupation recognition). 2 2.1

Model Description Data

The corpus was annotated with profession and employment status in BRAT Stando format (https://brat.nlplab.org/stando .html) by a team composed of linguists and clinical experts. These clinical cases were sourced from di erent specialties.

MEDDOPROF-NER For each clinical case, clinical note is stored in text le (.txt) and annotation(s) in annotation le (.ann). An example BRAT annotation is shown in Fig. 1.

MEDDOPROF-NORM Code corresponding to each of the annotated entity in train set was provided in a tab-separated le (.tsv). Additionally, a code reference list was provided with code and its corresponding label and alternative label(if available). Primarily, professions are mapped to ESCO while working statuses and activities are mapped to SNOMED-CT. A Linear-chain Conditional Random Fields (CRF) [ 5 ] classi er was trained to recognize the named entities. Parameter estimation is done using Limitedmemory BFGS (L-BFGS) [ 8 ] optimization algorithm. L-BFGS belongs to the family of quasi-Newton methods that approximates the Broyden{Fletcher{Goldfarb{Shanno algorithm (BFGS) using a limited amount of computer memory. The classi er is trained with both L1 and L2 regularization (coe cient values for both as 0.1). The CRF model is implemented using sklearn-crfsuite, which is a wrapper over CRFsuite [ 9 ].

Features Features are extracted using spaCy [ 3 ], an open source software library for natural language processing, written in Python and Cython. es core news sm (https://spacy.io/models/es) is the model used that was trained on news and media genre text. These features can be categorized as follows: { Lexical features

Unigrams (Current and immediate neighbor words)

String case { Parts of speech (POS) features

Current word's POS Prev and Next word's POS

Governor word's POS { Dependency parse features

Governor words Dependency type of current word

Dependency type of Governor word Example In the sentence shown in Fig. 2, seguridad produces the following features: { current word POS: NOUN { dependency tag: nmod { parent dependency tag: appos Mapping predicted entities to its relevant code has been solved using the vector embedding similarity approach. Vector embedding for the entities and text corresponding to the codes have been generated using fastText's trained model [ 2 ] for Spanish. For each predicted entities, code is assigned which has the highest cosine similarity. 3

Experimental Setup

3.1

Data

Train test split of corpus for the shared task is displayed in Table 1. Count statistics of the entities is displayed in Table 2.

Occupation Entity Count(Train set) Count(Test set) PROFESION 2513 693 SITUACION LABORAL 1010 356

ACTIVIDAD 109 28 For both MEDDOPROF-NER and MEDDOPROF-NORM tasks; precision, recall and F-score are computed for each of the clinical case. These metrics are then summarized over the corpus using micro-average. 4

Results

4.1

Quantitative Findings

MEDDOPROF-NER Table 3 shows the comparison of metrics when model trained on training set is applied on the same set and unseen test set. Though its expected that performance will decrease when run on test set, but the drop in recall is almost double the drop in precision. The primary reason for low recall is in model's failure to detect the entity mentions itself. Table 6 shows the count of instances where model succeeded/failed in identifying entity mentions irrespective of the three entity classes.

Description of the match types: { Exact Match: Predicted entity mentions text span matched exactly with ground truth. { Partial Match: Predicted entity mentions text span matches partially with ground truth. { False Negative: Ground truth entity mentions where model failed to generate any of the entity classes. { False Positive: Model generated entity mentions where there's no entity class as per ground truth.

Around 37% of the ground truth entities falls under false negative. False negative/positive cases are the ones where there's not even partial match between ground truth and predicted entities.

The o cial scores shown in Table 3 is based on strict evaluation setting. This would show Partial Match entities as failed. Table 5 shows a granular analysis at token level. B-Entity stands for rst token of the entity. I-Entity stands for rest of the entity tokens. MEDDOPROF-NORM As the output of MEDDOPROF-NER is fed as input for MEDDOPROF-NORM, the corresponding metrics for MEDDOPROF-NORM performs poorer. Improvement in MEDDOPROF-NER would get re ected in MEDDOPROF-NORM. 5

Conclusion

This paper proposes a CRF-based named entity extraction, and a vector embedding similarity based entity linking.

Future plans { Extract global structured information features for the dependency parse tree [ 4 ]. { Develop LSTM-CRF model [ 6 ] which would automate the feature extraction.

Around 9.4% of the ground truth entities fall under Partial Match. [ 4 ] de nes valid span as a word sequence that is covered by a chain of dependency parse arcs where no arc is covered by another. This enables extraction of dependency parse based global structured information rather than only local features as mentioned in 2.2. As per this de nition, the entity guardia de seguridad shown in Fig 1 can be considered as a valid span. This can be seen in Fig 2. Using global structured information features would hopefully produce correct entity mention span.

The signi cant drop for both precision and recall on unseen test data compared to seen training data, shows that there's need for better features. Hence plan to develop LSTM model for improved feature extraction.

1. Fisher, K. , Gri th , L., Gruneir , A. , Upshur , R. , Perez , R. , Favotto , L. , Nguyen , F. , Markle-Reid , M. , Ploeg , J.: E ect of socio-demographic and health factors on the association between multimorbidity and acute care service use: population-based survey linked to health administrative data . BMC Health Services Research 21 , 62 (01 2021 ). https://doi.org/10.1186/s12913-020-06032-5

2. Grave , E. , Bojanowski , P. , Gupta , P. , Joulin , A. , Mikolov , T. : Learning word vectors for 157 languages . In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018 ) ( 2018 )

3. Honnibal , M. , Montani , I., Van Landeghem, S. , Boyd , A. : spaCy: Industrial-strength Natural Language Processing in Python ( 2020 ). https://doi.org/10.5281/zenodo.1212303

4. Jie , Z. , Muis , A.O. , Lu , W.: E cient dependency-guided named entity recognition . In: Proceedings of the Thirty-First AAAI Conference on Arti cial Intelligence . p. 3457 { 3465 . AAAI' 17 , AAAI Press ( 2017 )

5. La

erty

, J.D., McCallum , A. , Pereira , F.C.N. : Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data . In: Proceedings of the Eighteenth International Conference on Machine Learning . p. 282 { 289 . ICML ' 01 , Morgan Kaufmann Publishers Inc., San Francisco, CA, USA ( 2001 )

6. Lample , G. , Ballesteros , M. , Subramanian , S. , Kawakami , K. , Dyer , C. : Neural Architectures for Named Entity Recognition . In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . pp. 260 { 270 . Association for Computational Linguistics, San Diego, California (Jun 2016 ). https://doi.org/10.18653/v1/ N16 -1030

7. Lima-Lopez , S. , Farre-Maduell , E. , Miranda-Escalada , A. , Briva-Iglesias , V. , Krallinger , M. : Nlp applied to occupational health: Meddoprof shared task at iberlef 2021 on automatic recognition, classi cation and normalization of professions and occupations from medical texts . Procesamiento del Lenguaje Natural 67 ( 2021 )

8. Nocedal , J.: Updating quasi-newton matrices with limited storage . Mathematics of Computation 35 ( 151 ), 773 { 782 ( 1980 ), http://www.jstor.org/stable/2006193

9. Okazaki , N.: Crfsuite: a fast implementation of Conditional Random Fields (CRFs) ( 2007 ), http://www.chokkan.org/software/crfsuite/

10. Park , S. , Jeon , H.J. , Kim , J.U. , Kim , S. , Roh , S. : Sociodemographic factors associated with the use of mental health services in depressed adults: Results from the korea national health and nutrition examination survey (knhanes) . BMC health services research 14 , 645 (12 2014 ). https://doi.org/10.1186/s12913-014-0645-7