Occupation Recognition and Normalization in
                  Clinical Notes

                       Kaushik Acharya1[0000−0003−2759−3646]

              Philips India Limited, Bangalore, Karnataka 560045, India
                             acharya.kaushik@gmail.com


        Abstract. This paper describes the system submitted in MEDical DOc-
        uments PROFessions recognition (MEDDOPROF) shared task, which is
        part of Iberian Languages Evaluation Forum (IberLeF) 2021 tasks. The
        Named Entity Recognition (NER) model built using Conditional Ran-
        dom Field, detects occupation and employment status entities in the
        Spanish medical documents. The entities are mapped to their code using
        vector embedding similarity of mention text with the code label text. The
        model obtains F-score of 0.635 for NER and 0.566 for the normalization
        task.

        Keywords: Named Entity Recognition · Entity Linking · Vector Em-
        bedding Similarity.


1     Introduction
There have been several studies [1,10] showing correlation of socio-demographic
factors on physical and mental health, habits and lifestyle choices. Occupation
and employment status are among the prime factors. Therefore, it becomes im-
portant to automatically identify its mention in clinical free text.
    The MEDDOPROF shared task [7] focuses on the automatic detection of
these factors as well as their mapping to standard code in medical documents.
    MEDDOPROF shared task comprised of three tracks:
 – Track 1: MEDDOPROF-NER
 – Track 2: MEDDOPROF-CLASS
 – Track 3: MEDDOPROF-NORM

MEDDOPROF-NER Track 1 is a named entity recognition problem which re-
quires finding mentions of occupations and classifying them as:
 – Profession: label PROFESION.
 – Employment Status: label SITUACION LABORAL.
 – Activity: label ACTIVIDAD.
    IberLEF 2021, September 2021, Málaga, Spain.
    Copyright © 2021 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).
MEDDOPROF-CLASS Track 2 requires determining to whom the occupation
mention belongs to:

 – Patient: label PACIENTE.
 – Family Member : label FAMILIAR.
 – Health Professional : label SANITARIO.
 – Someone else: label OTROS.

MEDDOPROF-NORM Track 3 is an entity linking problem which requires map-
ping predicted entities to the codes. The codes are unique concept identifiers
from

 – European Skills, Competences, Qualifications and Occupations (ESCO)
 – SNOMED-CT

    The shared task consists of clinical text written in Spanish. As per Instituto
Cervantes’s 2019 yearbook, Spanish is the world’s second-most spoken native lan-
guage with 480+ million native speakers (https://www.languagemagazine.com/
2019/11/18/spanish-in-the-world/). According to their 2014 report (https://
en.wikipedia.org/wiki/List of countries where Spanish is an official language),
it is the official language for 20 sovereign states and one dependent territory,
totaling population around 442 million. Hence, building NLP systems for
Spanish can have a significant societal impact.
    System described in this paper participated in MEDDOPROF-NER and
MEDDOPROF-NORM. Source code has been shared on github (https://github.
com/kaushikacharya/clinical occupation recognition).


2     Model Description

2.1   Data

The corpus was annotated with profession and employment status in BRAT
Standoff format (https://brat.nlplab.org/standoff.html) by a team composed of
linguists and clinical experts. These clinical cases were sourced from different
specialties.

MEDDOPROF-NER For each clinical case, clinical note is stored in text file
(.txt) and annotation(s) in annotation file (.ann). An example BRAT annotation
is shown in Fig. 1.

MEDDOPROF-NORM Code corresponding to each of the annotated entity in
train set was provided in a tab-separated file (.tsv). Additionally, a code reference
list was provided with code and its corresponding label and alternative label(if
available). Primarily, professions are mapped to ESCO while working statuses
and activities are mapped to SNOMED-CT.
Fig. 1. Example BRAT annotation with profession and employment status labels. En-
glish Translation: He is unemployed (he has had sporadic jobs as a cleaner, security
guard, etc.).


2.2   Named Entity Recognition

A Linear-chain Conditional Random Fields (CRF) [5] classifier was trained
to recognize the named entities. Parameter estimation is done using Limited-
memory BFGS (L-BFGS) [8] optimization algorithm. L-BFGS belongs
to the family of quasi-Newton methods that approximates the Broy-
den–Fletcher–Goldfarb–Shanno algorithm (BFGS) using a limited amount
of computer memory. The classifier is trained with both L1 and L2 regulariza-
tion (coefficient values for both as 0.1). The CRF model is implemented using
sklearn-crfsuite, which is a wrapper over CRFsuite [9].


Features Features are extracted using spaCy [3], an open source software library
for natural language processing, written in Python and Cython. es core news sm
(https://spacy.io/models/es) is the model used that was trained on news and
media genre text. These features can be categorized as follows:

 – Lexical features
    • Unigrams (Current and immediate neighbor words)
    • String case
 – Parts of speech (POS) features
    • Current word’s POS
    • Prev and Next word’s POS
    • Governor word’s POS
 – Dependency parse features
    • Governor words
    • Dependency type of current word
    • Dependency type of Governor word

Example In the sentence shown in Fig. 2, seguridad produces the following
features:

 – current word POS: NOUN
 – dependency tag: nmod
 – parent dependency tag: appos
             Fig. 2. POS & Dependency parse for the sentence in Fig. 1


2.3   Entity Linking

Mapping predicted entities to its relevant code has been solved using the vector
embedding similarity approach. Vector embedding for the entities and text cor-
responding to the codes have been generated using fastText’s trained model [2]
for Spanish. For each predicted entities, code is assigned which has the highest
cosine similarity.


3     Experimental Setup

3.1   Data

Train test split of corpus for the shared task is displayed in Table 1. Count
statistics of the entities is displayed in Table 2.


                         Table 1. Dataset count statistics.

                       Dataset Clinical Cases Sentences
                       Training 1500          49932
                       Test     344           9671


                     Table 2. Count statistics of the entities.

          Occupation Entity Count(Train set) Count(Test set)
          PROFESION         2513             693
          SITUACION LABORAL 1010             356
          ACTIVIDAD         109              28
3.2     External libraries

The system utilized the following libraries:

 – fastText (version: 0.9.2) (https://github.com/facebookresearch/fastText)
 – sklearn-crfsuite (version: 0.3.6) (https://sklearn-crfsuite.readthedocs.io/)
 – spaCy (version: 3.0.5) (https://github.com/explosion/spaCy)

      Their usage have been explained in Section 2.2, 2.2 and 2.3.


3.3     Evaluation Metrics

For both MEDDOPROF-NER and MEDDOPROF-NORM tasks; precision, re-
call and F-score are computed for each of the clinical case. These metrics are
then summarized over the corpus using micro-average.


4      Results

4.1     Quantitative Findings

Table 3 shows the micro-average metrics for MEDDOPROF-NER on both train
and test set using the evaluation script provided by the task organizers. Results
for MEDDOPROF-NORM is shown in Table 4.


                Table 3. MEDDOPROF-NER Micro-average metrics.

                         Metrics/Dataset Training Test
                         Precision       0.953    0.807
                         Recall          0.839    0.524
                         F-score         0.892    0.635


               Table 4. MEDDOPROF-NORM Micro-average metrics.

                         Metrics/Dataset Training Test
                         Precision       0.956    0.720
                         Recall          0.840    0.467
                         F-score         0.894    0.566
4.2   Error Analysis

MEDDOPROF-NER Table 3 shows the comparison of metrics when model
trained on training set is applied on the same set and unseen test set. Though
its expected that performance will decrease when run on test set, but the drop
in recall is almost double the drop in precision. The primary reason for low re-
call is in model’s failure to detect the entity mentions itself. Table 6 shows the
count of instances where model succeeded/failed in identifying entity mentions
irrespective of the three entity classes.
    Description of the match types:

 – Exact Match: Predicted entity mentions text span matched exactly with
   ground truth.
 – Partial Match: Predicted entity mentions text span matches partially with
   ground truth.
 – False Negative: Ground truth entity mentions where model failed to generate
   any of the entity classes.
 – False Positive: Model generated entity mentions where there’s no entity class
   as per ground truth.

    Around 37% of the ground truth entities falls under false negative. False
negative/positive cases are the ones where there’s not even partial match between
ground truth and predicted entities.
    The official scores shown in Table 3 is based on strict evaluation setting. This
would show Partial Match entities as failed. Table 5 shows a granular analysis
at token level. B-Entity stands for first token of the entity. I-Entity stands for
rest of the entity tokens.


Table 5. MEDDOPROF-NER token level metrics on test set. Produced using seqeval
library (https://github.com/chakki-works/seqeval)

                             Precision Recall F-score Support
         B-ACTIVIDAD         0.818     0.321 0.462    28
         I-ACTIVIDAD         0.867     0.433 0.578    60
         B-PROFESION         0.923     0.657 0.767    693
         I-PROFESION         0.825     0.616 0.705    1148
         B-SITUACION LABORAL 0.790     0.444 0.568    356
         I-SITUACION LABORAL 0.760     0.446 0.562    491
         O                   0.995     0.999 0.997    216660

         accuracy                                       0.993    219436
         macro avg                   0.854      0.559   0.663    219436
         weighted avg                0.993      0.993   0.993    219436
                    Table 6. Entity mentions detection in test set.

                                 Match type Count
                                 Exact Match 579
                                 Partial Match 102
                                 False Negative 405
                                 False Positive 38


MEDDOPROF-NORM As the output of MEDDOPROF-NER is fed as input for
MEDDOPROF-NORM, the corresponding metrics for MEDDOPROF-NORM
performs poorer. Improvement in MEDDOPROF-NER would get reflected in
MEDDOPROF-NORM.


5    Conclusion

This paper proposes a CRF-based named entity extraction, and a vector embed-
ding similarity based entity linking.

Future plans

 – Extract global structured information features for the dependency parse
   tree [4].
 – Develop LSTM-CRF model [6] which would automate the feature extraction.

    Around 9.4% of the ground truth entities fall under Partial Match. [4] defines
valid span as a word sequence that is covered by a chain of dependency parse arcs
where no arc is covered by another. This enables extraction of dependency parse
based global structured information rather than only local features as mentioned
in 2.2. As per this definition, the entity guardia de seguridad shown in Fig 1 can
be considered as a valid span. This can be seen in Fig 2. Using global structured
information features would hopefully produce correct entity mention span.
    The significant drop for both precision and recall on unseen test data com-
pared to seen training data, shows that there’s need for better features. Hence
plan to develop LSTM model for improved feature extraction.


References

 1. Fisher, K., Griffith, L., Gruneir, A., Upshur, R., Perez, R., Favotto, L., Nguyen, F.,
    Markle-Reid, M., Ploeg, J.: Effect of socio-demographic and health factors on the
    association between multimorbidity and acute care service use: population-based
    survey linked to health administrative data. BMC Health Services Research 21,
    62 (01 2021). https://doi.org/10.1186/s12913-020-06032-5
 2. Grave, E., Bojanowski, P., Gupta, P., Joulin, A., Mikolov, T.: Learning word vec-
    tors for 157 languages. In: Proceedings of the International Conference on Language
    Resources and Evaluation (LREC 2018) (2018)
 3. Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A.: spaCy:
    Industrial-strength Natural Language Processing in Python (2020).
    https://doi.org/10.5281/zenodo.1212303
 4. Jie, Z., Muis, A.O., Lu, W.: Efficient dependency-guided named entity recognition.
    In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence. p.
    3457–3465. AAAI’17, AAAI Press (2017)
 5. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional Random Fields: Prob-
    abilistic Models for Segmenting and Labeling Sequence Data. In: Proceedings of
    the Eighteenth International Conference on Machine Learning. p. 282–289. ICML
    ’01, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2001)
 6. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural Ar-
    chitectures for Named Entity Recognition. In: Proceedings of the 2016 Conference
    of the North American Chapter of the Association for Computational Linguistics:
    Human Language Technologies. pp. 260–270. Association for Computational Lin-
    guistics, San Diego, California (Jun 2016). https://doi.org/10.18653/v1/N16-1030
 7. Lima-López, S., Farré-Maduell, E., Miranda-Escalada, A., Brivá-Iglesias, V.,
    Krallinger, M.: Nlp applied to occupational health: Meddoprof shared task at iber-
    lef 2021 on automatic recognition, classification and normalization of professions
    and occupations from medical texts. Procesamiento del Lenguaje Natural 67 (2021)
 8. Nocedal, J.: Updating quasi-newton matrices with limited storage. Mathematics
    of Computation 35(151), 773–782 (1980), http://www.jstor.org/stable/2006193
 9. Okazaki, N.: Crfsuite: a fast implementation of Conditional Random Fields (CRFs)
    (2007), http://www.chokkan.org/software/crfsuite/
10. Park, S., Jeon, H.J., Kim, J.U., Kim, S., Roh, S.: Sociodemographic factors asso-
    ciated with the use of mental health services in depressed adults: Results from the
    korea national health and nutrition examination survey (knhanes). BMC health
    services research 14, 645 (12 2014). https://doi.org/10.1186/s12913-014-0645-7