<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Occupation Recognition and Normalization in Clinical Notes</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Philips India Limited</institution>
          ,
          <addr-line>Bangalore, Karnataka 560045</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the system submitted in MEDical DOcuments PROFessions recognition (MEDDOPROF) shared task, which is part of Iberian Languages Evaluation Forum (IberLeF) 2021 tasks. The Named Entity Recognition (NER) model built using Conditional Random Field, detects occupation and employment status entities in the Spanish medical documents. The entities are mapped to their code using vector embedding similarity of mention text with the code label text. The model obtains F-score of 0.635 for NER and 0.566 for the normalization task.</p>
      </abstract>
      <kwd-group>
        <kwd>Named Entity Recognition</kwd>
        <kwd>Entity Linking</kwd>
        <kwd>Vector Embedding Similarity</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>{ Track 1: MEDDOPROF-NER
{ Track 2: MEDDOPROF-CLASS
{ Track 3: MEDDOPROF-NORM
MEDDOPROF-NER Track 1 is a named entity recognition problem which
requires nding mentions of occupations and classifying them as:
{ Profession: label PROFESION.
{ Employment Status : label SITUACION LABORAL.
{ Activity : label ACTIVIDAD.</p>
      <p>MEDDOPROF-CLASS Track 2 requires determining to whom the occupation
mention belongs to:
{ Patient : label PACIENTE.
{ Family Member : label FAMILIAR.
{ Health Professional : label SANITARIO.</p>
      <p>{ Someone else: label OTROS.</p>
      <p>MEDDOPROF-NORM Track 3 is an entity linking problem which requires
mapping predicted entities to the codes. The codes are unique concept identi ers
from
{ European Skills, Competences, Quali cations and Occupations (ESCO)
{ SNOMED-CT</p>
      <p>The shared task consists of clinical text written in Spanish. As per Instituto
Cervantes's 2019 yearbook, Spanish is the world's second-most spoken native
language with 480+ million native speakers (https://www.languagemagazine.com/
2019/11/18/spanish-in-the-world/). According to their 2014 report (https://
en.wikipedia.org/wiki/List of countries where Spanish is an o cial language),
it is the o cial language for 20 sovereign states and one dependent territory,
totaling population around 442 million. Hence, building NLP systems for
Spanish can have a signi cant societal impact.</p>
      <p>System described in this paper participated in MEDDOPROF-NER and
MEDDOPROF-NORM. Source code has been shared on github (https://github.
com/kaushikacharya/clinical occupation recognition).
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Model Description</title>
      <sec id="sec-2-1">
        <title>Data</title>
        <p>The corpus was annotated with profession and employment status in BRAT
Stando format (https://brat.nlplab.org/stando .html) by a team composed of
linguists and clinical experts. These clinical cases were sourced from di erent
specialties.</p>
        <p>MEDDOPROF-NER For each clinical case, clinical note is stored in text le
(.txt) and annotation(s) in annotation le (.ann). An example BRAT annotation
is shown in Fig. 1.</p>
        <p>
          MEDDOPROF-NORM Code corresponding to each of the annotated entity in
train set was provided in a tab-separated le (.tsv). Additionally, a code reference
list was provided with code and its corresponding label and alternative label(if
available). Primarily, professions are mapped to ESCO while working statuses
and activities are mapped to SNOMED-CT.
A Linear-chain Conditional Random Fields (CRF) [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] classi er was trained
to recognize the named entities. Parameter estimation is done using
Limitedmemory BFGS (L-BFGS) [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] optimization algorithm. L-BFGS belongs
to the family of quasi-Newton methods that approximates the
Broyden{Fletcher{Goldfarb{Shanno algorithm (BFGS) using a limited amount
of computer memory. The classi er is trained with both L1 and L2
regularization (coe cient values for both as 0.1). The CRF model is implemented using
sklearn-crfsuite, which is a wrapper over CRFsuite [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>
          Features Features are extracted using spaCy [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], an open source software library
for natural language processing, written in Python and Cython. es core news sm
(https://spacy.io/models/es) is the model used that was trained on news and
media genre text. These features can be categorized as follows:
{ Lexical features
        </p>
        <p>Unigrams (Current and immediate neighbor words)</p>
        <p>String case
{ Parts of speech (POS) features</p>
        <p>Current word's POS
Prev and Next word's POS</p>
        <p>Governor word's POS
{ Dependency parse features</p>
        <p>Governor words
Dependency type of current word</p>
        <p>
          Dependency type of Governor word
Example In the sentence shown in Fig. 2, seguridad produces the following
features:
{ current word POS: NOUN
{ dependency tag: nmod
{ parent dependency tag: appos
Mapping predicted entities to its relevant code has been solved using the vector
embedding similarity approach. Vector embedding for the entities and text
corresponding to the codes have been generated using fastText's trained model [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]
for Spanish. For each predicted entities, code is assigned which has the highest
cosine similarity.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experimental Setup</title>
      <p>3.1</p>
      <sec id="sec-3-1">
        <title>Data</title>
        <p>Train test split of corpus for the shared task is displayed in Table 1. Count
statistics of the entities is displayed in Table 2.</p>
        <p>Occupation Entity Count(Train set) Count(Test set)
PROFESION 2513 693
SITUACION LABORAL 1010 356</p>
        <p>ACTIVIDAD 109 28
For both MEDDOPROF-NER and MEDDOPROF-NORM tasks; precision,
recall and F-score are computed for each of the clinical case. These metrics are
then summarized over the corpus using micro-average.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>4.1</p>
      <sec id="sec-4-1">
        <title>Quantitative Findings</title>
        <p>MEDDOPROF-NER Table 3 shows the comparison of metrics when model
trained on training set is applied on the same set and unseen test set. Though
its expected that performance will decrease when run on test set, but the drop
in recall is almost double the drop in precision. The primary reason for low
recall is in model's failure to detect the entity mentions itself. Table 6 shows the
count of instances where model succeeded/failed in identifying entity mentions
irrespective of the three entity classes.</p>
        <p>Description of the match types:
{ Exact Match: Predicted entity mentions text span matched exactly with
ground truth.
{ Partial Match: Predicted entity mentions text span matches partially with
ground truth.
{ False Negative: Ground truth entity mentions where model failed to generate
any of the entity classes.
{ False Positive: Model generated entity mentions where there's no entity class
as per ground truth.</p>
        <p>Around 37% of the ground truth entities falls under false negative. False
negative/positive cases are the ones where there's not even partial match between
ground truth and predicted entities.</p>
        <p>The o cial scores shown in Table 3 is based on strict evaluation setting. This
would show Partial Match entities as failed. Table 5 shows a granular analysis
at token level. B-Entity stands for rst token of the entity. I-Entity stands for
rest of the entity tokens.
MEDDOPROF-NORM As the output of MEDDOPROF-NER is fed as input for
MEDDOPROF-NORM, the corresponding metrics for MEDDOPROF-NORM
performs poorer. Improvement in MEDDOPROF-NER would get re ected in
MEDDOPROF-NORM.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>This paper proposes a CRF-based named entity extraction, and a vector
embedding similarity based entity linking.</p>
      <p>
        Future plans
{ Extract global structured information features for the dependency parse
tree [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
{ Develop LSTM-CRF model [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] which would automate the feature extraction.
      </p>
      <p>
        Around 9.4% of the ground truth entities fall under Partial Match. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] de nes
valid span as a word sequence that is covered by a chain of dependency parse arcs
where no arc is covered by another. This enables extraction of dependency parse
based global structured information rather than only local features as mentioned
in 2.2. As per this de nition, the entity guardia de seguridad shown in Fig 1 can
be considered as a valid span. This can be seen in Fig 2. Using global structured
information features would hopefully produce correct entity mention span.
      </p>
      <p>The signi cant drop for both precision and recall on unseen test data
compared to seen training data, shows that there's need for better features. Hence
plan to develop LSTM model for improved feature extraction.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. Fisher,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Gri</surname>
          </string-name>
          <string-name>
            <given-names>th</given-names>
            , L.,
            <surname>Gruneir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Upshur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Favotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Markle-Reid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Ploeg</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.:</surname>
          </string-name>
          <article-title>E ect of socio-demographic and health factors on the association between multimorbidity and acute care service use: population-based survey linked to health administrative data</article-title>
          .
          <source>BMC Health Services Research</source>
          <volume>21</volume>
          ,
          <volume>62</volume>
          (01
          <year>2021</year>
          ). https://doi.org/10.1186/s12913-020-06032-5
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Grave</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bojanowski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Learning word vectors for 157 languages</article-title>
          .
          <source>In: Proceedings of the International Conference on Language Resources and Evaluation (LREC</source>
          <year>2018</year>
          )
          <article-title>(</article-title>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Honnibal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montani</surname>
            , I., Van Landeghem,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boyd</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>spaCy: Industrial-strength Natural Language Processing in Python (</article-title>
          <year>2020</year>
          ). https://doi.org/10.5281/zenodo.1212303
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Jie</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Muis</surname>
            ,
            <given-names>A.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
          </string-name>
          , W.:
          <article-title>E cient dependency-guided named entity recognition</article-title>
          .
          <source>In: Proceedings of the Thirty-First AAAI Conference on Arti cial Intelligence</source>
          . p.
          <volume>3457</volume>
          {
          <fpage>3465</fpage>
          . AAAI'
          <fpage>17</fpage>
          , AAAI Press (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>La</surname>
            <given-names>erty</given-names>
          </string-name>
          , J.D.,
          <string-name>
            <surname>McCallum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pereira</surname>
            ,
            <given-names>F.C.N.</given-names>
          </string-name>
          :
          <article-title>Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data</article-title>
          .
          <source>In: Proceedings of the Eighteenth International Conference on Machine Learning</source>
          . p.
          <volume>282</volume>
          {
          <fpage>289</fpage>
          . ICML '
          <fpage>01</fpage>
          , Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Lample</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ballesteros</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Subramanian</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kawakami</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dyer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Neural Architectures for Named Entity Recognition</article-title>
          .
          <source>In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          . pp.
          <volume>260</volume>
          {
          <fpage>270</fpage>
          . Association for Computational Linguistics, San Diego, California (Jun
          <year>2016</year>
          ). https://doi.org/10.18653/v1/
          <fpage>N16</fpage>
          -1030
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Lima-Lopez</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farre-Maduell</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miranda-Escalada</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Briva-Iglesias</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krallinger</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Nlp applied to occupational health: Meddoprof shared task at iberlef 2021 on automatic recognition, classi cation and normalization of professions and occupations from medical texts</article-title>
          .
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>67</volume>
          (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Nocedal</surname>
          </string-name>
          , J.:
          <article-title>Updating quasi-newton matrices with limited storage</article-title>
          .
          <source>Mathematics of Computation</source>
          <volume>35</volume>
          (
          <issue>151</issue>
          ),
          <volume>773</volume>
          {
          <fpage>782</fpage>
          (
          <year>1980</year>
          ), http://www.jstor.org/stable/2006193
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Okazaki</surname>
          </string-name>
          , N.:
          <article-title>Crfsuite: a fast implementation of Conditional Random Fields (CRFs) (</article-title>
          <year>2007</year>
          ), http://www.chokkan.org/software/crfsuite/
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jeon</surname>
            ,
            <given-names>H.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>J.U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Sociodemographic factors associated with the use of mental health services in depressed adults: Results from the korea national health and nutrition examination survey (knhanes)</article-title>
          .
          <source>BMC health services research</source>
          <volume>14</volume>
          ,
          <volume>645</volume>
          (12
          <year>2014</year>
          ). https://doi.org/10.1186/s12913-014-0645-7
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>