<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>BERT Representations to Identify Professions and Employment Statuses in Health data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jose-Alberto Mesa-Murgado</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pilar Lopez-Ubeda</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manuel-Carlos D az-Galiano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>M. Teresa Mart n-Valdivia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>L. Alfonso Uren~a-Lopez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Departamento de Informatica, CEATIC, Universidad de Jaen</institution>
          ,
          <addr-line>Espan~a</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we detail our submission for the Medical Documents Profession Recognition (MEDDOPROF) shared task by submitting three di erent proposals for each subtask which involves: Named Entity Recognition of mentions regarding professions, occupations and employment statuses within clinical reports. Classify those mentions whether they refer to the visiting patient, a relative, medical sta or others and lastly, mapping those mentions with respect to the European Skills, Competences, Quali cations and Occupations (ESCO) classi cation and relevant SNOMED-CT terms. Our proposals are based on BERT representations and the similarity between the resulting word embeddings and its corresponding classes.</p>
      </abstract>
      <kwd-group>
        <kwd>Named Entity Recognition</kwd>
        <kwd>BERT Representations</kwd>
        <kwd>Professions and Employment Statuses</kwd>
        <kwd>Natural Language Processing</kwd>
        <kwd>Spanish</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>This paper describes the systems we have developed as our participation in
the Medical Documents Profession Recognition (MEDDOPROF) shared task
[7] focused on the Detection and Recognition of Professions and Employment
statuses in Health data.</p>
      <p>This challenge has been organised by the Barcelona Supercomputing Center
as part of the 2021 Iberian Languages Evaluation Forum (IberLEF) evaluative
initiative co-located with the XXXVII Congreso Internacional de la Sociedad
Espan~ola para el Procesamiento del Lenguaje Natural (SEPLN 2021).</p>
      <p>Detecting professions and employment statuses in these resources could help
public and private organizations to evaluate and predict the impact of future
pandemic diseases spreading in any location using the reports written after every
medical visit.</p>
      <p>These medical reports are a very valuable resource, as well as strictly
restricted, whose study leads to the development of highly useful applications for
the general public such as the detection of COVID-19 using radiological textual
reports [9] or the detection of tumor morphology within medical reports as seen in
the CANcer TExt Mining Shared Task { tumor named entity recognition (NER)
(CANTEMIST) [11] shared task, another challenge proposed by the Barcelona
Supercomputing Center in which our research group also participated [8].</p>
      <p>The idea behind this study was introduced in the ProfNER shared task as
part of the Social Media Mining for Health Applications or SMM4H, co-located
with the 2021 Annual Conference of the North American Chapter of the
Association for Computational Linguistics (NAACL 2021) [12] using data retrieved
from social media sources instead of clinical reports. Our research group also
participated [10] in this initiative using a Bidirectional Long Short Term Memory
(BiLSTM) Recurrent neural network (RNN) approach.</p>
      <p>
        The development of such applications tackle tasks that are commonly
attributed to the the Natural Language Processing (NLP) eld, aimed at
providing techniques to process textual information written in natural language with
the purpose of enabling a computer to understand such data in order to ful ll
an speci c purpose such as Question Answering [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], Automated Translation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
among others [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. These tasks are solved through the use of Machine Learning
and Deep Learning approaches.
      </p>
      <p>
        In particular, the latest state of the art approach revolves around the use of
BERT (Bidirectional Encoder Representations from Transformers) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] which has
presented good grounding results in a wide variety of the stated NLP common
tasks [13].
      </p>
      <p>
        Our team proposes three systems based on BETO [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the BERT Transformer
language model trained on a Spanish Corpus. On the one hand, a multiclass
approach and, on the other, two approaches relying on procedures similar to the
one of binary classi cation, meaning, masking all classes under the same label
in order to discriminate between tokens as either entities or non-entities.
      </p>
      <p>Considering this information, this paper is structured as follows: Section 2
introduces the aim of this challenge as well as the description and characteristics
of the provided dataset. Section 3 describes each of the three systems proposed
to tackle each task, including what hyperparameters we have explored, the
preprocessing applied to the input data and the results we have obtained on our
development set. Section 4 exhibits the results obtained by our systems on the
evaluation set, subsequently, this data is analyzed in Section 5. Lastly, we state
our conclusions in Section 6 concluding our submission.
2</p>
    </sec>
    <sec id="sec-2">
      <title>MEDDOPROF Tasks Description</title>
      <p>The MEDDOPROF Shared Task proposed three tasks in which to participate
although participants were not obliged to provide a system for each one of them
as these tasks were not dependent on each other.
1. Task 1, namely \MEDDOPROF-NER", asks participants to automatically
identify mentions of professions or occupations and employment statuses
out of plain text medical documents and tag each one of those entities under
the label of \PROFESION" or \SITUACION LABORAL, there is also one
last label, \ACTIVIDAD" but it seems to be scarce label among the given
training documents.
2. Task 2, namely \MEDDOPROF-CLASS", requires participants to once again
identify mentions of professions or occupations and employment statuses
from a given set of plain text documents and label each one of those
mentions according to the person to whom they refer to, considering whether
the mention is associated to the patient \PACIENTE", to a relative
"FAMILIAR", to medical sta \SANITARIO" or any other as \OTROS".
3. Lastly, task 3 or \MEDDOPROF-NORM" requires participants to catalogue
those same mentions as per the European Skills, Competences, Quali cations
and Occupations (ESCO) classi cation and relevant SNOMED-CT terms.
2.1</p>
      <sec id="sec-2-1">
        <title>Dataset provided</title>
        <p>The MEDDOPROF-NER and MEDDOPROF-CLASS corpus consists of 1844
documents; these documents have been transformed by the organizers into BRAT
format to extract the entities associated with each document. Below, we examine
each one of the datasets provided: training and evaluation.</p>
        <p>Training dataset contains 1500 documents, out of them we have computed
the length of their content (number of characters in the document), the number
of tokens as well as entities contained in them. This information can be found
detailed in Table 1.
Development dataset In order to evaluate the performance of our system
we had to divide the training corpus, therefore we decided to use the 80% of
the given data as training data (1191 documents) while the 20% remaining was
used as our development set (309 documents). This division was random but
consistent with the number of entities of the corpus. Therefore, the number of
entities in the development set is 80 times lower than those in the training set.
Its characteristics are detailed in Table 2.
Evaluation dataset consists of 344 documents and we have also retrieved the
associated document's length (number of characters contained) as well as the
number of tokens and entities in them, this information is available in Table 3.
1. The rst approach is a multiclass NER system trained to detect and
differentiate between professions and occupations, employment statuses and
activities as per their corresponding labels: \PROFESION", \SITUACION
LABORAL" and \ACTIVIDAD", respectively. Throughout this document
we will refer to this particular system as \Multiclass BETO".
2. The second one discern tokens that are entities from those that are not,
tagging them as either 'ENTITY' or 'O', respectively. This approach follows
the IOB (Inside-Outside-Beginning) [14] format. After retrieving the text
spans identi ed as entities, we map each one of them to their
corresponding word embedding using BETO to, subsequently, calculate the distance
between them and the word embeddings related to each tag. Afterwards,
our system associates the label with the smallest distance value (in other
words, with the highest similarity) to the identi ed entity. This distance is
measured using the Cosine similarity method as per the Equation 1, being
x and y, two word distinct embeddings.</p>
        <p>k(x; y) =</p>
        <p>xy&gt;
kxkkyk
(1)
3. The last of our systems is an extension of the one presented in the previous
point. This expansion is produced on the training set through the usage of
the corpus provided by ProfNER shared task, this data provides our system
with more training data that results in better predictions.
3.1</p>
      </sec>
      <sec id="sec-2-2">
        <title>Preprocessing</title>
        <p>
          We use the spaCy [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] library for Python in order to create a text le containing
each sequence contained in our dataset, their start and end positions with respect
to the document in which they are found, as well as the original token and its
normalized counterpart.
        </p>
        <p>For this normalization we have decided to transform each token to lowercase,
converted accented characters into their non accented shapes and removed all
characters di ering from numbers or special characters such as dots, commas,
semicolons, etc.</p>
        <p>Also, considering the fact that BERT allows a maximum of 512 tokens per
sequence and the given dataset contains sentences above that range, we have
decided to establish a maximum of 215 tokens per sequence. This decision was
made after noticing the memory limitations of our machine. We experimented
with distinct con gurations of this value (256 and 300 to be precise), however,
the improvement of our results was not noticeable or, results were unobtainable.</p>
        <p>Consequently, our strategy involved the division of sequences into smaller
ones after normalization. This separation took place right after the detection of
a dot character or until conforming a sequence whose length was equal or greater
than our established maximum number of tokens per sequence
3.2</p>
      </sec>
      <sec id="sec-2-3">
        <title>Hyperparameters tuning</title>
        <p>Training was carried out using 6 RTX 2080Ti, each one containing 11 GB of
memory built under Nvidia's Turing architecture. For this purpose we decided
to use di erent batch sizes: 8 and 16, and di erent epochs: 3, 4 and 5. We have
used the development dataset in order to test the performance of our systems.</p>
        <p>The combination of these parameters (epochs and batch sizes) produced
similar scores between them, therefore we have decided to include only the top
performers for each task and system displayed as follows:
{ Table 4 shows top combinations for the MEDDOPROF-NER task using the
multiclass approach.
{ Table 5 does the same for the MEDDOPROF-CLASS task using the
multiclass approach.
{ Table 6 displays the scores applied to both tasks MEDDOPROF-NER and</p>
        <p>MEDDOPROF-CLASS using the binary approach.
{ Lastly, Table 7 shows the results retrieved from the binary approaches used
for the MEDDOPROF-NORM task.</p>
        <p>The metrics displayed on these tables are expressed in terms of micro
precision, micro recall and micro F1 score.
Organizers have provided participants with a baseline from which to compare our
results. These comparisons can be found in detail in Table 8 for the NER task,
Table 9 for the CLASS task and lastly, Table 10 for the NORM task. Metrics
used are common in NLP tasks to compare the performance of di erent systems
and are expressed in terms of micro-precision (Precision), micro-recall (Recall)
and micro-F1 scoring (F1).</p>
        <p>Our results show an increase of 54% (on average) with respect to those given
by the organization as baseline. The Multiclass BETO approach is the one that
shows the highest gain, an increase of 58% compared against the baseline, along
with the Binary approach added to the ProfNER dataset, showing an increase
of 52%. In this task, all of our proposed systems have obtained similar results
exceeding 78% accuracy, 70% recall and more than 71% on F1.</p>
        <p>In the MEDDOPROF-CLASS task, we obtained an increase of 61% (on
average) with our systems, being the Multiclass BETO approach the one which, once
again, holds the highest gain with respect to the baseline (90% better than the
organizers given score). Our Binary approach trained along with the ProfNER
data showed an increase of 47% when compared to the baseline score. In this
case, the results obtained by our binary approach are worse than those of the
multiclass.</p>
        <p>Lastly, for this task we did not provide a Multiclass BETO approach and
only provided results for the systems using the binary approaches. As mentioned,
these models managed to tackle all three tasks at the same time as they compared
the word embeddings of each text span with those of their respective classes. Our
results for this task fall below the baseline given by the organizers by 8%.
5</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results Analysis</title>
      <p>To assess the problems that our models have encountered for each task, we
implemented our own evaluation software. This program compares each document in
the golden test with the results obtained by our systems and retrieves the entities
that have not been properly identi ed and labelled as per MEDDOPROF-NER
and MEDDOPROF-CLASS taks, as well as those that have not been correctly
normalized as per the MEDDOPROF-NORM taks.</p>
      <p>We have applied this evaluation to the top performer systems of every task
and created one single evaluation summary, this information can be
appreciated in Figure 1 with respect to every task (coloured as blue, orange and gray
for MEDDOPROF-NER, MEDDOPROF-CLASS and MEDDOPROF-NORM,
respectively). The data we wanted to extract and highlight consisted of:
{ Entities that our system detected and labelled correctly with respect to the
golden test, namely, True Positives (TP).
{ Entities detected by our systems not found in the golden test, namely, False</p>
      <p>Positives (FP).
{ Lastly, entities included in the golden test that were not detected by our
systems, namely, Missed Entities.</p>
      <p>To comprehend why our models were unable to correctly identify some
entities in the evaluation dataset, we decided to retrieve the text spans concerning
FP as well as Missed Entities. Considering that this information is too large to be
included in this document, we decided to only display the top 10 most frequent
entities included in each of the previously mentioned sets (FP and Missed
Entities) along with their frequency (number of occurrences within those documents)
, respectively in Table 11 and Table 12.</p>
      <p>After retrieving the results from our error assessment software, we have
noticed that there are certain entities that have been conceived as false positives
while also appearing as missed entities.</p>
      <p>From the list of entities contained in Table 11 and Table 12, we observe that:
{ For the MEDDOPROF-NER, the number of errors in the label assignment
was 0 out of the 10 top most frequent.
{ For the MEDDOPROF-CLASS, this event happened in 4 entities and what
we have noticed is that our system has incorrectly tagged with a di erent
label (e.g: \companero" as \OTROS" instead of \PACIENTE", or \psicologa"
as \OTROS" instead of \SANITARIO") although the entity has been span
has been identi ed.
compan~eros
equipo
terapeuta
en prision
estudia
tutora
medico de atencion
primaria
musico
compan~era de la
peluquer a
especialista en otro
centro
{ Lastly, it is in MEDDOPROF-NORM where we nd more matching terms
as 9 entities of the 10 most frequent appear in both tables. After analyzing
these matching results we have concluded that this is due to the fact that
some codes included the term \SCTID: " in their labelling while our output
only included the number. After repeating the experiment and applying the
o cial evaluation software provided by the organization, our scores seem to
have outperformed the baseline as seen in Table 13.
6</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>For our participation on the MEDDOPROF shared task challenge we have
proposed three systems based on the BERT representations to tackle all three tasks:
MEDDOPROF-NER, MEDDOPROF-CLASS and MEDDOPROF-NORM,
respectively. Our systems managed to outperform the given baselines with an
increase of 54% and 61% for the MEDDOPROF-NER and MEDDOPROF-CLASS.
The exception of this performance is the MEDDOPROF-NORM in which our
performance has fallen below the baseline, although close to it.</p>
      <p>For future work our plan is to implement another Deep Learning model, a
BiLSTM approach in combination with a Conditional Random Field (CRF) to
tackle all of the proposed tasks and therefore, evaluate whether this system would
have outperformed our proposed solutions or not. In addition, we also want to
implement the rst of our approaches to be able to tackle the
MEDDOPROFNORM for which our premise is that it will outperform the baseline.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>This work has been partially supported by the LIVING-LANG project
[RTI2018094653-B-C21] of the Spanish Government and the Fondo Europeo de Desarrollo
Regional (FEDER).
7. Lima-Lopez, S., Farre-Maduell, E., Miranda-Escalada, A., Briva-Iglesias, V.,
Krallinger, M.: Nlp applied to occupational health: Meddoprof shared task at
iberlef 2021 on automatic recognition, classi cation and normalization of professions
and occupations from medical texts. Procesamiento del Lenguaje Natural 67 (2021)
8. Lopez-Ubeda, P., D az-Galiano, M.C., Mart n-Valdivia, M.T., Lopez, L.:
Extracting neoplasms morphology mentions in spanish clinical cases through word
embeddings. In: IberLEF@SEPLN (2020)
9. Lopez-Ubeda, P., D az-Galiano, M.C., Mart n-Noguerol, T., Luna, A.,
Uren~aLopez, L.A., Mart n-Valdivia, M.T.: Covid-19 detection in radiological text reports
integrating entity recognition. Computers in Biology and Medicine 127 p. 104066
(2020)
10. Mesa Murgado, J.A., Parras Portillo, A.B., Lopez-Ubeda, P., Mart n-Valdivia,
M.T., Lopez, L.A.U.: Identifying professions &amp; occupations in health-related social
media using natural language processing. In: Proceedings of the Sixth Social Media
Mining for Health Applications Workshop &amp; Shared Task (2021)
11. Miranda-Escalada, A., Farre, E., Krallinger, M.: Named entity recognition,
concept normalization and clinical coding: Overview of the cantemist track for cancer
text mining in spanish, corpus, guidelines, methods and results. In: Proceedings
of the Iberian Languages Evaluation Forum (IberLEF 2020), CEUR Workshop
Proceedings (2020)
12. Miranda-Escalada, A., Farre-Maduell, E., Lopez, S.L., Briva-Iglesias, V.,
AgueroTorales, M., Gasco-Sanchez, L., Krallinger, M.: The profner shared task on
automatic recognition of professions and occupation mentions in social media: systems,
evaluation, guidelines, embeddings and corpora. In: Proceedings of the Sixth Social
Media Mining for Health Applications Workshop &amp; Shared Task (2021)
13. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Transactions on knowledge
and data engineering 22 (10), 1345{1359 (2009)
14. Schneider, N., Danchik, E., Dyer, C., Smith, N.A.: Discriminative
lexical semantic segmentation with gaps: Running the MWE gamut.
Transactions of the Association for Computational Linguistics 2 pp. 193{206 (2014),
https://www.aclweb.org/anthology/Q14-1016</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bouziane</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bouchiha</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doumi</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malki</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Question answering systems: survey and trends</article-title>
          .
          <source>Procedia Computer Science</source>
          73 pp.
          <volume>366</volume>
          {
          <issue>375</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Can~ete, J.,
          <string-name>
            <surname>Chaperon</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fuentes</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ho</surname>
            ,
            <given-names>J.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perez</surname>
          </string-name>
          , J.:
          <article-title>Spanish pretrained bert model and evaluation data</article-title>
          .
          <source>In: PML4DC at ICLR</source>
          <year>2020</year>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Chowdhury</surname>
            ,
            <given-names>G.G.</given-names>
          </string-name>
          :
          <article-title>Natural language processing</article-title>
          .
          <source>Annual review of information science and technology 37 (1)</source>
          ,
          <volume>51</volume>
          {
          <fpage>89</fpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>A.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hersh</surname>
            ,
            <given-names>W.R.:</given-names>
          </string-name>
          <article-title>A survey of current work in biomedical text mining</article-title>
          .
          <source>Brie ngs in bioinformatics 6 (1)</source>
          ,
          <volume>57</volume>
          {
          <fpage>71</fpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding (</article-title>
          <year>2018</year>
          ), http://arxiv.org/abs/
          <year>1810</year>
          .04805, cite arxiv:
          <year>1810</year>
          .04805Comment: 13 pages
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Honnibal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montani</surname>
            , I., Van Landeghem,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boyd</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>spaCy: Industrial-strength Natural Language Processing in Python (</article-title>
          <year>2020</year>
          ). https://doi.org/10.5281/zenodo.1212303, https://doi.org/10.5281/zenodo.1212303
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>