BERT Representations to Identify Professions
      and Employment Statuses in Health data

        José-Alberto Mesa-Murgado1 , Pilar López-Úbeda1 , Manuel-Carlos
    Dı́az-Galiano1 , M. Teresa Martı́n-Valdivia1 , and L. Alfonso Ureña-López1

        Departamento de Informática, CEATIC, Universidad de Jaén, España
              {jmurgado,plubeda,mcdiaz,maite,laurena}@ujaen.es


        Abstract. In this paper we detail our submission for the Medical Doc-
        uments Profession Recognition (MEDDOPROF) shared task by submit-
        ting three different proposals for each subtask which involves: Named
        Entity Recognition of mentions regarding professions, occupations and
        employment statuses within clinical reports. Classify those mentions
        whether they refer to the visiting patient, a relative, medical staff or
        others and lastly, mapping those mentions with respect to the Euro-
        pean Skills, Competences, Qualifications and Occupations (ESCO) clas-
        sification and relevant SNOMED-CT terms. Our proposals are based
        on BERT representations and the similarity between the resulting word
        embeddings and its corresponding classes.

        Keywords: Named Entity Recognition · BERT Representations · Pro-
        fessions and Employment Statuses · Natural Language Processing · Span-
        ish


1     Introduction
This paper describes the systems we have developed as our participation in
the Medical Documents Profession Recognition (MEDDOPROF) shared task
[7] focused on the Detection and Recognition of Professions and Employment
statuses in Health data.
    This challenge has been organised by the Barcelona Supercomputing Center
as part of the 2021 Iberian Languages Evaluation Forum (IberLEF) evaluative
initiative co-located with the XXXVII Congreso Internacional de la Sociedad
Española para el Procesamiento del Lenguaje Natural (SEPLN 2021).
    Detecting professions and employment statuses in these resources could help
public and private organizations to evaluate and predict the impact of future
pandemic diseases spreading in any location using the reports written after every
medical visit.
    IberLEF 2021, September 2021, Málaga, Spain.
    Copyright © 2021 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).
    These medical reports are a very valuable resource, as well as strictly re-
stricted, whose study leads to the development of highly useful applications for
the general public such as the detection of COVID-19 using radiological textual
reports [9] or the detection of tumor morphology within medical reports as seen in
the CANcer TExt Mining Shared Task – tumor named entity recognition (NER)
(CANTEMIST) [11] shared task, another challenge proposed by the Barcelona
Supercomputing Center in which our research group also participated [8].
    The idea behind this study was introduced in the ProfNER shared task as
part of the Social Media Mining for Health Applications or SMM4H, co-located
with the 2021 Annual Conference of the North American Chapter of the Asso-
ciation for Computational Linguistics (NAACL 2021) [12] using data retrieved
from social media sources instead of clinical reports. Our research group also par-
ticipated [10] in this initiative using a Bidirectional Long Short Term Memory
(BiLSTM) Recurrent neural network (RNN) approach.
    The development of such applications tackle tasks that are commonly at-
tributed to the the Natural Language Processing (NLP) field, aimed at provid-
ing techniques to process textual information written in natural language with
the purpose of enabling a computer to understand such data in order to fulfill
an specific purpose such as Question Answering [1], Automated Translation [3],
among others [4]. These tasks are solved through the use of Machine Learning
and Deep Learning approaches.
    In particular, the latest state of the art approach revolves around the use of
BERT (Bidirectional Encoder Representations from Transformers) [5] which has
presented good grounding results in a wide variety of the stated NLP common
tasks [13].
    Our team proposes three systems based on BETO [2], the BERT Transformer
language model trained on a Spanish Corpus. On the one hand, a multiclass
approach and, on the other, two approaches relying on procedures similar to the
one of binary classification, meaning, masking all classes under the same label
in order to discriminate between tokens as either entities or non-entities.
    Considering this information, this paper is structured as follows: Section 2
introduces the aim of this challenge as well as the description and characteristics
of the provided dataset. Section 3 describes each of the three systems proposed
to tackle each task, including what hyperparameters we have explored, the pre-
processing applied to the input data and the results we have obtained on our
development set. Section 4 exhibits the results obtained by our systems on the
evaluation set, subsequently, this data is analyzed in Section 5. Lastly, we state
our conclusions in Section 6 concluding our submission.


2   MEDDOPROF Tasks Description

The MEDDOPROF Shared Task proposed three tasks in which to participate
although participants were not obliged to provide a system for each one of them
as these tasks were not dependent on each other.
1. Task 1, namely “MEDDOPROF-NER”, asks participants to automatically
   identify mentions of professions or occupations and employment statuses
   out of plain text medical documents and tag each one of those entities under
   the label of “PROFESION” or “SITUACION LABORAL, there is also one
   last label, “ACTIVIDAD” but it seems to be scarce label among the given
   training documents.
2. Task 2, namely “MEDDOPROF-CLASS”, requires participants to once again
   identify mentions of professions or occupations and employment statuses
   from a given set of plain text documents and label each one of those men-
   tions according to the person to whom they refer to, considering whether
   the mention is associated to the patient “PACIENTE”, to a relative ”FA-
   MILIAR”, to medical staff “SANITARIO” or any other as “OTROS”.
3. Lastly, task 3 or “MEDDOPROF-NORM” requires participants to catalogue
   those same mentions as per the European Skills, Competences, Qualifications
   and Occupations (ESCO) classification and relevant SNOMED-CT terms.


2.1   Dataset provided

The MEDDOPROF-NER and MEDDOPROF-CLASS corpus consists of 1844
documents; these documents have been transformed by the organizers into BRAT
format to extract the entities associated with each document. Below, we examine
each one of the datasets provided: training and evaluation.


Training dataset contains 1500 documents, out of them we have computed
the length of their content (number of characters in the document), the number
of tokens as well as entities contained in them. This information can be found
detailed in Table 1.

              Table 1. Metrics associated with the Training dataset.

                          Sum        Minimum       Maximum             Average
Lengths                6,239,588       184          27,529             4,159.72
Tokens                 1,114,919        29           4,807              743.28
Entities                 9,217           1            86                 7.44


Development dataset In order to evaluate the performance of our system
we had to divide the training corpus, therefore we decided to use the 80% of
the given data as training data (1191 documents) while the 20% remaining was
used as our development set (309 documents). This division was random but
consistent with the number of entities of the corpus. Therefore, the number of
entities in the development set is 80 times lower than those in the training set.
Its characteristics are detailed in Table 2.
            Table 2. Metrics associated with the Development dataset.

                          Sum         Minimum           Maximum         Average
Lengths                1,339,910        274              27,529         4,336.27
Tokens                  239,033          45               4,807          773.57
Entities                 1,762           36                 1              5.7


Evaluation dataset consists of 344 documents and we have also retrieved the
associated document’s length (number of characters contained) as well as the
number of tokens and entities in them, this information is available in Table 3.


             Table 3. Metrics associated with the Evaluation dataset.

                          Sum         Minimum           Maximum         Average
Lengths                1,240,562        228              23,446         3,606.29
Tokens                  222,744          37               4,376          647.51
Entities                 2,786            1                79             9.05


3   System Description

As stated in the previous section, we have implemented three systems based
on BERT language model trained using Spanish Corpus or BETO for which
we used an implementation that uses Pytorch library for Python. However, our
approaches differ one from the other in the labels used to discriminate tokens:

1. The first approach is a multiclass NER system trained to detect and dif-
   ferentiate between professions and occupations, employment statuses and
   activities as per their corresponding labels: “PROFESION”, “SITUACION
   LABORAL” and “ACTIVIDAD”, respectively. Throughout this document
   we will refer to this particular system as “Multiclass BETO”.
2. The second one discern tokens that are entities from those that are not,
   tagging them as either ’ENTITY’ or ’O’, respectively. This approach follows
   the IOB (Inside-Outside-Beginning) [14] format. After retrieving the text
   spans identified as entities, we map each one of them to their correspond-
   ing word embedding using BETO to, subsequently, calculate the distance
   between them and the word embeddings related to each tag. Afterwards,
   our system associates the label with the smallest distance value (in other
   words, with the highest similarity) to the identified entity. This distance is
   measured using the Cosine similarity method as per the Equation 1, being
   x and y, two word distinct embeddings.

                                                xy >
                                   k(x, y) =                                       (1)
                                               kxkkyk
3. The last of our systems is an extension of the one presented in the previous
   point. This expansion is produced on the training set through the usage of
   the corpus provided by ProfNER shared task, this data provides our system
   with more training data that results in better predictions.

3.1   Preprocessing
We use the spaCy [6] library for Python in order to create a text file containing
each sequence contained in our dataset, their start and end positions with respect
to the document in which they are found, as well as the original token and its
normalized counterpart.
    For this normalization we have decided to transform each token to lowercase,
converted accented characters into their non accented shapes and removed all
characters differing from numbers or special characters such as dots, commas,
semicolons, etc.
    Also, considering the fact that BERT allows a maximum of 512 tokens per
sequence and the given dataset contains sentences above that range, we have
decided to establish a maximum of 215 tokens per sequence. This decision was
made after noticing the memory limitations of our machine. We experimented
with distinct configurations of this value (256 and 300 to be precise), however,
the improvement of our results was not noticeable or, results were unobtainable.
    Consequently, our strategy involved the division of sequences into smaller
ones after normalization. This separation took place right after the detection of
a dot character or until conforming a sequence whose length was equal or greater
than our established maximum number of tokens per sequence

3.2   Hyperparameters tuning
Training was carried out using 6 RTX 2080Ti, each one containing 11 GB of
memory built under Nvidia’s Turing architecture. For this purpose we decided
to use different batch sizes: 8 and 16, and different epochs: 3, 4 and 5. We have
used the development dataset in order to test the performance of our systems.
    The combination of these parameters (epochs and batch sizes) produced sim-
ilar scores between them, therefore we have decided to include only the top
performers for each task and system displayed as follows:
 – Table 4 shows top combinations for the MEDDOPROF-NER task using the
   multiclass approach.
 – Table 5 does the same for the MEDDOPROF-CLASS task using the multi-
   class approach.
 – Table 6 displays the scores applied to both tasks MEDDOPROF-NER and
   MEDDOPROF-CLASS using the binary approach.
 – Lastly, Table 7 shows the results retrieved from the binary approaches used
   for the MEDDOPROF-NORM task.
    The metrics displayed on these tables are expressed in terms of micro preci-
sion, micro recall and micro F1 score.
Table 4. Metrics associated with the Training dataset using Multiclass BETO on
MEDDOPROF-NER.

                           Batch Size   Epochs      Precision     Recall       F1
Multiclass BETO                 8         3           0.83         0.78       0.80
Multiclass BETO                16         4           0.82         0.79       0.80

Table 5. Metrics associated with the Training dataset using Multiclass BETO on
MEDDOPROF-CLASS.

                           Batch Size   Epochs      Precision     Recall       F1
Multiclass BETO                16         3           0.79         0.81       0.80
Multiclass BETO                16         4           0.82         0.79       0.80

Table 6. Metrics associated with the Training dataset using Binary BETO on
MEDDOPROF-NER and MEDDOPROF-CLASS.

                           Batch Size   Epochs      Precision     Recall       F1
Binary BETO                    16         3           0.83         0.82       0.82
Binary BETO                    16         5           0.85          0.8       0.82
Binary BETO + ProfNER          16         3           0.83         0.81       0.82

Table 7. Metrics associated with the Training dataset using Binary BETO on
MEDDOPROF-NORM.

                           Batch Size   Epochs      Precision     Recall       F1
Binary BETO                    16         3          0.859        0.637       0.732
Binary BETO                    16         5          0.866        0.628       0.728


4   Results

Organizers have provided participants with a baseline from which to compare our
results. These comparisons can be found in detail in Table 8 for the NER task,
Table 9 for the CLASS task and lastly, Table 10 for the NORM task. Metrics
used are common in NLP tasks to compare the performance of different systems
and are expressed in terms of micro-precision (Precision), micro-recall (Recall)
and micro-F1 scoring (F1).


                   Table 8. Results on MEDDOPROF-NER.

System                                  Precision        Recall             F1
Multiclass BETO                          0.821           0.740             0.778
Binary BETO                              0.787           0.707             0.745
Binary BETO + ProfNER                    0.788           0.720             0.752
Baseline                                 0.465           0.508             0.486
    Our results show an increase of 54% (on average) with respect to those given
by the organization as baseline. The Multiclass BETO approach is the one that
shows the highest gain, an increase of 58% compared against the baseline, along
with the Binary approach added to the ProfNER dataset, showing an increase
of 52%. In this task, all of our proposed systems have obtained similar results
exceeding 78% accuracy, 70% recall and more than 71% on F1.

                   Table 9. Results on MEDDOPROF-CLASS.

System                                   Precision       Recall         F1
Multiclass BETO                           0.775          0.69          0.730
Binary BETO                               0.592          0.531         0.560
Binary BETO + ProfNER                     0.593          0.541         0.566
Baseline                                  0.391          0.377         0.384


   In the MEDDOPROF-CLASS task, we obtained an increase of 61% (on aver-
age) with our systems, being the Multiclass BETO approach the one which, once
again, holds the highest gain with respect to the baseline (90% better than the
organizers given score). Our Binary approach trained along with the ProfNER
data showed an increase of 47% when compared to the baseline score. In this
case, the results obtained by our binary approach are worse than those of the
multiclass.

                  Table 10. Results on MEDDOPROF-NORM.

System                                   Precision       Recall          F1
Binary BETO                               0.685          0.404          0.508
Binary BETO + ProfNER                     0.679          0.404          0.507
Baseline                                  0.502          0.533          0.517


   Lastly, for this task we did not provide a Multiclass BETO approach and
only provided results for the systems using the binary approaches. As mentioned,
these models managed to tackle all three tasks at the same time as they compared
the word embeddings of each text span with those of their respective classes. Our
results for this task fall below the baseline given by the organizers by 8%.

5   Results Analysis
To assess the problems that our models have encountered for each task, we imple-
mented our own evaluation software. This program compares each document in
the golden test with the results obtained by our systems and retrieves the entities
that have not been properly identified and labelled as per MEDDOPROF-NER
and MEDDOPROF-CLASS taks, as well as those that have not been correctly
normalized as per the MEDDOPROF-NORM taks.
    We have applied this evaluation to the top performer systems of every task
and created one single evaluation summary, this information can be appreci-
ated in Figure 1 with respect to every task (coloured as blue, orange and gray
for MEDDOPROF-NER, MEDDOPROF-CLASS and MEDDOPROF-NORM,
respectively). The data we wanted to extract and highlight consisted of:

 – Entities that our system detected and labelled correctly with respect to the
   golden test, namely, True Positives (TP).
 – Entities detected by our systems not found in the golden test, namely, False
   Positives (FP).
 – Lastly, entities included in the golden test that were not detected by our
   systems, namely, Missed Entities.


               Fig. 1. Number of entities mislabeled by our systems.


    To comprehend why our models were unable to correctly identify some enti-
ties in the evaluation dataset, we decided to retrieve the text spans concerning
FP as well as Missed Entities. Considering that this information is too large to be
included in this document, we decided to only display the top 10 most frequent
entities included in each of the previously mentioned sets (FP and Missed Enti-
ties) along with their frequency (number of occurrences within those documents)
, respectively in Table 11 and Table 12.
    After retrieving the results from our error assessment software, we have no-
ticed that there are certain entities that have been conceived as false positives
while also appearing as missed entities.
    From the list of entities contained in Table 11 and Table 12, we observe that:

 – For the MEDDOPROF-NER, the number of errors in the label assignment
   was 0 out of the 10 top most frequent.
 – For the MEDDOPROF-CLASS, this event happened in 4 entities and what
   we have noticed is that our system has incorrectly tagged with a different la-
   bel (e.g: “compańero” as “OTROS” instead of “PACIENTE”, or “psicóloga”
   as “OTROS” instead of “SANITARIO”) although the entity has been span
   has been identified.
      Table 11. False Positives per task along with their Frequencies (Freq).

NER Task with Freq CLASS Task with Freq NORM Task with Freq
Multiclass BETO        Multiclass BETO       Binary BETO +
                                             ProfNER
compañeros          6 compañero          5 trabajadora        26
equipo               3 trabajador          5 trabajador         23
terapeuta            3 psicóloga          3 compañeros        21
en prisión          2 profesionales       3 ama de casa         9
estudia              2 en prisión         3 estudiante          7
tutora               2 terapeuta           3 en prisión         6
médico de atención 2 compañeros         3 jubilado            5
primaria
músico              2 de baja             2 Jubilado            5
compañera de la 1 fábrica de aluminio    2 profesionales       4
peluquerı́a
especialista en otro 1 compañeras de piso 2 Incapacidad Tempo- 4
centro                                       ral


 – Lastly, it is in MEDDOPROF-NORM where we find more matching terms
   as 9 entities of the 10 most frequent appear in both tables. After analyzing
   these matching results we have concluded that this is due to the fact that
   some codes included the term “SCTID: ” in their labelling while our output
   only included the number. After repeating the experiment and applying the
   official evaluation software provided by the organization, our scores seem to
   have outperformed the baseline as seen in Table 13.


6   Conclusions

For our participation on the MEDDOPROF shared task challenge we have pro-
posed three systems based on the BERT representations to tackle all three tasks:
MEDDOPROF-NER, MEDDOPROF-CLASS and MEDDOPROF-NORM, re-
spectively. Our systems managed to outperform the given baselines with an in-
crease of 54% and 61% for the MEDDOPROF-NER and MEDDOPROF-CLASS.
The exception of this performance is the MEDDOPROF-NORM in which our
performance has fallen below the baseline, although close to it.
   For future work our plan is to implement another Deep Learning model, a
BiLSTM approach in combination with a Conditional Random Field (CRF) to
tackle all of the proposed tasks and therefore, evaluate whether this system would
have outperformed our proposed solutions or not. In addition, we also want to
implement the first of our approaches to be able to tackle the MEDDOPROF-
NORM for which our premise is that it will outperform the baseline.
            Table 12. Misses per task along with their Frequency (Freq).

NER Task with Freq CLASS Task with Freq NORM Task with Freq
Multiclass BETO        Multiclass BETO        Binary BETO +
                                              ProfNER
IT                  10 IT                  10 trabajadora        29
en el colegio        4 trabajador           7 trabajador         23
AXE                  4 compañeros          6 compañeros        16
cuidadores           3 en el colegio        4 IT                 10
especialista         3 cuidadores           4 ama de casa         9
militar              2 AXE                  4 estudiante          7
robo de dinero       2 médico              3 incapacidad tempo- 6
                                              ral
cuidado de su madre  2 psicóloga           3 profesionales       5
excursionistas       2 bajas laborales      2 en el colegio       5
despido              2 compañero de viaje  2 de baja laboral     5

Table 13. Results on MEDDOPROF-NORM after fixing the problem regarding codes.

System                                     Precision        Recall          F1
Binary BETO                                 0.687           0.613          0.647
Binary BETO + ProfNER                       0.688           0.624          0.654
Baseline                                    0.502           0.533          0.517


Acknowledgements

This work has been partially supported by the LIVING-LANG project [RTI2018-
094653-B-C21] of the Spanish Government and the Fondo Europeo de Desarrollo
Regional (FEDER).


References
 1. Bouziane, A., Bouchiha, D., Doumi, N., Malki, M.: Question answering systems:
    survey and trends. Procedia Computer Science 73 pp. 366–375 (2015)
 2. Cañete, J., Chaperon, G., Fuentes, R., Ho, J.H., Kang, H., Pérez, J.: Spanish pre-
    trained bert model and evaluation data. In: PML4DC at ICLR 2020 (2020)
 3. Chowdhury, G.G.: Natural language processing. Annual review of information sci-
    ence and technology 37 (1), 51–89 (2003)
 4. Cohen, A.M., Hersh, W.R.: A survey of current work in biomedical text mining.
    Briefings in bioinformatics 6 (1), 57–71 (2005)
 5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training
    of deep bidirectional transformers for language understanding (2018),
    http://arxiv.org/abs/1810.04805, cite arxiv:1810.04805Comment: 13 pages
 6. Honnibal,      M.,     Montani,       I.,   Van      Landeghem,        S.,    Boyd,
    A.:      spaCy:      Industrial-strength      Natural      Language        Process-
    ing      in      Python      (2020).      https://doi.org/10.5281/zenodo.1212303,
    https://doi.org/10.5281/zenodo.1212303
 7. Lima-López, S., Farré-Maduell, E., Miranda-Escalada, A., Brivá-Iglesias, V.,
    Krallinger, M.: Nlp applied to occupational health: Meddoprof shared task at iber-
    lef 2021 on automatic recognition, classification and normalization of professions
    and occupations from medical texts. Procesamiento del Lenguaje Natural 67 (2021)
 8. López-Úbeda, P., Dı́az-Galiano, M.C., Martı́n-Valdivia, M.T., López, L.: Extract-
    ing neoplasms morphology mentions in spanish clinical cases through word embed-
    dings. In: IberLEF@SEPLN (2020)
 9. López-Úbeda, P., Dı́az-Galiano, M.C., Martı́n-Noguerol, T., Luna, A., Ureña-
    López, L.A., Martı́n-Valdivia, M.T.: Covid-19 detection in radiological text reports
    integrating entity recognition. Computers in Biology and Medicine 127 p. 104066
    (2020)
10. Mesa Murgado, J.A., Parras Portillo, A.B., López-Úbeda, P., Martı́n-Valdivia,
    M.T., Lopez, L.A.U.: Identifying professions & occupations in health-related social
    media using natural language processing. In: Proceedings of the Sixth Social Media
    Mining for Health Applications Workshop & Shared Task (2021)
11. Miranda-Escalada, A., Farré, E., Krallinger, M.: Named entity recognition, con-
    cept normalization and clinical coding: Overview of the cantemist track for cancer
    text mining in spanish, corpus, guidelines, methods and results. In: Proceedings
    of the Iberian Languages Evaluation Forum (IberLEF 2020), CEUR Workshop
    Proceedings (2020)
12. Miranda-Escalada, A., Farré-Maduell, E., López, S.L., Briva-Iglesias, V., Agüero-
    Torales, M., Gascó-Sánchez, L., Krallinger, M.: The profner shared task on auto-
    matic recognition of professions and occupation mentions in social media: systems,
    evaluation, guidelines, embeddings and corpora. In: Proceedings of the Sixth Social
    Media Mining for Health Applications Workshop & Shared Task (2021)
13. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Transactions on knowledge
    and data engineering 22 (10), 1345–1359 (2009)
14. Schneider, N., Danchik, E., Dyer, C., Smith, N.A.: Discriminative lexi-
    cal semantic segmentation with gaps: Running the MWE gamut. Transac-
    tions of the Association for Computational Linguistics 2 pp. 193–206 (2014),
    https://www.aclweb.org/anthology/Q14-1016