<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Resource-Based Anonymization for Spanish Clinical Cases</article-title>
      </title-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>704</fpage>
      <lpage>711</lpage>
      <abstract>
        <p>This implementation notes describe a system for anonymization of clinical case texts written in Spanish based on careful collection of gazetteers for several entity types. The entities gathered are improved converting their strings to regular expressions covering commonest text variations. A simple grammar formalism is implemented to sugar the creation of context-dependent regular expressions. These expressions are applied in order (organized according to reliability) and context specified by rules may refer to the character/word/regular expressions and/or annotations previously introduced in the running text by already applied rules. With this simple approach, the system obtained an F1-score of 0.9595 in the NER offset and entity type classification (subtask 1) and 0.96409 in the sensitive token detection (subtask 2).</p>
      </abstract>
      <kwd-group>
        <kwd>Anonymization</kwd>
        <kwd>de-identification</kwd>
        <kwd>resource-based processing</kwd>
        <kwd>Electronic Health Records</kwd>
        <kwd>Protected Health Information</kwd>
        <kwd>clinical cases</kwd>
        <kwd>meddocan</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The anonymization of Electronic Health Records (EHRs) opens a new range of
uses for these otherwise private data, being just a few of them the application of
big data techniques to gain feedback on treatment effectiveness, the estimation
of survival of a patient that has overcome a serious illness, and, generally, the
effective easiness of the study of human diseases in broad sense. In this new (at
least for Spanish) avenue of research activities, IberEval 2019 has launched
evaluation activity of Medical Document Anonymization (henceforth, meddocan).
This working paper accounts for our participation in the referred evaluation
track.1</p>
      <p>Section 2 describes the resources we have collected and enriched for our
the task at hand. Some general issues on the development of our system are
presented in section 3. System results are presented in section 4 along with both
a discussion on errors by the system and some comments of the track objectives
and the golden dataset. Finally, section 5 includes some concluding remarks.
1 http://temu.bsc.es/meddocan/</p>
      <p>Copyright c 2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0). IberLEF 2019, 24
September 2019, Bilbao, Spain.</p>
      <p>Fernando S´anchez-Leo´n
2</p>
    </sec>
    <sec id="sec-2">
      <title>Resource building</title>
      <p>This section describes the resources built from scratch (or reused, adapted and
further developed) specifically for the meddocan competition. Since the
organizers released gazetteers for some of the annotation classes (henceforth, simply
classes, we have decided to explore this avenue.</p>
      <p>From previous work, we had lists of first names and last names (c. 80,000
elements) that we have reused as such. Other gazetteers have been complemented
using different information sources, mainly c. 175,000 articles in the health
domain downloaded from SciELO electronic library2; but also Wikipedia3,
GeoNames4 and certain web searches and sites have been visited.</p>
      <p>For instance, the gazetteer for hospitals released by meddocan organizers,
which is basically the list published by the Ministerio de Sanidad, Consumo
y Bienestar Social5, has been cleaned for orthographic errors (lack of stresses:
HOSPITAL RIO CARRIO´N, HOSPITAL QUIRONSALUD MIGUEL DOMINGUEZ, missing
final dots: AVANCES ME´DICOS S.A). This list has been complemented with all
hospital and clinical centers extracted in the SciELO collection. Some other webs
have been used to download hospital names. The list used in the competition
has around 2,000 unique hospital names.</p>
      <p>As regards, known addresses we have used exclusively those included in the
gazetteer released by the organizers. Again, these addresses have been cleaned
and canonicalized (more on this below), and duplicate addresses have been
removed.</p>
      <p>We have also gathered gazetteers for services, units, laboratories (with cue
words in Spanish, English and Portuguese), as well as clinical specialties. These
class is not to be annotated within meddocan, but its identification will be of
help for the matching of positive classes.</p>
      <p>Institutions have been collected from a number of web pages, most notably
from the Ranking Web of Universities6. These have been complemented (in
another gazetteer) with company names, specially companies and holdings within
the business area of health.</p>
      <p>As for location names, we have collected names of countries, autonomous
communities, provinces and cities in Spain (from Wikipedia), as well as a list
of provinces and largest cities in countries around the world. Besides this list of
location names, we use another gazetteer with all cities with population greater
than 500, downloaded from GeoNames database7. These GeoNames are filtered
using five heuristics: 1) delete all name shorter than four characters long; 2)
2 https://www.scielo.org. The collection used was downloaded during February
2019.
3 https://es.wikipedia.org
4 https://www.geonames.org
5 https://www.mscbs.gob.es/ciudadanos/prestaciones/centrosServiciosSNS/
hospitales/docs/2018 CNH.pdf
6 http://www.webometrics.info/en
7 http://downloads.geonames.org/export/dump/cities500.zip
remove all names ambiguous with first names; 3) delete those ambiguous with
last names; 4) discard single- or multi-word names all whose parts are lexical
(i.e., found in a full form lexicon); 5) finally, forms ambiguous with any other
entity in the rest of gazetteers are also removed (Calcium). As a final
pruning of the set of GeoNames, all those names seen in train+dev datasets as part
of the construction signo|f´ormula|pasta|asa|drenaje|escala|... de &lt;name&gt;
have been blocked.</p>
      <p>Unfortunately, the information processed from GeoNames includes only the
column for the English name, discarding other columns with the GeoNames in
other languages. For this reason, we have missed some otherwise common
locations like Mil´an and Nueva York in the test set.</p>
      <p>Gazetteer elements are not plain strings but regular expressions. This way,
we can allow for the matching of text variants of known class instances found in
the clinical cases. However, these (possibly regular expression-form) entities are
further processed as described below. Before this further processing, the system
potentially uses c. 200,000 entities, plus 430,000 unique names from GeoNames
pre-processed database.</p>
      <p>Resource munging As already said, prior to munging the gazetteer data, we
have canonicalized all the entries in the new gazetteer types8. As a result of
this, the entry DR ESQUERDO, 46 (which actually appears twice in the released
gazetteer for known addresses) has been converted to
DR. ESQUERDO, 46</p>
      <p>The following are all the text versions of this ”canonical” address in the
train+dev dataset:
Calle del Dr. Esquerdo, 46
c/ Doctor Esquerdo 46
c/ Doctor Esquerdo, 46
C/ Dr. esquerdo, 46
C/ Dr. Esquerdo 46
C/ Dr. Esquerdo, 46
Doctor Esquerdo, 46
Dr Esquerdo 46</p>
      <p>When loading resources and depending on the class set as first line of each
gazetteer file, list elements are munged by a set of methods applied to each
of them. This way, after munging the former canonical address, we store the
following regular expression (split in two lines for editorial purposes):
8 Canonicalization has been done using editor search and replace commands on the
dataset. A random sample representing 20% of the addresses was exhaustively
checked and, when satisfied with the result, the rest was spot checked.
(?:(?:calle|c[\/.]*|avenida|avda\.?|av\.?|paseo|ps?\.?)
(?: del?| de la)? ?)?(?:dr\.|dr|doctor) esquerdo(?:,?) (?: *46)?</p>
      <p>We use a similar strategy also for hospital and institution names, but haven’t
used it (due to lack of time) to companies (that map also to INSTITUCION in
meddocan9).
3</p>
    </sec>
    <sec id="sec-3">
      <title>Development</title>
      <p>Our system’s approach is very simple, as it is described in this section. A core
subset of all entities10 is dynamically complemented with all entity candidates
found in the corpus to be processed. This is achieved performing a fast light
tokenization11 that spot fragments of (possibly multi-word) names in the above
mentioned three categories. All munging processes are performed also on the
newly gathered entity candidates and compiled with the rest of regular
expressions. A regular grammar, with the possibility to express left and right (regular)
context and/or tags already inserted on both sides of the focus, due to previous
applications of matching rules, has been defined. In order to access fragments
already identified by a previous rule, the system uses an offset vector, where we
record the span and the type of information identified12.</p>
      <p>The current grammar has nearly 100 rules, ordered in three sections: first
section contains rules that block the further identification of (mainly) entities as
any of the proposed classes, while the second and third sections identify positively
text fragments to be tagged, being the latter of these of a more heuristic shape.</p>
      <p>As an example of the type of context captured by rules in the former section,
take, for instance, escala de Glasgow, where the name must remain untagged. In
order for the reader to have a cleared insight of the rules shape, one of them is
included below:
HOSPITAL =&gt; {@HOSPITALS}
resp_known_hospital =&gt; ’({HOSPITAL})
(?:{SEP}* ({CITY_OR_PROVINCE}))?’,
types =&gt; [ ’HOSPITAL’, ’TERRITORIO’ ]
9 This lack of further processing of company names accounts for some errors in our
system run on the test data —Alcon Cus´ı, Novartis, Allergan are companies in our
gazetteer not annotated because in the text show with a SA or S.A. post-modifier.</p>
      <p>These and other cases could have been easily captured.
10 These entities were selected using frequency information gathered from SciELO
collection processing. Currently the core has the full content of gazetteers for all classes
except for locations, first names and last names.
11 Light tokenization takes 5.01 seconds on the meddocan background corpus —
1,644,964 running words as counted with wc— using an Intel(R) Core(TM) i7-3770K
CPU @ 3.50GHz. The system is implemented in Perl and it relies on a library
developed entirely by the authors for last years barr2 competition and extended for
the meddocan and the PharmaCoNER competitions this year. This library uses
other libraries from the Perl ecosystem.
12 Neither the algorithm nor the regular expressions dynamically built have been
optimized.</p>
      <p>Resource-based anonymization for Spanish clinical cases</p>
      <p>A symbol table is declared, and some notation sugar is implemented. For
instance, an @ sign at the beginning of an identified symbol instruct the
program to return the disjunction of all the gazetteer elements (possibly regular
expressions) of that class. The same is done for the symbol CITY OR PROVINCE.
If matched in the text, either a hospital name and optionally a city or province,
they get tagged with the types given in the rule.</p>
      <p>As another example, this time from the third section of the grammar, the
following, more heuristic rule, we try to identify an unknown city or province
(defined via a simple regular expression) if that offset span is not already tagged
and there is a specific set of entities to the left (lc stand for left context ).
city_heur_1
=&gt; ’{SEP}+ ({CITY_OR_PROVINCE_HEUR}) {SEP}+’,
types =&gt; [ ’TERRITORIO’ ],
lc =&gt; [ ’TERRITORIO|NOMBRE|HOSPITAL’ ]</p>
      <p>Among the information types for which a gazetteer does not exist in the
system are addresses (CALLE). For these, we build a regular expression with
typical cue elements (Av., C/?.?, Pg, Paseo,...) and common cues to express
street number and floor (or kilometer point or no number —s/n—)13.
address_heur_last_1 =&gt; ’(?:{HEALTH_RECORD_RESP_CUE}.*?) {TSEP}
({ADDRESS_CUE}.*?{ADDRESS_POST}) {SEP}’,
types =&gt; [ ’CALLE’ ],
rc =&gt; [ ’TERRITORIO’ ]</p>
      <p>Finally, in order to match addresses with no cues, we have used a third
scanning over the document string, with no particularly good results (see section 4).
4</p>
    </sec>
    <sec id="sec-4">
      <title>Results and discussion</title>
      <p>This section presents the results obtained by the system developed and also
discusses some weak points discovered after sending our datasets to the organizers.
Finally, we also express some insights from the overall task and the quality of
the golden datasets.</p>
      <p>Results We sent three runs. The three of them use the approach described in
these notes. However, runs 2 and 3 (B and C in the table below) make use of the
annotations gathered from train+dev datasets released by the organizers. This
information (obviously pruned of certain classes for which propagation across
documents is nonsense) was used prior to annotation by the system. Finally,
run 1 (A) is the result of system annotation alone. meddocan is structured in
two subtasks —the former, aiming to identify named entities and other sensible
patient information and to provide their offsets; the latter, simply extracting
the offsets of sensible text fragments. Results, as provided by the organizers, are
shown in the following tables.
13 Note that these examples are a small fragment, for illustrative purposes, of a very
complex set of cues including Portuguese, Catalan, Italian and English, apart from
Spanish, of cue text fragments.
Fernando S´anchez-Leo´n
Discussion No comparative comments can be included in this working paper
since, as it was the case in last year’s competition on abbreviation resolution
(barr2), the organizers don’t publish results of the participants prior to paper
writing. This lack of information on other participants systems performance is
somewhat surprising in an applied research activity that is defined as a
competition.</p>
      <p>For the sake of saying something, not precisely positive, about the system
described here, we included a third non-regular grammar-based scan on the
input text in order to improve recall on unknown addresses not showing any
cue. Due to an inadvertent bug, we applied this strategy also to addresses with
a known cue and, more critically, to known addresses, thus, wrongly extending in
some cases these known addresses to convert them in false positives. We haven’t
quantified the number of errors introduced by this bug, but we are sure that has
had a significant impact on the F1-score.</p>
      <p>Besides, some common expressions —like 3 d´ıas de nacido— were not
included in system resources. This and other new expressions can boost system
performance devoting a few minutes to resources update.</p>
      <p>Annotation and anonymization The ultimate goal of tagging certain
information bits to protect patient’s health information (PHI) is to anonymize a text
in a way that no person or program can achieve the re-identification of the
patient identity behind the data. With this primary goal in mind, and assuming a
clinical case structure like that found in the datasets released by the organizers
of meddocan, there are simpler ways to perform the task, at least as regards
subtask 2 of the competition. Let’s expand the point.</p>
      <p>Every clinical case has three parts —a patient profile, the case description
and the health professional data. The first and last parts of the clinical case can</p>
      <p>Resource-based anonymization for Spanish clinical cases
be trivially trimmed, although preserving relevant information on the specialty
of the health professional signing the case —information which is included in
the department, service, unit,. . . sometimes explicitly stated in the health
professional data part. Accepting this premise, the task at hand is that of thoroughly
work the case description.</p>
      <p>The test dataset contains 105,062 running words (as counted by wc) and
5,661 annotations in the .ann files, but only 769 annotations are in the body of
the text. However, we believe that this part of the clinical case makes a difference
in anonymization systems development for the reasons just exposed. Moreover,
it is precisely in body text where a human annotator can have more problems
to hear the bells of information that has to be protected. In order to prove
this assumption, we have read the body of all the cases in the test dataset and
checked the expert annotations. The results of this exercise are shown in table 3.
TAG True Positives True Negatives False Positives
EDAD SUJETO ASISTENCIA 242 2 0
SEXO SUJETO ASISTENCIA 194 0 0
FECHAS 109 0 0
FAMILIARES SUJETO ASISTENCIA 77 3 3
INSTITUCION 39 0 7
TERRITORIO 26 0 4
PAIS 27 0 0
HOSPITAL 13 0 2
ID SUJETO ASISTENCIA 13 1 0
PROFESION 4 5 3
OTROS SUJETO ASISTENCIA 3 0 0
NOMBRE SUJETO ASISTENCIA 2 0 0
NUMERO TELEFONO 1 0 0
Table 3. A closer look at the test dataset annotations in the body of clinical cases.</p>
      <p>In the preceding table, we have computed span errors as false positives.
Besides, Vancouver in the entity name Escala de Cicatrizaci´on de Vancouver and
similar cases have been also considered false positives. We also judge hospital
militar afgano a false positive, as it does not meet the criteria of having a unique
referent, as it should be the case for proper names. In the same vein, trabajador
en canteras and similar cases count as false positives given that the annotation
guidelines explicitly state that the word trabajador must not be annotated as part
of a PROFESION14. Finally, 2 out of 3 cases marked for OTROS SUJETO ASISTENCIA
are debatable.</p>
      <p>The number of false positives in the INSTITUCION class is consistent with
the scarce number of examples in the annotation guidelines for this particular
14 See page 15 of the annotation guidelines, available
http://temu.bsc.es/meddocan/wp-content/uploads/2019/02/gu%C3%ADas-deanotaci%C3%B3n-de-informaci%C3%B3n-de-salud-protegida.pdf.</p>
      <p>at
class15. Actually, there is no indication in the guidelines as to what to do when
the institution names refer to medical products and, hence, are not part of the
PHI. Take, for instance, the following text fragment:</p>
      <p>Se procedio´ a estudio inmunohistoqu´ımico de cortes representativos
mediante el m´etodo avidina-biotina peroxidasa, utilizando anticuerpos
primarios anti ant´ıgeno de membrana epitelial EMA, (Dako M613, USA, 10/500),
prote´ına S-100 (Dako, L1845, USA, predilu´ıda), neurofilamentos (Biogenex
6670-0154, USA), enolasa neuroespec´ıfica NSE, (Biogenex MU055-VC, USA,
10/1000), CD57 (Becton-Dickinson 7660, USA, 10/500), CD34 (Becton-Dickinson
7660, USA, 10/500), a-actina de mu´sculo liso (Dako MO851, USA, 10/200),
desmina (Dako M760, USA, 10/500), y vimentina (Shandon 402255, USA,
pred.).</p>
      <p>In this fragment every single occurrence of Dako, Biogenex, . . . , and USA is
tagged. This can be important for a NER task but is irrelevant for anonymization
of PHI. Re-identification is not only possible for the set of antibodies mentioned
but also of paramount importance for some human disease studies. In fact, in
other similar contexts these entities remain untagged16, a fact that mislead the
participant.</p>
      <p>Finally, some words remain untagged (and they have not been reflected in
the table above), albeit carrying personal information —this is the case of
primigesta de 32 an˜os. However, the most argueable case for untagged
information is that of gender when denoting sex of patient. There are, of course,
medical specialties only for men or for women, but, for the sake of complicating
re-identification of PHI, occurrences of la paciente —and the more difficult to
capture cases of feminine agreement of adjectives, participles and the like— need
to be anonymized, in our humble opinion.</p>
      <p>The error rate found in test set for body text is relatively high, standing
out that maybe some of the details in the guidelines are not clearly exemplified
and have been applied inconsistently. This has a significant impact on system
evaluation, especially when system scores are in the 90s.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>This paper has presented a resource-based anonymization system for Spanish
clinical cases. Its F1-score for the meddocan dataset is 0.9595 for NER offset
and entity type classification (subtask 1) and 0.96409 in the sensitive token
detection (subtask 2). We believe that, after fixing a bug in the program that
garbles known addresses and devoting a couple of hours to expand system’s
regular grammar, performance can be improved significantly.
15 See page 16 of the guidelines.
16 For example, in the text fragment: “Como antecedentes destaca estar en tratamiento
hematolo´gico con hidroxyurea (Hydrea R , 500 mg, Bristol-Myers, Squibb) y ´acido
acetilsalic´ılico (Adiro R , 300 mg, Bayer).”</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>