<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Spanish Medical Document Anonymization with Three-channel Convolutional Neural Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jordi Porta-Zamorano</string-name>
          <email>porta@rae.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centro de Estudios de la Real Academia Espan~ola c/ Serrano 187-189</institution>
          ,
          <addr-line>Madrid 28002 Tel.:</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>639</fpage>
      <lpage>646</lpage>
      <abstract>
        <p>This paper describes the system presented at the MEDDOCAN (Medical Document Anonymization) task. The system consists of a candidate generator which uses a PoS-tagger, and a candidate classi er based on a convolutional neural network which uses three channels and pretrained word embeddings to represent the sequence of words to be classi ed and its left and right context. On the MEDDOCAN Test Set, the systems achieved an F1 score of 0.9184 for subtask 1 (NER o set and entity type classi cation) and 0.9345 for subtask 2 (sensitive token detection).</p>
      </abstract>
      <kwd-group>
        <kwd>Medical Document Anonymization networks (CNN)</kwd>
        <kwd>Three-channel CNN (TCCNN)</kwd>
        <kwd>Convolutional neural</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        This paper describes the system presented at the MEDDOCAN (Medical
Document Anonymization) task within the IberLEF 2019 initiative. MEDDOCAN is
the rst community challenge task speci cally devoted to the anonymization of
medical documents in Spanish. Although the MEDDOCAN task is structured
in two subtasks: (i ) NER o set and entity type classi cation and (ii ) sensitive
token detection, these two subtasks are approached simultaneously providing
the same output for both. A more detailed description of MEDDOCAN and the
results of systems submitted can be found in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        The anonymization or de-identi cation task can be classi ed as a named
entity recognition and classi cation problem, which has been extensively studied
from a variety of approaches ranging from rule-based systems to machine learning
algorithms. Several systems using di erent kinds of recurrent neural networks
can be found in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] reporting continuous improvements in the state
of the art for English in similar anonymization tasks.
      </p>
      <p>
        The use of multichannel convolutional neural networks for sentence classi
cation was rst described in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], where channels were used to represent di erent
sets of word vectors. In [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], the standard deep learning model for text classi
cation and sentiment analysis using a word embedding layer and a one-dimensional
      </p>
      <p>Jordi Porta
convolutional neural network was expanded by multiple parallel convolutional
neural networks with di erent kernel sizes over the same text. Here, we will use
channels to represent the sequence of words to be classi ed and its left and right
context.
2</p>
    </sec>
    <sec id="sec-2">
      <title>MEDDOCAN Corpus Description</title>
      <p>
        The MEDDOCAN corpus is composed of 1000 clinical cases distributed over
a Training Set (500 cases), Development Set (250 cases), and a Test Set (250
cases). Each clinical case came in raw text format with its corresponding brat
annotation le [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>One of these texts is shown in Fig. 1, where MEDDOCAN entities in the
corresponding annotation le in Fig. 2 have been highlighted in red and
section titles in blue in order to show the underlying structure of these
clinical cases and the co-occurrence of entities with labels that introduce many
of them (e.g., Apellidos: Garcia Prieto). Note also that, although these texts
have been annotated blindly by two di erent expert annotators with 98% of
inter-annotator agreement, the text contains a false positive (Testes, wrongly
classi ed as FAMILIARES SUJETO ASISTENCIA) and a false negative (08022, not
identi ed and classi ed as TERRITORIO).
3</p>
    </sec>
    <sec id="sec-3">
      <title>System Description</title>
      <p>The system built for MEDDOCAN consists of a candidate generator and a
candidate classi er, which are depicted in Fig. 3 and described in the following
paragraphs.</p>
      <p>The candidate generation is performed in two steps: (i ) Texts are PoS-tagged
with a general in-house Spanish PoS-tagger which performs sentence
segmentation, tokenization, and morphosyntactic analysis. The tokenization
subcomponent has been slightly modi ed to split textual elements containing hyphens
and some textual elements consisting of unbroken sequences of words, numbers
and punctuation without proper spacing (e.g., Edad:68 ! Edad : 68 ); and
(ii ) PoS-Tagged texts are then given to an n-gram generator to provide
candidates along with their textual context for all the types of entities de ned by
the MEDDOCAN task. Since most of the n-grams do not correspond to entities
from a linguistic point of view (they should be noun phrases), a lter is applied
to reduce the number of false candidates. This lter removes n-grams with
nonallowed PoS tags in n-gram initial and nal positions (punctuation, prepositions,
adverbs, articles, conjunctions, etc.), n-grams with internal unpaired
punctuation, and n-grams containing verbs (except past participle forms) or any of the
words in a blacklist of the 15,000 most frequent non-allowed words within entities
manually extracted from the MEDDOCAN texts in the Training Set.</p>
      <p>Candidate n-grams with their context are classi ed in two steps: (i ) n-grams
are initially classi ed by a three-channel convolutional neural network (TCCNN)
with precomputed word embeddings (channels represent left and right context,
Nombre: Marc.</p>
      <p>Apellidos: Garcia Prieto.</p>
      <p>CIPA: 270058.</p>
      <p>NASS: 27 2226350 05.</p>
      <p>Domicilio: Av. Litoral, 30, 1C .</p>
      <p>Localidad/ Provincia: Barcelona.</p>
      <p>CP: 08005.</p>
      <p>NHC: 270058.</p>
      <p>Datos asistenciales.</p>
      <p>Fecha de nacimiento: 27-10-1968.</p>
      <p>Pa s: Espa~na.</p>
      <p>Edad: 47 Sexo: H.</p>
      <p>Fecha de Ingreso: 13-12-2015.</p>
      <p>Especialidad: androlog a.</p>
      <p>Medico: Anna Bujons Tur NoCol: 08 08 76541.</p>
      <p>Antecedentes: sin antecedentes patologicos de interes ni habitos toxicos.
Historia Actual: Paciente varon de 47 a~nos de edad, acude a la
consulta de androlog a por presentar erecciones prolongadas no dolorosas
de aproximadamente 4 a~nos de evolucion tras traumatismo perineal cerrado
con el manillar de una bicicleta.</p>
      <p>Exploracion f sica:En la exploracion f sica se observan cuerpos
cavernosos aumentados de consistencia, no dolorosos a la palpacion, sin
palpar pulsos anomalos. Sensibilidad peneana conservada. Testes moviles
en ambas bolsas escrotales y sin alteraciones.</p>
      <p>Resumen de pruebas complementarias: Como exploraciones complementarias se
le realiza ecodoppler penenano: vascularizacion cavernosa derecha
aparentemente conservada; en la porcion mas proximal del cuerpo cavernoso
izquierdo se observa formacion anecoica (2x1.8x1.5cm) con flujo
turbulento en su interior compatible con f stula arteriovenosa (FAV) de
larga duracion.</p>
      <p>Evolucion y comentarios: Con la orientacion diagnostica de Priapismo de
alto flujo se decide realizacion de Arteriograf a pudenda con anestesia
local confirmandose FAV y posterior embolizacion de la misma mediante 2
coil 3x5. Tras la embolizacion el paciente e voluciona favorablemente con
detumecencia peneana completa y erecciones normales. Actualmente esta
asintomatico.</p>
      <p>Diagnostico Principal: Priapismo
Remitido por: Anna Bujons Tur Calle Joaquim Folguera, 3 - 5o 2a 08022
Barcelona. abujons@gmail.com</p>
      <p>Fig. 1. MEDDOCAN Clinical Case S0004-06142006000600015-1: Raw Text</p>
      <p>Jordi Porta
and the n-gram to be classi ed); and (ii ) a nal lter is applied to discard
misclassi cations and to solve overlapped classi ed candidates selecting left-most
longest n-grams. This nal lter applies constraints at word and character levels
on classi ed n-grams. For example, emails must contain the @ character, streets
have no street type information, etc. These constraints have been created on the
basis of observation of misclassi cations in the Development Set.</p>
      <p>
        Despite the initial lter applied to reduce the number of candidates, 89%
of the candidates in the MEDDOCAN corpus are still non-entities, resulting in
training sets with a very unbalanced label distribution. To avoid bias to the
machine learning algorithm, labels are weighted to make them equally important
to the algorithm. Word embeddings have been trained using Word2Vec [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] from
Spanish newspaper archives, the Iberia Corpus [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], and the MEDDOCAN
Training Set, with a total of 1,355,054,567 tokens and a vocabulary size of 940,576.
The vectors have a dimensionality of 100 and were trained using the continuous
bag-of-words architecture. Words out of vocabulary are initialized to zero.
      </p>
      <p>
        The topology of the TCCNN is depicted in Fig. 4. Three input channels are
de ned for processing the sequence of words to be classi ed and its left and right
context. Each channel has an input layer, limiting to 15 the length of the input
sequence, followed by an embedding layer and a one-dimensional convolutional
layer. Max pooling layers down-sample the output from the convolutional layers
and a atten layer reduces its dimension for concatenation. The concatenated
output from the three channels is processed by a dense layer and an output
layer. The code used for implementing the TCCNN is an adaptation of the code
published in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], implemented in Keras [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Training and development sets were created from MEDDOCAN datasets
using the candidate generator to train and evaluate the TCCNN. For each entity
type, its weight was computed as n=ni where n = Pi ni and ni is the number
of examples of type i. Hyperparameters of the neural network were manually
explored until a peak of 0.965 accuracy was reached on the development set
(corresponding to 0.992 on the training set). Channels have a capacity of 15
words, dropout rates are set to 0.005, lter windows to 32, kernel and pool sizes
to 2; dense layers contain 200 units with L2 kernel regularizers set to 0.001;
activation functions are ReLU except for the output layer, which is softmax.
The mini-batch was 156 and the optimizer was Adam with a learning rate of
0.001. The learning curve of Fig. 5 indicates that the model ts quite well with
a training-development accuracy gap of 3%. The performance of the neural
net</p>
      <p>Jordi Porta
work classi er (without the nal lter) on every entity type in terms of precision
and recall, ordered by F1, is shown in Table 1. There is no apparent correlation
between performance and the number of examples of each entity type.
The two systems submitted di er in the training corpus: System II is trained
with the Training Set and System III is trained with all available data, i.e., the
Training and Development Sets.</p>
      <p>The results on Development and Test Sets in subtasks 1 and 2 are shown
in Tables 2.1 and 2.2, where the e ect of the nal lter is also measured (in all
cases left-most longest matches are selected in case of overlapping). Results for
System III on the Development Set are not given since this set has been seen
during the training. For all the experiments, the e ect of the lter is positive,
increasing all gures by 0.8-3.41%. On the Test Set, Systems II and III have
slightly di erent precision and recall but a similar F1 of about 0.918 for subtask
1 and 0.93 for subtask 2.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and Future Work</title>
      <p>An initial approach to the MEDDOCAN task was to create a pattern-based
system using context-free grammars to classify and generate candidates for the
kind of entities de ned by the task (System I). The development of this approach
results in high precision but low recall due to the di culty of modelling the
span and context of some entity types. The course of the development was then
changed by the n-gram generation using a blacklist of word forms combined</p>
      <p>Spanish Medical Document Anonymization with TCCNNs</p>
      <p>Jordi Porta
with a machine learning classi er, which was previously used for named entity
classi cation, resulting in a best F1 of 0.9184 for subtask 1 and 0.9345 for subtask
2. Part of the success in the tasks can be explained by the distribution of the
entities within the texts, many of which are placed in elds and surrounded
by labels introducing or delimiting them. A natural path for future work is to
incorporate PoS tags and word character level information into new channels of
the TCCNN, for ther purpose of eliminating the lter in the n-gram candidate
generation and the nal lter.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Brownlee</surname>
          </string-name>
          , J.:
          <article-title>Deep Learning for Natural Language Processing, chap. Develop an n-gram CNN for Sentiment Analysis (</article-title>
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Chollet</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , et al.: Keras. https://keras.io (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Dernoncourt</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>J.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uzuner</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szolovits</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>De-identi cation of patient notes with recurrent neural networks</article-title>
          .
          <source>Journal of the American Medical Informatics Association : JAMIA</source>
          <volume>24</volume>
          (
          <issue>3</issue>
          ),
          <volume>596</volume>
          {
          <fpage>606</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Khin</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burckhardt</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Padman</surname>
          </string-name>
          , R.:
          <article-title>A Deep Learning Architecture for De-identi cation of Patient Notes: Implementation and Evaluation (</article-title>
          <year>2018</year>
          ), http://arxiv.org/abs/
          <year>1810</year>
          .01570
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Convolutional Neural Networks for Sentence Classi cation</article-title>
          .
          <source>In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          . pp.
          <volume>1746</volume>
          {
          <fpage>1751</fpage>
          . Association for Computational Linguistics, Doha, Qatar (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          :
          <article-title>De-identi cation of clinical notes via recurrent neural network and conditional random eld</article-title>
          .
          <source>Journal of Biomedical Informatics</source>
          <volume>75</volume>
          ,
          <article-title>S34 { S42 (2017), a Natural Language Processing Challenge for Clinical Records: Research Domains Criteria (RDoC) for Psychiatry</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Marimon</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalez-Agirre</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Intxaurrondo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rodr</surname>
            <given-names>guez</given-names>
          </string-name>
          , H.,
          <string-name>
            <surname>Lopez</surname>
            <given-names>Martin</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J.A.</given-names>
            ,
            <surname>Villegas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            :
            <surname>Automatic</surname>
          </string-name>
          de
          <article-title>-identi cation of medical texts in Spanish: the MEDDOCAN track, corpus, guidelines, methods and evaluation of results</article-title>
          .
          <source>In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF</source>
          <year>2019</year>
          ). vol.
          <source>TBA</source>
          , p.
          <source>TBA. CEUR Workshop Proceedings (CEUR-WS.org)</source>
          , Bilbao, Spain (
          <year>2019</year>
          ), TBA
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Distributed Representations of Words and Phrases and Their Compositionality</article-title>
          .
          <source>In: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2</source>
          . pp.
          <volume>3111</volume>
          {
          <fpage>3119</fpage>
          . NIPS'13,
          <string-name>
            <surname>USA</surname>
          </string-name>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Porta-Zamorano</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <source>del Rosal Garc</source>
          a, E.,
          <string-name>
            <surname>Ahumada-Lara</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Design and development of Iberia: a corpus of scienti c Spanish</article-title>
          .
          <source>Corpora</source>
          <volume>6</volume>
          (
          <issue>2</issue>
          ),
          <volume>145</volume>
          {
          <fpage>158</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Stenetorp</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pyysalo</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Topic</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ohta</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ananiadou</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsujii</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>BRAT: A Web-based Tool for NLP-assisted Text Annotation</article-title>
          .
          <source>In: Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics</source>
          . pp.
          <volume>102</volume>
          {
          <fpage>107</fpage>
          . EACL '
          <volume>12</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computational Linguistics, Stroudsburg, PA, USA (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>