<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Anonymization of Clinical Reports in Spanish: a Hybrid Method Based on Machine Learning and Rules.</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pilar Lopez-Ubeda</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manuel C. D az-Galiano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>L. Alfonso Uren~a-Lopez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>M. Teresa Mart n-Valdivia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Advanced Studies Center in ICT (CEATIC) Universidad de Jaen</institution>
          ,
          <addr-line>Campus Las Lagunillas, 23071, Jaen</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>687</fpage>
      <lpage>695</lpage>
      <abstract>
        <p>Biomedicine is an ideal environment for the use of Natural Language Processing, due to the huge amount of information processed and stored in electronic format. This information cannot be shared with con dential patient data. In order to achieve this task, the Medical Document Anonymization workshop has been created. In this paper, we present an automated anonymization system for clinical reports written in Spanish. Three di erent methods are evaluated and compared. The rst method is rule-based, the second method uses machine learning and the third is a hybrid method between the rst two. The evaluation showed that the use of the hybrid method obtained the best results. The results are as expected, we obtained 90% in measure F1 in sub-task 1 and 95% in sub-task 2.</p>
      </abstract>
      <kwd-group>
        <kwd>Anonymization</kwd>
        <kwd>Named Entities Recognition</kwd>
        <kwd>CRF chine Learning</kwd>
        <kwd>Regular Expressions</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Named Entity Recognition (NER) in a text is an important key in many natural
language applications such as the anonymization of clinical records. This task is
crucial because a hospital cannot freely publish information related to a patient
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        NER consists in automatic identi cation of fragments of texts called entities
which refer to information units such as persons, geographical locations, sex,
names of organizations, dates, occupation or references to documents [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>The OTG de Sanidad of the Plan TL in collaboration with the National
Center for Oncological Research and the Hospital 12 de Octubre in Madrid have
organized a task to contribute to these studies mentioned above. This task is
part of the IberLEF (Iberian Languages Evaluation Forum) in SEPLN (Sociedad
Espan~ola para el Procesamiento del Lenguaje Natural) 2019.</p>
      <p>
        Medical Document Anonymization (MEDDOCAN) task is the rst
community challenge task speci cally devoted to the anonymization of medical
documents in Spanish [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The MEDDOCAN task is structured into two sub-tasks:
NER o set and entity type classi cation and Sensitive span detection.
      </p>
      <p>The rst sub-track wants to match exactly the beginning and end locations
of each Protected Health Information (PHI) entity tag, as well as detecting
correctly the annotation type. In the order hand, the second sub-track is more
speci c to the practical scenario needed for releasing de-identi ed clinical
documents, where the main goal is to identify and be able to obfuscate or mask
sensitive data, regardless the actual type of entity or the correct o set identi
cation of multi-token sensitive phrase mentions.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Dataset</title>
      <p>This MEDDOCAN corpus was selected manually by a practicing physician and
augmented with PHI phrases by health documentalists, adding PHI information
from discharge summaries and medical genetics clinical records. Usually, the
clinical records that we will treat are structured, most of them follow a common
format.</p>
      <p>The organizers provided us with the following datasets:
{ The training set consists of 500 documents
{ The validation set consists of 250 documents
{ The test set contain 3751 documents</p>
      <p>The MEDDOCAN annotation scheme de nes a total of 29 entity types 1 and
the o cial annotation guidelines used to annotate the MEDDOCAN data sets
is available 2.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Strategies</title>
      <p>In this section we will describe the methods and strategies followed to achieve
the tasks. These methods are used for both subtasks: NER o set and entity type
classi cation and Sensitive token detection.
3.1</p>
      <sec id="sec-3-1">
        <title>Rule-based method</title>
        <p>
          An initial survey using the training dataset showed that the majority of the
values for a given eld were recorded in certain patterns. Some patterns we
can nd are shown in Table 1. In this table we can see that it can be easy
to identify some named entities. Therefore, we designed rules by incorporating
regular expression (REs or regex) for each eld according to the description
types of the elds [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>
          There have been many studies on the use of regular expressions in di erent
areas of medicine [
          <xref ref-type="bibr" rid="ref5 ref8">8, 5</xref>
          ]. Some of them very current so that it is still a method
that is applied in the area of Natural Language Processing (NLP).
1 http://temu.bsc.es/meddocan/index.php/annotation-guidelines/
2
http://temu.bsc.es/meddocan/wp-content/uploads/2019/02/gu%C3%ADas-deanotaci%C3%B3n-de-informaci%C3%B3n-de-salud-protegida.pdf
        </p>
        <p>The rst step to elaborate this experiment was to de ne the rules to extract
the required entities. These rules were taken from the annotation guide provided
by the task organizers. The text is converted to lower case to have it
homogeneous. A total of 25 rules were de ned in this method. Some of these rules are
shown in Table 2.</p>
        <p>
          In addition to obtaining greater precision in recognizing PHI information
within the text. We use some resources to correctly identify territories and
peoples. First, we used a list of Spanish provinces on which we relied to identify
territories, and on the other hand, we use Stanford Named Entity Recognizer
[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] with a pre-trained Spanish model. Spanish models on a combination of two
corpora, after very heavy modi cations: AnCora Spanish 3.0 corpus3 and DEFT
Spanish Treebank V24. With these Stanford models, people and territories will
be identi ed in the text.
3.2
        </p>
        <p>
          CRF
Conditional Random Fields (CRF)[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] classi er is a stochastic model commonly
used to label and segment data sequences or extract information from medical
documents [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. We used CRFsuite, the implementation provided by Okazaki [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ],
as it is fast and provides a simple interface for training and modifying the input
features.
        </p>
        <sec id="sec-3-1-1">
          <title>3 http://clic.ub.edu/corpus/ancora 4 https://catalog.ldc.upenn.edu/LDC2018T01</title>
          <p>We incorporate some basic features of each word such as isLower, isUpper,
isTitle, isDigit, isAlpha, isBeginOfSentence and isEndIfSentece.</p>
          <p>Similar to most machine learning-based de-identi cation systems, the
tokenlevel CRF requires a tokenization module at rst. The tokenizer used is
WordPunctTokenizer of the NLTK5 library in Python.</p>
          <p>Below are a few short lines from the training le
(S1130-010820090005000121).</p>
          <p>S1130-01082009000500012-1 Medico
S1130-01082009000500012-1 :
S1130-01082009000500012-1 David
S1130-01082009000500012-1 Hernandez
S1130-01082009000500012-1 Alcaraz
S1130-01082009000500012-1 .</p>
          <p>S1130-01082009000500012-1 NCol
S1130-01082009000500012-1 :
S1130-01082009000500012-1 29
S1130-01082009000500012-1 29585
0
0
NOMBRE_PERSONAL_SANITARIO
NOMBRE_PERSONAL_SANITARIO
NOMBRE_PERSONAL_SANITARIO
0
0
0
ID_TITULACION_PERSONAL_SANITARIO</p>
          <p>ID_TITULACION_PERSONAL_SANITARIO</p>
          <p>The CRF algorithm trained with the parameters: algorithm = lbfgs, c1 =
0.1, c2 = 0.1, max iterations = 100, all possible transitions = False.</p>
          <p>The output provided by this method is shown below. As we can see, CRF
returns tokens and their predicted annotation, so it is necessary to perform a
treatment to join di erent tokens in the same concept.
[("Medico",0), (":",0), ("David",NOMBRE_PERSONAL_SANITARIO),
("Hernandez",NOMBRE_PERSONAL_SANITARIO),
("Alcaraz",NOMBRE_PERSONAL_SANITARIO), (".",0),
("NCol",0), (":", 0),("29",ID_TITULACION_PERSONAL_SANITARIO),
("29585",ID_TITULACION_PERSONAL_SANITARIO)]</p>
          <p>This treatment consisted of joining all the contiguous tokens of the same
category. In this way, we create the correct output le as shown below:
David Hernandez Alcaraz
29 29585</p>
          <p>NOMBRE_PERSONAL_SANITARIO</p>
          <p>ID_TITULACION_PERSONAL_SANITARIO
3.3</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Hybrid method</title>
        <p>The last method applied was using the two methods described above: rule-based
method and machine learning with CRF.</p>
        <p>At the end of the machine learning method, an error analysis was carried out
and we found that there were some inconsistencies according to the PHI phrases
that had been annotated. This analysis was developed with the development
dataset.</p>
        <sec id="sec-3-2-1">
          <title>5 https://www.nltk.org/</title>
          <p>Most of the error cases that we could observe were with annotations like:
TERRITORIO, ID CONTACTO ASISTENCIAL and ID ASEGURAMIENTO.</p>
          <p>The main problem we got with the TERRITORIO notation is that we wrote
phrases together as shown below:
AV. San Francisco 7, 3D 50006 Zaragoza
TERRITORIO</p>
          <p>And the correct annotation should be:
AV. San Francisco 7, 3D
50006
Zaragoza</p>
          <p>TERRITORIO
TERRITORIO</p>
          <p>TERRITORIO</p>
          <p>These errors could be solved by using regular expressions that separated that
annotation into di erent entries, in this case we used the regular expression to
nd the Zip Code and the list of cities in Spain, in this way, we could separate
the di erent PHI phrases.</p>
          <p>Other errors that we could avoid with the use of regex is the erroneous
annotation ID CONTACTO ASISTENCIAL because the algorithm identi ed it
as ID SUJETO ASISTENCIA.
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results and discussion</title>
      <p>For both sub-tracks the primary de-identi cation metrics used will consist of
standard measures from the NLP community, namely micro-averaged precision,
recall, and balanced F-score.</p>
      <p>In addition, the leak scores is also used for sub-task 1. This measure is related
to the detection of leaks (non-redacted PHI remaining after de-identi cation),
that is (#false negatives / #sentences present).</p>
      <p>The results obtained by our team for sub-task 1 (NER o set and entity type
classi cation) are shown in Table 3.</p>
      <p>This results show that we have improved our baseline (rule-based method)
using machine learning algorithms. A great step that is re ected how we obtain
a 0.59 in F1 score and we obtain a 0.86 with CRF. We managed to improve in all
measures with the use of some rules obtaining an precision of 0.92 and a recall
of 0.88.</p>
      <p>Table 4 and Table 5 show the evaluation of the systems for sub-task 2
(Sensitive span detection) with strict and merged spans evaluation respectively.</p>
      <p>In this second sub-task we check that we obtain values higher than the
previous task, this is because the objective is only to identify con dential data. Thus,
this is considered a span-based evaluation, regardless of the actual type of entity
or the correct o set identi cation of multi-token sensitive phrase mentions.</p>
      <p>In this sub-task we get a higher baseline than in the previous task, and also
get better thanks to CRF and the rules applied. We achieve relatively equal
results in both the stric and merged evaluations. The precision is almost perfect,
our third method is correct in 97% of cases.</p>
      <p>If we see a di erence between the two evaluations and systems 2 and 3. In
the strict evaluation system the machine learning algorithms get 90% F1, and in
the merged evaluation they get 95% with the second method. We see that the
second method works best when the evaluation is not strict.</p>
      <p>Finally, it is interesting to see that between the two possible evaluations
of sub-task 2 we obtain similar values with the third method proposed by our
group. This means that our third experiment records almost exactly.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Error analysis</title>
      <p>The main purpose of this section is to carry out an error analysis to identify
the weaknesses of our best system: hybrid method (run 3). To this end, we have
obtained some basic statistics for 250 gold test les. The test les taken into
account are from Subtask 1 described in Section ??. These les contain 5661 key
phrases annotated.</p>
      <p>We have described three basic types of errors for this analysis:</p>
      <sec id="sec-5-1">
        <title>1. Does not have the same annotated label.</title>
        <p>In this case, our system writes the positions of the key phrase correctly but
the associated annotation is incorrect. We found a total of 221 errors in this
case. The biggest confusions our system makes are described in Table 6. This
table shows the annotation found in the gold test les, the annotation of our
system and the number of errors found.
In future work, these types of errors are relatively easy to solve. The largest
of the cases is found in the territory and id of the subject of assistance, this
is because the zip codes are 5 digits and the IDs are usually digits as well.
For the improvement of this case we should observe the context in which the
digits are to annotate them correctly.</p>
      </sec>
      <sec id="sec-5-2">
        <title>2. Incorrect positions.</title>
        <p>In this case, our system incorrectly marks the start or end position of the
key phrase. To make this situation clearer, some examples are shown below:
Example number 1, the rst frame shows the correct annotations. In this
frame we can see that the name of the health employee and the street are
noted separately, we see that the positions are consecutive (340 - 360 and
361 - 374) but our system (second frame) takes everything as a full name
with positions from 340 to 374.</p>
        <p>CALLE 361 374
NOMBRE_PERSONAL_SANITARIO 340 360
Paseo Calanda
Raquel Ridruejo Saez
NOMBRE_PERSONAL_SANITARIO 340 374 Raquel Ridruejo Saez Paseo Calanda
The following example shows a similar error, the rst frame shows the
correct output and the next frame shows the output of our system. In them
we can see how our system annotates everything as a street when it should
separately record institution and street.</p>
        <p>CALLE 1945 1974
INSTITUCION 1923 1944</p>
        <p>Gran Via Corts Catalanes, 111</p>
        <p>Ciutat de la Just cia
CALLE 1923 1974 Ciutat de la Just cia Gran Via Corts Catalanes, 111
The number of errors of this type in our system are 215.</p>
      </sec>
      <sec id="sec-5-3">
        <title>3. Not found.</title>
        <p>Finally, we found 203 cases that are scored in the gold test but our system
does not write them down.</p>
        <p>The total number of errors found are 639, which is 11% of the test gold
annotations. It is a small number that could possibly be improved for later
systems.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>
        We have observed an increase in the number of studies associated with the
identi cation of PHI phrases in electronic medical records. Statistical analyses
or machine learning, followed by NLP techniques, are gaining popularity over
the years in comparison with rule-based systems [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. In this study we try to
verify that traditional and automatic methods can still coexist and that they
can be of great help if we use them together.
      </p>
      <p>The SINAI group presents its rst participation in this type of tasks where
the main objective is to nd PHI information in clinical records. The results are
really good. In both subtasks we have reached more than 90% F1 in our best
method. And in precision we got more than 97% in sub-task 2 with sensitive
evaluation.
7</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>This work has been partially supported by Fondo Europeo de Desarrollo
Regional (FEDER), LIVING-LANG project (RTI2018-094653-B-C21) and REDES
project (TIN2015-65136-C2-1-R) from the Spanish Government.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ding</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Using natural language processing to extract clinically useful information from chinese electronic medical records</article-title>
          .
          <source>International journal of medical informatics 124</source>
          , 6{
          <fpage>12</fpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Gralinski</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jassem</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marcinczuk</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wawrzyniak</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Named entity recognition in machine anonymization</article-title>
          .
          <source>Recent Advances in Intelligent Information Systems</source>
          pp.
          <volume>247</volume>
          {
          <issue>260</issue>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kayaalp</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Biological entity recognition with conditional random elds</article-title>
          .
          <source>In: AMIA Annual Symposium Proceedings</source>
          . vol.
          <year>2008</year>
          , p.
          <fpage>293</fpage>
          . American Medical Informatics Association (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>La</surname>
            <given-names>erty</given-names>
          </string-name>
          , J.,
          <string-name>
            <surname>McCallum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pereira</surname>
            ,
            <given-names>F.C.</given-names>
          </string-name>
          :
          <article-title>Conditional random elds: Probabilistic models for segmenting and labeling sequence data (</article-title>
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hao</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>A pattern-based method for medical entity recognition from chinese diagnostic imaging text</article-title>
          .
          <source>Frontiers in Arti cial Intelligence</source>
          <volume>2</volume>
          ,
          <issue>1</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Surdeanu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bauer</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finkel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McClosky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>The stanford corenlp natural language processing toolkit</article-title>
          . In:
          <article-title>Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations</article-title>
          . pp.
          <volume>55</volume>
          {
          <issue>60</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Marimon</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalez-Agirre</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Intxaurrondo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rodr</surname>
            <given-names>guez</given-names>
          </string-name>
          , H.,
          <string-name>
            <surname>Lopez</surname>
            <given-names>Martin</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J.A.</given-names>
            ,
            <surname>Villegas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            :
            <surname>Automatic</surname>
          </string-name>
          de
          <article-title>-identi cation of medical texts in spanish: the meddocan track, corpus, guidelines, methods and evaluation of results</article-title>
          .
          <source>In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF</source>
          <year>2019</year>
          ). vol.
          <source>TBA</source>
          , p.
          <source>TBA. CEUR Workshop Proceedings (CEUR-WS.org)</source>
          , Bilbao,
          <source>Spain (Sep</source>
          <year>2019</year>
          ), TBA
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>A.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lawley</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hansen</surname>
            ,
            <given-names>D.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bowman</surname>
            ,
            <given-names>R.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clarke</surname>
            ,
            <given-names>B.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duhig</surname>
            ,
            <given-names>E.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Colquist</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Symbolic rule-based classi cation of lung cancer stages from freetext pathology reports</article-title>
          .
          <source>Journal of the American Medical Informatics Association</source>
          <volume>17</volume>
          (
          <issue>4</issue>
          ),
          <volume>440</volume>
          {
          <fpage>445</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Okazaki</surname>
          </string-name>
          , N.:
          <article-title>Crfsuite: a fast implementation of conditional random elds (crfs) (</article-title>
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Shivade</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raghavan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fosler-Lussier</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Embi</surname>
            ,
            <given-names>P.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elhadad</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Johnson</surname>
          </string-name>
          , S.B.,
          <string-name>
            <surname>Lai</surname>
            ,
            <given-names>A.M.:</given-names>
          </string-name>
          <article-title>A review of approaches to identifying patient phenotype cohorts using electronic health records</article-title>
          .
          <source>Journal of the American Medical Informatics Association</source>
          <volume>21</volume>
          (
          <issue>2</issue>
          ),
          <volume>221</volume>
          {
          <fpage>230</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Szarvas</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farkas</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Busa-Fekete</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>State-of-the-art anonymization of medical records using an iterative machine learning framework</article-title>
          .
          <source>Journal of the American Medical Informatics Association</source>
          <volume>14</volume>
          (
          <issue>5</issue>
          ),
          <volume>574</volume>
          {
          <fpage>580</fpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>