<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Diagnostic Free Text Analysis in Biobanks with CRIP.CodEx: Automated Matching of Classifications</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Oliver Gros</string-name>
          <email>oliver.gros@izi-bb.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Reinhard Thasler</string-name>
          <email>reinhard.thasler@med.uni-muenchen.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>The Biobank under Administration of</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Biobank under administration of HTCR, Department of General, Visceral, Vascular, and Transplantation Surgery, Hospital of the University of Munich</institution>
          ,
          <addr-line>81377 Munich</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Fraunhofer Institute for Cell, Therapy and Immunology</institution>
          ,
          <addr-line>Branch, Bioanalytics and Bioprocesses (IZI-BB), Am Mühlenberg 13, 14476 Potsdam</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Human Tissue and Cell Research, Foundation HTCR at the University of, Munich Medical Centre is routinely, collecting remnant tissues and blood, samples from treatments of patients. For, diagnostic classification of the</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Biobanks represent key resources for biomedical research. To be accessible, e.g. over web-based query tools or transinstitutional metabiobanks, the stored human biospecimens have to be annotated with clinical data, transformed into harmonized and structured form, e.g. ICD codes, while currently only available from free text records.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Biobanks represent key resources for translational
research and personalized medicine
        <xref ref-type="bibr" rid="ref6">(Thasler et al.,
2013)</xref>
        . The Biobank under Administration of
Human Tissue and Cell Research Foundation
HTCR at University of Munich Medical Centre is
routinely collecting remnant tissues and blood
samples from treatments of patients at the Clinic
for General, Visceral, Vascular and
Transplantation Surgery as well as the Department
for Thoracic Surgery.
corresponding cases, a biobank specific
classification was developed, but not yet
matched to ICD codes.
      </p>
      <p>So far done manually, we now used the
automated knowledge extraction software
CRIP.CodEx, not needing a training set or
access to external resources, to recodify the
textual description of the specialized
HTCR biobank classification with ICD.
We show that the information contained in
the nomenclature of the individual biobank
specific catalogue of diagnoses is sufficient
for a mapping towards ICD-10 as well as
ICD-O-3 catalogues, and deliver an
automated matching of two different
classification systems using CRIP.CodEx.</p>
      <p>However, to be a valuable resource for
translational biomedical research these human
biospecimens have to be annotated with clinical
data, currently only available from free text
records.</p>
      <p>
        Access to these biospecimens and data, e.g.
over web-based query tools, like the
transinstitutional metabiobank CRIP (
        <xref ref-type="bibr" rid="ref5">Schröder et al.,
2011</xref>
        ) or p-BioSPRE
        <xref ref-type="bibr" rid="ref7">(Weiler et al., 2014)</xref>
        ,
demands that information from various sources
gets integrated into harmonized and structured data
to enable stratified, parameterized queries
        <xref ref-type="bibr" rid="ref1">(Ambert
and Cohen, 2009)</xref>
        .
      </p>
      <p>
        So far, analysis of free text sources and
structured data entry in the data protection
compliant biobank information system
        <xref ref-type="bibr" rid="ref4">(Müller and
Thasler, 2014)</xref>
        is done manually, and regarding
diagnostic classification of cases, a biobank
specific classification was developed. This
classification however was not matched to ICD
codes so far. Based on automated free text analysis
this coding can now be amended with international
classifications, e.g. ICD-10 or ICD-O-3, to
facilitate project queries.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <p>The automated knowledge extraction software
CRIP.CodEx was designed to identify and extract
information in free text medical records and assign
corresponding codes (see Figure 1).</p>
      <p>
        CRIP.CodEx is part of the CRIP-Toolbox2 and
will be published separately. It works automated,
fast and efficient3, identifies word relations and
negation, handles extended negation scopes
        <xref ref-type="bibr" rid="ref3">(Gros
and Stede, 2013)</xref>
        , but does not need access to
databases or other external resources. Specifically,
it does neither need a training set of pre-annotated
texts, nor does it need its extraction rules input
manually. The source for the self-generated
extraction rules are lists of codes and their
1 Online demonstrator available at
      </p>
      <p>https://preview-crip.fraunhofer.de/intern/codex-demo/
2 www.crip.fraunhofer.de/en/toolbox
3 Time per text less than one up to a few seconds (103 words)
descriptions contained for example in coding
guides for ICD-10 or ICD-O-3.</p>
      <p>We used CRIP.CodEx slightly off its
designedfor purpose, which is free text medical reports.
Instead we used CRIP.CodEx to recodify the
textual description of the specialized HTCR
biobank classification with ICD (see Figure 2). The
initially existing catalogue of diagnoses was based
on the actual collection of cases in the biobank,
reflecting indications and surgical treatments and
therefore being primarily aligned along a list of
organs affected by the surgical treatment. To build
a catalogue of diagnoses suitable for biospecimen
research, in a first step, this catalogue had to be
reorganized towards distinct pathological findings
across affected organs.</p>
      <p>The resulting categories therefore were
composed of two separate elements, a diagnosis
from pathological report and the affected organ.
These categories, 65 in total, were then manually
referred to the ICD-10 catalogue resulting in 82
codes.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Automated Matching</title>
      <p>After ensuring that the final categories were
sufficiently selective, we analyzed them in a first
run using CRIP.CodEx for ICD-O-3 as well as
ICD-10 classifications. The result was checked
manually and discussed with a Pathologist.</p>
      <p>However, the discussion of the first run showed
that the problems identified relied not only to
CRIP.CodEx, but also to the diagnostic catalogue,
mainly the categories’ composition as separate
elements of diagnosis and organs affected by the
surgical treatment. Both sources could be tackled –
and therefore both, the software configuration and
also the categories have been worked on to reach
improved results in the second run.
2.2
The configuration of CRIP.CodEx was optimized
including these aspects: adding variants of local
wording/synonyms, e.g. ‘Carzinom’ instead of
‘Karzinom’/’carcinoma’, into the internal
dictionary, increasing the number of allowed
multiple matches when detecting combined and
hyphenated words with synonyms, as well as
restricting ICD-10 neoplasm code assignment
depending on detected ICD-O-3 classification.</p>
      <p>
        The categories of the biobank specific
classification (see Table 1) have been amended
mainly by integrating the pathological diagnosis
and the primarily diseased organ in to one
descriptive text. Further amendments include
specifying the affected organs and information on
primary and secondary tumors more consistently,
as well as adding tumor types in some categories
while shortening the description of others and
dividing into subcategories, accompanied by also
amending the ICD-O-3 code assignments with an
expert from the regional tumor registry towards the
WHO “Blue Book”
        <xref ref-type="bibr" rid="ref2">(Bosman et al., 2010)</xref>
        as
international standard.
      </p>
      <p>After both, optimizing the configuration of
CRIP.CodEx, as well as further amending of the
categories, we performed a second run for
analyzing the categories.
ICD-O-3 M
8070/3
8140/3
8800/3
8010/3, 8170/3
8010/3, 8170/3
8010/3
8010/3, 8140/3
8890/3
8160/3
8010/3, 8160/3
9120/3
8010/3
8140/3
9050-55/0-3
8010/3, 8021/3, 8050/3,
8260/3, 8330/3, 8510/3
8010/3
8000/6, 8010/6
8000/6
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
    </sec>
    <sec id="sec-5">
      <title>First run</title>
      <p>The reorganized catalogue of diagnoses contained
64 categories and has been automatedly matched to
classifications by CRIP.CodEx. For each of the
categories CRIP.CodEx assigned matching
classifications in ICD-10 or ICD-O-3. All together
there have been 237 correct code assignments (true
positives), while we identified 6 wrong
assignments (false positives) and 12 missing codes
(false negatives).</p>
      <p>Due to the traceability of CRIP.CodEx’s each
individual code assignment, it was possible to
identify the causes for all false positives and
negatives. All but two causes could be resolved by
the outlined optimization in the CRIP.CodEx
configuration and the read in coding guide
respectively.
3.2</p>
    </sec>
    <sec id="sec-6">
      <title>Second run</title>
      <p>The configuration of CRIP.CodEx as well as the
original reorganized biobank specific catalogue of
diagnoses has been optimized and amended as
outlined above. Then again the local catalogue has
been automatedly matched to ICD classifications
by CRIP.CodEx. The final catalogue of diagnoses
contains 73 categories, for each of them the
software assigned matching classifications in
ICD10 or ICD-O-3 in the final second run. All together
there have been 442 correct code assignments (true
positives) by CRIP.CodEx, while we identified
zero wrong assignments (false positives) and 14
missing codes (false negatives).
4</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion &amp; Outlook</title>
      <p>Since diagnostic information contained in
medical free text is extracted and codified by the
automated CRIP.CodEx software, we also showed
that the information contained in the nomenclature
of the individual biobank specific catalogue of
diagnoses is sufficient for a mapping towards
ICD10 and basically also ICD-O catalogues. As a
remaining issue however, for categories such as
e.g. “Carcinoma of the gallbladder”, which
summarize a wide range of different morphologies,
ICD-O code extraction from nomenclature is too
general, and amending the nomenclature with a
listing of these morphologies is also dissatisfying.</p>
      <p>For the implementation of the restructured and
amended biobank-specific catalogue in the
Biobanks Database or the HTCR Web Application
however, the categories have been now structured
as a table by key features that can be maintained:




“Primary tumor vs. secondary tumor vs.
no tumor”
“affected Organ”
“originating organ in case of secondary
tumor”
“included morphologies”</p>
      <p>So as a next step, by tapping complementary
data sources, mainly extracting diagnostic
information from pathology reports by
CRIP.CodEx, in addition to coding of specialized
local classification, we deliver an automated
matching of two different classification systems.</p>
      <p>Even if all automated classifications have to be
checked thoroughly according to an individual
request from a research project, before providing
samples and data, this initial, very effective
automated classification to great extent facilitates
case related database research as well as data
export for display of the collection in biobank
registries and even metabiobanks. Thereby we
have not only enhanced the biobank’s availability
for translational research but also proposed a
general protocol for matching internal codes with
international classifications and standards.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>The authors specially thank Dr. Jens Neumann
from Institute of Pathology,
Ludwig-MaximiliansUniversity Munich and Dr. Gabriele
SchubertFritschle of Tumor Registry Munich for discussing
the results of CodEx code assignments and the
amended categories and manual code assignments.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Ambert</surname>
            <given-names>KH</given-names>
          </string-name>
          and
          <string-name>
            <surname>Cohen</surname>
            <given-names>AM</given-names>
          </string-name>
          (
          <year>2009</year>
          )
          <article-title>: A system for classifying disease comorbidity status from medical discharge summaries using automated hotspot and negated concept detection</article-title>
          .
          <source>Journal of the American Medical Informatics Association</source>
          <volume>16</volume>
          ,
          <fpage>590</fpage>
          -
          <lpage>595</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Bosman</surname>
            <given-names>FT</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carneiro</surname>
            <given-names>F</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hruban</surname>
            <given-names>RH</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Theise</surname>
            <given-names>ND</given-names>
          </string-name>
          (Editors)
          <article-title>: WHO classification of tumours of the digestive system - 4th edition</article-title>
          , IARC: Lyon,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Gros O and Stede</surname>
            <given-names>M</given-names>
          </string-name>
          (
          <year>2013</year>
          )
          <article-title>: Determining Negation Scope in German and English medical diagnoses</article-title>
          , in Taboada M and
          <string-name>
            <surname>Trnavac R</surname>
          </string-name>
          (Eds.):
          <article-title>Nonveridicality and Evaluation - Theoretical, Computational and Corpus Approaches, Studies in Pragmatics 11, BRILL</article-title>
          . ISBN:
          <volume>9789004258167</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Müller</surname>
            <given-names>TH</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thasler</surname>
            <given-names>R</given-names>
          </string-name>
          (
          <year>2014</year>
          )
          <article-title>: Separation of personal data in a biobank information system</article-title>
          .
          <source>Stud Health Technol Inform</source>
          .
          <volume>014</volume>
          ;
          <issue>205</issue>
          :
          <fpage>388</fpage>
          -
          <lpage>92</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Schröder</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heidtke</surname>
            <given-names>KR</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zacherl</surname>
            <given-names>N</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zatloukal</surname>
            <given-names>K</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Taupitz</surname>
            <given-names>J</given-names>
          </string-name>
          (
          <year>2011</year>
          ):
          <article-title>Safeguarding donors' personal rights and biobank autonomy in biobank networks: the CRIP privacy regime</article-title>
          .
          <source>Cell and tissue banking</source>
          ,
          <volume>12</volume>
          (
          <issue>3</issue>
          ),
          <fpage>233</fpage>
          -
          <lpage>240</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Thasler</surname>
            <given-names>WE</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thasler</surname>
            <given-names>RM</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schelcher</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jauch</surname>
            <given-names>KW</given-names>
          </string-name>
          (
          <year>2013</year>
          )
          <article-title>: Biobanking for research in surgery: are surgeons in charge for advancing translational research or mere assistants in biomaterial and data preservation?</article-title>
          <source>Langenbecks Arch Surg</source>
          .
          <year>2013</year>
          Apr;
          <volume>398</volume>
          (
          <issue>4</issue>
          ):
          <fpage>487</fpage>
          -
          <lpage>99</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Weiler</surname>
            <given-names>G</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schröder</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schera</surname>
            <given-names>F</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dobkowicz</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kiefer</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heidtke</surname>
            <given-names>KR</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hänold</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nwaknwo</surname>
            <given-names>I</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Forgó</surname>
            <given-names>N</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stanulla</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eckert</surname>
            <given-names>C</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Graf N</surname>
          </string-name>
          (
          <year>2014</year>
          )
          <article-title>: pBioSPRE - an information and communication technology framework for transnational biomaterial sharing and access</article-title>
          .
          <source>ecancer</source>
          <year>2014</year>
          ,
          <volume>8</volume>
          :
          <fpage>401</fpage>
          -
          <lpage>419</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>