=Paper= {{Paper |id=Vol-1650/smbm16Gros |storemode=property |title=Diagnostic Free Text Analysis in Biobanks with CRIP.CodEx: Automated Matching of Classifications |pdfUrl=https://ceur-ws.org/Vol-1650/smbm16Gros.pdf |volume=Vol-1650 |authors=Oliver Gros,Reinhard Thasler |dblpUrl=https://dblp.org/rec/conf/smbm/GrosT16 }} ==Diagnostic Free Text Analysis in Biobanks with CRIP.CodEx: Automated Matching of Classifications== https://ceur-ws.org/Vol-1650/smbm16Gros.pdf
        Diagnostic Free Text Analysis in Biobanks with CRIP.CodEx:
                   Automated Matching of Classifications


                  Oliver Gros                                    Reinhard Thasler
         Fraunhofer Institute for Cell                 Biobank under administration of HTCR,
      Therapy and Immunology, Branch                  Department of General, Visceral, Vascular
    Bioanalytics and Bioprocesses (IZI-BB)                  and Transplantation Surgery
              Am Mühlenberg 13                          Hospital of the University of Munich
          14476 Potsdam, Germany                             81377 Munich, Germany
    oliver.gros@izi-bb.fraunhofer.de                  reinhard.thasler@med.uni-muenchen.de




                    Abstract

    Biobanks represent key resources for                  corresponding cases, a biobank specific
    biomedical research. To be accessible, e.g.           classification was developed, but not yet
    over web-based query tools or trans-                  matched to ICD codes.
    institutional metabiobanks, the stored
    human biospecimens have to be annotated               So far done manually, we now used the
    with clinical data, transformed into                  automated knowledge extraction software
    harmonized and structured form, e.g. ICD              CRIP.CodEx, not needing a training set or
    codes, while currently only available from            access to external resources, to recodify the
    free text records.                                    textual description of the specialized
                                                          HTCR biobank classification with ICD.
    The Biobank under Administration of                   We show that the information contained in
    Human Tissue and Cell Research                        the nomenclature of the individual biobank
    Foundation HTCR at the University of                  specific catalogue of diagnoses is sufficient
    Munich Medical Centre is routinely                    for a mapping towards ICD-10 as well as
    collecting remnant tissues and blood                  ICD-O-3 catalogues, and deliver an
    samples from treatments of patients. For              automated matching of two different
    diagnostic    classification  of     the              classification systems using CRIP.CodEx.


1    Introduction
Biobanks represent key resources for translational        However, to be a valuable resource for
research and personalized medicine (Thasler et al.,    translational biomedical research these human
2013). The Biobank under Administration of             biospecimens have to be annotated with clinical
Human Tissue and Cell Research Foundation              data, currently only available from free text
HTCR at University of Munich Medical Centre is         records.
routinely collecting remnant tissues and blood            Access to these biospecimens and data, e.g.
samples from treatments of patients at the Clinic      over web-based query tools, like the trans-
for     General,    Visceral,    Vascular      and     institutional metabiobank CRIP (Schröder et al.,
Transplantation Surgery as well as the Department      2011) or p-BioSPRE (Weiler et al., 2014),
for Thoracic Surgery.                                  demands that information from various sources
gets integrated into harmonized and structured data             descriptions contained for example in coding
to enable stratified, parameterized queries (Ambert             guides for ICD-10 or ICD-O-3.
and Cohen, 2009).                                                   We used CRIP.CodEx slightly off its designed-
    So far, analysis of free text sources and                   for purpose, which is free text medical reports.
structured data entry in the data protection                    Instead we used CRIP.CodEx to recodify the
compliant biobank information system (Müller and                textual description of the specialized HTCR
Thasler, 2014) is done manually, and regarding                  biobank classification with ICD (see Figure 2). The
diagnostic classification of cases, a biobank                   initially existing catalogue of diagnoses was based
specific classification was developed. This                     on the actual collection of cases in the biobank,
classification however was not matched to ICD                   reflecting indications and surgical treatments and
codes so far. Based on automated free text analysis             therefore being primarily aligned along a list of
this coding can now be amended with international               organs affected by the surgical treatment. To build
classifications, e.g. ICD-10 or ICD-O-3, to                     a catalogue of diagnoses suitable for biospecimen
facilitate project queries.                                     research, in a first step, this catalogue had to be
                                                                reorganized towards distinct pathological findings
2    Methods                                                    across affected organs.
The automated knowledge extraction software
CRIP.CodEx was designed to identify and extract
information in free text medical records and assign
corresponding codes (see Figure 1).




                                                                        Figure 2: Knowledge Extraction
                                                                        with CRIP.CodEx – text example

                                                                   The resulting categories therefore were
                                                                composed of two separate elements, a diagnosis
                                                                from pathological report and the affected organ.
                                                                These categories, 65 in total, were then manually
                                                                referred to the ICD-10 catalogue resulting in 82
                                                                codes.
       Figure 1: UI variants for CRIP.CodEx1
                                                                2.1   Automated Matching
                                                        2
    CRIP.CodEx is part of the CRIP-Toolbox and                  After ensuring that the final categories were
will be published separately. It works automated,               sufficiently selective, we analyzed them in a first
fast and efficient 3 , identifies word relations and            run using CRIP.CodEx for ICD-O-3 as well as
negation, handles extended negation scopes (Gros                ICD-10 classifications. The result was checked
and Stede, 2013), but does not need access to                   manually and discussed with a Pathologist.
databases or other external resources. Specifically,               However, the discussion of the first run showed
it does neither need a training set of pre-annotated            that the problems identified relied not only to
texts, nor does it need its extraction rules input              CRIP.CodEx, but also to the diagnostic catalogue,
manually. The source for the self-generated                     mainly the categories’ composition as separate
extraction rules are lists of codes and their                   elements of diagnosis and organs affected by the
                                                                surgical treatment. Both sources could be tackled –
1
  Online demonstrator available at                              and therefore both, the software configuration and
  https://preview-crip.fraunhofer.de/intern/codex-demo/
2
  www.crip.fraunhofer.de/en/toolbox
                                                                also the categories have been worked on to reach
3
  Time per text less than one up to a few seconds (103 words)   improved results in the second run.
2.2    Optimization
The configuration of CRIP.CodEx was optimized            specifying the affected organs and information on
including these aspects: adding variants of local        primary and secondary tumors more consistently,
wording/synonyms, e.g. ‘Carzinom’ instead of             as well as adding tumor types in some categories
‘Karzinom’/’carcinoma’,     into  the     internal       while shortening the description of others and
dictionary, increasing the number of allowed             dividing into subcategories, accompanied by also
multiple matches when detecting combined and             amending the ICD-O-3 code assignments with an
hyphenated words with synonyms, as well as               expert from the regional tumor registry towards the
restricting ICD-10 neoplasm code assignment              WHO “Blue Book” (Bosman et al., 2010) as
depending on detected ICD-O-3 classification.            international standard.
   The categories of the biobank specific                   After both, optimizing the configuration of
classification (see Table 1) have been amended           CRIP.CodEx, as well as further amending of the
mainly by integrating the pathological diagnosis         categories, we performed a second run for
and the primarily diseased organ in to one               analyzing the categories.
descriptive text. Further amendments include


  ICD-10               Diagnosis category                                            ICD-O-3 M
  C15                  Plattenepithelkarzinom des Ösophagus                          8070/3
  C16                  Adenokarzinom des Magens                                      8140/3
  C16, C17, C25, C74   Sarkom (Bauchraum: Leber, Magen, Pankreas, Dünndarm,          8800/3
                       Dickdarm, Niere)
  C16, C22.9           Hepatozelluläres Karzinom (HCC)                               8010/3, 8170/3
  C16.9, C16, C22.9    Hepatozelluläres Karzinom (HCC), Subtyp: fibrolamelläres      8010/3, 8170/3
                       Leberzellkarzinom
  C17                  Karzinome des Dünndarmes                                      8010/3
  C17, C18.9, C20      Kolorektales Karzinom (CRC), auch Adenokarzinome              8010/3, 8140/3
  C17, C22.9           Leiomyosarkom (Magen, Abdomen, Peritoneum,                    8890/3
                       Retroperitoneum, Bindegewebe)
  C17, C22.9, C25      Cholangiokarzinom, extrahepatisch, auch Klatskin-Tumoren      8160/3
  C22.1, C22.9         Cholangiokarzinom (CCC), intrahepatisch                       8010/3, 8160/3
  C22.3                Angiosarkom der Leber                                         9120/3
  C23                  Gallenblasenkarzinom                                          8010/3
  C25                  Adenokarzinom des Pankreas                                    8140/3
  C45, C45.0, C45.9    Mesotheliom (Pleura, Peritoneum)                              9050-55/0-3
  C73                  Schilddrüsenkarzinom (papilläres, follikuläres, medulläres,   8010/3, 8021/3, 8050/3,
                       anaplastisches)                                               8260/3, 8330/3, 8510/3
  C74                  Nebennierenkarzinom                                           8010/3
  C78.7                [Lokalisation, z.B. Leber]metastase nach [diverse             8000/6, 8010/6
                       Primärkarzinome]
  C80                  Lebermetastase bei unbekanntem Primärtumor (CUP-              8000/6
                       Syndrom)
  D13.4, D13.6, D35.0, Adenom (einschl. Zystadenom) des/der Leber, Pankreas,         8140/0, 8440/0
  D37.2, D44.1         Dünndarm, Dickdarm, Nebenniere
  D13.4, D18.0         Hämangiom der Leber                                           9120/0
  D30.0, D35.0, D41.0 Phäochromozytom                                                8700/0
  E66                  Adipositas per magna (BMI >40)
  K51                  Colitis ulcerosa
  K57.32, K57.33       Divertikulitis

      Table 1: Final catalogue of diagnoses (excerpt) and their classification in ICD-10 and ICD-O-3
3     Results                                             For the implementation of the restructured and
                                                       amended biobank-specific catalogue in the
                                                       Biobanks Database or the HTCR Web Application
3.1    First run
                                                       however, the categories have been now structured
The reorganized catalogue of diagnoses contained       as a table by key features that can be maintained:
64 categories and has been automatedly matched to
classifications by CRIP.CodEx. For each of the                 “Primary tumor vs. secondary tumor vs.
categories CRIP.CodEx assigned matching                        no tumor”
classifications in ICD-10 or ICD-O-3. All together            “affected Organ”
there have been 237 correct code assignments (true            “originating organ in case of secondary
positives), while we identified 6 wrong                        tumor”
assignments (false positives) and 12 missing codes            “included morphologies”
(false negatives).
    Due to the traceability of CRIP.CodEx’s each          So as a next step, by tapping complementary
individual code assignment, it was possible to         data sources, mainly extracting diagnostic
identify the causes for all false positives and        information      from     pathology     reports   by
negatives. All but two causes could be resolved by     CRIP.CodEx, in addition to coding of specialized
the outlined optimization in the CRIP.CodEx            local classification, we deliver an automated
configuration and the read in coding guide             matching of two different classification systems.
respectively.                                             Even if all automated classifications have to be
3.2    Second run                                      checked thoroughly according to an individual
                                                       request from a research project, before providing
The configuration of CRIP.CodEx as well as the         samples and data, this initial, very effective
original reorganized biobank specific catalogue of     automated classification to great extent facilitates
diagnoses has been optimized and amended as            case related database research as well as data
outlined above. Then again the local catalogue has     export for display of the collection in biobank
been automatedly matched to ICD classifications        registries and even metabiobanks. Thereby we
by CRIP.CodEx. The final catalogue of diagnoses        have not only enhanced the biobank’s availability
contains 73 categories, for each of them the           for translational research but also proposed a
software assigned matching classifications in ICD-     general protocol for matching internal codes with
10 or ICD-O-3 in the final second run. All together    international classifications and standards.
there have been 442 correct code assignments (true
positives) by CRIP.CodEx, while we identified          Acknowledgements
zero wrong assignments (false positives) and 14
missing codes (false negatives).                          The authors specially thank Dr. Jens Neumann
                                                       from Institute of Pathology, Ludwig-Maximilians-
4     Conclusion & Outlook                             University Munich and Dr. Gabriele Schubert-
                                                       Fritschle of Tumor Registry Munich for discussing
   Since diagnostic information contained in           the results of CodEx code assignments and the
medical free text is extracted and codified by the     amended categories and manual code assignments.
automated CRIP.CodEx software, we also showed
that the information contained in the nomenclature     References
of the individual biobank specific catalogue of
                                                       Ambert KH and Cohen AM (2009): A system for
diagnoses is sufficient for a mapping towards ICD-
                                                         classifying disease comorbidity status from medical
10 and basically also ICD-O catalogues. As a             discharge summaries using automated hotspot and
remaining issue however, for categories such as          negated concept detection. Journal of the American
e.g. “Carcinoma of the gallbladder”, which               Medical Informatics Association 16, 590-595.
summarize a wide range of different morphologies,
ICD-O code extraction from nomenclature is too         Bosman FT, Carneiro F, Hruban RH, Theise ND
                                                         (Editors): WHO classification of tumours of the
general, and amending the nomenclature with a
                                                         digestive system – 4th edition, IARC: Lyon, 2010.
listing of these morphologies is also dissatisfying.
Gros O and Stede M (2013): Determining Negation
  Scope in German and English medical diagnoses, in
  Taboada M and Trnavac R (Eds.): Nonveridicality
  and Evaluation – Theoretical, Computational and
  Corpus Approaches, Studies in Pragmatics 11,
  BRILL. ISBN: 9789004258167.
Müller TH, Thasler R (2014): Separation of personal
  data in a biobank information system. Stud Health
  Technol Inform. 014; 205:388-92.
Schröder C, Heidtke KR, Zacherl N, Zatloukal K, &
  Taupitz J (2011): Safeguarding donors’ personal
  rights and biobank autonomy in biobank networks:
  the CRIP privacy regime. Cell and tissue
  banking, 12(3), 233-240.
Thasler WE, Thasler RM, Schelcher C, Jauch KW
  (2013): Biobanking for research in surgery: are
  surgeons in charge for advancing translational
  research or mere assistants in biomaterial and data
  preservation? Langenbecks Arch Surg. 2013 Apr;
  398(4):487-99.
Weiler G, Schröder C, Schera F, Dobkowicz M, Kiefer
 S, Heidtke KR, Hänold S, Nwaknwo I, Forgó N,
 Stanulla M, Eckert C, and Graf N (2014): p-
 BioSPRE – an information and communication
 technology framework for transnational biomaterial
 sharing and access. ecancer 2014, 8:401-419.