=Paper=
{{Paper
|id=Vol-1650/smbm16Gros
|storemode=property
|title=Diagnostic Free Text Analysis in Biobanks with CRIP.CodEx: Automated Matching of Classifications
|pdfUrl=https://ceur-ws.org/Vol-1650/smbm16Gros.pdf
|volume=Vol-1650
|authors=Oliver Gros,Reinhard Thasler
|dblpUrl=https://dblp.org/rec/conf/smbm/GrosT16
}}
==Diagnostic Free Text Analysis in Biobanks with CRIP.CodEx: Automated Matching of Classifications==
Diagnostic Free Text Analysis in Biobanks with CRIP.CodEx:
Automated Matching of Classifications
Oliver Gros Reinhard Thasler
Fraunhofer Institute for Cell Biobank under administration of HTCR,
Therapy and Immunology, Branch Department of General, Visceral, Vascular
Bioanalytics and Bioprocesses (IZI-BB) and Transplantation Surgery
Am Mühlenberg 13 Hospital of the University of Munich
14476 Potsdam, Germany 81377 Munich, Germany
oliver.gros@izi-bb.fraunhofer.de reinhard.thasler@med.uni-muenchen.de
Abstract
Biobanks represent key resources for corresponding cases, a biobank specific
biomedical research. To be accessible, e.g. classification was developed, but not yet
over web-based query tools or trans- matched to ICD codes.
institutional metabiobanks, the stored
human biospecimens have to be annotated So far done manually, we now used the
with clinical data, transformed into automated knowledge extraction software
harmonized and structured form, e.g. ICD CRIP.CodEx, not needing a training set or
codes, while currently only available from access to external resources, to recodify the
free text records. textual description of the specialized
HTCR biobank classification with ICD.
The Biobank under Administration of We show that the information contained in
Human Tissue and Cell Research the nomenclature of the individual biobank
Foundation HTCR at the University of specific catalogue of diagnoses is sufficient
Munich Medical Centre is routinely for a mapping towards ICD-10 as well as
collecting remnant tissues and blood ICD-O-3 catalogues, and deliver an
samples from treatments of patients. For automated matching of two different
diagnostic classification of the classification systems using CRIP.CodEx.
1 Introduction
Biobanks represent key resources for translational However, to be a valuable resource for
research and personalized medicine (Thasler et al., translational biomedical research these human
2013). The Biobank under Administration of biospecimens have to be annotated with clinical
Human Tissue and Cell Research Foundation data, currently only available from free text
HTCR at University of Munich Medical Centre is records.
routinely collecting remnant tissues and blood Access to these biospecimens and data, e.g.
samples from treatments of patients at the Clinic over web-based query tools, like the trans-
for General, Visceral, Vascular and institutional metabiobank CRIP (Schröder et al.,
Transplantation Surgery as well as the Department 2011) or p-BioSPRE (Weiler et al., 2014),
for Thoracic Surgery. demands that information from various sources
gets integrated into harmonized and structured data descriptions contained for example in coding
to enable stratified, parameterized queries (Ambert guides for ICD-10 or ICD-O-3.
and Cohen, 2009). We used CRIP.CodEx slightly off its designed-
So far, analysis of free text sources and for purpose, which is free text medical reports.
structured data entry in the data protection Instead we used CRIP.CodEx to recodify the
compliant biobank information system (Müller and textual description of the specialized HTCR
Thasler, 2014) is done manually, and regarding biobank classification with ICD (see Figure 2). The
diagnostic classification of cases, a biobank initially existing catalogue of diagnoses was based
specific classification was developed. This on the actual collection of cases in the biobank,
classification however was not matched to ICD reflecting indications and surgical treatments and
codes so far. Based on automated free text analysis therefore being primarily aligned along a list of
this coding can now be amended with international organs affected by the surgical treatment. To build
classifications, e.g. ICD-10 or ICD-O-3, to a catalogue of diagnoses suitable for biospecimen
facilitate project queries. research, in a first step, this catalogue had to be
reorganized towards distinct pathological findings
2 Methods across affected organs.
The automated knowledge extraction software
CRIP.CodEx was designed to identify and extract
information in free text medical records and assign
corresponding codes (see Figure 1).
Figure 2: Knowledge Extraction
with CRIP.CodEx – text example
The resulting categories therefore were
composed of two separate elements, a diagnosis
from pathological report and the affected organ.
These categories, 65 in total, were then manually
referred to the ICD-10 catalogue resulting in 82
codes.
Figure 1: UI variants for CRIP.CodEx1
2.1 Automated Matching
2
CRIP.CodEx is part of the CRIP-Toolbox and After ensuring that the final categories were
will be published separately. It works automated, sufficiently selective, we analyzed them in a first
fast and efficient 3 , identifies word relations and run using CRIP.CodEx for ICD-O-3 as well as
negation, handles extended negation scopes (Gros ICD-10 classifications. The result was checked
and Stede, 2013), but does not need access to manually and discussed with a Pathologist.
databases or other external resources. Specifically, However, the discussion of the first run showed
it does neither need a training set of pre-annotated that the problems identified relied not only to
texts, nor does it need its extraction rules input CRIP.CodEx, but also to the diagnostic catalogue,
manually. The source for the self-generated mainly the categories’ composition as separate
extraction rules are lists of codes and their elements of diagnosis and organs affected by the
surgical treatment. Both sources could be tackled –
1
Online demonstrator available at and therefore both, the software configuration and
https://preview-crip.fraunhofer.de/intern/codex-demo/
2
www.crip.fraunhofer.de/en/toolbox
also the categories have been worked on to reach
3
Time per text less than one up to a few seconds (103 words) improved results in the second run.
2.2 Optimization
The configuration of CRIP.CodEx was optimized specifying the affected organs and information on
including these aspects: adding variants of local primary and secondary tumors more consistently,
wording/synonyms, e.g. ‘Carzinom’ instead of as well as adding tumor types in some categories
‘Karzinom’/’carcinoma’, into the internal while shortening the description of others and
dictionary, increasing the number of allowed dividing into subcategories, accompanied by also
multiple matches when detecting combined and amending the ICD-O-3 code assignments with an
hyphenated words with synonyms, as well as expert from the regional tumor registry towards the
restricting ICD-10 neoplasm code assignment WHO “Blue Book” (Bosman et al., 2010) as
depending on detected ICD-O-3 classification. international standard.
The categories of the biobank specific After both, optimizing the configuration of
classification (see Table 1) have been amended CRIP.CodEx, as well as further amending of the
mainly by integrating the pathological diagnosis categories, we performed a second run for
and the primarily diseased organ in to one analyzing the categories.
descriptive text. Further amendments include
ICD-10 Diagnosis category ICD-O-3 M
C15 Plattenepithelkarzinom des Ösophagus 8070/3
C16 Adenokarzinom des Magens 8140/3
C16, C17, C25, C74 Sarkom (Bauchraum: Leber, Magen, Pankreas, Dünndarm, 8800/3
Dickdarm, Niere)
C16, C22.9 Hepatozelluläres Karzinom (HCC) 8010/3, 8170/3
C16.9, C16, C22.9 Hepatozelluläres Karzinom (HCC), Subtyp: fibrolamelläres 8010/3, 8170/3
Leberzellkarzinom
C17 Karzinome des Dünndarmes 8010/3
C17, C18.9, C20 Kolorektales Karzinom (CRC), auch Adenokarzinome 8010/3, 8140/3
C17, C22.9 Leiomyosarkom (Magen, Abdomen, Peritoneum, 8890/3
Retroperitoneum, Bindegewebe)
C17, C22.9, C25 Cholangiokarzinom, extrahepatisch, auch Klatskin-Tumoren 8160/3
C22.1, C22.9 Cholangiokarzinom (CCC), intrahepatisch 8010/3, 8160/3
C22.3 Angiosarkom der Leber 9120/3
C23 Gallenblasenkarzinom 8010/3
C25 Adenokarzinom des Pankreas 8140/3
C45, C45.0, C45.9 Mesotheliom (Pleura, Peritoneum) 9050-55/0-3
C73 Schilddrüsenkarzinom (papilläres, follikuläres, medulläres, 8010/3, 8021/3, 8050/3,
anaplastisches) 8260/3, 8330/3, 8510/3
C74 Nebennierenkarzinom 8010/3
C78.7 [Lokalisation, z.B. Leber]metastase nach [diverse 8000/6, 8010/6
Primärkarzinome]
C80 Lebermetastase bei unbekanntem Primärtumor (CUP- 8000/6
Syndrom)
D13.4, D13.6, D35.0, Adenom (einschl. Zystadenom) des/der Leber, Pankreas, 8140/0, 8440/0
D37.2, D44.1 Dünndarm, Dickdarm, Nebenniere
D13.4, D18.0 Hämangiom der Leber 9120/0
D30.0, D35.0, D41.0 Phäochromozytom 8700/0
E66 Adipositas per magna (BMI >40)
K51 Colitis ulcerosa
K57.32, K57.33 Divertikulitis
Table 1: Final catalogue of diagnoses (excerpt) and their classification in ICD-10 and ICD-O-3
3 Results For the implementation of the restructured and
amended biobank-specific catalogue in the
Biobanks Database or the HTCR Web Application
3.1 First run
however, the categories have been now structured
The reorganized catalogue of diagnoses contained as a table by key features that can be maintained:
64 categories and has been automatedly matched to
classifications by CRIP.CodEx. For each of the “Primary tumor vs. secondary tumor vs.
categories CRIP.CodEx assigned matching no tumor”
classifications in ICD-10 or ICD-O-3. All together “affected Organ”
there have been 237 correct code assignments (true “originating organ in case of secondary
positives), while we identified 6 wrong tumor”
assignments (false positives) and 12 missing codes “included morphologies”
(false negatives).
Due to the traceability of CRIP.CodEx’s each So as a next step, by tapping complementary
individual code assignment, it was possible to data sources, mainly extracting diagnostic
identify the causes for all false positives and information from pathology reports by
negatives. All but two causes could be resolved by CRIP.CodEx, in addition to coding of specialized
the outlined optimization in the CRIP.CodEx local classification, we deliver an automated
configuration and the read in coding guide matching of two different classification systems.
respectively. Even if all automated classifications have to be
3.2 Second run checked thoroughly according to an individual
request from a research project, before providing
The configuration of CRIP.CodEx as well as the samples and data, this initial, very effective
original reorganized biobank specific catalogue of automated classification to great extent facilitates
diagnoses has been optimized and amended as case related database research as well as data
outlined above. Then again the local catalogue has export for display of the collection in biobank
been automatedly matched to ICD classifications registries and even metabiobanks. Thereby we
by CRIP.CodEx. The final catalogue of diagnoses have not only enhanced the biobank’s availability
contains 73 categories, for each of them the for translational research but also proposed a
software assigned matching classifications in ICD- general protocol for matching internal codes with
10 or ICD-O-3 in the final second run. All together international classifications and standards.
there have been 442 correct code assignments (true
positives) by CRIP.CodEx, while we identified Acknowledgements
zero wrong assignments (false positives) and 14
missing codes (false negatives). The authors specially thank Dr. Jens Neumann
from Institute of Pathology, Ludwig-Maximilians-
4 Conclusion & Outlook University Munich and Dr. Gabriele Schubert-
Fritschle of Tumor Registry Munich for discussing
Since diagnostic information contained in the results of CodEx code assignments and the
medical free text is extracted and codified by the amended categories and manual code assignments.
automated CRIP.CodEx software, we also showed
that the information contained in the nomenclature References
of the individual biobank specific catalogue of
Ambert KH and Cohen AM (2009): A system for
diagnoses is sufficient for a mapping towards ICD-
classifying disease comorbidity status from medical
10 and basically also ICD-O catalogues. As a discharge summaries using automated hotspot and
remaining issue however, for categories such as negated concept detection. Journal of the American
e.g. “Carcinoma of the gallbladder”, which Medical Informatics Association 16, 590-595.
summarize a wide range of different morphologies,
ICD-O code extraction from nomenclature is too Bosman FT, Carneiro F, Hruban RH, Theise ND
(Editors): WHO classification of tumours of the
general, and amending the nomenclature with a
digestive system – 4th edition, IARC: Lyon, 2010.
listing of these morphologies is also dissatisfying.
Gros O and Stede M (2013): Determining Negation
Scope in German and English medical diagnoses, in
Taboada M and Trnavac R (Eds.): Nonveridicality
and Evaluation – Theoretical, Computational and
Corpus Approaches, Studies in Pragmatics 11,
BRILL. ISBN: 9789004258167.
Müller TH, Thasler R (2014): Separation of personal
data in a biobank information system. Stud Health
Technol Inform. 014; 205:388-92.
Schröder C, Heidtke KR, Zacherl N, Zatloukal K, &
Taupitz J (2011): Safeguarding donors’ personal
rights and biobank autonomy in biobank networks:
the CRIP privacy regime. Cell and tissue
banking, 12(3), 233-240.
Thasler WE, Thasler RM, Schelcher C, Jauch KW
(2013): Biobanking for research in surgery: are
surgeons in charge for advancing translational
research or mere assistants in biomaterial and data
preservation? Langenbecks Arch Surg. 2013 Apr;
398(4):487-99.
Weiler G, Schröder C, Schera F, Dobkowicz M, Kiefer
S, Heidtke KR, Hänold S, Nwaknwo I, Forgó N,
Stanulla M, Eckert C, and Graf N (2014): p-
BioSPRE – an information and communication
technology framework for transnational biomaterial
sharing and access. ecancer 2014, 8:401-419.