Diagnostic Free Text Analysis in Biobanks with CRIP.CodEx: Automated Matching of Classifications Oliver Gros Reinhard Thasler Fraunhofer Institute for Cell Biobank under administration of HTCR, Therapy and Immunology, Branch Department of General, Visceral, Vascular Bioanalytics and Bioprocesses (IZI-BB) and Transplantation Surgery Am Mühlenberg 13 Hospital of the University of Munich 14476 Potsdam, Germany 81377 Munich, Germany oliver.gros@izi-bb.fraunhofer.de reinhard.thasler@med.uni-muenchen.de Abstract Biobanks represent key resources for corresponding cases, a biobank specific biomedical research. To be accessible, e.g. classification was developed, but not yet over web-based query tools or trans- matched to ICD codes. institutional metabiobanks, the stored human biospecimens have to be annotated So far done manually, we now used the with clinical data, transformed into automated knowledge extraction software harmonized and structured form, e.g. ICD CRIP.CodEx, not needing a training set or codes, while currently only available from access to external resources, to recodify the free text records. textual description of the specialized HTCR biobank classification with ICD. The Biobank under Administration of We show that the information contained in Human Tissue and Cell Research the nomenclature of the individual biobank Foundation HTCR at the University of specific catalogue of diagnoses is sufficient Munich Medical Centre is routinely for a mapping towards ICD-10 as well as collecting remnant tissues and blood ICD-O-3 catalogues, and deliver an samples from treatments of patients. For automated matching of two different diagnostic classification of the classification systems using CRIP.CodEx. 1 Introduction Biobanks represent key resources for translational However, to be a valuable resource for research and personalized medicine (Thasler et al., translational biomedical research these human 2013). The Biobank under Administration of biospecimens have to be annotated with clinical Human Tissue and Cell Research Foundation data, currently only available from free text HTCR at University of Munich Medical Centre is records. routinely collecting remnant tissues and blood Access to these biospecimens and data, e.g. samples from treatments of patients at the Clinic over web-based query tools, like the trans- for General, Visceral, Vascular and institutional metabiobank CRIP (Schröder et al., Transplantation Surgery as well as the Department 2011) or p-BioSPRE (Weiler et al., 2014), for Thoracic Surgery. demands that information from various sources gets integrated into harmonized and structured data descriptions contained for example in coding to enable stratified, parameterized queries (Ambert guides for ICD-10 or ICD-O-3. and Cohen, 2009). We used CRIP.CodEx slightly off its designed- So far, analysis of free text sources and for purpose, which is free text medical reports. structured data entry in the data protection Instead we used CRIP.CodEx to recodify the compliant biobank information system (Müller and textual description of the specialized HTCR Thasler, 2014) is done manually, and regarding biobank classification with ICD (see Figure 2). The diagnostic classification of cases, a biobank initially existing catalogue of diagnoses was based specific classification was developed. This on the actual collection of cases in the biobank, classification however was not matched to ICD reflecting indications and surgical treatments and codes so far. Based on automated free text analysis therefore being primarily aligned along a list of this coding can now be amended with international organs affected by the surgical treatment. To build classifications, e.g. ICD-10 or ICD-O-3, to a catalogue of diagnoses suitable for biospecimen facilitate project queries. research, in a first step, this catalogue had to be reorganized towards distinct pathological findings 2 Methods across affected organs. The automated knowledge extraction software CRIP.CodEx was designed to identify and extract information in free text medical records and assign corresponding codes (see Figure 1). Figure 2: Knowledge Extraction with CRIP.CodEx – text example The resulting categories therefore were composed of two separate elements, a diagnosis from pathological report and the affected organ. These categories, 65 in total, were then manually referred to the ICD-10 catalogue resulting in 82 codes. Figure 1: UI variants for CRIP.CodEx1 2.1 Automated Matching 2 CRIP.CodEx is part of the CRIP-Toolbox and After ensuring that the final categories were will be published separately. It works automated, sufficiently selective, we analyzed them in a first fast and efficient 3 , identifies word relations and run using CRIP.CodEx for ICD-O-3 as well as negation, handles extended negation scopes (Gros ICD-10 classifications. The result was checked and Stede, 2013), but does not need access to manually and discussed with a Pathologist. databases or other external resources. Specifically, However, the discussion of the first run showed it does neither need a training set of pre-annotated that the problems identified relied not only to texts, nor does it need its extraction rules input CRIP.CodEx, but also to the diagnostic catalogue, manually. The source for the self-generated mainly the categories’ composition as separate extraction rules are lists of codes and their elements of diagnosis and organs affected by the surgical treatment. Both sources could be tackled – 1 Online demonstrator available at and therefore both, the software configuration and https://preview-crip.fraunhofer.de/intern/codex-demo/ 2 www.crip.fraunhofer.de/en/toolbox also the categories have been worked on to reach 3 Time per text less than one up to a few seconds (103 words) improved results in the second run. 2.2 Optimization The configuration of CRIP.CodEx was optimized specifying the affected organs and information on including these aspects: adding variants of local primary and secondary tumors more consistently, wording/synonyms, e.g. ‘Carzinom’ instead of as well as adding tumor types in some categories ‘Karzinom’/’carcinoma’, into the internal while shortening the description of others and dictionary, increasing the number of allowed dividing into subcategories, accompanied by also multiple matches when detecting combined and amending the ICD-O-3 code assignments with an hyphenated words with synonyms, as well as expert from the regional tumor registry towards the restricting ICD-10 neoplasm code assignment WHO “Blue Book” (Bosman et al., 2010) as depending on detected ICD-O-3 classification. international standard. The categories of the biobank specific After both, optimizing the configuration of classification (see Table 1) have been amended CRIP.CodEx, as well as further amending of the mainly by integrating the pathological diagnosis categories, we performed a second run for and the primarily diseased organ in to one analyzing the categories. descriptive text. Further amendments include ICD-10 Diagnosis category ICD-O-3 M C15 Plattenepithelkarzinom des Ösophagus 8070/3 C16 Adenokarzinom des Magens 8140/3 C16, C17, C25, C74 Sarkom (Bauchraum: Leber, Magen, Pankreas, Dünndarm, 8800/3 Dickdarm, Niere) C16, C22.9 Hepatozelluläres Karzinom (HCC) 8010/3, 8170/3 C16.9, C16, C22.9 Hepatozelluläres Karzinom (HCC), Subtyp: fibrolamelläres 8010/3, 8170/3 Leberzellkarzinom C17 Karzinome des Dünndarmes 8010/3 C17, C18.9, C20 Kolorektales Karzinom (CRC), auch Adenokarzinome 8010/3, 8140/3 C17, C22.9 Leiomyosarkom (Magen, Abdomen, Peritoneum, 8890/3 Retroperitoneum, Bindegewebe) C17, C22.9, C25 Cholangiokarzinom, extrahepatisch, auch Klatskin-Tumoren 8160/3 C22.1, C22.9 Cholangiokarzinom (CCC), intrahepatisch 8010/3, 8160/3 C22.3 Angiosarkom der Leber 9120/3 C23 Gallenblasenkarzinom 8010/3 C25 Adenokarzinom des Pankreas 8140/3 C45, C45.0, C45.9 Mesotheliom (Pleura, Peritoneum) 9050-55/0-3 C73 Schilddrüsenkarzinom (papilläres, follikuläres, medulläres, 8010/3, 8021/3, 8050/3, anaplastisches) 8260/3, 8330/3, 8510/3 C74 Nebennierenkarzinom 8010/3 C78.7 [Lokalisation, z.B. Leber]metastase nach [diverse 8000/6, 8010/6 Primärkarzinome] C80 Lebermetastase bei unbekanntem Primärtumor (CUP- 8000/6 Syndrom) D13.4, D13.6, D35.0, Adenom (einschl. Zystadenom) des/der Leber, Pankreas, 8140/0, 8440/0 D37.2, D44.1 Dünndarm, Dickdarm, Nebenniere D13.4, D18.0 Hämangiom der Leber 9120/0 D30.0, D35.0, D41.0 Phäochromozytom 8700/0 E66 Adipositas per magna (BMI >40) K51 Colitis ulcerosa K57.32, K57.33 Divertikulitis Table 1: Final catalogue of diagnoses (excerpt) and their classification in ICD-10 and ICD-O-3 3 Results For the implementation of the restructured and amended biobank-specific catalogue in the Biobanks Database or the HTCR Web Application 3.1 First run however, the categories have been now structured The reorganized catalogue of diagnoses contained as a table by key features that can be maintained: 64 categories and has been automatedly matched to classifications by CRIP.CodEx. For each of the  “Primary tumor vs. secondary tumor vs. categories CRIP.CodEx assigned matching no tumor” classifications in ICD-10 or ICD-O-3. All together  “affected Organ” there have been 237 correct code assignments (true  “originating organ in case of secondary positives), while we identified 6 wrong tumor” assignments (false positives) and 12 missing codes  “included morphologies” (false negatives). Due to the traceability of CRIP.CodEx’s each So as a next step, by tapping complementary individual code assignment, it was possible to data sources, mainly extracting diagnostic identify the causes for all false positives and information from pathology reports by negatives. All but two causes could be resolved by CRIP.CodEx, in addition to coding of specialized the outlined optimization in the CRIP.CodEx local classification, we deliver an automated configuration and the read in coding guide matching of two different classification systems. respectively. Even if all automated classifications have to be 3.2 Second run checked thoroughly according to an individual request from a research project, before providing The configuration of CRIP.CodEx as well as the samples and data, this initial, very effective original reorganized biobank specific catalogue of automated classification to great extent facilitates diagnoses has been optimized and amended as case related database research as well as data outlined above. Then again the local catalogue has export for display of the collection in biobank been automatedly matched to ICD classifications registries and even metabiobanks. Thereby we by CRIP.CodEx. The final catalogue of diagnoses have not only enhanced the biobank’s availability contains 73 categories, for each of them the for translational research but also proposed a software assigned matching classifications in ICD- general protocol for matching internal codes with 10 or ICD-O-3 in the final second run. All together international classifications and standards. there have been 442 correct code assignments (true positives) by CRIP.CodEx, while we identified Acknowledgements zero wrong assignments (false positives) and 14 missing codes (false negatives). The authors specially thank Dr. Jens Neumann from Institute of Pathology, Ludwig-Maximilians- 4 Conclusion & Outlook University Munich and Dr. Gabriele Schubert- Fritschle of Tumor Registry Munich for discussing Since diagnostic information contained in the results of CodEx code assignments and the medical free text is extracted and codified by the amended categories and manual code assignments. automated CRIP.CodEx software, we also showed that the information contained in the nomenclature References of the individual biobank specific catalogue of Ambert KH and Cohen AM (2009): A system for diagnoses is sufficient for a mapping towards ICD- classifying disease comorbidity status from medical 10 and basically also ICD-O catalogues. As a discharge summaries using automated hotspot and remaining issue however, for categories such as negated concept detection. Journal of the American e.g. “Carcinoma of the gallbladder”, which Medical Informatics Association 16, 590-595. summarize a wide range of different morphologies, ICD-O code extraction from nomenclature is too Bosman FT, Carneiro F, Hruban RH, Theise ND (Editors): WHO classification of tumours of the general, and amending the nomenclature with a digestive system – 4th edition, IARC: Lyon, 2010. listing of these morphologies is also dissatisfying. Gros O and Stede M (2013): Determining Negation Scope in German and English medical diagnoses, in Taboada M and Trnavac R (Eds.): Nonveridicality and Evaluation – Theoretical, Computational and Corpus Approaches, Studies in Pragmatics 11, BRILL. ISBN: 9789004258167. Müller TH, Thasler R (2014): Separation of personal data in a biobank information system. Stud Health Technol Inform. 014; 205:388-92. Schröder C, Heidtke KR, Zacherl N, Zatloukal K, & Taupitz J (2011): Safeguarding donors’ personal rights and biobank autonomy in biobank networks: the CRIP privacy regime. Cell and tissue banking, 12(3), 233-240. Thasler WE, Thasler RM, Schelcher C, Jauch KW (2013): Biobanking for research in surgery: are surgeons in charge for advancing translational research or mere assistants in biomaterial and data preservation? Langenbecks Arch Surg. 2013 Apr; 398(4):487-99. Weiler G, Schröder C, Schera F, Dobkowicz M, Kiefer S, Heidtke KR, Hänold S, Nwaknwo I, Forgó N, Stanulla M, Eckert C, and Graf N (2014): p- BioSPRE – an information and communication technology framework for transnational biomaterial sharing and access. ecancer 2014, 8:401-419.