<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Diagnostic Free Text Analysis in Biobanks with CRIP.CodEx: Automated Matching of Classifications</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Oliver</forename><surname>Gros</surname></persName>
							<email>oliver.gros@izi-bb.fraunhofer.de</email>
						</author>
						<author>
							<persName><forename type="first">Reinhard</forename><surname>Thasler</surname></persName>
							<email>reinhard.thasler@med.uni-muenchen.de</email>
						</author>
						<author>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">Fraunhofer Institute for Cell Therapy and Immunology</orgName>
								<orgName type="department" key="dep2">Branch Bioanalytics</orgName>
								<orgName type="laboratory">Bioprocesses (IZI-BB) Biobank under administration of HTCR</orgName>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="department" key="dep1">Department of General, Visceral</orgName>
								<orgName type="department" key="dep2">Vascular and Transplantation Surgery Am Mühlenberg</orgName>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff2">
								<orgName type="institution" key="instit1">Hospital</orgName>
								<orgName type="institution" key="instit2">University of Munich</orgName>
								<address>
									<postCode>14476, 81377</postCode>
									<settlement>Potsdam, Munich</settlement>
									<country>Germany, Germany</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Diagnostic Free Text Analysis in Biobanks with CRIP.CodEx: Automated Matching of Classifications</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">AC1DEFF218EF79636C78151BB88B9113</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T20:55+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Biobanks represent key resources for biomedical research. To be accessible, e.g. over web-based query tools or transinstitutional metabiobanks, the stored human biospecimens have to be annotated with clinical data, transformed into harmonized and structured form, e.g. ICD codes, while currently only available from free text records. The Biobank under Administration of Human Tissue and Cell Research Foundation HTCR at the University of Munich Medical Centre is routinely collecting remnant tissues and blood samples from treatments of patients. For diagnostic classification of the corresponding cases, a biobank specific classification was developed, but not yet matched to ICD codes. So far done manually, we now used the automated knowledge extraction software CRIP.CodEx, not needing a training set or access to external resources, to recodify the textual description of the specialized HTCR biobank classification with ICD.</p><p>We show that the information contained in the nomenclature of the individual biobank specific catalogue of diagnoses is sufficient for a mapping towards ICD-10 as well as ICD-O-3 catalogues, and deliver an automated matching of two different classification systems using CRIP.CodEx.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Biobanks represent key resources for translational research and personalized medicine <ref type="bibr" target="#b5">(Thasler et al., 2013)</ref>. The Biobank under Administration of Human Tissue and Cell Research Foundation HTCR at University of Munich Medical Centre is routinely collecting remnant tissues and blood samples from treatments of patients at the Clinic for General, Visceral, Vascular and Transplantation Surgery as well as the Department for Thoracic Surgery.</p><p>However, to be a valuable resource for translational biomedical research these human biospecimens have to be annotated with clinical data, currently only available from free text records.</p><p>Access to these biospecimens and data, e.g. over web-based query tools, like the transinstitutional metabiobank CRIP <ref type="bibr" target="#b4">(Schröder et al., 2011)</ref> or p-BioSPRE <ref type="bibr" target="#b6">(Weiler et al., 2014)</ref>, demands that information from various sources gets integrated into harmonized and structured data to enable stratified, parameterized queries <ref type="bibr" target="#b0">(Ambert and Cohen, 2009)</ref>.</p><p>So far, analysis of free text sources and structured data entry in the data protection compliant biobank information system <ref type="bibr" target="#b3">(Müller and Thasler, 2014)</ref> is done manually, and regarding diagnostic classification of cases, a biobank specific classification was developed. This classification however was not matched to ICD codes so far. Based on automated free text analysis this coding can now be amended with international classifications, e.g. ICD-10 or ICD-O-3, to facilitate project queries.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Methods</head><p>The automated knowledge extraction software CRIP.CodEx was designed to identify and extract information in free text medical records and assign corresponding codes (see Figure <ref type="figure" target="#fig_0">1</ref>). We used CRIP.CodEx slightly off its designedfor purpose, which is free text medical reports. Instead we used CRIP.CodEx to recodify the textual description of the specialized HTCR biobank classification with ICD (see Figure <ref type="figure" target="#fig_1">2</ref>). The initially existing catalogue of diagnoses was based on the actual collection of cases in the biobank, reflecting indications and surgical treatments and therefore being primarily aligned along a list of organs affected by the surgical treatment. To build a catalogue of diagnoses suitable for biospecimen research, in a first step, this catalogue had to be reorganized towards distinct pathological findings across affected organs. The resulting categories therefore were composed of two separate elements, a diagnosis from pathological report and the affected organ. These categories, 65 in total, were then manually referred to the ICD-10 catalogue resulting in 82 codes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Automated Matching</head><p>After ensuring that the final categories were sufficiently selective, we analyzed them in a first run using CRIP.CodEx for ICD-O-3 as well as ICD-10 classifications. The result was checked manually and discussed with a Pathologist.</p><p>However, the discussion of the first run showed that the problems identified relied not only to CRIP.CodEx, but also to the diagnostic catalogue, mainly the categories' composition as separate elements of diagnosis and organs affected by the surgical treatment. Both sources could be tackledand therefore both, the software configuration and also the categories have been worked on to reach improved results in the second run.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Optimization</head><p>The configuration of CRIP.CodEx was optimized including these aspects: adding variants of local wording/synonyms, e.g. 'Carzinom' instead of 'Karzinom'/'carcinoma', into the internal dictionary, increasing the number of allowed multiple matches when detecting combined and hyphenated words with synonyms, as well as restricting ICD-10 neoplasm code assignment depending on detected ICD-O-3 classification.</p><p>The categories of the biobank specific classification (see Table <ref type="table" target="#tab_0">1</ref>) have been amended mainly by integrating the pathological diagnosis and the primarily diseased organ in to one descriptive text. Further amendments include specifying the affected organs and information on primary and secondary tumors more consistently, as well as adding tumor types in some categories while shortening the description of others and dividing into subcategories, accompanied by also amending the ICD-O-3 code assignments with an expert from the regional tumor registry towards the WHO "Blue Book" <ref type="bibr" target="#b1">(Bosman et al., 2010)</ref> as international standard.</p><p>After both, optimizing the configuration of CRIP.CodEx, as well as further amending of the categories, we performed a second run for analyzing the categories.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ICD-10</head><p>Diagnosis 3 Results</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">First run</head><p>The reorganized catalogue of diagnoses contained 64 categories and has been automatedly matched to classifications by CRIP.CodEx. For each of the categories CRIP.CodEx assigned matching classifications in ICD-10 or ICD-O-3. All together there have been 237 correct code assignments (true positives), while we identified 6 wrong assignments (false positives) and 12 missing codes (false negatives).</p><p>Due to the traceability of CRIP.CodEx's each individual code assignment, it was possible to identify the causes for all false positives and negatives. All but two causes could be resolved by the outlined optimization in the CRIP.CodEx configuration and the read in coding guide respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Second run</head><p>The configuration of CRIP.CodEx as well as the original reorganized biobank specific catalogue of diagnoses has been optimized and amended as outlined above. Then again the local catalogue has been automatedly matched to ICD classifications by CRIP.CodEx. The final catalogue of diagnoses contains 73 categories, for each of them the software assigned matching classifications in ICD-10 or ICD-O-3 in the final second run. All together there have been 442 correct code assignments (true positives) by CRIP.CodEx, while we identified zero wrong assignments (false positives) and 14 missing codes (false negatives).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Conclusion &amp; Outlook</head><p>Since diagnostic information contained in medical free text is extracted and codified by the automated CRIP.CodEx software, we also showed that the information contained in the nomenclature of the individual biobank specific catalogue of diagnoses is sufficient for a mapping towards ICD-10 and basically also ICD-O catalogues. As a remaining issue however, for categories such as e.g. "Carcinoma of the gallbladder", which summarize a wide range of different morphologies, ICD-O code extraction from nomenclature is too general, and amending the nomenclature with a listing of these morphologies is also dissatisfying.</p><p>For the implementation of the restructured and amended biobank-specific catalogue in the Biobanks Database or the HTCR Web Application however, the categories have been now structured as a table by key features that can be maintained:</p><p> "Primary tumor vs. secondary tumor vs.</p><p>no tumor"  "affected Organ"  "originating organ in case of secondary tumor"  "included morphologies" So as a next step, by tapping complementary data sources, mainly extracting diagnostic information from pathology reports by CRIP.CodEx, in addition to coding of specialized local classification, we deliver an automated matching of two different classification systems.</p><p>Even if all automated classifications have to be checked thoroughly according to an individual request from a research project, before providing samples and data, this initial, very effective automated classification to great extent facilitates case related database research as well as data export for display of the collection in biobank registries and even metabiobanks. Thereby we have not only enhanced the biobank's availability for translational research but also proposed a general protocol for matching internal codes with international classifications and standards.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: UI variants for CRIP.CodEx 1 CRIP.CodEx is part of the CRIP-Toolbox 2 and will be published separately. It works automated, fast and efficient 3 , identifies word relations and negation, handles extended negation scopes (Gros and Stede, 2013), but does not need access to databases or other external resources. Specifically, it does neither need a training set of pre-annotated texts, nor does it need its extraction rules input manually. The source for the self-generated extraction rules are lists of codes and their</figDesc><graphic coords="2,92.05,343.90,186.70,157.95" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Knowledge Extraction with CRIP.CodExtext example</figDesc><graphic coords="2,336.00,286.80,181.20,99.85" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc></figDesc><table><row><cell></cell><cell>category</cell><cell>ICD-O-3 M</cell></row><row><cell>C15</cell><cell>Plattenepithelkarzinom des Ösophagus</cell><cell>8070/3</cell></row><row><cell>C16</cell><cell>Adenokarzinom des Magens</cell><cell>8140/3</cell></row><row><cell cols="2">C16, C17, C25, C74 Sarkom (Bauchraum: Leber, Magen, Pankreas, Dünndarm,</cell><cell>8800/3</cell></row><row><cell></cell><cell>Dickdarm, Niere)</cell><cell></cell></row><row><cell>C16, C22.9</cell><cell>Hepatozelluläres Karzinom (HCC)</cell><cell>8010/3, 8170/3</cell></row><row><cell>C16.9, C16, C22.9</cell><cell>Hepatozelluläres Karzinom (HCC), Subtyp: fibrolamelläres</cell><cell>8010/3, 8170/3</cell></row><row><cell></cell><cell>Leberzellkarzinom</cell><cell></cell></row><row><cell>C17</cell><cell>Karzinome des Dünndarmes</cell><cell>8010/3</cell></row><row><cell>C17, C18.9, C20</cell><cell>Kolorektales Karzinom (CRC), auch Adenokarzinome</cell><cell>8010/3, 8140/3</cell></row><row><cell>C17, C22.9</cell><cell>Leiomyosarkom (Magen, Abdomen, Peritoneum,</cell><cell>8890/3</cell></row><row><cell></cell><cell>Retroperitoneum, Bindegewebe)</cell><cell></cell></row><row><cell>C17, C22.9, C25</cell><cell cols="2">Cholangiokarzinom, extrahepatisch, auch Klatskin-Tumoren 8160/3</cell></row><row><cell>C22.1, C22.9</cell><cell>Cholangiokarzinom (CCC), intrahepatisch</cell><cell>8010/3, 8160/3</cell></row><row><cell>C22.3</cell><cell>Angiosarkom der Leber</cell><cell>9120/3</cell></row><row><cell>C23</cell><cell>Gallenblasenkarzinom</cell><cell>8010/3</cell></row><row><cell>C25</cell><cell>Adenokarzinom des Pankreas</cell><cell>8140/3</cell></row><row><cell>C45, C45.0, C45.9</cell><cell>Mesotheliom (Pleura, Peritoneum)</cell><cell>9050-55/0-3</cell></row><row><cell>C73</cell><cell>Schilddrüsenkarzinom (papilläres, follikuläres, medulläres,</cell><cell>8010/3, 8021/3, 8050/3,</cell></row><row><cell></cell><cell>anaplastisches)</cell><cell>8260/3, 8330/3, 8510/3</cell></row><row><cell>C74</cell><cell>Nebennierenkarzinom</cell><cell>8010/3</cell></row><row><cell>C78.7</cell><cell>[Lokalisation, z.B. Leber]metastase nach [diverse</cell><cell>8000/6, 8010/6</cell></row><row><cell></cell><cell>Primärkarzinome]</cell><cell></cell></row><row><cell>C80</cell><cell>Lebermetastase bei unbekanntem Primärtumor (CUP-</cell><cell>8000/6</cell></row><row><cell></cell><cell>Syndrom)</cell><cell></cell></row><row><cell>D13.4, D13.6, D35.0,</cell><cell>Adenom (einschl. Zystadenom) des/der Leber, Pankreas,</cell><cell>8140/0, 8440/0</cell></row><row><cell>D37.2, D44.1</cell><cell>Dünndarm, Dickdarm, Nebenniere</cell><cell></cell></row><row><cell>D13.4, D18.0</cell><cell>Hämangiom der Leber</cell><cell>9120/0</cell></row><row><cell cols="2">D30.0, D35.0, D41.0 Phäochromozytom</cell><cell>8700/0</cell></row><row><cell>E66</cell><cell>Adipositas per magna (BMI &gt;40)</cell><cell></cell></row><row><cell>K51</cell><cell>Colitis ulcerosa</cell><cell></cell></row><row><cell>K57.32, K57.33</cell><cell>Divertikulitis</cell><cell></cell></row></table><note>Final catalogue of diagnoses (excerpt) and their classification in ICD-10 and ICD-O-3</note></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">Online demonstrator available at https://preview-crip.fraunhofer.de/intern/codex-demo/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">www.crip.fraunhofer.de/en/toolbox</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">Time per text less than one up to a few seconds (10 3 words)</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgements</head><p>The authors specially thank Dr. Jens Neumann from Institute of Pathology, Ludwig-Maximilians-University Munich and Dr. Gabriele Schubert-Fritschle of Tumor Registry Munich for discussing the results of CodEx code assignments and the amended categories and manual code assignments.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">A system for classifying disease comorbidity status from medical discharge summaries using automated hotspot and negated concept detection</title>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">H</forename><surname>Ambert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">M</forename><surname>Cohen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the American Medical Informatics Association</title>
		<imprint>
			<biblScope unit="volume">16</biblScope>
			<biblScope unit="page" from="590" to="595" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">T</forename><surname>Bosman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Carneiro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">H</forename><surname>Hruban</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">D</forename><surname>Theise</surname></persName>
		</author>
		<title level="m">WHO classification of tumours of the digestive system -4 th edition</title>
				<meeting><address><addrLine>Lyon</addrLine></address></meeting>
		<imprint>
			<publisher>IARC</publisher>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Determining Negation Scope in German and English medical diagnoses</title>
		<author>
			<persName><forename type="first">O</forename><surname>Gros</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Stede</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Nonveridicality and Evaluation -Theoretical, Computational and Corpus Approaches</title>
		<title level="s">Studies in Pragmatics</title>
		<editor>
			<persName><forename type="first">M</forename><surname>Taboada</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Trnavac</surname></persName>
		</editor>
		<imprint>
			<publisher>ISBN</publisher>
			<date type="published" when="2013">2013</date>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="page">9789004258167</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Separation of personal data in a biobank information system</title>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">H</forename><surname>Müller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Thasler</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Stud Health Technol Inform</title>
		<imprint>
			<biblScope unit="volume">014</biblScope>
			<biblScope unit="page" from="388" to="392" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Safeguarding donors&apos; personal rights and biobank autonomy in biobank networks: the CRIP privacy regime</title>
		<author>
			<persName><forename type="first">C</forename><surname>Schröder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">R</forename><surname>Heidtke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Zacherl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Zatloukal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Taupitz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Cell and tissue banking</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="233" to="240" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Biobanking for research in surgery: are surgeons in charge for advancing translational research or mere assistants in biomaterial and data preservation?</title>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">E</forename><surname>Thasler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">M</forename><surname>Thasler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Schelcher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">W</forename><surname>Jauch</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Langenbecks Arch Surg</title>
		<imprint>
			<biblScope unit="volume">398</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="487" to="499" />
			<date type="published" when="2013-04">2013. 2013 Apr</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">p-BioSPRE -an information and communication technology framework for transnational biomaterial sharing and access</title>
		<author>
			<persName><forename type="first">G</forename><surname>Weiler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Schröder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Schera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dobkowicz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kiefer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">R</forename><surname>Heidtke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hänold</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Nwaknwo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Forgó</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Stanulla</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Eckert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Graf</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ecancer</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="page" from="401" to="419" />
			<date type="published" when="2014">2014. 2014</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
