<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">INDEX: the Intelligent Data Steward Toolbox Utilizing Large Language Model Embeddings for Automated Data Harmonization</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Tim</forename><surname>Adams</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Fraunhofer Institute for Algorithms and Scientific Computing</orgName>
								<address>
									<addrLine>Schloss Birlinghoven, Sankt Augustin</addrLine>
									<postCode>53757</postCode>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Mohamed</forename><surname>Aborageh</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Fraunhofer Institute for Algorithms and Scientific Computing</orgName>
								<address>
									<addrLine>Schloss Birlinghoven, Sankt Augustin</addrLine>
									<postCode>53757</postCode>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Yasamin</forename><surname>Salimi</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Fraunhofer Institute for Algorithms and Scientific Computing</orgName>
								<address>
									<addrLine>Schloss Birlinghoven, Sankt Augustin</addrLine>
									<postCode>53757</postCode>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">Bonn-Aachen International Center for IT</orgName>
								<orgName type="institution">Rheinische Friedrich-Wilhelms-Universität Bonn</orgName>
								<address>
									<postCode>53115</postCode>
									<settlement>Bonn</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Holger</forename><surname>Fröhlich</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Fraunhofer Institute for Algorithms and Scientific Computing</orgName>
								<address>
									<addrLine>Schloss Birlinghoven, Sankt Augustin</addrLine>
									<postCode>53757</postCode>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">Bonn-Aachen International Center for IT</orgName>
								<orgName type="institution">Rheinische Friedrich-Wilhelms-Universität Bonn</orgName>
								<address>
									<postCode>53115</postCode>
									<settlement>Bonn</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Marc</forename><surname>Jacobs</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Fraunhofer Institute for Algorithms and Scientific Computing</orgName>
								<address>
									<addrLine>Schloss Birlinghoven, Sankt Augustin</addrLine>
									<postCode>53757</postCode>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">INDEX: the Intelligent Data Steward Toolbox Utilizing Large Language Model Embeddings for Automated Data Harmonization</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">873F5EE40F1BF95A3260E74402C5ACE3</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:39+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>data stewardship</term>
					<term>large language models</term>
					<term>embeddings</term>
					<term>semantic mappings</term>
					<term>common data model</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The data steward, responsible for overseeing data management, plays a pivotal role in evidence-based medicine by ensuring the quality, integrity, and accessibility of data throughout its lifecycle. However, managing medical data poses challenges, including handling diverse structured and unstructured data from various sources in different formats. This data curation process demands significant time and resources. To alleviate these challenges and enhance the efficiency of data stewards, we introduce a novel data stewardship tool and curation workflow utilizing Large Language Models (LLMs). We evaluated our approach by performing automatic pairwise cohort harmonization using data dictionaries of 6 different Parkinson's Disease (PD) studies and 13 different studies in the context of Alzheimer's Disease (AD), as well as a mapping task of over 38,000 ICD10 codes using code descriptions obtained from UKBioBank. When compared with a String Matching based baseline method that does not capture the context of variable descriptions, we found that Generative Pre-trained Transformer (GPT) embedding based mappings performed significantly better, reaching a best average accuracy for the application of PD cohort harmonization for an automated initial closest match of 82%. While we found that due to various different formulation and wording issues descriptions could not be automatically matched in all cases, we are confident that our data steward tool can significantly facilitate the work of the data steward in a semi-automatic fashion.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>As data stewardship is an important but often time and resources intensive process, data stewardship tools can be used to facilitate the process effectively. Variable descriptions for data harmonization are often very diverse in their formulation; it is therefore important to incorporate their semantics to be able to harmonize them with a high accuracy. With the ongoing development of GPT models, we evaluated whether vector distances of GPT model embeddings can be used to automatically harmonize variable descriptions. We developed a data steward tool and a harmonization workflow 1 that can be used to iteratively improve harmonization results in a semi-automated process.</p><p>We evaluated our automated mapping approach based on three different application cases: We harmonized 6 different Parkinson's Disease (PD) cohorts pairwise using GPT-embedding SWAT4HCLS'24: Semantic Web Applications and Tools for Health <ref type="bibr">Care and Life Sciences, Feb 26-29, 2024, Leiden, NL</ref> tim.adams@scai.fraunhofer.de (T. Adams); mohamed.aborageh@scai.fraunhofer.de (M. Aborageh); yasamin.salimi@scai.fraunhofer.de (Y. Salimi); holger.froehlich@scai.fraunhofer.de (H. Fröhlich); marc.jacobs@scai.fraunhofer.de (M. Jacobs)  and Fuzzy String Matching as a baseline comparison, using an in-house Common Data Model (CDM) for ground-truth data. The same was tested in the context of Alzheimer's Disease (AD) using 13 different collected studies. We mapped over 38,000 Read codes for medical diagnosis to ICD10 codes using code descriptions obtained from UK Biobank and referring to a pre-existing mapping as ground truth. Notable examples of correct and incorrect matches are shown in Table1. We tested each approach against a baseline method using Fuzzy String Matching. The results are shown in Figure <ref type="figure" target="#fig_0">1</ref>. We found that GPT-Embedding based matching outperformed the baseline method significantly in all three tested application cases, reaching an average accuracy of 82% for the PD cohorts, 63% for the AD mappings and 56% for the automatic mapping of ICD10 codes. Especially for the harmonization application, we found that semantically coherent variable descriptions from different cohorts form distinct clusters that may overlap for different studies, even for different disease types (see Figure2). We however also found that given the very much different ways to formulate data descriptions when taking into account special cases such as custom abbreviations (see Table1), fully automatic data harmonization using LLMs is not yet feasible. We expect that with the ongoing development of LLMs and especially domain trained models, we will be able to further improve and build on our results in the future. </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>©Figure 1 :</head><label>1</label><figDesc>Figure 1: Average accuracy for the three evaluated harmonization tasks.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Two-dimensional t-SNE representation of computed AD and PD embeddings.</figDesc><graphic coords="2,301.80,89.63,179.68,174.23" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Examples of mapped Read and ICD10 descriptions. The "Logic" column indicates a correct match.</figDesc><table><row><cell cols="3">Source Read Description Matched ICD10 Description Correct ICD10 Description</cell><cell>Logic</cell></row><row><cell>FH: Stomach cancer</cell><cell>Family history of malignant</cell><cell>-</cell><cell>True</cell></row><row><cell></cell><cell>neoplasm of digestive organs</cell><cell></cell><cell></cell></row><row><cell>Cardiac function test abnor-</cell><cell>Abnormal results of cardiovas-</cell><cell>-</cell><cell>True</cell></row><row><cell>mal</cell><cell>cular function studies</cell><cell></cell><cell></cell></row><row><cell>Macrocytosis</cell><cell>Macroglossia</cell><cell>Other specified diseases of blood</cell><cell>False</cell></row><row><cell></cell><cell></cell><cell>and blood-forming organs</cell><cell></cell></row><row><cell>FH: Depression</cell><cell>Unhappiness</cell><cell>Family history of other mental and</cell><cell>False</cell></row><row><cell></cell><cell></cell><cell>behavioral disorders</cell><cell></cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl/>
			</div>
		</back>
	</text>
</TEI>
