=Paper=
{{Paper
|id=Vol-3890/paper-19
|storemode=property
|title=INDEX: the Intelligent Data Steward Toolbox. Utilizing Large Language Model embeddings
for automated data harmonization
|pdfUrl=https://ceur-ws.org/Vol-3890/paper-19.pdf
|volume=Vol-3890
}}
==INDEX: the Intelligent Data Steward Toolbox. Utilizing Large Language Model embeddings
for automated data harmonization==
INDEX: the Intelligent Data Steward Toolbox
Utilizing Large Language Model Embeddings for Automated Data Harmonization
Tim Adams1 , Mohamed Aborageh1 , Yasamin Salimi1,2 , Holger Fröhlich1,2 and
Marc Jacobs1
1
Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin 53757, Germany
2
Bonn-Aachen International Center for IT, Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn 53115, Germany
Abstract
The data steward, responsible for overseeing data management, plays a pivotal role in evidence-based
medicine by ensuring the quality, integrity, and accessibility of data throughout its lifecycle. However,
managing medical data poses challenges, including handling diverse structured and unstructured data
from various sources in different formats. This data curation process demands significant time and
resources. To alleviate these challenges and enhance the efficiency of data stewards, we introduce
a novel data stewardship tool and curation workflow utilizing Large Language Models (LLMs). We
evaluated our approach by performing automatic pairwise cohort harmonization using data dictionaries
of 6 different Parkinson’s Disease (PD) studies and 13 different studies in the context of Alzheimer’s
Disease (AD), as well as a mapping task of over 38,000 ICD10 codes using code descriptions obtained
from UKBioBank. When compared with a String Matching based baseline method that does not capture
the context of variable descriptions, we found that Generative Pre-trained Transformer (GPT) embedding
based mappings performed significantly better, reaching a best average accuracy for the application of
PD cohort harmonization for an automated initial closest match of 82%. While we found that due to
various different formulation and wording issues descriptions could not be automatically matched in
all cases, we are confident that our data steward tool can significantly facilitate the work of the data
steward in a semi-automatic fashion.
Keywords
data stewardship, large language models, embeddings, semantic mappings, common data model
As data stewardship is an important but often time and resources intensive process, data
stewardship tools can be used to facilitate the process effectively. Variable descriptions for
data harmonization are often very diverse in their formulation; it is therefore important to
incorporate their semantics to be able to harmonize them with a high accuracy. With the
ongoing development of GPT models, we evaluated whether vector distances of GPT model
embeddings can be used to automatically harmonize variable descriptions. We developed a
data steward tool and a harmonization workflow 1 that can be used to iteratively improve
harmonization results in a semi-automated process.
We evaluated our automated mapping approach based on three different application cases:
We harmonized 6 different Parkinson’s Disease (PD) cohorts pairwise using GPT- embedding
SWAT4HCLS’24: Semantic Web Applications and Tools for Health Care and Life Sciences, Feb 26–29, 2024, Leiden, NL
$ tim.adams@scai.fraunhofer.de (T. Adams); mohamed.aborageh@scai.fraunhofer.de (M. Aborageh);
yasamin.salimi@scai.fraunhofer.de (Y. Salimi); holger.froehlich@scai.fraunhofer.de (H. Fröhlich);
marc.jacobs@scai.fraunhofer.de (M. Jacobs)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings CEUR Workshop Proceedings (CEUR-WS.org)
http://ceur-ws.org
ISSN 1613-0073
1
https://github.com/SCAI-BIO/index & https://github.com/SCAI-BIO/index/tree/main/doc/workflow
Figure 1: Average accuracy for the three Figure 2: Two-dimensional t-SNE representation
evaluated harmonization tasks. of computed AD and PD embeddings.
and Fuzzy String Matching as a baseline comparison, using an in-house Common Data Model
(CDM) for ground-truth data. The same was tested in the context of Alzheimer’s Disease (AD)
using 13 different collected studies. We mapped over 38,000 Read codes for medical diagnosis to
ICD10 codes using code descriptions obtained from UK Biobank and referring to a pre-existing
mapping as ground truth. Notable examples of correct and incorrect matches are shown in
Table1. We tested each approach against a baseline method using Fuzzy String Matching. The
results are shown in Figure 1. We found that GPT-Embedding based matching outperformed the
baseline method significantly in all three tested application cases, reaching an average accuracy
of 82% for the PD cohorts, 63% for the AD mappings and 56% for the automatic mapping of
ICD10 codes. Especially for the harmonization application, we found that semantically coherent
variable descriptions from different cohorts form distinct clusters that may overlap for different
studies, even for different disease types (see Figure2). We however also found that given the
very much different ways to formulate data descriptions when taking into account special cases
such as custom abbreviations (see Table1), fully automatic data harmonization using LLMs is
not yet feasible. We expect that with the ongoing development of LLMs and especially domain
trained models, we will be able to further improve and build on our results in the future.
Source Read Description Matched ICD10 Description Correct ICD10 Description Logic
FH: Stomach cancer Family history of malignant - True
neoplasm of digestive organs
Cardiac function test abnor- Abnormal results of cardiovas- - True
mal cular function studies
Macrocytosis Macroglossia Other specified diseases of blood False
and blood-forming organs
FH: Depression Unhappiness Family history of other mental and False
behavioral disorders
Table 1
Examples of mapped Read and ICD10 descriptions. The "Logic" column indicates a correct match.