INDEX: the Intelligent Data Steward Toolbox Utilizing Large Language Model Embeddings for Automated Data Harmonization Tim Adams1 , Mohamed Aborageh1 , Yasamin Salimi1,2 , Holger Fröhlich1,2 and Marc Jacobs1 1 Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin 53757, Germany 2 Bonn-Aachen International Center for IT, Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn 53115, Germany Abstract The data steward, responsible for overseeing data management, plays a pivotal role in evidence-based medicine by ensuring the quality, integrity, and accessibility of data throughout its lifecycle. However, managing medical data poses challenges, including handling diverse structured and unstructured data from various sources in different formats. This data curation process demands significant time and resources. To alleviate these challenges and enhance the efficiency of data stewards, we introduce a novel data stewardship tool and curation workflow utilizing Large Language Models (LLMs). We evaluated our approach by performing automatic pairwise cohort harmonization using data dictionaries of 6 different Parkinson’s Disease (PD) studies and 13 different studies in the context of Alzheimer’s Disease (AD), as well as a mapping task of over 38,000 ICD10 codes using code descriptions obtained from UKBioBank. When compared with a String Matching based baseline method that does not capture the context of variable descriptions, we found that Generative Pre-trained Transformer (GPT) embedding based mappings performed significantly better, reaching a best average accuracy for the application of PD cohort harmonization for an automated initial closest match of 82%. While we found that due to various different formulation and wording issues descriptions could not be automatically matched in all cases, we are confident that our data steward tool can significantly facilitate the work of the data steward in a semi-automatic fashion. Keywords data stewardship, large language models, embeddings, semantic mappings, common data model As data stewardship is an important but often time and resources intensive process, data stewardship tools can be used to facilitate the process effectively. Variable descriptions for data harmonization are often very diverse in their formulation; it is therefore important to incorporate their semantics to be able to harmonize them with a high accuracy. With the ongoing development of GPT models, we evaluated whether vector distances of GPT model embeddings can be used to automatically harmonize variable descriptions. We developed a data steward tool and a harmonization workflow 1 that can be used to iteratively improve harmonization results in a semi-automated process. We evaluated our automated mapping approach based on three different application cases: We harmonized 6 different Parkinson’s Disease (PD) cohorts pairwise using GPT- embedding SWAT4HCLS’24: Semantic Web Applications and Tools for Health Care and Life Sciences, Feb 26–29, 2024, Leiden, NL $ tim.adams@scai.fraunhofer.de (T. Adams); mohamed.aborageh@scai.fraunhofer.de (M. Aborageh); yasamin.salimi@scai.fraunhofer.de (Y. Salimi); holger.froehlich@scai.fraunhofer.de (H. Fröhlich); marc.jacobs@scai.fraunhofer.de (M. Jacobs) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 1 https://github.com/SCAI-BIO/index & https://github.com/SCAI-BIO/index/tree/main/doc/workflow Figure 1: Average accuracy for the three Figure 2: Two-dimensional t-SNE representation evaluated harmonization tasks. of computed AD and PD embeddings. and Fuzzy String Matching as a baseline comparison, using an in-house Common Data Model (CDM) for ground-truth data. The same was tested in the context of Alzheimer’s Disease (AD) using 13 different collected studies. We mapped over 38,000 Read codes for medical diagnosis to ICD10 codes using code descriptions obtained from UK Biobank and referring to a pre-existing mapping as ground truth. Notable examples of correct and incorrect matches are shown in Table1. We tested each approach against a baseline method using Fuzzy String Matching. The results are shown in Figure 1. We found that GPT-Embedding based matching outperformed the baseline method significantly in all three tested application cases, reaching an average accuracy of 82% for the PD cohorts, 63% for the AD mappings and 56% for the automatic mapping of ICD10 codes. Especially for the harmonization application, we found that semantically coherent variable descriptions from different cohorts form distinct clusters that may overlap for different studies, even for different disease types (see Figure2). We however also found that given the very much different ways to formulate data descriptions when taking into account special cases such as custom abbreviations (see Table1), fully automatic data harmonization using LLMs is not yet feasible. We expect that with the ongoing development of LLMs and especially domain trained models, we will be able to further improve and build on our results in the future. Source Read Description Matched ICD10 Description Correct ICD10 Description Logic FH: Stomach cancer Family history of malignant - True neoplasm of digestive organs Cardiac function test abnor- Abnormal results of cardiovas- - True mal cular function studies Macrocytosis Macroglossia Other specified diseases of blood False and blood-forming organs FH: Depression Unhappiness Family history of other mental and False behavioral disorders Table 1 Examples of mapped Read and ICD10 descriptions. The "Logic" column indicates a correct match.