1. Introduction

KGSynX: Knowledge Graph and Explainable Feedback Guided LLMs for Synthetic Tabular Data Generation

Ke YU

Shigeru Ishikura

Yukari Usukura

Yuki Shigoku

Teruaki Hayashi

0 0 Department of Systems Innovation, School of Engineering, the University of Tokyo 1 Infomart Corporation

2025

2 6

Synthetic tabular data is vital for augmentation, privacy, and performance under limited data, yet most work targets marginal statistics, neglecting downstream utility and explainability in scarce-data scenarios. We propose KGSynX, which builds a knowledge graph from table records and derives graph embeddings to inform LLM prompts. A SHAP‑guided feedback loop measures attribution diferences between real and generated data and injects targeted corrections into subsequent prompts. Evaluated under the Train-on-Synthetic, Test-on-Real (TSTR) protocol on heart disease, enterprise invoice, and telco churn datasets, KGSynX consistently outperforms baseline in accuracy, F1, and AUC while closing the SHAP attribution gap. By explicitly modeling structure and semantics, KGSynX produces more reliable synthetic datasets for downstream tasks.

eol>Synthetic Data LLM Explainable AI Knowledge Graph

1. Introduction

Synthetic tabular data generation has emerged as a critical technique in scenarios where access to real datasets is limited by privacy, regulatory, or logistical constraints—for example, in healthcare [16], ifnance, and telecommunications [ 4, 9 ]. By creating high‑quality synthetic records, practitioners can augment scarce data, share information without exposing sensitive details [18], and improve model training under low‑resource conditions. However, most state‑of‑the‑art approaches—ranging from generative adversarial networks (GANs) [ 1, 12, 13 ] and difusion models [ 14, 15, 6 ] to Large Language Model (LLM) based generators [ 8 ] primarily focus on matching marginal feature distributions or low‑order statistics. While these methods can reproduce individual column histograms or pairwise correlations, they often fail to capture higher‑order semantic relationships present in the joint distribution. As a result, synthetic samples may exhibit unrealistic combinations of features, leading to degraded performance in downstream tasks and undermining user trust [ 5 ]. And these techniques still rely on handcrafted objectives or black‑box signals, making it dificult to trace how structural or semantic errors persist in the synthetic data.

To address these challenges, we present KGSynX, which integrates knowledge graphs (KG) [ 10 ] and explainable AI feedback to steer LLM‑based synthesis. Our key contributions are: First, KGSynX constructs a knowledge graph in which each record is represented as an entity node and each featurevalue pair as an attribute node; edges encode the semantic dependencies inherent in the original table. We then extract structure‑aware embeddings via Node2Vec [ 3 ] and incorporate them into LLM prompts, ensuring that sample generation respects the encoded graph topology. Next, we implement a SHAP‑driven refinement loop [ 2 ]: after each generation round, we compute the attribution gap between real and synthetic data, identify the top‑k discrepant features, and automatically inject targeted instructions into the prompt to correct those errors. This explainable feedback mechanism both improves downstream utility [19] and provides clear diagnostics for auditing.

We validate KGSynX under the Train‑on‑Synthetic, Test‑on‑Real (TSTR) protocol [ 11 ] on three benchmark datasets. Compared to baselines, our method achieves substantial gains in accuracy, F1 score, and AUC [20], while progressively narrowing the SHAP attribution gap. These results demonstrate that explicitly modeling semantic structure and leveraging interpretable feedback are key to producing reliable synthetic data for practical applications.

2. Method Overview 2.1. Framework 2.2. Core Components Knowledge Graph Construction.

We construct a knowgraph graph = ( , ) where = entity ∪ attribute, = {(, ) ∣ record has attribute }.

Here, entity represents the set of sample entity nodes and attribute represents the set of feature-value nodes. The edge set captures associations between entities and their attributes, thus encoding the structural dependencies inherent in the original tabular data.

SHAP Attribution Gap.

We quantify semantic alignment by computing SHAP_cos = 1 − real ⋅ syn ‖ real‖ ‖ syn‖ where real and syn are the normalized SHAP attribution vectors for the real and synthetic datasets. The cosine distance SHAP_cos measures the angular dissimilarity between these vectors, with values closer to 0 indicating that the synthetic data’s attribution pattern closely aligns with that of the real data.

Prompt Refinement. Given an initial prompt , we iteratively refine it by updating based on the top- attribution discrepancies Δ :

+1 = ⊕ {emphasize features in Δ }.

The operator ⊕ denotes the appending of targeted instructions to the existing prompt. Through this SHAP-guided feedback loop, the LLM is steered to generate samples whose feature importance distributions progressively converge to those of the real dataset.

2.3. Prompt Example

Prompt Examples Initial Prompt: "Using the knowledge graph context, generate synthetic records ensuring the following attribute dependencies: [KG summary]." After SHAP Feedback: "Prioritize matching the distribution of {Feature_A} and reduce overrepresentation of {Feature_B}."

The first prompt instructs the LLM to adhere to the structural relationships embedded within the knowledge graph during the generation of new records. The second prompt encourages the model to refine its output by prioritizing features exhibiting the most significant attribution discrepancies.

2.4. Semantic Alignment Convergence

As shown in Figure 2, at each iteration we measure the SHAP divergence between real and synthetic models and update the prompts accordingly. This loop terminates when the semantic-alignment gap falls below (default 0.1) or the maximum number of rounds is reached (default 5). In practice, convergence is typically achieved within 3–4 rounds.

3. Experiments & Results 3.1. Datasets and Classifiers

We used the three benchmark datasets in our experiments. The UCI Heart Disease dataset contains 303 samples with 13 clinical features, and is evaluated using a RandomForest classifier to capture non‐linear interactions. The Enterprise Invoice Usage dataset comprises 500 enterprise transaction records with 11 attributes, for which we employ XGBoost due to its robustness on structured financial data. Finally, the Telco Churn dataset (7,043 samples, 20 features) is tested with LightGBM to leverage its high eficiency and accuracy in large‐scale customer churn prediction. All classifiers are trained with default hyperparameter settings and 5‐fold cross‐validation to ensure a fair comparison.

3.2. Performance Comparison

Our KGSynX consistently outperforms CTGAN and vanilla LLM generators, achieving the best F1 and Area Under the Curve (AUC) scores across the board (Table 1). In the Heart Disease dataset, KGSynX boosts Accuracy from 0.667 (CTGAN) to 0.767 and improves F1 from 0.474 to 0.750. On the Enterprise dataset, it reaches the highest accuracy (0.900) and F1 (0.904), demonstrating its ability to model complex enterprise data. For Telco Churn, KGSynX attains the top AUC (0.853) and a balanced F1 (0.611), confirming its robustness in large‐scale customer prediction tasks. These results validate that integrating knowledge‐graph embeddings with SHAP‐driven prompt refinement yields synthetic data with downstream utility and semantic fidelity.

4. Conclusion & Future Work

In this work, we introduced KGSynX, a framework that seamlessly integrates knowledge‑graph embeddings with SHAP‑driven feedback to guide large language models in generating synthetic tabular data. Our method explicitly models the structural dependencies of tabular data and iteratively refines generation prompts based on feature attribution discrepancies. Our experiments, conducted under the TSTR protocol on UCI Heart Disease, Enterprise Invoice Usage, and Telco Churn datasets, demonstrate that KGSynX outperforms GAN-base models, TabDDPM, LLM‑only, and LLM+KG baselines in classification accuracy, F1 score, and AUC, while preserving semantic fidelity and interpretability.

Despite these encouraging results, the current implementation relies on heuristic prompt adjustments, which may require manual tuning and domain expertise. Additionally, SHAP‑based attribution computations introduce substantial computational overhead, limiting scalability in resource‑constrained environments. Future work will focus on developing reinforcement‑learning‑based or diferentiable optimization techniques for automated prompt refinement to reduce reliance on heuristics. We also plan to explore eficient SHAP approximation methods and extend our approach to multi‑label, multi‑modal knowledge graphs and streaming data scenarios to enhance applicability. 1https://archive.ics.uci.edu/dataset/45/heart+disease 2provided by Infomart Corporation 3https://www.kaggle.com/datasets/blastchar/telco-customer-churn

Supplemental Material Statement

The source code, real and synthetic datasets, and reproducible pipeline for KGSynX are available online via

• GitHub

Acknowledgments

This study was supported by the joint research project with Infomart Corporation and JST PRESTO Grant Number JPMJPR2369.

Declaration on Generative AI

During the preparation of this work, the author(s) used ChatGPT in order to: Grammar and spelling check, Paraphrase and reword. After using this tool/service, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content. [16] E. Choi, S. Biswal, B. Malin, J. Duke, W. F. Stewart, and J. Sun, “Generating multi-label discrete electronic health records using generative adversarial networks,” in Proceedings of the 2nd Machine Learning for Healthcare Conference, pp. 286–305, 2017. [17] E. Mosca, F. Szigeti, S. Tragianni, D. Gallagher, and G. Groh, “SHAP-based explanation methods: a review for NLP interpretability,” Proceedings of the 29th International Conference on Computational Linguistics, pp. 4593–4603, 2022. [18] E.-J. van Kesteren, “To democratize research with sensitive data, we should make synthetic data more accessible,” Patterns, vol. 5, no. 9, 2024. [19] J. Achterberg, M. Haas, B. van Dijk, and M. Spruit, “Fidelity-agnostic synthetic data generation improves utility while retaining privacy,” Patterns, 2025. [20] M. S. M. Sajjadi, O. Bachem, M. Lucic, O. Bousquet, and S. Gelly, “Assessing generative models via precision and recall,” in Advances in Neural Information Processing Systems, vol. 31, 2018.

[1]

Xu ,

Skoularidou ,

Wu , and G. Ermon, “ Modeling tabular data using conditional GAN ,” in NeurIPS, 2019 .

[2]

S. M.

Lundberg and S.-I. Lee , “ A unified approach to interpreting model predictions ,” in NeurIPS, 2017 .

[3]

Grover and

Leskovec , “node2vec: Scalable feature learning for networks ,” in KDD, 2016 .

[4]

De Cristofaro , “Synthetic Data: Methods,

Use

Cases , and Risks,” arXiv preprint arXiv: 2303 .01230, 2024 .

[5]

Marwala ,

Fournier-Tombs , and

Stinckwich , “ The Use of Synthetic Data to Train AI Models: Opportunities and Risks for Sustainable Development ,” arXiv preprint arXiv: 2309 .00652, 2023 .

[6]

Kotelnikov ,

Blinov ,

Baranchuk , et al., “TabDDPM: Modeling Tabular Data with Difusion Models,” arXiv preprint arXiv:2302.07984 , 2023 .

[7] OpenAI, “GPT‑4 Technical Report,” arXiv preprint arXiv:2303.08774 , 2023 .

[8]

Fang ,

Xu ,

F. A.

Tan ,

Zhang ,

Hu ,

Qi ,

Nickleach ,

Socolinsky ,

Sengamedu , and

Faloutsos , “ Large Language Models (LLMs) on Tabular Data: Prediction, Generation, and Understanding-A Survey ,” arXiv preprint arXiv: 2402 .17944, 2024 .

[9]

Patki ,

Wedge , and

Veeramachaneni , “The Synthetic Data Vault,” in 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA) , 2016 .

[10]

Hogan , E. Blomqvist,

Cochez , C. d'Amato, G. de Melo,

Gutierrez ,

Kirrane ,

J. E. L.

Gayo ,

Navigli , A.-C. Ngonga Ngomo, et al., “Knowledge Graphs,” ACM Computing Surveys (CSUR) , vol. 54 , no. 4 , pp. 1 - 37 , 2021 .

[11]

Esteban ,

S. L.

Hyland , and G. Rätsch, “ Real-valued (medical) time series generation with recurrent conditional GANs,” arXiv preprint arXiv:1706.02633 , 2017 .

[12]

Goodfellow ,

Pouget-Abadie ,

Mirza ,

Xu ,

Warde-Farley ,

Ozair ,

Courville , and

Bengio , “Generative adversarial nets, ” Advances in Neural Information Processing Systems , vol. 27 , pp. 2672 - 2680 , 2014 .

[13]

Mirza and

Osindero , “ Conditional generative adversarial nets , ” arXiv preprint arXiv:1411.1784 , 2014 .

[14]

Sohl-Dickstein ,

E. A.

Weiss ,

Maheswaranathan , and

Ganguli , “ Deep unsupervised learning using nonequilibrium thermodynamics , ” arXiv preprint arXiv:1503.03585 , 2015 .

[15]

Song and

Ermon , “ Generative modeling by estimating gradients of the data distribution,” in Advances in Neural Information Processing Systems , vol. 32 , pp. 11895 - 11907 , 2019 .