<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>KGSynX: Knowledge Graph and Explainable Feedback Guided LLMs for Synthetic Tabular Data Generation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ke YU</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shigeru Ishikura</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yukari Usukura</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yuki Shigoku</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Teruaki Hayashi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Systems Innovation, School of Engineering, the University of Tokyo</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Infomart Corporation</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <fpage>2</fpage>
      <lpage>6</lpage>
      <abstract>
        <p>Synthetic tabular data is vital for augmentation, privacy, and performance under limited data, yet most work targets marginal statistics, neglecting downstream utility and explainability in scarce-data scenarios. We propose KGSynX, which builds a knowledge graph from table records and derives graph embeddings to inform LLM prompts. A SHAP‑guided feedback loop measures attribution diferences between real and generated data and injects targeted corrections into subsequent prompts. Evaluated under the Train-on-Synthetic, Test-on-Real (TSTR) protocol on heart disease, enterprise invoice, and telco churn datasets, KGSynX consistently outperforms baseline in accuracy, F1, and AUC while closing the SHAP attribution gap. By explicitly modeling structure and semantics, KGSynX produces more reliable synthetic datasets for downstream tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Synthetic Data</kwd>
        <kwd>LLM</kwd>
        <kwd>Explainable AI</kwd>
        <kwd>Knowledge Graph</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Synthetic tabular data generation has emerged as a critical technique in scenarios where access to
real datasets is limited by privacy, regulatory, or logistical constraints—for example, in healthcare [16],
ifnance, and telecommunications [
        <xref ref-type="bibr" rid="ref4 ref9">4, 9</xref>
        ]. By creating high‑quality synthetic records, practitioners can
augment scarce data, share information without exposing sensitive details [18], and improve model
training under low‑resource conditions. However, most state‑of‑the‑art approaches—ranging from
generative adversarial networks (GANs) [
        <xref ref-type="bibr" rid="ref1 ref12 ref13">1, 12, 13</xref>
        ] and difusion models [
        <xref ref-type="bibr" rid="ref14 ref15 ref6">14, 15, 6</xref>
        ] to Large Language Model
(LLM) based generators [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] primarily focus on matching marginal feature distributions or low‑order
statistics. While these methods can reproduce individual column histograms or pairwise correlations,
they often fail to capture higher‑order semantic relationships present in the joint distribution. As a result,
synthetic samples may exhibit unrealistic combinations of features, leading to degraded performance
in downstream tasks and undermining user trust [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. And these techniques still rely on handcrafted
objectives or black‑box signals, making it dificult to trace how structural or semantic errors persist in
the synthetic data.
      </p>
      <p>
        To address these challenges, we present KGSynX, which integrates knowledge graphs (KG) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
and explainable AI feedback to steer LLM‑based synthesis. Our key contributions are: First, KGSynX
constructs a knowledge graph in which each record is represented as an entity node and each
featurevalue pair as an attribute node; edges encode the semantic dependencies inherent in the original
table. We then extract structure‑aware embeddings via Node2Vec [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and incorporate them into LLM
prompts, ensuring that sample generation respects the encoded graph topology. Next, we implement
a SHAP‑driven refinement loop [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]: after each generation round, we compute the attribution gap
between real and synthetic data, identify the top‑k discrepant features, and automatically inject targeted
instructions into the prompt to correct those errors. This explainable feedback mechanism both improves
downstream utility [19] and provides clear diagnostics for auditing.
      </p>
      <p>
        We validate KGSynX under the Train‑on‑Synthetic, Test‑on‑Real (TSTR) protocol [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] on three
benchmark datasets. Compared to baselines, our method achieves substantial gains in accuracy, F1
score, and AUC [20], while progressively narrowing the SHAP attribution gap. These results demonstrate
that explicitly modeling semantic structure and leveraging interpretable feedback are key to producing
reliable synthetic data for practical applications.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Method Overview</title>
      <sec id="sec-2-1">
        <title>2.1. Framework</title>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Core Components</title>
        <sec id="sec-2-2-1">
          <title>Knowledge Graph Construction.</title>
          <p>We construct a knowgraph graph  = ( , )
where
 =  entity ∪  attribute,  = {(, ) ∣
record  has attribute }.</p>
          <p>Here,  entity represents the set of sample entity nodes and  attribute represents the set of feature-value
nodes. The edge set  captures associations between entities and their attributes, thus encoding the
structural dependencies inherent in the original tabular data.</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>SHAP Attribution Gap.</title>
          <p>We quantify semantic alignment by computing
 SHAP_cos = 1 −
 real ⋅  syn
‖ real‖ ‖ syn‖
where  real and  syn are the normalized SHAP attribution vectors for the real and synthetic datasets.
The cosine distance  SHAP_cos measures the angular dissimilarity between these vectors, with values
closer to 0 indicating that the synthetic data’s attribution pattern closely aligns with that of the real
data.</p>
          <p>Prompt Refinement. Given an initial prompt   , we iteratively refine it by updating based on the
top- attribution discrepancies Δ  :</p>
          <p>+1 =   ⊕ {emphasize features in Δ  }.</p>
          <p>The operator ⊕ denotes the appending of targeted instructions to the existing prompt. Through
this SHAP-guided feedback loop, the LLM is steered to generate samples whose feature importance
distributions progressively converge to those of the real dataset.</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Prompt Example</title>
        <p>Prompt Examples
Initial Prompt:
"Using the knowledge graph context, generate synthetic records ensuring the
following attribute dependencies: [KG summary]."
After SHAP Feedback:
"Prioritize matching the distribution of {Feature_A} and reduce
overrepresentation of {Feature_B}."</p>
        <p>The first prompt instructs the LLM to adhere to the structural relationships embedded within the
knowledge graph during the generation of new records. The second prompt encourages the model to
refine its output by prioritizing features exhibiting the most significant attribution discrepancies.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Semantic Alignment Convergence</title>
        <p>As shown in Figure 2, at each iteration we measure the SHAP divergence between real and synthetic
models and update the prompts accordingly. This loop terminates when the semantic-alignment gap
falls below  (default 0.1) or the maximum number of rounds  is reached (default 5). In practice,
convergence is typically achieved within 3–4 rounds.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments &amp; Results</title>
      <sec id="sec-3-1">
        <title>3.1. Datasets and Classifiers</title>
        <p>We used the three benchmark datasets in our experiments. The UCI Heart Disease dataset contains
303 samples with 13 clinical features, and is evaluated using a RandomForest classifier to capture
non‐linear interactions. The Enterprise Invoice Usage dataset comprises 500 enterprise transaction
records with 11 attributes, for which we employ XGBoost due to its robustness on structured financial
data. Finally, the Telco Churn dataset (7,043 samples, 20 features) is tested with LightGBM to leverage
its high eficiency and accuracy in large‐scale customer churn prediction. All classifiers are trained
with default hyperparameter settings and 5‐fold cross‐validation to ensure a fair comparison.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Performance Comparison</title>
        <p>Our KGSynX consistently outperforms CTGAN and vanilla LLM generators, achieving the best F1
and Area Under the Curve (AUC) scores across the board (Table 1). In the Heart Disease dataset,
KGSynX boosts Accuracy from 0.667 (CTGAN) to 0.767 and improves F1 from 0.474 to 0.750. On the
Enterprise dataset, it reaches the highest accuracy (0.900) and F1 (0.904), demonstrating its ability to
model complex enterprise data. For Telco Churn, KGSynX attains the top AUC (0.853) and a balanced
F1 (0.611), confirming its robustness in large‐scale customer prediction tasks. These results validate
that integrating knowledge‐graph embeddings with SHAP‐driven prompt refinement yields synthetic
data with downstream utility and semantic fidelity.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion &amp; Future Work</title>
      <p>In this work, we introduced KGSynX, a framework that seamlessly integrates knowledge‑graph
embeddings with SHAP‑driven feedback to guide large language models in generating synthetic tabular data.
Our method explicitly models the structural dependencies of tabular data and iteratively refines
generation prompts based on feature attribution discrepancies. Our experiments, conducted under the TSTR
protocol on UCI Heart Disease, Enterprise Invoice Usage, and Telco Churn datasets, demonstrate that
KGSynX outperforms GAN-base models, TabDDPM, LLM‑only, and LLM+KG baselines in classification
accuracy, F1 score, and AUC, while preserving semantic fidelity and interpretability.</p>
      <p>Despite these encouraging results, the current implementation relies on heuristic prompt adjustments,
which may require manual tuning and domain expertise. Additionally, SHAP‑based attribution
computations introduce substantial computational overhead, limiting scalability in resource‑constrained
environments. Future work will focus on developing reinforcement‑learning‑based or diferentiable
optimization techniques for automated prompt refinement to reduce reliance on heuristics. We also plan
to explore eficient SHAP approximation methods and extend our approach to multi‑label, multi‑modal
knowledge graphs and streaming data scenarios to enhance applicability.
1https://archive.ics.uci.edu/dataset/45/heart+disease
2provided by Infomart Corporation
3https://www.kaggle.com/datasets/blastchar/telco-customer-churn</p>
    </sec>
    <sec id="sec-5">
      <title>Supplemental Material Statement</title>
      <p>The source code, real and synthetic datasets, and reproducible pipeline for KGSynX are available online
via</p>
      <p>• GitHub</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This study was supported by the joint research project with Infomart Corporation and JST PRESTO
Grant Number JPMJPR2369.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT in order to: Grammar and spelling
check, Paraphrase and reword. After using this tool/service, the author(s) reviewed and edited the
content as needed and take(s) full responsibility for the publication’s content.
[16] E. Choi, S. Biswal, B. Malin, J. Duke, W. F. Stewart, and J. Sun, “Generating multi-label discrete
electronic health records using generative adversarial networks,” in Proceedings of the 2nd Machine
Learning for Healthcare Conference, pp. 286–305, 2017.
[17] E. Mosca, F. Szigeti, S. Tragianni, D. Gallagher, and G. Groh, “SHAP-based explanation methods: a
review for NLP interpretability,” Proceedings of the 29th International Conference on Computational
Linguistics, pp. 4593–4603, 2022.
[18] E.-J. van Kesteren, “To democratize research with sensitive data, we should make synthetic data
more accessible,” Patterns, vol. 5, no. 9, 2024.
[19] J. Achterberg, M. Haas, B. van Dijk, and M. Spruit, “Fidelity-agnostic synthetic data generation
improves utility while retaining privacy,” Patterns, 2025.
[20] M. S. M. Sajjadi, O. Bachem, M. Lucic, O. Bousquet, and S. Gelly, “Assessing generative models via
precision and recall,” in Advances in Neural Information Processing Systems, vol. 31, 2018.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Skoularidou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wu</surname>
          </string-name>
          , and G. Ermon, “
          <article-title>Modeling tabular data using conditional GAN</article-title>
          ,” in NeurIPS,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Lundberg</surname>
          </string-name>
          and
          <string-name>
            <surname>S.-I. Lee</surname>
          </string-name>
          , “
          <article-title>A unified approach to interpreting model predictions</article-title>
          ,” in NeurIPS,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Grover</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Leskovec</surname>
          </string-name>
          , “node2vec:
          <article-title>Scalable feature learning for networks</article-title>
          ,” in KDD,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E.</given-names>
            <surname>De Cristofaro</surname>
          </string-name>
          , “Synthetic Data: Methods,
          <string-name>
            <given-names>Use</given-names>
            <surname>Cases</surname>
          </string-name>
          , and Risks,” arXiv preprint arXiv:
          <volume>2303</volume>
          .01230,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Marwala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Fournier-Tombs</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Stinckwich</surname>
          </string-name>
          , “
          <article-title>The Use of Synthetic Data to Train AI Models: Opportunities and Risks for Sustainable Development</article-title>
          ,” arXiv preprint arXiv:
          <volume>2309</volume>
          .00652,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kotelnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Blinov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Baranchuk</surname>
          </string-name>
          , et al.,
          <source>“TabDDPM: Modeling Tabular Data with Difusion Models,” arXiv preprint arXiv:2302.07984</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7] OpenAI, “GPT‑4
          <source>Technical Report,” arXiv preprint arXiv:2303.08774</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>X.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. A.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nickleach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Socolinsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sengamedu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Faloutsos</surname>
          </string-name>
          , “
          <article-title>Large Language Models (LLMs) on Tabular Data: Prediction, Generation,</article-title>
          and
          <string-name>
            <surname>Understanding-A Survey</surname>
          </string-name>
          ,” arXiv preprint arXiv:
          <volume>2402</volume>
          .17944,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>N.</given-names>
            <surname>Patki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wedge</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Veeramachaneni</surname>
          </string-name>
          , “The Synthetic Data Vault,” in
          <source>2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          , E. Blomqvist,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cochez</surname>
          </string-name>
          , C. d'Amato, G. de Melo,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gutierrez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kirrane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E. L.</given-names>
            <surname>Gayo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          , A.-C. Ngonga Ngomo, et al.,
          <source>“Knowledge Graphs,” ACM Computing Surveys (CSUR)</source>
          , vol.
          <volume>54</volume>
          , no.
          <issue>4</issue>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>37</lpage>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C.</given-names>
            <surname>Esteban</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. L.</given-names>
            <surname>Hyland</surname>
          </string-name>
          , and G. Rätsch, “
          <article-title>Real-valued (medical) time series generation with recurrent conditional GANs,”</article-title>
          <source>arXiv preprint arXiv:1706.02633</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>I.</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pouget-Abadie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mirza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Warde-Farley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ozair</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Courville</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          , “Generative adversarial nets,
          <source>” Advances in Neural Information Processing Systems</source>
          , vol.
          <volume>27</volume>
          , pp.
          <fpage>2672</fpage>
          -
          <lpage>2680</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mirza</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Osindero</surname>
          </string-name>
          , “
          <article-title>Conditional generative adversarial nets</article-title>
          ,
          <source>” arXiv preprint arXiv:1411.1784</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Sohl-Dickstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. A.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Maheswaranathan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Ganguli</surname>
          </string-name>
          , “
          <article-title>Deep unsupervised learning using nonequilibrium thermodynamics</article-title>
          ,
          <source>” arXiv preprint arXiv:1503.03585</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Song</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Ermon</surname>
          </string-name>
          , “
          <article-title>Generative modeling by estimating gradients of the data distribution,”</article-title>
          <source>in Advances in Neural Information Processing Systems</source>
          , vol.
          <volume>32</volume>
          , pp.
          <fpage>11895</fpage>
          -
          <lpage>11907</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>