<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Graph-Driven Validation of CSR (TFLs) Using Semantic Technologies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Muthiah Giri Hanuragav</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Viswanathan Gopinath</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Care2Data</institution>
          ,
          <addr-line>Chennai</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Regulators require that every number printed in a clinical-study report (CSR) be internally consistent across its TFLs and traceable to underlying observations. Scripted Quality Assurance around rich-text outputs (RTF) is brittle under schema drift and provides weak provenance. We present an ontologycentered workflow that converts RTF to JSON, maps JSON to RDF using compact YAML mappers read by a deterministic translator, enforces structure with SHACL, and applies SPARQL rule suites for content checks. Large-language models (LLMs) assist only in drafting YAML; the converter is deterministic and auditable. A pilot across three studies reduced manual TFL QC by up to 75% while surfacing discrepancies earlier.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;TFL</kwd>
        <kwd>Clinical Study Report</kwd>
        <kwd>Knowledge Graphs</kwd>
        <kwd>SHACL</kwd>
        <kwd>SPARQL</kwd>
        <kwd>YAML</kwd>
        <kwd>ETL</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Industry Problem and Background</title>
    </sec>
    <sec id="sec-2">
      <title>2. Limitations of Traditional Solutions</title>
      <p>Relational warehouses can load the tables but they do not encode relationships such as
derivesFrom or hasBaselineFlag. They therefore cannot answer cross-artifact questions like “Is
the baseline value in the Vital-Sign Table the same as in the Vital-Sign Listing?” Procedural
rule engines break when schemas drift and version-controlling thousands of SQL files becomes
unmanageable. Column labels alone cannot capture semantics such as “baseline is the last
non-missing value before first dose.”</p>
    </sec>
    <sec id="sec-3">
      <title>3. Why Semantic Technologies</title>
      <p>An OWL vocabulary stabilizes concepts (subjects, visits, observations, summary cells), SHACL
expresses structural constraints, and SPARQL encodes cross-artifact content rules. Because
rules bind to concepts rather than column labels, validations remain stable even when table
layouts change.</p>
    </sec>
    <sec id="sec-4">
      <title>4. ETL Pipeline</title>
      <sec id="sec-4-1">
        <title>Stage 1: RTF to JSON Conversion</title>
        <p>A layout-aware parser extracts tables/listings from RTF files, normalizes headers, and emits
canonical JSON while preserving complete provenance (file id, page, row, column). Figure 1
illustrates the complete ETL workflow.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Stage 2: JSON to RDF Transformation</title>
        <p>A deterministic Python translator reads compact YAML mapper files to transform JSON into
RDF. The translator mints namespaced IRIs following consistent patterns, performs type casting
(e.g., xsd:decimal), threads prov:wasDerivedFrom relationships, and outputs idempotent
NTriples suitable for re-runs and difs.</p>
        <p>YAML mapper contents: Class instantiation targets, IRI templates, property mappings, and
provenance passthrough rules. Only the translator consumes YAML; the triplestore receives
the resulting RDF.</p>
        <p>YAML creation process: Mappers are drafted from 2–3 representative JSON rows via
constrained prompts, then human-reviewed and unit-tested before deployment.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Stage 3: Load and SHACL Validation</title>
        <p>The repository accepts only graphs satisfying predefined shapes. Conformance and violations
are recorded as machine-readable validation reports.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Automated Validation</title>
      <p>The automated suite runs after the SHACL gate and mirrors established QC practice on the
graph:</p>
      <p>Structure. SHACL ensures well-formedness (typing and cardinalities; mandatory properties
present; links such as observes point to Observation).</p>
      <p>Content. A library of SPARQL templates re-computes denominators/percentages, verifies
cross-table agreement (e.g., listings vs. summary tables), and enforces key business rules (e.g.,
baseline and change-from-baseline linkages; date constraints for dosing vs. events). Findings
are emitted as RDF validation reports with click-through provenance to file/page/row/column.</p>
      <p>
        Where rules come from. Rule templates are derived from sponsor QC checklists/SOPs,
CDISC SDTM and ADaM implementation guides (variables, flags, and derivations), and ICH E3
guidance on CSR structure/consistency [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. Shapes and queries are version-controlled and
regression-tested prior to release.
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. LLMs: Draft Only, Never Execute</title>
      <p>LLMs (GPT-4, Claude-3) are used only to draft YAML mappers, delivering ∼70% faster authoring;
the converter never calls an LLM, keeping ETL deterministic and auditable. One of the many
examples of hallucination is Representative hallucination (inconsistent IRIs):
• Input: Subject “P001” in multiple rows
• LLM: Generates both ex:subject/P001 and ex:subj/P-001
• Impact: Duplicate subjects and incorrect aggregations
Why fine-tuning/APIs didn’t fit; why prompt tuning did: public APIs lack fixed seeds, so
drafts vary across runs (failing audit reproducibility), while on-prem fine-tuning strained GPUs
(quantization degraded output; full precision was impractical). Constrained prompt tuning with
a YAML schema and 2–3 golden JSON→triple examples yielded stable drafts; the deterministic
translator preserved auditability.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Business Value and Pilot Results</title>
      <p>Across three internal studies, manual TFL QC efort decreased by up to 75% with earlier detection
of denominator/baseline inconsistencies (internal benchmarks, Q4 2024). Gains stem from
removing duplicate programming and enabling click-through lineage for auditors. Regulatory
writers gain machine-readable provenance they can paste straight into submission dossiers</p>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusion</title>
      <p>Separating semantics (ontology + shapes + rules) from layout (YAML mappers) yields robust,
auditable TFL validation for the CSR. The graph layer provides stable checks under schema drift
and click-through provenance from any printed number to its contributing rows.</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, we, the authors, used ChatGPT for grammar and spelling
checks, paraphrasing, and citation management. After using this service, we reviewed and
edited the content as needed and take full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>CDISC. Study</given-names>
            <surname>Data Tabulation Model (SDTM) Implementation</surname>
          </string-name>
          <string-name>
            <surname>Guide</surname>
          </string-name>
          ,
          <source>Version 3.3</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>CDISC. Analysis</given-names>
            <surname>Data Model (ADaM) Implementation</surname>
          </string-name>
          <string-name>
            <surname>Guide</surname>
          </string-name>
          ,
          <source>Version 1.3</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>ICH.</surname>
          </string-name>
          <article-title>E3: Structure and Content of Clinical Study Reports, 1995 (R1).</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <fpage>W3C</fpage>
          .
          <article-title>Shapes Constraint Language (SHACL)</article-title>
          . https://www.w3.org/TR/shacl/ [5]
          <fpage>W3C</fpage>
          .
          <article-title>SPARQL 1.1 Query Language</article-title>
          . https://www.w3.org/TR/sparql11-query/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>