<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>RuleSum: Injecting Rulesets into Knowledge Graphs for Accurate and Accessible Legal Summarization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Saumya Chauhan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aditi Chandrashekar</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>California Institute of Technology</institution>
          ,
          <addr-line>Pasadena, CA 91125</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>New York University</institution>
          ,
          <addr-line>New York, NY 10003</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Legal texts are often complex and inaccessible, limiting understanding for non-experts. Large language models (LLMs) can summarize such material but often sacrifice interpretability and factual accuracy. Thus, we present RuleSum, a framework that integrates structured rulesets and knowledge graphs (KGs) with LLMs to generate legal summaries that are faithful, readable, and pedagogically aligned. Leveraging the IRAC method (Issue, Rule, Application, Conclusion) as a reasoning scafold, RuleSum applies structured representations-free-form, tuple-style (KAPING), and IRAC-labeled serialization-to guide summarization. We provide a framework for evaluation along multiple axes, including semantic consistency and readability. We evaluate our approach on the MultiLexSum dataset, using ROUGE-L for lexical overlap with reference summaries, SBERT for semantic similarity, and Flesch-Kincaid Grade Level (FKGL) for readability. The IRAC-guided KAPING-IRAC configuration consistently outperforms all baselines, achieving the highest alignment with reference summaries while maintaining accessibility for general audiences. Finally, we provide an interactive Gradio-based demo and open-source code (github.com/Saumya-Chauhan-MHC/Rulesum), that visualizes how each pipeline stage improves clarity and factual grounding, supporting future applications of structured reasoning for education and decision-making.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;large language models</kwd>
        <kwd>structured prompting</kwd>
        <kwd>knowledge graphs</kwd>
        <kwd>educational NLP</kwd>
        <kwd>evaluation frameworks</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Legal documents are notoriously complex, making it dificult for students and non-experts to identify
and understand critical information. While large language models (LLMs) like GPT-4 can produce
lfuent summaries of such documents, their reasoning processes remain opaque and rarely follow the
formal structure expected in legal analysis [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In high-stakes domains such as law, where accuracy
and clarity are paramount, this lack of interpretability (e.g. opaque reasoning chains) and potential for
factual error (hallucinated or imprecise details) greatly limits the usefulness of raw LLM-generated
summaries. Learners cannot easily discern how an LLM arrived at a conclusion or verify the facts in its
summary. These limitations highlight the need for summarization frameworks that provide not only
concise overviews, but also transparent reasoning and faithfully grounded content. Existing LLM-based
approaches fall short in educational settings, motivating new methods that align summaries with formal
legal reasoning to improve interpretability and trust.
      </p>
      <p>
        One approach to improve factual alignment and transparency is to augment LLMs with structured
knowledge. Prior work has integrated knowledge graphs (KGs) and extracted fact triples into the
summarization or QA process to better ground outputs in verifiable information. For example, the
KAPING system retrieves and verbalizes relevant KG triples to assist in zero-shot question answering,
and KG-to-text frameworks rewrite subgraphs into fluent, answer-oriented sentences – underscoring
how the serialization format of facts can influence a summary’s factual accuracy [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ]. Beyond
retrieval-based augmentation, systems like RTSUM rank triples by salience to improve interpretability,
and graph condensation methods (e.g., PCSG, HCSumm, KGTrimmer) preserve key semantics in a
compact form [
        <xref ref-type="bibr" rid="ref5 ref6 ref7">5, 6, 7</xref>
        ]. However, even these structured augmentation methods do not explicitly
incorporate domain-specific reasoning models or educational scafolds. In the context of legal education,
a summary needs to explain why and how conclusions are reached, not just state the conclusions, which
calls for integrating formal reasoning patterns into the summarization process.
      </p>
      <p>
        The legal domain introduces additional challenges and opportunities for structured summarization.
Legal reasoning often follows a formal schema taught in law schools, such as the IRAC framework (Issue,
Rule, Application, Conclusion), which provides a canonical scafold for argumentation [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Datasets
like MultiLexSum align lengthy case documents with expert-written summaries (including issue and
holding annotations), underscoring the need to capture not just facts but the logical flow of arguments
in a case [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Incorporating an IRAC-based scafold into the summarization pipeline can therefore help
align generated summaries with the way legal reasoning is communicated by experts. Yet, few existing
summarization systems fully leverage such formal reasoning schemas, limiting their efectiveness as
educational tools for illustrating the step-by-step logic of legal decisions. This gap suggests that a
combination of structured knowledge and domain-specific reasoning cues could significantly improve
the interpretability of legal summaries.
      </p>
      <p>
        In this paper, we propose RuleSum, a structured summarization framework that injects rulesets and
knowledge graphs into the LLM’s generation process to produce legal summaries that are faithful,
readable, and pedagogically aligned. RuleSum leverages the IRAC method as a reasoning scafold and
applies multiple representation strategies – from free-form text rephrasings to tuple-style fact triples
(KAPING format) and IRAC-labeled text segments – to guide the LLM in breaking down complex legal
language [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. By unifying symbolic knowledge with the generative fluency of LLMs, this design enables
the model to capture not only the content of a case, but also the logical progression of its arguments
in the summary. We evaluate our approach on the MultiLexSum dataset, using metrics that assess
both factual fidelity and accessibility (ROUGE-L for overlap with reference summaries, SBERT for
semantic similarity, and Flesch–Kincaid Grade Level for readability) [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref9">10, 11, 12, 9</xref>
        ]. In experiments,
an IRAC-guided configuration of RuleSum consistently outperforms a baseline GPT-4 summarizer,
achieving higher alignment with expert reference summaries while maintaining a lower reading level
(i.e. improved readability for general audiences). These results demonstrate that injecting structured
knowledge and legal reasoning cues into the summarization process can substantially improve both the
accuracy and clarity of the generated summaries.
      </p>
      <p>
        Our contributions are the following:
1. A novel legal summarization framework that decomposes complex legal language using
multiple structured representations and an IRAC-guided pipeline. RuleSum is the first
framework (to our knowledge) that explicitly integrates the IRAC reasoning schema into both
the knowledge graph construction and the LLM prompting stages of summarization, aligning
the summary’s structure with formal legal reasoning steps. The framework uniquely combines
several representation methods – including free-form text extraction, knowledge-graph triples,
and IRAC-labeled text segments – at diferent stages of the pipeline to break down dense “legalese”
into more accessible language without losing the case’s logical structure. By using IRAC-based
quotas in triple selection and organizing facts under Issue/Rule/Application/Conclusion labels,
our approach ensures that each part of the summary corresponds to a reasoning role, which sets
it apart from prior legal summarization systems that lack such structured scafolding [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
2. A comprehensive evaluation methodology that disentangles readability and factuality in
zero-shot legal summarization. We design a factorial study that (i) controls summary length to
probe completeness–clarity trade-ofs, (ii) varies triple-selection policies to contrast reasoning-role
coverage with semantic relevance, and (iii) ablates structured components across the pipeline—KG
use, serialization, and a no-knowledge baseline. This yields 48 configurations per case and ∼ 1,200
summaries overall, evaluated with ROUGE-L (lexical alignment), SBERT (semantic similarity),
and FKGL (readability), plus readability-quartile analyses to assess behavior on harder texts.
Beyond reporting aggregate gains, this protocol isolates each module’s contribution and surfaces
regime-specific strengths (e.g., IRAC-quota + KAPING-IRAC under higher FKGL), providing a
      </p>
      <p>
        reusable template for rigorous, component-wise evaluation of structured LLM summarizers in
education-oriented, high-stakes domains [
        <xref ref-type="bibr" rid="ref10 ref11 ref12">10, 11, 12</xref>
        ].
3. An interactive visualization tool and demo that enhances transparency in the
summarization process. We developed a visualization interface allowing users to explore which
parts of the source text and which knowledge graph triples contribute to each section of the
generated summary. This interactive demo shows how a particular Issue sentence in the summary
is grounded in specific source passages or facts in the graph. Such transparency allows users
(e.g. law students) to trace the origin of each summarized point, building trust in the summary’s
accuracy. This insight helps users learn how to better formulate prompts and questions for LLMs
in similarly complex domains as they can see the efects of structured guidance on the output. We
have open-sourced the code and demo for RuleSum, enabling others to examine the step-by-step
reasoning integration and supporting future applications of structured reasoning in education
and decision-support [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related Work</title>
      <p>
        LLM Summarization with Structured Augmentation. To improve factual grounding in specialized
domains, recent work augments LLMs with knowledge graphs (KGs). KAPING retrieves and verbalizes
relevant triples for zero-shot prompting, boosting factuality without fine-tuning [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. RRA instead
retrieves a subgraph and rewrites it into coherent text for the prompt, showing that how KG evidence is
serialized (triples vs. narrative) strongly afects downstream quality [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Triple Selection, Salience, and Interpretability. Selecting which facts to include is crucial at
scale. RTSUM ranks relation triples using multi-level salience and then “sentencifies” top items, yielding
concise outputs with traceable evidence links—an interpretability benefit directly relevant to legal
settings [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        Graph Condensation and Pruning. KG summarization/pruning methods target compact yet
representative subgraphs. PCSG optimizes pattern coverage and connectivity for RDF snippets [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ];
HCSumm preserves latent structure by approximately maintaining embedding distances [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]; and
KGTrimmer prunes nodes/edges by dual-view importance for recommendation without hurting accuracy
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Our use of top- triples difers: we condense for reasoning coverage, not maximum compression or
structural preservation [
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ].
      </p>
      <p>
        Serialization Formats for LLMs. KG evidence can be injected as compact tuples (KAPING-style)
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] or rewritten sentences (RRA) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]; the former is precise, the latter easier for LLMs to assimilate. We
compare these with an IRAC-labeled format that groups facts by reasoning role, explicitly scafolding
the argumentative flow.
      </p>
      <p>
        Positioning of Our Approach. RuleSum combines KG augmentation with a domain-specific schema:
IRAC labels guide both selection (IRAC-Quota) and serialization, complementing
relevance/saliencebased selection (e.g., RTSUM) and prior KG-to-text prompting (e.g., KAPING/RRA) [
        <xref ref-type="bibr" rid="ref13 ref2 ref4">2, 4, 13</xref>
        ]. This
alignment to legal reasoning is orthogonal to general graph condensation goals [
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Solution</title>
      <sec id="sec-3-1">
        <title>3.1. Problem Formulation and Workflow</title>
        <p>Legal case texts are not only long but also follow a formal reasoning structure, making them challenging
for non-experts. Large language models (LLMs) like GPT-4 can produce fluent summaries, but their
reasoning is often opaque and unstructured. To bridge this gap, RuleSum introduces a pipeline that
integrates knowledge graphs (KGs) and formal rulesets into the summarization process for better factual
accuracy and interpretability. We leverage the IRAC method (Issue, Rule, Application, Conclusion) as a
reasoning scafold – a canonical structure in legal analysis – to guide the summarization towards legal
pedagogical norms.</p>
        <p>Workflow: As shown in Figure 1, our system first parses the input legal document into a KG of
factual triples (subject, relation, object) capturing key entities, events, and outcomes. This KG (optionally
enriched with IRAC labels) is then serialized into text and injected into the prompt of an LLM (GPT-4).
Finally, the LLM generates a summary conditioned on this structured context. The design unifies
symbolic structure with LLM fluency, allowing the model to capture not just relevant content but
also the logical progression of arguments. In essence, RuleSum’s pipeline ensures that the summary
remains faithful to the case facts and accessible in explanation, by combining the strengths of knowledge
encoding and natural language reasoning. The following subsections describe each component and its
motivation in detail.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Knowledge Graph Construction</title>
        <p>
          The first stage is to construct a knowledge graph from the legal text. We employ an LLM-assisted
extraction module (built with LangChain’s graph-extraction tools [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]) to identify salient entities,
relations, and facts from the case document. Each fact is represented as a triple (e.g., Party_A – wins –
claim_X), forming a case-specific KG. This approach builds on prior work in knowledge-augmented
prompting, such as the KAPING system which retrieves and verbalizes triples to assist QA models. By
distilling the source text into a set of triples, we create a compact, structured representation of the
case’s key points.
        </p>
        <p>Crucially, RuleSum can incorporate IRAC-guided knowledge graphs to inject domain reasoning. If
IRAC integration is enabled, the extraction step tags each triple with one of the IRAC roles – Issue,
Rule, Application, or Conclusion. For example, a triple describing the central legal question would be
labeled as an Issue, whereas a triple stating the court’s decision would be labeled as a Conclusion. This
IRAC-tagged graph provides a scafolded view of the case: it not only lists facts, but also situates them
in the argument’s logical structure. If IRAC labeling is turned of, the KG remains unstructured (a plain
set of triples). By encoding legal reasoning roles into the KG, we enable the summarizer to more readily
follow the flow of a legal analysis, rather than treating all facts as equal or disconnected. This step is
inspired by the formalism of rhetorical role annotation in legal texts and ensures that our framework
aligns with how experts write case analyses.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Triple Selection and Serialization</title>
        <p>Given the full KG (which may contain dozens of triples), the next challenge is to select and linearize
the most relevant facts for the LLM. Including every extracted triple could overwhelm the prompt or
introduce irrelevant details. We therefore experiment with two complementary top- selection policies
for choosing a subset of triples:</p>
        <p>IRAC-Quota selection: This policy ensures balanced coverage of each IRAC reasoning role. As
seen in Figure 2, we allocate slots for triples such that Issue, Rule, Application, and Conclusion facts
are all represented in the prompt. With our choice of  = 8 triples, the quota policy selects the top 2
triples from each IRAC category (assuming enough triples exist per category). This guarantees that
the summary isn’t missing an essential part of the legal reasoning, addressing interpretability and
completeness. Our IRAC-Quota strategy extends prior work on guided summarization – whereas
RTSUM ranked triples by general salience, our method explicitly balances salience with logical role
coverage to preserve the case’s argumentative structure.</p>
        <p>Similarity-based selection: This policy ranks candidate triples by their semantic relevance to
the source document (or a reference summary) using embedding-based similarity. The intuition is to
pick facts that best capture the core content of the case. This approach connects to established graph
condensation methods like PCSG and KGTrimmer, which aim to retain the most informative parts of a
knowledge graph. By using sentence embeddings, we prioritize triples that cover important concepts
and events from the case, serving a role similar to how prior systems select supporting facts for QA.
We include this policy to benchmark a purely relevance-driven selection against the more structured
IRAC-Quota strategy.</p>
        <p>After selection, the chosen triples must be serialized into a textual form that can be fed into the LLM
prompt. We implement three serialization formats in the RuleSum pipeline:</p>
        <p>Free-form verbalization: Each triple is converted into a natural language sentence, yielding a
short paragraph of text. For example, a triple (Defendant – owed – duty) might be verbalized as “The
defendant owed a duty.” This free-form mode aims for readability and seamless integration into the
prompt, at the cost of possibly losing the explicit triple structure. It aligns with approaches where KG
facts are rewritten into narrative statements for the model (e.g., the RRA Retrieve-Rewrite-Answer
framework that turns subgraphs into answer-oriented sentences).</p>
        <p>Structured tuple (KAPING) format: Triples are presented in a terse, tuple-style notation. We
adopt the format used by KAPING, which might list facts as (Subject; Relation; Object) or a similar
semi-structured template. For instance: (Defendant; owes; duty of care). This compact representation
preserves the precise subject-predicate-object structure, providing the LLM with clear symbolic facts.
Prior work has shown that such structured prompts can improve factual grounding in zero-shot settings.
However, tuples alone lack explicit cues about their role in the argument.</p>
        <p>IRAC-labeled serialization (KAPING-IRAC): This mode groups the selected triples under IRAC
section headers. By prefixing each group of facts with its IRAC category, we provide the model with
strong hints of the legal reasoning context for each fact. This format, essentially KAPING tuples
organized by IRAC, ofers explicit reasoning cues while retaining symbolic precision. We expect this
to guide the LLM to produce summaries that follow the Issue-Rule-Application-Conclusion narrative,
improving interpretability. Notably, our use of IRAC labels in serialization is a novel addition building
on the idea that structural organization of input can steer an LLM’s output style. It goes beyond
surface-level fact inclusion by embedding an outline of the argument within the prompt.</p>
        <p>By comparing these three formats, we explore the trade-ofs between a more natural language context
(free-form), a concise factual context (tuple format), and a structured reasoning context (IRAC-labeled).
Past studies underscore that how knowledge is presented in the prompt greatly afects factual grounding,
so this component is key to RuleSum’s design.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Prompt Engineering and Summary Control</title>
        <p>The final component of RuleSum is the prompt engineering that integrates the selected and serialized
facts into the LLM’s input and controls the summary generation. We use GPT-4 as our base
summarization model. The prompt to GPT-4 includes two main elements: (a) the knowledge-graph content (in one
of the serializations above) and (b) instructions specifying the desired summary style and length.</p>
        <p>Length control: We direct GPT-4 to produce summaries in one of three target length ranges to
examine the efect of compression. In particular, we define a tiny summary (50 words), a short summary
( 150 words), and a long summary (300 words). The prompt explicitly mentions the target length
(e.g., “Summarize the case in about 150 words. . . ”). By varying length, we can study the trade-of
between informativeness and readability: shorter summaries force the model to be concise, while longer
ones allow more details at the risk of increased complexity. We found that GPT-4 adheres to these
length instructions reliably, which is crucial for fair evaluation. Each length setting serves a diferent
educational use case – from a quick gist (tiny) to a detailed study aid (long) – and testing across them
provides insight into how our structured approach scales with summary depth.</p>
        <p>Prompt structure and summary control: In the prompt, the chosen facts (free-form sentences or
tuples with optional IRAC headings) are typically prefaced by a brief instruction (e.g., “Facts: . . . ”) and
followed by the summarization cue (e.g., “Summary: Please provide a summary of the case...”). This
ensures the model treats the injected KG facts as contextual information and not as the final answer. By
labeling the sections (and using IRAC headings in one mode), we nudge the model to organize its output
in a logical order. The combination of factual grounding and IRAC cues in the prompt is designed to
produce summaries that are both faithful to the case facts and aligned with the pedagogical structure
of legal reasoning. Any hallucinations or deviations from the facts can be minimized since the model
can rely on the injected triples as a source of truth, similar in spirit to retrieve-and-read pipelines in
open-domain QA. In case the initial summary is lacking clarity or completeness, the pipeline allows for
an optional refinement step (e.g., passing the draft through another prompt or applying minor edits),
though our main evaluations focus on the one-shot output. Overall, this prompt engineering strategy
gives us control over the summary’s content (via selected facts), structure (via IRAC labels), and length,
making the generation process more transparent and tunable compared to a vanilla LLM prompt.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <p>To validate the efectiveness of RuleSum, we designed a comprehensive evaluation on a real-world legal
summarization dataset with multiple controlled variations. We used the MultiLexSum dataset, which
consists of long U.S. case documents paired with expert-written summaries and annotated issue–holding
pairs. From this corpus, 75 cases were sampled for evaluation (the same split as in prior work): 50
cases for development (parameter tuning) and 25 cases held out for testing. Each reference summary in
MultiLexSum serves as a gold standard to evaluate content coverage and writing quality.</p>
      <p>LLMs and libraries. We implement RuleSum using LangChain as the orchestration layer around
OpenAI chat models. Unless otherwise noted, all LLM calls use the same base model (OpenAI gpt-4o),
with temperature set to 0.0, top_p = 1.0, and max_tokens= 2048. We fix the random seed for
all non-LLM components and call each configuration once; with temperature 0.0, knowledge-graph
extraction and summarization are efectively deterministic up to minor non-determinism in the API.</p>
      <p>Evaluation dimensions: We systematically evaluate our system across several dimensions to isolate
the impact of each component. For each test case, we generated summaries under a full factorial
combination of conditions:</p>
      <p>Knowledge Graph Integration: IRAC-guided KG vs. Unstructured KG. In the IRAC-guided setting,
the input triples are tagged with IRAC roles (Issue/Rule/Application/Conclusion) during extraction; in
the unstructured setting, no such labels are used (the triples are factual only). This tests the efect of
injecting the IRAC reasoning structure into the pipeline.</p>
      <p>Triple Selection Policy: IRAC-Quota vs. Similarity-based. We compare the balanced selection of facts
across IRAC categories to a purely relevance-driven selection by semantic similarity. This dimension
shows whether emphasizing reasoning coverage trades of any relevance in content selection.</p>
      <p>Serialization Format: Free-form sentences vs. KAPING tuples vs. IRAC-labeled (KAPING-IRAC). This
evaluates how the format of injected knowledge afects the summary. Free-form provides a fluent
context, tuple format provides a compact factual list, and IRAC-labeled provides a structured outline.</p>
      <p>Target Summary Length: Tiny (50 words) vs. Short (150 words) vs. Long (300 words). By varying
the allowed length, we examine performance at diferent compression levels – from extremely concise
summaries to more detailed ones.</p>
      <p>In addition to the above, we include a baseline condition for comparison: a zero-shot GPT-4 summary
with no knowledge graph input (i.e., the model only sees the case text and is prompted to summarize it).
For fairness, the baseline is generated under the same length constraints as our structured runs (tiny,
short, long for each case). This yields a point of reference to quantify the benefit of RuleSum’s injected
structure.</p>
      <p>Combining the factors above, each test case is summarized under all 2 × 2 × 3 × 3 = 36 structured
configurations, plus the baseline variants. In total, our evaluation produced on the order of a thousand
generated summaries (e.g., 36 configurations per case × 25 cases ≈ 900 summaries) for analysis. This
exhaustive setup allows us to conduct a detailed ablation study. Specifically, we not only compare
the full RuleSum pipeline to the baseline, but also evaluate partial variants where we enable only one
structured component at a time (e.g., only adding the KG without IRAC labels, or only using IRAC
prompting without the KG, etc.). These ablation experiments help in attributing performance gains
to specific components of the pipeline (KG, IRAC structure, selection policy). All generation used the
same underlying model (GPT-4) and prompting approach, to control for variability.</p>
      <p>Evaluation metrics: We assessed each generated summary along three primary metrics that capture
complementary aspects of summary quality:</p>
      <p>ROUGE-L (F-measure): This metric measures lexical overlap between the generated summary and the
reference summary, focusing on the longest common subsequence. ROUGE-L provides a sense of how
much of the important content (as written by experts) is captured in the model’s su mmary. A higher
ROUGE-L indicates better coverage of reference facts or phrases. We report ROUGE-L as a percentage,
with higher being better for content fidelity.</p>
      <p>SBERT-based Similarity: To evaluate semantic fidelity, we use Sentence-BERT (SBERT) embeddings to
compute the similarity between each generated summary and the source document. This
embeddingbased score (cosine similarity) reflects how well the summary preserves the meaning of the original
text, beyond exact word overlap. We compare our summaries’ SBERT scores to those of the baseline
and even the expert summaries, to understand the semantic retention. Higher SBERT scores indicate
that the summary contains information semantically closer to the full document.</p>
      <p>Flesch–Kincaid Grade Level (FKGL): We measure the readability of the summaries using the FKGL
readability test, which approximates the U.S. grade school level required to understand the text. A
lower FKGL score means the summary is easier to read (simpler vocabulary and sentence structure).
This metric is crucial for our goal of accessible summaries – we want legal summaries that a broader
audience can comprehend. We compute FKGL for each summary and also compare against the reference
and source texts as benchmarks. For instance, the original cases often have high FKGL (legal jargon),
while good summaries should ideally have a lower FKGL for accessibility.</p>
      <p>All metrics are computed for each system configuration and length. We do not rely on any single
metric; rather, we consider the trade-ofs – e.g., does a structured method improve ROUGE (content)
and SBERT (meaning) without hurting FKGL (readability)? The evaluation is designed to answer such
questions. Importantly, we refrain from hand-tuning the summaries for these metrics; all outputs are
model-generated under the specified conditions, ensuring an apples-to-apples comparison. In the next
section, we will present the results of these experiments, highlighting how RuleSum performs under
diferent settings and which combination of knowledge injection strategies proves most efective.</p>
      <p>By comparing these three formats, we explore the trade-ofs between a more natural language context
(free-form), a concise factual context (tuple format), and a structured reasoning context (IRAC-labeled).
Past studies underscore that how knowledge is presented in the prompt greatly afects factual grounding,
so this component is key to RuleSum’s design.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>We summarize outcomes for our framework described in Figure 1, comparing the zero-shot baseline,
single-component ablations, and the best combined variant. Aggregate metrics and ablations appear in
Tables 1 and 2, while dificulty-stratified and length-sensitivity trends are shown in Figures 3 and 4.</p>
      <p>Semantic fidelity and readability across lengths. Figure 4 shows that structured prompting
preserves or improves semantic fidelity (SBERT cosine to the source document) relative to both the
baseline and expert references across all target lengths, with the largest margin at tiny (50 words),
where gains reach up to +0.12 SBERT over gold. In parallel, the same figure indicates that structured
variants generally have lower or comparable FKGL (better or similar readability) than the baseline
across lengths, with the largest readability margin at the shortest (tiny) summaries. This suggests that
the injected structure guides the model toward clearer phrasing rather than harming fluency. As length
increases from tiny → short → long, SBERT rises as expected and FKGL increases slightly, yet structured
variants retain a readability edge.</p>
      <p>Dificulty-stratified performance. To assess robustness under varying case complexity, we stratify
by FKGL quartiles of the source documents (Fig. 3). On easier cases (Q1–Q2), methods are broadly
comparable; on harder cases (Q3–Q4), the kaping_irac+IRAC-quota configuration tends to achieve
slightly higher mean ROUGE-L than the baseline (on the order of +0.006 ), although these diferences
are small in absolute terms. This suggests that explicit reasoning-role scafolds may be most helpful
when language is denser, even if gains are modest. This pattern indicates that explicit reasoning-role
scafolds are most helpful when language is dense, aligning with the intuition that structure mitigates
complexity.</p>
      <p>Ablation trends and complementary efects. Tables 1 and 2 summarize aggregate trends when
toggling individual components (KG ruleset, serialization, selection policy). Among the configurations
we tested, the IRAC-guided KG + KAPING_IRAC serialization + IRAC-quota selection configuration
achieves one of the strongest trade-ofs between semantic fidelity and readability: it improves SBERT
over the baseline (0.714 vs. 0.713) while lowering FKGL (13.04 vs. 13.27). Another configuration achieves
a slightly higher SBERT score (0.715) at the cost of higher FKGL (13.07), so we prioritize the configuration
that yields easier-to-read summaries under comparable semantic similarity. Overall, configurations
without at least one of IRAC KG, KAPING_IRAC serialization, or IRAC-quota selection show weaker
performance on one or more metrics, indicating complementary contributions from each component.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>Our structured prompting approach boosts both fidelity and clarity in legal case summaries. Across all
target lengths, IRAC-guided configurations achieve SBERT scores that are comparable to or slightly
higher than a zero-shot GPT-4 baseline, and at the shortest length they can exceed expert summaries
by up to +0.12 SBERT cosine similarity. These gains come without sacrificing readability: on average,
our preferred configuration yields lower or comparable Flesch–Kincaid Grade Levels relative to the
baseline, indicating summaries that are at least as accessible to non-expert readers. The benefit of
this structured strategy is most pronounced on complex, high-density texts, where it yields stronger
ROUGE-L overlap with the source and avoids the omission or misalignment errors that plague the
baseline. Finally, ablation studies confirm that each component of our pipeline (IRAC-tagged knowledge
graphs, structured kaping-irac serialization, and an IRAC-based quota) contributes synergistically. We
show that removing any one component degrades performance, underscoring the value of incorporating
domain-specific structure to guide large language models.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Grammarly in order to perform grammar and
spelling checks throughout the manuscript. After using this tool, the authors reviewed and edited the
content. They take full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1] OpenAI, Gpt-4
          <source>technical report, arXiv preprint arXiv:2403.05530</source>
          (
          <year>2024</year>
          ). URL: https://arxiv.org/ abs/2403.05530, accessed:
          <fpage>2025</fpage>
          -10-06.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Baek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Aji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Safari</surname>
          </string-name>
          ,
          <article-title>Knowledge-augmented language model prompting for zero-shot knowledge graph question answering</article-title>
          ,
          <source>in: Proceedings of the First Workshop on Matching From Unstructured and Structured Data (MATCHING</source>
          <year>2023</year>
          ),
          <article-title>Association for Computational Linguistics</article-title>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>70</fpage>
          -
          <lpage>98</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .matching-
          <volume>1</volume>
          .7/.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Koncel-Kedziorski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bekal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lapata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          ,
          <article-title>Text generation from knowledge graphs with graph transformers</article-title>
          ,
          <source>in: NAACL-HLT</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>2284</fpage>
          -
          <lpage>2295</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bi</surname>
          </string-name>
          , G. Qi,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Xie</surname>
          </string-name>
          , W. Song,
          <article-title>Retrieve-rewrite-answer: A kg-to-text enhanced llms framework for knowledge graph question answering</article-title>
          ,
          <source>arXiv preprint arXiv:2309.11206</source>
          (
          <year>2023</year>
          ). URL: https://arxiv.org/abs/2309.11206. doi:
          <volume>10</volume>
          .48550/arXiv.2309.11206.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. Zhang,</surname>
          </string-name>
          <article-title>From large-scale graphs to condensed graph-free data</article-title>
          ,
          <source>in: NeurIPS</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pei</surname>
          </string-name>
          ,
          <article-title>Node embedding preserving graph summarization</article-title>
          ,
          <source>ACM Transactions on Knowledge Discovery from Data</source>
          (
          <year>2024</year>
          ). HCSumm.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>W. Zhang,</surname>
          </string-name>
          <article-title>Knowledge graph pruning for recommendation</article-title>
          ,
          <source>in: SIGIR</source>
          ,
          <year>2024</year>
          . KGTrimmer.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Neumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Simon</surname>
          </string-name>
          ,
          <article-title>Legal Reasoning</article-title>
          and Legal Writing: Structure, Strategy, and
          <string-name>
            <surname>Style</surname>
          </string-name>
          , 6th ed., Aspen Publishers, New York,
          <year>2009</year>
          .
          <article-title>Introduces the IRAC (Issue-Rule-Application-Conclusion) framework widely used in legal analysis</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ladhak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cohan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Weld</surname>
          </string-name>
          ,
          <article-title>Multi-lexsum: Real-world summaries of civil rights lawsuits at multiple granularities</article-title>
          ,
          <source>in: NeurIPS 2022 Datasets and Benchmarks Track</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>C.-Y. Lin</surname>
          </string-name>
          ,
          <article-title>Rouge: A package for automatic evaluation of summaries, in: Text Summarization Branches Out, Association for Computational Linguistics</article-title>
          , Barcelona, Spain,
          <year>2004</year>
          , pp.
          <fpage>74</fpage>
          -
          <lpage>81</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Sentence-bert: Sentence embeddings using siamese bert-networks</article-title>
          , arXiv preprint arXiv:
          <year>1908</year>
          .
          <volume>10084</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Kincaid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. P.</given-names>
            <surname>Fishburne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Rogers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. S.</given-names>
            <surname>Chissom</surname>
          </string-name>
          ,
          <source>Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel</source>
          ,
          <source>Technical Report Research Branch Report 8-75</source>
          , Naval Technical Training Command, Research Branch, Memphis,
          <string-name>
            <surname>TN</surname>
          </string-name>
          ,
          <year>1975</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yeo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <surname>RTSUM</surname>
          </string-name>
          :
          <article-title>Relation triple-based interpretable summarization with multi-level salience visualization</article-title>
          ,
          <source>in: Proceedings of the 2024 Conference of the North American Chapter of the ACL: Demonstrations, Association for Computational Linguistics</source>
          ,
          <year>2024</year>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .naacl-demo.5/.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Z.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Pan</surname>
          </string-name>
          , et al.,
          <article-title>Pcsg: Pattern-coverage snippet generation for RDF datasets</article-title>
          ,
          <source>in: The Semantic Web - ISWC</source>
          <year>2021</year>
          , volume
          <volume>12922</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2021</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>20</lpage>
          . URL: https://dl.acm.org/doi/10.1007/978-3-
          <fpage>030</fpage>
          -88361-
          <issue>4</issue>
          _1. doi:
          <volume>10</volume>
          . 1007/978-3-
          <fpage>030</fpage>
          -88361-
          <issue>4</issue>
          _
          <fpage>1</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , S. Liu,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shen</surname>
          </string-name>
          , X. Cheng,
          <article-title>Node embedding preserving graph summarization, ACM Transactions on Knowledge Discovery from Data (TKDD) 18 (</article-title>
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>19</lpage>
          . URL: https://openreview. net/forum?id=
          <fpage>PLCM2cQz8b</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>LangChain</surname>
          </string-name>
          , Langchain:
          <article-title>Building applications with large language models</article-title>
          , https://www.langchain. com/,
          <year>2023</year>
          . Accessed:
          <fpage>2025</fpage>
          -10-06.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>