<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Improving Legal Question Answering through Structured Knowledge Representation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ankita Gupta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Frank Schilder</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Thomson Reuters Foundational Research</institution>
          ,
          <addr-line>Minnesota</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Massachusetts Amherst</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Large language models (LLMs) exhibit exciting potential to assist legal practitioners and enhance legal services while reducing costs. However, legal texts present unique challenges for automated processing due to their complex sentence structures, specialized terminology, and fact-intensive nature. These processing dificulties often impact downstream task performance, particularly in question answering scenarios that require careful reasoning about laws and precedents. In this work, we examine whether structuring knowledge in legal texts can improve LLMs' ability to answer legal questions. Our approach prompts LLMs to first generate structured triples of the form entity-relation-argument from a given legal text. We then prompt the LLM to answer questions based on these triples, which serve as a structured knowledge representation of the text. Our results demonstrate that this approach improves the performance of small language models like Qwen-3-8B and Llama-3-8B in two settings: a) when the gold passage relevant to the query is given, and b) when passages relevant to the query must be retrieved from a corpus.1</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Legal question answering</kwd>
        <kwd>Argumentation</kwd>
        <kwd>Knowledge graphs</kwd>
        <kwd>Large language models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Legal professionals worldwide are increasingly enthusiastic about integrating large language
models (LLMs) into their workflows to enhance legal services while reducing costs. Such AI tools are
already being deployed across various aspects of legal practice, including providing jurisdiction-specific
legal information, spotting potential issues in cases, creating legal documents, and numerous other
applications.1</p>
      <p>
        However, legal texts present unique challenges for automated processing due to their complex
sentence structures with multiple subclauses embedded within a single sentence. These texts are also
fact-intensive, interleaved with specialized legal terminology, making their comprehension dificult. For
instance, prior work has observed limitations in the ability of LLMs to distinguish between arguments
made by diferent legal actors [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        These comprehension dificulties can significantly impact downstream task performance. For instance,
answering legal questions based on a relevant legal text requires the model to carefully understand and
reason about laws and precedent cases mentioned in the text and apply them to answer the specific
question. The challenge intensifies in real-world scenarios where relevant passages must first be
retrieved from extensive legal corpora, as in a retrieval augmented generation (RAG) pipeline [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In
such cases, imperfect retrieval mechanisms can introduce noisy or only partially relevant passages,
further complicating the question answering process.
      </p>
      <p>In this work, we examine whether structuring the knowledge present in the legal text can help
improve the LLMs’ ability to answer legal questions. In particular, before answering the question, we
prompt an LLM to generate triples of the form entity-relation-argument from the given legal
text(s), where entity refers to a legal actor making a statement in the text (e.g., litigants, judge, cited
precedents or statutes), argument is the information that is expressed by the entity in the legal text(s)
and relation is how the information is expressed, whether the entity supports or contradicts the
information. After extracting these triples, we further prompt the LLM to answer the question based on
these triples, which serve as a structured knowledge representation of the input legal text(s).</p>
      <p>Our results show that the proposed approach helps improve the performance of small language
models, such as Qwen-3-8B and Llama-3-8B, in both settings: a) when the query-relevant passage is
given and b) when the passage relevant to the query is not known and has to be retrieved from a given
legal corpus. Our work opens several interesting avenues for future work, such as using the proposed
approach to reduce misattributions in LLM-generated answers and improving legal reasoning models.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The term knowledge representation refers to various formats that encode structured meaning
representations in the form of pre-defined schemata representing entities and properties of these entities.
The knowledge is often encoded via entity-relation-entity triples (e.g., capital(France,
Paris)) extracted from plain texts or manually curated, which are used to construct knowledge graphs
or databases with special data formats or schemata [
        <xref ref-type="bibr" rid="ref3">3, 4</xref>
        ]. Our approach draws inspiration from these
methods, although our approach focuses on triples of the form entity-relation-argument,
extracting arguments supported by the entities as mentioned within legal texts, thus extending the prior
approaches beyond entity-level relations. In addition, we instruct the model to create triplets at the
time of prompting and hence do not rely on a pre-curated knowledge graph (KG).
      </p>
      <sec id="sec-2-1">
        <title>2.1. General knowledge graphs</title>
        <p>KGs are graph-structured knowledge bases that integrate entities and relations from diverse data sources
into a unified schema [ 5]. They typically employ semantic web standards (RDF, OWL) and ontologies to
model concepts and relationships, enabling rich semantic queries (e.g., SPARQL) and both deductive and
inductive reasoning. However, these KGs require pre-defined schemata, and it is impossible to anticipate
every type of relation and property encountered in a lawsuit. KGs are also dificult to maintain and
may be too rigid for legal reasoning use cases.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Legal knowledge graphs</title>
        <p>Work in the domain of legal KG is closer to our approach because it often aims to structure case law,
statutes, regulations, and related documents into a semantic network. Nodes represent legal entities
(courts, cases, laws, legal concepts) and edges capture relations such as citations, amendments, or topic
hierarchies. Legal KGs often build on domain-specific ontologies (e.g., LKIF Core, 2 LegalRuleML3) to
model norms, actions, and agents. Similarly to general-purpose KG, they tend to be cumbersome to
maintain, although recent work has shown the utility of combining KG modeling Chinese criminal
statutes and historical cases, achieving much higher law-article recommendation accuracy [6]. Similarly,
Li et al. [7] propose the automated construction of a Chinese legal KG by fine-tuning a large language
model with legal prior knowledge, yielding a KG of thousands of legal triples.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Structured representations for prompting LLMs</title>
        <p>
          Structured data (graphs, tables, schemata) are increasingly used to guide large language models,
especially in domains like law that demand precise reasoning. Motivated by Chain-of-Thought (CoT)
approaches [8], recent work has used more structured data in the prompts, showing improvements
2https://github.com/RinkeHoekstra/lkif-core
3https://docs.oasis-open.org/legalruleml/legalruleml-core-spec/v1.0/os/legalruleml-core-spec-v1.0-os.html
for complex problems that require multi-step reasoning. For example, frameworks such as StructGPT
[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] leverage structured inputs by interleaving reading and reasoning phases: the system first queries
a knowledge graph or table to collect evidence, then prompts the LLM to reason over that evidence.
Hannah et al. [9] design a prompt-reformulation system tied to a legal KG: the KG is queried to generate
precise legal citations for issues raised by the LLM’s output, enriching and correcting the response but
not modeling the attribution of diferent statements to the respective legal actors.
        </p>
        <p>Overall, these works suggest that the explicit semantic structure created and retrieved from a KG
can improve the responses and even the reasoning capabilities of LLMs. In contrast to other work, we
prompt the LLM directly to produce the respective triplet structure of legal entities and arguments, not
requiring a well-defined, pre-determined KG of legal concepts.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>We next discuss our method to generate the triples from a given legal text. We extract these triples
from the legal text(s) that have been either provided in the input or have been retrieved from a legal
corpus to answer the question.</p>
      <sec id="sec-3-1">
        <title>3.1. Triples generation</title>
        <p>We prompt an LLM to generate triples of the form entity-relation-argument from the given
legal text(s). An entity refers to a legal actor making a statement in the text (e.g., litigants, judges,
precedents, statutes). An argument is the information that is expressed by the entity in the legal
text(s). An argument can be a clause within a sentence, complete sentences, or even multiple sentences,
based on the amount of information provided by the legal actor in the text. Finally, the relation
describes how the argument is expressed by the entity, i.e., whether the entity supports or contradicts
the argument. An example legal text along with extracted triples is shown in Figure 1.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Answer generation</title>
        <p>After extracting the triples from the given legal text(s), we further prompt the LLM to answer the
question based on these triples. Our motivation is that allowing the model to represent the legal text in
a structured knowledge representation format can help its ability to comprehend and reason over it.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments and Results</title>
      <sec id="sec-4-1">
        <title>4.1. Experimental Setup</title>
        <sec id="sec-4-1-1">
          <title>4.1.1. Large language models.</title>
          <p>We consider small open-source models including Llama-3.1-8B-Instruct and Qwen-3-8B. These
models are particularly valuable as they can be downloaded locally and help alleviate privacy concerns.
However, the of-the-shelf performance of such models is often not on par with their larger counterparts,
necessitating the development of methods to improve their performance.
4.1.2. Datasets.</p>
          <p>We consider the Bar Exam QA datasets introduced by Zheng et al. [10] for evaluation. The Bar Exam
QA dataset is a dataset of questions from multistate bar exam (MBE), which certifies law students to
practice law in the U.S. Each example in this dataset contains a legal scenario, a question about a specific
legal issue implicated in the scenario, a gold passage mentioning laws that can help answer the question
and four answer choices. The task is to select the correct answer choice. We use the test split of this
dataset for our evaluations.</p>
          <p>Legal text: "When deciding whether the parties agreed to
arbitrate a certain matter..., courts generally ... should
apply ordinary state-law principles that govern the formation
of contracts." Kaplan, 514 U.S. at 944, 115 S.Ct. 1920. It is
black-letter law that "an essential element of any contract
is a mutual intent to be bound." Martin H. Bauman Assocs.
v. H&amp;M Int’l Transp., Inc., 171 A.D.2d 479, 567 N.Y.S.2d
404, 407 (1st Dep’t 1991), and that "there can be no
contract absent a mutual intent to be bound." [...] It is also
true that the question is not what each party subjectively
intended; "it is necessary to look, rather, to the objective
manifestations of the intent of the parties as gathered by
their expressed words and deeds." Brown Bros. Elec.
Contractors v. Beam Constr. Corp., 41 N.Y.2d 397, 399, 393
N.Y.S.2d 350, 361 N.E.2d 999 (1977).
entity: Kaplan, 514 U.S. at 944Extract Triples
argument: When deciding whether the parties agreed to
arbitrate a certain matter..., courts generally ... should
apply ordinary state-law principles that govern the formation
of contracts.
relation: Supports
entity: Martin H. Bauman Assocs. v. H&amp;M Int’l Transp.,
Inc.
argument: an essential element of any contract is a
mutual intent to be bound and that "there can be no contract
absent a mutual intent to be bound."
relation: Supports
entity: Brown Bros. Elec. Contractors v. Beam Constr.</p>
          <p>Corp.
argument: it is necessary to look, rather, to the objective
manifestations of the intent of the parties as gathered by
their expressed words and deeds.</p>
          <p>relation: Supports</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Experiments</title>
        <sec id="sec-4-2-1">
          <title>4.2.1. Gold passage relevant to the query is given.</title>
          <p>We first examine the performance of our proposed approach in a simpler setting, where the legal text
that can help answer the query is already known. Thus, in the input, we provide LLM with the relevant
legal passage and the query.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2.2. Gold passage relevant to the query is not given.</title>
          <p>In this case, we simulate a more realistic setting, when the passages useful for answering the legal
query are not already known and must be retrieved from a large legal corpus, similar to a RAG setting.
To simulate this setting, we consider a simulated retriever, providing the LLM with input containing
the gold passage, distractor passages (intended to simulate imperfect retrieval) and the legal query. We
randomly order the gold and distractor passages.</p>
          <p>Dataset</p>
          <p>Models
w/ Triples
Bar Exam QA</p>
          <p>Llama-3.1-8B-Instruct
Qwen-3-8B (no think)
Qwen-3-8B (think)</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Results</title>
        <sec id="sec-4-3-1">
          <title>4.3.1. Triple generation helps improve performance.</title>
        </sec>
        <sec id="sec-4-3-2">
          <title>4.3.2. Performance with varying number of distractor passages in the RAG setting.</title>
          <p>We further examine the performance of the model when the gold passage is not given in advance and
must be retrieved from a corpus. To simulate this setting, in addition to the gold passage, we add a
varying number of distractor passages and provide all of them as input to the model. We again test the
Llama-3.1-8B-Instruct model’s performance with and without the triple generation step.</p>
          <p>As shown in Figure 2, the intermediate generation of triples consistently helps to achieve better
performance compared to the setting without triples. As expected, when the number of distractor
passages increases, the performance generally declines in both settings, as more noise in the input
afects the model’s performance. One potential reason for decreased performance gains is that LLMs
struggle to efectively process longer input contexts [ 12] when more passages are included in the
prompt. Nevertheless, the triple generation method maintains its advantage across diferent distractor
levels.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>Our work demonstrates that structuring legal texts as entity-relation-argument triples significantly
enhances LLMs’ performance on legal question answering tasks. By decomposing complex legal texts
into structured triple representations before answering questions, we enable models like Qwen-3-8B
and Llama-3.1-8B to better navigate the inherent challenges of legal language. The efectiveness
of our approach across both controlled settings (with gold passage given) and more realistic retrieval
settings highlights its robustness and practical utility. This structured knowledge extraction serves
as an efective intermediary step that helps models systematically process the complex relationships
between legal actors, their arguments, and the underlying legal principles. Future work can explore
the correlation between the complexity of the legal text (e.g., using measures such as the number of
subclauses) and the efectiveness of the triple generation approach for answering questions based on
this text to determine which types of complex texts benefit most from the proposed approach. The
triple generation approach can also be used for reflection over the generated answer, similar to the
self-reflection style prompting approaches [13].</p>
      <p>More broadly, our findings open several promising avenues for improving legal AI systems. Future
work could explore extending this approach to other legal tasks such as contract analysis, compliance
verification, and legal document drafting, as well as investigating how these structured representations
might improve explainability and reduce hallucinations in legal AI systems.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Perplexity/Ollama Llama-3.1-8b in order to:
The authors wrote the literature review with the help of AI starting from a set of relevant papers and
improved clarity and conciseness of the text using suggestions generated via an LLM. After using these
tools/services, the authors reviewed and edited the content as needed and take full responsibility for
the publication’s content.
[4] H. Li, J. Zhang, C. Li, H. Chen, Resdsql: decoupling schema linking and skeleton parsing for
text-to-sql, in: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence
and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth
Symposium on Educational Advances in Artificial Intelligence, AAAI’23/IAAI’23/EAAI’23, AAAI
Press, 2023. URL: https://doi.org/10.1609/aaai.v37i11.26535. doi:10.1609/aaai.v37i11.26535.
[5] L. Ehrlinger, W. Wöß, Towards a definition of knowledge graphs., in: SEMANTiCS (Posters,</p>
      <p>Demos, SuCCESS), 2016.
[6] Y. Chen, M. Chen, Y. Zhu, J. Pei, S. Chen, Y. Zhou, Y. Wang, Y. Zhou, H. Li, S. Zhang, Leverage
knowledge graph and large language model for law article recommendation: A case study of
chinese criminal law, 2025. URL: https://arxiv.org/abs/2410.04949. arXiv:2410.04949.
[7] J. Li, L. Qian, P. Liu, T. Liu, Construction of legal knowledge graph based on knowledge-enhanced
large language models, Information 15 (2024). URL: https://www.mdpi.com/2078-2489/15/11/666.
doi:10.3390/info15110666.
[8] J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. H. Chi, Q. Le, D. Zhou, Chain of thought prompting
elicits reasoning in large language models, CoRR abs/2201.11903 (2022). URL: https://arxiv.org/
abs/2201.11903. arXiv:2201.11903.
[9] G. Hannah, R. T. Sousa, I. Dasoulas, C. d’Amato, A prompt engineering approach and a knowledge
graph based framework for tackling legal implications of large language model answers, 2024.</p>
      <p>URL: https://arxiv.org/abs/2410.15064. arXiv:2410.15064.
[10] L. Zheng, N. Guha, J. Arifov, S. Zhang, M. Skreta, C. D. Manning, P. Henderson, D. E. Ho, A
reasoning-focused legal retrieval benchmark, in: Proceedings of the 2025 Symposium on Computer
Science and Law, 2025, pp. 169–193.
[11] X. Li, Z. Yu, Z. Zhang, X. Chen, Z. Zhang, Y. Zhuang, N. Sadagopan, A. Beniwal, When thinking
fails: The pitfalls of reasoning for instruction-following in llms, ArXiv abs/2505.11423 (2025). URL:
https://api.semanticscholar.org/CorpusID:278715317.
[12] M. Karpinska, K. Thai, K. Lo, T. Goyal, M. Iyyer, One thousand and one pairs: A “novel” challenge
for long-context language models, in: Y. Al-Onaizan, M. Bansal, Y.-N. Chen (Eds.), Proceedings
of the 2024 Conference on Empirical Methods in Natural Language Processing, Association for
Computational Linguistics, Miami, Florida, USA, 2024, pp. 17048–17085. URL: https://aclanthology.
org/2024.emnlp-main.948/. doi:10.18653/v1/2024.emnlp-main.948.
[13] N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, S. Yao, Reflexion: language agents with
verbal reinforcement learning, in: Proceedings of the 37th International Conference on Neural
Information Processing Systems, NIPS ’23, Curran Associates Inc., Red Hook, NY, USA, 2023.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>V.</given-names>
            <surname>Magesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Surani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Suzgun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Ho</surname>
          </string-name>
          ,
          <article-title>Hallucination-free? assessing the reliability of leading ai legal research tools</article-title>
          ,
          <source>Journal of Empirical Legal Studies</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Küttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , W.-t. Yih,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          ,
          <article-title>Retrieval-augmented generation for knowledge-intensive nlp tasks</article-title>
          ,
          <source>in: Proceedings of the 34th International Conference on Neural Information Processing Systems</source>
          , NIPS '20, Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-R.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <article-title>StructGPT: A general framework for large language model to reason over structured data</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Singapore,
          <year>2023</year>
          , pp.
          <fpage>9237</fpage>
          -
          <lpage>9251</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          . emnlp-main.
          <volume>574</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .emnlp-main.
          <volume>574</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>