<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Leveraging Ensemble LLMs and Contextual Embeddings for Case-Based Reasoning in the Legal Domain</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ramitha Abeyratne</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Robert Gordon University</institution>
          ,
          <addr-line>Aberdeen , Scotland</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>This research investigates the integration of Case-Based Reasoning (CBR) with Retrieval-Augmented Generation (RAG) for Large Language Models (LLMs) to enhance the reliability of legal question-answering systems. Thus far, we have developed a structured retrieval mechanism using CBR to improve the contextual relevance of generative outputs. Additionally, we introduced two novel alignment-based evaluation metrics-weighted and unweighted-which demonstrated superior performance over existing baselines in assessing QA responses. Our experimental validation on a legal dataset confirmed the efectiveness of the CBR-RAG approach in improving response accuracy. Moving forward, we aim to refine weighting strategies for alignment metrics and enhance textual representations to improve evaluation robustness. Furthermore, we plan to extend our study beyond the legal domain by conducting a comparative analysis across multiple datasets, ensuring broader applicability of the CBR-RAG framework.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;CBR</kwd>
        <kwd>RAG</kwd>
        <kwd>LLMs</kwd>
        <kwd>LLMs-as-Judges</kwd>
        <kwd>Case alignment</kwd>
        <kwd>Embeddings</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Retrieval Augmented Generation (RAG) has emerged as a powerful technique for enhancing the outputs
of Large Language Models (LLMs) by incorporating external knowledge sources [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This approach is
particularly crucial in specialised domains such as legal question-answering, where responses must
be both highly accurate and contextually relevant. Standalone LLMs often sufer from hallucinations,
largely due to their limited knowledge coverage and reliance on probabilistic text generation rather than
factual verification [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. However, conventional RAG systems typically depend on generic information
retrieval mechanisms, which may not always provide structured or contextually appropriate content
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. As a result, these limitations can lead to suboptimal outputs, reducing the overall reliability and
trustworthiness of such systems.
      </p>
      <p>
        To address these challenges, Case-Based Reasoning (CBR) presents itself as a structured retrieval
framework that leverages past cases to inform new queries [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Unlike traditional information retrieval
systems, CBR-based approaches contain multiple attributes that facilitate nuanced comparisons between
cases [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], making them particularly advantageous for legal applications. By integrating CBR with
RAG, retrieval processes can be significantly enhanced through structured similarity-based knowledge
extraction to ensure improved contextual alignment and greater accuracy in generated responses [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
This fusion of methodologies allows for more precise case retrieval, leading to well-informed and legally
sound outputs.
      </p>
      <p>
        Despite the potential benefits of RAG-based legal QA systems, their efectiveness is highly dependent
on the availability of high-quality, annotated datasets that enable rigorous performance evaluation
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. However, the manual annotation of legal datasets is an exceptionally resource-intensive task,
requiring not only substantial time but also considerable domain expertise [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The complexity of legal
texts further magnifies these challenges as annotations must adhere to strict legal interpretations and
contextual nuances. Consequently, the scarcity of large-scale annotated datasets poses a significant
hurdle in the development and refinement of RAG-based legal QA systems.
      </p>
      <p>
        To mitigate these challenges, automated annotation methodologies have emerged as a promising
solution for enhancing the evaluation process [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Leveraging LLMs for dataset annotation, a concept
recently introduced as ‘LLM-as-a-Judge’, has demonstrated potential in expediting this otherwise
laborious task [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. However, such existing methods are inherently susceptible to biases originating
from the applied LLMs themselves, which can lead to erroneous evaluations. Moreover, specific
biases—including positional bias, verbosity bias, and self-enhancement bias—can influence the reliability
of automated evaluations depending on the type of assessment being conducted [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. These biases can
compromise the validity of the evaluation process and impact the reliability of legal QA systems.
      </p>
      <p>To overcome these limitations, we propose an advanced AI evaluation framework based on ensemble
LLMs functioning as collective judges. The framework begins with the development of case
alignmentbased assessment metrics to provide a structured and unbiased evaluation process. These novel measures
avoid the bias issues related to existing methods. Next, the CBR-infused RAG framework is applied to a
legal QA system, leveraging case-based reasoning to enhance retrieval and generation accuracy. Finally,
the proposed evaluation metrics are used to assess the efectiveness of CBR-RAG, ensuring a balanced
and reliable performance analysis. This approach strengthens the robustness and credibility of legal AI
systems by refining evaluation methodologies, supporting more dependable and context-aware legal
information retrieval.
2. Research Plan
2.1. Research Objectives
• RO1: Develop an AI-driven evaluation framework using ensemble LLMs-as-a-judge to reduce
bias and enhance automated assessment through CBR-based alignment.
• RO2: Improve contextual embeddings from transformer networks to strengthen the
representation quality required for reliable alignment-based evaluation.
• RO3: Build a CBR-based RAG retrieval approach to provide structured, context-aware inputs for
generation, and evaluate its domain generalisability.
2.2. Approach / Methodology</p>
      <p>First, we create the QA evaluation framework using a concept called reverse generation (RO1). This
means that given an answer to a question, we use the answer to reverse-generate the question using LLMs
acting as judges. This reverse-generated question is then used to generate a reverse-generated answer
via the sames judges. We establish our first evaluation metric by using the embedded representations of
reverse-generated questions and reverse-generated answers to form the problem and solution spaces
as shown in Figure 1. Contextual embeddings from the generating model are used to convert the
text responses to embeddings. The relevance of an answer to the original question is assessed by
evaluating the alignment between the original question and answer within this space. This concept
is heavily inspired by case alignment literature [9]. We name this evaluation metric Inter-Language
Model Reconstruction Alignment (ILRAlign). Another metric called Weighted ILRAlign (WILRAlign) is
formed by assigning dynamic weights based on problem similarity to improve evaluation reliability.</p>
      <p>To address limitations in current alignment reliability, we enhance the alignment-based measures
by optimising both weighting mechanisms and contextual embeddings (RO2). This builds directly on
RO1, where weighting can be applied to the problem and solution space alignment measurements,
as well as during dynamic weighting in WILRAlign. For the initial set of experiments, we utilise
contextual embeddings of text from the last hidden layer. We also aim to explore strategies such as
basic embeddings and aggregated embeddings from diferent layers [ 10]. This is expected to have a
significant impact on the results as the evaluation metric relies entirely on numerical representations.</p>
      <p>Finally, we implement a Case-Based Reasoning and Retrieval-Augmented Generation (CBR-RAG)
architectural method to enable structured retrieval of legal cases and improve the contextual grounding
of LLM outputs (RO3). This involves developing a case-based retrieval mechanism that indexes legal
cases to leverage similarity knowledge containers, thereby improving retrieval performance. Figure 2
denotes this in the high level where multiple cases are retrieved and fed into the generator. We utilise
two types of embeddings for case-base embedding: one optimised for retrieval and another for similarity
assessment. Additionally, dynamic weighting of case attributes is applied to enhance retrieval accuracy.
RO3 is designed to be disjoint from RO1 and RO2, and is evaluated using the matured output of RO2 to
ensure efective integration and factual grounding.</p>
    </sec>
    <sec id="sec-2">
      <title>3. Progress Summary</title>
      <p>The question and answer evaluation metrics (RO1) - ILRAlign and WILRAlign, were successfully
implemented and tested on the Australian Legal Question Answering (ALQA) [11] and Sri Lankan
Supreme Court (SLSC) datasets. We generated 10 sets of answers for these datasets by varying the
generator model’s temperature parameter. Four 7-billion-parameter LLMs were used in the experiments.
This included variations from Gemma, Llama, Mistral and Falcon. A leave-one-out strategy was
employed for evaluating the judges.</p>
      <p>Similarity was measured using cosine similarity by comparing the candidate answers to the ground
truth. Finally, Pearson’s coeficient was used to calculate the correlation between the metrics and
variations in the gold truth similarity. Table 1 presents these results, demonstrating that our novel
metrics outperform traditional techniques by a significant margin. The detailed findings will be published
(already accepted) as a conference proceeding for International Conference on Case-Based Reasoning
(ICCBR 2025).</p>
      <p>Document count for RAG
No RAG
1 - RAG with context
1 - RAG with full case
3 - RAG with context
3 - RAG with full case</p>
      <p>No embeddings
0.897
N/A
N/A
N/A
N/A</p>
      <p>BERT
N/A
0.899
0.907
0.900
0.900</p>
      <p>LegalBERT</p>
      <p>N/A
0.902
0.904
0.903
0.905</p>
      <p>AngleBERT</p>
      <p>N/A
0.912
0.907
0.909
0.914</p>
    </sec>
    <sec id="sec-3">
      <title>4. Future Work</title>
      <p>The alignment-based question and answer evaluation metric which utilises an ensemble of LLMs,
produced significant improvements compared to existing baseline approaches. This advancement
highlights the potential of ensemble-based methodologies in enhancing the accuracy and fairness of
automated legal assessments. However, the initial prototype was developed with equal weightings
assigned to both the problem space and solution space when computing alignment. While this provided a
balanced approach, it may not always reflect the true complexity and nuance of question-answering tasks.
A key area for future research involves refining these weightings to optimise alignment calculations and
ensure a more contextually appropriate evaluation framework. Additionally, we plan to explore more
sophisticated weighting strategies for WILRAlign, moving beyond the simple normalised weighting
derived from problem space similarity. By incorporating adaptive and dynamic weighting mechanisms,
we aim to enhance the precision of our evaluation process.</p>
      <p>Furthermore, we will focus on enhancing the representation of textual data as embeddings, which
plays a crucial role in the efectiveness of our evaluation metrics. Given the multiple layers within
a transformer model, we intend to conduct a comprehensive study to identify the most efective
representation for alignment measurement. A deeper understanding of how diferent embedding layers
influence alignment computations will allow us to obtain improved performance. This investigation
will include an exploration of alternative embedding methodologies to optimise representation learning
for legal text alignment.</p>
      <p>Subsequently, we aim to validate the adaptability of our Case-Based Reasoning-infused
RetrievalAugmented Generation (CBR-RAG) architecture. While initial tests on the Australian Legal QA dataset
have yielded promising results, we plan to conduct a comparative study across multiple domains such
as healthcare, finance and education to assess its broader applicability and generalisability. Extending
our evaluation to diferent domains will provide deeper insights into the robustness and scalability of
our approach. By assessing its efectiveness across diverse knowledge-intensive sectors, we can refine
the model to better accommodate the specific requirements of each domain.</p>
    </sec>
    <sec id="sec-4">
      <title>5. Conclusion</title>
      <p>This study successfully developed an AI-driven evaluation framework that leverages ensemble LLMs
to enhance the grading of legal QA systems. Our proposed alignment-based metrics—ILRAlign and
WILRAlign—demonstrated superior performance in assessing the overall quality of generated legal
responses. Furthermore, by integrating CBR with RAG, we enhanced the retrieval of legal cases,
ensuring that generated responses are informed by structured, contextually relevant knowledge. This
advancement improves the overall generation quality of LLMs, making them more reliable for legal
applications.</p>
      <p>While our current evaluation framework applies uniform weightings to the problem and solution
spaces, future research will focus on dynamic weighting mechanisms to improve reliability. Additionally,
further investigation into transformer-based embedding optimisation will enhance representation
learning, with the goal of refining legal text alignment measurements. By continuously improving these
methodologies, we aim to develop a highly efective evaluation framework that not only improves legal
NLP applications but also serves as a benchmark for AI-driven legal reasoning. This work lays the
foundation for more robust AI-driven legal reasoning to advance in the field of legal NLP applications
towards greater reliability and precision.</p>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author used ChatGPT for the purpose of: grammar and spelling
check, paraphrase and reword. After using this tool, the author reviewed and edited the content as
needed and takes full responsibility for the publication’s content.
[9] S. Massie, N. Wiratunga, S. Craw, A. Donati, E. Vicari, From anomaly reports to cases, 2007, pp.</p>
      <p>359–373. doi:10.1007/978-3-540-74141-1_25.
[10] C. Tao, T. Shen, S. Gao, J. Zhang, Z. Li, Z. Tao, S. Ma, Llms are also efective embedding models:</p>
      <p>An in-depth overview, 2024. URL: https://arxiv.org/abs/2412.12591. arXiv:2412.12591.
[11] U. Butler, Open australian legal qa, 2023. URL: https://huggingface.co/datasets/umarbutler/
open-australian-legal-qa. doi:10.57967/hf/1479.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Küttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , W.-t. Yih,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          , et al.,
          <article-title>Retrieval-augmented generation for knowledge-intensive nlp tasks</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>9459</fpage>
          -
          <lpage>9474</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yu</surname>
          </string-name>
          , W. Ma,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Qin</surname>
          </string-name>
          , T. Liu,
          <article-title>A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions</article-title>
          ,
          <source>ACM Transactions on Information Systems</source>
          <volume>43</volume>
          (
          <year>2025</year>
          )
          <fpage>1</fpage>
          -
          <lpage>55</lpage>
          . URL: http: //dx.doi.org/10.1145/3703155. doi:
          <volume>10</volume>
          .1145/3703155.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Wiratunga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Abeyratne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jayawardena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Massie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Nkisi-Orji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weerasinghe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Liret</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Fleisch</surname>
          </string-name>
          ,
          <article-title>Cbr-rag: Case-based reasoning for retrieval augmented generation in llms for legal question answering</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2404.04302. arXiv:
          <volume>2404</volume>
          .
          <fpage>04302</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Upadhyay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Massie</surname>
          </string-name>
          ,
          <article-title>A case-based approach for content planning in data-to-text generation</article-title>
          ,
          <source>in: Int. Conf. on CBR</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>380</fpage>
          -
          <lpage>394</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>I.</given-names>
            <surname>Watson</surname>
          </string-name>
          ,
          <article-title>A case-based persistent memory for a large language model</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv. org/abs/2310.08842. arXiv:
          <volume>2310</volume>
          .
          <fpage>08842</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Montgomery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. Y.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cuadron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Popa</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Stoica</surname>
          </string-name>
          ,
          <article-title>Judgebench: A benchmark for evaluating llm-based judges</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2410. 12784. arXiv:
          <volume>2410</volume>
          .
          <fpage>12784</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          , Multi-news+:
          <article-title>Cost-eficient dataset cleansing via LLM-based data annotation</article-title>
          , in: Y.
          <string-name>
            <surname>Al-Onaizan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bansal</surname>
            ,
            <given-names>Y.-N.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          (Eds.),
          <year>2024</year>
          EMNLP, ACL, Miami, Florida, USA,
          <year>2024</year>
          , pp.
          <fpage>15</fpage>
          -
          <lpage>29</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .emnlp-main.2/. doi:
          <volume>10</volume>
          .18653/ v1/
          <year>2024</year>
          .emnlp-main.
          <volume>2</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ma</surname>
          </string-name>
          , H. Liu,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>A survey on llm-as-a-</article-title>
          <string-name>
            <surname>judge</surname>
          </string-name>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2411.15594. arXiv:
          <volume>2411</volume>
          .
          <fpage>15594</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>