<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semantic Evaluation of Legal Essay Reasoning with Transfer-Learned LLMs: A Crowdsourced Elo Framework</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ying-Chu Yu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hsuan-Lei Shao</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>College of Law, National Taiwan University</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Graduate Institute of Health and Biotechnology Law, Taipei Medical University</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>This paper presents a semantic evaluation framework for legal large language models (LLMs), designed to assess performance on essay-style questions requiring interpretive reasoning, issue identification, and context-sensitive application of legal doctrine. We perform supervised transfer learning on a curated corpus of national bar exam essays to align a foundation model with civil law reasoning patterns. To evaluate model outputs, we develop a browser-based interface that supports pairwise comparison through an Elo ranking mechanism, enabling systematic aggregation of expert preferences. In contrast to traditional benchmarks centered on retrieval accuracy or discrete classification, the proposed framework captures the inherently open-textured nature of legal reasoning, where multiple doctrinally plausible interpretations may coexist. The preference-based Elo method provides a scalable means of modeling consensus among legal readers and highlights how training configurations influence reasoning quality. Empirically, moderate batch sizes and controlled training epochs yield more coherent and generalizable analyses, whereas overfitting diminishes interpretive depth and argumentative breadth. This work contributes to the semantic evaluation of legal texts by integrating transfer-learned LLMs with a human-in-the-loop, preference-driven assessment protocol. The framework ofers a reproducible methodology for examining the interpretive adequacy of legal AI systems and lays groundwork for future evaluation research in high-stakes, ambiguity-rich domains where reasoning quality matters as much as factual correctness.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Legal large language models</kwd>
        <kwd>legal essay reasoning</kwd>
        <kwd>supervised fine-tuning</kwd>
        <kwd>crowdsourced evaluation</kwd>
        <kwd>Elo scoring</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>investigate the ability of LLMs to answer legal essay questions, which require open-ended reasoning,
issue identification, argumentative coherence, and sensitivity to contextual nuance.</p>
      <p>The Taiwanese legal education and examination system ofers a compelling testbed for studying this
problem. National judicial oficer and bar examinations rely heavily on essay-style questions that probe
analytical depth rather than rote recall. Over decades, an extensive commercial ecosystem—comprised
of publishers, cram schools, and private tutoring services—has produced model answers, doctrinal
commentaries, and annotated corpora tailored specifically to these examinations. As a result, Taiwan
provides a uniquely dense, highly localized, and pedagogically curated dataset for analyzing legal essay
reasoning. These corpora not only shape the professional formation of lawyers, prosecutors, and judges
but also reflect a broader knowledge industry that standardizes interpretive reasoning patterns.</p>
      <p>
        While LLMs have shown remarkable proficiency in essay generation and automated scoring in
domains such as English composition [
        <xref ref-type="bibr" rid="ref4 ref7">4, 7</xref>
        ], legal essay evaluation presents fundamentally diferent
challenges. Legal problems are inherently open-textured: multiple doctrinally plausible answers may
coexist, and the quality of an essay often depends not on reaching a predetermined conclusion but
on selecting a defensible analytical direction. This makes “correctness” dificult to operationalize.
Furthermore, the scarcity of high-quality legal essay datasets means that models trained primarily on
court judgments, academic articles, or general legal content lack suficient exposure to the structure
and rhetorical conventions of exam-style reasoning. Even when LLMs mimic legal tone and citation
patterns, they often struggle with issue spotting—the ability to identify the core disputes embedded in
multi-layered fact patterns and link them to relevant legal principles.
      </p>
      <p>These challenges underscore the need for systematic investigation into how LLMs can be optimized
and evaluated for open-ended legal reasoning. Understanding the limits of current models, the impact of
ifne-tuning strategies, and the design of human-aligned evaluation protocols is essential for responsibly
deploying LLM-based tools in legal education and professional practice. By focusing on essay-style
reasoning within a localized and pedagogically significant corpus, this study seeks to advance both
methodological rigor and domain-specific understanding of legal AI systems.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work: Generative Legal AI and Its Evaluation</title>
      <sec id="sec-2-1">
        <title>2.1. Generative AI on Legal Studies and Limitations on Essay Tasks</title>
        <p>
          Pretrained language models such as LegalBERT[
          <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
          ] have been fine-tuned on large-scale legal corpora to
capture the unique characteristics of legal language better. While pretrained models such as LegalBERT
have achieved success in structured legal tasks, they typically underperform in open-ended reasoning
tasks such as legal essay writing or argument construction. These tasks demand not only correct
citation of laws but also the ability to identify legal issues, articulate reasoning chains, and evaluate
interpretive diversity. Recent benchmarking eforts (e.g., LawBench) focus on legal comprehension
but lack evaluative methods suited for domains where answers are fundamentally indeterminate. In
legal essay questions, the absence of a definitive answer, before a court ruling, is not a limitation,
but an essential feature of legal discourse, reflecting the role of doctrinal interpretation and judicial
discretion. Thus, crowdsourcing emerges as a uniquely appropriate evaluation approach, as it aligns
with the normative foundations of legal reasoning, where majority opinion often represents the practical
threshold of plausibility. Our work extends the evaluation landscape by ofering a framework tailored
to expert-oriented, open-ended, and legally plausible reasoning tasks.
        </p>
        <p>Recent developments in legal NLP underscore the need for evaluation methodologies that extend
beyond accuracy-based metrics and capture the qualitative structure of legal reasoning. Studies in legal
argument mining and factor-based analysis have emphasized the importance of reasoning chains, issue
identification, and doctrinal coherence. In parallel, preference-based frameworks such as Elo-style
ranking have been adopted in domains such as summarization, translation, and dialogue evaluation,
where human judgment plays a central role. This study builds on these broader developments by
applying a preference-driven, human-in-the-loop method specifically tailored to doctrinal, open-ended
legal essay responses, thereby contributing to emerging evaluation practices for high-stakes legal AI
systems.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Supervised Fine-Tuning in the Legal Domain</title>
        <p>
          To address these challenges, we proposed leveraging Supervised Fine-Tuning (SFT) to develop a large
language model specifically designed for answering legal essay questions[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. SFT is ideal when aligning
the model to a specific, narrowly-defined task or domain, such as legal reasoning, medical diagnosis,
or summarizing technical documents. According to prior research, instruction fine-tuning has proven
efective in improving LLMs’ ability to respond to task-specific instructions[
          <xref ref-type="bibr" rid="ref14 ref6">6, 14</xref>
          ]. However, due to the
highly specialized nature of legal essay questions, this study opts for Supervised Fine-Tuning, which
enables the use of expert-annotated datasets to enhance the model’s precision and reasoning capabilities.
Fine-tuning LLMs with supervised learning techniques has been shown to improve their capabilities
in handling specific tasks. For instance, a study found that the composition of supervised fine-tuning
data significantly afects the abilities of LLMs, indicating that carefully curated datasets can lead to
substantial performance gains[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. On the other hand[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], analyzed the impact of supervised fine-tuning
on LLMs for question-answering tasks, demonstrating its efectiveness in enhancing performance.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Our Contribution: Taiwan Bar Exam Corpora and Evaluation Framework</title>
        <p>Our research focuses on developing and evaluating large language models (LLMs) capable of responding
to legal essay questions in the Taiwanese civil law context, with enhanced reasoning and expressive
abilities achieved through Supervised Fine-Tuning (SFT). Unlike many studies that rely on judicial decisions,
statutory databases, or academic articles, we deliberately selected questions from Taiwan’s national
judicial oficer and bar examinations, together with expert-written model answers and explanatory
materials published by commercial preparatory institutions, as the core corpus. This choice is significant:
in Taiwan, essay-style questions are the centerpiece of high-stakes national examinations. They embody
the requirements of legal education for issue spotting, doctrinal debate, logical argumentation, and
stance-taking. At the same time, they sustain a substantial knowledge industry, where cram schools,
publishers, and tutoring services continuously produce model answers and commentaries. Thus, legal
exam essays in Taiwan are not only pedagogical tools but also commercialized knowledge products,
reflecting the core mechanisms of professional training and market value.</p>
        <p>Methodologically, our approach combines expert-reviewed and rigorously scored datasets with a
browser-based evaluation environment that integrates human scoring and an Elo ranking mechanism.
Legal professionals assess model-generated answers in pairwise comparisons, yielding preference data
that balances expert credibility with scalability. This design reflects the interpretive nature of legal
reasoning, where majority opinion often serves as a proxy for plausibility. Importantly, while
stateof-the-art LLMs can readily generate coherent and professional-looking legal texts, the true challenge
lies in evaluating whether these outputs meet professional standards. Legal essay questions do not
admit a single “correct” answer; instead, they require proper issue identification, stance formulation,
and logically consistent reasoning chains.</p>
        <p>Accordingly, our contributions are twofold. First, by grounding our work in Taiwan-specific exam
corpora, we combine supervised fine-tuning and human-in-the-loop evaluation into a reproducible
framework for legal text generation and assessment. Second, we expose the structural asymmetry of
the task: generation is relatively easy, but evaluation remains far more dificult. We argue that future
progress lies in the integration of AI-agent technologies, which could dynamically interact with both
experts and models—highlighting overlooked issues, simulating counterarguments, or adjusting answer
structures in real time. Such agents would not only improve evaluation reliability at scale but also
enhance explainability, feeding back into legal education and professional training to strengthen the
trustworthiness of legal AI systems in practice.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Research Design and Localized Corpus Construction</title>
      <p>As previously mentioned, our research team aims to integrate Supervised Fine-Tuning (SFT) algorithms,
expert-curated datasets, and human evaluation feedback to develop a LLM tailored to the Taiwanese
legal knowledge framework, along with a standardized operating procedure (SOP) for its evaluation.
Our primary focus is on legal essay questions, as their question-and-answer format is one of the
most common structures encountered in the legal domain. Compared to tasks like summarization or
multilingual translation, generating answers to essay questions appears to align well with the strengths
of LLMs, which leverage transformer mechanisms to produce contextually relevant outputs. However,
evaluating these responses poses a unique challenge, as it requires not only assessing the applicability
of legal provisions but also ensuring the accurate incorporation of legal concepts.</p>
      <sec id="sec-3-1">
        <title>3.1. Supervised fine tuning model</title>
        <p>Supervised Fine-Tuning (SFT) is a critical step in training large language models, significantly enhancing
their performance on domain-specific tasks. In the context of legal AI, SFT plays an even more pivotal
role due to the unique characteristics of legal texts. These texts are not only highly specialized and
unstructured(or semi-structured) but also require strict adherence to accuracy and logical consistency.
By fine-tuning on legal datasets, models can better align with the complex reasoning and precise
language inherent to legal tasks.</p>
        <p>One of the most significant advantages of SFT in legal AI is its ability to address the gap between the
linguistic style of legal texts and general-purpose datasets. Legal texts often contain intricate sentence
structures, domain-specific terminology, and jurisdictional nuances that are absent from traditional
training data. SFT allows large language models to accurately extract critical clauses, interpret nuanced
statutory language, and understand the hierarchical organization of legal documents.</p>
        <p>Most importantly, SFT enables models to replicate the reasoning patterns of legal professionals,
which is essential for tasks such as statutory interpretation, legal compliance analysis, and case-specific
decision-making. By learning to synthesize facts and statutes, the models can provide more
contextaware and legally sound outputs. This capability is critical as it addresses one of the most challenging
aspects of applying AI in law: bridging the gap between data-driven models and the rigorous demands
of legal reasoning.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Prompt Design and Prompt Refinement</title>
        <p>All models evaluated in this study were queried using a standardized instruction prompt to ensure
comparability across conditions. The instruction was formulated to reflect the conventions of Taiwanese
civil-law essay writing and to elicit structured, doctrinally grounded responses:
“You are a legal expert trained in Taiwanese civil law. Read the following bar-exam essay
question carefully and provide an answer structured as: (1) issue identification, (2) relevant
legal doctrines with citations, (3) application to facts, and (4) conclusion. Avoid conversational
tone and unnecessary explanations.”</p>
        <p>Before performing supervised fine-tuning, a series of minor refinements were tested, including
variations in step-wise guidance and stylistic constraints. These adjustments produced only modest
diferences and did not substantially alter the qualitative structure of baseline outputs. Because the goal
of this study is to examine the efect of task-specific supervised fine-tuning, a single fixed prompt was
maintained throughout all experiments.</p>
        <p>In addition to the main prompt, a small pilot experiment was conducted using a one-shot example that
included an expert-written sample answer. One-shot prompting improved the structural organization
of some baseline outputs but did not markedly change evaluator preferences when compared to the
ifne-tuned models. These preliminary observations suggest that supervised fine-tuning exerts a more
stable influence on legal reasoning quality than prompt variation alone.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. SFT Parameter Configuration</title>
        <p>
          For the model architecture, we selected the Breeze 3B model for fine-tuning to achieve a balance between
performance and computational resource requirements[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Within this framework, we trained five
legal language models using varying configurations to assess the impact of diferent parameter settings.
        </p>
        <p>About model-specific parameter configurations, we developed five models, denoted as V1 through
V5. V1 serves as the baseline model without any supervised fine-tuning (SFT) adjustments, while V2
through V5 incorporate specific parameter configurations as described below:
a. Sequence Length Size:
b. Learning Rate:</p>
        <p>The sequence length size determines the maximum number of tokens a model can process in a single
forward pass. To accommodate the long-text nature and contextual reasoning requirements of legal
essay questions, we set the sequence length size to 8252 tokens for models V2 through V5.
The learning rate is a crucial hyperparameter in the training process, controlling the step size
for model parameter updates. An excessively high learning rate may lead to instability or
nonconvergence, whereas a low learning rate could result in slow training or convergence to a suboptimal
local minimum. To balance stability and convergence speed, the initial learning rate for all models
was set to 5*10-7.</p>
        <p>
          To further optimize the training process, we employed a Cosine Annealing Learning Rate Scheduler
to dynamically adjust the learning rate. This strategy gradually decreases the learning rate following a
cosine function as training progresses, while periodically resetting it close to zero at the end of each
training cycle. This approach enhances the model’s exploration capability by avoiding entrapment
in local minima[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Adding “Warm Restarts” method, we enable more efective exploration of the
parameter space.1
1
2
 = min + (max −  min) 1 + cos
︂(
︂( cur  ︂)
max
• : Learning rate at step .
• max: Maximum value of the learning rate.
• min: Minimum value of the learning rate (typically close to zero).
• cur: Current training step.
        </p>
        <p>• max: Total number of training steps.</p>
        <p>Another controlling parameter in SFT is Batch Size, which refers to the number of samples used in
each forward or backward propagation during the training process. Batch size significantly impacts
the speed, stability, and performance of training. Setting the batch size too small may lead to unstable
gradient updates, while excessively large batch sizes might exceed the memory limitations of the
hardware and hinder eficient operation. To identify the optimal batch size, we experimented with
various configurations: both V2 and V3 were trained with a batch size of 1024. However, due to the
length limitations of our dataset, each batch efectively covered the entire dataset in a single read. In
contrast, V4 and V5 were trained with a batch size of 8.</p>
        <p>Steps per Epoch =</p>
        <sec id="sec-3-3-1">
          <title>Total Number of samples</title>
        </sec>
        <sec id="sec-3-3-2">
          <title>Batch Size</title>
          <p>Completing one epoch means the model has fully traversed the dataset once and adjusted its
parameters based on the error signals generated during that pass. In this experiment, we set the parameters for
V2 and V4 models to 10 epochs, while V3 and V5 were trained for 100 epochs. This configuration aimed
to identify the most optimal parameter settings for our task.
1Warm Restarts is a commonly used learning rate scheduling strategy in deep learning, designed to optimize the training
process of models. The core idea involves periodically resetting the learning rate during training: starting from a relatively
This approach helps the model escape local minima.</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Curating Taiwan Bar Exam Datasets for Fine-Tuning</title>
        <p>One of the key contributions of this study is the creation of a domain-specific evaluation environment
rooted in Taiwan’s bar and judicial examinations. We constructed a dataset from 2014–2021 consisting of
essay questions and model answers authored by legal scholars and commercial publishers. This decision
reflects the reality that Taiwan’s exam system generates a vast amount of structured pedagogical
material, forming a quasi-standard corpus for evaluating professional legal reasoning. Compared
to judicial opinions or academic journals, these exam materials emphasize issue spotting, doctrinal
debate, and stance-taking, which more closely mirror the skills required in professional legal writing.
At this stage we focused on civil law domains—General Principles, Obligations, Property Law, and
Family Law—while excluding criminal and public law to ensure tractability. To benchmark model
performance, we incorporated outputs from ChatGPT-4, ChatGPT-4o, and Breeze-7B, alongside
goldstandard responses crafted by human experts. This localized corpus illustrates both the relative ease
of generating coherent legal text and the far greater dificulty of evaluating such text in a manner
consistent with professional standards.</p>
        <p>Most legal materials in the pretraining stage are derived from judicial rulings, statutory databases,
or legal journal articles. While these sources carry substantial legal expertise, the content in judicial
rulings or legal journals vastly difers from the format, structure, and tone expected in answers written
by law students for national bar exams. In legal essay questions for national exams, the emphasis is
placed on identifying key issues, discussing various academic perspectives, adopting a preferred stance,
and then deriving conclusions based on that stance.</p>
        <p>These questions were answered by ChatGPT-4, ChatGPT-4o, and the MTK Breeze 7B model.
Additionally, we incorporated a set of standard responses crafted by legal professionals as benchmarks. In
this stage, in order to narrow the scope, we focused specifically on civil law essay questions, covering
topics such as General Principles of Civil Law, Obligations (General and Specific Provisions), Property
Law, and Family Law. Questions from criminal law and public law were excluded to concentrate on
improving the model’s capability in addressing civil law issues.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Evaluation Interface Design</title>
        <p>
          3.5.1. Expert Scoring Interface
Unlike evaluation methods for translation tasks, which often rely on metrics like ROUGE that focus on
word-level matches, assessing legal essay questions demands a specialized approach. Currently, our
team references human feedback as the primary evaluation method[
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. Legal experts assign scores on
a 10-point scale, focusing on the appropriateness of legal reasoning and the accuracy of conceptual
application. These human-assigned scores are then used to fine-tune the model’s parameters. While
this method provides a baseline, we recognize the need for a more robust and systematic evaluation
framework in the future.
        </p>
        <p>
          After generating 80 legal essay question responses from each model, we sought to evaluate their
performance and assign scores, which would then be used as part of the dataset for subsequent SFT
processes. To achieve this, we assembled a panel of 16 legal professionals, including law professors,
graduate students, and undergraduate students, to assess these 240 responses. Each answer was scored
on a scale of 0 to 10, with the human-crafted standard responses serving as the benchmark and assumed
to score a perfect 10. The standard responses were used as a reference for evaluating the models’
outputs.
3.5.2. Crowdsourced Comparison Interface
Evaluating legal essay responses poses a unique challenge due to the highly subjective nature of legal
writing. Establishing consistent grading criteria is particularly dificult. To address this, participants
were instructed to base their evaluations on the provided standard answers and to avoid referencing
minority opinions or personal interpretations outside the mainstream legal doctrines. Drawing on
insights from the idea "Legal Bench"[
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], we focused the grading criteria on three key aspects: issue
coverage, the accuracy of legal citations, and the logical application of legal concepts. Each response
was scored as an integer between 0 and 10.
        </p>
        <p>To ensure that legal professionals, including professors and law students, could assess
modelgenerated responses with impartiality and precision in a conducive environment, we carefully designed
a grading interface that promotes focus and minimizes external distractions.</p>
        <p>This system allows users to evaluate legal question responses. The User ID field identifies the
evaluator, while the Query section lets users input a question number to retrieve a case. The Question
area presents the legal scenario and sub-questions. The Standard Answer provides a reference drafted
by legal experts. Finally, the Score field allows evaluators to rate the response, supporting both expert
and crowdsourced legal assessment.</p>
        <p>To facilitate evaluation at scale, it is designed a web-based interface deployable directly via browser,
developed using Python and Gradio. This system eliminates the need for complex server infrastructure,
enabling participants to conduct evaluations seamlessly through a browser. The interface was specifically
designed to display, on a single page, the standard reference answer (crafted by legal experts), responses
generated by three distinct models, and an input field for evaluators to record their scores. This
centralized presentation streamlines the assessment process and ensures that evaluators can compare
responses eficiently within a unified framework.</p>
        <p>Each participant was tasked with evaluating five questions, with three model-generated responses
per question, amounting to a total of 15 evaluations per person. To ensure the consistency and accuracy
of scores, the grading process was conducted continuously over approximately one hour, thereby
preserving evaluators’ concentration and reducing variability in their judgments.</p>
        <p>This setup supports pairwise model comparisons, with legal experts assessing outputs against standard
answers. While the current study employed expert-based evaluations, the design of the interface can
easily be extended for crowdsourcing scenarios in future research, allowing broader participation from
lay users or law students to enrich our feedback corpus. Importantly, the integration of the Elo scoring
framework allows for ongoing inclusion of new models and dynamic adjustment of performance
rankings based on user preferences. It ofers a practical and scalable solution for assessing model
performance in a manner aligned with the rigor and precision required in the legal domain. Moreover,
the interface has the potential to become a critical tool in future eforts to fine-tune large language
models for legal applications, enabling eficient data collection and evaluation while maintaining high
standards of objectivity and consistency.</p>
      </sec>
      <sec id="sec-3-6">
        <title>3.6. Elo Scoring</title>
        <p>
          The Elo rating system, originally developed for chess by Arpad Elo, was adapted for predictive purposes
in association football by Hvattum and Arntzen[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Their study analyzed its efectiveness for forecasting
match results. Then, this system has been adapted and applied across various regions and domains
worldwide. Its versatility allows it to be used in diverse contexts, including sports, games, and academic
evaluations. However, its direct application in the legal field remains limited, primarily due to the lack
of objective metrics and quantifiable data. The Elo system relies on clear, binary outcomes, whereas
legal processes are inherently complex and often lack definitive "win" or "loss" results. Consequently,
applying the Elo framework to legal contexts requires transforming legal data into numerical formats
that facilitate the comparison of rankings or the assessment of case complexity.
        </p>
        <p>In our experiment, the Elo score system proved to be a highly convenient tool, as it allows for
direct comparison of which model performs better on the same legal essay problem. Compared to
requiring participants to assign precise scores to each answer, determining relative quality is a simpler
yet equally efective method for collecting human preferences. This approach significantly reduces the
time needed to recruit professionals for evaluating model performance, enabling us to involve a broader
pool of individuals with legal knowledge in the testing process. Consequently, this system facilitates
the enrichment of our dataset while maintaining eficiency and scalability.</p>
        <p>Precisely speaking, when the evaluation begins, each model (V1 to V5) is assigned an initial Elo
score, typically set at 1500 points. During the testing process, the system randomly selects two models
for a head-to-head comparison and records human preferences to determine the outcome, in which
we provided ten diferent scenarios drawn from five models. Based on these results, the program
dynamically updates the scores of both models using the Elo score update formula, ensuring that the
rankings reflect the relative performance of the models over time. This iterative scoring process provides
a robust framework for evaluating and refining the models’ capabilities.
′ =  + ( −  ), ′ =  + ( −  )
(1)
(2)
(3)
Elo Hyperparameters and Stopping Criterion. The Elo framework used in this study follows
standard implementations for pairwise preference modeling. All models were assigned an initial score
of 1500. A fixed K-factor of  = 20 was used to balance score sensitivity with stability. Across
the evaluation process, a total of 160 pairwise comparisons were generated. The iterative update
procedure terminated when rating changes fell below 1 point across two successive iterations or when
all comparisons had been processed. Convergence was typically achieved after approximately 120
iterations, after which additional updates produced negligible changes in ranking order.</p>
        <p>The most distinctive feature of this approach is its iterative updating process. The comparison and
score updating cycles are repeated multiple times until the scores of all models gradually stabilize or the
predefined maximum number of testing iterations is reached. Ultimately, the model with the highest
Elo score is identified as the one most aligned with human preferences. The final Elo scores serve as
key performance indicators for the models. These scores are further analyzed alongside supplementary
data, such as preference distributions and model characteristics, to determine potential directions for
optimizing model parameters and improving overall performance.</p>
      </sec>
      <sec id="sec-3-7">
        <title>3.7. Process Flow</title>
        <p>The interaction design of the evaluation interface supports human-computer interaction studies within
information retrieval systems. Although the overall design is similar to the previous system, this
interface was tailored to the primary objective of the current experiment: to allow participants to select
the model that performs best rather than assign scores. To achieve this, the answers generated by two
models were displayed side by side, and user tags were added to prevent overlapping responses.</p>
        <p>The experiment involved eight participants, all law students. The testing session lasted one hour
and featured ten legal essay questions. Each question had responses from five models, and the system
randomly selected two model-generated answers for comparison at a time. Participants evaluated the
same question twice, meaning they faced two distinct random pairings for each question.</p>
        <p>This experimental design does not rely on prior scoring results. In other words, a model that
performed better in the first comparison does not gain an advantage by appearing more frequently in
subsequent pairings. This approach aligns with the experiment’s goal of fairly assessing the overall
performance of all five models on legal questions. By avoiding selection bias, where higher-scoring
models are over-tested and lower-scoring models are under-tested, the methodology ensures balanced
evaluation opportunities for all models.</p>
        <p>Additionally, given the limited number of participants and the constrained testing duration, the
randomized selection logic simplifies the operation process, enhances testing eficiency, and improves
the credibility of the results. In addition to facilitating precise pairwise model comparisons, the
platform ofers an educational component: legal evaluators, especially law students, reported increased
metacognitive awareness of legal reasoning through the assessment process. This reveals the dual
role of our interface as both an evaluation mechanism and a pedagogical tool, making it a promising
candidate for broader applications in legal education and interactive AI training environments.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Results and Comparative Observations</title>
      <p>This section presents a comparative analysis of the performance of five model versions after undergoing
Supervised Fine-Tuning, as evaluated by legal professionals using the Elo Score System.</p>
      <sec id="sec-4-1">
        <title>4.1. Observation of SFT Model V1 to V5: Generation Strengths and Weaknesses</title>
        <p>The training processes and parameter configurations for the five models were conducted as described
in the experimental design. Upon completing the training, we tested the models’ performance using
four civil law questions that were not included in the training dataset: one on Family Law, one on
Property Law, one on General Provisions of Obligations, and one on General Principles of Civil Law.
Additionally, ChatGPT and ChatGPT-4 were included as anonymous models to provide responses for
comparison. A law professor then evaluated the performance of all models. The scoring results are
illustrated in the figure below.</p>
        <p>The performance comparison between our supervised fine-tuned models (V1–V5) and baseline LLMs
such as ChatGPT-4 and ChatGPT-4o highlights a core theme in evaluating large language models.
However, because GPT-4 and GPT-4o are closed-source models that cannot be fine-tuned or controlled
for parameter consistency, we included them only as qualitative baselines rather than as participants in
the Elo ranking system. By experimenting with various training configurations, including diferent batch
sizes and epoch counts, we were able to systematically observe the trade-ofs between generalization
and overfitting. The results reinforce the necessity of empirical benchmarks in LLM evaluation and
demonstrate that supervised fine-tuning can meaningfully improve performance on domain-specific,
long-form question answering tasks, such as those found in legal contexts. A more detailed explanation
of each model is as follows:
1. Untrained Model V1 Displays a Scattergun Approach to Legal Answers</p>
        <p>The untrained V1 model tended to answer legal questions by listing a wide range of potentially
relevant legal provisions without identifying key issues. In legal essay responses, law students
are expected to identify key legal issues after analyzing the question, then selectively reference
applicable statutes. However, V1 appeared to take a "scattergun approach," using keywords from
the question to retrieve potentially related statutes and attempting to link them to the facts
presented. This approach resulted in overly verbose answers that often failed to address the core
issues.
2. V2 Outperforms V4 Despite Similar Answer Styles</p>
        <p>Both V2 and V4 exhibited answer structures and writing styles resembling those of law students,
such as starting responses with issue-focused questions and avoiding overly conversational
language. Unlike V1, both models were better at articulating a clear legal stance. For instance,
V1 often emphasized resolving specific problems with phrases like "this depends on the court’s
judgment in individual cases." However, V2 received a higher average score than V4. Due to
the small sample size of only four questions and a single evaluator, the reason for V2’s superior
performance remains unclear. Further evaluation using the Elo Score methodology is planned to
confirm these findings.
3. V2 Outperformed ChatGPT-4o on Question 3 (General Principles of Civil Law)
Question 3 focused on the validity of marriage and home-buying actions undertaken by a person
under guardianship. Both V2 and ChatGPT-4o incorrectly assessed the validity of the marriage
but difered in their treatment of the home-buying action. V2 identified the key point that a
person under guardianship has "limited legal capacity," a critical consideration in determining the
validity of the purchase. In contrast, ChatGPT-4o failed to mention the concept of legal capacity
entirely. Although V2’s phrasing was not entirely precise, its closer alignment with the intended
legal reasoning earned it a higher score. This result suggests that with further fine-tuning, V2’s
responses could more closely align with precise legal terminology and outperform ChatGPT-4o
in tasks requiring multi-layered legal reasoning.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Elo-Based Comparative Evaluation of Models V1–V5</title>
        <p>Before proceeding with the analysis of the experimental results, we ensured that the preference data
provided by the eight participants showed no significant variability. To achieve this, we calculated the
standard deviation, coeficient of variation, and performed outlier detection. Additionally, we created
box plots to visualize the distribution and deviations in the model preferences, providing a clearer
understanding of any inconsistencies in the data.</p>
        <p>According to the data analysis, the standard deviation and coeficient of variation (CV) indicate that
the evaluation results across models were relatively stable and consistent. Model V4 demonstrated the
smallest standard deviation and the lowest variance, highlighting the stability of its evaluation results.
Overall, the standard deviations for all models were relatively low, suggesting that the evaluation
outcomes did not exhibit extreme dispersion or significant deviation. Additionally, the CV values for
all models were below 0.04, further confirming that the evaluation scores were stable and free from
notable bias.</p>
        <p>For a more intuitive representation, the average performance of each model is summarized in the
table below. The Elo Score baseline starts at 1500 points. As shown in the figure, Model V4 achieved the
highest performance, followed by Model V2. The untrained baseline model, V1, ranked third, while V5
and V3 exhibited similar performance, with both falling behind the other models.</p>
        <p>The analysis revealed that a small batch size (Batch Size: 8) combined with fewer training epochs
(Epoch: 10) contributed significantly to performance improvement.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Interpretation of Findings</title>
        <p>The empirical results ofer early but informative insights into how supervised fine-tuning influences
the ability of large language models to address open-ended legal essay questions. While Model V4
obtained the highest Elo rating in our study, the domain scope and sample size necessarily limit the
generality of these observations. Rather than providing definitive rankings, the results highlight patterns
regarding batch size, epoch configuration, and the risk of overfitting in long-form legal reasoning. These
patterns suggest practical directions for optimizing fine-tuning strategies and illustrate the importance
of balancing generalization with doctrinal specificity in legal LLM development.</p>
        <p>These findings underline the importance of balancing batch size, training epochs, and generalization
to optimize the performance of models tailored for legal applications. Further refinements in training
strategies could enhance both the specificity and the analytical depth of model-generated legal responses.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Inter-Rater Reliability</title>
        <p>To assess the consistency of human judgments, inter-rater agreement statistics were computed based
on the preference data collected from eight evaluators. Kendall’s coeficient of concordance yielded
 = 0.62, indicating moderate agreement across evaluators. Krippendorf’s alpha was calculated at
 = 0.58 , reflecting a similar level of concordance suitable for preference-based assessments of
openended legal reasoning. These measures suggest that the resulting preference rankings are reasonably
stable, while also pointing to the potential value of incorporating a broader range of legal professionals
in future evaluations.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Directions for Legal Essay Evaluation</title>
      <sec id="sec-5-1">
        <title>5.1. Research Finding</title>
        <p>Legal essay evaluation difers fundamentally from accuracy-driven NLP tasks because legal reasoning
is inherently interpretive, open-textured, and sensitive to doctrinal context. The primary contribution
of this study therefore lies not only in comparing the relative performance of fine-tuned models,
but in establishing an evaluation paradigm suitable for domains where answers are not uniquely
determined. By integrating expert-curated corpora, preference-based comparisons, and a scalable Elo
ranking mechanism, this work demonstrates how human-aligned assessment can be operationalized
for long-form legal reasoning tasks.</p>
        <p>Crowdsourced evaluation emerges as an especially suitable benchmark for assessing legal
LLMs. In doctrinal systems where judicial reasoning reflects competing—and sometimes
unsettled—interpretations, the aggregated judgments of trained legal readers provide a practical proxy
for legal plausibility. This form of assessment is both methodologically scalable and conceptually
coherent with the interpretive nature of legal analysis, where consensus-based plausibility often matters
more than fixed notions of correctness.</p>
        <p>The proposed framework contributes a structured pathway for evaluating legal LLMs under
conditions of expertise, ambiguity, and contextual sensitivity—characteristics underrepresented in current
evaluation pipelines. By combining expert-derived prompts, web-based scoring interfaces, and iterative
preference modeling through the Elo system, this study advances the methodological groundwork for
evaluating information-access systems in high-stakes, open-textured professional domains. Rather than
ofering definitive performance rankings, the framework demonstrates how human preference signals
can be systematically captured and leveraged to assess reasoning quality in generative legal AI.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Limitations and Toward AI-Agent Assisted Evaluation</title>
        <p>This study has several limitations that indicate directions for future work. First, the evaluation dataset
was restricted to a small set of civil law essay questions, primarily covering the General Principles of
Civil Law, Property Law, and Family Law. As such, the results may not generalize to other domains,
including criminal or administrative law, where the structure of reasoning and doctrinal constraints
may difer substantially. Expanding the corpus to encompass a broader range of legal topics would
provide a more comprehensive understanding of model performance.</p>
        <p>Second, the evaluation relied on a relatively small pool of participants, consisting mainly of law
students. Although their training was suficient for the purposes of this exploratory study, evaluations
involving a more diverse group of legal practitioners—such as attorneys, judges, or senior scholars—could
ofer richer insights and increase the reliability of preference signals. Furthermore, the occasional use
of a single evaluator may have introduced subjective bias. Future research should employ larger, more
varied evaluator pools and additional controls to improve robustness.</p>
        <p>Finally, the models demonstrated dificulty in generating nuanced, well-elaborated legal analyses,
particularly in settings where overfitting reduced interpretive depth. This reflects a broader methodological
challenge: while LLMs can readily produce fluent doctrinal text, constructing evaluation protocols that
reliably capture legal reasoning quality remains considerably more complex. Our current framework
incorporates expert scoring and Elo-based comparisons, but further progress will require AI-agent
systems capable of mediating between model outputs, expert expectations, and evolving legal standards.
Such agents could automatically identify omitted issues, provide counter-arguments, assess doctrinal
coherence, or simulate peer-review-like critique, thereby enhancing both evaluation rigor and the
reasoning quality of generated outputs. Advancing toward agent-supported evaluation ofers a potential
path for reconciling the growing ease of text generation with the persistent dificulty of high-stakes
legal assessment.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <sec id="sec-6-1">
        <title>The authors have not employed any Generative AI tools.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Ilias</given-names>
            <surname>Chalkidis</surname>
          </string-name>
          , Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and
          <string-name>
            <given-names>Ion</given-names>
            <surname>Androutsopoulos</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>LegalBERT: The Muppets straight out of law school</article-title>
          .
          <source>In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          , pp.
          <fpage>2898</fpage>
          -
          <lpage>2904</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Zhiwei</given-names>
            <surname>Fei</surname>
          </string-name>
          , Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Songyang Zhang, Kai Chen,
          <string-name>
            <given-names>Zongwen</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Jidong</given-names>
            <surname>Ge</surname>
          </string-name>
          .
          <year>2024</year>
          .
          <article-title>LawBench: Evaluating legal reasoning and comprehension abilities of large language models</article-title>
          .
          <source>Artificial Intelligence and Law</source>
          ,
          <volume>32</volume>
          (
          <issue>1</issue>
          ), pp.
          <fpage>45</fpage>
          -
          <lpage>62</lpage>
          . arXiv:
          <volume>2309</volume>
          .
          <fpage>16289</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Lars</given-names>
            <surname>Magnus</surname>
          </string-name>
          Hvattum and
          <string-name>
            <given-names>Halvard</given-names>
            <surname>Arntzen</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Using ELO ratings for match result prediction in association football</article-title>
          .
          <source>International Journal of Forecasting</source>
          ,
          <volume>26</volume>
          (
          <issue>3</issue>
          ), pp.
          <fpage>460</fpage>
          -
          <lpage>470</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Anindita</given-names>
            <surname>Kundu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Denilson</given-names>
            <surname>Barbosa</surname>
          </string-name>
          .
          <year>2024</year>
          .
          <article-title>Are Large Language Models Good Essay Graders? In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL</article-title>
          ), pp.
          <fpage>1456</fpage>
          -
          <lpage>1466</lpage>
          . arXiv:
          <volume>2409</volume>
          .
          <fpage>13120</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Ilya</given-names>
            <surname>Loshchilov</surname>
          </string-name>
          and
          <string-name>
            <given-names>Frank</given-names>
            <surname>Hutter</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>SGDR: Stochastic Gradient Descent with Warm Restarts</article-title>
          .
          <source>In Proceedings of the International Conference on Learning Representations (ICLR</source>
          <year>2017</year>
          ). arXiv:
          <volume>1608</volume>
          .
          <fpage>03983</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Hsuan-Lei</surname>
            <given-names>Shao</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wei-Hsin</surname>
            <given-names>Wang</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Sieh-Chuen Huang</surname>
          </string-name>
          .
          <year>2024</year>
          .
          <article-title>Quantity Afects Quality: Instruction Fine-Tuning on LLM's Multiple-Choice Question Abilities</article-title>
          . EasyChair Preprint No.
          <volume>12345</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Wei</given-names>
            <surname>Xia</surname>
          </string-name>
          , Shaoguang Mao, and
          <string-name>
            <given-names>Chanjing</given-names>
            <surname>Zheng</surname>
          </string-name>
          .
          <year>2024</year>
          .
          <article-title>Empirical Study of Large Language Models as Automated Essay Scoring Tools in English Composition: TOEFL Independent Writing Task</article-title>
          .
          <source>arXiv preprint arXiv:2401</source>
          .
          <fpage>03401</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Junjie</given-names>
            <surname>Ye</surname>
          </string-name>
          , Yuming Yang, Qi Zhang, Tao Gui, Xuanjing Huang,
          <string-name>
            <surname>Peng</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhongchao Shi</surname>
            , and
            <given-names>Jianping</given-names>
          </string-name>
          <string-name>
            <surname>Fan</surname>
          </string-name>
          .
          <year>2024</year>
          .
          <article-title>Empirical Insights on Fine-Tuning Large Language Models for QuestionAnswering</article-title>
          .
          <source>arXiv preprint arXiv:2409</source>
          .
          <fpage>15825</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Lucia</given-names>
            <surname>Zheng</surname>
          </string-name>
          , Neel Guha,
          <string-name>
            <surname>Brandon R. Anderson</surname>
            ,
            <given-names>Peter</given-names>
          </string-name>
          <string-name>
            <surname>Henderson</surname>
          </string-name>
          , and Daniel E. Ho.
          <year>2021</year>
          .
          <article-title>When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset</article-title>
          .
          <source>arXiv preprint arXiv:2104</source>
          .
          <fpage>08671</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Zexuan</given-names>
            <surname>Zhong</surname>
          </string-name>
          and
          <string-name>
            <given-names>Danqi</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>A Frustratingly Easy Approach for Entity and Relation Extraction</article-title>
          . arXiv preprint arXiv:
          <year>2010</year>
          .12812.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Chih-Jen</surname>
            <given-names>Hsu</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chih-Lung</surname>
            <given-names>Liu</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feng-Tsun</surname>
            <given-names>Liao</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Po-Chun</surname>
            <given-names>Hsu</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yen-Chieh Chen</surname>
          </string-name>
          , and
          <string-name>
            <surname>Der-Shiuan Shiu</surname>
          </string-name>
          .
          <year>2024</year>
          . Breeze-7B
          <source>Technical Report. arXiv preprint arXiv:2403</source>
          .
          <fpage>02712</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Neel</surname>
            <given-names>Guha</given-names>
          </string-name>
          , John Nyarko, Daniel Ho, Christopher Ré, Adam Chilton, Alex Chohlas-Wood,
          <string-name>
            <given-names>Zhengyan</given-names>
            <surname>Li</surname>
          </string-name>
          , and others.
          <year>2024</year>
          .
          <article-title>LegalBench: A collaboratively built benchmark for measuring legal reasoning in large language models</article-title>
          .
          <source>Advances in Neural Information Processing Systems (NeurIPS)</source>
          , vol.
          <volume>36</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Mennatallah</surname>
            <given-names>Elaraby</given-names>
          </string-name>
          , Huiyi Xu, Matthew Gray,
          <string-name>
            <given-names>Kevin D.</given-names>
            <surname>Ashley</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Diane</given-names>
            <surname>Litman</surname>
          </string-name>
          .
          <year>2024</year>
          .
          <article-title>Adding Argumentation into Human Evaluation of Long Document Abstractive Summarization: A Case Study on Legal Opinions</article-title>
          .
          <source>In Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval)</source>
          ,
          <source>LREC-COLING</source>
          <year>2024</year>
          , pp.
          <fpage>28</fpage>
          -
          <lpage>35</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Matthew</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Gray</surname>
          </string-name>
          , Jaromir Savelka,
          <string-name>
            <surname>William M. Oliver</surname>
            , and
            <given-names>Kevin D.</given-names>
          </string-name>
          <string-name>
            <surname>Ashley</surname>
          </string-name>
          .
          <year>2024</year>
          .
          <article-title>Empirical Legal Analysis Simplified: Reducing Complexity through Automatic Identification and Evaluation of Legally Relevant Factors</article-title>
          .
          <source>Philosophical Transactions of the Royal Society A</source>
          ,
          <volume>382</volume>
          (
          <issue>2270</issue>
          ),
          <fpage>20230155</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>