<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semantic Specialization of Legal LLMs via Large-Scale Fine-Tuning on Jurisdictional MCQs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hsuan-Lei Shao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Graduate Institute of Health and Biotechnology Law,Taipei Medical University</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>This paper examines how instruction-based fine-tuning enhances the semantic competence of legal Large Language Models (LLMs) in answering multiple-choice questions (MCQs) drawn from jurisdiction-specific legal examinations. Using Breeze-7B, a Mandarin LLM adapted from Mistral-7B, we fine-tune the model on two corpora containing 5,000 and 70,000 legal MCQs to evaluate how training volume afects accuracy, norm identification, and alignment with local legal reasoning. We evaluate the resulting models across three datasets-the Taiwanese Bar Examination (TBE), the Taiwan Multimodal Legal Understanding benchmark (TMMLU), and MMLU-using both probability-based and zero-shot prompt-based protocols. Our findings show that large-scale ifne-tuning substantially improves the model's ability to recognize legal obligations, interpret statutory structures, and reproduce jurisdiction-specific doctrinal reasoning. Comparisons with general-purpose models such as GPT-3.5 and GPT-4o highlight the trade-ofs between broad linguistic generality and jurisdictional semantic specialization. We argue that MCQs function as structured, scalable signals for assessing and cultivating legal semantic competence in LLMs. Our results demonstrate that task-specific instruction tuning, combined with jurisdiction-grounded datasets, ofers a cost-eficient pathway for developing explainable, localized legal AI systems. This work contributes to ongoing eforts in legal informatics to advance semantic analysis, model adaptation, and digital sovereignty within diverse legal environments.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Legal Large Language Models (LLMs)</kwd>
        <kwd>Multiple-Choice Questions (MCQs)</kwd>
        <kwd>Instruction Fine-Tuning</kwd>
        <kwd>JurisdictionSpecific Datasets</kwd>
        <kwd>Semantic Specialization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>1.2. Challenges in Implementing LLMs Across Diverse Legal Systems</title>
        <p>Large Language Models (LLMs) have shown substantial promise in tasks such as legal drafting,
summarization, and research assistance. However, significant challenges arise when deploying general-purpose
LLMs within specialized legal systems, where statutory schemes, judicial doctrines, and regulatory
structures vary widely across jurisdictions. Such tasks require models to internalize and adapt to
heterogeneous legal frameworks—capabilities that current general LLMs lack [8, 9].</p>
        <p>A key concern is the potential erosion of legal diversity. Smaller jurisdictions or culturally distinctive
legal systems are often underrepresented in global training corpora, leading to the marginalization of
their doctrinal nuances. This issue is increasingly discussed under the rubric of “sovereignty AI,” which
emphasizes the necessity of preserving jurisdiction-specific legal reasoning and institutional identity
within AI systems [10, 11]. As a result, there is growing consensus on the need for customized, locally
trained LLMs to ensure that generated legal analysis remains relevant, accurate, and compliant with
jurisdictional requirements [12, 13].</p>
      </sec>
      <sec id="sec-1-2">
        <title>1.3. Specializing Local Knowledge in Legal LLMs</title>
        <p>In this study, we focus on building an LLM capable of incorporating jurisdiction-specific legal knowledge.
Each jurisdiction possesses its own legal system, professional practices, examination mechanisms, and
pedagogical traditions. Accordingly, we conceptualize a legal LLM not merely as a general-purpose
model applied to legal tasks, but as a form of localized AI infrastructure—and potentially a foundation
for “sovereign AI” development.</p>
        <p>We employ Breeze-7B, a Mandarin derivative of the Mistral-7B architecture, and enhance it through
instruction fine-tuning using curated legal MCQs. Two fine-tuned variants were constructed: one
trained on 5,000 MCQs (bz5k) and another on 70,000 MCQs (bz70k). The comparative performance
of these models reveals the intricate relationship between training volume and legal specialization.
Fine-tuning with insuficient data can degrade overall performance by overfitting limited doctrinal
patterns, thereby weakening the model’s capacity for knowledge-intensive tasks. By contrast,
largescale fine-tuning (as with bz70k) substantially strengthens legal semantic competence—particularly
in statutory interpretation and obligation recognition—while producing modest trade-ofs in
generaldomain performance.</p>
        <p>This balance underscores the inherent tension between specialization and generality. It also
demonstrates that global LLMs frequently overlook smaller jurisdictions, such as Taiwan, whose legal corpora
are rarely incorporated into mainstream training pipelines. By explicitly introducing localized materials,
our approach shows that jurisdictions and expert communities can build domain-specific LLMs that
preserve digital sovereignty and capture professional legal reasoning otherwise absent from globally
trained models.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Literature Review</title>
      <sec id="sec-2-1">
        <title>2.1. Overview of LLM Capabilities in Legal Domains</title>
        <p>As we mentioned, LLMs have increasingly become integral to various applications within legal domains,
demonstrating capabilities that span from basic legal information retrieval to complex reasoning and
document generation. Studies have shown that LLMs, like the GPT series and its successors, can
interpret, generate, and summarize legal texts with a high degree of accuracy. These models have been
employed for contract analysis, litigation prediction, and even legal education assistance by generating
hypothetical legal scenarios for study. This section reviews the extent of LLM integration in legal
practices and evaluates their efectiveness in handling diverse legal tasks [8, 15, 19].</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Current Methodologies in Prompt Engineering</title>
        <p>Prompt engineering involves the strategic crafting of input prompts to LLMs to elicit the most accurate
and relevant outputs. In legal applications, efective prompt engineering is critical as it significantly
impacts the model’s ability to provide legally sound advice and information. Recent advancements
have focused on developing methodologies that enhance the specificity and context-awareness of
prompts, thereby improving the precision of the model’s responses. This includes techniques such
as prompt chaining, in which a series of logically connected prompts are used to guide the model
through complex reasoning tasks, but may negatively impact the model’s architecture that is designed
for knowledge-intensive tasks [15].</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Current Methodologies in Instruction Fine-Tuning</title>
        <p>Instruction fine-tuning is a recent development aimed at refining the training process of LLMs to
better follow user instructions. Unlike traditional model training, instruction fine-tuning focuses on
aligning the model’s outputs with specific user expectations and requirements. In the legal field, this
is particularly advantageous for ensuring that models adhere to legal reasoning patterns and comply
with jurisdiction-specific regulations. This segment will cover the latest methodologies in instruction
ifne-tuning, including the application of specialized data sets (such as legal judgments or statutory
provisions) that train models to recognize and replicate the nuanced decision-making processes typical
in legal analyses [23,24].</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Research Design</title>
      <sec id="sec-3-1">
        <title>3.1. Multiple Choice Questions in Legal Evaluation</title>
        <p>On the one hand, LLMs are important in our daily lives or research fields; on the other hand, LLMs are
like a black-box that we hardly add new knowledge in. So, we use one of the basic indicates in legal
education: multiple choice questions (MCQs). The MCQ plays a critical role in legal education and
professional assessments. They are widely used in exams like the bar examination to eficiently test a
broad range of legal knowledge. The format of MCQs allows for assessing students’ understanding of
key concepts and their ability to distinguish between closely related legal issues, a crucial skill in legal
practice.</p>
        <p>Moreover, the structured nature of MCQs makes them particularly suitable for automation using AI
technologies like LLMs. By incorporating LLMs in creating and grading MCQs, educational institutions
can enhance the objectivity and eficiency of assessments. LLMs can also be used to generate diverse
question sets that cover a wide array of topics, providing a robust tool for comprehensive legal training
[14, 15]. In Taiwan, this logic is particularly pronounced because the national judicial oficer and bar
examinations are not only pedagogical tools but also part of a state-sanctioned “knowledge industry.”
These exams are high-stakes gateways to the legal profession, and their oficial questions and answers
embody codified statutory interpretation, case law reasoning, and doctrinal expectations. The fact that
cram schools, publishers, and exam candidates invest enormous resources into these MCQs underscores
their commercial value and national significance. Thus, by using Taiwanese bar exam MCQs as
finetuning data, we are not only leveraging a structured educational tool but also converting a resource of
immense social and economic value into semantically precise AI training material.</p>
        <p>However, the efectiveness of LLMs in this area depends heavily on their training and the quality
of data used. It is essential that the data reflects the specific legal principles and practices relevant to
the jurisdiction where the education or assessment is taking place. This ensures that the questions are
accurate and contextually appropriate, fostering a more efective and meaningful learning environment
[16, 17].</p>
        <p>To clarify the provenance and representativeness of our training data, we briefly summarise the
composition and sampling strategy for the MCQ corpus. The 70k-sample pool is drawn from publicly
available multiple-choice questions used in Taiwanese legal education and professional examinations,
including past judicial oficer and bar exams as well as high–stakes preparatory materials. We then
construct two training sets: a smaller subset of 5,000 MCQs (for bz5k) and a larger subset of 70,000
MCQs (for bz70k). Both sets were sampled to preserve diversity across core legal domains, including
constitutional law, criminal law, civil law, administrative law, and procedural law.</p>
        <p>To avoid data leakage with the 2023 Taiwanese Bar Examination (TBE) evaluation set, we rely on two
institutional characteristics of the Taiwanese examination system. First, by regulation, the Ministry of
Examination does not reuse questions across years. Each annual examination is produced by a newly
convened committee that deliberates and drafts items specifically for that year, ensuring that questions
are not duplicated across exam cycles. Second, because the TBE is typically administered in August,
constructing our training corpus exclusively from materials dated prior to 2023 guarantees that no
TBE-2023 items could have appeared in the fine-tuning data. This separation ensures that improvements
observed on TBE 2023, TMMLU, and MMLU genuinely reflect generalisation from the training corpus
rather than memorisation of evaluation items.</p>
        <p>Moreover, even assuming arguendo that data leakage had occurred, one would reasonably expect a
markedly elevated accuracy on the TBE 2023 set, as rote exposure to identical items typically produces
an clear and disproportionate performance spike. The empirical pattern observed here is inconsistent
with such an efect. Indeed, the only model with any plausible access to contemporaneous online
material—GPT-4o accessed through an API—may be more susceptible to overestimation due to its
capacity, in principle, to retrieve web-based information that indirectly references examination content.
Such asymmetry further supports the inference that the improvements achieved by our fine-tuned
models derive from genuine generalisation rather than contamination.</p>
        <p>With respect to item dificulty, while we do not provide a full psychometric decomposition, it is well
recognised within Taiwan’s legal examination regime that yearly fluctuations in dificulty are routine
and are often noted by examination specialists. Importantly, the system does not rely on absolute
cutscores but instead employs a relative-ranking mechanism: approximately the lowest 33.3% of candidates
are eliminated based on cohort-wide performance. This institutional design mitigates the impact of
year-to-year variance in dificulty and preserves comparability across examination cycles. Notably,
in the years comprising our training corpus, an accuracy level near 66.6% closely corresponds to the
empirical cumulative percentile distributions reported by the Ministry of Examination. Taken together,
these structural features reinforce our position that MCQs produced within this regime constitute a
stable, scalable, and jurisdiction-grounded signal for legal semantic specialisation.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Model Instruction Fine-tuning: Breeze-7B</title>
        <p>In this study, we based the capabilities of the Breeze-7B-base model [27], which is built upon the
foundations of the Mistral-7B architecture by incorporating an extensive set of MCQs into its training
regimen. The original Breeze model, without any specific fine-tuning towards these datasets, serves as a
control to understand the baseline capabilities of the LLM. The bz5k model, finetuned with 5,000 samples,
represents a modest increase in dataset-specific training. The bz70k model, representing a substantial
ifne-tuning efort with 70,000 samples, aims to tailor the model towards the dataset characteristics
significantly.</p>
        <p>Our approach to instruction fine-tuning involves directly using the multiple-choice options as inputs.
The output consists of the correct option (A, B, C, or D) along with the content of the option, which
provides additional information. This method ensures that the model not only selects the correct answer
but also captures the context and details associated with each option, enhancing its ability to handle
similar questions efectively. For example:</p>
        <p>This method ensures that the model not only selects the correct answer but also captures the context
and details associated with each option, enhancing its ability to handle similar questions efectively.</p>
        <p>For both bz5k and bz70k, we fine-tune Breeze-7B using the AdamW optimiser with a learning rate of
2e-5, and a linear warm-up over the first 10% of the training steps. We trained for one or three epochs,
ifnally we choose the one epoch version because of better perfermance. It with a global batch size of
32 and a maximum sequence length of 1,024 tokens. All models are trained on one NVIDIA A6000
(80GB) GPU using bfloat16 mixed precision to reduce memory consumption. The random seed is fixed
to 42 to ensure reproducibility. For LoRA, we adopt standard low-rank adaptation parameters with a
rank (r) of 8, an [a] of 16, and a dropout rate of 0.05. Except for the size of the fine-tuning corpus (5k
vs. 70k MCQs), all other hyperparameters remain identical so that observed performance diferences
can be attributed to diferences in training volume rather than in the optimisation scheme or model
architecture.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Evaluation Design</title>
        <p>These models were benchmarked against a general baseline model, GPT-3.5, the more advanced
GPT4o. The datasets employed for evaluation included the Multimodal Legal Understanding (MMLU)
[28], the Taiwanese Multimodal Legal Understanding (TMMLU) [29], and the 2023 Taiwanese Bar
Examination questions [18]. We employ two complementary evaluation protocols to assess the semantic
and jurisdictional alignment of the models: (i) probability-based scoring, and (ii) zero-shot prompt-based
evaluation. Using both methods allows us to measure not only the model’s underlying token-level
preferences but also its behaviour in a more realistic prompting setting.</p>
        <p>(i) Probability-Based Evaluation (Figure 1)</p>
        <p>For probability-based scoring, we follow the standard MMLU evaluation protocol. Each MCQ is
converted into four separate model queries, each corresponding to a candidate option (“A”, “B”, “C”,
or “D”). For each query, we extract the logits assigned to the option token by the final decoder layer
with sampling disabled (temperature = 0, top-p = 0), and normalise them using a softmax over the four
options. The option with the highest probability is taken as the model’s prediction. This procedure
yields fully deterministic results and enables systematic comparison across models. Because probability
extraction does not require autoregressive generation, no new tokens are produced beyond the option
token itself, ensuring consistency across runs.</p>
        <p>(ii) Prompt-Based Zero-Shot Evaluation (Figure 2)</p>
        <p>In addition to probability scoring, we evaluate the models in a zero-shot prompting setting that
mirrors real-world usage. Each MCQ is presented as a single prompt containing the stem and the four
candidate options, and the model is asked to answer directly. If the model explicitly outputs one of “A”,
“B”, “C”, or “D,” that token is taken as the predicted answer. In the rare cases where no option token
appears in the output (fewer than 10% of all prompts), we follow prior work and assign a neutral default
answer (“C”). This convention avoids privileging extreme positions (e.g., always selecting “A” or “D”)
and allows fair comparison across models with diferent verbosity levels.</p>
        <p>(iii) Rationale for Two Protocols Probability-based scoring isolates the model’s preference structure at
the token level and is therefore sensitive to fine-grained semantic distinctions. In contrast, Prompt-based
evaluation captures the model’s behavior under naturalistic conditions where instruction following,
formatting, and verbosity also influence performance. Using both provides a more comprehensive
assessment of semantic specialization, jurisdictional alignment, and robustness. The evaluation of
these models was conducted using two diferent methodologies to assess their performance in legal
multiple-choice question scenarios.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Research Result and Discussion I: Fine-tuning Quantity Efect</title>
      <sec id="sec-4-1">
        <title>4.1. Probability Selection Evaluation</title>
        <p>The table "Comparing Diferent Fine-tuning Quantity Efects" showcases the impact of varying
quantities of data used in fine-tuning on the performance of the Breeze model across diferent datasets.
These datasets encompass the Taiwanese Multi-Modal Legal Understanding (TMMLU) in Law and
Engineering domains, the broader Multi-Modal Legal Understanding (MMLU), and the 2023 Taiwanese
Bar Examination (TBE).</p>
        <p>1. Dataset-Specific Performance: TMMLU (Law) and TMMLU (Engineering): For the Law subset of
TMMLU, increasing the fine-tuning data quantity results in improved performance, as evident
from the bz70k model’s score of 0.486 compared to the bz5k’s 0.401 and the baseline’s 0.407. This
suggests that a larger dataset helps the model better understand and adapt to legal nuances.
Conversely, in the Engineering subset, the performance decreases as the quantity of fine-tuning
increases (0.458 in bz70k down from 0.498 in the baseline). This could indicate overfitting or
perhaps the introduction of noise or less relevant information through the additional data.
2. Diferent Language Performance: MMLU</p>
        <p>Here, we see a slight improvement in bz5k over the baseline (0.562 vs. 0.560), but a reduction
with bz70k (0.515). This pattern suggests that while some targeted fine-tuning can be beneficial,
excessive nfie-tuning may lead to diminishing returns or negative transfer, where too much
specificity detracts from the model’s general applicability. This demonstrates that if an LLM
performs better in one language, it often performs worse in another. This may be related to the
parameters of individual tokens, where fine-tuning can detrimentally afect the original linguistic
structure of the LLM.</p>
        <p>The observation that LLMs may exhibit improved performance in one language at the expense of
another underscores a critical aspect of language model training known as the trade-of between
specialization and generalization. In the context of LLMs, which are often trained on diverse
multilingual corpora, each token’s parameters carry the weight of representing linguistic features
pertinent to multiple languages. During the fine-tuning process, these parameters are adjusted
to optimize performance for specific tasks or languages. This optimization, while beneficial in
the targeted context, can inadvertently lead to a degradation of the model’s capabilities in other
languages not emphasized during the fine-tuning phase.
3. The Latest Local Knowledge: TBE</p>
        <p>Performance on the Taiwanese Bar Examination dataset improves significantly with the highest
data volume (bz70k), increasing from 0.486 to 0.514. This improvement indicates that
comprehensive legal training data can enhance model performance on specialized legal tasks such as
bar exams, which likely benefit from a deeper understanding of localized legal principles and
practices.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Discussion</title>
        <p>The graph represents the performance comparison of three models (Breeze, bz5k, bz70k) across four
diferent datasets (TMMLU-Law, TMMLU-Engineering, MMLU, TBE), with each model serving as a
point on the x-axis and performance scores on the y-axis. Diferent line styles distinguish each dataset.
1. TMMLU (Law) (dotted line): Shows a trend of improvement as the fine-tuning data volume
increases, peaking with the bz70k model.
2. TMMLU (Engineering) (dash-dot line): This line trends downward, indicating a decrease in
performance with more extensive fine-tuning, potentially due to overfitting or less relevant
ifne-tuning data for engineering-specific content.
3. MMLU (dashed line): Performance slightly increases with moderate fine-tuning (bz5k) but
decreases with extensive fine-tuning (bz70k), suggesting that a balance needs to be found to avoid
diminishing returns.
4. TBE (solid line): Shows a recovery in performance with the most extensive fine-tuning (bz70k),
indicating that larger, more focused datasets may be beneficial for specialized legal examinations
like the bar exam.</p>
        <p>This graph visually illustrates how varying the amount of fine-tuning data impacts model performance
across diferent domains. It highlights the need for careful consideration of how much and what type of
data to use for fine-tuning to optimize performance without compromising on the model’s generalization
capabilities. This insight is crucial for applying LLMs to specialized fields where accuracy and specificity
are paramount.</p>
        <p>Impact of Dataset Size in Fine-tuning: The varying efects of increased fine-tuning data volumes across
diferent datasets underscore the need for a nuanced approach to model training, where the quantity
and quality of data must be carefully managed to avoid overfitting while still achieving meaningful
improvements.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Research Result and Discussion II: Prompt Evaluation</title>
      <sec id="sec-5-1">
        <title>5.1. Prompt Evaluation</title>
        <p>In the second phase, we input the MCQs string directly through the API (refer to 3.3. Evaluation Design),
which allows us to compare our results with other LLMs.</p>
        <p>The table provides comparative performance data for diferent language models on MCQs across
three datasets: TMMLU (administrative law, sub-category of the Law category), MMLU, and TBE. It
shows how each model fares in accurately responding to prompts within these specific domains.</p>
        <p>First of all, we wish to skip the discussion on GPT-4o since it is an outlier in terms of performance.
Moreover, because it is much larger, its performance is not comparable to our approximately 7B
parameter model. We can see the trends on the Breeze-series and other LLMs:
1. Impact of Fine-tuning Method: Because our instruct fine-tuning itself has enough MCQs diversity,
the bz70k model can achieve high performance when we ask it directly. The bz70k in some
TMMLU fields simply because it was not “familiar” with the instructions.
2. Quantity Afects Quality: This table clearly illustrates that fine-tuning with a larger volume of
data specifically tailored to the task at hand can significantly enhance a model’s performance.
The bz70k’s success across datasets indicates that the additional specific training it received is
highly efective than Breeze-base and bz5k, even the GPT-3.5.
3. General vs. Specialized Models: The comparison between bz70k variants and GPT-4.0 highlights
an essential aspect of language model application: general models can perform well across broad
tasks, but under domain-specific fine-tuning processing, smaller LLM (7B) can reach the larger
performance.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Discussion</title>
        <p>The graph illustrates the performance comparison across diferent language models on three datasets:
TMMLU (administrative law), MMLU, and TBE (Taiwanese Bar Examination). Each model is represented
on the X-axis, and the performance score, likely accuracy or a similar metric, is represented on the
Y-axis. Diferent line styles and colors distinguish the performance of each dataset.</p>
        <p>Comparison of Breeze, bz5k, bz70k, GPT-3.5, and GPT-4o on TMMLU (administrative law), MMLU,
and TBE under direct prompting conditions. The bz70k model shows marked improvement in Taiwanese
bar exam accuracy, rivaling larger general-purpose LLMs. These results highlight the potential of
finetuned small LLMs to serve as cost-eficient, explainable, and localized AI tools for legal applications.
Trend Analysis of the graph:
1. Incremental Improvements: The graph illustrates a clear trend of incremental performance
improvements as we move from the baseline Breeze model to the bz5k and then to the bz70k.</p>
        <p>This trend is evident across all datasets but varies in magnitude.
2. TMMLU (Admin Law): For the TMMLU (Admin Law) dataset, the performance improvement
from Breeze (0.25) to bz5k (0.38) and then to bz70k (0.58) is quite pronounced. This significant
uptick suggests that the additional training samples used in fine-tuning the bz70k model are
highly efective at enhancing the model’s capabilities in handling complex administrative law
scenarios.
3. MMLU: The trend in the MMLU dataset follows a similar pattern. Starting from a performance
score of 0.32 with Breeze, there is a noticeable increase to 0.54 with bz5k, and further improvement
to 0.59 with bz70k. This consistent increase across fine-tuning stages underscores the efectiveness
of using larger, more targeted training sets for enhancing model performance in general legal
contexts.
4. TBE: In the TBE dataset, the performance jumps considerably from Breeze (0.106) to bz5k (0.423),
and sees a significant peak at bz70k (0.64). This demonstrates that extensive fine-tuning with a
large volume of specialized data is particularly beneficial for models that navigate the complexities
of bar examination questions, which likely involve nuanced legal reasoning and specific legal
knowledge.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Qualitative Error Analysis</title>
        <p>To complement the aggregate accuracy scores, we conducted a small-scale qualitative error analysis
on a sample questions drawn from TBE 2023 and TMMLU-Law. representative error pattern arises
in questions involving causal attribution and the foreseeability of aggravated results. Consider the
following item from TBE 2023:</p>
        <p>
          A and B bore longstanding animosity toward each other. One day, upon seeing B loitering on the street, A
proceeded to strike B. Startled after being attacked, B attempted to flee but accidentally fell into a roadside
construction pit. As a result, in addition to the contusions inflicted by A, B sufered a comminuted fracture
of the right lower leg, causing destruction of the limb. According to the latest judicial practice, which of the
following statements is correct? (A)Although B sustained severe injury, the fact that B accidentally fell into
the roadside construction pit constitutes a self-responsible act on the part of the victim; therefore, A is
liable only for the ordinary injury ofense under Article 277(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) of the Criminal Code for striking B’s head.
(B)Because B’s right lower leg merely sustained a fracture with destruction, it does not satisfy the definition
of “destruction or serious impairment of the function of one or more limbs” constituting a serious injury
under Article 10(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )(
          <xref ref-type="bibr" rid="ref4">4</xref>
          ) of the Criminal Code. (C) A should have had objective foreseeability that B might fall
into the construction pit and thereby sufer destruction of the right lower leg; thus, A bears liability for the
ofense of causing serious injury under the latter part of Article 277(
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) of the Criminal Code. (D) Because
B’s serious injury did not result from the aggravation of the head contusion, the ofense of causing
serious injury under the latter part of Article 277(
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) of the Criminal Code is not established. Correct answer: (C).
        </p>
        <p>
          Both the base Breeze-7B model and the bz5k variant frequently selected option (D), reflecting a
misunderstanding of Taiwan’s established doctrine that the victim’s panic-driven flight does not constitute
a self-responsible interruption of causation. Under contemporary Taiwanese criminal jurisprudence the
aggravated injury remains objectively foreseeable to the initial assailant and therefore attributable to
him under Article 277(
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) of the Criminal Code. This example illustrates how large-scale
jurisdictionspecific MCQ fine-tuning enables the model to internalize nuanced doctrinal concepts—such as the
scope of risk and objective foreseeability—that are rarely captured by general-purpose LLMs.
        </p>
        <p>Beyond the specific examples, a broader pattern emerges from our qualitative inspection. The
models—particularly the base Breeze-7B and the bz5k variant—tend to exhibit substantially higher error
rates on questions that depend on (i) Taiwan-specific statutory provisions, and (ii) multi-step causal
attribution under contemporary Taiwanese criminal jurisprudence. Items requiring precise knowledge
of local statutory wording (e.g., the threshold definitions of “serious injury” under Criminal Code Article
10 or institutional authority allocations under the organizational law of the courts) frequently led to
incorrect answers by non-specialised models, suggesting that general LLM pretraining corpora contain
insuficient coverage of Taiwanese doctrinal sources.</p>
        <p>Similarly, questions involving complex causation analysis—such as diferentiating between victim
selfresponsibility, independent intervening acts, and results that remain within the objectively foreseeable
scope of risk—proved particularly challenging. These topics involve doctrinal nuances that are heavily
shaped by Taiwan’s case law and academic commentary. The bz70k model performs notably better in
these scenarios, indicating that large-scale instruction tuning on jurisdiction-grounded MCQs provides
the model with access to precisely those doctrinal subtleties that generic LLMs lack.</p>
        <p>Taken together, these patterns demonstrate that errors cluster around areas where local legal
knowledge or multi-stage legal reasoning is required—precisely the domains in which jurisdiction-specific
ifne-tuning is most impactful.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <sec id="sec-6-1">
        <title>6.1. Performance of Fine-tuned Small LLMs</title>
        <p>The inference results across the Breeze model family—from bz5k to bz70k—demonstrate the decisive
influence of both data volume and domain specificity on model performance in legal tasks. The
Taiwanese Bar Examination (TBE) dataset, in particular, functions as a highly practical source of localized
legal knowledge, providing a rigorous benchmark for evaluating how efectively AI systems can operate
within jurisdiction-specific contexts. This aligns with a broader imperative in AI development: the
need for datasets embedded in concrete legal and cultural environments to evaluate the extent to which
LLMs can meaningfully adapt to localized conditions.</p>
        <p>The strong performance of bz70k on the TBE benchmark shows that, given suficient targeted
instruction tuning, small LLMs can attain substantial proficiency in legal reasoning, statutory interpretation,
and doctrinal alignment. This finding has direct implications for the deployment of AI in legal settings,
where accuracy, jurisdictional compliance, and the capacity to reflect professional reasoning practices
are essential. The progression from bz5k to bz70k underscores the value of incremental fine-tuning
strategies and highlights the necessity of aligning training data with the substantive demands of legal
tasks.</p>
        <p>More broadly, this study afirms the central premise that data quantity—when paired with appropriate
data quality—plays a critical role in shaping the semantic competence of LLMs. This observation echoes
the title of the paper, “Quantity Afects Quality,” by illustrating how substantial and well-curated
finetuning inputs can significantly enhance model outputs, even when the baseline model is comparatively
modest. Ultimately, high-quality, jurisdiction-grounded training data enables LLMs to acquire the
doctrinal precision and contextual awareness necessary for real-world legal applications.</p>
        <p>Our improvement plan has met the original research objectives:
1. Localized and Personalized LLMs. Digital Sovereignty and Localization. The project successfully
incorporates jurisdiction-specific legal knowledge into LLMs through targeted fine-tuning. This
approach aligns with the global movement toward digital sovereignty, in which regions or
institutions regain control over the informational and operational parameters of their AI systems.
Personalizing LLMs to reflect local legal standards enhances both practical utility and doctrinal
ifdelity, ensuring that the generated analyses and responses remain legally sound and contextually
appropriate.
2. Data Volume and Model Performance. The study confirms that expanding the volume
of fine-tuning data produces substantial gains in model performance, particularly for smaller
architectures that lack the inherent advantages of models such as GPT-3.5 or GPT-4o. Larger and
well-curated datasets allow these smaller models to close the performance gap, challenging the
assumption that scale alone determines capability. This finding underscores the iterative nature of
AI development: future advances in base models can likely benefit from the same domain-specific
ifne-tuning strategies examined here.
3. Practical Application and Instructional Method. Alignment with Real-World Use Scenarios. By
using direct question input as the primary instruction mechanism, the project mirrors real-world
legal inquiry patterns. This alignment improves practical usability, enabling the model to generate
responses that more closely reflect the tasks and reasoning structures encountered in actual legal
practice.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Research Limitations and Future Prospects</title>
        <p>1. Technological and Resource Limitations. Constraints on computing resources limited our
ability to experiment with larger base models, a common challenge in contemporary LLM research.
Because larger models demand substantial computational infrastructure, future work would
benefit from institutional partnerships or shared-resource models that enable experimentation
with more advanced architectures. Such collaborations could accelerate progress and foster a
more equitable research ecosystem.
2. Expanding the Scope of Legal Informatics. Although MCQs ofer a structured and scalable
foundation for modeling legal knowledge, they capture only a subset of legal reasoning. Doctrinal
analysis, argumentation, and judgment-like reasoning remain more complex tasks that require
deeper semantic understanding. Future research should therefore expand beyond MCQs to
incorporate higher-order legal tasks—such as argument mining, statutory interpretation across
contexts, or analysis of legal policy debates—that more closely approximate human legal reasoning.
3. Improving Evaluation Methods. Existing evaluation protocols may not fully capture the
reproducibility, normative coherence, or deliberative quality required for legal applications.
Efective legal AI must produce consistent and defensible outputs across varied conditions. Thus,
more robust evaluation frameworks are essential—potentially involving standardized datasets,
metrics that reflect legal reasoning quality rather than accuracy alone, and systematic
crossvalidation procedures. Human-in-the-loop assessments and explainability metrics should likewise
play a larger role in future evaluations.</p>
        <p>The principle that “Quantity Afects Quality” highlights the transformative influence of data scale
and quality on LLM development. Our results demonstrate that large-scale fine-tuning on
jurisdictional MCQs significantly enhances the semantic precision of small LLMs, particularly in statutory
interpretation and norm recognition. Yet this achievement also exposes a fundamental limitation:
specialization may narrow generality. When optimized to excel at multiple-choice reasoning, models
can lose robustness in broader or multilingual settings, as reflected in our comparative performance on
MMLU. This trade-of—between localization and generalization—remains a core challenge in legal NLP
and cannot be fully resolved under current training paradigms.</p>
        <p>Future research must proceed along three complementary directions. Technologically, modular
ifne-tuning and mixture-of-experts architectures may allow specialized legal reasoning modules to
coexist with general-purpose capabilities. Methodologically, expanding beyond MCQs to integrate
diverse legal tasks—such as argument extraction, contract-clause consistency checking, multilingual
statutory alignment, or legal discourse analysis—can enrich the semantic landscape of fine-tuning.
Evaluatively, new benchmarks are needed to assess legal reasoning quality, doctrinal soundness, and
explainability, rather than accuracy alone.</p>
        <p>In sum, while jurisdictional MCQs provide a cost-eficient and semantically rich foundation for
training localized legal LLMs, they also reveal the enduring tension between domain specialization
and broad adaptability. Addressing this tension will require sustained innovation in training strategies,
cross-jurisdictional collaboration, and the development of evaluation frameworks that respect both
local legal sovereignty and the need for interoperability in global AI research.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.
[9] Sun Z. A short survey of viewing large language models in legal aspect. ArXiv. 2023;abs/2303.09136.</p>
      <p>doi:10.48550/arXiv.2303.09136.
[10] Shaghaghian S, Feng L, Jafarpour B, Pogrebnyakov N. Customizing contextualized language models
for legal document reviews. In: 2020 IEEE International Conference on Big Data (Big Data). 2020.
p. 2139-2148. doi:10.1109/BigData50022.2020.9378201.
[11] Zhang D, Petrova A, Trautmann D, Schilder F. Unleashing the power of large language models for
legal applications. In: Proceedings of the 32nd ACM International Conference on Information and
Knowledge Management. 2023. doi:10.1145/3583780.3615993.
[12] Trozze A, Davies TP, Kleinberg B. Large language models in cryptocurrency securities cases: Can</p>
      <p>ChatGPT replace lawyers? ArXiv. 2023;abs/2308.06032. doi:10.48550/arXiv.2308.06032.
[13] Elwany E, Moore DA, Oberoi G. BERT goes to law school: Quantifying the competitive advantage
of access to large legal corpora in contract understanding. ArXiv. 2019;abs/1911.00473.
[14] Shui R, Cao Y, Wang X, Chua T. A comprehensive evaluation of large language models on legal
judgment prediction. ArXiv. 2023;abs/2310.11761. doi:10.48550/arXiv.2310.11761.
[15] Robinson J, Rytting C, Wingate D. Leveraging large language models for multiple choice question
answering. ArXiv. 2022;abs/2210.12353. doi:10.48550/arXiv.2210.12353.
[16] Zhang Z, Lei L, Wu L, Sun R, Huang Y, Long C, Liu X, Lei X, Tang J, Huang M. SafetyBench:
Evaluating the safety of large language models with multiple choice questions. ArXiv. 2023;abs/2309.07045.
doi:10.48550/arXiv.2309.07045.
[17] Bitew SK, Deleu J, Develder C, Demeester T. Distractor generation for multiple-choice
questions with predictive prompting and large language models. ArXiv. 2023;abs/2307.16338.
doi:10.48550/arXiv.2307.16338.
[18] Ministry of Examination, Taiwan. “Exam Question and Answer Search System,” Ministry of
Examination, Available at: https://wwwq.moex.gov.tw/exam/wFrmExamQandASearch.aspx?y=
2012&amp;e=101120, accessed on August 8, 2024.
[19] Nay JJ, Karamardian D, Lawsky S, Tao W, Bhat MM, Jain R, Lee AT, Choi JH, Kasai J. Large language
models as tax attorneys: A case study in legal capabilities emergence. ArXiv. 2023;abs/2306.07075.
doi:10.48550/arXiv.2306.07075.
[20] Phogat KS, Harsha C, Dasaratha S, Ramakrishna S, Puranam SA. Zero-Shot Question
Answering over Financial Documents using Large Language Models. ArXiv. 2023;abs/2311.14722.
doi:10.48550/arXiv.2311.14722.
[21] Kojima T, Gu S, Reid M, Matsuo Y, Iwasawa Y. Large Language Models are Zero-Shot Reasoners.</p>
      <p>ArXiv. 2022;abs/2205.11916. doi:10.48550/arXiv.2205.11916.
[22] Cheng D, Huang S, Bi J, Zhan YW, Liu J, Wang Y, Sun H, Wei F, Deng D, Zhang Q. UPRISE:
Universal Prompt Retrieval for Improving Zero-Shot Evaluation. ArXiv. 2023;abs/2303.08518.
doi:10.48550/arXiv.2303.08518.
[23] Ni Y, Jiang S, Wu X, Shen H, Zhou Y. Evaluating the robustness to instructions of large language
models. ArXiv. 2023;abs/2308.14306. doi:10.48550/arXiv.2308.14306.
[24] Xu C, Sun Q, Zheng K, Geng X, Zhao P, Feng J, Tao C, Jiang D. WizardLM:
Empowering large language models to follow complex instructions. ArXiv. 2023;abs/2304.12244.
doi:10.48550/arXiv.2304.12244.
[25] Hugging Face. ”Open LLM Leaderboard MMLU,” Hugging Face Blog, Available at: https://github.</p>
      <p>com/huggingface/blog/blob/main/open-llm-leaderboard-mmlu.md, accessed on August 8, 2024.
[26] Liu Yu-Wei. ”LLM Model Evaluation,” GitHub Repository, Available at: https://github.com/</p>
      <p>LiuYuWei/llm_model_evaluation, accessed on August 8, 2024.
[27] MediaTek Research. “Breeze-7B-Base-v1_0,” Hugging Face Model, Available at: https://huggingface.</p>
      <p>co/MediaTek-Research/Breeze-7B-Base-v1_0, accessed on August 8, 2024.
[28] Pei-Yuan Liu. ”MMLU Dataset,” Kaggle Dataset, Available at: https://www.kaggle.com/datasets/
peiyuanliu2001/mmlu-dataset, accessed on August 8, 2024.
[29] iKala. ”TMMLUPlus Dataset,” Hugging Face Dataset, Available at: https://huggingface.co/datasets/
ikala/tmmluplus, accessed on August 8, 2024.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Jenkins</surname>
            <given-names>J.</given-names>
          </string-name>
          <article-title>What can information technology do for law</article-title>
          .
          <source>Harvard J Law Technol</source>
          .
          <year>2008</year>
          ;
          <volume>21</volume>
          :
          <fpage>589</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Waltl</surname>
            <given-names>B</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zec</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Matthes F</surname>
          </string-name>
          .
          <article-title>A data science environment for legal texts</article-title>
          .
          <source>In: Proceedings of the International Conference on Artificial Intelligence and Law (ICAIL)</source>
          .
          <year>2015</year>
          . p.
          <fpage>193</fpage>
          -
          <lpage>194</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Conrad</surname>
            <given-names>JG</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Al-Kofahi</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karypis</surname>
            <given-names>G</given-names>
          </string-name>
          .
          <article-title>Efective document clustering for large heterogeneous law firm collections</article-title>
          .
          <source>In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          .
          <year>2005</year>
          . p.
          <fpage>177</fpage>
          -
          <lpage>187</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Pacheco</surname>
            <given-names>HR</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pavez</surname>
            <given-names>MM</given-names>
          </string-name>
          .
          <article-title>Contemporary challenges in legal informatics: Workshop INJU</article-title>
          .
          <source>In: 2016 11th Iberian Conference on Information Systems and Technologies (CISTI)</source>
          .
          <year>2016</year>
          . p.
          <fpage>1</fpage>
          -
          <lpage>2</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Lachmayer</surname>
            <given-names>F</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyras</surname>
            <given-names>V</given-names>
          </string-name>
          .
          <article-title>Visualization of legal informatics</article-title>
          .
          <source>J Vis Law</source>
          .
          <year>2021</year>
          ;
          <volume>3</volume>
          :
          <fpage>3</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Sharma</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gamoura</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prasad</surname>
            <given-names>DM</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aneja</surname>
            <given-names>A</given-names>
          </string-name>
          .
          <article-title>Emerging legal informatics towards legal innovation: Current status and future challenges and opportunities</article-title>
          .
          <source>Legal Inf Manage</source>
          .
          <year>2021</year>
          ;
          <volume>21</volume>
          :
          <fpage>218</fpage>
          -
          <lpage>235</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Katz</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dolin</surname>
            <given-names>R</given-names>
          </string-name>
          . Legal informatics. Cambridge: Cambridge University Press;
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Šavelka</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ashley</surname>
            <given-names>KD</given-names>
          </string-name>
          .
          <article-title>The unreasonable efectiveness of large language models in zero-shot semantic annotation of legal texts</article-title>
          .
          <source>Front Artif Intell</source>
          .
          <year>2023</year>
          ;
          <volume>6</volume>
          :
          <fpage>1279794</fpage>
          . doi:
          <volume>10</volume>
          .3389/frai.
          <year>2023</year>
          .
          <volume>1279794</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>