<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Rabula: A Benchmark for Evaluating LLMs in Brazilian Legal Tasks ⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Eduardo Caruso Barbosa Pacheco</string-name>
          <email>edu@atlasia.tech</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fernanda Mattar Suriani</string-name>
          <email>fernandasuriani@alumni.usp.br</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ricardo Ribeiro</string-name>
          <email>ricardosribeiro1976@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Atlas.IA</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Santa Catarina</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Brazil</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lawgorithm</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>São Paulo</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Brazil</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Advocacia-Geral da União (AGU)</institution>
          ,
          <addr-line>Brasília</addr-line>
          ,
          <country country="BR">Brazil</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Rabula is a benchmark for evaluating large language models (LLMs) in the Brazilian legal domain, addressing current limitations of assessments based solely on multiple-choice legal questions. Built upon the 2024 Brazilian Bar Exam (OAB), it includes four tasks: multiple-choice questions, legal document selection, document drafting, and essay-style legal problem solving. Multiple-choice responses are evaluated against the official answer key, while generative tasks are assessed using an LLM-as-judge method, based on official OAB grading criteria. Each generated answer is scored through a set of binary questions weighted by difficulty. Human evaluation using a golden label (from three legal experts) shows high agreement with the best evaluator model (Cohen's Kappa[14] 79.4) in essay-style task, comparable to the agreement among the human reviewers themselves (Fleiss' Kappa[13] 78.9). In document writing task, however, the agreement was smaller (Cohen's Kappa 64.3) compared with humans(Fleiss' Kappa 76.2). Rabula allows for granular assessment of model competence across legal areas and tasks. As a demonstration, we compare OpenAI's gpt4o-mini with Maritaca.AI's Sabiazinho-3, revealing that performance varies both by legal area and task. This benchmark aims to close the evaluation gap in Portuguese-language LLMs and foster the development of more capable legal models in Brazil.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Benchmark</kwd>
        <kwd>LLM</kwd>
        <kwd>LLM-as-a-Judge1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The Brazilian Judiciary has made significant progress in adopting artificial intelligence for both
administrative and judicial tasks. According to the 2023 AI Projects Dashboard in the Judiciary
from the National Council of Justice, there were 140 AI projects in 2023, compared to 111 in 2022,
with 80% of courts having at least one AI project in 2023. Generative AI holds relevance due to its
potential for drafting legal documents, which requires significant effort from human professionals
in both court administration and judicial activities. The same dashboard indicates that 39.7% of
courts already use or are implementing Generative AI in judicial activities, while 21.9% do so in
administrative tasks. The importance of AI has also caught the attention of the Brazilian
government, which announced investments of BRL 23 billion by 2028 to develop the sector. Despite
these developments, there has been little effort to create or adapt LLMs specifically for Portuguese
or specialize them in the legal domain. The recent surge in publicly available LLMs has not
translated into the development of foundational models in Portuguese, especially those tailored for
the Brazilian legal field. Notable exceptions include Sabiá-3 from Maritaca.AI [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], Juru [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] (based on
Sabiá-2 and specialized in the legal domain), and Tucano [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>While these projects are commendable, the limited development of Portuguese generative AI
specialized in the legal domain presents practical challenges for these initiatives and imposes
constraints on conclusions regarding model capabilities. Among the mentioned projects, no
benchmark exists for the Portuguese legal domain beyond multiple-choice tests.</p>
      <p>
        For instance, Sabiá-3’s legal domain performance was evaluated using two multiple-choice
exams. The first, introduced in the Sabiá-2 article [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], is based on 160 multiple-choice questions
from the 2023 OAB exam for lawyer qualification. The second, called ENAM, is a qualification
exam for law graduates aspiring to judicial careers, consisting of 160 multiple-choice questions [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
For the legal domain, the only task tested was the ability to answer multiple-choice questions. The
model achieved 75.9% accuracy on the first phase of the OAB exams and 64.9% on the ENAM.
      </p>
      <p>
        Tucano [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], on the other hand, used a dataset of 1,820 questions from the first phase of the OAB
exam, as detailed in the article "Passing the Brazilian OAB Exam: Data Preparation and Some
Experiments" [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Tucano performed poorly, with Pearson product-moment correlation scores
between 0.21 and 0.34 depending on model size.
      </p>
      <p>
        Meanwhile, Juru [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], developed in collaboration with Maritaca.AI, was evaluated using 160
multiple-choice questions and 35 multiple-choice questions from the ENADE exam, designed for
law students. Juru achieved 82.5% accuracy on ENADE and 62.2% on the OAB, both involving
multiple-choice tasks.
      </p>
      <p>The authors of Juru argue that open-ended question evaluations using LLM-as-judge may
exhibit low correlation with human evaluators due to domain-specific vocabulary, knowledge, and
the inherent difficulty of evaluating open-ended responses, which are costly to assess. They claim
that multiple-choice questions serve as better benchmarks in the context of Juru development.</p>
      <p>
        We diverge from this view for several reasons. First, there is not necessarily low correlation
between human evaluators and LLM-as-judge, as variations can be mitigated through
decisionmaking processes involving many techniques to improve performance. Second, the official OAB
exam provides a scoring key for essay questions, offering a standardized answer that reduces
evaluator (human or machine) discretion and simplifies decision-making to a binary compliance
check. Thus, the problem characterization was inaccurate: while the questions are open-ended, the
responses are evaluated by a less complex task, due to the binary nature of the criteria, mitigating
the criticisms raised. Finally, our experiment on response convergence showed a Fleiss Kapa[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] of
0.7893 and 0.7625 depending on the task among three human evaluators. Comparing evaluator
models against the golden standard (majority vote among humans) we found a mean Cohen’
Kappa[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] of 0.7941 in the best model in one task and 0.7941 in other.
      </p>
      <p>In addition to justifying the adequacy of LLM-as-judge evaluations, it is worth questioning the
argument that multiple-choice tests are inherently more suitable. Although multiple-choice tests
measure domain knowledge breadth and have their utility, they do not reflect the daily tasks of
legal professionals. Tasks like drafting legal documents, selecting the correct procedural document,
or providing legal opinions are assessed in the second phase of the OAB exam and reflects the
dayto-day tasks of a lawyer. They are good candidates to compose a benchmark. This paper aims to
offer a methodology for evaluating Portuguese-language LLMs in the legal domain, both through
objective multiple-choice questions from the first phase of the OAB exam (measuring knowledge
breadth) and essay questions from the second phase (measuring practical professional skills). As an
example of analysis this benchmark allows, we compare OpenAI's gpt4o-mini with Maritaca.AI's
Sabiazinho-3 model. While both perform comparably on multiple-choice questions (with a slight
edge for gpt4o-mini), Sabiazinho-3 outperforms in most legal subfields on essay tasks.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>Our methodology involves developing an evaluation framework that assesses both the breadth of
legal knowledge and practical legal skills. This is achieved through four tasks: multiple-choice
questions, selecting the appropriate legal document for a case, drafting the legal document, and
responding to legal case questions in free-text format.</p>
      <p>
        The multiple-choice questions correspond to the first phase of the Brazilian Bar Exam (OAB).
For these, we prompt the evaluated models to provide objective answers, which are then compared
against the official answer key prepared by legal experts from Fundação Getúlio Vargas (FGV), the
institution responsible for administering the OAB exam. The remaining tasks are evaluated using
the LLM-as-a-Judge technique [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which compares the free-text responses of candidate models to
the official answer key in the second-phase tasks. This begins with the creation of a golden label,
based on the official grading rubric, by having a responder model generate answers that are then
reviewed by three experienced human jurists. This review conducted over 21 legal drafting tasks
and 84 essay questions from the second phase of the OAB, following the exam’s official criteria.
      </p>
      <p>This process provides not only the expected answer but also a reference for determining
whether a given model response meets the required standard. We then select the best evaluator
model based on the highest Cohen’s Kappa agreement between its evaluations and the golden label
(majority vote among human reviewers).</p>
      <p>Once the evaluator model is selected, three independent instances of it, each prompted with the
candidate’s response and the grading criteria, assess each criterion individually. The final judgment
is determined by majority vote. Each question or task is scored according to the predefined rubric.
These scores are aggregated by legal subject and computed as an overall total, providing a
comprehensive measure of each model’s performance across tasks.
2.1.</p>
      <sec id="sec-2-1">
        <title>OAB Bar Exam</title>
        <p>The benchmark is based on the Brazilian Bar Exam (OAB), developed by legal experts from FGV.
Like the U.S. bar exam, it determines who is qualified to practice law in Brazil. The exam occurs
three times a year and has two phases. The first consists of 80 multiple-choice questions across
various areas of law. Those who pass (with at least 50%) proceed to the second phase, where
candidates choose one of seven legal specializations and complete two tasks: drafting a legal
document based on a practical case and answering four essay-style questions. The second phase
requires a minimum score of 60% and is graded using binary rubric: each response element is either
correct and awarded points or incorrect and receives none. Selecting the wrong type of legal
document results in an automatic zero on the practical task, making it impossible to pass. The
essay responses also follow predefined scoring criteria and represent 50% of the final grade. The
following figure 1 illustrate examples of an official first-phase OAB exam and the official answer
key used for grading responses in the second phase.
2.2.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Data</title>
        <p>The Dataset consists of 3 tables: multiple-choice, practical, and discursive. These tables correspond
to questions that were manually extracted from the OAB exams administered in 2024. The
multiple-choice table contains 240 multiple-choice questions from the first phase of the 2024 OAB
exams (OAB 40, 41, and 42) and includes the following fields: id, exam, exam_date, number,
question, answer, and cancelled_question. The "practical" table contains 21 practical exam cases
distributed across the seven specialization areas of the second phase of the OAB exam. These cases
were taken from the 2024 second-phase OAB exams (OAB 39, OAB 40, and OAB 41) and include
the following fields: id, exam, exam_date, area, question, answer, criteria, and legal_document. The
question field contains the exam prompt, the answer field pro-vides general response guidelines,
the criteria field includes the set of scoring criteria and their respective points, and the
legal_document field specifies the expected legal document. The "discursive" table contains 84
essay-style questions from the seven specializa-tion areas of the second phase across the three OAB
exams administered in 2024. Its structure is the same as the "practical" table, but without the
"legal_document" field.
To enable the systematic answering of questions by different models instructed with response
prompts, so that they can be evaluated by models instructed with evaluation prompts considering
all assessment criteria, we processed the grading criteria for each question in the second phase of
the discursive and practical exams using a prompt that converts the criteria, originally presented as
continuous text in the official answer key, into JSON format. For example, consider the answer key
for Question 1-A of Administrative Law from OAB 41, which appears in the official answer key as
follows (translated into English): “A. No. The granting of retirement constitutes a complex act that
will only be complete after a ruling by the Court of Accounts (0.50), in accordance with Article 71,
item III, of the Brazilian Federal Constitution of 1988 (CRFB/88) or Binding Precedent No. 3 (0.10).
0.00/0.50/0.60” The formatted response in JSON format appears as follows (translated into
English):
[{"letter": "A", "part": "I", "answer": "No. The granting of retirement constitutes a complex act that will only be com
plete after a ruling by the Court of Accounts, in accordance with Article 71, item III, of the Brazilian Federal Consti
tution (CRFB/88) or Binding Precedent No. 3.","criteria": "No. The granting of retirement constitutes a complex act
that will only be complete after a ruling by the Court of Accounts","points": 0.5},{"letter": "A","part": "II","answer":
"No. The granting of retirement constitutes a complex act that will only be complete after a ruling by the Court of
Accounts, in accordance with Article 71, item III, of the Brazilian Federal Constitution (CRFB/88) or Binding Precedent
No. 3.", "criteria": "in accordance with Article 71, item III, of the Brazilian Federal Constitution (CRFB/88) or Bind
ing Precedent No. 3","points": 0.1 }]
With this format, the models can see both the criterion itself, which is scored (criteria, points), and
the answer, which serves as context for the criterion. Considering the sum of individual criteria,
the benchmark consists of 379 discursive criteria, 561 criteria in writing practical cases, 21 criteria
in legal document identification and 240 multiple choice questions, summing up 1201 evaluation
points.</p>
        <p>We employed LLMs in various data processing tasks. First, as mentioned, involves formatting the
scoring criteria for the discursive and practical questions in the second phase. Next, we designed
prompts to generate model responses for evaluation. This category includes four different prompts,
one for each task: answering multiple-choice questions, identifying the appropriate legal
documents, drafting legal documents, and responding to discursive questions. To organize and
systematize these prompts, we used the LangChain framework in Python. Finally, we created evaluation
prompts, where we applied the LLM-as-a-Judge technique using the "Evaluation by Criteria"
approach, which assesses responses based on reference criteria in a binary decision framework
(compliance or non-compliance with each criterion).
2.4.</p>
      </sec>
      <sec id="sec-2-3">
        <title>LLM-as-a-Judge</title>
        <p>
          It is possible to use LLMs to evaluate responses generated by other models or by humans. The
method known as LLM-as-a-Judge is an approach for automated evaluation, which mitigates the
high costs of human assessments and enables large-scale evaluations. One study [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] demonstrated
that "the result of LLM evaluation is consistent with the results obtained by expert human
evaluation: the texts rated higher by human experts are also rated higher by the LLMs." Another
study [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] characterized the type of evaluation conducted in this work as "Solving Yes/No
Questions" and provided an in-depth discussion on the relevance of using LLMs as judges,
highlighting the challenges, ways to overcome them, and the biases present in both LLMs and
human evaluators.
        </p>
        <p>
          A third study [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] found that the level of agreement between humans and LLM-as-a-Judge (using
GPT-4) is around 85%, approximately the same level of agreement found among human evaluators
(81%). A fourth study [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], from early 2023, demonstrated that ChatGPT (GPT-3.5-turbo) achieved
"state-of-the-art or competitive correlation with human judgments in most cases" even in tasks
where previous generative models had failed in the field of linguistics. Another study [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], from
late 2024, introduced ways to improve LLM-as-a-Judge performance, such as requiring reasoning
for the response, a technique we implemented in the present work. Additionally, a 2024 study [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]
found that larger models, such as GPT-4 and Llama3-70B, achieve human alignment above 80%,
while also highlighting some concerns, such as the higher variance observed in smaller models.
2.5.
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>Alignment with Human Evaluators</title>
        <p>To evaluate the alignment between the responses of the evaluator model and human assessments,
we conducted the following experiment. First, we instructed an arbitrary model, gpt4o-mini, to
respond to two discursive tasks from the benchmark: drafting the legal document in 21 cases and
answering 84 discursive questions. Combined, these two tasks provide 940 evaluation points, each
consisting of binary questions assessing whether the established criteria in the answer key were
met.</p>
        <p>
          Next, we developed an annotation tool using Streamlit and asked 3 experienced legal experts to
evaluate the model's responses using this tool. The tool presents the question posed to the model,
the answer key, the candidate's response (in this case, the model's response), and the list of relevant
criteria for that question. When clicking on a criterion, the user sees a binary question regarding
the fulfillment of that criterion, with "yes" or "no" options. The human evaluators answered the
questions, and we then asked an LLM-as-a-Judge to respond to the same questions. By comparing
the responses, we can assess two key aspects: the agreement between human evaluators (see
Figure 2) and the agreement between humans and the LLM-as-a-Judge (see Figure 3). We
calculated Fleiss' Kappa to assess inter-rater agreement among the human evaluators and obtained
a score of 0.789. According to the widely adopted interpretation proposed by Landis and Koch[
          <xref ref-type="bibr" rid="ref15">15</xref>
          ],
this value falls within the range of 0.61 to 0.80, which indicates substantial agreement. This level of
consistency among the evaluators provides a reliable foundation for constructing the golden labels
used in our benchmark and for comparing model-generated responses. Next, we prompted seven
different candidate models (gpt4o, gpt4o-mini, gpt4.1, gpt4.1-mini, Sabiá-3, Sabiazinho-3,
gpto3mini) to act as evaluators using the LLM-as-a-Judge framework, assessing the responses generated
by the responder model (gpt4o-mini). For the open-ended essay question task from the second
phase of the exam, we ran 25 evaluation executions for each candidate model across all grading
criteria.
        </p>
        <p>We then calculated Cohen’s Kappa between each model’s evaluations and the golden labels.
Figure 3 presents the results: the best-performing evaluator model, gpt4o, achieved a mean
Cohen’s Kappa of 79.4 and an average accuracy of 90.9. According to the Landis and Koch (1977)
scale, this value is at the upper bound of the “substantial agreement” range (0.61–0.80), and just
short of the “almost perfect” category (above 0.80), which further supports its reliability as an
evaluator model in this context.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>We applied the benchmark using the best performing model to compare two models gpt4o-mini,
(OpenAI) and Sabiazinho-3 (Maritaca.IA). The objective of the test is to understand the strengths
and weaknesses of the models, as well as to identify gaps that can be addressed in future versions.</p>
      <p>Model</p>
      <p>Area</p>
      <p>Multiple Choice</p>
      <p>Document Choice
Gpt-4o-mini all
Sabiazinho3
all
64.7
63.0
Although the GPT-4o-mini model achieved a slightly higher score on multiple-choice questions,
the Sabiazinho-3 model performed better in legal document identification, legal document writing,
and answering discursive questions. We can break down this by task and law area to see more
details.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>Our findings from the human alignment experiment support the existing literature on
LLM-asJudge: there is a high alignment with human evaluations and strong potential for using this
technique as an assessment method for open-ended questions. The methodology developed
advances the evaluation of LLMs in the Brazilian legal domain. Previously, assessments were
limited to multiple-choice tests, which are in-sufficient as they do not reflect the tasks that legal
professionals perform in their daily practice. By creating a benchmark that combines the procedure
that qualifies human lawyers to practice law with the models' ability to evaluate compliance with
criteria defined in an official grading rubric, we enable a more in-depth scrutiny of LLMs used by
law students, legal professionals, and the public. Our method has the potential to facilitate the
development of LLMs for the Brazilian legal do-main by allowing better scrutiny, which could lead
to improved models for students learning law, professionals practicing it, and the public seeking
clarification on whether a particular conduct constitutes a crime.
During the preparation of this work, the authors used GPT4o to: Grammar and spelling check.
After using these tool, the authors reviewed and edited the content as needed and takes full
responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Hugo</given-names>
            <surname>Abonizio</surname>
          </string-name>
          , Thales Sales Almeida, Thiago Laitz, Roseval Malaquias Junior, Giovana Kerche Bonás, Rodrigo Nogueira, and
          <string-name>
            <given-names>Ramon</given-names>
            <surname>Pires</surname>
          </string-name>
          .
          <year>2024</year>
          . Sabia-3
          <source>Technical Report. arXiv preprint arXiv:2410.12049v2 [cs.CL] (Nov. 29</source>
          ,
          <year>2024</year>
          ). Available at: https://arxiv.org/abs/2410.12049.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Roseval</given-names>
            <surname>Malaquias</surname>
          </string-name>
          <string-name>
            <surname>Junior</surname>
          </string-name>
          , Ramon Pires, Roseli Romero, and
          <string-name>
            <given-names>Rodrigo</given-names>
            <surname>Nogueira</surname>
          </string-name>
          .
          <year>2024</year>
          .
          <article-title>Juru: Legal Brazilian Large Language Model from Reputable Sources</article-title>
          .
          <source>arXiv:2403.18140 [cs.CL] (Mar. 26</source>
          ,
          <year>2024</year>
          ). Available at: https://arxiv.org/abs/2403.18140.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Nicholas</given-names>
            <surname>Kluge</surname>
          </string-name>
          <string-name>
            <surname>Corrêa</surname>
          </string-name>
          , Aniket Sen, Sophia Falk, and
          <string-name>
            <given-names>Shiza</given-names>
            <surname>Fatimah</surname>
          </string-name>
          .
          <year>2024</year>
          .
          <article-title>Tucano: Advancing Neural Text Genera-tion for Portuguese</article-title>
          .
          <source>arXiv:2411.07854 [cs.CL] (Nov. 12</source>
          ,
          <year>2024</year>
          ). Available at: https://arxiv.org/abs/2411.07854.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Thales</given-names>
            <surname>Sales</surname>
          </string-name>
          <string-name>
            <surname>Almeida</surname>
          </string-name>
          , Hugo Abonizio, Rodrigo Nogueira, and
          <string-name>
            <given-names>Ramon</given-names>
            <surname>Pires</surname>
          </string-name>
          .
          <year>2024</year>
          .
          <article-title>Sabiá-2: A New Generation of Portuguese Large Language Models</article-title>
          .
          <source>arXiv:2403.09887 [cs.CL] (Mar. 26</source>
          ,
          <year>2024</year>
          ). Available at: https://arxiv.org/abs/2403.09887.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Pedro</given-names>
            <surname>Delfino</surname>
          </string-name>
          , Bruno Cuconato, Edward Hermann Haeusler, and
          <string-name>
            <given-names>Alexandre</given-names>
            <surname>Rademaker</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Passing the Brazilian OAB Exam: data preparation and some experiments</article-title>
          .
          <source>arXiv:1712.05128v1 [cs.CL] (Dec. 14</source>
          ,
          <year>2017</year>
          ). Available at: https://arxiv.org/abs/1712.05128v1.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Evidently</surname>
            <given-names>AI</given-names>
          </string-name>
          <string-name>
            <surname>Team</surname>
          </string-name>
          .
          <year>2025</year>
          .
          <article-title>LLM-as-a-Judge: A Complete Guide to Using LLMs for Evaluations</article-title>
          .
          <source>Evidently AI (Jan. 9</source>
          ,
          <year>2025</year>
          ). Available at: https://www.evidentlyai.com/llm-guide/
          <article-title>llm-as-a-judge.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7] Cheng-Han Chiang and
          <string-name>
            <surname>Hung-yi Lee</surname>
          </string-name>
          .
          <year>2023</year>
          .
          <article-title>Can Large Language Models Be an Alternative to Human Evaluations?</article-title>
          arXiv:
          <fpage>2305</fpage>
          .
          <year>01937</year>
          [cs.CL]
          <article-title>(May 3,</article-title>
          <year>2023</year>
          ). Available at: https://arxiv.org/abs/2305.
          <year>01937</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Jiawei</given-names>
            <surname>Gu</surname>
          </string-name>
          , Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu,
          <string-name>
            <given-names>Wei</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Yinghan</given-names>
            <surname>Shen</surname>
          </string-name>
          , Shengjie Ma, Honghao Liu,
          <string-name>
            <given-names>Yuanzhuo</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Jian</given-names>
            <surname>Guo</surname>
          </string-name>
          .
          <year>2025</year>
          .
          <article-title>A Survey on LLM-as-aJudge</article-title>
          .
          <source>arXiv:2411.15594 [cs.CL] (Jan. 9</source>
          ,
          <year>2025</year>
          ). Available at: https://arxiv.org/abs/2411.15594.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Lianmin</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <surname>Wei-Lin</surname>
            <given-names>Chiang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ying Sheng</surname>
            , Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin,
            <given-names>Zhuohan</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Dacheng</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Eric P.</given-names>
          </string-name>
          <string-name>
            <surname>Xing</surname>
            , Hao Zhang, Joseph E. Gonzalez, and
            <given-names>Ion</given-names>
          </string-name>
          <string-name>
            <surname>Stoica</surname>
          </string-name>
          .
          <year>2023</year>
          .
          <article-title>Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena</article-title>
          . arXiv:
          <volume>2306</volume>
          .05685 [cs.CL]
          <article-title>(June 9,</article-title>
          <year>2023</year>
          ). Available at: https://arxiv.org/abs/2306.05685.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Jiaan</surname>
            <given-names>Wang</given-names>
          </string-name>
          , Yunlong Liang, Fandong Meng, Zengkui Sun, Haoxiang Shi,
          <string-name>
            <given-names>Zhixu</given-names>
            <surname>Li</surname>
          </string-name>
          , Jinan Xu,
          <string-name>
            <given-names>Jianfeng</given-names>
            <surname>Qu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Jie</given-names>
            <surname>Zhou</surname>
          </string-name>
          .
          <year>2023</year>
          .
          <article-title>Is ChatGPT a Good NLG Evaluator? A Preliminary Study</article-title>
          .
          <source>arXiv:2303.04048 [cs.CL] (Mar. 7</source>
          ,
          <year>2023</year>
          ). Available at: https://arxiv.org/abs/2303.04048.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Haitao</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Qian</given-names>
            <surname>Dong</surname>
          </string-name>
          , Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai,
          <string-name>
            <given-names>Ziyi</given-names>
            <surname>Ye</surname>
          </string-name>
          , and Yiqun Liu.
          <year>2024</year>
          .
          <article-title>LLMs-as-</article-title>
          <string-name>
            <surname>Judges</surname>
          </string-name>
          :
          <article-title>A Comprehensive Survey on LLM-based Evaluation Methods</article-title>
          .
          <source>arXiv:2412.05579v2 [cs.CL] (Dec. 10</source>
          ,
          <year>2024</year>
          ). Available at: https://arxiv.org/abs/2412.05579v2.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Aman</given-names>
            <surname>Singh</surname>
          </string-name>
          <string-name>
            <given-names>Thakur</given-names>
            , Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, and
            <surname>Dieuwke</surname>
          </string-name>
          Hup-kes.
          <year>2025</year>
          .
          <article-title>Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-</article-title>
          <string-name>
            <surname>Judges</surname>
          </string-name>
          .
          <source>arXiv:2406.12624 [cs.CL] (Jan. 21</source>
          ,
          <year>2025</year>
          ). Available at: https://arxiv.org/abs/2406.12624.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Joseph</surname>
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Fleiss</surname>
          </string-name>
          .
          <year>1971</year>
          .
          <article-title>Measuring nominal scale agreement among many raters</article-title>
          .
          <source>Psychological Bulletin</source>
          <volume>76</volume>
          ,
          <issue>5</issue>
          (
          <year>1971</year>
          ),
          <fpage>378</fpage>
          -
          <lpage>382</lpage>
          . https://doi.org/10.1037/h0031619.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Cohen</surname>
          </string-name>
          .
          <year>1960</year>
          .
          <article-title>A coefficient of agreement for nominal scales</article-title>
          .
          <source>Educational and Psychological Measurement</source>
          <volume>20</volume>
          ,
          <issue>1</issue>
          (
          <year>1960</year>
          ),
          <fpage>37</fpage>
          -
          <lpage>46</lpage>
          . https://doi.org/10.1177/001316446002000104.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J. Richard</given-names>
            <surname>Landis</surname>
          </string-name>
          and
          <string-name>
            <given-names>Gary G.</given-names>
            <surname>Koch</surname>
          </string-name>
          .
          <year>1977</year>
          .
          <article-title>The measurement of observer agreement for categorical data</article-title>
          .
          <source>Biometrics</source>
          <volume>33</volume>
          ,
          <issue>1</issue>
          (
          <year>1977</year>
          ),
          <fpage>159</fpage>
          -
          <lpage>174</lpage>
          . https://doi.org/10.2307/2529310.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>