<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Leveraging LLMs to Build a Semi-Synthetic Dataset for Legal Information Retrieval: a Case Study on the Italian Civil Code and GPT4-o</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mattia Proietti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lucia Passaro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Lenci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CoLing Lab, Department of Philology</institution>
          ,
          <addr-line>Literature and Linguistics</addr-line>
          ,
          <institution>University of Pisa</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, University of Pisa</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Although raw textual data in the legal domain is abundant, making it easy to collect large amounts of material from several sources, structured and annotated data needed to fine-tune machine learning models is limited and dificult to obtain. Creating human-annotated datasets is both time- and money-consuming, which often makes impractical to get quality data to train machines on various legal language tasks. AI models such as Large Language Models (LLMs) are becoming appealing to generate synthetic data, judge model responses, and annotate textual information, so to cope with such shortcomings. In this work, we wish to evaluate the applicability of LLMs for the automatic generation of a dataset of legal query-passage pairs to train retrieval systems. Indeed, Legal Information Retrieval (LIR) has been crucial for the creation of robust search systems for legal documents and is now gaining new importance in the context of the Retrieval Augmented Generation (RAG) framework, which is becoming a widespread tool to cope with LLMs hallucinating behaviours. Our goal is to test the feasibility of building a query-passage dataset in which the queries are generated by an LLM about real textual passages and assess the reliability of such a process in terms of the generation of hallucination-free data points in a delicate domain, as the legal one. We do so in a two-step pipeline spelt out as follows: i) we use the Italian Civil Code as a source of self-contained, semantically coherent legal textual passages and ask the model to generate hypothetical questions on them; ii) we use the LLM itself to judge the coherence of the questions to spot those inconsistent with the passage. We then select a random subset of the question-passage pairs and ask humans to evaluate them. Finally, we compare human and model evaluations on the randomly selected subset. We show that the model generates many questions easily, and while it lags behind humans when evaluating the appropriateness of the generated questions with respect to the reference passages in zero-shot settings, it substantially reduces the gap with human judgements when only two examples are provided.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>Legal Information Retrieval</kwd>
        <kwd>Synthetic data generation</kwd>
        <kwd>LLM-as-a-judge</kwd>
        <kwd>Legal-NLP</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        more slowly toward a mature integration of language
technologies. Despite the vast volume of textual material
In recent years, we have witnessed great advancements generated daily by legal practitioners, the field still faces
in the field of Artificial Intelligence (AI), in particular in a significant shortage of machine-readable and annotated
its sub-domain of Natural Language Processing (NLP). resources needed to train and fine-tune AI systems for
The advent of Large Language Models (LLMs), especially Legal NLP (LNLP) tasks — a process that is complex and
on the wave initiated by the GPT family [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ], has revo- presents numerous challenges [3]. The lack of data
enlutionised the way we produce, understand, and manipu- compasses all the devisable LNLP tasks. In this work,
late textual content. This revolution has permeated all we focus on data formats necessary to train systems to
domains, and the legal field is no exception. Indeed, NLP perform Legal Information Retrieval (LIR) tasks. LIR is
for legal applications is spreading and is gaining a core a crucial task in the field of LNLP, primarily concerned
role in the discussion about the integration of AI into with retrieving relevant documents in response to a given
legal practice. However, due to its high degree of spe- textual query. A typical application scenario involves a
cialization, the intellectual complexity of legal tasks, and system capable of identifying and returning pertinent
the technical specificity of its language, the legal domain legal documents based on a user’s question. To
efec— similarly to other specialized fields — has progressed tively perform this task, it is essential to train models
on in-domain data—specifically, question-passage pairs
CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- derived from legal documents and expressed in legal
lantics, September 24 — 26, 2025, Cagliari, Italy guage—in order to address domain shifts [4]. However,
* Corresponding author. building such datasets purely through human
annota$ mattia.proietti@phd.unipi.it (M. Proietti); lucia.passaro@unipi.it tion is both extremely time-consuming and costly as it
(L. 0P0a0s9s-a0r0o0);2a-0le4s4s7a-n6d8r0oX.le(Mnc.i @Pruonieipttii.)i;t0(A00.0L-0e0n0c3i)-4934-5344 requires coming up with questions and associate them
(L. Passaro); 0000-0001-5790-4308 (A. Lenci) with relevant documents that may be used to answer
© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License those questions. To cope with such shortcomings,
synAttribution 4.0 International (CC BY 4.0).
thetic data generation and annotation through LLMs is but it can be almost as good as human judges in the
selfarising as a promising strategy and it is now being ex- evaluation task in 2-shot settings, though it lags behind
plored within the legal domain as well. Despite its ease, humans when a 0-shot prompt is used.1
the increasing application of LLMs to generate synthetic
data calls for a major assessment of their reliability and
real applicability for the task at hand. 2. Related Works
      </p>
      <p>This paper aims to answer the following research
question: “How reliable are automated methods for generat- Our work falls in between two paradigms that are
being and evaluating semi-synthetic datasets in the context coming standard practice in the NLP community, that is
of Legal Information Retrieval?" In turn, the motivation synthetic data generation and LLM-as-a-judge. As
behind this question is two-fold. On the one hand, we such it is related to a number of works in both those lines
want to generate a dataset that can be used to train ma- of research.
chine learning systems to perform the task of LIR. On the Synthetic Data – Making use of LLMs to generate
other hand, we aim to assess the feasibility of this process synthetic datasets is becoming commonplace among NLP
by evaluating the reliability of using a state-of-the-art practitioners at diferent stages of the data lifecycle, from
LLM both to generate questions and to assess their rele- generation to curation and evaluation [8]. For example,
vance to reference text passages, as well as the eficiency the Huggingface team has recently released a Python
of this approach in terms of time and cost. We consider library to automatically generate evaluation benchmarks
this process as a proxy to evaluate the model’s ability to using LLMs [9]. They implement a protocol they call
understand legal texts at a basic level, since formulating a Document-Evaluation-Generation, dubbed as DG2E. This
good question is an index of the degree of understanding is relevant to our work, as this framework allows the
reached by the system formulating that question. generation of domain-specific, tailored evaluation
bench</p>
      <p>To this end, we integrate two established paradigms marks. However, they used a far more complex strategy,
of LLMs applications: (i) synthetic data generation[5, 6], involving multiple LLMs and focusing on the creation
employed to automatically construct the dataset, and (ii) of evaluation questionnaires, while we are interested
LLM-as-a-judge[7], used to evaluate and filter out noisy in applying LLMs to generate questions to construct a
or inaccurate outputs. Specifically, we apply a multi- domain-specific retrieval dataset.
step strategy involving a state-of-the-art LLM, namely Several relevant works have explored the possibility of
GPT4-o, to generate questions on articles of the Italian generating synthetic questions to build retrieval datasets,
Civil Code and evaluate whether the generated questions either involving LLMs or not. Wang et al. [10] proposed
are answerable by reading the reference article text. We Generative Pseudo Labelling (GPL) to build unsupervised
subsequently sample subsets of the generated questions datasets for retrieval, using the encoder-decoder model
at random and have them evaluated by human annotators T5 [11] to generate queries and a cross-encoder to
asusing the same criteria as the model, in order to compare sign pseudo-labels. Ma et al. [12] makes use of synthetic
the results of automatic and manual evaluation. In that question generation to enhance the zero-shot retrieval
way, we estimate both the question-generation abilities of abilities of models in target domains. Meng et al. [13]
the LLM and its self-evaluation ability, both of which are implemented a framework called Augtriever, with which
crucial for assessing the feasibility of fully automating the synthetic pseudo-queries are generated by both
extractprocess of creating a legal question-answering dataset. ing salient spans from the target reference passage and</p>
      <p>Given the aforementioned lack of datasets to train using NLP text-generation trained on other tasks, such as
machine learning models for tasks related to the legal text summarisation. Tong et al. [14] have applied LLMs
domain and the costs related to manually annotating to generate synthetic questions to train retrieval models
corpora from the ground up, integrating LLMs in the in a protocol they dubbed IGFT (iterative Generation
Filprocess of dataset creation is nowadays a promising ap- tering and Tuning), consisting of iterating the three steps
proach. This work contributes to the understanding of of generating, filtering and tuning synthetic questions
how much we can rely on state-of-the-art LLMs to gener- to cope with low-quality generated data. Bonifacio et al.
ate synthetic textual data that are free from hallucinations [15] leveraged LLMs few-shot generation abilities to build
and that may actually be useful in practical downstream domain-specific synthetic datasets which they used to
tasks, particularly focusing on the generation of question- ifne-tune retrievers reported to outperform strong
stanpassage pairs to be used to train retriever models for LIR dard baselines trained on data obtained by supervised
and RAG in the legal domain. This aspect is particularly annotation. Saad-Falcon et al. [16] implements a pipeline
important for low-resource languages and vertical do- of synthetic question generation involving LLMs to build
mains, where annotated data is especially scarce. We retrieval datasets tailored to target low-resource domains.
found that not only the model’s performance on
generating questions is pretty remarkable in terms of quantity, 1Code and the data available at https://github.com/aittam9/cc_qa</p>
    </sec>
    <sec id="sec-2">
      <title>4. Methodology</title>
      <sec id="sec-2-1">
        <title>2https://it.wikisource.org/wiki/Codice_civile</title>
        <p>3https://platform.openai.com/docs/overview.</p>
      </sec>
      <sec id="sec-2-2">
        <title>4https://spacy.io/</title>
        <p>
          LLM-as-a-judge/annotator – LLMs have been re- the public.
cently involved in the process of both annotating data
and evaluating model-generated responses. Aldeen et al.
[
          <xref ref-type="bibr" rid="ref3">17</xref>
          ] evaluates the performance of ChatGPT in
annotating texts comparing it with those of human annotators.
        </p>
        <p>
          Savelka [
          <xref ref-type="bibr" rid="ref4">18</xref>
          ] use GPT to semantically annotate legal texts
in a zero-shot fashion. Wang et al. [
          <xref ref-type="bibr" rid="ref5">19</xref>
          ] deploy a
humanLLM collaborative protocol for data annotation.
        </p>
        <p>
          More broadly, LLMs have been used as judges in a
variety of works that are relevant to ours, both for the
methods employed and the aims pursued. For example, Sun
et al. [
          <xref ref-type="bibr" rid="ref6">20</xref>
          ] uses LLMs to judge if a the knowledge retrieved
as a triplet from s graph is suficient to answer a given
question. Bavaresco et al. [
          <xref ref-type="bibr" rid="ref7">21</xref>
          ] tested LLMs as judges on
20 tasks, comparing their judgements with human ones
through Spearman’s correlation [
          <xref ref-type="bibr" rid="ref8">22</xref>
          ] for graded scores
and Cohen’s  annotator agreement [
          <xref ref-type="bibr" rid="ref9">23</xref>
          ] for
categorical ones. We refer to Gu et al. [
          <xref ref-type="bibr" rid="ref10">24</xref>
          ] for a comprehensive
overview of works that have adapted the LLM-as-a-judge
paradigm in several ways.
        </p>
        <p>Although a variety of works have addressed the
problem of augmenting data for IR through synthetic question
generation, to the best of our knowledge, a gap exists
both for the Italian language and the Italian legal
domain. The same holds for the application of an LLMs as a
judge/annotator to evaluate and label data points to build
a dataset for LIR. The contribution of our work resides
precisely within that frame.</p>
        <p>Automatic Questions Evaluation. In a second step,
we provided the model with each article paired with
the questions it had generated initially and asked it to
evaluate whether the answer to each question could be
found within the corresponding textual passage. The
model was instructed to produce a binary output to
facilitate eficient parsing in subsequent evaluation stages.</p>
        <p>Specifically, the model assigned one of two labels to each
question–passage pair: “SI ” for a positive match,
indicatScript extension from Google Sheets5. We have been able
ing the answer is present, and “NO” for a negative match, to collect manual annotations for 12 random samples
indicating it is absent. The question, passage, and
instrucof 100 entries each, for a total of 1200 question-passage
tions were formatted into the prompt illustrated in Figure
pairs. Each question-passage pair to be evaluated has
step, and a general template prompt  shown in Figure 2, same binary classification task as the model, as illustrated</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Results</title>
      <sec id="sec-3-1">
        <title>5.1. Generation</title>
        <p>The results statistics for the first experiment, that is the
generation step, are shown in the Table 1:
5This task has been performed with the aid of an LLM.
a related question 
2. Therefore, given a pair consisting of a passage  ∈  ,</p>
        <p>∈ q generated in the previous
we built a prompt  for each passage-question pair. The
model  had to determine if  contains the necessary
information to answer , which basically translates into
the model performing a binary classification task over
the prompt , as shown in 1.</p>
        <p>() =
︂{
,
 ,
if  answers 
otherwise</p>
        <p>(1)
###ISTRUZIONI###
Sei un esperto in giurisprudenza. Di seguito ti verranno mostrati un testo
e una domanda. Il tuo compito è stabilire se la risposta alla domanda è
contenuta nel testo. Puoi utilizzare solo i seguenti due OUTPUT validi:
["SI", "NO"]. L’OUTPUT è "SI" se la risposta alla domanda è contenuta nel
testo. L’OUTPUT è "NO" se la risposta alla domanda non è contenuta nel
testo. Per poter dire "SI" la risposta alla domanda deve essere strettamente e
chiaramente nel testo. Restituisci solamente "SI" o "NO" e null’altro.
###TESTO###
{text}
###DOMANDA###
{query}
Forms have been automatically generated using the
Typebeen presented to the annotators as as shown in Figure
3. In this way, the human annotators had to perform the
in the previous paragraph, so that 1 can be turned into 2,
where  indicate the human performing the task.</p>
        <p>() =
︂{
,
 ,
if  answers 
otherwise
(2)
notator with no overlap of annotators on the same sets. lected subsets, considering both zero-shot and two-shot</p>
        <p>As shown, the model demonstrates strong proficiency
in generating questions for each article in terms of
quantity, with an average of approximately 3 questions per
article, ranging from 2.38 to 3.42 across books. Given a
total of 2,927 input articles, the model generated 8,076
questions, efectively doubling or tripling the length of
each book.</p>
      </sec>
      <sec id="sec-3-2">
        <title>5.2. Automatic Self-Evaluation</title>
        <sec id="sec-3-2-1">
          <title>As introduced in the previous section, we randomly se</title>
          <p>Next, we examine the results of the auto evaluation per- lected a subset of the generated questions and asked
formed by the model itself and regarding the quality of human evaluators to judge if a question would be good
the generated questions with respect to the input refer- for a given reference passage, thus eliciting the same type
ence text. Figure 4 shows the distribution of the positive of binary judgment obtained by prompting GPT4-o. We
and negative values assigned by the model to each pair did so for 12 sub-sets of data each containing 100 items,
of generated questions and reference article text. The for a total of 1200 items. As can be seen from Figure 5,
values are respectively represented by the labels SI and human annotators assigned far more positive labels than
NO as required by the prompt shown in the previous negative, as the model itself already did in the zero-shot
section in Figure 2, and their distribution is computed settings, but with an even greater gap between the two
per ICC book. In this phase, the model assigned the pos- classes, for a total of 1036 (86%) positive labels against
itive label SI to a total of 5369 question-passage pairs, 164 (14%) negative ones. The manual evaluation on the
while judging 2692 pairs as negative, which were labelled random sample seems to point out that the majority of
with NO. Additionally, the model failed to provide a le- questions generated by the model are, on average, correct
gitimate answer (SI or NO), thus failing to follow the with respect to the related text passage.
instructions written in the prompt in 15 cases. Overall,
the model judged as relevant to the reference article 66%
of the questions, thus interpreting as correct only 2/3 of 5.4. Cross Evaluation
its own generations.</p>
          <p>We ran a cross-analyisis between Human and Model
evaluations. As for the latter, we use the zero-shot
evaluations previously performed on the whole generated
dataset, as well as a new set of 2-shots evaluations elicited
for the random subsets assigned to humans. In that way,
we could compare Human evaluations against two type
of model evaluations, namely Model-0shot and
Model2shot. As shown in Table 2, human evaluations assigned
the most positive labels (86%), closely followed by the
Model-2shot (82%), while Model-0shot evaluations lag
behind both (66%). In fact, when the model is prompted
with no example provided, its evaluations display a gap
of around 18-20% compared to the other two modalities.</p>
          <p>It should be stressed that in that case positive and negative
do not necessarily correspond to correct and incorrect,
but to how an evaluator, human or artificial, has
considered the input pair. So, at this stage the comparison
between human annotators and the model is more on
the dimension of the propensity to assign positive val- 6. Discussion
ues to the analysed pairs rather than on judging correct
responses. We have performed a series of experiments to assess the</p>
          <p>Therefore, we then analysed how the model evalua- ability of GPT4-o to generate pertinent legal questions
tions performed against the human ones, using the latter in relation to articles of the Italian Civil Code. We first
as the gold standard, in order to have a more meaning- prompted the LLM to generate the questions, then asked
ful comparison between Human and Model evaluations. the model itself to judge their goodness, adopting a
biAs previously stated (see above Section 4), the evalua- nary labelling schema. In parallel, we sampled a subset
tion task can be formalised as a binary classification task. of the generated questions and asked humans to judge
Therefore, we computed classical machine learing met- their quality with respect to the reference text they were
rics such as Precision, Recall and F1 between human and generated from, using the same schema adopted for the
model annotations. Again, we did so for model’s evalu- model. Next, we compared the kind of evaluation, the
ations elicited in 0-shot and 2-shot settings. Results are automatic made by the model, and the manual performed
shown in Table 3. by human annotators.</p>
          <p>As expected, given the previous comparisons, the F1 Overall, we saw that, as expected, GPT4-o has been
score obtained between Human and Model-0shot is generally able to produce an adequate number of
quesmodest (76%). This is a confirmation of the tendency tions for each article, as it was stated by our heuristic,
of the model to underestimate the correctness of the which would allow the seamless creation of a dataset
generated questions when prompted with no example to train models for the Legal Information Retrieval task,
whatsoever. This led the model to mislabel lots of items, which may then be integrated into Search Engines or
favouring negative labels, hence leading to a problem of RAG applications. In fact, given the starting set of
infalse negatives, as already guessable in previous analysis. put texts, we have been able to triple its size in terms of
While the percentage of false positives assigned by the generated questions.
model is much lower. The model’s self-evaluation phase seemed to reveal an</p>
          <p>On the other hand, the F1 improved of 10 points (86%) underestimation of the goodness of the questions by the
when Model-2shot evaluations are used, substantially model itself when it is prompted to perform the task in
levelling the false negatives problem emerged in the 0- 0-shots settings. The model judged only 66% of the
quesshot evaluation. In other words, as it is further summed- tions as pertinent to their respective reference text when
up in the confusion matrices shown below in Figure 6, no example is provieded, initially leading us to think
much of the discrepancy between the two evaluation set- that while it is very good at generating, it underperforms
tings depends on the GPT4-o underestimating the good- when it comes to evaluating, even though the evaluation
ness of its own generations when the evaluation is led concerns its own generated texts. On the other hand,
with no examples provided, failing to correctly match a the model has been able to close the gap with human
huge number of pairs in which the question and refer- judges in positively evaluating question-passage pairs
ence article text were positively related. On the contrary, from a diference of 20% to only 4% when provided with
with just one correct and one incorrect examples, the a correct and an incorrect example. While the 0-shot
model evaluations align with humans one significantly settings underlined a substantial problem of false
negabetter. tives, this has been substantially reduced in the 2-shot
settings. The results show that an SOTA LLM can be
seamlessly used to generate legal content-related ques- variation qualitatively.
tions. It can hardly compete with humans in the 0-shot
evaluation of the quality of the same questions with
respect to their reference passage, but can better mimic
human performance when provided with a negligible
number of examples. Overall, all the above hints suggest
that using LLMs to cope with the shortage of annotated
resources to train machine learning models in the legal
domain is an asset worth putting into practice. As stated
in previous sections, we used the LLM as a generator to
produce questions and as a judge to evaluate the
goodness of its own generations. While the LLM-as-a-judge
paradigm provides an easy and eficient way to evaluate
model responses, its value is not limited to that. Indeed,
we can readapt model evaluations and consider them as
annotations, with no need to discard incorrect questions,
which can be used as negative labels of the generated
dataset.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>8. Conclusions</title>
    </sec>
    <sec id="sec-5">
      <title>7. Limitations and Future Directions</title>
      <sec id="sec-5-1">
        <title>Some limitations of the present work need to be noted.</title>
        <p>First of all, we used a proprietary model. While this
choice is apt to our purpose and data, using a
closedsource closed-access model implies not being able to
precisely define the engine being used, which can undergo
updates or modifications without notification. That may
hinder the reproducibility and stability of the results
across time.</p>
        <p>On the side of question evaluation, we used a simple
binary approach aiming at identifying whether a question
could be answered with the information provided in the
document from which it has been generated. While this
is straightforward and seamless to implement, it does not
allow a more nuanced assessment of the quality of the
questions. Therefore, future work is reserved to refining
the evaluation approach to introduce additional criteria
to assess the quality of a question other than simple
answerability (e.g. fluency, ambiguity and the alike). Also,
due to resource constraints, we distributed the random
samples for the manual evaluation among annotators,
assigning a single sample to each one, without overlapping.
This made it impossible to assess the soundness of the
annotations by computing annotators’ agreement measures.
In the future, we plan to widen the number of annotated
items as well as the pool of annotators, in order to obtain
a stronger and more faithful gold standard.</p>
        <p>Lastly, in this work, we focused solely on the Italian
Civil Code, from which we derived more than 8000
training inputs. Despite being a robust starting point, we are
planning to extend the strategy to other Italian Codes,
like the Penal Code, in order to both extend the dataset
quantitatively and add greater linguistic and conceptual</p>
      </sec>
      <sec id="sec-5-2">
        <title>In conclusion, integrating LLMs in the process of creat</title>
        <p>ing datasets for LNLP tasks is surely a promising and
worthwhile route, as it may have many benefits in terms
of costs and time eficiency. Indeed, we estimate that the
total cost of generating and evaluating questions with
GPT4-o is less than 30 dollars, and the amount of time
needed to perform the computational experiments is
between 15 and 20 hours. These numbers suggest that the
process may be easily scalable without a great waste
of resources. Also, we showed how the model needs
at least two examples to approach the human
performance in evaluation, while substantially lagging behind
it when a 0-shot prompt is used. While manual
evaluation seems to still be the most faithful way to derive gold
standards, we estimated that around one hour is
necessary for a human to perform an evaluation on a sample
of 100 entries, which may become impractical to extend
to larger datasets. In contrast, using an LLM to both
generate and judge-annotate synthetic questions seems to
be a viable alternative to fully automate the process of
generating training data for Legal Information Retrieval,
providing huge benefits in terms of money and time
resources, while maintaining an acceptable performance
rate, up to an unavoidable level of noise.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>9. Acknowledgments</title>
      <sec id="sec-6-1">
        <title>We are deeply grateful to the volunteer human annotators who have participated in the experiments.</title>
        <p>[3] H. Darji, J. Mitrović, M. Granitzer, Challenges and can Chapter of the Association for Computational
considerations in annotating legal data: A compre- Linguistics: Human Language Technologies,
Ashensive overview, 2024. URL: https://arxiv.org/abs/ sociation for Computational Linguistics, Seattle,
2407.17503. arXiv:2407.17503. United States, 2022, pp. 2345–2360. URL: https:
[4] D. Dua, E. Strubell, S. Singh, P. Verga, To adapt or to //aclanthology.org/2022.naacl-main.168/. doi:10.
annotate: Challenges and interventions for domain 18653/v1/2022.naacl-main.168.
adaptation in open-domain question answering, in: [11] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Pro- M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the
ceedings of the 61st Annual Meeting of the Asso- limits of transfer learning with a unified
text-tociation for Computational Linguistics (Volume 1: text transformer, Journal of Machine Learning
ReLong Papers), Association for Computational Lin- search 21 (2020) 1–67. URL: http://jmlr.org/papers/
guistics, Toronto, Canada, 2023, pp. 14429–14446. v21/20-074.html.</p>
        <p>URL: https://aclanthology.org/2023.acl-long.807/. [12] J. Ma, I. Korotkov, Y. Yang, K. Hall, R.
Mcdoi:10.18653/v1/2023.acl-long.807. Donald, Zero-shot neural passage retrieval via
[5] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, domain-targeted synthetic question generation, in:
Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, H. Zhang, P. Merlo, J. Tiedemann, R. Tsarfaty (Eds.),
ProJ. E. Gonzalez, I. Stoica, Judging llm-as-a-judge ceedings of the 16th Conference of the European
with mt-bench and chatbot arena, in: A. Oh, T. Nau- Chapter of the Association for Computational
Linmann, A. Globerson, K. Saenko, M. Hardt, S. Levine guistics: Main Volume, Association for
Compu(Eds.), Advances in Neural Information Processing tational Linguistics, Online, 2021, pp. 1075–1088.
Systems, volume 36, Curran Associates, Inc., 2023, URL: https://aclanthology.org/2021.eacl-main.92/.
pp. 46595–46623. URL: https://proceedings. doi:10.18653/v1/2021.eacl-main.92.
neurips.cc/paper_files/paper/2023/file/ [13] R. Meng, Y. Liu, S. Yavuz, D. Agarwal, L. Tu, N. Yu,
91f18a1287b398d378ef22505bf41832-Paper-Datasets_ J. Zhang, M. Bhat, Y. Zhou, Augtriever:
Unsuperand_Benchmarks.pdf. vised dense retrieval by scalable data augmentation,
[6] L. Long, R. Wang, R. Xiao, J. Zhao, X. Ding, arXiv preprint arXiv:2212.08841 (2022).</p>
        <p>G. Chen, H. Wang, On llms-driven synthetic [14] Z. Tong, C. Qin, C. Fang, K. Yao, X. Chen, J. Zhang,
data generation, curation, and evaluation: A sur- C. Zhu, H. Zhu, From missteps to mastery:
Envey, 2024. URL: https://arxiv.org/abs/2406.15126. hancing low-resource dense retrieval through
adaparXiv:2406.15126. tive query generation, in: Proceedings of the 31st
[7] D. Li, B. Jiang, L. Huang, A. Beigi, C. Zhao, Z. Tan, ACM SIGKDD Conference on Knowledge
DiscovA. Bhattacharjee, Y. Jiang, C. Chen, T. Wu, K. Shu, ery and Data Mining V.1, KDD ’25, Association for
L. Cheng, H. Liu, From generation to judg- Computing Machinery, New York, NY, USA, 2025,
ment: Opportunities and challenges of llm-as-a- p. 1373–1384. URL: https://doi.org/10.1145/3690624.
judge (2025). URL: https://arxiv.org/abs/2411.16594. 3709225. doi:10.1145/3690624.3709225.
arXiv:2411.16594. [15] L. Bonifacio, H. Abonizio, M. Fadaee, R. Nogueira,
[8] L. Long, R. Wang, R. Xiao, J. Zhao, X. Ding, Inpars: Unsupervised dataset generation for
inforG. Chen, H. Wang, On LLMs-driven syn- mation retrieval, in: Proceedings of the 45th
Inthetic data generation, curation, and evalua- ternational ACM SIGIR Conference on Research
tion: A survey, in: L.-W. Ku, A. Martins, and Development in Information Retrieval, SIGIR
V. Srikumar (Eds.), Findings of the Association ’22, Association for Computing Machinery, New
for Computational Linguistics: ACL 2024, Asso- York, NY, USA, 2022, p. 2387–2392. URL: https:
ciation for Computational Linguistics, Bangkok, //doi.org/10.1145/3477495.3531863. doi:10.1145/
Thailand, 2024, pp. 11065–11082. URL: https: 3477495.3531863.
//aclanthology.org/2024.findings-acl.658/. doi: 10. [16] J. Saad-Falcon, O. Khattab, K. Santhanam, R.
Flo18653/v1/2024.findings-acl.658. rian, M. Franz, S. Roukos, A. Sil, M. Sultan, C. Potts,
[9] S. Shashidhar, C. Fourrier, A. Lozovskia, T. Wolf, UDAPDR: Unsupervised domain adaptation via
G. Tur, D. Hakkani-Tür, Yourbench: Easy custom LLM prompting and distillation of rerankers, in:
evaluation sets for everyone, 2025. URL: https:// H. Bouamor, J. Pino, K. Bali (Eds.),
Proceedarxiv.org/abs/2504.01833. arXiv:2504.01833. ings of the 2023 Conference on Empirical
Meth[10] K. Wang, N. Thakur, N. Reimers, I. Gurevych, GPL: ods in Natural Language Processing, Association
Generative pseudo labeling for unsupervised do- for Computational Linguistics, Singapore, 2023,
main adaptation of dense retrieval, in: M. Carpuat, pp. 11265–11279. URL: https://aclanthology.org/
M.-C. de Marnefe, I. V. Meza Ruiz (Eds.), Proceed- 2023.emnlp-main.693/. doi:10.18653/v1/2023.
ings of the 2022 Conference of the North Ameri- emnlp-main.693.</p>
        <p>Declaration on Generative AI
During the preparation of this work, the author(s) did not use any generative AI tools or services.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Narasimhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Salimans</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Improving language understanding by generative pre-training (</article-title>
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Herbert-Voss</surname>
          </string-name>
          , G. Krueger,
          <string-name>
            <given-names>T.</given-names>
            <surname>Henighan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ziegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Winter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hesse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          , E. Sigler,
          <string-name>
            <given-names>M.</given-names>
            <surname>Litwin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chess</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Berner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>McCandlish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Sutskever</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <article-title>Language models are few-shot learners</article-title>
          , in: H.
          <string-name>
            <surname>Larochelle</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Hadsell</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Balcan</surname>
          </string-name>
          , H. Lin (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>33</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2020</year>
          , pp.
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          . URL: https://proceedings. neurips.cc/paper_files/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Aldeen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hong</surname>
          </string-name>
          , [26]
          <string-name>
            <given-names>D.</given-names>
            <surname>Licari</surname>
          </string-name>
          , G. Comandè,
          <article-title>Italian-legal-bert modP</article-title>
          . Yetukuri, L. Cheng,
          <article-title>Chatgpt vs. human an- els for improving natural language processing notators: A comprehensive analysis of chatgpt tasks in the italian legal domain, Computer for text annotation</article-title>
          , in: 2023
          <source>International Law &amp; Security Review</source>
          <volume>52</volume>
          (
          <year>2024</year>
          ) 105908. Conference on Machine Learning and Applica- URL: https://www.sciencedirect.com/science/ tions (ICMLA),
          <year>2023</year>
          , pp.
          <fpage>602</fpage>
          -
          <lpage>609</lpage>
          . doi:
          <volume>10</volume>
          .1109/ article/pii/S0267364923001188. doi:https:
          <fpage>ICMLA58977</fpage>
          .
          <year>2023</year>
          .
          <volume>00089</volume>
          . //doi.org/10.1016/j.clsr.
          <year>2023</year>
          .
          <volume>105908</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J.</given-names>
            <surname>Savelka</surname>
          </string-name>
          ,
          <article-title>Unlocking practical applications in legal domain: Evaluation of gpt for zero-shot semantic annotation of legal texts</article-title>
          ,
          <source>in: Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law</source>
          , ICAIL '23,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2023</year>
          , p.
          <fpage>447</fpage>
          -
          <lpage>451</lpage>
          . URL: https://doi.org/10.1145/3594536. 3595161. doi:
          <volume>10</volume>
          .1145/3594536.3595161.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rahman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Miao</surname>
          </string-name>
          ,
          <article-title>Human-llm collaborative annotation through effective verification of llm labels</article-title>
          ,
          <source>in: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, CHI '24</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2024</year>
          . URL: https://doi.org/10.1145/3613904.3641960. doi:
          <volume>10</volume>
          .1145/3613904.3641960.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. M.</given-names>
            <surname>Ni</surname>
          </string-name>
          , H.-Y. Shum,
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph (</article-title>
          <year>2024</year>
          ). URL: https://arxiv.org/ abs/2307.07697. arXiv:
          <volume>2307</volume>
          .
          <fpage>07697</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bavaresco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bernardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bertolazzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Elliott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fernández</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gatt</surname>
          </string-name>
          , E. Ghaleb,
          <string-name>
            <given-names>M.</given-names>
            <surname>Giulianelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hanna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Koller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Martins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mondorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Neplenbroek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pezzelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Plank</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schlangen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Suglia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Surikuchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Takmaz</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Testoni,
          <article-title>LLMs instead of human judges? a large scale empirical study across 20 NLP evaluation tasks</article-title>
          , in: W. Che,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nabende</surname>
          </string-name>
          , E. Shutova, M. T. Pilehvar (Eds.),
          <source>Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>2</volume>
          :
          <string-name>
            <surname>Short</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Vienna, Austria,
          <year>2025</year>
          , pp.
          <fpage>238</fpage>
          -
          <lpage>255</lpage>
          . URL: https://aclanthology.org/
          <year>2025</year>
          .acl-short.
          <volume>20</volume>
          /. doi:
          <volume>10</volume>
          . 18653/v1/
          <year>2025</year>
          .acl-short.
          <volume>20</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>C.</given-names>
            <surname>Spearman</surname>
          </string-name>
          ,
          <article-title>The proof and measurement of association between two things</article-title>
          ,
          <source>The American Journal of Psychology</source>
          <volume>15</volume>
          (
          <year>1904</year>
          )
          <fpage>72</fpage>
          -
          <lpage>101</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <article-title>A coeficient of agreement for nominal scales</article-title>
          ,
          <source>Educational and psychological measurement 20</source>
          (
          <year>1960</year>
          )
          <fpage>37</fpage>
          -
          <lpage>46</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ma</surname>
          </string-name>
          , H. Liu,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>A survey on llm-as-</article-title>
          <string-name>
            <surname>ajudge</surname>
          </string-name>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2411.15594. arXiv:
          <volume>2411</volume>
          .
          <fpage>15594</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [25]
          <string-name>
            <surname>OpenAI</surname>
          </string-name>
          , Gpt-4
          <source>technical report</source>
          ,
          <year>2024</year>
          . URL: https: //arxiv.org/abs/2303.08774. arXiv:
          <volume>2303</volume>
          .
          <fpage>08774</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>