1. Introduction

Leveraging LLMs to Build a Semi-Synthetic Dataset for Legal Information Retrieval: a Case Study on the Italian Civil Code and GPT4-o

Mattia Proietti

Lucia Passaro

0 1

Alessandro Lenci

0 0 CoLing Lab, Department of Philology , Literature and Linguistics , University of Pisa 1 Department of Computer Science, University of Pisa

2025

Although raw textual data in the legal domain is abundant, making it easy to collect large amounts of material from several sources, structured and annotated data needed to fine-tune machine learning models is limited and dificult to obtain. Creating human-annotated datasets is both time- and money-consuming, which often makes impractical to get quality data to train machines on various legal language tasks. AI models such as Large Language Models (LLMs) are becoming appealing to generate synthetic data, judge model responses, and annotate textual information, so to cope with such shortcomings. In this work, we wish to evaluate the applicability of LLMs for the automatic generation of a dataset of legal query-passage pairs to train retrieval systems. Indeed, Legal Information Retrieval (LIR) has been crucial for the creation of robust search systems for legal documents and is now gaining new importance in the context of the Retrieval Augmented Generation (RAG) framework, which is becoming a widespread tool to cope with LLMs hallucinating behaviours. Our goal is to test the feasibility of building a query-passage dataset in which the queries are generated by an LLM about real textual passages and assess the reliability of such a process in terms of the generation of hallucination-free data points in a delicate domain, as the legal one. We do so in a two-step pipeline spelt out as follows: i) we use the Italian Civil Code as a source of self-contained, semantically coherent legal textual passages and ask the model to generate hypothetical questions on them; ii) we use the LLM itself to judge the coherence of the questions to spot those inconsistent with the passage. We then select a random subset of the question-passage pairs and ask humans to evaluate them. Finally, we compare human and model evaluations on the randomly selected subset. We show that the model generates many questions easily, and while it lags behind humans when evaluating the appropriateness of the generated questions with respect to the reference passages in zero-shot settings, it substantially reduces the gap with human judgements when only two examples are provided.

eol>Large Language Models Legal Information Retrieval Synthetic data generation LLM-as-a-judge Legal-NLP

1. Introduction

more slowly toward a mature integration of language technologies. Despite the vast volume of textual material In recent years, we have witnessed great advancements generated daily by legal practitioners, the field still faces in the field of Artificial Intelligence (AI), in particular in a significant shortage of machine-readable and annotated its sub-domain of Natural Language Processing (NLP). resources needed to train and fine-tune AI systems for The advent of Large Language Models (LLMs), especially Legal NLP (LNLP) tasks — a process that is complex and on the wave initiated by the GPT family [ 1, 2 ], has revo- presents numerous challenges [3]. The lack of data enlutionised the way we produce, understand, and manipu- compasses all the devisable LNLP tasks. In this work, late textual content. This revolution has permeated all we focus on data formats necessary to train systems to domains, and the legal field is no exception. Indeed, NLP perform Legal Information Retrieval (LIR) tasks. LIR is for legal applications is spreading and is gaining a core a crucial task in the field of LNLP, primarily concerned role in the discussion about the integration of AI into with retrieving relevant documents in response to a given legal practice. However, due to its high degree of spe- textual query. A typical application scenario involves a cialization, the intellectual complexity of legal tasks, and system capable of identifying and returning pertinent the technical specificity of its language, the legal domain legal documents based on a user’s question. To efec— similarly to other specialized fields — has progressed tively perform this task, it is essential to train models on in-domain data—specifically, question-passage pairs CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- derived from legal documents and expressed in legal lantics, September 24 — 26, 2025, Cagliari, Italy guage—in order to address domain shifts [4]. However, * Corresponding author. building such datasets purely through human annota$ mattia.proietti@phd.unipi.it (M. Proietti); lucia.passaro@unipi.it tion is both extremely time-consuming and costly as it (L. 0P0a0s9s-a0r0o0);2a-0le4s4s7a-n6d8r0oX.le(Mnc.i @Pruonieipttii.)i;t0(A00.0L-0e0n0c3i)-4934-5344 requires coming up with questions and associate them (L. Passaro); 0000-0001-5790-4308 (A. Lenci) with relevant documents that may be used to answer © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License those questions. To cope with such shortcomings, synAttribution 4.0 International (CC BY 4.0). thetic data generation and annotation through LLMs is but it can be almost as good as human judges in the selfarising as a promising strategy and it is now being ex- evaluation task in 2-shot settings, though it lags behind plored within the legal domain as well. Despite its ease, humans when a 0-shot prompt is used.1 the increasing application of LLMs to generate synthetic data calls for a major assessment of their reliability and real applicability for the task at hand. 2. Related Works

This paper aims to answer the following research question: “How reliable are automated methods for generat- Our work falls in between two paradigms that are being and evaluating semi-synthetic datasets in the context coming standard practice in the NLP community, that is of Legal Information Retrieval?" In turn, the motivation synthetic data generation and LLM-as-a-judge. As behind this question is two-fold. On the one hand, we such it is related to a number of works in both those lines want to generate a dataset that can be used to train ma- of research. chine learning systems to perform the task of LIR. On the Synthetic Data – Making use of LLMs to generate other hand, we aim to assess the feasibility of this process synthetic datasets is becoming commonplace among NLP by evaluating the reliability of using a state-of-the-art practitioners at diferent stages of the data lifecycle, from LLM both to generate questions and to assess their rele- generation to curation and evaluation [8]. For example, vance to reference text passages, as well as the eficiency the Huggingface team has recently released a Python of this approach in terms of time and cost. We consider library to automatically generate evaluation benchmarks this process as a proxy to evaluate the model’s ability to using LLMs [9]. They implement a protocol they call understand legal texts at a basic level, since formulating a Document-Evaluation-Generation, dubbed as DG2E. This good question is an index of the degree of understanding is relevant to our work, as this framework allows the reached by the system formulating that question. generation of domain-specific, tailored evaluation bench

To this end, we integrate two established paradigms marks. However, they used a far more complex strategy, of LLMs applications: (i) synthetic data generation[5, 6], involving multiple LLMs and focusing on the creation employed to automatically construct the dataset, and (ii) of evaluation questionnaires, while we are interested LLM-as-a-judge[7], used to evaluate and filter out noisy in applying LLMs to generate questions to construct a or inaccurate outputs. Specifically, we apply a multi- domain-specific retrieval dataset. step strategy involving a state-of-the-art LLM, namely Several relevant works have explored the possibility of GPT4-o, to generate questions on articles of the Italian generating synthetic questions to build retrieval datasets, Civil Code and evaluate whether the generated questions either involving LLMs or not. Wang et al. [10] proposed are answerable by reading the reference article text. We Generative Pseudo Labelling (GPL) to build unsupervised subsequently sample subsets of the generated questions datasets for retrieval, using the encoder-decoder model at random and have them evaluated by human annotators T5 [11] to generate queries and a cross-encoder to asusing the same criteria as the model, in order to compare sign pseudo-labels. Ma et al. [12] makes use of synthetic the results of automatic and manual evaluation. In that question generation to enhance the zero-shot retrieval way, we estimate both the question-generation abilities of abilities of models in target domains. Meng et al. [13] the LLM and its self-evaluation ability, both of which are implemented a framework called Augtriever, with which crucial for assessing the feasibility of fully automating the synthetic pseudo-queries are generated by both extractprocess of creating a legal question-answering dataset. ing salient spans from the target reference passage and

Given the aforementioned lack of datasets to train using NLP text-generation trained on other tasks, such as machine learning models for tasks related to the legal text summarisation. Tong et al. [14] have applied LLMs domain and the costs related to manually annotating to generate synthetic questions to train retrieval models corpora from the ground up, integrating LLMs in the in a protocol they dubbed IGFT (iterative Generation Filprocess of dataset creation is nowadays a promising ap- tering and Tuning), consisting of iterating the three steps proach. This work contributes to the understanding of of generating, filtering and tuning synthetic questions how much we can rely on state-of-the-art LLMs to gener- to cope with low-quality generated data. Bonifacio et al. ate synthetic textual data that are free from hallucinations [15] leveraged LLMs few-shot generation abilities to build and that may actually be useful in practical downstream domain-specific synthetic datasets which they used to tasks, particularly focusing on the generation of question- ifne-tune retrievers reported to outperform strong stanpassage pairs to be used to train retriever models for LIR dard baselines trained on data obtained by supervised and RAG in the legal domain. This aspect is particularly annotation. Saad-Falcon et al. [16] implements a pipeline important for low-resource languages and vertical do- of synthetic question generation involving LLMs to build mains, where annotated data is especially scarce. We retrieval datasets tailored to target low-resource domains. found that not only the model’s performance on generating questions is pretty remarkable in terms of quantity, 1Code and the data available at https://github.com/aittam9/cc_qa

4. Methodology 2https://it.wikisource.org/wiki/Codice_civile

3https://platform.openai.com/docs/overview.

4https://spacy.io/

LLM-as-a-judge/annotator – LLMs have been re- the public. cently involved in the process of both annotating data and evaluating model-generated responses. Aldeen et al. [ 17 ] evaluates the performance of ChatGPT in annotating texts comparing it with those of human annotators.

Savelka [ 18 ] use GPT to semantically annotate legal texts in a zero-shot fashion. Wang et al. [ 19 ] deploy a humanLLM collaborative protocol for data annotation.

More broadly, LLMs have been used as judges in a variety of works that are relevant to ours, both for the methods employed and the aims pursued. For example, Sun et al. [ 20 ] uses LLMs to judge if a the knowledge retrieved as a triplet from s graph is suficient to answer a given question. Bavaresco et al. [ 21 ] tested LLMs as judges on 20 tasks, comparing their judgements with human ones through Spearman’s correlation [ 22 ] for graded scores and Cohen’s annotator agreement [ 23 ] for categorical ones. We refer to Gu et al. [ 24 ] for a comprehensive overview of works that have adapted the LLM-as-a-judge paradigm in several ways.

Although a variety of works have addressed the problem of augmenting data for IR through synthetic question generation, to the best of our knowledge, a gap exists both for the Italian language and the Italian legal domain. The same holds for the application of an LLMs as a judge/annotator to evaluate and label data points to build a dataset for LIR. The contribution of our work resides precisely within that frame.

Automatic Questions Evaluation. In a second step, we provided the model with each article paired with the questions it had generated initially and asked it to evaluate whether the answer to each question could be found within the corresponding textual passage. The model was instructed to produce a binary output to facilitate eficient parsing in subsequent evaluation stages.

Specifically, the model assigned one of two labels to each question–passage pair: “SI ” for a positive match, indicatScript extension from Google Sheets5. We have been able ing the answer is present, and “NO” for a negative match, to collect manual annotations for 12 random samples indicating it is absent. The question, passage, and instrucof 100 entries each, for a total of 1200 question-passage tions were formatted into the prompt illustrated in Figure pairs. Each question-passage pair to be evaluated has step, and a general template prompt shown in Figure 2, same binary classification task as the model, as illustrated

5. Results 5.1. Generation

The results statistics for the first experiment, that is the generation step, are shown in the Table 1: 5This task has been performed with the aid of an LLM. a related question 2. Therefore, given a pair consisting of a passage ∈ ,

∈ q generated in the previous we built a prompt for each passage-question pair. The model had to determine if contains the necessary information to answer , which basically translates into the model performing a binary classification task over the prompt , as shown in 1.

() = ︂{ , , if answers otherwise

(1) ###ISTRUZIONI### Sei un esperto in giurisprudenza. Di seguito ti verranno mostrati un testo e una domanda. Il tuo compito è stabilire se la risposta alla domanda è contenuta nel testo. Puoi utilizzare solo i seguenti due OUTPUT validi: ["SI", "NO"]. L’OUTPUT è "SI" se la risposta alla domanda è contenuta nel testo. L’OUTPUT è "NO" se la risposta alla domanda non è contenuta nel testo. Per poter dire "SI" la risposta alla domanda deve essere strettamente e chiaramente nel testo. Restituisci solamente "SI" o "NO" e null’altro. ###TESTO### {text} ###DOMANDA### {query} Forms have been automatically generated using the Typebeen presented to the annotators as as shown in Figure 3. In this way, the human annotators had to perform the in the previous paragraph, so that 1 can be turned into 2, where indicate the human performing the task.

() = ︂{ , , if answers otherwise (2) notator with no overlap of annotators on the same sets. lected subsets, considering both zero-shot and two-shot

As shown, the model demonstrates strong proficiency in generating questions for each article in terms of quantity, with an average of approximately 3 questions per article, ranging from 2.38 to 3.42 across books. Given a total of 2,927 input articles, the model generated 8,076 questions, efectively doubling or tripling the length of each book.

5.2. Automatic Self-Evaluation As introduced in the previous section, we randomly se

Next, we examine the results of the auto evaluation per- lected a subset of the generated questions and asked formed by the model itself and regarding the quality of human evaluators to judge if a question would be good the generated questions with respect to the input refer- for a given reference passage, thus eliciting the same type ence text. Figure 4 shows the distribution of the positive of binary judgment obtained by prompting GPT4-o. We and negative values assigned by the model to each pair did so for 12 sub-sets of data each containing 100 items, of generated questions and reference article text. The for a total of 1200 items. As can be seen from Figure 5, values are respectively represented by the labels SI and human annotators assigned far more positive labels than NO as required by the prompt shown in the previous negative, as the model itself already did in the zero-shot section in Figure 2, and their distribution is computed settings, but with an even greater gap between the two per ICC book. In this phase, the model assigned the pos- classes, for a total of 1036 (86%) positive labels against itive label SI to a total of 5369 question-passage pairs, 164 (14%) negative ones. The manual evaluation on the while judging 2692 pairs as negative, which were labelled random sample seems to point out that the majority of with NO. Additionally, the model failed to provide a le- questions generated by the model are, on average, correct gitimate answer (SI or NO), thus failing to follow the with respect to the related text passage. instructions written in the prompt in 15 cases. Overall, the model judged as relevant to the reference article 66% of the questions, thus interpreting as correct only 2/3 of 5.4. Cross Evaluation its own generations.

We ran a cross-analyisis between Human and Model evaluations. As for the latter, we use the zero-shot evaluations previously performed on the whole generated dataset, as well as a new set of 2-shots evaluations elicited for the random subsets assigned to humans. In that way, we could compare Human evaluations against two type of model evaluations, namely Model-0shot and Model2shot. As shown in Table 2, human evaluations assigned the most positive labels (86%), closely followed by the Model-2shot (82%), while Model-0shot evaluations lag behind both (66%). In fact, when the model is prompted with no example provided, its evaluations display a gap of around 18-20% compared to the other two modalities.

It should be stressed that in that case positive and negative do not necessarily correspond to correct and incorrect, but to how an evaluator, human or artificial, has considered the input pair. So, at this stage the comparison between human annotators and the model is more on the dimension of the propensity to assign positive val- 6. Discussion ues to the analysed pairs rather than on judging correct responses. We have performed a series of experiments to assess the

Therefore, we then analysed how the model evalua- ability of GPT4-o to generate pertinent legal questions tions performed against the human ones, using the latter in relation to articles of the Italian Civil Code. We first as the gold standard, in order to have a more meaning- prompted the LLM to generate the questions, then asked ful comparison between Human and Model evaluations. the model itself to judge their goodness, adopting a biAs previously stated (see above Section 4), the evalua- nary labelling schema. In parallel, we sampled a subset tion task can be formalised as a binary classification task. of the generated questions and asked humans to judge Therefore, we computed classical machine learing met- their quality with respect to the reference text they were rics such as Precision, Recall and F1 between human and generated from, using the same schema adopted for the model annotations. Again, we did so for model’s evalu- model. Next, we compared the kind of evaluation, the ations elicited in 0-shot and 2-shot settings. Results are automatic made by the model, and the manual performed shown in Table 3. by human annotators.

As expected, given the previous comparisons, the F1 Overall, we saw that, as expected, GPT4-o has been score obtained between Human and Model-0shot is generally able to produce an adequate number of quesmodest (76%). This is a confirmation of the tendency tions for each article, as it was stated by our heuristic, of the model to underestimate the correctness of the which would allow the seamless creation of a dataset generated questions when prompted with no example to train models for the Legal Information Retrieval task, whatsoever. This led the model to mislabel lots of items, which may then be integrated into Search Engines or favouring negative labels, hence leading to a problem of RAG applications. In fact, given the starting set of infalse negatives, as already guessable in previous analysis. put texts, we have been able to triple its size in terms of While the percentage of false positives assigned by the generated questions. model is much lower. The model’s self-evaluation phase seemed to reveal an

On the other hand, the F1 improved of 10 points (86%) underestimation of the goodness of the questions by the when Model-2shot evaluations are used, substantially model itself when it is prompted to perform the task in levelling the false negatives problem emerged in the 0- 0-shots settings. The model judged only 66% of the quesshot evaluation. In other words, as it is further summed- tions as pertinent to their respective reference text when up in the confusion matrices shown below in Figure 6, no example is provieded, initially leading us to think much of the discrepancy between the two evaluation set- that while it is very good at generating, it underperforms tings depends on the GPT4-o underestimating the good- when it comes to evaluating, even though the evaluation ness of its own generations when the evaluation is led concerns its own generated texts. On the other hand, with no examples provided, failing to correctly match a the model has been able to close the gap with human huge number of pairs in which the question and refer- judges in positively evaluating question-passage pairs ence article text were positively related. On the contrary, from a diference of 20% to only 4% when provided with with just one correct and one incorrect examples, the a correct and an incorrect example. While the 0-shot model evaluations align with humans one significantly settings underlined a substantial problem of false negabetter. tives, this has been substantially reduced in the 2-shot settings. The results show that an SOTA LLM can be seamlessly used to generate legal content-related ques- variation qualitatively. tions. It can hardly compete with humans in the 0-shot evaluation of the quality of the same questions with respect to their reference passage, but can better mimic human performance when provided with a negligible number of examples. Overall, all the above hints suggest that using LLMs to cope with the shortage of annotated resources to train machine learning models in the legal domain is an asset worth putting into practice. As stated in previous sections, we used the LLM as a generator to produce questions and as a judge to evaluate the goodness of its own generations. While the LLM-as-a-judge paradigm provides an easy and eficient way to evaluate model responses, its value is not limited to that. Indeed, we can readapt model evaluations and consider them as annotations, with no need to discard incorrect questions, which can be used as negative labels of the generated dataset.

8. Conclusions 7. Limitations and Future Directions Some limitations of the present work need to be noted.

First of all, we used a proprietary model. While this choice is apt to our purpose and data, using a closedsource closed-access model implies not being able to precisely define the engine being used, which can undergo updates or modifications without notification. That may hinder the reproducibility and stability of the results across time.

On the side of question evaluation, we used a simple binary approach aiming at identifying whether a question could be answered with the information provided in the document from which it has been generated. While this is straightforward and seamless to implement, it does not allow a more nuanced assessment of the quality of the questions. Therefore, future work is reserved to refining the evaluation approach to introduce additional criteria to assess the quality of a question other than simple answerability (e.g. fluency, ambiguity and the alike). Also, due to resource constraints, we distributed the random samples for the manual evaluation among annotators, assigning a single sample to each one, without overlapping. This made it impossible to assess the soundness of the annotations by computing annotators’ agreement measures. In the future, we plan to widen the number of annotated items as well as the pool of annotators, in order to obtain a stronger and more faithful gold standard.

Lastly, in this work, we focused solely on the Italian Civil Code, from which we derived more than 8000 training inputs. Despite being a robust starting point, we are planning to extend the strategy to other Italian Codes, like the Penal Code, in order to both extend the dataset quantitatively and add greater linguistic and conceptual

In conclusion, integrating LLMs in the process of creat

ing datasets for LNLP tasks is surely a promising and worthwhile route, as it may have many benefits in terms of costs and time eficiency. Indeed, we estimate that the total cost of generating and evaluating questions with GPT4-o is less than 30 dollars, and the amount of time needed to perform the computational experiments is between 15 and 20 hours. These numbers suggest that the process may be easily scalable without a great waste of resources. Also, we showed how the model needs at least two examples to approach the human performance in evaluation, while substantially lagging behind it when a 0-shot prompt is used. While manual evaluation seems to still be the most faithful way to derive gold standards, we estimated that around one hour is necessary for a human to perform an evaluation on a sample of 100 entries, which may become impractical to extend to larger datasets. In contrast, using an LLM to both generate and judge-annotate synthetic questions seems to be a viable alternative to fully automate the process of generating training data for Legal Information Retrieval, providing huge benefits in terms of money and time resources, while maintaining an acceptable performance rate, up to an unavoidable level of noise.

9. Acknowledgments We are deeply grateful to the volunteer human annotators who have participated in the experiments.

[3] H. Darji, J. Mitrović, M. Granitzer, Challenges and can Chapter of the Association for Computational considerations in annotating legal data: A compre- Linguistics: Human Language Technologies, Ashensive overview, 2024. URL: https://arxiv.org/abs/ sociation for Computational Linguistics, Seattle, 2407.17503. arXiv:2407.17503. United States, 2022, pp. 2345–2360. URL: https: [4] D. Dua, E. Strubell, S. Singh, P. Verga, To adapt or to //aclanthology.org/2022.naacl-main.168/. doi:10. annotate: Challenges and interventions for domain 18653/v1/2022.naacl-main.168. adaptation in open-domain question answering, in: [11] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Pro- M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the ceedings of the 61st Annual Meeting of the Asso- limits of transfer learning with a unified text-tociation for Computational Linguistics (Volume 1: text transformer, Journal of Machine Learning ReLong Papers), Association for Computational Lin- search 21 (2020) 1–67. URL: http://jmlr.org/papers/ guistics, Toronto, Canada, 2023, pp. 14429–14446. v21/20-074.html.

URL: https://aclanthology.org/2023.acl-long.807/. [12] J. Ma, I. Korotkov, Y. Yang, K. Hall, R. Mcdoi:10.18653/v1/2023.acl-long.807. Donald, Zero-shot neural passage retrieval via [5] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, domain-targeted synthetic question generation, in: Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, H. Zhang, P. Merlo, J. Tiedemann, R. Tsarfaty (Eds.), ProJ. E. Gonzalez, I. Stoica, Judging llm-as-a-judge ceedings of the 16th Conference of the European with mt-bench and chatbot arena, in: A. Oh, T. Nau- Chapter of the Association for Computational Linmann, A. Globerson, K. Saenko, M. Hardt, S. Levine guistics: Main Volume, Association for Compu(Eds.), Advances in Neural Information Processing tational Linguistics, Online, 2021, pp. 1075–1088. Systems, volume 36, Curran Associates, Inc., 2023, URL: https://aclanthology.org/2021.eacl-main.92/. pp. 46595–46623. URL: https://proceedings. doi:10.18653/v1/2021.eacl-main.92. neurips.cc/paper_files/paper/2023/file/ [13] R. Meng, Y. Liu, S. Yavuz, D. Agarwal, L. Tu, N. Yu, 91f18a1287b398d378ef22505bf41832-Paper-Datasets_ J. Zhang, M. Bhat, Y. Zhou, Augtriever: Unsuperand_Benchmarks.pdf. vised dense retrieval by scalable data augmentation, [6] L. Long, R. Wang, R. Xiao, J. Zhao, X. Ding, arXiv preprint arXiv:2212.08841 (2022).

G. Chen, H. Wang, On llms-driven synthetic [14] Z. Tong, C. Qin, C. Fang, K. Yao, X. Chen, J. Zhang, data generation, curation, and evaluation: A sur- C. Zhu, H. Zhu, From missteps to mastery: Envey, 2024. URL: https://arxiv.org/abs/2406.15126. hancing low-resource dense retrieval through adaparXiv:2406.15126. tive query generation, in: Proceedings of the 31st [7] D. Li, B. Jiang, L. Huang, A. Beigi, C. Zhao, Z. Tan, ACM SIGKDD Conference on Knowledge DiscovA. Bhattacharjee, Y. Jiang, C. Chen, T. Wu, K. Shu, ery and Data Mining V.1, KDD ’25, Association for L. Cheng, H. Liu, From generation to judg- Computing Machinery, New York, NY, USA, 2025, ment: Opportunities and challenges of llm-as-a- p. 1373–1384. URL: https://doi.org/10.1145/3690624. judge (2025). URL: https://arxiv.org/abs/2411.16594. 3709225. doi:10.1145/3690624.3709225. arXiv:2411.16594. [15] L. Bonifacio, H. Abonizio, M. Fadaee, R. Nogueira, [8] L. Long, R. Wang, R. Xiao, J. Zhao, X. Ding, Inpars: Unsupervised dataset generation for inforG. Chen, H. Wang, On LLMs-driven syn- mation retrieval, in: Proceedings of the 45th Inthetic data generation, curation, and evalua- ternational ACM SIGIR Conference on Research tion: A survey, in: L.-W. Ku, A. Martins, and Development in Information Retrieval, SIGIR V. Srikumar (Eds.), Findings of the Association ’22, Association for Computing Machinery, New for Computational Linguistics: ACL 2024, Asso- York, NY, USA, 2022, p. 2387–2392. URL: https: ciation for Computational Linguistics, Bangkok, //doi.org/10.1145/3477495.3531863. doi:10.1145/ Thailand, 2024, pp. 11065–11082. URL: https: 3477495.3531863. //aclanthology.org/2024.findings-acl.658/. doi: 10. [16] J. Saad-Falcon, O. Khattab, K. Santhanam, R. Flo18653/v1/2024.findings-acl.658. rian, M. Franz, S. Roukos, A. Sil, M. Sultan, C. Potts, [9] S. Shashidhar, C. Fourrier, A. Lozovskia, T. Wolf, UDAPDR: Unsupervised domain adaptation via G. Tur, D. Hakkani-Tür, Yourbench: Easy custom LLM prompting and distillation of rerankers, in: evaluation sets for everyone, 2025. URL: https:// H. Bouamor, J. Pino, K. Bali (Eds.), Proceedarxiv.org/abs/2504.01833. arXiv:2504.01833. ings of the 2023 Conference on Empirical Meth[10] K. Wang, N. Thakur, N. Reimers, I. Gurevych, GPL: ods in Natural Language Processing, Association Generative pseudo labeling for unsupervised do- for Computational Linguistics, Singapore, 2023, main adaptation of dense retrieval, in: M. Carpuat, pp. 11265–11279. URL: https://aclanthology.org/ M.-C. de Marnefe, I. V. Meza Ruiz (Eds.), Proceed- 2023.emnlp-main.693/. doi:10.18653/v1/2023. ings of the 2022 Conference of the North Ameri- emnlp-main.693.

Declaration on Generative AI During the preparation of this work, the author(s) did not use any generative AI tools or services.

[1]

Radford ,

Narasimhan ,

Salimans , I. Sutskever , Improving language understanding by generative pre-training ( 2018 ).

[2]

Brown ,

Mann ,

Ryder ,

Subbiah ,

J. D.

Kaplan ,

Dhariwal ,

Neelakantan ,

Shyam ,

Sastry ,

Askell ,

Agarwal ,

Herbert-Voss , G. Krueger,

Henighan ,

Child ,

Ramesh ,

Ziegler ,

Wu ,

Winter ,

Hesse ,

Chen , E. Sigler,

Litwin ,

Gray ,

Chess ,

Clark ,

Berner ,

McCandlish ,

Radford ,

Sutskever ,

Amodei , Language models are few-shot learners , in: H. Larochelle , M.

Ranzato , R.

Hadsell , M.

Balcan , H. Lin (Eds.), Advances in Neural Information Processing Systems , volume 33 , Curran

Associates

, Inc., 2020 , pp. 1877 - 1901 . URL: https://proceedings. neurips.cc/paper_files/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf .

[17]

Aldeen ,

Luo ,

Lian ,

Zheng ,

Hong , [26]

Licari , G. Comandè, Italian-legal-bert modP . Yetukuri, L. Cheng, Chatgpt vs. human an- els for improving natural language processing notators: A comprehensive analysis of chatgpt tasks in the italian legal domain, Computer for text annotation , in: 2023 International Law & Security Review 52 ( 2024 ) 105908. Conference on Machine Learning and Applica- URL: https://www.sciencedirect.com/science/ tions (ICMLA), 2023 , pp. 602 - 609 . doi: 10 .1109/ article/pii/S0267364923001188. doi:https: ICMLA58977 . 2023 . 00089 . //doi.org/10.1016/j.clsr. 2023 . 105908 .

[18]

Savelka , Unlocking practical applications in legal domain: Evaluation of gpt for zero-shot semantic annotation of legal texts , in: Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law , ICAIL '23, Association for Computing Machinery, New York, NY, USA, 2023 , p. 447 - 451 . URL: https://doi.org/10.1145/3594536. 3595161. doi: 10 .1145/3594536.3595161.

[19]

Wang ,

Kim ,

Rahman ,

Mitra ,

Miao , Human-llm collaborative annotation through effective verification of llm labels , in: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, CHI '24 , Association for Computing Machinery, New York, NY, USA, 2024 . URL: https://doi.org/10.1145/3613904.3641960. doi: 10 .1145/3613904.3641960.

[20]

Sun ,

Xu ,

Tang ,

Wang ,

Lin ,

Gong ,

L. M.

Ni , H.-Y. Shum,

Guo , Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph ( 2024 ). URL: https://arxiv.org/ abs/2307.07697. arXiv: 2307 . 07697 .

[21]

Bavaresco ,

Bernardi ,

Bertolazzi ,

Elliott ,

Fernández ,

Gatt , E. Ghaleb,

Giulianelli ,

Hanna ,

Koller ,

Martins ,

Mondorf ,

Neplenbroek ,

Pezzelle ,

Plank ,

Schlangen ,

Suglia ,

A. K.

Surikuchi ,

Takmaz , A . Testoni, LLMs instead of human judges? a large scale empirical study across 20 NLP evaluation tasks , in: W. Che,

Nabende , E. Shutova, M. T. Pilehvar (Eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2 : Short

Papers)

, Association for Computational Linguistics , Vienna, Austria, 2025 , pp. 238 - 255 . URL: https://aclanthology.org/ 2025 .acl-short. 20 /. doi: 10 . 18653/v1/ 2025 .acl-short. 20 .

[22]

Spearman , The proof and measurement of association between two things , The American Journal of Psychology 15 ( 1904 ) 72 - 101 .

[23]

Cohen , A coeficient of agreement for nominal scales , Educational and psychological measurement 20 ( 1960 ) 37 - 46 .

[24]

Gu ,

Jiang ,

Shi ,

Tan ,

Zhai ,

Xu ,

Li ,

Shen ,

Ma , H. Liu,

Wang ,

Zhang ,

Wang ,

Gao ,

Ni ,

Guo , A survey on llm-as- ajudge , 2025 . URL: https://arxiv.org/abs/2411.15594. arXiv: 2411 . 15594 .

[25] OpenAI , Gpt-4 technical report , 2024 . URL: https: //arxiv.org/abs/2303.08774. arXiv: 2303 . 08774 .