1. Introduction

A Novel Multi-Step Prompt Approach for LLM-based Q&As on Banking Supervisory Regulation

Daniele Licari

0 1

Canio Benedetto

Praveen Bushipaka

Alessandro De Gregorio

Marco De Leonardis

Tommaso Cucinotta

1 0 Banca d'Italia , Via Nazionale, 91, Rome, 00184 , Italy 1 Scuola Superiore Sant'Anna , P.zza dei Martiri della Libertà, 33, Pisa, 56100 , Italy

This paper investigates the use of large language models (LLMs) in analyzing and answering questions related to banking supervisory regulation concerning reporting obligations. We introduce a multi-step prompt construction method that enhances the context provided to the LLM, resulting in more precise and informative answers. This multi-step approach is compared with standard "zero-shot" and "few-shot" approaches, which lacks context enrichment. To assess the quality of the generated responses, we utilize an LLM evaluator. Our findings indicate that the multi-step approach significantly outperforms the zero-shot method, producing more comprehensive and accurate responses.

eol>Regulatory Q&A Banking Supervisory Reporting Regulation Artificial Intelligence GenAI GPT-4o RAG LLM Evaluator

1. Introduction Therefore, it is essential to establish strong verification

procedures and retain human supervision to counter The advent of generative AI (GenAI), and specifically of these risks. The complexity of regulatory documents, large language models (LLMs), ofers significant oppor- with their dense network of cross-referenced texts/cats tunities, among others, in the legal and financial sector, and specialized content, necessitates careful analysis to facilitating the implementation of innovative solutions retrieve the needed information ensuring at the same across various domains of activities [1, 2, 3, 4, 5]. One time efective risk management and limit the burden of of the most promising applications is the business case such manual compliance. for supporting the navigation and analysis of complex This study introduces a novel methodology to autoregulatory documents [6, 7, 8, 9], which can be particu- mate and expedite the "question & answer" (Q&A) prolarly valuable for compliance oficers, legal teams, and cess in regulatory compliance, leveraging advanced large other professionals working in financial institutions who language models (LLMs) to provide accurate and timely need to have a clear and timely understanding of the responses to inquiries about the European Banking Auregulations and the consequently derived obligations. thority’s (EBA) reporting regulations. Our multi-step

Supervisory authorities could benefit from a tool that approach aligns with Retrieval-Augmented Generation streamlines the consultation of complex legislation, pro- (RAG) principles, enhancing context retrieval and genviding swift responses to entities and enhancing efi- erative capabilities through mechanisms like explicit ciency [10]. While LLMs ofer advantages for this pur- extraction of Capital Requirements Regulation (CRR) pose, they also pose risks like bias and inaccuracies [11]. references, implicit reference analysis, and a dedicated cross-encoder for precise regulatory text retrieval. This CDLeciC0-4it—200264,:2T0e2n4t,hPIitsaal,iIatnalCyonference on Computational Linguistics, methodology ensures tailored response generation suited * Corresponding author. to the complex regulatory compliance context, where pre† The views and opinions expressed in this paper are those of the cise and comprehensive answers are crucial. authors and do not necessarily reflect the oficial policy or position Our work finds particular applications within the doof the Bank of Italy. main of EBA regulatory reporting because it is charac$ daniele.licari@bancaditalia.it (D. Licari); terized by a large and complex set of interrelated docucparnavioe.ebne.nbeudsehtitpoa@kab@anscaandtiatnanliaa.pitis(aC.i.t B(Pe.nBeduesthtiop)a;ka); ments, including delegated and implementing acts, techalessandro.degregorio@bancaditalia.it (A. De Gregorio); nical standards, guidelines, and recommendations, which marco.deleonardis@bancaditalia.it (M. De Leonardis); cover various aspects of financial entities. Such comtommaso.cucinotta@santannapisa.it (T. Cucinotta) plexity makes the business case both challenging and 0000-0002-2963-9233 (D. Licari); 0000-0002-8446-9468 rewarding. (0C0.00B-e0n0e0d1e-7tt5o7)7;-03060595-0(A00.9D-7e7G53r-e8g6o6r2io()P;.0B00u9s-h0i0p0a4k-a6)5;23-186X In this work, we focus on Regulation (EU) N.2013/575, (M. De Leonardis); 0000-0002-0362-0657 (T. Cucinotta) also called Capital Requirements Regulati on (CRR) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License https://eur-lex.europa.eu/legal-content/en/ALL/?uri= Attribution 4.0 International (CC BY 4.0). celex%3A32013R0575, specifically on the topic of Table 1 Liquidity Risk as a first use case to evaluate the potential Sample distribution across training, validation, and test sets benefit of enriched context for an accurate response for CRR-related Q&A and the subset of only Liquidity Risk generation. The main reason for this choice is that this Q&A. topic is supported by a relatively limited number of Set CRR-related Q&A Liquidity Risk Q&A regulatory documents, so it was a good starting point since the regulation is not readily available in the form Training 798 58 of a structured dataset and its pre-processing is usually a ValTideasttion 166327 1426 time-consuming task.

We used the actual EBA Q&As dataset [12] as the foun- variables: question ID, question, submission date, status, dation for developing a system capable of generating au- topic, legal act, article [within that act], background infortomated responses to questions formulated by analysts mation,final answer, submission date and status (details on EBA reporting requirements and rules. By harnessing in Table 4, Appendix 4) Secondly, we implemented a twothe capabilities of LLMs we aim to create a tool that can step filtering process aimed at ensuring model eficacy: deliver accurate and contextually relevant answers to by excluding non-English entries, and by focusing on any inquiry on the content of the CRR. CRR-related questions within the same timeframe. This

Recent studies highlight the potential of LLMs for qual- resulted in a final dataset of 1597 CRR-related questions itative assessment [13, 14, 15, 16]. For this reason, in this and answers, which was then split into training (50%), work we also propose the use of an "LLM Evaluator" to validation (10%), and test sets (40%) for robust evaluation automate the validation process. (token number distribution in Figure 1 in Appendix A).

The structure of this paper is the following. Chapter The distribution of samples for the dataset is summarized 2 introduces the methodology and provides a detailed in Table 1. description of the approach adopted in this study; it explains the dataset utilized and the normative retrieval techniques employed to identify the regulatory docu- 2.2. Context Enrichment ments necessary to address the EBA’s Q&As. Chapter 3 presents the LLM Evaluator and the evaluation criteria. Chapter 4 reports experimental results and results and presents the main outcomes of the study. Chapter 5 discusses challenges as well as potential areas for future developments.

The context enrichment process is a three-step approach designed to identify, within the data set, the most relevant CRR references to provide an appropriate content to formulate the answer to the inquiry. The first step simply involves extracting explicit CRR references, if directly mentioned in the question (Article in tab 4). The second step leverages on the capabilities of the GPT-4o (prompt 2. Methodology in Appendix C.1) to analyse the “question” and the “background information” to identify other CRR references This research employs a multi-step methodology to con- that are not explicitly stated by the user. The last step struct a comprehensive prompt for the GPT-4 omni (GPT- of the process utilizes our CRR Ranker model, a cross4o) language model [17], enabling it to answer EBA- encoder architecture that has been trained to identify related questions efectively. This step-wise approach and retrieve pertinent references from the Capital Refocuses on enriching the context provided by the user’s quirements Regulation in response to specific inquiries. question. First, it identifies relevant EBA regulations This 3-steps comprehensive approach ensures a broader (specifically CRR references) within the inquiry. Second, and potentially more accurate understanding of the the it incorporates response examples to guide the LLM’s inquiry and the specific legal act(s) related to the CRR output format ensuring alignment with EBA regulations. that the Q&A tool deems applicable. This enriched context is then leveraged by a powerful LLM to generate more accurate and informative re- 2.2.1. CRR Ranker Training sponses (details in Appendix B.1).

With regard to the context enrichment, i.e. the CRR

Ranker Training, we employed a specifically trained 2.1. Dataset Construction cross-encoder model [18] to identify relevant CRR references for enriching inquiry context. We used a dedicated To develop and then evaluate our LLM-based Q&A sys- “question-article” pair dataset derived from our EBA Q&A tem, firstly we extracted a subset from the EBA’s Single- Train Dataset, excluding questions related to CRR Artirule-book-qa online resource [12], comprising “question- cle 99 https://www.eba.europa.eu/regulation-and-policy/ and-answer” pairs submitted to the EBA between 2013 and 2020. In particular, we focused on the following single-rulebook/interactive-single-rulebook/14212 due Dataset. This evaluation employed recall metrics at varito their frequent lack of topical relevance. Each data point ous retrieval cutofs, including recall@5, recall@10, reconsisted of a question (user query and background in- call@20, and recall@30 (results in Section 4). formation), an associated CRR article, and a binary label indicating relevance (1 for relevant, 0 for not applicable). 2.3. Examples Enrichment

We constructed the training dataset by selecting positive and negative samples. Positive samples comprised To improve the model’s understanding of the desired requestion-article pairs where the article explicitly ad- sponse format, tone, and content, we adopted a few-shot dressed the user’s query. Additionally, we included pairs prompting approach [22]. This involved extracting five formed by questions and implicit CRR references ex- relevant examples from the EBA Q&A Train Dataset with tracted from the user’s text, context information, and the same topic as the user question we want to answer. oficial response using GPT-4o (used prompt in Appendix These examples served as demonstrations for the model, C.1). showcasing the ideal structure, language style, and level

Negative training samples were mined by using the of detail expected in the final responses. Notably, the seBAAI bge-large-en-v1.5 pre-trained language model [19]. lection process ensured heterogeneity within the chosen For the CRR Ranker Training we employed a two-phase topic, meaning the examples covered various aspects to process for negative sample selection: first, all CRR ar- promote a broader understanding. Limiting the number ticles were encoded using the bge-large-en-v1.5 model, of examples to five struck a balance between providing and cosine similarity was utilized to rank them relative to diverse demonstrations and maintaining cost-eficiency the user’s question; second, a set of 20 negative examples during inference, as the LLM’s input token length has was randomly chosen from a pre-defined ranking inter- limitations. val (250-300). The choice of 20 negative samples provides a good balance between computational eficiency and 2.4. Answer Generation the availability of enough training data. This approach aimed to balance the representation of relevant and irrel- Figure 2 in Appendix B.1 details how we construct a evant information within the training data, ensuring the comprehensive prompt that enhances GPT-4o’s ability model learns to distinguish between the user’s query and to efectively answer user questions. The final prompt in potentially related but ultimately of-topic CRR articles Appendix C.2 integrates the enriched context (extracted [20]. CRR references) and the example enrichment (demon

The final dataset comprised 12,533 unique "question- strations of desired response format, tone, and content). article" pairs with positive and negative labels. This data This comprehensive prompt is fed to GPT-4o through the was split into training (10,179 pairs) and development OpenAI API, enabling it to generate a well-reasoned and (2,354 pairs) sets for model fine-tuning. This fine-tuning informative response that adheres to the EBA’s regulaaimed to learn robust semantic representations for ques- tory framework and professional tone. tions and CRR articles, enabling the model to efectively identify relevant CRR references for enriching user query 2.5. Comparison with RAG Principles context.

We selected the BAAI BGE Reranker v2 m3 model Our multi-step prompt approach aligns with the core [18] as the basis for our cross-encoder, owing to its task- principles of Retrieval-Augmented Generation (RAG) specific aptness and its demonstrated superior perfor- while incorporating tailored enhancements that improve mance relative to the BGE Reranker Large [19], as re- context enrichment for regulatory Q&A tasks. Like RAG, ported in Section 4. We adopted the Cross-Entropy Bi- our method integrates information retrieval with lannary Classification loss function, following the approach guage generation, but it adds specialized steps to enhance suggested in the BGE Rerank Git repository [21]. To context enrichment. These include explicit extraction of promote stable convergence, we incorporated a warmup CRR references, implicit analysis using LLM capabilities, schedule ( with a number of steps 0.1 × len(train_data) × and precise retrieval through a dedicated cross-encoder. num_epochs step) that gradually increases the learning Compared to standard RAG, which often relies on singlerate during the initial phase of training. The entire fine- stage retrieval, our structured multi-step process adds a tuning process was conducted over 4 epochs. We em- higher level of granularity, including example enrichment ployed an evaluation interval of 800 steps during training through few-shot prompts. This ensures not only factual and saved the model that achieved the highest F1 score accuracy but also alignment with domain-specific lanon the development set. guage standards, ultimately improving response quality

Finally, we evaluated the model’s retrieval ability of for complex regulatory inquiries. Overall, our approach CRR items for a given user question on EBA Q&A Test extends the RAG principles to generate tailored, contextually enriched answers, which is particularly beneficial for the intricate requirements of regulatory compliance.

3. LLM Evaluator

over a human one, especially for tasks involving prompt optimization and evaluation. The figure in Appendix B.2 illustrates the complete process of evaluating agreement between the LLM evaluator and the human expert.

In our pipeline, we employ an LLM Evaluator to assess the 4. Experiments and Results quality of generated responses, defined in Section 2, compared to the EBA’s answers already provided. Employing This section describes the results obtained by measuran LLM Evaluator ofers significant advantages in terms ing retrieval efectiveness and answer quality. Retrieval of cost-efectiveness and eficiency compared to tradi- performance is measured by the number of relevant regtional human evaluation/comparison methods. Recent ulations retrieved (recall) using diferent encoder models. research highlights the potential of LLMs for large-scale Answer quality is then evaluated by a separate LLM, natural language evaluation tasks [23, 24, 25]. which scores each generated response based on factors

The evaluation process uses a scale from one to four, like relevance and adherence to EBA legal acts. We combased on two evaluation criteria: correctness and com- pare the multi-step prompt approach with a few-shot and pleteness. A generated response is considered correct if zero-shot one focusing on a single topic within the EBA its content aligns with the information presented in the Q&A framework, specifically Liquidity Risk. Finally, we oficial answer. Additionally, a response is deemed com- test our Multi-Step pipeline with other LLM models, such plete if it incorporates all relevant regulatory references as Google Gemini Flash 1.5 and Llama 3.1 70B. provided in the oficial answer. The following scoring rubric outlines the evaluation criteria:

4.1. CRR Retrieval To preliminary validate the efectiveness of our LLM

evaluator, we conducted an experiment using a synthetic • Bi-encoders: all-MiniLM-L6-v2 [27], gte-large-endataset. This dataset was carefully designed to test var- v1.5 [28], and bge-large-en-v1.5 [19]. ious aspects of language generation and was evaluated • Cross-encoders: bge-reranker-large [19], bgeby both a human expert and the LLM. The alignment be- reranker-v2-m3 [29, 18]. tween the human expert’s assessments and those of the LLM was then analyzed. The complete details of the final The detailed results (presented in table 2) show the prompt used for LLM evaluator are provided in Appendix achieved recall scores on EBA Q&As Test Dataset for C.3. each model. Our fine-tuned CRR Ranker significantly

The dataset comprises 60 Q&A pairs, balanced across outperformed all other models, achieving a more than the four score categories. For each category, two pairs 20% improvement compared to the best pre-trained model were excluded as they were used as examples for the (bge-large-en-v1.5). prompt for the LLM evaluator, resulting in a final dataset of 52 Q&A pairs to measure the alignment between the 4.2. Answer Generation human and LLM evaluator. Using GPT-4o, we obtained a Kendall-tau coeficient of 0.77, with a p-value of 6· 10− 11. Here we compare the performance of our multi-step apThese results justified the adoption of the LLM evaluator proach with a zero-shot one for answering EBA liquidity • Score 1: The generated answer is completely incorrect and incomplete compared to the oficial answer. • Score 2: The generated answer is incorrect but either complete or partially complete compared to the oficial answer . It contains some useful information found in the oficial answer , but the main statement is incorrect. • Score 3: The generated answer is correct but only partially complete. The main statement matches the oficial answer , but some information from the oficial answer is missing. • Score 4: The generated answer is fully correct and complete. It is essentially a rephrased version of the oficial answer with no significant diferences.

We employed “recall” as the primary metric to assess the efectiveness of bi and cross encoder models in retrieving relevant CRR articles based on the information submitted with the inquiry. “Recall” signifies the proportion of truly relevant CRR articles retrieved from the dataset compared to all the pertinent actual articles [26]. In the context of legal information retrieval, prioritizing the retrieval of all crucial regulatory information for the inquiry makes the recall a particularly relevant metric.

Our primary objective was to identify a model that delivers exceptional retrieval accuracy while maintaining computational eficiency. This potentially excluded models with an extremely large number of parameters, as they can be computationally expensive to run.

We conducted a performance comparison between our ifne-tuned CRR Ranker and several pre-trained models: 0.37 0.39 0.41 0.17 0.24 0.51 0.46 0.48 0.52 0.23 0.31 0.67 0.55 0.57 0.62 0.31 0.39 0.81 0.59 0.63 0.67 0.38 0.44 0.86 risk inquiries, using our LLM as the evaluation system (Figure in Appendix B.3). To this end, we utilized a subset of 46 Q&As from our EBA Q&A Test dataset specifically focused on liquidity risk.

We tested:

In this section, we extend our analysis of the multi-step pipeline by incorporating evaluations using additional large language models (LLMs), specifically Google Gemini Flash 1.5 and Llama 3.1 70B. Google Gemini Flash 1.5 is widely recognized for its high-speed processing • Zero-Shot Approach: for each question, a stan- capabilities and eficiency in response generation, makdard prompt was provided to the LLM. It encom- ing it a suitable benchmark for comparative performance passed both the specific query and any relevant analysis. Conversely, Llama 3.1 70B is noted for its rocontextual information they provided. bustness in handling complex queries while maintaining • Few-Shot Approach: for each question, a few moderate computational demands, providing an interexamples were provided along with the query to esting contrast in terms of performance and resource guide the LLM in generating responses. eficiency. • Multi-Step Approach: for each question, we Our experimental results indicate that the average evalcreated prompts following our established multi- uation score achieved by Google Gemini Flash 1.5 was 2.0, step approach, incorporating context enrichment whereas Llama 3.1 70B attained an average score of 2.2. and example enrichment (as detailed in previous Notably, these scores did not surpass the performance sections). of the GPT-4o zero-shot approach, which underscores The LLM Evaluator assessed each response based on the advanced capabilities of GPT-4o in addressing the its correctness and completeness relative to the oficial complexities of regulatory compliance inquiries. This obEBA response. As described in Section 3, the LLM Evalu- servation highlights the inherent strength of GPT-4o in ator assigned an overall score on a scale of 1 (completely generating accurate and contextually relevant responses, incorrect and incomplete) to 4 (fully correct and compre- outperforming the other models under similar conditions. hensive). Future research will focus on an in-depth analysis of

Table 3 summarizes the evaluation results for re- these models with a view toward optimizing each step sponses generated by the diferent approaches. The of the multi-step pipeline in a model-specific manner. By “multi-step” approach consistently achieved higher tailoring our methodology to align with the distinctive counts in the high-quality rating categories compared to strengths and limitations of each model, we aim to furboth the “zero-shot” and “few-shot” ones. This demon- ther enhance the overall accuracy and reliability of the strates that the multi-step approach significantly outper- generated responses. formed the other methods in terms of response quality.

The LLM evaluator awarded the multi-step approach an average score of 2.7, representing a 12.5% improvement 5. Challenges and Advancements over the zero-shot and few-shot approaches, which both received an average score of 2.4. Notably, a larger portion Our work has highlighted several key challenges that are of the responses generated by our multi-step approach worth discussing. One of the primary issues concerns received scores of 3 or higher, indicating correct answers. the limited size of our test dataset. This constraint arose In contrast, only 2 out of 46 responses generated by the because we focused on the single topic of Liquidity Risk. multi-step approach were rated as completely incorrect However, to achieve robust human alignment and ensure (score 1), compared to 6 such responses for the zero-shot the system addresses diverse user inquiries across EBA approach and 11 for the few-shot approach. These find- topics, future eforts should prioritize dataset expansion ings suggest that the context enrichment in the multi-step and human evaluation integration. prompts efectively guides the primary LLM toward gen- Another topic for reflection is that the study emphaerating more comprehensive and informative responses sizes the need to retrieve relevant CRR articles. Future that accurately reflect the EBA regulations. research could investigate methods to further refine the generated responses by incorporating legal reasoning domains with complex regulatory frameworks like regand argumentation capabilities into the LLM [30, 31], ulatory reporting. Even at this early stage, the tool has and the most relevant Q&As as examples for few-shot demonstrated its ability to make the work of the human prompting [6]. analyst more eficient. Future research directions include

It is also crucial to underscore the importance of op- exploring the use of diferent LLM architectures and intimizing prompts for this kind of application, and we vestigating alternative methods for incorporating human plan to address this moving forward. Our future research feedback into the prompt construction process. Lastly, endeavors will focus on investigating automatic prompt exploring the generalization of this approach to other engineering techniques [32] leveraging the LLM Evalu- regulatory domains would be valuable. ator as a metric to optimize. These techniques aim to tailor and optimize prompts based on the specific topic of inquiries, enhancing overall performance. Acknowledgments

Moreover, currently we have utilized only one model, GPT-4o, but we intend to extend our testing to include The authors would like to express their sincere gratiother models that have demonstrated similar perfor- tude to Vincenzo Capone, Pamela Maggiori, Daniele Bovi, mance levels in the field of open question answering Fabio Zambuto, Francesca Monacelli, and Roberto Sab[33]. This will help us identify the most efective model batini (Bank of Italy) for their insightful comments and for our application with an unbiased evaluation [34]. stimulating discussions on an earlier draft of this paper.

Similarly, in the context of LLM evaluators, we also in- Their feedback greatly enhanced the clarity and focus of tend to explore additional models, including open-source our work. options [35, 36], that have shown strong performance They would also like to thank the anonymous reviewers in assessing the quality of responses from various LLMs. for their invaluable suggestions and constructive feedThis approach is expected to increase the correlation back. between human and LLM evaluations, thereby enhancing the system’s overall accuracy and reliability. The scientific community is very active in this area to better understand the limitations of the diferent types of models considered as evaluators [37].

By addressing the identified limitations through increased human involvement, expanded data coverage, and domain-specific evaluation methods, we believe it is possible to enhance the system’s efectiveness and generalizability across a wide range of regulatory domains.

6. Conclusion This study explored a novel approach for generating au

tomated responses to inquiries on the Regulation (EU) N.2013/575, specifically on the liquidity risk subject. We proposed a multi-step prompt construction method that enriches the context to be provided to LLMs, enabling them to generate more accurate and informative answers. An LLM Evaluator, which demonstrated strong agreement with human experts, was employed to compare our multi-step approach with standard zero-shot and fewshot methods that lack context enrichment. The quality of the generated responses was assessed, and our findings indicate that the multi-step approach significantly outperforms both the zero-shot and few-shot methods, resulting in responses that are more comprehensive and accurate in relation to the EBA regulation. These results suggest that the multi-step prompt construction is a promising approach for enhancing LLM performance in legal information retrieval tasks, particularly within

A. Dataset B. Multi-Step Generation and Evalutation B.1. Multi-Step Approach for Answer Generation B.2. LLM evaluator Alignment B.3. Multi-Step vs. Zero-Shot C. Prompt template C.1. Extracting Law References

#task Extract from the text (#text) any reference to regulatory documents contained in it and insert them into a list (e.g. ["regulatory document name": ["article 1","article 2",...]]). I will provide you an example (#text (example)) and the expected output (#output (example)): #text (example) "In accordance with Article 425 (1) of Regulation (EU) No. 575/2013 (CRR) institutions may exempt contractual liquidity inflows from borrowers and bond investors arising from mortgage lending funded by covered bonds eligible for preferential treatment as set out in Article 129b (4-6) of CRR or by bonds as referred to in Article 52(4) of Directive 2009/65/EC from the 75% inflow cap." #output (example) "["Regulation (EU) No. 575/2013 (CRR)": ["425","129b"], "Directive 2009/65/EC" : ["52"]]" #text > text_to_extract #output (list only)

This prompt was used to extract any reference to regulatory documents from the provided text_to_extract ) (placeholder to input text)

C.2. Answer Generation

" #system You are a virtual assistant for the European Banking Authority (EBA), handling user inquiries related to Liquidity Risk regulations. The user’s query specifically pertains to Regulation (EU) No. 575/2013 (CRR) or Delegated Regulation (EU) No. 2015/61 (LCR DA).""" #task Answer the question based on the instructions below. 1. Analyze the User’s Question (#question): - Identify the central topic and relevant keywords related to Liquidity Risk and the specified EBA regulations. 2. Leverage the Provided Context (#context): - Incorporate the context (including CRR articles and additional information) to tailor the answer to the user’s specific scenario. 3. Liquidity Risk Topic: - Reference relevant articles from provided context (#context) that address the specific aspect of Liquidity Risk raised in the question. 4. Desired Answer (#answer): - Use only the information provided in the context and examples (if provided) to answer the question. - Craft a well-reasoned and informative response that covers all aspects of the user’s query. - Clearly articulate the regulatory implications while considering the provided context. - Maintain a professional and informative tone suitable for the EBA.

#examples: Example 1: > example_1 Example 2: > example_2 Example 3: > example_3 Example 4: > example_4 Example 5: > example_5 #question: > question #context: > context

> enhanced_context #answer:

This prompt was used to generate answer given a question and context. #examples section (placeholder to include 5 examples) and enhanced_context (placeholder to include CRR articles), highlighted in yellow, were used only for multi-step approach.

Gpt4-omni Prompt

I will provide you with two answers to a question. One is the #oficial answer, which serves as the benchmark. The other is the #generated answer, which needs to be evaluated against the #oficial answer. You must compare the answers step by step.

Consider the following definitions for this evaluation: - Correctness: A #generated answer is correct if its content aligns with that of the #oficial answer. - Completeness: A #generated answer is complete if it includes all the information present in the #oficial answer.

Your task is to act as an evaluator and rate the #generated answer according to the following scale: RATING 1: The #generated answer is completely incorrect and incomplete compared to the #oficial answer. RATING 2: The #generated answer is incorrect but either complete or partially complete compared to the #oficial answer. It contains some useful information found in the #oficial answer but the main statement is incorrect.

RATING 3: The #generated answer is correct but only partially complete. The main statement matches the #oficial answer, but some information from the #oficial answer is missing.

RATING 4: The #generated answer is fully correct and complete. It is essentially a rephrased version of the #oficial answer with no significant diferences.

Please provide a single numerical rating (1-4) followed by a brief explanation for your rating <EXAMPLE 1> ... <EXAMPLE 8> Compute the score in the following case: #question > question #background > background #oficial answer > answer #generated answer generated answer

18653 /v1/ 2023 .findings-acl. 858 . [31]

an Lu , H. yu Kao , 0x .yuan at semeval-2024

task 5: Enhancing legal argument reasoning with

on Semantic Evaluation , 2024 . URL: https://api.

semanticscholar.org/CorpusID:270765544. [32]

Ye ,

Axmed ,

Pryzant ,

Khani , Prompt

engineering a prompt engineer , 2024 . URL: https:

//arxiv.org/abs/2311.05661. arXiv: 2311 . 05661 . [33]

Huang ,

Wang ,

Xia , P. Liu, Olympicarena

far? , 2024 . URL: https://arxiv.org/abs/2406.16772.

arXiv:2406 . 16772 . [34]

Panickssery ,

S. R.

Bowman ,

Feng , Llm eval-

tions , 2024 . URL: https://arxiv.org/abs/2404.13076.

arXiv:2404 . 13076 . [35]

Kim ,

Suk ,

Longpre ,

B. Y.

Lin ,

Shin ,

Prometheus 2: An open source language model

els , 2024 . URL: https://arxiv.org/abs/2405.01535.

arXiv:2405 . 01535 . [36]

Kim ,

Suk ,

J. Y.

Cho ,

Longpre ,

Kim ,

Yoon ,

models with language models , 2024 . URL: https:

//arxiv.org/abs/2406.05761. arXiv: 2406 . 05761 . [37]

Huang ,

Qu ,

Zhou , J. Liu,

Yang ,

Xu ,

els for llm evaluation , 2024 . URL: https://arxiv.org/

abs/2403 .02839. arXiv: 2403 . 02839 .