A Novel Multi-Step Prompt Approach for LLM-based Q&As on Banking Supervisory Regulation Daniele Licari1,2,*,† , Canio Benedetto1,† , Praveen Bushipaka2 , Alessandro De Gregorio1,† , Marco De Leonardis1,† and Tommaso Cucinotta2 1 Banca d’Italia, Via Nazionale, 91, Rome, 00184, Italy 2 Scuola Superiore Sant’Anna, P.zza dei Martiri della Libertà, 33, Pisa, 56100, Italy Abstract This paper investigates the use of large language models (LLMs) in analyzing and answering questions related to banking supervisory regulation concerning reporting obligations. We introduce a multi-step prompt construction method that enhances the context provided to the LLM, resulting in more precise and informative answers. This multi-step approach is compared with standard "zero-shot" and "few-shot" approaches, which lacks context enrichment. To assess the quality of the generated responses, we utilize an LLM evaluator. Our findings indicate that the multi-step approach significantly outperforms the zero-shot method, producing more comprehensive and accurate responses. Keywords Regulatory Q&A, Banking Supervisory Reporting Regulation, Artificial Intelligence, GenAI, GPT-4o, RAG, LLM Evaluator 1. Introduction Therefore, it is essential to establish strong verification procedures and retain human supervision to counter The advent of generative AI (GenAI), and specifically of these risks. The complexity of regulatory documents, large language models (LLMs), offers significant oppor- with their dense network of cross-referenced texts/cats tunities, among others, in the legal and financial sector, and specialized content, necessitates careful analysis to facilitating the implementation of innovative solutions retrieve the needed information ensuring at the same across various domains of activities [1, 2, 3, 4, 5]. One time effective risk management and limit the burden of of the most promising applications is the business case such manual compliance. for supporting the navigation and analysis of complex This study introduces a novel methodology to auto- regulatory documents [6, 7, 8, 9], which can be particu- mate and expedite the "question & answer" (Q&A) pro- larly valuable for compliance officers, legal teams, and cess in regulatory compliance, leveraging advanced large other professionals working in financial institutions who language models (LLMs) to provide accurate and timely need to have a clear and timely understanding of the responses to inquiries about the European Banking Au- regulations and the consequently derived obligations. thority’s (EBA) reporting regulations. Our multi-step Supervisory authorities could benefit from a tool that approach aligns with Retrieval-Augmented Generation streamlines the consultation of complex legislation, pro- (RAG) principles, enhancing context retrieval and gen- viding swift responses to entities and enhancing effi- erative capabilities through mechanisms like explicit ciency [10]. While LLMs offer advantages for this pur- extraction of Capital Requirements Regulation (CRR) pose, they also pose risks like bias and inaccuracies [11]. references, implicit reference analysis, and a dedicated cross-encoder for precise regulatory text retrieval. This CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, Dec 04 — 06, 2024, Pisa, Italy methodology ensures tailored response generation suited * Corresponding author. to the complex regulatory compliance context, where pre- The views and opinions expressed in this paper are those of the cise and comprehensive answers are crucial. † authors and do not necessarily reflect the official policy or position Our work finds particular applications within the do- of the Bank of Italy. main of EBA regulatory reporting because it is charac- $ daniele.licari@bancaditalia.it (D. Licari); terized by a large and complex set of interrelated docu- canio.benedetto@bancaditalia.it (C. Benedetto); praveen.bushipaka@santannapisa.it (P. Bushipaka); ments, including delegated and implementing acts, tech- alessandro.degregorio@bancaditalia.it (A. De Gregorio); nical standards, guidelines, and recommendations, which marco.deleonardis@bancaditalia.it (M. De Leonardis); cover various aspects of financial entities. Such com- tommaso.cucinotta@santannapisa.it (T. Cucinotta) plexity makes the business case both challenging and  0000-0002-2963-9233 (D. Licari); 0000-0002-8446-9468 rewarding. (C. Benedetto); 0009-0009-7753-8662 (P. Bushipaka); 0000-0001-7577-3655 (A. De Gregorio); 0009-0004-6523-186X In this work, we focus on Regulation (EU) N.2013/575, (M. De Leonardis); 0000-0002-0362-0657 (T. Cucinotta) also called Capital Requirements Regulation (CRR) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). https://eur-lex.europa.eu/legal-content/en/ALL/?uri= CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings celex%3A32013R0575, specifically on the topic of Table 1 Liquidity Risk as a first use case to evaluate the potential Sample distribution across training, validation, and test sets benefit of enriched context for an accurate response for CRR-related Q&A and the subset of only Liquidity Risk generation. The main reason for this choice is that this Q&A. topic is supported by a relatively limited number of Set CRR-related Q&A Liquidity Risk Q&A regulatory documents, so it was a good starting point since the regulation is not readily available in the form Training 798 58 Validation 162 12 of a structured dataset and its pre-processing is usually a Test 637 46 time-consuming task. We used the actual EBA Q&As dataset [12] as the foun- variables: question ID, question, submission date, status, dation for developing a system capable of generating au- topic, legal act, article [within that act], background infor- tomated responses to questions formulated by analysts mation,final answer, submission date and status (details on EBA reporting requirements and rules. By harnessing in Table 4, Appendix 4) Secondly, we implemented a two- the capabilities of LLMs we aim to create a tool that can step filtering process aimed at ensuring model efficacy: deliver accurate and contextually relevant answers to by excluding non-English entries, and by focusing on any inquiry on the content of the CRR. CRR-related questions within the same timeframe. This Recent studies highlight the potential of LLMs for qual- resulted in a final dataset of 1597 CRR-related questions itative assessment [13, 14, 15, 16]. For this reason, in this and answers, which was then split into training (50%), work we also propose the use of an "LLM Evaluator" to validation (10%), and test sets (40%) for robust evaluation automate the validation process. (token number distribution in Figure 1 in Appendix A). The structure of this paper is the following. Chapter The distribution of samples for the dataset is summarized 2 introduces the methodology and provides a detailed in Table 1. description of the approach adopted in this study; it ex- plains the dataset utilized and the normative retrieval techniques employed to identify the regulatory docu- 2.2. Context Enrichment ments necessary to address the EBA’s Q&As. Chapter The context enrichment process is a three-step approach 3 presents the LLM Evaluator and the evaluation crite- designed to identify, within the data set, the most rele- ria. Chapter 4 reports experimental results and results vant CRR references to provide an appropriate content to and presents the main outcomes of the study. Chapter 5 formulate the answer to the inquiry. The first step simply discusses challenges as well as potential areas for future involves extracting explicit CRR references, if directly developments. mentioned in the question (Article in tab 4). The second step leverages on the capabilities of the GPT-4o (prompt in Appendix C.1) to analyse the “question” and the “back- 2. Methodology ground information” to identify other CRR references This research employs a multi-step methodology to con- that are not explicitly stated by the user. The last step struct a comprehensive prompt for the GPT-4 omni (GPT- of the process utilizes our CRR Ranker model, a cross- 4o) language model [17], enabling it to answer EBA- encoder architecture that has been trained to identify related questions effectively. This step-wise approach and retrieve pertinent references from the Capital Re- focuses on enriching the context provided by the user’s quirements Regulation in response to specific inquiries. question. First, it identifies relevant EBA regulations This 3-steps comprehensive approach ensures a broader (specifically CRR references) within the inquiry. Second, and potentially more accurate understanding of the the it incorporates response examples to guide the LLM’s inquiry and the specific legal act(s) related to the CRR output format ensuring alignment with EBA regulations. that the Q&A tool deems applicable. This enriched context is then leveraged by a power- ful LLM to generate more accurate and informative re- 2.2.1. CRR Ranker Training sponses (details in Appendix B.1). With regard to the context enrichment, i.e. the CRR Ranker Training, we employed a specifically trained 2.1. Dataset Construction cross-encoder model [18] to identify relevant CRR refer- ences for enriching inquiry context. We used a dedicated To develop and then evaluate our LLM-based Q&A sys- “question-article” pair dataset derived from our EBA Q&A tem, firstly we extracted a subset from the EBA’s Single- Train Dataset, excluding questions related to CRR Arti- rule-book-qa online resource [12], comprising “question- cle 99 https://www.eba.europa.eu/regulation-and-policy/ and-answer” pairs submitted to the EBA between 2013 and 2020. In particular, we focused on the following single-rulebook/interactive-single-rulebook/14212 due Dataset. This evaluation employed recall metrics at vari- to their frequent lack of topical relevance. Each data point ous retrieval cutoffs, including recall@5, recall@10, re- consisted of a question (user query and background in- call@20, and recall@30 (results in Section 4). formation), an associated CRR article, and a binary label indicating relevance (1 for relevant, 0 for not applicable). 2.3. Examples Enrichment We constructed the training dataset by selecting posi- tive and negative samples. Positive samples comprised To improve the model’s understanding of the desired re- question-article pairs where the article explicitly ad- sponse format, tone, and content, we adopted a few-shot dressed the user’s query. Additionally, we included pairs prompting approach [22]. This involved extracting five formed by questions and implicit CRR references ex- relevant examples from the EBA Q&A Train Dataset with tracted from the user’s text, context information, and the same topic as the user question we want to answer. official response using GPT-4o (used prompt in Appendix These examples served as demonstrations for the model, C.1). showcasing the ideal structure, language style, and level Negative training samples were mined by using the of detail expected in the final responses. Notably, the se- BAAI bge-large-en-v1.5 pre-trained language model [19]. lection process ensured heterogeneity within the chosen For the CRR Ranker Training we employed a two-phase topic, meaning the examples covered various aspects to process for negative sample selection: first, all CRR ar- promote a broader understanding. Limiting the number ticles were encoded using the bge-large-en-v1.5 model, of examples to five struck a balance between providing and cosine similarity was utilized to rank them relative to diverse demonstrations and maintaining cost-efficiency the user’s question; second, a set of 20 negative examples during inference, as the LLM’s input token length has was randomly chosen from a pre-defined ranking inter- limitations. val (250-300). The choice of 20 negative samples provides a good balance between computational efficiency and 2.4. Answer Generation the availability of enough training data. This approach aimed to balance the representation of relevant and irrel- Figure 2 in Appendix B.1 details how we construct a evant information within the training data, ensuring the comprehensive prompt that enhances GPT-4o’s ability model learns to distinguish between the user’s query and to effectively answer user questions. The final prompt in potentially related but ultimately off-topic CRR articles Appendix C.2 integrates the enriched context (extracted [20]. CRR references) and the example enrichment (demon- The final dataset comprised 12,533 unique "question- strations of desired response format, tone, and content). article" pairs with positive and negative labels. This data This comprehensive prompt is fed to GPT-4o through the was split into training (10,179 pairs) and development OpenAI API, enabling it to generate a well-reasoned and (2,354 pairs) sets for model fine-tuning. This fine-tuning informative response that adheres to the EBA’s regula- aimed to learn robust semantic representations for ques- tory framework and professional tone. tions and CRR articles, enabling the model to effectively identify relevant CRR references for enriching user query 2.5. Comparison with RAG Principles context. We selected the BAAI BGE Reranker v2 m3 model Our multi-step prompt approach aligns with the core [18] as the basis for our cross-encoder, owing to its task- principles of Retrieval-Augmented Generation (RAG) specific aptness and its demonstrated superior perfor- while incorporating tailored enhancements that improve mance relative to the BGE Reranker Large [19], as re- context enrichment for regulatory Q&A tasks. Like RAG, ported in Section 4. We adopted the Cross-Entropy Bi- our method integrates information retrieval with lan- nary Classification loss function, following the approach guage generation, but it adds specialized steps to enhance suggested in the BGE Rerank Git repository [21]. To context enrichment. These include explicit extraction of promote stable convergence, we incorporated a warmup CRR references, implicit analysis using LLM capabilities, schedule ( with a number of steps 0.1 × len(train_data) × and precise retrieval through a dedicated cross-encoder. num_epochs step) that gradually increases the learning Compared to standard RAG, which often relies on single- rate during the initial phase of training. The entire fine- stage retrieval, our structured multi-step process adds a tuning process was conducted over 4 epochs. We em- higher level of granularity, including example enrichment ployed an evaluation interval of 800 steps during training through few-shot prompts. This ensures not only factual and saved the model that achieved the highest F1 score accuracy but also alignment with domain-specific lan- on the development set. guage standards, ultimately improving response quality Finally, we evaluated the model’s retrieval ability of for complex regulatory inquiries. Overall, our approach CRR items for a given user question on EBA Q&A Test extends the RAG principles to generate tailored, contex- tually enriched answers, which is particularly beneficial over a human one, especially for tasks involving prompt for the intricate requirements of regulatory compliance. optimization and evaluation. The figure in Appendix B.2 illustrates the complete process of evaluating agreement between the LLM evaluator and the human expert. 3. LLM Evaluator In our pipeline, we employ an LLM Evaluator to assess the 4. Experiments and Results quality of generated responses, defined in Section 2, com- pared to the EBA’s answers already provided. Employing This section describes the results obtained by measur- an LLM Evaluator offers significant advantages in terms ing retrieval effectiveness and answer quality. Retrieval of cost-effectiveness and efficiency compared to tradi- performance is measured by the number of relevant reg- tional human evaluation/comparison methods. Recent ulations retrieved (recall) using different encoder models. research highlights the potential of LLMs for large-scale Answer quality is then evaluated by a separate LLM, natural language evaluation tasks [23, 24, 25]. which scores each generated response based on factors The evaluation process uses a scale from one to four, like relevance and adherence to EBA legal acts. We com- based on two evaluation criteria: correctness and com- pare the multi-step prompt approach with a few-shot and pleteness. A generated response is considered correct if zero-shot one focusing on a single topic within the EBA its content aligns with the information presented in the Q&A framework, specifically Liquidity Risk. Finally, we official answer. Additionally, a response is deemed com- test our Multi-Step pipeline with other LLM models, such plete if it incorporates all relevant regulatory references as Google Gemini Flash 1.5 and Llama 3.1 70B. provided in the official answer. The following scoring rubric outlines the evaluation criteria: 4.1. CRR Retrieval • Score 1: The generated answer is completely in- We employed “recall” as the primary metric to assess the correct and incomplete compared to the official effectiveness of bi and cross encoder models in retrieving answer. relevant CRR articles based on the information submitted • Score 2: The generated answer is incorrect but with the inquiry. “Recall” signifies the proportion of truly either complete or partially complete compared relevant CRR articles retrieved from the dataset compared to the official answer. It contains some useful to all the pertinent actual articles [26]. In the context of information found in the official answer, but the legal information retrieval, prioritizing the retrieval of main statement is incorrect. all crucial regulatory information for the inquiry makes • Score 3: The generated answer is correct but only the recall a particularly relevant metric. partially complete. The main statement matches Our primary objective was to identify a model that the official answer, but some information from delivers exceptional retrieval accuracy while maintain- the official answer is missing. ing computational efficiency. This potentially excluded • Score 4: The generated answer is fully correct and models with an extremely large number of parameters, complete. It is essentially a rephrased version of as they can be computationally expensive to run. the official answer with no significant differences. We conducted a performance comparison between our To preliminary validate the effectiveness of our LLM fine-tuned CRR Ranker and several pre-trained models: evaluator, we conducted an experiment using a synthetic • Bi-encoders: all-MiniLM-L6-v2 [27], gte-large-en- dataset. This dataset was carefully designed to test var- v1.5 [28], and bge-large-en-v1.5 [19]. ious aspects of language generation and was evaluated • Cross-encoders: bge-reranker-large [19], bge- by both a human expert and the LLM. The alignment be- reranker-v2-m3 [29, 18]. tween the human expert’s assessments and those of the LLM was then analyzed. The complete details of the final The detailed results (presented in table 2) show the prompt used for LLM evaluator are provided in Appendix achieved recall scores on EBA Q&As Test Dataset for C.3. each model. Our fine-tuned CRR Ranker significantly The dataset comprises 60 Q&A pairs, balanced across outperformed all other models, achieving a more than the four score categories. For each category, two pairs 20% improvement compared to the best pre-trained model were excluded as they were used as examples for the (bge-large-en-v1.5). prompt for the LLM evaluator, resulting in a final dataset of 52 Q&A pairs to measure the alignment between the 4.2. Answer Generation human and LLM evaluator. Using GPT-4o, we obtained a Kendall-tau coefficient of 0.77, with a p-value of 6·10−11 . Here we compare the performance of our multi-step ap- These results justified the adoption of the LLM evaluator proach with a zero-shot one for answering EBA liquidity Table 2 Table 3 Recall scores on EBA Q&As Test Dataset Evaluation results for responses generated by zero-shot, few- shot and multi-step Models r@5 r@10 r@20 r@30 all-MiniLM 0.37 0.46 0.55 0.59 Rating zero-shot few-shot multi-step (gpt4o) gte-large 0.39 0.48 0.57 0.63 1 6 12 2 bge-large 0.41 0.52 0.62 0.67 2 18 11 14 bge-reranker-large 0.17 0.23 0.31 0.38 3 19 16 26 bge-reranker-v2-m3 0.24 0.31 0.39 0.44 4 3 7 4 CRR Ranker (ours) 0.51 0.67 0.81 0.86 4.2.1. Other LLMs risk inquiries, using our LLM as the evaluation system (Figure in Appendix B.3). To this end, we utilized a subset In this section, we extend our analysis of the multi-step of 46 Q&As from our EBA Q&A Test dataset specifically pipeline by incorporating evaluations using additional focused on liquidity risk. large language models (LLMs), specifically Google Gem- We tested: ini Flash 1.5 and Llama 3.1 70B. Google Gemini Flash 1.5 is widely recognized for its high-speed processing • Zero-Shot Approach: for each question, a stan- capabilities and efficiency in response generation, mak- dard prompt was provided to the LLM. It encom- ing it a suitable benchmark for comparative performance passed both the specific query and any relevant analysis. Conversely, Llama 3.1 70B is noted for its ro- contextual information they provided. bustness in handling complex queries while maintaining • Few-Shot Approach: for each question, a few moderate computational demands, providing an inter- examples were provided along with the query to esting contrast in terms of performance and resource guide the LLM in generating responses. efficiency. • Multi-Step Approach: for each question, we Our experimental results indicate that the average eval- created prompts following our established multi- uation score achieved by Google Gemini Flash 1.5 was 2.0, step approach, incorporating context enrichment whereas Llama 3.1 70B attained an average score of 2.2. and example enrichment (as detailed in previous Notably, these scores did not surpass the performance sections). of the GPT-4o zero-shot approach, which underscores The LLM Evaluator assessed each response based on the advanced capabilities of GPT-4o in addressing the its correctness and completeness relative to the official complexities of regulatory compliance inquiries. This ob- EBA response. As described in Section 3, the LLM Evalu- servation highlights the inherent strength of GPT-4o in ator assigned an overall score on a scale of 1 (completely generating accurate and contextually relevant responses, incorrect and incomplete) to 4 (fully correct and compre- outperforming the other models under similar conditions. hensive). Future research will focus on an in-depth analysis of Table 3 summarizes the evaluation results for re- these models with a view toward optimizing each step sponses generated by the different approaches. The of the multi-step pipeline in a model-specific manner. By “multi-step” approach consistently achieved higher tailoring our methodology to align with the distinctive counts in the high-quality rating categories compared to strengths and limitations of each model, we aim to fur- both the “zero-shot” and “few-shot” ones. This demon- ther enhance the overall accuracy and reliability of the strates that the multi-step approach significantly outper- generated responses. formed the other methods in terms of response quality. The LLM evaluator awarded the multi-step approach an average score of 2.7, representing a 12.5% improvement 5. Challenges and Advancements over the zero-shot and few-shot approaches, which both received an average score of 2.4. Notably, a larger portion Our work has highlighted several key challenges that are of the responses generated by our multi-step approach worth discussing. One of the primary issues concerns received scores of 3 or higher, indicating correct answers. the limited size of our test dataset. This constraint arose In contrast, only 2 out of 46 responses generated by the because we focused on the single topic of Liquidity Risk. multi-step approach were rated as completely incorrect However, to achieve robust human alignment and ensure (score 1), compared to 6 such responses for the zero-shot the system addresses diverse user inquiries across EBA approach and 11 for the few-shot approach. These find- topics, future efforts should prioritize dataset expansion ings suggest that the context enrichment in the multi-step and human evaluation integration. prompts effectively guides the primary LLM toward gen- Another topic for reflection is that the study empha- erating more comprehensive and informative responses sizes the need to retrieve relevant CRR articles. Future that accurately reflect the EBA regulations. research could investigate methods to further refine the generated responses by incorporating legal reasoning domains with complex regulatory frameworks like reg- and argumentation capabilities into the LLM [30, 31], ulatory reporting. Even at this early stage, the tool has and the most relevant Q&As as examples for few-shot demonstrated its ability to make the work of the human prompting [6]. analyst more efficient. Future research directions include It is also crucial to underscore the importance of op- exploring the use of different LLM architectures and in- timizing prompts for this kind of application, and we vestigating alternative methods for incorporating human plan to address this moving forward. Our future research feedback into the prompt construction process. Lastly, endeavors will focus on investigating automatic prompt exploring the generalization of this approach to other engineering techniques [32] leveraging the LLM Evalu- regulatory domains would be valuable. ator as a metric to optimize. These techniques aim to tailor and optimize prompts based on the specific topic of inquiries, enhancing overall performance. Acknowledgments Moreover, currently we have utilized only one model, The authors would like to express their sincere grati- GPT-4o, but we intend to extend our testing to include tude to Vincenzo Capone, Pamela Maggiori, Daniele Bovi, other models that have demonstrated similar perfor- Fabio Zambuto, Francesca Monacelli, and Roberto Sab- mance levels in the field of open question answering batini (Bank of Italy) for their insightful comments and [33]. This will help us identify the most effective model stimulating discussions on an earlier draft of this paper. for our application with an unbiased evaluation [34]. Their feedback greatly enhanced the clarity and focus of Similarly, in the context of LLM evaluators, we also in- our work. tend to explore additional models, including open-source They would also like to thank the anonymous reviewers options [35, 36], that have shown strong performance for their invaluable suggestions and constructive feed- in assessing the quality of responses from various LLMs. back. This approach is expected to increase the correlation between human and LLM evaluations, thereby enhanc- ing the system’s overall accuracy and reliability. The scientific community is very active in this area to bet- ter understand the limitations of the different types of models considered as evaluators [37]. By addressing the identified limitations through in- creased human involvement, expanded data coverage, and domain-specific evaluation methods, we believe it is possible to enhance the system’s effectiveness and gen- eralizability across a wide range of regulatory domains. 6. Conclusion This study explored a novel approach for generating au- tomated responses to inquiries on the Regulation (EU) N.2013/575, specifically on the liquidity risk subject. We proposed a multi-step prompt construction method that enriches the context to be provided to LLMs, enabling them to generate more accurate and informative answers. An LLM Evaluator, which demonstrated strong agree- ment with human experts, was employed to compare our multi-step approach with standard zero-shot and few- shot methods that lack context enrichment. The quality of the generated responses was assessed, and our find- ings indicate that the multi-step approach significantly outperforms both the zero-shot and few-shot methods, resulting in responses that are more comprehensive and accurate in relation to the EBA regulation. These re- sults suggest that the multi-step prompt construction is a promising approach for enhancing LLM performance in legal information retrieval tasks, particularly within References [11] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, T. Liu, [1] S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, A survey on hallucination in large language models: S. Gehrmann, P. Kambadur, D. Rosenberg, G. Mann, Principles, taxonomy, challenges, and open ques- BloombergGPT: A Large Language Model for Fi- tions, 2023. URL: https://arxiv.org/abs/2311.05232. nance, 2023. URL: http://arxiv.org/abs/2303.17564, arXiv:2311.05232. arXiv:2303.17564 [cs, q-fin]. [12] Single Rulebook Q&A | European Banking Author- [2] J. Lai, W. Gan, J. Wu, Z. Qi, P. S. Yu, Large Language ity, 2013-2024. URL: https://www.eba.europa.eu/ Models in Law: A Survey, 2023. URL: http://arxiv. single-rule-book-qa. org/abs/2312.03718. doi:10.48550/arXiv.2312. [13] S. Ye, D. Kim, S. Kim, H. Hwang, S. Kim, 03718, arXiv:2312.03718 [cs]. Y. Jo, J. Thorne, J. Kim, M. Seo, FLASK: Fine- [3] C. Biancotti, C. Camassa, Loquacity and Visible grained Language Model Evaluation based on Align- Emotion: ChatGPT as a Policy Advisor, 2023. URL: ment Skill Sets, 2024. URL: http://arxiv.org/abs/ https://papers.ssrn.com/abstract=4533699. doi:10. 2307.10928. doi:10.48550/arXiv.2307.10928, 2139/ssrn.4533699. arXiv:2307.10928 [cs]. [4] J. J. Horton, Large Language Models as Simu- [14] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, lated Economic Agents: What Can We Learn from Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, Homo Silicus?, 2023. URL: https://arxiv.org/abs/ J. E. Gonzalez, I. Stoica, Judging LLM-as-a-Judge 2301.07543v1. with MT-Bench and Chatbot Arena, 2023. URL: http: [5] P. Homoki, Z. Ződi, Large language mod- //arxiv.org/abs/2306.05685. doi:10.48550/arXiv. els and their possible uses in law, Hungar- 2306.05685, arXiv:2306.05685 [cs]. ian Journal of Legal Studies 64 (2024) 435–455. [15] Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, C. Zhu, URL: https://akjournals.com/view/journals/2052/ G-Eval: NLG Evaluation using Gpt-4 with Bet- 64/3/article-p435.xml. doi:10.1556/2052.2023. ter Human Alignment, in: H. Bouamor, J. Pino, 00475, publisher: Akadémiai Kiadó Section: Hun- K. Bali (Eds.), Proceedings of the 2023 Conference garian Journal of Legal Studies. on Empirical Methods in Natural Language Pro- [6] N. Wiratunga, R. Abeyratne, L. Jayawardena, cessing, Association for Computational Linguis- K. Martin, S. Massie, I. Nkisi-Orji, R. Weerasinghe, tics, Singapore, 2023, pp. 2511–2522. URL: https: A. Liret, B. Fleisch, Cbr-rag: Case-based reasoning //aclanthology.org/2023.emnlp-main.153. doi:10. for retrieval augmented generation in llms for legal 18653/v1/2023.emnlp-main.153. question answering, 2024. URL: https://arxiv.org/ [16] C.-M. Chan, W. Chen, Y. Su, J. Yu, W. Xue, abs/2404.04302. arXiv:2404.04302. S. Zhang, J. Fu, Z. Liu, ChatEval: Towards [7] A. Louis, G. van Dijck, G. Spanakis, Interpretable Better LLM-based Evaluators through Multi- Long-Form Legal Question Answering with Agent Debate, 2023. URL: http://arxiv.org/abs/ Retrieval-Augmented Large Language Models, 2308.07201. doi:10.48550/arXiv.2308.07201, 2023. URL: http://arxiv.org/abs/2309.17050. arXiv:2308.07201 [cs]. doi:10.48550/arXiv.2309.17050, [17] OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, arXiv:2309.17050 [cs]. I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, [8] W. Zhang, H. Shen, T. Lei, Q. Wang, D. Peng, S. Altman, S. Anadkat, R. Avila, I. Babuschkin, S. Bal- X. Wang, GLQA: A Generation-based Method aji, V. Balcom, P. Baltescu, H. Bao, M. Bavarian, for Legal Question Answering, in: 2023 Inter- J. Belgum, I. Bello, J. Berdine, G. Bernadett-Shapiro, national Joint Conference on Neural Networks C. Berner, L. Bogdonoff, O. Boiko, M. Boyd, A.-L. (IJCNN), 2023, pp. 1–8. URL: https://ieeexplore.ieee. Brakman, G. Brockman, T. Brooks, M. Brundage, org/document/10191483?denied=. doi:10.1109/ K. Button, T. Cai, R. Campbell, A. Cann, B. Carey, IJCNN54540.2023.10191483, iSSN: 2161-4407. C. Carlson, R. Carmichael, B. Chan, C. Chang, [9] A. Abdallah, B. Piryani, A. Jatowt, Ex- F. Chantzis, D. Chen, S. Chen, R. Chen, J. Chen, ploring the state of the art in legal QA sys- M. Chen, B. Chess, C. Cho, C. Chu, H. W. Chung, tems, Journal of Big Data 10 (2023) 127. URL: D. Cummings, J. Currier, Y. Dai, C. Decareaux, https://doi.org/10.1186/s40537-023-00802-8. doi:10. T. Degry, N. Deutsch, D. Deville, A. Dhar, D. Do- 1186/s40537-023-00802-8. han, S. Dowling, S. Dunning, A. Ecoffet, A. Eleti, [10] J. Prenio, Peering through the hype - assessing T. Eloundou, D. Farhi, L. Fedus, N. Felix, S. P. suptech tools’ transition from experimentation to Fishman, J. Forte, I. Fulford, L. Gao, E. Georges, supervision (2024). URL: https://www.bis.org/fsi/ C. Gibson, V. Goel, T. Gogineni, G. Goh, R. Gontijo- publ/insights58.htm. Lopes, J. Gordon, M. Grafstein, S. Gray, R. Greene, J. Gross, S. S. Gu, Y. Guo, C. Hallacy, J. Han, bedding, 2023. arXiv:2309.07597. J. Harris, Y. He, M. Heaton, J. Heidecke, C. Hesse, [20] H. Xuan, A. Stylianou, X. Liu, R. Pless, Hard nega- A. Hickey, W. Hickey, P. Hoeschele, B. Houghton, tive examples are hard, but useful, 2021. URL: https: K. Hsu, S. Hu, X. Hu, J. Huizinga, S. Jain, S. Jain, //arxiv.org/abs/2007.12749. arXiv:2007.12749. J. Jang, A. Jiang, R. Jiang, H. Jin, D. Jin, S. Jo- [21] S. Xiao, Z. Liu, P. Zhang, N. Muennighoff, moto, B. Jonn, H. Jun, T. Kaftan, L. Kaiser, A. Ka- FlagEmbedding/FlagEmbedding/reranker at mali, I. Kanitscheider, N. S. Keskar, T. Khan, L. Kil- master · FlagOpen/FlagEmbedding, 2024. URL: patrick, J. W. Kim, C. Kim, Y. Kim, J. H. Kirchner, https://github.com/FlagOpen/FlagEmbedding/ J. Kiros, M. Knight, D. Kokotajlo, L. Kondraciuk, tree/master/FlagEmbedding/reranker. A. Kondrich, A. Konstantinidis, K. Kosic, G. Krueger, [22] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, V. Kuo, M. Lampe, I. Lan, T. Lee, J. Leike, J. Leung, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, D. Levy, C. M. Li, R. Lim, M. Lin, S. Lin, M. Litwin, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, T. Lopez, R. Lowe, P. Lue, A. Makanju, K. Malfacini, G. Krueger, T. Henighan, R. Child, A. Ramesh, S. Manning, T. Markov, Y. Markovski, B. Martin, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, K. Mayer, A. Mayne, B. McGrew, S. M. McKin- E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, ney, C. McLeavey, P. McMillan, J. McNeil, D. Med- C. Berner, S. McCandlish, A. Radford, I. Sutskever, ina, A. Mehta, J. Menick, L. Metz, A. Mishchenko, D. Amodei, Language models are few-shot learn- P. Mishkin, V. Monaco, E. Morikawa, D. Moss- ers, 2020. URL: https://arxiv.org/abs/2005.14165. ing, T. Mu, M. Murati, O. Murk, D. Mély, A. Nair, arXiv:2005.14165. R. Nakano, R. Nayak, A. Neelakantan, R. Ngo, [23] Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, C. Zhu, G- H. Noh, L. Ouyang, C. O’Keefe, J. Pachocki, A. Paino, eval: Nlg evaluation using gpt-4 with better human J. Palermo, A. Pantuliano, G. Parascandolo, J. Parish, alignment, 2023. URL: https://arxiv.org/abs/2303. E. Parparita, A. Passos, M. Pavlov, A. Peng, A. Perel- 16634. arXiv:2303.16634. man, F. d. A. B. Peres, M. Petrov, H. P. d. O. Pinto, [24] Y. Dubois, X. Li, R. Taori, T. Zhang, I. Gulrajani, J. Ba, Michael, Pokorny, M. Pokrass, V. H. Pong, T. Powell, C. Guestrin, P. Liang, T. B. Hashimoto, Alpacafarm: A. Power, B. Power, E. Proehl, R. Puri, A. Radford, A simulation framework for methods that learn J. Rae, A. Ramesh, C. Raymond, F. Real, K. Rim- from human feedback, 2024. URL: https://arxiv.org/ bach, C. Ross, B. Rotsted, H. Roussez, N. Ryder, abs/2305.14387. arXiv:2305.14387. M. Saltarelli, T. Sanders, S. Santurkar, G. Sastry, [25] J. Fu, S.-K. Ng, Z. Jiang, P. Liu, Gptscore: Evaluate H. Schmidt, D. Schnurr, J. Schulman, D. Selsam, as you desire, 2023. URL: https://arxiv.org/abs/2302. K. Sheppard, T. Sherbakov, J. Shieh, S. Shoker, 04166. arXiv:2302.04166. P. Shyam, S. Sidor, E. Sigler, M. Simens, J. Sitkin, [26] C. D. Manning, P. Raghavan, H. Schütze, Introduc- K. Slama, I. Sohl, B. Sokolowsky, Y. Song, N. Stau- tion to Information Retrieval, Cambridge University dacher, F. P. Such, N. Summers, I. Sutskever, Press, USA, 2008. J. Tang, N. Tezak, M. B. Thompson, P. Tillet, [27] P. S. H. Lewis, Y. Wu, L. Liu, P. Minervini, H. Küttler, A. Tootoonchian, E. Tseng, P. Tuggle, N. Turley, A. Piktus, P. Stenetorp, S. Riedel, PAQ: 65 million J. Tworek, J. F. C. Uribe, A. Vallone, A. Vijayvergiya, probably-asked questions and what you can do with C. Voss, C. Wainwright, J. J. Wang, A. Wang, them, CoRR abs/2102.07033 (2021). URL: https:// B. Wang, J. Ward, J. Wei, C. J. Weinmann, A. Weli- arxiv.org/abs/2102.07033. arXiv:2102.07033. hinda, P. Welinder, J. Weng, L. Weng, M. Wiethoff, [28] Z. Li, X. Zhang, Y. Zhang, D. Long, P. Xie, D. Willner, C. Winter, S. Wolrich, H. Wong, L. Work- M. Zhang, Towards general text embeddings with man, S. Wu, J. Wu, M. Wu, K. Xiao, T. Xu, S. Yoo, multi-stage contrastive learning, arXiv preprint K. Yu, Q. Yuan, W. Zaremba, R. Zellers, C. Zhang, arXiv:2308.03281 (2023). M. Zhang, S. Zhao, T. Zheng, J. Zhuang, W. Zhuk, [29] C. Li, Z. Liu, S. Xiao, Y. Shao, Making large language B. Zoph, GPT-4 Technical Report, 2024. URL: http: models a better foundation for dense retrieval, 2023. //arxiv.org/abs/2303.08774. doi:10.48550/arXiv. arXiv:2312.15503. 2303.08774, arXiv:2303.08774 [cs]. [30] F. Yu, L. Quartey, F. Schilder, Exploring [18] J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian, the effectiveness of prompt engineering for le- Z. Liu, Bge m3-embedding: Multi-lingual, gal reasoning tasks, in: A. Rogers, J. Boyd- multi-functionality, multi-granularity text embed- Graber, N. Okazaki (Eds.), Findings of the As- dings through self-knowledge distillation, 2024. sociation for Computational Linguistics: ACL arXiv:2402.03216. 2023, Association for Computational Linguistics, [19] S. Xiao, Z. Liu, P. Zhang, N. Muennighoff, C-pack: Toronto, Canada, 2023, pp. 13582–13596. URL: https: Packaged resources to advance general chinese em- //aclanthology.org/2023.findings-acl.858. doi:10. 18653/v1/2023.findings-acl.858. [31] Y. an Lu, H. yu Kao, 0x.yuan at semeval-2024 task 5: Enhancing legal argument reasoning with structured prompts, in: International Workshop on Semantic Evaluation, 2024. URL: https://api. semanticscholar.org/CorpusID:270765544. [32] Q. Ye, M. Axmed, R. Pryzant, F. Khani, Prompt engineering a prompt engineer, 2024. URL: https: //arxiv.org/abs/2311.05661. arXiv:2311.05661. [33] Z. Huang, Z. Wang, S. Xia, P. Liu, Olympicarena medal ranks: Who is the most intelligent ai so far?, 2024. URL: https://arxiv.org/abs/2406.16772. arXiv:2406.16772. [34] A. Panickssery, S. R. Bowman, S. Feng, Llm eval- uators recognize and favor their own genera- tions, 2024. URL: https://arxiv.org/abs/2404.13076. arXiv:2404.13076. [35] S. Kim, J. Suk, S. Longpre, B. Y. Lin, J. Shin, S. Welleck, G. Neubig, M. Lee, K. Lee, M. Seo, Prometheus 2: An open source language model specialized in evaluating other language mod- els, 2024. URL: https://arxiv.org/abs/2405.01535. arXiv:2405.01535. [36] S. Kim, J. Suk, J. Y. Cho, S. Longpre, C. Kim, D. Yoon, G. Son, Y. Cho, S. Shafayat, J. Baek, S. H. Park, H. Hwang, J. Jo, H. Cho, H. Shin, S. Lee, H. Oh, N. Lee, N. Ho, S. J. Joo, M. Ko, Y. Lee, H. Chae, J. Shin, J. Jang, S. Ye, B. Y. Lin, S. Welleck, G. Neubig, M. Lee, K. Lee, M. Seo, The biggen bench: A principled benchmark for fine-grained evaluation of language models with language models, 2024. URL: https: //arxiv.org/abs/2406.05761. arXiv:2406.05761. [37] H. Huang, Y. Qu, H. Zhou, J. Liu, M. Yang, B. Xu, T. Zhao, On the limitations of fine-tuned judge mod- els for llm evaluation, 2024. URL: https://arxiv.org/ abs/2403.02839. arXiv:2403.02839. A. Dataset Table 4 EBA Q&As dataset. For this research, we focused on the fields highlighted in yellow. Variable Name Description Question ID The unique identifier for each question. Topic The general topic or category under which the question falls. Subject matter The specific subject matter of the question. Legal act The specific legal act to which the question relates. (e.g., CRR) Article The specific article of the legal to which the question relates. COM Delegated or Implementing Acts/RTS/ITS/GLs/Recommendations Other legislation, standards, guidelines or recommendations to which the question relates. Article/Paragraph The specific article or paragraph within the above-mentioned Question The actual question asked. Background on the question Any additional information or context provided by the question submitter. Final answer The official answer provided to the question. Submission date The date when the question was submitted. Final publishing date The date when the final answer to the question was published. Status The current status of the question (e.g. Final, rejected, etc.). Type of submitter The type of entity that submitted the question (e.g. Credit institution, investment firm, etc.). Answer prepared by The entity that prepared the answer to the question. Figure 1: Distribution of tokens among Questions, Background, and Answers in datasets and splits B. Multi-Step Generation and Evalutation B.1. Multi-Step Approach for Answer Generation Figure 2: Multi-Step Approach for Answer Generation B.2. LLM evaluator Alignment Figure 3: Evaluating Alignment between the LLM evaluator and the human expert B.3. Multi-Step vs. Zero-Shot Figure 4: Multi-Step vs. Zero-Shot Approach for EBA Liquidity Risk Inquiries C. Prompt template C.1. Extracting Law References Gpt4-omni Prompt #task Extract from the text (#text) any reference to regulatory documents contained in it and insert them into a list (e.g. ["regulatory document name": ["article 1","article 2",...]]). I will provide you an example (#text (example)) and the expected output (#output (example)): #text (example) "In accordance with Article 425 (1) of Regulation (EU) No. 575/2013 (CRR) institu- tions may exempt contractual liquidity inflows from borrowers and bond investors arising from mortgage lending funded by covered bonds eligible for preferential treatment as set out in Article 129b (4-6) of CRR or by bonds as referred to in Article 52(4) of Directive 2009/65/EC from the 75% inflow cap." #output (example) "["Regulation (EU) No. 575/2013 (CRR)": ["425","129b"], "Directive 2009/65/EC" : ["52"]]" #text > text_to_extract #output (list only) This prompt was used to extract any reference to regulatory documents from the provided text_to_extract ) (placeholder to input text) C.2. Answer Generation Gpt4-omni Prompt " #system You are a virtual assistant for the European Banking Authority (EBA), handling user inquiries related to Liquidity Risk regulations. The user’s query specifically pertains to Regulation (EU) No. 575/2013 (CRR) or Delegated Regulation (EU) No. 2015/61 (LCR DA).""" #task Answer the question based on the instructions below. 1. Analyze the User’s Question (#question): - Identify the central topic and relevant keywords related to Liquidity Risk and the specified EBA regulations. 2. Leverage the Provided Context (#context): - Incorporate the context (including CRR articles and additional information) to tailor the answer to the user’s specific scenario. 3. Liquidity Risk Topic: - Reference relevant articles from provided context (#context) that address the specific aspect of Liquidity Risk raised in the question. 4. Desired Answer (#answer): - Use only the information provided in the context and examples (if provided) to answer the question. - Craft a well-reasoned and informative response that covers all aspects of the user’s query. - Clearly articulate the regulatory implications while considering the provided context. - Maintain a professional and informative tone suitable for the EBA. #examples: Example 1: > example_1 Example 2: > example_2 Example 3: > example_3 Example 4: > example_4 Example 5: > example_5 #question: > question #context: > context > enhanced_context #answer: This prompt was used to generate answer given a question and context. #examples section (placeholder to include 5 examples) and enhanced_context (placeholder to include CRR articles), highlighted in yellow, were used only for multi-step approach. C.3. LLM as Evaluator Gpt4-omni Prompt I will provide you with two answers to a question. One is the #official answer, which serves as the benchmark. The other is the #generated answer, which needs to be evaluated against the #official answer. You must compare the answers step by step. Consider the following definitions for this evaluation: - Correctness: A #generated answer is correct if its content aligns with that of the #official answer. - Completeness: A #generated answer is complete if it includes all the information present in the #official answer. Your task is to act as an evaluator and rate the #generated answer according to the following scale: RATING 1: The #generated answer is completely incorrect and incomplete compared to the #official answer. RATING 2: The #generated answer is incorrect but either complete or partially complete compared to the #official answer. It contains some useful information found in the #official answer but the main statement is incorrect. RATING 3: The #generated answer is correct but only partially complete. The main statement matches the #official answer, but some information from the #official answer is missing. RATING 4: The #generated answer is fully correct and complete. It is essentially a rephrased version of the #official answer with no significant differences. Please provide a single numerical rating (1-4) followed by a brief explanation for your rating ... Compute the score in the following case: #question > question #background > background #official answer > answer #generated answer generated answer Output: This prompt was used to compare an AI-generated answer (#generated answer) to an official one (#official answer), rating its correctness, completeness, and providing an explanation.