=Paper= {{Paper |id=Vol-3878/58_main_long |storemode=property |title=A Novel Multi-Step Prompt Approach for LLM-based Q&As on Banking Supervisory Regulation |pdfUrl=https://ceur-ws.org/Vol-3878/58_main_long.pdf |volume=Vol-3878 |authors=Daniele Licari,Canio Benedetto,Praveen Bushipaka,Alessandro De Gregorio,Marco De Leonardis,Tommaso Cucinotta |dblpUrl=https://dblp.org/rec/conf/clic-it/LicariBBGLC24 }} ==A Novel Multi-Step Prompt Approach for LLM-based Q&As on Banking Supervisory Regulation== https://ceur-ws.org/Vol-3878/58_main_long.pdf
                                A Novel Multi-Step Prompt Approach for LLM-based Q&As
                                on Banking Supervisory Regulation
                                Daniele Licari1,2,*,† , Canio Benedetto1,† , Praveen Bushipaka2 , Alessandro De Gregorio1,† ,
                                Marco De Leonardis1,† and Tommaso Cucinotta2
                                1
                                    Banca d’Italia, Via Nazionale, 91, Rome, 00184, Italy
                                2
                                    Scuola Superiore Sant’Anna, P.zza dei Martiri della Libertà, 33, Pisa, 56100, Italy


                                                 Abstract
                                                 This paper investigates the use of large language models (LLMs) in analyzing and answering questions related to banking
                                                 supervisory regulation concerning reporting obligations. We introduce a multi-step prompt construction method that
                                                 enhances the context provided to the LLM, resulting in more precise and informative answers. This multi-step approach
                                                 is compared with standard "zero-shot" and "few-shot" approaches, which lacks context enrichment. To assess the quality
                                                 of the generated responses, we utilize an LLM evaluator. Our findings indicate that the multi-step approach significantly
                                                 outperforms the zero-shot method, producing more comprehensive and accurate responses.

                                                 Keywords
                                                 Regulatory Q&A, Banking Supervisory Reporting Regulation, Artificial Intelligence, GenAI, GPT-4o, RAG, LLM Evaluator



                                1. Introduction                                                                                        Therefore, it is essential to establish strong verification
                                                                                                                                       procedures and retain human supervision to counter
                                The advent of generative AI (GenAI), and specifically of these risks. The complexity of regulatory documents,
                                large language models (LLMs), offers significant oppor- with their dense network of cross-referenced texts/cats
                                tunities, among others, in the legal and financial sector, and specialized content, necessitates careful analysis to
                                facilitating the implementation of innovative solutions retrieve the needed information ensuring at the same
                                across various domains of activities [1, 2, 3, 4, 5]. One time effective risk management and limit the burden of
                                of the most promising applications is the business case such manual compliance.
                                for supporting the navigation and analysis of complex                                                     This study introduces a novel methodology to auto-
                                regulatory documents [6, 7, 8, 9], which can be particu- mate and expedite the "question & answer" (Q&A) pro-
                                larly valuable for compliance officers, legal teams, and cess in regulatory compliance, leveraging advanced large
                                other professionals working in financial institutions who language models (LLMs) to provide accurate and timely
                                need to have a clear and timely understanding of the responses to inquiries about the European Banking Au-
                                regulations and the consequently derived obligations.                                                  thority’s (EBA) reporting regulations. Our multi-step
                                   Supervisory authorities could benefit from a tool that approach aligns with Retrieval-Augmented Generation
                                streamlines the consultation of complex legislation, pro- (RAG) principles, enhancing context retrieval and gen-
                                viding swift responses to entities and enhancing effi- erative capabilities through mechanisms like explicit
                                ciency [10]. While LLMs offer advantages for this pur- extraction of Capital Requirements Regulation (CRR)
                                pose, they also pose risks like bias and inaccuracies [11]. references, implicit reference analysis, and a dedicated
                                                                                                                                       cross-encoder for precise regulatory text retrieval. This
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,
                                Dec 04 — 06, 2024, Pisa, Italy
                                                                                                                                       methodology ensures tailored response generation suited
                                *
                                  Corresponding author.                                                                                to the complex regulatory compliance context, where pre-
                                  The views and opinions expressed in this paper are those of the cise and comprehensive answers are crucial.
                                †

                                  authors and do not necessarily reflect the official policy or position                                  Our work finds particular applications within the do-
                                  of the Bank of Italy.                                                                                main of EBA regulatory reporting because it is charac-
                                $ daniele.licari@bancaditalia.it (D. Licari);                                                          terized by a large and complex set of interrelated docu-
                                canio.benedetto@bancaditalia.it (C. Benedetto);
                                praveen.bushipaka@santannapisa.it (P. Bushipaka);
                                                                                                                                       ments, including delegated and implementing acts, tech-
                                alessandro.degregorio@bancaditalia.it (A. De Gregorio);                                                nical standards, guidelines, and recommendations, which
                                marco.deleonardis@bancaditalia.it (M. De Leonardis);                                                   cover various aspects of financial entities. Such com-
                                tommaso.cucinotta@santannapisa.it (T. Cucinotta)                                                       plexity makes the business case both challenging and
                                 0000-0002-2963-9233 (D. Licari); 0000-0002-8446-9468                                                 rewarding.
                                (C. Benedetto); 0009-0009-7753-8662 (P. Bushipaka);
                                0000-0001-7577-3655 (A. De Gregorio); 0009-0004-6523-186X
                                                                                                                                          In this work, we focus on Regulation (EU) N.2013/575,
                                (M. De Leonardis); 0000-0002-0362-0657 (T. Cucinotta)                                                  also   called Capital Requirements Regulation (CRR)
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                          Attribution 4.0 International (CC BY 4.0).
                                                                                                                                       https://eur-lex.europa.eu/legal-content/en/ALL/?uri=




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
celex%3A32013R0575, specifically on the topic of                Table 1
Liquidity Risk as a first use case to evaluate the potential    Sample distribution across training, validation, and test sets
benefit of enriched context for an accurate response            for CRR-related Q&A and the subset of only Liquidity Risk
generation. The main reason for this choice is that this        Q&A.
topic is supported by a relatively limited number of                  Set       CRR-related Q&A        Liquidity Risk Q&A
regulatory documents, so it was a good starting point
since the regulation is not readily available in the form          Training             798                    58
                                                                  Validation            162                    12
of a structured dataset and its pre-processing is usually a
                                                                     Test               637                    46
time-consuming task.
   We used the actual EBA Q&As dataset [12] as the foun-        variables: question ID, question, submission date, status,
dation for developing a system capable of generating au-        topic, legal act, article [within that act], background infor-
tomated responses to questions formulated by analysts           mation,final answer, submission date and status (details
on EBA reporting requirements and rules. By harnessing          in Table 4, Appendix 4) Secondly, we implemented a two-
the capabilities of LLMs we aim to create a tool that can       step filtering process aimed at ensuring model efficacy:
deliver accurate and contextually relevant answers to           by excluding non-English entries, and by focusing on
any inquiry on the content of the CRR.                          CRR-related questions within the same timeframe. This
   Recent studies highlight the potential of LLMs for qual-     resulted in a final dataset of 1597 CRR-related questions
itative assessment [13, 14, 15, 16]. For this reason, in this   and answers, which was then split into training (50%),
work we also propose the use of an "LLM Evaluator" to           validation (10%), and test sets (40%) for robust evaluation
automate the validation process.                                (token number distribution in Figure 1 in Appendix A).
   The structure of this paper is the following. Chapter        The distribution of samples for the dataset is summarized
2 introduces the methodology and provides a detailed            in Table 1.
description of the approach adopted in this study; it ex-
plains the dataset utilized and the normative retrieval
techniques employed to identify the regulatory docu-            2.2. Context Enrichment
ments necessary to address the EBA’s Q&As. Chapter              The context enrichment process is a three-step approach
3 presents the LLM Evaluator and the evaluation crite-          designed to identify, within the data set, the most rele-
ria. Chapter 4 reports experimental results and results         vant CRR references to provide an appropriate content to
and presents the main outcomes of the study. Chapter 5          formulate the answer to the inquiry. The first step simply
discusses challenges as well as potential areas for future      involves extracting explicit CRR references, if directly
developments.                                                   mentioned in the question (Article in tab 4). The second
                                                                step leverages on the capabilities of the GPT-4o (prompt
                                                                in Appendix C.1) to analyse the “question” and the “back-
2. Methodology                                                  ground information” to identify other CRR references
This research employs a multi-step methodology to con-          that are not explicitly stated by the user. The last step
struct a comprehensive prompt for the GPT-4 omni (GPT-          of the process utilizes our CRR Ranker model, a cross-
4o) language model [17], enabling it to answer EBA-             encoder architecture that has been trained to identify
related questions effectively. This step-wise approach          and retrieve pertinent references from the Capital Re-
focuses on enriching the context provided by the user’s         quirements Regulation in response to specific inquiries.
question. First, it identifies relevant EBA regulations         This 3-steps comprehensive approach ensures a broader
(specifically CRR references) within the inquiry. Second,       and potentially more accurate understanding of the the
it incorporates response examples to guide the LLM’s            inquiry and the specific legal act(s) related to the CRR
output format ensuring alignment with EBA regulations.          that the Q&A tool deems applicable.
This enriched context is then leveraged by a power-
ful LLM to generate more accurate and informative re- 2.2.1. CRR Ranker Training
sponses (details in Appendix B.1).                        With regard to the context enrichment, i.e. the CRR
                                                          Ranker Training, we employed a specifically trained
2.1. Dataset Construction                                 cross-encoder model [18] to identify relevant CRR refer-
                                                          ences for enriching inquiry context. We used a dedicated
To develop and then evaluate our LLM-based Q&A sys- “question-article” pair dataset derived from our EBA Q&A
tem, firstly we extracted a subset from the EBA’s Single- Train Dataset, excluding questions related to CRR Arti-
rule-book-qa online resource [12], comprising “question- cle 99 https://www.eba.europa.eu/regulation-and-policy/
and-answer” pairs submitted to the EBA between 2013
and 2020. In particular, we focused on the following
single-rulebook/interactive-single-rulebook/14212 due          Dataset. This evaluation employed recall metrics at vari-
to their frequent lack of topical relevance. Each data point   ous retrieval cutoffs, including recall@5, recall@10, re-
consisted of a question (user query and background in-         call@20, and recall@30 (results in Section 4).
formation), an associated CRR article, and a binary label
indicating relevance (1 for relevant, 0 for not applicable).   2.3. Examples Enrichment
   We constructed the training dataset by selecting posi-
tive and negative samples. Positive samples comprised          To improve the model’s understanding of the desired re-
question-article pairs where the article explicitly ad-        sponse format, tone, and content, we adopted a few-shot
dressed the user’s query. Additionally, we included pairs      prompting approach [22]. This involved extracting five
formed by questions and implicit CRR references ex-            relevant examples from the EBA Q&A Train Dataset with
tracted from the user’s text, context information, and         the same topic as the user question we want to answer.
official response using GPT-4o (used prompt in Appendix        These examples served as demonstrations for the model,
C.1).                                                          showcasing the ideal structure, language style, and level
   Negative training samples were mined by using the           of detail expected in the final responses. Notably, the se-
BAAI bge-large-en-v1.5 pre-trained language model [19].        lection process ensured heterogeneity within the chosen
For the CRR Ranker Training we employed a two-phase            topic, meaning the examples covered various aspects to
process for negative sample selection: first, all CRR ar-      promote a broader understanding. Limiting the number
ticles were encoded using the bge-large-en-v1.5 model,         of examples to five struck a balance between providing
and cosine similarity was utilized to rank them relative to    diverse demonstrations and maintaining cost-efficiency
the user’s question; second, a set of 20 negative examples     during inference, as the LLM’s input token length has
was randomly chosen from a pre-defined ranking inter-          limitations.
val (250-300). The choice of 20 negative samples provides
a good balance between computational efficiency and            2.4. Answer Generation
the availability of enough training data. This approach
aimed to balance the representation of relevant and irrel-     Figure 2 in Appendix B.1 details how we construct a
evant information within the training data, ensuring the       comprehensive prompt that enhances GPT-4o’s ability
model learns to distinguish between the user’s query and       to effectively answer user questions. The final prompt in
potentially related but ultimately off-topic CRR articles      Appendix C.2 integrates the enriched context (extracted
[20].                                                          CRR references) and the example enrichment (demon-
   The final dataset comprised 12,533 unique "question-        strations of desired response format, tone, and content).
article" pairs with positive and negative labels. This data    This comprehensive prompt is fed to GPT-4o through the
was split into training (10,179 pairs) and development         OpenAI API, enabling it to generate a well-reasoned and
(2,354 pairs) sets for model fine-tuning. This fine-tuning     informative response that adheres to the EBA’s regula-
aimed to learn robust semantic representations for ques-       tory framework and professional tone.
tions and CRR articles, enabling the model to effectively
identify relevant CRR references for enriching user query      2.5. Comparison with RAG Principles
context.
   We selected the BAAI BGE Reranker v2 m3 model               Our multi-step prompt approach aligns with the core
[18] as the basis for our cross-encoder, owing to its task-    principles of Retrieval-Augmented Generation (RAG)
specific aptness and its demonstrated superior perfor-         while incorporating tailored enhancements that improve
mance relative to the BGE Reranker Large [19], as re-          context enrichment for regulatory Q&A tasks. Like RAG,
ported in Section 4. We adopted the Cross-Entropy Bi-          our method integrates information retrieval with lan-
nary Classification loss function, following the approach      guage generation, but it adds specialized steps to enhance
suggested in the BGE Rerank Git repository [21]. To            context enrichment. These include explicit extraction of
promote stable convergence, we incorporated a warmup           CRR references, implicit analysis using LLM capabilities,
schedule ( with a number of steps 0.1 × len(train_data) ×      and precise retrieval through a dedicated cross-encoder.
num_epochs step) that gradually increases the learning         Compared to standard RAG, which often relies on single-
rate during the initial phase of training. The entire fine-    stage retrieval, our structured multi-step process adds a
tuning process was conducted over 4 epochs. We em-             higher level of granularity, including example enrichment
ployed an evaluation interval of 800 steps during training     through few-shot prompts. This ensures not only factual
and saved the model that achieved the highest F1 score         accuracy but also alignment with domain-specific lan-
on the development set.                                        guage standards, ultimately improving response quality
   Finally, we evaluated the model’s retrieval ability of      for complex regulatory inquiries. Overall, our approach
CRR items for a given user question on EBA Q&A Test            extends the RAG principles to generate tailored, contex-
tually enriched answers, which is particularly beneficial      over a human one, especially for tasks involving prompt
for the intricate requirements of regulatory compliance.       optimization and evaluation. The figure in Appendix B.2
                                                               illustrates the complete process of evaluating agreement
                                                               between the LLM evaluator and the human expert.
3. LLM Evaluator
In our pipeline, we employ an LLM Evaluator to assess the      4. Experiments and Results
quality of generated responses, defined in Section 2, com-
pared to the EBA’s answers already provided. Employing         This section describes the results obtained by measur-
an LLM Evaluator offers significant advantages in terms        ing retrieval effectiveness and answer quality. Retrieval
of cost-effectiveness and efficiency compared to tradi-        performance is measured by the number of relevant reg-
tional human evaluation/comparison methods. Recent             ulations retrieved (recall) using different encoder models.
research highlights the potential of LLMs for large-scale      Answer quality is then evaluated by a separate LLM,
natural language evaluation tasks [23, 24, 25].                which scores each generated response based on factors
   The evaluation process uses a scale from one to four,       like relevance and adherence to EBA legal acts. We com-
based on two evaluation criteria: correctness and com-         pare the multi-step prompt approach with a few-shot and
pleteness. A generated response is considered correct if       zero-shot one focusing on a single topic within the EBA
its content aligns with the information presented in the       Q&A framework, specifically Liquidity Risk. Finally, we
official answer. Additionally, a response is deemed com-       test our Multi-Step pipeline with other LLM models, such
plete if it incorporates all relevant regulatory references    as Google Gemini Flash 1.5 and Llama 3.1 70B.
provided in the official answer. The following scoring
rubric outlines the evaluation criteria:                       4.1. CRR Retrieval
     • Score 1: The generated answer is completely in-         We employed “recall” as the primary metric to assess the
       correct and incomplete compared to the official         effectiveness of bi and cross encoder models in retrieving
       answer.                                                 relevant CRR articles based on the information submitted
     • Score 2: The generated answer is incorrect but          with the inquiry. “Recall” signifies the proportion of truly
       either complete or partially complete compared          relevant CRR articles retrieved from the dataset compared
       to the official answer. It contains some useful         to all the pertinent actual articles [26]. In the context of
       information found in the official answer, but the       legal information retrieval, prioritizing the retrieval of
       main statement is incorrect.                            all crucial regulatory information for the inquiry makes
     • Score 3: The generated answer is correct but only       the recall a particularly relevant metric.
       partially complete. The main statement matches             Our primary objective was to identify a model that
       the official answer, but some information from          delivers exceptional retrieval accuracy while maintain-
       the official answer is missing.                         ing computational efficiency. This potentially excluded
     • Score 4: The generated answer is fully correct and      models with an extremely large number of parameters,
       complete. It is essentially a rephrased version of      as they can be computationally expensive to run.
       the official answer with no significant differences.       We conducted a performance comparison between our
   To preliminary validate the effectiveness of our LLM        fine-tuned CRR Ranker and several pre-trained models:
evaluator, we conducted an experiment using a synthetic             • Bi-encoders: all-MiniLM-L6-v2 [27], gte-large-en-
dataset. This dataset was carefully designed to test var-             v1.5 [28], and bge-large-en-v1.5 [19].
ious aspects of language generation and was evaluated               • Cross-encoders: bge-reranker-large [19], bge-
by both a human expert and the LLM. The alignment be-                 reranker-v2-m3 [29, 18].
tween the human expert’s assessments and those of the
LLM was then analyzed. The complete details of the final         The detailed results (presented in table 2) show the
prompt used for LLM evaluator are provided in Appendix         achieved recall scores on EBA Q&As Test Dataset for
C.3.                                                           each model. Our fine-tuned CRR Ranker significantly
   The dataset comprises 60 Q&A pairs, balanced across         outperformed all other models, achieving a more than
the four score categories. For each category, two pairs        20% improvement compared to the best pre-trained model
were excluded as they were used as examples for the            (bge-large-en-v1.5).
prompt for the LLM evaluator, resulting in a final dataset
of 52 Q&A pairs to measure the alignment between the
                                                               4.2. Answer Generation
human and LLM evaluator. Using GPT-4o, we obtained a
Kendall-tau coefficient of 0.77, with a p-value of 6·10−11 .   Here we compare the performance of our multi-step ap-
These results justified the adoption of the LLM evaluator      proach with a zero-shot one for answering EBA liquidity
Table 2                                                       Table 3
Recall scores on EBA Q&As Test Dataset                        Evaluation results for responses generated by zero-shot, few-
                                                              shot and multi-step
         Models           r@5     r@10    r@20     r@30
       all-MiniLM         0.37    0.46     0.55     0.59        Rating    zero-shot    few-shot     multi-step (gpt4o)
        gte-large         0.39    0.48     0.57     0.63           1           6           12               2
        bge-large         0.41    0.52     0.62     0.67           2          18           11               14
   bge-reranker-large     0.17    0.23     0.31     0.38           3          19           16               26
   bge-reranker-v2-m3     0.24    0.31     0.39     0.44           4           3            7                4
   CRR Ranker (ours)      0.51    0.67     0.81     0.86
                                                              4.2.1. Other LLMs
risk inquiries, using our LLM as the evaluation system
(Figure in Appendix B.3). To this end, we utilized a subset   In this section, we extend our analysis of the multi-step
of 46 Q&As from our EBA Q&A Test dataset specifically         pipeline by incorporating evaluations using additional
focused on liquidity risk.                                    large language models (LLMs), specifically Google Gem-
   We tested:                                                 ini Flash 1.5 and Llama 3.1 70B. Google Gemini Flash
                                                              1.5 is widely recognized for its high-speed processing
     • Zero-Shot Approach: for each question, a stan-         capabilities and efficiency in response generation, mak-
       dard prompt was provided to the LLM. It encom-         ing it a suitable benchmark for comparative performance
       passed both the specific query and any relevant        analysis. Conversely, Llama 3.1 70B is noted for its ro-
       contextual information they provided.                  bustness in handling complex queries while maintaining
     • Few-Shot Approach: for each question, a few
                                                              moderate computational demands, providing an inter-
       examples were provided along with the query to
                                                              esting contrast in terms of performance and resource
       guide the LLM in generating responses.
                                                              efficiency.
     • Multi-Step Approach: for each question, we
                                                                 Our experimental results indicate that the average eval-
       created prompts following our established multi-
                                                              uation score achieved by Google Gemini Flash 1.5 was 2.0,
       step approach, incorporating context enrichment
                                                              whereas Llama 3.1 70B attained an average score of 2.2.
       and example enrichment (as detailed in previous
                                                              Notably, these scores did not surpass the performance
       sections).
                                                              of the GPT-4o zero-shot approach, which underscores
   The LLM Evaluator assessed each response based on          the advanced capabilities of GPT-4o in addressing the
its correctness and completeness relative to the official     complexities of regulatory compliance inquiries. This ob-
EBA response. As described in Section 3, the LLM Evalu-       servation highlights the inherent strength of GPT-4o in
ator assigned an overall score on a scale of 1 (completely    generating accurate and contextually relevant responses,
incorrect and incomplete) to 4 (fully correct and compre-     outperforming the other models under similar conditions.
hensive).                                                        Future research will focus on an in-depth analysis of
   Table 3 summarizes the evaluation results for re-          these models with a view toward optimizing each step
sponses generated by the different approaches. The            of the multi-step pipeline in a model-specific manner. By
“multi-step” approach consistently achieved higher            tailoring our methodology to align with the distinctive
counts in the high-quality rating categories compared to      strengths and limitations of each model, we aim to fur-
both the “zero-shot” and “few-shot” ones. This demon-         ther enhance the overall accuracy and reliability of the
strates that the multi-step approach significantly outper-    generated responses.
formed the other methods in terms of response quality.
The LLM evaluator awarded the multi-step approach an
average score of 2.7, representing a 12.5% improvement        5. Challenges and Advancements
over the zero-shot and few-shot approaches, which both
received an average score of 2.4. Notably, a larger portion   Our work has highlighted several key challenges that are
of the responses generated by our multi-step approach         worth discussing. One of the primary issues concerns
received scores of 3 or higher, indicating correct answers.   the limited size of our test dataset. This constraint arose
In contrast, only 2 out of 46 responses generated by the      because we focused on the single topic of Liquidity Risk.
multi-step approach were rated as completely incorrect        However, to achieve robust human alignment and ensure
(score 1), compared to 6 such responses for the zero-shot     the system addresses diverse user inquiries across EBA
approach and 11 for the few-shot approach. These find-        topics, future efforts should prioritize dataset expansion
ings suggest that the context enrichment in the multi-step    and human evaluation integration.
prompts effectively guides the primary LLM toward gen-           Another topic for reflection is that the study empha-
erating more comprehensive and informative responses          sizes the need to retrieve relevant CRR articles. Future
that accurately reflect the EBA regulations.                  research could investigate methods to further refine the
generated responses by incorporating legal reasoning          domains with complex regulatory frameworks like reg-
and argumentation capabilities into the LLM [30, 31],         ulatory reporting. Even at this early stage, the tool has
and the most relevant Q&As as examples for few-shot           demonstrated its ability to make the work of the human
prompting [6].                                                analyst more efficient. Future research directions include
   It is also crucial to underscore the importance of op-     exploring the use of different LLM architectures and in-
timizing prompts for this kind of application, and we         vestigating alternative methods for incorporating human
plan to address this moving forward. Our future research      feedback into the prompt construction process. Lastly,
endeavors will focus on investigating automatic prompt        exploring the generalization of this approach to other
engineering techniques [32] leveraging the LLM Evalu-         regulatory domains would be valuable.
ator as a metric to optimize. These techniques aim to
tailor and optimize prompts based on the specific topic
of inquiries, enhancing overall performance.                  Acknowledgments
   Moreover, currently we have utilized only one model,
                                                              The authors would like to express their sincere grati-
GPT-4o, but we intend to extend our testing to include
                                                              tude to Vincenzo Capone, Pamela Maggiori, Daniele Bovi,
other models that have demonstrated similar perfor-
                                                              Fabio Zambuto, Francesca Monacelli, and Roberto Sab-
mance levels in the field of open question answering
                                                              batini (Bank of Italy) for their insightful comments and
[33]. This will help us identify the most effective model
                                                              stimulating discussions on an earlier draft of this paper.
for our application with an unbiased evaluation [34].
                                                              Their feedback greatly enhanced the clarity and focus of
   Similarly, in the context of LLM evaluators, we also in-
                                                              our work.
tend to explore additional models, including open-source
                                                              They would also like to thank the anonymous reviewers
options [35, 36], that have shown strong performance
                                                              for their invaluable suggestions and constructive feed-
in assessing the quality of responses from various LLMs.
                                                              back.
This approach is expected to increase the correlation
between human and LLM evaluations, thereby enhanc-
ing the system’s overall accuracy and reliability. The
scientific community is very active in this area to bet-
ter understand the limitations of the different types of
models considered as evaluators [37].
   By addressing the identified limitations through in-
creased human involvement, expanded data coverage,
and domain-specific evaluation methods, we believe it is
possible to enhance the system’s effectiveness and gen-
eralizability across a wide range of regulatory domains.


6. Conclusion
This study explored a novel approach for generating au-
tomated responses to inquiries on the Regulation (EU)
N.2013/575, specifically on the liquidity risk subject. We
proposed a multi-step prompt construction method that
enriches the context to be provided to LLMs, enabling
them to generate more accurate and informative answers.
An LLM Evaluator, which demonstrated strong agree-
ment with human experts, was employed to compare our
multi-step approach with standard zero-shot and few-
shot methods that lack context enrichment. The quality
of the generated responses was assessed, and our find-
ings indicate that the multi-step approach significantly
outperforms both the zero-shot and few-shot methods,
resulting in responses that are more comprehensive and
accurate in relation to the EBA regulation. These re-
sults suggest that the multi-step prompt construction is
a promising approach for enhancing LLM performance
in legal information retrieval tasks, particularly within
References                                                    [11] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng,
                                                                   H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, T. Liu,
 [1] S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze,            A survey on hallucination in large language models:
     S. Gehrmann, P. Kambadur, D. Rosenberg, G. Mann,              Principles, taxonomy, challenges, and open ques-
     BloombergGPT: A Large Language Model for Fi-                  tions, 2023. URL: https://arxiv.org/abs/2311.05232.
     nance, 2023. URL: http://arxiv.org/abs/2303.17564,            arXiv:2311.05232.
     arXiv:2303.17564 [cs, q-fin].                            [12] Single Rulebook Q&A | European Banking Author-
 [2] J. Lai, W. Gan, J. Wu, Z. Qi, P. S. Yu, Large Language        ity, 2013-2024. URL: https://www.eba.europa.eu/
     Models in Law: A Survey, 2023. URL: http://arxiv.             single-rule-book-qa.
     org/abs/2312.03718. doi:10.48550/arXiv.2312.             [13] S. Ye, D. Kim, S. Kim, H. Hwang, S. Kim,
     03718, arXiv:2312.03718 [cs].                                 Y. Jo, J. Thorne, J. Kim, M. Seo, FLASK: Fine-
 [3] C. Biancotti, C. Camassa, Loquacity and Visible               grained Language Model Evaluation based on Align-
     Emotion: ChatGPT as a Policy Advisor, 2023. URL:              ment Skill Sets, 2024. URL: http://arxiv.org/abs/
     https://papers.ssrn.com/abstract=4533699. doi:10.             2307.10928. doi:10.48550/arXiv.2307.10928,
     2139/ssrn.4533699.                                            arXiv:2307.10928 [cs].
 [4] J. J. Horton, Large Language Models as Simu-             [14] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu,
     lated Economic Agents: What Can We Learn from                 Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang,
     Homo Silicus?, 2023. URL: https://arxiv.org/abs/              J. E. Gonzalez, I. Stoica, Judging LLM-as-a-Judge
     2301.07543v1.                                                 with MT-Bench and Chatbot Arena, 2023. URL: http:
 [5] P. Homoki, Z. Ződi,            Large language mod-            //arxiv.org/abs/2306.05685. doi:10.48550/arXiv.
     els and their possible uses in law,            Hungar-        2306.05685, arXiv:2306.05685 [cs].
     ian Journal of Legal Studies 64 (2024) 435–455.          [15] Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, C. Zhu,
     URL: https://akjournals.com/view/journals/2052/               G-Eval: NLG Evaluation using Gpt-4 with Bet-
     64/3/article-p435.xml. doi:10.1556/2052.2023.                 ter Human Alignment, in: H. Bouamor, J. Pino,
     00475, publisher: Akadémiai Kiadó Section: Hun-               K. Bali (Eds.), Proceedings of the 2023 Conference
     garian Journal of Legal Studies.                              on Empirical Methods in Natural Language Pro-
 [6] N. Wiratunga, R. Abeyratne, L. Jayawardena,                   cessing, Association for Computational Linguis-
     K. Martin, S. Massie, I. Nkisi-Orji, R. Weerasinghe,          tics, Singapore, 2023, pp. 2511–2522. URL: https:
     A. Liret, B. Fleisch, Cbr-rag: Case-based reasoning           //aclanthology.org/2023.emnlp-main.153. doi:10.
     for retrieval augmented generation in llms for legal          18653/v1/2023.emnlp-main.153.
     question answering, 2024. URL: https://arxiv.org/        [16] C.-M. Chan, W. Chen, Y. Su, J. Yu, W. Xue,
     abs/2404.04302. arXiv:2404.04302.                             S. Zhang, J. Fu, Z. Liu, ChatEval: Towards
 [7] A. Louis, G. van Dijck, G. Spanakis, Interpretable            Better LLM-based Evaluators through Multi-
     Long-Form Legal Question Answering with                       Agent Debate, 2023. URL: http://arxiv.org/abs/
     Retrieval-Augmented Large Language Models,                    2308.07201. doi:10.48550/arXiv.2308.07201,
     2023.      URL:     http://arxiv.org/abs/2309.17050.          arXiv:2308.07201 [cs].
     doi:10.48550/arXiv.2309.17050,                           [17] OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad,
     arXiv:2309.17050 [cs].                                        I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt,
 [8] W. Zhang, H. Shen, T. Lei, Q. Wang, D. Peng,                  S. Altman, S. Anadkat, R. Avila, I. Babuschkin, S. Bal-
     X. Wang, GLQA: A Generation-based Method                      aji, V. Balcom, P. Baltescu, H. Bao, M. Bavarian,
     for Legal Question Answering, in: 2023 Inter-                 J. Belgum, I. Bello, J. Berdine, G. Bernadett-Shapiro,
     national Joint Conference on Neural Networks                  C. Berner, L. Bogdonoff, O. Boiko, M. Boyd, A.-L.
     (IJCNN), 2023, pp. 1–8. URL: https://ieeexplore.ieee.         Brakman, G. Brockman, T. Brooks, M. Brundage,
     org/document/10191483?denied=. doi:10.1109/                   K. Button, T. Cai, R. Campbell, A. Cann, B. Carey,
     IJCNN54540.2023.10191483, iSSN: 2161-4407.                    C. Carlson, R. Carmichael, B. Chan, C. Chang,
 [9] A. Abdallah, B. Piryani, A. Jatowt,                Ex-        F. Chantzis, D. Chen, S. Chen, R. Chen, J. Chen,
     ploring the state of the art in legal QA sys-                 M. Chen, B. Chess, C. Cho, C. Chu, H. W. Chung,
     tems, Journal of Big Data 10 (2023) 127. URL:                 D. Cummings, J. Currier, Y. Dai, C. Decareaux,
     https://doi.org/10.1186/s40537-023-00802-8. doi:10.           T. Degry, N. Deutsch, D. Deville, A. Dhar, D. Do-
     1186/s40537-023-00802-8.                                      han, S. Dowling, S. Dunning, A. Ecoffet, A. Eleti,
[10] J. Prenio, Peering through the hype - assessing               T. Eloundou, D. Farhi, L. Fedus, N. Felix, S. P.
     suptech tools’ transition from experimentation to             Fishman, J. Forte, I. Fulford, L. Gao, E. Georges,
     supervision (2024). URL: https://www.bis.org/fsi/             C. Gibson, V. Goel, T. Gogineni, G. Goh, R. Gontijo-
     publ/insights58.htm.                                          Lopes, J. Gordon, M. Grafstein, S. Gray, R. Greene,
     J. Gross, S. S. Gu, Y. Guo, C. Hallacy, J. Han,                bedding, 2023. arXiv:2309.07597.
     J. Harris, Y. He, M. Heaton, J. Heidecke, C. Hesse,       [20] H. Xuan, A. Stylianou, X. Liu, R. Pless, Hard nega-
     A. Hickey, W. Hickey, P. Hoeschele, B. Houghton,               tive examples are hard, but useful, 2021. URL: https:
     K. Hsu, S. Hu, X. Hu, J. Huizinga, S. Jain, S. Jain,           //arxiv.org/abs/2007.12749. arXiv:2007.12749.
     J. Jang, A. Jiang, R. Jiang, H. Jin, D. Jin, S. Jo-       [21] S. Xiao, Z. Liu, P. Zhang, N. Muennighoff,
     moto, B. Jonn, H. Jun, T. Kaftan, L. Kaiser, A. Ka-            FlagEmbedding/FlagEmbedding/reranker                    at
     mali, I. Kanitscheider, N. S. Keskar, T. Khan, L. Kil-         master · FlagOpen/FlagEmbedding, 2024. URL:
     patrick, J. W. Kim, C. Kim, Y. Kim, J. H. Kirchner,            https://github.com/FlagOpen/FlagEmbedding/
     J. Kiros, M. Knight, D. Kokotajlo, L. Kondraciuk,              tree/master/FlagEmbedding/reranker.
     A. Kondrich, A. Konstantinidis, K. Kosic, G. Krueger,     [22] T. B. Brown, B. Mann, N. Ryder, M. Subbiah,
     V. Kuo, M. Lampe, I. Lan, T. Lee, J. Leike, J. Leung,          J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
     D. Levy, C. M. Li, R. Lim, M. Lin, S. Lin, M. Litwin,          G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss,
     T. Lopez, R. Lowe, P. Lue, A. Makanju, K. Malfacini,           G. Krueger, T. Henighan, R. Child, A. Ramesh,
     S. Manning, T. Markov, Y. Markovski, B. Martin,                D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen,
     K. Mayer, A. Mayne, B. McGrew, S. M. McKin-                    E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,
     ney, C. McLeavey, P. McMillan, J. McNeil, D. Med-              C. Berner, S. McCandlish, A. Radford, I. Sutskever,
     ina, A. Mehta, J. Menick, L. Metz, A. Mishchenko,              D. Amodei, Language models are few-shot learn-
     P. Mishkin, V. Monaco, E. Morikawa, D. Moss-                   ers, 2020. URL: https://arxiv.org/abs/2005.14165.
     ing, T. Mu, M. Murati, O. Murk, D. Mély, A. Nair,              arXiv:2005.14165.
     R. Nakano, R. Nayak, A. Neelakantan, R. Ngo,              [23] Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, C. Zhu, G-
     H. Noh, L. Ouyang, C. O’Keefe, J. Pachocki, A. Paino,          eval: Nlg evaluation using gpt-4 with better human
     J. Palermo, A. Pantuliano, G. Parascandolo, J. Parish,         alignment, 2023. URL: https://arxiv.org/abs/2303.
     E. Parparita, A. Passos, M. Pavlov, A. Peng, A. Perel-         16634. arXiv:2303.16634.
     man, F. d. A. B. Peres, M. Petrov, H. P. d. O. Pinto,     [24] Y. Dubois, X. Li, R. Taori, T. Zhang, I. Gulrajani, J. Ba,
     Michael, Pokorny, M. Pokrass, V. H. Pong, T. Powell,           C. Guestrin, P. Liang, T. B. Hashimoto, Alpacafarm:
     A. Power, B. Power, E. Proehl, R. Puri, A. Radford,            A simulation framework for methods that learn
     J. Rae, A. Ramesh, C. Raymond, F. Real, K. Rim-                from human feedback, 2024. URL: https://arxiv.org/
     bach, C. Ross, B. Rotsted, H. Roussez, N. Ryder,               abs/2305.14387. arXiv:2305.14387.
     M. Saltarelli, T. Sanders, S. Santurkar, G. Sastry,       [25] J. Fu, S.-K. Ng, Z. Jiang, P. Liu, Gptscore: Evaluate
     H. Schmidt, D. Schnurr, J. Schulman, D. Selsam,                as you desire, 2023. URL: https://arxiv.org/abs/2302.
     K. Sheppard, T. Sherbakov, J. Shieh, S. Shoker,                04166. arXiv:2302.04166.
     P. Shyam, S. Sidor, E. Sigler, M. Simens, J. Sitkin,      [26] C. D. Manning, P. Raghavan, H. Schütze, Introduc-
     K. Slama, I. Sohl, B. Sokolowsky, Y. Song, N. Stau-            tion to Information Retrieval, Cambridge University
     dacher, F. P. Such, N. Summers, I. Sutskever,                  Press, USA, 2008.
     J. Tang, N. Tezak, M. B. Thompson, P. Tillet,             [27] P. S. H. Lewis, Y. Wu, L. Liu, P. Minervini, H. Küttler,
     A. Tootoonchian, E. Tseng, P. Tuggle, N. Turley,               A. Piktus, P. Stenetorp, S. Riedel, PAQ: 65 million
     J. Tworek, J. F. C. Uribe, A. Vallone, A. Vijayvergiya,        probably-asked questions and what you can do with
     C. Voss, C. Wainwright, J. J. Wang, A. Wang,                   them, CoRR abs/2102.07033 (2021). URL: https://
     B. Wang, J. Ward, J. Wei, C. J. Weinmann, A. Weli-             arxiv.org/abs/2102.07033. arXiv:2102.07033.
     hinda, P. Welinder, J. Weng, L. Weng, M. Wiethoff,        [28] Z. Li, X. Zhang, Y. Zhang, D. Long, P. Xie,
     D. Willner, C. Winter, S. Wolrich, H. Wong, L. Work-           M. Zhang, Towards general text embeddings with
     man, S. Wu, J. Wu, M. Wu, K. Xiao, T. Xu, S. Yoo,              multi-stage contrastive learning, arXiv preprint
     K. Yu, Q. Yuan, W. Zaremba, R. Zellers, C. Zhang,              arXiv:2308.03281 (2023).
     M. Zhang, S. Zhao, T. Zheng, J. Zhuang, W. Zhuk,          [29] C. Li, Z. Liu, S. Xiao, Y. Shao, Making large language
     B. Zoph, GPT-4 Technical Report, 2024. URL: http:              models a better foundation for dense retrieval, 2023.
     //arxiv.org/abs/2303.08774. doi:10.48550/arXiv.                arXiv:2312.15503.
     2303.08774, arXiv:2303.08774 [cs].                        [30] F. Yu, L. Quartey, F. Schilder,              Exploring
[18] J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian,                   the effectiveness of prompt engineering for le-
     Z. Liu, Bge m3-embedding:               Multi-lingual,         gal reasoning tasks, in: A. Rogers, J. Boyd-
     multi-functionality, multi-granularity text embed-             Graber, N. Okazaki (Eds.), Findings of the As-
     dings through self-knowledge distillation, 2024.               sociation for Computational Linguistics: ACL
     arXiv:2402.03216.                                              2023, Association for Computational Linguistics,
[19] S. Xiao, Z. Liu, P. Zhang, N. Muennighoff, C-pack:             Toronto, Canada, 2023, pp. 13582–13596. URL: https:
     Packaged resources to advance general chinese em-              //aclanthology.org/2023.findings-acl.858. doi:10.
     18653/v1/2023.findings-acl.858.
[31] Y. an Lu, H. yu Kao, 0x.yuan at semeval-2024
     task 5: Enhancing legal argument reasoning with
     structured prompts, in: International Workshop
     on Semantic Evaluation, 2024. URL: https://api.
     semanticscholar.org/CorpusID:270765544.
[32] Q. Ye, M. Axmed, R. Pryzant, F. Khani, Prompt
     engineering a prompt engineer, 2024. URL: https:
     //arxiv.org/abs/2311.05661. arXiv:2311.05661.
[33] Z. Huang, Z. Wang, S. Xia, P. Liu, Olympicarena
     medal ranks: Who is the most intelligent ai so
     far?, 2024. URL: https://arxiv.org/abs/2406.16772.
     arXiv:2406.16772.
[34] A. Panickssery, S. R. Bowman, S. Feng, Llm eval-
     uators recognize and favor their own genera-
     tions, 2024. URL: https://arxiv.org/abs/2404.13076.
     arXiv:2404.13076.
[35] S. Kim, J. Suk, S. Longpre, B. Y. Lin, J. Shin,
     S. Welleck, G. Neubig, M. Lee, K. Lee, M. Seo,
     Prometheus 2: An open source language model
     specialized in evaluating other language mod-
     els, 2024. URL: https://arxiv.org/abs/2405.01535.
     arXiv:2405.01535.
[36] S. Kim, J. Suk, J. Y. Cho, S. Longpre, C. Kim, D. Yoon,
     G. Son, Y. Cho, S. Shafayat, J. Baek, S. H. Park,
     H. Hwang, J. Jo, H. Cho, H. Shin, S. Lee, H. Oh,
     N. Lee, N. Ho, S. J. Joo, M. Ko, Y. Lee, H. Chae, J. Shin,
     J. Jang, S. Ye, B. Y. Lin, S. Welleck, G. Neubig, M. Lee,
     K. Lee, M. Seo, The biggen bench: A principled
     benchmark for fine-grained evaluation of language
     models with language models, 2024. URL: https:
     //arxiv.org/abs/2406.05761. arXiv:2406.05761.
[37] H. Huang, Y. Qu, H. Zhou, J. Liu, M. Yang, B. Xu,
     T. Zhao, On the limitations of fine-tuned judge mod-
     els for llm evaluation, 2024. URL: https://arxiv.org/
     abs/2403.02839. arXiv:2403.02839.
A. Dataset

Table 4
EBA Q&As dataset. For this research, we focused on the fields highlighted in yellow.

 Variable Name                            Description
 Question ID                              The unique identifier for each question.
 Topic                                    The general topic or category under which the question falls.
 Subject matter                           The specific subject matter of the question.
 Legal act                                The specific legal act to which the question relates. (e.g., CRR)
 Article                                  The specific article of the legal to which the question relates.
 COM Delegated or Implementing
 Acts/RTS/ITS/GLs/Recommendations         Other legislation, standards, guidelines or recommendations to which the question relates.
 Article/Paragraph                        The specific article or paragraph within the above-mentioned
 Question                                 The actual question asked.
 Background on the question               Any additional information or context provided by the question submitter.
 Final answer                             The official answer provided to the question.
 Submission date                          The date when the question was submitted.
 Final publishing date                    The date when the final answer to the question was published.
 Status                                   The current status of the question (e.g. Final, rejected, etc.).
 Type of submitter                        The type of entity that submitted the question (e.g. Credit institution, investment firm, etc.).
 Answer prepared by                       The entity that prepared the answer to the question.




Figure 1: Distribution of tokens among Questions, Background, and Answers in datasets and splits
B. Multi-Step Generation and Evalutation
B.1. Multi-Step Approach for Answer Generation




Figure 2: Multi-Step Approach for Answer Generation




B.2. LLM evaluator Alignment




Figure 3: Evaluating Alignment between the LLM evaluator and the human expert




B.3. Multi-Step vs. Zero-Shot




Figure 4: Multi-Step vs. Zero-Shot Approach for EBA Liquidity Risk Inquiries
C. Prompt template
C.1. Extracting Law References
    Gpt4-omni Prompt

    #task
    Extract from the text (#text) any reference to regulatory documents contained in it and insert them into a
    list (e.g. ["regulatory document name": ["article 1","article 2",...]]). I will provide you an example (#text
    (example)) and the expected output (#output (example)):

    #text (example) "In accordance with Article 425 (1) of Regulation (EU) No. 575/2013 (CRR) institu-
    tions may exempt contractual liquidity inflows from borrowers and bond investors arising from mortgage
    lending funded by covered bonds eligible for preferential treatment as set out in Article 129b (4-6) of CRR
    or by bonds as referred to in Article 52(4) of Directive 2009/65/EC from the 75% inflow cap."

    #output (example) "["Regulation (EU) No. 575/2013 (CRR)": ["425","129b"], "Directive 2009/65/EC" : ["52"]]"
    #text
    > text_to_extract

    #output (list only)


   This prompt was used to extract any reference to regulatory documents from the provided text_to_extract ) (placeholder
to input text)
C.2. Answer Generation
    Gpt4-omni Prompt

    " #system
    You are a virtual assistant for the European Banking Authority (EBA), handling user inquiries related to
    Liquidity Risk regulations. The user’s query specifically pertains to Regulation (EU) No. 575/2013 (CRR) or
    Delegated Regulation (EU) No. 2015/61 (LCR DA)."""

    #task
   Answer the question based on the instructions below.
   1. Analyze the User’s Question (#question):
   - Identify the central topic and relevant keywords related to Liquidity Risk and the specified EBA regulations.
    2. Leverage the Provided Context (#context):
   - Incorporate the context (including CRR articles and additional information) to tailor the answer to the
    user’s specific scenario.
    3. Liquidity Risk Topic:
   - Reference relevant articles from provided context (#context) that address the specific aspect of Liquidity
    Risk raised in the question. 4. Desired Answer (#answer):
   - Use only the information provided in the context and examples (if provided) to answer the question.
   - Craft a well-reasoned and informative response that covers all aspects of the user’s query.
   - Clearly articulate the regulatory implications while considering the provided context.
   - Maintain a professional and informative tone suitable for the EBA.

     #examples:
     Example 1: > example_1
     Example 2: > example_2
     Example 3: > example_3
     Example 4: > example_4
     Example 5: > example_5

    #question:
    > question

    #context:
    > context
     > enhanced_context

    #answer:


   This prompt was used to generate answer given a question and context. #examples section (placeholder to include
5 examples) and enhanced_context (placeholder to include CRR articles), highlighted in yellow, were used only for
multi-step approach.
C.3. LLM as Evaluator
    Gpt4-omni Prompt

    I will provide you with two answers to a question. One is the #official answer, which serves as the
    benchmark. The other is the #generated answer, which needs to be evaluated against the #official answer.
    You must compare the answers step by step.

    Consider the following definitions for this evaluation:

   - Correctness: A #generated answer is correct if its content aligns with that of the #official answer.
   - Completeness: A #generated answer is complete if it includes all the information present in the #official
    answer.
   Your task is to act as an evaluator and rate the #generated answer according to the following scale:

    RATING 1: The #generated answer is completely incorrect and incomplete compared to the #official answer.
    RATING 2: The #generated answer is incorrect but either complete or partially complete compared to the
    #official answer. It contains some useful information found in the #official answer but the main statement is
    incorrect.
    RATING 3: The #generated answer is correct but only partially complete. The main statement matches the
    #official answer, but some information from the #official answer is missing.
    RATING 4: The #generated answer is fully correct and complete. It is essentially a rephrased version of the
    #official answer with no significant differences.
    Please provide a single numerical rating (1-4) followed by a brief explanation for your rating

    
    ...
    

    Compute the score in the following case:


    #question
    > question


    #background
    > background


    #official answer
    > answer


    #generated answer
    generated answer

    Output:


   This prompt was used to compare an AI-generated answer (#generated answer) to an official one (#official answer),
rating its correctness, completeness, and providing an explanation.