<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Novel Multi-Step Prompt Approach for LLM-based Q&amp;As on Banking Supervisory Regulation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniele Licari</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Canio Benedetto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Praveen Bushipaka</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro De Gregorio</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco De Leonardis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tommaso Cucinotta</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Banca d'Italia</institution>
          ,
          <addr-line>Via Nazionale, 91, Rome, 00184</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Scuola Superiore Sant'Anna</institution>
          ,
          <addr-line>P.zza dei Martiri della Libertà, 33, Pisa, 56100</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper investigates the use of large language models (LLMs) in analyzing and answering questions related to banking supervisory regulation concerning reporting obligations. We introduce a multi-step prompt construction method that enhances the context provided to the LLM, resulting in more precise and informative answers. This multi-step approach is compared with standard "zero-shot" and "few-shot" approaches, which lacks context enrichment. To assess the quality of the generated responses, we utilize an LLM evaluator. Our findings indicate that the multi-step approach significantly outperforms the zero-shot method, producing more comprehensive and accurate responses.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Regulatory Q&amp;A</kwd>
        <kwd>Banking Supervisory Reporting Regulation</kwd>
        <kwd>Artificial Intelligence</kwd>
        <kwd>GenAI</kwd>
        <kwd>GPT-4o</kwd>
        <kwd>RAG</kwd>
        <kwd>LLM Evaluator</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>Therefore, it is essential to establish strong verification</title>
        <p>procedures and retain human supervision to counter
The advent of generative AI (GenAI), and specifically of these risks. The complexity of regulatory documents,
large language models (LLMs), ofers significant oppor- with their dense network of cross-referenced texts/cats
tunities, among others, in the legal and financial sector, and specialized content, necessitates careful analysis to
facilitating the implementation of innovative solutions retrieve the needed information ensuring at the same
across various domains of activities [1, 2, 3, 4, 5]. One time efective risk management and limit the burden of
of the most promising applications is the business case such manual compliance.
for supporting the navigation and analysis of complex This study introduces a novel methodology to
autoregulatory documents [6, 7, 8, 9], which can be particu- mate and expedite the "question &amp; answer" (Q&amp;A)
prolarly valuable for compliance oficers, legal teams, and cess in regulatory compliance, leveraging advanced large
other professionals working in financial institutions who language models (LLMs) to provide accurate and timely
need to have a clear and timely understanding of the responses to inquiries about the European Banking
Auregulations and the consequently derived obligations. thority’s (EBA) reporting regulations. Our multi-step</p>
        <p>
          Supervisory authorities could benefit from a tool that approach aligns with Retrieval-Augmented Generation
streamlines the consultation of complex legislation, pro- (RAG) principles, enhancing context retrieval and
genviding swift responses to entities and enhancing efi- erative capabilities through mechanisms like explicit
ciency [10]. While LLMs ofer advantages for this pur- extraction of Capital Requirements Regulation (CRR)
pose, they also pose risks like bias and inaccuracies [11]. references, implicit reference analysis, and a dedicated
cross-encoder for precise regulatory text retrieval. This
CDLeciC0-4it—200264,:2T0e2n4t,hPIitsaal,iIatnalCyonference on Computational Linguistics, methodology ensures tailored response generation suited
* Corresponding author. to the complex regulatory compliance context, where
pre† The views and opinions expressed in this paper are those of the cise and comprehensive answers are crucial.
authors and do not necessarily reflect the oficial policy or position Our work finds particular applications within the
doof the Bank of Italy. main of EBA regulatory reporting because it is
charac$ daniele.licari@bancaditalia.it (D. Licari); terized by a large and complex set of interrelated
docucparnavioe.ebne.nbeudsehtitpoa@kab@anscaandtiatnanliaa.pitis(aC.i.t B(Pe.nBeduesthtiop)a;ka); ments, including delegated and implementing acts,
techalessandro.degregorio@bancaditalia.it (A. De Gregorio); nical standards, guidelines, and recommendations, which
marco.deleonardis@bancaditalia.it (M. De Leonardis); cover various aspects of financial entities. Such
comtommaso.cucinotta@santannapisa.it (T. Cucinotta) plexity makes the business case both challenging and
0000-0002-2963-9233 (D. Licari); 0000-0002-8446-9468 rewarding.
(0C0.00B-e0n0e0d1e-7tt5o7)7;-03060595-0(A00.9D-7e7G53r-e8g6o6r2io()P;.0B00u9s-h0i0p0a4k-a6)5;23-186X In this work, we focus on Regulation (EU) N.2013/575,
(M. De Leonardis); 0000-0002-0362-0657 (T. Cucinotta) also called Capital Requirements Regulati
          <xref ref-type="bibr" rid="ref3">on (CRR)
© 2024</xref>
          Copyright for this paper by its authors. Use permitted under Creative Commons License https://eur-lex.europa.eu/legal-content/en/ALL/?uri=
Attribution 4.0 International (CC BY 4.0).
celex%3A32013R0575, specifically on the topic of Table 1
Liquidity Risk as a first use case to evaluate the potential Sample distribution across training, validation, and test sets
benefit of enriched context for an accurate response for CRR-related Q&amp;A and the subset of only Liquidity Risk
generation. The main reason for this choice is that this Q&amp;A.
topic is supported by a relatively limited number of Set CRR-related Q&amp;A Liquidity Risk Q&amp;A
regulatory documents, so it was a good starting point
since the regulation is not readily available in the form Training 798 58
of a structured dataset and its pre-processing is usually a ValTideasttion 166327 1426
time-consuming task.
        </p>
        <p>We used the actual EBA Q&amp;As dataset [12] as the foun- variables: question ID, question, submission date, status,
dation for developing a system capable of generating au- topic, legal act, article [within that act], background
infortomated responses to questions formulated by analysts mation,final answer, submission date and status (details
on EBA reporting requirements and rules. By harnessing in Table 4, Appendix 4) Secondly, we implemented a
twothe capabilities of LLMs we aim to create a tool that can step filtering process aimed at ensuring model eficacy:
deliver accurate and contextually relevant answers to by excluding non-English entries, and by focusing on
any inquiry on the content of the CRR. CRR-related questions within the same timeframe. This</p>
        <p>Recent studies highlight the potential of LLMs for qual- resulted in a final dataset of 1597 CRR-related questions
itative assessment [13, 14, 15, 16]. For this reason, in this and answers, which was then split into training (50%),
work we also propose the use of an "LLM Evaluator" to validation (10%), and test sets (40%) for robust evaluation
automate the validation process. (token number distribution in Figure 1 in Appendix A).</p>
        <p>The structure of this paper is the following. Chapter The distribution of samples for the dataset is summarized
2 introduces the methodology and provides a detailed in Table 1.
description of the approach adopted in this study; it
explains the dataset utilized and the normative retrieval
techniques employed to identify the regulatory docu- 2.2. Context Enrichment
ments necessary to address the EBA’s Q&amp;As. Chapter
3 presents the LLM Evaluator and the evaluation
criteria. Chapter 4 reports experimental results and results
and presents the main outcomes of the study. Chapter 5
discusses challenges as well as potential areas for future
developments.</p>
        <p>The context enrichment process is a three-step approach
designed to identify, within the data set, the most
relevant CRR references to provide an appropriate content to
formulate the answer to the inquiry. The first step simply
involves extracting explicit CRR references, if directly
mentioned in the question (Article in tab 4). The second
step leverages on the capabilities of the GPT-4o (prompt
2. Methodology in Appendix C.1) to analyse the “question” and the
“background information” to identify other CRR references
This research employs a multi-step methodology to con- that are not explicitly stated by the user. The last step
struct a comprehensive prompt for the GPT-4 omni (GPT- of the process utilizes our CRR Ranker model, a
cross4o) language model [17], enabling it to answer EBA- encoder architecture that has been trained to identify
related questions efectively. This step-wise approach and retrieve pertinent references from the Capital
Refocuses on enriching the context provided by the user’s quirements Regulation in response to specific inquiries.
question. First, it identifies relevant EBA regulations This 3-steps comprehensive approach ensures a broader
(specifically CRR references) within the inquiry. Second, and potentially more accurate understanding of the the
it incorporates response examples to guide the LLM’s inquiry and the specific legal act(s) related to the CRR
output format ensuring alignment with EBA regulations. that the Q&amp;A tool deems applicable.
This enriched context is then leveraged by a
powerful LLM to generate more accurate and informative re- 2.2.1. CRR Ranker Training
sponses (details in Appendix B.1).</p>
      </sec>
      <sec id="sec-1-2">
        <title>With regard to the context enrichment, i.e. the CRR</title>
        <p>Ranker Training, we employed a specifically trained
2.1. Dataset Construction cross-encoder model [18] to identify relevant CRR
references for enriching inquiry context. We used a dedicated
To develop and then evaluate our LLM-based Q&amp;A sys- “question-article” pair dataset derived from our EBA Q&amp;A
tem, firstly we extracted a subset from the EBA’s Single- Train Dataset, excluding questions related to CRR
Artirule-book-qa online resource [12], comprising “question- cle 99 https://www.eba.europa.eu/regulation-and-policy/
and-answer” pairs submitted to the EBA between 2013
and 2020. In particular, we focused on the following
single-rulebook/interactive-single-rulebook/14212 due Dataset. This evaluation employed recall metrics at
varito their frequent lack of topical relevance. Each data point ous retrieval cutofs, including recall@5, recall@10,
reconsisted of a question (user query and background in- call@20, and recall@30 (results in Section 4).
formation), an associated CRR article, and a binary label
indicating relevance (1 for relevant, 0 for not applicable). 2.3. Examples Enrichment</p>
        <p>We constructed the training dataset by selecting
positive and negative samples. Positive samples comprised To improve the model’s understanding of the desired
requestion-article pairs where the article explicitly ad- sponse format, tone, and content, we adopted a few-shot
dressed the user’s query. Additionally, we included pairs prompting approach [22]. This involved extracting five
formed by questions and implicit CRR references ex- relevant examples from the EBA Q&amp;A Train Dataset with
tracted from the user’s text, context information, and the same topic as the user question we want to answer.
oficial response using GPT-4o (used prompt in Appendix These examples served as demonstrations for the model,
C.1). showcasing the ideal structure, language style, and level</p>
        <p>Negative training samples were mined by using the of detail expected in the final responses. Notably, the
seBAAI bge-large-en-v1.5 pre-trained language model [19]. lection process ensured heterogeneity within the chosen
For the CRR Ranker Training we employed a two-phase topic, meaning the examples covered various aspects to
process for negative sample selection: first, all CRR ar- promote a broader understanding. Limiting the number
ticles were encoded using the bge-large-en-v1.5 model, of examples to five struck a balance between providing
and cosine similarity was utilized to rank them relative to diverse demonstrations and maintaining cost-eficiency
the user’s question; second, a set of 20 negative examples during inference, as the LLM’s input token length has
was randomly chosen from a pre-defined ranking inter- limitations.
val (250-300). The choice of 20 negative samples provides
a good balance between computational eficiency and 2.4. Answer Generation
the availability of enough training data. This approach
aimed to balance the representation of relevant and irrel- Figure 2 in Appendix B.1 details how we construct a
evant information within the training data, ensuring the comprehensive prompt that enhances GPT-4o’s ability
model learns to distinguish between the user’s query and to efectively answer user questions. The final prompt in
potentially related but ultimately of-topic CRR articles Appendix C.2 integrates the enriched context (extracted
[20]. CRR references) and the example enrichment
(demon</p>
        <p>The final dataset comprised 12,533 unique "question- strations of desired response format, tone, and content).
article" pairs with positive and negative labels. This data This comprehensive prompt is fed to GPT-4o through the
was split into training (10,179 pairs) and development OpenAI API, enabling it to generate a well-reasoned and
(2,354 pairs) sets for model fine-tuning. This fine-tuning informative response that adheres to the EBA’s
regulaaimed to learn robust semantic representations for ques- tory framework and professional tone.
tions and CRR articles, enabling the model to efectively
identify relevant CRR references for enriching user query 2.5. Comparison with RAG Principles
context.</p>
        <p>We selected the BAAI BGE Reranker v2 m3 model Our multi-step prompt approach aligns with the core
[18] as the basis for our cross-encoder, owing to its task- principles of Retrieval-Augmented Generation (RAG)
specific aptness and its demonstrated superior perfor- while incorporating tailored enhancements that improve
mance relative to the BGE Reranker Large [19], as re- context enrichment for regulatory Q&amp;A tasks. Like RAG,
ported in Section 4. We adopted the Cross-Entropy Bi- our method integrates information retrieval with
lannary Classification loss function, following the approach guage generation, but it adds specialized steps to enhance
suggested in the BGE Rerank Git repository [21]. To context enrichment. These include explicit extraction of
promote stable convergence, we incorporated a warmup CRR references, implicit analysis using LLM capabilities,
schedule ( with a number of steps 0.1 × len(train_data) × and precise retrieval through a dedicated cross-encoder.
num_epochs step) that gradually increases the learning Compared to standard RAG, which often relies on
singlerate during the initial phase of training. The entire fine- stage retrieval, our structured multi-step process adds a
tuning process was conducted over 4 epochs. We em- higher level of granularity, including example enrichment
ployed an evaluation interval of 800 steps during training through few-shot prompts. This ensures not only factual
and saved the model that achieved the highest F1 score accuracy but also alignment with domain-specific
lanon the development set. guage standards, ultimately improving response quality</p>
        <p>Finally, we evaluated the model’s retrieval ability of for complex regulatory inquiries. Overall, our approach
CRR items for a given user question on EBA Q&amp;A Test extends the RAG principles to generate tailored,
contextually enriched answers, which is particularly beneficial
for the intricate requirements of regulatory compliance.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. LLM Evaluator</title>
      <p>over a human one, especially for tasks involving prompt
optimization and evaluation. The figure in Appendix B.2
illustrates the complete process of evaluating agreement
between the LLM evaluator and the human expert.</p>
      <p>In our pipeline, we employ an LLM Evaluator to assess the 4. Experiments and Results
quality of generated responses, defined in Section 2,
compared to the EBA’s answers already provided. Employing This section describes the results obtained by
measuran LLM Evaluator ofers significant advantages in terms ing retrieval efectiveness and answer quality. Retrieval
of cost-efectiveness and eficiency compared to tradi- performance is measured by the number of relevant
regtional human evaluation/comparison methods. Recent ulations retrieved (recall) using diferent encoder models.
research highlights the potential of LLMs for large-scale Answer quality is then evaluated by a separate LLM,
natural language evaluation tasks [23, 24, 25]. which scores each generated response based on factors</p>
      <p>The evaluation process uses a scale from one to four, like relevance and adherence to EBA legal acts. We
combased on two evaluation criteria: correctness and com- pare the multi-step prompt approach with a few-shot and
pleteness. A generated response is considered correct if zero-shot one focusing on a single topic within the EBA
its content aligns with the information presented in the Q&amp;A framework, specifically Liquidity Risk. Finally, we
oficial answer. Additionally, a response is deemed com- test our Multi-Step pipeline with other LLM models, such
plete if it incorporates all relevant regulatory references as Google Gemini Flash 1.5 and Llama 3.1 70B.
provided in the oficial answer. The following scoring
rubric outlines the evaluation criteria:</p>
      <sec id="sec-2-1">
        <title>4.1. CRR Retrieval</title>
        <sec id="sec-2-1-1">
          <title>To preliminary validate the efectiveness of our LLM</title>
          <p>evaluator, we conducted an experiment using a synthetic • Bi-encoders: all-MiniLM-L6-v2 [27],
gte-large-endataset. This dataset was carefully designed to test var- v1.5 [28], and bge-large-en-v1.5 [19].
ious aspects of language generation and was evaluated • Cross-encoders: bge-reranker-large [19],
bgeby both a human expert and the LLM. The alignment be- reranker-v2-m3 [29, 18].
tween the human expert’s assessments and those of the
LLM was then analyzed. The complete details of the final The detailed results (presented in table 2) show the
prompt used for LLM evaluator are provided in Appendix achieved recall scores on EBA Q&amp;As Test Dataset for
C.3. each model. Our fine-tuned CRR Ranker significantly</p>
          <p>The dataset comprises 60 Q&amp;A pairs, balanced across outperformed all other models, achieving a more than
the four score categories. For each category, two pairs 20% improvement compared to the best pre-trained model
were excluded as they were used as examples for the (bge-large-en-v1.5).
prompt for the LLM evaluator, resulting in a final dataset
of 52 Q&amp;A pairs to measure the alignment between the 4.2. Answer Generation
human and LLM evaluator. Using GPT-4o, we obtained a
Kendall-tau coeficient of 0.77, with a p-value of 6· 10− 11. Here we compare the performance of our multi-step
apThese results justified the adoption of the LLM evaluator proach with a zero-shot one for answering EBA liquidity
• Score 1: The generated answer is completely
incorrect and incomplete compared to the oficial
answer.
• Score 2: The generated answer is incorrect but
either complete or partially complete compared
to the oficial answer . It contains some useful
information found in the oficial answer , but the
main statement is incorrect.
• Score 3: The generated answer is correct but only
partially complete. The main statement matches
the oficial answer , but some information from
the oficial answer is missing.
• Score 4: The generated answer is fully correct and
complete. It is essentially a rephrased version of
the oficial answer with no significant diferences.</p>
          <p>We employed “recall” as the primary metric to assess the
efectiveness of bi and cross encoder models in retrieving
relevant CRR articles based on the information submitted
with the inquiry. “Recall” signifies the proportion of truly
relevant CRR articles retrieved from the dataset compared
to all the pertinent actual articles [26]. In the context of
legal information retrieval, prioritizing the retrieval of
all crucial regulatory information for the inquiry makes
the recall a particularly relevant metric.</p>
          <p>Our primary objective was to identify a model that
delivers exceptional retrieval accuracy while
maintaining computational eficiency. This potentially excluded
models with an extremely large number of parameters,
as they can be computationally expensive to run.</p>
          <p>We conducted a performance comparison between our
ifne-tuned CRR Ranker and several pre-trained models:
0.37
0.39
0.41
0.17
0.24
0.51
0.46
0.48
0.52
0.23
0.31
0.67
0.55
0.57
0.62
0.31
0.39
0.81
0.59
0.63
0.67
0.38
0.44
0.86
risk inquiries, using our LLM as the evaluation system
(Figure in Appendix B.3). To this end, we utilized a subset
of 46 Q&amp;As from our EBA Q&amp;A Test dataset specifically
focused on liquidity risk.</p>
          <p>We tested:</p>
          <p>In this section, we extend our analysis of the multi-step
pipeline by incorporating evaluations using additional
large language models (LLMs), specifically Google
Gemini Flash 1.5 and Llama 3.1 70B. Google Gemini Flash
1.5 is widely recognized for its high-speed processing
• Zero-Shot Approach: for each question, a stan- capabilities and eficiency in response generation,
makdard prompt was provided to the LLM. It encom- ing it a suitable benchmark for comparative performance
passed both the specific query and any relevant analysis. Conversely, Llama 3.1 70B is noted for its
rocontextual information they provided. bustness in handling complex queries while maintaining
• Few-Shot Approach: for each question, a few moderate computational demands, providing an
interexamples were provided along with the query to esting contrast in terms of performance and resource
guide the LLM in generating responses. eficiency.
• Multi-Step Approach: for each question, we Our experimental results indicate that the average
evalcreated prompts following our established multi- uation score achieved by Google Gemini Flash 1.5 was 2.0,
step approach, incorporating context enrichment whereas Llama 3.1 70B attained an average score of 2.2.
and example enrichment (as detailed in previous Notably, these scores did not surpass the performance
sections). of the GPT-4o zero-shot approach, which underscores
The LLM Evaluator assessed each response based on the advanced capabilities of GPT-4o in addressing the
its correctness and completeness relative to the oficial complexities of regulatory compliance inquiries. This
obEBA response. As described in Section 3, the LLM Evalu- servation highlights the inherent strength of GPT-4o in
ator assigned an overall score on a scale of 1 (completely generating accurate and contextually relevant responses,
incorrect and incomplete) to 4 (fully correct and compre- outperforming the other models under similar conditions.
hensive). Future research will focus on an in-depth analysis of</p>
          <p>Table 3 summarizes the evaluation results for re- these models with a view toward optimizing each step
sponses generated by the diferent approaches. The of the multi-step pipeline in a model-specific manner. By
“multi-step” approach consistently achieved higher tailoring our methodology to align with the distinctive
counts in the high-quality rating categories compared to strengths and limitations of each model, we aim to
furboth the “zero-shot” and “few-shot” ones. This demon- ther enhance the overall accuracy and reliability of the
strates that the multi-step approach significantly outper- generated responses.
formed the other methods in terms of response quality.</p>
          <p>The LLM evaluator awarded the multi-step approach an
average score of 2.7, representing a 12.5% improvement 5. Challenges and Advancements
over the zero-shot and few-shot approaches, which both
received an average score of 2.4. Notably, a larger portion Our work has highlighted several key challenges that are
of the responses generated by our multi-step approach worth discussing. One of the primary issues concerns
received scores of 3 or higher, indicating correct answers. the limited size of our test dataset. This constraint arose
In contrast, only 2 out of 46 responses generated by the because we focused on the single topic of Liquidity Risk.
multi-step approach were rated as completely incorrect However, to achieve robust human alignment and ensure
(score 1), compared to 6 such responses for the zero-shot the system addresses diverse user inquiries across EBA
approach and 11 for the few-shot approach. These find- topics, future eforts should prioritize dataset expansion
ings suggest that the context enrichment in the multi-step and human evaluation integration.
prompts efectively guides the primary LLM toward gen- Another topic for reflection is that the study
emphaerating more comprehensive and informative responses sizes the need to retrieve relevant CRR articles. Future
that accurately reflect the EBA regulations. research could investigate methods to further refine the
generated responses by incorporating legal reasoning domains with complex regulatory frameworks like
regand argumentation capabilities into the LLM [30, 31], ulatory reporting. Even at this early stage, the tool has
and the most relevant Q&amp;As as examples for few-shot demonstrated its ability to make the work of the human
prompting [6]. analyst more eficient. Future research directions include</p>
          <p>It is also crucial to underscore the importance of op- exploring the use of diferent LLM architectures and
intimizing prompts for this kind of application, and we vestigating alternative methods for incorporating human
plan to address this moving forward. Our future research feedback into the prompt construction process. Lastly,
endeavors will focus on investigating automatic prompt exploring the generalization of this approach to other
engineering techniques [32] leveraging the LLM Evalu- regulatory domains would be valuable.
ator as a metric to optimize. These techniques aim to
tailor and optimize prompts based on the specific topic
of inquiries, enhancing overall performance. Acknowledgments</p>
          <p>Moreover, currently we have utilized only one model,
GPT-4o, but we intend to extend our testing to include The authors would like to express their sincere
gratiother models that have demonstrated similar perfor- tude to Vincenzo Capone, Pamela Maggiori, Daniele Bovi,
mance levels in the field of open question answering Fabio Zambuto, Francesca Monacelli, and Roberto
Sab[33]. This will help us identify the most efective model batini (Bank of Italy) for their insightful comments and
for our application with an unbiased evaluation [34]. stimulating discussions on an earlier draft of this paper.</p>
          <p>Similarly, in the context of LLM evaluators, we also in- Their feedback greatly enhanced the clarity and focus of
tend to explore additional models, including open-source our work.
options [35, 36], that have shown strong performance They would also like to thank the anonymous reviewers
in assessing the quality of responses from various LLMs. for their invaluable suggestions and constructive
feedThis approach is expected to increase the correlation back.
between human and LLM evaluations, thereby
enhancing the system’s overall accuracy and reliability. The
scientific community is very active in this area to
better understand the limitations of the diferent types of
models considered as evaluators [37].</p>
          <p>By addressing the identified limitations through
increased human involvement, expanded data coverage,
and domain-specific evaluation methods, we believe it is
possible to enhance the system’s efectiveness and
generalizability across a wide range of regulatory domains.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>6. Conclusion</title>
      <sec id="sec-3-1">
        <title>This study explored a novel approach for generating au</title>
        <p>tomated responses to inquiries on the Regulation (EU)
N.2013/575, specifically on the liquidity risk subject. We
proposed a multi-step prompt construction method that
enriches the context to be provided to LLMs, enabling
them to generate more accurate and informative answers.
An LLM Evaluator, which demonstrated strong
agreement with human experts, was employed to compare our
multi-step approach with standard zero-shot and
fewshot methods that lack context enrichment. The quality
of the generated responses was assessed, and our
findings indicate that the multi-step approach significantly
outperforms both the zero-shot and few-shot methods,
resulting in responses that are more comprehensive and
accurate in relation to the EBA regulation. These
results suggest that the multi-step prompt construction is
a promising approach for enhancing LLM performance
in legal information retrieval tasks, particularly within</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>A. Dataset</title>
    </sec>
    <sec id="sec-5">
      <title>B. Multi-Step Generation and Evalutation</title>
      <sec id="sec-5-1">
        <title>B.1. Multi-Step Approach for Answer Generation</title>
      </sec>
      <sec id="sec-5-2">
        <title>B.2. LLM evaluator Alignment</title>
      </sec>
      <sec id="sec-5-3">
        <title>B.3. Multi-Step vs. Zero-Shot</title>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>C. Prompt template</title>
      <sec id="sec-6-1">
        <title>C.1. Extracting Law References</title>
        <p>#task
Extract from the text (#text) any reference to regulatory documents contained in it and insert them into a
list (e.g. ["regulatory document name": ["article 1","article 2",...]]). I will provide you an example (#text
(example)) and the expected output (#output (example)):
#text (example) "In accordance with Article 425 (1) of Regulation (EU) No. 575/2013 (CRR)
institutions may exempt contractual liquidity inflows from borrowers and bond investors arising from mortgage
lending funded by covered bonds eligible for preferential treatment as set out in Article 129b (4-6) of CRR
or by bonds as referred to in Article 52(4) of Directive 2009/65/EC from the 75% inflow cap."
#output (example) "["Regulation (EU) No. 575/2013 (CRR)": ["425","129b"], "Directive 2009/65/EC" : ["52"]]"
#text
&gt; text_to_extract
#output (list only)</p>
        <p>This prompt was used to extract any reference to regulatory documents from the provided text_to_extract ) (placeholder
to input text)</p>
      </sec>
      <sec id="sec-6-2">
        <title>C.2. Answer Generation</title>
        <p>" #system
You are a virtual assistant for the European Banking Authority (EBA), handling user inquiries related to
Liquidity Risk regulations. The user’s query specifically pertains to Regulation (EU) No. 575/2013 (CRR) or
Delegated Regulation (EU) No. 2015/61 (LCR DA)."""
#task
Answer the question based on the instructions below.
1. Analyze the User’s Question (#question):
- Identify the central topic and relevant keywords related to Liquidity Risk and the specified EBA regulations.
2. Leverage the Provided Context (#context):
- Incorporate the context (including CRR articles and additional information) to tailor the answer to the
user’s specific scenario.
3. Liquidity Risk Topic:
- Reference relevant articles from provided context (#context) that address the specific aspect of Liquidity
Risk raised in the question. 4. Desired Answer (#answer):
- Use only the information provided in the context and examples (if provided) to answer the question.
- Craft a well-reasoned and informative response that covers all aspects of the user’s query.
- Clearly articulate the regulatory implications while considering the provided context.
- Maintain a professional and informative tone suitable for the EBA.</p>
        <p>#examples:
Example 1: &gt; example_1
Example 2: &gt; example_2
Example 3: &gt; example_3
Example 4: &gt; example_4
Example 5: &gt; example_5
#question:
&gt; question
#context:
&gt; context</p>
        <p>&gt; enhanced_context
#answer:</p>
        <p>This prompt was used to generate answer given a question and context. #examples section (placeholder to include
5 examples) and enhanced_context (placeholder to include CRR articles), highlighted in yellow, were used only for
multi-step approach.</p>
        <sec id="sec-6-2-1">
          <title>Gpt4-omni Prompt</title>
          <p>I will provide you with two answers to a question. One is the #oficial answer, which serves as the
benchmark. The other is the #generated answer, which needs to be evaluated against the #oficial answer.
You must compare the answers step by step.</p>
        </sec>
        <sec id="sec-6-2-2">
          <title>Consider the following definitions for this evaluation: - Correctness: A #generated answer is correct if its content aligns with that of the #oficial answer. - Completeness: A #generated answer is complete if it includes all the information present in the #oficial answer.</title>
          <p>Your task is to act as an evaluator and rate the #generated answer according to the following scale:
RATING 1: The #generated answer is completely incorrect and incomplete compared to the #oficial answer.
RATING 2: The #generated answer is incorrect but either complete or partially complete compared to the
#oficial answer. It contains some useful information found in the #oficial answer but the main statement is
incorrect.</p>
          <p>RATING 3: The #generated answer is correct but only partially complete. The main statement matches the
#oficial answer, but some information from the #oficial answer is missing.</p>
          <p>RATING 4: The #generated answer is fully correct and complete. It is essentially a rephrased version of the
#oficial answer with no significant diferences.</p>
          <p>Please provide a single numerical rating (1-4) followed by a brief explanation for your rating
&lt;EXAMPLE 1&gt;
...
&lt;EXAMPLE 8&gt;
Compute the score in the following case:
#question
&gt; question
#background
&gt; background
#oficial answer
&gt; answer
#generated answer
generated answer</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <volume>18653</volume>
          /v1/
          <year>2023</year>
          .findings-acl.
          <volume>858</volume>
          . [31]
          <string-name>
            <given-names>Y.</given-names>
            <surname>an Lu</surname>
          </string-name>
          ,
          <source>H. yu Kao</source>
          ,
          <year>0x</year>
          .yuan at semeval-2024
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>task 5: Enhancing legal argument reasoning with</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>on Semantic Evaluation</source>
          ,
          <year>2024</year>
          . URL: https://api.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          semanticscholar.org/CorpusID:270765544. [32]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Axmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pryzant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Khani</surname>
          </string-name>
          , Prompt
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>engineering a prompt engineer</source>
          ,
          <year>2024</year>
          . URL: https:
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          //arxiv.org/abs/2311.05661. arXiv:
          <volume>2311</volume>
          .
          <fpage>05661</fpage>
          . [33]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xia</surname>
          </string-name>
          , P. Liu, Olympicarena
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>far?</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2406.16772.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>arXiv:2406</source>
          .
          <fpage>16772</fpage>
          . [34]
          <string-name>
            <given-names>A.</given-names>
            <surname>Panickssery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Bowman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Feng</surname>
          </string-name>
          , Llm eval-
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>tions</surname>
          </string-name>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2404.13076.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>arXiv:2404</source>
          .
          <fpage>13076</fpage>
          . [35]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Suk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Longpre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shin</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <article-title>Prometheus 2: An open source language model</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>els</surname>
          </string-name>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2405.01535.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <source>arXiv:2405</source>
          .
          <fpage>01535</fpage>
          . [36]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Suk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Y.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Longpre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yoon</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <source>models with language models</source>
          ,
          <year>2024</year>
          . URL: https:
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          //arxiv.org/abs/2406.05761. arXiv:
          <volume>2406</volume>
          .
          <fpage>05761</fpage>
          . [37]
          <string-name>
            <given-names>H.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <article-title>els for llm evaluation</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <source>abs/2403</source>
          .02839. arXiv:
          <volume>2403</volume>
          .
          <fpage>02839</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>