Preparing AI for Compliance: Initial Steps of a Framework for Teaching LLMs to Reason About Compliance Barbara Makovec1,2 , Luis Rei1,∗ and Inna Novalija1 1 Institut ”Jožef Stefan”, Jamova 39, Ljubljana, Slovenia 2 Faculty of Mathematics and Physics, University of Ljubljana, Jadranska 19, Ljubljana, Slovenia Abstract The integration of powerful Large Language Models into diverse applications has been rapid, but it faces significant challenges due to the complexity of global regulatory and ethical frameworks, such as those in the GDPR and the AI act. To address the need for AI systems that can navigate these compliance requirements, we propose a tool designed to create a specialized dataset for training AI assistants in regulatory and ethical reasoning and present its initial implementation. Our approach uses a Retrieval-Augmented Generation (RAG) method that preserves the structure of legal texts, ensuring accurate retrieval and interpretation of relevant provisions. This tool automates the generation of compliance reasoning data by selecting and explaining how specific legal and ethical guidelines impact real-world examples of AI technologies. This is to be followed by a refinement process to ensure only the best candidates are presented to the annotators. We aim to facilitate the development of AI-driven compliance assistants that can effectively align with global legal and ethical standards. Keywords Large Language Models (LLMs), Regulatory Reasoning, Retrieval-Augmented Generation (RAG), Chain-of-Though, Text mining, AI Governance, Fair Transparent and Trustworthy AI, Artificial Intelligence (AI) Compliance 1. Introduction In recent years, we have witnessed the disruptive emergence of powerful Large Language Models, which can be utilized as ready-to-deploy AI services with minimal effort. Their rapid adoption spans from small- scale single-developer projects to critical integrations within Fortune 500 companies. Simultaneously, a plethora of legislations, regulations, ethical guidelines, and policy goals have emerged in the technology and data sectors, such as the GDPR 1 , the Data Governance Act 2 , the Data Act 3 , the Artificial Intelligence Act 4 . The rapid technological advancement, coupled with diverse and evolving regulatory landscapes across different countries, presents significant challenges for developers, data scientists, researchers, regulators, and policymakers. We believe that leveraging Large Language Models (LLMs) to explain, review, and assess AI models, datasets, and complete pipelines from the perspective of legislations, regulations, ethical guidelines, and social impact can help address the challenges. For instance, a data scientist developing a new pipeline could ensure compliance with EU and USA regulations by submitting the pipeline description, along with each dataset and model card, to the compliance assistant. By selecting the relevant jurisdictions, potential issues can be identified early in the development process, facilitating faster progress before a more detailed review by the company’s compliance experts. Beyond just understanding the law, any general solution will likely require some form of Retrieval- Augmented Generation (RAG) in which the LLM can reason over the specific set of retrieved compliance requirements that can apply to a single product, service, or company at a given point in time within a certain jurisdiction. The first step towards developing a ”compliance assistant” is to build datasets RuleML+RR’24: Companion Proceedings of the 8th International Joint Conference on Rules and Reasoning, September 16–22, 2024, Bucharest, Romania ∗ Corresponding author. Envelope-Open makovecbarbara1@gmail.com (B. Makovec); luis.rei@ijs.si (L. Rei); inna.koval@ijs.si (I. Novalija) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 https://eur-lex.europa.eu/eli/reg/2016/679/oj 2 https://digital-strategy.ec.europa.eu/en/policies/data-governance-act 3 https://digital-strategy.ec.europa.eu/en/policies/data-act 4 https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings that can be used to teach and evaluate the assistant in this complex task. Annotating and labeling this data demands the expertise of legal professionals to ensure accuracy, making the process both time-consuming and expensive. To address this challenge, we propose a framework that generates high-quality examples for annotation (Figure 1). In this paper, we discuss the details of the first part, the initial generation of examples. 2. Related Work Our ultimate goal of creating a compliance assistant is not conceptually unique. For example, Gra- cenote.ai 5 is an AI-driven platform for regulatory compliance. While the legal AI CoCounsel 6 from Thomson Reuters includes contract compliance features. CuratedAI 7 uses RAG approach to answer legal questions about EU laws and regulations. In research, as overall systems, we highlight DISC- LawLLM which includes a retriever with access to a knowledge base of Chinese laws [1] and Chatlaw dynamically builds a case-specific Knowledge Graph within a multi-agent system by various methods and answers using a RAG approach [2]. Several public datasets evaluate LLM assistants’ legal reasoning, such as LegalBench [3] and Contract Understanding Atticus Dataset [4]. Our goal is slightly different, as we want to do reasoning on compliance of AI tools with variable provisions. Given an LLM that is instructed to reason only on specific retrieved provisions, the user can select which provisions would be considered by selecting those that can be retrieved, e.g. only laws that apply in the EU, plus provisions that apply to the financial sector, plus the user’s ethical guidelines. For generating better responses, Chain-of-Thought Prompting enhances LLM reasoning by generating intermediate steps [5], and LLMs can perform zero-shot reasoning by adding ”Let’s think step by step” before answers [6]. Self-Consistency improves this by sampling diverse reasoning paths and selecting the most consis- tent answer [7]. Additionally, LLMs can self-improve by generating and fine-tuning themselves with high-confidence, rationale-augmented answers [8]. The SELF-DISCOVER framework allows LLMs to self-compose reasoning structures using atomic modules [9], and the Self-Instruct framework enhances instruction-following capabilities through self-generated instructions [10]. In ranking and selecting model responses, the use of strong LLMs as judges to evaluate responses to open-ended questions has become one of the most popular options [11]. Building on this, using a Panel of LLM evaluators (PoLL) has been proposed to provide a more diverse and balanced evaluation [12]. The Llama Guard model introduces an LLM-based input-output safeguard for classifying and evaluating responses that can filter out undesirable ones [13]. Self-Refine introduces an iterative feedback mechanism where an LLM generates an initial output, provides feedback on its own output, and then refines itself based on this feedback [14]. The utility of LLM critics is demonstrated in the context of code and mathemat- ics evaluation, where LLMs provide natural language feedback that highlights issues in code [15] or proofs [16]. 3. Data and Methods We focus on the candidate generation phase of our framework (as shown in Figure 1). This process utilizes a RAG approach, starting with the selection of examples from our database, which includes news articles about specific AI technologies or incidents, GitHub README files from AI-related repositories, and Hugging Face model and dataset cards. The next step involves retrieving relevant sections of legal and ethical provisions from our knowledge base, identified through similarity search. These retrieved provisions are then combined, and the language model is prompted to reason and explain how they impact the selected example using a zero-shot Chain of Thought (CoT) prompt [6]. A common limitation of many RAG pipelines is their disregard for the structural integrity of docu- ments, often dividing them into uniform-length chunks. This can lead to critical oversight, especially 5 https://gracenote.ai/ 6 https://casetext.com/cocounsel/ 7 https://www.curatedai.eu/ Figure 1: Our framework leveraging RAG and an LLM to generate, judge, criticize, and refine candidate examples. when dealing with legal documents, which are typically organized into articles and paragraphs. We employ a systematic approach to structuring and querying legal documents for efficient retrieval and compliance analysis, as described in Figure 3. The legal document 𝐿 is divided into its pre-defined arti- cles and paragraphs as they are structured in the base document. Each paragraph is further segmented into overlapping passages of fixed length 𝑠 with an overlap 𝑜 to maintain context across segments. Each passage is then encoded using a dense retrieval embedding model. When querying, we embed the query and compute the dot product similarity between the embeddings of the query and the stored passages. We retrieve the top 𝑘 passages with the highest scores. We then look up the articles to which these passages belong and generate a prompt using a predefined template and 𝑛 of these articles. The prompt forms a question asking the LLM to analyze step-by-step [6] the implications of the provided legislative articles with respect to the query. Algorithm 1 Legal Text Indexing and Retrieval Augmented Generation Input: Legal document 𝐿, query 𝑄, embedding model 𝐸, parameters 𝑝, 𝑜, 𝑘, 𝑡, 𝑛 Output: LLM-generated candidate responses based on 𝑄 1: Indexing: 2: Split 𝐿 into articles 𝒜 = {𝐴1 , … , 𝐴𝑥 } 3: for each 𝐴𝑥 in 𝒜 do 4: Split 𝐴𝑥 into paragraphs 𝑃𝑦 = {𝑃𝑥1 , … , 𝑃𝑥𝑦 } 5: for each 𝑃𝑥𝑦 in 𝐴𝑥 do 6: Partition 𝑃𝑥𝑦 into overlapping passages 𝑔𝑥𝑦𝑧 of length 𝑝 with overlap 𝑜 7: Encode 𝑔𝑥𝑦𝑧 using model 𝐸 8: end for 9: end for 10: Retrieval Augmented Generation: 11: Encode query 𝑄 using model 𝐸 12: Compute similarity scores between the encoded 𝑄 and each encoded passage 𝑔𝑥𝑦𝑧 13: Retrieve top 𝑘 passages {𝑔1 , … , 𝑔𝑘 } with a similarity score ≥ 𝑡 14: Get the subset of articles 𝒜𝑢 to which the passages {𝑔1 , … , 𝑔𝑘 } belong 15: for each subset of up to 𝑛 articles in 𝒜𝑢 do 16: Construct prompt 𝑀𝑄 and obtain LLM response 𝑅𝑄 17: end for In our initial experiments, we used the EU AI Act as our legislative text, and with queries consisting of sentences reporting on AI-related incidents from the news, dataset and model cards, and open-source AI project README files. The retrieval model used was the small BGE [17] model 8 for dense retrieval, while the LLM was GPT-4 [18]. The parameters used were 𝑠 = 184 and 𝑜 = 30, 𝑘 = 10, 𝑡 = 0.3, determined heuristically. We’ve explored creating queries with both 𝑛 = 1 and 𝑛 = 𝑘, the choice influences how many articles are included in a single query. An example prompt template is shown in Listing 1. Listing 1: Example Prompt for Legal Compliance Analysis Consider the following articles of legislation, provided between triple backticks, and nothing else: ```\{articles\}``` Under these articles and only these articles and ignoring those that are not applicable, as a legal compliance expert, answer: what are the implications of that legislation to the following \{example type\}, provided between triple backticks: ```\{query\}``` Let's think step by step. 4. Conclusions and Future Work In this work, we introduced the initial phase of a framework and tool designed to prepare datasets for training Large Language Models (LLMs) to perform compliance reasoning in AI applications. Our approach preserves the critical structure and content of legal provisions within a Retrieval-Augmented Generation (RAG) setting, ensuring more accurate and contextually aware reasoning. Our proposed framework offers significant advantages for companies developing and deploying AI systems across different regulatory landscapes. By integrating a compliance assistant into the AI development process, companies can proactively ensure that their models and data pipelines comply with complex regulations, identify potential legal issues early in the development cycle, and streamline the process by reducing the need for extensive manual reviews by legal experts. As a result, companies can reduce compliance risks, accelerate time-to-market, and maintain high standards of ethical and legal accountability in their AI initiatives. Looking ahead, our next steps will focus on the implementation of the refinement loop. Additionally, we plan to explore the tool’s potential use by the public and policymakers to raise awareness and deepen understanding of AI technologies and the associated regulatory landscape. 5. Acknowledgments This work was supported by the European Union through enrichMyData EU HORIZON-IA project under grant agreement No 101070284 and ELIAS HORIZON-RIA project under grant agreement No 101120237. References [1] S. Yue, W. Chen, S. Wang, B. Li, C. Shen, S. Liu, Y. Zhou, Y. Xiao, S. Yun, W. Lin, X. Huang, Z. Wei, Disc-lawllm: Fine-tuning large language models for intelligent legal services, 2023. arXiv:2309.11325 . [2] J. Cui, M. Ning, Z. Li, B. Chen, Y. Yan, H. Li, B. Ling, Y. Tian, L. Yuan, Chatlaw: A multi-agent collaborative legal assistant with knowledge graph enhanced mixture-of-experts large language model, 2024. arXiv:2306.16092 . 8 https://huggingface.co/BAAI/bge-small-en-v1.5 [3] N. Guha, et al., Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models, in: Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023, pp. 44123–44279. [4] D. Hendrycks, C. Burns, A. Chen, S. Ball, CUAD: an expert-annotated NLP dataset for legal contract review, in: J. Vanschoren, S. Yeung (Eds.), Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021. [5] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, D. Zhou, Chain- of-thought prompting elicits reasoning in large language models, in: S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh (Eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022, pp. 24824–24837. [6] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, Y. Iwasawa, Large language models are zero-shot reasoners, in: Advances in Neural Information Processing Systems 35: Annual Conference on Neural Infor- mation Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022, pp. 22199–22213. [7] X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi, S. Narang, A. Chowdhery, D. Zhou, Self- consistency improves chain of thought reasoning in language models, in: The Eleventh Inter- national Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023, 2023. [8] J. Huang, S. Gu, L. Hou, Y. Wu, X. Wang, H. Yu, J. Han, Large language models can self-improve, in: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, Association for Computational Linguistics, 2023, pp. 1051–1068. doi:10.18653/V1/2023.EMNLP- MAIN.67 . [9] P. Zhou, et al., Self-discover: Large language models self-compose reasoning structures, 2024. arXiv:2402.03620 . [10] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, H. Hajishirzi, Self-instruct: Aligning language models with self-generated instructions, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, Association for Computational Linguistics, 2023, pp. 13484–13508. doi:10.18653/V1/2023.ACL- LONG.754 . [11] L. Zheng, et al., Judging llm-as-a-judge with mt-bench and chatbot arena, in: Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. [12] P. Verga, S. Hofstätter, S. Althammer, Y. Su, A. Piktus, A. Arkhangorodsky, M. Xu, N. White, P. S. H. Lewis, Replacing judges with juries: Evaluating LLM generations with a panel of diverse models, 2024. arXiv:2404.18796 . [13] H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y. Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggine, M. Khabsa, Llama guard: Llm-based input-output safeguard for human-ai conversations, 2023. arXiv:2312.06674 . [14] A. Madaan, et al., Self-refine: Iterative refinement with self-feedback, in: Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023, pp. 46534–46594. [15] N. McAleese, R. M. Pokorny, J. F. C. Uribe, E. Nitishinskaya, M. Trebacz, J. Leike, Llm critics help catch llm bugs, 2024. arXiv:2407.00215 . [16] B. Gao, et al., LLM critics help catch bugs in mathematics: Towards a better mathematical verifier with natural language feedback, 2024. arXiv:2406.14024 . [17] S. Xiao, Z. Liu, P. Zhang, N. Muennighoff, C-pack: Packaged resources to advance general chinese embedding, 2023. arXiv:2309.07597 . [18] OpenAI, GPT-4 technical report, 2024. arXiv:2303.08774 .