<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of Web Information Systems</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1108/ijwis-12-2023-0256</article-id>
      <title-group>
        <article-title>Integrating KGs and Ontologies with RAG for Personalised Summarisation in Regulatory Compliance</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Umair Arshad</string-name>
          <email>U.Arshad1@rgu.ac.uk</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Corsar</string-name>
          <email>d.corsar1@rgu.ac.uk</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ikechukwu Nkisi-Orji</string-name>
          <email>i.nkisi-orji@rgu.ac.uk</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Regulatory Compliance, KGs, RAG, Personalised Summarisation</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>20</volume>
      <issue>2024</issue>
      <abstract>
        <p>With the growing complexity and increased volumes, regulatory texts are fast becoming a significant challenge for organisations to remain compliant. Traditional ways of summarising legal texts need to be more accommodating of critical, domain-specific requirements, rendering the process ultimately ineficient and subject to the risk of noncompliance. Therefore, this paper proposes a new solution integrating Ontology and Knowledge Graphs (KGs) with the Retrieval-Augmented Generation (RAG) paradigm to aid process automation and improve regulatory compliance. It ofers deep semantic understanding, accurate contextual summaries, and personalised insights relevant to users' needs. In the meantime, this will assist organisations in operating with more precision and confidence in an ever-changing regulatory environment.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Regulations are becoming more complex and numerous, and with a greater frequency of change,
compliance is becoming a continuing challenge for companies. In line with this, compliance teams
must constantly adapt to meet evolving regulatory requirements [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. They are essentially managing
regulatory documents, finding the right content, and matching that to their organisation’s situation,
which is a complex, manual, and ineficient process [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The process could benefit greatly from enhanced
capabilities in natural language understanding technologies through automated provision of support
based on identifying and summarising parts of regulatory documents related to the specific task at
hand. Most research into text summarisation of legal documents has been on extractive approaches,
which assemble vital phrases from a text to produce a summary [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. However, these techniques are
inefective in domains with closed vocabulary, including legal regulatory documents, where capturing
semantic relationships and contextual dependencies is essential [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. However, these documents will
likely have complex clauses, conditions, and implicit references to be interpreted in-depth, more than
the surface level of extraction needed.
      </p>
      <p>Therefore, more advanced natural language processing (NLP) techniques will be needed to account
for the subtleties of the regulatory documents, considering the implicit semantic relations expressed
throughout both in terms of understanding the documents and generation of outputs, e.g., summaries.
Ontologies, as a well-established methodology, describe the concepts within a domain and the
relationships between them. Ontologies provide a schema for knowledge graphs (KGs), which capture
individual examples of the concepts and their interconnections. The domain knowledge captured in
ontologies and KGs can be successfully used to support question-answering tasks [5]. While creating
ontologies and knowledge graphs is a complex process that demands significant domain knowledge
and can be particularly challenging in changing environments, such as regulations, approaches to
automated KG construction for extracting knowledge and structure from unstructured documents have
potential to reduce the ontology maintenance overheads [6, 7, 8].</p>
      <p>Combining recent advances in NLP with ontologies specific to the regulatory domain, there is potential
to improve the efectiveness of systems that support compliance teams within organisations. One
such advance, is Retrieval-Augmented Generation (RAG) architectures. RAG combine representations
SICSA REALLM Workshop 2024</p>
      <p>CEUR</p>
      <p>ceur-ws.org
of unstructured information in the form of text embeddings, with additional knowledge – such as
from ontologies – to improve outputs when compared to those of Large Language Models (LLMs) on
similar tasks [9]. Learning how to generate embeddings, of both the data (in this case, regulatory
documents) and user’s query (e.g. a request for summary of updates) is a critical part of this process.
This enables the system to retrieve more accurate results when searching for potential answers, by
utilising the domain knowledge in the ontology. To support this process, [10] demonstrated how the
use of a cross-attention mechanism can be used to allows the system to weigh in and prioritise the
retrieved information, according to relevance to the query to highlighting the most important details
in the outputs. This approach aims to address some of the challenges that LLMs face when analysing
domain-specific content [ 11]. The KG can also be used to evaluate the system’s output, providing
domain constraints that can be used to assess the factual accuracy of the generated outputs.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Proposed Approach</title>
      <p>This position paper proposes combining regulatory domain knowledge expressed in ontologies and
KGs, with the text understanding and generation capabilities of LLMs within a RAG architecture to
enhance the automated support provided with monitoring regulatory compliance. Figure 1 outlines the
proposed architecture. The first step in the system is data preprocessing: this step will take inputs from
diferent sources, such as ontologies and regulation documentation files, and perform any necessary the
necessary data cleaning and processing steps to create and store embeddings of the domain knowledge
in each source. Next, when a user’s query is received, a query embedding is obtained by mapping
the it using a pre-trained language model, such as BERT [12]. These process the query into syntactic
and semantic parts, making a vector representation of the query - necessary for retrieving relevant
information from the data sources. The query embedding is then used to search both in a Vector
Database of the unstructured text and of the KG (i.e. structured data). Vector databases, such as Milvus
[13], are used to quickly retrieve relevant regulatory text; a Neo4j graph database is used to store the
ontology and knowledge graph, supporting retrieval of relevant concepts and entities [14]. Graph
embedding techniques like TransE can be used to generate embeddings for entities and relationships
in KG and transform them into vectors [15]. Then, cosine similarity or other similarity metrics are
used to compare the user query embedding with KG and text embeddings to retrieve information
that is contextually relevant. This dual retrieval process is used to ensure the returned information
is appropriate and relevant to the user’s query, through the combination of through structured and
unstructured information about regulations.</p>
      <p>After the retrieval of relevant information, the system performs a retrieval evaluation to rank
according to the relevance and importance of data retrieved. For each retrieved piece of information, a
cross attention mechanism [11] can be used to calculate an attention score providing an indication of
how much the retrieved information correlates with the query defined by the user. A context vector
consisting of the most relevant features from the knowledge graph and retrieved text is generated by
pooling the highest-ranked results. Then, this context vector is processed with Transformer Layers that
refine the information further, and generate a response aligned with the query.</p>
      <p>To improve performance, the generated response has to go through an evaluation step before being
presented to the user. In the proposed approach, this step involves checking the accuracy of the
information within the response using domain-specific rules and legal standards stored in the KG.
These are constraints to make sure that the output is valid is a semantically accurate presentation of
the legal information in the regulations, and is appropriate for the context. This can be achieved using
methodologies such as those suggested by [16, 17]. If the response is validated, it gets enriched with
structured insights from the KG as a personalised summary and facilitates the final stage. The system
iterates and revises the response if the validation criteria are not met, safeguarded by iteration limits to
prevent infinite loops. It assures that only high-quality, legally compliant information is passed to the
user.</p>
      <p>To assess the performance of the proposed solution, several evaluation metrics will be employed:
• Accuracy will be measured by checking the factual correctness of the output against the
corresponding regulations.
• Relevance will be assessed using attention scores from the cross-attention mechanism to ensure
the retrieved information aligns contextually with the query.
• Completeness will be evaluated through manual reviews by domain experts, ensuring that the
summary addresses all critical aspects of the query.
• User satisfaction will be gauged through surveys and feedback from end-users (e.g., compliance
oficers) using metrics like Net Promoter Score (NPS).</p>
      <p>• Eficiency will be tested by benchmarking the system’s response time and scalability performance.</p>
      <p>By integrating ontologies, KG, and LLMs in a RAG architecture, and evaluating performance using the
outlined metrics, this solution promises to simplify the process of monitoring regulatory compliance.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Rationale for the Proposed Solution</title>
      <p>
        The proposed approach will be far more efective in managing complications arising from regulatory
compliance by automating conventional time- and labour-intensive processes and enhancing scalability
and eficiency [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Such automation is required, given that the regulatory landscape keeps expanding,
and it is hard for the manual efort to keep pace. Instead, what will be had is an application of ontologies
and integration of KGs—the model that has a deeper grasp of the legal concepts and how they interlink.
In that respect, it will ofer the generated summaries relevantly specific in content, retaining better
representation in the regulatory documents [18]. By so doing, it will serve the specific legal professionals
and organisations better and make the whole process reliable and more comprehensive.
      </p>
      <p>It is also more agile since RAG will remain responsive to an ever-changing rules environment. With
RAG, summaries remain up-to-date concerning the regulatory requirements, an important requirement
for compliance in such a fast-changing legal world. This will further allow cross-attention to be applied
so that summaries can be personalised to correspond to the particular needs of varied users. Its unique
context makes the compliance process much more intuitive. This would aid better decision-making and
reduce the number of ubiquitous non-compliances.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion of Challenges</title>
      <p>We identify several significant challenges in creating such a method for customisation, summarising
regulatory compliance. One major issue is that legal language is complex: legal documents’ sentences
are complete with jargon and complex sentence structure [19]. In such cases, the system might
need domain-specific language models and NLP techniques to cater for this. Extensive collections of
documents from diferent regulatory bodies bring performance issues, and thus, they need to look for
solutions such as using distributed programming and optimised indexing for processing documents.
Another challenge in KGs and ontologies is to ensure data precision and consistency; any inaccuracies
will degrade the quality of such summaries. These knowledge structures could be validated regularly
and routinely checked for integrity automatically. Computational complexity is also critical because,
given large amounts of data at stake, the processes associated with it are resource-intensive. These
challenges can be mitigated using parallel processing, GPU acceleration, and model optimisation.</p>
      <p>Lowering scalability will make it easier to evolve the system to new regulations as they come out
without requiring manual updates as frequently. Incremental updates and continuous learning might
be employed to avoid system performance degradation over time. Additionally, the KGs and ontologies
continuously need to be updated to incorporate regulation changes that could be resolved via automated
update mechanisms backed by expert reviews. The proposed solution will be scalable and reliable
through optimisation strategies, continuous updates, and cutting-edge technologies to address these
challenges. It will improve decision-making while reducing risks of non-compliance in actual-world
applications.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In conclusion, integrating ontologies, KGs and RAG to provide an overall solution to aid organisations in
regulation compliance provides a stable and feasible approach. Therefore, it will ofer intelligent support
to organisational eforts in regulatory monitoring activities. To successfully develop this approach,
detailed experiments quantify how much it overcomes the limitations of relying on LLMs alone in the
legal and regulatory domains. In addition, domain ontologies for regulatory compliance and knowledge
graphs for diferent regulations will furnish the research community with reusable artefacts to be used
in other decision support systems.
Association for Computational Linguistics, 2023. URL: http://dx.doi.org/10.18653/v1/2023.nllp-1.7.
doi:10.18653/v1/2023.nllp- 1.7.
[5] M. Hofer, D. Obraczka, A. Saeedi, H. Köpcke, E. Rahm, Construction of knowledge graphs: State
and challenges, 2023. URL: https://arxiv.org/abs/2302.11509. arXiv:2302.11509.
[6] W. Liang, P. D. Meo, Y. Tang, J. Zhu, A survey of multi-modal knowledge graphs: Technologies
and trends, ACM Computing Surveys 56 (2024) 1–41. URL: http://dx.doi.org/10.1145/3656579.
doi:10.1145/3656579.
[7] G. Wang, W. Li, E. Lai, J. Jiang, Katsum: Knowledge-aware abstractive text summarization, 2022.</p>
      <p>URL: https://arxiv.org/abs/2212.03371. arXiv:2212.03371.
[8] R. C. Barron, V. Grantcharov, S. Wanna, M. E. Eren, M. Bhattarai, N. Solovyev, G. Tompkins,
C. Nicholas, K. Rasmussen, C. Matuszek, B. S. Alexandrov, Domain-specific retrieval-augmented
generation using vector stores, knowledge graphs, and tensor factorization, 2024. URL: https:
//arxiv.org/abs/2410.02721. arXiv:2410.02721.
[9] W. Fan, Y. Ding, L. Ning, S. Wang, H. Li, D. Yin, T.-S. Chua, Q. Li, A survey on rag meeting llms:
Towards retrieval-augmented large language models, in: Proceedings of the 30th ACM SIGKDD
Conference on Knowledge Discovery and Data Mining, volume 24 of KDD ’24, ACM, 2024, p.
6491–6501. URL: http://dx.doi.org/10.1145/3637528.3671470. doi:10.1145/3637528.3671470.
[10] Y. Zhang, C. Liu, M. Liu, T. Liu, H. Lin, C.-B. Huang, L. Ning, Attention is all you need: utilizing
attention in ai-enabled drug discovery, Briefings in Bioinformatics 25 (2023). URL: http://dx.doi.
org/10.1093/bib/bbad467. doi:10.1093/bib/bbad467.
[11] M. A. K. Raiaan, M. S. H. Mukta, K. Fatema, N. M. Fahad, S. Sakib, M. M. J. Mim, J. Ahmad, M. E.</p>
      <p>Ali, S. Azam, A review on large language models: Architectures, applications, taxonomies, open
issues and challenges, IEEE Access (2023). URL: http://dx.doi.org/10.36227/techrxiv.24171183.v1.
doi:10.36227/techrxiv.24171183.v1.
[12] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers
for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/N19-1423.
doi:10.18653/v1/N19- 1423.
[13] J. Wang, X. Yi, R. Guo, H. Jin, P. Xu, S. Li, X. Wang, X. Guo, C. Li, X. Xu, K. Yu, Y. Yuan, Y. Zou,
J. Long, Y. Cai, Z. Li, Z. Zhang, Y. Mo, J. Gu, R. Jiang, Y. Wei, C. Xie, Milvus: A purpose-built vector
data management system, in: Proceedings of the 2021 International Conference on Management of
Data, SIGMOD ’21, Association for Computing Machinery, New York, NY, USA, 2021, p. 2614–2627.</p>
      <p>URL: https://doi.org/10.1145/3448016.3457550. doi:10.1145/3448016.3457550.
[14] J. J. Miller, Graph database applications and concepts with neo4j, in: Proceedings of the southern
association for information systems conference, Atlanta, GA, USA, volume 2324, 2013, pp. 141–147.
[15] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, O. Yakhnenko, Translating embeddings for
modeling multi-relational data, Advances in neural information processing systems 26 (2013).
[16] J. Kim, S. Park, Y. Kwon, Y. Jo, J. Thorne, E. Choi, Factkg: Fact verification via reasoning on
knowledge graphs, in: Proceedings of the 61st Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, 2023. URL:
http://dx.doi.org/10.18653/v1/2023.acl-long.895. doi:10.18653/v1/2023.acl- long.895.
[17] J. Zhou, X. Han, C. Yang, Z. Liu, L. Wang, C. Li, M. Sun, Gear: Graph-based evidence aggregating
and reasoning for fact verification, in: Proceedings of the 57th Annual Meeting of the Association
for Computational Linguistics, Association for Computational Linguistics, 2019. URL: http://dx.
doi.org/10.18653/v1/p19-1085. doi:10.18653/v1/p19- 1085.
[18] F. Sovrano, M. Palmirani, F. Vitali, Legal Knowledge Extraction for Knowledge Graph Based
Question-Answering, IOS Press, 2020. URL: http://dx.doi.org/10.3233/faia200858. doi:10.3233/
faia200858.
[19] X. Yang, Z. Wang, Q. Wang, K. Wei, K. Zhang, J. Shi, Large language models for automated
qamp;a involving legal documents: a survey on algorithms, frameworks and applications,
Inter</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Nair</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Levacher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stephenson</surname>
          </string-name>
          ,
          <article-title>Towards automated extraction of business constraints from unstructured regulatory text</article-title>
          , in: D.
          <string-name>
            <surname>Zhao</surname>
          </string-name>
          (Ed.),
          <source>Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Santa Fe, New Mexico,
          <year>2018</year>
          , pp.
          <fpage>157</fpage>
          -
          <lpage>160</lpage>
          . URL: https://aclanthology.org/C18-2034.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hinterleitner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Knill</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Steinebach,</surname>
          </string-name>
          <article-title>The growth of policies, rules, and regulations: A review of the literature and research agenda</article-title>
          ,
          <source>Regulation amp; Governance</source>
          <volume>18</volume>
          (
          <year>2023</year>
          )
          <fpage>637</fpage>
          -
          <lpage>654</lpage>
          . URL: http://dx.doi.org/10.1111/rego.12511. doi:
          <volume>10</volume>
          .1111/rego.12511.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Grabmair</surname>
          </string-name>
          ,
          <article-title>Extractive summarization of legal decisions using multi-task learning and maximal marginal relevance, in: Findings of the Association for Computational Linguistics: EMNLP 2022, Association for Computational Linguistics</article-title>
          ,
          <year>2022</year>
          . URL: http://dx.doi.org/ 10.18653/v1/
          <year>2022</year>
          .findings-emnlp.
          <volume>134</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .findings- emnlp.134.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>W.</given-names>
            <surname>Hua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Weber</surname>
          </string-name>
          ,
          <article-title>Mixed-domain language modeling for processing long legal documents</article-title>
          ,
          <source>in: Proceedings of the Natural Legal Language Processing Workshop</source>
          <year>2023</year>
          ,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>