Graph Representational Learning for Internal Audit
                                Pai, Sumit1,∗,† , Singh, Vivek Kumar1,† , Gupta, Sanvi1 , Chavali, Pavani1 ,
                                Siddhartha, Siddhartha1 , Bowen, Edward2 and Tiyyagura, Sunil Reddy1
                                1
                                    Deloitte & Touche Assurance & Enterprise Risk Services India Private Limited
                                2
                                    Deloitte & Touche LLP


                                                                         Abstract
                                                                         This work aims to improve the quality of Internal Audits (IA) that are a critical part of an organization’s
                                                                         governance structure and serves as third line of defense helping provide assurance that the controls and
                                                                         processes have adequate risk mitigation strategies in place. We focus on AI enabled internal audits that
                                                                         could improve the quality, coverage and time needed to perform them and thus improve the effectiveness
                                                                         and efficiency of providing assurance, to help auditors identify potential risks that may go unnoticed
                                                                         through traditional methods. We compare different AI methodologies that can be used in controls testing
                                                                         for various financial and corporate processes. We propose the use of Knowledge Graphs (KGs) and
                                                                         representational learning to leverage the inherent relational nature of the data and to identify potential
                                                                         non-compliance or fraud. The experimental results demonstrate that our proposed method exhibits a
                                                                         significant improvement in F1 score, outperforming standard outlier detection approaches, reducing the
                                                                         number of False Positives (FPs) and in turn the manual review involved.

                                                                         Keywords
                                                                         Internal Audit, Controls testing, Knowledge Graphs, Representation Learning


                                   Introduction. The Internal Audit function’s primary responsibility is to evaluate and advise
                                on risk management and the related effectiveness of internal controls across an organization,
                                including financial, operational, regulatory, IT and strategic domains. Controls are typically a
                                set of defined processes designed to mitigate risk and some common business processes for
                                which controls testing is done are accounts receivable/payable, employee expenses, payroll,
                                supply chain, etc. In this paper, we specifically discuss the use-case of employee expenses.
                                Problem Statement and proposed solution. The key challenges with rules-based approach
                                to controls testing are: limited scope with predefined rules that may not cater to each of the
                                business environments, too many false positives and false negatives, scalability, manual effort
                                especially in identifying high risk samples for testing, and heavy reliance on Subject Matter
                                Specialists (to eliminate False Positives) to name a few. Standard outlier detection approaches,
                                such as Isolation Forests (IF) and AutoEncoders (AE), as shown in Table 1, usually don’t work
                                well due to distribution shifts, label imbalance, and not being able to leverage the relational
                                nature of the data (due to independent and identically distributed (i.i.d) assumption between
                                samples). We are working towards a solution using KGs to leverage this inherent relational
                                aspect, as well as capture the domain knowledge and thus improve upon these methods.
                                Knowledge Graph Design. Given a tabular representation of the time and expense dataset,
                                ISWC-2023: The 22nd International Semantic Web Conference
                                ∗
                                    Corresponding author.
                                †
                                    These authors contributed equally.
                                Envelope-Open sumpai@deloitte.com (P. Sumit)
                                                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
we identify relationships between the columns, and model the data as a KG. The KG Schema
has five primary nodes in red, as shown in the Fig 1, each of which are described by their
respective attributes in green (e.g. transaction amount for the transaction identifier node). The
dataset contains continuous, discrete and textual columns, each of which are incorporated
in the graph with appropriate pre-processing steps: Continuous values are binned, textual
attributes are cleaned, split into keywords and semantically similar words are connected using
Bidirectional Encoder Representations from Transformers (BERT)-based word embeddings.
We then use a semi-supervised setup, where a small fraction of transactions (< 1%) are noisy
labelled as fraudulent based on a small set of controls and are assigned an edge in the KG.


                                                          Model    Precision   Recall    F1
                                                           IF        0.25       0.45    0.32
                                                           AE        0.32       0.55    0.40
                                                          KGE        0.59       0.55    0.57

                                                    Table 1: Classification Results for
  Figure 1: Representative KG Schema. Primary nodes          Fraudulent Transactions.
             are in red and their attributes in green
Graph Representational Learning. We leverage the relational modeling power of graphs
and learn representations of nodes and edges by propagating this relational information using
Knowledge Graph Embedding (KGEs) models[1]. The trained model is calibrated on a held-out
set which is made up of fraudulent and non-fraudulent transactions. The classification threshold
is chosen such that it maximizes the F1 on this set and using this threshold the performance
is measured on a test set. Both these sets are carved to be representative of the true data
distribution where fraudulent transactions are expected to be have a very small percentage.
Results and Conclusion Compared to the other two approaches (IF and AE), as shown in Table
1, we clearly see the benefit of relational modeling with the KGs, as we were able to achieve
an F1 score of 0.57 on identifying the fraudulent transactions. We provided an overview of IA
and a related use-case highlighting the potential benefits of employing semantic modeling and
learning based approaches to enhance controls testing. With continuous monitoring, instances
of FPs would reduce, enabling greater confidence across the 3 lines of defense.
Future Work While the results of transductive models from [1] are promising, we need to
retrain them from scratch as we get new batches of data due to the presence of unseen symbolic
nodes. So we plan to explore inductive models, where we can approximate unseen symbolic
nodes during inference thus saving huge computational costs of retraining.


References
[1] L. Costabello, S. Pai, C. L. Van, R. McGrath, N. McCarthy, P. Tabacof, AmpliGraph: a Library
    for Representation Learning on Knowledge Graphs, 2019.
[2] P. L. Tang, T. D. Le Pham, T. B. Dinh, Tree-Based Credit Card Fraud Detection Using
    Isolation Forest, Spectral Residual, And Knowledge Graph, in: MLODS, 2023.