Graph Representational Learning for Internal Audit

Graph Representational Learning for Internal Audit SumitPai Deloitte & Touche Assurance & Enterprise Risk Services India Private Limited VivekSingh Deloitte & Touche Assurance & Enterprise Risk Services India Private Limited Kumar Deloitte & Touche Assurance & Enterprise Risk Services India Private Limited Chavali Deloitte & Touche Assurance & Enterprise Risk Services India Private Limited Pavani Deloitte & Touche Assurance & Enterprise Risk Services India Private Limited EdwardBowen Deloitte & Touche Assurance & Enterprise Risk Services India Private Limited SunilTiyyagura Deloitte & Touche Assurance & Enterprise Risk Services India Private Limited Reddy Deloitte & Touche Assurance & Enterprise Risk Services India Private Limited Graph Representational Learning for Internal Audit 1613-0073 6DAF3826207E981CEA9BE7F5F69C67EB GROBID - A machine learning software for extracting information from scholarly documents Internal Audit Controls testing Knowledge Graphs Representation Learning

This work aims to improve the quality of Internal Audits (IA) that are a critical part of an organization's governance structure and serves as third line of defense helping provide assurance that the controls and processes have adequate risk mitigation strategies in place. We focus on AI enabled internal audits that could improve the quality, coverage and time needed to perform them and thus improve the effectiveness and efficiency of providing assurance, to help auditors identify potential risks that may go unnoticed through traditional methods. We compare different AI methodologies that can be used in controls testing for various financial and corporate processes. We propose the use of Knowledge Graphs (KGs) and representational learning to leverage the inherent relational nature of the data and to identify potential non-compliance or fraud. The experimental results demonstrate that our proposed method exhibits a significant improvement in F1 score, outperforming standard outlier detection approaches, reducing the number of False Positives (FPs) and in turn the manual review involved.

we identify relationships between the columns, and model the data as a KG. The KG Schema has five primary nodes in red, as shown in the Fig 1, each of which are described by their respective attributes in green (e.g. transaction amount for the transaction identifier node). The dataset contains continuous, discrete and textual columns, each of which are incorporated in the graph with appropriate pre-processing steps: Continuous values are binned, textual attributes are cleaned, split into keywords and semantically similar words are connected using Bidirectional Encoder Representations from Transformers (BERT)-based word embeddings. We then use a semi-supervised setup, where a small fraction of transactions (< 1%) are noisy labelled as fraudulent based on a small set of controls and are assigned an edge in the KG. Graph Representational Learning. We leverage the relational modeling power of graphs and learn representations of nodes and edges by propagating this relational information using Knowledge Graph Embedding (KGEs) models [1]. The trained model is calibrated on a held-out set which is made up of fraudulent and non-fraudulent transactions. The classification threshold is chosen such that it maximizes the F1 on this set and using this threshold the performance is measured on a test set. Both these sets are carved to be representative of the true data distribution where fraudulent transactions are expected to be have a very small percentage.

Results and Conclusion

Compared to the other two approaches (IF and AE), as shown in Table 1, we clearly see the benefit of relational modeling with the KGs, as we were able to achieve an F1 score of 0.57 on identifying the fraudulent transactions. We provided an overview of IA and a related use-case highlighting the potential benefits of employing semantic modeling and learning based approaches to enhance controls testing. With continuous monitoring, instances of FPs would reduce, enabling greater confidence across the 3 lines of defense. Future Work While the results of transductive models from [1] are promising, we need to retrain them from scratch as we get new batches of data due to the presence of unseen symbolic nodes. So we plan to explore inductive models, where we can approximate unseen symbolic nodes during inference thus saving huge computational costs of retraining.

Figure 1 :1Figure 1: Representative KG Schema. Primary nodes are in red and their attributes in green

Table 1 :1Classification Results for Fraudulent Transactions.Model Precision RecallF1IF0.250.450.32AE0.320.550.40KGE0.590.55 0.57

AmpliGraph: a Library for Representation Learning on Knowledge Graphs LCostabello SPai CLVan RMcgrath NMccarthy PTabacof 2019 Tree-Based Credit Card Fraud Detection Using Isolation Forest, Spectral Residual, And Knowledge Graph PLTang TDLe Pham TBDinh 2023 MLODS