1. Introduction

Building A Knowledge Graph for Audit Information

Naser Ahmadi

Hansjorg Sand

Paolo Papotti

EURECOM

France

Germany

We present our insights from the experience of creating a knowledge graph (KG) for the auditing domain. We discuss the main challenges in building such KG starting from text and unstructured data and present an overview of our solution. The proposed approach follows a standard pipeline when it first extracts entities from auditing documents and then finds relationships among them. However, the process is especially challenging because auditing entities are in most cases non-named entities, which are hard to model in the graph and to identify in text. From our experience, we finally derive a set of observations on the limits of automatic methods for the construction of audit KGs and a possible direction to address them.

eol>knowledge graph auditing text taxonomy structured data

1. Introduction A Knowledge Graph (KG) is a structured representation

of information which stores real-world entities as nodes, and relationships between them as edges. KGs represent data with large collections of interconnected entities.

Usually, types (classes) describe the entities (e.g., entity Figure 1: Examples of knowledge triples from encyclopedic Paris is a city, France is a country), while predicates de- and commonsense KGs [14]. scribe their relationships (a city isCapital of a country) and their properties (France has a population:62M). RDF KGs organize information in the form of triples with a The specific and technical domain of an enterprise conpredicate expressing a binary relation between a subject tent is one of the biggest challenges in creating financial and an object. KGs store large amounts of triples, or facts, KGs [13], in general, and an audit KG in our setting. e.g., the English version of DBpedia stores 850 million External commonsense resources, such as ConceptNet, facts. The syntactic and semantic structures of knowl- are used in some of the relevant methods, but they are not edge in KGs are useful in building applications, such as a direct solution to the KG construction problem. Many Question Answering [1, 2] and Semantic Search [3]. terms are domain-specific, so they are either missing

Manually building a KG is a very expensive process. from the existing resource or their modeling in the comFor this reason, research has been conducted on KG cre- monsense KGs does not match the level of details that ation both in academia [4, 5, 6, 7, 8] and in the indus- is needed in the enterprise setting. For example, in an try [9, 10]. However, when applied on the textual docu- accounting dictionary AIM stands for Alternative Investments in the financial domain, these methods fail short. ment Market and goodwill is “a type of tangible assets that Indeed the KGs for legal and audit enterprises are very occurs when a buyer acquires an existing business”, while diferent from Wikipedia pages. While most of the KGs these words have very diferent meanings in a general in the literature are encyclopedic, covering objects and dictionary. We remark also the challenge in modeling the facts in the real world, some enterprises may have infor- above definition of goodwill by using non-named entities mation which is mostly composed of non-named entities in the KG, what are the right noun phrases to add? Can and abstract topics, making it close to a commonsense KG. the properties expressed in the sentence be represented See examples that highlight the diference in Figure 1. with binary relationships? The latter category is much harder to build automatically, In our work, we are developing tools for automating and most eforts rely on humans, usually in a crowdsourc- diferent parts of a framework for continuous creation ing fashion, such as ConceptNet [11] and ATOMIC [12]. and curation of KGs. However, we face a lot of challenges that make the automatic creation of such data structures much harder than in other settings. We start with an example of a KG we are creating in our collaboration with KPMG and then explain the dificulties and the opportunities in building an audit KG.

Published in the Workshop Proceedings of the EDBT/ICDT 2022 Joint Conference (March 29-April 1, 2022), Edinburgh, UK $ naser.ahmadi@eurecom.fr (N. Ahmadi); hsand@kpmg.com (H. Sand); papotti@eurecom.fr (P. Papotti)

© 2022 Copyright for this paper by its authors. Use permitted under Creative CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org)

2. Audit Knowledge Graph

containment, in the example E6 is contained in E4. This could be a word contained in a document, for example, or We introduce a very high-level KG based on node entities a sub-element in a hierarchy (e.g., the relation between and only two kinds of relationships between entities. This IEC 27001 and Audit process in the hierarchy in Figure 2). KG is diferent from traditional entity-centric knowledge Also, E2 could be a topic that describes document E8. graphs and it is motivated by text data and taxonomies We remark that all manually defined edges are given the that are available in the KPMG corpus of textual docu- same weight with value 1, but in the KG edges can be ments. The design of the KG is done also according to weighted with a value between 0 and 1 for uncertain target applications. relationships (according to the confidence given by an automatic tool, for example).

The above example representation is very generic and simplified, we introduce it to give a feeling of the kind of graph that we are interested in. However, in our deployed KG, the nodes are of six diferent types:

In Figure 3, there is only one kind of node, representing entities. Those are very generic texts, they can be single words, paragraphs or long documents. The relationships across them are represented by directed edges and the nodes are connected in many to many relationships. We consider two kinds of relationships. The first one is the • Documents nodes are (possibly long) texts containing one to multiple paragraphs. For example, in Figure 2 two paragraphs are shown on the left side; those correspond to two D nodes. • Taxonomy nodes are auditing concepts following a hierarchical structure. For example, every process step can be represented as a path from the root node to the leaf, e.g., Audit programme → ISO 19001 → Initial audit. • Caption nodes are client-specific short documents that are described by taxonomy nodes, i.e., a describes edge goes from a taxonomy node to a caption node. • Topics nodes are terms with one or multiple related entities; e.g., “risk treatment” and “audit process” are topics in the describes relationship with the Risk treatment in audit process step. Entities are associated in an isIn relationship with a topic. • Entities nodes contain n-gram terms that are representative of relevant items, names and concepts in the audit domain. Every entity is the representative for a family of words, where a family includes (with isIn relationships) synonyms and abbreviations that can be used to express such entity in documents. • Word nodes are words in an entity, their synonyms or other variations. E.g., auditing, adt and prc are words for entity audit process.

There are two main design choices behind our repre

sentation.

First, we use several node types and very few relationship types, as the latter are harder to extract automatically from text. We found that NLP analysis of the text can identify the two (relatively simple from a semantic viewpoint) relationships, while for the entity types the task is simplified by the awareness of their provenance, i.e., those, we generate families of words for each entity node. some types that can be mostly derived from the source The goal is to find a group of semantically equivalent of extraction. However, obtaining such types and rela- words, including abbreviations and acronyms, and to astionships automatically from text documents is a dificult sociated them to the representative entity given only the task, as we discuss in the next section. documents [20]. Words and representative entities are

Second, some node types are inspired by the target related with isIn relationships. When evaluated against users. The proposed representation has been validated the ground truth written by the experts, we found that by experts and it is used for one text matching application the proposed unsupervised technique for mapping words at the firm. This application exploits the rich granularity and entities can achieve high precision, but only limited of the text representation in the KG. Indeed, the diferent recall, with the latter varying between 0.55 and 0.4 detypes enable the immediate characterization of a new pending on the language at hand, i.e., English is easier text, say a customer document, in terms of entities (with than German [20]. entity and word nodes) and more abstract concepts (set We then propose a method to identify relationships of entities). We found this freedom crucial given the of type describes between nodes, and we conduct experichallenge of fixing the right abstraction for the expression mental campaigns on the discovery of relations between of non-named entities in the KG. documents and taxonomy nodes [21]. Our method exploits a deep learning approach for the unsupervised modeling of the entities as vectors in the presence of 3. Limits and Opportunities of free text and structured data [22]. Such vectors are then Automatic Methods used in the unsupervised matching step. In particular, we report promising results in matching documents and Given the nature of the auditing content, automatic meth- taxonomy nodes, which is a challenging task for existing ods for encyclopedic KG construction are not very efec- methods because of the long textual content in our entive [15, 16, 17]. We experimented largely with such tities. Compared to the manually created relationships, methods, but with results that were far away from the the unsupervised method obtains 0.6 F-measure when required quality [18]. We list five main challenges. (1) looking at top-3 matches [21].

Auditing entities are not standard named entities, such While our initial results are promising, we need betas France and IBM. (2) Non-named entities are expressed ter methods that involve the experts in the KG building as noun phrases that can be recognized as subject in sen- process with simple interfaces [23, 24]. The design of tences but are hard to organize in a structured graph. For human-in-the-loop solutions is at the core of our current example, “tangible asset" should be modeled with one eforts. The knowledge graphs with the human-in-theor two entities? (3) Most of these entities are oftentimes loop solutions we work on will support a broad range of used in the form of acronyms or abbreviations. (4) Tak- scenarios in financial and economic settings: ing in account the richness of human language, there are many variations of noun phrases in expressing the same • Automated classification of financial records in concept. (5) There is no training data in this domain, data ingestion and analysis pipelines. and general corpora miss the subtle diferences in the • Automated classification of financial transaction audit domain [19, 15]. While some of these challenges ap- documents to support automated transaction proply in general for KG construction, we found that these cessing. problems are especially hard for existing tools in this setting. • Automated metadata tagging for documents and

As the project moved forward, diferent parts of the sub-documents in legal and accounting corpora KG have been manually defined by the domain experts at to improve the reliability of semantics search enKPMG. For example, a list of potential entities has been gines. identified with NLP traditional tools and then manually revised by a human team. This process had identified References some of the opportunities to introduce automatic methods to help in the KG construction. Moreover, the manually crafted portions of the KG ofered us some ground truth for the evaluation of the proposed algorithms [20].

In our pipeline, the first task is the automatic identification of nodes and the second task is the identification of relationships across the diferent nodes. We first tackle the task of generating the entity nodes, or key short phrases, that act as subjects and objects. Starting from [1] C. Unger, A. Freitas, P. Cimiano, An introduction to question answering over linked data, in: Reasoning Web International Summer School, Springer, 2014, pp. 100–140. [2] D. Diefenbach, V. Lopez, K. Singh, P. Maret, Core techniques of question answering systems over knowledge bases: a survey, Knowledge and Information systems 55 (2018) 529–569. [3] H. Bast, B. Björn, E. Haussmann, Semantic search [13] S. Elhammadi, L. V.S. Lakshmanan, R. Ng, M. Simpon text and knowledge bases, Foundations and son, B. Huai, Z. Wang, L. Wang, A high precision Trends in Information Retrieval 10 (2016) 119–271. pipeline for financial knowledge graph construc[4] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. tion, in: COLING, 2020, pp. 967–977.

Hruschka Jr, T. M. Mitchell, Toward an architecture [14] T. Safavi, D. Koutra, Relational world knowledge for never-ending language learning., in: AAAI, representation in contextual language models: A 2010, pp. 1306–1313. review, arXiv preprint arXiv:2104.05837 (2021). [5] C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, [15] M. Kejriwal, Domain-specific knowledge graph conC. Becker, R. Cyganiak, S. Hellmann, DBpedia- struction, Springer, 2019.

A crystallization point for the web of data, Web [16] M. Kejriwal, R. Shao, P. Szekely, Expert-guided Semantics 7 (2009) 154–165. entity extraction using expressive rules, in: SIGIR, [6] F. M. Suchanek, G. Kasneci, G. Weikum, YAGO: A 2019, pp. 1353–1356.

core of semantic knowledge unifying wordnet and [17] B. Abu-Salih, Domain-specific knowledge graphs: wikipedia, in: WWW, 2007, pp. 697–706. A survey, Journal of Network and Computer Appli[7] D. Vrandečić, M. Krötzsch, Wikidata: A free col- cations 185 (2021) 103076.

laborative knowledgebase, Comm. of the ACM 57 [18] S. Wu, L. Hsiao, X. Cheng, B. Hancock, T. Rekatsi(2014) 78–85. nas, P. Levis, C. Ré, Fonduer: Knowledge base con[8] X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, struction from richly formatted data, in: SIGMOD, N. Tang, Y. Ye, KATARA: a data cleaning system ACM, 2018, pp. 1301–1316. powered by knowledge bases and crowdsourcing, [19] N. Jain, Domain-specific knowledge graph construcin: SIGMOD, 2015. tion for semantic analysis, in: European Semantic [9] X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, Web Conference, Springer, 2020, pp. 250–260.

K. Murphy, S. Sun, W. Zhang, From data fusion [20] N. Ahmadi, A framework for the continuous curato knowledge fusion, PVLDB 7 (2014) 881–892. tion of a knowledge base system, Ph.D. thesis, 2021. [10] O. Deshpande, D. S. Lamba, M. Tourn, S. Das, S. Sub- EURECOM.

ramaniam, A. Rajaraman, V. Harinarayan, A. Doan, [21] N. Ahmadi, H. Sand, P. Papotti, Unsupervised Building, maintaining, and using knowledge bases: matching of data and text, in: ICDE, IEEE, 2022. a report from the trenches, in: SIGMOD, 2013, pp. [22] R. Cappuzzo, P. Papotti, S. Thirumuruganathan, 1209–1220. Creating embeddings of heterogeneous relational [11] R. Speer, J. Chin, C. Havasi, Conceptnet 5.5: An datasets for data integration tasks, in: SIGMOD, open multilingual graph of general knowledge, in: 2020.

Proceedings of the AAAI Conference on Artificial [23] S. Zhang, L. He, E. C. Dragut, S. Vucetic, How to Intelligence, volume 31, 2017. invest my time: Lessons from human-in-the-loop [12] M. Sap, R. L. Bras, E. Allaway, C. Bhagavatula, entity extraction, in: SIGKDD, ACM, 2019, pp. 2305– N. Lourie, H. Rashkin, B. Roof, N. A. Smith, Y. Choi, 2313.

ATOMIC: an atlas of machine commonsense for [24] P. Ristoski, A. L. Gentile, A. Alba, D. Gruhl, S. Welch, if-then reasoning, in: AAAI, AAAI Press, 2019, pp. Large-scale relation extraction from web docu3027–3035. ments and knowledge graphs with human-in-theloop, J. Web Semant. 60 (2020).