<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Building A Knowledge Graph for Audit Information</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Naser Ahmadi</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hansjorg Sand</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Papotti</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>EURECOM</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>France</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Germany</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>We present our insights from the experience of creating a knowledge graph (KG) for the auditing domain. We discuss the main challenges in building such KG starting from text and unstructured data and present an overview of our solution. The proposed approach follows a standard pipeline when it first extracts entities from auditing documents and then finds relationships among them. However, the process is especially challenging because auditing entities are in most cases non-named entities, which are hard to model in the graph and to identify in text. From our experience, we finally derive a set of observations on the limits of automatic methods for the construction of audit KGs and a possible direction to address them.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;knowledge graph</kwd>
        <kwd>auditing</kwd>
        <kwd>text</kwd>
        <kwd>taxonomy</kwd>
        <kwd>structured data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>A Knowledge Graph (KG) is a structured representation</title>
        <p>of information which stores real-world entities as nodes,
and relationships between them as edges. KGs
represent data with large collections of interconnected entities.</p>
        <p>Usually, types (classes) describe the entities (e.g., entity Figure 1: Examples of knowledge triples from encyclopedic
Paris is a city, France is a country), while predicates de- and commonsense KGs [14].
scribe their relationships (a city isCapital of a country)
and their properties (France has a population:62M). RDF
KGs organize information in the form of triples with a The specific and technical domain of an enterprise
conpredicate expressing a binary relation between a subject tent is one of the biggest challenges in creating financial
and an object. KGs store large amounts of triples, or facts, KGs [13], in general, and an audit KG in our setting.
e.g., the English version of DBpedia stores 850 million External commonsense resources, such as ConceptNet,
facts. The syntactic and semantic structures of knowl- are used in some of the relevant methods, but they are not
edge in KGs are useful in building applications, such as a direct solution to the KG construction problem. Many
Question Answering [1, 2] and Semantic Search [3]. terms are domain-specific, so they are either missing</p>
        <p>Manually building a KG is a very expensive process. from the existing resource or their modeling in the
comFor this reason, research has been conducted on KG cre- monsense KGs does not match the level of details that
ation both in academia [4, 5, 6, 7, 8] and in the indus- is needed in the enterprise setting. For example, in an
try [9, 10]. However, when applied on the textual docu- accounting dictionary AIM stands for Alternative
Investments in the financial domain, these methods fail short. ment Market and goodwill is “a type of tangible assets that
Indeed the KGs for legal and audit enterprises are very occurs when a buyer acquires an existing business”, while
diferent from Wikipedia pages. While most of the KGs these words have very diferent meanings in a general
in the literature are encyclopedic, covering objects and dictionary. We remark also the challenge in modeling the
facts in the real world, some enterprises may have infor- above definition of goodwill by using non-named entities
mation which is mostly composed of non-named entities in the KG, what are the right noun phrases to add? Can
and abstract topics, making it close to a commonsense KG. the properties expressed in the sentence be represented
See examples that highlight the diference in Figure 1. with binary relationships?
The latter category is much harder to build automatically, In our work, we are developing tools for automating
and most eforts rely on humans, usually in a crowdsourc- diferent parts of a framework for continuous creation
ing fashion, such as ConceptNet [11] and ATOMIC [12]. and curation of KGs. However, we face a lot of challenges
that make the automatic creation of such data structures
much harder than in other settings. We start with an
example of a KG we are creating in our collaboration
with KPMG and then explain the dificulties and the
opportunities in building an audit KG.</p>
        <p>Published in the Workshop Proceedings of the EDBT/ICDT 2022 Joint
Conference (March 29-April 1, 2022), Edinburgh, UK
$ naser.ahmadi@eurecom.fr (N. Ahmadi); hsand@kpmg.com
(H. Sand); papotti@eurecom.fr (P. Papotti)</p>
        <p>© 2022 Copyright for this paper by its authors. Use permitted under Creative
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org)</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Audit Knowledge Graph</title>
      <p>containment, in the example E6 is contained in E4. This
could be a word contained in a document, for example, or
We introduce a very high-level KG based on node entities a sub-element in a hierarchy (e.g., the relation between
and only two kinds of relationships between entities. This IEC 27001 and Audit process in the hierarchy in Figure 2).
KG is diferent from traditional entity-centric knowledge Also, E2 could be a topic that describes document E8.
graphs and it is motivated by text data and taxonomies We remark that all manually defined edges are given the
that are available in the KPMG corpus of textual docu- same weight with value 1, but in the KG edges can be
ments. The design of the KG is done also according to weighted with a value between 0 and 1 for uncertain
target applications. relationships (according to the confidence given by an
automatic tool, for example).</p>
      <p>The above example representation is very generic and
simplified, we introduce it to give a feeling of the kind of
graph that we are interested in. However, in our deployed
KG, the nodes are of six diferent types:</p>
      <p>In Figure 3, there is only one kind of node, representing
entities. Those are very generic texts, they can be single
words, paragraphs or long documents. The relationships
across them are represented by directed edges and the
nodes are connected in many to many relationships. We
consider two kinds of relationships. The first one is the
• Documents nodes are (possibly long) texts
containing one to multiple paragraphs. For example,
in Figure 2 two paragraphs are shown on the left
side; those correspond to two D nodes.
• Taxonomy nodes are auditing concepts
following a hierarchical structure. For example, every
process step can be represented as a path from
the root node to the leaf, e.g., Audit programme
→ ISO 19001 → Initial audit.
• Caption nodes are client-specific short
documents that are described by taxonomy nodes, i.e.,
a describes edge goes from a taxonomy node to a
caption node.
• Topics nodes are terms with one or multiple
related entities; e.g., “risk treatment” and “audit
process” are topics in the describes relationship
with the Risk treatment in audit process step.
Entities are associated in an isIn relationship with a
topic.
• Entities nodes contain n-gram terms that are
representative of relevant items, names and concepts
in the audit domain. Every entity is the
representative for a family of words, where a family
includes (with isIn relationships) synonyms and
abbreviations that can be used to express such
entity in documents.
• Word nodes are words in an entity, their
synonyms or other variations. E.g., auditing, adt and
prc are words for entity audit process.</p>
      <sec id="sec-2-1">
        <title>There are two main design choices behind our repre</title>
        <p>sentation.</p>
        <p>First, we use several node types and very few
relationship types, as the latter are harder to extract automatically
from text. We found that NLP analysis of the text can
identify the two (relatively simple from a semantic
viewpoint) relationships, while for the entity types the task
is simplified by the awareness of their provenance, i.e., those, we generate families of words for each entity node.
some types that can be mostly derived from the source The goal is to find a group of semantically equivalent
of extraction. However, obtaining such types and rela- words, including abbreviations and acronyms, and to
astionships automatically from text documents is a dificult sociated them to the representative entity given only the
task, as we discuss in the next section. documents [20]. Words and representative entities are</p>
        <p>Second, some node types are inspired by the target related with isIn relationships. When evaluated against
users. The proposed representation has been validated the ground truth written by the experts, we found that
by experts and it is used for one text matching application the proposed unsupervised technique for mapping words
at the firm. This application exploits the rich granularity and entities can achieve high precision, but only limited
of the text representation in the KG. Indeed, the diferent recall, with the latter varying between 0.55 and 0.4
detypes enable the immediate characterization of a new pending on the language at hand, i.e., English is easier
text, say a customer document, in terms of entities (with than German [20].
entity and word nodes) and more abstract concepts (set We then propose a method to identify relationships
of entities). We found this freedom crucial given the of type describes between nodes, and we conduct
experichallenge of fixing the right abstraction for the expression mental campaigns on the discovery of relations between
of non-named entities in the KG. documents and taxonomy nodes [21]. Our method
exploits a deep learning approach for the unsupervised
modeling of the entities as vectors in the presence of
3. Limits and Opportunities of free text and structured data [22]. Such vectors are then
Automatic Methods used in the unsupervised matching step. In particular,
we report promising results in matching documents and
Given the nature of the auditing content, automatic meth- taxonomy nodes, which is a challenging task for existing
ods for encyclopedic KG construction are not very efec- methods because of the long textual content in our
entive [15, 16, 17]. We experimented largely with such tities. Compared to the manually created relationships,
methods, but with results that were far away from the the unsupervised method obtains 0.6 F-measure when
required quality [18]. We list five main challenges. (1) looking at top-3 matches [21].</p>
        <p>Auditing entities are not standard named entities, such While our initial results are promising, we need
betas France and IBM. (2) Non-named entities are expressed ter methods that involve the experts in the KG building
as noun phrases that can be recognized as subject in sen- process with simple interfaces [23, 24]. The design of
tences but are hard to organize in a structured graph. For human-in-the-loop solutions is at the core of our current
example, “tangible asset" should be modeled with one eforts. The knowledge graphs with the
human-in-theor two entities? (3) Most of these entities are oftentimes loop solutions we work on will support a broad range of
used in the form of acronyms or abbreviations. (4) Tak- scenarios in financial and economic settings:
ing in account the richness of human language, there are
many variations of noun phrases in expressing the same • Automated classification of financial records in
concept. (5) There is no training data in this domain, data ingestion and analysis pipelines.
and general corpora miss the subtle diferences in the • Automated classification of financial transaction
audit domain [19, 15]. While some of these challenges ap- documents to support automated transaction
proply in general for KG construction, we found that these cessing.
problems are especially hard for existing tools in this
setting. • Automated metadata tagging for documents and</p>
        <p>As the project moved forward, diferent parts of the sub-documents in legal and accounting corpora
KG have been manually defined by the domain experts at to improve the reliability of semantics search
enKPMG. For example, a list of potential entities has been gines.
identified with NLP traditional tools and then manually
revised by a human team. This process had identified References
some of the opportunities to introduce automatic
methods to help in the KG construction. Moreover, the
manually crafted portions of the KG ofered us some ground
truth for the evaluation of the proposed algorithms [20].</p>
        <p>In our pipeline, the first task is the automatic
identification of nodes and the second task is the identification of
relationships across the diferent nodes. We first tackle
the task of generating the entity nodes, or key short
phrases, that act as subjects and objects. Starting from
[1] C. Unger, A. Freitas, P. Cimiano, An introduction to
question answering over linked data, in: Reasoning
Web International Summer School, Springer, 2014,
pp. 100–140.
[2] D. Diefenbach, V. Lopez, K. Singh, P. Maret, Core
techniques of question answering systems over
knowledge bases: a survey, Knowledge and
Information systems 55 (2018) 529–569.
[3] H. Bast, B. Björn, E. Haussmann, Semantic search [13] S. Elhammadi, L. V.S. Lakshmanan, R. Ng, M.
Simpon text and knowledge bases, Foundations and son, B. Huai, Z. Wang, L. Wang, A high precision
Trends in Information Retrieval 10 (2016) 119–271. pipeline for financial knowledge graph
construc[4] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. tion, in: COLING, 2020, pp. 967–977.</p>
        <p>Hruschka Jr, T. M. Mitchell, Toward an architecture [14] T. Safavi, D. Koutra, Relational world knowledge
for never-ending language learning., in: AAAI, representation in contextual language models: A
2010, pp. 1306–1313. review, arXiv preprint arXiv:2104.05837 (2021).
[5] C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, [15] M. Kejriwal, Domain-specific knowledge graph
conC. Becker, R. Cyganiak, S. Hellmann, DBpedia- struction, Springer, 2019.</p>
        <p>A crystallization point for the web of data, Web [16] M. Kejriwal, R. Shao, P. Szekely, Expert-guided
Semantics 7 (2009) 154–165. entity extraction using expressive rules, in: SIGIR,
[6] F. M. Suchanek, G. Kasneci, G. Weikum, YAGO: A 2019, pp. 1353–1356.</p>
        <p>core of semantic knowledge unifying wordnet and [17] B. Abu-Salih, Domain-specific knowledge graphs:
wikipedia, in: WWW, 2007, pp. 697–706. A survey, Journal of Network and Computer
Appli[7] D. Vrandečić, M. Krötzsch, Wikidata: A free col- cations 185 (2021) 103076.</p>
        <p>laborative knowledgebase, Comm. of the ACM 57 [18] S. Wu, L. Hsiao, X. Cheng, B. Hancock, T.
Rekatsi(2014) 78–85. nas, P. Levis, C. Ré, Fonduer: Knowledge base
con[8] X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, struction from richly formatted data, in: SIGMOD,
N. Tang, Y. Ye, KATARA: a data cleaning system ACM, 2018, pp. 1301–1316.
powered by knowledge bases and crowdsourcing, [19] N. Jain, Domain-specific knowledge graph
construcin: SIGMOD, 2015. tion for semantic analysis, in: European Semantic
[9] X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, Web Conference, Springer, 2020, pp. 250–260.</p>
        <p>K. Murphy, S. Sun, W. Zhang, From data fusion [20] N. Ahmadi, A framework for the continuous
curato knowledge fusion, PVLDB 7 (2014) 881–892. tion of a knowledge base system, Ph.D. thesis, 2021.
[10] O. Deshpande, D. S. Lamba, M. Tourn, S. Das, S. Sub- EURECOM.</p>
        <p>ramaniam, A. Rajaraman, V. Harinarayan, A. Doan, [21] N. Ahmadi, H. Sand, P. Papotti, Unsupervised
Building, maintaining, and using knowledge bases: matching of data and text, in: ICDE, IEEE, 2022.
a report from the trenches, in: SIGMOD, 2013, pp. [22] R. Cappuzzo, P. Papotti, S. Thirumuruganathan,
1209–1220. Creating embeddings of heterogeneous relational
[11] R. Speer, J. Chin, C. Havasi, Conceptnet 5.5: An datasets for data integration tasks, in: SIGMOD,
open multilingual graph of general knowledge, in: 2020.</p>
        <p>Proceedings of the AAAI Conference on Artificial [23] S. Zhang, L. He, E. C. Dragut, S. Vucetic, How to
Intelligence, volume 31, 2017. invest my time: Lessons from human-in-the-loop
[12] M. Sap, R. L. Bras, E. Allaway, C. Bhagavatula, entity extraction, in: SIGKDD, ACM, 2019, pp. 2305–
N. Lourie, H. Rashkin, B. Roof, N. A. Smith, Y. Choi, 2313.</p>
        <p>ATOMIC: an atlas of machine commonsense for [24] P. Ristoski, A. L. Gentile, A. Alba, D. Gruhl, S. Welch,
if-then reasoning, in: AAAI, AAAI Press, 2019, pp. Large-scale relation extraction from web
docu3027–3035. ments and knowledge graphs with
human-in-theloop, J. Web Semant. 60 (2020).</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>