On the Analysis of Large Integrated Knowledge Graphs for Economics, Banking, and Finance Shuai Wang1 1 Department of Computer Science, Vrije Universiteit Amsterdam, 1081 HV Amsterdam, the Netherlands Abstract Knowledge graphs are being used for the detection of money laundering, insurance fraud, and other suspicious activities. Some recent work demonstrated how knowledge graphs are being used to study the impact of the COVID-19 outbreak on the economy. The fact that knowledge graphs are being used in more and more interdisciplinary problems calls for a reliable source of interdisciplinary knowledge. In this paper, we study the integration of knowledge graphs in the domains of economics, banking, and finance. Our integrated knowledge graph has over 610K nodes and 1.7 million edges. By performing statistical and graph-theoretical analysis, we demonstrate how the integration results in more entities with richer information. Its quality was examined by analyzing the subgraphs of the identity links and (pseudo-)transitive relations. Finally, we study the sources of error, and their refinement and discuss the benefit of our integrated graph. Keywords Integrated knowledge graphs, knowledge graph analysis, knowledge graph refinement 1. Introduction based on the dynamics of complex inter-connected sys- tems. Unfortunately, many sources of knowledge were The 2008 financial crisis urged early detection of systemic developed independently of each other. Fusing these in- risk to national and world economies in derivatives mar- dependent KGs could lead to a significantly richer source kets. The relative size of these markets is a fundamental of knowledge which could improve the performance of risk to geopolitical as well as economic security [1]. One existing applications. In this paper, we study proper- of the trendy tools that can be used for the modelling of ties of the integration of knowledge graphs by analyzing relations between companies and their economic behav- the statistical and graph-theoretical properties. More ior is knowledge graph. Knowledge graphs show great specifically, we study properties of integrated knowledge potential in use as they can represent companies struc- graphs by combining existing knowledge graphs in the tured in complex shareholdings, as well as information domains of economics, banking, and finance. about investment, acquisition, bankruptcy, etc. Shao et al. Finance The Financial Industry Business Ontology used knowledge graphs of real financial data where nodes (FIBO) [4] includes formal models that are intended to de- are customer, merchant, building, etc. The edges can be fine unambiguous shared meaning for financial industry transactions between customers, residential information concepts. Another popular ontology is the Financial Reg- about customers, etc. As a benefit of the graphical struc- ulation Ontology (FRO), which has been used as a higher ture, their knowledge graph captures interrelations and level, core ontology for ontologies such as the Insurance interactions across tremendous types of entities more Regulation Ontology1 (IRO), the Fund Ontology2 , etc. effectively than traditional methods. They performed Economics The STW (Standard Thesaurus extensive experiments and demonstrated the usage of Wirtschaft) Thesaurus for Economics was devel- knowledge graphs in the consumer banking sector [2]. oped by the German National Library of Economics Bellomarini et al. address the impact of the COVID- (ZBW) and gained popularity in scientific institutes, 19 outbreak on the network of Italian companies using libraries and documentation centers, as well as business knowledge graphs of millions of nodes [3]. Such projects information providers. The JEL classification system was require multiple types of domain knowledge, from com- initially developed for use in the Journal of Economic pany ownership to public health policy, from bankruptcy Literature (JEL) [5] and is now a standard method of to social resilience. The essence of such knowledge be- classifying scholarly literature in the field of economics. comes clear for strategy formation and policy making Banking Knowledge graphs have attracted increasing attention in the banking industry over the past decade. Published in the Workshop Proceedings of the EDBT/ICDT 2022 Joint The WBG Taxonomy3 includes 3,882 concepts. It serves Conference (March 29-April 1, 2022), Edinburgh, UK as a small classification schema which represents the con- $ shuai.wang@vu.nl (S. Wang) cepts used to describe the World Bank Group’s topical € https://shuai.ai (S. Wang)  0000-0002-1261-9930 (S. Wang) 1 https://insuranceontology.com/ © 2022 Copyright for this paper by its authors. Use permitted under Creative 2 Commons License Attribution 4.0 International (CC BY 4.0). https://fundontology.com/ CEUR Workshop Proceedings (CEUR-WS.org) 3 https://vocabulary.worldbank.org/PoolParty/wiki/taxonomy CEUR http://ceur-ws.org Workshop ISSN 1613-0073 Proceedings knowledge domains and areas of expertise, providing ontology alignment and the set of correspondences is an enterprise-wide, application-independent framework. called a mapping or an alignment. In comparison, the Bank Regulation Ontology (BRO) is By integrating knowledge graphs of various domains, much bigger and uses two industrial standards, namely we expect more entities and richer information for enti- FIBO and LKIF [6], as its upper ontology. It was built on ties. The following is a list of 11 knowledge graphs we top of the FRO ontology, as mentioned above. Unfortu- collected from 9 projects in the domains of economics, nately, many knowledge graphs are developed by banks banking, and finance. and are not open source. In this paper we study properties of integrated knowl- 1. the Financial Industry Business Ontology (we col- edge graphs in the domain of economics, banking and lected the FIBO ontology using OWL and FIBO finance. Our results show that even though the integrated vocabulary using SKOS)5 knowledge graph has some errors which have been cre- 2. the Financial Regulation Ontology (FRO)6 ated due to minor mistakes, the overall usefulness has 3. the Hedge Fund Regulation (HFR) ontology7 been improved. Our contributions are: 4. the Legal Knowledge Interchange Format (LKIF) a) We integrate some knowledge graphs in the domain ontology8 of economics, banking, and finance and present the inte- 5. the Bank Regulation Ontology (BRO)9 grated knowledge graph consisting of over 610K entities 6. the Financial Instrument Global Identifier (FIGI)10 and 1.7 million triples4 . 7. the STW Thesaurus for Economics (and its map- b) We study how the integration can enrich the in- pings)11 formation of entities with some statistical and graph- 8. the Journal of Economic Literature (JEL) classifi- theoretical analysis. cation system12 c) We discuss the source of error and its refinement of 9. the Fund Ontology13 the integrated knowledge graph for future use. The paper is organised as follows: Section 2 presents Not all knowledge graphs are available: some are not the knowledge graphs and their statistics. Section 3 open source (e.g., the Italian Ownership Graph [3]), some presents details of the integrated knowledge graph with others are commercial (e.g., the enterprise knowledge an analysis of the source of error, followed by a discus- graphs by Agnos.ai14 ) and a few are not maintained any- sion. Finally, we draw the conclusion in Section 4. more (e.g., the OntoBacen project [7]). We used LogMap15 for the alignment between knowl- edge graphs [8]. LogMap is a highly scalable ontology 2. Integrating Knowledge Graphs matching system with ‘built-in’ reasoning and inconsis- A knowledge graph 𝐺 = ⟨𝑉, 𝐸, 𝐿, 𝑙⟩ is a directed and tency repair capabilities. It can efficiently match semanti- labelled graph, where 𝑉 is the set of nodes, 𝐸 ⊆ 𝑉 × 𝑉 cally rich ontologies containing tens (and even hundreds) the set of edges, and 𝐿 is the set of edge labels. A function of thousands of classes. Considering the size of our files, 𝑙 : 𝐸 → 2𝐿 assigns to each edge a set of labels from 𝐿. 5 The nodes 𝑉 can be IRIs, literals, or blank nodes. The The product version retrieved from https://edmconnect. edmcouncil.org/fibointerestgroup/fibo-products/fibo-owl (147 edges 𝐸 are relations between nodes and their types in files in Turtle format) and https://edmconnect.edmcouncil.org/ the form of triples. Ontologies are semantic models of fibointerestgroup/fibo-products/fibo-voc (1 file in Turtle format) data that define the entities, their properties and types, respectively on 14th January, 2022. 6 types and subtyping, as well as relations between entities. 32 Turtle files were retrieved from https://finregont.com/ An ontology can be represented as a knowledge graph. ontology-directory-files-prefixes/ on 14th Janurary, 2022. 7 12 Turtle files were retrieved from https://hedgefundontology. An integrated knowledge graph G = ⟨V, E, L, l⟩ com/ontology-files/ on 14th January, 2022 is a combination of a set of 𝑁 knowledge graphs 8 Retrieved from http://www.estrellaproject.org/lkif-core/ {𝐺1 , . . . , 𝐺𝑁 } where V = 𝑉1 ∪ . . . ∪ 𝑉𝑁 , E = 𝐸1 ∪ #download on 30th January, 2022. 9 . . . ∪ 𝐸𝑁 , and L = 𝐿1 ∪ . . . ∪ 𝐿𝑁 . A function l : E → 2l 16 Turtle files were retrieved from https://bankontology.com/ ontology-directory-files-prefixes/ on 30th January, 2022. assigns to each edge a set of labels, which is the union 10 4 RDF files were retrieved from https://www.omg.org/spec/ of the labels: l(𝑒) = 𝑙1 (𝑒) ∪ . . . ∪ 𝑙𝑁 (𝑒). For a given set FIGI/ on 22nd December, 2021. relations R, the subgraph is the graph GR with L = R. 11 The paper used STW v9.12 based on the SKOS ontology. The When R = {𝑟}, GR = G𝑟 . Often times, such an integra- ontology and its 9 mappings files were retrieved from https://zbw.eu/ tion requires the process of determining correspondences stw/version/latest/download/about.en.html on 30th Janurary, 2022. 12 The Turtle file was retrieved from https://zbw.eu/beta/external_ between concepts in ontologies. Such a process is called identifiers/jel/about on 30th January, 2021. 13 The paper used 8 Turtle files retrieved from https:// fundontology.com/ontology-files/ on 28th December, 2021. 4 14 The data and Python scripts are available at https://github.com/ https://agnos.ai/services 15 shuaiwangvu/EcoFin-integrated. http://krrwebtools.cs.ox.ac.uk/logmap/ Table 1 3. Analysis of the Integrated Alignment of knowledge graphs knowledge graph FIBO- FIBO- LKIF FIGI STW JEL Fund vD OWL In this section, we first study how the information of FIBO- - 599 1 147 12 204 11 entities can be enriched with some statistical analysis of vD graph structure (Section 3.1). We then examine identity FIBO- - - 24 516 5 57 70 links (e.g. skos:exactMatch) in the integrated graph OWL G and their corresponding subgraphs (Section 3.2). Fi- LKIF - - - 1 0 0 23 nally, we study transitive and pseudo-transitive relations FIGI - - - - 0 34 2 STW - - - - - 2 0 such as concept generalisation (Section 3.3) followed by JEL - - - - - - 1 a discussion (Section 3.5). Fund - - - - - - - 3.1. Statistical analysis Table 2 We study how the information of entities can be en- General statistics of knowledge graphs riched when combining different resources. When an Name |V| |E| Size entity is described in different domains, its in- and out- degree are expected to increase. Figure 1 illustrates the FIBO-vD 17,547 28,128 3.1MB FIBO-OWL 103,288 250,002 16MB in-/out-degree of the knowledge graphs and the inte- FRO 94,215 283,976 16MB grated knowledge graph. Both the in- and out-degrees HFR 14,235 34,771 2.6MB of the integrated graph show a power-law distribution. LKIF 1,005 2,363 141KB Moreover, the figures show that the integration increases BRO 259,074 838,007 43MB both the number of degrees in general and the number of FIGI 12,180 16,434 822KB nodes with high degrees, which demonstrates how this STW 51,128 113,276 3.4MB integration can enrich the information of entities. For JEL 12,109 177,57 1.1MB example, lkif-core-norm:allowed_by has an out- Fund 10,119 35,005 3.2MB degree of 7 in the integrated graph but the three graphs STW-mappings 78,398 177,603 11MB that contain information about it has out-degrees of 2, 5, alignment 2,327 1,698 255KB and 1 respectively17 . integrated 610,866 1,778,755 93MB A strongly connected component (SCC) of a directed graph is a maximal subgraph where there is a path be- tween all pairs of vertices. A weakly connected compo- we used the version with mapping repair but not the aid nent (WCC) is a subgraph of the original graph where of any reasoner. Unfortunately, FRO, BRO, and HFR failed all vertices are connected to each other by some path, to load due to parsing errors in some files they import. ignoring the direction of edges. Table 3 summarizes the Table 1 summarizes the number of pairs of entities gen- graph-theoretical statistics. Let maxSCC and maxWCC erated by LogMap. Overall, 1,698 unique identity links of represent the number of nodes in the largest strongly skos:exactMatch were added to the integrated graph. connected component and weakly connected component All the knowledge graphs were first converted to Tur- respectively. In addition, we compute the fraction of tle format and then used the RDFpro16 [9] for the integra- nodes in the biggest SCC and WCC, denoted 𝑝𝑆 and 𝑝𝑊 tion process with duplicated triples removed. RDFpro is respectively. The high values of 𝑝𝑊 in the table show an open source stream-oriented toolkit for the processing that the graphs are mostly connected. More specifically, of RDF triples. We used RDFpro (version 0.6) without 𝑝𝑊 = 99.98% for the integrated graph, which is due to smushing. The integration took 23 seconds on a 2.2 GHz the overlapping domains of the knowledge graphs and Quad-Core i7 laptop with a 16GB memory running Mac the mappings. The low values of 𝑝𝑆 indicate that the un- OS. All the files were then converted to their HDT format derlying structure of these graphs is mostly hierarchical, for further experiments. The integrated knowledge graph especially that of JEL, BRO, and FIBO-vD. consists of 1,778,755 unique triples (edges) and 610,866 nodes. It has 93MB and 22MB in its Turtle and HDT for- mat respectively. Table 2 summarize the statistics of the 3.2. Analysis of identity links number of nodes, edges and the size of their Turtle files. Identity links are relations between entities that are For the sake of speed, when studying properties of these considered identical and intended to refer to the same knowledge graphs, we use files in their HDT format. 17 The prefix lkif-core-norm corresponds to the namespace http: 16 http://rdfpro.fbk.eu/ //www.estrellaproject.org/lkif-core/norm.owl#. Table 3 triples about skos:exactMatch. In addition, there Graph-theoretical statistics of knowledge graphs are 8,172 triples about skos:relatedMatch, and 6,418 triples about skos:closeMatch. Figure 2 shows the fre- Name maxSCC 𝑝𝑆 (%) maxWCC 𝑝𝑊 (%) quency distribution of the weakly connected components FIBO- 1 0.01 17,535 99.93 in their corresponding subgraphs. vD FIBO- 297 0.29 103,208 100 OWL FRO 17 0.02 94,015 99.79 HFR 849 5.96 14,230 99.96 LKIF 88 8.76 963 95.82 BRO 13 0.01 258,982 99.96 FIGI 13 0.11 12,180 100 STW 6777 13.25 51,128 100 JEL 1 0.01 12,099 99.92 Fund 109 1.08 10,111 99.92 STW- 617 0.79 78,398 100 mappings Figure 2: Frequency distribution of connected components alignment 3 0.13 119 5.11 in the integrated graph integrated 36,853 6.03 610,792 99.98 The largest two connected components of the subgraph of owl:sameAs are with 8 and 6 entities each. In contrast, the largest two connected components of skos:exactMatch are much bigger, with 119 and 45 entities respectively. For skos:relatedMatch, the largest weakly connected component consists of 21 entities. That of skos:closeMatch consists of 52 entities. A manual examination below shows that there are errors in these large connected components. The mis-use of these SKOS mapping properties can have less implications than the owl:sameAs since skos:exactMatch indicates only “a high degree of confidence that the concepts can be used interchangeably across a wide range of applications”[10]. Moreover, lkif-core:mereology.owl#strictly_equivalent is a equivalence relation but corresponds to no triple18 . More discussion is included in Section 3.4. 3.3. Analysis of transitive and pseudo-transitive relations Transitive relations are widely used in knowledge graphs on the definition of class subsumption, concept generali- sation, organisation composition, etc. Due to transitivity, Figure 1: Distribution of in-/out-degree of nodes in knowledge entities in cycles imply some equivalence relation, which graphs could be erroneous. Take lkif-core:component_of for example. A triple specifies that “some thing is a (func- tional) component of some other thing”. Entities in a cycle of lkif-core:component_of indicate that all real-world entities. Typical identity links use relations they are components of each other, which could be erro- such as owl:sameAs and skos:exactMatch. We first neous. Some past work showed how strongly connected study identity links in G and their corresponding sub- components can be used to locate errors when refining graphs. In contrast to the statistics reported by Raad knowledge graphs [11, 12]. et al., where owl:sameAs is much more popular than skos:exactMatch [10], our analysis shows that only 18 The prefix lkif-core corresponds to the namespace http: 5,253 triples about owl:sameAs are in G against 31,254 //www.estrellaproject.org/lkif-core/. There are in total 20relations typed Our analysis also shows that the identity links come owl:TransitiveProperty in G. We also study solely from two sources: the owl:sameAs triples are the pseudo-transitive relations: those relations that from the FIBO-OWL knowledge graph, the triples are not typed owl:TransitiveProperty but shows about skos:exactMatch, skos:closeMatch, and transitivity in their intended semantics [11]. In this study, skos:relatedMatch are from STW-mappings and our we focus on two pairs of such relations: skos:broader alignment. Mapping files about the STW subject cate- and its inverse skos:narrower, skos:broaderMatch gories were created by the alignment tool Amalgame20 . as well as its inverse relation skos:narrowerMatch. Our manual examination shows that these identity links This section excludes relations of identity links such as are closely related concepts and requires knowledge from skos:exactMatch, which was discussed in Section 3.2. experts for refinement. Take skos:broadMatch for example. A manual anal- ysis of the largest three SCC shows the edges could be 3.5. Discussion erroneous. These SCCs are: a component with four enti- ties about plebiscite, referendum, and popular initiative; As shown above, this integration results in new statisti- a component with three entities about insurance and pri- cal and graph-theoretical properties. Next, we compare vate insurance; a component with three distinct entities how these problems exhibit in our graph and the LOD- about the CARICOM countries, Caribbean countries, and a-lot21 [13]. LOD-a-lot is a dataset that integrates over the Caribbean Community. 28 billion triples from 650K files of the LOD Cloud into Let GB be the subgraph of the integrated graph G with a single ready-to-consume file. While our integrated B = {skos:broader, skos:broaderMatch} and GN for knowledge graph has 1.7 million unique triples, LOD-a- N = {skos:narrower, skos:narrowerMatch}. Next, lot is much larger with 28.3 billion triples. For LOD-a-lot, we combine the GB with the graph G’N , where G’N is a 356.9K edges out of 11.8 million edges of skos:broader graph with each edge of G reversed in direction. After are involved in SCCs [11]. In contrast, we have no performing the same analysis, we discover a new strongly SCC with two or more entities among 17,868 edges of connected component with four entities about adjustable skos:broader. For LOD-a-lot, 1.4K edges out of 4.4 mil- peg, fixed exchange rate, exchange rate regime and in- lion edges of rdfs:subClassOf are involved in SCCs ternationales Währungssystem, respectively. Moreover, [11, 12]. In contrast, there is no cycle for our correspond- the resulting graph has 44 connected components of two ing subgraph. This confirms the quality of the knowledge entities, which are more than that of the subgraphs cor- graphs we used. The identity graph of the LOD-a-lot responding to each individual relation. This indicates graph regarding owl:sameAs consists of 558.9 million that such integration can result in more complex errors triples with the largest connected component consisting which do not exhibit in stand-alone graphs. of 177.8K entities [10]. In contrast, our identity graphs Our analysis shows that rdfs:subClassOf is a pop- of both owl:sameAs and skos:exactMatch are small ular relation with 47,597 triples. However, there is no and can be manually refined. SCC with more than one component, which implies that the underlying class hierarchy is a directed acyclic graph. In addition, lkif-core:component, fro:divides19 , 4. Conclusion and its inverse fro:divided_by are also popular tran- In this paper, we presented an integrated knowledge sitive relations. Finally, none of them has strongly con- graph in the domain of Economics, Finance, and Banking. nected components of size greater or equal to two. We demonstrated how the integrated graph has more en- tities with richer information. We discussed subgraphs of 3.4. Source of Error and Refinement (pseudo-)transitive and identity relations as well as their refinement. The overall usefulness has been improved When tracing back to the sources of each edge, we found despite minor errors introduced due to integration. that skos:broader and skos:narrower are mostly Our integrated knowledge graph can be used to evalu- from three sources: STW, JEL, and FIBO-vD. When ate data interoperability. Also, it can enrich the features combined with the subgraph of skos:broadMatch and of entities, which may increase the accuracy of pattern skos:narrowMatch, there are in total 44 SCCs of two recognition using Machine Learning for the detection of entities, two SCCs of three entities, and two SCCs of four takeovers, money laundering, insurance fraud, counter- entities. It is feasible that some domain experts manually feiting, etc. Furthermore, it can also be used to improve examine all these small SCCs without employing any the quality of suspicious activity reports, recommenda- refinement algorithm. tion systems, conversational agents, etc. 19 20 The prefix fro corresponds to the namespace http://finregont. https://github.com/jrvosse/amalgame 21 com/fro/ref/LegalReference.ttl#. http://lod-a-lot.lod.labs.vu.nl/ References [12] S. Wang, J. Raad, P. Bloem, F. van Harmelen, Sub- massive: Resolving subclass cycles in very large [1] S. Malik, The ontology of finance: Price, power, and knowledge graphs, in: Workshop on Large Scale the arkhederivative, in: Collapse Vol. VIII: casino RDF Analytics, 2020. real, Falmouth: Urbanomic, 2014, pp. 629–811. [13] W. Beek, J. D. Fernández, R. Verborgh, Lod-a-lot: A [2] D. Shao, R. Annam, Translation embeddings for single-file enabler for data science, Association for knowledge graph completion in consumer bank- Computing Machinery, New York, NY, USA, 2017. ing sector, in: A. El Fallah Seghrouchni, D. Sarne (Eds.), Artificial Intelligence. IJCAI 2019 Interna- tional Workshops, Springer International Publish- ing, Cham, 2020, pp. 5–17. [3] L. Bellomarini, M. Benedetti, A. Gentili, R. Lau- rendi, D. Magnanimi, A. Muci, E. Sallinger, COVID-19 and company knowledge graphs: As- sessing golden powers and economic impact of selective lockdown via AI reasoning, CoRR abs/2004.10119 (2020). URL: https://arxiv.org/abs/ 2004.10119. arXiv:2004.10119. [4] M. Bennett, The financial industry business ontol- ogy: Best practice for big data, Journal of Banking Regulation 14 (2013) 255–268. [5] B. Cherrier, Classifying economics: A history of the jel codes, Journal of economic literature 55 (2017) 545–79. [6] R. Hoekstra, J. Breuker, M. Di Bello, A. Boer, The lkif core ontology of basic legal concepts, 2007, pp. 43–63. [7] F. F. Polizel, S. J. Casare, J. S. Sichman, Ontobacen: A modular ontology for risk management in the brazilian financial system, in: Proceedings of the Joint Ontology Workshops, 2015. [8] E. Jiménez-Ruiz, B. Cuenca Grau, Logmap: Logic- based and scalable ontology matching, in: L. Aroyo, C. Welty, H. Alani, J. Taylor, A. Bernstein, L. Kagal, N. Noy, E. Blomqvist (Eds.), The Semantic Web – ISWC 2011, Springer Berlin Heidelberg, 2011, pp. 273–288. [9] F. Corcoglioniti, M. Rospocher, M. Mostarda, M. Amadori, Processing billions of RDF triples on a single machine using streaming and sorting, in: R. L. Wainwright, J. M. Corchado, A. Bechini, J. Hong (Eds.), Proceedings of the 30th Annual ACM Symposium on Applied Computing, Salamanca, Spain, April 13-17, 2015, ACM, 2015, pp. 368– 375. URL: https://doi.org/10.1145/2695664.2695720. doi:10.1145/2695664.2695720. [10] J. Raad, N. Pernelle, F. Saïs, W. Beek, F. van Harmelen, The sameas problem: A survey on identity management in the web of data, CoRR abs/1907.10528 (2019). URL: http://arxiv.org/abs/ 1907.10528. arXiv:1907.10528. [11] S. Wang, J. Raad, P. Bloem, F. Van Harmelen, Re- fining transitive and pseudo-transitive relations at web scale, in: European Semantic Web Conference, Springer, 2021, pp. 249–264.