Augmenting Logic-based Knowledge Graphs: The Case of Company Graphs Paolo Atzeni1 , Luigi Bellomarini2 , Michela Iezzi2 , Emanuel Sallinger3,4 , and Adriano Vlad4 1 Università Roma Tre 2 Banca d’Italia 3 TU Wien 4 University of Oxford 1 The Industrial Setting Company ownership graphs are central objects in corporate economics [4, 10, 13, 17] and are of high importance for central banks to solve problems in banking supervision, credit-worthiness, anti-money laundering and many more areas. The Bank of Italy owns the database of Italian companies that contains details about shareholding structures. In a graph-based view of the database, as shown in Figure 1, we see ownership as the core concept: nodes are companies and persons (black resp. blue nodes), and ownership edges (black solid links) are labelled with the shares a company or person x owns of a company y. An important problem with com- pany graphs is company control [9], which amounts to deciding whether a company x controls a company y, i.e., x can force decisions of y. A company (or a person) x controls a company y, if: (i) x directly owns more than 50% of y; or, (ii) x controls a set of compa- nies that jointly, and possibly together with x, own more than 50% of y. Fig. 1. Sample from the Italian graph. Besides financial relationships, per- sonal or family connections (of various degrees, e.g., “PartnerOf”, “SiblingOf”, and so on) enable a much broader use of such company graphs in several fields such as anti-money laundering, fraud detection or economic and statistical re- search [12]. In these fields, the definition of control can be extended to families. For instance, in Figure 1: P1 controls C4 with a direct 80% edge; P2 controls C7 , via C5 and C6 . Dashed green edges represent control links. The dashed red edge between P2 and P3 represents a “PartnerOf ” relationship. P2 and P3 are in the same family. Neither P2 nor P3 control C8 , but their family jointly owns 60% and thus controls C8 . In this paper we continue our recent work [2] and study the problem of predicting family links as a special case of a wider class of graph augmentation problems providing a hybrid embedding-reasoning approach to it. Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 P. Atzeni et al. 2 The Vada-Link Framework With knowledge graph augmentation, (KG augmentation) we wish to character- ize a special case of link prediction [15], concerned with predicting hidden links in network structures. Unlike the standard problem, our emphasis is on the need for a careful combination of the extensional data (existing nodes and edges, namely the ground extensional component) and the available domain knowledge (the in- tensional component). This view on the problem is well captured by logic-based Knowledge Graphs [6], which represent available data as facts, often encoding property graphs (PGs) [1] in a relational form, and domain knowledge as rea- soning rules. Rules are modeled in some logic formalism, e.g., Vadalog [7] a language in the Datalog± [8] family, striking a good balance between compu- tational complexity and expressive power. The application of rules upon the extensional component allows to accomplish complex reasoning tasks by gener- ating new entailed facts and thus new knowledge. At the core of our approach, Vada-Link, there is the idea of modeling KG augmentation problems as reasoning tasks on logic-based KGs, having enterprise data stores as extensional component (typically a PG) and Vadalog reason- ing rules expressing the augmentation logic. Link prediction, and therefore KG augmentation, is an intrinsically quadratic problem, making naı̈ve approaches inapplicable to dense graphs like that of the Italian companies with millions of nodes. Inspired by entity resolution approaches [11], Vada-Link addresses scal- ability with a principled interplay of node embeddings-based clustering [14] that reduces the set of candidate nodes, and pure logic reasoning, which exhaustively determines links inside each cluster and hence improves clustering accuracy. The Vadalog reasoning rules for KG augmentation are structured into three sets: 1. input mappings: taking as input the relational representation of one specific PG, and transforming it into higher-level concepts: generic nodes, generic edges and types and properties for those nodes and edges; 2. link prediction logic: containing the core reasoning process giving rise to new edges; 3. output mappings: transforming the high-level generic links that have been created by the link prediction logic into their relational representation in the PG. Let us come to the link prediction logic, described by the following rules: (1) Node(x, f1x , . . . , fnx ), Link(e, v, w, f1e , . . . , fm e ), NodeType(x, tn ), EdgeType(e, te ), b1 = #GraphEmbedClust(f1x , . . . , fnx , f1e , . . . , fm e , tn , te , hei), x x b2 = #GenerateBlocks(f1 , . . . , fn , tn ) → Block(b1 , b2 , x) (2) Node(x, f1x , . . . , fnx ), Node(y, f1y , . . . , fny ), NodeType(x, tn ), NodeType(y, tn ), x 6= y, Block(b1 , b2 , x), Block(b1 , b2 , y), LinkClass(t), Candidate(x, y, t) → ∃z Link(z, x, y, . . .), EdgeType(z, t). For every node x, Rule (1) considers all edges e and positions x as a two-level nested clustering structure represented by the atom Block, where b1 and b2 are the clustering levels. The first clustering is established by applying the function #GraphEmbedClust, which wraps a function call to a specific clustering algo- Augmenting Logic-based Knowledge Graphs: The Case of Company Graphs 3 rithm based on a node embedding primitive, e.g., node2vec [14]. It takes as input the features of x, the edges e of the graph along with their features, and the re- spective types tn and te , and returns the identifier b1 of the first-level cluster. #GraphEmbedClust is a monotonic aggregation function. The second-level clustering is determined by applying the function #Gener- ateBlocks, whose resulting cluster identifier b2 only depends on the node prop- erties and type. #GenerateBlocks is a fact-level function: for a specific binding of the function arguments, a value for b1 is produced. The function is in some sense polymorphic: depending on the type tn of the involved nodes, a specific semantics is applied to decide the target cluster on the basis of the node features. For every second-level cluster defined by a Block fact, Rule (2) exhaustively considers all pairs of nodes x and y and every possible LinkClass t (of which with respect to our case many exist: Control, ParentOf, PartnerOf, etc.). The Candidate predicate is used to decide whether a Link from x to y must be produced or not. If this is the case, a new t-typed edge is created. Rule (2) compares only the pairs of nodes in the same sub-cluster (identified by b1 and b2 ). Candidate is defined for company control (3-4) and family links (5) as: (3) Node(x, f1 , . . . , fn ), NodeType(x, Company) → Candidate(x, x, Control). (4) Candidate(x, z, Control), Link(u, z, y, w)EdgeType(u, Shareholding), msum(w, hzi) > 0.5→ Candidate(x, y, Control). (5) Node(x, f1x . . . fnx ), Node(y, f1y . . . fny ), NodeType(x, Person), NodeType(y, Person), #LinkProbability(f1x . . . fnx , f1y . . . fny ) > T → Candidate(x, y, PartnerOf). Many different models to predict family links exist. As common in these cases, we simply define the presence of a family link between x and y with a multi-feature Bayesian classifier (wrapped by #LinkProbability). Discussion. Our approach is schema independent as Vada-Link is able to per- form KG augmentation regardless of the specific input PGs. We implement it with polymorphic Candidate atoms for each problem. The approach is model in- dependent because the extensional component can originate from heterogeneous data sources, even based on different data models. Finally, we adopt a kind of reinforcement principle because the the first-level clustering in the Vada- Link algorithm is gradually improved with new edges from Rule (2). Termination is guaranteed as the number of Links that can be generated by 2 Rule (2) is finite and, in the worst case, it amounts to |N | × C, where N are the PG nodes and C is the number of possible link types. Rule (1) produces a finite number of clusterings hb1 , b2 i, since it can fire in the worst case for every single edge in E plus all the ones introduced by Rule (2). Our solution is scalable, given Vadalog tractability [7], and limits the search space via multi-level clustering. In the average case, clustering allows for linear 2 behaviour. In the worst case, the approach performs |N | × C comparisons, (N are the nodes and C is the number of possible link types). However, for high density, complexity is dominated by embedding algorithms, having quadratic complexity in the branching factor [16]. 4 P. Atzeni et al. 3 Experiments We provide an experimental evaluation of Vada-Link to family detection, high- lighting its scalability and accuracy for both real-world and artificial settings. Datasets. For the real-world case, we use the database of Italian companies of Banca d’Italia whose details can be found in [2]. For the synthetic case, we adopt Barábasi algorithm [3] to generate realistic scale-free company networks with 6-feature nodes obtained respecting the original statistical properties. Software and hardware configuration. We ran the experiments on a Mac- Book with 1.8 GHz Intel Core i5 and 4 GB 1600 MHz DDR3 memory. The clustering functions have been executed with Python 3.2.3. 3.1 Evaluating Scalability We tested the scalability of Vada-Link for different clustering structures, vary- ing graph topology (e.g., number of nodes, density) and data distribution. Varying number of nodes. We built 20 scenarios with subsets (≈ 1-100k nodes) from the Italian company graph and averaged elapsed times. To further stress the system, we also built 6 artifi- cial graphs of higher density but same size and scale-free topology as the real-world ones. Figure 2(a) and 2(c) show Vada-Link good scal- ability for the real-world and synthetic case respectively. The trend is linear and the Fig. 2. Vada-Link execution time and recall. approach very effective. Varying the number of clusters. We crafted a real-world-like experiment, using the Italian company graph. Vada-Link clustering technique recursively and jointly employs node2vec (first-level clustering) and feature based-blocking (second-level clustering); using a deterministic mapping, we assign –via hash- ing or Skolem functors– a feature vector f1 , . . . , fn into a second-level cluster identifier. In this experiment, by artificially tweaking the value of k of such n features, we hijack the mapping into an increasing number of clusters of de- creasing size and observe how this affects elapsed time. We extract values for the vector f1 , . . . , fk from a discrete multivariate uniform distribution over the sample space S1 , . . . , Sk . By restricting (expanding) the cardinality of the do- main of S1 × . . . × Sk , we induce more (fewer) and smaller (bigger) clusters. Specifically, we mapped f1 , . . . , fn into ≈ 1-500 clusters. Augmenting Logic-based Knowledge Graphs: The Case of Company Graphs 5 Figure 2(b) confirms that the effective application of Vada-Link requires a careful feature engineering phase: the number and size of the clusters vary according to the feature selectivity. For example, searching for the “siblingOf” relationship among people with common last names and same age range would lead to large clusters. Vada-Link, through the joint adoption of a hybrid node- 2vec/feature-based clustering and recursive self-improving approach, makes it easier to balance the number of clusters, elapsed time, and recall. Varying the density. We built 4 artificial graph scenarios, superdense, dense, normal, sparse and measured the relevant execution time. Figure 2(c) shows the impact of graph density on the performance of Vada-Link. Elapsed times increase significantly for more than 500 nodes whilst, for lower values, sparse, normal and dense show similar trends. On the contrary, the computation for superdense graphs is slower taking ≈30 seconds for 500 nodes. The trend is amplified for greater values, with superlinear growth for dense and superdense. While for second-level clustering #GenerateBlocks is not affected by node den- sity, since it only considers node features, both first-level clustering (#Graph- EmbedClust) and the implementations of Candidate predicate are. Node2vec processes a number of random walks that grows with the density; nevertheless, once a density is fixed, it scales almost linearly with the number of nodes, as also experimentally shown in [14]. Also the behaviour of Candidate is highly depen- dent on the considered KG augmentation problem. For example, the detection of family connections has good scalability with respect to density, as evident in Figure 2(c). Company control and close link detection are more challenging (spe- cific experiments in [7, 5]). Nevertheless, thanks to clustering, Vada-Link can achieve good behaviour also in this case, clearly at the cost of loss in accuracy. 3.2 Evaluating Accuracy Following a common validation approach for link prediction settings [14], we arbitrarily remove some edges from the initial graph, then try to predict and recover them, and evaluate the overall tradeoff between scalability and recall. Varying the number of clusters. We built 10 realistic random graphs Si (with 1 ≤ i ≤ 10). For each of them, we ran Vada-Link concentrating all nodes in one single cluster and producing all the theoretically possible links resulting in an augmented one Ŝi . Then, from each Ŝi , we randomly selected 10 edge sets Θij (with 1 ≤ j ≤ 10), each containing 20% of the predicted links and generated new subgraphs S Θij without those edges. For each S Θij we ran Vada-Link by vary- ing the number of clusters with 20 configurations from 1 to 500. For each cluster, we obtained an augmented graph Ŝ Θij and the recall Rijc as |E(Ŝ Θij )|/|E(Ŝi )|, i.e., the percentage of removed edges that have been recovered. By comparing Figures 2(b) and 2(d), we observe that 10 clusters enable processing times under 10 seconds and an effective efficiency-recall balance. The recursive interplay be- tween first- and second-level clustering compensates for increases in the number of clusters and contributes to a favourable balance between scalability and recall proving the robustness of Vada-Link. 6 P. Atzeni et al. 4 Conclusion In this paper, we proposed Vada-Link, a novel approach for weaving enterprise KGs by discovering hidden links. At the core of our technique, there is a princi- pled formulation of the problem as a Vadalog reasoning task, which guarantees scalability and problem independence, as well as machine learning/embeddings. Acknowledgements. We acknowledge the support of the WWTF grant VRG18-013, the EPSRC grant EP/M025268/1, and the EU Horizon 2020 grant 809965. References 1. Renzo Angles. The property graph database model. In AMW, 2018. 2. Paolo Atzeni, Luigi Bellomarini, Michela Iezzi, Emanuel Sallinger, and Adriano Vlad. Weaving enterprise knowledge graphs: The case of company ownership graphs. In EDBT, pages 555–566. OpenProceedings.org, 2020. 3. Albert-Laszlo Barabasi and Reka Albert. Emergence of scaling in random networks. Science (New York, N.Y.), 286:509–12, 11 1999. 4. Fabrizio Barca and Marco Becht. The control of corporate europe. oxford university press, european corporate governance network. 2001. 5. Luigi Bellomarini, Davide Benedetto, Georg Gottlob, and Emanuel Sallinger. Vada- log: A modern architecture for automated reasoning with large knowledge graphs. Information Systems, page 101528, 2020. 6. Luigi Bellomarini, Daniele Fakhoury, Georg Gottlob, and Emanuel Sallinger. Knowledge graphs and enterprise AI: the promise of an enabling technology. In ICDE, pages 26–37. IEEE, 2019. 7. Luigi Bellomarini, Emanuel Sallinger, and Georg Gottlob. The vadalog system: Datalog-based reasoning for knowledge graphs. PVLDB, 11(9):975–987, 2018. 8. Andrea Calı̀, Georg Gottlob, Thomas Lukasiewicz, Bruno Marnette, and Andreas Pieris. Datalog+/-: A family of logical knowledge representation and query lan- guages for new applications. In LICS, 2010. 9. Stefano Ceri, Georg Gottlob, and Letizia Tanca. Logic programming and databases. Springer, 2012. 10. Ariane Chapelle and Ariane Szafarz. Controlling firms through the majority voting rule. Physica A: Statistical Mechanics and its Applications, 355(2-4):509–529, 2005. 11. Peter Christen. Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Data-centric systems and applications. Springer, 2012. 12. Caroline Fohlin. Trends in business organization: Do participation and cooperation increase competitiveness? edited by horst siebert. tubingen: J. c. b. mohr, 1995. pp. viii, 292. dm 108. The Journal of Economic History, 58(1):292–293, 1998. 13. James B Glattfelder. Ownership networks and corporate control: mapping economic power in a globalized world. PhD thesis, ETH Zurich, 2010. 14. Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In KDD, pages 855–864. ACM, 2016. 15. David Liben-Nowell and Jon M. Kleinberg. The link prediction problem for social networks. In CIKM, pages 556–559. ACM, 2003. 16. Tiago Pimentel, Adriano Veloso, and Nivio Ziviani. Fast node embeddings: Learn- ing ego-centric representations. In ICLR (Workshop). OpenReview.net, 2018. 17. Andrea Romei, Salvatore Ruggieri, and Franco Turini. The layered structure of company share networks. In 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 1–10. IEEE, 2015.