Towards Distributed Contextualized Knowledge Repositories for Analysis of Large-Scale Knowledge Graphs Loris Bozzato1 and Christoph G. Schuetz2 1 Fondazione Bruno Kessler, Via Sommarive 18, 38123 Trento, Italy 2 Johannes Kepler University Linz, Altenberger Str. 69, 4040 Linz, Austria bozzato@fbk.eu, christoph.schuetz@jku.at Abstract. A knowledge graph (KG) represents real-world entities as well as their properties and relationships in a structured and often logic-based formalism. Given the large amount of information and the diversity of data stored in KGs, op- erations for analysis of such data akin to traditional OLAP operations are useful to understand the contents of KGs along different dimensions. In this direction, we recently proposed Knowledge Graph OLAP (KG-OLAP), a framework based on contextualized description logics that allows to organize knowledge graphs in a multi-dimensional structure – a KG-OLAP cube. For KG-OLAP cubes, we defined operations for combination of knowledge from different cells and for abstraction of knowledge within cells. Experiments with a proof-of-concept pro- totype, however, revealed that the management of a centralized KG-OLAP cube is impractical for large KGs. In this paper, we extend KG-OLAP in order to for- malize the case in which knowledge is distributed across different repositories. We hence formalize a distributed version of the multidimensional cube structure, and we show how the operations can be adapted to this scenario. 1 Introduction A knowledge graph (KG) represents real-world entities and their properties as well as relationships between entities in a structured and often logic-based formalism KGs [8]. In recent years, many organizations aimed for the development of KGs for practical purposes, necessitating the development of appropriate Knowledge Graph Management Systems (KGMS) to leverage KGs. In particular, given the diversity of data stored in KGs, operations for analysis of such data akin to traditional OLAP operations can be useful to understand the contents of graphs and abstract their information along different dimensions (see e.g., [1,7]). In this direction, we recently proposed Knowledge Graph OLAP (KG-OLAP) [9], a framework based on contextualized description logics that allows to organize KGs in a multi-dimensional structure - a KG-OLAP cube. The main feature of the KG-OLAP model, in contrast to traditional OLAP, where numerical measures are aggregated, is that the cube cells in a KG-OLAP cube comprise knowledge facts. In this regard, each cell is identified by the combination of different dimensions (e.g., time, location, im- portance) with the members being hierarchically organized objects (e.g., 7-july-20 ≺ Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons Li- cense Attribution 4.0 International (CC BY 4.0). july-2020, austria ≺ europe). The multi-dimensional model of KG-OLAP cubes is an adaptation of the Contextualized Knowledge Repository (CKR) framework [11,3], a for- mal approach for representing and reasoning with contexts in knowledge graphs based on a description logic. The multi-dimensional structure of KG-OLAP cubes allows us to formulate OLAP-like operations for the analysis of a cube’s contents: We defined sets of operations for the combination of knowledge from different cells (contextual operations) and for the abstraction of knowledge within cells (graph operations). The KG-OLAP model and operations were implemented in a proof-of-concept prototype and evaluated with respect to the real-world scenario of Air Traffic Management (ATM) knowledge graphs [10]. A limitation of the model, especially when applied to large-scale KGs, lies in the centralized view of the cube: in practical cases (like in the ATM scenario [10]), the information may be distributed across different repositories and only part of it should be integrated when applying analysis operations. Thus, the centralized view of KG-OLAP might not be applicable to represent cases where single “nodes” in a distributed model cannot have the entire vision of the knowledge and structure of the cube. Moreover, experiments with the proof-of-concept prototype of KG-OLAP framework revealed that the management of operations over a centralized cube is impractical for KG-OLAP cubes with a large amount of data contents or a large number of cells. In response to these needs, in this paper we introduce our ongoing work to ex- tend the KG-OLAP model towards a distributed version of the framework, with the aim of managing scenarios where knowledge from the model is partitioned in different repositories. The benefit of this decentralized view is not only to be able to model the integration and analysis of a distributed KG, but also to allow for larger contextualized KGs to be divided into smaller units that can be managed by different processors. Thus, while in this paper we focus on the formalization of the modeling and data analysis in a distributed KG-OLAP model, we envision that such model can be useful also in im- plementations in order to increase the performance of reasoning (e.g., by instance-level materialization as in [9]) and analysis over large-scale KGs. We present the distributed KG-OLAP model in Section 2. We define the syntax and semantics of the distributed model by extending the KG-OLAP model presented in [9]. The proposed model can be seen as a formal specialization in the context of KGs of the multi-dimensional model presented in [10] for the management of ATM information. In Section 3 we adapt the contextual OLAP operations presented for single KG-OLAP cubes to the distributed model, considering operations over the node structure and the local organization of cells. 2 Distributed KG-OLAP Framework As in the original KG-OLAP model [9], we build our extended KG-OLAP model on definitions of the CKR framework [5,3] for a generic description logic language [2]. Intuitively, in our model we distribute the information of a KG-OLAP cube to different nodes. Nodes are organized in a hierarchical structure (akin to the coverage structure of contextual dimensions). Each node will contain a partial view of the KG-OLAP cube (representing its local information, e.g., in ATM, for a specific airport or geographic zone) and nodes from lower levels will inherit knowledge from the higher levels. The multidimensional structure is expressed using a DL signature Ω, that we call cube vocabulary, composed of the mutually disjoint sets NRΩ of atomic roles, NCΩ of atomic concepts, and NIΩ of individual names. The vocabulary further specifies a set F ⊆ NIΩ of cell names, a set D ⊆ NRΩ of dimensions, a set I ⊆ NIΩ of dimension members, and for every dimension E ∈ D, a set DE ⊆ I of dimension members of E (cf. dimensional structure in [11]). The cube language LΩ is a DL language over cube vocabulary Ω. For every dimension A ∈ D, we define the role ≺A (the dimensional ordering for A) as a strict partial order relation over dimension members DA . In the following, we may also use non-strict orderings A and direct successor relations ≺ ˙ A. In order to represent the points of the distributed structure of repositories, we con- sider a set of nodes S ⊆ NIΩ . Similarly to dimensions, we consider a role ≺S which defines a partial order over the nodes of S and describes the (hierarchical) relationships across repositories. In the rest of the paper, we assume that each dimension and the node structure S are ordered in a simple hierarchy (in particular supporting a notion of level, see e.g., ranked hierarchies in [6,4]). We thus consider a set L ⊆ NIΩ of levels and a relation lev associating each dimensional value to its level as defined in [9]. The notion of dimensional vector provides a local identifier for cells inside a given repository. Let |D| = k, we define a dimensional vector as the set d = {A1 := d1 , . . . , Ak := dk } where every dj is in the set of dimension members DAj of Aj , with j ∈ {1, . . . , k}. We refer to the set of all dimensional vectors of cube vocabulary Ω as the multidimensional space DΩ . The distribution of the cube into different nodes affects the way in which we identify and order cells inside the overall structure: intuitively, these need to depend both on the node and local dimensional coordinates. Given a node and a dimensional vector, we associate with such coordinates a cell name using the function cn : S × DΩ → F. We require cn to be bijective, i.e., each cell name c is associated with a couple (s, d) identifying the repository s and the point d in the dimensional space; conversely, c can be interpreted as the unique identifier of the corresponding dimensional point d in node s. Cells coverage now depends both on the ordering of nodes and the coverage of dimensional vectors. Let d, e ∈ DΩ , we say that d  e iff dA  eA for each A ∈ D. In the case of cells, given c1 , c2 ∈ F, we say that c2 covers c1 and we write c1  c2 iff cn(s1 , d) = c1 and cn(s2 , e) = c2 , s1 S s2 and for every A ∈ D, dA  eA . Knowledge represented inside cells is expressed in a DL language LΣ , called the object language, which is based on a DL object vocabulary Σ = NCΣ ] NRΣ ] NIΣ . The distributed KG-OLAP model can be represented by a distributed extension of the CKR framework [3,5]. The contextual structure of a CKR is expressed by a meta- vocabulary Γ , a DL vocabulary that contains a set of context names N ⊆ NIΓ , a set of module names M ⊆ NIΓ , a set of context classes C ⊆ NCΓ including class Ctx, a set of contextual relations R ⊆ NRΓ , a set of contextual attributes A ⊆ NRΓ , and for every attribute A ∈ A, a set DA ⊆ NIΓ of attribute values of A. The role mod defined over N × M expresses associations between contexts and modules. Modules represent pieces of knowledge specific to a context or context class; attributes describe contextual properties (e.g., time, location, provenance) identifying a context (or class); the context class Ctx defines the class of all contexts. It is then easy to relate the KG-OLAP cube language LΩ to the CKR meta-language LΓ : we have that F ⊆ N (i.e. cells are a kind of context), D ⊆ A (i.e. dimensions are a kind of contextual attributes) and cell coverage is a partial order relation in R. We can define as follows the extension of the CKR with distribution on nodes. Definition 1 (distributed Contextualized Knowledge Repository). A distributed Con- textualized Knowledge Repository (dCKR) over hΓ, Σi is a family KS = {Ks }s∈S where each Ks is a structure Ks = hBs , Gs , KMs i such that: ˙ S s or s≺ – Bs is a set of direct ordering relations relative to node s of the kind z ≺ ˙ S z; – Gs is a DL knowledge base over LΓ ∪ LΣ ; – KMs = {Km }m∈Ms where every Km is a DL knowledge base over LΣ , for each module name m ∈ Ms ⊆ M. Intuitively, this definition extends the definition of local CKRs with a set Bs , containing the direct relations of the node with other higher or lower nodes. In the following we call KS a distributed KG-OLAP cube (or simply distributed cube) if its metaknowledge is based on a cube language LΩ , following the above relations. A dCKR interpretation is composed by a set of CKR interpretations for each node of the distributed structure; in turn, each CKR interpretation, has a DL interpretation for the global context and a DL interpretation for every local context. Definition 2 (dCKR interpretation). A dCKR interpretation IS for hΓ, Σi is a family {Is }s∈S where each Is is a structure Is = hMs , Is i such that: (i) Ms = h∆Ms , ·Ms i is a DL interpretation of Γ ∪ Σ; (ii) for every x ∈ CtxMs , Is (x) = h∆Is (x) , ·Is (x) i is a DL interpretation over Σ. Moreover, the following conditions hold: (i). for s ∈ S, for every c ∈ N, cMs ∈ CtxMs and, for every C ∈ C, CMs ⊆ CtxMs ; (ii). for s, z ∈ S, ∆Ms = ∆Mz ; moreover, for every x ∈ CtxMs , ∆Is (x) = ∆Ms ; (iii). for s, z ∈ S and a ∈ NIΣ , aMs = aMz ; moreover, for every x ∈ CtxMs , aIs (x) = aMs . The conditions in this definition ensure that the interpretation of domains is consistent across contexts and nodes. The interpretation of ordinary DL expressions in each DL interpretation is defined, as in CKR, by the language of choice for the object language (e.g., SROIQ-RL in [3]). Given a distributed cube KS , let us consider the set BS of all bridge conditions defined as BS = s∈S Bs . We consider BS∗ to be the closure of BS with respect to the S node ordering relation. We then extend as follows the definition of KG-OLAP model [9] with new conditions for the intended interpretation of the node structure. Definition 3 (distributed KG-OLAP cube model). A dCKR interpretation IS is a distributed KG-OLAP cube model of KS iff the following conditions hold: – Local conditions: for each s ∈ S 1. for α ∈ LΣ ∪ LΓ in Gs , Ms |= α; 2. for hx, yi ∈ modMs with y = mMs , Is (x) |= Km ; 3. for α ∈ Gs ∩ LΣ and x ∈ CtxMs , Is (x) |= α. 4. if c1 , c2 ∈ F, and for every A ∈ D with d ∈ DA , Ms |= A(c1 , d) and Ms |= A(c2 , d) then c1 = c2 . 5. for d ∈ DΩ and cn(s, d) = c ∈ F, then Ms |= A(c, dA ) for each A ∈ D with d A ∈ DA . – Global conditions: 6. if s ≺S z ∈ BS∗ , then for each α ∈ LΣ ∪ LΓ in Gz , Ms |= α; 7. given c1 , c2 ∈ F with cn(s, d) = c1 and cn(z, e) = c2 where s S z ∈ BS∗ , if Ms |= c1  c2 and Ms |= mod(c2 , m) with m ∈ M, then Ms |= mod(c1 , m); Intuitively, global conditions require that knowledge from the higher nodes is propa- gated to nodes at the lower levels: if s ≺S z, condition (6) requires that axioms of the global context Gz are propagated to Gs ; condition (7) ensures that knowledge modules associated to a higher cell c2 (inside the same or higher node) are also associated to cell c1 in lower levels of the multi-dimensional structure. Given KS and c ∈ N with cn(s, d) = c, an axiom α ∈ LΣ is c-entailed by KS (denoted KS |= c : α) if Is (cMs ) |= α for every model IS of KS . We say that an axiom α is globally entailed by KS (denoted KS |= α) if: (i) α ∈ LΣ and KS |= c : α for every c ∈ N, or (ii) α ∈ LΓ and Ms |= α for every cube model Is in IS of KS . Example 1. We demonstrate the idea of distributed KG-OLAP in the Air Traffic Man- agement scenario. Figure 1 shows a “monolithic” KG-OLAP cube for the representation of a contextualized KG in ATM: each cell contains the information relative to a specific context, e.g., describing Flight Critical knowledge about Construction in 08-2020. The dimensions are hierarchically ordered, e.g., Flight Critical and Restriction are part of the Essential package of knowledge. Cells exist at multiple granularities within the cube, e.g., the KG-OLAP cube contains Additional knowledge about Weather in 08-2020 but also Supplementary knowledge about Operational Control in 2020. Figure 2 illustrates the distribution of such KG-OLAP cube for ATM in multiple nodes. Instead of a single monolithic KG-OLAP cube for the entire relevant knowl- edge, the knowledge is distributed across nodes for the knowledge of the Europe area, the LOVV (Austria) flight information region and the LOWW (Vienna) airport. The presented framework also allows for the redundant allocation of knowledge on dif- ferent nodes, e.g., LOVV and LOVV’. Knowledge propagates, on the one hand, from more general to more specific cells. For example, the knowledge from the Essential– Closure–2020 context propagates to the Restriction–Closure–2020 context. On the other hand, knowledge also propagates from more general to more specific nodes, e.g., the knowledge from the Essential–Closure–2020 cell at the Europe node propagate to the Essential–Closure–2020 cell at the LOWW node. 3 3 Query operations for distributed KG-OLAP cubes In this section we show how OLAP query operations can be defined for analyzing the contents of distributed KG-OLAP cubes. In the following we only consider contextual operations, namely operations that manipulate the structure of distributed cubes, while graph operations, that is operations that modify the graph inside modules of a cell (and thus do not depend on the organization of cells), can be defined similarly to the original formulation in [9]. Slice and dice. The slice and dice operations allows for the selection of a set of facts (i.e. a “slice” of the distributed cube structure) with specific coordinates for subsequent manipulation. In the case of distributed cubes, we can now consider two distinct oper- ations for the selection of “slices” of the multi-dimensional structure provided by the node structure and dimensional space. Additional Supplementary Potential Hazard Importance Restriction Essential Critical Flight Construction Closure Weather Comm Airport Operational Operations Control Topic Fig. 1. An individual three-dimensional KG-OLAP cube with the hierarchically ordered dimen- sions Importance, Topic, and Time and cells at multiple levels of granularity Fig. 2. An example of (a) the distributed KG-OLAP cube KATM , which corresponds to the federa- tion of the individual cubes Europe, LOVV, LOVV’, and LOWW; (b) propagation of knowledge along the hierarchy of nodes and the hierarchy of cells. The arrows represent the partial order between nodes and contexts, respectively, e.g., LOVV ATM Europe and c2  c1 . A first operation, the node slice and dice, operates on the node structure and restricts the structure to local cubes related to an input node. Definition 4 (Node slice and dice). Given a distributed cube KS = {Ks }s∈S and an element z ∈ S, we define the node slice and dice operation δ(KS , z) as a new distributed cube KS0 = {Ks }s∈S0 such that S0 = {s ∈ S | s S z or z S s ∈ BS∗ }. The selection of cells inside a local cube (once the node has been fixed), defining a local slice and dice, can be determined by the values of their dimensions, analogously to the slice operation in [9]. Let us define the set of cells relative to a node s ∈ S as Fs = {c ∈ F | cn− (c) = (s, e)}. Definition 5 (Local slice and dice). Given a cube Ks = hBs , Gs , KMs i and a dimen- sional vector d which defines the dice coordinates, we define the local slice and dice operation δ(Ks , d) as a new cube K0s = hBs , G0s , KMs i over hΓ 0 , Σi, such that: – For each A ∈ D, D0A = {e ∈ DA | e  dA or dA  e, with dA ∈ d}; – F0s = {c ∈ Fs | cn− (c) = (s, e), for each eA ∈ e, eA ∈ D0A } and other components of Γ 0 are defined analogously to Γ ; – G0 = GΣ ∪ GΓ 0 (i.e., metaknowledge in G0 is equal to the formulas in GΓ that have only symbols in Γ 0 ). Note that while this dimension based slice is defined on a single node, it can be easily extended to all the cubes in KS if the operation is applied over each s ∈ S. Merge. Similarly, we can provide operations for merging information considering the whole distribution structure and local cubes. The merge (or roll-up) operations combine knowledge at a certain granularity by merging knowledge from a lower granularity. In the case of distributed cubes, we can distinguish a node merge operation, that adds to a given node the combination of knowledge coming from its more specific nodes, and a local merge, which adds to cells of a given “dimensional level” l the roll-up of knowledge from cells in lower levels (inside a local cube). Similarly to [9], we can parametrize both operations with respect to the method met that is used to combine the rolled-up knowledge. In the following definitions we con- sider the cases in which met can be the union (∪) or intersection (∩) of such knowledge. Definition 6 (Node merge). Given a distributed cube KS = {Ks }s∈S and an element z ∈ S, we define the node merge operation ρmet (KS , z) as a new distributed cube KS0 = {Ks }s∈S0 such that: – S0 = S \ {s ∈ S | s ≺S z ∈ BS∗ }; – M0z = Mz ∪ {mg(c) | c ∈ Fz } with each mg(c) a new module name; – for each c ∈ Fz , mod(c, mg(c)) is added to G0z S (met = ∪): – Union merge – Gz ∪ s≺S z Gs is added to G0z ; – for every cell c ∈ Fz with cn− (c) = (z, d), add module Kmg(c) to KM0z with union of every module for c0 s.t. cn− (c0 ) = (s, d) for s ≺S z ∈ BS∗ . T merge (met = ∩): – Intersection – Gz ∪ s≺S z Gs is added to G0z ; – for every cell c ∈ Fz with cn− (c) = (z, d), add module Kmg(c) to KM0z with intersection of every module for c0 s.t. cn− (c0 ) = (s, d) for s ≺S z ∈ BS∗ . Let us now consider the local merge operation, which is defined as a merge on the local cells at a specific level of the dimensional structure. We define a level vector as a set l = {l1 , . . . , lk } s.t. for j ∈ {1, . . . , k}, lj ∈ LAj . We define restrictions of dimensional space DΩ given w.r.t. a level vector l as follows. The subspace DlΩ identifies all the vectors exactly at the level specified by the level vector l: DlΩ = {d ∈ DΩ | for d ∈ DA , lev(d, l) with l ∈ l}. Dl Ω defines the vectors above (or equal to) the specified level l vector: DΩ = {d ∈ DΩ | e  d, with e ∈ DlΩ }. Let µs (c) = {m ∈ M | Gs |= mod(c0 , m), c0 ≺ c}. The set µs (c) then contains all module names of the initial cube associated to contexts c0 that are more specific than the input context c. Definition 7 (Local merge). Given a local cube Ks = hBs , Gs , KMs i and a level vec- tor l, we define the merge operation ρmet (Ks , l) as a new cube K0s = hBs , G0s , K0Ms i over hΓ 0 , Σi such that: – F0s = {c ∈ Fs | cn− (c) = (s, d), d ∈ Dl Ω }; – M0s = Ms ∪ {mg(c) | c ∈ F0s with cn(c)− = (s, d), d ∈ DlΩ } with each mg(c) a new module name; – Metaknowledge of G0s is restricted to Γ 0 and mod(c, mg(c)) for each c ∈ F0s with cn(c)− = (s, d), d ∈ DlΩ is added to G0s ; 0 S (met = ∪): knowledge module Kmg(c) for c is added to KMs with: – Union merge Kmg(c) = m∈µ(c) Km 0 T (met = ∩): knowledge module Kmg(c) for c is added to KMs – Intersection merge with: Kmg(c) = m∈µ(c) Km We remark that the operations over the distribution and dimensional structure have been separated for a more clear presentation: however, in practice they could be easily com- bined in a single operation that manipulates both levels of the KG-OLAP structure. 4 Summary and Future Work In this paper we introduced a distributed version of the KG-OLAP model for the anal- ysis of contextualized knowledge graphs. We first extended the KG-OLAP framework presented in [9] to provide a decentralized definition for the multi-dimensional frame- work structure. On the base of this, we then demonstrated how analytic operations on KG-OLAP cubes can be extended to manipulate the newly defined distributed structure of the framework. This paper represents a first step in the study of distributed KG-OLAP cubes. As a next step, we need to develop reasoning methods for computing inferences over the distributed structure, possibly by extending the materialization methods pro- posed for KG-OLAP. In order to evaluate the possibilities provided by the distributed framework, we will develop a proof-of-concept implementation of KG-OLAP cubes in [9]. This will allow us to compare the current performance and scalability results over centralized KG-OLAP cubes to the case of a distributed scenario. As previously mentioned, we conjecture that distribution of a KG over multiple processors can be used to enhance the performance of reasoning over large-scale KGs. We also plan to verify the applicability of our model and operations to the real scenario of ATM information, possibly by further developing our initial definitions. References 1. Abelló, A., Darmont, J., Etcheverry, L., Golfarelli, M., Mazón, J., Naumann, F., Pedersen, T.B., Rizzi, S., Trujillo, J., Vassiliadis, P., Vossen, G.: Fusion cubes: Towards self-service business intelligence. Int. J. of Data Warehousing and Mining 9(2), 66–88 (2013) 2. Baader, F., Calvanese, D., McGuinness, D., Nardi, D., Patel-Schneider, P. (eds.): The De- scription Logic Handbook. Cambridge University Press (2003) 3. Bozzato, L., Eiter, T., Serafini, L.: Enhancing context knowledge repositories with justifiable exceptions. Artificial Intelligence 257, 72–126 (2018) 4. Bozzato, L., Eiter, T., Serafini, L.: Justifiable exceptions in general contextual hierarchies. In: CONTEXT 2019. Lecture Notes in Computer Science, vol. 11939, pp. 26–39. Springer (2019) 5. Bozzato, L., Serafini, L.: Materialization calculus for contexts in the Semantic Web. In: DL 2013. CEUR Workshop Proceedings, vol. 1014. CEUR-WS.org (2013) 6. Bozzato, L., Serafini, L., Eiter, T.: Reasoning with justifiable exceptions in contextual hier- archies. In: KR 2018. pp. 329–338. AAAI Press (2018) 7. Colazzo, D., Goasdoué, F., Manolescu, I., Roatiş, A.: RDF Analytics: Lenses over Semantic Graphs. In: WWW’14. pp. 467–478 (2014) 8. Krötzsch, M., Weikum, G.: Editorial for special section on knowledge graphs. Journal of Web Semantics 37-38, 53–54 (2016) 9. Schuetz, C.G., Bozzato, L., Neumayr, B., Schrefl, M., Serafini, L.: Knowledge Graph OLAP: A Multidimensional Model and Query Operations for Contextualized Knowledge Graphs. Semantic Web Journal (2020), in press. http://www.semantic-web- journal.net/content/knowledge-graph-olap-multidimensional- model-and-query-operations-contextualized-knowledge-0 10. Schuetz, C.G., Neumayr, B., Schrefl, M., Gringinger, E., Wilson, S.: Semantics-based sum- marisation of ATM information: Managing information overload in pilot briefings using se- mantic data containers. The Aeronautical Journal 123(1268), 1639–1665 (2019) 11. Serafini, L., Homola, M.: Contextualized Knowledge Repositories for the Semantic Web. Journal of Web Semantics 12, 64–87 (2012)