A Data-Driven Graph Schema Larry González1 and Aidan Hogan2 1 Center for Advancing Electronics Dresden (cfaed) TU Dresden, Germany. larry.gonzalez@tu-dresden.de 2 Millennium Institute for Foundational Research on Data, DCC, University of Chile. ahogan@dcc.uchile.cl Abstract. In this paper, we summarise our results on Modelling Dy- namics in Semantic Web Knowledge Graphs published at WWW 2018 where we proposed a novel data-driven schema for graphs and apply it for the use-case of predicting high-level changes in Wikidata. 1 Introduction Graph-based data models [1] have become increasingly common in data man- agement scenarios that require flexibility beyond what is offered by traditional relational databases. Such flexibility is particularly important in Web scenarios, where potentially many users may be involved (either directly or indirectly) in the creation, management, and curation of data. An example of such a scenario is the Wikidata knowledge graph [2] where users can add new properties and types that can be used to define further data. The flip-side of flexibility is higher levels of heterogeneity. Conceptually un- derstanding the current state of a knowledge graph – in terms of what data it contains, what it is missing, how it can be effectively queried, what has changed recently, etc. – is thus a major challenge: it is unclear how to distil an adequate, high-level description that captures an actionable overview of knowledge graphs. We thus need well-founded methodologies to make sense of knowledge graphs, where an obvious approach is to define some notion of schema for such graphs. The traditional approach in the Semantic Web has been what Pham and Boncz [3] call the schema first approach, which defines the schema that the data should follow. The most established language for specifying such schemas is RDFS. An alternative to the schema first approach is the schema last approach [3], which foregoes an upfront schema and rather lets the data evolve naturally; thereafter, the goal is to understand what the legacy graph data contain by extracting high- level summaries that characterise the graph, resulting in a data-driven schema. In this paper, we summarise recently published results [4] on a novel approach to compute a data-driven schema from knowledge graphs. We believe that such schemas are useful for understanding what a knowledge graph contains, and how it can be queried, among several other use-cases. Nevertheless, in this work we focus on the use-case of predicting how the knowledge graph will evolve in future versions, which could be used for measuring the time-to-live of cached SPARQL results, identifying missing properties for entities, etc. 2 Preliminaries RDF is a graph-structured model based on three disjoint sets of terms: IRIs (I), literals (L) and blank nodes (B). Claims involving these terms can be organised into RDF triples (s, p, o) ∈ (I ∪ B) × I × (I ∪ B ∪ L), where s is called subject, p is called predicate, and o is called object. An RDF graph G is then a finite set of RDF triples; each such triple (s, p, o) ∈ G can be viewed as a directed labelled edge p of the form s − → o. The terms used in the predicate position are referred to as properties. We use the term entity to refer to the real-world objects identified by the subjects of the graph. Given an RDF graph G, for • ∈ {s, p, o}, we denote by π• (G) the projection of the set of terms appearing in a particular triple position in G; e.g., πs (G) := {s | ∃p, o : (s, p, o) ∈ G}. We also use this notation for more than one triple position, for example, πs,p (G) := {(s, p) | ∃o : (s, p, o) ∈ G}. 3 A Data-Driven Schema for (RDF) Graphs πs (G) S data-driven schema proposal, let JGK ⊆ 2 To define our ×2πp (G) denote a set such that (S,P )∈JGK S × P = πs,p (G), and where for all (S, P ) ∈ JGK, it holds that S 6= ∅, P 6= ∅, and there does not exist (S 0 , P 0 ) ∈ JGK, (S, P ) 6= (S 0 , P 0 ) such that S ∩S 0 6= ∅ or P = P 0 . Intuitively, letting JsKG := {p | ∃o : (s, p, o) ∈ G} denote the characteristic set of s in G [5], then each pair (S, P ) ∈ JGK represents the (non-empty) set of all subjects S with the (non-empty) characteristic set P . We further define JP KG = S such that (S, P ) ∈ JGK (or JP KG = ∅ if no such S exists), and JSKG = P such that (S, P ) ∈ JGK (or JSKG = ∅ if no such P exists). Abusing notation, we say that (S, P ) ⊆ (S 0 , P 0 ) iff P ⊆ P 0 . We then also define JGK∗ := JGK ∪ {(∅, ∅), (Jπp (G)KG , πp (G))}, adding a bottom and top concept (if needed) respectively in the ⊆ order. Finally, for a graph G, our data-driven schema proposal is then given by the lattice L = (JGK∗ , ⊆). Given that large-scale knowledge graphs may often have orders of magni- tude more subjects than predicates, we can greatly reduce the overall (e.g., in-memory) size of the lattice by encoding the number of subjects rather than the full set of subjects. In other words, given a lattice L = (JGK∗ , ⊆), we define JGK# := {(n, P ) | ∃S : (S, P ) ∈ JGK∗ , n = |S|} with lattice L# := (JGK# , ⊆). We denote by JP K# G = |JP KG | the number of subjects that P has. Figure 1 provides an example RDF graph and the Hasse diagram for its corresponding L# . 4 Lattice Diff-Algebra Though we believe that the lattices defined previously may satisfy a number of applications, we currently focus on the use-case of modelling and predicting changes in a graph. More specifically, if we have the lattices for two versions of a knowledge graph, we can apply a diff to see high-level changes between both versions. Furthermore, given such a diff, we could further consider adding that diff to the most recent version to try predict future changes. (0, {d, n, s, w}) :UT :name "U Thurman" ; :star :Gattaca . :GO :name "G Orwell" ; :writer :1984 . (1, {d, n, s}) :AK :name "A Kurosawa" ; :director :Ikiru , :Ran . :PD :name "PK Dick" ; :writer :Ubik , :Valis . :CE :name "C Eastwood" ; :director :Sully ; (1, {d, n}) (1, {n, s}) (2, {n, w}) :star :Unforgiven , :Tightrope . (0, ∅) Fig. 1: Example RDF graph and corresponding data-driven schema where char- acteristic sets are annotated with the number of subjects (#-lattice). Properties are abbreviated by their initial: d/:director, n/:name, s/:star, w/:writer Defining lattice diffs Let Li = (JGi K∗ , ⊆) and Lj = (JGj K∗ , ⊆) be the lattices for two versions (i and j) of an RDF graph G. We define the diff between these two lattices as ∆j,i := {(JsKGj , s, JsKGi ) | s ∈ πs (Gi ∪ Gj )}; note that JsKGj = ∅ for deleted subjects and JsKGi = ∅ for new subjects. Given ∆j,i , we also define a cardinality-only version ∆# 0 0 j,i := {(P , n, P ) : n = |{s : (P , s, P ) ∈ ∆j,i }|}, where by ∆# 0 0 # j,i (P , P ) we denote n such that (P , n, P ) ∈ ∆j,i or 0 if no such n exists. Predicting future #-lattices Given ∆# # j,i , and Lk (for k a third version of the graph), we can “add” the changes between the ith and j th versions to the k th version to predict the (k+j−i)th version (where typically i < j ≤ k). We will thus define the operation L# # # # k +∆j,i as producing a #-lattice Lk,j,i predicting L[k+j−i] . To apply this operation we consider the ratio of subjects moving from a source to a target characteristic set. Formally, we define the ratio of subjects of P (where ∆# 0 j,i (P ,P ) JP K# 0 0 0 Gi 6= 0, P 6= ∅) moving to P (where P 6= ∅) as ρj,i (P , P ) := JP K# ; in Gi # the case that JP KGi = 0, we define ρj,i (P 0 , P ) = 1 iff P 0 = P , or 0 otherwise. We then define L− k,j,i := ({(σ(P ), P ) | P 6= ∅ and σ(P ) 6= 0}, ⊆), where:   X σ(P ) := round  ρj,i (P, Pk ) × JPk K# Gk  + ∆# j,i (P, ∅) . Pk ∈{P |∃S:(S,P )∈JGk K} The summand ∆# j,i (P, ∅) adds the number of fresh subjects (not appearing in version i) added to P in version j. Finally, we add top and bottom concepts (as # before) to L−k,j,i to generate the predicted #-lattice Lk,j,i . 5 Evaluation We consider 11 weeks of “truthy” RDF dumps of Wikidata from 2017-04-18 to 2017-06-27; the first version has 1,102,242,331 triples, 54,236,592 unique subjects and 3,276 unique properties, while the last version has 1,293,099,057 triples (+17%), 57,197,406 unique subjects (+5%) and 3,492 unique properties (+6%). From the last version, with a MapReduce implementation, we extract 2,118,109 characteristic sets in approximately 2.5 hours; computing the lattice by the ⊆ relation then took almost 8 hours on a single machine [4]. To test the quality of the future #-lattices we predict – specifically the num- ber of subjects per characteristic set in the future unseen version – we run experiments where we consider from 2–5 previous weekly versions to predict the next version of the #-lattice. As a baseline, for each characteristic set, we apply linear regression over the number of subjects in that characteristic set for the previous weeks to predict the number of subjects for the next week; we compare this baseline with our diff algebra (∆), computing the error with respect to the real lattice of the predicted week. The results are available in [4], where we show that our diff algebra outperforms the linear regression baseline method in all cases; we believe that this is because our diff algebra considers the number of subjects remaining in source characteristic sets for its predictions whereas the baseline does not consider where predicted new subjects will come from. 6 Conclusion We have proposed a form of data-driven schema for large-scale knowledge graphs and shown it to be feasible to compute. As a concrete use-case, we presented an algebraic method by which these schemas can be used to predict high-level changes in the dataset. Our evaluation over 11 weeks of Wikidata demonstrates that such predictions are feasible to compute; furthermore, we validated the qual- ity of predictions made by our algebraic approach against a linear-model baseline. We refer the interested reader to the full paper [4] for additional details on the proposed schema, examples of diffs, algorithms to compute the lattices, statistics on the lattices produced, details of the experiments, and further discussion. Acknowledgements: This work was supported by the Millenium Scientific Initia- tive, by Fondecyt Grant No. 1181896 and by the German Research Foundation (DFG) in CRC 912 (HAEC) and in Emmy Noether grant KR 4381/1-1. References 1. Angles, R., Arenas, M., Barceló, P., Hogan, A., Reutter, J., Vrgoč, D.: Foundations of modern query languages for graph databases. ACM Comp. Surveys 50(5) (2017) 2. Vrandecic, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Com- mun. ACM 57(10) (2014) 78–85 3. Pham, M., Boncz, P.A.: Exploiting Emergent Schemas to Make RDF Systems More Efficient. In: ISWC. Lecture Notes in Computer Science, Springer (2016) 463–479 4. González, L., Hogan, A.: Modelling dynamics in semantic web knowledge graphs with formal concept analysis. In: WWW. (2018) 5. Neumann, T., Moerkotte, G.: Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins. In: ICDE, IEEE Comp. Society (2011) 984–994