A Data-Driven Graph Schema

                       Larry González1 and Aidan Hogan2
    1
       Center for Advancing Electronics Dresden (cfaed) TU Dresden, Germany.
                          larry.gonzalez@tu-dresden.de
2
  Millennium Institute for Foundational Research on Data, DCC, University of Chile.
                              ahogan@dcc.uchile.cl


        Abstract. In this paper, we summarise our results on Modelling Dy-
        namics in Semantic Web Knowledge Graphs published at WWW 2018
        where we proposed a novel data-driven schema for graphs and apply it
        for the use-case of predicting high-level changes in Wikidata.


1   Introduction
Graph-based data models [1] have become increasingly common in data man-
agement scenarios that require flexibility beyond what is offered by traditional
relational databases. Such flexibility is particularly important in Web scenarios,
where potentially many users may be involved (either directly or indirectly) in
the creation, management, and curation of data. An example of such a scenario
is the Wikidata knowledge graph [2] where users can add new properties and
types that can be used to define further data.
    The flip-side of flexibility is higher levels of heterogeneity. Conceptually un-
derstanding the current state of a knowledge graph – in terms of what data it
contains, what it is missing, how it can be effectively queried, what has changed
recently, etc. – is thus a major challenge: it is unclear how to distil an adequate,
high-level description that captures an actionable overview of knowledge graphs.
    We thus need well-founded methodologies to make sense of knowledge graphs,
where an obvious approach is to define some notion of schema for such graphs.
The traditional approach in the Semantic Web has been what Pham and Boncz [3]
call the schema first approach, which defines the schema that the data should
follow. The most established language for specifying such schemas is RDFS. An
alternative to the schema first approach is the schema last approach [3], which
foregoes an upfront schema and rather lets the data evolve naturally; thereafter,
the goal is to understand what the legacy graph data contain by extracting high-
level summaries that characterise the graph, resulting in a data-driven schema.
    In this paper, we summarise recently published results [4] on a novel approach
to compute a data-driven schema from knowledge graphs. We believe that such
schemas are useful for understanding what a knowledge graph contains, and how
it can be queried, among several other use-cases. Nevertheless, in this work we
focus on the use-case of predicting how the knowledge graph will evolve in future
versions, which could be used for measuring the time-to-live of cached SPARQL
results, identifying missing properties for entities, etc.
2    Preliminaries

RDF is a graph-structured model based on three disjoint sets of terms: IRIs (I),
literals (L) and blank nodes (B). Claims involving these terms can be organised
into RDF triples (s, p, o) ∈ (I ∪ B) × I × (I ∪ B ∪ L), where s is called subject, p is
called predicate, and o is called object. An RDF graph G is then a finite set of RDF
triples; each such triple (s, p, o) ∈ G can be viewed as a directed labelled edge
                 p
of the form s − → o. The terms used in the predicate position are referred to as
properties. We use the term entity to refer to the real-world objects identified by
the subjects of the graph. Given an RDF graph G, for • ∈ {s, p, o}, we denote by
π• (G) the projection of the set of terms appearing in a particular triple position
in G; e.g., πs (G) := {s | ∃p, o : (s, p, o) ∈ G}. We also use this notation for more
than one triple position, for example, πs,p (G) := {(s, p) | ∃o : (s, p, o) ∈ G}.


3    A Data-Driven Schema for (RDF) Graphs
                                                            πs (G)
           S data-driven schema proposal, let JGK ⊆ 2
To define our                                                      ×2πp (G) denote a set
such that (S,P )∈JGK S × P = πs,p (G), and where for all (S, P ) ∈ JGK, it holds
that S 6= ∅, P 6= ∅, and there does not exist (S 0 , P 0 ) ∈ JGK, (S, P ) 6= (S 0 , P 0 )
such that S ∩S 0 6= ∅ or P = P 0 . Intuitively, letting JsKG := {p | ∃o : (s, p, o) ∈ G}
denote the characteristic set of s in G [5], then each pair (S, P ) ∈ JGK represents
the (non-empty) set of all subjects S with the (non-empty) characteristic set P .
We further define JP KG = S such that (S, P ) ∈ JGK (or JP KG = ∅ if no such S
exists), and JSKG = P such that (S, P ) ∈ JGK (or JSKG = ∅ if no such P exists).
Abusing notation, we say that (S, P ) ⊆ (S 0 , P 0 ) iff P ⊆ P 0 . We then also define
JGK∗ := JGK ∪ {(∅, ∅), (Jπp (G)KG , πp (G))}, adding a bottom and top concept (if
needed) respectively in the ⊆ order. Finally, for a graph G, our data-driven
schema proposal is then given by the lattice L = (JGK∗ , ⊆).
    Given that large-scale knowledge graphs may often have orders of magni-
tude more subjects than predicates, we can greatly reduce the overall (e.g.,
in-memory) size of the lattice by encoding the number of subjects rather than
the full set of subjects. In other words, given a lattice L = (JGK∗ , ⊆), we define
JGK# := {(n, P ) | ∃S : (S, P ) ∈ JGK∗ , n = |S|} with lattice L# := (JGK# , ⊆). We
denote by JP K# G = |JP KG | the number of subjects that P has. Figure 1 provides
an example RDF graph and the Hasse diagram for its corresponding L# .


4    Lattice Diff-Algebra

Though we believe that the lattices defined previously may satisfy a number
of applications, we currently focus on the use-case of modelling and predicting
changes in a graph. More specifically, if we have the lattices for two versions of
a knowledge graph, we can apply a diff to see high-level changes between both
versions. Furthermore, given such a diff, we could further consider adding that
diff to the most recent version to try predict future changes.
                                                                           (0, {d, n, s, w})
:UT :name "U Thurman" ; :star :Gattaca .
:GO :name "G Orwell" ; :writer :1984 .                             (1, {d, n, s})
:AK :name "A Kurosawa" ; :director :Ikiru , :Ran .
:PD :name "PK Dick" ; :writer :Ubik , :Valis .
:CE :name "C Eastwood" ; :director :Sully ;                (1, {d, n})        (1, {n, s})          (2, {n, w})
    :star :Unforgiven , :Tightrope .
                                                                                    (0, ∅)


Fig. 1: Example RDF graph and corresponding data-driven schema where char-
acteristic sets are annotated with the number of subjects (#-lattice). Properties
are abbreviated by their initial: d/:director, n/:name, s/:star, w/:writer


Defining lattice diffs Let Li = (JGi K∗ , ⊆) and Lj = (JGj K∗ , ⊆) be the lattices
for two versions (i and j) of an RDF graph G. We define the diff between these
two lattices as ∆j,i := {(JsKGj , s, JsKGi ) | s ∈ πs (Gi ∪ Gj )}; note that JsKGj = ∅
for deleted subjects and JsKGi = ∅ for new subjects. Given ∆j,i , we also define a
cardinality-only version ∆#            0                      0
                            j,i := {(P , n, P ) : n = |{s : (P , s, P ) ∈ ∆j,i }|}, where
by ∆#      0                                  0           #
     j,i (P , P ) we denote n such that (P , n, P ) ∈ ∆j,i or 0 if no such n exists.


Predicting future #-lattices Given ∆#                 #
                                           j,i , and Lk (for k a third version of the
graph), we can “add” the changes between the ith and j th versions to the k th
version to predict the (k+j−i)th version (where typically i < j ≤ k). We will thus
define the operation L#      #                              #                #
                        k +∆j,i as producing a #-lattice Lk,j,i predicting L[k+j−i] .
To apply this operation we consider the ratio of subjects moving from a source to
a target characteristic set. Formally, we define the ratio of subjects of P (where
                                                                                             ∆#     0
                                                                                              j,i (P ,P )
JP K#                           0        0                0
    Gi 6= 0, P 6= ∅) moving to P (where P 6= ∅) as ρj,i (P , P ) :=       JP K#
                                                                                                            ; in
                                                                              Gi
                  #
the case that JP KGi = 0, we define ρj,i (P 0 , P ) = 1 iff P 0 = P , or 0 otherwise.
We then define L− k,j,i := ({(σ(P ), P ) | P 6= ∅ and σ(P ) 6= 0}, ⊆), where:
                                                                            
                                X
    σ(P ) := round                               ρj,i (P, Pk ) × JPk K#
                                                                       Gk
                                                                           + ∆#
                                                                               j,i (P, ∅) .
                       Pk ∈{P |∃S:(S,P )∈JGk K}


The summand ∆#      j,i (P, ∅) adds the number of fresh subjects (not appearing in
version i) added to P in version j. Finally, we add top and bottom concepts (as
                                                         #
before) to L−k,j,i to generate the predicted #-lattice Lk,j,i .


5    Evaluation

We consider 11 weeks of “truthy” RDF dumps of Wikidata from 2017-04-18 to
2017-06-27; the first version has 1,102,242,331 triples, 54,236,592 unique subjects
and 3,276 unique properties, while the last version has 1,293,099,057 triples
(+17%), 57,197,406 unique subjects (+5%) and 3,492 unique properties (+6%).
From the last version, with a MapReduce implementation, we extract 2,118,109
characteristic sets in approximately 2.5 hours; computing the lattice by the ⊆
relation then took almost 8 hours on a single machine [4].
    To test the quality of the future #-lattices we predict – specifically the num-
ber of subjects per characteristic set in the future unseen version – we run
experiments where we consider from 2–5 previous weekly versions to predict the
next version of the #-lattice. As a baseline, for each characteristic set, we apply
linear regression over the number of subjects in that characteristic set for the
previous weeks to predict the number of subjects for the next week; we compare
this baseline with our diff algebra (∆), computing the error with respect to the
real lattice of the predicted week. The results are available in [4], where we show
that our diff algebra outperforms the linear regression baseline method in all
cases; we believe that this is because our diff algebra considers the number of
subjects remaining in source characteristic sets for its predictions whereas the
baseline does not consider where predicted new subjects will come from.


6    Conclusion
We have proposed a form of data-driven schema for large-scale knowledge graphs
and shown it to be feasible to compute. As a concrete use-case, we presented
an algebraic method by which these schemas can be used to predict high-level
changes in the dataset. Our evaluation over 11 weeks of Wikidata demonstrates
that such predictions are feasible to compute; furthermore, we validated the qual-
ity of predictions made by our algebraic approach against a linear-model baseline.
We refer the interested reader to the full paper [4] for additional details on the
proposed schema, examples of diffs, algorithms to compute the lattices, statistics
on the lattices produced, details of the experiments, and further discussion.

Acknowledgements: This work was supported by the Millenium Scientific Initia-
tive, by Fondecyt Grant No. 1181896 and by the German Research Foundation
(DFG) in CRC 912 (HAEC) and in Emmy Noether grant KR 4381/1-1.


References
1. Angles, R., Arenas, M., Barceló, P., Hogan, A., Reutter, J., Vrgoč, D.: Foundations
   of modern query languages for graph databases. ACM Comp. Surveys 50(5) (2017)
2. Vrandecic, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Com-
   mun. ACM 57(10) (2014) 78–85
3. Pham, M., Boncz, P.A.: Exploiting Emergent Schemas to Make RDF Systems More
   Efficient. In: ISWC. Lecture Notes in Computer Science, Springer (2016) 463–479
4. González, L., Hogan, A.: Modelling dynamics in semantic web knowledge graphs
   with formal concept analysis. In: WWW. (2018)
5. Neumann, T., Moerkotte, G.: Characteristic sets: Accurate cardinality estimation
   for RDF queries with multiple joins. In: ICDE, IEEE Comp. Society (2011) 984–994