Introduction

Linked Data Querying through FCA-based Schema Indexing

0 Institute for Web Science and Technologies, University of Koblenz-Landau

The e ciency of SPARQL query evaluation against Linked Open Data may bene t from schema-based indexing. However, many data items come with incomplete schema information or lack schema descriptions entirely. In this position paper, we outline an approach to an indexing of linked data graphs based on schemata induced through Formal Concept Analysis. We show how to map queries onto RDF graphs based on such derived schema information. We sketch next steps for realizing and optimizing the suggested approach.

Linked Data Formal Concept Analysis Schema Indexing

Introduction

The ease of consuming Linked Open Data (LOD) depends on the availability of schema information. This holds for tasks such as browsing, where a user may not know the structure of the data found in a dataset, or querying, where schema information could be used for query optimization. In the LOD setting a query can be evaluated against the original LOD datasets through query federation or by evaluating it centrally against a LOD crawl. In both scenarios, the e ciency of query evaluation may be improved by identifying those datasets schematically matching the query { maybe partially { while leaving out others.

In the LOD cloud schema information is sparse, as many data items come with incomplete schema information or lack schema descriptions entirely. Methods for schema induction and ontology learning have been studied to remedy this problem ([ 4 ], cf. section 2). A method, that has successfully been used in this context, is Formal Concept Analysis (FCA) (cf. [ 2 ], [ 3 ], [ 12 ]).

To our knowledge, FCA has not been used for constructing indices, that allow a schema-oriented query-to-graph mapping. Particularly, we are not aware of previous work on mapping queries to formal contexts. We expect that FCA will bene t this kind of indexing through the formally sound construction of schemata that are free of external heuristics and concise at the same time. In this paper, we outline an approach for such a schema index and point to future work on an implementation. For our approach to schema indexing, we propose a property-based schema induction using FCA. Property-based type clustering is discussed in literature as a remedy for the schema sparsity in LOD datasets. The feasibility of such approaches has been analyzed in [ 5 ]. The authors show that the properties of resources within LOD datasets can be used for deducing resource types. [ 10 ] argue for a property-based data access for programming with Linked Data in order to deal with a lack of type descriptions. [ 6 ] further corroborate this position, as they argue for property-based concept de nitions. In order to ensure quality of such de nitions, they propose a system for incorporating interactive user feedback.

Schema induction, the learning of schema information from data, is a way to handle schema sparsity. [ 11 ] describes a statistical approach to schema induction through association rule mining on transaction tables. A recent approach is presented in [ 7 ]. It combines the mining of property-based entity descriptions and type clustering over these using DBSCAN. An overview over the eld of schema/ontology learning is given in [ 4 ].

FCA has been applied to problems related to ontology learning. In [ 3 ] it has been used for learning taxonomies from natural language text. The authors of [ 2 ] and [ 12 ] use attribute exploration from FCA for semi-automatic approaches to ontology completion. Lately, [ 1 ] used FCA-based association rule mining for completing type de nitions given in DBPedia data through further implied information.

We combine the induction of schema with building and providing an index for query-to-graph mapping. An index for subgraph querying is presented in [ 14 ]. There the authors make use of a precomputed lattice of subgraphs in order to narrow down the candidates for subgraph queries. An approach similar to ours is demonstrated in [ 8 ]. The authors present a schema-based index that is consisting of three layers, each supporting di erent types of queries. The index is constructed through clustering of RDF type information in a stream-based fashion. The approaches presented in [ 7 ], [ 8 ] and [ 12 ] either introduce heuristics or human oversight or require case speci c parameter tuning. We sketch a FCAbased schema index that is free of such external factors and general enough to map the diverse data found in the LOD cloud. 3

Preliminaries

In the following we give a short introduction to Formal Concept Analysis (FCA), a method for conceptual knowledge discovery and representation, as well as the basics of RDF and SPARQL.

We lend the following de nitions 1, 2, 3 from [ 13 ]: De nition 1 (Formal Context). A formal context K := (G; M; I) is a triple comprised of a set of objects G, a set of attributes M and an indicidence relation I G M encoding that "g has attribute m" i gIm. De nition 2 (Formal Concept). For A G and B M [ 13 ] de ne A0 := fm 2 M j8g 2 A : gImg, B0 := fg 2 Gj8m 2 B : gImg.

A formal concept is a pair (A; B) with A0 = B and B0 = A. A is the extent, B is the intent of the concept. As in [ 13 ], we abbreviate the set of all formal concepts for a formal context as L(G; M; I).

Formal concepts fall naturally into a hierarchy, called a concept lattice, by a subconcept-superconcept-relation de ned via the subset relations over G and M , respectively.

De nition 3 ( ). For two concepts (A1; B1), (A2; B2) [ 13 ] de ne (A1; B1) (A2; B2) , A1 A2 (equivalently, , B1 B2). The pair (L(G; M; I); ) is called concept lattice.

RDF Datasources of the LOD cloud contain RDF1 data. For now, we abstract from the possibility of blank nodes in RDF data.

De nition 4 (RDF Graph). Given the set of RDF resource identi ers U and the set of literals L and having the set of RDF terms T := U [ L: An RDF graph is a set of RDF triples D f< s; p; o > js 2 U; p 2 U; o 2 T g. We further de ne the set of all known RDF graphs D := fDjD is a known graphg. SPARQL A language for formulating graph oriented queries against RDF datasets is SPARQL2. For this paper, we will focus on Basic Graph Patterns that are used for formalizing graph pattern matching and, thus, are fundamental to SPARQL(cf. [ 9 ]). Here, we further assume that queries are only evaluated against the default graph of a datasource.

De nition 5 (Basic Graph Pattern). Given a set V of variable names, a Triple Pattern tp is a triple (s; p; o) 2 (U [V ) (U [V ) (T [V ). A Basic Graph Pattern (BGP) of a query q is a set containing a number of Triple Patterns. We abbreviate the set of all queries only containing a single BGP as BGP .

With this restriction to single-BGP queries, we de ne the solution to a query q through matching its BGP against the RDF graph D as q(D) (cf. [ 9 ]). We further de ne a solution to q against a set of graphs D as q(D) := q(SD2D D).

An example for a BGP and a possible solution is given in g. 1 b), c). 4

Schema Index

In a query federation system a central query processor accepts and processes queries by dispatching (parts of) the queries to datasets (i.e., computing nodes serving them). In the case of LOD these datasets are graphs. The schema index maps queries onto minimal sets of graphs required for evaluating the given 1 Resource Description Framework; http://www.w3.org/TR/rdf-concepts/ 2 http://www.w3.org/TR/sparql11-overview/ (a) album title ex:abbey:road "Abbey Road" ex:quadrophenia "Quadrophenia" ...

...

(c) queries. The optimal schema index minimizes the subset of all known RDF graphs to be used for returning the complete solution of a query q: Di(q) = argminDj D^q(Dj)=q(D)jDj j: (1)

The improvement of e ciency of query evaluation then depends on the number of graphs left out of Di(q). Judging a graph to be relevant for query evaluation may be based on a schematic match of a query and the graph.

In our approach the matching will be informed by an underlying lattice of property type clusters induced from the graph data via FCA. With our approach, the schema index constructs and maintains a schema lattice covering every graph encountered. We build this lattice by applying FCA on a formal context extracted from an RDF graph.

De nition 6 (Resource-Feature Map). We de ne a feature domain F as the set of all known predicates p encountered in D. A resource-feature map f : D U ! 2F assigns an RDF resource r the set of describing features found in a graph: f (D; r) = fp 2 F j9o 2 T :< r; p; o >2 Dg.

Example 1. f (gex; ex:quadrophenia) = fex:title; ex:tracksg There are other possible manifestations of such a resource-feature map. For example, one could (also) map the types assigned to resources through rdf:type. De nition 7 (Schema Lattice). We de ne a graph-covering formal context K := (U; F; If ) with the set of resources U , the feature domain F and If := f(r; p)j9D 2 D; 9r 2 U : p 2 f (D; r)g. The schema lattice LS = (L(U; F; If ); ) is the lattice constructed from K via FCA.

While extracting features given in the graphs, the schema index will also store the set of graphs contributing individual features to the subsequent lattice construction.

De nition 8 (Feature-contributing Graph). For an RDF resource r and a feature p 2 F , we de ne a set of feature-contributing graphs as (r; p) = fDjp 2 f (D; r)g. For the features of an intent B, we de ne the set of contributing graphs as 0(B) = fDj9p 2 B; 9r 2 U : D 2 (r; p)g.

We conceive the schema index as a pair (LS ; ) of the schema lattice LS and a function mapping queries to indexed graphs. In order to look up graphs that a query should be evaluated against, the schema index then derives the queries type information through the Query-Type Map function: De nition 9 (Query-Type Map). The Query-Type Map BGP ! 2F that maps a query q to a set of intents (q). is a function :

For a query q the schema index returns the set of schematically appropriate graphs as follows: (q) = fDjD 2 0(B) for some B 2 (q)g:

For our running example (cf. g. 1), the graph gex would be returned for the BGP, as here fex:title, ex:tracksg constitutes an intent re ected in the query. 5

Outlook and Conclusion

In this position paper, we sketch the idea of a schema index for a query-to-graph mapping. For constructing the index we aim at an automatic schema induction process through FCA. We hypothesize that FCA is a good method for this use. One, it is independent of external factors, such as expert involvement as in [ 12 ] or a task speci c parameter tuning as done in [ 7 ]. Two, by design it is a method targeting property-based entity descriptions as argued for in [ 6 ], [ 5 ].

Our proposition is that the schema index returns the graphs necessary for a complete query evaluation: q( (q)) = q(D) (for every query q) (2) However, (q) may return graphs that might be disregarded for evaluating BGP. Hence, the proposed schema index is not necessarily optimal. For a future implementation we plan to approximate the optimal case (cf. (1)). This must only be done to a degree that is sensible, since potential savings in execution time are being countered by costs for building and looking up the index.

Towards a realization of the proposed schema index, we expect to face questions such as for the size of formal contexts as given through the feature domains of a graph. Next steps for our research will also involve empirical analysis of runtime behaviour and scalability of FCA in the LOD use scenario. We have indications that in spite of exponential worst-case complexity, the property-based clustering of data items may lead to empirically acceptable runtime behaviour. A further question in this context is how to e ciently update existing lattices, when, e.g., encountering further features of an object while still ingesting a graph in a stream-based fashion. For such a scenario, we will also look into what lattice construction algorithm to choose. Finally, we plan to address the problems of reducing the size of formal contexts and lattices through appropriate data preparation and cleansing lattices of "noisy" concepts.

Alam ,

Buzmakov ,

Codocedo , and

Napoli . Mining de nitions from RDF annotations using formal concept analysis . Proc. of IJCAI , pages 823 { 829 , 2015 .

Baader ,

Ganter , and

Sattler . Completing description logic knowledge bases using formal concept analysis . Proc. of IJCAI , pages 230 { 235 , 2007 .

Cimiano ,

Hotho , and

Staab . Learning Concept Hierarchies from Text Corpora using Formal Concept Analysis . Journal of Arti cial Intelligence Research , 24 : 305 { 339 , 2011 .

4. C. D'Amato , N.

Fanizzi , and F.

Esposito . Inductive learning for the Semantic Web: What does it buy? Semantic Web , 1 ( 1 -2): 53 { 59 , 2010 .

Gottron ,

Knauf ,

Scheglmann , and

Scherp . A Systematic Investigation of Explicit and Implicit Schema Information on the Linked Open Data Cloud . Proc. of ESWC , 7882 : 228 { 242 , 2013 .

Homoceanu ,

Wille , and

Balke . ProSWIP: Property-based data access for semantic web interactive programming . In Proc. of ISWC , pages 184 { 199 , 2013 .

Kellou-Menouer and

Kedad . Schema discovery in RDF data sources . In Proc. of ER , pages 481 { 495 , 2015 .

Konrath ,

Gottron ,

Staab , and

Scherp . Schemex - e cient construction of a data catalogue by stream-based indexing of linked data . Web Semantics , 16 : 52 { 58 , 2012 .

Prudhommeaux ,

Seaborne , et al. Sparql query language for rdf . W3C recommendation , https://www.w3.org/TR/rdf-sparql-query/, 15, 2008 .

10. S. Scheglmann, G. Groener,

Staab , and

La mmel. Incompleteness-aware programming with RDF data . Proc. of the 2013 workshop on Data driven functional programming , DDFP 2013 , pages 11 { 14 , 2013 .

11. J. Volker and

Niepert . Statistical schema induction . In Proc. of ESWC , pages 124 { 138 , 2011 .

12. J. Vo

lker and

Rudolph . Fostering web intelligence by semi-automatic OWL ontology re nement . Proc. of Web Intelligence , WI 2008 , pages 454 { 460 , 2008 .

13.

Wille . Restructuring Lattice Theory: An Approach Based on Hierarchies of Concepts , pages 445 { 470 . 1982 .

14.

Yuan and

Mitra. Lindex : A lattice-based index for graph databases . VLDB Journal , 22 ( 2 ): 229 { 252 , 2013 .