<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Linked Data Querying through FCA-based Schema Indexing</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for Web Science and Technologies, University of Koblenz-Landau</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The e ciency of SPARQL query evaluation against Linked Open Data may bene t from schema-based indexing. However, many data items come with incomplete schema information or lack schema descriptions entirely. In this position paper, we outline an approach to an indexing of linked data graphs based on schemata induced through Formal Concept Analysis. We show how to map queries onto RDF graphs based on such derived schema information. We sketch next steps for realizing and optimizing the suggested approach.</p>
      </abstract>
      <kwd-group>
        <kwd>Linked Data</kwd>
        <kwd>Formal Concept Analysis</kwd>
        <kwd>Schema Indexing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The ease of consuming Linked Open Data (LOD) depends on the availability of
schema information. This holds for tasks such as browsing, where a user may not
know the structure of the data found in a dataset, or querying, where schema
information could be used for query optimization. In the LOD setting a query
can be evaluated against the original LOD datasets through query federation or
by evaluating it centrally against a LOD crawl. In both scenarios, the e ciency
of query evaluation may be improved by identifying those datasets schematically
matching the query { maybe partially { while leaving out others.</p>
      <p>
        In the LOD cloud schema information is sparse, as many data items come
with incomplete schema information or lack schema descriptions entirely.
Methods for schema induction and ontology learning have been studied to remedy
this problem ([
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], cf. section 2). A method, that has successfully been used in
this context, is Formal Concept Analysis (FCA) (cf. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]).
      </p>
      <p>
        To our knowledge, FCA has not been used for constructing indices, that
allow a schema-oriented query-to-graph mapping. Particularly, we are not aware
of previous work on mapping queries to formal contexts. We expect that FCA
will bene t this kind of indexing through the formally sound construction of
schemata that are free of external heuristics and concise at the same time. In
this paper, we outline an approach for such a schema index and point to future
work on an implementation.
For our approach to schema indexing, we propose a property-based schema
induction using FCA. Property-based type clustering is discussed in literature
as a remedy for the schema sparsity in LOD datasets. The feasibility of such
approaches has been analyzed in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The authors show that the properties of
resources within LOD datasets can be used for deducing resource types. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] argue
for a property-based data access for programming with Linked Data in order to
deal with a lack of type descriptions. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] further corroborate this position, as they
argue for property-based concept de nitions. In order to ensure quality of such
de nitions, they propose a system for incorporating interactive user feedback.
      </p>
      <p>
        Schema induction, the learning of schema information from data, is a way to
handle schema sparsity. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] describes a statistical approach to schema induction
through association rule mining on transaction tables. A recent approach is
presented in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. It combines the mining of property-based entity descriptions
and type clustering over these using DBSCAN. An overview over the eld of
schema/ontology learning is given in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        FCA has been applied to problems related to ontology learning. In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] it has
been used for learning taxonomies from natural language text. The authors of
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] use attribute exploration from FCA for semi-automatic approaches
to ontology completion. Lately, [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] used FCA-based association rule mining for
completing type de nitions given in DBPedia data through further implied
information.
      </p>
      <p>
        We combine the induction of schema with building and providing an index
for query-to-graph mapping. An index for subgraph querying is presented in
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. There the authors make use of a precomputed lattice of subgraphs in order
to narrow down the candidates for subgraph queries. An approach similar to
ours is demonstrated in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The authors present a schema-based index that is
consisting of three layers, each supporting di erent types of queries. The index
is constructed through clustering of RDF type information in a stream-based
fashion. The approaches presented in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] either introduce heuristics
or human oversight or require case speci c parameter tuning. We sketch a
FCAbased schema index that is free of such external factors and general enough to
map the diverse data found in the LOD cloud.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Preliminaries</title>
      <p>In the following we give a short introduction to Formal Concept Analysis (FCA),
a method for conceptual knowledge discovery and representation, as well as the
basics of RDF and SPARQL.</p>
      <p>
        We lend the following de nitions 1, 2, 3 from [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]:
De nition 1 (Formal Context). A formal context K := (G; M; I) is a triple
comprised of a set of objects G, a set of attributes M and an indicidence relation
I G M encoding that "g has attribute m" i gIm.
De nition 2 (Formal Concept). For A G and B M [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] de ne A0 :=
fm 2 M j8g 2 A : gImg, B0 := fg 2 Gj8m 2 B : gImg.
      </p>
      <p>
        A formal concept is a pair (A; B) with A0 = B and B0 = A. A is the extent,
B is the intent of the concept. As in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], we abbreviate the set of all formal
concepts for a formal context as L(G; M; I).
      </p>
      <p>Formal concepts fall naturally into a hierarchy, called a concept lattice, by
a subconcept-superconcept-relation de ned via the subset relations over G and
M , respectively.</p>
      <p>
        De nition 3 ( ). For two concepts (A1; B1), (A2; B2) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] de ne (A1; B1)
(A2; B2) , A1 A2 (equivalently, , B1 B2). The pair (L(G; M; I); ) is
called concept lattice.
      </p>
      <p>RDF Datasources of the LOD cloud contain RDF1 data. For now, we abstract
from the possibility of blank nodes in RDF data.</p>
      <p>
        De nition 4 (RDF Graph). Given the set of RDF resource identi ers U and
the set of literals L and having the set of RDF terms T := U [ L: An RDF graph
is a set of RDF triples D f&lt; s; p; o &gt; js 2 U; p 2 U; o 2 T g. We further de ne
the set of all known RDF graphs D := fDjD is a known graphg.
SPARQL A language for formulating graph oriented queries against RDF
datasets is SPARQL2. For this paper, we will focus on Basic Graph Patterns
that are used for formalizing graph pattern matching and, thus, are
fundamental to SPARQL(cf. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]). Here, we further assume that queries are only evaluated
against the default graph of a datasource.
      </p>
      <p>De nition 5 (Basic Graph Pattern). Given a set V of variable names, a
Triple Pattern tp is a triple (s; p; o) 2 (U [V ) (U [V ) (T [V ). A Basic Graph
Pattern (BGP) of a query q is a set containing a number of Triple Patterns. We
abbreviate the set of all queries only containing a single BGP as BGP .</p>
      <p>
        With this restriction to single-BGP queries, we de ne the solution to a query
q through matching its BGP against the RDF graph D as q(D) (cf. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]). We
further de ne a solution to q against a set of graphs D as q(D) := q(SD2D D).
      </p>
      <p>An example for a BGP and a possible solution is given in g. 1 b), c).
4</p>
    </sec>
    <sec id="sec-3">
      <title>Schema Index</title>
      <p>In a query federation system a central query processor accepts and processes
queries by dispatching (parts of) the queries to datasets (i.e., computing nodes
serving them). In the case of LOD these datasets are graphs. The schema
index maps queries onto minimal sets of graphs required for evaluating the given
1 Resource Description Framework; http://www.w3.org/TR/rdf-concepts/
2 http://www.w3.org/TR/sparql11-overview/
(a)
album title
ex:abbey:road "Abbey Road"
ex:quadrophenia "Quadrophenia"
...</p>
      <p>...</p>
      <p>(c)
queries. The optimal schema index minimizes the subset of all known RDF
graphs to be used for returning the complete solution of a query q:
Di(q) = argminDj D^q(Dj)=q(D)jDj j:
(1)</p>
      <p>The improvement of e ciency of query evaluation then depends on the
number of graphs left out of Di(q). Judging a graph to be relevant for query evaluation
may be based on a schematic match of a query and the graph.</p>
      <p>In our approach the matching will be informed by an underlying lattice of
property type clusters induced from the graph data via FCA. With our
approach, the schema index constructs and maintains a schema lattice covering
every graph encountered. We build this lattice by applying FCA on a formal
context extracted from an RDF graph.</p>
      <p>De nition 6 (Resource-Feature Map). We de ne a feature domain F as
the set of all known predicates p encountered in D. A resource-feature map f :
D U ! 2F assigns an RDF resource r the set of describing features found in
a graph: f (D; r) = fp 2 F j9o 2 T :&lt; r; p; o &gt;2 Dg.</p>
      <p>Example 1. f (gex; ex:quadrophenia) = fex:title; ex:tracksg
There are other possible manifestations of such a resource-feature map. For
example, one could (also) map the types assigned to resources through rdf:type.
De nition 7 (Schema Lattice). We de ne a graph-covering formal context
K := (U; F; If ) with the set of resources U , the feature domain F and If :=
f(r; p)j9D 2 D; 9r 2 U : p 2 f (D; r)g. The schema lattice LS = (L(U; F; If ); )
is the lattice constructed from K via FCA.</p>
      <p>While extracting features given in the graphs, the schema index will also
store the set of graphs contributing individual features to the subsequent lattice
construction.</p>
      <p>De nition 8 (Feature-contributing Graph). For an RDF resource r and a
feature p 2 F , we de ne a set of feature-contributing graphs as (r; p) = fDjp 2
f (D; r)g. For the features of an intent B, we de ne the set of contributing graphs
as 0(B) = fDj9p 2 B; 9r 2 U : D 2 (r; p)g.</p>
      <p>We conceive the schema index as a pair (LS ; ) of the schema lattice LS and
a function mapping queries to indexed graphs. In order to look up graphs that
a query should be evaluated against, the schema index then derives the queries
type information through the Query-Type Map function:
De nition 9 (Query-Type Map). The Query-Type Map
BGP ! 2F that maps a query q to a set of intents (q).
is a function
:</p>
      <p>For a query q the schema index returns the set of schematically appropriate
graphs as follows:
(q) = fDjD 2
0(B) for some B 2
(q)g:</p>
      <p>For our running example (cf. g. 1), the graph gex would be returned for
the BGP, as here fex:title, ex:tracksg constitutes an intent re ected in the
query.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Outlook and Conclusion</title>
      <p>
        In this position paper, we sketch the idea of a schema index for a query-to-graph
mapping. For constructing the index we aim at an automatic schema induction
process through FCA. We hypothesize that FCA is a good method for this use.
One, it is independent of external factors, such as expert involvement as in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
or a task speci c parameter tuning as done in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Two, by design it is a method
targeting property-based entity descriptions as argued for in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>Our proposition is that the schema index returns the graphs necessary for a
complete query evaluation:
q( (q)) = q(D) (for every query q)
(2)
However, (q) may return graphs that might be disregarded for evaluating BGP.
Hence, the proposed schema index is not necessarily optimal. For a future
implementation we plan to approximate the optimal case (cf. (1)). This must only
be done to a degree that is sensible, since potential savings in execution time
are being countered by costs for building and looking up the index.</p>
      <p>Towards a realization of the proposed schema index, we expect to face
questions such as for the size of formal contexts as given through the feature domains
of a graph. Next steps for our research will also involve empirical analysis of
runtime behaviour and scalability of FCA in the LOD use scenario. We have
indications that in spite of exponential worst-case complexity, the property-based
clustering of data items may lead to empirically acceptable runtime behaviour.
A further question in this context is how to e ciently update existing lattices,
when, e.g., encountering further features of an object while still ingesting a graph
in a stream-based fashion. For such a scenario, we will also look into what
lattice construction algorithm to choose. Finally, we plan to address the problems
of reducing the size of formal contexts and lattices through appropriate data
preparation and cleansing lattices of "noisy" concepts.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>M.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Buzmakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Codocedo</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Napoli</surname>
          </string-name>
          .
          <article-title>Mining de nitions from RDF annotations using formal concept analysis</article-title>
          .
          <source>Proc. of IJCAI</source>
          , pages
          <volume>823</volume>
          {
          <fpage>829</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>F.</given-names>
            <surname>Baader</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ganter</surname>
          </string-name>
          , and
          <string-name>
            <given-names>U.</given-names>
            <surname>Sattler</surname>
          </string-name>
          .
          <article-title>Completing description logic knowledge bases using formal concept analysis</article-title>
          .
          <source>Proc. of IJCAI</source>
          , pages
          <volume>230</volume>
          {
          <fpage>235</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>P.</given-names>
            <surname>Cimiano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hotho</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Staab</surname>
          </string-name>
          .
          <article-title>Learning Concept Hierarchies from Text Corpora using Formal Concept Analysis</article-title>
          .
          <source>Journal of Arti cial Intelligence Research</source>
          ,
          <volume>24</volume>
          :
          <fpage>305</fpage>
          {
          <fpage>339</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>C. D'Amato</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Fanizzi</surname>
            , and
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Esposito</surname>
          </string-name>
          .
          <article-title>Inductive learning for the Semantic Web: What does it buy?</article-title>
          <source>Semantic Web</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          -2):
          <volume>53</volume>
          {
          <fpage>59</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>T.</given-names>
            <surname>Gottron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Knauf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Scheglmann</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Scherp</surname>
          </string-name>
          .
          <article-title>A Systematic Investigation of Explicit and Implicit Schema Information on the Linked Open Data Cloud</article-title>
          .
          <source>Proc. of ESWC</source>
          ,
          <volume>7882</volume>
          :
          <fpage>228</fpage>
          {
          <fpage>242</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>S.</given-names>
            <surname>Homoceanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wille</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Balke</surname>
          </string-name>
          . ProSWIP:
          <article-title>Property-based data access for semantic web interactive programming</article-title>
          .
          <source>In Proc. of ISWC</source>
          , pages
          <volume>184</volume>
          {
          <fpage>199</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>K.</given-names>
            <surname>Kellou-Menouer</surname>
          </string-name>
          and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Kedad</surname>
          </string-name>
          .
          <article-title>Schema discovery in RDF data sources</article-title>
          .
          <source>In Proc. of ER</source>
          , pages
          <volume>481</volume>
          {
          <fpage>495</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>M.</given-names>
            <surname>Konrath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gottron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Staab</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Scherp</surname>
          </string-name>
          .
          <article-title>Schemex - e cient construction of a data catalogue by stream-based indexing of linked data</article-title>
          .
          <source>Web Semantics</source>
          ,
          <volume>16</volume>
          :
          <fpage>52</fpage>
          {
          <fpage>58</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>E.</given-names>
            <surname>Prudhommeaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Seaborne</surname>
          </string-name>
          , et al.
          <article-title>Sparql query language for rdf</article-title>
          .
          <source>W3C recommendation</source>
          , https://www.w3.org/TR/rdf-sparql-query/, 15,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. S. Scheglmann, G. Groener,
          <string-name>
            <given-names>S.</given-names>
            <surname>Staab</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>La</surname>
          </string-name>
          <article-title>mmel. Incompleteness-aware programming with RDF data</article-title>
          .
          <source>Proc. of the 2013 workshop on Data driven functional programming</source>
          ,
          <source>DDFP 2013</source>
          , pages
          <fpage>11</fpage>
          {
          <fpage>14</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. J. Volker and
          <string-name>
            <given-names>M.</given-names>
            <surname>Niepert</surname>
          </string-name>
          .
          <article-title>Statistical schema induction</article-title>
          .
          <source>In Proc. of ESWC</source>
          , pages
          <volume>124</volume>
          {
          <fpage>138</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>J. Vo</surname>
          </string-name>
          <article-title>lker and</article-title>
          <string-name>
            <given-names>S.</given-names>
            <surname>Rudolph</surname>
          </string-name>
          .
          <article-title>Fostering web intelligence by semi-automatic OWL ontology re nement</article-title>
          .
          <source>Proc. of Web Intelligence</source>
          ,
          <string-name>
            <surname>WI</surname>
          </string-name>
          <year>2008</year>
          , pages
          <fpage>454</fpage>
          {
          <fpage>460</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>R.</given-names>
            <surname>Wille</surname>
          </string-name>
          .
          <source>Restructuring Lattice Theory: An Approach Based on Hierarchies of Concepts</source>
          , pages
          <volume>445</volume>
          {
          <fpage>470</fpage>
          .
          <year>1982</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>D.</given-names>
            <surname>Yuan</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Mitra. Lindex</surname>
          </string-name>
          :
          <article-title>A lattice-based index for graph databases</article-title>
          .
          <source>VLDB Journal</source>
          ,
          <volume>22</volume>
          (
          <issue>2</issue>
          ):
          <volume>229</volume>
          {
          <fpage>252</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>