<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards an Incremental Schema-level Index for Distributed Linked Open Data Graphs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Till Blume</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ansgar Scherp</string-name>
          <email>ansgar.scherp@stir.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kiel University</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Stirling</institution>
          ,
          <addr-line>Scotland</addr-line>
          <country country="UK">UK</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>ZBW - Leibniz Information Centre for Economics</institution>
          ,
          <addr-line>Kiel</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Semi-structured, schema-free data formats are used in many applications because their flexibility enables simple data exchange. Especially graph data formats like RDF have become well established in the Web of Data. For the Web of Data, it is known that data instances are not only added, changed, and removed regularly, but that their schemas are also subject to enormous changes over time. Unfortunately, the collection, indexing, and analysis of the evolution of data schemas on the web is still in its infancy. To enable a detailed analysis of the evolution of Linked Open Data, we lay the foundation for the implementation of incremental schema-level indices for the Web of Data. Unlike existing schema-level indices, incremental schema-level indices have an eficient update mechanism to avoid costly recomputations of the entire index. This enables us to monitor changes to data instances at schema-level, trace changes, and ultimately provide an always up-to-date schema-level index for the Web of Data. In this paper, we analyze in detail the challenges of updating arbitrary schema-level indices for the Web of Data. To this end, we extend our previously developed meta model FLuID. In addition, we outline an algorithm for performing the updates.</p>
      </abstract>
      <kwd-group>
        <kwd>Incremental schema-level index</kwd>
        <kwd>Schema computation</kwd>
        <kwd>LOD</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The Web of Data comprises a huge amount of interlinked data instances (Linked
Data) in a standard format, e. g., RDF. However, since the Web of Data is by its
nature a decentralized database, without a central authority responsible for data
management, diferent data schemas are used. In addition, data instances on the
Web are not just regularly added, modified, and removed [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], but their schema
is also subject to enormous changes over time [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Consequently, data-driven
applications that aim to make use of the Web of Data render useless if they rely
on, e. g., outdated data caches [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The analysis of the evolution of data and data
schemas can bring enormous advantages for the understanding of the data [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]
and help to keep local copies of the data up-to-date [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Many research works
analyze the (co-)evolution of data [
        <xref ref-type="bibr" rid="ref14 ref5 ref9">5, 9, 14</xref>
        ], but few analyze the (co-)evolution
of Linked Open Data to understand dependencies of change behavior such as
frequently used vocabularies, network links, and other (co-)evolution patterns of
data instances [
        <xref ref-type="bibr" rid="ref13 ref4">4,13</xref>
        ]. In order to perform large scale analyses on the Web of Data,
eficient access to the dynamic data is required, with incremental indices playing
a decisive role. Eficient access to data on the Web can be achieved with crawling
and indexing. Crawling the data produces a data stream that can only be indexed
with an incremental algorithm, which needs to provide eficient updates of the
index instead of costly re-computations of the entire index.
      </p>
      <p>
        So far, there exist only incremental instance-level indices but no incremental
schema-level indices. Instance-level indices allow to quickly retrieve nodes or
answer questions about reachability, distance, or shortest path [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. In contrast,
schema level indices (SLI) index the schema computed from the data [
        <xref ref-type="bibr" rid="ref2 ref6">2, 6</xref>
        ].
For example, if there are instances in a data source, each described with three
properties (see Fig. 1), an SLI can summarize these instances. An SLI stores only
the combination of the three properties (schema information) and links them to
the corresponding data source (payload information). The payload is needed to
use an SLI in a concrete application context such as for data search [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], where
the aim is to find data sources relevant to a user’s query. There are numerous
diferent variations of SLIs that index diferent schema information and store
diferent payload information [
        <xref ref-type="bibr" rid="ref2 ref6">2,6</xref>
        ]. However, there is no incremental schema-level
index that allows the analysis of the evolution of data schemas on the Web.
      </p>
      <p>
        In the related field of schema discovery in NoSQL databases, there are
incremental algorithms to discover hidden schemas in the data [
        <xref ref-type="bibr" rid="ref1 ref18">1, 18</xref>
        ]. However,
these are limited to document-oriented formats such as JSON and are designed
for closed database systems. Therefore, they cannot be applied to an incremental
schema-level index for open, distributed graph databases where data instances
are stored decentralized. In such an open system like the Web of Data, the index
computation must be able to process incomplete data due to the size of the Web
and the high dynamics. Let us assume that a data source DS-URI-2 is crawled
at a certain point in time after data source DS-URI-1 has been indexed (Fig. 1).
If the data source DS-URI-2 contains instances with the same schema as the
instances in DS-URI-1, the payload of PC-1 must be updated. The crawler can
also visit the data source DS-URI-1 again and provide changed information. This
can trigger diferent types of updates. For example, if only one of two instances
in Fig. 1 disappears, we have to update the payload with the new number of
summarized instances (1 Instance). If both instances disappear we must also
remove the data source and consider removing the schema element (PC-1) from
the SLI since a matching query would not return any payload. If, however, only
the instance information changes, e. g., the title of I-2 changes to "Title-C", no
update (neither on schema nor on payload) is required.
      </p>
      <p>For the Web of Data, we make certain assumptions about the data graph.
First, we consider the crawling strategy as a black box. This means that all
data sources in the Web of Data are (re-)visited eventually by the crawler at
an unknown point in time. Second, crawler download the entire data source for
a given data source URI. Thus, we assume that all statements fetched via one
data source URI come next to each other in the data stream. Third, each data
source can contain statements about instances stored in other data sources. Our
experiments suggest that almost one third of the instances are distributively
defined. Thus, instances can be defined in a truly distributed setting.</p>
      <p>
        With this work, we want to lay the foundation to close the gap between
continuously crawling data from the Web and analyzing the data schemas. We do
not want to develop a customized incremental SLI, but rather create the basis for
updates to any SLI. To this end, we examine all possible update operations on
SLIs. For this purpose, we extend our previously developed formal model FLuID
to describe and implement any SLI using only 11 building blocks [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. For updates
at schema level, we need to consider several cases: Instances with new schemas are
observed (SEnew), instances with a known schema are added (P Eadd), schemas
of known instances are changed (SEmod), not all instances with a known schema
are deleted (P Edel), and all instances with a known schema are deleted (SEdel).
      </p>
      <p>The remainder is structured as follows: In Sect. 2, we discuss related works.
In Sect. 3, we outline FLuID’s building blocks and present how SLIs can be
computed for a given data graph. Subsequently, we analyze the problem of
updating SLIs in accordance with the FLuID model in Sect. 4. Finally, we give a
brief outlook on our incremental algorithm, before we conclude the paper.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        The approaches for indexing graph data can be divided in instance-level indices
and schema-level indices. Instance-level indices store information about the
actual data instances and their statements [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. They support queries for specific
resources such as finding all persons with the surname “Müller”. Schema-level
indices provide a concise summary of the data instances by memorizing the schema
defined by the instances’ statements. Schema-level indices satisfy information
needs like “Find all data sources with information about persons who are actors
and American presidents” [
        <xref ref-type="bibr" rid="ref2 ref6">2, 6</xref>
        ]. There exist various variants of schema-level
indices (SLIs) [
        <xref ref-type="bibr" rid="ref17 ref2 ref6">2,6,17</xref>
        ], which greatly difer in what kind of schema is captured. For
example, TermPicker [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] summarizes instances based on a common set of types,
a common set of properties, and a common set of types of all property-linked
resources. However, there exists no incremental schema-level index. Therefore,
we discuss existing incremental instance-level indices and discuss incremental
schema discovery algorithms in NoSQL databases.
      </p>
      <p>
        There are few incremental instance-level indices that allow adding new data
instances on the fly and seamlessly incorporating it into the the index [
        <xref ref-type="bibr" rid="ref19 ref20">19, 20</xref>
        ].
Yuan et. al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] propose a subgraph mining algorithm to mine frequent and
discriminative features in subgraphs in order to optimize a computed index
and regroup subgraphs based on newly added features. Thereby the algorithm
minimizes the number of index lookups for a given type of query. The idea of
subgraph mining for instance-level indices has been further improved by various
works [
        <xref ref-type="bibr" rid="ref11 ref20">11, 20</xref>
        ]. Some works investigate the eficient subgraph matching problem
on unlabeled and undirected graphs [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Qiao et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] propose to compress
the data graph G using a pattern graph P by extracting isomorphic sub-graphs.
They propose interesting techniques and address the problem of graphs too large
for the main memory. However, it needs to be evaluated whether they can also
be applied on labeled directed graphs such as the Web of Data.
      </p>
      <p>
        Wang et al. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] propose a schema management framework for JSON
document stores. They propose to store the schema in tree-like structures, which
can be computed using each instance only once. However, their approach has
several limitations. First, they assume that instances are always complete when
the schema is computed. Second, their approach is only applicable for schema
structures that consider properties used by the specific instance only and no
information beyond that is needed. Baazizi et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] compute schemas for JSON
documents to present information about optional fields and mandatory fields.
However, incrementally discovering hidden schemas is not the same as computing
exact schemas given by the data, they are limited to document orientated formats
like JSON, and are designed for closed database systems.
      </p>
      <p>
        In summary, all related approaches assume that the data is accessible at any
time and that data instances are defined in a single data source, i. e., are not
distributed. The only approach handling decentralized graph data is proposed by
Konrath et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The authors assume that the data is provided by a crawler and
apply the idea of a stream-based index computation over a window of the data.
Their computation approach, however, does not support incremental updates.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Basic building blocks of Schema-level Indices</title>
      <p>
        In our previous work, we developed the FLuID model, which is able to define
arbitrary schema-level indices as combination of equivalence relations [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Formally,
a schema-level index is a 3-tuple (G, EQR, PAY), where G is the data graph
which is indexed, EQR is an equivalence relation over instances in G, and PAY is
an n-tuple of payload functions, which map instance information to equivalence
classes in EQR. These equivalence relations can be defined using paramaterized
simple and complex schema elements [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. They basically define how the schema
is computed from the data graph, e. g., which types and properties are taken
into account. In the following, we introduce common notions, define the data
graph, and highlight the relevant aspects of FLuID’s schema elements, their
parameterizations, and the payload elements [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>A data graph G is defined by G ⊆ VUB × P × (VUB ∪ L), where VUB denotes
the set of URIs and blank nodes, P the set of properties, and L the set of literals.
A triple is a statement about a resource s ∈ VUB ∪ P in the form of a
subjectpredicate-object expression (s, p, o) ∈ G. We define instances Is ⊆ G to be a set
of triples, where each triple shares a common subject URI s. The properties P
can be divided into disjoint subsets P = Ptype ∪˙ Prel, where Ptype contains the
properties denoting type information and Prel contains the properties linking
instances in the data graph. We say the instance Is with the subject URI s is
of type c if there exists a triple (s, pt, c) ∈ G, with pt ∈ Ptype. In the context of
RDF, Ptype contains only rdf:type and Prel all p ∈ P \ Ptype. Furthermore, we
denote with δ+(Is) the outdegree of an instance Is and with δ−(Is) the indegree.</p>
      <sec id="sec-3-1">
        <title>3.1 Index Definition with FluID</title>
        <p>Below, we briefly describe FLuID’s building blocks and subsequently describe,
how FLuID-indices can be computed.</p>
        <p>Simple Schema Elements summarize instances I1 ∈ G and I2 ∈ G by comparing
all triples (s1, p1, o1) ∈ I1 to the triples (s2, p2, o2) ∈ I2. For simple schema
elements, we distinguish property cluster (PC), object cluster (OC), and
propertyobject cluster (POC). Each simple schema element compares triples following a
diferent pattern, i. e., comparing only the properties, only the objects, or the
combination of both. Furthermore, there are undirected versions of the three
simple schema elements, where additionally the incoming triples (x, p, i) ∈ G
with i as the subject of the instance in object position are considered.
Complex Schema Elements partition the data graph by summarizing instances
based on the three given equivalence relations ∼s, ∼p, and ∼o. Therefore, they can
be defined as 3-tuple CSE := (∼s, ∼p, ∼o). While the simple schema elements
summarize instances by comparing triples using the identity equivalence “=”, the
complex schema elements compare triples using arbitrary equivalences ∼s, ∼p,
and ∼o, e. g., simple schema elements. Thus, they can be considered as containers
to combine any number of simple schema elements.</p>
        <p>Parameterizations further specify our schema elements. There are four
parameterizations defined in FLuID. Chaining parameterization determines the size
of the considered sub-graph of a complex schema element CSE of up to k-hops,
and is denoted by CSEk. Label parameterization allows restricting the SLI to
consider only specific properties and can be used to define type cluster OCtype,
where the object cluster only compares objects connected over the properties in
Ptype. Inference parameterization Φ is applied on the data graph G and enables
ontology reasoning using a schema graph SG. In practice this means that a
schema graph SGRDF S is constructed on-the-fly (or in a pre-processing step),
which stores all hierarchical dependencies between types and properties denoted
by RDFS properties found in the data graph G. Instance parameterization σ
allows merging equivalent instances, e. g., instances linked with owl:sameAs.</p>
        <p>The label, inference, and instance parameterizations pose further restrictions
or relaxations on the comparison of triples when computing the schema elements.
Thus, they change the number of considered triples for each schema element. For
example, the label parameterization allows ignoring a certain set of properties
to determine the equivalence of two instances. The chaining parameterizations,
however, afects the number of neighboring instances that also need to be
equivalent according to the complex schema element. Two instances I1 and I2 are
considered equivalent by CSEk, if for each neighboring instances of I1 in the
data graph for each distance up to k, there is a neighboring instance of I2 in the
same distance that is considered equivalent by CSE.</p>
        <p>
          Payload Elements are attached to schema elements and contain information about
the summarized instances, e. g., their data source or the number of summarized
instances. The payload is needed to implement a concrete application, e. g.,
a search engine [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. For each type of payload, a mapping function needs to
be defined. One such function is the data source mapping function ds, which
maps a schema element to the data sources of the summarized instances. The
function ds takes a schema element EQR as input and returns all sources, with
ds(EQR) := SI∈EQR context(I), with the function context : P(G) → P(VU )
returning all data sources of an instance I.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Index Construction with FLuID</title>
        <p>In this section, we present our approach to compute schema-level indices modeled
with FLuID. Susequently, we extend this to an incremental indexing approach.</p>
        <p>
          Computing an SLI can be described as a function SC : (G, EQR, PAY) → sli,
which computes according to the equivalence relations given in EQR, and the
n payload functions PAY the concrete schema-level index sli for a data graph
G. Since the equivalence relations given in EQR can be defined with schema
elements, we can treat EQR as a tuple of schema elements. In the following, we
denote schema elements in EQR as abstract schema elements. For example, the
SLI TermPicker [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] defines the schema structure that summarizes instances that
have a common set of types, a common set of properties, and a common set of
combined types of all neighboring instances. With the help of FLuID, this can
be expressed with four abstract schema elements:
        </p>
        <p>EQRTermPicker := (OCtype ∩ P Crel, T, OCtype),
(1)
A complex schema element wraps the label parameterized object cluster using
the set Ptype denoted as OCtype, the label parameterized property cluster using
the set Prel denoted as P Crel, and an arbitrary tautology denote as T , which
considers everything equivalent.</p>
        <p>To compute a concrete schema-level index sli, the abstract schema elements are
used as “blueprints” to compute (instantiate) schema elements for the instances
in the data graph G. This can, for example, be done, by extracting all properties
of an instance to form a property cluster. For instances using the same set of
properties, the same schema element is computed. More specifically, instantiating
schema elements means creating a schema element se-i ∈ sli with a certain set
of specific attributes, e. g., the three properties {_:title, _:subject, _:abstract}.</p>
        <p>There is a significant diference between computing simple schema elements
and computing complex schema elements. The FLuID model defines simple
schema elements as low-level building blocks that summarize instances by
comparing triples using the identity equivalence (=) for properties and/or objects.
Thus, simple schema elements can always be computed directly from the data
without dependencies to other schema elements. In contrast, complex schema
elements are deliberately designed to have dependencies either on complex schema
elements or on simple schema element. To explain this in more detail, we look at
the example shown in Fig. 2. When we start with an empty sli (interval [t0, t1]),
to compute the schema of instances, we first need to compute the sub-schema
structures (i. e., the simple schema elements octype-1, octype-2, pcrel-1) according
to the abstract schema definition (blueprint) given in Eq. ( 1). We can compute the
complex schema elements cse-1 and cse-2 based on the simple schema elements.
To compute the complex schema element for instance i6, we need to know the
type information about instances i2 and i1 (octype-2).</p>
        <p>
          The advantage of this feature is that with increasing complexity of the abstract
schema definition, the computation costs increase only linearly since the index
reuses already computed simple schema elements for several complex schema
elements [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. In general this means, we have to instantiate at most all abstract
schema elements for each instance I1 ∈ G. We denote with ς the number of
abstract simple schema elements and with τ the number of abstract complex
schema elements. Please note that the chaining parameterization is defined so
that the pattern of an abstract complex schema element is applied recursively k
times. When chaining a CSE k times, comparable to unrolling a loop, we can
simply assume that we have to calculate k × τ many abstract complex schema
elements. However, no new abstract simple schema element is required, so the
number ς is not afected.
        </p>
        <p>The payload information needs to be computed for a data graph G according
to the payload functions PAY. Computing the payload elements follows the same
principal as computing the schema elements. Thus, we omit the details here.
Please note, in the following, we consider a computed sli for a data graph G
again as a graph. However, other data structures can be used as well.</p>
        <p>To summarize, the total number of instances in G can be bounded by the
number of nodes in the data graph G, with |G| = v. When computing an
schemalevel index, we have to compute α ≤ ς × v many simple schema elements and
β ≤ τ × v many complex schema elements. Furthermore, we need to consider p
many payload mapping functions instantiating γ ≤ p × v many payload elements
for the data graph G. Thus, we know that α + β ≤ v × (ς + τ × k) schema elements
are instantiated in the final schema-level index. With a data graph where no two
instances can be summarized, we end up with α + β = v × (ς + τ × k).
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Update Operations on an Index built with FLuID</title>
      <p>As described in the previous sections, computing an SLI means extracting schema
and payload information from the data to compute schema and payload elements.
In the following, we analyze the diferent cases of updates with respect to the
expected complexity. We compare the costs implied with our incremental index
to computing the index from scratch.
4.1</p>
      <sec id="sec-4-1">
        <title>Updating Schema Elements</title>
        <p>There are six cases of updates possible for an SLI: a new instance is observed
with a new schema (SEnew), a new instance is observed with a known schema
(P Eadd), a known instance is observed with a changed schema (SEmod), a known
instance is observed with only changed instance information (P Emod), a instance
with its schema and payload information no longer exists (P Edel), and no more
instance with a specific schema exists in the data graph ( SEdel). Please note, we
denote with the schema of an instance the complete schema defined by all abstract
schema elements in the index definition, e. g., the complex schema element.
Subschemas are smaller parts of the schema, e. g., simple schema elements combined
within a complex schema element.</p>
        <p>New Data Instances Adding an instance to G means adding a set of triples
(s1, p1, s2) about a resource s1 when there where no triple about this resource in
G before, neither with s1 as subject nor with s1 as object. Due to our blackbox
crawling scenario, we cannot assume any order in the data. Thus, we may obeserve
statements with s1 as object before we observe statements with s1 as subject. In
this case, we consider s1 as instance with an empty schema and no payload.</p>
        <p>For new instances in G, we consider two cases. First, if the instance is observed
with an entirely new schema (SEnew), all ς + τ × k (sub-) schema elements need
to be computed and added to the index. However, it is possible that some
sub-schema elements are already known even if the schema of the instance is
new. For example, the specific combination of properties (pc- 1) is known but
not the specific combination of types (tc- 8) and therefore also not the complex
combination cse-3 using pc-1 and tc-8. This means, adding an instance with a
new schema to G (SEnew) can require adding between 1 and ς + τ × k schema
elements to the index.</p>
        <p>Second, if the instance is observed with a known schema (P Eadd), also all
sub-schema elements have to already exist in the index. Therefore, only the
payload of existing schema elements may needs to be updated.</p>
        <p>
          Modifying Known Data Instances Modifying instances means adding and/or
removing a set of triples (s1, p1, s2) ∈ G about the resource s1 when there existed
triples about this resource in G before. When the modifications are only on
instance-level (P Emod), this requires no change on the schema elements but the
payload may needs to be updated. However, the modifications can be on
schemalevel (SEmod), e. g., new properties are added or types are changed. When we
modify an instance on schema-level, analogously to adding an instance, between
1 and ς + τ × k schema-elements in the index may need to be updated or created.
Furthermore, it is possible that for another instance I2 a triple (s2, p1, s1) ∈ G
was observed before, where s1, s2 are the subject URIs of I1, I2 respectively. This
means, for each such instance I2 in G, we may need to update the schema elements
as well. Thus, for each incoming edge to I1 in G, up to τ many complex schema
elements require an update. When we apply the chaining parameterization, the
individual indegree of each neighbor up to a distance of k hops needs to be
considered. To ease notation, we denote with δ−(I1maxk ) the maximum indegree
of all instances within a k-hop distance from instance I1. We can then simplify
the estimation to δ−(I1maxk )k × τ efected schema elements. Thus, SEmod requires
a maximum of ς + τ × k + δ−(I1maxk )k × τ updates on schema-level. Since ς and
τ are fixed before computing the index, the only variable factor depending on the
data is the indegree δ− of the instances. Please note, for existing SLIs we know
that τ ≤ 1 and ς ≤ 3 [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Furthermore, our results discussed in Sect. 5 indicate
that the indegree is on average only 5.
        </p>
        <p>Deleting Known Data Instances Deletion can be seen as an extreme case of
modification where all instance information is deleted, i. e., all triples about the
resource. Due to the decentrality of the Web of Data, there can still be links to
that instance in the data graph. Thus, the URI of the instance can still appear
in the data graph, but only on object position of a triple.</p>
        <p>First, the deletion can mean that there are no further data instances in the
data graph G with that schema (SEdel). This means, we basically have to revert
all changes that where needed to add a new instance with a new schema (SEnew)
and we have to delete between 1 and ς + τ × k schema elements from the index.
Furthermore, deleting single schema elements may require an update of depending
schema-elements when (chained) complex schema elements are used analogously
to the modification SEmod. Considering Fig. 2, the schema is computed based on
types extracted from instances linked via properties. This means, when we remove
instance i4 and i5 after t4, updates on depending schema elements are necessary.
We delete the (sub-) schema elements of cse-2 and also have to update the schema
element cse-1 since it can no longer reference octype-2. Thus, analogously to the
modification, we have to consider the indegrees of neighboring instances and
possibly update up to δ−(I1maxk )k × τ complex schema elements.</p>
        <p>Second, the deletion can mean that there are still instances in the data graph
G with the same schema as the deleted instance (P Edel). This means, also all
sub-schema elements have to remain in the index. Thus, there is no deletion of
schema information required and only the payload of existing schema elements
may needs to be updated. For this case, although no schema element is deleted,
we may still have to update δ−(I1maxk )k × τ complex schema elements. Let us
consider the example illustrated in Fig. 3. When we compute indices using
(chained) complex schema elements, e. g., TermPicker, we do not delete even
a sub-schema element but we rather need to add schema elements cse-2 and
octype-3. Thus, up to δ−(I1maxk )k × τ + ς schema elements need an update.
With p many payload functions, up to p new payload-elements need to be
instantiated for each new instance I1. Making general assumptions about the
payload is not possible since there can be a unlimited number of diferent payload
functions. Thus, we focus on two often used payload functions, the data source
function and the instance count function. Adding or deleting an instance always
triggers an update on the instance count payload. Modifying instances does not
trigger a change on the instance counts. All three kinds of instance-level change
can trigger an update of the data source payload if a new data source is added
or an existing instance is moved to another data source.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Discussion</title>
      <p>As we discussed in the previous sections, to update schema-level indices some
information about the data graph needs to be preserved to coordinate the right
update operation to the correct schema elements. Thus, we propose to store two
kinds of link information, instance URI to schema element and instance URI to
data source URI. Furthermore, if complex schema elements are used, we need to
memorize a subset of incoming edges for each instance. The amount of storage
space required to store all these links for the Web of Data may be undesirably
large. We plan to conduct detailed analyses using real-world datasets crawled
from the Web of Data. From our analysis in Sect. 4, we know that one major
factor is the indegree of instances in the data graph. Our preliminary results
for the first four snapshots crawled by the Dynamic Linked Data Observatory
(DyLDO)4 indicate large variety in the link distribution and instance distribution.
About 28% of the data instances are split over 2 to 5, 800 (average 2.2) data
sources. Furthermore, about one third of the instances have dependencies on 1
to 172, 000 instances. However, on average instances only have dependencies on
5 instances. Storing all links sums up to about 33% of the number of links in the
data graph. These first results suggest that implementing an eficient incremental
schema-level index using a complex schema element for the Web requires storing
significantly more data. Thus, we will also analyze possible optimization and
approximation strategies.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and Outlook</title>
      <p>We reviewed existing works on incremental indices and showed that there is a
clear gap for incremental schema-level indices. Our analysis of update operations
on the FLuID model shows that the update performance of schema-level indices
mainly depends on the indegree of changed data instances and on the number of
complex schema elements used. All known schema-level indices, however, need
only at most one complex schema element. Furthermore, our preliminary results
suggest that to update an incremental index for the Web of Data defined with
a complex schema element, we need to memorize about an additional 33% of
data graph size. A detailed evaluation of such an incremental SLI is needed to
measure the impact on the performance when computing indices incrementally.
Acknowledgement. This research was co-financed by the EU H 2020 project
MOVING (http://www.moving-project.eu/) under contract no 693092.
4 http://km.aifb.kit.edu/projects/dyldo/</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Baazizi</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ben</surname>
            <given-names>Lahmar</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Colazzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Ghelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Sartiani</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          :
          <article-title>Schema inference for massive JSON datasets</article-title>
          .
          <source>In: EDBT</source>
          . pp.
          <fpage>222</fpage>
          -
          <lpage>233</lpage>
          . OpenProceedings.org (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Blume</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scherp</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Towards flexible indices for distributed graph data: The formal schema-level index model FLuID</article-title>
          . In: Foundations of Databases. CEURWS.org (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Dividino</surname>
            ,
            <given-names>R.Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gottron</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scherp</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Strategies for eficiently keeping local linked open data caches up-to-date</article-title>
          .
          <source>In: ISWC</source>
          . Springer (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Dividino</surname>
            ,
            <given-names>R.Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gottron</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scherp</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gröner</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>From changes to dynamics: Dynamics analysis of linked open data sources</article-title>
          .
          <source>In: PROFILES@ESWC. CEURWS.org</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>A general and parallel platform for mining co-movement patterns over large-scale trajectories</article-title>
          .
          <source>PVLDB</source>
          <volume>10</volume>
          (
          <issue>4</issue>
          ),
          <fpage>313</fpage>
          -
          <lpage>324</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Gómez</surname>
            ,
            <given-names>S.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Etcheverry</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marotta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Consens</surname>
            ,
            <given-names>M.P.</given-names>
          </string-name>
          :
          <article-title>Findings from two decades of research on schema discovery using a systematic literature review</article-title>
          .
          <source>In: AMW. CEUR-WS.org</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Gottron</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scherp</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krayer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peters</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>LODatio: using a schema-level index to support users infinding relevant sources of linked data. In: K-CAP</article-title>
          .
          <source>ACM</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Hose</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schenkel</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Theobald</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weikum</surname>
          </string-name>
          , G.:
          <article-title>Database foundations for scalable RDF processing</article-title>
          .
          <source>In: Reasoning Web. - Int. Summer School</source>
          . Springer (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cao</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ke</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Detecting influence relationships from graphs</article-title>
          .
          <source>In: SDM. SIAM</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Käfer</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abdelrahman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Umbrich</surname>
            , J.,
            <given-names>O</given-names>
          </string-name>
          <string-name>
            <surname>'Byrne</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Observing linked data dynamics</article-title>
          .
          <source>In: ESWC</source>
          . Springer (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Kansal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spezzano</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>A scalable graph-coarsening based index for dynamic graph databases</article-title>
          .
          <source>In: CIKM. ACM</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Konrath</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gottron</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Staab</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scherp</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>SchemEX - eficient construction of a data catalogue by stream-based indexing of Linked Data</article-title>
          .
          <source>J. Web Sem</source>
          .
          <volume>16</volume>
          ,
          <fpage>52</fpage>
          -
          <lpage>58</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Nishioka</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scherp</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Temporal patterns and periodicity of entity dynamics in the linked open data cloud. In: K-CAP</article-title>
          .
          <source>ACM</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Ohsaka</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Akiba</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yoshida</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kawarabayashi</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Dynamic influence analysis in evolving networks</article-title>
          .
          <source>PVLDB</source>
          <volume>9</volume>
          (
          <issue>12</issue>
          ),
          <fpage>1077</fpage>
          -
          <lpage>1088</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Qiao</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , H., Cheng, H.:
          <article-title>Subgraph matching: on compression and computation</article-title>
          .
          <source>PVLDB</source>
          <volume>11</volume>
          (
          <issue>2</issue>
          ),
          <fpage>176</fpage>
          -
          <lpage>188</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Sakr</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Al-Naymat</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Graph indexing and querying: a review</article-title>
          .
          <source>J. of Web Inf. Sys</source>
          .
          <volume>6</volume>
          (
          <issue>2</issue>
          ),
          <fpage>101</fpage>
          -
          <lpage>120</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Schaible</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gottron</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scherp</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>TermPicker: Enabling the reuse of vocabulary terms by exploiting data from the Linked Open Data cloud</article-title>
          .
          <source>In: ESWC</source>
          . Springer (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jiao</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hassanzadeh</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zou</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wangz</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Schema management for document stores</article-title>
          .
          <source>VLDB</source>
          <volume>8</volume>
          (
          <issue>9</issue>
          ),
          <fpage>922</fpage>
          -
          <lpage>933</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Yuan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mitra</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giles</surname>
            ,
            <given-names>C.L.</given-names>
          </string-name>
          :
          <article-title>Iterative graph feature mining for graph indexing</article-title>
          .
          <source>In: ICDE. IEEE Computer Society</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Yuan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mitra</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giles</surname>
            ,
            <given-names>C.L.</given-names>
          </string-name>
          :
          <article-title>Updating graph indices with a one-pass algorithm</article-title>
          .
          <source>In: SIGMOD. ACM</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>