A Main Memory Index Structure to Query Linked Data

                                Olaf Hartig                                                    Frank Huber
                     Humboldt-Universität zu Berlin                                   Humboldt-Universität zu Berlin
                         Unter den Linden 6                                               Unter den Linden 6
                       10099 Berlin, Germany                                            10099 Berlin, Germany
                 hartig@informatik.hu-berlin.de                                    huber@informatik.hu-berlin.de


ABSTRACT                                                                   plementation of our query approach. Disk-based data structures for
A possible approach to query Linked Data combines the actual               RDF1 are unsuitable for the task because they feature very costly
evaluation of a query with the traversal of data links in order to         I/O operations and there is no need to persist data collections that
discover and retrieve potentially relevant data. An implementation         emerge during link traversal based query execution. However, ex-
of this idea requires approaches that support an efficient and flexi-      isting data structures that manage RDF data in main memory do
ble management of temporary, ad hoc data collections that emerge           not satisfy the outlined requirements as well, primarily because
during query execution. However, existing proposals for managing           they are optimized for path queries [14, 22] or for complete graph
RDF data primarily focus on persistent storage and query execution         queries [2, 5, 16] but not for simple triple based queries.
for large datasets and, thus, are unsuitable in our dynamic scenario           In this paper we investigate three main memory data structures
which involves many small sets of data.                                    that satisfy the outlined requirements. All three data structures are
   In this paper we investigate main memory data structures to store       based on the same index structure for RDF. This index uses hash ta-
Linked Data in a query execution system. We discuss the require-           bles which store sets of RDF triples, arranged in six possible ways
ments for such a data structure, introduce three strategies that make      to efficiently support all possible types of triple based queries. The
use of hash table based indexes, and compare these strategies em-          three studied data structures, which we call individually indexed
pirically. While this paper focuses on our query execution ap-             representation, combined index representation, and combined quad
proach, the discussed data structures can also be applied to other         index representation, differ in how they actually make use of the in-
approaches that retrieve Linked Data from remote sources via URI           dex structure: The individually indexed representation stores the
look-ups in order to process this data locally.                            data resulting from each link traversal step in a separate index,
                                                                           whereas, the combined and the quad index representation use a sin-
                                                                           gle index for all data that the query engine retrieves during the exe-
1.    INTRODUCTION                                                         cution of queries. These different, index utilization strategies result
   In [12] we propose link traversal based query execution as a new        in different characteristics of the three data structures: A single in-
query execution paradigm tailored for the Web of Data. In contrast         dex enables a more efficient execution of triple based queries than
to traditional query execution paradigms which assume knowledge            a collection of individual indexes. On the other hand, it is not as
of a fixed set of potentially relevant data sources beforehand, our        straightforward to add, remove, or replace data in a single index that
approach conceives the Web of Data as an initially unknown set of          multiple, parallel link traversal operations access concurrently than
data sources. We make use of the characteristics of Linked Data,           it is for the individually indexed representation. However, the com-
in particular, the existence of links between data items of differ-        bined index representation and the quad index representation that
ent sources. The main idea of our approach is to intertwine the            we propose in this paper support concurrent access in an efficient
construction of the query result with the traversal of data links that     manner. Although we present and discuss the three data structures
correspond to intermediate solutions in the construction process.          in the context of link traversal based query execution, they can also
This strategy allows the execution engine to discover potentially          be applied to other approaches that retrieve Linked Data by look-
relevant data during the query execution.                                  ing up URIs in order to process this data locally; for instance, the
   An important characteristic that distinguishes link traversal based     navigational query approach proposed by Bouquet et al. [7] or the
query execution from other approaches, such as query federation,           data summary based query approach from Harth et al. [10]. Hence,
is the retrieval and query-local processing of Linked Data from            our main contributions are:
the Web. Moreover, the integration of result construction and link              • three main memory data structures for ad hoc storing of
traversal means that the query-local data collection is continuously               Linked Data that is consecutively retrieved from remote sour-
augmented with further data discovered during the query execu-                     ces in order to be processed locally, including a discussion
tion. To support this procedure a link traversal based query system                how to access these data structures, and
requires a data structure that stores the retrieved data while it is be-        • an evaluation that analyzes the proposed data structures and
ing processed. Such a data structure must allow the query operators                compares them empirically.
to efficiently add and access the retrieved data. For our query ex-
ecution approach, such an access includes the evaluation of triple           This paper is structured as follows: First, we introduce prelim-
based queries. Furthermore, the data structure must also support           inaries in Section 2, including the link traversal based query exe-
concurrent access in order to enable an efficient, multi-threaded im-      cution paradigm. In Section 3 we discuss the requirements that a
                                                                           data structure must support in order to be suitable for our usage
Copyright is held by the author/owner(s).                                  1
LDOW2011, March 29, 2011, Hyderabad, India.                                    RDF is the data model employed by Linked Data.
scenario. Based on these requirements we review existing work in            As can be seen from the outlined procedure, instead of evaluat-
Section 4. Section 5 introduces the studied data structures that sat-    ing a BGP query over a fixed set of RDF triples, our link traversal
isfy our requirements, including the index structure they are based      based approach executes BGPs over a dataset that is continuously
on. We empirical evaluate these data structures in Section 6 and         augmented with descriptor objects retrieved during query execution
conclude in Section 7.                                                   itself. Formally, we define such a query-local dataset as follows:

                                                                           Definition 2. A query-local dataset is a set of descriptor objects
2.    PRELIMINARIES                                                      where for each two distinct descriptor objects D1 and D2 in the
   This section introduces the preliminaries for the work presented      query-local dataset it must hold url(D1 ) 6= url(D2 ).
in this paper, including a brief, informal introduction of our query
approach: Link traversal based query execution is a novel query
execution paradigm developed to exploit the Web of Data to its           3.     REQUIREMENTS
full potential. Since adhering to the Linked Data principles [3] is         In this section we discuss the requirements for a data structure
the minimal requirement for publication in the Web of Data, link         that stores the query-local dataset. We first describe the core func-
traversal based query execution relies solely on these principles in-    tional requirements, introduce important non-functional properties,
stead of assuming the existence of source-specific query services.       and, finally, mention properties that are not necessary.
   The Linked Data principles require to describe data using the
RDF data model. RDF distinguishes three distinct sets of RDF             3.1      Core Functional Requirements
terms: URIs, literals, and blank nodes. Let U be the infinite set          The data structure must store a query-local dataset and it must
of URIs, L an infinite set of literals, and B an infinite set of blank   support the following four operations over the query-local dataset:
nodes that represent unnamed entities. An RDF triple is a 3-tuple
(s, p, o) ∈ (U ∪ B) × U × (U ∪ B ∪ L).                                        • find – This operation finds RDF triples in the query-local
   In the Web of Data each entity has to be identified via a single             dataset that match a given triple pattern. Formally, an RDF
HTTP scheme based URI. Let U LD ⊂ U be the (possibly infinite)                  triple t = (s, p, o) is a matching triple for a triple   S pattern
set of all these URIs. By looking up such a URI we retrieve RDF                 (s̃, p̃, õ) in a query-local dataset D iff it holds3 t ∈ D∈D D
data about the entity identified by the URI. Conceptually, we un-               and
derstand the result of such a URI look-up as a descriptor object:               (s̃ ∈
                                                                                    / V ⇒ s̃ = s) ∧ (p̃ ∈
                                                                                                        / V ⇒ p̃ = p) ∧ (õ ∈
                                                                                                                            / V ⇒ õ = o)
   Definition 1. A descriptor object is a set of RDF triples, i.e. a            Due to this definition the find operation must report each
finite subset D ⊂ (U ∪ B) × U × (U ∪ B ∪ L), which i) was                       matching triple only once, even if it occurs in multiple de-
retrieved by looking up the URI u ∈ U LD of an arbitrary entity                 scriptor objects.
and which ii) describes that entity. To denote the URL of the Web
resource from which a descriptor object D was actually retrieved              • add – This operation adds a given descriptor object to the
we write url(D).                                                                query-local dataset. The query execution system performs
   The query language for RDF data is SPARQL [17] which is                      this operation for each descriptor object that has been re-
based on graph patterns and subgraph matching. The basic building               trieved during the execution of a query.
block of a SPARQL query is a basic graph pattern (BGP), that is, a            • remove – The remove operation removes a descriptor object
finite subset of the set (U ∪ B ∪ V ) × (U ∪ V ) × (U ∪ B ∪ L ∪ V )             that was retrieved from a given URL from the query-local
where V is an infinite set of query variables, distinct from U , B and          dataset. This operation is necessary to enable an invalidation
L. The elements of a BGP are called triple patterns. We adjusted                strategy in query systems that (re-)use the query-local dataset
the semantics of BGP queries2 for our link traversal based query                for the execution of multiple queries. While such a reuse can
execution approach. While we outline the idea in the following, we              be beneficial w.r.t. query execution costs and result complete-
refer to [11] for a formal definition.                                          ness [11], it bears the risk of providing query results based
   To provide results for a BGP query, a link traversal based query             on stale data. Thus, an invalidation strategy should identify
execution engine intertwines the construction of such query results             potentially stale descriptor objects and remove them from the
with the traversal of data links in order to discover data that might           query-local dataset.
be relevant to answer the executed query. By using the descriptor
objects retrieved from looking up the URIs in a query as a start-             • replace – This operation replaces a specific descriptor object
ing point, a link traversal based query execution engine evaluates a            with a more recent version that has been retrieved from the
certain triple pattern of the BGP. The intermediate solutions result-           same URL. This operation is useful for invalidation strategies
ing from this triple pattern evaluation usually contain further URIs.           that immediately retrieve new versions of (potentially) stale
These URIs link to additional data which may provide further, in-               data.
termediate solutions for the same or for other patterns of the BGP.
To determine results of the whole BGP query the execution engine            Due to the dynamic nature of link traversal based query execu-
alternates between evaluating triple patterns and looking up URIs;       tion it is likely that an implementation of this query approach makes
finally, the intermediate solutions for the triple patterns have to be   use of multiple processes (or threads). For instance, our proto-
joined. A comprehensive example of link traversal based query            typical query system implements URI look-ups by asynchronous
execution can be found in [13], in which we also compare our ap-         function calls and, thus, enables the processing of multiple look-up
proach to other approaches that execute SPARQL queries over the          tasks in parallel [12]. Due to this parallelism multiple processes
Web of Data.                                                             may access the query-local dataset concurrently. Therefore, the
2                                                                        3
  While we consider only BGP queries in this paper, the results for a      For the union of descriptor objects we assume that no two de-
BGP that might be determined using link traversal based query ex-        scriptor objects share the same blank nodes. This requirement can
ecution, can be processed by the SPARQL algebra which provides           be guaranteed by using a unique set of blank nodes identifiers for
operators for the other types of graph patterns in SPARQL queries.       each descriptor object retrieved from the Web.
data structure for storing the query-local dataset has to enable the         Since the execution of queries is a read-only process there is no
implementation of the aforementioned operations in a way that it          need to write data back to the Web. It is not even necessary to
supports concurrent access. In particular, we require the transac-        modify retrieved descriptor objects within the query system.
tional property isolation, given that an execution of any of the four        Finally, our understanding of a transaction in this paper does not
operations, find, add, remove and replace, can be understood as a         correspond to the traditional understanding according to which a
transaction over the stored query-local dataset. Isolation requires       whole query execution is a single transaction. In the case of link
that “events within a transaction must be hidden from other trans-        traversal based query execution we do not require transactional
actions running concurrently” [8]. In our case, this requirement          properties on that level. In fact, multiple queries executed in paral-
means that the addition, replacement and removal of descriptor ob-        lel may even mutually benefit from a shared and collectively aug-
jects must not have side effects that cause an interference with any      mented query-local dataset because descriptor objects discovered
other add, remove and replace operation. Furthermore, each find           and retrieved for one query could also be useful to others.
operation must operate on a single, immutable state of the query-
local dataset, where:
                                                                          4.      RELATED WORK
   • only those RDF triples that are part of a descriptor object             Several data structures to store RDF data have been proposed in
     contained in the query-local dataset can be used for triple          the literature. In this section we review them w.r.t. the requirements
     pattern matching (i.e. by the find operation), and                   introduced in the previous section.
                                                                             The majority of existing work focuses on RDF stores, which are
   • all RDF triples that are part of a descriptor object contained
                                                                          DBMSs that are designed to persistently store large amounts of
     in the query-local dataset can be used for triple pattern match-
                                                                          RDF data in secondary memory. Theoharis et al. provide a clas-
     ing.
                                                                          sification and comparison of earlier approaches that make use of
The isolation requirement also implies that:                              relational databases [21]. More recent work proposes to partition
                                                                          the data vertically by grouping RDF triples with the same property
   • a descriptor object can only be used for triple pattern match-
                                                                          into dedicated tables and to store these tables in a column-oriented
     ing after it has been added completely,
                                                                          DBMS [1], an approach that is suitable for data with a moder-
   • different descriptor objects which were retrieved from the           ate number of properties [19]. An alternative to relational storage
     same URL must not be used at the same time for triple pat-           schemes is the development of native, disk-based data structures for
     tern matching, and                                                   RDF data (e.g. [20, 15, 23]). From the work in this area the most
                                                                          relevant in our context is the index structure presented by Harth and
   • a descriptor object cannot be used for triple pattern matching       Decker [9]. The authors associate RDF triples (s, p, o) with a con-
     when it has already been removed partially.                          text c and represent such a “triple in context” by a quad, that is, a
                                                                          4-tuple (s, p, o, c). The proposed index structure comprises a lexi-
3.2    Non-Functional Requirements                                        con and 6 quad indexes. The lexicon provides a mapping from RDF
   The data structure that stores a query-local dataset has to support    terms to identifiers in order to save space and improve processing
an efficient execution of queries. The query execution time of link       time. The quad indexes are implemented using B+ trees and con-
traversal based query execution depends on multiple factors such          tain quads encoded by the identifiers of their elements. While all
as delays caused by looking up URIs, retrieving descriptor objects        6 quad indexes contain all indexed quads, each index uses different
and adding them to local data structures, as well as the time to          elements of the quads as index keys such that these 6 quad indexes
actually evaluate the query patterns over the query-local dataset.        cover all kinds of quad based queries, leveraging the fact that B+
In order to reduce the latter two factors we require a data structure     trees support prefix queries. While this data structure is disk-based,
that supports an efficient implementation of the operations find, add     its design influenced the main memory data structures that we in-
and replace. In particular, the performance of the find operation         vestigate in this paper.
should scale with the number of descriptor objects in the query-             In contrast to RDF stores, the majority of existing software frame-
local dataset. Furthermore, the overall amount of main memory             works for RDF, such as Jena4 and Sesame5 , also provide an in-
consumed by the data structure storing a query-local dataset should       memory data structure for RDF data. In our experiments we em-
be as small as possible to allow for query-local datasets that contain    pirically compare some of these in-memory models with the data
a large number of descriptor objects.                                     structures discussed in this paper (cf. Section 6). However, to the
                                                                          best of our knowledge, only a few publications exist that explicitly
3.3    Non-Relevant Properties                                            introduce data structures to manage RDF data in main memory:
   In addition to the aforementioned requirements a data structure        Atre et al. introduce BitMat [2], a compressed bit-matrix struc-
that can hold Linked Data from the Web may have further proper-           ture that enables processing of BGP queries on very large sets of
ties and other approaches to consume Linked Data may have dif-            RDF triples completely in main memory. Binna et al. pursue the
ferent requirements. In the remainder of this section we point out        same goal with a main memory RDF store, called SpiderStore [5].
potentially relevant properties that are not required in our scenario.    SpiderStore applies a graph-based storage layout that represents an
   In our case it is not necessary to query specific descriptor objects   RDF graph (i.e. a graph presentation of a set of RDF triples) na-
individually: Intermediate solutions in link traversal based query        tively. Based on this representation, Binna et al. propose a depth-
execution are generated from matching RDF triples that may occur          first graph traversal approach to evaluate BGP queries. Since the
in any of the descriptor objects. For this reason, we understand the      applied graph-based storage layout is optimized for graph traversal
find operation to operate on the whole query-local dataset instead of     operations, query execution in SpiderStore is very efficient. How-
individual descriptor objects. Therefore, it is also not necessary that   ever, due to the focus on storing a single RDF graph and executing
an implementation of find distinguishes and reports the descriptor        complete BGP queries, both approaches, BitMat and SpiderStore,
objects from which matching triples originate. However, the latter
                                                                          4
may become a requirement for systems that have to provide the                 http://openjena.org/
                                                                          5
provenance of each query result.                                              http://openrdf.org/
are unsuitable in our context; we require a data structure that is         Table 1: Relevant hash buckets for triple patterns (s, p, o) of
optimized for an efficient evaluation of single triple patterns over       different types.
multiple RDF graphs.
                                                                                   Pattern Type                 Relevant Bucket(s)
   Oren et al. propose the application of evolutionary algorithms to
                                                                                      (!, ?, ?)             HS [hS (id(s), id(p), id(o))]
execute queries over RDF data [16]. The authors represent the data
in main memory using Bloom filters. While this approach allows                        (?, !, ?)             HP [hP (id(s), id(p), id(o))]
the query system to process large amounts of data, the proposed                       (?, ?, !)             HO [hO (id(s), id(p), id(o))]
query execution method also focuses on an evaluation of complete                      (!, !, ?)            HSP [hSP (id(s), id(p), id(o))]
BGPs instead of triple pattern queries. Furthermore, the approach                     (!, ?, !)           HSO [hSO (id(s), id(p), id(o))]
provides for approximate query answers only.                                          (?, !, !)           HPO [hPO (id(s), id(p), id(o))]
   Janik and Kochut present BRAHMS, a main memory-based stor-                          (!, !, !)        e.g. HSO [hSOS (id(s), id(p), id(o))]
age system for RDF [14]. BRAHMS is based on a read-only data                         (?, ?, ?)                    e.g. n i=1 HS [i]
structure which comprises multiple hash tables, similar to the index
structure that we use. However, BRAHMS requires these hash ta-
bles to enable an efficient retrieval of node neighborhoods in RDF         references to all ID-encoded triples of the RDF data to be indexed;
graphs in order to support path queries for semantic association dis-      furthermore, the corresponding hash functions, hS , hP , hO , hSP ,
covery algorithms. Another approach that focuses on path queries           hSO and hPO , have to satisfy the following requirement: For each
over RDF graphs has been presented by Udrea et al. [22] who pro-           pair of ID-encoded triples (s¯1 , p¯1 , o¯1 ) and (s¯2 , p¯2 , o¯2 ) it holds:
pose a graph based index called GRIN. While both approaches,
BRAHMS and GRIN, provide for an efficient query execution, they                              s¯1 = s¯2 ⇒ hS (s¯1 , p¯1 , o¯1 ) = hS (s¯2 , p¯2 , o¯2 )
are unsuitable in our context, due to their focus on path queries.                          p¯1 = p¯2 ⇒ hP (s¯1 , p¯1 , o¯1 ) = hP (s¯2 , p¯2 , o¯2 )
                                                                                           o¯1 = o¯2 ⇒ hO (s¯1 , p¯1 , o¯1 ) = hO (s¯2 , p¯2 , o¯2 )
5.    STORING QUERY-LOCAL DATASETS                                             s¯1 = s¯2 ∧ p¯1 = p¯2 ⇒ hSP (s¯1 , p¯1 , o¯1 ) = hSP (s¯2 , p¯2 , o¯2 )
   In this paper we investigate three main memory data structures              s¯1 = s¯2 ∧ o¯1 = o¯2 ⇒ hSO (s¯1 , p¯1 , o¯1 ) = hSO (s¯2 , p¯2 , o¯2 )
that satisfy the requirements as introduced in Section 3. All three
data structures are based on the same index for RDF data. This                 p¯1 = p¯2 ∧ o¯1 = o¯2 ⇒ hPO (s¯1 , p¯1 , o¯1 ) = hPO (s¯2 , p¯2 , o¯2 )
index enables an efficient execution of triple pattern queries (i.e.          Due to this requirement we can expect to find all triples that
find operations). In this section we introduce the index and describe      may match a triple pattern (s, p, o) of type (!, ?, ?) from search-
the three data structures.                                                 ing through the references in the hS (id(s), id(p), id(o))-th bucket
                                                                           of hash table HS ; i.e. in bucket HS [hS (id(s), id(p), id(o))]. Simi-
5.1     Indexing RDF Data                                                  larly for triple patterns of type (?, !, ?), (?, ?, !), (!, !, ?), (!, ?, !) and
   The index that we introduce in the following, stores a set of RDF       (?, !, !). However, since we do not assume collision-free hash func-
triples; it comprises a dictionary, a triple list, and 6 hash tables.      tions, the relevant bucket may also contain references to triples that
   The dictionary provides a two-way mapping between RDF terms             do not match the corresponding triple pattern. Hence, each triple
and numerical identifiers for these terms. Using term identifiers al-      referenced in a relevant bucket must still be checked. For triple
lows for a more efficient query execution because it is faster to pro-     patterns of type (!, !, !) we may access any one of the hash tables;
cess and compare numbers than strings or any other kind of object          for instance, bucket HSO [hSO (id(s), id(p), id(o))]; and for triple
representation. Formally, we represent the dictionary by two bijec-        patterns of type (?, ?, ?) we have to inspect all n buckets in one of
tive functions: id : (U ∪B ∪L) → I and term : I → (U ∪B ∪L)                the hash tables. Table 1 summarizes what buckets are relevant for
where I denotes the set of numerical term identifiers; id and term         which kind of triple patterns.
are inverse to each other. Based on the term identifiers we intro-            Based on the presented index we propose three strategies for
duce the notion of an ID-encoded triple, that is a 3-tuple contain-        storing a query-local dataset: individual indexing, combined in-
ing three identifiers of RDF terms. To denote the ID-encoded triple        dexing, and quad indexing.
that corresponds to an RDF triple t = (s, p, o) we write enc(t);
i.e. enc(t) = (id(s), id(p), id(o)); similarly, dec(t̄) denotes the        5.2     Individual Indexing
RDF triple corresponding to an ID-encoded triple t̄ = (s̄, p̄, ō); i.e.      The main idea of the individually indexed representation (IndIR)
dec(t̄) = (term(s̄), term(p̄), term(ō)).                                  of the query-local dataset is to keep the triples of each descrip-
   In addition to the dictionary, the index contains a triple list and 6   tor object in a separate index. Due to this separation we need an
hash tables, denoted as HS , HP , HO , HSP , HSO and HPO . Each            additional mapping idx that associates each URL from which a de-
of these hash tables consists of n different hash buckets. To denote       scriptor object was retrieved with the index that contains the triples
the i-th bucket in hash table HX we write HX [i]. These buckets            of that descriptor object. To reduce the amount of required memory
store references to ID-encoded triples; all triples referenced in the      all indexes share a common dictionary6 .
buckets are stored in the triple list of the index. The hash function         IndIR allows for a straightforward implementation of the four
hX : I × I × I → [1, n] of each of the 6 hash tables associates any        main operations add, remove, replace and find: Adding a newly
ID-encoded triple t̄ with the bucket which may contain a reference         retrieved descriptor object D to the query-local dataset is a mat-
to t̄ in that hash table.                                                  ter of creating a new index ID , indexing all triples t ∈ D in ID ,
   The index contains 6 hash tables to support the different access        and adding the association idx(url(D)) = ID . After this add op-
patterns based on which the query execution system may try to find         eration finished successfully, it is not necessary to keep the added
RDF triples that match a triple pattern. These access patterns corre-      descriptor object anymore because it is completely represented by
spond to the 8 different kinds of triple patterns (cf. Table 1 in which    the index. To remove the descriptor object that was retrieved from
’?’ denotes that the corresponding element of the triple pattern is a
query variable and ’!’ denotes the element is an RDF term). To sup-        6
                                                                             We assume that the dictionary always assigns different identifiers
port these access patterns each of the 6 hash tables has to contain        to the blank nodes in different descriptor objects.
URL u from the query-local dataset, the association idx(u) = ID              • BeingRemoved – This status indicates that the descriptor ob-
has to be removed and ID has to be deleted. Replacing an old de-               ject is currently being removed from the index.
scriptor object that was retrieved from URL u with a new descriptor
                                                                          The mapping cur associates the URL of each descriptor object
object Dnew , retrieved from the same URL (i.e. url(Dnew ) = u),
                                                                          which is currently part of the query-local dataset with the identi-
requires creating an index Inew , adding all t ∈ Dnew to Inew ,
                                                                          fier that refers to that descriptor object. For each URL u that is
atomically changing the association idx(u) = Iold to idx(u) =
                                                                          mapped by cur it must hold: status(cur(u)) = Indexed.
Inew , and deleting Iold . To find matching triples in the query-local
                                                                             Using the combined index and the mappings src, status and
dataset D, we have to inspect the relevant hash bucket in each index
                                                                          cur we can implement the required operations as follows: To find
ID ∈ {idx(url(D0 )) | D0 ∈ D}. However, since each matching
                                                                          matching triples in the query-local dataset D, we have to search
triple that occurs in multiple descriptor objects must be reported
                                                                          through the references in the relevant hash bucket of our combined
only once, we have to record all reported triples in a temporary set;
                                                                          index, but we ignore all ID-encoded triples t̄ for which it does not
if we find a recorded triple again in the index of another descriptor
                                                                          hold: ∃i ∈ src(t̄) : status(i) = Indexed.
object, we omit reporting it again.
                                                                             To add a descriptor object D to the query-local dataset we, first,
   These implementations of the modification operations add, re-
                                                                          have to create a unique identifier i for it and add the association
move and replace guarantee our isolation requirement naturally, as
                                                                          status(i) = BeingIndexed. For each RDF triple t ∈ D we have to
long as the modifications of idx are performed atomically. To guar-
                                                                          find enc(t) in the index. If enc(t) is not in the index, we add
antee the isolation requirement for the implementation of the find
                                                                          the association src(enc(t)) = {i} and insert enc(t) in the in-
operation it is necessary to use a temporary, immutable copy of idx
                                                                          dex; otherwise, we just add i to src(enc(t)). Finally, we change
instead of iterating over the shared idx that might be modified by
                                                                          status(i) = BeingIndexed to status(i) = Indexed and add the
concurrent operations.
                                                                          mapping cur(url(D)) = i.
   While the separation of the descriptor objects into individual in-
                                                                             To remove the descriptor object that was retrieved from URL
dexes allows for a straightforward implementation, it introduces the
                                                                          u from the query-local dataset, we change status(cur(u)) = In-
potential for performance issues during the execution of the find op-
                                                                          dexed to status(cur(u)) = ToBeRemoved and remove the asso-
eration. The need to access |D| different indexes and to iterate over
                                                                          ciation cur(u) = i. After these simple changes we can under-
multiple hash buckets may have a significant impact on the execu-
                                                                          stand the descriptor object to be removed because it is ignored
tion time, in particular for query-local datasets that contain a large
                                                                          by find operations, even if this data is still present in our index.
number of descriptor objects. The combined indexing strategy ad-
                                                                          To delete this data completely we propose an additional operation
dresses this issue.
                                                                          clear that an asynchronous process executes regularly. This oper-
5.3    Combined Indexing                                                  ation, first, changes all associations status(i) = ToBeRemoved to
   The combined index representation (CombIR) of query-local data-        status(i) = BeingRemoved. Next, it searches all buckets from
sets uses only a single index that contains the triples of all descrip-   hash table HS in our combined index for ID-encoded triples t̄ with
tor objects in its 6 hash tables. For CombIR the find operation can       ∀i ∈ src(t̄) : status(i) = BeingRemoved. It removes the refer-
be implemented very efficiently: We only have to search through           ences to these triples from all 6 hash tables and deletes the corre-
the references in the relevant hash bucket(s) of our single index.        sponding associations from mapping src. Finally, it removes all
   Implementing the other three operations for CombIR is not as           associations status(i) = BeingRemoved.
straightforward as it is for IndIR: Just adding the RDF triples from         Replacing an old descriptor object that was retrieved from URL
each descriptor object to the combined index would make it im-            u with a new descriptor object Dnew corresponds to such an ex-
possible to remove or replace a specific descriptor object because        ecution of add for Dnew that does not modify the mapping cur
it would be unclear which of the indexed triples are part of it and       in the end. Instead, we complete the replacement by changing
have to be removed from the index. To enable the implementation           status(cur(u)) = Indexed to status(cur(u)) = ToBeRemoved
of add, remove and replace in a way that guarantees our isolation         and cur(u) = iold to cur(u) = inew where identifier inew refers
requirement and that satisfies the efficiency requirements we adopt       to Dnew .
the idea of a multiversion database system [4]. In particular, we            Due to the association of descriptor objects with indexing sta-
propose to use three additional mappings: src, status and cur.            tuses we ensure the isolation requirement for the modification op-
   The mapping src associates each ID-encoded triple in the index         erations add, remove and replace. To guarantee isolation for the
                                                                          find operation it is necessary to atomically create a temporary copy
with a set of unique identifiers that refer to the descriptor objects
from which the triple originates. The mapping status associates           of relevant buckets and inspect these immutable snapshots instead
each of these identifiers with the current indexing status of the cor-    of iterating over the original buckets that might be modified by con-
responding descriptor object. We distinguish the following index-         current operations.
ing statuses:                                                             5.4    Quad Indexing
   • BeingIndexed – This status indicates that the descriptor ob-            For CombIR we propose to use the mapping src to keep track
     ject is currently being added to the index.                          of the origins of the triples in the combined index. An alternative
                                                                          strategy is the application of the concept of quads, or “triples in
   • Indexed – This status indicates that the descriptor object is
                                                                          context” as Harth and Decker call them [9]. The combined quad
     completely contained in the index and can be used by the
                                                                          index representation (CombQuadIR) implements this idea by using
     find operation.
                                                                          a slightly modified version of the index structure that we present in
   • ToBeRemoved – This status indicates that the descriptor ob-          Section 5.1. Instead of the triple list, this modified version com-
     ject is completely contained in the index but it cannot be           prises a list of quads which we formally represent as pairs (t̄, i)
     used by the find operation because the descriptor object has         where t̄ is an ID-encoded triple and i is an identifier for a descrip-
     been removed from the query-local dataset or the query-local         tor object. Consequently, the 6 hash tables in the modified version
     dataset already contains a more recent version of the descrip-       of the index structure do not keep references to ID-encoded triples
     tor object.                                                          but to the quads in the quad list. However, the hash functions of
the index operate on the triple part of the quads only, exactly as      Table 2: Statistics of query-local datasets for our experiments.
described for the original version of the index structure. Hence, the
                                                                            q.-local   BSBM sca-     Number of des-       Overall number
relevancy of hash buckets for triple patterns as specified in Table 1
                                                                            dataset    ling factor   criptor objects      of RDF triples
still holds.
                                                                              D50          50                  2,599               22,616
   While the use of quads obsoletes the mapping src, CombQuadIR
                                                                             D100          100                 4,178               40,133
still requires the mappings status and cur that we introduced for
CombIR. The implementation of the four required operations for               D150          150                 5,756               57,524
the quad index is also similar to their implementation for CombIR:           D200          200                 7,329               75,062
To add a descriptor object D to the query-local dataset we create            D250          250                 9,873               97,613
a unique identifier i and add the association status(i) = BeingIn-           D300          300                11,455              115,217
dexed. For each RDF triple t ∈ D we insert a new quad (enc(t), i)            D350          350                13,954              137,567
into the index. Finally, we change status(i) = BeingIndexed to               D500          500                18,687              190,502
status(i) = Indexed and add the mapping cur(url(D)) = i.
   To remove or replace descriptor objects we can use the same
implementations as introduced for CombIR. However, the asyn-            terms; instead, we use a counter that provides a new identifier
chronously executed clear operation has to be adjusted: After chang-    whenever the id function is called for an unknown RDF term.
ing all associations status(i) = ToBeRemoved to status(i) = Be-            Our implementation is written in Java and available as Free Soft-
ingRemoved it goes through all quads in the quad list of the index.     ware as part of SQUIN8 , our link traversal based query execution
For each quad q = (t̄, i) with status(i) = BeingRemoved it re-          system. To evaluate the query performance provided by the three
moves the references to q from all 6 hash tables and deletes q in       data structures we used the query engine in SQUIN. However, for
the quad list. Finally, it removes all associations status(i) = Be-     most experiments we disabled the traversal of data links during the
ingRemoved.                                                             query execution and used pre-populated query-local datasets in-
   For the find operation we have to search through the references in   stead. This strategy allowed us to avoid measuring the effects of
the relevant hash bucket of the combined quad index, but we ignore      URI look-ups and data retrieval.
all quads (t̄, i) for which status(i) 6= Indexed. To ensure that we        For our experiments we adopt the Berlin SPARQL Benchmark
report each matching triple only once, even if it occurs in multiple    (BSBM) [6]. The BSBM executes different mixes of 25 SPARQL
descriptor objects and, thus, is part of multiple quads, we apply the   queries over synthetic RDF data. This data describes entities in a
same strategy as we use for the find operation of IndIR: We record      distributed e-commerce scenario, including different producers and
reported triples temporary and do not report them again if they are     vendors, the offered products, and reviews from multiple reviewing
part of another quad referenced by the relevant hash bucket.            systems. The data is generated automatically and it is possible to
   For the same reasons as discussed for CombIR, it is possible to      use datasets that vary in the number of described entities and, thus,
guarantee that an implementation of add, remove, replace and find       the amount of RDF triples. We use these BSBM datasets to prepare
for CombQuadIR satisfies our isolation requirement.                     multiple query-local datasets of different sizes for our experiments:
                                                                        We understand each dataset generated for BSBM as the union of
6.    EVALUATION                                                        multiple datasets that could be exposed as Linked Data on the Web.
   In this section we analyze and compare IndIR, CombIR, and            Thus, looking up the URI of any of the described entities would
CombQuadIR empirically. We describe the experimental setup, de-         result in a descriptor object which contains only these RDF triples
termine suitable parameters for the indexes that we use in these data   from the BSBM dataset that describe the corresponding entity. To
structures, and compare the data structures with each other as well     generate a query-local dataset that contains all these descriptor ob-
as with similar data structures available in existing systems.          jects we query a given BSBM dataset with SPARQL DESCRIBE
                                                                        queries for all relevant entities in that dataset; the result of each
6.1    Experimental Environment                                         DESCRIBE query becomes a descriptor object in the query-local
   We implemented the data structures and the corresponding op-         dataset that we generate9 . Table 2 describes the query-local datasets
erations as introduced in this paper. For the indexes we use hash       that we prepared for the experiments. These datasets are typical for
tables with a number of n buckets where n = 2m for some expo-           the query-local datasets that SQUIN usually processes.
nent m and hash functions that return the m least significant bits of      In addition to merely comparing our implementations of IndIR,
the term identifiers at specific positions in ID-encoded triples7 :     CombIR, and CombQuadIR among each other, we also compare
                                                                        these implementations with existing data structures that can be used
                hS (s̄, p̄, ō) ≡ s̄ ⊕ bitmask[m]                       to store a query-local dataset: the main memory implementation of
                                                                        the NamedGraphSet interface in NG4J10 and the main memory
                hP (s̄, p̄, ō) ≡ p̄ ⊕ bitmask[m]
                                                                        implementation of the DatasetGraph interface in ARQ11 , the
                hO (s̄, p̄, ō) ≡ ō ⊕ bitmask[m]                       query engine for the Jena framework. Both implementations store
                                                                        each descriptor object in a separate data structure, similar to our
               hSP (s̄, p̄, ō) ≡ (s̄ · p̄) ⊕ bitmask[m]
                                                                        individually indexed representation IndIR. However, while we use
               hSO (s̄, p̄, ō) ≡ (s̄ · ō) ⊕ bitmask[m]                indexes with ID-encoded triples, NG4J and ARQ use an imple-
                                                                        mentation of the Jena Graph interface that represents RDF terms
               hPO (s̄, p̄, ō) ≡ (p̄ · ō) ⊕ bitmask[m]
                                                                        8
   We use these hash functions because they can be calculated very         http://squin.org
                                                                        9
efficiently. Furthermore, the calculated hash values are distributed       For a more detailed description and the code of the correspond-
sufficiently because our dictionary implementation does not cal-         ing extension for the BSBM data generator we refer to a blog
                                                                         post at http://sourceforge.net/apps/wordpress/squin/2009/04/15/a-
culate term identifiers based on some representation of the RDF          data-generator-for-bsbm-that-provides-linked-data-characteristics/
                                                                        10
7
  With ⊕ we denote the bitwise AND operator and bitmask[m] is a            http://www4.wiwiss.fu-berlin.de/bizer/ng4j/
                                                                        11
bitmask in which the m least significant bits are set.                     http://openjena.org/ARQ/
                  (a)                               (b)                              (c)                                  (d)

Figure 1: Measurements for combined index representations of different query-local datasets with varying m (from 1 to 20): (a) aver-
age creation times for representations of D50 , D250 and D500 , (b) average number of triples per hash bucket in HS after loading D200 ,
(c) estimated memory consumed by the representations, and (d) average time to execute BSBM query mixes over the representations.


as Java objects. For the experiments we used the latest releases        For each of these representations we measure the creation time, the
of Jena (v.2.6.4) and ARQ (v.2.8.7) and a recent version of NG4J        required memory, and the query performance.
(CVS check-out from Jan. 20, 2011).                                        To determine the creation times we loaded our query-local data-
   We conducted all experiments on an Intel Core 2 Duo T7200            sets into an ARQ DatasetGraph and measured the time to create
processor with 2 GHz, 4 MB L2 cache, and 2 GB main memory.              combined index representations from it. For each m we followed
This machine was connected through the university LAN and it            this procedure 5 times and took the mean of the results. The chart
runs a recent 32 bit version of Gentoo Linux with Sun Java 1.6.0.       in Figure 1(a) illustrates these measurements for the different com-
All software used for this evaluation, including the scripts to run     bined index representations of the query-local datasets D50 , D250
the experiments, as well as all our measurements can be found on        and D500 . As can be seen from this chart, the creation times are
the Website for this paper12 .                                          small for representations with a high m (e.g. about 6.3 s and 8.8 s
                                                                        for the m = 20 representations of D50 and D500 , respectively) and
6.2       Tuning the Combined Index                                     they increase significantly with a decreasing m. The reason for this
    Before we compared IndIR, CombIR, and CombQuadIR we had             behavior is the number of elements that have to be added per hash
to determine suitable parameters for the indexes that we use in these   bucket: For a high m we have a large number of buckets in the hash
data structures. In the following we discuss the reasons for this       tables so that each bucket is only relevant for a few triples as Fig-
need and describe the analysis based on which we tuned the in-          ure 1(b) illustrates. In this case, the time to encode all RDF triples,
dexes.                                                                  find the relevant hash buckets for each, and add the corresponding
    The number of buckets per hash table has an impact on the per-      references into these buckets is roughly constant, independent of
formance of find operations and, thus, query execution times, be-       m. However, for a small number of buckets per hash table (i.e. for
cause a higher number corresponds to less elements per buckets and      a small m), adding the references to the buckets is more expensive
it is faster to search through a smaller number of potentially match-   and impossible in constant time because each bucket is relevant
ing triples. Furthermore, the number of buckets may also affect the     for a much greater number of triples (cf. Figure 1(b)). Thus, for a
amount of main memory required for the indexes if the implemen-         decreasing m we reach a point where the time to fill the buckets
tation of the buckets always allocates some spare memory so that        dominates the overall creation time.
buckets are never filled to their full capacity. For these reasons it      To measure the required memory we applied a method for Java
is necessary to decide on a number of buckets that is suitable for      object profiling as proposed in [18]. Using this method we esti-
the typical amount of RDF data which has to be stored in the hash       mated the amount of main memory consumed by the different com-
tables. Since this amount is different for the individual indexes and   bined index representations of the query-local datasets. Figure 1(c)
the combined index we require a different number of buckets for         illustrates the results for D50 , D100 , D150 , D200 , D350 and D500 ;
these indexes. As we describe in Section 5.1, the hash tables in our    while the x-axis denotes the exponents m, the y-axis denotes the
index structure contain n = 2m buckets. Thus, to provide for a          estimated amount of memory (in MB) consumed by the different
fair comparison of IndIR, CombIR, and CombQuadIR we have to             representations. For each dataset we note that representations with
find suitable exponents m for the indexes that we use in these data     a smaller number of buckets per hash table (i.e. a smaller m) re-
structures. In this paper we focus on tuning CombIR. However, we        quire roughly the same amount of memory (e.g. about 12.5 MB
conducted similar analyses for the other two data structures.           for D50 and about 95.3 MB for D500 ). For representations with
    As outlined before, selecting a suitable number of buckets (or      an m greater than 10 the required memory starts to increase ex-
m) is a trade-off between query performance and required mem-           ponentially. We explain this behavior as follows: For smaller m
ory. To find a suitable m for the combined index we used different      the required memory is dominated by a constant amount of mem-
m, varying from 1 to 20, to create different combined index repre-      ory required for the dictionary and the ID-encoded triples, which
sentations of the aforementioned query-local datasets (cf. Table 2).    is independent of m for each dataset. At some point, m > 10 in
                                                                        our experiments, this constant amount is largely exceeded by the
12
     http://squin.org/experiments/IndexStructuresLDOW2011/              memory required for the hash buckets that are not filled to their ca-
                      (a)                                                (b)                                               (c)

Figure 2: Measurements for different data structures storing the query-local datasets Dpc from Table 2: (a) estimated memory
consumed by the data structures, (b) average time to execute BSBM query mixes over the data structures, and (c) average creation
times for the data structures.


pacity. The exponential growth can be attributed to the exponential            with the main memory implementation of similar data structures
correlation between the number of buckets per hash table and m.                in NG4J and ARQ as introduced before (cf. Section 6.1). For our
   To measure the query performance provided by CombIR initial-                comparison we conducted the same kind of experiments as in the
ized with different m we used the different combined index repre-              previous section: We measured the required memory, the query
sentations of each of our query-local datasets and executed iden-              execution time, and the creation time. Figures 2(a) to 2(c) illustrate
tical BSBM (V2.0) query mixes over them. Hence, for each rep-                  the results of these experiments. In the following we discuss the
resentation we run the BSBM query mix 13 times where the first                 results of these experiments.
3 runs were for warm up and are not considered for the measure-                   The chart in Figure 2(a) illustrates the estimated amount of mem-
ments. For the actual execution of queries we used the query engine            ory that was required to store the query-local datasets D50 to D500
in SQUIN. However, we disabled the look-up of URIs for this ex-                using the different data structures. As can be seen from this chart,
periment in order to actually measure the efficiency of the find op-           CombQuadIR required less memory than the other data structures
erations and to avoid distortions caused by the effects of network             from which CombIR required slightly less than ARQ, NG4J and In-
access, data retrieval, etc. Figure 1(d) depicts the average times             dIR. We attribute the latter observation to the fact that ARQ, NG4J
to execute the query mixes for CombIR with different m. For all                and IndIR have to allocate memory for Java objects that represent
query-local datasets, the execution times are small for a high m               the containers of the separated descriptor objects. For instance,
(e.g. 0.28 s and 0.44 s for D50 and D500 with m = 15, respec-                  our implementation of IndIR has to provide a separate Java object
tively) and they increase significantly with a decreasing m. Sim-              for each of the individual indexes, whereas, CombIR and Com-
ilarly to the behavior of the creation time, this observation can be           bQuadIR require only one of these objects. The other observation
attributed to the notable differences in the average number of triples         is the advantage of CombQuadIR over CombIR w.r.t. memory con-
per hash bucket: While the time to find relevant buckets is constant,          sumption. This observation shows that storing the src associations
searching through the entries in these buckets and checking them               for each indexed triple requires more space in our implementation
w.r.t. a given triple pattern gets more expensive for a small number           than storing multiple quads that contain the same triple.
of buckets that reference more triples each.                                      While the amount of memory required for the different data struc-
   Based on the results of these experiments we identified a value             tures was at least in the same order of magnitude, the provided
of 12 for m as the best trade-off between required memory and                  query performance differed significantly: In comparison to ARQ,
query performance (incl. creation time) for CombIR. Note, for ap-              NG4J and IndIR, we measured almost constant query mix execu-
plications that operate on much smaller or much larger query-local             tion times for CombIR and CombQuadIR (cf. Figure 2(b)). Even if
datasets than those that we used for the experiments, we suggest to            these times increase slightly with the size of the query-local dataset
find a value for m that is more suitable in these scenarios. However,          (e.g. from an average of 0.3 seconds to execute a BSBM query
in the remainder of this paper we use m = 12 for the combined in-              mix for CombIR of D50 to an average of 1.3s for D500 ), we con-
dex. To tune CombQuadIR accordingly we conducted the same                      sider this increase insignificant to the increase we measured for the
experiments and found that m = 12 is also the best choice for the              other data structures. The reason for these remarkable differences
quad index. Similarly, we analyzed IndIR. According to this analy-             is the number of indexes that have to be accessed during the execu-
sis13 m = 4 is most suitable for the individual indexes that contain           tion of find operations: While it is only a single index for CombIR
a single descriptor object each.                                               and CombQuadIR, independent of the number of descriptor objects
                                                                               in the query-local dataset, IndIR needs as many index accesses as
                                                                               descriptor objects are contained in the query-local dataset; simi-
6.3    Comparing the Data Structures                                           larly for NG4J and ARQ. Figure 2(b) also illustrates that ARQ per-
  We use our implementations of IndIR, CombIR, and CombQuad-                   formed very badly (for the larger datasets query execution aborts
IR to empirically compare the three approaches for storing a query-            with a stack overflow error) and that IndIR was slightly better than
local dataset. Furthermore, we also compare our implementations                the NG4J data structure. We explain the latter by our use of numer-
                                                                               ical identifiers to represent RDF terms; these identifiers are faster to
13                                                                             process and compare than the Java objects used by NG4J/Jena. We
 http://sourceforge.net/apps/wordpress/squin/2009/04/25/identified-
a-proper-index-size-for-the-new-swcllib-storage-solution/                      also note that CombIR and CombQuadIR performed equally well.
          (a)                      (b)                                   (c)                                             (d)

 Figure 3: Measurements for a link traversal based execution of a sequence of 200 BSBM query mixes that re-use the local dataset.


   Finally, we measured the creation times14 for our implementa-        quence of 200 BSBM query mixes, using a single, initially empty
tions of IndIR, CombIR, and CombQuadIR. Figure 2(c) denotes             query-local dataset. Hence, we did not clear the query-local dataset
the results of this experiment: The creation times for our imple-       between the queries so that each query execution could benefit from
mentation of CombQuadIR are much smaller than for IndIR and             the local availability of all descriptor objects that have been re-
CombIR. In particular, the difference between CombIR and Comb-          trieved for previous queries. For each of the 200 query mixes we
QuadIR might be surprising given the similarity of both data struc-     measured the overall query execution time and the number of de-
tures. However, the need to check whether the combined index al-        scriptor objects that were contained in the query-local dataset after
ready contains each triple that has to be inserted is the reason why    executing the query mix.
CombIR has higher creation times than CombQuadIR for which in-             Figure 3(a) denotes the number of descriptor objects that had
serting a triple just means adding a new quad. Focusing solely on       been requested during the execution of queries in each of the 200
IndIR and CombIR we note that the creation times for our imple-         query mixes. The trendline in this chart illustrates that this number
mentations of these data structures are comparable, even if the indi-   remained roughly constant for our sequence of query mixes. How-
vidually indexed representations were created slightly faster. How-     ever, the number of newly discovered descriptor objects added to
ever, this small advantage of IndIR over CombIR does not outweigh       the query-local dataset decreased for later query mixes in the se-
the divergence that we measured for the query execution times.          quence as can been seen from the overall number of descriptor ob-
                                                                        jects contained in the query-local dataset after each query mix (cf.
6.4    Evaluating the Impact on Link Traversal                          Figure 3(b)). Hence, many descriptor objects that were requested
       Based Query Execution                                            during the execution of queries in later query mixes had already
   In all experiments discussed so far, we used query-local datasets    been retrieved during the execution of previous queries. This ob-
that we populated in advance. This strategy allowed us to avoid         servation indicates that certain queries in the whole sequence of
measuring the effect of link traversal and network access, which af-    query mixes are similar w.r.t. the data that is required to answer
fect the overall execution time of link traversal based query execu-    them. Applications that generate a query workload with similar
tion. Hence, we were able to evaluate the actual query performance      characteristics are most likely to benefit from re-using the query-
provided by the different data structures. This evaluation revealed     local dataset for the execution of multiple queries.
a clear advantage of CombIR and CombQuadIR over the other data             Figure 3(c) illustrates the overall execution time of each of the
structures, where CombQuadIR is superior to CombIR due to the           200 query mixes for the different data structures. We note that all
smaller creation times. A question that remains is whether these        three data structures exhibit the same general behavior: The first
differences have an impact on the overall execution time of link        query mixes have a significantly higher execution time than later
traversal based query execution. To answer this question we en-         query mixes. We explain this behavior as follows: For the first
abled the traversal of data links in our query engine and conducted     query mixes the query execution times are dominated by the effects
an experiment that we describe in the remainder of this section.        of link traversal (e.g. the need to retrieve descriptor objects from
   Using the BSBM dataset generated with a scaling factor of 50         the Web and to add them to the data structure that stores the query-
(cf. Table 2) we set up a Linked Data server15 which publishes          local dataset). However, with a decreasing number of descriptor
the generated data following the Linked Data principles. With this      objects that have to be retrieved and added, this domination de-
server we simulate the Web of Data in order to conduct the exper-       creases, until, at around the 13th query mix in our experiment, the
iment in a controlled environment. To query over this simulated         trend changes completely. An additional indicator for this domina-
Web we adjusted the BSBM queries in such a way that they access         tion is the 20th query mix, for which an untypical high number of
our simulation server. For each data structure we ran the same se-      60 additional descriptor objects were discovered (in comparison,
14
                                                                        for the query mixes #19 and #21 it was 15 and 19, respectively).
   As mentioned before, we understand creation time as the time to      Due to the retrieval of this comparably high number of newly dis-
 create our data structures from an ARQ DatasetGraph object             covered descriptor objects the query execution times measured for
 into which we loaded the corresponding dataset before. Hence,
 based on this understanding ARQ has a creation time of 0 seconds.      all three data structures peak again.
15
   Our server uses RAP Pubby which is available from http://www4.          After the domination of the effects of link traversal disappeared
 wiwiss.fu-berlin.de/bizer/rdfapi/tutorial/RAP_Pubby.htm                (i.e. after query mix #20) the three data structures exhibit the same
behavior w.r.t. query execution times as they did in the previous ex-    [6] C. Bizer and A. Schultz. Benchmarking the Performance of
periments without link traversal based query execution: For Com-             Storage Systems that expose SPARQL Endpoints. In
bIR and CombQuadIR the query engine required roughly the same                Proceedings of the Workshop on Scalable Semantic Web
execution time for all query mixes, independent of the still in-             Knowledge Base Systems at ISWC, 2008.
creasing size of the query-local dataset. For IndIR, in contrast,        [7] P. Bouquet, C. Ghidini, and L. Serafini. Querying The Web
the execution times are higher and they begin to increase slightly.          Of Data: A Formal Approach. In Proceedings of the 4th
We expect that this increase could even rise, given that the query-          Asian Semantic Web Conference (ASWC), 2009.
local dataset contained 2.397 descriptor objects after query mix         [8] T. Haerder and A. Reuter. Principles of Transaction-Oriented
#200, which is comparable to D50 used in the previous experi-                Database Recovery. ACM Computing Surveys, 15, 1983.
ments (cf. Table 2). As a final observation we note that the smaller     [9] A. Harth and S. Decker. Optimized Index Structures for
creation times of CombQuadIR impact the overall execution times              Querying RDF from the Web. In Proceedings of the 3rd
only when a significant number of newly discovered descriptor ob-            Latin American Web Congress (LA-Web), 2005.
jects are added during link traversal based query execution. Fig-       [10] A. Harth, K. Hose, M. Karnstedt, A. Polleres, K.-U. Sattler,
ure 3(d) provides a detailed view on the query execution times for           and J. Umbrich. Data Summaries for On-Demand Queries
query mixes #1 to #25 to illustrate this effect: For the first query         over Linked Data. In Proceedings of the 19th International
mixes (during which more descriptor objects were retrieved) we               Conference on World Wide Web (WWW), 2010.
measured slighly smaller query execution times for CombQuadIR
                                                                        [11] O. Hartig. How caching improves efficiency and result
than for CombIR. However, with a decreasing number of newly
                                                                             completeness for querying linked data. In Proceedings of the
discovered descriptor objects, the advantage of CombQuadIR over
                                                                             4th International Linked Data on the Web workshop (LDOW)
CombIR vanishes.
                                                                             at WWW, 2011.
                                                                        [12] O. Hartig, C. Bizer, and J.-C. Freytag. Executing SPARQL
7.    CONCLUSION                                                             Queries over the Web of Linked Data. In Proceedings of the
   In this paper we investigate main memory data structures to store         8th International Semantic Web Conference (ISWC), 2009.
a temporary collection of Linked Data from the Web as is required       [13] O. Hartig and A. Langegger. A Database Perspective on
in a link traversal based query execution system. We discuss that            Consuming Linked Data on the Web. Datenbank-Spektrum,
these data structures must support efficient, multi-threaded loading         10(2), 2010.
and triple-based querying of many small sets of RDF data, which         [14] M. Janik and K. Kochut. BRAHMS: A WorkBench RDF
is a requirement that is not sufficiently addressed by existing work.        Store And High Performance Memory System for Semantic
Therefore, we introduce and analyze three alternative data struc-            Association Discovery. In Proceedings of the 4th
tures that are suitable in our usage scenario. These data structures         International Semantic Web Conference (ISWC), 2005.
make use of hash table based indexes and present i) an individu-        [15] T. Neumann and G. Weikum. RDF-3X: a RISC-style Engine
ally indexed organization of the data, ii) a combined index, and             for RDF. In Proceedings of the 34th International
iii) a combined quad index. In an empirical evaluation we demon-             Conference on Very Large Data Bases (VLDB), 2008.
strate a significant performance improvement that can be achieved       [16] E. Oren, C. Gueret, and S. Schlobach. Anytime Query
by using the combined index or the combined quad index. Com-                 Answering in RDF through Evolutionary Algorithms. In
paring these two alternatives, our evaluation revealed that the quad         Proceedings of the 7th International Semantic Web
index features smaller creation times and, thus, is advantageous in          Conference (ISWC), 2008.
cases where new data is retrieved and added frequently. As future       [17] E. Prud’hommeaux and A. Seaborne. SPARQL Query
work we aim to investigate whether we can improve the query per-             Language for RDF. W3C Recommendation, Online at
formance or memory requirements of the data structures by using              http://www.w3.org/TR/rdf-sparql-query/, 2008.
another hash function for the indexes.                                  [18] V. Roubtsov. Sizeof for Java. http://www.javaworld.com/
                                                                             javaworld/ javaqa/ 2003-12/ 02-qa-1226-sizeof.html, 2003.
8.    REFERENCES                                                        [19] L. Sidirourgos, R. Goncalves, M. Kersten, N. Nes, and
                                                                             S. Manegold. Column-Store Support for RDF Data
 [1] D. J. Abadi, A. Marcus, S. R. Madden, and K. Hollenbach.                Management: not all swans are white. Proceedings of the
     SW-Store: A Vertically Partitioned DBMS for Semantic Web                VLDB Endowment, 1, 2008.
     Data Management. The VLDB Journal, 2009.                           [20] H. Stuckenschmidt, R. Vdovjak, G.-J. Houben, and
 [2] M. Atre, J. Srinivasan, and J. A. Hendler. BitMat: A                    J. Broekstra. Index Structures and Algorithms for Querying
     Main-memory Bit Matrix of RDF Triples for Conjunctive                   Distributed RDF Repositories. In Proceedings of the 13th
     Triple Pattern Queries. In Proceedings of the Poster and                International Conference on World Wide Web (WWW), 2004.
     Demo Session at the 7th International Semantic Web                 [21] Y. Theoharis, V. Christophides, and G. Karvounarakis.
     Conference (ISWC), 2008.                                                Benchmarking Database Representations of RDF/S Stores.
 [3] T. Berners-Lee. Linked Data. Online at http://www.                      In Proceedings of the 4th International Semantic Web
     w3.org/DesignIssues/LinkedData.html, 2006.                              Conference (ISWC), 2005.
 [4] P. A. Bernstein and N. Goodman. Multiversion Concurrency           [22] O. Udrea, A. Pugliese, and V. S. Subrahmanian. GRIN: A
     Control – Theory and Algorithms. ACM Transactions on                    Graph Based RDF Index. In Proceedings of the 21nd AAAI
     Database Systems, 8, 1983.                                              Conference on Artificial Intelligence (AAAI), 2007.
 [5] R. Binna, W. Gassler, E. Zangerle, D. Pacher, and G. Specht.       [23] C. Weiss, P. Karras, and A. Bernstein. Hexastore: Sextuple
     SpiderStore: Exploiting Main Memory for Efficient RDF                   Indexing for Semantic Web Data Management. In
     Graph Representation and Fast Querying. In Proceedings of               Proceedings of the 34th International Conference on Very
     the Workshop on Semantic Data Management (SemData) at                   Large Data Bases (VLDB), 2008.
     VLDB, 2010.