Querying the Web of Interlinked Datasets
                             using VOID Descriptions

                       Ziya Akar                  Tayfun Gökmen Halaç                  Erdem Eser Ekinci
              Department of Computer               Department of Computer            Department of Computer
             Engineering, Ege University          Engineering, Ege University       Engineering, Ege University
            35100 Bornova, Izmir, Turkey         35100 Bornova, Izmir, Turkey      35100 Bornova, Izmir, Turkey
           ziya.seagent@gmail.com                tayfunhalac@gmail.com             erdemeserekinci@gmail.com
                                                      Oguz Dikenelli
                                                   Department of Computer
                                                  Engineering, Ege University
                                                 35100 Bornova, Izmir, Turkey
                                                oguz.dikenelli@ege.edu.tr

ABSTRACT                                                          [1] is published as W3C Semantic Web Interest Group note1 .
Query processing is an important way of accessing data on         VOID is an RDF vocabulary and is used to describe meta-
the Semantic Web. Today, the Semantic Web is character-           data of RDF datasets, in a sense, metadata of the web of
ized as a web of interlinked datasets, and thus querying the      data. Linked open data cloud is represented as a graph of
web can be seen as dataset integration on the web. Also, this     datasets in which datasets are represented as nodes and sets
dataset integration must be transparent from the data con-        of links between datasets are represented as edges. Since
sumer as if she is querying the whole web. To decide which        VOID bases on graph based soul of web of data, it provides
datasets should be selected and integrated for a query, one       a strong way of describing metadata that allows to discover
requires a metadata of the web of data. In this paper, to         datasets which queries are distributed over.
enable this transparency, we introduce a federated query en-         In this paper, we present a federated query engine called
gine called WoDQA (Web of Data Query Analyzer) which              WoDQA (Web of Data Query Analyzer) which is developed
discovers datasets relevant with a query in an automated          to execute a query on distributed datasets without missing
manner using VOID documents as metadata. WoDQA fo-                answers using VOID metadata of datasets in linked open
cuses on powerful dataset elimination by analyzing query          data cloud. WoDQA focuses on effective dataset selection
structure with respect to the metadata of datasets. Dataset       for a query and analyzes query structure to eliminate irrel-
and linkset descriptions in VOID documents are analyzed for       evant datasets. Relevant datasets are selected by analyzing
a SPARQL query and a federated query is constructed. By           VOID documents and considering which dataset includes a
means of linkset concept of VOID, links between datasets are      resource related with the query and which links between
incorporated into selection of federated data sources. Cur-       datasets allow to find a result to the query. VOID meta-
rent version of WoDQA is available as a SPARQL endpoint.          data provides dataset descriptions representing content of
                                                                  a dataset and linkset descriptions representing relationships
                                                                  between datasets which are used by WoDQA for effective
1.    INTRODUCTION                                                dataset selection.
  While the web is evolving through a structured data space,         There are two main approaches which enable automated
many applications are publishing and linking their data, and      query processing on the web of data and prevent data con-
the cloud of this linked and open data will be gigantic as        sumers from searching for relevant datasets. The first ap-
times go. In this interlinked and structured data space,          proach called follow-your-nose (link traversal) [11] is based
query execution becomes one of the most important research        on following links between data to discover potentially rele-
problems and different query execution approaches and tools       vant data, and the second one is query federation [7] which
have been proposed in the literature [11, 7]. The query exe-      is based on dividing a query into sub-queries and distribut-
cution on the web of data is basically depends on searching       ing sub-queries to relevant datasets which are selected using
for resources that satisfy our needs, but we need to discover     metadata about datasets.
which parts of linked open data cloud may have such re-              Follow-your-nose approach conceptualizes the web as a
sources. To make this discovery effectively, which resources      graph of documents which contains dereferenceable URIs.
and vocabularies reside in a dataset and which datasets are       This approach is based on executing queries on relevant
interlinked to others via interested links should be taken into   documents which are retrieved by following links between re-
account. If dataset publishers provide such information by        sources in different documents. But, this method raises com-
describing metadata of their datasets, relevant datasets can      pleteness and performance issues. Although some heuris-
be selected effectively in an automated manner. To enable         tic query planning methods can be used to answer different
this automation, Vocabulary of Interlinked Datasets (VOID)        kinds of queries [10], this approach cannot guarantee find-
                                                                  ing all results because relevant documents vary according
Copyright is held by the author/owner(s).                         1
LDOW2012, April 16, 2012, Lyon, France.                               http://www.w3.org/TR/void/
to the starting point and the path. Also, although follow-           On the other hand, distributed querying depends on pro-
your-nose requires nothing other than linked data principles      cessing query parts directly on original data and managing
to process a query, another disadvantage is that encounter-       results retrieved from distributed data. Query federation
ing large documents causes retrieval problems. The other          [9, 7] and follow-your-nose [11] are mainstream distributed
approach, query federation, has raised from database liter-       querying approaches. DARQ [14], FedX [15] and SPLEN-
ature, and is composed of two main steps before perform-          DID [8] are the example implementations of the query fed-
ing a query. Firstly, query is divided into sub-queries and       eration approach. DARQ distributes a query using dataset
datasets relevant with sub-queries are selected using some        metadata called Service Descriptions3 which are constructed
metadata which reflects dataset content. Then, the query          manually by query developer, and benefits from triple and
evaluation plan is changed using statistics about datasets        entity counts and selectivity estimates to optimize the query
in the query optimization step. For the purpose of exe-           plan. Since DARQ uses predicates to select relevant datasets,
cuting sub-queries on distributed data sources, query fed-        the success of the query execution depends on associating
eration requires accessing datasets via SPARQL endpoints.         datasets with predicates, and triple patterns which have un-
Contrary to follow-your-nose approach, in this approach, all      bound predicates cannot be handled4 . On the other hand,
results can be found under the assumption of metadata of          FedX is an extended version of Federation SAIL provided
all datasets is complete and accurate, and queries can be         by AliBaba5 . Datasets which will be queried are given to
optimized before execution by estimating execution using          FedX, and it checks each triple pattern existence on each
dataset metadata. To find all results in an effective way,        dataset by ASK queries to decide about which triple pat-
query federation determines relevant datasets before exe-         tern will be queried on which datasets [16]. These two query
cution using well-defined dataset metadata such as VOID           federation implementations also stand up to self-descriptive
documents.                                                        nature of linked data since metadata of datasets should be
   In the light of these ideas, WoDQA executes queries by         described by data publishers as is in describing and link-
analyzing VOID documents which constitute a projection of         ing their data. The last query federation implementation is
the web of data and incorporates follow-your-nose approach        SPLENDID which indexes dataset using VOID descriptions,
into query federation by considering links between datasets       eliminates datasets by ASK queries for triple patterns, and
in metadata. WoDQA does not change the evaluation order           benefits from statistical data in VOID to optimize federated
of a query because the main focus of this initial version of      queries.
WoDQA is only eliminating much more irrelevant datasets              Although aforementioned query federation implementa-
in dataset selection without query optimization. Current          tions aim to query linked datasets, they do not consider links
RDF federation implementations select relevant datasets by        between data for dataset selection. For this reason, there
considering only predicate and type indexes. Since vocab-         are some shortcomings of these implementations from query-
ularies in the Semantic Web should be common, there can           ing web of data perspective. The first one is that deciding
be a lot of datasets which use a specific property or class.      datasets via only predicate indexes causes inability to select
Therefore, using such indexes causes selection of redundant       datasets effectively for triple patterns which have unbounded
datasets. The main contribution of WoDQA is incorporating         predicates or have so general predicates such as owl:sameAs
both links between data through linkset concept and rela-         and foaf:page that are extensively used in datasets6 . The sec-
tionships between triple patterns of a query into dataset se-     ond shortcoming is that so many datasets may be selected
lection to eliminate irrelevant datasets effectively. We serve    for triple patterns, and executing ASK queries in such a case
WoDQA as a SPARQL endpoint and a simple web form2                 increases the cost notably. One need to take the structure of
to execute raw queries by analyzing datasets in the VOID          the query into account to eliminate right irrelevant datasets
stores.                                                           in the web of data context.
   Remaining sections are organized as follows. In Section 2,        The second distributed querying approach is follow-your-
related work is discussed. Section 3 introduces general archi-    nose [11] whose basic idea is traversing RDF links between
tecture of WoDQA and details dataset selection approach.          data to discover relevant datasets. There is no need to any
In Section 4, usage of WoDQA is shown with a working ex-          prior metadata about datasets in advance as in query fed-
ample. Finally, Section 5 concludes the paper.                    eration, but it needs initial URIs in some triple patterns to
                                                                  start exploring datasets. The main disadvantages of this
                                                                  approach are infinite link discovery, trying to retrieve large
2.   RELATED WORK                                                 RDF graphs, failing to discover relevant data for queries
   The Semantic Web querying approaches can be classified         with only bound predicates (?s foaf:friend ?o) or type state-
as centralized and distributed. Centralized querying is based     ments (?s rdf:type foaf:Person). These restrictions cause less
on collecting linked data into a single central data store, and   comprehensive result sets. One of well-known follow-your-
querying the data from this store. This approach includes         nose implementations is SQUIN [11] that traverse RDF links
data warehousing which collects pre-selected data sources         on the fly, i.e. during query execution. Hartig et al. improve
and search engines which crawl the Web by following RDF           this work using some heuristic methods that modify query
links and index discovered data [12]. But, the main disad-
vantage of this approach is that queried data is not live, i.e.   3
                                                                    Service Description introduced in that paper contains in-
duplicate of original sources. On the other hand, search en-      formation about triples in the dataset, limitations on access
gines cannot crawl all the web and cannot answer complete         patterns, and statistical information about dataset.
                                                                  4
structured queries.                                                 http://darq.sourceforge.net/#Limitations and known issues
                                                                  5
                                                                    http://www.openrdf.org/doc/alibaba/2.0-beta6/alibaba-
2                                                                 sail-federation/
 The simple web form and up-to-date endpoint address can
                                                                  6
be found on http://seagent.ege.edu.tr/etmen/wodqa.html              SPLENDID also uses type indexes, but it is still not enough
page.                                                             since vocabularies can be used frequently.
evaluation order to reduce execution cost and to provide
more comprehensive results [10], but the results strictly de-
pend on the starting point and the evaluation order. On the
other hand, Bouquet et al. formalize the web of data and
suggest three different querying methods exploiting their
web of data formalization [4]. These methods based on
merging relevant graphs to execute queries on them. One of
these methods uses follow-your-nose approach which speci-
fies and merges relevant graphs by looking up URIs before
query execution.
   WoDQA aims to query the web of interlinked datasets us-
ing VOID dataset and linkset descriptions to decide relevant               Figure 3.1: WoDQA internal architecture
datasets for a query. At first, it assumes that all datasets
are relevant with a query, then irrelevant datasets are elim-
inated by analyzing query structure in the light of meta-         SERVICE expressions11 . Details of QueryReorganizer are
data of datasets. Its novelty is considering query structure      given in Subsection 3.2.
and links between datasets to select relevant datasets be-          The last module is query executor which directly uses Jena
fore query execution, and thus it incorporates follow-your-       ARQ to execute SPARQL queries including the SERVICE
nose approach into the query federation. To the best of           expressions inserted by the QueryReorganizer. The feder-
our knowledge, WoDQA is the first query engine which uses         ated query constructed by QueryReorganizer is passed to
datasets and linksets together that are critical elements of      ARQ to be executed. Results of query execution are re-
VOID to describe dataset metadata.                                turned to the querior. In the following subsections, the first
                                                                  two modules which implement WoDQA analysis and reor-
                                                                  ganization phases are explained.
3.   WODQA INTERNAL ARCHITECTURE
   In this section, query processing architecture of WoDQA        3.1       Dataset Analyzer
is explained in detail. Since it is impractical to perform a         This section introduces the details of the DatasetAna-
query on all published datasets on the web, WoDQA aims            lyzer module which is the core and the innovative part of
to transform a query into a federated query which is evalu-       the current version of WoDQA. Unlike other query federa-
ated only on relevant datasets. In this direction, to process a   tion approaches, WoDQA considers triple pattern relations
query on the linked data cloud, WoDQA contains three main         and links between datasets while selecting datasets. Thanks
modules as seen in Figure 3.1: DatasetAnalyzer, QueryRe-          to dataset analysis of WoDQA, relevant datasets are speci-
organizer and Jena ARQ 7 .                                        fied while plenty of irrelevant ones are excluded. Output of
   Dataset publishers construct the VOID documents of their       dataset analysis is a subset of all published datasets on the
datasets and the Semantic Web programmers can access              web of data, and thus the query is performed only on this
these documents through services called VOID store such           subset including related ones. For the purpose of explaining
as voiD Browser8 , CKAN9 and voiDStore10 . A VOID store           how this subset is constructed, we give a formalization in
generates a projection of Linked Open Data, and thus this         this section.
structure obliges dataset publishers to create well-defined          We firstly give a definition of the web of data to formalize
VOID document which reflects actual content of the dataset        our dataset selection approach. In summary, the web of data
to enable including the dataset in relevant queries. Datase-      is an RDF graph which is constructed by typed links between
tAnalyzer is the module which is responsible for discover-        data from different sources. Basically an RDF graph (G) is
ing relevant datasets and eliminating irrelevant ones using       formally represented as a set of triples in the form of hs, p, oi:
VOID documents of datasets in the VOID stores. We as-             G = {hs, p, oi | hs, p, oi∈ (I ∪ B) × I × (I ∪ B ∪ L)} where I
sume that dataset publishers update the description doc-          is the set of IRIs[6], B is the set of blank nodes, L is the
uments in the VOID stores to make VOID stores up-to-              set of literals, and all are RDF terms T = I ∪ B ∪ L . In
date for dataset selection when datasets are changed. In the      this direction, web of data is the global graph (Gwod ) which
current version of WoDQA, DatasetAnalyzer discovers the           consists of the triples constructed from IRIs, blank nodes
VOID documents from the CKAN net, and analyzes dataset            and literals on the web. Gwod is a model of mathematical
and linkset descriptions for each triple pattern in the query.    RDF construct for the web of data.
This analysis eliminates irrelevant datasets which definitely        From another perspective, the web of data means web
do not contain any result contributing to the result of the       of interlinked datasets [3]. A dataset (δ) is a meaningful
query by assuming that accurate and complete VOID docu-           set of RDF triples [1] which decreases granularity of the
ments of datasets are available. Dataset analysis is achieved     web. Rather than publishing information only as single re-
by a rule-based approach. We explain the rules which dis-         sources and connecting these resources, datasets are the way
covers relevant datasets Subsection 3.1.                          of publishing information as sub-graphs of Gwod . These sub-
   The second module is QueryReorganizer which rewrites           graphs, i.e. datasets, are connected via RDF triples which
queries depending on results of DatasetAnalyzer. This rewrit-     connect resources in different datasets [13]. The publishers
ing process constructs federated SPARQL queries including         create their resources and deploy them into the datasets on
7
                                                                  the web, and consumers use these resources while creating
   http://jena.sourceforge.net/ARQ/                               their datasets. With regard to this, to formalize a dataset,
8
   http://kwijibo.talis.com/voiD/                                 we use subj (G) which represents the set of resources which
 9
   http://ckan.net/
10                                                                11
   http://void.rkbexplorer.com/                                        http://www.w3.org/TR/sparql11-federated-query/
are the subjects of triples in a graph.

  Definition 1. A dataset is a sub-graph of web of data,
δx ⊂ Gwod , and the resources which are included by δx are
specified as follows: ∀r, δi (r ∈ subj (δx ) → Owner (δx , r)).

  VOID describes a dataset with well-defined properties12 ,
and we formalize a VOID dataset description as a tuple
hLspace , I voc i ∈ L × I. The first dataset property Lspace
corresponding to void:uriSpace set which contains string lit-
erals that all entity IRIs in a dataset start with. The other                     Figure 3.2: Example VOID Models
one is I voc corresponding to void:vocabulary which denotes
the set of vocabularies used by the dataset13 .
  The triples whose object is a resource in another dataset
                                                                        datasets. Since this is impractical, our purpose is eliminat-
make the web of data a graph of interlinked datasets. We
                                                                        ing irrelevant datasets for each triple pattern. By elimina-
call such triples link triples, and define the set of link triples
                                                                        tion of irrelevant datasets, we construct a federated query
as LT = {(s, p, o) |owner (s) 6= owner (o)} where s, o ∈ I.
                                                                        which distributes sub-queries to only relevant ones. In or-
This definition leads us to define link predicate (plink ) con-
                                                                        der to eliminate irrelevant datasets, we introduce a set of
cept which corresponds to void:linkPredicate used to define
                                                                        rules which discover the relevant datasets in dataset analysis,
a linkset which is an important contribution of VOID ef-
                                                                        called relevant dataset discovery rules. A relevant dataset
fort. Set of link predicates P link includes all the predicates
                                                                        for a triple pattern is formalized as an assertion ρ(δx , tpi )
which are used in a link triple: P link = {plink |∃ s, plink , o ∈      which denotes that the dataset δx may contain a result for
LT }. A linkset represents link triples which connect re-               the triple pattern tpi . We need to discuss how relevant
sources in different datasets using the D  same link predicate.E        dataset assertions (ρ) inferred by discovery rules used for
We formalize the linkset, λ, as a tuple δλf rom , δλto , plink
                                                          λ      ∈      eliminating irrelevant datasets. To explain the method of
∆ × ∆ × P link where ∆ is the set of all datasets on the web.           eliminating irrelevant datasets, assume that Qtpi is the set
In this definition, δλf rom is the referrer dataset which is the        of datasets selected to be queried for tpi which we call se-
owner of subject of link triples in the linkset, δλto is the refer-     lected set, and this set initially contains all datasets on the
enced dataset which is the owner of object of link triples in           web, Qinit
                                                                                tpi ≡ ∆ (all datasets in a VOID store in our case).

the linkset, and plink      is the link predicate of all link triples   Each rule analyzes datasets in Qtpi of each triple pattern.
                        λ
in the linkset.                                                         After applying a rule, elements of Qtpi which the rule does
   DatasetAnalyzer uses both dataset and linkset descrip-               not infer relevant dataset assertion about are removed from
tions of VOID metadata to select the relevant datasets, in              Qtpi . But, if the rule does not imply any relevant dataset
other words sub-graphs, which may contain the results of                assertion, Qtpi remains the same. Irrelevant dataset elimi-
the query. By this means, the query is transformed into a               nation method is formalized as Qnew    tpi below to denote such

federated query and executed on the relevant datasets on                an update of selected set subsequent to executing a rule.
the web. Accordingly, we exclude the datasets which do not                        
                                                                                     ∃δa (ρ (δa , tpi )) ; {δx | (δx ∈ Qtpi ) ∧ ρ(tpi , δx )}
                                                                                                                                              
contain any result for the query while querying the global              Qnew   =
                                                                          tpi
                                                                                     ∃0 δa (ρ (δa , tpi )) ;            Qtpi
graph, Gwod . Beside analyzing relationships between VOID
descriptions (datasets, linksets) and triple pattern, relation-
ships between triple patterns in a query are considered. For               In the following subsection we give relevant dataset discov-
this reason, we need to give a formal definition of SPARQL              ery rules in detail and we use some queries to exemplify ap-
queries.                                                                plication of rules. Figure 3.2 shows VOID models of a set of
   We consider a subset of SPARQL queries which corre-                  datasets which are used in these examples. This model con-
sponds to a basic graph pattern for formalization [5]. A                tains simplified VOID descriptions of five datasets and the
basic graph pattern consists of triple patterns, BGP = {tpi |           linksets connecting these datasets by link predicates. Link
hstpi , ptpi , otpi i ∈ (I ∪ V) × (I ∪ V) × (I ∪ L ∪ V)}. A triple      predicates are showed by arrows between dataset descrip-
pattern is slightly different from RDF triple since it contains         tions which are represented with squares. Also we create a
at least one and at most three variables that are elements              sample dataset called Facebook which keeps the data about
of infinite set V 14 . While performing a query, its variables          Facebook users in our local store. This data is about that
are replaced by RDF terms. Thus, according to semantics                 movie resources located in LinkedMDB dataset are liked by
of SPARQL, a query result is a set of solution mappings,                which users. Although there can be a lot of linksets, in Fig-
{µ|µ : V → T }, where a solution mapping, µ, is a partial               ure 3.2, only a few linksets are taken into account to explain
function from variables to RDF terms.                                   our rules are depicted.
   To perform a query on the web of data, in the worst
case, each triple pattern has to be queried on all published            3.1.1     Relevant Dataset Discovery Rules
12                                                                        This subsection presents a set of rules each of which repre-
   DatasetAnalyzer of the current version of WoDQA consid-              sents an analysis method of the DatasetAnalyzer module to
 ers only these properties for the sake of simplicity. We plan
 to integrate other properties such as statistics in the future         discover relevant datasets effectively. Each relevant dataset
 to make optimized queries.                                             discovery rule aims to analyze datasets from different per-
13                                                                      spectives and combinations of them to infer relevant dataset
   Note that schemas of the Semantic Web languages such as
 RDFS and OWL are not specified in I voc .                              (ρ) assertions for triple patterns. These perspectives can
14
   We exclude blank nodes in query.                                     be classified into three groups. The first one is analyzing
IRIs in the triple patterns. We call this perspective IRI-                 Another perspective to discover relevant datasets is Link-
based Analysis where namespaces of IRIs and vocabularies                ing Analysis. Since a triple links two resources in the same
in the VOID documents are considered to determine relevant              dataset or in different datasets, this perspective is separated
datasets. The second one is considering linked resources.               into two kinds of analyses, each of which considers different
We call this perspective Linking Analysis which considers               kind of triples. The first one is Internal Linking Analysis
whether a triple links two resources in the same dataset (in-           which considers triples linking resources in the same dataset.
ternal) or in different datasets (external) to eliminate irrele-        A relevant dataset found by this analysis is called internal
vant datasets. The last perspective is Shared Variable Anal-            relevant dataset, and is represented with ρint (δx , tpi ). On
ysis. Since triple patterns share some variables, each triple           the other hand, External Linking Analysis considers link
pattern affects relevant datasets of other triple patterns that         triples which connect resources in different datasets. In this
include same variables. Relevant dataset discovery rules of             case, linkset descriptions of VOID documents are taken into
all perspectives are introduced in this section.                        account to discover relevant datasets, and a dataset found
   The first two rules are under the IRI-based analysis per-            by this analysis is called external relevant dataset which is
spective each of which considers vocabularies I voc of VOID             represented as ρext (δx , tpi ). Internal and External Link-
metadata. The first discovery rule checks whether IRIs in               ing analyses are both executed for a triple pattern in whole
triple patterns are RDFS (or OWL) classes or properties in              Linking Analysis process, and then produced internal and
the vocabulary set (I voc ) of VOID documents. To give the             external relevant datasets for a triple pattern are unified as
rule, we define has Iδvoc
                        x
                          , r function which represents that a          relevant datasets for the triple pattern as shown in Rule 3.
resource r ∈ I (a property or a class IRI) is included by one
of the vocabularies in Iδvoc
                          x
                             . Using has expression, relevance             Rule 3. Union of external and internal datasets for a
of a dataset δx to an IRI r is represented with V ocM atch as           triple pattern constitutes relevant datasets for the triple pat-
shown in Definition 2.                                                  tern.
 Definition 2. ∀δx (has(Iδvoc
                           x
                              , r) → V ocM atch(δx , r)) where          ∀δx , tpi ρint (δx , tpi ) ∨ ρext (δx , tpi ) → ρ (δx , tpi )
                                                                                                                                        
r∈I
   For a triple pattern such as ?s dbpprop:name15 “Nikola                  Irrelevant dataset elimination method considers relevant
Tesla”, dbpprop:name RDFS property in the predicate po-                 datasets which are specified by Rule 3 to eliminate irrelevant
sition obliges that matching triple patterns can only be in             datasets from selected set of a triple pattern (Qtpi ). Inter-
datasets which uses dbpprop vocabulary. Therefore, we can               nal and external datasets are intermediate results to infer
eliminate datasets which do not use dbpprop vocabulary.                 relevant datasets in a Linking Analysis.
This situation is handled by Rule 1 which is similar to pred-              The first two rules under Linking Analysis perspective
icate indexes.                                                          are linking-to-IRI discovery rules. Consider the example
                                                                        triple pattern, ?film owl:sameAs dbpedia:A Fistful of Dollars,
   Rule 1. If there is a dataset in which one of its vocab-             which can be matched with a triple that links a resource to
ularies includes the predicate of a triple pattern, then it is          dbpedia:A Fistful of Dollars resource. Triple patterns whose
relevant for the triple pattern.                                        object is an IRI are analyzed by these rules. Since own-
                                                                        ers of the linked IRI must be known in these rules, we
∀tpi , δx (V ocM atch (δx , ptpi ) → ρ (δx , tpi ))                     give a definition that depicts the owners of any resource
                                                                        on the basis of IRI Analysis. We formalize inclusion of a
   According to Figure 3.2, DBpedia uses dbpprop vocabu-                resource (r ∈ I) by a dataset (δx ) in Definition 3 by us-
lary, and the rule decides that it is relevant for such a triple        ing the urispaces (Lspace ) property of the VOID description
pattern. In web of data, lots of datasets which uses dbpprop            and startsW ith r, Lspace     function which represents that r
                                                                                               δx
vocabulary can be found, and they can be eliminated using               starts with one of the urispaces in Lspace  .
                                                                                                               δx
outputs of other discovery rules.
   To introduce Rule 2, consider another triple pattern, ?pro-             Definition 3. ∀δx (startsW ith(r, Lspace ) → Owner(δx , r))
                                                                                                              δx
ducer rdf:type linkedMDB:producer, which contains a type
definition for the variable ?producer. In such cases, the ob-              Rule 4 is linking-to-IRI internal discovery rule from the in-
ject of the triple pattern is a class definition, it makes sense        ternal linking point of view. According to the example query
to eliminate the datasets which do not use the vocabulary               ?film should be in the same dataset with dbpedia:A Fistful of
of this class. Rule 2 resembling type indexes is used to spec-          Dollars, i.e. owner of the dbpedia:A Fistful of Dollars resource.
ify relevant datasets for such triple patterns. The example             Therefore appropriate triple patterns can be found in DB-
triple pattern is queried from the datasets that use linked-            pedia dataset.
MDB vocabulary, i.e. LinkedMDB dataset for our example
model. Other datasets do not include a resource which is                   Rule 4. If there is a triple pattern whose object is an IRI,
an instance of linkedMDB:producer class, and therefore they             then owner datasets of the IRI are internal relevant for the
are eliminated by output of this rule.                                  triple pattern.
   Rule 2. If there is a dataset in which one of its vocabu-
                                                                        ∀tpi , δx (δx ∈ Qtpi ) ∧ Owner (δx , otpi ) → ρint (δx , tpi )
                                                                                                                                            
laries includes the object of a triple pattern when the predi-
                                                                        where otpi ∈ I, stpi ∈ V
cate of the triple pattern is rdf:type, then the dataset is rel-
evant for the triple pattern.                                              On the other hand, from the external linking point of
                                                                        view, appropriate triples can be found in datasets which are
∀tpi , δx (V ocM atch(δx , otpi ) ∧ (ptpi = rdf:type) → ρ(δx , tpi ))   linked to the owner datasets of object IRI. For our exam-
15
     All prefixes used in the paper are defined in Table 1.             ple triple pattern, ?film should be in datasets which contain
link triples whose object resource is defined in DBpedia. For           Rule 7 is IRI-links-to external discovery rule and it finds
this analysis, linkset descriptions of VOID documents are             the triples that connect resources in different datasets. Ac-
used. To discover relevant datasets for a triple pattern by           cording to this rule an owner dataset of the subject IRI of
using linking-to-IRI external discovery rule, the triple pat-         the triple pattern is external relevant only when there is
tern should have a bound predicate. We define Compatible              a linkset definition that includes owner dataset as referrer
expression in Definition 4 to represent that which linkset            dataset compatible with the triple pattern.
description is appropriate to use for determining relevant
datasets for a triple pattern.                                           Rule 7. If there is a linkset description that is compati-
                                                                      ble with the triple pattern and whose referrer dataset is an
   Definition 4. If selected set of a triple pattern has the          owner dataset of the triple pattern’s subject, then the refer-
referrer dataset of a linkset description and link predicate          rer dataset of the linkset description is external relevant for
of the linkset description is same with the triple pattern’s          the triple pattern.
predicate, then the linkset description is compatible with
the triple pattern.                                                   ∀tpi , λm , δx (Compatible(λm , tpi ) ∧ Owner(δx , stpi ) ∧ (δx =
                                                                      δλf m
                                                                          rom
                                                                              ) → ρext (δx , tpi )) where otpi ∈ V, stpi ∈ I
∀λm , tpi ((δλf m
                rom
                    ∈ Qtpi ) ∧ (plink
                                 λm = ptpi ) → Compatible(λm ,
tpi ))                                                                   According to the example, DBpedia is the owner dataset
                                                                      of dbpedia:Ennio Morricone and also there is a linkset de-
   Considering ?film owl:sameAs dbpedia:A Fistful of Dollars          scription whose referrer dataset is DBpedia and whose link
triple pattern, and remembering the linkset description of            predicate is owl:sameAs.
our example model in Figure 3.2, there are linksets from                 Other discovery rules in Linking Analysis are combined
LinkedMDB and YAGO datasets to DBpedia dataset whose                  with Shared Variable Analysis. The first discovery rule
link predicates are owl:sameAs. Rule 5 which is under the             which conforms to this combined analysis is Chaining Triple
Linking Analysis perspective gives these two datasets as ex-          Patterns Analysis. This rule considers two triple patterns to-
ternal relevant datasets for this triple pattern. If a dataset        gether to discover relevant datasets. This is a characteristic
is not linked to DBpedia by owl:sameAs predicate then one             of Shared Variables Analysis, since it depends on analyzing
can conclude that this dataset is irrelevant with the triple          more than one triple pattern that have same variable. Triple
pattern.                                                              patterns below are example of chaining triple patterns:
                                                                      ?s owl:sameAs ?film.
   Rule 5. If there is a linkset description that is compati-         ?film linkedMDB:producer name “Sergio Leone”
ble with the triple pattern and whose referenced dataset is an           Notice that the second triple pattern is used for querying
owner dataset of the triple pattern’s object, then the referrer       films whose producer name is “Sergio Leone”. Thus, the sec-
dataset of the linkset description is external relevant for the       ond triple pattern affects the relevant datasets of the first
triple pattern.                                                       triple pattern. From internal linking point of view, the first
                                                                      triple pattern can be found in datasets which satisfy the sec-
∀tpi , λm , δx (Compatible(λm , tpi ) ∧ Owner(δx , otpi ) ∧ (δx =     ond triple pattern because ?s and ?film should be in the same
δλtom ) → ρext (δλf m
                    rom
                        , tpi )) where otpi ∈ I, stpi ∈ V             dataset. In this direction, Internal Chaining Triple Pattern
                                                                      Analysis formalized in Rule 8 is used to discover internal
   Recall that internal and external relevant datasets are uni-       relevant datasets for triple patterns.
fied by Rule 3 after applying rules in Rule 4 and Rule 5.
Thus, the final relevant datasets which are selected by the              Rule 8. If there is a triple pattern whose object is same
linking-to-IRI rules are DBpedia, LinkedMDB and YAGO.                 with the subject of another triple pattern, then datasets in-
   Another couple of rules under the Linking Analysis per-            cluded by selected sets of both triple patterns are internal
spective are IRI-links-to rules. These rules are applied to           relevant.
triple patterns whose subject is an IRI and object is a vari-
able to determine relevant datasets from internal and ex-             ∀δx , tpi , tpj ((otpi = stpj ) ∧ (δx ∈ Qtpi ) ∧ (δx ∈ Qtpj ) →
ternal linking point of view. Rule 6 is IRI-links-to internal         ρint (δx , tpi ) ∧ρint (δx , tpj )) where otpi , stpi ∈ V
discovery rule and it finds the triples that link resources in
the same dataset. One can conclude that if the subject of                To execute Chaining Triple Pattern Analysis, execution
a triple pattern is an IRI, then triples matching with this           order of rules becomes important. To exemplify this situ-
triple pattern are in the owner dataset of this IRI.                  ation according to Chaining Triple Patterns query, assume
                                                                      that IRI-based analysis is applied before, and LinkedMDB
   Rule 6. If a dataset is an owner of the subject of a triple        is the relevant dataset for the second triple pattern since
pattern, then this dataset is internal relevant dataset for the       LinkedMDB VOID includes linkedMDB as the value of vo-
triple pattern.                                                       cabulary. Then, this rule can specify that the internal rel-
                                                                      evant dataset for the first triple pattern is LinkedMDB. It
∀tpi , δx (δx ∈ Qtpi ) ∧ Owner(δx , stpi ) → ρint (δx , tpi ) where
                                                             
                                                                      is clearly seen from this example, Shared Variable Analysis
otpi ∈ V, stpi ∈ I                                                    should be performed after the execution of IRI-based Anal-
                                                                      ysis rules to eliminate more datasets. Hence, in the Sub-
  Consider the triple pattern dbpedia:Ennio Morricone owl:            section 3.1.2, we give an overview of analysis process which
sameAs ?person. Subject of this triple pattern is an IRI              specifies an execution order for these rules.
whose namespace is dbpedia, and therefore an internal rel-               After the IRI-based analysis, if we apply Rule 8 for the ex-
evant dataset is DBpedia whose VOID metadata contains                 ample triple patterns, relevant datasets of the second triple
dbpedia as value of urispace property.                                pattern is shown as Qtp2 = {δLinkedM DB }, and of the first
one is shown as Qtp1 ≡ ∆. For this case, this rule asserts              triple patterns, then referrer datasets of the linkset descrip-
that ρint (δLinkedM DB , tp1 ) and ρint (δLinkedM DB , tp2 ).           tions are external relevant for the triple patterns.
   On the other hand, External Chaining Triple Patterns
analysis uses linkset descriptions while considering two triple         ∀λm , λn , tpi , tpj ((otpi = otpj )∧Compatible(λm , tpi )∧Com−
patterns. While internal one can be used for all triple pat-            patible(λn , tpj )∧(δλtom = δλton ) → ρext (δλf m
                                                                                                                        rom
                                                                                                                            , tpi )∧ρext (δλf nrom ,
terns without considering the predicate, external rule is ap-           tpj )) where otpi ∈ V
plied for a triple pattern which has a link predicate. Rule 9
introduces this rule.                                                      For the example of Object Sharing Triple Pattern Anal-
                                                                        ysis, assume that no rule is applied before this analysis
   Rule 9. If there is a triple pattern (tpi ) whose object is          and selected sets are Qtp1 ≡ ∆ and Qtp2 ≡ ∆. In Fig-
same with the subject of another one (tpj ), and there is a             ure 5, δDBpedia , δY AGO , and δLinkedM DB have link triples
linkset description which is compatible with (tpi ) and its ref-        with predicate owl:sameAs. But, only δF acebook has a linkset
erenced dataset is included by the selected set of tpj , then           to δLinkedM DB with predicate facebook:likes, and therefore
referrer dataset is external relevant for tpi and referenced            Qnew
                                                                          tp1 = {δF acebook }. In this case, tp2 can only be queried
dataset is external relevant for tpj .                                  on the datasets which is linked to δLinkedM DB with predi-
                                                                        cate owl:sameAs, and thus Qnewtp2 = {δDBpedia }. Other triples
∀λm , tpi , tpj ((otpi = stpj ) ∧ Compatible(λm , tpi ) ∧ (δλtom ∈      whose subject corresponding to ?film in other datasets ac-
Qtpj ) → ρext (δλf mrom
                        , tpi ) ∧ ρext (δλtom , tpj )) where otpi ∈ V   cording to tp2 cannot include objects that satisfy object of
                                                                        tp1 . As done in other Linking Analysis methods, internal
   With respect to the example of chaining triple patterns,             and external relevant datasets which are inferred by Rule 10
assume that selected set for the first triple pattern is Qtp1 ≡         and Rule 11 are unified according to Rule 3.
∆, and for the second one is Qtp2 = {δLinkedM DB }. Rule                   The last analysis from the Shared Variable Analysis per-
9 determines that δDBpedia is external relevant for tp1 since           spective considers the triple patterns which have the same
resources ?film can be found in δLinkedM DB and there is a              subject called Subject Sharing Triple Patterns Analysis. This
linkset between these two datasets with owl:sameAs predi-               rule does not use Linking Analysis perspective, and thus it
cate. It is clear that no other dataset can contain an appro-           does not contain Internal and External Linking Analysis.
priate triple if it is not linked to LinkedMDB by owl:sameAs.           Consider the following example triple patterns for Subject
At the end of Chaining Triple Pattern Analysis, internal and            Sharing Triple Pattern Analysis:
external relevant datasets specified by Rule 8 and 9 are uni-           ?city dbpprop:name “Izmir” .
fied according to Rule 3.                                               ?city dc:terms ?subject.
   Another analysis which uses both Linking Analysis and                   According to our dataset definition, triples which have the
Shared Variable Analysis is Object Sharing Triple Patterns              same subject are included by the same dataset. Based on
Analysis. For triple patterns which have the same object                this, Rule 12 infers datasets which are relevant for triple pat-
variable, only the datasets can include triples which satisfy           terns by taking the intersection of selected sets into account.
the object variable of both triple patterns. To simplify the               Assume that selected sets for triple patterns are Qtp1 =
explanation, we use the following example for Object Shar-              {δDBpedia } and Qtp2 ≡ ∆. According to the rule, the fi-
ing Triple Pattern Analysis below:                                      nal datasets according to this rule are Qnew  tp1   ≡ Qnew
                                                                                                                                 tp2  =
?person facebook:likes ?movie.                                          {δDBpedia }.
?film owl:sameAs ?movie.
   From internal linking point of view, ?person and ?film                 Rule 12. If there is a triple pattern whose subject is same
should be in the same dataset with ?movie. Rule 10 speci-               with the subject of another triple pattern, then the datasets
fies the internal relevant datasets for triple patterns which           included by selected sets of both triple patterns are relevant.
have the same object. Assume that Qtp1 = {δF acebook }
due to value of vocabulary property of Facebook VOID and                ∀δx , tpi , tpj ((stpi = stpj )∧(δx ∈ (Qtpi ∩Qtpj )) → ρ(δx , tpi )∧
Qtp2 ≡ ∆. This rule determines that only δF acebook can                 ρ(δx , tpj )) where stpi ∈ V
contain internal triples that satisfy triple patterns together.
                                                                           Up to this point, we have given the relevant dataset dis-
   Rule 10. If there is a triple pattern whose object is same           covery rules which are used to determine relevant datasets
with the object of another one, then the datasets included by           from different perspectives. These rules are executed to-
selected sets of both triple patterns are internal relevant.            gether for a query to make a complete analysis. Next section
                                                                        introduces the analysis process which specifies the execution
∀δx , tpi , tpj ((otpi = otpj ) ∧ (δx ∈ Qtpi ) ∧ (δx ∈ Qtpj ) →         order for rules.
ρint (δx , tpi ) ∧ρint (δx , tpj )) where otpi ∈ V                      3.1.2      Analysis Process
   To execute Object Sharing Analysis from external point                 The rules introduced above should be executed together
of view, we benefit from the linkset descriptions. To find ap-          to provide effective dataset selection. In this section, exe-
propriate link triples, ?person and ?film should be in different        cution of rules are explained in the process of DatasetAn-
datasets. Rule 11 determines external relevant dataset for              alyzer which is shown Figure 3.3. In the figure, QBGP =
triple patterns which have the same object. This rule consid-           {Qtpi |tpi ∈ BGP } is the set of selected sets of all triple pat-
ers two linkset descriptions together for two triple patterns.          terns in a query. We use Qinit
                                                                                                     BGP to represent the initial state
                                                                        where selected set of each triple pattern contains the whole
   Rule 11. If two triple patterns have the same object, and            web of data, ∀Qtpi ∈ QinitBGP (Qtpi ≡ ∆). This set is the in-
there are two linkset descriptions which have the same ref-             put of the single step analysis, and selected sets in this set are
erenced dataset each of which is compatible with one of the             constrained by execution of the rules includes the IRI-based
                                                                    Algorithm 1 This algorithm divides query into sub-triples
                                                                    FUNCTION DivideT riples()
                                                                    INPUT bgp = {tp1 , . . . , tpn } including n triple patterns;
                                                                    LET i := 1, α := 1;
        Figure 3.3: WoDQA Analysis Process                          LET SubT riplesα := {tpi };
                                                                    WHILE i < n DO
                                                                       IF Qtpi+1 = Qtpi THEN
analyses. Single step analysis includes vocabulary match,                   LET SubT riplesα := SubT riplesα ∪ {tpi+1 };
linking-to-IRI and IRI-links-to rules which are executed only          ELSE
once. The reason is that the rules based on IRI-based analy-                LET SubT riplesα+1 := {tpi+1 };
sis produce the same result for every execution because they                LET α := α + 1;
do not depend on current Qtpi . Although the rules in sin-             LET i := i + 1;
gle step analysis do not have a specific order, using output
of each rule in dataset elimination method reduces current
datasets set of the triple patterns. Then, Qconstrained
                                                   BGP        is    sents the federated form of the initial query which contains
given to the repetitive analysis phase.                             ordered service graph patterns. WoDQA executes the reor-
   On the other hand, rules based on Shared Variable Anal-          ganized query using Jena ARQ query engine.
ysis take more than one triple pattern into consideration.            Besides grouping triple patterns, although WoDQA does
Since different combinations of triple patterns affect the se-      not include query optimization phase of query federation ap-
lected sets of each other, these rules depend on current se-        proach, only moving up FILTER expression[2] optimization
lected sets of triple patterns to discover the datasets rele-       technique is used. Thus intermediate results are filtered as
vant to the triple patterns. For this reason, they are exe-         early as possible. Furthermore, WoDQA supports queries
cuted repetitively until no dataset is eliminated from any          include UNION and OPTIONAL keywords but queries in-
Qtpi . The repetitive analysis phase produces Qnew  BGP by ex-      clude GRAPH keyword and blank nodes are not supported.
ecuting Shared Variable Analysis rules. After the phase is
completed once, if an elimination has been done, Qnew   BGP is
given to repetitive analysis as Qconstrained
                                 BGP         , and the phase is     4.   USAGE SCENARIO
repeated. On the other hand, when the rules do not change              There are two ways for users to benefit from WoDQA.
any selected set, i.e. Qnew
                         BGP ≡ QBGP
                                   constrained
                                               , the analysis is    The first one is the SPARQL endpoint of WoDQA18 which
finished and the result is produced as QfBGPinal
                                                 .                  can be used to redirect raw queries. One can construct a
                                                                    SPARQL query with a SERVICE block including the raw
3.2    Query Reorganizer                                            query, and use WoDQA SPARQL endpoint as the remote
   The QueryReorganizer module is responsible for rewrit-           service. When this query is executed, the WoDQA endpoint
ing a query       using final selected set of each triple pattern   is invoked, and this endpoint transforms the query into a
                                                                    federated form by means WoDQA and executes this feder-
          
  QfBGP
     inal
               decided by the DatasetAnalyzer. While rewriting
                                                                    ated query on the relevant dataset transparently to the user.
a federated query, Query Reorganizer conforms to SPARQL
1.1 federation extension16 .                                          The other way is using the web form of the WoDQA19 .
   While the initial query is a set of triple patterns, QueryRe-    In this section, a sample query execution on the web form
organizer divides the query into sub-queries and makes it a         of WoDQA is explained. Reorganized form of the query,
set of service graph patterns each of which is represented          results of select and construct queries and execution time
with a tuple sgpα = hSrvα , SubT pα i. A service graph pat-         can be observed in this form.
tern consists of a Srv set which includes datasets17 to send          The example query seen in the WoDQA web form in Fig-
the sub-query, and a SubT p set which is the subset of the          ure 4.1 searches for an answer to “Which facebook users like
triple patterns of the initial query, i.e. a sub-query. Per-        movies which are produced by a German producer?”. This
forming a service graph pattern is unifying the results of the      query is represented as BGP = htp1 , . . . , tp5 i where
sub-query in each dataset in Srvα .                                 tp1 = h?faceUser,facebook:likes,?moviei,
   Triple patterns which have the same selected set (Qtpi )         tp2 = h?movie,linkedMDB:producer,?produceri,
are added to the same service graph pattern to decrease net-        tp3 = h?dbProducer,owl:sameAs,?produceri ,
working cost. But, only consecutive triple patterns can be          tp4 = h?anyMovie, dbpo:producer, ?dbProduceri,
in the same sub-query because WoDQA does not change the             tp5 = h?dbProducer, dbpo:birthPlace, dbpedia:Germanyi.
evaluation order of the query. In this direction, elements of         We explain how relevant datasets are found according
SubT p sets of service graph patterns are found by Algorithm        to the WoDQA Analysis process introduced in Figure 3.3.
1.                                                                  Initially, single step analysis phase is performed for this
   Datasets of a sub-query is formalized as Srvα = {δx |            query. Selected sets of tp1 and tp2 are eliminated via out-
δx ∈ Qtpi , tpi ∈ SubT pα }, and service endpoint URLs are          put of predicate vocabulary match, and in the consequence
procured from VOID documents of the datasets. The output            of single step analysis they are Qtp1 = {δF acebook } and
of the Query Reorganizer is shown as ReorganizedBGP =               Qtp2 = {δLinkedM DB }. No relevant dataset is found for tp3
hsgp1 , . . . , sgpm i where 1 ≤ m ≤ n and n is the number of       in the single step analysis, because owl:sameAs is a generic
triple patterns of the initial query. ReorganizedBGP repre-
                                                                    18
                                                                       Up-to-date WoDQA SPARQL endpoint address can
16
   http://www.w3.org/TR/sparql11-federated-query/                    be found on http://seagent.ege.edu.tr/etmen/wodqa.html
17                                                                   page.
   In the implementation, SPARQL endpoints of datasets are
                                                                    19
 used in SERVICE expressions.                                          http://seagent.ege.edu.tr/etmen/wodqa.html
                                                                   VOID descriptions which include a SPARQL endpoint to
                                                                   query the dataset, reflect actual content of the dataset com-
                                                                   pletely and accurately, and include linksets between datasets
                                                                   to select datasets effectively. WoDQA allows users to con-
                                                                   struct raw queries without the need to know how query will
                                                                   divide into sub-queries and where sub-queries are executed.
                                                                   Query results are complete under the assumption of avail-
                                                                   able, accurate and complete VOID descriptions of datasets.
                                                                      The initial version of WoDQA which is introduced in this
                                                                   paper has some disadvantages arising from query federation
                                                                   approach which WoDQA builds upon. As mentioned previ-
                                                                   ously, follow-your-nose has some problems such as missing
                                                                   results and large document retrieval. Similar problems may
                                                                   occur for query federation. Firstly, to find complete results
                                                                   to queries, it is required that metadata of all datasets must
                                                                   be well-defined and accurate. But, to provide such an ac-
                                                                   curate dataset metadata an automated mechanism which
                                                                   continuously updates the metadata is required. However,
                                                                   even there would be a tool which implements this require-
                                                                   ment, providing accurate dataset metadata via such a tool
                                                                   is the responsibility of dataset publishers.
                                                                      Another problems of query federation are high latency
                                                                   and low selectivity of datasets which are similar to retrieval
                                                                   of large documents in follow-your-nose. Query optimization
                                                                   can be a solution for these problems of query federation.
                                                                   Grouping triple patterns to filter more triples on an end-
                                                                   point can prevent high latency (required processing time)
Figure 4.1: Example query execution with WoDQA                     and changing query evaluation order according to dataset
                                                                   selectivity statistics can prevent retrieving large result sets.
                                                                   To make WoDQA functioning in the wild, optimization step
property and owl is not defined as vocabulary property in          of query federation is required to be implemented. We plan
VOIDs. Thus, Qtp3 still includes all datasets (∆). Predicate       to incorporate triple pattern selectivity into query reorgani-
vocabulary match discovers relevant datasets for tp4 and tp5 ,     zation using VOID properties about statistics.
and their selected sets are Qtp4 = Qtp5 = {δDBpedia }.                On the other hand, we could not make an evaluation of our
   After the single step analysis is applied to all triple pat-    approach in this paper, since VOID documents in current
terns, the repetitive analysis phase is performed, and se-         VOID stores are not well-defined. Since SPARQL endpoint
lected set of tp3 is eliminated in this phase. Subject Sharing     definitions, linkset descriptions or vocabularies are missing
Triple Patterns Analysis discovers relevant datasets for tp3       in most of VOID documents, we could not find a chance to
since tp3 and tp5 have the same subject variable, and thus         execute comprehensive scenarios. Developing a tool which
its selected set becomes Qtp3 = {δDBpedia }.                       extracts well-defined VOID descriptions of datasets, and by
   The reorganized query shown in Figure 4.1 is rendered           this means evaluating our approach is a required future work
with these analysis results by QueryReorganizer and for-           to confirm applicability of WoDQA on linked open data.
malized as ReorganizedBGP = hsgp1 , sgp2 , sgp3 i where            Also, evaluating the analysis cost of WoDQA for a large
   sgp1 = h{δF acebook } , {tp1 }i,                                VOID store will be possible when well-defined VOIDs are
   sgp2 = h{δLinkedM DB } , {tp2 }i,                               constructed.
   sgp3 = h{δDBpedia } , {tp3 , tp4 , tp5 }i.
   After the reorganizing process, the triple patterns in the
service graph patterns are executed on related endpoints by        6.   REFERENCES
means of Jena ARQ, and query results are incrementally              [1] K. Alexander, R. Cyganiak, M. Hausenblas, and
collected. In conclusion, results related with the query are            J. Zhao. Describing Linked Datasets - On the Design
listed at the bottom of the web form page as seen in Figure             and Usage of voiD, the ’Vocabulary of Interlinked
4.1.                                                                    Datasets’. In WWW 2009 Workshop: Linked Data on
                                                                        the Web (LDOW2009), Madrid, Spain, 2009.
5.   CONCLUSION                                                     [2] Abraham Bernstein, Christoph Kiefer, and Markus
  In this paper, we have introduced a query federation en-              Stocker. OptARQ: A SPARQL Optimization
gine called WoDQA that discovers related datasets in a VOID             Approach based on Triple Pattern Selectivity
store for a query and distributes the query over these datasets.        Estimation. Technical Report ifi-2007.03, Department
The novelty of our approach is exhaustive dataset selection             of Informatics, University of Zurich, 2007.
mechanism which includes analysis of triple pattern relations       [3] Chris Bizer, Tom Heath, Danny Ayers, and Yves
and links between datasets besides analyzing datasets for               Raimond. Interlinking open data on the web.
each triple pattern. WoDQA focuses on discovering relevant              www4.wiwiss.fu-
datasets and eliminating irrelevant ones using a rule-based             berlin.de/bizer/pub/LinkingOpenData.pdf, 2007.
approach introduced in this paper. Our approach requires                Stand 12.5.2009.
                                              Table 1: Prefix definitions
                          rdf:         <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
                          dbpedia:     <http://dbpedia.org/resource/>
                          dbpprop:     <http://dbpedia.org/property/>
                          linkedMDB:   <http://data.linkedmdb.org/resource/movie/>
                          owl:         <http://www.w3.org/2002/07/owl#>
                          dc:          <http://purl.org/dc/terms/>
                          foaf:        <http://xmlns.com/foaf/0.1/>
                          facebook:    <http://155.223.25.235:8180/FLE/ontology/socsem.owl#>


 [4] Paolo Bouquet, Chiara Ghidini, and Luciano Serafini.          Manfred Hauswirth, Jörg Hoffmann, and Manolis
     Querying the web of data: A formal approach. In               Koubarakis, editors, The Semantic Web: Research and
     Proceedings of the 4th Asian Conference on The                Applications, volume 5021 of Lecture Notes in
     Semantic Web, ASWC ’09, pages 291–305, Berlin,                Computer Science, chapter 39, pages 524–538.
     Heidelberg, 2009. Springer-Verlag.                            Springer Berlin / Heidelberg, Berlin, Heidelberg, 2008.
 [5] Carlos Buil, Marcelo Arenas, and Óscar Corcho.          [15] Andreas Schwarte, Peter Haase, Katja Hose, Ralf
     Semantics and optimization of the SPARQL 1.1                  Schenkel, and Michael Schmidt. Fedx: A federation
     Federation Extension. In Proc. of 8th Extended                layer for distributed query processing on linked open
     Semantic Web Conference (ESWC 2011), Heraklion,               data. In Grigoris Antoniou, Marko Grobelnik, Elena
     Crete, Greece, volume 6644 of Lecture Notes in                Simperl, Bijan Parsia, Dimitris Plexousakis, Pieter
     Computer Science, pages 1–15. Springer, 2011.                 De Leenheer, and Jeff Pan, editors, The Semanic Web:
 [6] M. Duerst and M. Suignard. Internationalized                  Research and Applications, volume 6644 of Lecture
     Resource Identifiers (IRIs). RFC 3987 (Proposed               Notes in Computer Science, pages 481–486. Springer
     Standard), January 2005.                                      Berlin / Heidelberg, 2011.
 [7] Olaf Görlitz and Steffen Staab. Federated data          [16] Andreas Schwarte, Peter Haase, Katja Hose, Ralf
     management and query optimization for linked open             Schenkel, and Michael Schmidt. Fedx: Optimization
     data. In Athena Vakali and Lakhmi Jain, editors, New          techniques for federated query processing on linked
     Directions in Web Data Management 1, volume 331 of            data. In Lora Aroyo, Chris Welty, Harith Alani, Jamie
     Studies in Computational Intelligence, pages 109–137.         Taylor, Abraham Bernstein, Lalana Kagal, Natasha
     Springer Berlin / Heidelberg, 2011.                           Noy, and Eva Blomqvist, editors, The Semantic Web -
 [8] Olaf Görlitz and Steffen Staab. SPLENDID: SPARQL             ISWC 2011, volume 7031 of Lecture Notes in
     Endpoint Federation Exploiting VOID Descriptions.             Computer Science, pages 601–616. Springer Berlin /
     In Proceedings of the 2nd International Workshop on           Heidelberg, 2011.
     Consuming Linked Data, Bonn, Germany, 2011.
 [9] Peter Haase, Tobias Mathäß, and Michael Ziller. An
     evaluation of approaches to federated query processing
     over linked data. In Proceedings of the 6th
     International Conference on Semantic Systems,
     I-SEMANTICS ’10, pages 5:1–5:9, New York, NY,
     USA, 2010. ACM.
[10] Olaf Hartig. Zero-knowledge query planning for an
     iterator implementation of link traversal based query
     execution. In Grigoris Antoniou, Marko Grobelnik,
     Elena Paslaru Bontas Simperl, Bijan Parsia, Dimitris
     Plexousakis, Pieter De Leenheer, and Jeff Pan,
     editors, ESWC (1), volume 6643 of Lecture Notes in
     Computer Science, pages 154–169. Springer, 2011.
[11] Olaf Hartig, Christian Bizer, and Johann Christoph
     Freytag. Executing sparql queries over the web of
     linked data. In International Semantic Web
     Conference, pages 293–309, 2009.
[12] Olaf Hartig and Andreas Langegger. A Database
     Perspective on Consuming Linked Data on the Web.
     Datenbankspektrum, Semantic Web Special Issue, July
     2010.
[13] Tom Heath and Christian Bizer. Linked Data:
     Evolving the Web into a Global Data Space. Morgan &
     Claypool, San Rafael, CA, 1 edition, 2011.
[14] Bastian Quilitz and Ulf Leser. Querying Distributed
     RDF Data Sources with SPARQL. In Sean Bechhofer,