Faceted Views over Large-Scale Linked Data

                                                           Orri Erling
                                                    OpenLink Software, Inc.
                                                    10 Burlington Mall Road
                                                           Suite 265
                                                     Burlington,MA 01803
                                                             U.S.A.
                                                oerling@openlinksw.com

ABSTRACT                                                          domain specific ones, such as [5]. For these to enter into
Faceted views over structured and semi structured data have       the user experience, the platform must be able to support
been popular in user interfaces for some years. Deploy-           the user’s choice of terminology or terminologies as needed,
ing such views of arbitrary linked data at arbitrary scale        preferably without blow up of data and concomitant slow-
has been hampered by lack of suitable back end technol-           down.
ogy. Many ontologies are also quite large, with hundreds of          Likewise, in the LOD world, many link sets have been
thousands of classes.                                             created for bridging between data sets.Whether such linkage
  Also, the linked data community has been concerned with         is relevant will depend on the use case. Therefore we provide
the processing cost and potential for denial of service pre-      fine grained control over which owl:sameAs assertions will
sented by public SPARQL end points.                               be followed, if any.
  This paper discusses how we use Virtuoso Cluster Edition           Against this background, we discuss how we tackle incre-
for providing interactive browsing over billions of triples,      mental interactive query composition on arbitrary data with
combining full text search, structured querying and result        Virtuoso Cluster[6].
ranking. We discuss query planning, run time inferencing             Using SPARQL or a web/web service interface, The user
and partial query evaluation. This functionality is exposed       can form combinations of text search and structured cri-
through SPARQL, a specialized web service and a web user          teria, including joins to an arbitrary depth. If queries are
interface.                                                        precise and select a limited number of results, the results are
                                                                  complete. If queries would select tens of millions of results,
                                                                  partial results are shown.
Categories and Subject Descriptors                                   The system being described is being actively devel-
H.5.4 [Information Systems]: Hypertext/Hypermedia;                oped as of this writing, early March of 2009 and is on-
H.2.8 [Information Systems]: Database Applications                line at lod.openlinksw.com. The data set is a combina-
                                                                  tion of Dbpedia, Musicbrainz, Freebase, web crawls from
                                                                  www.pingthesemanticweb.com, Uniprot, Neurocommons,
Keywords                                                          Bio2RDF.
Faceted Views, Linked Data, SPARQL, OpenLink Virtuoso,               The hardware consists of 2 8 core servers with 16G RAM
partial query evaluation, entity ranking, large ontologies        and 4 disks each. The system runs on Virtuoso 6 Cluster
                                                                  Edition. All application code is written in SQL procedures
                                                                  with limited client side Ajax, the Virtuoso platform itself is
1.   INTRODUCTION                                                 in C.
   The transition of the web from a distributed document             The facets service allows the user to start with a text
repository into a universal, ubiquitous database requires a       search or a fixed URI and to refine the search by specifying
new dimension of scalability for supporting rich user inter-      classes, property values etc., on the selected subjects or any
action. If the web is the database, then it also needs a query    subjects referenced therefrom.
and report writing tool to match. A faceted user interaction         This process generates queries involving combinations of
paradigm has been found useful for aiding discovery and           text and structured criteria, often dealing with property
query of variously structured data. Numerous implementa-          and class hierarchies and often involving aggregation over
tions exist but they are chiefly client side and are limited in   millions of subjects, specially at the initial stages of query
the data volumes they can handle.                                 composition. To make this work with in interactive time,
   At the present time, linked data is well beyond prototypes     two things are needed:
and proofs of concept. This means that what was done in              1. a query optimizer that can almost infallibly produce
limited specialty domains before must now be done at real         the right join order based on cardinalities of the specific
world scale, in terms of both data volume and ontology size.      constants in the query
On the schema, or T box side, there exist many compre-               2. a query execution engine that can return partial results
hensive general purpose ontologies such as Yago[1], Open          after a timeout.
CYC[2], Umbel[3] and the DBpedia[4] ontology and many                It is often the case, specially at the beginning of query
Copyright is held by the author/owner(s).                         formulation, that the user only needs to know if there are
LDOW2009, April 20, 2009, Madrid, Spain.                          relatively many or few results that are of a given type or
.
involve a given property. Thus partially evaluating a query      The bif:contains function in the filter specifies the full text
is often useful for producing this information. This must        search condition on ?o1.
however be possible with an arbitrary query, simply citing         This query is a typical example of queries that are exe-
precomputed statistics is not enough.                            cuted all the time when a user refines a search. We will now
   It has for a long time been a given that any search-like      look at how we can make an efficient execution plan for the
application ranks results by relevance. Whenever the facets      query. First, we must know the cardinalities of the search
service shows a list of results, not an aggregation of result    conditions:
types or properties, it is sorted on a composite of text match     To see the count of subclasses of Yago performer, we can
score and link density.                                          do:
   The paper is divided into the following parts:
                                                                 prefix cy: <http://dbpedia.org/class/yago/>
     • SPARQL query optimization and execution adapted           select count (*)
       for run time inference over large subclass structures.    from <http://dbpedia.org/yago.owl>
                                                                 where {
     • Resolving identity with inverse functional properties         ?s rdfs:subClassOf cy:Performer110415638
                                                                             option (transitive, t_distinct) }
     • Ranking entities based on graph link density

     • SPARQL partial query evaluation for displaying par-         There are 4601 distinct subclasses, including indirect ones.
       tial results in fixed time                                  Next we look at how many Shakespeare mentions there
                                                                 are:
     • a facets web service providing an XML interface for
                                                                 select count (*) where {
       submitting queries, so that the user interface is not
                                                                     ?s ?p ?o .
       required to parse SPARQL
                                                                     filter (bif:contains (?o, ’Shakespeare’)) }
     • a sample web interface for interacting with this
                                                                   There are 10267 subjects with Shakespeare mentioned in
     • sample queries and their evaluation times against com-    some literal.
       binations of large LOD data sets
                                                                 define input:inference "yago"
                                                                 prefix cy: <http://dbpedia.org/class/yago/>
2.    PROCESSING LARGE HIERARCHIES                               select count (*) where {
      IN SPARQL                                                      ?s1 a cy:Performer110415638 . }
   Virtuoso has for a long time had built-in superclass and
superproperty inference. This is enabled by specifying the          There are 184885 individuals that belong to some subclass
define input:inference "context" option, where context           of performer.
is previously declared to be all subclass, subproperty, equiv-      This is the data that the SPARQL compiler must know
alence, inverse functional property and same as relations        in order to have a valid query plan. Since these values
defined in a a given graph. The ontology file is loaded          will wildly vary depending on the specific constants in the
into its own graph and this is then used to construct the        query, the actual database must be consulted as needed
context. Multiple ontologies and their equivalences can be       while preparing the execution plan. This is regular query
loaded into a single graph which then makes another context      processing technology but is now specially adapted for deep
which holds the union of the ontology information from the       subclass and subproperty structures.
merged source ontologies.                                           Conditions in the queries are not evaluated twice, once
   Let us consider a sample query combining a full text          for the cardinality estimate and once for the actual run.
search and a restriction on the class of the desired matches:    Instead, the cardinality estimate is a rapid sampling of the
                                                                 index trees that reads at most one leaf page.
define input:inference "yago"                                       Consider a B tree index, which we descend from top to
prefix cy: <http://dbpedia.org/class/yago/>                      the leftmost leaf containing a match of the condition. At
select distinct ?s1 as ?c1,                                      each level, we count how many children would match and
  (bif:search_excerpt (                                          always select the leftmost one. When we reach a leaf, we see
    bif:vector (’Shakespeare’), ?o1 ) ) as ?c2                   how many entries are on the page. From these observations,
where {                                                          we extrapolate the total count of matches.
    ?s1 ?s1textp ?o1 .                                              With this method, the guess for the count of performers
    filter (bif:contains (?o1, ’"Shakespeare"’)) .               is 114213, which is acceptably close to the real number.
    ?s1 a cy:Performer110415638 .                                   Given these numbers, we see that it makes sense to first
  } limit 20                                                     find the full text matches and then retrieve the actual classes
                                                                 of each and see if this class is a subclass of performer. This
   This selects all Yago performers that have a property that    last check is done against a memory resident copy of the
contains “Shakespeare” as a whole word.                          Yago hierarchy, the same copy that was used for enumerat-
   The define input:inference "yago" clause means that           ing the subclasses of performer.
subclass, subproperty and inverse functions property state-         However, the query
ments contained in the inference context called yago are con-
sidered when evaluating the query. The built-in function
bif:search excerpt makes a search engine style summary
of the found text, highlighting occurrences of Shakespeare.
                                                                   This option is controlled by the choice of the inference
define input:inference "yago"
                                                                context, which is selectable in the interface discussed below.
prefix cy: <http://dbpedia.org/class/yago/>
                                                                   The IFP inference can be thought of as a transparent ad-
select distinct ?s1 as ?c1,
                                                                dition of a subquery into the join sequence. The subquery
  (bif:search_excerpt (
                                                                joins each subject to its synonyms given by sharing IFP’s.
      bif:vector (’Shakespeare’), ?o1 ) ) as ?c2
                                                                This subquery has the special property that it has the initial
where {
                                                                binding automatically in its result set. It could be expressed
    ?s1 ?s1textp ?o1 .
                                                                as:
    filter (bif:contains (?o1, ’"Shakespeare"’)) .
    ?s1 a cy:ShakespeareanActors .                              select ?f where {
  }                                                                 ?k foaf:name "Kjetil Kjernsmo" .
                                                                      { select ?org ?syn where {
   will start with Shakespearean actors since this is a leaf                ?org ?p ?key .
class with only 74 instances and then check if the properties               ?syn ?p ?key .
contain Shakespeare and return their search summaries.                      filter ( bif:rdf_is_sub ("b3sifp", ?p,
   In principle, this is common cost based optimization but                      <b3s:any_ifp>, 3) &&
is here adapted to deep hierarchies combined with text pat-                   ?syn != ?org ) }
terns. An unmodified SQL optimizer would have no possi-               } option (transitive,
bility of arriving at these results.                            t_in (?org), t_out (?syn), t_min (0), t_max (1) )
   The implementation reads the graphs designated as hold-          filter (?org = ?k) .
ing ontologies when first needed and subsequently keeps a           ?syn foaf:knows ?f . }
memory based copy of the hierarchy on all servers. This
is used for quick iteration over sub/superclasses or proper-      It is true that each subject shares IFP values with itself
ties as well as for checking if a given class or property is    but the transitive construct with 0 minimum and 1 max-
a subclass/property of another. Triples with OWL pred-          imum depth allows passing the initial binding of ?org di-
icates equivalentClass, equivalentProperty and sameAs           rectly to ?syn, thus getting first results more rapidly. The
are also cached in the same data structure if they occur in     rdf is sub function is an internal that simply tests whether
the ontology graphs.                                            ?p is a subproperty of b3s:any ifp.
   Also cardinality estimates for members of classes near the     Internally, the implementation has a special query oper-
root of the class hierarchy take some time since a sample of    ator for this and the internal form is more compact than
each subclass is needed. These are cached for some minutes      would result from the above but the above could be used to
in the inference context, so that repeated queries will not     the same effect.
redo the sampling.                                                The issues of run time vs precomputed identity inference
                                                                through IFP’s and owl:sameAs are discussed in much more
3.     INVERSE FUNCTIONAL PROPERTIES                            detail at[9].
       AND SAME AS                                                Our general position is that identity criteria are highly
                                                                application specific and thus we offer the full spectrum
   Specially when navigating social data, as in FOAF[7] and
                                                                of choice between run time and precomputing. Further,
SIOC[8] spaces, there are many blank nodes that are iden-
                                                                weaker identity statements than sameness are difficult to
tified by properties only. For this, we offer an option for
                                                                use in queries, thus we prefer identity with semantics of
automatically joining to subjects which share an IFP value
                                                                owl:sameAs but make this an option that can be turned on
with the subject being processed. For example, the query
                                                                and off query by query.
for the friends of friends of Kjetil Kjernsmo returns empty:

select count (?f2) where {
    ?s a foaf:Person ; ?p ?o ; foaf:knows ?f1 .                 4. ENTITY RANKING
    ?o bif:contains "’Kjetil Kjernsmo’" .                         It is a common end user expectation to see text search
    ?f1 foaf:knows ?f2 };                                       results sorted by their relevance. The term entity rank refers
                                                                to a quantity describing the relevance of a URI in an RDF
     But with the option                                        graph.
                                                                  This is a sample query using entity rank:
define input:inference "b3sifp"
select count (?f2) where {                                      prefix yago: <http://dbpedia.org/class/yago/>
    ?s a foaf:Person ; ?p ?o ; foaf:knows ?f1 .                 prefix prop: <http://dbpedia.org/property/>
    ?o bif:contains "’Kjetil Kjernsmo’" .                       select distinct ?s2 as ?c1 where {
    ?f1 foaf:knows ?f2 };                                           ?s1 ?s1textp ?o1 .
                                                                    ?o1 bif:contains ’Shakespeare’ .
   we get 4022. We note that there are many duplicates              ?s1 a yago:Writer110794014 .
since the data is blank nodes only, with people easily rep-         ?s2 prop:writer ?s1 .
resented 10 times. The context b3sifp simple declares that        } order by desc (<LONG::IRI_RANK> (?s2))
foaf:name and foaf:mbox sha1sum should be treated as in-          limit 20 offset 0
verse functional properties (IFP). The name is not an IFP
in the actual sense but treating it as such for the purposes      This selects works where a writer with Shakespeare in
of this one query makes sense, otherwise nothing would be       some property is the writer.
found.                                                            Here the query returns subjects, thus no text search sum-
maries, so only the entity rank of the returned subject is        structures and control flows where these are efficient. For
used. We order text results by a composite of text hit score      example, it would make little sense to store entity ranks as
and entity rank of the RDF subject where the text occurs.         triples due to space consumption and locality considerations.
The entity rank of the subject is defined by the count of         With these tools, the whole ranking functionality took under
references to it, weighed by the rank of the referrers and the    a week to develop.
outbound link count of referrers. Such techniques are used
in text based information retrieval.[15]
   One interesting application of entity rank and inference
on IFP’s and owl:sameAs is in locating URI’s for reuse. We        5. QUERY EVALUATION TIME LIMITS
can easily list synonym URI’s in order of popularity as well         When scaling the Linked Data model, we have to take it
as locate URI’s based on associated text. This can serve in       as a given that the workload will be unexpected and that the
application such as the Entity Name Server[14].                   query writers will often be unskilled in databases. Insofar
   Entity ranking is one of the few operations where we take      possible, we wish to promote the forming of a culture of
a precomputing approach. Since a rank is calculated based         creative reuse of data. To this effect, even poorly formulated
on a possibly long chain of references, there is little choice    questions deserve an answer that is better than just timeout.
but to precompute. The precomputation itself is straight-            If a query produces a steady stream of results, interrupting
forward enough: First all outbound references are counted         it after a certain quota is simple. However, most interesting
for all subjects. Next all ranks of subjects are incremented      queries do not work in this way. They contain aggregation,
by 1 over the referrer’s outbound link count. On successive       sorting, maybe transitivity.
iterations, the increment is based on the rank increment the         When evaluating a query with a time limit in a cluster
referrer received in the previous round.                          setup, all nodes monitor the time left for the query. When
   The operation is easily partitioned, since each partition      dealing with a potentially partial query to begin with, there
increments the ranks of subjects it holds. The referrers are      is little point in transactionality. Therefore the facet service
spread throughout the cluster, though. When rank is cal-          uses read committed isolation. A read committed query
culated, each partition accesses every other partition. This      will never block since it will see the before-image of any
is done with relatively long messages, referee ranks are ac-      transactionally updated row. There will be no waiting for
cessed in batches of several thousand at a time, thus absorb-     locks and timeouts can be managed locally by all servers in
ing network latency.                                              the cluster.
   On the test system, this operation performs a single pass         Thus, when having a partitioned count, for example, we
over the corpus of 2.2 billion triples and 356 million distinct   expect all the partitions to time out around the same time
subjects in about 30 minutes. The operation has 100% uti-         and send a ready message with the timeout information
lization of all 16 cores. Adding hardware would speed it up,      to the cluster node coordinating the query. The condition
as would implementing it in C instead of the SQL procedures       raised by hitting a partial evaluation time limit differs from
it is written in at present.                                      a run time error in that it leaves the query state intact on
   The main query in rank calculation is                          all participating nodes. This allows the timeout handling to
                                                                  come fetch any accumulated aggregates.
select O, P, iri_rank (S)                                            Let us consider the query for the top 10 classes of things
from rdf_quad table option (no cluster)                           with “Shakespeare” in some literal. This is typical of the
where isiri_id(O) order by O;                                     workload generated by the faceted browsing web service:
   This is the SQL cursor iterated over by each partition.        define input:inference "yago"
The no cluster option means that only rows in this pro-           select ?c count (*) where {
cess’ partition are retrieved. The RDF QUAD table holds the           ?s a ?c ; ?p ?o .
RDF quads in the store, i.e. triple plus graph. The S, P, O           ?o bif:contains "Shakespeare" .
columns are the subject, predicate and object respectively.         } group by ?c order by desc 2 limit 10
The graph column is not used here. The textttiri rank is a
partitioned SQL function. This works by using the S argu-           On the first execution with an entirely cold cache, this
ment to determine which cluster node should run the func-         times out after 2 seconds and returns:
tion. The specifics of the partitioning are declared elsewhere.
The calls are then batched for each intended recipient and        yago:class/yago/Entity100001740                     566
sent when the batches are full. The SQL compiler automat-         yago:class/yago/PhysicalEntity100001930             452
ically generates the relevant control structures. This is like    yago:class/yago/Object100002684                     452
an implicit map operation in the map-reduce terminology.          yago:class/yago/Whole100003553                      449
   An SQL procedure loops over this cursor, adds up the           yago:class/yago/Organism100004475                   375
rank and when seeing a new O, the added rank is persisted         yago:class/yago/LivingThing100004258                375
into a table. Since links in RDF are typed, we can use            yago:class/yago/CausalAgent100007347                373
the semantics of the link to determine how much rank is           yago:class/yago/Person100007846                     373
transferred by a reference. With extraction of named entities     yago:class/yago/Abstraction100002137                150
from text content, we can further place a given entity into a     yago:class/yago/Communicator109610660               125
referential context and use this as a weighting factor. This
is to be explored in future work. The experience thus far           The next repeat gets about double the counts, starting
shows that we greatly benefit from Virtuoso being a general       with 1291 entities.
purpose DBMS, as we can create application specific data            With a warm cache, the query finishes in about 300 ms (4
                                                                  core Xeon, Virtuoso 6 Cluster) and returns:
                                                                     • Enter in the search form “Napoleon’:
yago:class/yago/Entity100001740                   13329
yago:class/yago/PhysicalEntity100001930           10423           <query inference="" same-as="" view3=""
yago:class/yago/Object100002684                   10408             s-term="e" c-term="type">
yago:class/yago/Whole100003553                    10210             <text>napoleon</text>
yago:class/yago/LivingThing100004258               8868             <view type="text" limit="20" offset="" />
yago:class/yago/Organism100004475                  8868           </query>
yago:class/yago/CausalAgent100007347               8853
                                                                     • Select the “types” view:
yago:class/yago/Person100007846                    8853
yago:class/yago/Abstraction100002137               3284           <query inference="" same-as="" view3=""
yago:class/yago/Entertainer109616922               2356             s-term="e" c-term="type">
                                                                    <text>napoleon</text>
   It is a well known fact that running from memory is thou-        <view type="classes" limit="20" offset="0"
sands of times faster than from disk.                                 location-prop="0" />
   The query plan begins with the text search. The subjects       </query>
with “Shakespeare” in some property get dispatched to the
                                                                     • Choose “MilitaryConflict” type:
partition that holds their class. Since all partitions know the
class hierarchy, the superclass inference runs in parallel, as    <query inference="" same-as="" view3=""
does the aggregation of the group by. When all partitions           s-term="e" c-term="type">
have finished, the process coordinating the query fetches the       <text>napoleon</text>
partial aggregates, adds them up and sorts them by count.           <view type="classes" limit="20" offset="0"
   If a timeout occurs, it will most likely occur where the           location-prop="0" />
classes of the text matches are being retrieved. When this          <class iri="yago:ontology/MilitaryConflict" />
happens, this part of the query is reset, but the aggregate       </query>
states are left in place. The process coordinating the query
                                                                     • Choose “NapoleonicWars”:
then goes on as if the aggregates had completed. If there are
many levels of nested aggregates, each timeout terminates         <query inference="" same-as="" view3=""
the innermost aggregation that is still accumulating results,       s-term="e" c-term="type">
thus a query is guaranteed to return in no more than n              <text>napoleon</text>
timeouts, where n is the number of nested aggregations or           <view type="classes" limit="20" offset="0"
subqueries.                                                           location-prop="0" />
                                                                    <class iri="yago:ontology/MilitaryConflict" />
                                                                    <class iri="yago:class/yago/NapoleonicWars" />
6.   FACETS WEB SERVICE                                           </query>
   The Virtuoso Facets web service is a general purpose RDF
query facility for facet based browsing. It takes an XML             • Select “any location” in the select list beside the
description of the view desired and generates the reply as             “map” link, then hit “map” link:
an XML tree containing the requested data. The user agent         <query inference="" same-as="" view3=""
or a local web page can use XSLT for rendering this for the         s-term="e" c-term="type">
end user. The selection of facets and values is represented as      <text>napoleon</text>
an XML tree. The rationale for this is the fact that such a         <class iri="yago:ontology/MilitaryConflict" />
representation is easier to process in an application than the      <class iri="yago:class/yago/NapoleonicWars" />
SPARQL source text or a parse tree of SPARQL and more               <view type="geo" limit="20" offset="0"
compactly captures the specific subset of SPARQL needed               location-prop="any" />
for faceted browsing. All such queries internally generate        </query>
SPARQL and the SPARQL generated is returned with the
results. One can therefore use this is a starting point for         This last XML fragment corresponds to the below text of
hand crafted queries.                                             SPARQL query:
   The query has the top level element <query>. The child         select ?location as ?c1 ?lat1 as ?c2 ?lng1 as ?c3
elements of this represents conditions pertaining to a single     where {
subject. A join is expressed with the property or property-           ?s1 ?s1textp ?o1 .
of element. This has in turn children which state conditions          filter (bif:contains (?o1, ’"Napoleon"’)) .
on a property of the first subject. Property and property-            ?s1 a <yago:ontology/MilitaryConflict> .
of elements can be nested to an arbitrary depth and many              ?s1 a <yago:class/yago/NapoleonicWars> .
can occur inside one containing element. In this way, tree-           ?s1 ?anyloc ?location .
shaped structures of joins can be expressed.                          ?location geo:lat ?lat1 ; geo:long ?lng1 .
   Expressing more complex relationships, such as intermedi-        }
ate grouping, subqueries, arithmetic or such requires writing     limit 200 offset 0
the query in SPARQL. The XML format is for easy auto-
matic composition of queries needed for showing facets, not         The query takes all subjects with some literal property
a replacement for SPARQL.                                         with “Napoleon” in it, then filters for military conflicts and
   Consider composing a map of locations involved with            Napoleonic wars, then takes all objects related to these
Napoleon. Below we list user actions and the resulting            where the related object has a location. The map has the
XML query descriptions.                                           objects and their locations.
                                                                 9. FUTURE WORK
                                                                    All the functions discussed above are presently being pro-
                                                                 ductized for delivery with Virtuoso 6, so that single servers
                                                                 are open source and clusters commercial only. The most
                                                                 relevant future work is thus final debugging and tuning of
                                                                 existing functionality.
                                                                    The technology will be first commercially used as a plat-
                                                                 form for an Amazon EC2 offering of the whole LOD cloud
                                                                 on a cluster of servers. This complements the existing line
                                                                 of data sets pre-packaged by OpenLink[11].
                                                                    For more sophisticated, also editable user facing function-
                                                                 ality, OpenLink is presently working with the developers of
                                                                 OntoWiki[12] on integrating the functionality discussed here
                                                                 into OntoWiki as a new large-scale back-end. From this de-
                                                                 velopment, we expect to have the functional equivalent of
                                                                 Freebase[13], except with more data, working with open,
                                                                 standard data models, being more integrable and above all
                                                                 having a full range of deployment options. This means any-
                                                                 thing from the desktop to the data center with either soft-
                Figure 1: The displayed result                   ware as service or installation at end user sites as options.
                                                                    We presently rank search results on text match scores and
                                                                 link density around the URI’s related to the text hits. We
                                                                 expect having semantics associated with links to open new
7.     VOID DISCOVERABILITY                                      possibilities in this domain. We plan to leverage link seman-
   A long awaited addition to the LOD cloud is the Vocabu-       tics for ranking but as of this writing have not extensively
lary of Interlinked Data (VoID)[10]. Virtuoso automatically      explored this.
generates VoID descriptions of data sets it hosts.
   Virtuoso incorporates an SQL function rdf void gen
which returns a Turtle representation of a given graph’s
                                                                 10. CONCLUSIONS
VoID statistics.                                                    We have presented a set of query processing techniques
                                                                 and a web service and user interface for interactive brows-
                                                                 ing of a large corpus of linked data. We have shown sig-
8.     TEST SYSTEM AND DATA                                      nificant scalability on low cost server hardware, with open
   The test system consists of two 2x4 core Xeon 5345,           ended scale out capacity for larger data set sizes and more
2.33 GHz servers with 16G RAM and 4 disks each. The              concurrent usage.
machines are connected by two 1Gbit Ethernet connections.           The service described is online and is also packaged with
The software is Virtuoso 6 Cluster. The Virtuoso server is       Virtuoso 6 open source distributions.
split into 16 partitions, 8 for each machine. Each partition        The technical experience derived from developing this ser-
is managed by a separate server process.                         vice emphasizes the following:
   The test database has the following data sets:
                                                                    • Central importance of a SPARQL/SQL cost model
      • Dbpedia 3.2                                                   that is aware of hierarchies and is capable of sampling
                                                                      data as needed. Without the right execution plan, no
      • Musicbrainz
                                                                      amount of hardware will save the day.
      • Bio2RDF                                                     • The importance of enforcing a cap on resource usage.
      • Neurocommons                                                • The need for scale-out in order to have enough data
      • Uniprot                                                       in memory. Disk is a far greater bottleneck than pro-
                                                                      cessor or network speed. Scaling out in a shared noth-
      • Freebase (95M triples)                                        ing fashion is by far the most economical and scalable
                                                                      means of increasing total memory, disk bandwidth and
      • Ping The Semantic Web (1.6 million miscellaneous files        processing power.
        from http://www.pingthesemanticweb.com).
                                                                    • Additional verification of our capacity to schedule par-
     Ontologies:                                                      allel query processing on a distributed memory cluster
                                                                      without being killed by latency.
      • Yago
                                                                    • Confirmation of the Virtuoso platform’s flexibility for
      • Open CYC                                                      building additional data intensive services, such as en-
      • Umbel                                                         tity ranking.

      • Dbpedia                                                     Present work is therefore concentrated on refining and
                                                                 productizing the platform and its RDF applications. We be-
 The database is 2.2 billion triples with 356 million distinct   lieve this to be a significant infrastructure element enabling
URI’s.                                                           the take off of linked data.
11. REFERENCES
[1] Suchanek, F.M.; Kasneci, G.; Weikum, G.: YAGO: A
    Core of Semantic Knowledge Unifying WordNet and
    Wikipedia. WWW2007, ACM
    978-1-59593-654-7/07/0005.
[2] Overview of OpenCyc.
    http://www.cyc.com/cyc/opencyc/overview
[3] UMBEL Ontology, Vol. 1: Technical Documentation,
    TR 08-08-28-A1.
    http://www.umbel.org/doc/UMBELOntology vA1.pdf
[4] Auer, S.; Bizer, C.; Lehmann, J.; Kobilarov, G.;
    Cyganiak, R.; Ives, Z.: DBpedia: A Nucleus for a Web
    of Open Data. In Aberer et al. (Eds.): The Semantic
    Web, 6th International Semantic Web Conference, 2nd
    Asian Semantic Web Conference, ISWC 2007 + ASWC
    2007, Busan, Korea, November 11-15, 2007. LNCS 4825
    Springer 2007, ISBN 9783-540762973.
[5] The National Center for Biomedical Ontology:
    Resources. http://bioontology.org/repositories.html
[6] OpenLink Software, Inc. Virtuoso 6 FAQ.
    http://virtuoso.openlinksw.com/Whitepapers/
    html/Virt6FAQ.html
[7] Brickley, D.; Miller, L.: FOAF Vocabulary Specification
    0.91. http://xmlns.com/foaf/spec/
[8] Bojars, U.; Breslin, J.G. (eds.): SIOC Core Ontology
    Specification http://rdfs.org/sioc/spec/
[9] Erling, O.: “E Pluribus Unum”, or “Inversely
    Functional Identity”, or “Smooshing Without the
    Stickiness”.
    http://www.openlinksw.com/dataspace/
    oerling/weblog/Orri%20Erling’s%20Blog/1498
[10] Hausenblas, M.: Discovery and Usage of Linked
    Datasets on the Web of Data. NodMag #4. Available
    at http://www.talis.com/nodalities/
    pdf/nodalities issue4.pdf
[11] OpenLink Software, Inc. Virtuoso Universal Server
    (Cloud Edition) AMI for EC2.
    http://virtuoso.openlinksw.com/wiki/main/
    Main/VirtuosoEC2AMI
[12] Auer, S.; Dietzold, S.; Riechert, T.: OntoWiki A Tool
    for Social, Semantic Collaboration. 5th International
    Semantic Web Conference, Nov 5th–9th, Athens, GA,
    USA. In I. Cruz et al. (Eds.): ISWC 2006, LNCS 4273,
    pp. 736-749, 2006. Springer-Verlag Berlin Heidelberg
    2006.
[13] Metaweb Technologies, Inc.: What is Freebase?
    http://www.freebase.com/view/en/what is freebase
[14] Stoermer, H.: Entity Name System: The Back-bone of
    an Open and Scalable Web of Data. In: Proceedings of
    the IEEE International Conference on Semantic
    Computing, ICSC 2008, number CSS-ICSC
    2008-4-28-25. IEEE, August 2008. Available at
    http://www.okkam.org/publications/
    stoermer-EntityNameSystem.pdf/at download/file
[15] Brin, S., Page, L.: The Anatomy of a Large-Scale
    Hypertextual Web Search Engine. In: Seventh
    International World-Wide Web Conference (WWW
    1998), April 14-18, 1998, Brisbane, Australia. Available
    at http://ilpubs.stanford.edu:8090/361/