-

Paths towards the Sustainable Consumption of Semantic Data on the Web

Aidan Hogan

Claudio Gutierrez

0 0 Department of Computer Science, Universidad de Chile

Based on recent results, we argue that the right method for Web clients to access relevant information from Linked Datasets has not yet been found. We propose that something is needed between (i) Linked Data dereferencing, which is simple and reliable but too vaguely de ned; (ii) data dumps, which are simple and reliable but too coarse-grained, and (iii) SPARQL querying, which is powerful and ne-grained but too unreliable. We argue that new protocols and query languages need to be investigated and de ne eight desiderata that an access method should meet in order to be considered sustainable for a mature Web of Data.

( 225 ) triples for the BTC '11 crawl due to accessibility issues [4, §4.3]. In other work, we showed that publishers often provide incomplete, ad hoc information in dereferenced documents: e.g., we found that on average, publishers include 83.6% of local triples where a URI appears as subject in its resp. dereferenced document, but only 55.2% of local triples where it appears as object, and that only 32.8% of local dereferenceable resources are assigned a human readable label [ 3 ].

Downloading the data-dump is coarse and involves accessing a lot of irrelevant data. The problem can be mitigated by compression: in previous works, we proposed HDT, which can compress data-dumps by a factor of 15 while o ering lookup functionality over the archive, thus tackling problems with bandwidth and helping clients to extract relevant data o ine [ 2 ]. However, the full dataset still needs to be transferred, which is wasteful for small requests, and updates are di cult to mirror.

For SPARQL endpoints, evaluating even a SPARQL 1.0 query is PSpacecomplete [ 5 ] and query-planning costs become less reliable as queries grow more complex. Relatedly, we previously showed that many public SPARQL endpoints su er from various issues, harming their usability: for example, we found that of the 427 public endpoints surveyed, only 32% had an availability (i.e., \uptime") falling into 99{100% and that only 13.3% could return more than 100,000 results upon request (due to the popular use of result-size thresholds) [ 1 ].

Each access method has, in practice, exhibited issues that undermine its sustainability; furthermore, these three access methods cannot be readily combined. Desiderata: To help de ne a path forward, we propose a list of eight desiderata for sustainable data-access methods, divided into four di erent goals, as follows: Standardised: The rst two criteria refer to the agreement that exists between client(s) and server(s): 1. Accessible: a software agent can access data through a uniform protocol without location-speci c logic. This holds for SPARQL and for dereferencing, but not for dumps (which vary in formats, compression, access, etc.). 2. Well-defined: given a query Q and a dataset D, both client and server can precisely agree on what the response R(Q; D) should be. This holds for SPARQL, but not for dumps (which may vary in their completeness) or dereferencing (where dereferenced content varies from server to server).

Bandwidth conservation: The second two criteria aim to minimise wasting bandwidth in transferring irrelevant data to a client: 3. Granular: the query language allows the client to specify su cient information in Q to avoid transferring irrelevant data. This is true for SPARQL, but not for dereferencing (e.g., to get the capitals of countries, full descriptions for each country must be dereferenced) or dumps. 4. Pagination: a large response R(Q; D) can be served in chunks until the client is satis ed. This is not true for dereferencing or dumps and is costly in SPARQL (which relies on ORDER BY or vendor-speci c heuristics).

Server e ciency: The third two criteria aim to make the access method sustainable for the server to host: 5. Cacheable: common requests are amenable to caching techniques/answerable by direct lookup; previously computed responses can be easily re-used. This is true for dereferencing and dumps but is prohibitive for SPARQL.

Dereferencing

Dumps SPARQL endpoints

X X e l b i s s e c c A X

X X X

X X

X X 6. Costable: the server can e ciently and accurately predict the processing/transport cost of serving R(Q; D), fostering quality of service. This is true for dereferencing and dumps, but not for general SPARQL queries.

Client usability: The nal two criteria refer to the needs of clients: 7. Transparent: the client can determine if a dataset D is relevant for their needs and if a service is su ciently reliable to serve a given purpose. This is (arguably) true for dereferencing since a client knows the topic of a page; however, a client will not know a priori what content (or quality of service) is provided behind a SPARQL endpoint or a dump. 8. Robust: the access method can gracefully handle any type of valid request from multitudinous clients; if exceptions occur, they are clearly identi ed and accounted for by quality of service. This is true for dereferencing and dumps but not for SPARQL (which may, e.g., fail or silently return a partial response).

Table 1 summarises these desiderata for the three state-of-the-art Linked Data access methods: clearly the right combination of protocol and query language has not yet been found. New proposals for access methods that satisfy more of these desiderata { perhaps like \Linked Data Fragments" [ 7 ] { need to be investigated in order to meet the expectations of future applications and increased tra c. Otherwise, if we stick with current trends, the Web of Data may continue to stagnate in its current local maximum: a perpetual experimental phase; a nice idea.

1. C. B. Aranda , A.

Hogan , J.

Umbrich , and P.-Y.

Vandenbussche. SPARQL WebQuerying

Infrastructure: Ready for Action? In ISWC , pages 277 { 293 , 2013 .

Fernandez , M.

Mart nez-

Prieto , C.

Gutierrez , A.

Polleres , and M.

Arias . Binary RDF representation for publication and exchange (HDT) . JWS , 19 : 22 { 41 , 2013 .

Hogan ,

Umbrich ,

Harth ,

Cyganiak ,

Polleres , and

Decker . An empirical survey of Linked Data conformance . JWS , 14 : 14 { 44 , 2012 .

4. T. Kafer, J. Umbrich , A. Hogan , and

Polleres . Towards a Dynamic Linked Data Observatory . In LDOW , 2012 .

Perez ,

Arenas , and

Gutierrez . Semantics and complexity of SPARQL . ACM Trans. Database Syst ., 34 ( 3 ), 2009 .

Umbrich ,

Gutierrez ,

Hogan ,

Karnstedt , and

J. X.

Parreira . Eight fallacies when querying the Web of Data . In DESWEB , 2013 .

Verborgh ,

M. Vander

Sande ,

Colpaert ,

Coppens , E. Mannens, and R. Van de Walle. Web-scale querying through Linked Data Fragments . In LDOW , 2014 .