SPORTAL: Searching for
                      Public SPARQL Endpoints

    Ali Hasnain1 , Qaiser Mehmood1 , Syeda Sana e Zainab1 , and Aidan Hogan2
             1
               Center for Semantic Web Research, DCC, University of Chile
     2
         INSIGHT Centre for Data Analytics, National University of Ireland, Galway


          Abstract. There are hundreds of SPARQL endpoints on the Web, but
          finding an endpoint relevant to a client’s needs is difficult: each endpoint
          acts like a black box, often without a description of its content. Herein
          we briefly describe Sportal: a system that collects meta-data about the
          content of endpoints and collects them into a central catalogue over which
          clients can search. Sportal sends queries to individual endpoints offline
          to learn about their content, generating a best-effort VoID description for
          each endpoint. These descriptions can then be searched and queried over
          by clients in the Sportal user interface, for example, to find endpoints
          that contain instances of a given class, or triples with a given predicate,
          or more complex requests such as endpoints with at least 1,000 images
          of people. Herein we give a brief overview of Sportal, its design and
          functionality, and the features that shall be demoed at the conference.


1        Introduction

Finding public SPARQL endpoints that contain content relevant for a client’s
needs is not easy. Say, for example, a client is interested in data about movies
and wants to find related SPARQL endpoints on the Web. They could try to
use a traditional search engine with keywords like “movie sparql” or something
similar, but many of the webpages returned may not actually be SPARQL end-
points. Another option would be to use a specialised service such as the VoID
Store3 , which allows for performing searches over dataset descriptions provided
by publishers that sometimes include a link to a SPARQL endpoint (or even
multiple endpoints); however, the system relies on publishers creating their own
VoID files [1], keeping them up to date, etc. A better option might be to use
the Datahub catalogue4 to find movie datasets with a SPARQL endpoint, but
about half of the endpoints listed in Datahub are no longer working [2].
    Rather than relying on publisher-submitted meta-data about the content
of endpoints – which may be out of date and is in any case not available for
the majority of endpoints [2,4] – we instead propose a system, which we call
the SPARQL portal (Sportal), to run SPARQL queries against a given list
3
    http://void.rkbexplorer.com/; (l.a. 2016-08-29)
4
    http://datahub.io/; (l.a. 2016-08-29)
2         Ali Hasnain, Qaiser Mehmood, Syeda Sana e Zainab, and Aidan Hogan

of endpoints to compute meta-data about their content, with a particular em-
phasis on schema data. The meta-data that Sportal collects is based on the
computable subset of VoID [1,3] and some further extensions thereof.5
    Computing dataset descriptions for endpoints in this way has a number of ad-
vantages: (1) Sportal can be updated on demand by rerunning queries against
the endpoints (we currently update every 15 days), hence excluding offline end-
points and reflecting changes to content; (2) we assume no external descriptions
or services other than a working SPARQL (1.1) endpoint; (3) the provenance of
the descriptions we compute are given by the queries we use, and the time we run
them against the endpoint. However, likewise, there are a number of disadvan-
tages: (1) the queries needed to generate a detailed dataset description can be
expensive, and may fail due to performance limitations or result-size thresholds
of public endpoints [2], thus Sportal will have incomplete descriptions for many
operational endpoints; (2) most of the queries require support for SPARQL 1.1,
which although growing, is not yet universal [2].
    In our previous work [3], we introduced Sportal, where we (1) proposed
a set of what we call “self-descriptive queries” that can be used to generate a
dataset description from an endpoint with incremental expressivity/complexity,
(2) evaluated the feasibility of running these queries in a local setting for four
SPARQL implementations (4Store, Fuseki, Sesame, Virtuoso) over four different
datasets, (3) performed experiments for a list of 618 public SPARQL endpoints
collected from Datahub and Bio2RDF to see how well these queries performed in
real-world settings and how much data we could collect for the central Sportal
catalogue, and, (4) gave an overview of the online Sportal system that allows
clients to search and/or query the catalogue of dataset descriptions.
    Our additional contribution will be to demo the Sportal system, which is
available online at http://www.sportalproject.org/. Herein, we first describe
the process of data collection, recapitulating some of the main results from our
previous study [3] (Section 2). Thereafter, we focus on the functionality of the
Sportal system itself, which will be demoed at the conference (Section 3).


2      SPORTAL Data Collection

To each public SPARQL endpoint being catalogued, Sportal sends a sequence
of queries of increasing complexity to learn about the content of that endpoint.
In particular, Sportal uses CONSTRUCT queries that will directly generate VoID
meta-data from the endpoint. For example, we send the following query6 to the
endpoint to generate meta-data about VoID class partitions [1]:
                                                                               
    CONSTRUCT { <D> void:classPartition [ void:class ?c ] } WHERE { ?s a ?c }
                                                                               
Here, <D> represents an IRI for the dataset that we generate internally based on
the endpoint URL. Many of the queries we use require SPARQL 1.1 features, in
5
    http://ldf.fi/void-ext#; (l.a. 2016-08-29)
6
    Prefixes used can be located at http://prefix.cc/ (l.a. 2016-08-29)
                            SPORTAL: Searching for Public SPARQL Endpoints                       3

particular aggregation and sub-queries; e.g., the following query uses aggregation
and a sub-query to count triples in each property partition [1]:
                                                                                 
    CONSTRUCT { <D> void:propertyPartition [ void:property ?p ; void:triples ?x ] }
    WHERE { SELECT (COUNT(?o) AS ?x) ?p WHERE { ?s ?p ?o } GROUP BY ?p }
                                                                              
   The queries become increasingly complex, where, for example, the following
query counts, for each class, the number of instances that have each property:
                                                                              
    CONSTRUCT { <D> void:classPartition [ void:class ?c ;
                      void:propertyPartition [ void:distinctSubjects ?x ] ] }
    WHERE { SELECT (COUNT(DISTINCT ?s) AS ?x) ?c ?p WHERE { ?s a ?c ; ?p ?o } GROUP BY ?c ?p }
                                                                                   
Each endpoint answers – or attempts to answer – each such query over its local
dataset; when merged, the results for each query comprise a description of the
dataset. We consider 29 such CONSTRUCT queries in total [3].
    These queries – in particular the latter more complex ones – would be expen-
sive to compute, particularly over large datasets. Hence in our previous work [3],
we performed a variety of experiments to ascertain how feasible it would be to an-
swer these queries over current SPARQL implementations and public endpoints.
The most reliable implementation appeared to be Virtuoso, which managed to
successfully run 27/29 queries on datasets of around 1 million triples, but could
only run 8/29 queries over a subset of DBpedia with 114.5 million triples, 53,200
unique predicates and 447 unique classes.
    In experiments over 618 endpoints taken from the DataHub and Bio2RDF [3],
307 (49.7%) responded to a simple SPARQL 1.0 query (i.e., were operational)
and 168 (27.2%) responded to a simple SPARQL 1.1 query (i.e., support SPARQL
1.1). Considering just these 307 operational endpoints, non-empty/non-error re-
sponses to our queries varied from 94% for the first query above listing classes, to
about 25% for the latter query. We did not verify the completeness nor the cor-
rectness of answers; we could only report that non-empty results were returned.
    We refresh the data collected from the DataHub/Bio2RDF endpoints every
15 days. We refer the reader to our previous work [3] for more details on queries,
runtimes, result sizes, and so forth.

3      SPORTAL Interfaces
Over the data collected by running these dataset-description queries on end-
points, we build a number of interfaces to help clients find endpoints of interest.
SPARQL Interface: All collected meta-data are indexed in a public SPARQL
endpoint that clients can query programmatically. Just as an example, we can
find the largest five datasets/endpoints with instances of foaf:Person:
                                   
SELECT DISTINCT ?ep ?ts                                        ?ep                     ?ts
WHERE {
 ?ds void:sparqlEndpoint ?ep ;               http://commons.dbpedia.org/sparql     1229690546
  void:triples ?ts ; void:classPartition     http://live.dbpedia.org/sparql         563358498
   [ void:class foaf:Person ] .              http://lod.kaist.ac.kr/sparql          326078469
}                                            http://data.oceandrilling.org/sparql   284665595
ORDER BY DESC(?ts) LIMIT 5                   http://data.utpl.edu.ec/.../lod/sparql 215627469
                                          
4       Ali Hasnain, Qaiser Mehmood, Syeda Sana e Zainab, and Aidan Hogan

A variety of queries are supported per the data we collect. One can also ask, e.g.,
for the most frequent classes/properties across all endpoints, endpoints with the
most instances of a given class, endpoints that have images of people, etc. The
SPARQL endpoint (with a YASGUI interface [5]) is available at the following
location http://www.sportalproject.org/yasgui/yasgui.html.
User Interface: The two main features of the U.I. we provide are class and
property search, where a user can enter a substring such as person/knows to
autocomplete a list of class/property URLs (in descending order of the number
of endpoints using the term) and then search that class/property URL to find
endpoints with data using that class/property (in descending order of the number
of instances/triples using that term, where available). We also offer endpoint
search based on autocompleting URL substrings (e.g., dbpedia), showing the
meta-data we have for that endpoint. We also have some views of statistics,
including success rates of queries, distributions of property/class terms, etc. The
U.I. front-page is available at http://www.sportalproject.org/.

4   Conclusions
Though by the nature of the data collection process, the Sportal catalogue is
incomplete – e.g., results for the previous example query may miss endpoints
with foaf:Person instances that failed to return a triple count – the system
does offer useful (partial) results when looking for relevant public endpoints in
a manner that, to the best of our knowledge, no existing service does.
    We will demonstrate both the SPARQL interface and the user interface at
the conference, showing examples of queries and searches that can be executed,
discussing possible use-cases for Sportal (e.g., SPARQL federation, finding
datasets to link to, etc.), as well as possible future directions for Sportal and
alternative strategies for finding relevant endpoints.
    Aside from improvements to the interface, possible future plans for the tool
include discovery of new endpoints and integration with SPARQLES [2].
Acknowledgments This work was supported by Science Foundation Ireland (SFI) under
Grant № SFI/12/RC/2289, the Millennium Nucleus Center for Semantic Web Research,
Grant № NC120004, and Fondecyt, Grant № 11140900.


References
1. Alexander, K., Cyganiak, R., Hausenblas, M., Zhao, J.: Describing Linked Datasets.
   In: Linked Data On the Web (LDOW). CEUR (2009)
2. Buil-Aranda, C., Hogan, A., Umbrich, J., Vandenbussche, P.Y.: SPARQL Web-
   Querying Infrastructure: Ready for Action? In: ISWC, pp. 277–293. Springer (2013)
3. Hasnain, A., Mehmood, Q., e Zainab, S.S., Hogan, A.: SPORTAL: Profiling the
   Content of Public SPARQL Endpoints. IJSWIS 12(3), 134–163 (2016)
4. Paulheim, H., Hertling, S.: Discoverability of SPARQL Endpoints in Linked Open
   Data. In: ISWC Posters & Demos. pp. 245–248. Springer (2013)
5. Rietveld, L., Hoekstra, R.: YASGUI: feeling the pulse of Linked Data. In: EKAW.
   pp. 441–452 (2014)