Dataset Dashboard – a SPARQL Endpoint Explorer

               Petr Křemen, Lama Saeeda, Miroslav Blaško, and Michal Med

     Czech Technical University in Prague, Praha, Department of Cybernetics, Knowledge-based
                            Software Systems Group, Czech Republic
                                http://kbss.felk.cvut.cz
           {petr.kremen,saeedla1,blaskmir,medmicha}@fel.cvut.cz


          Abstract. We present the Dataset Dashboard, a SPARQL endpoint exploration
          tool that helps to understand the dataset structure, as well as relationships to other
          datasets. The tool is based on the notion of a dataset descriptor which describes
          some characteristic of a dataset. Currently, the tool offers descriptors for basic
          class/property statistics, spatial information, temporal information, as well as ad-
          vanced dataset summarization. The tool currently registers over 200 SPARQL
          endpoints and named graphs inside the SPARQL endpoints.

          Keywords: Linked Data · SPARQL · Dataset Descriptor


1      Introduction
When trying to understand unknown RDF datasets1 , linked data consumers needs to
judge suitability of the dataset for his/her scenario in an efficient way. However, the
dataset exploration itself is a non-trivial and time-consuming task for many reasons.
First problem lies in the data – although many RDF datasets are cataloged, the metadata
gathered about these datasets are typically not put in sync with the actual dataset content
on the regular basis, or describe provenance metadata and do not reflect the dataset
content at all. Second important problem lies in the accessibility of the data, which
is often limited by (un)availability/latency of SPARQL endpoints, or efficiency of the
underlying hardware.
    In this paper we aim at the problem of dataset exploration by users with technical
background who need to become familiar with unknown datasets (e.g. linked data de-
signers who link to external sources, data journalists who explore a new linked open
data source, etc.). As an example, let’s consider a linked data designer who needs to
find out, whether a new and unknown dataset is suitable for integration with another
dataset (s)he is designing. For this purpose, a bunch of exploratory SPARQL queries
need to be posed by the designer to become familiar with the vocabulary used in the
dataset, mutual interlinks between classes, their significance, as well as the spatial (e.g.
which spatial area it covers) or temporal (which temporal interval it spans) scope of the
dataset.
    The introduced Dataset Dashboard tool aims at making the process of understand-
ing an unknown RDF dataset more efficient. For this purpose, the tool shows several
 1
     In this paper, we only consider RDF datasets available as subsets of actual snapshots of
     SPARQL endpoints content.


                                                   70
Dataset Dashboard - a SPARQL Endpoint Explorer


 interactive visualizations of general content-level characteristics of the dataset, includ-
 ing a filterable dataset summary, basic class and property statistics, as well as spatial
 and temporal range of the dataset.


 2     Dataset Dashboard Overview

 Please refer to the RDF(S) specifications [6],[5] for related notions. We consider an
 RDF graph of the form g = {hsi , pi , oi i}, where for each triple hsi , pi , oi i, si (resp. pi ,
 oi ) is its subject (resp. predicate, object). A dataset descriptor is an RDF graph δ(g)
 constructed from g which is easier to interpret and visualize than g itself. Function
 δ : 2g → 2g , is the actual dataset descriptor function. The main purpose of the dataset
 dashboard is to compute and visualize descriptors of various types.
      Let’s introduce basic features of the Dataset dashboard2 , which are later presented
 in Section 3 on an example.


 Summary Schema Widget The basic summarization descriptor provided by the Sum-
 mary Schema Widget is the SPO Summary, schematically described as

      δSP O (g) = {hc1 , p, c2 i|{hx, rdf : type, c1 i, hy, rdf : type, c2 i, hx, p, yi} ⊆ g}

 Additionally, frequencies of the summary triples are computed and visualized by the
 weight (thickness) of the corresponding graph edge. The widget provides also an en-
 hanced version of the SPO Summary descriptor which shows instance-level links to
 other datasets. The technique to compute enhanced descriptors is currently under re-
 view at the Journal of Web Semantics.
      To explore large summaries, two filtering options are offered – content-based fil-
 tering allows to exclude concrete classes/properties from the summary graph, while
 weight-based filtering allows to exclude all edges with weight below a threshold, i.e.
 filtering out triple patterns which are not frequent enough in the dataset.


 Spatial Widget Spatial part of the data is represented by geographical objects, using
 GeoSPARQL3 . Spatial Widget extracts GeoSPARQL geometries and displays them in
 a map, schematically

 δGeoSP ARQL (g) = {ho, p, gi|ho, p, gi ∈ g, for p ∈ {geosparql : asWKT, geosparql : asGML}}

 The widget recognizes both Well-Known Text (WKT) geometry and Geography Markup
 Language (GML) geometry. All types of features found in the dataset are listed in the
 pop-up menu next to the map, where the user may choose which features to show in
 the map. Point, line and polygon geometries are supported. Moreover, points are rep-
 resented as markers with pop-up bubble containing the link to the RDF resource repre-
 senting the feature.
  2
      Available on-line at https://onto.fel.cvut.cz/dataset-dashboard
  3
      http://www.opengeospatial.org/standards/geosparql, cit. 1.6.2018


                                                71
Dataset Dashboard - a SPARQL Endpoint Explorer


 Temporal Widget Temporal widget shows the temporal coverage of the dataset, com-
 puted as minimum and maximum time points occurring in the dataset. The computation
 itself considers the structured temporal information as well as temporal information ex-
 tracted from non-structured texts. For the unstructured temporal information, it retrieves
 the textual literals and performs a natural language processing analysis to extract the
 time information. To perform this step, the Stanford temporal tagger (SUTime)4 which
 is part of the StanfordCoreNLP5 tool is used. The pipeline for computing the temporal
 descriptor can be found in [10].

 Dataset Dashboard Bookmarking In order to share a particular dataset dashboard
 across the community, bookmarkable dataset dashboard URIs are offered.


 3    Example
 Let’s consider a dataset gexp about parcels, buildings, floors and land use which is a
 part of the dataset maintained by the Prague Institute of Planning and Development6 .
 Comparing to the original, gexp is limited to the data for Prague Centre, which results
 in approx. 1.8 mil. triples.
     Basic capabilities of the dataset dashboard are shown in Figure 1. The figure shows
 an overview of gexp in terms of the SPO summary provided by the Summary Schema
 widget. The full SPO summary contains 13 RDFS classes and 25 RDF properties. The
 dataset dashboard allows to filter out some classes/properties from the schema. Addi-
 tionally, edges can be filtered according to their weight. Thus, the figure excludes two
 properties from the view (rdfs:label, ddo:has-published-dataset-snapshot)
 and shows only edges with weight at least 12000 (adjustable by the slider control above
 the SPO summary graph).
     Spatial and temporal context of the dataset is shown in Figure 2. The GeoSPARQL
 descriptor shows geometries attached functional land use (instances of the RDFS class
 town-fvu:VyuzitiPloch). Dereferencable link of one of the geometries7 is shown
 in a tooltip. The temporal descriptor shows the temporal range of the dataset. This
 dataset provides temporal information only inside the properties town-parcely:dat vznik
 and town-budovy:dat vznik, town-budovy:dat zmena, which denote the
 creation/change dates of the data change record. So in this case, and contrary to the
 GeoSPARQL descriptor, the extracted temporal knowledge refers rather to the creation
 of the data than to the actual content. The dashboard of this example is shown at the
 following URL
 https://onto.fel.cvut.cz/dataset-dashboard/#/dashboard/online?
 endpointUrl=http://onto.fel.cvut.cz:7200/repositories/ipr_
 datasets&graphIri=http://onto.fel.cvut.cz/ipr-datasets/resource/
 town-plan
  4
    https://nlp.stanford.edu/software/sutime.html
  5
    https://stanfordnlp.github.io/CoreNLP/
  6
    http://en.iprpraha.cz, cit. 1.6.2018
  7
    http://onto.fel.cvut.cz/ontologies/town-plan/pvp_fvu_p/
    geometry/70/2018-01-29T14:36:24.178617


                                            72
Dataset Dashboard - a SPARQL Endpoint Explorer


            Fig. 1: An example SPO summary and basic statistics widget.


                    (a)                                          (b)

              Fig. 2: Example temporal/spatial descriptors of the dataset.


                                          73
Dataset Dashboard - a SPARQL Endpoint Explorer


 4     Technical Background

 Dataset dashboard is a client-server application8 , as shown in Fig. 3a. The client side
 serves mainly as a container for various widgets. Each widget is a separate React com-
 ponent, which is able to visualize a dataset descriptor of one or more types (see Sec-
 tion 4.1). The server takes care of proper creation and consumption of Dataset De-
 scriptor Ontology metadata (see Section 4.1). In order to implement a new visual-
 ization, one needs to implement a visual widget component and just insert it in the
 DatasetDashboardController, along with the other widgets. Upon request, the
 widget gets from the Server the result of one of the predefined SPARQL queries. For
 more complicated scenarios, one needs to also implement an SPipes (see Section 4.2)
 pipeline that computes the particular descriptor.


        (a) Dataset Dashboard Architecture        (b) Dataset Description Ontology Core


 4.1   Dataset Descriptor Ontology

 Dataset Descriptor Ontology (DDO) [4] is a domain ontology9 for describing the pro-
 cess of dataset description, backed by the Unified Foundational Ontology (UFO) [8].
 Simplified view on the ontology core is depicted in Figure 3b. The fundamental notion
 of the ontology is the notion of a ddo:dataset, being an endurant entity which can
 have different ddo:dataset-snapshots in time. Each dataset is available through some
  8
    https://kbss.felk.cvut.cz/gitblit/summary/?r=dataset-
    dashboard.git, cit. 8.6.2018
  9
    http://onto.fel.cvut.cz/ontologies/ddo, cit. 4.6.2018


                                             74
Dataset Dashboard - a SPARQL Endpoint Explorer


 ddo:dataset-source to which particular ddo:dataset-snapshots can be deployed by
 an event of ddo:dataset-publication. Upon an event of describing a dataset source
 (a ddo:description), a new ddo:dataset-descriptor is created as a description of a
 ddo:described-dataset-snapshot. An example of a ddo:dataset-snapshot is a par-
 ticular content of a SPARQL endpoint.

 4.2   SPipes
 SPipes is a custom implementation of the SPARQLMotion engine10 . It allows for cre-
 ation RDF processing pipelines, consisting of modules (nodes) and their edges depen-
 dencies (edges). A dependency says that first the source module is executed and its
 resulting RDF graph together with global variable bindings is passed to the target mod-
 ule. Then the target module is executed. The language is very well integrated with
 SPARQL[13]. It allows reusing variables from SPARQL expressions in modules and
 vice versa. It can be accessed through REST interface that is generated from definitions
 of pipelines. Currently, SPipes are used to compute the SPO summaries and temporal
 descriptors.

 5     Related Work
 An overview of exploration tools can be found in [3] and a more recent one in [9]. Let’s
 discuss some of the tools in detail.
     LODeX [2] is a tool (no on-line demo available any more) offering visualization
 similar to the SPO summaries. Comparing to Dataset Dashboard, information about
 frequent triple patterns of the dataset is not graphically visualized (and thus cannot be
 used for filtering the summary).
     LODSight [7] visualizes SPO summaries, in a similar way the Dataset Dashboard
 does. Comparing to Dataset Dashboard, LODSight does not allow to filter the SPO
 summary by the classes occurring in the data.
     Linked Data Visualization Wizard [1] is an on-line tool11 detecting presence of sev-
 eral types of information inside a SPARQL endpoint, including temporal information,
 spatial information, statistical data or SKOS. in addition to detecting presence of the
 information, the tool also shows examples of data for each type of information and the
 query used to generate them. Comparing to Dataset Dashboard, graphical visualization
 of the spatio-temporal characteristics and dataset summaries is missing.
     Linked Geo Data Browser [11] allows to dynamically generate a faceted search
 based on linked data geographically located inside a region selected by user in a map.
 Comparing to Dataset Dashboard, it does not use the GeoSPARQL vocabulary and is
 bound to the LinkedGeoData dataset12 .
     Facete [12] is a tool providing faceted search over a SPARQL endpoint, present-
 ing the results in a map.Map4rdf13 is a similar tool for GeoSPARQL-compliant data.
 Comparing to Dataset Dashboard, both tools seem to support only point geometries.
 10
    http://sparqlmotion.org/, cit. 3.6.2018
 11
    http://semantics.eurecom.fr/datalift/rdfViz/apps, cit. 11.7.2018
 12
    http://linkedgeodata.org, cit. 11.7.2018
 13
    http://oegdev.dia.fi.upm.es/map4rdf/, cit. 1.6.2018


                                           75
Dataset Dashboard - a SPARQL Endpoint Explorer


     Furthemore, comparing to the all mentioned tools, Dataset Dashboard provides a
 comprehensive view of the dataset under exploration by combining different descriptors
 into a single dashboard. Also, it stores the computed dataset descriptors14 and provide
 persistent identifiers, for efficient sharing of the view over the dataset.


 6    Evaluation

 We conducted a preliminary survey about usefulness of the tool among three IT experts
 (a PhD student in the semantic web field, a linked data expert and a semantic web de-
 veloper) not involved in the system design and development. As a part of the survey
 they had to explore three datasets (a SKOS vocabulary about labor law15 ), a complex
 dataset about EU television content16 ), a dataset with geospatial and temporal infor-
 mation about urban development in Prague17 ) not known to them before and judge the
 benefits of using tool for their use-case. Although the experts were not provided with
 any information about the tool beyond its SPARQL endpoint URL, all of them were
 successful in describing what the topics of all three datasets are (mainly using the Sum-
 mary Schema Widget). Two of them see the main advantage of the tool in support for
 subsequent SPARQL query formulation to the particular SPARQL endpoint. The third
 one sees its advantage in providing visualization revealing the complexity of the dataset.
 Two of the experts find dataset dashboard persistent links useful for sharing information
 about the datasets. On the other hand, for the purpose of dataset exploration, the experts
 miss visualization of data samples and more advanced data statistics.


 7    Conclusions

 We presented the Dataset dashboard – a tool for dataset exploration using different
 dataset descriptors. The tool currently registers over 200 SPARQL endpoints and named
 graphs inside the SPARQL endpoints and is currently used in two national research
 projects. Initial feedback by IT experts is motivating, revealing that the tool is useful
 for dataset exploration, as well as providing suggestions for future work.
     In future, we aim at providing history tracking for computed descriptors, as well
 as introducing new descriptor types (e.g. data samples, as suggested by the survey) for
 spatial and temporal widgets.


 Acknowledgements This work was partially supported by grants No. GA 16-09713S
 Efficient Exploration of Linked Data Cloud of the Grant Agency of the Czech Republic
 and No.SGS16/229/OHK3/3T/13 Supporting ontological data quality in information
 systems of the Czech Technical University in Prague.
 14
    Currently only for SPO summary and temporal descriptors.
 15
    http://vocabulary.wolterskluwer.de/PoolParty/sparql/
    arbeitsrecht, cit. 10.7.2018
 16
    http://lod.euscreen.eu/sparql, cit. 10.7.2018
 17
    The dataset presented in Section 3


                                            76
Dataset Dashboard - a SPARQL Endpoint Explorer


 References
  1. Atemezing, G.A., Troncy, R.: Towards a linked-data based visualization wizard. In: ISWC
     2014, 5th International Workshop on Consuming Linked Data (COLD 2014), 20 October
     2014, Riva del Garda, Italy. Riva Del Garda, ITALIE (10 2014), http://www.eurecom.
     fr/publication/4380
  2. Benedetti, F., Bergamaschi, S., Po, L.: LODeX: A tool for visual querying linked open data.
     In: CEUR Workshop Proceedings. vol. 1486 (2015)
  3. Bikakis, N., Sellis, T.K.: Exploration and visualization in the web of big linked data: A survey
     of the state of the art. In: Palpanas, T., Stefanidis, K. (eds.) Proceedings of the Workshops of
     the EDBT/ICDT 2016 Joint Conference, EDBT/ICDT Workshops 2016, Bordeaux, France,
     March 15, 2016. CEUR Workshop Proceedings, vol. 1558. CEUR-WS.org (2016), http:
     //ceur-ws.org/Vol-1558/paper28.pdf
  4. Blasko, M., Kostov, B., Kremen, P.: Ontology-based Dataset Exploration – A Temporal On-
     tology Use-Case. In: Proc. of the Intelligent Exploration of Semantic Data (IESD’16). Kode
     (2016)
  5. Brickley, D., Guha, R.V.: RDF Schema 1.1 (feb 2014), http://www.w3.org/TR/rdf-
     schema/
  6. Cyganiak, R., Wood, D., Lanthaler, M.: RDF 1.1 Concepts and Abstract Syntax.
     W3c recommendation, W3C (2014), http://www.w3.org/TR/2014/REC-rdf11-
     concepts-20140225/
  7. Dudáš, M., Svátek, V., Mynarz, J.: Dataset summary visualization with LODsight. In: Lec-
     ture Notes in Computer Science. vol. 9341, pp. 36–40 (2015). https://doi.org/10.1007/978-
     3-319-25639-9 7
  8. Guizzardi, G.: Ontological foundations for structural conceptual models. Ph.D. thesis, Uni-
     versity of Twente, The Netherlands. (mar 2005), http://doc.utwente.nl/50826/
     1/thesis{_}Guizzardi.pdf
  9. Klı́mek, J., Škoda, P., Nečaský, M.: Survey of Tools for Linked Data Consumption. Semantic
     Web (2018)
 10. Saeeda, L., Kremen, P.: Temporal knowledge extraction for dataset discovery. In: CEUR
     Workshop Proceedings. vol. 1927 (2017)
 11. Stadler, C., Lehmann, J., Höffner, K., Auer, S.: Linkedgeodata: A core for a web of spatial
     open data. Semant. web 3(4), 333–354 (Oct 2012), http://dl.acm.org/citation.
     cfm?id=2590208.2590210
 12. Stadler, C., Martin, M., Auer, S.: Exploring the web of spatial data with facete. In: Proceed-
     ings of the 23rd International Conference on World Wide Web. pp. 175–178. WWW ’14
     Companion, ACM, New York, NY, USA (2014). https://doi.org/10.1145/2567948.2577022,
     http://doi.acm.org/10.1145/2567948.2577022
 13. The W3C SPARQL Working Group: SPARQL 1.1 Overview. W3c recommendation (2012),
     https://www.w3.org/TR/sparql11-overview/


                                                 77