=Paper=
{{Paper
|id=Vol-1704/paper11
|storemode=property
|title=LD-VOWL: Extracting and Visualizing Schema Information for Linked Data Endpoints
|pdfUrl=https://ceur-ws.org/Vol-1704/paper11.pdf
|volume=Vol-1704
|authors=Marc Weise,Steffen Lohmann,Florian Haag
|dblpUrl=https://dblp.org/rec/conf/semweb/WeiseLH16
}}
==LD-VOWL: Extracting and Visualizing Schema Information for Linked Data Endpoints==
LD-VOWL: Extracting and Visualizing Schema
Information for Linked Data
Marc Weise1 , Steffen Lohmann2 , Florian Haag1
1
Institute for Visualization and Interactive Systems (VIS), University of Stuttgart,
Universitätsstraße 38, 70569 Stuttgart, Germany
2
Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS),
Schloss Birlinghoven, 53757 Sankt Augustin, Germany
Abstract. Users currently face the problem that schema information for Linked
Data is often not available. If it is available, it tends to be incomplete or does not
adequately represent the data. It can therefore be hard for users to get an impres-
sion of the data provided by some Linked Data source. In this paper, we introduce
LD-VOWL, a web-based tool that extracts and visualizes schema information of
Linked Data sources based on the VOWL notation. SPARQL queries are used to
infer the schema information from the data of the source, which is then gradually
added to an interactive VOWL graph visualization. We tested LD-VOWL on a
number of Linked Data endpoints with promising results.
Keywords: Linked Data, Schema Extraction, Visualization, SPARQL, RDF, OWL.
1 Introduction
A huge amount of Linked Data has been published in recent years, and is ready for
consumption [2,5]. A large portion of this data is available in RDF format and can be
queried using the standardized query language SPARQL [4,5]. The data often does not
follow a strict schema, but typically different ontologies and vocabularies are used to
describe it in a flexible way. On the one hand, this flexibility is an important character-
istic and benefit of Linked Data; on the other hand, it can make it difficult to get an idea
of what data is actually provided by a SPARQL endpoint. Visualizations can help to get
a better overview of the type and structure of the data and can serve as a useful starting
point for further querying and analysis.
In this paper, we introduce LD-VOWL, a tool that extracts and visualizes schema
information from Linked Data endpoints, based on a number of SPARQL queries.
This schema information is then incrementally added to an interactive graph visual-
ization, using a slightly adapted version of the Visual Notation for OWL Ontologies
(VOWL) [13,14].
2 Related Work
There are surprisingly few works concerning the extraction and visualization of schema
information from Linked Data. Presutti et al. describe an approach of extracting core
120
LD-VOWL: Extracting and Visualizing Schema Information for Linked Data Endpoints
knowledge [16] from Linked Data by detecting knowledge patterns. Central types and
properties are identified by their betweenness and number of instances. In contrast to
our approach, Presutti et al. focus on the detection of patterns in the data but not on the
extraction and visualization of schema information.
Peroni et al. developed an approach for the automatic identification of key con-
cepts [15]. Different from our work, the approach runs on ontologies and not on Linked
Data. They use a couple of metrics, such as the length of concept names and their cen-
trality in the graph structure, to find natural categories in the dataset. The concepts are
also weighted by their popularity, which is defined as the number of results found by a
search engine.
Another related work is QueryVOWL [9], which is also based on the VOWL no-
tation and enables users without prior knowledge about RDF and SPARQL to query
Linked Data. A graph consisting of VOWL elements is gradually constructed by the
user and mapped to SPARQL queries which are sent to an endpoint. However, in con-
trast to our approach, QueryVOWL does not provide an overview visualization of the
dataset but assumes that the user has already an idea of the type and structure of the
data and knows how to start the querying process—as it is also assumed by many re-
lated querying approaches, such as LodLive [7] or the RelFinder [10].
Other works are concerned with the recommendation of concepts based on Linked
Data [8,17], or follow general approaches of applying formal concept analysis to the
Semantic Web [11].
3 Extraction of Schema Information
The schema extraction in LD-VOWL uses a class-centric perspective, i.e., the classes
are extracted first and define the view on the Linked Data source. The classes are then
connected by properties and enriched by datatypes. A class-centric perspective is very
common in ontology engineering and fits well with the node-link paradigm of the graph
visualization that the VOWL notation is based upon [13].
The extraction is realized by dynamically generated SPARQL queries revealing the
schema information from a given dataset based on a couple of assumptions. For these
queries, we had to find a trade-off between the number of required requests and the com-
plexity of the queries. Since the SPARQL endpoints of Linked Data sources can have
strict limits in terms of execution time, the queries must not be too complex. At the
same time, we were aiming for displaying parts of the retrieved schema information as
soon as possible, hence short response times were important as well. In addition, short
response times were important, as we were aiming for displaying parts of the retrieved
schema information as soon as possible to the users to minimize waiting time. There-
fore, our priority was on using simple SPARQL queries, while we were also interested
in limiting the total number of requests.
The SPARQL queries are sent in a stepwise approach, based on a couple of assump-
tions that are detailed in the following:
1. Extraction of classes with the most instances: A generic SPARQL query asking
for the n classes with the most instances is sent to the endpoint first (where n is a
user-defined upper limit). Listing 1.1 (Appendix) shows this query for the default
121
LD-VOWL: Extracting and Visualizing Schema Information for Linked Data Endpoints
limit of n = 10. The results of this query serve as a starting point for further
extractions.
This approach is based on the assumption that a dataset is well represented by the
classes having the most instances. On the other hand, these classes are often also the
more generic ones. Therefore, we integrated three strategies to avoid a too generic
visualization:
(a) All built-in classes and properties of RDF, RDFS, OWL and optionally SKOS
are contained in a blacklist that is filtered by default.
(b) Users can customize this blacklist by adding or removing classes according to
their needs. For instance, they can remove owl:Thing from the list to include
it in the visualization or add foaf:Agent to filter it too.
(c) Users can increase the number n of retrieved classes if the n initially retrieved
classes are too generic, by adapting the limit of classes accordingly.
2. Detection of subclasses, equivalent and disjoint classes: Based on the n extracted
classes with most instances, further SPARQL queries are sent to the endpoint in
order to detect classes that can be considered equivalent, subclasses or disjoint
classes. This is done by a pairwise comparison of the numbers of shared instances
for all n classes, using the following assumptions:
(a) If the number of shared instances of two classes is equal to the number of
instances of each individual class, the classes are assumed to be equivalent.
(b) If the number of shared instances of two classes is equal to the number of
instances of the class having fewer instances, the class with fewer instances
is considered a proper subset of the other class, which indicates a subclass
relation between the two classes.
(c) If there are no common instances at all, the two classes are considered to be
disjoint.
All three assumptions are based entirely on the actual data retrieved from the end-
point. For instance, two classes might not be explicitly defined as disjoint; how-
ever, if they do not share any instances in the dataset, a disjoint relationship will
be inferred following the above assumption. This informs users that any search for
individuals in that dataset which belong to both classes will be in vain.
3. Retrieval of object properties: In the third step, properties between the instances
of the classes are retrieved. As with the classes, we retrieve the most frequently used
properties first, i.e., properties having the greatest number of subject individuals
(see example in Listing 1.2, Appendix). This also includes property loops, i.e.,
properties where the subject and object individuals are from the same class.
As there can be a huge amount of different properties between the instances of two
classes, we retrieve the properties in an incremental manner. When using a single
SPARQL query, the execution of the query could take a very long time, possibly
too long for Linked Data endpoints that have a strict limit for the execution time.
Therefore, we choose the following approach in LD-VOWL: Starting with a limit of
l properties, this limit is doubled with each SPARQL query sent until all properties
are retrieved.
4. Retrieval of datatype properties: In the fourth step, LD-VOWL retrieves datatypes
linked with the instances of the extracted classes. This step can be performed either
after the third step or in parallel to it. LD-VOWL executes it in parallel in order
122
LD-VOWL: Extracting and Visualizing Schema Information for Linked Data Endpoints
to avoid the impression that there are no datatypes defined for the retrieved classes
due to the delayed retrieval and visualization (remember that LD-VOWL follows a
stepwise approach and visualizes the elements as soon as they are extracted).
For each class, LD-VOWL sends queries that retrieve up to m datatypes which are
most often used with the instances of that class (Listing 1.3, Appendix). After the
datatypes are retrieved, the properties that connect the instances of the classes with
these datatypes are fetched in a second step (Listing 1.4, Appendix). The reason
for this two-step approach is again the limited execution time of many SPARQL
endpoints. In addition, it supports our goal of visualizing the extracted schema in-
formation as quickly as possible, even if it is still incomplete. This requires the use
of placeholders as labels for the datatype properties in the visualization as long as
the actual properties are unknown.
It must be noted that due to the pairwise retrieval in both step two and three of the
extraction process, the number of SPARQL requests that need to be sent grows quadrat-
ically with the number of classes n retrieved in the first step (i.e., Nrequests ∈ O(n2 )).
Thus, we recommend to select the number of classes n that are initially retrieved with
care and in accordance to the endpoint performance (LD-VOWL currently uses n = 10
as default).
4 Visualization Based on VOWL
LD-VOWL uses VOWL [13,14] for the visualization of the extracted schema infor-
mation. We had to make some minor modifications to VOWL in order to address the
peculiarities arising when visualizing information extracted from Linked Data.
In accordance with VOWL, extracted classes are represented as circle nodes in a
force-directed layout (see Figure 1). The radii of the circles refer to the number of in-
stances of the classes. Extracted properties are shown as directed and labeled edges
linking the nodes. Different from VOWL, multiple properties between instances of the
same pair of classes are merged into one edge. The more different properties exist be-
tween the instances of a pair of classes, the broader the edge is drawn. If different prop-
erties are merged into one edge, the property which occurs most often is considered
most important—analogous to the class extraction principle. Therefore, the label of this
property is shown on the edge, with the number of properties that have been merged
given in brackets. If we would not merge those properties into one edge, this could
result in a large number of edges being displayed between two classes, which would
quickly clutter the visualization. Datatypes are displayed as yellow rectangles with a
black border, like it is specified by VOWL. Accordingly, datatype properties linking the
class instances with the datatypes are shown as edges with a green label.
The ontology or vocabulary comprising most of the classes is set as the main names-
pace of the dataset. The recommended default color of VOWL (light blue) is used as
the background color of all elements in this namespace. All other ontologies and vo-
cabularies that are part of the extracted schema have a different background color and
inverted font color (white) in accordance with the VOWL specification. The colors used
to indicate and group these other namespaces range from dark blue to pink in order to
make different namespaces easily distinguishable in the visualization.
123
LD-VOWL: Extracting and Visualizing Schema Information for Linked Data Endpoints
5 Implementation and User Interface
LD-VOWL is a web application implemented in JavaScript that sends SPARQL queries
via HTTP GET to extract the schema information3 , and uses web standards like HTML5,
CSS and SVG to display the extracted information. Furthermore, it makes use of the vi-
sualization toolkit D3 [6] for computing and displaying the force-directed graph.4 The
user interface of LD-VOWL is inspired by WebVOWL [12] and consists of three views:
1. The start view allows the user to select a Linked Data source by either entering the
URL of its SPARQL endpoint or selecting one from a predefined list.
2. The main view (see Figure 1) shows the visualization of the extracted schema in-
formation. It is complemented by a sidebar with controls and information details.
3. The settings view enables the user to adjust the extraction by editing the blacklist
or the language of labels, among others.
Fig. 1. LD-VOWL applied to the SPARQL endpoint of the Semantic Web Conference Corpus [3].
Users can zoom and pan to adjust the visible area and position of the graph that is
shown in the main view. They can also modify the graph layout via drag and drop or by
changing the average edge length. Furthermore, LD-VOWL provides options to filter
parts of the extracted information in the visualization, such as datatypes, property loops,
subclass relations, and disjoint classes. All nodes and edges in the graph visualization
can be selected to see details on demand, for instance, the exact number of instances
of a class or the list of all properties that connect two classes. URIs are displayed as
hyperlinks, i.e., users can click on them to view further information (if available).
3
Note that there are some endpoint requirements with regard to the supported SPARQL con-
structs, returned file format, handling of cross-origin requests, etc.
4
A demo of LD-VOWL is available at: http://ldvowl.visualdataweb.org.
124
LD-VOWL: Extracting and Visualizing Schema Information for Linked Data Endpoints
Finally, users can control the namespace classification by flagging namespaces as
belonging to the main vocabulary or being marked as external. Users can also decide
whether different colors should be used for the external namespaces or not.
6 Discussion
To unleash the full potential of Linked Data, it is important that users can get a quick
overview of the type and structure of the data provided by a SPARQL endpoint. In this
paper, we presented LD-VOWL, which allows to extract and visualize schema infor-
mation from SPARQL endpoints. It uses a number of SPARQL queries that help to
structure the data and reveal how it is described by ontologies and vocabularies, based
on a set of assumptions. This schema information is then incrementally added to an
interactive graph visualization using the VOWL notation.
We implemented LD-VOWL as a web application and tested it on several SPARQL
endpoints. The results of these tests showed that LD-VOWL can create comprehensible
overviews of the content and structures of datasets within a few seconds to minutes,
depending on the performance of the endpoint (i.e., the used server, middleware, etc.),
the extraction parameters selected in LD-VOWL (variables l, m, n, see Section 3) and
the size of the dataset (which may affect the query execution time).
However, the results also show that the scalability of LD-VOWL is limited in several
regards. As mentioned in Section 3, the number of SPARQL requests that need to be
sent grows quadratically with the number of classes n that are initially retrieved. For this
purpose, LD-VOWL retrieves only those classes that have the most instances (if not on
the blacklist), which comes with benefits and limitations: On the one hand, LD-VOWL
intends to provide an overview visualization, which implies that not all information is
shown for datasets that contain a lot of classes and properties. On the other hand, it
could be useful to explore certain regions of the overview visualization in more detail
by ‘expanding’ parts of the graph and extracting further information for those parts on
demand. Therefore, we could envision to extend LD-VOWL with such an exploration
mode, or combine it with related visual querying approaches like QueryVOWL [9]. In
general, LD-VOWL can be easily integrated with other tools, as it runs completely on
the client side and only requests the server via SPARQL.
A direction for future research would be the extraction of further ontology concepts
from the Linked Data sources, such as inverse properties or set operators, by develop-
ing corresponding assumptions and extraction patterns. These additional concepts could
again be visualized with VOWL, which provides graphical representations for a large
number of OWL language constructs [13,14]. LD-VOWL would also benefit from ad-
ditional interactive features enabling the users to highlight, filter and collapse parts of
the graph, as they are implemented in WebVOWL [12]. Such interactive features can
improve the visual scalability and support the exploration and analysis of the extracted
schema information.
However, the visual scalability of node-link diagrams can only be improved up to
a certain extent with interactive features: Although a node-link diagram as used by
VOWL is very suitable to depict the structure of some dataset, its visual scalability
is inherently limited. The readability usually decreases with the number of nodes and
125
LD-VOWL: Extracting and Visualizing Schema Information for Linked Data Endpoints
edges that are visualized. Therefore, another important direction of research is to inves-
tigate better means of visualizing a large amount of structured information—as it could
potentially be extracted with approaches like LD-VOWL—in a more compact way.
References
1. DBpedia endpoint. http://dbpedia.org/sparql
2. Linked Data. http://linkeddata.org
3. Semantic Web Dog Food endpoint. http://data.semanticweb.org/sparql
4. SPARQL Endpoints Status. http://sparqles.okfn.org
5. Bizer, C., Heath, T., Berners-Lee, T.: Linked data – the story so far. International Journal on
Semantic Web and Information Systems 5(3), 1–22 (2009)
6. Bostock, M., Ogievetsky, V., Heer, J.: D3 data-driven documents. IEEE Transactions on Vi-
sualization and Computer Graphics 17(12), 2301–2309 (2011)
7. Camarda, D.V., Mazzini, S., Antonuccio, A.: LodLive, exploring the web of data. In: 8th
International Conference on Semantic Systems (I-SEMANTICS ’12). pp. 197–200. ACM
(2012)
8. Damljanovic, D., Stankovic, M., Laublet, P.: Linked data-based concept recommendation:
Comparison of different methods in open innovation scenario. In: 9th Extended Semantic
Web Conference (ESWC ’12). LNCS, vol. 7295, pp. 24–38. Springer (2012)
9. Haag, F., Lohmann, S., Siek, S., Ertl, T.: QueryVOWL: A visual query notation for linked
data. In: ESWC 2015 Satellite Events. LNCS, vol. 9341, pp. 387–402. Springer (2015)
10. Heim, P., Hellmann, S., Lehmann, J., Lohmann, S., Stegemann, T.: RelFinder: Revealing
relationships in RDF knowledge bases. In: 4th International Conference on Semantic and
Digital Media Technologies (SAMT ’09). LNCS, vol. 5887, pp. 182–187. Springer (2009)
11. Kirchberg, M., Leonardi, E., Tan, Y.S., Link, S., Ko, R.K.L., Lee, B.: Formal concept dis-
covery in semantic web data. In: 10th International Conference on Formal Concept Analysis
(ICFC ’12). LNCS, vol. 7278, pp. 164–179. Springer (2012)
12. Lohmann, S., Link, V., Marbach, E., Negru, S.: WebVOWL: Web-based visualization of
ontologies. In: EKAW 2014 Satellite Events. LNAI, vol. 8982, pp. 154–158. Springer (2015)
13. Lohmann, S., Negru, S., Haag, F., Ertl, T.: Visualizing ontologies with VOWL. Semantic
Web 7(4), 399–419 (2016)
14. Negru, S., Lohmann, S., Haag, F.: VOWL: Visual notation for OWL ontologies. http:
//purl.org/vowl/ (2014)
15. Peroni, S., Motta, E., D’Aquin, M.: Identifying key concepts in an ontology, through the
integration of cognitive principles with statistical and topological measures. In: 3rd Asian
Semantic Web Conference (ASWC ’08). LNCS, vol. 5367, pp. 242–256. Springer (2008)
16. Presutti, V., Aroyo, L., Adamou, A., Schopman, B.A.C., Gangemi, A., Schreiber, G.: Ex-
tracting core knowledge from linked data. In: 2nd International Workshop on Consuming
Linked Data (COLD ’2011). CEUR Workshop Proceedings, vol. 782. CEUR-WS.org (2011)
17. Stankovic, M., Breitfuss, W., Laublet, P.: Linked-data based suggestion of relevant topics. In:
7th International Conference on Semantic Systems (I-SEMANTICS ’11). pp. 49–55. ACM
(2011)
126
LD-VOWL: Extracting and Visualizing Schema Information for Linked Data Endpoints
A Examples of SPARQL Queries Used for the Schema Extraction
The following listings provide examples of SPARQL queries used by LD-VOWL to ex-
tract schema information from Linked Data sources, based on a couple of assumptions
that are described in Section 3.
SELECT DISTINCT ? c l a s s (COUNT( ? i n s t a n c e ) AS ? i n s t a n c e C o u n t )
WHERE {
? instance a ? class .
}
GROUP BY ? c l a s s
ORDER BY DESC( ? i n s t a n c e C o u n t )
LIMIT 10 OFFSET 0
Listing 1.1. SPARQL query retrieving the n = 10 classes having the most instances.
SELECT (COUNT( ? o r i g i n I n s t a n c e ) AS ? c o u n t ) ? p r o p
WHERE {
? o r i g i n I n s t a n c e a .
? t a r g e t I n s t a n c e a .
? o r i g i n I n s t a n c e ? prop ? t a r g e t I n s t a n c e .
}
GROUP BY ? p r o p
ORDER BY DESC( ? c o u n t )
LIMIT 10 OFFSET 0
Listing 1.2. SPARQL query retrieving the l = 10 most often used object properties connecting
instances of the classes Agent and Document (run on the DBpedia endpoint [1]).
SELECT (COUNT( ? v a l ) AS ? v a l C o u n t ) ? v a l T y p e
WHERE {
? i n s t a n c e a .
? i n s t a n c e ? prop ? val .
BIND (DATATYPE( ? v a l ) AS ? v a l T y p e ) .
}
GROUP BY ? v a l T y p e
ORDER BY DESC( ? v a l C o u n t )
LIMIT 10
Listing 1.3. SPARQL query retrieving the m = 10 datatypes most often linked to the DBpedia
class Agent.
SELECT DISTINCT ? p r o p
WHERE {
? i n s t a n c e a .
? i n s t a n c e ? prop ? val .
FILTER (
DATATYPE( ? v a l ) =
)
}
LIMIT 10 OFFSET 0
Listing 1.4. SPARQL query retrieving properties between instances of the DBpedia class Agent
and the linked datatype string.
127