<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Assessing, Monitoring and Analyzing Linked Data Quality in Public SPARQL Endpoints?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Muhammad Intizar Ali</string-name>
          <email>ali.intizar@insight-centre.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qaiser Mehmood</string-name>
          <email>qaiser.mehmood@insight-centre.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Muhamamad Saleem</string-name>
          <email>saleem@informatik.uni-leipzig.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Insight Centre for Data Analytics, National University of Ireland</institution>
          ,
          <addr-line>Galway</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Leipzig</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we propose a domain agnostic and query driven approach to monitor, assess, and analyze quality of the linked data hosted by public SPARQL endpoints. We identi ed various quality related metrics for linked datasets and used linked data vocabulary to represent quality information. We provide a Linked Data Quality (LDQ) dataset, which is generated after conducting various quality related tests over a few public SPARQL endpoints. Our main goal in this paper is to provide a platform for monitoring, assessing and analyzing linked data quality. Data consumers can also execute various analytical queries over LDQ to analyze quality related metrics of the public SPARQL endpoints. We hope that LDQ will increase data consumer's con dence over public SPARQL endpoints and will support the wide adoption of these datasets in various linked data applications.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Linking Open Data (LOD) is gaining popularity with every passing day and
the amount of data available at LOD is growing rapidly. The LOD cloud
contains data originated from hundreds of sources and the number of data sources
is continuously increasing3 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. These datasets are accessible through di erent
interfaces such as SPARQL endpoints, triple patterns fragments, RDF
datadumps, and HDT les. SPARQL endpoints provide a public interface for
querying the underlying RDF data. Provision of access to linked datasets through
SPARQL queries not only facilitates an easy access to the datasets, but it
also allows data consumers to integrate data from multiple datasets on the
      </p>
      <p>y. Moreover, applications can use these datasets without committing any
resources to locally host these large linked datasets. According to SPARQLES
(https://sparqles.ai.wu.ac.at/), which is a service to monitor status of public
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)</p>
      <p>SPARQL endpoints, there are around 557 sparql endpoints accessible on the
Web, last accessed: July 2019).4.</p>
      <p>However, a wide adoption of public SPARQL endpoints is hindered by a
number of challenges. Data quality, reliability and quality of service are among
the prominent challenges faced by any linked data application using SPARQL
endpoints. Limited availability of information related to data quality results into
decreasing the con dence and trust of data consumers in public open linked data
services. To this end, di erent monitoring services have been proposed to
monitor and evaluate the quality of service features of public SPARQL endpoints.</p>
      <p>However, in order to evaluate data quality of any dataset usually a deep
understanding of the internal structure of the data and domain speci c knowledge is
required.</p>
      <p>
        In this paper, we propose a domain agnostic and query driven quality
monitoring and assessment approach to remotely assess the quality of the linked
datasets which are accessible via public SPARQL endpoints. We identi ed
various quality related metrics for linked datasets which can be monitored through
various SPARQL queries. Contrary to the existing query driven approaches, we
designed a linked data quality (LDQ) dataset, which contains quality pro les of
di erent public SPARQL endpoints generated at various timestamps. Each
quality pro le holds results of query-driven tests conducted over any given SPARQL
endpoint. Initially, we focused on three important aspects of linked data, namely
(i) IRI's, (ii) data types, and (iii) data structured-ness (introduced in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]).
Regarding IRI's, we designed tests to evaluate the validity of the IRI's in the linked
dataset. We also evaluated dereference-ability of these IRI's. Regarding the data
types, we provide a sample test to locate all DateTime literals which are wrongly
stored as string data types, and lastly for data structured-ness we computed
individual and weighted class coverage to show the coherence or structured-ness
of any given dataset. Despite we conducted an evaluation for a limited number
of parameters, the LDQ dataset is easily extensible and users can evaluate any
quality metric of their choice by designing their own query driven tests and
execute them over any SPARQL endpoint. Results of all quality assessment tests are
stored as linked data following LDQ vocabulary5 structure and these results are
linked to a quality pro le generated for that particular public SPARQL endpoint.
      </p>
      <p>Our aim is to provide a central monitoring service which executes quality
assessment tests following a pre-de ned schedule and it also allows its users to execute
on-demand tests. A quality pro le of each public SPARQL endpoint will be
generated after every planned test and values for di erent quality metrics will be
stored in the quality pro le. We host LDQ as a SPARQL endpoint accessible at:
http://srvgal89.deri.ie:8022/sparql. The open access to public SPARQL
endpoints hosting LDQ data facilitates data consumers to directly execute
various analytical queries for analyzing quality metrics of any SPARQL endpoint.</p>
      <p>Users can also analyze historical data to understand quality related evolution by
4 SPARQLES service is executed periodically to check status of public SPARQL
end</p>
      <p>points and the number of available SPARQL endpoints can uctuate.
5 Data Quality Vocabulary: https://www.w3.org/TR/vocab-dqv/
observing the change pattern of quality metrics over the time. LDQ has
potential to increase data consumers con dence over public SPARQL endpoints and
hence, can contribute towards the wide adoption of public SPARQL endpoints
by linked data applications. We also provide a Web interface to execute test over
a limited number of endpoints. We foresee LDQ provided as a service for quality
monitoring and attaching the evaluated quality pro les to each dataset (initially
only public SPARQL endpoints) listed in the Linked Open Data Cloud.</p>
      <p>Structure of the Paper: We position our work in comparison with the state
of the art in Section 2. In Section 3, we identify linked data quality metrics and
present LDQ data model. Section 4 discusses our quality assessment approach
with a list of quality related parameters and their evaluation methods. We discuss
on linked data quality monitoring approach and few some evaluation results in
Section 5. We conclude our work and discussed future directions in Section 6.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Di erent approaches have been proposed for linked data quality assessment over
the past [
        <xref ref-type="bibr" rid="ref10 ref16 ref5">10, 16, 5</xref>
        ], which are broadly categorized as (i) automated, (ii)
semiautomated, and (iii) manual. Most of these approaches require the involvement
of a user with expert domain knowledge of the given dataset under quality
inspection. Due to the requirement of domain knowledge, quality assessment tests
cannot be generalized for all type of datasets. Test-driven approaches have been
proposed for quality assessment of linked datasets and di erent SPARQL queries
are designed to assess the quality of linked data [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Similarly, crowdsourcing
approaches for linked data quality assessment are also introduced [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However,
most of these approaches have conducted a one-time quality evaluation. In the
dynamic Web environment, linked datasets are also prone to frequent updates,
which can potentially change the quality level of the overall datasets after
every update. Moreover, linked datasets are gradually increasing and improving at
the same time. Hence, one-time quality assessment of any public SPARQL
endpoint will not truly re ect the quality assessment of frequently updating linked
datasets.
      </p>
      <p>
        SPARQLES is a monitoring service designed to monitor status of public
SPARQL endpoints [
        <xref ref-type="bibr" rid="ref18 ref4">4, 18</xref>
        ]. This service is executed periodically using
various SPARQL queries to monitor four performance metrics of endpoint service
namely, (i) Availability, (ii) Performance, (iii) Interoperability, and (iv)
Discoverability. Results of the SPARQLES monitoring are accessible at https:
//sparqles.ai.wu.ac.at/. Our proposed work is very closely aligned to
SPARQLES except the fact that we are focusing on the quality of the underlying data
hosted by the SPARQL endpoint rather than quality of service as monitored by
SPARQLES.
      </p>
      <p>
        Acknowledging the importance of quality measurements of linked open
data, a community e ort that has led to de ning a W3C proposed standard
for Data Quality Vocabulary (DQV), accessible at: https://www.w3.org/TR/
vocab-dqv/. We built our dataset of monitoring linked data quality of public
SPARQL endpoints using the same vocabulary. A similar approach to
represent QoS parameters of public SPARQL endpoints using a QoS data models is
presented in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
3
      </p>
      <p>Linked Data Quality Metrics and Data Model
In this section, we discuss two important data quality related metrics speci cally
for linked data quality assessment and present DQV data model which was used
for representing and storing values of quality metrics calculated over data hosted
by public SPARQL endpoints.
3.1</p>
      <sec id="sec-2-1">
        <title>Linked Data Quality Monitoring</title>
        <p>
          Data quality is a broad term referring to a variety of dimensions and quality
check metrics. Pipono et. al. summarised 16 dimensions of data quality. Table
1 provides an overview of data quality dimensions listed in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. As it is
evident from the given list of dimensions that data quality assessment is heavily
dependent on the domain of data as well as requirements of data manipulation
tasks. Zavari et. al. presented a comprehensive overview of linked data quality
metrics and added a few additional quality metrics which they believed are more
relevant to the linked datasets [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. These metrics are namely, (i) Interlinking,
(ii) Licensing, (iii) Versatility, and (iv) Security.
        </p>
        <p>However, due to the distributed nature of the linked data and mostly
availability of open access to this data via SPARQL endpoints, it is not easy to apply
quality tests locally. Most of the existing quality testing of linked data require
a local replica of complete dataset before evaluating quality metrics. Due to the
resource constraints it is not easy to download a complete dataset hosted at a
SPARQL endpoint either due to limits on data access imposed by the SPARQL
endpoint service or simply due to the large size of the hosted data which makes
it hard to download and process a local replica.
3.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Query-driven Linked Open Data Quality Assessment</title>
        <p>SPARQL endpoints follow a distributed service oriented architecture, where
different endpoints are accessible using SPARQL query service making it very hard
to create a local copy of a dataset containing all data sources due to large size
and high level of distribution. Hence, contrary to the existing quality checks over
linked data which require a complete local access to the whole dataset, we focused
on generic mechanisms to assess data quality of linked data hosted by SPARQL
endpoints. We de ne generic quality assessment SPARQL queries which can be
executed by any client capable of dispatching queries to SPARQL endpoints
using SPARQL query service. We propose a query based evaluation of quality
metrics, which can be executed over any endpoint using SPARQL queries. We
identify various data quality parameters for linked datasets and consider only
the relevant quality parameters, which can be evaluated by executing SPARQL
queries.</p>
        <p>A few examples of potential query driven quality metrics assessment are listed
below;
{ Validity of IRIs can be determined by extracting all IRIs in a dataset hosted
at a SPARQL endpoint and then check which percentage of the total IRIs
are valid IRIs.
{ Fact checking by comparing the answers of same query over multiple
endpoints hosting similar information.
{ Contradictory information detection by using well-know predicates (e.g. date
of birth and date of death) and checking whether the corresponding triples
are using valid date-time format and free from contradictions (e.g. date of
birth, date of death and age triples are presenting accurate information).
{ De-referenceability of IRIs in a dataset can check via SPARQL queries
indicating to which extent all the IRIs presented in a dataset are dereferenceable.</p>
        <p>It is worth mentioning, that the general categorization of quality parameters
provided in this article is not exhaustive but rather an indicative list to showcase
only relevant quality parameters and their broader categories. The exact
categorization of each query-driven test or quality parameter is beyond the scope of
this paper. We left this task at the user's discretion to allocate broader category
for any of the quality parameters discussed in this paper or even for their own
de ned quality parameter.
3.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>LDQ Data Model</title>
        <p>We used the W3C Data Quality Vocabulary to represent the outcomes of
quality evaluation results. Figure 1 gives an abstract overview of the Data Quality
vocabulary showing a few relevant classes. LDQ data model is exible and any
number of data quality parameters can be introduced after their proper
categorization. Pre x ldq:http://www.insight-centre.org/ldq is the default pre x for
all classes and properties starting with \:" symbol in Figure 1. For the most of
the dataset, we stick to the classes and pre xes de ned within the DQV. The
detailed description of the vocabulary can be accessed at the W3C description
of DQV accessible at: https://www.w3.org/TR/vocab-dqv/</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Assessing Linked Data Quality</title>
      <p>In order to assess query driven quality of any public SPARQL endpoint, we
identi ed various quality related parameters. This section discuss quality related
parameters that are considered in this paper along their assessment methods.
Quality parameters, measured in this paper, are mainly categorized in three
types, namely, (i) IRI's , (ii) Data Types, (iii) Data Structure. Below we discuss
each of these category and their relevant tests.
4.1</p>
      <sec id="sec-3-1">
        <title>IRIs Related Quality Parameters</title>
        <p>IRI are one of the key ingredient of linked data and hold a prominent role in
the vision and principles of linked data. IRIs related quality parameters indicate
to which level any dataset adhere to linked data principles. We consider the
following IRI related quality parameters.</p>
        <p>IRI Validity: IRI validity refers whether a given IRI is complying to the IRI
syntax or not. For example any IRI containing restricted characters (e.g.
a space) is not a valid IRI. IRI validity test can be conducted by simply
selecting all IRIs and then using pre-de ned java UrlValidator function to
check whether a selected IRI is valid.</p>
        <p>
          IRI Dereference-ability: Dereferencing refers the process of retrieving
resource representation. It is an important feature of linked data principles
which demands that all IRIs within a link dataset must dereference. It is
particularly important for link traversal-based federated SPARQL query
processing[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. In this type of SPARQL federation, the query processing is done
through traversing dereference-able IRI's [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. Quality parameter for linked
data can evaluate that how many of the total IRIs are dereference-able. This
can be achieved by retrieving the list of all IRIs in the dataset, similar to
the IRI validity test, and then follow the http path for each IRI to validate
whether that particular IRI is dereference-able.
        </p>
        <p>
          Blank Nodes: Blank nodes are an important feature of linked data, while the
number of blank nodes is not necessarily a quality parameter, but a
statistical information to showcase the percentage of blank nodes in the linked
dataset can de nitely indicate the quality of a linked dataset. SPARQL query
processing in presence of blank nodes is particularly challenging [
          <xref ref-type="bibr" rid="ref17 ref8">8, 17</xref>
          ].
4.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Data Type Related Quality Parameters</title>
        <p>These parameters are mainly concerned with the literal values in a linked
dataset. Ideally, most of the literals have speci c data types announced to
indicate which type of data can be stored in that literal. This quality
parameter can indicate how correctly data types are de ned and whether all
literals hold a data value belonging to the right data type.</p>
        <p>Date Type Validity: String is a default data type for all literals in linked
datasets, unless described otherwise. This leads to possibilities of having
values belonging to other data types being stored in string format. A common
mistake is to have literal values stored as string instead of the best matching
data type for that particular value. A simple date type quality parameter
can calculate the total number of all those xsd:dateTime values which are
wrongly stored as xsd:String data type.
4.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Data Structuredness Related Quality Parameters</title>
        <p>
          These types of quality parameters provide insights related to internal
structure of the dataset. Since linked dataset are essentially a graph structure,
so these parameters showcase how connected or disconnected is any linked
dataset. We discuss few of the structuredness related quality parameters
below;
Class Coverage: This metric was introduced in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] and determines how well
the instance data conform to rdf:class (class for short), i.e., how well a speci c
class is covered by the di erent instances of that class. The coverage of a
class C demented by Coverage(C) is de ned as follow:
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>De nition 1 (Class Coverage). For a dataset D, let P (C) denote the</title>
        <p>set of distinct properties having class C and I(C) denote the set of distinct
instances having class C. Let I(p; C) denote the number of distinct
instances having predicate p and class C. Then, the coverage of the class
CV (C) is</p>
        <p>CV (C) =</p>
        <p>P8p2P (C) I(p;C)</p>
        <p>jP (C)j jI(C)j
SELECT Count(Distinct ?s) as ?occurences
WHERE {</p>
        <p>?s a &lt;Class name C&gt; .
?s &lt;Predicate p&gt; ?o
}
Listing 1. Calculating the number of distinct instances having predicate p and class
C denoted by I(p, C)</p>
        <p>SELECT DISTINCT ?typePred
WHERE {
?s a &lt;Class name C&gt; .
?s ?typePred ?o
}
SELECT Count(DISTINCT ?s) as ?cnt
WHERE {
?s a &lt;Class name C&gt; .</p>
        <p>Listing 2. The set of distinct properties having class C denoted by P(C)
Listing 3. Calculating the number of instances having class C denoted by I(C)
Listings 1, 2, and 3 contain three di erent SPARQL queries which can be
used to evaluate class coverage.</p>
        <p>
          Weighted Class Coverage De nition 1 considers the structuredness of a
dataset with respect to a single class. Obviously, a dataset D has instances
from multiple classes, with each instance belonging to at least one of these
classes (if multiple instantiations are supported). It is possible that dataset
D might have a high structuredness for a class C, say CV(C) = 0.8, and a
low structuredness for another class C', say CV(C') = 0.15. But then, what
is the structuredness of the whole dataset with respect to our class system
(set of all classes)? Duan et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] proposed a mechanism to compute this, by
considering the weighted sum of the coverage CV (C) of individual classes.
In particular, for each class C, the weighted coverage is de ned below.
        </p>
      </sec>
      <sec id="sec-3-5">
        <title>De nition 2 (Weighted Class Coverage). Taking De nition 1 in to</title>
        <p>account, the weighted coverage for a class C denoted by W T (CV (C)) is
calculated using the following formula:</p>
        <p>jP (C)j+jI(C)j</p>
        <p>W T (CV (C)) = P8C02D jP (C0)j+jI(C0)j
Dataset Structuredness By using De nitions 1, 2, we are now ready to
compute the structuredness, hereafter termed as coherence, of a whole dataset
D.</p>
      </sec>
      <sec id="sec-3-6">
        <title>De nition 3 (Dataset Structuredness). The overall structuredness or</title>
        <p>coherence of a dataset D denoted by CH(D) is de ne as</p>
        <p>CH(D)) = P8C2D CV (C) W T (CV (C))</p>
        <p>
          The dataset structuredness has a direct impact on the query runtimes as well
as the result sizes. According to [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], the higher the dataset structuredness, the
higher both result sizes and query runtimes of SPARQL queries. This metric
is particularly important while designing federated SPARQL query benchmarks
[
          <xref ref-type="bibr" rid="ref12 ref15">12, 15</xref>
          ]. A federated SPARQL querying benchmark should comprise of datasets
from multiple domains with varying structuredness values [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
5
        </p>
        <p>Monitoring &amp; Analyzing Linked Data Quality
In order to monitor the quality of linked data parameters, we de ned a variety
of query driven and domain agnostic tests which can be executed over linked
datasets. We randomly selected 4 public SPARQL endpoints hosting linked
datasets from di erent domains, details of the endpoints and their brief
description is presented in Table 2.We conducted di erent tests on each of these 4
public SPARQL endpoints to monitor their data quality. A simple java program
is written to execute SPARQL queries on a remote server. A list of selected</p>
        <p>Name Endpoint URI Description
DBPedia http://dbpedia.org/ DBpdeia contains linked data representation of
sparql the data extracted from Wikipedia.</p>
        <p>Semantic http://data. Semantic Web Dog Food contains linked dataset
Web Dog semanticweb.org/ representing publications and attendees record of
Food sparql di erent conferences and workshops.
Symbolic http://symbolicdata. Symbolic data is a dataset designed for pro
lDataset org:8890/sparql ing, testing and benchmarking Computer Algebra</p>
        <p>Software (CAS).</p>
        <p>LRI https://sparql.lri. LRI is a dataset containing information about the
Dataset fr/sparql scientists working in a french laboratory.
Open https://data.gov.cz/ This endpoint contains national open data
proData sparql vided by govt. of Czech.</p>
        <p>Linked IS- http://dati. This dataset is a compartment of environmental
PRA isprambiente.it/ protection information.</p>
        <p>sparql
SPARQL endpoints was initially provided to the java program together with the
list of all possible tests to be executed.</p>
        <p>
          Our main aim for this evaluation was to showcase the feasibility and
potential usage of LDQ by evaluating few quality parameters mainly belong to
two broad categories of data quality assessment, namely, (i) Completeness, and
(ii) Accuracy. We recommend LDQ users to consider LDQ categories in [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], to
design tests for the quality evaluation of their own de ned quality parameters.
Depending on the nature of the test conducted, either a SPARQL query was
able to directly provide the score of quality parameter or in some case
additional processing was required after retrieving the SPARQL query results, for
example in order to evaluate dereferencing of IRIs, all IRIs were retrieved by
a SPARQL query and then each IRIs are tested by java program to locate any
description of the resource from the Web. Results of quality tests were annotated
following the data model described earlier and directly stored in a locally hosted
SPARQL endpoint. We strongly encourage LDQ users to utilize existing LDQ
dataset accessible at http://srvgal89.deri.ie:8022/sparql.
        </p>
        <p>Listing 4 contains a sample query to access quality pro le of Semantic Web
Dog Food endpoint, while Listing 5 depicts a sample excerpt of the LDQ dataset.</p>
        <p>Table 3 presents values of the di erent quality parameters assessed after
executing these quality assessment tests6. We also expect to attract a larger
audience who is willing to de ne their own quality parameters and their data
quality assessment tests, in order to facilitate and encourage quality assessment
tests design process, we provide source code of LDQ generation at: https://
github.com/qaimeh/LinkedDataQuality
6 Details of the tests and source code for test re-execution or reproduce-ability is
available at https://github.com/qaimeh/LinkedDataQuality
PREFIX dcat : &lt;http://www.w3. org/ns/dcat#&gt;
PREFIX dcterms : &lt;http:// purl . org/dc/terms/&gt;
PREFIX dqv: &lt;http://www.w3. org/ns/dqv#&gt;
SELECT DISTINCT ?endpoint ?MeasurementName ?value
FROM &lt;http:// linked . data . quality/July 2019&gt;
WHERE f
?endpoint a dcat : Dataset .
?endpoint dcterms : t i t l e ? t i t l e .
?endpoint dcat : distribution ?endpointDistribution .
?endpointDistribution dqv: hasQualityMeasurement ?measurements .
?measurements dqv: isMeasurementOf ?MeasurementName.
?measurements dqv: value ?value
FILTER (? t i t l e ="Semantic Web Dog Food" )
g
6</p>
        <p>Listing 4. A Sample Query over LDQ Endpoint</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Concluding Remarks and Future Directions</title>
      <p>In this paper, we present LDQ, a linked data quality monitoring service to assess
and analyze data quality of linked datasets. We designed a generic data model
to present quality evaluation results for public SPARQL endpoints and showcase
the feasibility of our approach by designing two simple quality tests over 5 public
SPARQL endpoints. LDQ data model is extensible and users have freedom to
de ne their own quality parameters and design the relevant query driven tests
for the assessment of quality parameters. LDQ will serve as a baseline to get a
general idea of data quality level of any public SPARQL endpoints, and data
consumers can rely on statistics extracted from LDQ before using any public
SPARQL endpoint. LDQ monitoring service will act as a central hub for data
quality assessment and end-consumers can execute their quality assessment tests.
As future directions, we plan to de ne a comprehensive list of query driven
quality assessments tests and execute these tests on the complete list of public
SPARQL endpoints available at Datahub. We plan to execute quality assessment
tests periodically, which will result into a comprehensive linked data quality
dataset and can be used to analyze linked datasets evolution in terms of their
quality over the period of time. We also plan to host a linked data quality
service for users who are not familiar with SPARQL, users can simply use online
service to execute quality tests from a website. We foresee our service being run
periodically on all datasets available as SPARQL endpoint and a quality score
could be attached to each individual dataset within the whole LOD Cloud.
@prefix ldq:&lt;http:// insight centre . org/LDQ#&gt;.
@prefix xsd:&lt;http://www.w3. org/2001/XMLSchema#&gt;.
@prefix void:&lt;http://www.w3. org/TR/void&gt;.
@prefix muo:&lt;http:// purl . oclc . org/NET/muo/muo#/&gt;.
:SWDF
a dcat : Dataset ; dcterms : t i t l e "Semantic Web Dog Food" ;
dcat : distribution :SWDFDistribution ;
hasQualityMetaData dqv:QualityMetadataSWDF .
:SWDFDistribution
a dcat : Distribution ;
dcat :downloadURL &lt;http://www. scholarlydata . org/dumps/indicators
/03 02 2018 indicators . nt&gt; ;
dcterms : t i t l e "RDF distribution of dataset" ;
dcat :mediaType "text/nt" ; dcat : byteSize "5889"^^xsd : decimal .
:SWDFDistribution</p>
      <p>dqv: hasQualityMeasurement :measurement1 .
dqv:QualityMetadataSWDF
a dqv: QualityMetadata ;
prov : generatedAtTime "2015 05 27T02:52:02Z"^^xsd :dateTime ;
prov :wasGeneratedBy :SWDFQualityChecking .
:SWDFQualityChecking
a prov : Activity ; rdfs : label "The checking of SWDFDatasetDistribution ' s
quality"^^xsd : string ;
prov :endedAtTime "2015 05 27T02:52:02Z"^^xsd : dateTime;
prov : startedAtTime "2015 05 27T00:52:02Z"^^xsd : dateTime .
:measurement1
a dqv: QualityMeasurement ;
dqv:computedOn :SWDFDistribution ;
dqv: isMeasurementOf : ntCompletenessMetric ;
dqv: value "0.5"^^xsd : double ;
prov : generatedAtTime "2015 05 27T02:52:02Z"^^xsd :dateTime ;
prov :wasGeneratedBy :SWDFQualityChecking .
: ntCompletenessMetric
a dqv: Metric ;
skos : definition "Ratio between the number of objects represented and
the number of objects expected to be represented according to the
declared dataset scope ."@en ;
dqv: expectedDataType xsd : double ;
dqv: inDimension : completeness .
#definition of dimensions and metrics
: completeness a dqv: Dimension ;
skos : prefLabel "Completeness"@en ;
skos : definition "Completeness refers to the degree to which a l l
required information i s present in a particular dataset ."@en ;
dqv: inCategory : intrinsicDimensions .</p>
      <p>Listing 5. A Sample Excerpt From LDQ Dataset
Name IR VI PV DI PD BN BS BO DT ST
DBPedia 1950000 1889033 96 1318941 67 55209471 27655447 27554024 0 0.19
SWDF 41700 41416 99 34797 83 37524 28164 9360 428 0.42
SD 41273 40702 98 16286 39 9 6 3 42 0.68
LRI 2047 1438 70 1048 51 421 348 73 1 0.52
Open Data 2048843 871730 42 1859127 90 46749 35369 11380 273
ISPRA 598111 597594 99 546609 91 1144 771 373 10907 0.95
Table 3. Quality Parameters Assessment Values (IR=Total IRIs, VI=Valid IRIs, PV=
% Valid IRIs, DI=Dereference-able IRIs, PD = % Dereference-able IRIs, BN = Total
Blank Nodes, BS=Blank Nodes as Subject, BO= Blank Nodes as Object, DT=Date
Time as String, ST= Structuredness, SD = Symbolic Dataset). We were not able to
get structuredness value for Open Data SPARQL endpoint due to runtime error.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This publication has emanated from research supported in part by a
research grant from Science Foundation Ireland (SFI) under Grant Number
SFI/12/RC/2289-P2, co-funded by the European Regional Development Fund
and Enable SPOKE under Grant Number 16/SP/3804. The work conducted in
the University of Leipzig has been supported by the project LIMBO (Grant no.
19F2029I), OPAL (no. 19F2028A), KnowGraphs (no. 860801), and SOLIDE
(no. 13N14456)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>M.</given-names>
            <surname>Acosta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zaveri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Simperl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          .
          <article-title>Crowdsourcing linked data quality assessment</article-title>
          .
          <source>In The Semantic Web{ISWC</source>
          <year>2013</year>
          , pages
          <fpage>260</fpage>
          {
          <fpage>276</fpage>
          . Springer,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Ali</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Mileo</surname>
          </string-name>
          .
          <article-title>How good is your sparql endpoint? In On the Move to Meaningful Internet Systems: OTM 2014 Conferences</article-title>
          , pages
          <volume>491</volume>
          {
          <fpage>508</fpage>
          . Springer,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Heath</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Berners-Lee</surname>
          </string-name>
          .
          <article-title>Linked data-the story so far</article-title>
          .
          <source>Semantic Services, Interoperability and Web Applications: Emerging Concepts</source>
          , pages
          <volume>205</volume>
          {
          <fpage>227</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>C.</given-names>
            <surname>Buil-Aranda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Umbrich</surname>
          </string-name>
          , and P.-Y. Vandenbussche.
          <article-title>Sparql webquerying infrastructure: Ready for action</article-title>
          ? In International Semantic Web Conference, pages
          <volume>277</volume>
          {
          <fpage>293</fpage>
          . Springer,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>J.</given-names>
            <surname>Debattista</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lange</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          .
          <article-title>Luzzu-a framework for linked data quality assessment</article-title>
          .
          <source>arXiv preprint arXiv:1412.3750</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>S.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kementsietsidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Srinivas</surname>
          </string-name>
          , and
          <string-name>
            <given-names>O.</given-names>
            <surname>Udrea</surname>
          </string-name>
          .
          <article-title>Apples and oranges: a comparison of rdf benchmarks and real rdf datasets</article-title>
          .
          <source>In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data</source>
          , pages
          <volume>145</volume>
          {
          <fpage>156</fpage>
          . ACM,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>O.</given-names>
            <surname>Hartig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.-C.</given-names>
            <surname>Freytag</surname>
          </string-name>
          .
          <article-title>Executing sparql queries over the web of linked data</article-title>
          .
          <source>In International Semantic Web Conference</source>
          , pages
          <volume>293</volume>
          {
          <fpage>309</fpage>
          . Springer,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>D.</given-names>
            <surname>Hernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gutierrez</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          .
          <article-title>Certain answers for sparql with blank nodes</article-title>
          .
          <source>In International Semantic Web Conference</source>
          , pages
          <volume>337</volume>
          {
          <fpage>353</fpage>
          . Springer,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Westphal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cornelissen</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Zaveri</surname>
          </string-name>
          .
          <article-title>Test-driven evaluation of linked data quality</article-title>
          .
          <source>In Proceedings of the 23rd international conference on World Wide Web</source>
          , pages
          <volume>747</volume>
          {
          <fpage>758</fpage>
          . ACM,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Mendes</surname>
          </string-name>
          , H. Muhleisen, and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>Sieve: linked data quality assessment and fusion</article-title>
          .
          <source>In Proceedings of the 2012 Joint EDBT/ICDT Workshops</source>
          , pages
          <volume>116</volume>
          {
          <fpage>123</fpage>
          . ACM,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>L. L.</given-names>
            <surname>Pipino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. W.</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <article-title>Data quality assessment</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>45</volume>
          (
          <issue>4</issue>
          ):
          <volume>211</volume>
          {
          <fpage>218</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>M. Saleem</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hasnain</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.-C. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          .
          <article-title>Largerdfbench: a billion triples benchmark for sparql endpoint federation</article-title>
          .
          <source>Journal of Web Semantics</source>
          ,
          <volume>48</volume>
          :
          <fpage>85</fpage>
          {
          <fpage>125</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>M. Saleem</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Khan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hasnain</surname>
            ,
            <given-names>I. Ermilov</given-names>
          </string-name>
          , and A.
          <string-name>
            <surname>-C. Ngonga Ngomo</surname>
          </string-name>
          .
          <article-title>A negrained evaluation of sparql endpoint federation systems</article-title>
          .
          <source>Semantic Web</source>
          ,
          <volume>7</volume>
          (
          <issue>5</issue>
          ):
          <volume>493</volume>
          {
          <fpage>518</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>M. Saleem</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Szarnyas</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Conrads</surname>
            ,
            <given-names>S. A. C.</given-names>
          </string-name>
          <string-name>
            <surname>Bukhari</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>Mehmood</surname>
          </string-name>
          , and A.
          <string-name>
            <surname>-C. Ngonga Ngomo</surname>
          </string-name>
          .
          <article-title>How representative is a sparql benchmark? an analysis of rdf triplestore benchmarks?</article-title>
          <source>In The World Wide Web Conference</source>
          , pages
          <volume>1623</volume>
          {
          <fpage>1633</fpage>
          . ACM,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>M. Schmidt</surname>
            , O. Gorlitz, P. Haase,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Ladwig</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Schwarte</surname>
            , and
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Tran</surname>
          </string-name>
          .
          <article-title>Fedbench: A benchmark suite for federated semantic data query processing</article-title>
          .
          <source>In International Semantic Web Conference</source>
          , pages
          <volume>585</volume>
          {
          <fpage>600</fpage>
          . Springer,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>A.</given-names>
            <surname>Schultz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Matteini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Isele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Mendes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Becker</surname>
          </string-name>
          .
          <article-title>Ldifa framework for large-scale linked data integration</article-title>
          .
          <source>In 21st International World Wide Web Conference (WWW</source>
          <year>2012</year>
          ), Developers Track, Lyon, France,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <given-names>A.</given-names>
            <surname>Stolpe</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Halvorsen</surname>
          </string-name>
          .
          <article-title>Distributed query processing in the presence of blank nodes</article-title>
          .
          <source>Semantic Web</source>
          ,
          <volume>8</volume>
          (
          <issue>6</issue>
          ):
          <volume>1001</volume>
          {
          <fpage>1021</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18. P.
          <string-name>
            <surname>-Y. Vandenbussche</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Umbrich</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Matteis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hogan</surname>
            , and
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Buil-Aranda</surname>
          </string-name>
          .
          <article-title>Sparqles: Monitoring public sparql endpoints</article-title>
          .
          <source>Semantic web</source>
          ,
          <volume>8</volume>
          (
          <issue>6</issue>
          ):
          <volume>1049</volume>
          {
          <fpage>1065</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <given-names>A.</given-names>
            <surname>Zaveri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maurino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pietrobon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          .
          <article-title>Quality assessment for linked data: A survey</article-title>
          .
          <source>Semantic Web</source>
          ,
          <volume>7</volume>
          (
          <issue>1</issue>
          ):
          <volume>63</volume>
          {
          <fpage>93</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>