<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Luzzu - A Framework for Linked Data Quality Assessment</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Bonn and Fraunhofer IAIS</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>With the increasing adoption and growth of the Linked Open Data cloud, the variety of the Web of Data makes it challenging to determine the quality of the data published on the Web and to subsequently make this information explicit to data consumers. In this demo paper we describe Luzzu, a scalable quality assessment framework for Linked Data. Apart from providing quality metadata and quality problem reports that can be used for data cleaning, Luzzu is extensible: third party metrics can be easily plugged-in the framework. Hence, the extensibility of Luzzu enables the quality assessment in light of “fitness for use”.</p>
      </abstract>
      <kwd-group>
        <kwd>Data Quality</kwd>
        <kwd>Assessment Framework</kwd>
        <kwd>Quality Metadata</kwd>
        <kwd>Quality Metrics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>RDF dataset (http://datahub.io/dataset/democratic-city). These steps
can be replicated on Luzzu Web. The video demonstrates (1) the quality
assessment of a dataset; (2) the filtering and ranking of assessed datasets using
the daQ meta-model; (3) the visualisation of quality metadata.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Approach</title>
      <p>The framework follows a three step workflow, starting with the metric
initialisation process (Step 1). In this step, user defined metrics are compiled and
initialised together with metrics implemented in Java. The quality assessment
process is then commenced (Step 2) by sequentially streaming statements of
the candidate dataset into the initialised quality metrics. Once this process is
completed, the annotation process (Step 3) generates quality metadata and
compiles a comprehensive quality report. The quality report produced in this
framework enables data curators to improve the dataset’s quality by using the
report to identify quality issues within the dataset.</p>
      <p>The framework comprises three layers: Communication, Assessment and
Knowledge. The former exploits the framework’s interfaces as a REST service,
whilst the latter two are described in the remainder of this section.
2.1</p>
      <sec id="sec-2-1">
        <title>Knowledge Layer</title>
        <p>
          The Knowledge Layer is composed of three units, namely the Semantic Schema
Layer, the Annotation Unit, and the Operations Unit. These units assist to the
provision of quality metadata and assessment reports, and other operations that
can be performed upon the same metadata. This layer, and subsequently Luzzu,
is driven by a number of schemas that enables the representation of quality
metadata (daQ), quality problem reports (QPRO) and other operational schemas to
operate the framework2. The Dataset Quality Ontology (daQ) [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] is the core
vocabulary, based on the RDF Data Cube vocabulary3 and PROV-O4, that defines
how quality metadata should be represented at an abstract level. It is used to
attach the results of quality benchmarking of a Linked Open Dataset to the
dataset itself. These results can be used to rank (cf. Section 3) or visualise (cf.
Section 4) datasets according to their quality.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Assessment Layer</title>
        <p>The Assessment Layer is composed of three units, namely the Processing Unit,
the LQML Compilation Unit, and the Quality Assessment Unit. These units
handle the operations related to the quality assessment of a dataset.
2 All Luzzu ontologies in have the namespace http://purl.org/eis/vocab/{prefix}
3 http://www.w3.org/TR/vocab-data-cube/
4 http://www.w3.org/TR/prov-o/</p>
        <p>Luzzu – A Framework for Linked Data Quality Assessment
OutpRute:pPorrotblem
Input: Dataset and
Metric Selection</p>
        <p>Communica/on  Layer  </p>
        <p>Invoke 
Metric  
1  </p>
        <p>Quality  Assessment  Unit  </p>
        <p>Me2tr ic   …</p>
        <p>Metric  </p>
        <p>n  </p>
        <p>Metric Value / </p>
        <p>Problematic Triples 
Quad 
&lt;s,p,o,c&gt; </p>
        <p>Stream  
Processor  
Quality Metadata </p>
        <p>Quality 
Report 
Dataset  
triple/quad </p>
        <p>Annota/on  Unit  </p>
        <p>
          The Quality Assessment Unit is the most important unit of the framework.
We are offering many common quality metrics for download on the Luzzu
homepage. In their implementation, we followed a comprehensive survey of linked data
quality by Zaveri et al. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], which also reviews related approaches. Third parties
can extend the framework by creating custom metrics by either implementing
simple Java interfaces5, or LQML [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], a novel quality metric language. The main
advantage of LQML is that creators of quality metrics do not need to go through
all the process to create a Java package, but can declaratively define a metric in
a few lines of code. We are currently in the process of implementing
functionality that allows more complex metrics to be implemented in LQML and not just
simple pattern matching rules.
        </p>
        <p>The Processing Unit controls the whole execution of the quality assessment
of a chosen dataset. Luzzu implements two stream processing units; one based on
the Jena RDF API and the other on the Spark processing framework. Streaming
ensures scalability (since we are not limited by main memory) and
parallelisability (since the parsing of a dataset can be split into several streams to be processed
on different threads, cores or machines). Figure 1 shows a high level workflow
of the quality assessment. All triples in a dataset are fed into each initialised
metric processor; the output comprises quality metadata and a quality report.
3</p>
        <p>Ranking Datasets using the Quality Metadata
Our framework enables flexible filtering and ranking in that the daQ
vocabulary facilitates access to dataset quality metrics in these different dimensions
and thus facilitates the (re)computation of custom aggregated metrics derived
from base metrics. To keep quality metric information easily accessible, the daQ
quality metadata graph about a dataset should be stored in that dataset itself
5 See http://eis-bonn.github.io/Luzzu/howto.html for how to do this.</p>
        <p>REFERENCES
once it has been computed. In the spirit of “fitness for use”, the Luzzu ranking
algorithm enables users to define weights on their preferred categories,
dimensions or metrics, that are deemed suitable for her task at hand. Figure 2 shows
the ranking view of the Luzzu web application.
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Visualising Quality Metadata</title>
      <p>Apart from displaying ranked lists, the Luzzu web application visualises quality
metadata as charts. A visualisation wizard helps the user to choose the right
visualisation type and charts. Currently, the following three types can be
visualised are (a) multiple datasets vs metric; (b) dataset vs metric over time; (c)
quality of dataset. Figure 3 depicts a dataset’s quality evolution over time.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>Data quality assessment is crucial for the wider deployment and use of Linked
Data. With Luzzu we presented a scalable Linked Data quality assessment
framework. The Luzzu Web frontend furthermore makes quality assessment easy to
use. We see Luzzu as the first step on a long-term research agenda aiming at
shedding light on the quality of data published on the Web.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Debattista</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lange</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Luzzu Quality Metric Language - A DSL for Linked Data Quality Assessment</surname>
          </string-name>
          .
          <year>2015</year>
          . arXiv:
          <volume>1504</volume>
          .07758 [cs.DB].
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Debattista</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lange</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>Representing Dataset Quality Metadata using Multi-Dimensional Views</article-title>
          . In: SEMANTiCS.
          <year>2014</year>
          , pp.
          <fpage>92</fpage>
          -
          <lpage>99</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Zaveri</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          et al.
          <article-title>Quality Assessment for Linked Data</article-title>
          .
          <source>In: Semantic Web Journal</source>
          (
          <year>2015</year>
          ). http://www.semantic- web
          <source>- journal</source>
          .net/content/qualityassessment-linked
          <article-title>-data-survey.</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>