Introduction

Luzzu - A Framework for Linked Data Quality Assessment

0 University of Bonn and Fraunhofer IAIS , Germany

With the increasing adoption and growth of the Linked Open Data cloud, the variety of the Web of Data makes it challenging to determine the quality of the data published on the Web and to subsequently make this information explicit to data consumers. In this demo paper we describe Luzzu, a scalable quality assessment framework for Linked Data. Apart from providing quality metadata and quality problem reports that can be used for data cleaning, Luzzu is extensible: third party metrics can be easily plugged-in the framework. Hence, the extensibility of Luzzu enables the quality assessment in light of “fitness for use”.

Data Quality Assessment Framework Quality Metadata Quality Metrics

Introduction

RDF dataset (http://datahub.io/dataset/democratic-city). These steps can be replicated on Luzzu Web. The video demonstrates (1) the quality assessment of a dataset; (2) the filtering and ranking of assessed datasets using the daQ meta-model; (3) the visualisation of quality metadata. 2

Approach

The framework follows a three step workflow, starting with the metric initialisation process (Step 1). In this step, user defined metrics are compiled and initialised together with metrics implemented in Java. The quality assessment process is then commenced (Step 2) by sequentially streaming statements of the candidate dataset into the initialised quality metrics. Once this process is completed, the annotation process (Step 3) generates quality metadata and compiles a comprehensive quality report. The quality report produced in this framework enables data curators to improve the dataset’s quality by using the report to identify quality issues within the dataset.

The framework comprises three layers: Communication, Assessment and Knowledge. The former exploits the framework’s interfaces as a REST service, whilst the latter two are described in the remainder of this section. 2.1

Knowledge Layer

The Knowledge Layer is composed of three units, namely the Semantic Schema Layer, the Annotation Unit, and the Operations Unit. These units assist to the provision of quality metadata and assessment reports, and other operations that can be performed upon the same metadata. This layer, and subsequently Luzzu, is driven by a number of schemas that enables the representation of quality metadata (daQ), quality problem reports (QPRO) and other operational schemas to operate the framework2. The Dataset Quality Ontology (daQ) [ 2 ] is the core vocabulary, based on the RDF Data Cube vocabulary3 and PROV-O4, that defines how quality metadata should be represented at an abstract level. It is used to attach the results of quality benchmarking of a Linked Open Dataset to the dataset itself. These results can be used to rank (cf. Section 3) or visualise (cf. Section 4) datasets according to their quality. 2.2

Assessment Layer

The Assessment Layer is composed of three units, namely the Processing Unit, the LQML Compilation Unit, and the Quality Assessment Unit. These units handle the operations related to the quality assessment of a dataset. 2 All Luzzu ontologies in have the namespace http://purl.org/eis/vocab/{prefix} 3 http://www.w3.org/TR/vocab-data-cube/ 4 http://www.w3.org/TR/prov-o/

Luzzu – A Framework for Linked Data Quality Assessment OutpRute:pPorrotblem Input: Dataset and Metric Selection

Communica/on Layer

Invoke Metric 1

Quality Assessment Unit

Me2tr ic …

Metric

Metric Value /

Problematic Triples Quad <s,p,o,c>

Stream Processor Quality Metadata

Quality Report Dataset triple/quad

Annota/on Unit

The Quality Assessment Unit is the most important unit of the framework. We are offering many common quality metrics for download on the Luzzu homepage. In their implementation, we followed a comprehensive survey of linked data quality by Zaveri et al. [ 3 ], which also reviews related approaches. Third parties can extend the framework by creating custom metrics by either implementing simple Java interfaces5, or LQML [ 1 ], a novel quality metric language. The main advantage of LQML is that creators of quality metrics do not need to go through all the process to create a Java package, but can declaratively define a metric in a few lines of code. We are currently in the process of implementing functionality that allows more complex metrics to be implemented in LQML and not just simple pattern matching rules.

The Processing Unit controls the whole execution of the quality assessment of a chosen dataset. Luzzu implements two stream processing units; one based on the Jena RDF API and the other on the Spark processing framework. Streaming ensures scalability (since we are not limited by main memory) and parallelisability (since the parsing of a dataset can be split into several streams to be processed on different threads, cores or machines). Figure 1 shows a high level workflow of the quality assessment. All triples in a dataset are fed into each initialised metric processor; the output comprises quality metadata and a quality report. 3

Ranking Datasets using the Quality Metadata Our framework enables flexible filtering and ranking in that the daQ vocabulary facilitates access to dataset quality metrics in these different dimensions and thus facilitates the (re)computation of custom aggregated metrics derived from base metrics. To keep quality metric information easily accessible, the daQ quality metadata graph about a dataset should be stored in that dataset itself 5 See http://eis-bonn.github.io/Luzzu/howto.html for how to do this.

REFERENCES once it has been computed. In the spirit of “fitness for use”, the Luzzu ranking algorithm enables users to define weights on their preferred categories, dimensions or metrics, that are deemed suitable for her task at hand. Figure 2 shows the ranking view of the Luzzu web application. 4

Visualising Quality Metadata

Apart from displaying ranked lists, the Luzzu web application visualises quality metadata as charts. A visualisation wizard helps the user to choose the right visualisation type and charts. Currently, the following three types can be visualised are (a) multiple datasets vs metric; (b) dataset vs metric over time; (c) quality of dataset. Figure 3 depicts a dataset’s quality evolution over time. 5

Conclusion

Data quality assessment is crucial for the wider deployment and use of Linked Data. With Luzzu we presented a scalable Linked Data quality assessment framework. The Luzzu Web frontend furthermore makes quality assessment easy to use. We see Luzzu as the first step on a long-term research agenda aiming at shedding light on the quality of data published on the Web.

1. Debattista , J. , Lange , C. , Auer , S. Luzzu Quality Metric Language - A DSL for Linked Data Quality Assessment . 2015 . arXiv: 1504 .07758 [cs.DB].

2. Debattista , J. , Lange , C. , Auer , S. Representing Dataset Quality Metadata using Multi-Dimensional Views . In: SEMANTiCS. 2014 , pp. 92 - 99 .

3. Zaveri , A. et al. Quality Assessment for Linked Data . In: Semantic Web Journal ( 2015 ). http://www.semantic- web - journal .net/content/qualityassessment-linked -data-survey.