=Paper=
{{Paper
|id=Vol-1362/paper1
|storemode=property
|title=Roomba: An Extensible
Framework to Validate and Build Dataset Profiles
|pdfUrl=https://ceur-ws.org/Vol-1362/PROFILES2015_paper1.pdf
|volume=Vol-1362
}}
==Roomba: An Extensible
Framework to Validate and Build Dataset Profiles==
<pdf width="1500px">https://ceur-ws.org/Vol-1362/PROFILES2015_paper1.pdf</pdf>
<pre>
 Roomba: An Extensible Framework to Validate
         and Build Dataset Profiles

                 Ahmad Assaf1,2 , Raphaël Troncy1 and Aline Senart2
     1
         EURECOM, Sophia Antipolis, France, <firstName.lastName@eurecom.fr>
               2
                 SAP Labs France, <firstName.lastName@sap.com>


          Abstract. Linked Open Data (LOD) has emerged as one of the largest
          collections of interlinked datasets on the web. In order to benefit from this
          mine of data, one needs to access to descriptive information about each
          dataset (or metadata). This information can be used to delay data en-
          tropy, enhance dataset discovery, exploration and reuse as well as helping
          data portal administrators in detecting and eliminating spam. However,
          such metadata information is currently very limited to a few data portals
          where they are usually provided manually, thus being often incomplete
          and inconsistent in terms of quality. To address these issues, we propose
          a scalable automatic approach for extracting, validating, correcting and
          generating descriptive linked dataset profiles. This approach applies sev-
          eral techniques in order to check the validity of the metadata provided
          and to generate descriptive and statistical information for a particular
          dataset or for an entire data portal.

          Keywords: Linked Data, Dataset Profile, Metadata, Data Quality


1        Introduction
From 12 datasets cataloged in 2007, the Linked Open Data cloud has grown to
nearly 1000 datasets containing more than 82 billion triples3 [7]. Data is being
published by both the public and private sectors and covers a diverse set of
domains from life sciences to media or government data. The Linked Open Data
cloud is potentially a gold mine for organizations and individuals who are trying
to leverage external data sources in order to produce more informed business
decisions [5].
    Dataset discovery can be done through public data portals like Datahub.io
and publicdata.eu or private ones like quandl.com and enigma.io. Private
portals harness manually curated data from various sources and expose them
to users either freely or through paid plans. Similarly, in some public data por-
tals, administrators manually review datasets information, validate, correct and
attach suitable metadata information. This information is mainly in the form
of predefined tags such as media, geography, life sciences for organization and
clustering purposes. However, the diversity of those datasets makes it harder
to classify them in a fixed number of predefined tags that can be subjectively
3
    http://datahub.io/dataset?tags=lod
33      Ahmad Assaf, Raphaël Troncy and Aline Senart

assigned without capturing the essence and breadth of the dataset [22]. Further-
more, the increasing number of datasets available makes the metadata review
and curation process unsustainable even when outsourced to communities.
    There are several Data Management Systems (DMS) that power public data
portals. CKAN4 is the world’s leading open-source data portal platform pow-
ering web sites like DataHub, Europe’s Public Data and the U.S Government’s
open data. Modeled on CKAN, DKAN5 is a standalone Drupal distribution
that is used in various public data portals as well. Socrata6 helps public sector
organizations improve data-driven decision making by providing a set of solu-
tions including an open data portal. In addition to these tradition data portals,
there is a set of tools that allow exposing data directly as RESTful APIs like
thedatatank.com.
    Metadata provisioning is one of the Linked Data publishing best practices
mentioned in [6]. Datasets should contain the metadata needed to effectively un-
derstand and use them. This information includes the dataset’s license, prove-
nance, context, structure and accessibility. The ability to automatically check
this metadata helps in:
 – Delaying data entropy: Information entropy refers to the degradation or
   loss limiting the information content in raw or metadata. As a consequence
   of information entropy, data complexity and dynamicity, the life span of
   data can be very short. Even when the raw data is properly maintained, it
   is often rendered useless when the attached metadata is missing, incomplete
   or unavailable. Comprehensive high quality metadata can counteract these
   factors and increase dataset longevity [21].
 – Enhancing data discovery, exploration and reuse: Users who are un-
   familiar with a dataset require detailed metadata to interpret and analyze
   accurately unfamiliar datasets. A study conducted by the European Union
   commission [29] found that both business and users are facing difficulties in
   discovering, exploring and reusing public data due to missing or inconsistent
   metadata information.
 – Enhancing spam detection: Portals hosting public open data like Datahub
   allow anyone to freely publish datasets. Even with security measures like
   captchas and anti-spam devices, detecting spam is increasingly difficult. In
   addition to that, the increasing number of datasets hinders the scalability of
   this process, affecting the correct and efficient spotting of datasets spam.
    Data profiling is the process of creating descriptive information and collect
statistics about that data. It is a cardinal activity when facing an unfamiliar
dataset [25]. Data profiles reflect the importance of datasets without the need
for detailed inspection of the raw data. It also helps in assessing the importance
of the dataset, improving users’ ability to search and reuse part of the dataset
and in detecting irregularities to improve its quality. Data profiling includes
typically several tasks:
4
  http://ckan.org
5
  http://nucivic.com/dkan/
6
  http://www.socrata.com
    Roomba: An Extensible Framework to Validate and Build Dataset Profiles      34

 – Metadata profiling: Provides general information on the dataset (dataset
   description, release and update dates), legal information (license information,
   openness), practical information (access points, data dumps), etc.
 – Statistical profiling: Provides statistical information about data types and
   patterns in the dataset (e.g. properties distribution, number of entities and
   RDF triples).
 – Topical profiling: Provides descriptive knowledge on the dataset content
   and structure. This can be in form of tags and categories used to facilitate
   search and reuse.
    In this work, we address the challenges of automatic validation and genera-
tion of descriptive dataset profile. This paper proposes Roomba, an extensible
framework consisting of a processing pipeline that combines techniques for data
portals identification, datasets crawling and a set of pluggable modules com-
bining several profiling tasks. The framework validates the provided dataset
metadata against an aggregated standard set of information. Metadata fields
are automatically corrected when possible (e.g. adding a missing license URL
reference). Moreover, a report describing all the issues highlighting those that
cannot be automatically fixed is created to be sent by email to the dataset’s
maintainer. There exist various statistical and topical profiling tools for both
relational and Linked Data. The architecture of the framework allows to easily
add them as additional profiling tasks. However, in this paper, we focus on the
task of dataset metadata profiling. We validate our framework against a man-
ually created set of profiles and manually check its accuracy by examining the
results of running it on various CKAN-based data portals.
    The remainder of the paper is structured as follows. In Section 2, we review
relevant related work. In Section 3, we describe our proposed framework’s archi-
tecture and components that validate and generate dataset profiles. In Section 4,
we evaluate the framework and we finally conclude and outline some future work
in Section 5.


2     Related Work

Data Catalog Vocabulary (DCAT) [12] and the Vocabulary of Interlinked Datasets
(VoID) [11] are concerned with metadata about RDF datasets. There exist sev-
eral tools aiming at exposing dataset metadata using these vocabularies. In [8],
the authors generate VoID descriptions limited to a subset of properties that
can be automatically deduced from resources within the dataset. However, it
still provides data consumers with interesting insights. Flemming’s Data Qual-
ity Assessment Tool7 provides basic metadata assessment as it computes data
quality scores based on manual user input. The user assigns weights to the pre-
defined quality metrics and answers a series of questions regarding the dataset.
These include, for example, the use of obsolete classes and properties by defining
the number of described entities that are assigned disjoint classes, the usage of
7
    http://linkeddata.informatik.hu-berlin.de/LDSrcAss/datenquelle.php
35       Ahmad Assaf, Raphaël Troncy and Aline Senart

stable URIs and whether the publisher provides a mailing list for the dataset.
The ODI certificate8 , on the other hand, provides a description of the published
data quality in plain English. It aspires to act as a mark of approval that helps
publishers understand how to publish good open data and users how to use it.
It gives publishers the ability to provide assurance and support on their data
while encouraging further improvements through an ascending scale. ODI comes
as an online and free questionnaire for data publishers focusing on certain char-
acteristics about their data.
    Metadata profiling: The Project Open Data Dashboard9 tracks and mea-
sures how US government web sites implement the Open Data principles to
understand the progress and current status of their public data listings. A val-
idator analyzes machine readable files: e.g. JSON files for automated metrics
like the resolved URLs, HTTP status and content-type. However, deep schema
information about the metadata is missing like description, license information
or tags. Similarly on the LOD cloud, the Datahub LOD Validator10 gives an
overview of Linked Data sources cataloged on the Datahub. It offers a step-by-
step validator guidance to check a dataset completeness level for inclusion in the
LOD cloud. The results are divided into four different compliance levels from
basic to reviewed and included in the LOD cloud. Although it is an excellent
tool to monitor LOD compliance, it still lacks the ability to give detailed in-
sights about the completeness of the metadata and overview on the state of the
entire LOD cloud group and it is very specific to the LOD cloud group rules and
regulations.
    Statistical profiling: Calculating statistical information on datasets is vital
to applications dealing with query optimization and answering, data cleansing,
schema induction and data mining [18, 15, 22]. Semantic sitemaps [10] and RDF-
Stats [23] are one of the first to deal with RDF data statistics and summaries.
ExpLOD [20] creates statistics on the interlinking between datasets based on
owl:sameAs links. In [25], the author introduces a tool that induces the actual
schema of the data and gather corresponding statistics accordingly. LODStats
[2] is a stream-based approach that calculates more general dataset statistics.
ProLOD++ [1] is a Web-based tool that allows LOD analysis via automatically
computed hierarchical clustering [4]. Aether [26] generates VoID statistical de-
scriptions of RDF datasets. It also provides a Web interface to view and compare
VoID descriptions. LODOP [14] is a MapReduce framework to compute, opti-
mize and benchmark dataset profiles. The main target for this framework is to
optimize the runtime costs for Linked Data profiling. In [19] authors calculate
certain statistical information for the purpose of observing the dynamic changes
in datasets.
    Topical Profiling: Topical and categorical information facilitates dataset
search and reuse. Topical profiling focuses on content-wise analysis at the in-
stances and ontological levels. GERBIL [28] is a general entity annotation frame-

8
   https://certificates.theodi.org/
9
   http://labs.data.gov/dashboard/
10
   http://validator.lod-cloud.net/
     Roomba: An Extensible Framework to Validate and Build Dataset Profiles          36

work that provides machine processable output allowing efficient querying. In
addition, there exist several entity annotation tools and frameworks [9] but none
of those systems are designed specifically for dataset annotation. In [16], the
authors created a semantic portal to manually annotate and publish metadata
about both LOD and non-RDF datasets. In [22], the authors automatically as-
signed Freebase domains to extracted instance labels of some of the LOD Cloud
datasets. The goal was to provide automatic domain identification, thus enabling
improving datasets clustering and categorization. In [3], the authors extracted
dataset topics by exploiting the graph structure and ontological information,
thus removing the dependency on textual labels. In [13], the authors generate
VoID and VoL descriptions via a processing pipeline that extracts dataset topic
models ranked on graphical models of selected DBpedia categories.
    Although the above mentioned tools are able to provide various types of
information about a dataset, there exists no approach that aggregates this in-
formation and is extensible to combine additional profiling tasks. To the best
of our knowledge, this is the first effort towards extensible automatic validation
and generation of descriptive dataset profiles.

3      Profiling Data Portals
In this section, we provide an overview of Roomba’s architecture and the pro-
cessing steps for validating and generating dataset profiles. Figure 1 shows the
main steps which are the following: (i) data portal identification; (ii) metadata
extraction; (iii) instance and resource extraction; (iv) profile validation (v) profile
and report generation.
    Roomba is built as a Command Line Interface (CLI) application using Node.js.
Instructions on installing and running the framework are available on its public
Github repository11 . The various steps are explained in detail below.

3.1     Data Portal Identification
Roomba should be extensible to any data portal that exposes its functionalities
via an external accessible API. Since every portal ca have its own data model,
identifying the software powering data portals is a vital first step. We rely on
several Web scraping techniques in the identification process which includes a
combination of the following:
 – URL inspection: Various CKAN based portals are hosted on subdomains
   of the http://ckan.net. For example, CKAN Brazil (http://br.ckan.
   net). Checking the existence of certain URL patterns can detect such cases.
 – Meta tags inspection: The <meta> tag provides metadata about the HTML
   document. They are used to specify page description, keywords, author, etc.
   Inspecting the content attribute can indicate the type of the data portal.
   We use CSS selectors to check the existence of these meta tags. An exam-
   ple of a query selector is meta[content*=‘‘ckan’’] (all meta tags with
11
     https://github.com/ahmadassaf/opendata-checker
37       Ahmad Assaf, Raphaël Troncy and Aline Senart


      Fig. 1. Processing pipeline for validating and generating dataset profiles


   the attribute content containing the string CKAN ). This selector can iden-
   tify CKAN portals whereas the meta[content*=‘‘Drupal’’] can identify
   DKAN portals.
 – Document Object Model (DOM) inspection: Similar to the meta tags
   inspection, we check the existence of certain DOM elements or properties. For
   example, CKAN powered portals will have DOM elements with class names
   like ckan-icon or ckan-footer-logo. A CSS selector like .ckan-icon will
   be able to check if a DOM element with the class name ckan-icon exists. The
   list of elements and properties to inspect is stored in a separate configurable
   object for each portal. This allows the addition and removal of elements as
   deemed necessary.

The identification process for each portal can be easily customized by overriding
the default function. Moreover, adding or removing steps from the identification
process can be easily configured.
     After those preliminary checks, we query one of the portal’s API endpoints.
For example, DataHub is identified as CKAN, so we will query the API endpoint
on http://datahub.io/api/action/package\_list. A successful request will
list the names of the site’s datasets, whereas a failing request will signal a possible
failure of the identification process.


3.2   Metadata Extraction

Data portals expose a set of information about each dataset as metadata. The
model used varies across portals. However, a standard model should contain
information about the dataset’s title, description, maintainer email, update and
creation date, etc. We divided the metadata information into the following types:
    General information: General information about the dataset. e.g., title,
description, ID, etc. This general information is manually filled by the dataset
     Roomba: An Extensible Framework to Validate and Build Dataset Profiles     38

owner. In addition to that, tags and group information is required for classifi-
cation and enhancing dataset discoverability. This information can be entered
manually or inferred modules plugged into the topical profiler.
    Access information: Information about accessing and using the dataset.
This includes the dataset URL, license information i.e., license title and URL
and information about the dataset’s resources. Each resource has as well a set
of attached metadata e.g., resource name, URL, format, size.
    Ownership information: Information about the ownership of the dataset.
e.g., organization details, maintainer details, author. The existence of this infor-
mation is important to identify the authority on which the generated report and
the newly corrected profile will be sent to.
    Provenance information: Temporal and historical information on the dataset
and its resources. For example, creation and update dates, version information,
version, etc. Most of this information can be automatically filled and tracked.
    Building a standard metadata model is not the scope of this paper, and since
we focus on CKAN-based portals, we validate the extracted metadata against
the CKAN standard model12 .
    After identifying the underlying portal software, we perform iterative queries
to the API in order to fetch datasets metadata and persist them in a file-based
cache system. Depending on the portal software, we can issue specific extraction
jobs. For example, in CKAN-based portals, we are able to crawl and extract the
metadata of a specific dataset, all the datasets in a specific group (e.g. LOD
cloud) or all the datasets in the portal.

3.3     Instance and Resource Extraction
From the extracted metadata we are able to identify all the resources associated
with that dataset. They can have various types like a SPARQL endpoint, API,
file, visualization, etc. However, before extracting the resource instance(s) we
perform the following steps:
 – Resource metadata validation and enrichment: Check the resource
   attached metadata values. Similar to the dataset metadata, each resource
   should include information about its mimetype, name, description, format,
   valid de-referenceable URL, size, type and provenance. The validation pro-
   cess issues an HTTP request to the resource and automatically fills up var-
   ious missing information when possible, like the mimetype and size by ex-
   tracting them from the HTTP response header. However, missing fields like
   name and description that needs manual input are marked as missing and
   will appear in the generated summary report.
 – Format validation: Validate specific resource formats against a linter or
   a validator. For example, node-csv13 for CSV files and n314 to validate N3
   and Turtle RDF serializations.
12
   http://demo.ckan.org/api/3/action/package\_show?id=adur\_district\
   _spending
13
   https://github.com/wdavidw/node-csv
14
   https://github.com/RubenVerborgh/N3.js
39        Ahmad Assaf, Raphaël Troncy and Aline Senart

    Considering that certain datasets contain large amounts of resources and the
limited computation power of some machines on which the framework might
run on, a sampler module can be introduced to execute various sample-based
strategies detailed as they were found to generate accurate results even with
comparably small sample size of 10%. These strategies introduced in [13] are:
 – Random Sampling: Randomly selects resource instances.
 – Weighted Sampling: Weighs each resources as the ratio of the number of
   datatype properties used to define a resource over the maximum number of
   datatype properties over all the datasets resources.
 – Resource Centrality Sampling: Weighs each resource as the ration of the
   number of resource types used to describe a particular resource divided by
   the total number of resource types in the dataset. This is specific and impor-
   tant to RDF datasets where important concepts tend to be more structured
   and linked to other concepts.
    However, the sampler is not restricted only to these strategies. Strategies
like those introduced in [24] can be configured and plugged in the processing
pipeline.

3.4     Profile Validation
A dataset profile should include descriptive information about the data exam-
ined. In our framework, we have identified three main categories of profiling
information. However, the extensibility of our framework allows for additional
profiling techniques to be plugged in easily (i.e. a quality profiling module re-
flecting the dataset quality). In this paper, we focus on the task of metadata
profiling.
    Metadata validation process identifies missing information and the ability to
automatically correct them. Each set of metadata (general, access, ownership
and provenance) is validated and corrected automatically when possible. Each
profiler task has a set of metadata fields to check against. The validation process
check if each field is defined and if the value assigned is valid.
    There exist many special validation steps for various fields. For example, the
email addresses and urls should be validated to ensure that the value entered
is syntactically correct. In addition to that, for urls, we issue an HTTP HEAD
request in order to check if that URL is reachable. We also use the information
contained in a valid content-header response to extract, compare and correct
some resources metadata values like mimetype and size.
    From our experiments, we found out that datasets’ license information is
noisy. The license names if found are not standardized. For example, Creative
Commons CCZero can be also CC0 or CCZero. Moreover,the license URI if
found and if de-referenceable can point to different reference knowledge bases
e.g., http://opendefinition.org. To overcome this issue, we have manually
created a mapping file standardizing the set of possible license names and the
reference knowledge base15 . In addition, we have also used the open source and
15
     https://github.com/ahmadassaf/opendata-checker/blob/master/util/
     licenseMappings.json
     Roomba: An Extensible Framework to Validate and Build Dataset Profiles                                                      40

knowledge license information16 to normalize the license information and add
extra metadata like the domain, maintainer and open data conformance.
{
          ” l i c e n s e i d ” : [ ”ODC−PDDL−1.0” ] ,
          ” d i s a m b i g u a t i o n s ” : [ ”Open Data Commons P u b l i c      Domain     Dedication       and    License
                    (PDDL) ” ]
},
{
          ” l i c e n s e i d ” : [ ”CC−BY−SA−4.0” , ”CC−BY−SA−3.0” ] ,
          ” d i s a m b i g u a t i o n s ” : [ ” c c−by−s a ” , ”CC BY−SA” , ” C r e a t i v e Commons A t t r i b u t i o n
                    S h a r e−A l i k e ” ]
}


                               Listing 1.1. License mapping file sample


3.5     Profile and Report Generation

The validation process highlights the missing information and presents them in
a human readable report. The report can be automatically sent to the dataset
maintainer email if exists in the metadata. In addition to the generated report,
the enhanced profiles are represented in JSON using the CKAN data model and
are publicly available17 .
   Data portal administrators need an overall knowledge of the portal datasets
and their properties. Our framework has the ability to generate numerous reports
of all the datasets by passing formatted queries. There are two main sets of
aggregation tasks that can be run:

    – Aggregating meta-field values: Passing a string that corresponds to a
      valid field in the metadata. The field can be flat like license title (aggre-
      gates all the license titles used in the portal or in a specific group) or nested
      like resource>resource type (aggregates all the resources types for all the
      datasets). Such reports are important to have an overview of the possible
      values used for each metadata field.
    – Aggregating key:object meta-field values: Passing two meta-field val-
      ues separated by a colon : e.g., resources>resource type:resources>name.
      These reports are important as you can aggregate the information needed
      when also having the set of values associated to it printed.

    For example, the meta-field value query resource>resource type run against
the LODCloud group will result in an array containing [f ile, api, documentation...]
values. These are all the resource types used to describe all the datasets of
the group. However, to be able to know also what are the datasets containing
resources corresponding to each type, we issue a key:object meta-field query
resource>resource type:name. The result will be a JSON object having the
resource type as the key and an array of corresponding datasets titles that has
a resource of that type.

16
     https://github.com/okfn/licenses
17
     https://github.com/ahmadassaf/opendata-checker/tree/master/results
41       Ahmad Assaf, Raphaël Troncy and Aline Senart


=======================================================================
                                            Metadata Report
=======================================================================
 group i n f o r m a t i o n i s m i s s i n g . Check o r g a n i z a t i o n i n f o r m a t i o n a s t h e y
          can be mixed s o m e t i m e s
  o r g a n i z a t i o n i m a g e u r l f i e l d e x i s t s but t h e r e i s no v a l u e d e f i n e d
=======================================================================
                                            Tag S t a t i s t i c s
=======================================================================
 There i s a t o t a l o f : 21 [ u n d e f i n e d ] v o c a b u l a r y i d f i e l d s        100.00%
=======================================================================
                                            L i c e n s e Report
=======================================================================
 L i c e n s e i n f o r m a t i o n has been n o r m a l i z e d !
=======================================================================
                                         Resource S t a t i s t i c s
=======================================================================
 There i s a t o t a l o f : 10 [ m i s s i n g ] u r l −t y p e f i e l d s         100.00%
 There i s a t o t a l o f : 9 [ m i s s i n g ] c r e a t e d f i e l d s        90.00%
 There i s a t o t a l o f : 10 [ u n d e f i n e d ] c a c h e l a s t u p d a t e d f i e l d s        100.00%
 There i s a t o t a l o f : 10 [ u n d e f i n e d ] s i z e f i e l d s         100.00%
 There i s a t o t a l o f : 10 [ u n d e f i n e d ] hash f i e l d s            100.00%
 There i s a t o t a l o f : 10 [ u n d e f i n e d ] m i m e t y p e i n n e r f i e l d s        100.00%
 There i s a t o t a l o f : 7 [ u n d e f i n e d ] mimetype f i e l d s             70.00%
 There i s a t o t a l o f : 10 [ u n d e f i n e d ] c a c h e u r l f i e l d s        100.00%
 There i s a t o t a l o f : 6 [ u n d e f i n e d ] name f i e l d s            60.00%
 There i s a t o t a l o f : 9 [ u n d e f i n e d ] w e b s t o r e u r l f i e l d s       90.00%
 There i s a t o t a l o f : 9 [ u n d e f i n e d ] l a s t m o d i f i e d f i e l d s       90.00%
 There i s one [ u n d e f i n e d ] f o r m a t f i e l d          10.00%
=======================================================================
                                 Resource Connectivity I s s u e s
=======================================================================
 There a r e 2 c o n n e c t i v i t y i s s u e s w i t h t h e f o l l o w i n g URLs :
     − \ u r l { http : / / dbpedia . org / void / Dataset }
=======================================================================
                                    Un−R e a c h a b l e URLs Types
=======================================================================
 There a r e : 1 u n r e a c h a b l e URLs o f t y p e [ f i l e ]

                   Listing 1.2. Excerpt of the DBpedia validation report


4    Experiments and Evaluation
In this section, we provide the experiments and evaluation of the proposed frame-
work. All the experiments are reproducible by our tool and their results are avail-
able in its Github repository. A CKAN dataset metadata describes four main
sections in addition to the core dataset’s properties. These sections are:
 – Resources: The distributable parts containing the actual raw data. They
   can come in various formats (JSON, XML, RDF, etc.) and can be down-
   loaded or accessed directly (REST API, SPARQL endpoint).
 – Tags: Provide descriptive knowledge on the dataset content and structure.
   They are used mainly to facilitate search and reuse.
 – Groups: A dataset can belong to one or more group that share common
   semantics. A group can be seen as a cluster or a curation of datasets based
   on shared categories or themes.
 – Organizations: A dataset can belong to one or more organization controlled
   by a set of users. Organizations are different from groups as they are not
     Roomba: An Extensible Framework to Validate and Build Dataset Profiles        42

      constructed by shared semantics or properties, but solely on their association
      to a specific administration party.
    Each of these sections contains a set of metadata corresponding to one or
more type (general, access, ownership and provenance). For example, a dataset
resource will have general information such as the resource name, access infor-
mation such as the resource url and provenance information such as creation
date. The framework generates a report aggregating all the problems in all these
sections, fixing field values when possible. Errors can be the result of missing
metadata fields, undefined field values or field value errors (e.g. unreachable URL
or incorrect email addresses).

4.1     Experimental Setup
We ran our tool on two CKAN-based data portals. The first one is datahub.io
targeting specifically the LOD cloud group. The current state of the LOD cloud
report [27] indicates that the LOD cloud contains 1014 datasets. They were har-
vested via a LDSpider crawler [17] seeded with 560 thousands URIs. Roomba, on
the other hand, fetches datasets hosted in data portals where datasets have at-
tached relevant metadata. As a result, we relied on the information provided by
the Datahub CKAN API. Examining the tags available, we found two candidate
groups. The first one tagged with “lodcloud” returned 259 datasets, while the
second one tagged with “lod” returned only 75 datasets. After manually exam-
ining the two lists, we found out the datasets grouped with the tag “lodcloud”
are the correct ones. To qualify other CKAN-based portals for the experiments,
we use http://dataportals.org/ which contains a comprehensive list of Open
Data portals from around the world. In the end, we chose the Amsterdam data
portal18 . The portal was commissioned in 2012 by the Amsterdam Economic
Board Open Data Exchange (ODE) and covers a wide range of information do-
mains (energy, economy, education, urban development, etc.) about Amsterdam
metropolitan region.
    We ran the Roomba instance and resource extractors in order to cache the
metadata files for these datasets locally and ran the validation process. The
experiments were executed on a 2.6 Ghz Intel Core i7 processor with 16GB of
DDR3 memory machine. The approximate execution time alongside the sum-
mary of the datasets’ properties are presented in table 1.


     Data Portal         No. Datasets No. Groups No. Resources Processing Time
     LOD Cloud               259         N/A         1068          140 mins
     Amsterdam Open Data     172          18          480          35 mins
                   Table 1. Summary of the experiments details


  In our evaluation, we focused on two aspects: i)profiling correctness which
manually assesses the validity of the errors generated in the report, and ii)profiling
18
     http://data.amsterdamopendata.nl/
43       Ahmad Assaf, Raphaël Troncy and Aline Senart

completeness which assesses if the profilers cover all the errors in the datasets
metadata.

4.2    Profiling Correctness
To measure profile correctness, we need to make sure that the issues reported
by Roomba are valid on the dataset, group and portal levels.
   On the dataset level, we choose three datasets from both the LOD Cloud
and the Amsterdam data portal. The datasets details are shown in table 2.


      Dataset Name                    Data Portal Group ID Resources Tags
      dbpedia                           Datahub       lodcloud        10 21
      event-media                       Datahub       lodcloud         9 15
      bbc-music                         Datahub       lodcloud         2 14
      bevolking cijfers amsterdam      Amsterdam bevolking             6 12
      bevolking-prognoses-amsterdam Amsterdam bevolking                1  3
      religieuze samenkomstlocaties    Amsterdam bevolking             1  8
                 Table 2. Datasets chosen for the correctness evaluation


   To measure the profiling correctness on the groups level, we selected four
groups from the Amsterdam data portal containing a total of 25 datasets. The
choice was made to cover groups in various domains that contain a moderate
number of datasets that can be checked manually (between 3-9 datasets). Table
3 summarizes the groups chosen for the evaluation.


      Group Name                     Domain          Datasets Resources Tags
      bestuur-en-organisatie       Management           9          45    101
      bevolking                     Population          3           8     23
      geografie                     Geography           8          16     56
      openbare-orde-veiligheid Public Order & Safety    5          19     34
                Table 3. Groups chosen for the correctness evaluation


    After running Roomba and examining the results on the selected datasets
and groups, we found out that our framework provides 100% correct results
on the individual dataset level and on the aggregation level over groups. Since
our portal level aggregation is extended from the group aggregation, we can
infer that the portal level aggregation also produces complete correct profiles.
However, the lack of a standard way to create and manage collections of datasets
was the source of some errors when comparing the results from these two portals.
For example, in Datahub, we noticed that all the datasets groups information
were missing, while in the Amsterdam Open Data portal, all the organisation
information was missing. Although the error detection is correct, the overlap
     Roomba: An Extensible Framework to Validate and Build Dataset Profiles      44

in the usage of group and organization can give a false indication about the
metadata quality.

4.3     Profiling Completeness
We analyzed the completeness of our framework by manually constructing a
set of profiles that act as a golden standard. These profiles cover the range of
uncommon problems that can occur in a certain dataset19 . These errors are:
 – Incorrect mimetype or size for resources;
 – Invalid number of tags or resources defined;
 – Check if the license information can be normalized via the license id or
   the license title as well as the normalization result;
 – Syntactically invalid author email or maintainer email.
    After running our framework at each of these profiles, we measured the com-
pleteness and correctness of the results. We found out that our framework covers
indeed all the metadata problems that can be found in a CKAN standard model
correctly.

5      Conclusion and Future Work
In this paper, we proposed a scalable automatic approach for extracting, validat-
ing, correcting and generating descriptive linked dataset profiles. This approach
applies several techniques in order to check the validity of the metadata provided
and to generate descriptive and statistical information for a particular dataset
or for an entire data portal. Based on our experiments running the tool on the
LOD cloud, we discovered that the general state of the datasets needs attention
as most of them lack informative access information and their resources suffer
low availability. These two metrics are of high importance for enterprises looking
to integrate and use external linked data.
    It has been noticed that the issues surrounding metadata quality affect di-
rectly dataset search as data portals rely on such information to power their
search index. We noted the need for tools that are able to identify various issues
in this metadata and correct them automatically. We evaluated our framework
manually against two prominent data portals and proved that we can automati-
cally scale the validation of datasets metadata profiles completely and correctly.
    As part of our future work, we plan to introduce workflows that will be
able to correct the rest of the metadata either automatically or through intu-
itive manually-driven interfaces. We also plan to integrate statistical and topical
profilers to be able to generate full comprehensive profiles. We also intend to sug-
gest a ranked standard metadata model that will help generate more accurate
and scored metadata quality profiles. We also plan to run this tool on various
CKAN-based data portals, schedule periodic reports to monitor the evolvement
of datasets metadata. Finally, at some stage, we plan to extend this tool for
other data portal types like DKAN and Socrata.
19
     https://github.com/ahmadassaf/opendata-checker/tree/master/test
45      Ahmad Assaf, Raphaël Troncy and Aline Senart

Acknowledgments
This research has been partially funded by the European Union’s 7th Framework
Programme via the project Apps4EU (GA No. 325090).


References
 1. Z. Abedjan, T. Gruetze, A. Jentzsch, and F. Naumann. Profiling and mining
    RDF data with ProLOD++. In 30th IEEE International Conference on Data
    Engineering (ICDE), pages 1198–1201, 2014.
 2. S. Auer, J. Demter, M. Martin, and J. Lehmann. LODStats - an Extensible Frame-
    work for High-performance Dataset Analytics. In 18th International Conference
    on Knowledge Engineering and Knowledge Management (EKAW), pages 353–362,
    Galway, Ireland, 2012.
 3. C. Böhm, G. Kasneci, and F. Naumann. Latent Topics in Graph-structured Data.
    In 21st ACM International Conference on Information and Knowledge Manage-
    ment (CIKM), pages 2663–2666, Maui, Hawaii, USA, 2012.
 4. C. Bohm, F. Naumann, Z. Abedjan, D. Fenz, T. Grutze, D. Hefenbrock, M. Pohl,
    and D. Sonnabend. Profiling linked open data with ProLOD. In 26th International
    Conference on Data Engineering Workshops (ICDEW), 2010.
 5. D. Boyd and K. Crawford. Six provocations for big data. A Decade in Internet
    Time: Symposium on the Dynamics of the Internet and Society, 2011.
 6. B. Christian. Evolving the Web into a Global Data Space. In 28th British National
    Conference on Advances in Databases, 2011.
 7. B. Christian, H. T, and B.-L. T. Linked Data - The Story So Far. International
    Journal on Semantic Web and Information Systems (IJSWIS), 2009.
 8. B. Christoph, L. Johannes, and N. Felix. Creating voiD Descriptions for Web-scale
    Data. Journal of Web Semantics, 9(3):339–345, 2011.
 9. M. Cornolti, P. Ferragina, and M. Ciaramita. A Framework for Benchmarking
    Entity-annotation Systems. In 22nd World Wide Web Conference (WWW), 2013.
10. R. Cyganiak, H. Stenzhorn, R. Delbru, S. Decker, and G. Tummarello. Semantic
    Sitemaps: Efficient and Flexible Access to Datasets on the Semantic Web. In
    5th European Semantic Web Conference (ESWC), pages 690–704, Tenerife, Spain,
    2008.
11. R. Cyganiak, J. Zhao, M. Hausenblas, and K. Alexander. Describing Linked
    Datasets with the VoID Vocabulary. W3C Note, 2011. http://www.w3.org/TR/
    void/.
12. M. Fadi and E. John. Data Catalog Vocabulary (DCAT). W3C Recommendation,
    2014. http://www.w3.org/TR/vocab-dcat/.
13. B. Fetahu, S. Dietze, B. Pereira Nunes, M. Antonio Casanova, D. Taibi, and W. Ne-
    jdl. A Scalable Approach for Efficiently Generating Structured Dataset Topic Pro-
    files. In 11th European Semantic Web Conference (ESWC), 2014.
14. B. Forchhammer, A. Jentzsch, and F. Naumann. LODOP - Multi-Query Opti-
    mization for Linked Data Profiling Queries. In International Workshop on Dataset
    PROFIling and fEderated Search for Linked Data (PROFILES), Heraklion, Greece,
    2014.
15. M. Frosterus, E. Hyvönen, and J. Laitio. Creating and Publishing Semantic Meta-
    data about Linked and Open Datasets. In Linking Government Data. 2011.
  Roomba: An Extensible Framework to Validate and Build Dataset Profiles           46

16. M. Frosterus, E. Hyvönen, and J. Laitio. DataFinland - A Semantic Portal for
    Open and Linked Datasets. In 8th Extended Semantic Web Conference (ESWC),
    pages 243–254, 2011.
17. R. Isele, J. Umbrich, C. Bizer, and A. Harth. LDspider: An Open-source Crawl-
    ing Framework for the Web of Linked Data. In 9th International Semantic Web
    Conference (ISWC), Posters & Demos Track, 2010.
18. A. Jentzsch. Profiling the Web of Data. In 13th International Semantic Web
    Conference (ISWC), Doctoral Consortium, Trentino, Italy, 2014.
19. T. Käfer, A. Abdelrahman, J. Umbrich, P. O’Byrne, and A. Hogan. Observing
    Linked Data Dynamics. In 10th European Semantic Web Conference (ESWC),
    2013.
20. S. Khatchadourian and M. P. Consens. ExpLOD: Summary-based Exploration of
    Interlinking and RDF Usage in the Linked Open Data Cloud. In 7th Extended
    Semantic Web Conference (ESWC), pages 272–287, Heraklion, Greece, 2010.
21. Kovács-Láng. Global Terrestrial Observing System. Technical report, GTOS Cen-
    tral and Eastern European Terrestrial Data Management and Accessibility Work-
    shop, 2000.
22. S. Lalithsena, P. Hitzler, A. Sheth, and P. Jain. Automatic Domain Identification
    for Linked Open Data. In IEEE/WIC/ACM International Joint Conferences on
    Web Intelligence (WI) and Intelligent Agent Technologies (IAT), pages 205–212,
    2013.
23. A. Langegger and W. Woss. RDFStats - An Extensible RDF Statistics Generator
    and Library. In 20th International Workshop on Database and Expert Systems
    Application (DEXA), pages 79–83, 2009.
24. J. Leskovec and C. Faloutsos. Sampling from Large Graphs. In 12th th ACM
    International Conference on Knowledge Discovery and Data Mining (KDD’12),
    2006.
25. H. Li. Data Profiling for Semantic Web Data. In International Conference on Web
    Information Systems and Mining (WISM), pages 472–479, 2012.
26. E. Mäkelä. Aether - Generating and Viewing Extended VoID Statistical Descrip-
    tions of RDF Datasets. In 11th European Semantic Web Conference (ESWC),
    Demo Track, Heraklion, Greece, 2014.
27. S. Max, B. Christian, and P. Heiko. Adoption of the Linked Data Best Practices
    in Different Topical Domains. In 13th International Semantic Web Conference
    (ISWC), 2014.
28. R. Usbeck, M. Röder, A.-C. Ngonga-Ngomo, C. Baron, A. Both, M. Brümmer,
    D. Ceccarelli, M. Cornolti, D. Cherix, B. Eickmann, P. Ferragina, C. Lemke,
    A. Moro, R. Navigli, F. Piccinno, G. Rizzo, H. Sack, R. Speck, R. Troncy, J. Wait-
    elonis, and L. Wesemann. GERBIL - General Entity Annotation Benchmark
    Framework. In 24th World Wide Web Conference (WWW), 2015.
29. G. Vickery. Review of Recent Studies on PSI-use and Related Market Develop-
    ments. Technical report, EC DG Information Society, 2011.

</pre>