<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Workshop on Linked Data Quality; Editors: Magnus Knuth,
Dimitris Kontokostas, and Harald Sack Sept.</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Towards assured data quality and validation by data certification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>John P. McCrae</string-name>
          <email>jmccrae@cit-ec.uni-</email>
          <email>jmccrae@cit-ec.unibielefeld.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cord Wiljes</string-name>
          <email>cwiljes@cit-ec.uni-</email>
          <email>cwiljes@cit-ec.unibielefeld.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Philipp Cimiano</string-name>
          <email>cimiano@cit-ec.uni-</email>
          <email>cimiano@cit-ec.unibielefeld.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CITEC, Bielefeld University</institution>
          ,
          <addr-line>Inspiration 1, Bielefeld</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <volume>2</volume>
      <issue>2014</issue>
      <abstract>
        <p>Increasingly a large amount of data relevant to a wide variety of scienti c domains is self-published by scientists on websites and this is proving to be an important resource for the replicability and further development of science. Much of this data is even made available as linked data. However, the self-publishing model provides no quality control on the data, and as such datasets frequently contain errors. We therefore consider an architecture of a system that enables the certi cation of data (both linked and otherwise) by a web service and the sharing of this certi cation on the web, and contemplate why this may improve data quality.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;data quality</kwd>
        <kwd>data sharing</kwd>
        <kwd>open science</kwd>
        <kwd>validation service</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        It has been widely acknowledged across the sciences that the
publishing of data generated or required for an experiment
is a crucial step towards the replicability of experiments or
analyses [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. However, it is also the case that most data
is of poor quality and plagued by basic data errors [
        <xref ref-type="bibr" rid="ref1 ref11">1, 11</xref>
        ].
In this paper we tackle an aspect of data quality we refer
to as the \readiness for use", by which is meant whether the
data can be directly applied, rather than if its content is
actually useful for a speci c application. Errors such as these
can easily be detected by validation in a manner that does
not need to know the domain or the intended application
of the data. Such errors not only make the data
fundamentally harder to use but also mean that anyone consuming
the dataset must rst correct any existing data errors,
possibly making unwarranted assumptions about the data, thus
potentially leading to unintended modi cations of the data.
Much of this is due to the fact that for many small datasets
there is no su cient institutional support for the publication
of data, leading to many datasets containing formal errors,
such as incorrectly escaped characters. It is our belief that
many scientists who self-publish datasets do not make such
errors out of intention or indi erence, but instead out of a
lack of support in validating services. To this end we propose
a simple, general, extensible web service to provide syntactic
and semantic validation of data in, initially, XML and RDF,
which can be extended to a wider range of data formats.
Finally, a second key goal is to provide continuous validation
of the resource in that we continue to check the validity of
resources periodically after they are published. It is in fact a
common problem that resources and data cease to be
available after the end of the funding period, and as such the
data generated during this project become lost. It is also
quite common for URLs to be changed for technical reasons
without a redirect from the old URL to be implemented. For
example, in a study of MEDLINE papers [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], it was found
that 37% of URLs quoted in papers had become unavailable
or were only intermittently available after publication,
although it was unclear how many of these URLs referred to
datasets.
      </p>
      <p>The architecture of such a system has several clear design
goals in order to cope with such a wide range of potential
resources. The architecture should ful ll the following
requirements:
Extensibility: There are a wide range of data formats in
use in scienti c work and as such we should be able to
grow and extend to a wide range of data sets that are
available on the web.</p>
      <p>E ciency: The system should be able to process
potentially very large datasets in a reasonable amount of
time. For that reason, validation algorithms that have
a linear time complexity in the input size are to be
preferred.</p>
      <p>Tiered Architecture: It should be possible to follow deeper
validation layers, such that we can validate data as to
whether it is available on the web and whether it uses
a standard and open format. Further, in the case that
the data uses a valid RDF vocabulary, we can check
whether it conforms to RDFS/OWL schema/ontology
that the data claims to adhere to.</p>
      <p>Such an architecture should allow us to quickly build an
extensible service that allows new data formats and models
to be handled and validated.</p>
    </sec>
    <sec id="sec-2">
      <title>2. MOTIVATION</title>
      <p>
        The Open Science movement advocates sharing the data
that scienti c results are based on [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Open data
publication is expected to improve the integrity and e ciency
of science. Errors and fraud will be easier to detect and
valuable research data can be re-used by other scientists for
their own research questions. Therefore, scienti c journals
and research funding agencies worldwide have been
instituting policies for data sharing.
      </p>
      <p>
        Good scienti c practice calls for research to be reproducible,
i.e. other researchers must be able to test the data as well
as the analysis procedures. The growing number and
diversity of digital research data and the strong increase in
importance of computational methods in all empirical
sciences have created hurdles for this ideal. Whereas in the
past reproducibility in the scienti c research process (Fig.
1) was mainly concerned with reproducing the experimental
result in recent years it has become increasingly di cult to
ensure the reproducibility of the computational analysis of
research [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Therefore, a new \culture of reproducibility for
computational science" [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] is needed.
      </p>
      <p>For data to be useful it has to be of high quality, so
additional e orts will be necessary to test and ensure data
quality. Standards and best practices for data publication
need to be de ned. Building on the proven work ows for
quality assessment in science we propose a combination of
tool-assisted automated quality evaluation, complemented
by a social, peer-reviewing based approach.</p>
    </sec>
    <sec id="sec-3">
      <title>3. TARGET DATASETS</title>
      <p>In general, we require that there are three main conditions
on datasets that are necessary in order to build a service
for the validation of datasets. Firstly, we would require that
the dataset is open. In this case we do not require that
the license is necessarily fully open, such as using a
CCBY1 license, but rather this requirement states that we can
access the datasets systematically by downloading them on
the web, without the impediment of authentication systems
or such like. Secondly, it is important that the dataset is a
single le, as we wish to download the dataset without the
user having to ll in complex metadata to describe how we
may access individual les. We see no conceivable use case
where a dataset cannot be combined into a single le by
archiving or a similar method. Finally, we require that the
dataset uses a standard format, that is a format that is open
and is standardized by some standardization body. These
requirements are similar to the 3rd star of the \5 Star Open
Data Model" 2. The advantage of these requirements is that
we do not require complex metadata to describe a dataset
but instead require only a download URL, which is easy to
work with.</p>
    </sec>
    <sec id="sec-4">
      <title>4. ARCHITECTURE</title>
      <p>
        The certi cation system we propose in this paper takes the
form of a very simple web service in which we take as
input a single URL and then assign a local identi er (also a
URL based on the MD5 hash of the external URL) to the
dataset where we can make the results of the process
available by means of linked data. As such the service is based
around simple RESTful principles allowing a single URL to
be posted to the service and a the resulting report URL
returned by means of an HTTP redirect. Dereferencing the
returned URL will give the current status of the resource as
an RDF document based on the DCAT vocabulary [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and
the DataID scheme [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
    </sec>
    <sec id="sec-5">
      <title>4.1 User interaction</title>
      <p>A key goal of a web service is to engage with a wide range
of data publishers including many who may not be
familiar with web services and RESTful principles. As such, we
acknowledge that it is important to enable the usage of the
service by a wide range of users. Thus, we provide a
simple form based interaction explaining to the user how to use
the web service. Furthermore, they get to see the report
URL immediately, which is based on a hash function and
calculated in the browser.</p>
      <p>Of most importance, however is the nal step, where a
certi cate is provided which users can include on their own
website next to the download link. This certi cate will
dynamically display the dataset's current evaluation as an iconic
image, which will contain a brief summary of the dataset in
terms of badges or stars awarded to the dataset based on the
1https://creativecommons.org/licenses/by/4.0/
2http://www.w3.org/DesignIssues/LinkedData.html
validation. This image will be provided directly at a URL
derived from the dataset by means of a MD5 hash and will
thus be up-to-date with current evaluations, and rmly tied
to that URL encouraging data providers not to change URL
without providing a URL forwarding mechanism.
Warning This URL is invalid, has not yet been analysed
or the data set has not been available for more than
three months
Bronze star It is possible to download this URL
Silver star It is possible to download this URL, extract it
if necessary, and the data contains syntactically valid
RDF or XML.</p>
      <p>Gold star As silver, but deeper semantic validation
(discussed below) was also successful.</p>
      <p>Linked data star The data is valid and contains external
links.</p>
      <p>It is important to stress that the linked data star is not
awarded for simply using RDF, but instead for having at
least 50 triples3 that refer to entities hosted on some other
domain, where the domain of the dataset is assumed to be
the same as its download URL.</p>
      <p>
        These stars are included as part of the badge that the user
can display on the website and as such allow external users to
easily verify the quality of the downloaded dataset. These
badges, which take the form of a custom generated PNG
3Following http://www.w3.org/wiki/TaskForces/
CommunityProjects/LinkingOpenData/DataSets/
CKANmetainformation
image, allow the data publisher to show the quality of their
data4, and assure the user of the quality of the data. This
image's URL is related to the more detailed report and so it
is easy to verify that it refers to the published dataset.
Furthermore, by issuing a separate star for linking the dataset,
we believe that this will be an enticement for data providers
to follow linked data principles and thus move towards 5 star
data as de ned by Heath and Bizer [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
    </sec>
    <sec id="sec-6">
      <title>4.2 Validation architecture</title>
      <p>As our goal is to handle datasets which are both very large
and potentially very diverse, the calculation of the
validation system is far from a trivial implementation. To this
end, we require that the validation itself follows speci c
requirements. The most important of these requirements are
as follows:</p>
      <p>The service will not permanently store any data, both
for practical reasons and to ensure that we do not
violate any licenses. This means the service will not be
able to act as a back-up or an alternative source of
any of these data services. As such the service is not
intended to replace the use of a DOI to provide a x
identi er for the data.</p>
      <p>The steps should be able to process the dataset in a
single pass, without either using signi cant memory
or requiring the creation of a large database. This
requirement stops an execution of the validation from
monopolizing the resources on the server.</p>
      <p>It should be possible to add new steps without
significant modi cation to the system. This will enable not
only us but also outside collaborators to contribute
new validation steps, and as such we will make the
source code available on the web and accept
appropriate extensions.</p>
      <p>
        The architecture of the system is illustrated in Figure 3. In
this we see that the basic services start of with the
download step, which as its name suggests obtains a copy of the
resource by HTTP(S). The next step, which we call the
format sni er, attempts to deduce the format of the le. It
does this by looking at the le name (extension), the HTTP
headers and the rst 1KB of the le. If the le is found to
be an archive of some form then we extract it and apply
the format sni er to each extracted le. We also note that
the format sni er is extensible by means of dependency
injection [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], allowing external contributors to easily add new
formats.
      </p>
      <p>
        Then the systems applies a format speci c validator, such
as a SAX parser for XML, or the Rapper 5 tool for RDF
documents. Each of these services are implemented as a single
command and are extended to return an RDF document.
This RDF document contains the result of the execution
(success, failure, internal error), any potential next steps
to run in the chain and any extra annotations to be added
4This is similar to the use of build status images used by
continuous integration servers, such as by Travis CI
5http://librdf.org/raptor/rapper.html
to the report. For example, if the XML syntax validator
nds a link to an XML document type de nition (DTD) or
schema description (XSD) then the service may indicate that
validation according to the schema is the next step in the
chain, which may have already been carried out by the SAX
parser. Furthermore, the steps may yield additional
output. For example, Rapper produces the number of triples
and this is the added to the report using the VoID
vocabulary6. Finally, we apply deeper tests to the RDF using the
RDFUnit [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] framework, which checks whether the dataset
conforms to the constraints de ned by its ontological
constraints. This framework is based on SPARQL and works
on a principal of checking whether certain queries produce
results as intended.
      </p>
    </sec>
    <sec id="sec-7">
      <title>4.3 Continuous validation</title>
      <p>While datasets are frequently of good quality when released,
one of the key concerns in data quality is that eventually
these datasets become unavailable or the URL they are
published at changes. As such, our service plans to not only do
initial validation but also to provide continuous validation.
To this extent we will access the URL by means of a
headeronly-request (falling back to a GET for servers that do not
support HEAD). Then by analysing the return status,
especially the Last-Modified header, we can deduce if a resource
is likely to have changed. In such cases we can re-run the
full validation chain. If a resource fails over a xed time
period we will mark it as not downloadable.</p>
    </sec>
    <sec id="sec-8">
      <title>5. CONCLUSION</title>
      <p>
        In this paper we have presented the architecture of a system
that aims to help with the quality of data and in particular
linked data as self-published by scientists and other
professionals on the web. This system works by means of
certifying that datasets follow not only simple syntactic constraints
of RDF and XML, but also deeper semantic conditions as
de ned by the schema. The system is currently under
development and we expect to release the prototype version
brie y after publication of this article. While it is clear that
this service cannot guarantee that a dataset is t for use
in a given application, it can guarantee developers that the
dataset is ready to be applied, avoiding the \tedious process
of data wrangling" [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] by ensuring that formats are valid and
encouraging the use of data semantics. We hope that by
providing an easy-to-use interface, without requiring signi cant
metadata, this service can play a key role in improving data
quality and enabling replicability of experiments across all
computational sciences.
      </p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>This work is supported by the Cognitive Interaction
Technology, Center of Excellence (CITEC) and the LIDER Project
under the European Seventh Framework Program grant
number 610782.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bechhofer</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Volz</surname>
          </string-name>
          .
          <article-title>Patching syntax in OWL ontologies</article-title>
          .
          <source>In The Semantic Web { ISWC</source>
          <year>2004</year>
          , pages
          <fpage>668</fpage>
          {
          <fpage>682</fpage>
          . Springer,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bru</surname>
          </string-name>
          mmer, C. Baron, I. Ermilov,
          <string-name>
            <given-names>M.</given-names>
            <surname>Freudenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann</surname>
          </string-name>
          . DataID:
          <article-title>Towards semantically rich metadata for complex datasets</article-title>
          .
          <source>In Proceedings of the 10th International Conference on Semantic Systems</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D. L.</given-names>
            <surname>Donoho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maleki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. U.</given-names>
            <surname>Rahman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shahram</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Stodden</surname>
          </string-name>
          .
          <article-title>Reproducible research in computational harmonic analysis</article-title>
          .
          <source>Computing in Science &amp; Engineering</source>
          ,
          <volume>11</volume>
          (
          <issue>1</issue>
          ):8{
          <fpage>18</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kandel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Hellerstein</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Heer</surname>
          </string-name>
          .
          <article-title>Proactive wrangling: mixed-initiative end-user programming of data transformation scripts</article-title>
          .
          <source>In Proceedings of the 24th annual ACM symposium on User interface software and technology</source>
          , pages
          <volume>65</volume>
          {
          <fpage>74</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Heath</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>Linked data: Evolving the web into a global data space</article-title>
          .
          <source>Synthesis lectures on the semantic web: theory and technology</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ):1{
          <fpage>136</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Westphal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cornelissen</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Zaveri</surname>
          </string-name>
          .
          <article-title>Test-driven evaluation of linked data quality</article-title>
          .
          <source>In Proceedings of the 23rd international conference on World Wide Web</source>
          , pages
          <volume>747</volume>
          {
          <fpage>758</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Maali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Erickson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Archer</surname>
          </string-name>
          .
          <article-title>Data catalog vocabulary (DCAT)</article-title>
          .
          <source>W3C Working Draft</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Martin</surname>
          </string-name>
          .
          <article-title>The dependency inversion principle</article-title>
          .
          <source>C++ Report</source>
          ,
          <volume>8</volume>
          (
          <issue>6</issue>
          ):
          <volume>61</volume>
          {
          <fpage>66</fpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Murray-Rust</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Neylon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pollock</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Wilbanks</surname>
          </string-name>
          .
          <article-title>Panton principles: principles for open data in science</article-title>
          .
          <source>Panton Principles</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R. D.</given-names>
            <surname>Peng</surname>
          </string-name>
          .
          <article-title>Reproducible research in computational science</article-title>
          .
          <source>Science</source>
          ,
          <volume>334</volume>
          (
          <issue>6060</issue>
          ):
          <fpage>1226</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>E.</given-names>
            <surname>Rahm</surname>
          </string-name>
          and
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Do</surname>
          </string-name>
          .
          <article-title>Data cleaning: Problems and current approaches</article-title>
          .
          <source>IEEE Data Eng. Bull.</source>
          ,
          <volume>23</volume>
          (
          <issue>4</issue>
          ):3{
          <fpage>13</fpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Wren</surname>
          </string-name>
          .
          <article-title>404 not found: the stability and persistence of urls published in medline</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>20</volume>
          (
          <issue>5</issue>
          ):
          <volume>668</volume>
          {
          <fpage>672</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>