=Paper=
{{Paper
|id=Vol-1215/paper-05
|storemode=property
|title=Towards Assured Data Quality and Validation by Data Certification
|pdfUrl=https://ceur-ws.org/Vol-1215/paper-05.pdf
|volume=Vol-1215
|dblpUrl=https://dblp.org/rec/conf/i-semantics/McCraeWC14
}}
==Towards Assured Data Quality and Validation by Data Certification==
<pdf width="1500px">https://ceur-ws.org/Vol-1215/paper-05.pdf</pdf>
<pre>
        Towards assured data quality and validation by data
                          certification

                  John P. Mc Crae                               Cord Wiljes                      Philipp Cimiano
             CITEC, Bielefeld University                CITEC, Bielefeld University         CITEC, Bielefeld University
                   Inspiration 1                              Inspiration 1                       Inspiration 1
                Bielefeld, Germany                         Bielefeld, Germany                  Bielefeld, Germany
               jmccrae@cit-ec.uni-                        cwiljes@cit-ec.uni-                 cimiano@cit-ec.uni-
                   bielefeld.de                               bielefeld.de                        bielefeld.de

ABSTRACT                                                                 of the data. Such errors not only make the data fundamen-
Increasingly a large amount of data relevant to a wide va-               tally harder to use but also mean that anyone consuming
riety of scientific domains is self-published by scientists on           the dataset must first correct any existing data errors, pos-
websites and this is proving to be an important resource for             sibly making unwarranted assumptions about the data, thus
the replicability and further development of science. Much               potentially leading to unintended modifications of the data.
of this data is even made available as linked data. However,             Much of this is due to the fact that for many small datasets
the self-publishing model provides no quality control on the             there is no sufficient institutional support for the publication
data, and as such datasets frequently contain errors. We                 of data, leading to many datasets containing formal errors,
therefore consider an architecture of a system that enables              such as incorrectly escaped characters. It is our belief that
the certification of data (both linked and otherwise) by a               many scientists who self-publish datasets do not make such
web service and the sharing of this certification on the web,            errors out of intention or indifference, but instead out of a
and contemplate why this may improve data quality.                       lack of support in validating services. To this end we propose
                                                                         a simple, general, extensible web service to provide syntactic
                                                                         and semantic validation of data in, initially, XML and RDF,
Categories and Subject Descriptors                                       which can be extended to a wider range of data formats.
H.4 [Information Systems Applications]: Miscellaneous;
D.2.8 [Software Engineering]: Metrics—complexity mea-
                                                                         Finally, a second key goal is to provide continuous validation
sures, performance measures
                                                                         of the resource in that we continue to check the validity of
                                                                         resources periodically after they are published. It is in fact a
General Terms                                                            common problem that resources and data cease to be avail-
science, data, quality                                                   able after the end of the funding period, and as such the
                                                                         data generated during this project become lost. It is also
                                                                         quite common for URLs to be changed for technical reasons
Keywords                                                                 without a redirect from the old URL to be implemented. For
data quality, data sharing, open science, validation service
                                                                         example, in a study of MEDLINE papers [12], it was found
                                                                         that 37% of URLs quoted in papers had become unavailable
1.    INTRODUCTION                                                       or were only intermittently available after publication, al-
It has been widely acknowledged across the sciences that the             though it was unclear how many of these URLs referred to
publishing of data generated or required for an experiment               datasets.
is a crucial step towards the replicability of experiments or
analyses [10]. However, it is also the case that most data               The architecture of such a system has several clear design
is of poor quality and plagued by basic data errors [1, 11].             goals in order to cope with such a wide range of potential
In this paper we tackle an aspect of data quality we refer               resources. The architecture should fulfill the following re-
to as the “readiness for use”, by which is meant whether the             quirements:
data can be directly applied, rather than if its content is ac-
tually useful for a specific application. Errors such as these
                                                                         Extensibility: There are a wide range of data formats in
can easily be detected by validation in a manner that does
                                                                             use in scientific work and as such we should be able to
not need to know the domain or the intended application
                                                                             grow and extend to a wide range of data sets that are
                                                                             available on the web.
                                                                         Efficiency: The system should be able to process poten-
                                                                               tially very large datasets in a reasonable amount of
                                                                               time. For that reason, validation algorithms that have
                                                                               a linear time complexity in the input size are to be
Copyright is held by the author/owner(s).                                      preferred.
LDQ 2014, 1st Workshop on Linked Data Quality; Editors: Magnus Knuth,
Dimitris Kontokostas, and Harald Sack Sept. 2, 2014, Leipzig, Germany.   Tiered Architecture: It should be possible to follow deeper
                                                                              validation layers, such that we can validate data as to
                                                                  quality. Standards and best practices for data publication
                                                                  need to be defined. Building on the proven workflows for
                                                                  quality assessment in science we propose a combination of
                                                                  tool-assisted automated quality evaluation, complemented
                                                                  by a social, peer-reviewing based approach.

                                                                  3.     TARGET DATASETS
                                                                  In general, we require that there are three main conditions
                                                                  on datasets that are necessary in order to build a service
                                                                  for the validation of datasets. Firstly, we would require that
                                                                  the dataset is open. In this case we do not require that
                                                                  the license is necessarily fully open, such as using a CC-
                                                                  BY1 license, but rather this requirement states that we can
                                                                  access the datasets systematically by downloading them on
                                                                  the web, without the impediment of authentication systems
                                                                  or such like. Secondly, it is important that the dataset is a
                                                                  single file, as we wish to download the dataset without the
                                                                  user having to fill in complex metadata to describe how we
                                                                  may access individual files. We see no conceivable use case
                                                                  where a dataset cannot be combined into a single file by
                                                                  archiving or a similar method. Finally, we require that the
                                                                  dataset uses a standard format, that is a format that is open
                                                                  and is standardized by some standardization body. These
Figure 1: Research data in the scientific discovery               requirements are similar to the 3rd star of the “5 Star Open
process                                                           Data Model” 2 . The advantage of these requirements is that
                                                                  we do not require complex metadata to describe a dataset
                                                                  but instead require only a download URL, which is easy to
     whether it is available on the web and whether it uses       work with.
     a standard and open format. Further, in the case that
     the data uses a valid RDF vocabulary, we can check
     whether it conforms to RDFS/OWL schema/ontology
                                                                  4.     ARCHITECTURE
                                                                  The certification system we propose in this paper takes the
     that the data claims to adhere to.
                                                                  form of a very simple web service in which we take as in-
                                                                  put a single URL and then assign a local identifier (also a
Such an architecture should allow us to quickly build an          URL based on the MD5 hash of the external URL) to the
extensible service that allows new data formats and models        dataset where we can make the results of the process avail-
to be handled and validated.                                      able by means of linked data. As such the service is based
                                                                  around simple RESTful principles allowing a single URL to
                                                                  be posted to the service and a the resulting report URL re-
2.   MOTIVATION                                                   turned by means of an HTTP redirect. Dereferencing the
The Open Science movement advocates sharing the data              returned URL will give the current status of the resource as
that scientific results are based on [9]. Open data publi-        an RDF document based on the DCAT vocabulary [7] and
cation is expected to improve the integrity and efficiency        the DataID scheme [2].
of science. Errors and fraud will be easier to detect and
valuable research data can be re-used by other scientists for
their own research questions. Therefore, scientific journals
                                                                  4.1     User interaction
                                                                  A key goal of a web service is to engage with a wide range
and research funding agencies worldwide have been institut-
                                                                  of data publishers including many who may not be famil-
ing policies for data sharing.
                                                                  iar with web services and RESTful principles. As such, we
                                                                  acknowledge that it is important to enable the usage of the
Good scientific practice calls for research to be reproducible,
                                                                  service by a wide range of users. Thus, we provide a sim-
i.e. other researchers must be able to test the data as well
                                                                  ple form based interaction explaining to the user how to use
as the analysis procedures. The growing number and di-
                                                                  the web service. Furthermore, they get to see the report
versity of digital research data and the strong increase in
                                                                  URL immediately, which is based on a hash function and
importance of computational methods in all empirical sci-
                                                                  calculated in the browser.
ences have created hurdles for this ideal. Whereas in the
past reproducibility in the scientific research process (Fig.
                                                                  Of most importance, however is the final step, where a cer-
1) was mainly concerned with reproducing the experimental
                                                                  tificate is provided which users can include on their own web-
result in recent years it has become increasingly difficult to
                                                                  site next to the download link. This certificate will dynam-
ensure the reproducibility of the computational analysis of
                                                                  ically display the dataset’s current evaluation as an iconic
research [3]. Therefore, a new “culture of reproducibility for
                                                                  image, which will contain a brief summary of the dataset in
computational science” [10] is needed.
                                                                  terms of badges or stars awarded to the dataset based on the
                                                                  1
For data to be useful it has to be of high quality, so ad-            https://creativecommons.org/licenses/by/4.0/
                                                                  2
ditional efforts will be necessary to test and ensure data            http://www.w3.org/DesignIssues/LinkedData.html
                                                                 image, allow the data publisher to show the quality of their
                                                                 data4 , and assure the user of the quality of the data. This
                                                                 image’s URL is related to the more detailed report and so it
                                                                 is easy to verify that it refers to the published dataset. Fur-
                                                                 thermore, by issuing a separate star for linking the dataset,
                                                                 we believe that this will be an enticement for data providers
                                                                 to follow linked data principles and thus move towards 5 star
                                                                 data as defined by Heath and Bizer [5].

                                                                 4.2    Validation architecture
                                                                 As our goal is to handle datasets which are both very large
                                                                 and potentially very diverse, the calculation of the valida-
                                                                 tion system is far from a trivial implementation. To this
                                                                 end, we require that the validation itself follows specific re-
                                                                 quirements. The most important of these requirements are
                                                                 as follows:


                                                                    • The service will not permanently store any data, both
                                                                      for practical reasons and to ensure that we do not vi-
                                                                      olate any licenses. This means the service will not be
                                                                      able to act as a back-up or an alternative source of
                                                                      any of these data services. As such the service is not
                                                                      intended to replace the use of a DOI to provide a fix
Figure 2: A mock-up of the user page for the certi-                   identifier for the data.
fication service
                                                                    • The steps should be able to process the dataset in a
                                                                      single pass, without either using significant memory
validation. This image will be provided directly at a URL             or requiring the creation of a large database. This
derived from the dataset by means of a MD5 hash and will              requirement stops an execution of the validation from
thus be up-to-date with current evaluations, and firmly tied          monopolizing the resources on the server.
to that URL encouraging data providers not to change URL
without providing a URL forwarding mechanism.                       • It should be possible to add new steps without signif-
                                                                      icant modification to the system. This will enable not
                                                                      only us but also outside collaborators to contribute
Warning This URL is invalid, has not yet been analysed                new validation steps, and as such we will make the
   or the data set has not been available for more than               source code available on the web and accept appropri-
   three months                                                       ate extensions.
Bronze star It is possible to download this URL
Silver star It is possible to download this URL, extract it      The architecture of the system is illustrated in Figure 3. In
     if necessary, and the data contains syntactically valid     this we see that the basic services start of with the down-
     RDF or XML.                                                 load step, which as its name suggests obtains a copy of the
                                                                 resource by HTTP(S). The next step, which we call the for-
Gold star As silver, but deeper semantic validation (dis-        mat sniffer, attempts to deduce the format of the file. It
    cussed below) was also successful.                           does this by looking at the file name (extension), the HTTP
Linked data star The data is valid and contains external         headers and the first 1KB of the file. If the file is found to
    links.                                                       be an archive of some form then we extract it and apply
                                                                 the format sniffer to each extracted file. We also note that
                                                                 the format sniffer is extensible by means of dependency in-
It is important to stress that the linked data star is not       jection [8], allowing external contributors to easily add new
awarded for simply using RDF, but instead for having at          formats.
least 50 triples3 that refer to entities hosted on some other
domain, where the domain of the dataset is assumed to be         Then the systems applies a format specific validator, such
the same as its download URL.                                    as a SAX parser for XML, or the Rapper 5 tool for RDF doc-
                                                                 uments. Each of these services are implemented as a single
These stars are included as part of the badge that the user      command and are extended to return an RDF document.
can display on the website and as such allow external users to   This RDF document contains the result of the execution
easily verify the quality of the downloaded dataset. These       (success, failure, internal error), any potential next steps
badges, which take the form of a custom generated PNG            to run in the chain and any extra annotations to be added
3                                                                4
 Following       http://www.w3.org/wiki/TaskForces/                This is similar to the use of build status images used by
CommunityProjects/LinkingOpenData/DataSets/                      continuous integration servers, such as by Travis CI
                                                                 5
CKANmetainformation                                                http://librdf.org/raptor/rapper.html
                                                                 of RDF and XML, but also deeper semantic conditions as
                                                                 defined by the schema. The system is currently under de-
                                                                 velopment and we expect to release the prototype version
                                                                 briefly after publication of this article. While it is clear that
                                                                 this service cannot guarantee that a dataset is fit for use
                                                                 in a given application, it can guarantee developers that the
                                                                 dataset is ready to be applied, avoiding the “tedious process
                                                                 of data wrangling” [4] by ensuring that formats are valid and
                                                                 encouraging the use of data semantics. We hope that by pro-
                                                                 viding an easy-to-use interface, without requiring significant
                                                                 metadata, this service can play a key role in improving data
                                                                 quality and enabling replicability of experiments across all
                                                                 computational sciences.

                                                                 Acknowledgments
                                                                 This work is supported by the Cognitive Interaction Tech-
                                                                 nology, Center of Excellence (CITEC) and the LIDER Project
                                                                 under the European Seventh Framework Program grant num-
                                                                 ber 610782.

                                                                 6.   REFERENCES
                                                                  [1] S. Bechhofer and R. Volz. Patching syntax in OWL
Figure 3: The initial set of back-end validation ser-                 ontologies. In The Semantic Web – ISWC 2004, pages
vices                                                                 668–682. Springer, 2004.
                                                                  [2] M. Brümmer, C. Baron, I. Ermilov, M. Freudenberg,
to the report. For example, if the XML syntax validator               D. Kontokostas, and S. Hellmann. DataID: Towards
finds a link to an XML document type definition (DTD) or              semantically rich metadata for complex datasets. In
schema description (XSD) then the service may indicate that           Proceedings of the 10th International Conference on
validation according to the schema is the next step in the            Semantic Systems, 2014.
chain, which may have already been carried out by the SAX         [3] D. L. Donoho, A. Maleki, I. U. Rahman, M. Shahram,
parser. Furthermore, the steps may yield additional out-              and V. Stodden. Reproducible research in
put. For example, Rapper produces the number of triples               computational harmonic analysis. Computing in
and this is the added to the report using the VoID vocabu-            Science & Engineering, 11(1):8–18, 2009.
lary6 . Finally, we apply deeper tests to the RDF using the       [4] P. J. Guo, S. Kandel, J. M. Hellerstein, and J. Heer.
RDFUnit [6] framework, which checks whether the dataset               Proactive wrangling: mixed-initiative end-user
conforms to the constraints defined by its ontological con-           programming of data transformation scripts. In
straints. This framework is based on SPARQL and works                 Proceedings of the 24th annual ACM symposium on
on a principal of checking whether certain queries produce            User interface software and technology, pages 65–74,
results as intended.                                                  2011.
                                                                  [5] T. Heath and C. Bizer. Linked data: Evolving the web
                                                                      into a global data space. Synthesis lectures on the
4.3     Continuous validation                                         semantic web: theory and technology, 1(1):1–136, 2011.
While datasets are frequently of good quality when released,
                                                                  [6] D. Kontokostas, P. Westphal, S. Auer, S. Hellmann,
one of the key concerns in data quality is that eventually
                                                                      J. Lehmann, R. Cornelissen, and A. Zaveri.
these datasets become unavailable or the URL they are pub-
                                                                      Test-driven evaluation of linked data quality. In
lished at changes. As such, our service plans to not only do
                                                                      Proceedings of the 23rd international conference on
initial validation but also to provide continuous validation.
                                                                      World Wide Web, pages 747–758, 2014.
To this extent we will access the URL by means of a header-
only-request (falling back to a GET for servers that do not       [7] F. Maali, J. Erickson, and P. Archer. Data catalog
support HEAD). Then by analysing the return status, espe-             vocabulary (DCAT). W3C Working Draft, 2012.
cially the Last-Modified header, we can deduce if a resource      [8] R. C. Martin. The dependency inversion principle.
is likely to have changed. In such cases we can re-run the            C++ Report, 8(6):61–66, 1996.
full validation chain. If a resource fails over a fixed time      [9] P. Murray-Rust, C. Neylon, R. Pollock, and
period we will mark it as not downloadable.                           J. Wilbanks. Panton principles: principles for open
                                                                      data in science. Panton Principles, 2010.
                                                                 [10] R. D. Peng. Reproducible research in computational
5.     CONCLUSION                                                     science. Science, 334(6060):1226, 2011.
In this paper we have presented the architecture of a system
                                                                 [11] E. Rahm and H. H. Do. Data cleaning: Problems and
that aims to help with the quality of data and in particular
                                                                      current approaches. IEEE Data Eng. Bull.,
linked data as self-published by scientists and other profes-
                                                                      23(4):3–13, 2000.
sionals on the web. This system works by means of certify-
ing that datasets follow not only simple syntactic constraints   [12] J. D. Wren. 404 not found: the stability and
                                                                      persistence of urls published in medline.
6
    http://www.w3.org/TR/void/                                        Bioinformatics, 20(5):668–672, 2004.

</pre>