Towards assured data quality and validation by data certification John P. Mc Crae Cord Wiljes Philipp Cimiano CITEC, Bielefeld University CITEC, Bielefeld University CITEC, Bielefeld University Inspiration 1 Inspiration 1 Inspiration 1 Bielefeld, Germany Bielefeld, Germany Bielefeld, Germany jmccrae@cit-ec.uni- cwiljes@cit-ec.uni- cimiano@cit-ec.uni- bielefeld.de bielefeld.de bielefeld.de ABSTRACT of the data. Such errors not only make the data fundamen- Increasingly a large amount of data relevant to a wide va- tally harder to use but also mean that anyone consuming riety of scientific domains is self-published by scientists on the dataset must first correct any existing data errors, pos- websites and this is proving to be an important resource for sibly making unwarranted assumptions about the data, thus the replicability and further development of science. Much potentially leading to unintended modifications of the data. of this data is even made available as linked data. However, Much of this is due to the fact that for many small datasets the self-publishing model provides no quality control on the there is no sufficient institutional support for the publication data, and as such datasets frequently contain errors. We of data, leading to many datasets containing formal errors, therefore consider an architecture of a system that enables such as incorrectly escaped characters. It is our belief that the certification of data (both linked and otherwise) by a many scientists who self-publish datasets do not make such web service and the sharing of this certification on the web, errors out of intention or indifference, but instead out of a and contemplate why this may improve data quality. lack of support in validating services. To this end we propose a simple, general, extensible web service to provide syntactic and semantic validation of data in, initially, XML and RDF, Categories and Subject Descriptors which can be extended to a wider range of data formats. H.4 [Information Systems Applications]: Miscellaneous; D.2.8 [Software Engineering]: Metrics—complexity mea- Finally, a second key goal is to provide continuous validation sures, performance measures of the resource in that we continue to check the validity of resources periodically after they are published. It is in fact a General Terms common problem that resources and data cease to be avail- science, data, quality able after the end of the funding period, and as such the data generated during this project become lost. It is also quite common for URLs to be changed for technical reasons Keywords without a redirect from the old URL to be implemented. For data quality, data sharing, open science, validation service example, in a study of MEDLINE papers [12], it was found that 37% of URLs quoted in papers had become unavailable 1. INTRODUCTION or were only intermittently available after publication, al- It has been widely acknowledged across the sciences that the though it was unclear how many of these URLs referred to publishing of data generated or required for an experiment datasets. is a crucial step towards the replicability of experiments or analyses [10]. However, it is also the case that most data The architecture of such a system has several clear design is of poor quality and plagued by basic data errors [1, 11]. goals in order to cope with such a wide range of potential In this paper we tackle an aspect of data quality we refer resources. The architecture should fulfill the following re- to as the “readiness for use”, by which is meant whether the quirements: data can be directly applied, rather than if its content is ac- tually useful for a specific application. Errors such as these Extensibility: There are a wide range of data formats in can easily be detected by validation in a manner that does use in scientific work and as such we should be able to not need to know the domain or the intended application grow and extend to a wide range of data sets that are available on the web. Efficiency: The system should be able to process poten- tially very large datasets in a reasonable amount of time. For that reason, validation algorithms that have a linear time complexity in the input size are to be Copyright is held by the author/owner(s). preferred. LDQ 2014, 1st Workshop on Linked Data Quality; Editors: Magnus Knuth, Dimitris Kontokostas, and Harald Sack Sept. 2, 2014, Leipzig, Germany. Tiered Architecture: It should be possible to follow deeper validation layers, such that we can validate data as to quality. Standards and best practices for data publication need to be defined. Building on the proven workflows for quality assessment in science we propose a combination of tool-assisted automated quality evaluation, complemented by a social, peer-reviewing based approach. 3. TARGET DATASETS In general, we require that there are three main conditions on datasets that are necessary in order to build a service for the validation of datasets. Firstly, we would require that the dataset is open. In this case we do not require that the license is necessarily fully open, such as using a CC- BY1 license, but rather this requirement states that we can access the datasets systematically by downloading them on the web, without the impediment of authentication systems or such like. Secondly, it is important that the dataset is a single file, as we wish to download the dataset without the user having to fill in complex metadata to describe how we may access individual files. We see no conceivable use case where a dataset cannot be combined into a single file by archiving or a similar method. Finally, we require that the dataset uses a standard format, that is a format that is open and is standardized by some standardization body. These Figure 1: Research data in the scientific discovery requirements are similar to the 3rd star of the “5 Star Open process Data Model” 2 . The advantage of these requirements is that we do not require complex metadata to describe a dataset but instead require only a download URL, which is easy to whether it is available on the web and whether it uses work with. a standard and open format. Further, in the case that the data uses a valid RDF vocabulary, we can check whether it conforms to RDFS/OWL schema/ontology 4. ARCHITECTURE The certification system we propose in this paper takes the that the data claims to adhere to. form of a very simple web service in which we take as in- put a single URL and then assign a local identifier (also a Such an architecture should allow us to quickly build an URL based on the MD5 hash of the external URL) to the extensible service that allows new data formats and models dataset where we can make the results of the process avail- to be handled and validated. able by means of linked data. As such the service is based around simple RESTful principles allowing a single URL to be posted to the service and a the resulting report URL re- 2. MOTIVATION turned by means of an HTTP redirect. Dereferencing the The Open Science movement advocates sharing the data returned URL will give the current status of the resource as that scientific results are based on [9]. Open data publi- an RDF document based on the DCAT vocabulary [7] and cation is expected to improve the integrity and efficiency the DataID scheme [2]. of science. Errors and fraud will be easier to detect and valuable research data can be re-used by other scientists for their own research questions. Therefore, scientific journals 4.1 User interaction A key goal of a web service is to engage with a wide range and research funding agencies worldwide have been institut- of data publishers including many who may not be famil- ing policies for data sharing. iar with web services and RESTful principles. As such, we acknowledge that it is important to enable the usage of the Good scientific practice calls for research to be reproducible, service by a wide range of users. Thus, we provide a sim- i.e. other researchers must be able to test the data as well ple form based interaction explaining to the user how to use as the analysis procedures. The growing number and di- the web service. Furthermore, they get to see the report versity of digital research data and the strong increase in URL immediately, which is based on a hash function and importance of computational methods in all empirical sci- calculated in the browser. ences have created hurdles for this ideal. Whereas in the past reproducibility in the scientific research process (Fig. Of most importance, however is the final step, where a cer- 1) was mainly concerned with reproducing the experimental tificate is provided which users can include on their own web- result in recent years it has become increasingly difficult to site next to the download link. This certificate will dynam- ensure the reproducibility of the computational analysis of ically display the dataset’s current evaluation as an iconic research [3]. Therefore, a new “culture of reproducibility for image, which will contain a brief summary of the dataset in computational science” [10] is needed. terms of badges or stars awarded to the dataset based on the 1 For data to be useful it has to be of high quality, so ad- https://creativecommons.org/licenses/by/4.0/ 2 ditional efforts will be necessary to test and ensure data http://www.w3.org/DesignIssues/LinkedData.html image, allow the data publisher to show the quality of their data4 , and assure the user of the quality of the data. This image’s URL is related to the more detailed report and so it is easy to verify that it refers to the published dataset. Fur- thermore, by issuing a separate star for linking the dataset, we believe that this will be an enticement for data providers to follow linked data principles and thus move towards 5 star data as defined by Heath and Bizer [5]. 4.2 Validation architecture As our goal is to handle datasets which are both very large and potentially very diverse, the calculation of the valida- tion system is far from a trivial implementation. To this end, we require that the validation itself follows specific re- quirements. The most important of these requirements are as follows: • The service will not permanently store any data, both for practical reasons and to ensure that we do not vi- olate any licenses. This means the service will not be able to act as a back-up or an alternative source of any of these data services. As such the service is not intended to replace the use of a DOI to provide a fix Figure 2: A mock-up of the user page for the certi- identifier for the data. fication service • The steps should be able to process the dataset in a single pass, without either using significant memory validation. This image will be provided directly at a URL or requiring the creation of a large database. This derived from the dataset by means of a MD5 hash and will requirement stops an execution of the validation from thus be up-to-date with current evaluations, and firmly tied monopolizing the resources on the server. to that URL encouraging data providers not to change URL without providing a URL forwarding mechanism. • It should be possible to add new steps without signif- icant modification to the system. This will enable not only us but also outside collaborators to contribute Warning This URL is invalid, has not yet been analysed new validation steps, and as such we will make the or the data set has not been available for more than source code available on the web and accept appropri- three months ate extensions. Bronze star It is possible to download this URL Silver star It is possible to download this URL, extract it The architecture of the system is illustrated in Figure 3. In if necessary, and the data contains syntactically valid this we see that the basic services start of with the down- RDF or XML. load step, which as its name suggests obtains a copy of the resource by HTTP(S). The next step, which we call the for- Gold star As silver, but deeper semantic validation (dis- mat sniffer, attempts to deduce the format of the file. It cussed below) was also successful. does this by looking at the file name (extension), the HTTP Linked data star The data is valid and contains external headers and the first 1KB of the file. If the file is found to links. be an archive of some form then we extract it and apply the format sniffer to each extracted file. We also note that the format sniffer is extensible by means of dependency in- It is important to stress that the linked data star is not jection [8], allowing external contributors to easily add new awarded for simply using RDF, but instead for having at formats. least 50 triples3 that refer to entities hosted on some other domain, where the domain of the dataset is assumed to be Then the systems applies a format specific validator, such the same as its download URL. as a SAX parser for XML, or the Rapper 5 tool for RDF doc- uments. Each of these services are implemented as a single These stars are included as part of the badge that the user command and are extended to return an RDF document. can display on the website and as such allow external users to This RDF document contains the result of the execution easily verify the quality of the downloaded dataset. These (success, failure, internal error), any potential next steps badges, which take the form of a custom generated PNG to run in the chain and any extra annotations to be added 3 4 Following http://www.w3.org/wiki/TaskForces/ This is similar to the use of build status images used by CommunityProjects/LinkingOpenData/DataSets/ continuous integration servers, such as by Travis CI 5 CKANmetainformation http://librdf.org/raptor/rapper.html of RDF and XML, but also deeper semantic conditions as defined by the schema. The system is currently under de- velopment and we expect to release the prototype version briefly after publication of this article. While it is clear that this service cannot guarantee that a dataset is fit for use in a given application, it can guarantee developers that the dataset is ready to be applied, avoiding the “tedious process of data wrangling” [4] by ensuring that formats are valid and encouraging the use of data semantics. We hope that by pro- viding an easy-to-use interface, without requiring significant metadata, this service can play a key role in improving data quality and enabling replicability of experiments across all computational sciences. Acknowledgments This work is supported by the Cognitive Interaction Tech- nology, Center of Excellence (CITEC) and the LIDER Project under the European Seventh Framework Program grant num- ber 610782. 6. REFERENCES [1] S. Bechhofer and R. Volz. Patching syntax in OWL Figure 3: The initial set of back-end validation ser- ontologies. In The Semantic Web – ISWC 2004, pages vices 668–682. Springer, 2004. [2] M. Brümmer, C. Baron, I. Ermilov, M. Freudenberg, to the report. For example, if the XML syntax validator D. Kontokostas, and S. Hellmann. DataID: Towards finds a link to an XML document type definition (DTD) or semantically rich metadata for complex datasets. In schema description (XSD) then the service may indicate that Proceedings of the 10th International Conference on validation according to the schema is the next step in the Semantic Systems, 2014. chain, which may have already been carried out by the SAX [3] D. L. Donoho, A. Maleki, I. U. Rahman, M. Shahram, parser. Furthermore, the steps may yield additional out- and V. Stodden. Reproducible research in put. For example, Rapper produces the number of triples computational harmonic analysis. Computing in and this is the added to the report using the VoID vocabu- Science & Engineering, 11(1):8–18, 2009. lary6 . Finally, we apply deeper tests to the RDF using the [4] P. J. Guo, S. Kandel, J. M. Hellerstein, and J. Heer. RDFUnit [6] framework, which checks whether the dataset Proactive wrangling: mixed-initiative end-user conforms to the constraints defined by its ontological con- programming of data transformation scripts. In straints. This framework is based on SPARQL and works Proceedings of the 24th annual ACM symposium on on a principal of checking whether certain queries produce User interface software and technology, pages 65–74, results as intended. 2011. [5] T. Heath and C. Bizer. Linked data: Evolving the web into a global data space. Synthesis lectures on the 4.3 Continuous validation semantic web: theory and technology, 1(1):1–136, 2011. While datasets are frequently of good quality when released, [6] D. Kontokostas, P. Westphal, S. Auer, S. Hellmann, one of the key concerns in data quality is that eventually J. Lehmann, R. Cornelissen, and A. Zaveri. these datasets become unavailable or the URL they are pub- Test-driven evaluation of linked data quality. In lished at changes. As such, our service plans to not only do Proceedings of the 23rd international conference on initial validation but also to provide continuous validation. World Wide Web, pages 747–758, 2014. To this extent we will access the URL by means of a header- only-request (falling back to a GET for servers that do not [7] F. Maali, J. Erickson, and P. Archer. Data catalog support HEAD). Then by analysing the return status, espe- vocabulary (DCAT). W3C Working Draft, 2012. cially the Last-Modified header, we can deduce if a resource [8] R. C. Martin. The dependency inversion principle. is likely to have changed. In such cases we can re-run the C++ Report, 8(6):61–66, 1996. full validation chain. If a resource fails over a fixed time [9] P. Murray-Rust, C. Neylon, R. Pollock, and period we will mark it as not downloadable. J. Wilbanks. Panton principles: principles for open data in science. Panton Principles, 2010. [10] R. D. Peng. Reproducible research in computational 5. CONCLUSION science. Science, 334(6060):1226, 2011. In this paper we have presented the architecture of a system [11] E. Rahm and H. H. Do. Data cleaning: Problems and that aims to help with the quality of data and in particular current approaches. IEEE Data Eng. Bull., linked data as self-published by scientists and other profes- 23(4):3–13, 2000. sionals on the web. This system works by means of certify- ing that datasets follow not only simple syntactic constraints [12] J. D. Wren. 404 not found: the stability and persistence of urls published in medline. 6 http://www.w3.org/TR/void/ Bioinformatics, 20(5):668–672, 2004.