Linking UK Government Data John Sheridan Jeni Tennison The National Archives The Stationery Office 102 Petty France Mandela Way London SW1H 9AJ London SE1 5SS john.sheridan@nationalarchives.gsi.gov.uk jeni.tennison@tso.co.uk ABSTRACT notably with the launch of data.gov in the United States. In the UK, Sir Tim Berners-Lee and Professor Nigel Shadbolt were What does it take to create a web of linked government data? appointed as advisors to the Government to spearhead work in With the launch of data.gov.uk the UK Government has been government. The UK Government's data website, data.gov.uk, finding out. This paper sets out the case for using Linked Data aimed at developers, was launched with a commitment that the standards for publishing open government data and describes government would use W3C standards and in particular adopt some of the benefits. It explains how Linked Data standards linked data standards and approaches for publishing UK uniquely allow governments to publish data responsibly and why Government data on the web. responsible data publishing is so important to the open government data movement. The paper goes on to explain how The open government data movement provides a golden the Linked Data world was not quite ready for the large-scale opportunity for linked data advocates to prove the value of adoption of these standards by a major government, leaving much standards such as RDF, OWL and SKOS. The experience from to be done to develop practical approaches and patterns for the the UK is that the linked data community was not quite ready for publishing of government data. From URIs, to provenance and a major government to start creating a web of linked government versioning, through to statistics and geographic information, data. Whilst the standards are mature, capable and powerful, much much thinking and work has been done. In each case the emphasis work needs to be done to translate those standards into simple and has been, not on research, but designing simple repeatable repeatable publishing patterns that government departments and patterns, supported through tools. This work has also involved agencies can adopt, use and implement. and building understanding and capability amongst officials from The UK Government aims to be a responsible publisher of linked across government departments and agencies. data – indeed that is an important part of the motivation for using linked data standards. To do this it is addressing questions such as It explains why the government's use of linked data standards was how to handle versioning and provenance information. The not universally welcomed and was even greeted by antagonism government is developing publishing patterns that ease the from some. Learning from this feedback the paper describes how process of publishing data in linked data form. More specifically we are now using linked data standards to enable government as a it is thinking through the implications of linked data standards in platform, commoditising the process of creating APIs to meet the two important domain areas, statistics and geo-spatial needs of a wide range of data consumers, from business, academia information. and the developer communities. The divergent needs of government data publishers and Categories and Subject Descriptors government data consumers is also becoming apparent. Many H.4 [Information Systems]: Information Systems Applications developers want to download government datasets or, better, to have programmatic access to data through RESTful APIs. The demand from data consumers for government data in linked data General Terms form, particularly those outside of academia, is limited to a Design, Standardization growing minority. The majority of developers want easy and immediate access to data in simple-to-use formats. Despite the Keywords inherently RESTful nature of linked data, SPARQL Endpoints Linked Data, eGovernment and RDF are viewed with suspicion. The data.gov.uk mailing list received numerous posts from developers wanting to have programmatic access to government data but not wishing to use 1. INTRODUCTION either SPARQL or RDF. To bridge the gap between responsible There are many different ways of putting data on the web. It has data publishing and easy data use, the UK government is using been possible for governments to publish data using the Internet linked data as an underpinning technology, as a bridge that for over 30 years, long before the web was invented, by providing enables data publishers to meet the diverse needs of data access to flat files over FTP. What is distinctive about the web is consumers. HTTP and the web's linking ability. In 2009 governments around the world started to move decisively towards publishing increasing volumes of government data on the web, perhaps most 2. OPEN GOVERNMENT DATA There is a global movement of governments and local authorities starting to put their data on the web. Open government data projects have sprung up in countries around the world from the Copyright is held by the author/owner(s). LDOW 2010, April 27, 2010, Raleigh, North Carolina. United States, Australia and New Zealand to The Netherlands, Sweden, Spain, Austria and Denmark, not to mention an increasing number of city- and local-authority-based initiatives things, that they should be resolvable to representations of RDF from Vancouver to London. The policy objectives achieved graphs, and that they should link to other resources. However, through open government data vary, including increasing they naturally remain silent on the details. Some of these gaps transparency and democratic accountability, supporting economic have been filled by work such as Cool URIs for the Semantic growth by stimulating new data-based products and services, and Web 2 and Best Practice Recipes for Publishing RDF improving how public services are delivered. Vocabularies3 but even these rightly leave developers with a lot of choice about how to approach the publication of linked data. In the UK, the government has set out clear public data principles. These state that the government will make public data available in This choice is good in many ways, and accurately reflects the machine-readable formats, published using open standards and relative immaturity of the field. However, in the context of released under an open licence. The UK government has gone encouraging government departments to publish their data, it can further, committing that it will make any raw dataset available in also be confusing. To help publishers get up and running quickly, linked data form. with the minimum of effort, we have adopted a policy of providing clear guidelines and recommendations. These are not There are a number of advantages for using linked data standards intended to constrain publishers (they are not rules) but to ease for publishing open government data. The most important benefit the path to publication by giving clear directions along the way. for publishers of government data is how linked data standards By providing a level of consistency in approach, we also hope to enable departments and agencies to publish their data responsibly. make it easier to create tools and to help consumer developers This is because each fact or data point is associated with a URI know what to expect from government linked data. and that URI can be resolved. The publisher determines what information is returned when a request is made and can serve These guidelines are led by and refined by experience, and are whatever additional context or provenance information they deem thus at different levels of maturity. There are three sets in necessary. For example, if the 2002 figures can't be compared particular that are worth highlighting here: in the design of URIs, with those in subsequent years, the publisher can say so, for every the approach to versioning and the provision of provenance data point. The data can be copied, adapted and re-used, but the information. publisher always controls what is returned when each URI is dereferenced. This is an important benefit over interchange 3.1 URIs Some of the earliest work centered on the creation of patterns for formats such as CSV or XML where data can be changed or URIs, culminating in the publication of Designing URI Sets for context lost as it is passed from hand to hand or system to system. the UK Public Sector 4 . These married guidance based on the For government officials worried that their datasets will be Linked Data approach, usability guidance, and practical dumbed down to create the machine readable form and then used constraints particularly regarding the impermanence of many in incorrect or even misleading ways, linked data provides a real department-based domain names. The result is a slash-based boon. It leaves the publisher ultimately in control of their data in a scheme that includes four main patterns: unique way, whilst enabling very flexible consumption and re-use.  http://{sector}.data.gov.uk/id/{concept}[/{identif There are other benefits for governments wishing to publish data. ier}] for real-world things such as schools and roads Linked data is based on open standards. This aligns it well with the UK Government's commitment to open standards in Open  http://{sector}.data.gov.uk/doc/{concept}[/{identi Source, Open Standards and Re-use1. fier}] for documents about those things Linked data enables the government to publish its data in a very  http://{sector}.data.gov.uk/def/{scheme}[/{concept modular way, benefiting from a 'small pieces loosely joined' }] for vocabularies, classes, properties, concept schemes and approach to government data. This is important as the government concepts is itself a complex and highly distributed set of organisations. The  http://{sector}.data.gov.uk/data/{dataset}[/{part} most useful data about schools, for example, will be the ] for datasets and the graphs they contain combination of information from a number of different The first pattern results in a 303 See Other response to an departments and agencies. Each organisation can publish its own equivalent URI using the second pattern. For the latter three data but using linked data the information can also easily be patterns, suffixes are used to indicate particular formats for the combined. Neither the government nor data consumers need returned documents. everything to be planned in advance, the data web can evolve, as the web of documents has. We have also framed general guidance about these URIs, such as: Rather than create many different bespoke APIs, which would  using natural identifiers within URIs where possible prove time consuming and potentially expensive, linked data  considering the persistence of URIs over time technology offers a way of providing flexible and easy  designing URIs for things that are not ultimately controlled programmatic access to data. Moreover, linked data is also very by data.gov.uk portable, not locking the government in to a particular vendors technology platform or approach. 3. DESIGN PATTERNS The four Linked Data principles provide some very clear 2 guidance: that HTTP URIs should be used to name real-world http://www.w3.org/TR/cooluris/ 3 http://www.w3.org/TR/swbp-vocab-pub/ 1 4 http://www.cabinetoffice.gov.uk/media/318020/open_source.pdf http://writetoreply.org/ukgovurisets/ 3.2 Versioning Unsurprisingly, both areas have large existing communities of Versioning is particularly important with government data. It's interest and standard approaches to modelling and representing important to be able to relate a given event to a local authority their data. But in both cases, the advantages of a linked data that was disbanded in April 2009, and to relate that local authority approach have been recognised very quickly. to the one that has taken its place. It's important to be able to track 4.1 Statistics shifting classifications and coding schemes as these have Statistics are a vital source of data. While RDF does not provide implications to how we interpret statistics about crime or health succinct representations of multi-dimensional hypercubes, over time. While in general consumers will be interested in the representing statistical information using linked data provides current state of the world, it's particularly important for policy three important benefits: makers to look back into the past and project into the future. We also have to deal with different sets of information about a  the ability to slice hypercubes in ways not anticipated by given resource updating at different times, and information from their original publishers different sources, modified at different times, potentially  links into the wider cloud that provide extra contextual overlapping with each other. For example, a school's name might information for the statistics be recorded in five different databases, all exposed as linked data  the ability to make annotations at various levels, from whole and updated at different intervals. How can we determine which datasets down to individual observations reflects the current name of the school? The UK government has led work to align the major existing These considerations have led us to adopt named graphs as a standard for publishing statistics, SDMX 6 , with Linked Data, mechanism for annotating sets of statements with information leading to work on both extending SCOVO and mechanisms for about their validity over time, their authoritativeness, and other accessing these statistics using web-based APIs. named graphs in the same series. While many sources may provide information about a given resource, only one should provide authoritative information about a particular property of 4.2 Geo-Spatial Information that resource, such as the school's name. These graphs can be The UK is under a statutory obligation to implement the European combined to give slices of information at a particular point in INSPIRE Directive 7 , which seeks to ensure that European time. countries are able to exchange spatial information. One of the features of this directive is the requirement to provide identifiers 3.3 Provenance for and a resolution mechanism for spatial objects. This dovetails Alongside the requirement for handling changing information, the with the opening up of geographic information within the UK, UK government needs to provide information about the source of particularly that provided by the Ordnance Survey. the information that it publisher as linked data. This includes the provenance of the data itself but also, crucially, the ways in which The UK has decided to use Linked Data to fulfill these it has been manipulated en route to the final consumer of the data. requirements: spatial objects will be identified through HTTP URIs, and the resolution mechanism will be the standard web There are going to be many different ways in which linked data is architecture. Particular issues that we are working through generated and published, including: include:  publication of RDF-based representations from standard  the correspondence between real-world things and the spatial relational databases, as just another output format objects that represent them  generation of RDF based on transformations from other  the representation of phenomena such as boundaries that formats, such as CSV files, which is then served purely as both change over time and are available at varying RDF resolutions  programmatic, on-demand generation based purely on the  the representation of geometries within RDF, whether as resource URI literals or as resources Named graphs are again vital for associating metadata, this time about the provenance of information, to sets of triples. We are 5. LINKED DATA FOR DEVELOPERS currently working on developing the patterns to represent the Linked data standards are very powerful but for many developers complexity and variety of provenance of government linked data, RDF and SPARQL are new technologies. Although the data is in coordination with the W3C Provenance Incubator Group 5. machine readable, the standard formats such as RDF/XML and Turtle are impenetrable without special parsers, making it hard to 4. KEY CONTENT AREAS use by non-experts. Representing data as graphs rather than using From our earliest forays into linked data, it was clear that there the more familiar paradigms of trees or tables adds another were two areas which deserved special attention: statistics and obstacle. To consume linked data effectively the data consumer geo-spatial information. Practically every interesting dataset has to think differently, constructing both a new mental model of contains statistics of some description, whether it's the number of the data and, very often, starting to use a new or unfamiliar code vehicles passing a particular point or the number of pupils of a library to work with and manipulate it. The experience from the certain age within a school; and references a location or an area in data.gov.uk mailing list is that even experienced developers baulk the real world. 6 http://www.sdmx.org 5 7 http://www.w3.org/2005/Incubator/prov/ http://inspire.jrc.ec.europa.eu/ at the learning curve required to fully exploit a SPARQL As a result of this work, the UK Government has now embraced endpoint. linked data both as the most effective route for providing programmatic access to data both natively and in the easy to When the UK Government started publishing data using linked consume formats that developers already know, such as JSON. data standards, those already familiar with the technology Linked data standards provide an underpinning technology that cheered. A wider group of developers approached the technology can enable other forms of programmatic access to data. with an open mind but many floundered. It was too hard to find out what data was available and how it had been modelled. This is made harder by difficulties getting overviews of RDF 6. CONCLUSIONS vocabularies and the possible properties that a given resource The UK Government is making a serious attempt to create a web might have. of linked government data as part of the wider linked data cloud. . There are important benefits for governments by using linked data We provided example SPARQL queries, but important features standards for data publishing. For data publishers in government, such as aggregation of results through summing and counting linked data standards mean they can publish their data were missing both from the current SPARQL standard and from responsibly. For data consumers, linked data standards mean they the then data.gov.uk implementation. Tutorials available can re-use government data flexibly and easily, for example elsewhere on the web were sparse and not developer friendly. through APIs. There was no easy to follow progression path from the world developers knew to the world of linked data. The choice was Adopting Linked Data within the UK government has been an either to make the leap and invest considerable time and energy to exercise in balance: use linked data, or to be left frustrated. Far from enabling access  between the largely academic advocates of linked data and to open government data, linked data standards looked like they the pragmatic concerns of data consumers getting in the way!  between providing publishers both helpful patterns, guidance The last few years have seen an explosion in web services and the flexibility to move outside their bounds when provided through RESTful APIs. These APIs are defined either necessary through URI patterns or queries and return data in simple XML or JSON formats that are easy to process on both servers and clients.  between the need for a centralised, single point of access to government information, that makes it easy to find and use, The UK government has therefore been supporting work to create and its distributed publication middleware, based on a standard configuration format, that can sit above SPARQL endpoints to:  between providing core resources on which we can build while recognising that growth will only come from data  provide simple JSON and XML views of linked data holders publishing thier own data on the web  provide simple URI-based searching, filtering and sorting of To overcome the bootstapping problem an important focus has linked data been on realising immediate benefits from the use of linked data  support the creation of flexible domain-specific APIs by data standards as well as the longer term gains; jam today as well as publishers jam tomorrow. Using linked data as an underpinning technology for creating APIs is an important approach, embracing the needs To do this, we have written a specification for the API features the of the widest range of data users, not just those familiar with middleware will provide and supported the creation of initial linked data. implementations. There are major opportunities for linked data standards with All UK Government linked data can now also be available government data, particularly for statistical and geo-spatial through RESTful APIs. The ultimate goal here is to commoditise information. There is much still to be done and more to learn the production of APIs through linked data in a way that is both about the implementation of linked data standards for government simple for the publisher and valuable to the consumer, to data. We believe the practical application of linked data standards demonstrate that publishing in linked data provides benefits that by the UK government has strengthened the case for linked data vastly outweigh the costs. The case for linked data is transformed whilst highlighting some weaknesses in the maturity of by this approach. Powerful in their own right, linked data implementation approaches, which we have worked to resolve. standards now also provide the UK Government with a basis for the rapid creation of APIs.