Semantic Statistics: Bringing Together SDMX and SCOVO Richard Cyganiak Simon Field Arofan Gregory Digital Enterprise Research Institute Office for National Statistics metadata technology NUI Galway, Lower Dangan Cardiff Road 5335 North Nina Drive Galway, Ireland Newport NP10 8XG Tucson, Arizona AZ 85704 richard@cyganiak.de simon.field@ons.gsi.gov.uk arofan.gregory@metadatatechnology.com Wolfgang Halb Jeni Tennison Institute of Information Systems The Stationery Office JOANNEUM RESEARCH Mandela Way Graz, Austria London SE1 5SS wolfgang.halb@joanneum.at jeni.tennison@tso.co.uk ABSTRACT definitions of the dimensions and attributes of observations, to Whether it's population, income, unemployment or interest rates, how to discover statistical dataset flows through a central registry. statistical data is a fundamental source of information for analysis SDMX is used by organisations such as the U.S. Federal Reserve and visualisations. Many publishers of statistics use SDMX to Board, the European Central Bank, Eurostat, the WHO, the IMF, represent statistics and make them available through web services. and the World Bank. The Organisation for Economic Cooperation The linked data principles of identifying items with HTTP URIs and Development (OECD) and the UN expect the publishers of and representing data using RDF provide some benefits (though national statistics, such as the Office for National Statistics (ONS) also some costs) for statistical publishing. This paper describes in the UK, to produce their statistical data using SDMX so that how the SDMX information model can be used with linked data these can be aggregated on an international level. and RDF and describes some ongoing work to explore the impact At the same time, the UK Government has made a commitment to of doing so. make public data available on the web using linked data standards to enable its widespread re-use. The government publishes large Categories and Subject Descriptors volumes of statistical data and the Office for National Statistics has invested heavily in technology for managing and publishing H.4 [Information Systems]: Information Systems Applications statistical data on the web. Individual departments and agencies are also publishing increasing volumes of statistical data, in General Terms specialist XML-based formats such as LGDx1, in CSV, as well as Design, Standardization in other formats that generally curtail their wider combination and re-use, such as Excel or PDF. Keywords The UK Government has to reconcile the requirement to share statistics, linked data, RDF, SDMX, SCOVO, open data statistical data with other statistics authorities using SDMX with the wider requirement to enable re-use of statistical data on the web using linked data standards. The challenge is how to marry 1. INTRODUCTION these different requirements in a pragmatic way for publishers and Statistical data is the life blood of the interesting mash-ups and consumers alike. Publishers are concerned about publishing data visualisations we see on the web. It provides the raw numbers that responsibly and avoiding the accidental mistinterpretation of designers love to turn into graphs and charts. More importantly, statistical data while the majority of consumers care about the its analysis enables policy makers to make predications, plan and ease of accessing, querying and processing statistical data. adjust. So much of the data that we have is statistical data. But how can we bring together the existing standards for transferring In practice, this means publishing statistical data using HTTP statistical data and the linked data approach [1]? What advantages URIs for datasets, time series and individual observations. These does this bring? enable both publishers and third parties to annotate and reference statistical data on the web, which helps to build trust with those The current standard for statistical organisations to produce their engaging in conversations about the data. Using the RDF data statistical data is through Statistical Data and Metadata Exchange model enables consumers to query statistical data in standard (SDMX) [2]. This standard covers everything from how to ways and to enhance statistical data by mixing it with other linked represent statistical data in flat files and as XML, to the data. Copyright is held by the author/owner(s). LDOW 2010, April 27, 2010, Raleigh, North Carolina. 1 http://neighbourhood.statistics.gov.uk/dissemination/Info.do ?page=nde.htm Pragmatically, however, we must adopt approaches that utilise Bank for International Settlements, the OECD, the UN (for the and build on existing data standards and technology investments Millenium Development Goals indicators), the World Health rather than replacing them. While this paper focuses on the Organization, UNESCO (for education statistics), the IMF, the domain of official statistics, a similar situation can be observed in World Bank, the Food and Agriculture Organization, and many many fields: organisations have heavily invested in existing others, including numerous national-level statistical organisations domain-specific standards. Bridging from these standards to the and central banks. In some cases, it has become a standard linked data universe is a prerequisite for getting valuable domain- mechanism for the dissemination of statistical data and metadata specific data into the linked data web. (OECD.stat is a good example). In others, it is used as an internal In this paper, we'll first describe SDMX and the data model it standard for processing, or as a means of supporting data uses. We'll then talk about how publishing statistical data as collection and production. Adoption has been increasing rapidly linked data provides some advantages that aren't realised within throughout the official statistical world. the more traditional SDMX publishing pattern. We'll go on to SDMX emphasises the SDMX Information Model shown at a high describe how the SDMX data model and process model map on to level in Figure 1 - a meta-model of the important aspects of linked data concepts, and thus how organisations such as the ONS collection, processing, exchange, and dissemination of aggregate can publish statistical data as linked data without disrupting their data. All of the technology artefacts of SDMX are existing tool chains. implementations of the SDMX Information Model. The model is based on an earlier standard, GESMES/TS, which used the 2. SDMX UN/EDIFACT flat-file syntax. SDMX in its current version has The Statistical Data and Metadata Exchange (SDMX) Initiative expanded this model to include a view of the entire process of was organised in 2001 by seven international organisations (BIS, statistical production. This model is the result of implementation ECB, Eurostat, IMF, OECD, World Bank and the UN) who and analysis of statistical processes in many national, supra- remain the governing sponsors. The goals for the initiative were to national, and international organisations, and has been effectively realise greater efficiencies in statistical practice, with a focus on used to support these functions in many implementations. employing current technology to enhance efficiency, improve quality, and address other challenges. These organisations all collect significant amounts of data, mostly from the national level, to support policy. They also disseminate data at the supra-national and international levels. There have been several important results from this work: two versions of a set of technical specifications - ISO:TS 17369 (SDMX) - and the release of several recommendations for structuring and harmonising cross-domain statistics, the SDMX Content-Oriented Guidelines. All of the products are available at www.sdmx.org. The standards are now being widely adopted around the world for the collection, exchange, processing, and dissemination of aggregate statistics by official statistical organisations. The UN Statistical Commission recommended SDMX as the preferred standard for statistics in 2007. Figure 2: Schematic high-level view of data structures in the SDMX Information Model The SDMX Technical Specifications describe two major syntaxes: SDMX-EDI, which uses the flat-file UN/EDIFACT syntax; and SDMX-ML, which is broader in scope and offers XML formats for many types of statistical data and metadata. In both cases, users configure the formats to work with the statistical concepts of importance to their data and metadata, providing a flexible, generic basis from which to work, which remains conformant with the standard model. As well as specifying the formats for exchanging statistical data, SDMX defines a services-based architecture centred around the deployment of queryable web-services and coordination enabled by the uses of an optimised set of registry services. Figure 1: Schematic high-level view of the SDMX Information Tools are becoming widely available for working with SDMX, as Model freeware, as open-source, and in statistical tools offered by SDMX is currently being employed by several producers of commercial vendors. important data sets. To name a few examples, it is being used and adopted by the U.S. Federal Reserve Board, the Federal Reserve Bank of New York, the European Central Bank, Eurostat, the 3. STATISTICAL LINKED DATA concept, as defined in the Event ontology4. A statistical item is a Linked data takes a different approach from SDMX. Rather than a particular classification of a time/space region. Dimensions of a centralised repository that can resolve URNs for the discovery of statistical item are factors of the corresponding events, attached datasets, linked data simply uses HTTP URIs so that information through the dimension property, pointing to an instance of the can be found using the usual web architecture. Linked data SCOVO Dimension class. This model is easily extensible by emphasises the assignment of identifiers to all the instances in a defining new factors and agents pertaining to the actual statistical data model, which in the SDMX case includes: data. For example, one can relate to a statistical data item the institutional body responsible for it as well as the methodology  code lists and codes used. A Dimension can have a minimum (and respectively a  concept schemes and concepts maximum) range value, captured through the min and max properties.  datasets and dataset flows  time series and sections  individual observations Identifying each of these resources allows statements to be made about them. An individual anomalous observation might be annotated, for example. Datasets can be annotated to indicate the provenance of the statistics, including how they were collected and what processing they have been through. Being published as linked data also means that these items are accessible programmatically via the web. For example, this means that the detailed semantics of a particular dimension can be located easily, simply through resolving its identifier on the web. Figure 3: SCOVO data model What's more, statistical datasets can link into the wider web of The Statistical Core Vocabulary (depicted in Fig. 3) is defined in data, leading to more powerful ways of filtering and querying RDF Schema. statistical data. For example, statistical observations often reference the geographical area to which the statistic applies. 4. EXPRESSING SDMX IN RDF Extra analysis can be made possible by resolving information While SCOVO addresses the basic use case of publishing about that area such as the political make-up of its council, how statistical data in linked data form, its minimalist design is rural it is or the size of its police force. Similarly, information limiting, and it does not support important scenarios that occur in about the containment of an area inside others supports the statistical publishing and have led to the development of the aggregation of statistics for larger areas. SDMX information model, such as: The Statistical Core Vocabulary (SCOVO) [3] demonstrates these  definition and publication of the structure of a dataset principles and how they can apply to statistical data. It is a independent from concrete data lightweight RDF vocabulary for expressing statistical data. Its relative simplicity allows easy adoption by data producers and  data flows which group together datasets that share the same consumers, and it can be combined with other RDF vocabularies structure, for example from different national data providers for greater effect. The model is extensible both on the schema and  definition of “slices” through a dataset, such as an individual the instance level for more specialized use cases. SCOVO’s time series or cross-section, for individual annotation origins are in the riese (“RDFizing and Interlinking the EuroStat  distinctions between dimensions, attributes and measures Data Set Effort”) project [4, 5]. It has also been used to express statistics about RDF datasets [6, 7], in early efforts to convert UK There are also features of SDMX which are rightly not addressed government data to RDF2, to publish Italian university statistics3, by SCOVO but can be expressed through other vocabularies or and in other contexts. design patterns, such as: SCOVO defines three basic concepts:  describing code lists, category schemes, and mappings between them using SKOS  a dataset, representing the container of some data, such as a table holding some data in its cells;  describing metadata and access details about datasets using Dublin Core and voiD [7]  a data item, representing a single piece of data (e.g. a cell in a table);  describing organisations using FOAF  a dimension, representing some kind of unit of a single piece In this section, a mapping from SDMX to RDF is described. It is of data (for example a time period, location, etc.) based on SCOVO, with extensions that partly borrow from existing vocabularies and partly reside in a new SDMX A statistical dataset in SCOVO is represented by the class vocabulary. The task of mapping SDMX to RDF is greatly aided scovo:Dataset, which is a SKOS concept [8] in order to allow by the fact that the SDMX standard is separated into an abstract hooking into a categorisation scheme. A statistical data item information model (SDMX-IM) and concrete XML and scovo:Item belongs to a dataset. An Item is subsuming the Event UN/EDIFACT based syntaxes. We will list main structures of the SDMX information model (see Fig. 1 and Fig. 2), and sketch their 2 http://www.jenitennison.com/blog/node/138 3 4 http://sw.unime.it/loius/info.html http://purl.org/NET/c4dm/event.owl translation to RDF. The section will conclude with first dimensions allows for a more compact RDF representation of implementation experiences. observations. Mapping overview. Fig. 4 provides a high-level overview of the RDF model. At the core of SDMX is the data structure definition (DSD), which describes the structure, or metamodel, of one or more statistical datasets. Individual datasets must conform to a DSD, and are represented by instances of the sdmx:DataSet class. The sdmx:structure property connects a dataset and its DSD. The sdmx:DataSet class is defined as a subclass of SCOVO’s scovo:Dataset class, and also as a subclass of void:Dataset, so VoiD properties can be used to describe access methods (SPARQL endpoint, RDF dump, etc.) to the data. VoiD covers much of the same ground as SDMX’s web service based registry module, which we therefore do not map to RDF. Figure 5: SDMX Data Structure Definition in RDF Dimensions, attributes and measures in SDMX take their semantics from concepts. Concepts are items in concept schemes. By using standard concepts and code lists, data becomes comparable across datasets, DSDs, and providers. Concepts could Figure 4: Mapping SDMX to RDF: Overview be modeled as properties, and could be associated with components using rdfs:subPropertyOf. Instead, we model them as Data providers. The organisation that publishes a dataset is given skos:Concepts, and introduce a new property for associating them via the dc:publisher relationship. Organisations are represented as with the component. This takes advantage of the easier instances of foaf:Agent. Organisations are also used with the management, wider reusability, and fine-grained mapping features sdmx:maintainer property, which indicates the maintenance of SKOS vocabularies compared to RDFS-defined properties. agency of various SDMX artefacts, such as DSDs, code lists, and category schemes. Data set details. SDMX offers two approaches to organising the data inside a dataset. Either the dataset is a collection of time Data flows and provision agreements. Two important scenarios series (a set of observations that share the same dimension values in official statistics are the periodical publishing of datasets except for the time dimension), or it is a collection of cross- according to a schedule, and the aggregation of datasets from sections (a set of observations that share the same dimension different data providers (e.g., European Union national statistics values except for one or more non-time “wildcard dimensions”). offices) into a larger collection for central dissemination (e.g., In our RDF mapping, we unify both models into a simpler yet Eurostat). These scenarios are addressed via sdmx:DataFlow. A more verbose model that can be more easily interrogated with data flow represents a “feed” of datasets that all conform to the SPARQL queries (see Fig. 6). The observation values are same DSD. Data flows are associated with provision agreements, modeled as instances of sdmx:Observation, a subclass of which can be understood as commitments from an organisation to scovo:Item. Each observation instance is directly connected to the publish datasets into a data flow. sdmx:DataSet via the sdmx:dataset property. An observation must Data structure definition details. A DSD, also known as a key have a value for each dimension property defined in the DSD. The family in SDMX, describes the metamodel of one or more datasets actual observation value is recorded using rdf:value. (see Fig. 5). It defines attributes, measures, and dimensions, The time series and cross-sections found in SDMX data are still collectively called components. Measures name the observable translated to RDF, in order to make any metadata attached to them phenomenon, such as income per household. Dimensions identify available in the RDF view. The same applies to groups, which are what is measured, such as of a particular country at a particular another organisational tool that can be used to apply metadata to time. Attributes define metadata about the observations, such as sections of a dataset, for example to monthly, quarterly and annual the method of data collection or the unit of measurement. timelines of the same measure. Components are coded if possible values come from a pre-defined code list (such as country), or uncoded otherwise. Code lists are Content-Oriented Guidelines. A key component of the SDMX mapped to a subclass of skos:ConceptScheme. standards package are the Content-Oriented Guidelines, a set of cross-domain concepts, code lists, and categories that support We represent all components as instances of rdf:Property. We interoperability and comparability between datasets by providing define subclasses of rdf:Property to indicate the particular kind of a shared language between SDMX implementers. We have component, as well as whether it is coded, and the particular role performed an initial mapping of the cross-domain concepts and it plays in the DSD (e.g, TimeDimension, PrimaryMeasure). the category scheme to RDF, although the result has not yet found Compared to SCOVO, the property-based modeling of a permanent home where we can guarantee stable URIs. 6. ACKNOWLEDGEMENTS This paper is based on the collaboration that was initiated in a workshop Publishing statistical datasets in SDMX and the semantic web hosted by ONS in Sunningdale, United Kingdom in February 2010. The completion of a draft reference model was one of several recommendations made by the participants, and this ongoing work continues in an open collaborative environment 5. Taken together with the proposed collaboration to create a recommended style for URI design for use in APIs to find, obtain and query statistical data6, we believe this work represents a key step towards bringing the worlds of linked data and official statistics together through the wider adoption of open standards. The authors would like to thank all the participants at that workshop for their input into this work. The authors would also like to thank John Sheridan for his comments and suggestions on an earlier draft of this paper. 7. REFERENCES Figure 6: SDMX DataSet in RDF [1] C. Bizer, T. Heath, T. Berners-Lee: Linked Data – The Story so Far. In International Journal on Semantic Web and Implementation experience. This mapping from SDMX to RDF Information Systems (IJSWIS), 2009 can be carried out in a number of ways. For example, we have used XSLT to demonstrate mapping SDMX-ML (the XML [2] International Organisation for Standardisation: ISO/TS dialect for SDMX) into RDF/XML. The mapping is easy to 17369:2005 Statistical Data and Metadata Exchange articulate, and the resulting linked data can be queried in a (SDMX) number of ways, to provide slices through the data that were not [3] M. Hausenblas, W. Halb, Y. Raimond, L. Feigenbaum, D. anticipated by the original publishers. On the down side, this Ayers. SCOVO: Using Statistics on the Web of Data. approach to modeling statistical data does result in a large Proceedings of ESWC 2009 - 6th European Semantic Web numbers of triples, with each observation resulting in a number of Conference, p704-718, Heraklion, Greece, 2009 statements equal to at least the number of dimensions in the DSD, and consequently large file sizes. [4] W. Halb, Y. Raimond, and M. Hausenblas. Building Linked Data For Both Humans and Machines. In WWW 2008 5. CONCLUSIONS Workshop: Linked Data on the Web (LDOW2008), Beijing, The statistical publishing community and the linked data China, 2008. community share some common aims. Both seek to make data [5] M. Hausenblas, W. Halb, and Y. Raimond. Scripting User easy to locate and open it up for reuse, particularly for analysis Contributed Interlinking. In 4th Workshop on Scripting for and visualisation. From a linked data perspective, the world of the Semantic Web (SFSW08), Tenerife, Spain, 2008. statistical publishing is an extremely rich seam of data. For statistical publishers, linked data provides a way of addressing [6] A. Langegger, W. Wöß: RDFStats - An Extensible RDF and retrieving information at a range of levels and in a distributed Statistics Generator and Library. In DEXA Workshops 2009: fashion. 79-83 By showing how to map from the standard SDMX information [7] K. Alexander, R. Cyganiak, M. Hausenblas: J. Zhao: model into RDF, we hope to illustrate the ease with which Describing Linked Datasets. In: Proceedings of the Linked statistical publishers could transition to providing data to both Data on the Web Workshop (LDOW2009), Madrid, Spain, communities. There are already moves within the UK to create 2009 demonstrators that illustrate both the feasibility and the practical [8] Semantic Web Deployment Working Group. SKOS Simple costs and benefits of publishing statistical data using linked data Knowledge Organization System Reference. W3C principles using this mapping, both from statistical data that is Recommendation, Semantic Web Deployment Working currently represented in SDMX and that represented in other Group, 2009. forms such as LGDx and CSV. Future work in this area includes formally capturing both the additional RDF vocabularies and concept schemes that are required to support SDMX and the ways in which this can work with other vocabularies to provide a mapping from SDMX. We will also be exploring the use of APIs aimed at web developers who will not typically be interested in learning the complexities of either SDMX or, indeed, RDF, but who simply want to access statistical data in simple and familiar ways. We believe that 5 combining the experience and rigour behind SDMX with the web- http://groups.google.com/group/publishing-statistical-data based paradigm of linked data in this way will ultimately enhance 6 http://groups.google.com/group/publishing-statistical-data/web/ the value both of statistical data and of the web of data. workshop-summary