-

Publishing O cial Classi cations in Linked Open Data

Giorgia Lodi

giorgia.lodi@agid.gov.it 0

Antonio Maccioni

antonio.maccioni@agid.gov.it 0

Monica Scannapieco

scannapi@istat.it 1

Mauro Scanu

scanu@istat.it 1

Laura Tosco

tosco@istat.it 1 0 Agenzia per l'Italia Digitale 1 Istituto Nazionale di Statistica

Data interoperability is well recognized as a basic step for developing integrated services supporting inter-organizations communication. The issue of ensuring data interoperability has been tackled by many di erent communities in order to address various problems. In particular, the (over-)national institutes of statistics deeply concern the issuing of o cial and shared classi cations (i.e., taxonomies, schemes, code-lists) to be used in the jurisdiction of reference. On a di erent perspective, there has been much work from the Web data management community to publish data on the Web in an interoperable way. The e orts have converged on a series of standards and practices gathered under the Semantic Web stack. Clearly, the two mentioned scenarios are complementary as they can bene t one to another. To this concern, the Italian Institute of Statistics (Istat) and the Agency for Digital Italy (AgID) have launched an initiative aiming at producing o cial classi cations under the form of Linked Open Data to be published in the Web of data using standard ontologies. The paper describes and motivates this initiative.

In the O cial Statistical (OS) domain, the issue of data interoperability has been present since decades: both National and International exchanges of data resulting from statistical processes are made possible only by adopting common metadata models and formats.

From this point of view, a speci c e ort has been performed in such a domain to standardize classi cations, and hence to introduce the concept of \o cial" classi cation, that is one conform with internationally accepted standards. Indeed, most of the exchanged data in OS are multidimensional data represented as measures and related dimensions. These latter are coded according to speci c classi cations, and hence o cial ones have played a signi cant role in data exchange processes among National Statistical Institutes (NSIs).

The Linked Data initiative [7] is more and more a rming as the principal mean for data interoperability, by permitting to create and interlink arbitrary volumes of structured data across the Web. In particular, the Linked Data initiative is made possible by the widespread adoption of Web standards for publishing data according to the Resource Description Framework (RDF) model. RDF allows to uniquely identifying resources on the Web, by means of a speci c URI (Uniform Resource Identi er). This feature has several advantages, including (i) the possibility of a direct access to resources via a query language and (ii) the ability to link data together in order to access them in an integrated way (with the clear positive side-e ect of higher quality, more information more easily accessed, and so on).

The Linked Open Data (LOD) project is concerned with the publication of Linked Data that are \Open". There are several LOD datasets already available. The so-called LOD cloud [8] covers more than an estimated 50 billion facts from many di erent domains like geography, media, biology, chemistry, economy, energy, etc.

The LOD project has had also an immediate and widespread success in the e-government sector: several public administrations (PAs) and institutions are starting publishing their data as LOD. To this end, in the speci c Italian context, at the end of 2012, the Agency for Digital Italy (AgID) published national guidelines [11] that paved the way to the use of LOD as the data paradigm for enabling semantic interoperability in the collaboration between PAs. Since then, AgID continued to exercise its role of national Public Sector Information (PSI) enabler by annually releasing a number of strategic documents for PAs. One of these documents is the so-called national agenda which includes principles (e.g., interoperability, usability, accessibility, data quality) and objectives to be achieved by PAs within a year in order to implement, and sustain in the long term, the PSI enhancing process. Following the G8 open data charter experience [ 3 ], the agenda introduces the concept of key datasets to be released as high quality open data. Among the key datasets, AgID identi ed \o cial" classi cations as cross-domain data to be published in LOD so that to foster an e ective integration between even heterogeneous data.

In view of this scenario, Istat, the Italian National Institute of Statistics, and AgID launched a joint project whose objectives can be summarised as follows: { to model \o cial" classi cations such as Ateco 2007 (Classi cation of Economic Activity) and COFOG (Classi cation of the Functions of Government) in LOD using standard ontologies (e.g., SKOS - Simple Knowledge Organization System, XKOS - eXtended Knowledge Organization System, ADMS - Asset Description Metadata Schema, etc); { to certify data by provenance using the PROV framework. In particular, the framework has been used to document the overall process of the creation of the classi cations by Istat, as well as of the creation of their LOD versions and of their publication on the SPARQL endpoint by AgID.

The project also helped AgID and Istat to delineate guidelines and artefacts for LOD publication that could raise the awareness on the need to certify data quality and reliability, thus enabling an e ective data reuse and interoperability in the Italian PSI context.

Similar projects have been carried out by other institutions, such as the publication of o cial classi cations by INSEE [5], or the publication of the NACE o cial classi cation by the European Community [9].

The paper describes this project and the data model that has been adopted and implemented in LOD for the speci c Ateco 2007 standardized classi cation. The remaining of the paper is structured as follows. Sections 2 and 3 provide the background on Ateco 2007 and PROV, respectively. Section 4 introduces the data modelling in LOD of the Ateco classi cation and Section 5 concludes the paper. 2

Background: O cial Classi cations and ATECO 2007

Classi cations have been one of the rst metadata set that NSIs started to store, with the objectives of (i) reusing them in di erent production processes and (ii) promoting harmonization. The main result was the Neuch^atel Group that issued version 2.1 of the Neuch^atel Terminology Model Classi cation [10] database object types and their attributes in 2004. The principal purpose of the work was to arrive at a common language and perception of the structure of classi cations as well as of the links between them.

More recently, the Generic Statistical Information Model (GSIM) [4] was proposed with the objective to describe the information objects and ows in the statistical business process: in the Concepts module, classi cations are described in detail. Some of the elements characterizing GSIM classi cations are: { A Classi cation Family is a group of Classi cation Series. Classi cation Series is an ensemble of one or more Statistical Classi cations. { A Statistical Classi cation is a set of Categories. The Categories at each Level of the classi cation structure must be mutually exclusive and jointly exhaustive of all objects/units in the population of interest. { A Statistical Classi cation has Categories that are represented by Classi cation Items. These Classi cation Items are organised into Levels determined by the hierarchy. A Level is a set of Concepts that are mutually exclusive and jointly exhaustive. { Statistical Classi cations can be versions or variants. A variant type of Statistical Classi cation is based on a version type of Statistical Classi cation. In a variant the Categories of the version may be split, aggregated or regrouped to provide additions or alternatives to the standard order and structure of the original Statistical Classi cation. to the level classe, it coincides with the NACE classi cation, which is a modi cation of the ISIC Rev. 4 [6] managed by the United Nations. Variants of these classi cations managed in the Istat Sistema Unitario dei Metadati (SUM) are those used for speci c purposes in Istat with the following distinctions: { Variants organized by speci c Istat systems (e.g. the system devoted to dissemination, the one devoted to data collection and so on). Each of these variants include additional categories, as the codes in the Main Industrial Groupings. { Variants used in speci c data structures (either of macro or micro data) in order to show the level of detail in which the dimension Economic Activity is given. For instance some disseminated data use all the codes of the rst ATECO level, another a subgroup of them and some Main Industrial Groupings, another one focuses on one or two codes of the rst ATECO level and then decomposes it up to the the last ATECO level. 3

Background: PROV and XKOS Ontologies

As mentioned, standardizing a classi cation is a complex process that involves di erent actors: the PROV ontology described in Section 3.1 helps tracking this aspect.

Moreover, statistical classi cations are a speci c type of classi cations that needs to be modeled in an ad-hoc way. This is described by the focus on the XKOS ontology in Section 3.2. 3.1

PROV Ontology

Provenance information is relevant to certify data quality and reliability. In more detail, it is very important to publish, together with data, who is the responsible for them and which are the entities, activities and agents involved in the generation/manipulation processes; namely, the following concepts: { Responsible of the data: person/institution/administration that manages/creates/manipulates the data. { Certi ed data: i.e. data published by their responsible. { Provenance: set of detailed information regarding entities and processes involved in the production and publication of data.

We tested the PROV Ontology [12], a W3C recommendation, to certify the role of Istat as o cial producer of ATECO 2007 classi cation published as linked open data.

In Figure 2, the principal elements of the PROV data model are shown. The PROV Ontology (PROV-O) is an OWL2 ontology that expresses the PROV Data Model (PROV-DM), which provides a set of classes, properties, and restrictions to represent provenance information. A PROV framework is available and is composed by a set of documents describing di erent aspects of the provenance issue. In detail, the framework consists of the following documents: { PROV-DM: describes the data model; that is, entity, activity and agent concepts. { PROV-O (PROV-Ontology): describes the data model using the OWL2 language. { PROV-N: describes the data model using the human-readable notation N3. { PROV-XML describes the data model using the XML notation. { PROV-Constraints: describes the integrity constraint for writing correctly the provenance information using the PROV ontology. { PROV-Sem: describes the PROV data model semantic. { PROV-Dictionary: de nes an extension of the PROV-DM for collections and dictionaries. { PROV-Links: describes an extension of the PROV-DM for the correct description of multiple data sources. { PROV-AQ: describes some methods to access and query data. { PROV-DC: allows to referencing provenance data expressed in the Dublin

Core Ontology.

The speci c usage we made of the PROV ontology is described in Section 4. 3.2

XKOS Ontology

XKOS (eXtended Knowledge Organization System)[ 2 ] is an extension of SKOS (Simple Knowledge Organization System) [13] for managing statistical classi cations and concept management systems.

With respect to SKOS, XKOS enables the representation of statistical classi cations with their structure and textual properties, as well as the relations between classi cations. Moreover, XKOS re nes SKOS semantic properties allowing the usage of more speci c relations between concepts for describing statistical classi cations. In more detail, SKOS concepts are de ned from the point of view of a thesaurus, thus it only de nes the following relationships: (i) broader, (ii) narrower, and (iii) related to. Otherwise, statistical classi cations rely on hierarchical relations (generic-speci c and whole-part), thus XKOS introduces the de nition of these relations structuring data into levels; a level corresponds to all those concepts that are at the same distance from the top of the hierarchy. Finally, XKOS de nes causal, sequential, and temporal relations. See Figure 3 showing the XKOS concepts. 4

Modeling the Classi cation

This section describes the process that we conducted in order to create the LOD version of the o cial classi cation Ateco 2007 we call LinkedAteco2007, Note that, the same process has been also applied to another classi cation named COFOG that has been recently published by AgID in collaboration with Istat. Both classi cations are available at AgID SPARQL endpoint: http://spcdata.digitpa.gov.it/.

The activity involved the use of di erent ontologies and vocabularies and a customization of existing models. It is worth noticing that this activity was also described as a best practice within the Italian national guidelines for PSI valorization in order to guide PAs in (i) certifying the provenance of their data (crucial especially in the collaboration between the local and central levels of government), and (ii) using standard and common ontologies when describing their data.

The following subsections detail the activity carried out by Istat and AgID. 4.1

Provenance Modeling

In order to deal with the complexity of the process of standardizing classi cations, in our modeling we gathered as much of the information related to such a process as possible, leaving in any case to other users the exibility to extend the classi cation with variants and further versions. In this respect, the rst phase of the modeling was to represent the provenance of the data that form the LinkedAteco2007. Figure 4 shows the resulting diagram that includes the activities, entities and actors involved in the process, all modeled through the PROV ontology. From the diagram, we can observe that the activity of publishing the LinkedAteco2007 (i.e., Generation Linked Ateco2007 ) was carried out by two di erent actors: U cio RST/B and AgID. U cio RST/B is an organizational unit of the DCIT belonging to ISTAT. The diagram also speci es that AgID is the publisher of the LinkedAteco2007, whereas U cio RST/B is the creator. In addition, the original classi cation was in uenced by the Eurostat classi cation NACE2006v2 and derived from the Ateco2002, a previous variant of the Ateco classi cation. All the Ateco classi cations are grouped together through the Ateco collection.

The model is su ciently exible to allow an external user (e.g., a local administration) to extend the Ateco2007 for its own purposes (e.g., add a non-existent activity): in this case, the user can de ne a variant from the LinkedAteco2007 as illustrated by the Variation Ateco 2007 box at the bottom of Figure 4. 4.2

Classi cation Distribution

To enrich the metadata of the classi cation, we exploited the use of standard and well-known ontologies. Figure 5 illustrates the metadata enrichment. In particular, we used two ontologies; namely, DCAT3 and ADMS4.The former allows us to insert the classi cation as dataset of our data catalogue and to decouple the abstract entity notion of dataset from its actual implementation. This also allows us to state the fact that the conceptual model of the dataset is expressed using the RDF framework and that we have produced such classi cation in different formats (e.g., RDF/N3, RDF/XML, etc.). These di erent productions are instances of the class dcat:Distribution and some of them are downloadable. 3 http://www.w3.org/TR/vocab-dcat/ 4 http://www.w3.org/TR/vocab-adms/

The ADMS ontology allows us to specify and remark that the classi cation is a semantic asset, since it can be e ectively used as integration element between di erent data, thus enabling semantic interoperability.

Finally, Figure 5 shows that the LinkedAteco2007 entity is a Taxonomy represented using SKOS (see next subsection). 4.3

Classi cation Modeling

To model the content of the classi cation we used the wide-used ontology SKOS [13] and its extension for statistical data XKOS [ 2 ].

SKOS allows us to express that LinkedAteco2007 is a ConceptSchema and XKOS allows us to represent the full hierarchy of classi cation levels (i.e., the instances of the class xkos:Classi cationLevel ). Each level of the classi cation is de ned by its depth (i.e., xkos:depth) and is connected to all its member (i.e., skos:member ).

Figure 7 completes the description of the classi cation by showing how each member is described. Speci cally, a member has a notation (i.e., skos:notation), a label (i.e., skos:prefLabel ), a textual description in the attributes note (i.e., skos:note) and a comment (i.e., skos:comment ). It is worth seeing that every element is explicitly related to the upper level member; that is, it is a speci cation expressed through skos:specializes, and to the lower level member; that is, it is a generalization expressed through skos:generalizes. Implementation. In general the implementation process of the previously described modelling adheres to a methodology proposed and used by AgID in other LOD projects [15]. In particular, the implementation consisted in the following four steps: 1. import the tabular data (i.e., a CSV le) from the public available data source5 in a relational database; 2. clean and prepare the data (for instance adding new attributes with composed existing elds) for processing; 3. model the RDF data following the diagrams illustrated above and transform them accordingly using common tools. In our case, we used the D2RQ framework [14]. 4. publish the resulting classi cation on the Web portal SPCData6 and on the corresponding linked data cloud, and making available the LOD dataset for querying through the SPCData SPARQL endpoint. 5

Conclusions

The paper describes a real project by Istat and AgiD related to the publication of O cial Classi cations as LOD in the Italian Public Sector context. The technical 5 http://sistemaclassi cazioni.istat.it/class/sistemaclassi cazioni/index.php 6 http://spcdata.digitpa.gov.it/classi cazioni.html contributions of the paper is focused on the modeling aspects of the classi cations by making use of ontologies that are speci c of the statistical domain, like XKOS, as well as of more general purpose ontologies, like PROV.

Besides technical contributions, the paper also provides a relevant methodological contribution to foster data interoperability among di erent institutions.

Indeed, data interoperability is made possible by twofold e orts: { shared formats and models, i.e., technological standards like RDF framework

and ontologies like XKOS.

{ content harmonization, i.e., common domain-speci c information concepts.

We think that publishing OS classi cations in LOD is a rst important step towards \content" harmonization, which can consistently speed up the cooperation among di erent institutions based on data sharing and exchanges. 4. Generic statistical information model (gsim) v. 1.1,

http://www1.unece.org/stat/platform/display/gsim/Generic+Statistical+Information+Model 5. Insee o cial classi cation site, http://www.rdf.insee.fr/codes/index.html 6. Isic rev 4, http://unstats.un.org/unsd/cr/registry/isic-4.asp 7. Linked data initiative, http://linkeddata.org/ 8. Lod-cloud, lod-cloud.net/ 9. Nace o cial classi cation, http://www.ec.europa.eu/eurostat/ramon/rdfdata/nacer2.rdf 10. Neucha^tel terminology model classi cation, http://www3.ssb.no/DOCS/Neuchatelversion2.1.pdf 11. An overview of the italian guidelines for semantic interoperability through linked open data, http://www.agid.gov.it/sites/default/ les/documentazione trasparenza/semanticinteroperabilitylod en 3.pdf 12. Provenance ontology (prov), http://www.w3.org/TR/prov-o/ 13. Simple knowledge organization system (skos), http://www.w3.org/2004/02/skos/ 14. Bizer, C.: D2r map - a database to rdf mapping language. In: WWW (Posters)

(2003) 15. Lodi, G., Maccioni, A., Tortorelli, F.: Linked open data in the italian e-government interoperability framework. In: 6th International Conference on Methodologies, Technologies and Tools enabling e-Government (METTEG) (2012)

1. Classi cazione delle attivita economiche 2007 (ateco ), http://www3.istat.it/strumenti/de nizioni/ateco/ateco.html?versione= 2007 .3

2. Extended knowledge organization system (xkos ), http://www.ddialliance.org/Speci cation/RDF/XKOS/

3. G8 open data charter and technical annex , https://www.gov.uk/government/publications/open -data-charter/g8-open-datacharter-and-technical-annex