<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Publishing O cial Classi cations in Linked Open Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giorgia Lodi</string-name>
          <email>giorgia.lodi@agid.gov.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonio Maccioni</string-name>
          <email>antonio.maccioni@agid.gov.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Monica Scannapieco</string-name>
          <email>scannapi@istat.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mauro Scanu</string-name>
          <email>scanu@istat.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laura Tosco</string-name>
          <email>tosco@istat.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Agenzia per l'Italia Digitale</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Istituto Nazionale di Statistica</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Data interoperability is well recognized as a basic step for developing integrated services supporting inter-organizations communication. The issue of ensuring data interoperability has been tackled by many di erent communities in order to address various problems. In particular, the (over-)national institutes of statistics deeply concern the issuing of o cial and shared classi cations (i.e., taxonomies, schemes, code-lists) to be used in the jurisdiction of reference. On a di erent perspective, there has been much work from the Web data management community to publish data on the Web in an interoperable way. The e orts have converged on a series of standards and practices gathered under the Semantic Web stack. Clearly, the two mentioned scenarios are complementary as they can bene t one to another. To this concern, the Italian Institute of Statistics (Istat) and the Agency for Digital Italy (AgID) have launched an initiative aiming at producing o cial classi cations under the form of Linked Open Data to be published in the Web of data using standard ontologies. The paper describes and motivates this initiative.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In the O cial Statistical (OS) domain, the issue of data interoperability has
been present since decades: both National and International exchanges of data
resulting from statistical processes are made possible only by adopting common
metadata models and formats.</p>
      <p>From this point of view, a speci c e ort has been performed in such a domain
to standardize classi cations, and hence to introduce the concept of \o cial"
classi cation, that is one conform with internationally accepted standards.
Indeed, most of the exchanged data in OS are multidimensional data represented
as measures and related dimensions. These latter are coded according to
speci c classi cations, and hence o cial ones have played a signi cant role in data
exchange processes among National Statistical Institutes (NSIs).</p>
      <p>The Linked Data initiative [7] is more and more a rming as the principal
mean for data interoperability, by permitting to create and interlink arbitrary
volumes of structured data across the Web. In particular, the Linked Data
initiative is made possible by the widespread adoption of Web standards for
publishing data according to the Resource Description Framework (RDF) model.
RDF allows to uniquely identifying resources on the Web, by means of a speci c
URI (Uniform Resource Identi er). This feature has several advantages,
including (i) the possibility of a direct access to resources via a query language and
(ii) the ability to link data together in order to access them in an integrated
way (with the clear positive side-e ect of higher quality, more information more
easily accessed, and so on).</p>
      <p>The Linked Open Data (LOD) project is concerned with the publication of
Linked Data that are \Open". There are several LOD datasets already available.
The so-called LOD cloud [8] covers more than an estimated 50 billion facts
from many di erent domains like geography, media, biology, chemistry, economy,
energy, etc.</p>
      <p>
        The LOD project has had also an immediate and widespread success in the
e-government sector: several public administrations (PAs) and institutions are
starting publishing their data as LOD. To this end, in the speci c Italian
context, at the end of 2012, the Agency for Digital Italy (AgID) published national
guidelines [11] that paved the way to the use of LOD as the data paradigm for
enabling semantic interoperability in the collaboration between PAs. Since then,
AgID continued to exercise its role of national Public Sector Information (PSI)
enabler by annually releasing a number of strategic documents for PAs. One
of these documents is the so-called national agenda which includes principles
(e.g., interoperability, usability, accessibility, data quality) and objectives to be
achieved by PAs within a year in order to implement, and sustain in the long
term, the PSI enhancing process. Following the G8 open data charter experience
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], the agenda introduces the concept of key datasets to be released as high
quality open data. Among the key datasets, AgID identi ed \o cial" classi
cations as cross-domain data to be published in LOD so that to foster an e ective
integration between even heterogeneous data.
      </p>
      <p>In view of this scenario, Istat, the Italian National Institute of Statistics, and
AgID launched a joint project whose objectives can be summarised as follows:
{ to model \o cial" classi cations such as Ateco 2007 (Classi cation of
Economic Activity) and COFOG (Classi cation of the Functions of Government)
in LOD using standard ontologies (e.g., SKOS - Simple Knowledge
Organization System, XKOS - eXtended Knowledge Organization System, ADMS
- Asset Description Metadata Schema, etc);
{ to certify data by provenance using the PROV framework. In particular, the
framework has been used to document the overall process of the creation of
the classi cations by Istat, as well as of the creation of their LOD versions
and of their publication on the SPARQL endpoint by AgID.</p>
      <p>The project also helped AgID and Istat to delineate guidelines and artefacts
for LOD publication that could raise the awareness on the need to certify data
quality and reliability, thus enabling an e ective data reuse and interoperability
in the Italian PSI context.</p>
      <p>Similar projects have been carried out by other institutions, such as the
publication of o cial classi cations by INSEE [5], or the publication of the
NACE o cial classi cation by the European Community [9].</p>
      <p>The paper describes this project and the data model that has been adopted
and implemented in LOD for the speci c Ateco 2007 standardized classi cation.
The remaining of the paper is structured as follows. Sections 2 and 3 provide
the background on Ateco 2007 and PROV, respectively. Section 4 introduces the
data modelling in LOD of the Ateco classi cation and Section 5 concludes the
paper.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background: O cial Classi cations and ATECO 2007</title>
      <p>Classi cations have been one of the rst metadata set that NSIs started to
store, with the objectives of (i) reusing them in di erent production processes
and (ii) promoting harmonization. The main result was the Neuch^atel Group
that issued version 2.1 of the Neuch^atel Terminology Model Classi cation [10]
database object types and their attributes in 2004. The principal purpose of the
work was to arrive at a common language and perception of the structure of
classi cations as well as of the links between them.</p>
      <p>More recently, the Generic Statistical Information Model (GSIM) [4] was
proposed with the objective to describe the information objects and ows in the
statistical business process: in the Concepts module, classi cations are described
in detail. Some of the elements characterizing GSIM classi cations are:
{ A Classi cation Family is a group of Classi cation Series. Classi cation
Series is an ensemble of one or more Statistical Classi cations.
{ A Statistical Classi cation is a set of Categories. The Categories at each
Level of the classi cation structure must be mutually exclusive and jointly
exhaustive of all objects/units in the population of interest.
{ A Statistical Classi cation has Categories that are represented by Classi
cation Items. These Classi cation Items are organised into Levels determined
by the hierarchy. A Level is a set of Concepts that are mutually exclusive
and jointly exhaustive.
{ Statistical Classi cations can be versions or variants. A variant type of
Statistical Classi cation is based on a version type of Statistical Classi cation. In
a variant the Categories of the version may be split, aggregated or regrouped
to provide additions or alternatives to the standard order and structure of
the original Statistical Classi cation.
to the level classe, it coincides with the NACE classi cation, which is a modi
cation of the ISIC Rev. 4 [6] managed by the United Nations. Variants of these
classi cations managed in the Istat Sistema Unitario dei Metadati (SUM) are
those used for speci c purposes in Istat with the following distinctions:
{ Variants organized by speci c Istat systems (e.g. the system devoted to
dissemination, the one devoted to data collection and so on). Each of these
variants include additional categories, as the codes in the Main Industrial
Groupings.
{ Variants used in speci c data structures (either of macro or micro data) in
order to show the level of detail in which the dimension Economic Activity
is given. For instance some disseminated data use all the codes of the rst
ATECO level, another a subgroup of them and some Main Industrial
Groupings, another one focuses on one or two codes of the rst ATECO level and
then decomposes it up to the the last ATECO level.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Background: PROV and XKOS Ontologies</title>
      <p>As mentioned, standardizing a classi cation is a complex process that involves
di erent actors: the PROV ontology described in Section 3.1 helps tracking this
aspect.</p>
      <p>Moreover, statistical classi cations are a speci c type of classi cations that
needs to be modeled in an ad-hoc way. This is described by the focus on the
XKOS ontology in Section 3.2.
3.1</p>
      <sec id="sec-3-1">
        <title>PROV Ontology</title>
        <p>Provenance information is relevant to certify data quality and reliability. In
more detail, it is very important to publish, together with data, who is the
responsible for them and which are the entities, activities and agents involved
in the generation/manipulation processes; namely, the following concepts:
{ Responsible of the data: person/institution/administration that
manages/creates/manipulates the data.
{ Certi ed data: i.e. data published by their responsible.
{ Provenance: set of detailed information regarding entities and processes
involved in the production and publication of data.</p>
        <p>We tested the PROV Ontology [12], a W3C recommendation, to certify the role
of Istat as o cial producer of ATECO 2007 classi cation published as linked
open data.</p>
        <p>In Figure 2, the principal elements of the PROV data model are shown. The
PROV Ontology (PROV-O) is an OWL2 ontology that expresses the PROV
Data Model (PROV-DM), which provides a set of classes, properties, and
restrictions to represent provenance information. A PROV framework is available
and is composed by a set of documents describing di erent aspects of the
provenance issue. In detail, the framework consists of the following documents:
{ PROV-DM: describes the data model; that is, entity, activity and agent
concepts.
{ PROV-O (PROV-Ontology): describes the data model using the OWL2
language.
{ PROV-N: describes the data model using the human-readable notation N3.
{ PROV-XML describes the data model using the XML notation.
{ PROV-Constraints: describes the integrity constraint for writing correctly
the provenance information using the PROV ontology.
{ PROV-Sem: describes the PROV data model semantic.
{ PROV-Dictionary: de nes an extension of the PROV-DM for collections and
dictionaries.
{ PROV-Links: describes an extension of the PROV-DM for the correct
description of multiple data sources.
{ PROV-AQ: describes some methods to access and query data.
{ PROV-DC: allows to referencing provenance data expressed in the Dublin</p>
        <p>Core Ontology.</p>
        <p>The speci c usage we made of the PROV ontology is described in Section 4.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>XKOS Ontology</title>
        <p>
          XKOS (eXtended Knowledge Organization System)[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] is an extension of SKOS
(Simple Knowledge Organization System) [13] for managing statistical classi
cations and concept management systems.
        </p>
        <p>With respect to SKOS, XKOS enables the representation of statistical
classi cations with their structure and textual properties, as well as the relations
between classi cations. Moreover, XKOS re nes SKOS semantic properties
allowing the usage of more speci c relations between concepts for describing
statistical classi cations. In more detail, SKOS concepts are de ned from the point of
view of a thesaurus, thus it only de nes the following relationships: (i) broader,
(ii) narrower, and (iii) related to. Otherwise, statistical classi cations rely on
hierarchical relations (generic-speci c and whole-part), thus XKOS introduces
the de nition of these relations structuring data into levels; a level corresponds
to all those concepts that are at the same distance from the top of the hierarchy.
Finally, XKOS de nes causal, sequential, and temporal relations. See Figure 3
showing the XKOS concepts.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Modeling the Classi cation</title>
      <p>This section describes the process that we conducted in order to create the
LOD version of the o cial classi cation Ateco 2007 we call LinkedAteco2007,
Note that, the same process has been also applied to another classi
cation named COFOG that has been recently published by AgID in
collaboration with Istat. Both classi cations are available at AgID SPARQL endpoint:
http://spcdata.digitpa.gov.it/.</p>
      <p>The activity involved the use of di erent ontologies and vocabularies and
a customization of existing models. It is worth noticing that this activity was
also described as a best practice within the Italian national guidelines for PSI
valorization in order to guide PAs in (i) certifying the provenance of their data
(crucial especially in the collaboration between the local and central levels of
government), and (ii) using standard and common ontologies when describing
their data.</p>
      <p>The following subsections detail the activity carried out by Istat and AgID.
4.1</p>
      <sec id="sec-4-1">
        <title>Provenance Modeling</title>
        <p>In order to deal with the complexity of the process of standardizing classi
cations, in our modeling we gathered as much of the information related to such
a process as possible, leaving in any case to other users the exibility to extend
the classi cation with variants and further versions. In this respect, the rst
phase of the modeling was to represent the provenance of the data that form the
LinkedAteco2007. Figure 4 shows the resulting diagram that includes the
activities, entities and actors involved in the process, all modeled through the PROV
ontology. From the diagram, we can observe that the activity of publishing the
LinkedAteco2007 (i.e., Generation Linked Ateco2007 ) was carried out by two
di erent actors: U cio RST/B and AgID. U cio RST/B is an organizational
unit of the DCIT belonging to ISTAT. The diagram also speci es that AgID
is the publisher of the LinkedAteco2007, whereas U cio RST/B is the creator.
In addition, the original classi cation was in uenced by the Eurostat classi
cation NACE2006v2 and derived from the Ateco2002, a previous variant of the
Ateco classi cation. All the Ateco classi cations are grouped together through
the Ateco collection.</p>
        <p>The model is su ciently exible to allow an external user (e.g., a local
administration) to extend the Ateco2007 for its own purposes (e.g., add a non-existent
activity): in this case, the user can de ne a variant from the LinkedAteco2007
as illustrated by the Variation Ateco 2007 box at the bottom of Figure 4.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Classi cation Distribution</title>
        <p>To enrich the metadata of the classi cation, we exploited the use of standard
and well-known ontologies. Figure 5 illustrates the metadata enrichment. In
particular, we used two ontologies; namely, DCAT3 and ADMS4.The former allows
us to insert the classi cation as dataset of our data catalogue and to decouple
the abstract entity notion of dataset from its actual implementation. This also
allows us to state the fact that the conceptual model of the dataset is expressed
using the RDF framework and that we have produced such classi cation in
different formats (e.g., RDF/N3, RDF/XML, etc.). These di erent productions are
instances of the class dcat:Distribution and some of them are downloadable.
3 http://www.w3.org/TR/vocab-dcat/
4 http://www.w3.org/TR/vocab-adms/</p>
        <p>The ADMS ontology allows us to specify and remark that the classi cation is
a semantic asset, since it can be e ectively used as integration element between
di erent data, thus enabling semantic interoperability.</p>
        <p>Finally, Figure 5 shows that the LinkedAteco2007 entity is a Taxonomy
represented using SKOS (see next subsection).
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Classi cation Modeling</title>
        <p>
          To model the content of the classi cation we used the wide-used ontology
SKOS [13] and its extension for statistical data XKOS [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>SKOS allows us to express that LinkedAteco2007 is a ConceptSchema and
XKOS allows us to represent the full hierarchy of classi cation levels (i.e., the
instances of the class xkos:Classi cationLevel ). Each level of the classi cation is
de ned by its depth (i.e., xkos:depth) and is connected to all its member (i.e.,
skos:member ).</p>
        <p>Figure 7 completes the description of the classi cation by showing how each
member is described. Speci cally, a member has a notation (i.e., skos:notation),
a label (i.e., skos:prefLabel ), a textual description in the attributes note (i.e.,
skos:note) and a comment (i.e., skos:comment ). It is worth seeing that every
element is explicitly related to the upper level member; that is, it is a speci cation
expressed through skos:specializes, and to the lower level member; that is, it is
a generalization expressed through skos:generalizes.
Implementation. In general the implementation process of the previously
described modelling adheres to a methodology proposed and used by AgID in other
LOD projects [15]. In particular, the implementation consisted in the following
four steps:
1. import the tabular data (i.e., a CSV le) from the public available data
source5 in a relational database;
2. clean and prepare the data (for instance adding new attributes with
composed existing elds) for processing;
3. model the RDF data following the diagrams illustrated above and
transform them accordingly using common tools. In our case, we used the D2RQ
framework [14].
4. publish the resulting classi cation on the Web portal SPCData6 and on the
corresponding linked data cloud, and making available the LOD dataset for
querying through the SPCData SPARQL endpoint.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>The paper describes a real project by Istat and AgiD related to the publication of
O cial Classi cations as LOD in the Italian Public Sector context. The technical
5 http://sistemaclassi cazioni.istat.it/class/sistemaclassi cazioni/index.php
6 http://spcdata.digitpa.gov.it/classi cazioni.html
contributions of the paper is focused on the modeling aspects of the classi cations
by making use of ontologies that are speci c of the statistical domain, like XKOS,
as well as of more general purpose ontologies, like PROV.</p>
      <p>Besides technical contributions, the paper also provides a relevant
methodological contribution to foster data interoperability among di erent institutions.</p>
      <p>Indeed, data interoperability is made possible by twofold e orts:
{ shared formats and models, i.e., technological standards like RDF framework</p>
      <p>and ontologies like XKOS.</p>
      <p>{ content harmonization, i.e., common domain-speci c information concepts.</p>
      <p>We think that publishing OS classi cations in LOD is a rst important step
towards \content" harmonization, which can consistently speed up the cooperation
among di erent institutions based on data sharing and exchanges.
4. Generic statistical information model (gsim) v. 1.1,</p>
      <p>http://www1.unece.org/stat/platform/display/gsim/Generic+Statistical+Information+Model
5. Insee o cial classi cation site, http://www.rdf.insee.fr/codes/index.html
6. Isic rev 4, http://unstats.un.org/unsd/cr/registry/isic-4.asp
7. Linked data initiative, http://linkeddata.org/
8. Lod-cloud, lod-cloud.net/
9. Nace o cial classi cation, http://www.ec.europa.eu/eurostat/ramon/rdfdata/nacer2.rdf
10. Neucha^tel terminology model classi cation, http://www3.ssb.no/DOCS/Neuchatelversion2.1.pdf
11. An overview of the italian guidelines for
semantic interoperability through linked open data,
http://www.agid.gov.it/sites/default/ les/documentazione trasparenza/semanticinteroperabilitylod en 3.pdf
12. Provenance ontology (prov), http://www.w3.org/TR/prov-o/
13. Simple knowledge organization system (skos), http://www.w3.org/2004/02/skos/
14. Bizer, C.: D2r map - a database to rdf mapping language. In: WWW (Posters)</p>
      <p>(2003)
15. Lodi, G., Maccioni, A., Tortorelli, F.: Linked open data in the italian e-government
interoperability framework. In: 6th International Conference on Methodologies,
Technologies and Tools enabling e-Government (METTEG) (2012)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <article-title>Classi cazione delle attivita economiche 2007 (ateco</article-title>
          ), http://www3.istat.it/strumenti/de nizioni/ateco/ateco.html?versione=
          <year>2007</year>
          .3
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <article-title>Extended knowledge organization system (xkos</article-title>
          ), http://www.ddialliance.org/Speci cation/RDF/XKOS/
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <article-title>G8 open data charter and technical annex</article-title>
          , https://www.gov.uk/government/publications/open
          <article-title>-data-charter/g8-open-datacharter-and-technical-annex</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>