Proceedings of the I-SEMANTICS 2012 Posters & Demonstrations Track, pp. 26-30, 2012. Copyright © 2012 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors. Linked Open Data Infrastructure for Public Sector Information: Example from Serbia Valentina Janev1, Uroš Miloševiü1, Mirko Spasiü1, Jelena Milojkoviü2, Sanja Vraneš1 1 Mihailo Pupin Institute, University of Belgrade, Belgrade, Serbia {valentina.janev, uros.milosevic, mirko.spasic, sanja.vranes@pupin.rs} 2 Statistical Office of the Republic of Serbia, Belgrade, Serbia {jelena.milojkovic@stat.gov.rs} Abstract. To improve transparency and public service delivery, national, regional and local governmental bodies need to consider new strategies to openning up their data. We approach the problem of creating a more scalable and interoperable Open Gov- ernment Data ecosystem by considering the latest advances in Linked Open Data. More precisely, we showcase how an integrated and coherent collection of aligned state of the art software tools, the LOD2 Stack, can be used to deliver trusted, open and rich collections of interlinked datasets to the public. The usage of the Tool Stack is demonstrated on the case of one of the largest data providers in the Republic of Serbia – its Statistical Office. Keywords. linked open data, open government data, infrastructure, tools, public sec- tor, Serbia 1 Introduction In order to improve efficiency in the provision of public services, increase transparen- cy and interaction with citizens and society as a whole, but also create new businesses and job opportunities, both local and national governments need to find better strate- gies for delivering large amounts of trusted data to the public. The fact that the Euro- pean Commission is investing considerable amounts of finances to overcome this problem is a strong indicator of its significance. As a direct example, consider the ISA (Interoperability Solutions for European Public Administrations) program for the period from 2010-2015 that has been assigned a budget of 164,1 million euros1. The program enables “the delivery of electronic public services and ensures the availabili- ty, interoperability, re-use and sharing of common solutions”2. To make government data truly open (for use and re-use), and increase transparency, it needs to be pub- lished in a non-proprietary, machine-readable format (e.g. RDF, http://www.w3.org/TR/2004/REC-rdf-syntax-grammar-20040210). 1 European Commission ISA Webpage, http://ec.europa.eu/isa/ 2 European Commission ISA Webpage, http://ec.europa.eu/isa/faq/faq_en.htm 26 Linked Open Data Infrastructure for Public Sector Information: Example from Serbia In this paper, we will show why Linked Data is considered a promising approach to the above problem, and how the LOD2 Stack, a powerful set of software tools and components, can be used to lower the cost of addressing the challenges of publishing and integrating Open Government Data (OGD). The evaluation of the tools used in the National Statistical Office use case workflow (see Fig. 1) will be given in section 2. Section 3 discusses the achieved results in the process of integration of Serbian public data in the LOD cloud, with a special attention to the case of one of the largest data providers in the Republic of Serbia – its Statistical Office (SORS). 1.1 LOD2: The Project and the OGD Use Case In the last few years the Linked Data paradigm has evolved as a powerful enabler for the transition of the current document-oriented Web into a Web of interlinked Data and, ultimately, into the Semantic Web. Aimed at speeding up this process, the LOD2 project ("Creating knowledge out of interlinked data", http://lod2.eu) partners have delivered the LOD2 Stack, “an integrated collection of aligned state of the art soft- ware components that enable corporations, organizations and individuals to employ Linked Data technologies with minimal initial investments” [1]. One of the LOD2 objectives is to showcase the wide applicability of the LOD2 Stack for building public services for ordinary citizens of the European Union. As partners of the LOD2 project, the Mihailo Pupin Institute’s team established the Ser- bian CKAN,3 the first catalogue of this kind in the West Balkan countries, with a goal of becoming an essential tool for enforcing business ventures based on open data in this region. The RDF datasets cataloged with the Serbian CKAN (rs.ckan.net) are periodically harvested and synchronized at an international level with the PublicDa- ta.eu portal4 and integrated into the LOD cloud. 2 Evaluation of LOD Tools and Technologies The LOD2 Stack was evaluated for allowing governments and governmental agencies to publish their data based on open standards. Requirements identified for the Nation- al Statistical Office scenario [2] were grouped into the following types: Data extrac- tion and transformation, Domain-specific modeling, Data enrichment and interlink- ing, Data storage, Exploration and analysis, and Data and Service administration. Table 1 shows how the LOD2 Stack responds to these requirements. Vocabularies suitable for modeling statistical data in RDF format are the Data Cube vocabulary [3] which is fully compatible with the cube model that underlines SDMX5, and VoID (Vocabulary of Interlinked Datasets, http://www.w3.org/TR/void/), an RDF based schema used to describe linked datasets. 3 CKAN is a data catalogue system used by various institutions and communities to manage open data. 4 PublicData.eu has been developed as a part of the LOD2 project. 5 SDMX (Statistical Data and Metadata eXchange), http://code.google.com/p/publishing- statistical-data/wiki/Documentation. 27 Linked Open Data Infrastructure for Public Sector Information: Example from Serbia Table 1. Overview of LOD2 Stack capabilities Data Extraction and Transformation In a case where direct central database access is enabled, the D2R server and D2RQ mapping language can be used to represent the content in RDF format (e.g. using the SPARQL endpoint). Otherwise, for data provided in Excel or XML format, OntoWiki‘s stat2RDF extension or the LOD2 XSLT processor can be used. Domain-specific Modeling The PoolParty Thesaurus Manager (PPT, http://lod2.poolparty.biz) tool for enter- prise metadata management and linked data publishing is based on standard SKOS vocabulary and can be combined with text mining and linked data technologies. Addi- tionally, knowledge models developed with PoolParty can be edited and enhanced with OntoWiki (http://ontowiki.net/) authoring tool. Data Enrichment and Interlinking These features are very important as a pre-processing step in integration and analy- sis of statistical data from multiple sources. The LOD2 tools such as SILK (http://www4.wiwiss.fu-berlin.de/bizer/silk) and Limes (http://aksw.org/Projects/LIMES) facilitate mapping between knowledge bases, while GRefine can be used to enrich the data with descriptions from DBpedia or reconcile with other information in the LOD cloud. Data Storage The LOD Cloud Cluster knowledge store for the LOD2 Project (http://lod.openlinksw.com) hosting 50 billion plus triples, consists of a Virtuoso clus- tered instance hosted on 8 server nodes at the Sindice Data Centre at DERI (NUIG)[4]. Exploration and Analysis The LOD2 Stack offers tools such as SparQLed, Sindice’s assisted SPARQL editor (http://sindicetech.com/sindice-suite/sparqled/) and the RDF Data Cube visualization component CubeViz (https://github.com/AKSW/cubeviz.ontowiki), that are of special importance for statistical data analysis and visualization. 3 Linked Open Data Example from Serbia In an attempt to adopt the LOD2 Stack for the Statistical Office of the Republic of Serbia, over 100 datasets were extracted from the central statistics database (http://webrzs.stat.gov.rs/WebSite/public/ReportView.aspx), transformed into RDF, stored as RDF dump files on a local server (http://elpo.stat.gov.rs/lod2/) and regis- tered with the Serbian CKAN. The data includes statistics from the Prices, National accounts, Usage of Information and Communication Technologies, and Science, Technology and Innovation domains (see [2] for more details). Performed activities can be summarized as follows. Metadata Management. The statistics published by National Statistical Offices or Eurostat are organized by theme, presented in aggregate form by using a wide range of standard metadata (code lists). In the SORS Use case, a knowledge model was built where standard code lists were modeled using the SKOS vocabulary [2]. The model 28 Linked Open Data Infrastructure for Public Sector Information: Example from Serbia (http://lod2.poolparty.biz/) currently incorporates 12 concept schemas including the NACE (revision 1 and revision 2), COICOP, and SITC (revision 4), as well as other schemas used in SORS statistical publications, such as geographical, time and statis- tical areas code lists. In order to formalize the conceptualization of the National ac- counts domain, for instance, the ESA 95 (European system of accounts ESA, http://circa.europa.eu/irc/dsis/nfaccount/info/data/ESA95/en/titelen.htm) was used. In governmental organizations, the metadata management activity is carried out by users with administration permissions (depicted in Fig.1). Using Silk and LODGefine (http://code.zemanta.com/sparkica/) some of the code lists were interlinked with DBpedia and Eurostat code lists. Fig. 1. Using LOD2 tools for publishing and consuming statistical data The Serbian CKAN. The Serbian CKAN portal is deployed on a server with the following characteristics: Intel® Xeon® CPU 5140, dual core @ 2.33GHz 8GB RAM, Ubuntu 11.04, with kernel version: 2.6.38-12. The CKAN software was fully translated to Serbian, enabling support for two character sets (Latin and Cyrillic). Furthermore, a large number of dataset relationships have been defined, making the CKAN browsing and navigation experiences more comfortable. The Serbian CKAN is currently maintained by the Mihailo Pupin Institute’s team. The SORS LOD Cloud. The SORS statistical data in XML form was passed as input to the XSLT processor and transformed into RDF using the aforementioned vocabula- ries (RDF Data Cube, SDMX-RDF, SKOS, Dublin Core Terms, VoID) and devel- oped concept schemes. The VoID definition of the SORS LOD dataset is given in Fig.2. The SORS dataset (87.968 triples, see http://stats.lod2.eu/serbia) was also up- loaded to the LOD Cloud Cluster knowledge store under the graph name http://elpo.stat.gov.rs/lod2/. 29 Linked Open Data Infrastructure for Public Sector Information: Example from Serbia Fig. 2. VoID description of the SORS LOD 4 Conclusion and Outlook This paper contributes to the understanding of the LOD2 tools and technologies and discusses their use for publishing and consuming public sector information through the SORS Use case. The main lessons learnt from this study are: x The Data Cube RDF vocabulary is mature enough to be used for publishing sta- tistical data as it improves interoperability and allows comparison of data from different statistical sources. x The LOD2 Stack provides a wide range of data transformation, enrichment and exploitation tools. However, advanced tools for analysis and visualization of sta- tistical data are still under development. x For publishers who currently only offer static files, Linked Data offers a flexible, non-proprietary, machine-readable means of publication that supports an out-of- the-box web API for programmatic access. x The Serbian CKAN increases the visibility and accessibility of Serbian public sector data We conclude that adoption of LOD2 tools and technologies leads to establishment of an interoperable Open Government Data ecosystem. Future work will include an analysis of the LOD2 Stack components for building custom applications for different LOD stakeholders. Acknowledgements. The research presented in this paper is partly financed by the European Union (FP7 LOD2 project, Pr. No: 257943), and partly by the Ministry of Science and Technological Development of Republic of Serbia (SOFIA project, Pr. No: TR-32010). The Linked Open Data example was realized through close coopera- tion with the Statistical Office of the Republic of Serbia. References 1. Auer, S., Martin, M., Frischmuth, P., Deblieck, B.: Facilitation the publication of Open Governmental Data with the LOD2 Stack. Share-PSI workshop, Brussels. Retrieved from http://share-psi.eu/papers/LOD2.pdf (2011) 2. Vraneš, S., Janev, V., Spasiü, M., Miloševiü, U.: Establishment of the Serbian CKAN. LOD2 Deliverable 9.5.1, Institute Mihajlo Pupin (2012). 3. Cyganiak R., Reynolds D., Tennison J.: The RDF Data Cube vocabulary (July 14. 2010). 4. Williams, H., Boncz, P., Tummarello, G., Auer, S.: 50 Billion plus Triple LOD Cloud Hosted on the LOD2 Knowledge Store Cluster. LOD2 Deliverable 2.1.3 (2012). 30