GovLOD: Towards a Linked Open Data Portal Octavian Rinciog and Vlad Posea Politehnica University of Bucharest {octavian.rinciog, vlad.posea}@cs.pub.ro Abstract. Nowadays, governments and public agencies publish open data at an exponentially growing rate on dedicated portals. These open data sources provide semi-structured or unstructured data, because the focus is on publishing data and not on how they are later used. GovLOD is a platform that aims to transform the information found in these heterogeneous files in Linked Open Data using RDF triples. Keywords: Linked Open Data, RDF, OCR, SPARQL 1 Introduction Most developed countries governments have implemented public open data portals. The data for this portals are supplied by the governments and public institutions, helping citizens to find, download and use information generated and held by these public institutions. Portals that publish public data exist in countries like USA, UK or Germany. Romania started an open data portal in 2013. This portal now holds 441 datasets from 71 institutions, less than 2% of the portal content in UK for example. The problem with these open data portals and implicitly with the data delivered by these is that, in most cases [2], provided files do not have any structure, being very hard to work with. The information found in these files can be classified as: a) Semi-structured data: Most published files are semi-structured, such as CSV or XLS. These files do not have a formal defined structure, so the users of public data must download each document they want to use and must understand how data are presented in that file. b) Unstructured data: Unfortunately, there are published files that do not even have a structure, such as scanned PDF files, doc or zip files, so these documents must be processed before being actually used. The problems which developers are facing using these published data are: Heterogeneous data The same information can be found in multiple formats on same data portal, being published by different institutions. This hetero- geneity of used formats and structures prevents data consumption. No data linking - There is not a clear link between the published data. For example, if a user wants to find all the information published about a city or about a particular industry he must download all files published and seek every possible file that could contain information about what is interesting to him. Further, in this article we present GovLOD, a platform which aims to change the paradigm of open data portals by publishing the data as Linked Open Data with a well-defined structure. Thus, we propose a semi-automatic solution for ingestion, processing and publication of information that is hidden in files published on open data portals governments. By publishing the data found in files in Linked Open Data format, our platform also proposes a very easy method for developers to build applications without the need to process the initial files. 2 Platform Our platform consists of several layers, each having a clearly defined role. Some of these layers are automatic, for example: ingestion or processing layer, others are currently not: vocabulary choosing. The purpose of the platform is to transform the open data files into Linked Open Data, according to requirements from [3]. 2.1 Architecture Fig. 1. GovLOD Architecture The platform architecture is shown in Figure 1 and is structured according to the following workflow: First, files are taken from open data portals. The second step is processing these files in order to have a well-defined structure. In the third step the information from the files is converted to RDF triples and stored in a semantic repository. Developers who want to use public data can use SPARQL queries for accessing the data, without the need of processing the initial files. Ingestion layer Most open data portals are implemented across CKAN plat- form that provides APIs for information retrieval from the portals. Our system connects to this API and processes the new information published since the last query. This process is an automatic one: this layer keeps track of all downloaded files and in the current step it only downloads the newly added files from the last query time. Processing layer In the second step, the files retrieved pass through a stage of processing, whose purpose is to identify and define the structure of each file. Currently we have defined modules for processing PDF and tabular files, such as CSV or XLS. From the PDF files, whether they are scanned or not, our platform extracts the tables and makes them available as structured data. This process is based on techniques that identify the table structure within the pages of the file, identify the text by applying OCR on each of the table cells and afterwards export the tables as CSV files. Also in this step data cleaning and normalization is undergone by identifying the incorrectly formatted numbers and strings. Semantic layer Data taken from CSV files returned in the previous stage are classified into two categories: a) statistical data b) physical entities information (e.g. monuments, hospitals, schools, churches and people). In this phase, existing data from files are converted to RDF triples. The URI schema is consistent, so each time when an entity is recognized, it will be mapped to the same URI. Files containing statistical data are converted using RDF Data Cube Vocabulary1 . Thus we map every dimension of data in one semantic dimension of the cube. Because all statistical open data must be located in time and space, they must share these two common dimensions. The other dimensions are variable and the user must identify the properties to be mapped to each dimension. Also, we must identify an ontology in which we map each data about physical entities. At this time, in our system this mapping is done manually, but in the future we plan to have it done automatically or semi-automatically. If an existing ontology cannot be found on the given subject the user has the possibility to define new properties inside the platform. Also in this step, we augment existing data using several external web services, such as geolocation for physical entities. Then we link the created resources with existing resources from the Linked Open Data Cloud. Currently the system supports linking of resources through property, with existing resources in dbpedia.org. Similar resources are identified by matching exactly the property from dbpedia.org with from our created resources. In the future we intend to integrate a framework of automatic retrieval of connections, such as Silk[5]. Storage layer RDF triples transformed in the above step are stored in Apache Marmotta platform which contains a SPARQL engine and a reasoning one2 . For effective data storage Apache Marmotta can use a H2 or a PostgreSQL database. We use the last option as it provides additional scalability. App layer As already mentioned, using the SPARQL engine from the previous layer, developers can implement applications only by running SPARQL queries, without having to process the initial files. The major advantage of this layer is that the data structure is documented and well-defined. Each data set that was transformed into RDF triples has attached one wiki page3 , which explains the properties of the RDF triples and gives a few SPARQL queries samples about how to access these data. 1 https://www.w3.org/TR/vocab-data-cube 2 http://opendata.cs.pub.ro/repo/sparql 3 http://opendata.cs.pub.ro/wiki 3 Results In order to test this platform, we used data taken from Romanian national public data portal4 . As already mentioned, in this portal 441 datasets are published, comprising 3,231 files. From there, we selected a number of 117 documents, divided into 23 datasets. They share information in the following areas: health (data about hospitals and pharmacies ), education (schools), culture (museums, archaeological sites and churches), and statistical data (types of cars registered in this country). From these files converted to RDF, our platform stored 165,279 resources and 2.4 million RDF triples. The easy use of the data provided by this platform is evidenced by the application Romanian Linked Open Data Map5 , whose purpose is to display the nearest museums, hospitals or pharmacies to the user’s location. This application was implemented using only data provided by our platform, using SPARQL queries. 4 Conclusion In this article, we presented GovLOD, a solution that ingests and transforms files from national open data portals and publishes RDF triples into a semantic repository. The platform also provides a SPARQL engine that developers can use for implementing applications without the need to process the initial documents. In recent years, a number of research projects aimed to transform open data in Linked Open Data. For example, City Data Pipeline [1] creates a single model under which it aggregates all available data about a given city and DataLift [4] is a tool that converts CSV files and relational databases taken from city open data portals into RDF triples. Our platform differs from these solutions by automatically ingesting files that contain open data by transforming structured data taken from scanned PDF files and statistical data into RDF triples. References 1. Bischof, S., Polleres, A., et al.: City data pipeline. Proc. of the I-SEMANTICS (2013) 2. Böhm, C., Freitag, M., Heise, A., et al.: Govwild: integrating open government data for transparency. In: Proceedings of the 21st International Conference on World Wide Web. pp. 321–324. ACM (2012) 3. Heath, T., Hausenblas, M., Bizer, C., Cyganiak, R., Hartig, O.: How to publish linked data on the web. In: Tutorial in the 7th International Semantic Web Conference, Karlsruhe, Germany (2008) 4. Scharffe, F., Atemezing, G., Troncy, R., et al.: Enabling linked-data publication with the datalift platform. In: Proc. AAAI workshop on semantic cities (2012) 5. Volz, J., Bizer, C., et al.: Discovering and maintaining links on the web of data. In: International Semantic Web Conference. pp. 650–665. Springer (2009) 4 http://data.gov.ro 5 http://opendata.cs.pub.ro:3000