The GeoKnow Generator Workbench – an Integrated Tool Supporting the Linked Data Lifecycle for Enterprise Usage Andreas Both Alejandra Garcia-Rojas Matthias Wauer R&D, Unister GmbH Ontos A.G. R&D, Unister GmbH Leipzig (Germany) Nidau (Switzerland) Leipzig (Germany) andreas.both@unister.de alejandra.garciarojas@ontos.com matthias.wauer@unister.de Daniel Hladky Jens Lehmann Ontos A.G. Universität Leipzig, AKSW Nidau (Switzerland) Leipzig (Germany) daniel.hladky@ontos.com lehmann@informatik.uni- leipzig.de ABSTRACT and is crucial for decision making and search applications. Linked Data promises to make data integration easier for For instance, being able to make use of demographics and academic and industrial use. However, performing such terrain data for strategic business planning, or improving data integration tasks currently requires high investments search engines for questions like ”find a good typical restau- because of several major challenges. Available tools are not rant in Vienna next to the Danube river”. Finding answers connected to each other, access restrictions on private data to such questions depends on appropriate preprocessing of and certain tools have to be enforced, processes have to a combination of geospatial and related information. be managable and easy to use, and, finally, data process- ing needs to be comprehensible in terms of provenance and The Linked Data lifecycle1 (c.f., Figure 1) is a blueprint for traceability. The GeoKnow Generator Workbench solves extracting data from different types of sources, interlink- these problems by providing an integrated Web interface on ing with other datasets, enrichment, quality assurance, as top of an extensible solution for easy access to tools dedi- well as exploring and visualising it. Thus, it describes the cated to certain Linked Data lifecycle phases, also addressing processes needed to make LD useable. Based on the LD major industrial requirements. While it focuses on geospa- lifecycle, the GeoKnow Generator presented in this paper tial aspects, it is generally applicable to Linked Data man- enables a seamless integrated workflow and comprehensive agement tasks. processing options, based on a variety of tools in a modu- lar workbench Web application, meeting industrial require- ments. Most of the integrated tools include specific func- Categories and Subject Descriptors tionality for working with geospatial data. H.3.5 [Online Information Services]: Web-based ser- vices; D.2.11 [Software Architectures]: Data abstraction, The paper is organized as follows. We discuss primary re- Domain-specific architectures quirements in Section 2. The GeoKnow Generator Work- bench is presented in Section 3. Section 4 outlines appli- Keywords cations based on the proposed solution. Related work is linked data management, data processing, data publishing, discussed in Section 5. Finally, the paper closes with the data provenance, integrated workbench, geospatial informa- conclusions and future work. tion systems 1. INTRODUCTION 2. REQUIREMENTS In the last decade, many open data sources have been pub- In this section, we describe the use cases and the related de- lished following the Linked Data (LD) principles [3]. Today, rived requirements which led to the creation of the GeoKnow even some industrial applications are driven by LD (e.g., [7]). Generator. Many LD data sources include geospatial attributes. In gen- eral, geospatial data has a high relevance in everyday life 2.1 Use Cases Tourism e-Commerce. In this use case by Unister2 inter- nal data have to be enriched with public geospatial data in order to improve online search applications. Thus, Unister can understand user’s search motives and support queries beyond basic hotel features. 1 See http://stack.linkeddata.org/. 2 http://www.unister.com/ 92 Interlinking / Fusing Manual Classifi- revision/ cation/ Authoring Enrichment Storage/ Linked Data Quality Lifecycle Querying Analysis Evolution / Extraction Repair Search/ Browsing/ Exploration Figure 1: Linked Data Lifecycle Supply Chain. In order to visualize key information of the logistics in a supply chain, information from supply chain Figure 2: GeoKnow Generator Workbench transactions have to be connected to related LD. As a result, the flow of material and accompanying information can be observed in real-time, bottlenecks can be identified early, media breaks in the information flows are minimised. This 3. GEOKNOW GENERATOR The GeoKnow Generator is a stack of tools for data prepa- use case by a large automotive company incorporates traffic, ration following the LD lifecycle. The GeoKnow Generator weather, and transport information, which is linked to the Workbench is the common entry point of all those tools. The supply chain information. actual architecture is presented in Figure 2. This diagram reflects the stack of tools integrated for each stage of the LD E-Government Services. The Linked Data Service3 (LIN- lifecycle. This architecture lays on following three pillars: DAS) has the objective to provide information about au- thorities. Their services and software solutions are collected decentralised by the Swiss Confederation, the cantons or Software integration and deployment using the Debian pack- communes. The service gathers, homogenises, and publishes aging system. This infrastructure facilitates the pack- authority data using Semantic Web standard. aging and integration as well as the maintenance of dependencies between the various components. Using Automotive Data Investigation. Geosocial networks for the Debian system also enables the deployment on in- sharing location-based messages, such as recommendations dividual servers or cloud infrastructures. and notifications, benefit from providing context-related in- Use of a central SPARQL endpoint and standardized vo- formation. For services like community-based truck net- cabularies for knowledge base access and integration works developed by Continental Automotive GmbH, rele- between the different tools. All components can ac- vant geospatial LD has to be filtered and selected, e.g., mo- cess this central knowledge base repository and write torway service areas. Of future interest are further touristic their data back to it. In order for other tools to make information, such as museums and playgrounds, which are sense out of the information it is important to define readily available in public data sets. vocabularies for each of the stages of the LD lifecycle. Integration of the user interfaces based on REST enabled 2.2 Requirements Web applications. Currently the user interfaces of the Concerning functional requirements, the primary function- various tools are technologically and methodologically ality of tools for LD lifecycle phases has to be extended heterogeneous. Thus, a common entry point for ac- towards geospatial data, e.g., by implementating geospatial cessing the tools can forward a user to a specific UI distance metrics for interlinking and fusing datasets, and component provided by a certain tool in order to com- appropriate quality metrics. In addition, non-functional re- plete a certain task. For tools that do not provide an quirements include: interface, extra development effort is needed. • Scalability for working with large data sets • Authentication, Authorization and Role Management For integrating components, some JavaScript and basic RDF as a primary requirement in companies editing are required. Specifically, the AngularJS framework4 is used for straight-forward creation of GUIs and application • Data Provenance tracking for tracability of changes routing. A more detailed description of the GeoKnow Gen- • Job Monitoring and Robustness for applicability in pro- erator Workbench and how to integrate components can be duction found in the repository wiki5 . • Modularity and Composability in order to provide flex- 4 ibility w.r.t. integrating additional tools https://angularjs.org/ 5 https://github.com/GeoKnow/GeoKnowGeneratorUI/ 3 http://lindas-data.ch/ wiki 93 Tool Description Geo and Sparqlify. The linking of external data and inter- Sparqlify [2] SPARQL-to-SQL rewriter, enables to nal data was performed using LIMES with immense perfor- query RDBMS with SPARQL. mance gains compared to a comparable custom approach. TripleGeo [9] Geo-spatial feature extraction of ESRI Besides integration of structured data, unstructured data shapefiles, GML, KML, INSPIRE- such as hotel reviews can be processed using DEER. That aligned, and several geospatially- way, related entities can be identified and integrated so their enabled DBMSs attributes (such as locations) can be used for further analysis DEER [11] Data enrichment with implicit geospa- of places, providing useful information for a search engine. tial information through dereferencing, interlinking and NLP. In the Supply Chain use case, a Dashboard (see Figure 3a) LIMES Link discovery framework, supports 13 offers a unified spatial view on the logistics in the supply similarity measures of which six are geo- chain. Companies can benefit from the Supply Chain Dash- spatial distance measures [8] board by gaining a better picture of the current state of the FAGI-gis [4] Fusion of geospatial RDF data and supply chain and the spatial distribution of goods and prod- metadata ucts in the supply chain. The required data integration and Mappify [1] Map view generator into in linking were enabled by the Sparqlify and LIMES compo- HTML/JavaScript snippets nents of the Workbench. The resulting information allows Facete [12] A web-based faceted browsing of RDF live visualisation of orders and shipments status in the Dash- geospatial data board. Circulated messages and a supplier score card pro- Coevolution Service for managing dataset prove- vide live analytics of the supply chain based on user-defined nance and modifications metrics. Virtuoso Hybrid RDBMS/Graph Column Store cluster/cloud scalable. Continental products DropYa and TruckYa use GeoKnow technology in the Automotive Data Investigation use case. Table 1: Integrated LD Stack components DropYa is a geosocial network where users can send and re- ceive location-based messages sharing their experience and recommendations. TruckYa is a community-based tool for Table 1 describes the actual software tools integrated in the finding adequate parking spaces aimed at truck drivers. In GeoKnow Generator Workbench. Besides these integration the investigation process of assessing LD sources, Facete pro- work, the main benefits of the GeoKnow Generator Work- vides the functionality to browse data on a map, view at- bench are the following features: (1) Authentication and tributes of interest, export relevant parts and support the Role Management: Access to different components can be editorial process. restricted via the Workbench using roles. (2) Authorisation: A graph-based security access control allows users to cre- The Ontos AG7 Linked Data Information Workbench (On- ate and configure public and user-specific access control to tosLDIW) is a generic, enterprise-ready workbench on the datasets. For components accessing private graphs, Cross- GeoKnow architecture supporting the LD Lifecycle. On- origin resource sharing (CORS) and proxy-based model is tosLDIW was applied to a real world e-government scenario provided. (3) Job Monitoring: For some of the software for the State Secretariat for Economic Affairs (SECO) in tools, which can have long runtime on large-scale input, Switzerland8 . The developed Linked Data Service (LIN- the user can execute batch jobs that are configured and ob- DAS) has centralized the tasks of the data scientist into one servable in a dashboard (Figure 3b). (4) Data Provenance: common workbench allowing to orchestrate, monitor and ex- When working with several datasources and different pro- ecute processes from one standardized UI. Thus, it reduces cessing stages, it is required to keep information about the the efforts to learn various tools and front ends, improves provenance of certain triples. The Workbench adds meta- efficiency, and reduces costs. data about the tools used to process these data, timestamp, and authors. (5) Scalability: Storage scalability is supported As generalized feedback from these use case applications, thanks to Virtuoso Cluster edition. Workbench and inte- an integrated workbench brings the benefit of orchestrating grated tools can be easly scaled out to different nodes. the process from a single point of view. It reduces the time required for learning and switching between tools, and it All software tools used in the GeoKnow Generator Work- reduces the interface and data exchange through a single bench and the GeoKnow Generator Workbench itself are point of access and common UI. availeble in the LD Stack6 repositories. The LD Stack is an independent project that aims to ease the distribution and installation and integration of LD tools developed in differnt research projects. GeoKnow project is an active contributor 5. RELATED WORK The LOD2 Statistical Workbench [5] provides an integrated and supporter of the LD Stack. set of tools from the LD Stack for official statistical pro- duction processes of governments. The workbench supports 4. APPLICATION IN USE CASES many different operations. This solution is suitable for a spe- In the Tourism e-Commerce use case, the GeoKnow Gener- cific use case but lacks the general applicability of a more ator Workbench has been applied to generate an interlinked configurable approach. Unifiedviews[6] is a LD processing dataset used for a motive-based search infrastructure. Ex- ternal datasets have been transformed to RDF using Triple- 7 http://ontos.com/ 6 8 http://stack.linkeddata.org http://www.seco.admin.ch/?lang=en 94 (a) The supply chain dashboard. (b) Task monitoring dashbord. Figure 3: Use Cases and Applications of GeoKnow Generator Workbench framework created under the EU project COSMODE9 us- [4] G. Giannopoulos, D. Skoutas, T. Maroulis, ing components from the LD Stack. This platform requires N. Karagiannakis, and S. Athanasiou. Fagi: A implementing Data Processing Units for each component in framework for fusing geospatial rdf data. In On the order to be integrated. Moreover, Unifiedviews doesn’t pro- Move to Meaningful Internet Systems: OTM 2014 vide support for authentication or authorisation features. Conferences, volume 8841 of Lecture Notes in Still, it represents a relevant reference point for the Geo- Computer Science, pages 553–561. Springer, 2014. Know Generator Workbench. [10] presents a workbench for [5] V. Janev, B. V. Nuffelen, V. Mijovi, K. Kremer, publishing geospatial linked data. In contrast to our work, M. Martin, U. Miloševi, and S. Vrane. Supporting the the data processing is highly specialized and does not pro- linked data publication process with the lod2 vide solutions for all steps of the LD lifecycle. statistical workbench. Semantic Web âĂŞ Interoperability, Usability, Applicability, 2014. 6. CONCLUSIONS AND FUTURE WORK [6] T. Knap, M. Kukhar, B. Machác, P. Skoda, J. Tomes, The main contribution of this paper, presented in 3, is the and J. Vojt. Unifiedviews: An ETL framework for GeoKnow Generator Workbench, which is a web-based user sustainable RDF data processing. In The Semantic interface that integrates all components needed for process- Web: ESWC 2014 Satellite Events - ESWC 2014 ing data following the LD lifecycle. It enables simple access Satellite Events, Anissaras, Crete, Greece, May 25-29, and interaction with the different components needed for 2014, Revised Selected Papers, pages 379–383, 2014. different tasks. Moreover, it provides APIs for being inte- [7] G. Kobilarov, T. Scott, Y. Raimond, S. Oliver, grated into other systems and to exchange the components C. Sizemore, M. Smethurst, C. Bizer, and R. Lee. currently available out-of-the-box. Future tools can be in- Media meets semantic web–how the bbc uses dbpedia tegrated easily. An online demo and video tutorials of the and linked data to make connections. In The semantic Workbench are available at http://generator.geoknow.eu. web: research and applications, pages 723–737. Springer, 2009. We described the GeoKnow Generator and the main fea- [8] A.-C. N. Ngomo. Orchid - reduction-ratio-optimal tures enabling an enterprise use. All requirements, including computation of geo-spatial distances for link discovery. those w.r.t managing geospatial data, are derived from real In International Semantic Web Conference, pages world use cases, which also demonstrate the usability of the 395–410, 2013. Generator components and the Workbench in enterprise en- [9] K. Patroumpas, M. Alexakis, G. Giannopoulos, and vironments. In the future we will integrate additional tools S. Athanasiou. Triplegeo: an etl tool for transforming and decouple the Workbench from Virtuoso. geospatial data into rdf triples. 2014. [10] A. Shaon, A. Woolf, R. Boczek, W. Rogers, and Acknowledgments. M. Jackson. An Open Source Linked Data Framework This work is part of the European Commission FP7 Project for Publishing Environmental Data under the UK GeoKnow (GA No 318159). Location Strategy, volume 798. CEUR Workshop Proceedings, 2011. 7. REFERENCES [11] M. Sherif, A.-C. Ngonga Ngomo, and J. Lehmann. [1] Mappify: a tool to easily create interactive maps Automating RDF dataset transformation and backed by semantic web technologies. enrichment. In 12th Extended Semantic Web http://mappify.aksw.org/. Conference, Portoroz, Slovenia, 31st May - 4th June [2] Sparqlify: a sparql-sql rewrite. 2015. Springer, 2015. http://aksw.org/Projects/Sparqlify.html. [12] C. Stadler, M. Martin, and S. Auer. Exploring the [3] C. Bizer, T. Heath, and T. Berners-Lee. Linked web of spatial data with facete. In Proceedings of the data-the story so far. Semantic Services, companion publication of the 23rd international Interoperability and Web Applications: Emerging conference on World wide web companion, pages Concepts, pages 205–227, 2009. 175–178, 2014. 9 http://www.cosmode.eu/ 95