Towards Sustainable view-based Extract-Transform-Load (ETL) Fusion of Open Data Kay Müller, Claus Stadler, Ritesh Kumar Singh, Sebastian Hellmann AKSW/KILT, Leipzig University & DBpedia Association {kay.mueller,cstadler,ritesh.kumar.singh,hellmann}@informatik. uni-leipzig.de http://aksw.org/Groups/KILT.html Abstract. Openly available datasets originate from different data providers which range from government agencies, over commercial enterprises to communities of data enthusiasts. Integrating different source datasets into a single RDF graph by using ETL (Extract-Transform-Load) sys- tems which perform offline transformation, ontology matching and link- ing techniques usually takes many iterations of revisions until the target dataset is made free of the most obvious mapping, linking and consis- tency errors. Since ETL systems produce the RDF offline, any map- ping or content change requires a re-ingest of the relevant source data. When dealing with heterogeneous source datasets, creating a unified tar- get dataset can be a tedious undertaking. Therefore the paper proposes an RDF view based ingestion approach, which allows real-time “debug- ging” of the unified dataset where mappings and links can be changed with immediate effect. Once the unified graph passes all data quality tests, the RDF can be materialized. This process poses an alternative to existing ETL solutions. Keywords: Linked Data, ETL, RDF View 1 Introduction and Related Work Traditionally, ETL (Extract-Transform-Load) systems are used to convert differ- ent source datasets into a structured target dataset. These systems are commonly used in the area of data warehousing. As the name suggests these frameworks follow the workflow of: 1.) Extract required information from source datasets, 2.) Transform portions of the source data into a target model using schema map- pings, normalizations and deduplications, 3.) Load the data into a store, possibly with versioning support. In the Semantic Web community, tools such as LDIF1 , OpenRefine2 and Uni- fied Views3 have been developed to allow Linked Data developers to create new 1 http://ldif.wbsg.de/ 2 http://openrefine.org/ 3 https://www.semantic-web.at/unifiedviews Towards Sustainable view-based ETL Fusion of Open Data 71 Linked Data content. Since many Linked Data data publishers do not always follow existing best practices and (quasi-)standards of the Linked Data com- munity, it potentially complicates linking tasks and decreases the overall data quality of source datasets. The LOD Laundromat [1] was created due to this dilemma: This system crawls the Linked Data Cloud4 and creates corresponding standard-compliant datasets which can be downloaded from the website. Since data quality can vary between datasets, using an ETL system to convert the source data into the target format might result in mapping and content errors, which might only be discovered once all the data is converted. Very often this might result in a re-ingest of the source data through the whole ETL pipeline. These trial and error attempts can be very time and resource consuming. When working on integrating different company datasets into a knowledge graph us- ing the standard ETL approach, we realised that it was necessary to rerun the pipeline many times due to conversion, linking or mapping errors. Once the ETL pipeline was finished, we examined the ETL output for errors. We fixed the er- rors in the ETL pipeline and rerun the pipeline. This process was repeated until a satisfiable ETL output was generated. Hence the idea was born to create an ETL system, which can perform online dataset updates. As was shown in [2] RDF views can present a feasible alternative to standard ETL approaches in the context of data quality management. Based on the concept of RDF views, this paper proposes a novel architecture which gives the possibility to “debug” knowledge graphs. The term “debugging” is used in broad sense and illustrates the idea that the effects of changes at the linking, mapping or data normalization pipeline steps can immediately be examined without having to wait for a com- plete re-ingest, hence allowing an immediate “debugging/evaluation” of changes and their impact on the target dataset. 2 Design In order to ease the quality assessment of heterogeneous datasets, we propose a view-based RDF transformation design. This design bears the advantage that many data and transformation related changes can be reviewed promptly instead of potentially having to re-ingest the data. The proposed design is shown in Figure 1. In order to support the handling of heterogeneous datasets we identified the following main features: – RDF views: Since our proposed architecture uses RDF-based views, we sug- gest to convert the source data directly to RDF without cleaning the input data. As RDF-2-RDF mapping languages, SPARQL CONSTRUCT5 queries with minor syntax extensions for naming views and quad support could be exploited. Another option is to adapt RML6 for this purpose. Compared to SQL-based views, the RDF-based views approach has the advantage that SPARQL queries of one layer can be transformed into SPARQL queries of another layer without limitations of SPARQL-2-SQL conversions. 4 http://lod-cloud.net/ 5 https://www.w3.org/TR/rdf-sparql-query/#construct 6 http://semweb.mmlab.be/rml/spec.html 72 Müller et al. Fig. 1. RDF Views Design Diagram – Functional Indexes for transformation functions: The input data is rarely expected to fit the target ontology and all defined requirements. In order to speed up access to data normalized via views, the proposed system supports functional indices similar to those supported by traditional RDBMS7 . – Real-time updates: Due to the virtualization of the RDF and Normalization Layer, changes applied in these layers have immediate effect on the virtual- ized RDF views. This feature allows ontology engineers to follow an iterative process when creating a new dataset, rather than re-ingesting and checking parts of the dataset in order to save time. – Unified Querying and ETL: All relevant source data is expected to have been mapped to appropriate target ontologies. Using SPARQL, it is possible to retrieve portions of the data on-the-fly as well as to export all data at once. Our design consists of 6 layers: The Source Data Layer is used to ingest the original source data via the Source Mapping Layer into the RDF backend using mappings which alter the source data as little as possible. The Interme- diate RDF Layer provides direct access to the non-normalized RDF data. The Normalization Layer expresses normalizations by means of RDF2RDF views, possibly supported by functional indices and caches. These indices and caches are supposed to improve the query performance against the normalized RDF 7 For example, see Postgres: http://www.postgresql.org/docs/9.5/static/ indexes-expressional.html Towards Sustainable view-based ETL Fusion of Open Data 73 data. Thereby, there are two kinds of views: Term-based views for crafting RDF terms, especially IRIs, and triple-based views for relating the terms to each other. If required the Enrichment Layer can be used to trigger the additional genera- tion of information about existing entities. Finally the Query Federation Layer can be used to integrate other (possibly virtual) SPARQL endpoints. Note, that changes made in one layer are automatically propagated to the upper layers. 3 Discussion and Future Work The authors are aware that RDF views come with performance implications [2]. It can be compared to a debug executable which executes slower by an order of magnitude compared to a release version. But this executable offers a lot of features which support the debugging process. In the same way the proposed design allows unique knowledge graph “debugging/evaluation” capabilities. For some use cases the performance of this design might be sufficient and would not require a full RDF export. If a “release” version of the dataset is required, SPARQL queries can be used to retrieve portions of the data on-the-fly as well as export all data at once. This “release” data can then be loaded into a graph backend. This generic architecture can be used to add different data sources to the knowledge graph. These data sources could be databases, REST APIs, etc. Since a lot of source data is only accessible via REST APIs or via databases, this opens new integration opportunities for source data. These backends could be integrated via the Query Federation Layer. In addition, it would be possible to add a Meta Data Layer which could store provenance, confidence and other meta-information which is relevant for relations and entities. In addition it would be possible to add a Fusion Layer which would use all owl:SameAs relationships and entity fusion algorithms to show a fused view of all entities for the imported source datasets. Despite a possible performance loss by using RDF views, the advantages of such a design surpass the disadvantages, especially since it can be combined with other existing solutions at each layer. Hence we believe that this novel design will allow the creation and curation of knowledge graphs in a new, iterative way. Acknowledgments. This work was supported by grants from the Federal Ministry for Economic Affairs and Energy of Germany (BMWi) for the Smart- DataWeb Project (GA-01MD15010B)8 References 1. W. Beek, L. Rietveld, H. R. Bazoobandi, J. Wielemaker, and S. Schlobach. Lod laundromat: A uniform way of publishing other people’s dirty data. In ISWC, 2014. 2. N. Konstantinou, D. E. Spanos, and N. Mitrou. Transient and Persistent RDF Views over Relational Databases in the Context of Digital Repositories. Communications in Computer and Information Science, 390 CCIS:342–354, 2013. 8 http://smartdataweb.de/