Introduction

ODIN: A Dataspace Management System

Odin is a system that supports the incremental pay-as-yougo integration of data sources into dataspaces and provides user-friendly querying mechanisms on top of them. We describe its main characteristics and underlying assumptions, including the user interactions required. Odin's novelty lies in a largely automated bottom-up approach (i.e., driven by the sources at hand) that includes the user in the loop for disambiguation purposes. The on-site demonstration will feature an ongoing project with the World Health Organization (WHO). Online demo and videos: www.essi.upc.edu/dtim/odin/ A prominent approach to virtual data integration is that of exposing an ontology, which conceptualizes the domain of interest, to o er a uniform query interface over the sources. Queries over the ontology are rewritten over the sources via schema mappings. The maintenance of such constructs (i.e., evolving the ontology, or adding new sources and mappings) is well-known to be an arduous and manually-intensive task that hinders the ability of such systems to exibly adapt and provide right-time integration. This limitation has been coined as the data variety challenge, which refers to the complexity of providing on-demand integration over a vast and evolving set of data sources. Dataspaces represent a major step towards tackling the variety challenge. With the vision of reducing the usual upfront and maintenance costs, dataspaces claim for the adoption of a exible and dynamic pay-as-you-go approach where di erent integration tasks are automated [1]. Supporting the end-to-end lifecycle of dataspaces is a technically challenging task. The state of the art on automatic construction of an ontology from the data sources (and their respective mappings), commonly known as bootstrapping, is BootOX [2]. Targeted to ontology-based data integration, BootOX generates OWL 2 QL ontologies from relational databases, together with R2RML mappings to the sources. Yet, this approach falls short in settings where managing data variety is a key requirement. On the one hand, the extraction is restricted to relational databases and misses widely used semi-structured data formats such as CSV, JSON or XML. On the other hand, such mappings conform to the global-as-view (GaV) Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). This work was partly supported by the GENESIS project, funded by the Spanish Ministerio de Ciencia e Innovacion under grant TIN2016-79269-R.

Introduction

family, which characterize the ontological classes and properties in terms of SQL queries, which are not well suited for highly dynamic and evolving settings.

Odin (short for On-demand Data INtegration), a dataspace management system grounded in knowledge graphs, was conceived to overcome the aforementioned challenges. Fig. 1 depicts how Odin supports the dataspaces complete lifecycle. Odin automatically extracts the schemata from structured (e.g., relational) and semi-structured (e.g., JSON) data sources and translates them into a canonical data model, namely RDFS. To this end, a set of production rules parse their metadata and automatically generate RDFS-compliant source graphs. Next, the source graphs are aligned while considering the user feedback throughout this process. As result, Odin generates provenance graphs (PG) tracing the results of the previous stages. A PG is a target-agnostic metadata construct (i.e., not tailored for a speci c tool) about the integration of a particular set of data sources. PG captures the results of bootstraping the sources and aligning their schemata, and guarantees we can generate target-speci c metadata from them1. Thus, PGs are used to generate the speci c constructs of a given integration tool. In this demo, ODIN generates the constructs required by [ 3 ]. Precisely, conjunctive query (CQ)-oriented graphs, which expose the sources schemata in rst-normal form, which are then linked via local-as-view (LaV) schema mappings (represented as graphs) to the global graph. LaV mappings characterize the sources in terms of a query over the ontology, which make them inherently more suitable in data variety settings. This entails a more complex query answering process, which boils down to the problem of answering queries using views. In our demo, however, we will show the feasibility of our approach in real cases.

Provenance Metadata Dataspaces Metadata

...

Schema Extraction Schema Extraction

Source Graph ...

Source Graph

Wrapper (1NF) Alignment Feedback Data Analyst

Provenance

Graph

Target-oriented merging & consolidation

CQ-oriented

Graph CQ-oriented

Graph

LAV Global CQ-oriented mappings Graph

Graph

Metadata Flow

Query Iterative Processing

Data Analyst 1 Although the focus of this paper will be on answering queries, during the demo we will highlight the ability to generate the required metadata for other ontological reasoning services (e.g., DL-Lite for satis ability checking) from a PG. Generation of RDFS schemata from sources. Odin adopts a meta-modeling approach to bootstrap disparate sources in order to create an RDFS representation of their schemata. For each source data model, Odin de nes an equivalent rst order logic representation of its meta-model. Given a source model (e.g., relational or JSON), a set of pre-de ned production rules (i.e., tuple-generating dependencies de ned at the meta-model level) generate an equivalent RDFS model2. Source graphs are the result of this bootstrapping phase. User-driven source graph alignment. From source graphs, Odin incrementally generates the PG, where it annotates source graph alignments in the form of taxonomies. To discover alignments, Odin uses an enhanced version of LogMap3, which considers Wordnet synonyms. Candidate alignments are ranked, and Odin prompts the user to accept or reject them. Further, since aligning two ontologies is a hard task, Odin also provides an intuitive interface to manually assert alignments.

Querying the sources via the ontology. This nal step consists in generating the required metadata constructs to pose and resolve queries over the dataspace. To this end, from PGs, Odin automatically generates the global graph (i.e., a merged view of the aligned source graphs) and CQ-oriented graphs, which expose a rst-normal form structure of the sources. To guarantee the incremental evolution of the system, Odin also generates LaV mappings from CQ-oriented graphs to the global graph. Since PGs were created in a bottom-up approach, we are able to automate the de nition of all required constructs. Consequently, given that Odin explicitly models the schema that sources expose, LaV mappings are exact and they are not required to deal with incompleteness on the sources. Finally, Odin provides a user-friendly interface to pose conjunctive queries (CQs) on the global graph, that are automatically translated to SPARQL. A rewriting algorithm interprets such query and generates the certain answers under the closed-world assumption in terms of unions of CQs [ 3 ]. The demo will show that such constructs are automatically generated in linear time (w.r.t. the size of PG). 3

Demo

We will present the functionalities of Odin via the WHO Information System to Control and Eliminate Neglected Tropical Diseases (WISCENTD)4. The goal of WISCENTD is to provide support in the collection, integration and analysis of data coming from di erent monitoring systems surveilling di erent aspects of neglected tropical diseases (NTDs). Data related to NTDs are largely fragmented and their integration is mandatory to shed light on NTDs around the world. The demo will simulate the day-by-day of a WHO data analyst and how Odin is used to rst collect and integrate di erent sources of relevance for a certain NTD, and later cross-query them. We will use relevant datasets, such as UN Data (open-data JSON datasets) about health economics indicators and

2 http://essi.upc.edu/dtim/ardi 3 https://github.com/ernestojimenezruiz/logmap-matcher 4 https://www.who.int/neglected_diseases/disease_management/wiscentds/en

migrant information per country5, data about diagnosis and treatment per country periodically extracted from WIDP6 (that hosts a relational database), data about drug distribution periodically extracted from WIMEDS7 as CSVs, etc. We will rst showcase how the data analyst, just interacting with Odin's interface, is able to integrate and query such sources in a friendly manner. Odin allows the interested users to browse the metadata generated throughout the whole process: (i) source bootstrapping, (ii) their alignment to construct the PG (Fig. 2), and (iii) the automatic creation of the constructs for query answering. The audience will be encouraged to participate including new sources in an incremental manner, query the global graph, or even apply Odin to other domains. Implementation details. Odin follows a service-oriented architecture, which enables extensibility and separation of concerns. The frontend is implemented in Javascript and resides in a Node.JS webserver. Odin uses WebVOWL to visualize and interact with graphs. The backend, is implemented as a set of REST APIs de ned using Jersey for Java. To deal with RDF graphs, this component makes heavy use of Jena and its persistance engine Jena TDB.

5 http://data.un.org

6 http://bit.ly/whowidp 7 http://bit.ly/whowimeds

1. Franklin , M.J. , Halevy , A.Y. , Maier , D. : From databases to dataspaces: a new abstraction for information management . SIGMOD Record 34 ( 4 ), 27 { 33 ( 2005 )

2. Jimenez-Ruiz , E. , Kharlamov , E. , Zheleznyakov , D. , Horrocks , I. , Pinkel , C. , Skj

veland

, M.G., Thorstensen , E. , Mora , J.: BootOX: Practical Mapping of RDBs to OWL 2 . In: ISWC 2015 . pp. 113 { 132 ( 2015 )

3. Nadal , S. , Romero , O. , Abello , A. , Vassiliadis , P. , Vansummeren , S.: An integrationoriented ontology to govern evolution in big data ecosystems . Inf. Syst . 79 , 3 { 19 ( 2019 )