=Paper= {{Paper |id=Vol-1343/poster14 |storemode=property |title=UnifiedViews: Towards ETL Tool for Simple yet Powerfull RDF Data Management |pdfUrl=https://ceur-ws.org/Vol-1343/poster14.pdf |volume=Vol-1343 |dblpUrl=https://dblp.org/rec/conf/dateso/KnapSKN15 }} ==UnifiedViews: Towards ETL Tool for Simple yet Powerfull RDF Data Management== https://ceur-ws.org/Vol-1343/poster14.pdf
      UnifiedViews: Towards ETL Tool for Simple yet
    UnifiedViews: Towards ETL Tool for Simple       yet
            Powerfull RDF Data Management? ?
          Powerfull RDF Data Management
               Tomáš Knap, Petr Škoda, Jakub Klímek, and Martin Nečaský
             Tomáš Knap, Petr Škoda, Jakub Klı́mek, and Martin Nečaský
                 Charles University in Prague, Faculty of Mathematics and Physics
                      Malostranské
              Charles University  innám.  25, 118
                                      Prague,     00 Praha
                                               Faculty     1, Czech Republic
                                                        of Mathematics  and Physics
                              {surname}@ksi.mff.cuni.cz
                   Malostranské   nám. 25, 118 00 Praha 1, Czech Republic
                    {knap, skoda, klimek, necasky}@ksi.mff.cuni.cz


           Abstract. We present UnifiedViews, an Extract-Transform-Load (ETL) frame-
           work that allows users to define, execute, monitor, debug, schedule, and share
           ETL data processing tasks, which may employ custom plugins (data processing
           units, DPUs) created by users. UnifiedViews differs from other ETL frameworks
           by natively supporting RDF data and ontologies. In this paper, we introduce Uni-
           fiedViews and discuss future features of the tool towards simplicity of use for
           non-RDF experts. We are persuaded that UnifiedViews helps RDF/Linked Data
           publishers and consumers to address the problem of sustainable RDF data pro-
           cessing; we support such statement by introducing a list of projects and other
           activities where UnifiedViews is successfully exploited.


        Keywords: RDF data processing, ETL, Linked Data


   1 Introduction and Basic Concepts of UnifiedViews
   The advent of Linked Data [1] accelerates the evolution of the Web into an exponentially
   growing information space where the unprecedented volume of data offers information
   consumers a level of information integration that has up to now not been possible.
       Suppose a consumer building a data mart integrating information from various RDF
   and non-RDF sources. There are lots of tools used by the RDF/Linked Data commu-
   nity1 , which may support various phases of the data processing; e.g., a consumer may
   use any232 for extraction of non-RDF data and its conversion to RDF data, Virtuoso3
   database for storing RDF data and executing SPARQL (Update) queries [2,3], Silk [6]
   for RDF data linkage, or Cr-batch4 for RDF data fusion. Nevertheless, the consumer
   who is preparing a data processing task producing the desired data mart typically has to
   (1) configure every such tool properly (using a different configuration for every tool),
   (2) implement a script for downloading and unpacking certain source data, (3) write his
   own script holding the set of SPARQL Update queries refining the data, (4) implement
    ?
      This work was partially supported by a grant from the European Union’s 7th Framework Pro-
      gramme number 611358 provided for the project COMSODE
    1
      http://semanticweb.org/wiki/Tools
    2
      https://any23.apache.org/
    3
      http://virtuoso.openlinksw.com/
    4
      https://github.com/mifeet/cr-batch



M. Nečaský, J. Pokorný, P. Moravec (Eds.): Dateso 2015, pp. 111–120, CEUR-WS.org/Vol-1343.
112      Tomáš Knap, Petr Škoda, Jakub Klı́mek, Martin Nečaský




            Fig. 1. UnifiedViews framework – Definition of a data processing task


custom transformers which, e.g., enrich processed data with the data in his knowledge
base, (5) write his own script executing the tools in the required order, so that every
tool has all desired inputs when being launched, (6) prepare a scheduling script, which
ensures that the task is executed regularly, and (7) extend his script with notification
capabilities, such as sending an email in case of an error during task execution.
    Maintenance of such data processing tasks is challenging. Suppose for example that
a consumer defines tens of data processing tasks, which should run every week. Further,
suppose that certain data processing task does not work as expected. To find the prob-
lem, the consumer typically has to browse/query the RDF data outputted by a certain
tool; to realise that, he has to manually launch the required tool with the problematic
configuration and load the outputted RDF data to the store, such as Virtuoso, support-
ing browse/query capabilities. Furthermore, when other consumers would like to pre-
pare similar data processing tasks, they cannot share the tools’ configurations already
prepared by the consumer.
    The general problem RDF/Linked Data publishers and consumers are facing is that
they have to write most of the logic to define, execute, monitor, schedule, and share the
data processing tasks themselves. Furthermore, they do not get any support regarding
the debugging of the tasks. To address these problems, we developed UnifiedViews, an
Extract-Transform-Load (ETL) framework, where the concept of data processing task
is a central concept. Another central concept is the native support for RDF data format
and ontologies.
    A data processing task (or simply task) consists of one or more data processing
units. A data processing unit (DPU) encapsulates certain business logic needed when
processing data (e.g., one DPU may extract data from a SPARQL endpoint or apply
a SPARQL query). Every DPU must define its required/optional inputs and produced
outputs. UnifiedViews supports an exchange of RDF data between DPUs. Every tool
produced by RDF/Linked Data community can be used in UnifiedViews as a DPU, if a
simple wrapper is provided5 .
    UnifiedViews allows users to define and adjust data processing tasks, using graphi-
cal user interface (an excerpt is depicted in Figure 1). Every consumer may also define
their custom DPUs, or share DPUs provided by others together with their configura-
tions. DPUs may be drag&dropped on the canvas where the data processing task is
constructed. A data flow between two DPUs is denoted as an edge on the canvas (see
Figure 1); a label on the edge clarifies which outputs of a DPU are mapped to which
 5
     https://grips.semantic-web.at/display/UDDOC/Creation+of+
     Plugins
         UnifiedViews: Towards ETL Tool for Simple yet Powerfull RDF DM             113


inputs of another DPU. UnifiedViews natively supports exchange of RDF data between
DPUs; apart from that, files and folders may be exchanged between DPUs.
    UnifiedViews takes care of task schedulling, a user may configure UnifiedViews
to get notifications about errors in the tasks’ executions; user may also get daily sum-
maries about the tasks executed. UnifiedViews ensures that DPUs are executed in the
proper order, so that all DPUs have proper required inputs when being launched. Uni-
fiedViews provides users with the debugging capabilities – a user may browse and query
(using SPARQL query language) the RDF inputs to and RDF outputs from any DPU.
UnifiedViews allows users to share DPUs and tasks as needed.
    The code of UnifiedViews is available under a combination of GPLv3 and LGPLv3
license6 at https://github.com/UnifiedViews .

2 Related Work
There are plenty of ETL frameworks for preparing tabular data to be loaded to data
warehouses, some of them are also opensource7 – for example Clover ETL (community
edition)8 . In all these frameworks custom DPUs may be created in some way, but the
disadvantage of these non-RDF ETL frameworks is that there is no support for RDF
data format and ontologies in the framework itself. As a result, these non-RDF ETL
frameworks are, e.g., not prepared to suggest ontological terms in DPU configurations,
a feature important when preparing SPARQL queries or mappings of the table columns
to RDF predicates. Furthermore, these frameworks do not have a native support for
exchanging RDF data between DPUs; also the existing DPUs do not support RDF data
format, URIs for identifying things according to Linked Data principles. Therefore,
further, we discuss the related work in the area of RDF ETL frameworks.
    ODCleanStore (Version 1)9 , was the original Linked data management framework,
which was used as an inspiration for ODCleanStore (Version 2)10 , the student’s project
implemented at Charles University in Prague and defended in March 2014 . Unified-
Views is based on ODCleanStore (Version 2). Linked Data Manager (LDM)11 is a Java
based Linked (Open) Data Management suite to schedule and monitor required ETL
tasks for web-based Linked Open Data portals and data integration scenarios. LDM
was developed by Semantic Web Company in Austria12 . They currently decided to re-
place LDM, used by their clients, with UnifiedViews and further continue to maintain
UnifiedViews together with Charles University in Prague, the Czech Linked Data com-
pany Semantica.cz s.r.o.13 , and Slovak company EEA s.r.o.14 .
 6
   http://www.gnu.org/licenses/gpl.txt,       http://www.gnu.org/
   licenses/lgpl.txt
 7
   http://sourceforge.net/directory/business-enterprise/
   enterprise/data-warehousing/etl/
 8
   http://www.cloveretl.com/products/community-edition
 9
   http://sourceforge.net/projects/odcleanstore/
10
   https://github.com/mff-uk/ODCS/
11
   https://github.com/lodms/lodms-core
12
   http://www.semantic-web.at
13
   http://semantica.cz/en/
14
   http://eea.sk/
114     Tomáš Knap, Petr Škoda, Jakub Klı́mek, Martin Nečaský


     DERI Pipes15 is an engine and graphical environment for general Web Data trans-
formations. DERI Pipes supports creation of custom DPUs; however, an adjustment of
the core is needed when a new DPU is added, which is not acceptable; in UnifiedViews,
it is possible to reload DPUs as the framework is running. DERI Pipes also does not
provide any solution for library version clashes; on the other hand, in UnifiedViews,
DPUs are loaded as OSGi bundles, thus, it is possible to use two DPUs requiring two
different versions of the same dependency (library) and no clashes arise. In DERI pipes,
it is not possible to debug inputs and outputs of DPUs.
     Linked Data Integration Framework (LDIF)[5] is an open-source Linked Data inte-
gration framework that can be used to transform Web data. The framework consists of a
predefined set of DPUs, which may be influenced by their configuration; however, new
DPUs cannot be easily added16 . LDIF provides a user interface to monitor results of ex-
ecuted tasks.; however, when compared with UnifiedViews, LDIF does not provide any
graphical user interface for defining and scheduling tasks, managing DPUs, browsing
and querying inputs from and output to the DPUs, and managing users and their roles
in the framework. LDIF also does not provide any possibility to share pipelines/DPUs
among users. On the other hand, LDIF provides possibility to run tasks using Hadoop17 .


3 Impact of the UnifiedViews Framework

The goal of the OpenData.cz initiative18 is to extract, transform and publish Czech open
data in the form of Linked Data, so that the initiative contributes to the Czech Linked
(Open) Data cloud. For this effort, UnifiedViews framework is successfully used since
September 2013; so far we published tens of datasets, hundreds of milions of triples;
Figure 2 depicts an excerpt of the datasets (blue circles) published with UnifiedViews
and the integration of these datasets (links are depicted by blue arrows, pointing from
the linking dataset to the linked dataset).
     Project INTLIB19 aims at extracting (1) references between legislation documents,
such as decisions and acts, (2) entities (e.g., a citizen, a president) defined by these
documents and (3) the rights and obligations of these extracted entities. UnifiedViews is
used in INTLIB to extract data from selected sources of legislation documents, convert
it to RDF data, and provide it as Linked Data.
     COMSODE FP7 project20 has the goal to create a publication platform for pub-
lishing (linked) open data. UnifiedViews is used there as the core tool for converting
hundreds of original datasets to RDF/Linked Data.
     UnifiedViews framework is being integrated to the stack of tools produced by the
LOD2 project21 . As a result, anybody using tools from LOD2 stack, such as Virtuoso
15
   http://pipes.deri.org/
16
   http://ldif.wbsg.de/
17
   http://hadoop.apache.org/
18
   http://opendata.cz
19
   http://www.isvav.cz/projectDetail.do?rowId=TA02010182
20
   http://www.comsode.eu/
21
   http://lod2.eu/
         UnifiedViews: Towards ETL Tool for Simple yet Powerfull RDF DM               115




             Fig. 2. Datasets published by opendata.cz initiative – an excerpt


and Silk, has also the possibility to use UnifiedViews. UnifiedVIews will be also used
in the recently starting EU H2020 project called YourDataStories22 .
    UnifiedViews framework is intended to be used for commercial purposes by com-
panies Semantica.cz s.r.o., Czech Republic, EEA s.r.o., Slovak Republic, Semantic Web
Company, Austria, TenForce, Belgium, to help their customers to prepare and process
RDF data.


4 Ongoing and Future Work Towards Simplicity of Use
In this section, we introduce the ongoing and future work on UnifiedViews towards sim-
plicity of use of the tool for non-RDF experts. Each section below describes the planned
feature. In the sections below, we are talking about a task designer – a person who cre-
ates new or adjusts existing data processing tasks. All sections contain motivation, goals
to be achieved, and at least outline how the goals will be realised.

4.1 Automated Schema Alignment and Object Linkage
Motivation. Suppose that a data processing task is created to daily extract tabular
data provided by Czech Hydrometeorological Institute about the air pollution in var-
ious cities of the Czech Republic and publish them as Linked Data. Publishing data
as Linked Data involves (1) alignment of the schema used for the published data with
well-known schemas (RDF vocabularies) used in the Linked Data community, e.g., in
Linking Open Data cloud23 and, (2) linkage of the published objects with the objects
22
     https://www.insight-centre.org/content/your-data-stories
23
     http://lod-cloud.net/
116      Tomáš Knap, Petr Škoda, Jakub Klı́mek, Martin Nečaský


already available in the Linked Data space, so that common objects in the datasets have
the same identifiers across the datasets. As a result, Linked Data applications work on
top of integrated datasets (thanks to (2)), which use common schema elements (thanks
to (1)).
    If the task designer is a Linked Data expert, he is able to manually integrate the
data, e.g., link the RDF representation of the cities introduced in the source tabular
data to the generally accepted representation of the cities – identifiers used by the LAU
codes dataset24 ; based on that, Linked Data applications may not only show pollution
in the particular city, but also, e.g., level of carbon emissions in that city, demographic
statistics for the population in the city, number of child inhabitants in that city, etc.
    Linked data experts may also ensure that published data is using well-know RDF
vocabularies used in Linked Open Data community to publish certain types of data; for
example, the task designer may ensure that instead of automatically generated predicate
ex:firstName holding first names of persons, the predicate foaf:givenName is
used. Such adjustments of the RDF data related to alignment of the schemas or object
linkage are not trivial and cannot be easily done by non-experts.


Goal to be achieved. UnifiedViews should simplify Linked Data publishing for non-
RDF experts by:

 1. Automatically discovering that certain columns in the processed tabular represent
    certain types of data (e.g., cities of the Czech Republic) and automatically mapping
    values in this column to URIs taken from the preferred dataset for the given type of
    data (e.g., from the dataset with LAU codes). As a result, all datasets use the same
    identifiers for the same types of data, which realises the data integration and avoids
    costly and adhoc application integration.
 2. Automatically suggest the mappings of the used RDF vocabulary terms (e.g., pred-
    icated) to well-known vocabulary terms (e.g., predicates), which increases the un-
    derstandability of the data and reuse of the data by various applications.

    To realise 1), first, it is necessary to identify that certain columns contain certain
types of values; such identification is always probabilistic and typically based on the
comparison of the name of the column with the list of names of the RDF classes and/or
based on matching sample data from the considered column against known codelists,
such as list of Czech cities; experiments are needed to decide the particular algorithm for
identification of types among input data. Second step to realise 1) is to apply predefined
Silk [6] rules for the given identified type of data within the column of the input tabular
data. To realise 2), various schema matching techniques has to be experimented [4].


4.2 Hiding Sparql Queries

Motivation. Linked data expert is able to query RDF data using SPARQL query lan-
guage [2,3], so for an expert it is enough to have one generic DPU, which is able to
execute arbitrary SPARQL query on top of processed RDF data. Nevertheless, using
24
     http://opendata.cz/linked-data
         UnifiedViews: Towards ETL Tool for Simple yet Powerfull RDF DM              117


SPARQL query language for rather typical and simple operations with the data, such as
renaming predicates or replacing predicate’s value based on a regular expression, may
be considered as too heavy-weight and difficult for RDF beginners and as tedious for
experienced users.


Goal to be achieved. UnifiedViews should simplify work with SPARQL query lan-
guage by providing a set of DPUs for executing rather typical and simple operation on
top of RDF data; such DPUs may be configured via a configuration dialog, SPARQL
query behind is completely hidden from the task designer.
    To realise the goal, list of typical operation should be written down and DPUs should
be prepared. Discussion with the users – task designers – is crucial to focus on the most
typically used operations on top of RDF data.


4.3 Autocompleting Terms from Well-known Vocabularies

Motivation. In many cases, e.g., when aligning vocabulary terms as described in Sec-
tion 4.1 or when configuring DPU hiding complexity of SPARQL query language as
described in Section 4.2, task designer has to define certain vocabulary terms. Since the
number of well-known vocabularies is quite high, task designer may easily use wrong
vocabulary term or misspell the term.


Goal to be achieved. As the task designer is configuring DPUs, UnifiedViews should
suggest and autocomplete vocabulary terms from well-known Linked Data vocabular-
ies. Task designer should be not only provided with the suggested term, but also with
the description of the term, its formal definition, its recommended usage etc.
    To realise this goal, data processing tasks should be prepared to populate RDF
database regularly with the well-known Linked Data vocabularies – such knowledge
base is then used for suggesting and autocompleting the vocabulary terms. Selected
components of the DPUs’ configurations, e.g., text fields, should by design support
suggesting of terms from well known vocabularies – so any DPU developer may use
such vocabulary autocomplete aware text field when defining configuration dialog for
his DPU.


4.4 Sustainable RDF Data Processing

Motivation. As the task designer updates the task, the interconnections among and
configurations of the DPUs comprising that task are adjusted. As a result, task designer
may introduce errors in the definition of the task yielding in erroneous or no data pro-
duced by the task.


Goal to be achieved. UnifiedViews should address the problem of sustainable RDF
data processing by allowing task designer to define for each DPU a set of SPARQL
queries, which tests that the output data produced by the given DPU satisfies certain
conditions. Such set of SPARQL queries for testing data outputted by the DPU plays
118      Tomáš Knap, Petr Škoda, Jakub Klı́mek, Martin Nečaský


similar role as standard JUnit tests – to test that any change to the DPU configuration did
not change the produced data of that DPU in an unexpected way. Task designer should
be supported with the autocomplete feature (described in 4.2) as he is specifying the
SPARQL unit tests.
    To realise this goal, every DPU detail should be extended with the possibility to
define set of SPARQL ASK queries to realise unit testing.


4.5 Wizards for Simple Definition of Data Processing Tasks

Motivation. Defining data processing tasks typically requires detailed knowledge of
the DPUs that are available in the deployed UnifiedViews instance; task designer has to
know which DPUs are suitable for the task at his hand, how it should be configured and
interconnected with other DPUs.


Goal to be achieved. It should be possible to define simple tasks without the knowl-
edge of the DPUs, its configurations. UnifiedViews will contain so-called wizards,
which provides task designers step by step guides for defining new data processing
tasks – at least for typical types of data processing tasks, e.g, extracting tabular data and
publishing it as Linked Data, or extracting data from relational databases and publishing
it as Linked Data.
     To realise the goal, list of typical types of data processing tasks should be written
down and wizards should be prepared for such tasks. Discussion with the users – task
designers – is crucial to focus on the most typical types of tasks. The idea of incorpo-
rating wizards to existing UnifiedViews frontend is as follows: when a task designer
creates new pipeline, he may either manually define the task or start the wizard which
will guide him through the process of task preparation; task designer may then manually
finetune the definition of the task.


4.6 Assessing Quality of Produced Data, Recommendation of Cleansing DPUs

Motivation. As the goal of a task designer is to produce high quality data, the task de-
signer should be informed about any problems in the data, e.g., w.r.t. syntactic/semantic
accuracy of the produced Linked Data or completeness of the published dataset. Fur-
thermore, if such problems may be corrected, they should be corrected.


Goal to be achieved. UnifiedViews should provide a set of DPUs assessing the quality
of the produced data and set of DPUs being able to cleanse the problems in the data.
UnifiedViews should also automatically recommend cleansing DPUs for data process-
ing tasks based on the problems revealed in the data.
    To realise the goal, list of quality assessment and cleansing DPUs should be imple-
mented, being inspired by the list of data quality dimensions and metrics relevant for
Linked Data [7]. The recommendation of cleansing DPUs should be based on the types
of quality assessment DPUs which reported problems.
          UnifiedViews: Towards ETL Tool for Simple yet Powerfull RDF DM                    119


4.7 Evolution of DPUs
Motivation. DPUs may evolve as the time goes, different tasks use different versions of
the same DPU. When the version of the DPU is updated, configuration used in the tasks
must be also updated without the needed to reconfigure the DPU by the task designer.

Goal to be achieved. UnifiedViews must be able to cope with the changing versions of
the DPUs; each new version of the DPU may bring changes to the DPU’s configuration.
UnifiedViews must be able to automatically convert outdated configuration, so that it
may be used in the latest version of the DPU.
    To realise the goal, interface of the DPUs should be extended, so that DPU devel-
opers may provide a migration method converting previous configuration to the current
DPU configuration. As a result, if outdated version of the configuration is encountered
and should be updated to the correct version, a sequence of these migration methods
may be automatically executed by UnifiedViews (if these methods are properly pro-
vided by the DPU developer).


5 Conclusions
We presented UnifiedViews, an ETL framework with a native support for processing
RDF data. The framework allows to define, execute, monitor, debug, schedule, and
share data processing tasks. UnifiedViews also allows users to create custom plugins -
data processing units.
    We discussed future intended features of the tool w.r.t simplicity of use of the tool
for non-RDF experts or those not familiar with all Linked Data vocabularies, datasets
etc. For each intended feature we discussed its motivation, goals and its realisation.
    We are persuaded that UnifiedViews is a matured tool, which addresses the major
problem of RDF/Linked Data consumers – the problem of sustainable RDF data pro-
cessing; we support such statement by introducing a list of projects where UnifiedViews
is successfully used and mention two commercial exploitations of the tool.


References
1. C. Bizer, T. Heath, and T. Berners-Lee. Linked Data - The Story So Far. International Journal
   on Semantic Web and Information Systems, 5(3):1 – 22, 2009.
2. S. H. Garlik, A. Seaborne, and E. Prud’hommeaux.                SPARQL 1.1 Query Lan-
   guage.        W3C Recommendation, 2013.               http://www.w3.org/TR/2013/
   REC-sparql11-query-20130321/, Retrieved 20/03/2014.
3. P. Gearon, A. Passant, and A. Polleres. SPARQL 1.1 Update. Technical report, W3C,
   2013. Published online on March 21st, 2013 at http://www.w3.org/TR/2013/
   REC-sparql11-update-20130321/, Retrieved 20/03/2014.
4. E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. The
   VLDB Journal, 10(4):334–350, Dec. 2001.
5. A. Schultz, A. Matteini, R. Isele, C. Bizer, and C. Becker. LDIF : Linked Data Integration
   Framework. In Proceedings of the Second International Workshop on Consuming Linked Data
   (COLD), Bonn, Germany, 2011. CEUR-WS.org.
120     Tomáš Knap, Petr Škoda, Jakub Klı́mek, Martin Nečaský


6. J. Volz, C. Bizer, M. Gaedke, and G. Kobilarov. Silk - A Link Discovery Framework for
   the Web of Data. In Proceedings of the WWW2009 Workshop on Linked Data on the Web
   (LDOW), Madrid, Spain, 2009. CEUR-WS.org.
7. A. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann, and S. Auer. Quality assessment
   for linked data: A survey. Semantic Web Journal, 2015.