Reusable transformations of Data Cube Vocabulary
              datasets from the fiscal domain

Jindřich Mynarz1 , Jakub Klímek123 , Marek Dudáš1 , Petr Škoda12 , Christiane Engels4 ,
                       Fathoni A. Musyaffa5 , and Vojtěch Svátek1
                             1
                              University of Economics, Prague
                  Nám. W. Churchilla 4, 130 67 Praha 3, Czech Republic
               jindrich.mynarz|marek.dudas|svatek@vse.cz
            2
              Charles University in Prague, Faculty of Mathematics and Physics
                  Malostranské nám. 25, 118 00 Praha 1, Czech Republic
                        klimek|skoda@ksi.mff.cuni.cz
         3
           Czech Technical University in Prague, Faculty of Information Technology
                      Thákurova 9, 160 00 Praha 6, Czech Republic
                           jakub.klimek@fit.cvut.cz
                                    4
                                       Fraunhofer IAIS
                  Schloss Birlinghoven, 53757 Sankt Augustin, Germany
                   christiane.engels@iais.fraunhofer.de
                  5
                    University of Bonn, Institute for Computer Sciences
                        Römerstraße 164, 53117 Bonn, Germany
                            musyaffa@cs.uni-bonn.de


       Abstract. Shared data models provide leverage for reusable data transforma-
       tions. Common modelling patterns and data structures can make data transfor-
       mations applicable to diverse datasets. Similarly to data models, reusable data
       transformations promote separation of concerns, prevent duplication of effort,
       and reduce the time spent processing data. However, unlike data models, which
       can be shared as RDF vocabularies or ontologies, there is no well-established
       way of sharing data transformations. We propose a way to share data transforma-
       tions as ‘pipeline fragments’ for LinkedPipes ETL (LP-ETL), which is an RDF-
       based data processing tool focused on RDF data. We describe the features of
       LP-ETL that enable development of reusable transformations as pipeline frag-
       ments. Pipeline fragments are represented in RDF as JSON-LD files that can be
       shared directly or via dereferenceable IRIs. We demonstrate the use of pipeline
       fragments on data transformations for fiscal data described by the Data Cube Vo-
       cabulary (DCV). We cover both generic transformations for any DCV-compliant
       data, such as DCV validation or DCV to CSV conversion, and transformations
       specific for the fiscal data used in the OpenBudgets.eu (OBEU) project, including
       conversion of Fiscal Data Package to RDF or normalization of monetary values.
       The applicability of these transformations is shown on concrete use cases serving
       the goals of the OBEU project.


1   Introduction
Many data processing tasks share the same requirements on data transformations. For
example, generic cleaning and enrichment tasks apply to a variety of datasets. The com-
monality of applicable data transformations can be based on the common data models
used to describe the transformed datasets. In such case, common data models pro-
vide affordances for common data transformations. However, while data models can
be shared as RDF vocabularies or ontologies, there is no well-established way how to
share data transformations.
     It is desirable that common data transformations can be separated into reusable
components. Componentization into modular transformations endowed with single re-
sponsibility promotes separation of concerns, avoids duplication of code, and makes
data transformations more maintainable. It reduces the development time needed for
data transformations, thus decreasing the effort spent on pre-processing data.
     In this paper we describe a way to share data transformations as ‘pipeline fragments’
for LinkedPipes ETL (LP-ETL) [5]. The pipeline fragments are represented in RDF
and can be shared either as files or via dereferenceable IRIs. We show several reusable
transformations built on top of the Data Cube Vocabulary (DCV) [1] that can be shared
as pipeline fragments. We cover both generic transformations applicable to any DCV
data and domain-specific transformations for fiscal data, comprising both budget and
spending data, that are motivated by the use cases in the OpenBudgets.eu (OBEU)6
project. More details about the presented transformations can be found in an OBEU
deliverable on data optimisation, enrichment, and preparation for analysis [4].
     OBEU is an EU-funded H2020 project devoted to advancing analyses of fiscal open
data. In order to support the goals of the project we developed the OBEU data model,
which is built on DCV. The data model is based on reusable component properties
(i.e. instances of qb:ComponentProperty) that can be composed into dataset-
specific data structure definitions (DSDs). We chose to adopt DCV because it provides
the OBEU data model with a uniform way to represent heterogeneous fiscal datasets,
which makes them better combinable and comparable. Moreover, DCV simplifies com-
bining fiscal data described using the data model with other statistical datasets, such as
macroeconomic indicators, which can put the data into a broader context useful for data
analyses.
     We start this paper with a description of the features of LP-ETL that enable building
reusable data transformations. Subsequently, we cover reusable transformations built
for DCV data for the purposes of the OBEU project, and we conclude with a discussion
of how these transformations are used in the project. Before we proceed with the paper,
we review related work in technologies for reusable transformations of RDF data.


1.1    Related work

While there are many ETL tools, only few allow reusing partial transformations of RDF
data. Thanks to its declarative nature, SPARQL 1.1 Update [2] operations can be used
to formalize reusable transformations of RDF. SPARQL allows to bundle several update
operations separated by semicolons in a single request, so that it can encapsulate more
complex transformations requiring multiple steps; e.g., create, populate, and delete a
temporary named graph. However, SPARQL Update is restricted to transformations of
RDF data, and as such it cannot incorporate tasks involving other data formats; e.g.,
 6
     http://openbudgets.eu
decompressing a ZIP file. Moreover, SPARQL Update operations must be static. If they
contain dynamic parts, another mechanism is needed to generate them.
    Besides the standard SPARQL, there are bespoke tools for ETL of RDF data that
allow reusing transformations. DataGraft [7] provides reusable components and trans-
formations focusing on tabular data. UnifiedViews [6] features templates7 that enable
to pre-configure individual data processing units; e.g., with a specific regular expres-
sion. If we extend our scope beyond RDF, we can find Pentaho, a widely used ETL
tool, which features mapping steps that can encapsulate reusable transformations.8 The
mapping steps, also known as sub-transformations, use placeholders for their inputs that
they expect to be provided from their parent transformation and for the outputs that they
provide back to their parent transformation to read. The mapping steps can be stored in
separate files and imported into other transformations.
    However, none of these tools satisfies all requirements for easy reuse of data trans-
formations. DataGraft focuses only on tabular data, while there are other formats such
as XML, JSON and RDF which need to be processed. UnifiedViews is not very user-
friendly and has issues with sharing pipelines between multiple versions of the soft-
ware. Pentaho does not support RDF transformations. This is why we chose LP-ETL,
described in the next section, as the most suitable candidate for our use case.


2     LinkedPipes ETL
LP-ETL is a data processing tool focused on producing RDF data from various data
sources. A data processing task in LP-ETL is defined as a pipeline. Pipeline is a repeat-
able process that consists of configurable components, each responsible for an atomic
data transformation task, such as a SPARQL query or a CSV to RDF transformation.
ETL (extract, transform, load) denotes a data processing task in which data is first ex-
tracted, then transformed, and finally loaded to a database or a file system. LP-ETL
consists of a back-end, which runs the data transformations and exposes APIs, and a
front-end, which is a responsive web application that provides a pipeline editor and
an execution monitor to the user. One of the distinguishing features of LP-ETL, when
compared to other ETL tools, is the fact that the pipelines and all component configu-
rations are themselves stored as RDF, which enables novel data processing workflows.
Consequently, the use of the tool requires the knowledge of RDF, SPARQL, and related
technologies. In addition, its use for statistical data processing requires the knowledge
of DCV and the Simple Knowledge Organization System (SKOS)9 vocabulary.

2.1   Runtime configuration
All component configurations in LP-ETL are stored as RDF. The configurations range
from simple, such as a SPARQL query to be executed or a URL to download, to com-
plex, as is the case of the component that transforms tabular data to RDF and contains
 7
   https://grips.semantic-web.at/display/UDDOC/3.+DPU+Templates+
   Section
 8
   http://wiki.pentaho.com/display/EAI/Mapping
 9
   https://www.w3.org/TR/skos-reference/
a full mapping of the CSV columns to RDF. A key feature of LP-ETL is the possibility
to provide a configuration to the component at runtime. In this way, a pipeline can cre-
ate configurations for its components based on its input data. For example, this feature
can be used to download multiple files specified in a list. First, a download component
downloads the list, passes it to another component that parses the list and transforms it
into a runtime configuration for another download component that subsequently down-
loads the files in the list. A more complex use of this feature is generating a SPARQL
query based on the pipeline’s input data and then executing it.

2.2   Pipeline fragments
Since LP-ETL pipelines themselves are stored using RDF as JSON-LD files, they can be
easily shared on the Web. When exporting a pipeline from LP-ETL, the user can either
download the pipeline as is or download it without potentially sensitive information,
such as user names and passwords for uploading to a server, so that it can be shared
safely. The exported pipeline can be then shared publicly, such as by making it available
at a public URL.
     Users of the shared pipeline can either upload the pipeline into their LP-ETL in-
stance from a JSON-LD file or from an IRI that dereferences to the pipeline’s represen-
tation in JSON-LD. The shared pipelines can be incomplete and may need to be pro-
vided with input or configuration to work. Incomplete pipelines constitute the so-called
pipeline fragments, which can be imported into existing pipelines to reuse common
transformations. Instead of referencing pipeline fragments directly, they are reused as
copies to enable local modifications, so they are not automatically updated if the ‘mas-
ter’ copy changes. In particular, this feature is useful for sharing transformations that
require no or only minor adjustments to serve their purpose, such as having provided
a URL of the input file. Pipeline fragments can enforce the contract of their interface,
i.e. necessary configuration or requisite input, by using the SPARQL ASK component
that allows to check if the contract’s preconditions are satisfied by querying the frag-
ment’s input and asserting the expected boolean result. If the expectations are not met,
the component makes the pipeline execution fail.

2.3   Sending input data to pipelines using HTTP API
Execution of LP-ETL pipelines can be triggered by sending an HTTP POST request
containing input data for the pipelines. In this case, a component that receives the posted
input data is placed at the start of a pipeline. Any file can be provided as input data.
This feature allows the pipeline designers to pass data into a pipeline without needing
to specify the source of the data directly in the pipeline. It promotes modularization of
pipelines, so that a complex pipeline can be split into several simpler, possibly reusable
pipelines that trigger each other.

2.4   Typical pipeline shape
In order to provide a concrete illustration of LP-ETL pipelines we discuss their typical
shape. A typical shape of a pipeline transforming source tabular files into RDF using
DCV and the OBEU data model can be seen in Figure 1.
                          Fig. 1. An example pipeline in LP-ETL


     First, the source tabular data in the Excel format is downloaded by the HTTP Get
component. In the next step, it is converted to CSV files using the Excel to CSV com-
ponent, which can also do some basic operations like sheet selection and rows and
columns selection. CSVs are in turn transformed to raw RDF using the Tabular compo-
nent. This component is based on the CSV on the Web W3C Recommendation10 with
some technical adjustments to better serve the ETL process. For example, we adjusted
the way property IRIs are generated, as in the specification they are based on the CSV
file name, which makes it impractical for batch processing of multiple files, where the
property IRI needs to be fixed across files. The raw RDF data is subsequently mapped
to the desired vocabularies, typically via a series of SPARQL CONSTRUCT or SPARQL
Update components. The resulting RDF data is then serialized to dumps and uploaded
to a target server. During the process, one source table can be split into multiple RDF
datasets (three are used in Figure 1) due to data modeling reasons. One example of such
a reason can be an input table, which is formatted for printing and besides individual
transactions contains quarterly sums of the transactions. In this case, we may decide to
transform it as two datasets, one containing the individual transactions and another one
containing the quarterly sums. In addition to the data itself, metadata can be also added
in the pipeline, providing important information about the datasets themselves.


2.5    Performance and scaling

Given the nature of the sequential pipeline processing, the size of the input data and the
complexity of the pipeline are the key factors affecting the overall performance of the
transformation. So far, every component has all of its input and output data materialized
in an in-memory RDF store, which poses restrictions on the size of data. The number
of components affects the transformation runtime as the data is copied in each step. The
upside of this approach is the availability of intermediate data for debugging purposes.
10
     https://www.w3.org/TR/csv2rdf/
There are multiple ways of optimization being developed, but those are out of the scope
of this paper.


3     Collection of reusable transformations

We created several reusable transformations to support the goals of the OBEU project.
The transformations are implemented as LP-ETL pipeline fragments and are available
at https://github.com/openbudgets/pipeline-fragments. While the
individual technologies used in these pipelines are not new, the way they are combined
is novel. We can split the transformations into generic ones that are applicable to any
DCV-compliant data and ones that are applicable only to the fiscal data represented with
the OBEU data model.


3.1   Transformations of DCV-compliant data

The DCV specification [1] provides two standard transformations: DCV normalization
and validation using integrity constraints. Both these transformations are encoded in
SPARQL. We reused these formulations and extended them into pipeline fragments.
Additionally, we developed a pipeline fragment for DCV to CSV conversion.


DCV normalization DCV data can be coerced to a regular structure using the DCV
normalization algorithm,11 which allows to simplify queries on the data. The algorithm
is expressed via SPARQL Update operations, which makes it simple to wrap in a LP-
ETL pipeline fragment via its SPARQL Update component. DCV normalization can
serve for pre-processing input data for subsequent transformations. Since the normal-
ized DCV data follows a regular structure, transformations of such data are simpler due
to a less heterogeneous input.


DCV validation DCV defines 22 integrity constraints12 that formalize some of the
assumptions about well-formed DCV-compliant data. Apart from one constraint that
tests datatype consistency, the remaining 21 constraints are implemented via SPARQL
ASK queries that evaluate to true when a constraint violation is found. While 19 con-
straints are static, two constraints must be generated dynamically based on the dataset
to be validated. We reformulated the constraints as SPARQL CONSTRUCT queries
that produce descriptions of the errors found in the validated data. The errors are rep-
resented by the SPIN RDF13 vocabulary, which allows to pinpoint the RDF resources
causing the errors and provide explanations to users to help them fixing the errors. In
this way, instead of boolean answers from ASK queries, users receive richer descrip-
tions of the detected constraint violations. We generate the dynamic integrity constraints
11
   https://www.w3.org/TR/vocab-data-cube/#normalize-algorithm
12
   https://www.w3.org/TR/vocab-data-cube/#wf-rules
13
   http://spinrdf.org/spin.html
using Mustache14 templates and pass the generated queries further as runtime config-
uration. Apart from these constraints we generate the query for the integrity constraint
12 that detects duplicate observations. Compared to the generic query for this constraint
in the DCV specification, we observed approximately 100× speed-up for the generated
dataset-specific query. We implemented DCV validation as a pipeline fragment that ex-
ecutes the integrity constraints, merges their resulting RDF graphs, and outputs them
both as RDF, which allows further automated processing, and as an HTML report for
quick visual inspection by users.


DCV to CSV conversion Many data mining tools cannot handle RDF and instead
require input data to be provided in a single propositional table. The required tabular
structure can be serialized in CSV, which is considerably less expressive than RDF.
However, DCV constrains RDF to follow a more regular structure. DCV dimensions
and measures must adhere to 1..1 cardinality and attributes to 0..1 cardinality. Since
multi-valued properties are missing from the DCV model, producing a single CSV ta-
ble out of DCV data is simpler than doing so out of arbitrary RDF data. RDF can be
transformed to tabular data using SPARQL SELECT queries [3], the results of which
can be serialized to CSV.
    We developed a pipeline fragment for DCV to CSV conversion, which is based
on the data structure definition (DSD) of the transformed dataset. The columns of the
CSV output are derived from the components composing the DSD. We extract the com-
ponent specifications from the DSD, including their type, attachment, and order. The
extracted data is used in a Mustache template to generate a SPARQL SELECT query
that transforms the datasets conforming to the DSD to CSV.


3.2   Transformations of OBEU-compliant data

Besides the generic transformations applicable to any DCV data we also developed
transformations that can be reused for any data described using the OBEU data model.
In this section we describe the pipeline fragments for conversion from Fiscal Data Pack-
age (FDP) to RDF, validation of integrity constraints of the OBEU data model, and
normalization of monetary amounts using exchange rates and GDP deflators.


FDP to RDF transformation FDP15 is a data format used in the OpenSpending
project.16 Since the OpenSpending project and OBEU share some goals and FDP ex-
hibits a similar structure to the OBEU data model,17 it is straightforward to provide
a transformation between the formats to enable to use FDP data in OBEU and vice
versa. Therefore we have developed an FDP to RDF transformation pipeline in LP-
ETL. FDP and OBEU data models cover the same domain and are semantically almost
identical. FDP consists of CSV files accompanied by metadata in a JSON descriptor file
14
   https://mustache.github.io
15
   http://fiscal.dataprotocols.org/spec/
16
   http://openspending.org
17
   https://github.com/openbudgets/data-model
describing the meaning of CSV columns and possible relationships between them. The
description is based on dimensions, which have a similar interpretation to the OBEU
dimension properties. The FDP dimensions are mapped to CSV columns. The goal of
the pipeline is then first to transform the metadata into RDF and then to transform the
CSV records according to the metadata.
    The input of the pipeline is a single JSON file – an FDP descriptor – containing
references to CSV files with the actual data. The input CSV files are first transformed
to RDF via the Tabular component to enable to manipulate them via SPARQL queries.
The rest of the transformation is implemented via SPARQL CONSTRUCT queries.
The FDP descriptor in JSON is reinterpreted as JSON-LD18 and transformed to RDF.
The output of the FDP to RDF pipeline is an RDF graph compliant with the OBEU
data model. The pipeline is available at https://github.com/openbudgets/
pipeline-fragments/tree/master/FDPtoRDF.19
    The implementation in SPARQL was guided by several principles. We split the
transformation into atomic steps, each having a single responsibility. This leads to bet-
ter code organization that eases iterative development and improves maintenance. The
decomposition of the transformation process simplified debugging it, because the in-
termediate outputs of each step were available to scrutiny, albeit it complicated the
pipeline’s structure. The transformation’s state is propagated through the pipeline as
auxiliary RDF annotations, which are read and updated by the transformation steps.
The advantages of the declarative formulation of the pipeline are that it is easier to
understand and maintain and it can be configured directly through the UI of LP-ETL.
Conversely, the downside of this approach is the duplication of transformation code due
to the granularity of SPARQL queries. It is often the case that transformation queries
differ only slightly; e.g., transformation of different FDP dimensions to OBEU dimen-
sion properties, in which only the source and target dimension/property differ. It is
possible that modularization of SPARQL queries could avoid such duplication and ease
their maintenance.


Validation of the OBEU integrity constraints In a similar fashion to DCV, the OBEU
data model provides integrity constraints that can detect violations of some assumptions
behind the data model. The constraints cover the recurrent errors made by the users of
the data model. As is the case in our implementation of DCV validation, these integrity
constraints are formalized as SPARQL CONSTRUCT queries producing SPIN RDF
data. Besides the patterns expressed in the SPARQL queries, these constraints leverage
background knowledge encoded in the DCV and the OBEU data model. We imple-
mented six integrity constraints in total, mostly testing the assumptions about datasets’
DSDs.

 1. Redefinition of component property’s code list: Detects if the validated dataset
    redefines the code list for a coded component property reused from the OBEU data
    model. The validated datasets should instead define the custom code lists for a
    derived subproperty.
18
     http://json-ld.org/
19
     See also the pipeline’s overview at https://goo.gl/H6SSiE.
 2. Hijacked core namespace: The validated dataset must not invent terms in the
    namespace of the OBEU data model.20 A different namespace should be used in-
    stead.
 3. Missing mandatory component property: The validated dataset must contain
    mandatory component properties (or their subproperties) from the OBEU data mo-
    del: obeu-attribute:currency, obeu-dimension:fiscalPeriod,
    obeu-dimension:operationCharacter, obeu-dimension:organi-
    zation, and obeu-measure:amount.
 4. Property instantiation: A common error we discovered was instantiation of prop-
    erties. Since RDF only allows to instantiate classes, instantiating properties is in-
    correct. In the context of DCV, this error may be caused by typos in class IRIs that
    differ from property IRIs only in character case (e.g., qb:DataSet vs. qb:da-
    taSet).
 5. Use of abstract property: Properties marked as abstract in the OBEU data model
    (e.g., obeu-dimension:classification) should not be directly reused.
    Users should mint subproperties of the abstract properties and use these instead.
 6. Wrong character case in DCV: The constraint detects if the validated dataset
    contains a non-existent term from the DCV namespace that differs from an exist-
    ing DCV term only in character case. Apart from reporting the invalid term, the
    constraint suggests a valid substitute.


Normalization of monetary amounts In order to support comparative analysis com-
bining fiscal data from different times and areas, we developed a pipeline fragment for
normalization of monetary amounts using exchange rates and GDP deflators. This nor-
malization allows to compare amounts of money not only as nominal values, but also in
terms of their real value. Among others, monetary amounts can differ in currency, coun-
try, and time when they were spent. The amounts can be converted to a single currency,
such as euro, using exchange rates and adjusted for different price levels depending on
country and time using GDP deflators. The GDP deflator is a price index based on the
gross domestic product (GDP) of a country that measures changes in price levels with
respect to a specific base year. We can normalize an amount Q using the following cal-
culation, in which Q0 is the normalized monetary amount, Ip,t is the price index for the
target year to which we normalize, Ip,0 is the price index for the original year when Q
was spent, and Et is the exchange rate to euro for the target year:

                                            Ip,t Q
                                     Q0 =                                            (1)
                                            Ip,0 Et

   The normalization requires the monetary amounts to be enriched with two datasets:
one containing exchange rates and the other containing GDP deflators. Eurostat21 pro-

20
     http://data.openbudgets.eu/ontology/
21
     http://ec.europa.eu/eurostat
vides both these datasets in tabular format,22 but thanks to Linked Statistics23 they are
also available in RDF modelled with DCV, which makes it easier to combine them with
other data. We developed a pipeline fragment that uses these datasets to calculate nor-
malized values of monetary amounts. The datasets are joined via country (and in turn
by the country’s currency) and year. The normalized datasets can either reuse Euro-
stat’s resources for countries and years or provide explicit links to them. The above-
described calculation for normalization is implemented in a dynamically generated
SPARQL query.


4     OBEU use case

The pipeline fragments introduced in the previous section are used at several points
in the OBEU project. The transformations of OBEU datasets take advantage of the
two validation fragments as well as of the reusability of the datasets transformations
themselves. In between the integration of datasets and the analytics part of OBEU, the
DCV to CSV pipeline serves as connection point. We present two selected use cases to
show how these transformations are applied.


4.1   Validation of OBEU datasets

An important part of the OBEU project is the integration of datasets from different
sources, structures, and formats into one platform using the uniform OBEU data model.
To this end, the datasets have to be transformed to this data model. For datasets in other
formats than FDP, such as raw datasets from different governments’ publication of-
fices, we developed separate pipelines using LP-ETL.24 Due to the huge diversity in
the structure of the raw datasets and different data formats, the pipelines for trans-
forming these datasets differ and have to be developed or at least adapted manually.
Consequently, this is an error-prone process, for which an automated way to validate
the transformed datasets with minimum effort is essential. Using the pipeline fragments
described above, such a validation pipeline can be set up with a few clicks.
    A validation pipeline for an OBEU dataset checking both the DCV and OBEU
validity can be composed as follows (cf. Figure 2). First, the transformed dataset to
be validated is loaded into the pipeline (e.g., downloaded from GitHub via the HTTP
Get component) and normalized using the DCV normalization pipeline fragment since
the following DCV validation step requires its input data to be in the DCV normal
form. Then the two different validation steps, DCV and OBEU, are performed over
22
   Exchange        rates:        http://ec.europa.eu/eurostat/web/products-
   datasets/-/tec00033
   GDP deflators: http://ec.europa.eu/eurostat/web/products-datasets/-
   /nama_10_gdp
23
   http://eurostat.linked-statistics.org
24
   Pipelines for EU-level datasets (e.g., the budget of European Structural Investment Funds)
   and various EU member states’ regions and municipalities (e.g., Thessaloniki and Athens in
   Greece, Aragon in Spain, or Bonn in Germany) are available at https://github.com/
   openbudgets/datasets.
Fig. 2. Overview of the OBEU validation pipeline checking both the DCV and OBEU validity.


the normalized dataset. Finally, the resulting validation reports are combined to a joint
output.
    Iterating over this pipeline supports debugging of OBEU transformation pipelines
and bug fixing in the corresponding datasets. An example of such a validation pipeline
for the 2016 budget of the city of Bonn (Germany) is available on GitHub.25 To reuse
the pipeline for another dataset, only the URLs of the dataset to be validated need to be
changed.
    Thanks to their modularity, the validation and normalization pipelines are also be-
ing used for the FDP2RDF pipeline by simply linking them to its output: the pipeline
fragment can be imported directly into the FDP2RDF pipeline and connected to its end,
so that the validation is run every time the FDP2RDF pipeline runs. As the FDP2RDF
pipeline is still in development, the validation pipelines are used mainly for checking
the FDP2RDF pipeline itself and have already helped to discover and fix bugs.


4.2    Comparative analysis on normalized values

Several datasets have been transformed to the OBEU data model to be used in the
OBEU project. They vary with respect to time, country, and also administrative level
(i.e., local, regional, national, and EU level).
     To perform a proper analysis of budget and spending datasets from different years
and regions it is essential that the analysed monetary values are comparable. More con-
cretely, the currency should be the same, the inflation rate has to be taken care of as well
as different price levels among countries. Combining the pipeline fragment for normal-
izing monetary amounts with the pipeline fragment for DCV to CSV conversion, we
are able to produce a CSV file with comparable amounts. This can be used as input for
a comparative analysis across different regions or a time series analysis along different
years with state-of-the-art analysis tools to address questions such as how Thessaloniki
in Greece and Bonn in Germany budget their expenditures for public transportation or
25
     https://github.com/openbudgets/datasets/blob/master/Bonn/
     pipelines/validation.jsonld
to what extent the crisis in Greece has influenced the municipalities’ budgets compared
over time.


5    Conclusion
The need to transform fiscal or similar data from heterogeneous source formats to RDF
is growing widespread and reusing the solutions for recurring data transformation tasks
thus becomes vital. We proposed a technological solution for such reuse, based on
the state-of-the-art LP-ETL tool, presented a concrete collection of reusable transfor-
mations and demonstrated it on use cases in the OBEU project. Pipelines combining
reusable fragments of different functionality, ranging from validation through structural
normalization to complete transformation between formats, have been composed. The
utilization of those significantly reduces the efforts to perform repetitive steps required
in the dataset transformation cycle.
     Our future work will focus primarily on the application of the pipelines in the con-
text of the OpenBudgets.eu use case while extending their collection further and testing
to what degree it could be ported to other fields that exploit multidimensional data.
Acknowledgements: The presented research has been supported by the H2020 project
no. 645833 (OpenBudgets.eu).


References
1. Cyganiak, R., Reynolds, D.: The RDF Data Cube Vocabulary. W3C recommendation, W3C
   (2014), https://www.w3.org/TR/vocab-data-cube/
2. Gearon, P., Passant, A., Polleres, A.: SPARQL 1.1 Update. W3C recommendation, W3C
   (2013), https://www.w3.org/TR/sparql11-update/
3. Hausenblas, M., Villazón-Terrazas, B., Cyganiak, R.: Data shapes and data transformations.
   Tech. rep. (2012), http://arxiv.org/abs/1211.1565
4. Klímek, J., Mynarz, J., Škoda, P., Zbranek, J., Zeman, V.: Deliverable 2.2: Data optimisation,
   enrichment, and preparation for analysis. Tech. rep. (2016), http://openbudgets.eu/
   assets/deliverables/D2.2.pdf
5. Klímek, J., Škoda, P., Nečaský, M.: LinkedPipes ETL: Evolved linked data preparation. In:
   The Semantic Web: ESWC 2016 Satellite Events - ESWC 2016 Satellite Events, Anissaras,
   Crete, Greece, May 29-June 2, 2016, Revised Selected Papers, to appear (2016)
6. Knap, T., Kukhar, M., Macháč, B., Škoda, P., Tomeš, J., Vojt, J.: UnifiedViews: An ETL
   framework for sustainable RDF data processing. In: ESWC (2014)
7. Roman, D., Dimitrov, M., Nikolov, N., Putlier, A., Sukhobok, D., Elvesæter, B.,
   Berre, A.J., Ye, X., Simov, A., Petkov, Y.: Datagraft: Simplifying open data publish-
   ing. In: Posters & Demos of the 13th European Semantic Web Conference (2016),
   http://2016.eswc-conferences.org/sites/default/files/papers/
   Accepted%20Posters%20and%20Demos/ESWC2016_DEMO_DataGraft.pdf