Exploiting Linked Data Cubes with OpenCube Toolkit

       Evangelos Kalampokis1,2, Andriy Nikolov3, Peter Haase3, Richard Cyganiak4,
      Arkadiusz Stasiewicz4, Areti Karamanou1,2, Maria Zotou1,2, Dimitris Zeginis1,2,
                   Efthimios Tambouris 1,2, Konstantinos Tarabanis1,2
       1
           Centre for Research & Technology - Hellas, 6th km Xarilaou-Thermi, 57001, Greece
                  2
                    University of Macedonia, Egnatia 156, 54006 Thessaloniki, Greece
                           {ekal, akarm, mzotou, zegin, tambouris, kat}@uom.gr
                    3
                      fluid Operations AG, Altrottstraße 31, 69190 Walldorf, Germany
                                {andriy.nikolov, peter.haase}@fluidops.com
                            4
                              Insight Centre for Data Analytics, Galway, Ireland
                        {richard.cyganiak, arkadiusz.stasiewicz}@insight-centre.org


           Abstract. The adoption of the Linked Data principles and technologies has
           promised to enhance the analysis of statistics at a Web scale. Statistical data,
           however, is typically organized in data cubes where a numeric fact (aka
           measure) is categorized by dimensions. Both data cubes and linked data
           introduce complexity that raises the barrier for reusing the data. The majority of
           linked data tools are not able to support or do not facilitate the reuse of linked
           data cubes. In this demo we present the OpenCube Toolkit that enables the easy
           publishing and exploitation of linked data cubes using visualizations and data
           analytics.

           Keywords: Linked data, statistics, data cubes, visualization, analytics.


1 Introduction

A major part of Open Data concerns statistics such as population figures, economic
and social indicators. Analysis of statistical open data can provide value to both
citizens and businesses in various areas such as business intelligence, epidemiological
studies and evidence-based policy-making. Linked Data has emerged as a promising
paradigm to enable the exploitation of the Web as a platform for data integration. As a
result Linked Data has been proposed as the most appropriate way for publishing
open data on the Web. Statistical data needs to be formulated as RDF data cubes [1]
characterized by dimensions, slices and observations in order to unveil its full
potential and value [2]. Processing of linked statistical data has only become a
popular research topic in the recent years. Several practical solutions have been
developed in this domain: for example, the LOD2 Statistical Workbench1 brings
together components developed in the LOD2 project by means of the OntoWiki2 tool.

  1
      http://wiki.lod2.eu/display/LOD2DOC/LOD2+Statistical+Workbench
  2
      http://aksw.org/Projects/OntoWiki.html
   In this demo paper we describe the OpenCube Toolkit that enable users to work
with linked data cubes in an easy manner. In comparison with existing tools, our
toolkit provides the following contributions:
   • application development SDK allowing customized domain-specific
        applications to be built to support various use cases;
   • new functionalities enabling users to better exploit linked data cubes;
   • components supporting the whole linked data cube lifecycle.


2 OpenCube Toolkit

The OpenCube Toolkit3 integrates a number of components which enable the user to
work with semantic statistical data at different stages of the lifecycle: from importing
legacy data and exposing it as linked open data to applying advanced visualization
techniques and complex statistical methods to it.
   The Information Workbench (IWB) platform [3] serves as a backbone for the
toolkit components. The components are integrated into a single architecture via
standard interfaces provided by the IWB SDK: widgets (for UI controls) and data
providers (for data importing and processing components). The overall UI design is
based on the use of wiki-based templates providing dedicated views for RDF
resources: an appropriate view template is applied to an RDF resource based on its
type. All components of the architecture share the access to a common RDF
repository (local or remote) and can retrieve data by means of SPARQL queries.
   The OpenCube Toolkit demo uses datasets from the Linked Data version of
Eurostat4 and can be currently accessed using the following link:
http://data.fluidops.net.


2.1 Using the OpenCube Toolkit for data import, transformation, and publishing

Much of the relevant valuable statistical data are only available in various legacy
formats, such CSV and Excel. To present these data in the form of linked RDF data
cubes, they have to be imported, transformed into the RDF Data Cube format and
made accessible for querying.
   The OpenCube TARQL5 component enables cubes construction from legacy data
via TARQL (Transformation SPARQL): a SPARQL-based data mapping language
that enables conversion of data from RDF, CSV, TSV and JSON (and potentially
XML and relational databases) to RDF. TARQL is tool for converting CSV files to
RDF using SPARQL 1.1 syntax. It is built on top of Apache ARQ6. The OpenCube
TARQL component includes the new release of TARQL. It brings several
improvements, such as: streaming capabilities, multiple query patterns in one

  3 http://opencube-toolkit.eu
  4
    http://eurostat.linked-statistics.org
  5
    https://github.com/cygri/tarql
  6
    http://jena.apache.org/documentation/query/
mapping file, convenient functions for typical mapping activities, validation rules
included in mapping file, increased flexibility (dealing with CSV variants like TSV).
    The R2RML7 language is a W3C standard for mappings from relational databases
to RDF datasets. D2RQ8 is a platform for accessing relational databases as virtual,
read-only RDF graphs. D2RQ Extensions for Data Cube cover the functionality of
importing of raw data as data cubes by mapping raw data to RDF. The process of
mapping the data cube with a relational data source includes: (a) mapping the tables
to classes of entities, (b) mapping selected columns into cube dimensions and cube
measures, (c) mapping selected rows into observation values, and (d) generate triples
with data structure definition. The user, by providing information about the dataset,
such as the data dimensions and related measures, will receive an R2RML mapping
file, which as a result will be used to generate and store the output.


2.2 Using the OpenCube Toolkit to utilize statistical data

To make use of available statistical data cubes, the user requires, as a minimum, to be
able to explore and, visualize the data. The next step involves being able to apply to
these data relevant statistical analysis methods.
   The OpenCube Browser enables the exploration of an RDF data cube by
presenting two-dimensional slices of the cube as a table. Currently browser enables
users to change the two dimensions that define the table of the browser and also
change the values of the fixed dimensions and thus select a different slice to be
presented. Moreover, the browser supports roll-up and drill-down OLAP operations
through dimensions reduction and insertion respectively. Finally, the user can create
and store a two-dimensional slice of the cube based on the data presented in the
browser. Initially, the browser selects two dimensions to present in the table and sets
up a fixed value for all other dimensions. Based on these it creates and sends a
SPARQL query to the store to retrieve the appropriate data. For the drill-down and
roll-up operations the browser assumes that a set of data cubes has been created out of
the initial cube by summarizing observations across one or more dimensions. We
assume that these cubes define an Aggregation Set.
   The OpenCube Map View enables the visualization of RDF data cubes on a map
based on their geospatial dimension. Initially, Map View presents to the user the
supported types of visualization (including markers, bubbles, choropleth and heat
maps) along with all the dimensions and their values in drop-down lists.
The user selects the type of visualization and a map appears that actually visualizes a
one-dimension slice of the cube where the geospatial dimension is free and the other
dimensions are randomly “fixed”. In addition, the user can click on an area or marker
or bubble and see the details of the specific observation. The maps are created using
OpenStreetMap9 and Leaflet10 open-source library.


  7
    http://www.w3.org/TR/r2rml/
  8
    http://d2rq.org/
  9
    http://wiki.openstreetmap.org/wiki/Develop
  10
     http://leafletjs.com/
   To allow the user explore the data in a data cube, it is important that the used
visualization controls are (i) interactive and (ii) adapted to the cube data
representation. In this way the user can easily switch between different slices of the
cube and compare between them. To this end, we implemented our Chart-based
Visualization functionality. The charts can be inserted into a wiki page of an RDF
resource and configured to show data cube slices. When viewing the page, the user
can change the selection of dimension values to change the visualised cube slices. The
SPARQL query to retrieve the appropriate data is constructed based on the slice
definition, and the data is downloaded from the SPARQL endpoint dynamically.
   When working with statistical data, a crucial requirement is the possibility to apply
specialized analysis methods. One of the most popular environments for statistical
data analysis is R11. To use the capabilities of R inside the OpenCube Toolkit, we
integrated it with our architecture through the Statistical Analysis of RDF Data
Cubes component. R is run as a web service (using Rserve12 package) and accessed
via HTTP. Input data are retrieved using SPARQL queries and passed to R together
with an R script. Then, R capabilities can be exploited in two modes: (i) as a widget
(the script generates a chart, which is then shown on the wiki page) and (ii) as a data
source (the script produces a data frame, which is then converted to RDF using
defined R2RML mappings and stored in the data repository).


3 Conclusions

This demo paper presents the first release of the OpenCube Toolkit developed to
enable easy publishing and reusing of linked data cubes. The toolkit smoothly
integrates separate components dealing with different subtasks of the linked statistical
data processing workflow to provide the user with a rich set of functionalities for
working with statistical semantic data.

Acknowledgments. The work presented in this paper was partially carried out in
OpenCube13 project, which is funded by the EC within FP7 (No. 611667).


References

1.        Cyganiak,      R.,    Reynolds,     D.:     The     RDF     Data     Cube vocabulary,
          http://www.w3.org/TR/vocab-data-cube/ (2013)
2.        Kalampokis, E., Tambouris, E., Tarabanis, K.: Linked Open Government Data Analytics.
          In: Wimmer, M.A., Janssen, M., Scholl, H.J. (eds.) EGOV 2013. LNCS, vol. 8074, pp. 99-
          110. IFIP International Federation for Information Processing (2013)
3.        Haase, P., Schmidt, M., Schwarte, A. Information Workbench as a Self-Service platform.
          COLD 2011, ISWC 2011, Shanghai, China (2011).


     11
        http://www.r-project.org/
     12
        http://www.rforge.net/Rserve/
     13
        http://www.opencube-project.eu