Interacting with Statistical Linked Data via
                  OLAP Operations

              Benedikt Kämpgen1 , Sean O’Riain2 , and Andreas Harth1
      1
        Institute AIFB, Karlsruhe Institute of Technology, Karlsruhe, Germany
                        {benedikt.kaempgen,harth}@kit.edu
    2
      Digital Enterprise Research Institute, National University of Ireland, Galway
                                sean.oriain@deri.org


          Abstract. Online Analytical Processing (OLAP) promises an interface
          to analyse Linked Data containing statistics going beyond other interac-
          tion paradigms such as follow-your-nose browsers, faceted-search inter-
          faces and query builders. As a new way to interact with statistical Linked
          Data we deﬁne comon OLAP operations on data cubes modelled in RDF
          and show how a nested set of OLAP operations lead to an OLAP query.
          Then, we show how to transform an OLAP query to a SPARQL query
          which generates all required facts from the data cube. Both metadata
          and OLAP queries are issued directly on a triple store; therefore, if the
          RDF is modiﬁed or updated, changes are propagated directly to OLAP
          clients.

          Keywords: OLAP, query, operation, Linked Data, statistics, XBRL


1    Introduction

Linked Data provides easy access to large amounts of interesting statistics from
many organizations for information integration and decision support, including
ﬁnancial information from institutions such as the UK government3 and the
U.S. Securities and Exchange Commission.4 However, interaction paradigms for
Linked Data such as follow-your-nose browsers, faceted-search interfaces, and
query builders [10, 12] do not allow users to analyse large amounts of numerical
data in an exploratory fashion of “overview ﬁrst, zoom and ﬁlter, then details-on-
demand” [24]. Online Analytical Processing (OLAP) operations on data cubes

  Copyright � c 2012 by the paper’s authors. Copying permitted only for private and
  academic purposes. This volume is published and copyrighted by its editors.
  In: C. Unger, P. Cimiano, V. Lopez, E. Motta, P. Buitelaar, R. Cyganiak (eds.): Pro-
  ceedings of Interacting with Linked Data (ILD 2012), Workshop co-located with the
  9th Extended Semantic Web Conference, Heraklion, Greece, 28-05-2012, published
  at http://ceur-ws.org
3
  http://data.gov.uk/resources/coins
4
  http://edgarwrap.ontologycentral.com/
              Interacting with Statistical Linked Data via OLAP Operations     37

for viewing statistics from diﬀerent angles and granularities, ﬁltering for spe-
ciﬁc features, and comparing aggregated measures fulﬁl this information seeking
mantra and provide interfaces for decision-support from statistics [2,4,25]. How-
ever OLAP on statistical Linked Data imposes two main challenges:

 – OLAP requires a model of data cubes, dimensions, and measures. Automat-
   ically creating such a multidimensional schema from generic Linked Data is
   diﬃcult, and only semi-automatic methods have proved applicable [16,18,21].
   The RDF Data Cube vocabulary (QB)5 is a Linked Data vocabulary to
   model statistics in RDF. Several publishers have already used the vocabu-
   lary for statistical datasets.6
 – OLAP queries are complex and require specialised data models, e.g., star
   schemas in relational databases, to be executed eﬃciently [9]. The typical
   architecture of an OLAP system consists of an ETL pipeline that extracts,
   transforms and loads data from the data sources into a data warehouse,
   e.g., a relational or multidimensional database. OLAP clients such as JPivot
   allow users to built OLAP queries and display results in pivot tables. An
   OLAP engine, e.g., Mondrian, transforms OLAP queries into queries to the
   data warehouse, and deploys mechanisms for fast data cube computation and
   selection, under the additional complexity that data in the data warehouse
   may change dynamically [8, 14].

   In this paper, we assume that statistical Linked Data has been modelled
using QB and focus on eﬃciently executing OLAP queries on statistical Linked
Data. In previous work [11] we have presented a proof-of-concept to interpret
Linked Data reusing QB as a multidimensional model and to automatically load
the data into a data warehouse used by common OLAP systems. The fact that
OLAP queries are executed not on the RDF directly but by a common OLAP
engine after automatically populating a data warehouse result in two drawbacks:
ﬁrst, although the relational star schema we adopted is a quasi-standard logical
model for data warehouses, our approach requires a ROLAP engine to execute
OLAP queries; second, if statistical Linked Data is updated, e.g., if a single new
row is added, the entire ETL process has to be repeated, to have the changes
propagated. Figure 1 shows our new data ﬂow of having an OLAP engine issuing
SPARQL queries directly to a triple store.
   The current work presents a new way to interact with statistical Linked Data:

 – We deﬁne comon OLAP operations on data cubes as Linked Data reusing
   QB and show how a nested set of OLAP operations lead to an OLAP query.
 – We show how to transform an OLAP query to a SPARQL query which
   generates all required facts from the data cube.

   In the remainder of the paper, we ﬁrst present a motivational scenario from
the ﬁnancial domain in Section 2. As a prerequisite for our contribution, in
5
    http://www.w3.org/TR/2012/WD-vocab-data-cube-20120405/
6
    http://wiki.planet-data.eu/web/Datasets
38       B. Kämpgen, S. O’Riain, A. Harth


     Fig. 1. Data ﬂow for OLAP queries on statistical Linked Data in a triple store


Section 3, we formally deﬁne a multidimensional model of data cubes based
on QB. Then, in Section 4, we introduce OLAP operations on data cubes and
present a direct mapping of OLAP to SPARQL queries. We apply this mapping
in a small experiment in Section 5 and discuss some lessons learned in Section 6.
In Section 7, we describe related work, after which, in Section 5, we conclude
and describe future research.


2     Scenario: Analysing Financial Linked Data

In this section we describe a scenario of analysing Linked Data containing ﬁnan-
cial information. The Edgar Linked Data Wrapper7 provides access to XBRL
ﬁlings8 as Linked Data reusing QB. Those ﬁlings disclose balance sheets of a
large number of US organizations, for instance that RAYONIER INC had a
sales revenue net of 377,515,000 USD from 2010-07-01 to 2010-09-30.9
    Using LDSpider, we crawled Linked Data from the Edgar wrapper and stored
a data cube SecCubeGrossProﬁtMargin into an Open Virtuoso triple store. The
data cube contains single disclosures from ﬁnancial companies such as RAY-
ONIER INC. Each disclosure either discloses cost of goods sold (CostOfGoodsSold)
or sales revenue net (Sales) as measures. The two measures have the unit USD
and an aggregation function that returns the number of disclosures, or - if only
one - the actual number. Any disclosure is fully dependent on the following di-
mensions: the disclosing company (Issuer), the date a disclosure started (Dtstart)
and ended (Dtend) to be valid, and additional segment information (Segment).
    In our scenario, a business analyst wants to compare the number of disclo-
sures of cost of goods sold for two companies. He requests a pivot table with
issuers RAYONIER INC and WEYERHAEUSER CO on the columns, and the
possible periods for which disclosures are valid on the rows, and in the cells
showing the number of disclosed cost of goods sold, or - if only one - the actual
number. Figure 2 shows the needed pivot table.

7
  http://edgarwrap.ontologycentral.com/
8
  http://www.xbrl.org/Specification/XBRL-RECOMMENDATION-2003-12-
  31+Corrected-Errata-2008-07-02.htm
9
  http://edgarwrap.ontologycentral.com/archive/52827/0001193125-10-
  238973#ds
              Interacting with Statistical Linked Data via OLAP Operations         39


                    Fig. 2. Pivot table to be ﬁlled in our scenario


3    A Multidimensional Model Based on QB
In this section, as a precondition for OLAP queries on Linked Data, we formally
deﬁne the notion of data cubes in terms of QB. The deﬁnition is based on the
multidimensional models of Gómez et al. [7], Pedersen et al. [20] and the Open
Java API for OLAP10 as well as on the Linked Data vocabularies of QB, SKOS11
and skosclass.12
Deﬁnition 1 (Linked Data store with terms and triples). The set of RDF
terms in a triple store consists of the set of IRIs I, the set of blank nodes B and
the set of literals L. A triple (s, p, o) ∈ T = (I ∪ B) × I × (I ∪ B ∪ L) is called
an RDF triple, where s is the subject, p is the predicate and o is the object.
   Given a triple store with statistical Linked Data, we use basic SPARQL triple
patterns on the store to help us deﬁning sets of multidimensional elements. Given
a multidimensional element x iri(x) ∈ (I ∪ B) returns its IRI or blank node:

Member deﬁnes the set of members as M ember = {?x ∈ (I∪B)|(?x a skos:Con-
   cept)}. Let V = 2M ember , V ∈ V, ROLLU P M EM BER ⊆ M ember ×
   M ember, rollupmember(V ) = {(v1 , v2 ) ∈ V ×V |(iri(v1 ) skos:broader iri(v2 )∨
   iri(v2 ) skos:narrower iri(v1 ))}
Level deﬁnes the set of levels as Level = {(?x, V, rolluplevel(V )) ∈ (I ∪ B) ×
   V × ROLLU P LEV EL|(?x a skosclass:ClassiﬁcationLevel ∧∀v ∈ V (iri(v)
   skos:member ?x)}. Let L = 2Level , L ∈ L, ROLLU P LEV EL ⊆ Level ×
   Level, rolluplevel(L) = {(l1 , l2 ) ∈ L × L|(iri(l1 ) skosclass:depth x) ∧ (iri(l2 )
   skosclass:depth y) ∧ x ≤ y))}
Hierarchy deﬁnes the set of hierarchies as Hierarchy = {(?x, L, rolluplevel(L))
   ∈ (I ∪B)×L×ROLLU P LEV EL|(?x a skos:ConceptScheme )∧∀l ∈ L(iri(l)
   skos:inScheme ?x)}. Let H = 2Hierarchy .
10
   http://www.olap4j.org/
11
   http://www.w3.org/2004/02/skos/
12
   http://www.w3.org/2011/gld/wiki/ISO_Extensions_to_SKOS
40       B. Kämpgen, S. O’Riain, A. Harth

Dimension deﬁnes the set of dimensions as Dimension = {(?x, H) ∈ (I ∪
   B) × H|(?x a qb:DimensionProperty ) ∧ ∀h ∈ H(?x qb:codeList iri(h))}. Let
   D = 2Dimension .
Measure deﬁnes the set of measures as M easure = {(?x, aggr) ∈ (I ∪ B) ×
   {U DF }|(?x a qb:MeasureProperty )} with U DF a default aggregation func-
   tion since QB so far does not provide a standard way to represent typical
   aggregation functions such as SUM, AVG and COUNT: if only one value is
   given, the value itself else the number of values is returned. Conceptually,
   measures are treated as members of a dimension-hierarchy-level combination
   labelled “Measures”. Let M = 2M easure .
DataCubeSchema deﬁnes the set of data cube schemas as {(?x, D, M ) ∈ (I ∪
   B)×D ×M|(?x a qb:DataStructureDeﬁnition ∧∀d ∈ D(?x qb:componentPro-
   perty ?comp∧?comp qb:dimensionProperty iri(d))∧∀m ∈ M (?x qb:component-
   Property ?comp∧?comp qb:measureProperty iri(m)))}.
Fact deﬁnes the set of possible statistical facts as F act = {(?x, ?c0 , . . . , ?ci , ?e0 ,
   . . . , ?ej ) ∈ (I ∪ B) × (I ∪ B ∪ L) × . . . × (I ∪ B ∪ L) × L × . . . × L|(?x a
   qb:Observation )∧(?x?d0 c0 ∧?d0 a qb:DimensionProperty)∧. . .∧(?x?di ci ∧?di
   a qb:DimensionProperty ) ∧ (?x?m0 e0 ∧?m0 a qb:MeasureProperty) ∧ . . . ∧
   (?x?mj ej ∧?mj a qb:MeasureProperty)}. Let F = 2F act .
DataCube deﬁnes the set of data cubes as DataCube = {(cs, F ) ∈ DataCube-
   Schema × F|cs = (?x, D, M ) ∧ D = {D0 , . . . , D|D| } ∧ M = {m0 , . . . , m|M | } ∧
   ∀c ∈ F (c = (?obs, c0 , . . . , c|D| , e0 , . . . , e|M | ) : (?obs qb:dataSet ?ds∧?ds qb:struc-
   ture ?x) ∧ ∀Di ∈ D(?xiri(Di )ci ∧ iri(Di ) qb:codeList ?h∧?l skos:inSchme
   ?h∧?v skos:member ?l ∧ (?v skos:notation ?ci ∨?v skos:exactMatch ?ci )) ∧
   ∀mi ∈ M (?xiri(mi )ei ∨ ei = null).
   We distinguish metadata queries and OLAP queries on data cubes. Whereas
metadata queries return multidimensional objects such as the cube schema, the
dimensions, and the measures, OLAP queries return facts directly contained in
or derived from (e.g., by aggregation of several facts) the data cube.
Deﬁnition 1. According to Gray et al. [8] a materialised data cube (cs, F ) with
cs = (x, D, M ) contains a set of facts M F = (?x, ?c0 , . . . , ?c|D| , ?e0 , . . . , ?e|M | )
with ?ci ∈ Ci with Ci : 0 ≤ i ≤ |D| all possible members of a dimension Di ∈ D
or the special ALL value, and with ej ∈ T : 0 ≤ j ≤ |M | with T a numeric
domain including the special null value in cases of cube sparsity. A data cube
can be materialised by a union of 2|D| sub-queries, each grouping the result on the
members of a subset of dimensions, replacing the values of the other dimensions
by a special ALL value, and computing each measure value by the measure’s
aggregation function. The number of facts from a materialised data cube is given
by |M F | = (|C0 | + 1) · . . . · (|C|D| | + 1).
    Subqueries and aggregation functions in SPARQL 1.1 make easily possible
the concept of Gray et al. [8] to fully materialise a data cube represented as
Linked Data reusing QB. However, due to its exponential size w.r.t. the number
of dimensions, such a SPARQL query is ineﬃcient. OLAP queries may require
only a small subset of all possible facts of a data cube; therefore, in the next
section, we show how to evaluate OLAP queries using a single SPARQL query.
               Interacting with Statistical Linked Data via OLAP Operations          41

4    Mapping OLAP Operations to SPARQL on QB
In this section we show how to issue OLAP queries on a multidimensional model.
We deﬁne common OLAP operations on single data cubes [19,20,22,23]. A nested
set of OLAP operations lead to an OLAP query. We describe how to evaluate
such an OLAP query using SPARQL on QB. Figure 3 illustrates the eﬀect of
common OLAP operations, with inputs and outputs.


Fig. 3. Illustration of common OLAP operations with inputs and outputs (adapted
from [23])


    Note, this paper focuses on direct querying of single data cubes, the integra-
tion of several data cubes through Drill-Across or set-oriented operations such
as union, intersection, diﬀerence is out-of-scope. Multiple datasets can already
be queried together if they are covered by one qb:DataStructureDeﬁnition.
    Each OLAP operation has as input and output a data cube. Therefore, oper-
ations can be nested. A nested set of OLAP operations lead to an OLAP query.
For interpreting a set of OLAP operations as an OLAP query and evaluate it
using a SPARQL query on QB, we adopt the notion of subcube queries [13].
Deﬁnition 2. We deﬁne an OLAP query as a subcube query [13] on a cer-
tain cube (cs, C) with cs = (?uri, D, M ), represented as a tuple with as many
elements as dimensions and measures: (q0 , ..., q|D| , m0 , ..., m|M | ) with dom(qi ) =
{?, ALL, x} with ? for an inquired dimension, with ALL for an aggregated di-
mension, with x for one or more members to ﬁx a dimension, and with mi a
measure to query for.
    As examples, we describe three distinguishable subcube queries:
 – A full-cube query returns exactly the tuples from a DataCubeInstance and
   inquires all dimensions: (?, ?, ?, ..., m0 , ..., m|M | ).
42     B. Kämpgen, S. O’Riain, A. Harth

 – A point query asks for a data cube comprising one speciﬁc instance tuple:
   (a0 , a1 , ..., a|D| , m0 , ..., m|M | ) with ai a member of a dimension Di .
 – A fully-aggregated query asks for the measures aggregated over all dimen-
   sions: (ALL, ALL, . . . , ALL, m0 , . . . , m|M | ).
    In the following we describe how to evaluate each OLAP operation in terms
of this query model and how a nested set of OLAP operations results in one
speciﬁc OLAP subcube query. We assume that names and schemas of input
and output data cubes are implicitly given and we focus on the data cube facts
which will be queried for. An input data cube is represented as a full-cube query
(?, ?, ?, ..., m0 , ..., m|M | ).
Projection is deﬁned as P rojection : DataCube × M easure → DataCube and
    removes a measure from the input cube and allows to query only for speciﬁc
    measures. We evaluate P rojection by removing a measure from the subcube
    query tuple.
Slice is deﬁned as Slice : DataCube × Dimension → DataCube and removes
    a dimension from the input cube and aggregates over the members of a
    dimension. We evaluate Slice by setting the tuple element of that dimension
    to ALL.
Dice is deﬁned as Dice : DataCube × Dimension × V → DataCube and allows
    to ﬁlter for and aggregate over certain dimension members. We evaluate Dice
    by setting the tuple element of that dimension to this particular member or
    set of members and aggregate over the set. Note, dice is not a selection
    operation but a combined ﬁlter and slice operation.
Roll-Up is deﬁned as Roll−U p : DataCube×Dimension×Level → DataCube
    and allows to create a cube that contains instance data on a higher aggrega-
    tion level. We evaluate roll-up by replacing the inquired members of a dimen-
    sion with members of a higher level. Note, we do not deﬁne Drill − Down,
    since it can be seen as an inverse operation to Roll − U p.
    As an example, consider an OLAP query on our SecCubeGrossProﬁtMargin
cube for the cost of goods sold (CostOfGoodsSold) for each issuer and each
date until when each disclosure is valid (dtend), ﬁltering by disclosures from two
speciﬁc segments. A nested set of OLAP operations that queries the requested
facts can be composed as follows. In all our queries, we use preﬁxes to make
URIs more readable:
Slice (
           Dice (
                          Projection (
                          e d g a r : SecCubeGrossProfitMargin ,
                          e d g a r : CostOfGoodsSold ) ,
           e d g a r : segment ,
           { e d g a r : segmentAHealthCareInsuranceCompany ,
           edgar :
                 segmentAResidentialRealEstateDeveloperMember })
                  ,
                Interacting with Statistical Linked Data via OLAP Operations                43

edgar : d t s t a r t )
    This query can then be represented as a subcube query with dimensions
Issuer, Dtstart, Dtend, Segment:
( ? , ∗ , ? , { e d g a r : segmentAHealthCareInsuranceCompany ,
edgar : segmentAResidentialRealEstateDeveloperMember } ,
      e d g a r : CostOfGoodsSold )
    Next, we describe how to evaluate such an OLAP query using a SPARQL
query on QB. Since OLAP hierarchies add considerable complexity to QB and
since a Roll − U p has a similar eﬀect to a Slice operation, in this paper, we
assume data cubes with only one hierarchy and level per dimension. A subcube
query Q = (q0 , . . . , q|D| , m0 , ..., m|M | ) can be translated into a SPARQL query
using the following steps:

 1. We initialise the SPARQL query using the URI of the data cube. We query
    for all instance data from the data cube, i.e., observations linking to datasets
    which link to the data structure deﬁnition.
 2. For each selected measure, we incorporate it in the SPARQL query by select-
    ing additional variables for each measure and by aggregating them using the
    aggregation function of that measure, using OPTIONAL patterns in case we
    query for several measures.
 3. For each inquired dimension, we query for all the instances of skos:Concept
    in a level of a hierarchy of the dimension and for the represented values used
    (from members either linked via skos:notation or skos:exactMatch) in the
    observations. We query for the observations showing property-value pairs
    for each of these variables. To display inquired dimensions in the result and
    correctly aggregating the measures, we group by each dimension variable.
 4. For each ﬁxed dimension, we ﬁlter for those observations that exhibit for
    each dimension one of the listed members.

  We transform our example from above to the following SPARQL query. Note,
“UDF” represents the standard aggregation function from our scenario:
s e l e c t ?dimMem0 ?dimMem1 UDF( ? measureValues0 ) where {
? obs qb : d a t a S e t ? ds .
? ds qb : s t r u c t u r e e d g a r : S e c C u b e G r o s s P r o f i t M a r g i n .
? obs e d g a r : i s s u e r ? v a l u e s 0 .
             ?dimMem0 s k o s : member ? l e v e l 0 .
             ? l e v e l 0 s k o s : inScheme ? h i e r a r c h y 0 .
             e d g a r : i s s u e r qb : c o d e L i s t ? h i e r a r c h y 0 .
             ?dimMem0 s k o s : exactMatch ? v a l u e s 0 .
? obs e d g a r : dtend ? v a l u e s 1 .
             ?dimMem1 s k o s : member ? l e v e l 1 .
             ? l e v e l 1 s k o s : inScheme ? h i e r a r c h y 1 .
             e d g a r : dtend qb : c o d e L i s t ? h i e r a r c h y 1 .
             ?dimMem1 s k o s : n o t a t i o n ? v a l u e s 1 .
44      B. Kämpgen, S. O’Riain, A. Harth

? obs e d g a r : segment ? v a l u e s 2 .
            ? slicerMem0 s k o s : member ? l e v e l 2 .
            ? l e v e l 2 s k o s : inScheme ? h i e r a r c h y 2 .
            e d g a r : segment qb : c o d e L i s t ? h i e r a r c h y 2 .
            ? slicerMem0 s k o s : exactMatch ? v a l u e s 2 . F i l t e r ( ?
                   slicerMem0 = e d g a r :
                   segmentAHealthCareInsuranceCompany
           OR ? slicerMem0 = e d g a r :
                   segmentAResidentialRealEstateDeveloperMember )
? obs e g a r : CostOfGoodsSold ? measureValue0 .
} group by ?dimMem0 ?dimMem1


5    Experiment

In this section we demonstrate in a small experiment the applicability of our
OLAP-to-SPARQL mapping to our scenario from the ﬁnancial domain. The
SecCubeGrossProﬁtMargin cube contains 17,448 disclosures that either disclose
cost of goods sold or sales revenue net. The values of the measures fully depend
on one of 625 diﬀerent issuers, the date a disclosure started (27 members of
dimension Dtstart) and ended (20 members of Dtend) to be valid, and additional
information (21,227 members of Segment). The two measures have the unit USD
and an aggregation function that returns the number of disclosures, or - if only
one - the actual number. If fully materialised according to Deﬁnition 1, the cube
contains 626 · 28 · 21 · 21, 228 = 7, 813, 772, 064 facts. To compute all of its facts,
24 = 16 SPARQL subqueries would be needed.
    OLAP interface allow users to interactively combine OLAP operations into an
expression of an OLAP query language. Results of the query shall be visualised
using a pivot table, a compact format to display multidimensional data [5]. As
far as we know, MDX is the most widely used OLAP query language, adopted
by OLAP engines such as Microsoft SQL Server, the Open Java API for OLAP,
XML for Analysis (XMLA), and Mondrian. Therefore, we show that an MDX
query can be transformed into an OLAP subcube query according to Deﬁnition 2
and evaluate the subcube query using a SPARQL query. The result is a subset of
all possible facts from a data cube. The pivot table determines what dimensions
to display on its columns and rows.
    In order to answer the OLAP question of our scenario, we created the follow-
ing MDX query. Multidimensional elements are described there using URIs.13
For an introduction to MDX, see its website.14 A more detailed description of
how to transform an MDX query into an OLAP query due to space constraints
we leave for future work when we evaluate our OLAP-to-SPARQL mapping more
thoroughly.
13
   Note URIs need to be translated to an MDX-compliant format that does not use
   reserved MDX-speciﬁc characters.
14
   http://msdn.microsoft.com/en-us/library/aa216770%28v=sql.80%29.aspx
                   Interacting with Statistical Linked Data via OLAP Operations                           45

SELECT
 { e d g a r : c i k 1 4 1 7 9 0 7 i d C o n c e p t , e d g a r : c i k 1 0 6 5 3 5 i d C o n c e p t } ON
       COLUMNS,
 C r o s s J o i n ( e d g a r : d t s t a r t R o o t L e v e l . Members , e d g a r :
       dtendRootLevel . Members ) ON ROWS
FROM [ e d g a r : S e c C u b e G r o s s P r o f i t M a r g i n ]
WHERE { e d g a r : CostOfGoodsSold }
     A nested set of OLAP operations to compose our OLAP query is as follows:
 Slice ( Projection (
               e d g a r : SecCubeGrossProfitMargin ,
               e d g a r : CostOfGoodsSold ) ,
 e d g a r : segment )
    This query can then be represented as a subcube query with dimensions
Issuer, Dtstart, Dtend, Segment: (?, ?, ?, ∗, CostOf GoodsSold). The resulting
SPARQL query is as follows:
s e l e c t ?dimMem0 ?dimMem1 ?dimMem2 count ( xsd : decimal ( ?
       measureValue0 ) ) sum( xsd : decimal ( ? measureValue0 ) )
where {
? obs qb : d a t a S e t ? ds .
? ds qb : s t r u c t u r e e d g a r : S e c C u b e G r o s s P r o f i t M a r g i n .
? obs e d g a r : i s s u e r ? v a l u e s 0 .
             ?dimMem0 s k o s : member ? l e v e l 0 .
             ? l e v e l 0 s k o s : inScheme ? h i e r a r c h y 0 .
             e d g a r : i s s u e r qb : c o d e L i s t ? h i e r a r c h y 0 .
             ?dimMem0 s k o s : exactMatch ? v a l u e s 0 .
? obs e d g a r : d t s t a r t ? v a l u e s 1 .
             ?dimMem1 s k o s : member ? l e v e l 1 .
             ? l e v e l 1 s k o s : inScheme ? h i e r a r c h y 1 .
             e d g a r : d t s t a r t qb : c o d e L i s t ? h i e r a r c h y 1 .
             ?dimMem1 s k o s : n o t a t i o n ? v a l u e s 1 .
? obs e d g a r : dtend ? v a l u e s 2 .
             ?dimMem2 s k o s : member ? l e v e l 2 .
             ? l e v e l 2 s k o s : inScheme ? h i e r a r c h y 2 .
             e d g a r : dtend qb : c o d e L i s t ? h i e r a r c h y 2 .
             ?dimMem2 s k o s : n o t a t i o n ? v a l u e s 2 .
? obs e d g a r : CostOfGoodsSold ? measureValue0 .
} group by ?dimMem0 ?dimMem1 ?dimMem2
    The aggregation function used is a non-standard one, therefore, we had to
compute the SUM and COUNT for the measure. We run the query after a reboot
of the triple store. The query took 18sec and returned 58 facts to be ﬁlled into
the requested pivot table. The number of 7, 813, 772, 064 potential facts in the
cube does not have a strong inﬂuence on the query since the cube is very sparse,
for instance, the triple store contains observations only for a fraction of segment
members.
46     B. Kämpgen, S. O’Riain, A. Harth

6    Discussion

Data from the data cube is queried on demand, and no materialization is done.
We correctly aggregate data on one speciﬁc granularity, deﬁned by the men-
tioned inquired and ﬁxed dimensions. Dimensions that are not mentioned will
be automatically handled as having an ALL value [8], representing all possible
values of the dimension. The aggregation results in correct calculations, since we
assume only one hierarchy-level per dimension in this work. Only if observations
would be deﬁned on diﬀerent granularities, e.g., gender male, female, and total,
aggregating over them would result in incorrect numbers.
    Filling the pivot table with measure values from the SPARQL result requires
matching of the member values for each fact for the following reasons: ﬁrst, if
the data cube is sparse, i.e., not for every possible combination of members a
value is given, then, for non-occurring combinations the SPARQL query does
not return a value; second, all member combinations of inquired dimensions are
calculated, even though only speciﬁc combinations might be selected, as in the
case of the two issuers in the OLAP query of our scenario. Indexing of either
the pivot table or the SPARQL result table may allow faster population of the
pivot table.
    In summary, though our OLAP algebra to SPARQL mapping may not result
in the most eﬃcient SPARQL query and require additional eﬀorts for populating
the pivot table, it correctly computes all required facts from the data cube
without the need for explicitly introducing the non-relational ALL member or
using sub-queries [8].


7    Related Work

Kobilarov and Dickinson [12] have combined browsing, faceted-search, and query-
building capabilities for more powerful Linked Data exploration, similar to OLAP,
but not focusing on statistical data. Though years have passed since then, cur-
rent literature on Linked Data interaction paradigms does not seem to expand
on analysing large amounts of statistics.
    OLAP query processing in general has long been a topic of research [27].
OLAP operations have been deﬁned on a logical level [1, 20, 26] or on a concep-
tual level [3,22,25]. Execution of OLAP operations mainly is concerned with the
computation of the data cube and with storing parts of the results of that com-
putation to eﬃciently return the results, to require few disk or memory space,
and to remain easy to update if data sources change [14]. Approaches mainly
depend on the type of data structure on which to perform the computations
and in which to store the results. Data structures can roughly be grouped into
ROLAP, using relational tables and star or snowﬂake schemas, and MOLAP,
using multidimensional arrays for direct storing and querying of data cubes.
    Speciﬁc approaches regarding OLAP on Linked Data seem to have concen-
trated so far on multidimensional modelling from ontologies [6, 15–17]. For in-
stance, Nebot et al. [16] recognise the potential of OLAP to analyse RDF data,
              Interacting with Statistical Linked Data via OLAP Operations      47

but do not provide a dedicated query engine and require a multidimensional
database that needs to be updated if RDF data changes. Entity-centric object
databases [20] show some resemblance to OLAP querying, however, have so far
not been applied to Linked Data.
    In this work we use the graph-based RDF data model for querying and storing
of multidimensional data reusing QB. Here, both schema information and actual
data is accessed using Linked Data principles and managed using an RDF store.
Our approach focuses on OLAP queries that can be composed by common OLAP
operations and can be represented as a subcube query. Our mapping allows
translating OLAP queries into one SPARQL query to be run on the RDF without
storage of intermediate results. In our small experiment the produced SPARQL
query showed suﬃciently fast, but queries are expected to become insuﬃcient for
larger datasets. Since no materialization is done, only few extra space is required
for a hashmap to ﬁll the pivot table with the SPARQL result set, and updates
to the RDF are propagated directly to OLAP clients. Although there may be
more eﬃcient querying approaches such as special indexing and caching, to the
best of our knowledge, this is the ﬁrst work on computing and querying of data
cubes represented in RDF.


8   Conclusions and Future Work
We have presented an approach to interact with statistical Linked Data using
common Online Analytical Processing operations of “overview ﬁrst, zoom and
ﬁlter, then details-on-demand”. For that, we deﬁne common OLAP operations
on single data cubes in RDF reusing the RDF Data Cube vocabulary, map nested
sets of OLAP operations to OLAP subcube queries, and evaluate those OLAP
queries using SPARQL. Both metadata and OLAP queries are issued directly
on a triple store; therefore, if the RDF is modiﬁed or updated, changes are
propagated directly to OLAP clients. Though, our OLAP-to-SPARQL mapping
may not result in the most eﬃcient SPARQL query and require additional eﬀort
in populating resulting pivot tables, we correctly calculate requested facts of a
data cube without the need for explicitly introducing the non-relational ALL
member or using subqueries.
    Future work may be conducted in three areas: 1) extending our current ap-
proach with OLAP hierarchies and Drill-Across queries; 2) implementing an
OLAP engine to more thoroughly evaluate our current approach and to investi-
gate more eﬃcient OLAP query execution plans; 3) investigating possible OLAP
clients that map OLAP operations to intuitive user interactions.


Acknowledgements
This work was supported by the German Ministry of Education and Research
(BMBF) within the SMART project (Ref. 02WM0800) and the European Com-
munity’s Seventh Framework Programme FP7/2007-2013 (PlanetData, Grant
257641).
48      B. Kämpgen, S. O’Riain, A. Harth

References
 1. Agrawal, R., Gupta, A., Sarawagi, S.: Modeling Multidimensional Databases. In:
    Proc. of the Thirteenth International Conference on Data Engineering (1997)
 2. Chaudhuri, S., Dayal, U.: An overview of data warehousing and OLAP technology.
    ACM SIGMOD Record 26 (1997) 65–74
 3. Chen, L., Ramakrishnan, R., Barford, P., Chen, B., Yegneswaran, V.: Composite
    Subset Measures. In: Proc. of the 32nd International Conference on Very Large
    Databases (2006)
 4. Codd, E. F., Codd, S. B., Salley, C. T.: Providing OLAP to User-Analysts: An IT
    Mandate. (1993)
 5. Cunningham, C., Galindo-Legaria, C. A., Graefe, G.: PIVOT and UNPIVOT: op-
    timization and execution strategies in an RDBMS. In: Proc. of the Thirtieth In-
    ternational Conference on Very Large Databases (2004)
 6. Diamantini, C., Potena, D.: Semantic enrichment of strategic datacubes. In: Proc.
    of the ACM 11th international workshop on Data warehousing and OLAP (2008)
 7. Gómez, L. I., Gómez, S. A., Vaisman, A. A.: A Generic Data Model and Query
    Language for Spatiotemporal OLAP Cube Analysis Categories and Subject De-
    scriptors. In: Proc. of EDBT 2012
 8. Gray, J., Bosworth, A., Lyaman, A., Pirahesh, H.: Data cube: a relational aggre-
    gation operator generalizing GROUP-BY, CROSS-TAB, and SUB-TOTALS. In:
    Proc. of the Twelfth International Conference on Data Engineering (1995) 152–159
 9. Harinarayan, V., Rajaraman, A.: Implementing data cubes eﬃciently. ACM SIG-
    MOD Record (1996)
10. Harth, A.: VisiNav: A system for visual search and navigation on web data. Journal
    of Web Semantics 8(4) (2010) 348–354
11. Kämpgen, B., Harth, A.: Transforming Statistical Linked Data for Use in OLAP
    Systems. In: Proc. of I-Semantics 2011
12. Kobilarov, G., Dickinson, I.: Humboldt: Exploring Linked Data. In: Proc. of Linked
    Data on the Web Workshop (LDOW 2008) at WWW 2008
13. Li, X., Han, J., Gonzalez, H.: High-dimensional OLAP: a minimal cubing approach.
    In: Proc. of the Thirtieth International Conference on Very Large Databases (2004)
14. Morfonios, K., Konakas, S., Ioannidis, Y., Kotsis, N.: ROLAP implementations of
    the data cube. ACM Computing Surveys 39 (2007) 12–es
15. Nebot, V., Berlanga, R.: Building data warehouses with semantic web data. Deci-
    sion Support Systems 52 (2012) 853–868
16. Nebot, V., Berlanga, R., Pérez, J. M., Aramburu, M. J., Vej, S. L.: Multidimen-
    sional Integrated Ontologies : A Framework for Designing Semantic Data Ware-
    houses. Journal on Data Semantics (2009) 1–36
17. Niinimäki, M., Niemi, T.: An ETL Process for OLAP Using RDF/OWL Ontologies.
    Journal on Data Semantics XIII 5530 (2009) 97–119
18. Pardillo, J., Mazón, J.-N.: Using Ontologies for the Design of Data Warehouses.
    International Journal of Database Management Systems 3 (2011) 73–87
19. Pardillo, J., Mazón, J.-N., Trujillo, J.: Bridging the semantic gap in OLAP models:
    platform-independent queries. In: Proc. of the ACM 11th international workshop
    on Data warehousing and OLAP (2008)
20. Pedersen, T. B., Gu, J., Shoshani, A., Jensen, C. S.: Object-extended OLAP query-
    ing. Data Knowl. Eng. 68 (2009) 453–480
21. Romero, O., Abelló, A.: Automating multidimensional design from ontologies. In:
    Proc. of the ACM tenth international workshop on Data warehousing and OLAP
    (2007)
               Interacting with Statistical Linked Data via OLAP Operations           49

22. Romero, O., Abelló, A.: On the Need of a Reference Algebra for OLAP. In: Proc.
    of DaWaK 2007
23. Romero, O., Marcel, P., Abelló, A., Peralta, V., Bellatreche, L.: Describing analyt-
    ical sessions using a multidimensional algebra. In: Proc. of the 13th international
    conference on Data warehousing and knowledge discovery (2011)
24. Shneiderman, B.: The Eyes Have It : A Task by Data Type Taxonomy for Infor-
    mation Visualizations. Information Visualization (1996) 336–343
25. Trujillo, J.: Bridging the Semantic Gap in OLAP Models : Platform-independent
    Queries Categories and Subject Descriptors. Computing Systems (2008) 89–96
26. Vassiliadis, P.: Modeling Multidimensional Databases, Cubes and Cube Opera-
    tions. In: Proc. of the 10th International Conference on Scientiﬁc and Statistical
    Database Management (1998)
27. Vassiliadis, P., Sellis, T.: A survey of logical models for OLAP databases. ACM
    Sigmod Record 28 (1999) 64–69