Introduction

OLAP Manipulations on RDF Data following a Constellation Model

Ra k Saad

k@gmail.com 0

Olivier Teste

Cassia Trojahn

cassia.trojahng@irit.fr 0 0 IRIT (UMR5505) & Universite Toulouse 2 Le Mirail (UTM2) , France srf.ra

Multidimensional analysis is an alternative way for summarising, aggregating and viewing RDF data on di erent axes (dimensions) and subjects of analysis (facts). From a RDF data collection conforming to the W3C Data Cube speci cation, we formalise a multidimensional model in terms of RDF data structures following a conceptual constellation model. This model regroups facts, which are studied according to several dimensions possibly shared between facts, with dimensions relating multi-hierarchies. We show how elementary OLAP operations can be translated into SPARQL queries using an OLAP algebra that is compliant to the constellation model. This algebra is based on a multidimensional table which displays data from one fact and two of its linked dimensions. Initial experiments have been carried out using both synthetic data sets and real data sets.

Introduction

The Linked Open Data (LOD) movement has promoting the publication of large interlinked collections of data, represented as RDF graphs. Following this initiative, many organisations currently publish statistical data in RDF format (e.g., Eurostat1, European Central Bank2, UK COINS3, to cite a few examples). The need for exploiting these data for analytical and decision-making purposes becomes rapidly evident. On the one hand, one promising way of analysing these numeric data is by means of OLAP (Online Analytical Processing) analysis [ 1, 2 ]. This technique allows for summarising, ltering, aggregating, and viewing data on di erent axes and subjects of analysis. On the other hand, OLAP treatments require data to be structured following a speci c model, i.e., the multidimensional model, which organises data on a set of facts (subjects of analysis), and dimensions and hierarchies (axes of analysis).

A rst category of approaches for manipulating RDF data following a multidimensional model considers an ETL process for extracting and transforming these data into a speci c structure, usually the star relational model, before using standard OLAP systems [ 6 ]. Another category of approaches aims at manipulating OLAP operations directly on RDF data collections without using an

1http://eurostat.linked-statistics.org 2http://ecb.270a.info/.html 3http://data.gov.uk/resources/coins

ETL process [ 8, 3 ]. While the rst approach requires the ETL process to be repeated for propagating the data evolutions in the sources, the second approach requires a multidimensional modelling of RDF data and a dynamic translation of OLAP operations into SPARQL queries.

In this paper, we present an approach for OLAP manipulations on RDF data which falls into the second category of approaches. To that extent, we consider that RDF data are modelled according to the RDF Data Cube Vocabulary4, a vocabulary to model multidimensional data, such as statistics, in RDF. First, we propose a formalisation of a multidimensional structure based on RDF format following a constellation model of facts and dimensions composed of multi-hierarchies. This model has been introduced by Ravat and colleagues in [ 12 ], extending star schemes [ 9 ], which are commonly used in the multidimensional modelling. Second, based on this formalisation, we show how the main OLAP operations (DRILLDOWN, ROLLUP, SELECT and ROTATE) can be translated into SPARQL queries using an OLAP algebra that is compliant to the constellation model. This algebra is based on a multidimensional table which displays data from one fact and two of its linked dimensions. It de nes a set of elementary operators from which more complex OLAP operations can be dened. Our proposal has been implemented and experiments were carried out using both synthetic data sets and real data sets. The main contributions of the paper can be outlined as follows : { we provide an e cient RDF constellation model that is intimately related to the multidimensional data model. This generalised model supports multiple facts, multiple dimensions and multiple hierarchies; { the proposed modelling supports complex hierarchies to represent several real world data organisations and covers the case of non-covering hierarchies [ 11, 10 ], where instances can not strictly follow the hierarchical speci cations by allowing values of a child level to jump the intermediate levels along the hierarchy. An example of this kind of hierarchy is a company having customers into di erent cities of several countries. American cities can be regrouped into parent levels such as states whereas French cities jump these intermediate levels.

The remainder of the paper is organised as follows. x2 introduces the conceptual constellation model we proposed for the multidimensional modelling of RDF data. Based on this model, x3 presents how OLAP operations are translated into SPARQL queries. Then, x4 describes the prototype developed to validate our approach. x5 discusses related work and x6 concludes the paper and discusses future work. 2

Constellation Model on RDF Data

The multidimensional manipulation of RDF data requires the de nition of a conceptual model from which the OLAP operations will be speci ed. This section 4http://www.w3.org/TR/vocab-data-cube/ formally de nes the elements of a multidimensional model, in terms of RDF data described using RDF Data Cube5, SKOS6 and RDFS7 vocabularies. This formalisation is based on a conceptual model that de nes a constellation of facts and dimensions, which are composed of multi-hierarhies. A constellation regroups several facts, which are studied according to several dimensions possibly shared between facts [ 12 ]. Before introducing the model, we consider an example of a constellation schema that will be used throughout the remainder of the paper. This schema is composed by the fact Sales that has two measures, quantity and amount, and which can be analysed throughout three dimensions, Time, Product, and Geography. The Geography dimension is composed of two hierarchies, HGeo, a three-level hierarchy (composed by the attributes City, Region and Country ) and HArea, a two-level hierarchy (City, Area). Figure 1 depicts the example of a multidimensional schema using graphical notations. Note that HGeo is a noncovering hierarchy because some cities do not belong to regions, whereas each region as well as each city belongs to one country.

A dimension models an analysis axis and is composed of attributes (also called parameters or levels). These attributes are organized according to one or several hierarchies within a dimension : De nition 1. A dimension D is de ned as 4-tuple (?d,HD,AD,ID), where { ?d a qb:DimensionProperty { HD=f?h1D,...,?hvDg is a set of hierarchies, where ?h a skos:ConceptScheme ^ ?d qb:codeList ?h { AD=f?a1D,...,?auDg is a set of attributes, where ?a a rdfs:Class ^ ?a rdfs:subClassOf skos:Concept ^ (9 ?h 2 HD : ?a skos:inScheme ?h)

5http://www.w3.org/TR/2012/WD-vocab-data-cube-20120405/ 6http://www.w3.org/2004/02/skos/ 7http://www.w3.org/TR/2004/REC-rdf-schema-20040210/

{ ID=fI1D,...,IpDg is a set of dimension instances, where IjD=f?ijD1 ,...,?ijDk g is a set of attribute instances, where ?i a ?a ^ ?a 2 AD ^ k <= jADj Each hierarchy relates to a skos:ConceptScheme and attributes of dimensions are modelled as subclasses of skos:Concept and instances of rdfs:Class. In fact, attributes represent levels of granularity within a dimension. As hierarchies in the RDF Data Cube Vocabulary version we use are de ned as SKOS hierarchies, each hierarchical level refers to a skos:Concept in the SKOS hierarchy. In order to link the instances of attributes to each hierarchical level, de ning them as instances of rdfs:Class allows for using the property rdf:type to state that instances refer to a speci c level of hierarchy8.

As stated above, hierarchies represent a particular vision (perspective) of a dimension where each attribute represents one data granularity according to which measures could be analysed. Following the constellation model in [ 12 ], weak attributes (attributive properties) complete the parameter semantics. As one dimension may have multiple hierarchies of attributes, this means that an attribute can have several direct ancestors, each one belonging to a speci c hierarchy : De nition 2. A hierarchy H of a dimension D is de ned as 3-tuple (?h,PH ,WeakH ), where { ?h a skos:ConceptScheme { PH = <?p1H ,...,?pvH > is an ordered set of attributes (parameters), representing levels of granularity within a dimension, where ?p a rdfs:Class ^ ?p rdfs:subClassOf skos:Concept ^ ?p skos:inScheme ?h ^ ?h skos:hasTopConcept ?p, where ?h skos:hasTopConcept ?xtop ^ 9 C=<?x1...?xn> : n < jPH j ^ ?x1 skos:broader ?x2 ^ ?x2 skos:broader ?x3...?xn ^ ?xn skos:broader ?xtop { WeakH : PH ! 2AD P H is a function associating each parameter to one or more weak attributes, where WeakH (?p)=f?a1...?ang, where ?a a rdfs:Class ^ ?a rdfs:subClassOf skos:Concept ^ ?a skos:related ?p

A fact re ects the information that has to be analysed according to dimensions and that is modelled through one or several indicators (measures): De nition 3. A fact F is de ned as 3-tuple (?ds,MF ,IF ), where { ?ds a qd:DataSet { MF =ff1(?m1F ),...,fw(?mFw ) is a set of measures associated with an aggregate function f , where ?m a db:MeasureProperty { IF =f?iiF ,...,?iqF g is a set of fact instances, where ?i is a 3-tuple (?o,(?dv1, ?dv2,...?dvn),(?mv1,?mv2,...?mvm)), where ?o a qb:Observation ^ ?o qb:DataSet ?ds : 8i2[1..n], 9?d2Star(F):?o?d?dvi ^ 8i2[1..m], 9?m2MF :?o?m?mvj

8An alternative to associating instances to SKOS concepts could consider XKOS

[ 4 ], a recently proposed extension of SKOS that takes into account the representation of levels in concept schemes, semantic relations like isPartOf and instantiation of concepts, between other features. This will be further investigated in future work.

A constellation regroups several facts, which are studied according to several dimensions, possibly shared between facts : De nition 4. A Constellation Cs is de ned as a 3-tuple (FCs,DCs,StarCs), where { FCs=f?f 1,...,?f mg is a set of facts, where ?f a qb:DataSet { DCs=f?d1,...,?dng is a set of dimensions, where ?d a qb:DimensionProperty { StarCs : FCs ! 2DCs associates each fact to its linked dimensions, where ?f qb:Structure ?cube ^ ?cube a qb:DataStructureDe nition ^ ?cube qb:component ?compD ^ ?compD qb:dimension ?d ^ (8?m2Mf , 9?compM : ?cube qb:component ?compM ^ ?compM qb:measure ?m). 3

Translating OLAP Operations into SPARQL

Based on the constellation model presented above, we de ne a SPARQL query mechanism for performing OLAP operations directly on the RDF data. This mechanism is based on the OLAP algebra de ned in [ 12 ]. This algebra is a procedural query language that provides a set of elementary operators from which more complex operations can be speci ed. It is based on a multidimensional table (MT) which displays data from one fact and two of its linked dimensions De nition 5 (Multidimensional table (MT) [ 12 ]). A MT is as 4-tuple (S,L,C,R) where S=(FS ,MS ) represents the analysed subjects through a fact F S 2FCs and a set of projected measures MS , L=(DL,HL,PL) represents the horizontal analysis axis where PL=<pHmLax,...pHmLin>, HL2HDL and DL2StarCs(FS ), HL is the current hierarchy of DL, C=(DC,HC,PC) represents the vertical analysis axis where PC=<pHmCax,...pHmCin>, HC2HDC and DC2StarCs(FS ), HC is the current hierarchy of DC, R=pred1^...^predt is a normalised conjunction of predicates (restrictions of dimension data and fact data).

The algebraic operators take as input a source M T , noted M TSRC =(SSRC , LSRC ,CSRC ,RSRC ), and produces an output M T , noted M TRES =(SRES , LRES , CRES ,RRES ). Each M TRES can further be manipulated using operators of the same algebra. In the scope of this paper, we focus on the minimal core of operators, namely DISPLAY (for de ning a rst MT), DRILLDOWN and ROLLUP (for moving the analysis details along a hierarchy), SELECT (for selecting data of a multidimensional schema), and ROTATE (for replacing an analysis axis by another one). We assume querying a single data set. Each OLAP operation has an input M TSRC and an output M TRES . For the formal de nition of each OLAP operator the reader can refer to [ 12 ]. Here, we de ne how each operation has been de ned in terms of SPARQL queries. The aggregations and optimisations are out of the scope of this paper.

As Figure 2 depicts, an initial multidimensional table M T is built from a constellation Cs, using the operator DISPLAY, where DISPLAY (FS ,MS , DL,HL,DC,HC)=MTRES , with M TRES =(SRES ,LRES ,CRES ,RRES ). This operation displays the root parameters of each hierarchy (where all observations refer to the lowest level of the dimension hierarchy, i.e., root level or level 1). The process of generating a SPARQL query (Table 1) corresponding to DISPLAY considers the following set of nested operations : 1. Identify the instances of the fact FS to be displayed; 2. Retrieve the values of the root parameters P L1 (attribute in the horizontal analysis axis) and P C1 (attribute in the vertical analysis axis) of the dimensions DL and DC, respectively; 3. Retrieve the value mvi of each measure mi; 4. Group the measure values mvi by P L1 and P C1; 5. Calculate the aggregations by applying on the measure values mvi the corresponding aggregation functions Aggi.

SELECT ?PL1 ?PC1 (Aggi(mvi) AS ?mesi) SELECT ?prodId ?city (SUM(?qty) AS qtySales) WHERE WHERE ?fobs rdf:type qb:Observation. ?fobs rdf:type qb:Observation. ?obs qb:dataset IRI(FS). ?obs qb:dataset ex:Sales. ?obs IRI(DL) ?PL1. ?obs ex:Products ?prodId. ?obs IRI(DC) ?PC1. ?obs ex:Geography ?city. ?obs IRI(mi) ?mvi. ?obs ex:quantity ?qty. gGROUP BY ?PL1 ?PC1 gGROUP BY ?prodId ?city

From a M TRES resulting from a DISPLAY operation, the operations ROLLUP and DRILLDOWN modify the analysis precision by manipulating the hierarchical levels of the dimensions. In ROLLUP (MTSRC ,D,Lvlsup)=MTRES , D 2 fDL,DCg is the dimension on which the operation is applied, Lvlsup = is a coarser-graduation level used in M TRES , where M TRES =(SSRC ,LRES ,CRES , RSRC ). Inversely, for DRILLDOWN (MTSRC ,D,Lvlinf )=MTRES , D is the dimension on which the operation is applied, Lvlinf = is a lower attribute in the current hierarchy H of D, and M TRES =(SSRC ,LRES ,CRES ,RSRC ). For the SPARQL query generation, the operation consists of positioning the hierarchical level of the current hierarchies under a parameter of level n 1 (by means of skos:broader ). As an example, consider the ROLLUP (MTSRC ,DL,Lvlsup), supposing that for the dimension in DCMTSRC , the hierarchy HCMTSRC is positioned at the root parameter. The operations to be performed are the following (Table 2) : 1. Identify the instances of the fact FMTSRC in MTSRC to be displayed; 2. Retrieve the values of the root parameters P L1 and P C1 of the dimensions

DLMTSRC and DCMTSRC ; 3. Retrieve the value mvi of each measure mMTSRC ; 4. From P L1, drill upward in the hierarchy HiLMTSRC until getting up the level Lvlsup (level n) through the predicate skos:broader and by ensuring that the new parameter accessed is part of the hierarchy HLMTMTSRC (via the predicate skos:inScheme. This allows for navigating through the good hierarchy, in the case where the dimension has multiple hierarchies. Hence, for getting the parameter of level n requires navigating through n 1 parameters; 5. Retrieve the value mvi of each measure mMTSRC ; i 6. Group the measure values mvi by P C1 and Lvlsup; 7. Calculate the aggregations by applying on the measure values mvi the corresponding aggregation functions Aggi; 8. Display the aggregated measure values with the values of P C1 and Lvlsup. SELECT ?PC1?P Sup(Aggi(mvi)) AS ?mesi SELECT ?prodId ?region (SUM(?qty) AS qtySales) WHERE WHERE ?fobs rdf:type qb:Observation. ?fobs rdf:type qb:Observation. ?obs qb:dataset IRI(FMTSRC ). ?obs qb:dataset ex:Sales. ?obs IRI(DCMTSRC ) ?PC1. ?obs ex:Products ?prodId. ?obs IRI(DLMTSRC ) ?PL1. ?obs ex:Geography ?city. ?PL1 skos:broader ?PL2. ?city skos:broader ?region. ?PL2 skos:inScheme IRI(HLMTSRC ). ?region skos:inScheme ex:HGeo. ... ?region rdf:type ex:region. ?P Ln 1 skos:broader ?PSup. ?PSup skos:inScheme HLMTSRC . ?PSup rdf:type IRI(Lvlsup). ?obs IRI(miMTSRC ) ?mvi. ?obs ex:quantity ?qty. gGROUP BY ?PC1 ?PSup gGROUP BY ?prodId ?region

Note that for the DRILLDOWN operation, the evaluation principle is similar to the evaluation of ROLLUP. For manipulating the hierarchies within dimensions, it is important to take into account a special case where, at instances level, some hierarchical levels have not any associated instance. It is the case of noncovering hierarchies. For example, in the case of the geographical hierarchy HGeo, data at the instance level may contain some cities which are not associated to any region or state (for instance, Vatican City is considered as a city-state which is not associated to a region). Suppose that HGeo is a non-covering hierarchy, and one wants to analyse product sales by countries. Hence, the aggregations have to be done by country regardless of the real hierarchical level on which they are positioned, i.e, (a) countries positioned at the third level (case of instances that respect the hierarchy speci cation in the scheme); (b) countries positioned at the second level (countries which have a state or a city but not both), (c) countries positioned at the rst level (countries without states and cities, like Monaco). Hence, we combine graph patterns resulting from the UNION SPARQL operator. An example of query that takes into account non-covering hierarchies, from the previous ROLLUP query (ROLLUP (MTSRC ,DL,Lvlsup), where Lvlsup is the parameter of level n in the scheme, is presented in Table 3.

WHERE f ?obs rdf:type qb:Observation. ?obs qb:dataset IRI(FMTSRC ). ?obs IRI(miMTSRC ) ?mvi. ?PC1 rdf:type IRI(PC1). ?obs (DCMTSRC ) ?PC1. ?PSup rdf:type IRI(Lvlsup). f ?obs (DLMTSRC ) ?PL1. ?PL1 skos:broader ?PL2. ?PL2 skos:inScheme IRI(HLMTSRC ). ... ?P Ln 1 skos:broader ?PSup. g UNION f ?obs (DLMTSRC ?PL1. ?PL1 skos:broader ?PL2. ?PL2 skos:inScheme IRI(HLMTSRC ). ... ?P Ln 2 skos:broader ?PSup. g ...

UNION f ?obs (DLMTSRC ) ?PSup. g GROUP BY ?PC1 ?PSup SELECT ?PC1 ?PSup (Aggi(mvi)) AS ?mesi SELECT ?prodId ?country (SUM(?qty) AS ?SalesQty) WHERE f ?obs rdf:type qb:Observation. ?obs qb:dataset ex:Sales. ?obs ex:quantity ?qty. ?obs ex:Products ?prodId. ?country rdf:type ex:Country. f ?obs ex:Geography ?geo1. ?geo1 skos:broader ?geo2. ?geo2 skos:inScheme ex:HGeo. ?geo2 skos:broader ?country. g UNION f ?obs ex:Geography ?geo1. ?geo1 skos:broader ?country. g UNION f ?obs ex:Geography ?country. g

GROUP BY ?prodId ?country

For changing the analysis criteria, the ROTATE operation allows changing one analysis axis by another or the current hierarchy by another in a same dimension, ROTATE (MTSRC ,Dold,Dnew,HkDnew)=MTRES , where Dold 2 fDL,DCg is the dimension on which the operation is applied, Dnew is the dimension replacing Dold, HDnew is the current hiearchy of Dnew, and M TRES =(SSRC ,LRES ,CRES , k RSRC ). Dold=Dnew where only the current hierarchy is to be replaced. The hierarchical level of the modi ed axis corresponds to the root parameter of this hierarchy. The process of generating the SPARQL query ensuring these operations is similar to the operation DISPLAY, but accessing the level of detail speci ed for the unmodi ed axis. In case of performing hierarchy rotations, the corresponding axis is positioned on the root level of the new hierarchy.

Finally, the SELECT operation (i.e., SLICE and DICE in the common OLAP terminology) removes the data which do not satisfy a condition. A condition can be applied on the dimension attribute values or fact measure values: SELECT (MTSRC ,pred)=MTRES , where pred=pred1^ predt is a selection predicate on facts FS and/or its linked dimensions Di 2 StarCs(FS ). The SPARQL query implementing this operation can be obtained by integrating the restriction condition in the query producing the initial multidimensional table (MTSRC ). This can be achieved using SPARQL FILTER operator which can apply a logical condition to lter the results of queries. 4

Prototype and experiments

In this section, we present the prototype we have implemented in order to validate the conceptual approach and the experiments carried out using this prototype. The prototype has been implemented using Microsoft .NET Framework and dotNetRDF API9. To allow the system to load the multidimensional schema modelling the analysis needs, it is necessary to specify the data set structure de nitions (qb:DataStructureDe nition) and the hierarchical speci cation of dimensions according to the model described above. This operation (step 1) is assured by the rst principal module, the Multidimensional Model Loader. Once the schema loaded, the manipulation of OLAP queries becomes possible (step 2). The second module is the SPARQL Generator. It is responsible for generating SPARQL queries from OLAP manipulations speci ed as input (step 3). Using the dotNetRDF API, the generated SPARQL queries run on the data set (step 4). Query results are then presented to the user (step 5). RDF/XML, Turtle, and N3 RDF syntaxes are supported by the prototype. Figure 3 shows a screenshot of our prototype. The user interface allows to specify the analysis needs and shows the results returned after querying the data sources. The generated SPARQL queries are also given to the user together with their processing time.

We have conducted our experiments on a RDF data set about Annual producer price of industrial products from CA 1996 Statistical O ce of the Republic of Serbia10 having with 789 instances of attributes and 156 observations. It contains one temporal and one geographical (about Serbia) dimensions. Hierarchical structure of data is given in associated dictionaries providing a temporal hierarchy and information on regions and municipalities of Serbia. Initially, observation values (measures) are given according to the Year level for the temporal dimension and the country level for the geographical dimension (only one country: Serbia). We have modi ed the observation values according to di erent levels of

9http://www.dotnetrdf.org/

10http://wiki.planet-data.eu/web/Annual producer price indicatores of industrial products CA 1996 from Statistical O ce of the Republic of Serbia the dimensions hierarchies in order to carry out our experiments on manipulating hierarchies and make some non-covering data available. This rst data set contains only two dimensions, however, OLAP operations such as ROTATE requires more than two dimensions for performing dimension rotations, and more than one hierarchy per dimension for performing hierarchy rotations. Hence, a second data set has been used. It has 69888 observations and 1191 instances of attributes. This data set have been obtained by converting a synthetic multidimensional database generated in our team to RDF format. The conversion tool is actually limited to this speci c database. We are planning to develop a generic version of it to handle with any multidimensional data set.

The evaluation of our approach has been limited to the manipulations of OLAP operations on the two data sets described above. More speci cally, we evaluated the adequacy and the correctness of the results from a sequence of applied operations. From a DISPLAY operation, we are able to correctly aggregate data according to a speci c level of granularity, rotate dimensions and hierarchies or select parts of a multidimensional table. We do not evaluate runtime for executing the operations, with respect to the number of observations in the data sets. Further evaluation on di erent data sets have to be carried out. 5

Related work

Traditionally, OLAP analysis operates on data obtained from multiple and heterogeneous sources. In order to organise data coming from these sources in a multidimensional structure, a pipeline of extraction, transformation and loading (ETL) is usually carried out. In [ 6 ], an ETL module transforms RDF data (described using RDF Data Cube Vocabulary) into a Multidimensional Model. The resulting structure is further manipulated using Mondrian OLAP system and MDX queries. Hence, the OLAP manipulations are performed on the multidimensional data source and not directly on the RDF data collection. The inconvenient of this approach is the ETL process has to repeated in order to propagate changes in the raw data. In order to overcome this drawback, further proposals [ 14, 3, 8 ] deal with the direct manipulation of RDF data via SPARQL queries. In [ 14 ], an approach for online analysis of RDF triples has been proposed, where a speci c storage system has been designed. The aim is to e ciently manage large collections of RDF data. A speci c query mechanism extends SPARQL and multidimensional modelling of RDF data has not been considered in this solution. In [ 3 ], the authors de ne an RDF vocabulary called Open Cubes (OC) for the multidimensional modelling of RDF data. OC provides a set of classes and properties to model the di erent structures of the multidimensional model (dimensions, attributes, measures), including hierarchical relationships between attributes of dimensions. From RDF collections described using OC, di erent OLAP manipulations can be performed directly via queries expressed in SPARQL. Although this solution is based on a multidimensional modelling of RDF data and allows expressing OLAP operations in terms of SPARQL, its main limitation is the use of a speci c and non-standard vocabulary. RDF Data Cube Vocabulary, meanwhile, is supported by the W3C, that justi es its use by several collections of published statistical data for subsequent analysis. Tools for supporting the publication of these statistical data have been proposed. This is the case of OLAP2Cube and CSV2Cube presented in [ 13 ], which allow the generation of RDF collections using the vocabulary RDF Data Cube from multidimensional databases implemented on relational data. In [ 15 ], a process of identifying data sources for publishing statistical linked data following the RDF Data Cube vocabulary is also presented. Domain ontologies are used to provide a semantic meaning to the data cube.

Comparing the two main categories of approaches presented above, the approaches based on direct manipulation of RDF data ([ 14, 3, 8 ]) are more advantageous in terms of exibility and adaptation to the speci cities of published data on the Web. However, the main drawback in [ 14 ] and [ 3 ] is related to the use of non-standard vocabularies. With regards to the work described in [ 8 ], although it is based on RDF Data Cube Vocabulary, it does not take into account the hierarchical structure at multiple levels neither the multiples hierarchies in a dimension. Our approach supports these hierarchical notions. Contrary to [ 7 ], we do not implement any kind of aggregation for optimising query execution. 6

Final remarks and future work

This paper has presented a formalisation of a multidimensional model in terms of RDF data described following RDF Data Cube, SKOS and RDFS vocabularies. On the basis of this model, we de ned a mechanism for translating common OLAP operations into SPARQL queries. Two important aspects addressed in this paper are the ability of representing multiple hierarchies in a dimension and the ability of handling the cases where the hierarchical structures are not fully covered at the instance level (a common case in real data). We implemented a prototype in order to experiment and validate our proposal. A weak point of our work is evaluation, which has been mainly based on the correctness of query results, from a sequence of applied OLAP operations. We have several opportunities for future work. First, we plan to study the ability to express, through SPARQL queries, more advanced OLAP manipulations using the operators described in [ 5 ]. Second, we intend to focus on performance and optimisation of query execution by means of pre-aggregations. Third, RDF data represent basically resources referenced by links on the Web. A point to study, would be the ability to integrate these interconnections between resources in order to associate automatically more data and extend information available in initial data sources. Finally, XKOS could be further investigated in our formalisation.

Chaudhuri and

Dayal . An overview of data warehousing and OLAP technology . SIGMOD Rec ., 26 ( 1 ): 65 { 74 , Mar . 1997 .

E. F.

Codd ,

S. B.

Codd , and C. T. Salley. Providing OLAP ( On-Line Analytical Processing) to User-Analysis: An IT Mandate , 1993 .

Etcheverry and

A. A.

Vaisman . Enhancing OLAP Analysis with Web Cubes . In ESWC , pages 469 { 483 , 2012 .

Gillman ,

Cotton , and

Jaques. eXtended Knowledge Organization System (XKOS) . METIS, Work Session on Statistical Metadata , Geneva, May 2013 .

Gray ,

Chaudhuri ,

Bosworth ,

Layman ,

Reichart ,

Venkatrao ,

Pellow , and

Pirahesh . Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals . Data Min. Knowl. Discov. , 1 ( 1 ): 29 { 53 , 1997 .

mpgen and

Harth . Transforming statistical linked data for use in olap systems . In I-SEMANTICS , pages 33 { 40 , 2011 .

mpgen and

Harth . No Size Fits All ? Running the Star Schema Benchmark with SPARQL and RDF Aggregate Views . In ESWC , pages 290 { 304 , 2013 .

mpgen, S. O'Riain, and

Harth . Interacting with Statistical Linked Data via OLAP Operations . In Proceedings of Interacting with Linked Data (ILD 2012 ), Workshop co-located with the 9th ESWC , pages 36 { 49 , 2012 .

Kimball . The data warehouse toolkit: practical techniques for building dimensional data warehouses . John Wiley & Sons, Inc., New York, NY, USA, 1996 .

10.

Malinowski and E. Zimnyi. OLAP Hierarchies: A Conceptual Perspective . In Advanced Information Systems Engineering , pages 477 { 491 . 2004 .

11. T. B. Pedersen , C. S.

Jensen , and C. E.

Dyreson . A foundation for capturing and querying complex multidimensional data . Inf. Syst. , 26 ( 5 ): 383 { 423 , July 2001 .

12.

Ravat ,

Teste ,

Tournier , and G. Zur uh. Algebraic and graphic languages for olap manipulations . IJDWM , 4 ( 1 ): 17 { 46 , 2008 .

13.

P. E. R.

Salas ,

F. M. D.

Mota ,

K. K.

Breitman ,

M. A.

Casanova ,

Martin , and

Auer . Publishing statistical data on the web . Int. J. Semantic Computing , 6 ( 4 ): 373 { 388 , 2012 .

14. B. Wu , D.

Wu , H.

Zhong , H.

Jin , P.

Yuan , and P.

Liu . Scalable online analysis of semantic web data . Semantic Web Challenge , 2010 .

15.

Zancanaro ,

L. D.

Pizzol , R. de Moura Speroni,

J. L.

Todesco , and F. O. Gauthier. Publishing multidimensional statistical linked data . In Proceedings of the Fifth International Conference on Information, Process, and Knowledge Management , pages 290 { 304 , 2013 .