<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Containment and Complementarity Relationships in Multidimensional Linked Open Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marios Meimaris</string-name>
          <email>m.meimaris@imis.athena-innovation.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>George Papastefanatos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for the Management of Information Systems, Research Center “Athena”</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The Linked Open Data (LOD) cloud can act as a source of remote multidimensional datasets which are seemingly disparate, but are modeled under common directives and thus often share a common meta-model, dimensions and measures, as well as external codelists. This gives them a latent measure of relatedness that is independent of the publishers' initial intentions, but a derivative of the motivations behind LOD. In this paper we identify the constituents of relatedness between multidimensional LOD data points (observations) modeled with the Data Cube vocabulary, that often exhibit overlapping values both at the schema and at the data level. Treating hierarchies as first-class citizens, we consider observation relatedness in two aspects, namely containment and complementarity, for which we provide formal definitions and representational semantics. Finally, we present a methodology for computing these types of relatedness and we provide an evaluation over real-world datasets.</p>
      </abstract>
      <kwd-group>
        <kwd>Linked Data</kwd>
        <kwd>Data Cube</kwd>
        <kwd>Multidimensional Data</kwd>
        <kwd>Data Enrichment</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Recently, more and more bodies such as governments, statistical authorities, public
and private organizations, research and health centers publish information in the form
of multidimensional Linked Open Data (LOD) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ][
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] in very different domains, such
as census and statistical data, socioeconomic and demographic indicators, clinical
trials and health data, environmental and finance data. The abundance of such
datasets enable third parties to have access, exploit and combine published data
eventually leading to the generation of new quantifiable insights and knowledge, capable of
influencing policy-building and decision-making.
      </p>
      <p>
        Modelling-wise, multidimensional data are traditionally represented as cubes of
observations that are instantiated over a fixed set of dimensions and measures. In the
LOD paradigm, W3C has proposed the Data Cube Vocabulary [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], a recommendation
for modelling and publishing multidimensional datasets. However, one difference
between closed-world multidimensional data stores, such as OLAP databases, and
LOD datasets is that proper LOD publishing techniques across remote data providers
will lead to the existence of common or shared terms between seemingly independent
multidimensional datasets, given that reusability and interoperability are prime drivers
in the semantic web. Practically, this means that ontologies, codelists and hierarchies
that are commonly used in the LOD cloud are likely to appear across different and
disparate LOD multidimensional datasets.
provide extended/enriched information. More specifically, we extend the notion of
schema complement, defined in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and apply it at the instance level of observations
in order to annotate whether two observations can complement each other’s
information. Then, we provide a technique for computing these properties as follows: first
all observations are placed in an occurrence matrix that represents them as data
vectors in a multidimensional feature space encoding both schema and data information
along with dimension hierarchies. The occurrence matrix is used to compute the
complementarity matrix, which enables us to derive complementarity between
observations. Then, the occurrence matrix is transformed to a set of k |O|×|O| containment
matrices, where k is the distinct number of dimensions. Full and partial containment
for all pairs of observations are given by adding all matrices together.
      </p>
      <p>Contributions. In short, the contributions of this work are as follows: (a) We
define new relationships between individual observations and extend the Data Cube
terms with properties for representing: full and partial observation containment
between two observations as derivatives of the hierarchical relationships between their
dimension values, and observation complementarity as a means of comparison and
correlation of different measures; (b) we provide an efficient technique of computing
these properties based on occurrence and similarity matrices; finally (c) we evaluate
our techniques over real-world multidimensional datasets.</p>
      <p>This paper is structured as follows: section 2 provides background knowledge and
related work, section 3 defines the new properties, section 4 provides the techniques
for computing them, while section 5 proposes possible extensions to Data Cube
Vocabulary. Finally, section 6 presents an evaluation of our approach and section 7
concludes the paper and presents future directions of this work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Generally, the problem of finding related multidimensional observations has been
addressed within the contexts of Linked Data and Online Analytical Processing
(OLAP) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Data mining techniques in OLAP are known as Online Analytical
Mining (OLAM) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. OLAM research works study problems such as
clustering/classifying OLAP cubes [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], detecting outliers [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], performing intelligent
aggregations [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and building recommender systems for OLAP sessions based on either query
formulation or observation similarity [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ][
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Applications of these approaches aim to
enable discovery of latent knowledge, promote exploratory analysis [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], improve
OLAP query efficiency [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and so on. In this context, Aligon et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] study the
problem of finding similarities between OLAP sessions, i.e sequences of queries that are
applied online. To this end they define similarity functions by conducting user-based
analysis and they compute similarity of sessions by decomposing the session queries’
features and then using the Levenshtein distance, Dice’s coefficient, term
frequencyinverse document frequency (tf-idf) and the Smith-Waterman algorithm. They find
the latter to be the best performing measure for their purposes.
      </p>
      <p>
        Baikousi et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] provide distance functions categorized over their relation with
the hierarchy space. Similarly to our approach, they consider hierarchies to be of
prime importance in the problem and base their distance functions on hierarchies.
Finally, they summarize the hierarchical distances with two approaches, simple
summation and the Hausdorff distance and they find that both approaches are equally
effective. Hsu et al [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] apply multidimensional scaling methods (MDS) and
hierarchical clustering (HC) in order to find similarity between OLAP reports of the same
cube. They formally define the problem and its constraints, such as identifying when
two OLAP reports are comparable, and conclude that a combination of MDS and HC
yields the best results.
      </p>
      <p>
        In the context of Linked Data, similarity or relatedness between entities has been a
main component of entity resolution, record linkage and interlinking [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ][
        <xref ref-type="bibr" rid="ref18">18</xref>
        ][
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
These approaches deal with discovering links between RDF nodes from different
datasets in efficient ways by using distance-based techniques. Statistical linked open
data have been addressed by [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] in the context of online analysis and exploration,
and in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] as a use case scenario for data source contextualization. To the best of our
knowledge, this is the first work that addresses the definition, representation and
computation of relationships between individual multidimensional LOD observations.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Problem definition</title>
      <p>We consider that the problem space is composed by n datasets modeled and validated
by the integrity constraints imposed by the Data Cube Vocabulary. A dataset is
composed of its schema (i.e. dimensions, measures and attribute definitions), and its data
(i.e. observations). The values in dimensions are provided by a fixed set of coded lists
(code-value pairs) that are hierarchically structured in levels. Flat coded lists, i.e.,
simple enumerations, are considered to be hierarchies with exactly one level. These
are presented formally in the following:</p>
      <p>Definition 1 (Cube Structure): Let D={D1, …, Dn} be the set of all input datasets.
A dataset Di D is composed by the set of observations, Oi, and the set of schema
definitions, Si, and O={O1, …, On} and S={S1, …, Sn} are the sets of all observations
and schema definitions in D. Furthermore, a schema Si consists of the sets of
dimension Pi and measure properties Mi defined in Di, i.e., Si={Pi,Mi}. Let P=⋃
and M=⋃ be the set of all k distinct dimensions
and l measure properties in D. Any pj P, mj M can belong to more than one Si, as
dimension and measure properties are reused among sources. An observation o Oi is
an entity that instantiates all dimension and measure properties defined in Si. The
value that observation oi has for dimension pj is .</p>
      <p>Definition 2 (Coded list terms): Each dimension property pj P takes values from a
fixed coded list, i.e. a set of code – value pairs, C(pj)={c(pj)1, … c(pj)m}, j=1..k, (for
simplicity we write cji instead of c(pj)i). The coded list defines a hierarchy such that
when cji ⥼ cjm, then cji is an ancestor of cjm. Furthermore, we define cjroot as the top
concept in the code list of pj , i.e., an ancestor of all other terms in the coded list, such
that ⥼ . Every coded list term is an ancestor of itself, i.e. ⥼ .</p>
      <p>In Figure 2, we present sample coded lists for the dimensions present in motivating
example of Fig. 1.</p>
      <p>Americas</p>
      <p>USA
Texas
Austin
)
sex Total
Male</p>
      <p>Female
refPeriod</p>
      <p>All
2001
2011
refArea</p>
      <p>Europe
Greece Italy</p>
      <p>..</p>
      <p>Feb2011 Jan2011</p>
      <p>Athens</p>
      <p>
        Rome
Next we provide the definitions for containment and complementarity properties.
Similarly to [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], we apply the notion of complementarity between two
observations for denoting whether these are comparable, i.e., they have the same dimension
values but measure different phenomena. This is represented by the following.
      </p>
      <p>Definition 3 (Observation Complement): Given two observations oa and ob and
their dimensions Pa and Pb, oa is observation complement to ob when:
(
) (
) (
We denote this relationship with Compl(oa, ob) or equivalently oa Compl ob. Def. 1
states that the dimensions in ob must be a superset of the dimensions in oa and the
common dimensions must have the same values. All other dimension values of ob
must be equal to the root of the dimension hierarchy, thus providing no further
specialization. For example, an observation measuring poverty in Greece per year is
observation complement with an observation measuring the population in Greece per
year for all genders.</p>
      <p>Furthermore, a containment relationship captures whether an observation measure
is an aggregation of the measures of the contained observations. For example, an
observation measuring the population of Greece implicitly contains all observations
measuring the population of sub-regions of Greece. We distinguish between full and
partial containment. The former denotes that all contained observations can be
combined in a roll-up operation for being observation complement with the containing
one, while the latter denotes that both contained and containing observation must be
rolled-up on their disjoint dimensions for being observation complement. These are
presented in the following definition.</p>
      <p>Definition 4 (Partial and full containment): Given two observations oa and ob,
their dimensions Pa and Pb and their measures Ma and Mb, partial containment
between oa and ob exists when:
(
) (
) (
⥼
)
An observation oa partially contains ob when (i) there is one Mi shared between oa and
ob, (ii) the dimensions of oa are a subset of the dimensions of ob and (iii) there exists at
least one dimension whose value for oa is a hierarchical ancestor of the respective
dimension value for ob. We denote this as Contpartial(oa, ob) or equivalently oa Contpartial
ob. Similarly, full containment between oa and ob exists when:
(
) (
) (
⥼
)
That is, an observation oa fully contains ob when (i) there is one Mi shared between oa
and ob, (ii) the dimensions of oa are a subset of the dimensions of ob and (iii) values of
all dimensions for oa are hierarchical ancestors of the respective dimension values for
ob. We denote this with Contfull(oa, ob) or equivalently oa Contfull ob. Observe by
definition that the containment property is not symmetric and that given oa Contfull ob we
cannot derive that ob Contfull oa. Based on the above definitions, our problem can be
outlined as follows.</p>
      <p>Problem: Given a set of source datasets D, and O the set of observations in D, for
each pair of observations oi, oj O, i≠j, assess whether a) oi Contfull oj , b) oi Contpartial
oj and c) oi OC oj. In the following section, we provide our techniques for computing
these properties.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Computing containment and complementarity properties</title>
      <p>Our technique for computing containment and complementarity properties considers
that observations are data vectors in a multidimensional feature space composed of all
schema definitions and coded list values in D. In addition, the feature set is enriched
with all ancestor values in the hierarchy of each coded list, up to the higher common
ancestor of all values in D. This is represented by an occurrence matrix that captures
occurrence (1 or 0) of a dimension, measure definition and coded list value in the set
of observations. The occurrence matrix is used for calculating complementarity
properties between two observations. It is, then, used for constructing containment
matrices that are used for calculating containment properties.</p>
      <sec id="sec-4-1">
        <title>4.1 Constructing the Occurrence Matrix</title>
        <p>Each oi defines a bit vector oi of |C|+|P|+|M| dimensions and all oi O yield a
|O|x|C+P+M| occurrence matrix OM that consists of the following sub-matrices:
 OMC is the |O|x|C| matrix defined by the occurrences of coded list values in the
respective dimension values of all observations. Each value cji Cj corresponding
to dimension pj is treated as a feature, i.e., a column in OMC. Hierarchical
containment is encoded into OMC using a bottom-up algorithm that places a value of 1
in column cji if the value of the dimension pj of oa is equal to the feature cji and
then gives the value of 1 to all columns corresponding to the parents of cji. Finally,
we fill with 1’s the cjroot of all observations that do not contain pj in their schema.
 OMP and OMM are the |O|x|P| and |O|x|M| matrices defined by the occurrences of
dimension and measure properties in each observation. Each dimension and
measure definition is considered as a feature and is marked with 1 if oa contains it and 0
if not. The measure values are not taken into account. Therefore, OM= [OMC,
OMP, OMM].</p>
        <p>The OCC of the example of Fig. 1, given the hierarchies shown in Figure 2, is shown
in Table 2. The sub-matrix OMC can be further broken down in separate sub matrices
for each coded list, i.e., OMC = [OMC1, …, OMCk] where OMCi is a sub-matrix that
represents occurrences for all values of dimension pi and k=|C|. Therefore OM becomes
[OMC1,…, OMCk, OMP, OMM].
obs11
obs12
obs21
obs22
obs31
obs32
obs33</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2 Pair-wise observation containment</title>
        <p>Containment matrices. A containment matrix CMi is a |O|x|O| bit-vector matrix that
captures pair-wise containment relationships between observations, for each
dimension pi. If cell CMi[oa,ob] 0 then observations oa and ob are hierarchically related for
dimension pi P (e.g. oa refers to Greece and ob refers to Europe), while CMi[oa,ob] 0
holds otherwise.</p>
        <p>
          Computation of containment matrices. To compute a containment matrix CMi, we
first take each bit array OMCi in OMC and we apply a containment function sf for all
pairs of observations oa and ob by considering the rows of oa and ob in OMCi as two
bit vectors, a and b, respectively. Then, two observations are hierarchically related if
the bit-wise AND operation between their corresponding bit vectors yields one of the
two bit vectors as shown in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Given this, we define sf for a pair of observations oa,
ob and their bit-vectors a,b resp. in OMCi as the following conditional function:
(
)
{
(
)
( ) means that we apply sf for oa and ob in OMCi and if the AND between a
and b gives the bit vector b, then oa is contained by ob. If a=b then the relationship still
holds. By applying this function for each sub matrix in OCC we acquire a set of k
containment matrices of |O|x|O| dimensions, CM1, …, CMk, each capturing
containment information for a given dimension property. Then, addition of all CMi yields the
Overall Containment Matrix OCM, which holds full and partial relationships in the
form of normalized similarities between pairs of observations as follows:
∑
∑
OCM retains values in the range of [
          <xref ref-type="bibr" rid="ref1">0,1</xref>
          ]. Full containment is given when two
observations have similarity 1, i.e., oa Contfull ob iff cell OCM[oa, ob]=1. Partial containment
is given when two observations have similarity between 0 and 1, i.e., oa Contpartial ob
iff OCM[oa, ob]&gt;0. Finally, no containment between observations exists when
OCM[oa, ob]=0.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3 Pair-wise observation complementarity</title>
        <p>Following the definition of observation complementarity, we use OM to assess
whether the dimension values between pairs of observations are the same, given that
holds. To check if is true between two observations oa and ob we
apply a bit-wise AND to the bit vectors a, b of oa and ob the same way as when trying
to compute CM matrices, by using function sf(oa, ob). This is because values in OMP
capture whether an observation has a dimension in its schema. Given this, we want to
assess whether two bit vectors in OMP are related via a containment property, which
justifies the use of function sf, this time on the matrix of the dimension properties.
Then, we check if the dimension values of the two observations are equal:
(
)
{
(
(
)
)
(
)
( ) means that we apply sf for oa and ob in OMP, while a and b are the bit
vectors in OMC. This results in a |O|x|O| Complementarity matrix that gives the value
of 1 for complementary observations and 0 otherwise, i.e., oa Compl ob iff
Complementarity [oa, ob]&gt;0.</p>
        <p>Observe that the time complexity of computing containment and complementarity
matrices for N observations is O(N2), further optimization is left as future work. Two
approaches to improve running time are (i) parallelism and (ii) reducing the search
space by taking into account characteristics of the incoming schemata. For example, if
we can conclude that there can be no containment and/or complementarity between
observations of Di and Dj just by examining their schema, we do not need to perform
any computations on observation pairs between Di and Dj.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Complementarity and containment properties in Data Cube</title>
      <p>We propose simple extensions to the Data Cube Vocabulary such that
complementarity and containment, full and partial, between observations can be represented. We
define three properties, containment, partialContainment and fullContainment, where
partialContainment is a sub-property of the generic containment property, and
fullContainment is a sub-property of partialContainment, which reflects the fact that full
containment is a specialization of partial containment. As an example, a relationship
Contpartial(oa, ob) is then modelled as shown in the top part of Figure 3. The
containment relationship becomes a blank node of the appropriate type and is reified to
include information on ob and other possible metadata on the relationship. Similarly,
complementarity is denoted with the property imis:complement, as shown in the
bottom part of Figure 3.
imis:complement
_:bn2
imis:observation
imis:observation</p>
      <p>ex:obs_b
ex:obs_c
In this section we present the evaluation of our approach over real-world statistical
datasets. Our experiments were performed using Java and Apache Jena for handling
the RDF models and creating the matrices, and R for computations on the matrices,
application of the functions sf and cf and so on.</p>
      <p>Datasets. Four datasets have been used. D1 and D2 measure poverty in two
different sets of EU sub-regions and periods, D3 measure population in three EU countries
and their first-level sub-regions and D4 measure households with internet access in
seven EU countries and their sub-regions. They exhibit an overlap of 5 dimensions
(location, time, sex, unit and age) and 3 measures (poverty, population and
households with internet access). The mappings between the coded lists for the common
dimension values have been manually created. Observe that creating the mappings for
different dimension values is an orthogonal work to our approach; many approaches
from the fields of entity resolution and LOD interlinking can be applied. The datasets
are either downloaded as RDF or converted using a conversion script written in Java.
Eurostat Linked Data Wrapper1, the Eurostat database2 and World Bank3 were used as
sources. The datasets where pre-processed to include data about EU countries, as well
as selected sub-regions based on official EU geo classifications (NUTS4), for various
time periods, taken from the Gregorian calendar classification of data.gov.uk5. The
dataset structures are summarized in Table 2.
#of obs.</p>
      <p>D1 (539)
D2 (1693)
D3 (629)
refArea
85 regions,20 countries
293 regions, 33 countries
42 regions, 3 countries
refPeriod
2004-2011
2003-2010
2009-2013
sex
N/A
N/A
M, F, Total
unit
Yes
Yes
Yes
age
Yes
N/A
N/A
poverty
Yes
Yes
N/A
internet
N/A
N/A
N/A
population
N/A
N/A
Yes
1 http://estatwrap.ontologycentral.com/
2 http://epp.eurostat.ec.europa.eu/portal/page/portal/statistics/search_database
3 http://data.worldbank.org/
4 http://nuts.geovocab.org/
5 http://datahub.io/dataset/data-gov-uk-time-intervals
D4 (316)
65 regions,7 countries 2009-2013 N/A N/A N/A N/A Yes
Table 2: Dimensions and measures of the input datasets. Measures are marked in grey</p>
      <p>Computing containment and complementarity properties over the observations of
the four datasets resulted in the creation of multiple relationships summed up in Table
3. The results do not include self-containing or self-complementing observations. We
have defined three metrics, full, partial and compl. For a pair of datasets Di and Dj,
they measure the total number of pairs that exhibit full containment, partial
containment and observation complementarity respectively, as a percentage of the total
possible number of pairs in the given two datasets, minus the diagonal. As can be seen,
most new relationships are partial containments, which is a reasonable result given
that it is the most weakly defined relationship in terms of its prerequisites. The
strictest relationship, observation complementarity, resulted in linking 0.03% of the total
possible observation pairs. Sample observations participating in the newly created
relationships can be seen in fig 4. The created links are modeled after our proposed
vocabulary and uploaded in RDF form in an Openlink Virtuoso store6.
D1
D2
D3
D4</p>
      <p>D1
647 (0.31%) full
34.3k (16.32%) partial
N/A compl
605 (0.02%) full
605k (14.83%) partial
1238 (0.04%) compl
N/A full
N/A partial
N/A compl</p>
      <p>D2
N/A full
N/A partial
N/A compl
3370 (0.14%) full
378k (14.83%) partial
N/A (complement
N/A full
N/A partial
N/A compl</p>
      <p>D3
N/A full
N/A partial
N/A compl
N/A full
N/A partial
204 (0.004%) compl
1k (0.26%) full
261k (65.9%) partial
N/A compl</p>
      <p>D4
N/A full
N/A partial
N/A compl
N/A full
N/A partial
N/A compl
N/A full
N/A partial
N/A compl
N/A full N/A full N/A full 437 (0.17%) full
N/A partial N/A partial N/A partial 22.2k (22.3%) partial
328 (0.05%) compl 218 (0.005%) compl 592 (0.07%) compl N/A compl
Table 3: Results of new relationships for test datasets D1...D4. Each cell [i,j] contains
information on the total number of pairs exhibiting each relationship, as well as the percentage
over all possible pair-wise combinations of observations for the combination</p>
      <p>Discussion. The relatedness properties that we have defined yield interesting
information on how existing observations can be combined across datasets as well as in
the same source dataset. The advantages of this work are two-fold, first it creates new
information by combining existing facts (complementarity) and second it creates a
containment graph among observations that helps exploration, aggregation and
discovery of nearby multidimensional observations. The definitions that we have
provided can also be studied in terms of their impact at the dataset level, for example, D3 has
65.9% partial containment in itself, with only 0.26% full containment. This hints that
the structure of the dataset’s hierarchies is such that there are a few top level concepts
in comparison to lower-level concepts, and that also the depth of the hierarchies is not
very large. As a matter of fact, this is true for D3 as it addresses 3 countries and 42
sub-regions, all of them lying on the same level. Figure 4 shows an example of an
observation _:b134 that exhibits both complementarity and containment with _:b432
and _:b135 resp. Complementarity shows that we can combine the measures of two
observations set on the same dimension values in order to complement their
information. Two complementary observations can be joined on their common dimension
values and combined into a new observation, whose schema contains the union of
their dimension and measure sets. Figure 4 shows how we can combine population
and households with internet access because they are both set on region EL1 in 2010
although _:b432 does not involve gender dimension.</p>
      <p>_:b134
a qb:Observation ;
qb:dataSet _:population ;
sdmx-dimension:sex sex:Total ;
sdmx-dimension:refArea nuts2008:EL1 ;
sdmx-dimension:refPeriod uk:2010 ;
sdmx-measure:obsValue 3,590,447 .</p>
      <p>observation
complement
fullContainment _:b135
_:b432
a qb:Observation ;
qb:dataSet _:internetHouseholds ;
sdmx-dimension:refArea nuts2008:EL1 ;
sdmx-dimension:refPeriod uk:2010 ;
sdmx-measure:obsValue 31.5 .
a qb:Observation ;
qb:dataSet _:population ;
sdmx-dimension:sex sex:Male ;
sdmx-dimension:refArea nuts2008:EL11 ;
sdmx-dimension:refPeriod uk:2010 ;
sdmx-measure:obsValue 302,855 .
In this paper, we have presented a novel approach for identifying and modeling
relationships between observations of multidimensional linked open data. We have
defined three new properties, namely full and partial observation containment and
observation complementarity between two observations as derivatives of the hierarchical
relationships between their dimension values, and as a means of comparison and
correlation of their different measures. We have proposed a possible extension on the
Data Cube terms for representing these properties and we have provided an evaluation
of our approach over real-world statistical datasets.</p>
      <p>A future direction concerns techniques for grouping and combining observations
into new datasets based on their containment and complementarity properties.
Another direction is the incorporation of the Data Cube attributes in our algorithms as these
often represent valuable information like units of measure, observation status or
folded dimensions and can be used for creating dimension value mappings at the
preprocessing stage. Finally, we will explore techniques for optimizing the computation
time and scale up the overall performance of our approach.</p>
      <p>Acknowledgement. This work has been co-financed by the EU project DIACHRON
and by EU and Greek national funds through the Operational Program
"Competitiveness and Entrepreneurship" (OPCE ΙΙ) of the National Strategic Reference Framework
(NSRF) - Research Funding Program: KRIPIS.
8</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Aligon</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          et al.
          <year>2014</year>
          .
          <article-title>"Similarity measures for OLAP sessions." Knowledge and information systems 39</article-title>
          , no.
          <volume>2</volume>
          :
          <fpage>463</fpage>
          -
          <lpage>489</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , et al.
          <year>2013</year>
          .
          <article-title>"The rdf data cube vocabulary." W3C recommendation.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Villazón-Terrazas</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          et al.
          <year>2011</year>
          .
          <article-title>"Methodological guidelines for publishing government linked data."</article-title>
          <source>In Linking Government Data</source>
          , pp.
          <fpage>27</fpage>
          -
          <lpage>49</lpage>
          . Springer New York.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Hsu</surname>
            ,
            <given-names>K.C.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>M.Z.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>"Techniques for finding similarity knowledge in OLAP reports</article-title>
          .
          <source>" Expert Systems with Applications</source>
          <volume>38</volume>
          , no.
          <volume>4</volume>
          :
          <fpage>3743</fpage>
          -
          <lpage>3756</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Sathe</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Sarawagi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2001</year>
          .
          <article-title>"Intelligent rollups in multidimensional OLAP data."</article-title>
          <source>In VLDB</source>
          , vol.
          <volume>1</volume>
          , pp.
          <fpage>531</fpage>
          -
          <lpage>540</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Giacometti</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , et al.
          <year>2009</year>
          .
          <article-title>"Query recommendations for OLAP discovery driven analysis."</article-title>
          <source>In Proceedings of the ACM twelfth international workshop on Data warehousing and OLAP</source>
          , pp.
          <fpage>81</fpage>
          -
          <lpage>88</lpage>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Han,
          <string-name>
            <surname>J</surname>
          </string-name>
          . et al.
          <year>1999</year>
          .
          <article-title>"Constraint-based, multidimensional data mining</article-title>
          .
          <source>" Computer</source>
          <volume>32</volume>
          , no.
          <volume>8</volume>
          :
          <fpage>46</fpage>
          -
          <lpage>50</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Aggarwal</surname>
            ,
            <given-names>C. C.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>P.S.</given-names>
          </string-name>
          <year>2001</year>
          .
          <article-title>"Outlier detection for high dimensional data." In ACM Sigmod Record</article-title>
          , vol.
          <volume>30</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>37</fpage>
          -
          <lpage>46</lpage>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Markl</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          et al.
          <year>1999</year>
          .
          <article-title>"Improving OLAP performance by multidimensional hierarchical clustering." In Database Engineering</article-title>
          and Applications,
          <year>1999</year>
          . IDEAS'99. International Symposium Proceedings, pp.
          <fpage>165</fpage>
          -
          <lpage>177</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Chaudhuri</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Dayal</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          <year>1997</year>
          .
          <article-title>"An overview of data warehousing and OLAP technology</article-title>
          .
          <source>" ACM Sigmod record 26</source>
          , no.
          <volume>1</volume>
          :
          <fpage>65</fpage>
          -
          <lpage>74</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Vassiliadis</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Sellis</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <year>1999</year>
          .
          <article-title>"A survey of logical models for OLAP databases</article-title>
          .
          <source>" ACM Sigmod Record</source>
          <volume>28</volume>
          , no.
          <volume>4</volume>
          :
          <fpage>64</fpage>
          -
          <lpage>69</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Aït-Kaci</surname>
          </string-name>
          et al.
          <year>1989</year>
          .
          <article-title>"Efficient implementation of lattice operations." ACM Transactions on Programming Languages and Systems (TOPLAS) 11</article-title>
          , no.
          <volume>1</volume>
          :
          <fpage>115</fpage>
          -
          <lpage>146</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Capadisli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          et al.
          <year>2013</year>
          .
          <article-title>"Linked statistical data analysis." Semantic Web Challenge</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Wagner</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          et al.
          <year>2013</year>
          .
          <article-title>"Discovering related data sources in data-portals."</article-title>
          <source>In Proceedings of the First International Workshop on Semantic Statistics</source>
          , co
          <article-title>-located with the the International Semantic Web Conference</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>Das</given-names>
            <surname>Sarma</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          et al.
          <year>2012</year>
          .
          <article-title>"Finding related tables."</article-title>
          <source>In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data</source>
          , pp.
          <fpage>817</fpage>
          -
          <lpage>828</lpage>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Baikousi</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          et al.
          <year>2011</year>
          .
          <article-title>"Similarity measures for multidimensional data." In Data Engineering (ICDE</article-title>
          ),
          <year>2011</year>
          IEEE 27th International Conference on, pp.
          <fpage>171</fpage>
          -
          <lpage>182</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Ngomo</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>"LIMES: a time-efficient approach for large-scale link discovery on the web of data."</article-title>
          <source>In Proceedings of the Twenty-Second international joint conference on Artificial Intelligence-Volume Volume Three</source>
          , pp.
          <fpage>2312</fpage>
          -
          <lpage>2317</lpage>
          . AAAI Press.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Volz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          et al.
          <year>2009</year>
          .
          <article-title>"Silk-A Link Discovery Framework for the Web of Data." LDOW 538.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Mendes</surname>
            ,
            <given-names>P.N.</given-names>
          </string-name>
          et al.
          <year>2011</year>
          <article-title>"DBpedia spotlight: shedding light on the web of documents."</article-title>
          <source>In Proceedings of the 7th International Conference on Semantic Systems</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . ACM.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>