<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Enabling Self-Service BI on Document Stores</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mohamed L. Chouder</string-name>
          <email>m_chouder@esi.dz</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Rizzi</string-name>
          <email>stefano.rizzi@unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rachid Chalal</string-name>
          <email>r_chalal@esi.dz</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DISI, University of Bologna</institution>
          ,
          <addr-line>Bologna</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>LMCS, ESI</institution>
          ,
          <addr-line>Algiers</addr-line>
          ,
          <country country="DZ">Algeria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The growing use of document stores has resulted in vast amounts of semi-structured data holding precious information, which could be pro tably integrated into existing BI systems. Unfortunately, due to their schemaless nature, document stores are hardly accessible via direct OLAP querying. In this paper we propose an interactive, schema-on-read approach for nding multidimensional structures in document stores aimed at enabling OLAP querying in the context of self-service BI. Our approach works in three steps: multidimensional enrichment, querying, and OLAP enabling; the validation of user queries and the detection of multidimensional structures is based on the mining of approximate functional dependencies from data. The e ciency of our approach is discussed with reference to real datasets.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Over the past decade, companies have been adopting NoSQL
databases to deal with the huge volumes of data
manipulated by modern applications. NoSQL systems have emerged
as an alternative to relational database management
systems in many implementations [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. They can be
classied based on their data model, the most popular categories
being key-value, wide-column, graph-based and
documentoriented; document-oriented databases (brie y, document
stores) are probably the most widely adopted so far. The
common features of document stores are horizontal
scalability on commodity hardware and the lack of an explicit
schema (according to the data rst, schema later or never
paradigm). This schemaless nature provides a exible data
model that has attracted developers seeking to avoid the
restrictions posed by the relational model.
      </p>
      <p>Document stores usually collect nested, denormalized, and
hierarchical documents in the form of collections; the
documents within the same collection may present a structural
variety due to schema exibility and data evolution.
Documents are self-describing and mainly encoded using the
semi-structured data format JSON (JavaScript Object
Notation), which is also popular as an exchange format and
widely adopted in modern web APIs.</p>
      <p>
        The growing use of document stores and the dominance
of JSON have resulted in vast amounts of semi-structured
data holding precious information, which could be pro tably
integrated into existing business intelligence (BI) systems
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Indeed, multidimensional modeling and analysis has
been recognized to be an e ective way for conducting
analytics over big NoSQL data [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Unfortunately, due to their
schemaless nature, document stores are hardly accessible via
direct OLAP querying. Recent e orts in this area propose
SQL-like query interfaces (e.g., Apache Drill, Spark SQL,
Presto), which provide access to schemaless and nested data
while o ering the possibility of using traditional BI tools.
However, none of these solutions supports multidimensional
querying or OLAP over document stores.
      </p>
      <p>
        In this paper we propose an interactive, schema-on-read
approach for nding multidimensional structures in
document stores aimed at enabling OLAP querying in the
context of self-service BI. Di erently from a schema-on-write
approach, which would force a xed structure in data and
load them into a data warehouse, a schema-on-read
approach leaves data unchanged in their structure until they
are accessed by the user [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. We claim that, when querying
document stores for self-service BI, a schema-on-read
approach should be preferred to a schema-on-write approach.
The main reason for this is that document stores handle
high volumes of data with varied structure, which question
the ability of the traditional ETL in processing data on the
one hand, and the possibility of storing them into a data
warehouse with a xed schema on the other hand.
      </p>
      <p>To discover the information necessary for discovering
multidimensional concepts despite the lack of structure, we
resort to data distributions, i.e., we mine approximate
functional dependencies (AFDs) and use them to automate the
design process. More speci cally, our approach takes as
input a collection of nested documents and operates in three
phases.
1. Multidimensional Enrichment. In this phase, a
schema is extracted out of the collection and enriched
with some basic multidimensional knowledge. This is
done by tentatively classifying attributes into
dimensions and measures, and by relating each measure to
the subset of dimensions that determine its
granularity, so as to create a draft multidimensional schema.
2. Querying. Now the user can formulate a
multidimensional query by picking dimensions and measures
of interest. This query is checked by accessing data to
ensure its multidimensional validity and correct
summarization. If the query is found to be well-formed, it
is executed. Otherwise, the multidimensional schema
is re ned and some possible querying alternatives are
proposed to the user.
3. OLAP Enabling. This iterative phase enables the
user to further explore data by running an OLAP
session. To this end, local portions of multidimensional
hierarchies (consisting of roll-up and drill-down
relationships for each level involved in the user query) are
built by mining AFDs from data. The user can now
apply an OLAP operator (e.g., roll-up or drill-down)
to iteratively create a new query on the collection.
The advantages of the proposed approach are twofold. First,
it enables decision makers to formulate multidimensional
queries and create reports on-the- y on document stores in
a self-service fashion, i.e., without any support from ICT
people. This reduces the time and e ort spent in traditional
BI settings. Second, our approach also works when nested
structures are not present within the input data (e.g., for
at documents), which is relevant because a large number
of non-nested datasets are available in JSON format on the
web (e.g., more than 11000 collections in www.data.gov).
The approach has been implemented on top of MongoDB,
one of the most popular document stores.</p>
      <p>The rest of this paper is organized as follows. Section 2
discusses the related work and Section 3 presents document
stores along with the working example. Our approach is
detailed in Section 4 and experimentally evaluated in
Section 5. Section 6 concludes the paper and gives some future
directions for research.</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        Supply-driven multidimensional design. The
automation of multidimensional modeling has been widely
explored in the area of data warehouse design. Supply-driven
approaches automate the discovery of multidimensional
concepts by a thorough analysis of the source data. The rst
approaches proposed algorithms to build multidimensional
schemata starting from Entity/Relationship diagrams or
relational schemata [
        <xref ref-type="bibr" rid="ref12 ref19 ref8">8, 19, 12</xref>
        ], while further approaches
considered more expressive conceptual sources such as UML
class diagrams [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] and ontologies [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. Furthermore, [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]
discussed the use of FDs for automating multidimensional
modeling from ontologies. In particular, similarly to the
querying phase of our approach, [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] proposes an algorithm
to check the multidimensional validity of an SQL cube query
and to derive the underlying multidimensional schema. This
algorithm identi es multidimensional concepts from the
structure of the query itself and by following foreign keys in the
relational schema.
      </p>
      <p>In these approaches, multidimensional modeling is done at
design time, mainly based on FDs expressed in the source
schemata as foreign keys or many-to-one relationships.
Conversely, our approach is meant to be used at query time
and automates the discovery of multidimensional concepts
by mining FDs from data, as required by the schemaless
context.</p>
      <p>
        Multidimensional design from non-relational data.
Our approach closely relates to previous approaches for
multidimensional modeling from semi-structured data [
        <xref ref-type="bibr" rid="ref13 ref28 ref9">9, 13, 28</xref>
        ]
in XML format, which is similar to JSON. These approaches
take in input DTDs or XML schemata that provide rich
information about XML documents (e.g., multiplicities, data
types), so they cannot operate directly on XML data not
having a schema speci cation. In particular, the work in [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]
builds multidimensional schemata starting from an XML
schema but, in some cases, data is accessed to discover FDs
that are not expressed in the schema. Similarly, starry vault
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is a recent approach that mines FDs for multidimensional
modeling from data vaults. These are databases
characterized by a speci c data model tailored to provide historical
storage in presence of schema evolutions. The main idea is
to mine approximate and temporal FDs to cope with the
issue of noisy and time-varying data, which may result in
hidden FDs.
      </p>
      <p>All the above-mentioned approaches are similar in that
they de ne multidimensional schemata at design time using
structural and additional information extracted from data
(i.e., FDs). In contrast, our approach operates at query
time and mostly relies on data distributions, while structural
information is only used |when available| to intelligently
reduce the search space.</p>
      <p>
        OLAP on linked data. Recent works in this space
propose to directly perform OLAP-like analysis over semantic
web data. Exploratory OLAP has been de ned as the
process that discovers, collects, integrates, and analyzes these
external data on the y [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Some works in this direction
address the problem of building multidimensional hierarchies
from linked data [
        <xref ref-type="bibr" rid="ref21 ref27">21, 27</xref>
        ], which is closely related to the
third phase of our approach. Speci cally, iMOLD [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] is
an instance-based approach that operates at query time for
exploratory OLAP on linked data; it aims at nding
multidimensional patterns representing roll-up relationships, in
order to enable users to extend corporate data cubes with
new hierarchies extracted from linked data. Similarly, [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]
proposes an algorithm for discovering hierarchies from
dimensions present in statistical linked data represented
using the RDF Data Cube (QB) vocabulary (www.w3.org/TR/
vocab-data-cube/). Like starry vault, this algorithm mines
for AFDs (here called quasi -FDs) to cope with the presence
of imperfect and partial data.
      </p>
      <p>The algorithm that we propose for building hierarchies
differs from these works in that, to ensure good performances,
it looks for local portions of hierarchies for the levels involved
in the user queries, thus acting in an on-demand fashion.</p>
      <p>SQL on schemaless data. Several solutions have emerged
in the industry to enable SQL querying of schemaless data
for analytic purposes. These solutions can be classi ed into
two categories: (1) relational systems enabling the storage
and management of schemaless data, and (2) systems
designed as an SQL interface for schemaless data.</p>
      <p>
        Solutions in the rst category include RDBMSs that
support storage and querying of schemaless data to be used
along with relational data in one system [
        <xref ref-type="bibr" rid="ref17 ref5">5, 17</xref>
        ] (e.g.,
Oracle, IBM DB2, and PostgreSQL). Current systems do not
impose a xed relational schema to store and query data,
but derive and maintain a dynamic schema to be used for
schema-on-read SQL/JSON querying instead [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Other
solutions that fall into this category are virtual adapters that
expose a relational view of the data in a document store, to
be used in common BI tools; an example is the MongoDB
BI connector (www.mongodb.com/products/bi-connector).
      </p>
      <p>
        The second line of solutions are SQL-like interfaces that
o er extensions to query schemaless data persisted in
document stores or as JSON les (e.g., in Hadoop). Spark
SQL [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] has been designed for relational data processing
of native Spark distributed datasets in addition of diverse
data sources and formats (including JSON). In order to run
SQL queries on schemaless data in Spark SQL, a schema
is automatically inferred beforehand. Apache Drill (drill.
apache.org) is a distributed SQL engine that can join data
from multiple data sources, including Hadoop and NoSQL
databases. It has a built-in support for JSON with
columnar in-memory execution. Drill is also able to dynamically
discover a schema during query processing, which allows to
handle schemaless data without de ning a schema upfront.
Finally, Presto (prestodb.io) is a distributed SQL engine,
developed at Facebook for interactive queries, that can also
combine data from multiple kinds of data sources
including relational and non-relational databases. When querying
schemaless data, Presto tries to discover data types, but it
may need a schema to be de ned manually in order to run
queries. All above-mentioned engines di er in their
architecture, support to ANSI SQL, and extensions to SQL to
handle schemaless and nested data. These engines can
connect to MongoDB, which gave us various architectural
options for implementation. However, none of them supports
multidimensional or OLAP querying over document stores.
      </p>
    </sec>
    <sec id="sec-3">
      <title>DOCUMENT STORES</title>
      <p>
        A document store holds a set of databases, each organizing
the storage of documents in the form of collections in a way
similar to rows and tables in relational databases. A
document, also called object, consists of a set of key-value pairs
(or elds) mainly encoded in JSON format (www.json.org).
JSON has many variants that are used for storage and
optimization purposes such as BSON (www.bsonspec.org) in
MongoDB and, very recently, OSON in the Oracle DBMS
[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Keys are always strings, while values have the
following types: primitive (number, string, Boolean), object (or
sub-document ), array of atomic values or objects.
      </p>
      <p>
        From a conceptual point of view, a typical collection
consists of a set of business entities connected through two kinds
of relationships: nesting and references. Nesting consists of
embedding objects arbitrarily within other objects, which
corresponds to two types of relationships: to-one in the case
of a nested object and to-many in the case of an array of
nested objects. On the other hand, references are similar
to foreign keys in relational databases and can be expressed
manually or using a speci c mechanism such as $ref in
MongoDB. However, when references are used, getting data
requires joins which are not supported in document stores; so,
nesting is most often used instead. Since a document store
has a exible schema there are a variety of ways for data
representation; nevertheless, the way data is modeled may
a ect performance and data retrieval patterns.
games :
f\ id" : \52f2a70eddbd75540aba7c06",
\date" : ISODate(\1989-03-27"),
\teams" :
[ f\name" : \Detroit Pistons",
\abbreviation" : \DET",
\score" : 90,
\home" : true,
\won" : 1,
\results" : f \points" : 90, : : : g,
\players" :
[ f\player" : \Isiah Thomas",
\points" : 30, : : : g,
: : :
]
g,
f\name": \Dallas Mavericks",
\abbreviation" : \DAL",
\score" : 77,
\home" : false,
\won" : 0,
\results" : f \points" :77, : : : g,
\players" :
[ f\player" : \Rolando Blackman",
\points" : 23, : : : g,
: : :
]
\gg,uestteam" :
f\name": \Dallas Mavericks",
\abbreviation" : \DAL",
\score" : 77,
\home" : false,
\won" : 0,
\results" : f \points" :77, : : : g,
\players" :
[ f\player" : \Rolando Blackman",
\points" : 23, : : : g,
: : :
g
Example 1. In the scope of this paper, we focus on
collections of JSON documents that logically represent the same
business entities, while expecting that their structure may
vary. In particular, we will use as a working example a
collection that models NBA games (adapted from [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]), where a
document represents one game between two teams along with
players and team results. In the sample document shown in
Figure 1a the main object is games, which holds the array
of objects teams (to-many), which in turn holds the object
results (to-one) and the array of objects players (to-many).
Figure 1b shows an alternative JSON representation of the
same domain; the relationship between game and teams is
modeled using two separate sub-documents (home and guest
team) instead of an array. Note that a simple query that
computes the average score by team would be much easier to
implement using the rst representation and it would also
perform better, since in the second representation it would
require two separate queries.
4.
      </p>
    </sec>
    <sec id="sec-4">
      <title>APPROACH</title>
      <p>As already discussed, our approach for enabling self-service
BI on document stores includes three phases:
multidimensional enrichment, querying, and OLAP enabling; these phases
are discussed in the detail in the following subsections.
4.1</p>
    </sec>
    <sec id="sec-5">
      <title>Multidimensional Enrichment</title>
      <p>
        The aim of this phase is to de ne a draft multidimensional
schema on which the user will be able to formulate queries.
To this end, we extract the schema of the input collection
and enrich it with basic multidimensional knowledge.
Several works have addressed schema extraction and discovery
from document stores [
        <xref ref-type="bibr" rid="ref14 ref25 ref29">25, 29, 14</xref>
        ]. These works focus on the
structural variety of documents within the same collection
caused by their schemaless nature and by the evolution of
data; for instance, [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] proposes to extract a JSON schema
(www.json-schema.org) similar to a JSON document and
infer attribute and type occurrence frequencies to
emphasize the structural variety. In [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], all the schema versions
in a collection are extracted and stored in a repository to be
used within a schema management framework. In order to
have a single view of the distinct schemata in a collection,
the authors propose a relaxed form of schema called
skeleton that better captures the core and prominent elds and
lters out unfrequent ones. This would be useful for us since
we rely on a single view of the collection; however, the
computational cost for building skeletons is high, making them
unsuitable in real-time contexts.
      </p>
      <p>
        Since schema discovery is not the focus of this paper, we
adopt a simple algorithm that builds a tree-like schema
including all the elds that appear in the documents [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This
algorithm works in a single-pass over data to infer a tree
structure of elds and their corresponding types ( elds with
varying types are generalized to String).
      </p>
      <p>The tree-like collection schema that will be the input for
the subsequent steps of our approach is de ned as follows.</p>
      <p>Definition 1 (Collection Schema). Given a
collection D, its schema (brie y, c-schema) is a tree S pK; Eq
where:</p>
      <sec id="sec-5-1">
        <title>K is the union of the sets of keys appearing in the documents of D;</title>
        <p>r P K is the distinguished root node and is labelled with
the collection name;</p>
        <sec id="sec-5-1-1">
          <title>E is a set of arcs that includes (i) an arc r Ñ k for</title>
          <p>each key/value pair in the root r, and (ii) an arc k1 Ñ
k2 i k2 appears as a key in an array of nested objects
having key k1.</p>
          <p>Figure 2 shows the c-schema to which the document of
Figure 1a conforms. A path from the root key to a leaf
key is called an attribute. In denoting attributes we use
the dot notation omitting the root key (since a single
collection is involved) but including the keys corresponding to
nested objects; so, for instance, the c-schema in Figure 2
contains attributes id, date, teams.name, teams.results.points,
teams.players.player, teams.players.points, etc. The depth of
an attribute a is the number of arcs included in its path,
and is denoted by depthpaq (e.g., depthpidq 1); the set of
attributes at depth is denoted by Attrsp q.</p>
          <p>games
id
date</p>
          <p>teams
name abbreviation home
score results.points players
player
points</p>
          <p>After the c-schema S of a collection has been de ned, a
corresponding multidimensional schema has to be derived
from it. Since measures at di erent granularities can
possibly be included in the documents (e.g., in our example,
points at the player's level and score at the team level), the
de nition we provide relates each single measure to a speci c
set of dimension.</p>
          <p>Definition 2 (Md-Schema). A multidimensional
schema (brie y, md-schema) is a triple M pH; M; gq
where:</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>H is a nite set of hierarchies; each hierarchy hi P H</title>
          <p>is associated to a set Li of categorical levels and a
rollup partial order ©i of Li. Each element G P 2L, where
L i Li and at most one level in G is taken from
each hierarchy hi, is called a group-by set of H.</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>M is a nite set of numerical measures.</title>
        <p>g is a function relating each measure in M to its
dimensions, i.e., to the set of levels that determine its
granularity: g : M Ñ G where G is the set of all
groupby sets of H.</p>
        <p>At rst, a draft md-schema Mdraft is built from S by
tentatively labelling all attributes of types date, string, and
Boolean as dimensions (i.e., as levels of di erent hierarchies),
and all attributes of types integer and decimal as measures.
Date dimensions are then decomposed into di erent
categorical levels giving rise to standard temporal hierarchies (e.g.,
date © month © year). Note that no further attempt to
discover hierarchies is made at this stage, since discovering the
FDs involving all attributes would be computationally too
expensive. As a consequence, in Mdraft two levels l and l1
might be erroneously modeled as two separate dimensions,
while they should be actually part of the same hierarchy
because l © l1.</p>
        <p>The user can contribute to this step by manually changing
the label of some attributes, since in some cases a numeric
attribute can be used as a dimension, and a non-numeric
attribute can be used as a measure. The rst situation has
no impact on the following steps, since a group-by set can
also include numeric attributes. Conversely, the second
situation requires that the values of the non-numeric attribute
are transformed into numerical values to be supported by
aggregation functions.</p>
        <p>
          To complete the de nition of Mdraft, the granularity
mapping g must be built. We recall from [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] that an md-schema
should comply with the multidimensional space arrangement
constraint, stating that each instance of a measure is related
to one instance of each dimension. In the c-schema S,
attributes in Attrsp q are related by to-many multiplicity to
those in Attrsp 1q, so a measure cannot be related to
levels having a higher depth. Therefore, gpmq is set by
connecting each measure m to the group-by set G that contains
all levels l P L such that depthpmq ¥ depthplq.
        </p>
        <p>Example 2. In this example and in the following ones
we will refer to the representation in Figure 1a. Attribute
teams.score is numerical, so it is assumed to be a
measure. It is depthpteams.scoreq 2, therefore teams.score
is tentatively associated in Mdraft with group-by set G
tteams.name; teams.abbreviation; teams.home; id; dateu. All
the measures that have the same depth of teams.score
(e.g., teams.won and teams.results.points) are also
associated with G. Similarly, measure teams.players.points
has depth 3, so it is associated with group-by set
G1 tteams.players.player; teams.name; teams.abbreviation;
teams.home; id; dateu.
4.2</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Querying</title>
      <p>This phase is aimed at supporting a non-expert user in
formulating a well-formed multidimensional query. In a
schemaon-write approach, queries are formulated on a complete
mdschema whose correctness is ensured by the designer.
Conversely, in a schema-on-read approach |our case|, the
mdschema is de ned at read-time by the query itself. Though
this approach gives more exibility because each user can
\force" her own multidimensional view onto the data, it
requires a further check to ensure that the FDs implied by the
query are not contradicted by data; so, querying becomes
an iterative process where the underlying md-schema is
progressively re ned together with the query.</p>
      <p>Definition 3 (Md-Query). A multidimensional query
(brie y, md-query) q on md-schema M pH; M; gq is a
triple q pGq; Mq; q where</p>
      <sec id="sec-6-1">
        <title>Gq P G is the md-query group-by set;</title>
      </sec>
      <sec id="sec-6-2">
        <title>Mq  M is the set of required measures;</title>
        <p>is a function that associates each measure m P Mq
with an aggregation operator.</p>
        <p>
          Starting from the draft md-schema Mdraft, the user
formulates an md-query q by choosing one to three dimensions
(Gq), one or more measures of interest (Mq), and an
aggregation operator for each measure ( ). However, to be
considered well-formed, q (and the md-schema q is
formulated on) should comply with the following constraints [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]:
71 The base integrity constraint, stating that the levels
in the group-by set are orthogonal, i.e., functionally
independent on each other.
72 The summarization integrity constraint, which requires
disjointness (the measure instances to be aggregated
are partitioned by the group-by instances),
completeness (the union of these partitions constitutes the
entire set), and compatibility (the aggregation operator
chosen for each measure is compatible with the type
of that measure) [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
        </p>
        <p>How to carry out these validity checks is discussed in the
following subsections.</p>
        <sec id="sec-6-2-1">
          <title>Example 3. A possible md-query on the draft md-schema</title>
          <p>described in Example 2 is the one characterized by Gq
tteams.players.player; dateu, Mq tteams.scoreu, and</p>
        </sec>
        <sec id="sec-6-2-2">
          <title>Sum. Unfortunately, aggregating on teams.score would re</title>
          <p>sult in double counting since each instance of teams.score is
related to multiple instances of teams.players.player (a team
has several players). So, this query violates disjointness and
is not well-formed.</p>
          <p>The pseudo-code of the querying phase is sketched in
Algorithm 1. It starts from the md-query q and takes as input
both the c-schema and the md-schema. The output of the
algorithm is a valid md-query based on a re ned md-schema
Algorithm 1 Querying
to be used in the next phase. Algorithm 1 works as
follows. Firstly, procedure CheckMeasures drops from Mq the
measures, if any, that are not compatible with Gq based on
their granularity (Line 1). If no measure is left, some
alternative (compatible) measures are suggested by
RecommendMeasures (Lines 2{5). If some measures Mrec are found,
the user interacts to approve their inclusion in the md-query
(Line 5). Then, if the md-query group-by set includes either
2 or 3 levels (Lines 7{11), procedure CheckGroupBy is called
to drop from Gq the levels, if any, that are not compliant
with the base integrity constraint and update M
accordingly (Line 8). If just one level is left in Gq, procedure
RecommendLevels is called to look for additional group-by
levels (Line 10) and possibly include them in the md-query
(Line 11). Finally, procedure CheckSummarization checks
the aggregation operators (Line 12). We allow the resulting
md-query to be executed with a single group-by level, but
not without measures (Lines 6 and 13); in the latter case,
the user is asked to choose a new group-by set (Line 15).
4.2.1</p>
          <p>Checking Measures</p>
          <p>The goal of this procedure is to ensure that all the
measures in Mq can be correctly aggregated at group-by set Gq;
speci cally, our goal is to avoid that for some m P Mq there
is a to-many relationship between m and a level l P Gq,
because this would violate disjointness and lead to double
counting when executing q. Measure check is based on the
g component of M, which expresses the granularity of each
measure.</p>
          <p>To explain how this is done, we observe that the roll-up
partial orders on the hierarchies H of M induce a partial
order ©H on the set G of the group-by sets of H, de ned as
the product order of the single roll-up orders.1 Measure m is
compatible with Gq i gpmq ©H Gq; indeed, if this condition
is satis ed, the granularity expressed by the group-by set is
coarser than the one at which m is de ned, meaning that
m can be safely aggregated at Gq. The measures that are
found not to be compatible with Gq are removed from Mq.</p>
          <p>Example 4. Considering again the query in Example 3,
it is gpteams.scoreq «H Gq (because teams.players.player is
not in gpteams.scoreq). Then, q is not well-formed and
teams.score is dropped from Mq.
1The product order of n total orders is a partial order on
the Cartesian product of the n totally ordered sets, such
that px1; : : : ; xnq © py1; : : : ; ynq i xi © yi for i 1; : : : ; n.</p>
          <p>
            This check is performed at the schema level, so some
exceptions may arise when the relationship between a measure
m and Gq is hidden in data. This may happen when arrays
are not used to model to-many relationships, or when m
depends either on a level not present in Gq or on a subset of
the levels in Gq [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ]. In these cases, m may be non-additive
or even non-aggregable on Gq, so the user's knowledge of
the application domain is required to manually x
summarization (see Section 4.2.5).
4.2.2
          </p>
          <p>Checking Group-by Set</p>
          <p>This process is aimed at ensuring completeness and base
integrity.</p>
          <p>
            The completeness condition is violated when, for some
level l P Gq, there is no instance corresponding to one or
more instances of a measure in Mq; from a conceptual point
of view, l is classi ed as optional [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ]. In a schema-on-write
approach this problem is xed by aggregating all \dangling"
measures into an ad-hoc group of l. Similarly, in our
schemaon-read approach, these instances are grouped into a null
instance of l thus restoring the completeness condition; then
for instance, the $ifNull operator of MongoDB could be used
to replace null values with an ad-hoc one when executing q.
          </p>
          <p>Base integrity requires that the levels in Gq are mutually
orthogonal. So, from this point of view, md-query q is valid
only if, for each pair of levels l; l1 P Gq, there is no FD
between l and l1, i.e., there is a many-to-many relationship
between them. An FD is a to-one relationship, which we will
denote with l Ñ l1 to emphasize that values of l functionally
determine the values of l1. Checking base integrity requires
to access data in order to retrieve AFDs between the levels,
in Gq, that are not related by a to-many multiplicity in the
c-schema (see Section 4.3 for more details about AFDs).</p>
          <p>We recall that 1 ¤ |Gq| ¤ 3:
1. If |Gq|</p>
          <p>1, orthogonality is obvious.
2. If |Gq| 2, the relationship between the two levels
in Gq is checked. If it turns out to be many-to-many
(e.g., if Gq tdate; teams.nameu), the base integrity
constraint is met and both levels remain in Gq. If it
turns out to be many-to-one (e.g., if Gq tid; dateu),
then there is a roll-up relationship between the two
levels in Gq. Only the level at the \many" side (id)
remains in Gq and the md-schema is modi ed by placing
these levels in the same hierarchy (id © date). A
special case is when the relationship is one-to-one (e.g.,
if Gq tteams.name; teams.abbreviationu); here, one of
the levels is kept in Gq while the other is considered
as a descriptive attribute.
3. In case |Gq|</p>
          <p>
            3, there are three possibilities:
(a) If many-to-many relationships are found between
each pair of levels, then the constraint is met
and all levels remain in Gq (e.g., if Gq
tdate; teams.name; teams.players.playeru).
(b) Let many-to-many relationships be found between
two pairs of levels pl; l1q, pl; l2q, and a many-to-one
relationship be found between the remaining pair
pl1; l2q (e.g., if Gq tteams.name; id; dateu, since
id Ñ date). In this case, l1and l2 are placed in the
same hierarchy in the md-schema (l1 © l2), and
l2 is removed from Gq.
(c) Finally, when l1 Ñ l and l2 Ñ l,
i. if the relationship between l1 and l2 is
many-tomany, they are both retained in Gq;
conceptually, l is a convergence [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] and the md-schema
must be updated by adding l1 © l and l2 © l.
ii. if the relationship between l1 and l2 is
many-toone, then the three levels in Gq belong to the
same hierarchy (l1 © l2 © l) and only the level
with highest cardinality, l1, is kept in Gq.
4.2.3
          </p>
          <p>Recommending Measures</p>
          <p>When there is no compatible measure left in Mq, this
procedure looks for alternative measures. Speci cally, it returns
all the measures in the md-schema that are compatible with
Gq, i.e., all m P M such that gpmq ©H Gq.</p>
        </sec>
        <sec id="sec-6-2-3">
          <title>Example 5. With reference to Example 4, procedure</title>
          <p>RecommendM easures returns teams.players.points, since it
is gpteams.players.pointsq ©H Gq.
4.2.4</p>
          <p>Recommending Levels</p>
          <p>Recommending additional group-by levels is done when
originally the user had selected two or three group-by levels,
but only one of them was left in Gq after checking the
groupby set. For a given candidate level l, it requires to check
that the base integrity constraint is met between l and some
other level in M, and that the measure check is not violated.
The rst level found is proposed to the user, who has the
choice to use it or to proceed looking for other levels (up to
a maximum of three levels in the group-by set).</p>
          <p>Example 6. Let Gq tid; dateu and Mq
tteams.players.pointsu. The relationship between the
levels in Gq is many-to-one, so date is removed. In this
case, to recommend levels we may look for a many-to-many
relationship between id and teams.name, teams.abbreviation,
teams.home, and teams.players.player. Since teams.name
does not violate the measure check and has a many-to-many
relationship with id, it is recommended to the user as a
possible group-by level.
4.2.5</p>
          <p>Checking Summarization</p>
          <p>
            Compatibility states that measures cannot be aggregated
along hierarchies using any aggregation operators. In
principle, checking summarization would require knowing the
measure category ( ow, stock, or value-per-unit), the type
of group-by levels (temporal or non-temporal), and the
aggregation operator (Sum, Avg, Min, Max, Count, etc.) [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ].
Knowing the measure category and type of group-by levels
could be used to recommend some aggregation operators,
but still the user would have to choose among a number
of potential options; besides, correctly classifying a
measure into ow, stock, or value-per-unit may be hard for
nonskilled users. Therefore, we prefer to enable users to directly
pick an aggregation operator for each measure in Mq.
          </p>
          <p>Summarization is violated when a measure m P Mq has
ner granularity than another measure m1 P Mq, since m
would be double-counted when executing q. In this case,
the user is warned that she will get erroneous results, and
she may choose to change the aggregation operator for m
(e.g., using Min, Max or Avg instead of Sum will give the
correct result) or to drop m from Mq.</p>
          <p>Example 7. Let Mq tteams.score; teams.players.ptsu.</p>
        </sec>
        <sec id="sec-6-2-4">
          <title>These measures have di erent granularities, so if the user</title>
          <p>has chosen to aggregate teams.score (which has ner
granularity) using Sum she will get double counting. Then, she
can either change the aggregation operator for teams.score
to Avg or drop teams.score from Mq.
4.3</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>OLAP Enabling</title>
      <p>The goal of this phase is to re ne the md-schema by
discovering some hierarchies, so that the user is enabled to
interact with data in an OLAP fashion. Completely building
all the hierarchies would require to mine all FDs between
levels, which would be computationally too expensive. For
this reason, we only build local portions of hierarchies for
the levels in the group-by set of the previously-formulated
md-query q. Besides, again for complexity reasons, we only
mine simple FDs (i.e., those relating single attributes rather
than attribute sets), which are mostly common in
multidimensional schemata.</p>
      <p>Speci cally, the idea is to mine, for each level l P Gq for
which no roll-up relationships have been discovered yet, the
FDs of either type l Ñ l1 (to enable a roll-up of l) or l1 Ñ l (to
enable a drill-down of l). Then, if the user applies a roll-up
or drill-down, a new md-query q1 is formed and the process
is iterated to further extend the hierarchies. Remarkably,
using FDs to discover hierarchies guarantees that q1 satis es
the conditions discussed earlier, speci cally disjointness and
completeness.</p>
      <p>
        FDs are not explicitly modeled in a document store, so
we must resort to data. Since document stores host large
amounts of data, we can reasonably assume that the mined
FDs are representative enough of the application domain.
However, schemaless data commonly present some errors
and missing values which may hide some FDs. The tool we
adopt to cope with this issue are approximate FDs (AFDs)
[
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ], which \almost hold" on data, instead of traditional
ones. Unfortunately, this entails an approximate
satisfaction of the disjointness and completeness conditions,
possibly leading to non-disjoint and incomplete hierarchies.
      </p>
      <p>Definition 4 (Approximate FD). Given two levels l
and l1, let strengthpl; l1q denote the ratio between the number
of unique values of l and the number of unique values of ll1.</p>
      <sec id="sec-7-1">
        <title>We will say that AFD l ; l1 holds if strengthpl; l1q ¥ ,</title>
        <p>
          where is a user-de ned threshold [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
        <p>At a given time during a user's session, let M be the
current version of the md-schema and q be the last md-query
formulated. Then, the search space for mining AFDs
includes the levels that are not in Gq and are not involved
in roll-up relationships in M. To reduce the search
complexity we take into account the structural clues provided
by the c-schema. Speci cally, let l and l1 be two levels, such
that depthplq   depthpl1q; as already stated, the attributes
in Attrsp q are related by to-many multiplicity to those in
Attrsp 1q, so l ; l1. For this reason, checking if l and
l1 should be placed in the same hierarchy only requires to
check if l1 ; l. In addition, we avoid checking trivial and
transitive AFDs since we explore one hierarchy at a time.
Finally, the result of all AFD checks performed is stored
in a meta-data repository, to be used to avoid checking the
same AFDs twice.</p>
        <p>In the following subsections, roll-up and drill-down
discovery are detailed.
4.3.1</p>
        <p>Roll-up Discovery</p>
        <p>Let l P Gq; discovering possible roll-ups for l requires to
mine the AFDs of type l ; l1, with l1 P LzGq and depthpl1q ¤
depthplq. Let R be the set of all right-hand sides for these
AFDs. To avoid useless checks, the AFDs whose right-hand
side l1 has higher cardinality than l are not checked, since
they clearly cannot hold. One exception is when |l| |l1| or
|l| Á |l1| (due to approximation): in this case both l1 ; l
and l ; l1 must be checked, since the relationship between
l and l1 may be one-to-one (so l1 is a descriptive attribute
for l). These bidirectional checks are made at the end of the
process to prioritize the discovery of \true" hierarchies. The
AFDs in R are checked until one is found to hold, say l ; l1.
Then, the md-schema M is updated by placing these levels
in the same hierarchy (l © l1) and the user is noti ed of the
ability of performing a roll-up from l.</p>
        <p>Example 8. Let Gq tid; teams.nameu and Mq
tteams.score; teams.wonu. The search space for level id
only includes date, since it is the only level for which
depthpdateq ¤ depthpidq. Since |id| ¡ |date|, id ; date
is checked. This is found to be true, so the md-schema is
updated with id © date. The search space for teams.name
consists of teams.abbreviation and teams.home; by
querying data we nd that |teams.name| |teams.abbreviation| ¡
|teams.home|. Firstly, teams.name ; teams.home is
checked, and found not to hold. Then, teams.name ;
teams.abbreviation and teams.abbreviation ; teams.name are
checked; both are found to hold, giving rise to a descriptive
attribute.
4.3.2</p>
        <p>Drill-Down Discovery</p>
        <p>Here the set of candidate levels for checking AFDs of type
l1 ; l includes the levels in LzGq whose depth is greater
or equal than l, i.e., such that depthpl1q ¥ depthplq. As in
roll-up discovery, an AFD whose left-hand side l1 has lower
cardinality than l is not checked, since it cannot hold. If
l1 ; l is found to hold, l1 © l is added to M and the user is
noti ed of the ability of drilling down from l.</p>
        <p>Note that drilling down from l to l1 could produce a new
md-query q1 whose group-by set, Gq1 Gqztlu Y tl1u, is not
compatible with some of the measures in Mq1 Mq. Since
checking compatibility is computationally cheaper than
checking AFDs (as explained in Section 4.2.1, it can be done
based on the partial order of group-by sets, without
accessing data), the AFD for a candidate level l1 is checked only if
Gq1 is found to be compatible with at least one of the
measures in Mq1 . Of course, if l1 ; l is found to hold and the
user decides to drill-down to l1, the incompatible measures
in Mq1 must be dropped.</p>
        <sec id="sec-7-1-1">
          <title>Example 9. Consider again Example 8. The</title>
          <p>search space for id includes date, teams.home, and
teams.players.player (we recall that teams.abbreviation has
already been added as a descriptive attribute). Since id
has the highest cardinality, there are no AFDs to check.
On the other hand, the search space for teams.name
consists of teams.home and teams.players.player, where
|teams.home|   |teams.name|   |teams.players.player|. AFD
teams.home ; teams.name is not checked because of its
cardinality; the same for teams.players.player ; teams.name,
since Gq1 tid; teams.players.playeru is not compatible with
the measures in Mq1 .</p>
          <p>
            The md-schema resulting after OLAP enabling in
Examples 8 and 9 is shown in DFM notation [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] in Figure 3. Note
that, at the end of the user's session, md-schemata can be
stored for reuse and sharing.
          </p>
          <p>Year
Month</p>
          <p>Date
id</p>
          <p>Game
Teams.score
Teams.won</p>
          <p>Teams.name</p>
          <p>Teams.abbreviation</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>EXPERIMENTAL EVALUATION</title>
      <p>We evaluate the performance of our approach using two
real-world datasets :</p>
      <p>
        Games. The working example dataset has been
collected by Sports Reference LLC [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. It contains around
32K nested documents representing NBA games in the
period of 1985{2013. Each document represents a game
between two teams with at least 11 players each. It
contains 46 attributes; 40 of them are numeric and
represent team and player results.
      </p>
      <p>DBLP. This dataset contains 5.4M documents scraped
from DBLP (dblp.uni-trier.de/xml/) in XML
format and converted into JSON. Documents are at and
represent eight kinds of publications including
conference proceedings, journal articles, books, thesis, etc.
So, they have common attributes such as title,
author, type, year and uncommon ones such as journal
and booktitle.</p>
      <p>Implementation. Each dataset has been loaded as a
collection in a local MongoDB instance running on Intel Core i5
CPU at 3.3 GHz and 4 GB of RAM machine with Ubuntu
14.04 OS. We have focused on the parts of our approach
that require accessing data to retrieve multiplicities of
interlevel relationships by means of count distinct queries,
since the purely algorithmic parts have no relevant
computational complexity. Query execution can be delegated
either to the document store (MongoDB in our architecture)
or to query engines such as Spark SQL, Drill, and Presto.
Based on the tests we performed, these engines are signi
cantly slower than native querying, plausibly because of the
di erent query execution plans. In addition, these engines
sometimes return incorrect results when it comes to count
distinct queries involving nested attributes. Therefore, we
de ned two queries based on the native query language of
MongoDB:
1. The rst one (QAF D) checks the AFD between a pair
of levels and returns its approximation using the
formula of De nition 4.
2. The second one (Qcard) returns the cardinality of a
level.</p>
      <p>Methodology. In the previous section, we showed how
our approach validates md-queries and iteratively derives an
md-schema that satis es the multidimensional constraints.
In this section, we consider ve md-queries on the games
dataset, q1 to q5, and four on the DBLP dataset, q6 to q9
(the group-by sets of all queries are shown in Table 1). Then,
we measure the time required by the querying and
OLAPenabling phases. We also discuss the e ciency of our
algorithm based on the number of performed and skipped checks
in each phase. The results obtained are proposed below.
5.1</p>
    </sec>
    <sec id="sec-9">
      <title>Querying</title>
      <p>To evaluate the e ciency of the validation of an md-query,
we measure the time of the group-by check since it is the
only step that requires to access data.2 In Table 1 we show,
for our nine md-queries, the number of checks avoided using
the c-schema (#Avoided), and the total time for validating
the md-query (tquerying). In order to measure tquerying, the
AFD checks required by a single md-query are executed in
parallel. We can easily conclude that (i) our approach e
ectively reduces the number of checks by relying on the dataset
structure, and (ii) the md-query validation time depends on
the number of levels in the group-by set and on their depths.
Remarkably, the md-query validation time is still reasonable
(max 50 secs) even with a large dataset, which makes our
approach suitable for real-time usage.</p>
      <p>Dataset
q</p>
      <p>Gq
#Avoided tquerying
Games
DBLP
q1 id, teams.name
q2 date, teams.name
q3 id,teams.name,</p>
      <p>teams.home
q4 teams.players.player,</p>
      <p>date, teams.name
q5 teams.players.player,
teams.name,
teams.home,
q6 journal, year
q7 year, type
q8 author, year
q9 type, year, author
1/2
1/2
2/6
3/6
2/6
0/2
0/2
1/2
2/6</p>
      <p>For OLAP enabling, we measure the time to check each
single AFD and the time to retrieve the cardinality of each
level. Tables 2 and 3 show, for each AFD checked between
a pair of levels, the number of processed documents, the
AFD strength, and the checking time. The di erence in the
number of documents (or the data size) for a given dataset
is to due to the attening operation that is performed when
executing QAF D. The results clearly show that the querying
time tAF D depends on the levels and their depths. This
is apparent for the AFDs involving the author attribute in
Table 3, which have a higher execution time. This is due
to the amount of memory allowed by MongoDB for
groupby queries, which is limited to 100 MB; when a query does
not t in memory, temporary les must be written on disk,
which dramatically slows execution down.
2The time for actually executing each md-query is not
considered here, since the optimization of each md-query is out
of the paper scope.
l ; l1
id ; date
date ; id
t.name ; id
t.name ; date
t.name ; t.abbreviation
t.name ; t.home
t.abbreviation ; id
t.abbreviation ; date
t.abbreviation ; t.name
t.abbreviation ; t.home
t.home ; id
t.home ; date
t.home ; t.name
t.home ; t.abbreviation
t.players.player ; id
t.players.player ; date
t.players.player ; t.name
t.players.player ; t.abbrev.
t.players.player ; t.home
l ; l1
year ; type
year ; journal
year ; booktitle
type ; year
type ; journal
type ; booktitle
journal ; year
journal ; type
journal ; booktitle
booktitle ; year
booktitle ; type
booktitle ; journal
author ; type
author ; year
author ; journal
author ; booktitle
32K
64K
64K
64K
644K
#Docs
5.4M
5.4M
5.4M
5.4M
12M</p>
      <p>In this paper we have proposed an interactive
schema-onread approach for enabling OLAP on document stores. To
this end, AFDs are mined and used to automate the
discovery of multidimensional structures. The user interaction is
limited to the selection of multidimensional concepts and to
a proper choice of the aggregation operators. After
validating user queries from the multidimensional point of view and
re ning the underlying multidimensional schema, we adopt
a smart strategy to e ciently build local portions of
hierarchies aimed at enabling OLAP-style user interaction in the
form of roll-ups and drill-downs.</p>
      <p>Overall, the experiments we conducted show that the
performances of our approach are in line with the requirements
of a real-time user interaction. However, some relevant
issues still need to be explored and are part of our future
work. To improve performance, we plan to further optimize
the algorithms proposed. Besides, to increase e ectiveness,
the evolution of data and schemata in a collection must be
considered; we intend to address this issues by searching for
temporal FDs.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Abello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Darmont</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Etcheverry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Golfarelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-N.</given-names>
            <surname>Mazon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          , T. B.
          <string-name>
            <surname>Pedersen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Rizzi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Trujillo</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Vassiliadis</surname>
            , and
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Vossen</surname>
          </string-name>
          .
          <article-title>Fusion cubes: towards self-service business intelligence</article-title>
          .
          <source>International Journal of Data Warehousing and Mining</source>
          ,
          <volume>9</volume>
          (
          <issue>2</issue>
          ):
          <volume>66</volume>
          {
          <fpage>88</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Abello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Romero</surname>
          </string-name>
          , T. B.
          <string-name>
            <surname>Pedersen</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Berlanga</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Nebot</surname>
            ,
            <given-names>M. J.</given-names>
          </string-name>
          <string-name>
            <surname>Aramburu</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Simitsis</surname>
          </string-name>
          .
          <article-title>Using semantic web technologies for exploratory OLAP: A survey</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          ,
          <volume>27</volume>
          (
          <issue>2</issue>
          ):
          <volume>571</volume>
          {
          <fpage>588</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Armbrust</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ghodsi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaharia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Xin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. K.</given-names>
            <surname>Bradley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kaftan</surname>
          </string-name>
          , and
          <string-name>
            <surname>M. J. Franklin. Spark SQL</surname>
          </string-name>
          :
          <article-title>Relational data processing in spark</article-title>
          .
          <source>In Proc. SIGMOD</source>
          , pages
          <volume>1383</volume>
          {
          <fpage>1394</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cattell</surname>
          </string-name>
          .
          <article-title>Scalable SQL and NoSQL data stores</article-title>
          .
          <source>ACM SIGMOD Record</source>
          ,
          <volume>39</volume>
          (
          <issue>4</issue>
          ):
          <volume>12</volume>
          {
          <fpage>27</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Chasseur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and J.</given-names>
            <surname>Patel</surname>
          </string-name>
          .
          <article-title>Enabling JSON document stores in relational systems</article-title>
          .
          <source>In Proc. WebDB</source>
          , pages
          <volume>1</volume>
          {
          <issue>6</issue>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Cuzzocrea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bellatreche</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I.-Y.</given-names>
            <surname>Song</surname>
          </string-name>
          .
          <article-title>Data warehousing and OLAP over big data: current challenges and future research directions</article-title>
          .
          <source>Proc. DOLAP</source>
          , pages
          <volume>67</volume>
          {
          <fpage>70</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Golfarelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Graziani</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Rizzi</surname>
          </string-name>
          .
          <article-title>Starry vault: Automating multidimensional modeling from data vaults</article-title>
          .
          <source>In Proc. ADBIS</source>
          , pages
          <volume>137</volume>
          {
          <fpage>151</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Golfarelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Maio</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Rizzi</surname>
          </string-name>
          .
          <article-title>The dimensional fact model: A conceptual model for data warehouses</article-title>
          .
          <source>International Journal of Cooperative Information Systems</source>
          ,
          <volume>7</volume>
          (
          <issue>2</issue>
          -3):
          <volume>215</volume>
          {
          <fpage>247</fpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Golfarelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rizzi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Vrdoljak</surname>
          </string-name>
          .
          <article-title>Data warehouse design from XML sources</article-title>
          .
          <source>In Proc. DOLAP</source>
          , pages
          <volume>40</volume>
          {
          <fpage>47</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huhtala</surname>
          </string-name>
          , J. Karkkainen, P. Porkka, and
          <string-name>
            <given-names>H.</given-names>
            <surname>Toivonen</surname>
          </string-name>
          . TANE:
          <article-title>An e cient algorithm for discovering functional and approximate dependencies</article-title>
          .
          <source>Computer Journal</source>
          ,
          <volume>42</volume>
          (
          <issue>2</issue>
          ):
          <volume>100</volume>
          {
          <fpage>111</fpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>I. F.</given-names>
            <surname>Ilyas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Markl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Haas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Brown</surname>
          </string-name>
          , and
          <string-name>
            <given-names>A.</given-names>
            <surname>Aboulnaga</surname>
          </string-name>
          . Cords:
          <article-title>Automatic discovery of correlations and soft functional dependencies</article-title>
          .
          <source>In Proc. of SIGMOD</source>
          , pages
          <volume>647</volume>
          {
          <fpage>658</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Jensen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Holmgren</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T. B.</given-names>
            <surname>Pedersen</surname>
          </string-name>
          .
          <article-title>Discovering multidimensional structure in relational data</article-title>
          .
          <source>In Proc. DaWaK</source>
          , pages
          <volume>138</volume>
          {
          <fpage>148</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Jensen</surname>
          </string-name>
          ,
          <string-name>
            <surname>T. H.</surname>
          </string-name>
          <article-title>M ller, and</article-title>
          <string-name>
            <given-names>T. B.</given-names>
            <surname>Pedersen</surname>
          </string-name>
          .
          <article-title>Specifying OLAP cubes on XML data</article-title>
          .
          <source>Journal of Intelligent Information Systems</source>
          ,
          <volume>17</volume>
          (
          <issue>2-3</issue>
          ):
          <volume>255</volume>
          {
          <fpage>280</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Klettke</surname>
          </string-name>
          ,
          <string-name>
            <surname>U.</surname>
          </string-name>
          <article-title>Storl, and</article-title>
          <string-name>
            <given-names>S.</given-names>
            <surname>Scherzinger</surname>
          </string-name>
          .
          <article-title>Schema extraction and structural outlier detection for JSON-based NoSQL data stores</article-title>
          .
          <source>In Proc. BTW</source>
          , pages
          <volume>425</volume>
          {
          <fpage>444</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Lenz</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Shoshani</surname>
          </string-name>
          .
          <article-title>Summarizability in OLAP and statistical databases</article-title>
          .
          <source>In Proc. SSDBM</source>
          , pages
          <volume>132</volume>
          {
          <fpage>143</fpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Z. H.</given-names>
            <surname>Liu</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Gawlick</surname>
          </string-name>
          .
          <article-title>Management of exible schema data in RDBMSs - opportunities and limitations for NoSQL</article-title>
          .
          <source>In Proc. CIDR</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Z. H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hammerschmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>McMahon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Chang</surname>
          </string-name>
          .
          <article-title>Closing the functional and performance gap between SQL and NoSQL</article-title>
          .
          <source>In Proc. SIGMOD</source>
          , pages
          <volume>227</volume>
          {
          <fpage>238</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J. N.</given-names>
            <surname>Mazon</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <article-title>Lechtenborger, and</article-title>
          <string-name>
            <given-names>J.</given-names>
            <surname>Trujillo</surname>
          </string-name>
          .
          <article-title>A survey on summarizability issues in multidimensional modeling</article-title>
          .
          <source>Data and Knowledge Engineering</source>
          ,
          <volume>68</volume>
          (
          <issue>12</issue>
          ):
          <volume>1452</volume>
          {
          <fpage>1469</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>C.</given-names>
            <surname>Phipps</surname>
          </string-name>
          and
          <string-name>
            <given-names>K. C.</given-names>
            <surname>Davis</surname>
          </string-name>
          .
          <article-title>Automating data warehouse conceptual schema design and evaluation</article-title>
          .
          <source>In Proc. DMDW</source>
          , pages
          <volume>23</volume>
          {
          <fpage>32</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>N.</given-names>
            <surname>Prat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Akoka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I.</given-names>
            <surname>Comyn-Wattiau</surname>
          </string-name>
          .
          <article-title>A UML-based data warehouse design method</article-title>
          .
          <source>Decision Support Systems</source>
          ,
          <volume>42</volume>
          (
          <issue>3</issue>
          ):
          <volume>1449</volume>
          {
          <fpage>1473</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>S.</given-names>
            <surname>Rizzi</surname>
          </string-name>
          , E. Gallinucci,
          <string-name>
            <given-names>M.</given-names>
            <surname>Golfarelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Romero</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Abello</surname>
          </string-name>
          .
          <article-title>Towards exploratory OLAP on linked data</article-title>
          .
          <source>In Proc. SEBD</source>
          , pages
          <volume>86</volume>
          {
          <fpage>93</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>O.</given-names>
            <surname>Romero</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Abello</surname>
          </string-name>
          .
          <article-title>Multidimensional design by examples</article-title>
          .
          <source>In Proc. DaWaK</source>
          , pages
          <volume>85</volume>
          {
          <fpage>94</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>O.</given-names>
            <surname>Romero</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Abello</surname>
          </string-name>
          .
          <article-title>Automating multidimensional design from ontologies</article-title>
          .
          <source>In Proc. DOLAP</source>
          , pages
          <volume>1</volume>
          {
          <issue>8</issue>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>O.</given-names>
            <surname>Romero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Calvanese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abello</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Rodr</surname>
          </string-name>
          guez-Muro.
          <article-title>Discovering functional dependencies for multidimensional design</article-title>
          .
          <source>In Proc. DOLAP</source>
          , pages
          <volume>1</volume>
          {
          <issue>8</issue>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. F.</given-names>
            <surname>Morales</surname>
          </string-name>
          , and Jesus Garcia Molina.
          <article-title>Inferring versioned schemas from NoSQL databases and its applications</article-title>
          .
          <source>In Proc. ER</source>
          , pages
          <volume>467</volume>
          {
          <fpage>480</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>K.</given-names>
            <surname>Valeri</surname>
          </string-name>
          . https://thecodebarbarian.wordpress.com,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>J.</given-names>
            <surname>Varga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaisman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Romero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Etcheverry</surname>
          </string-name>
          , T. B.
          <string-name>
            <surname>Pedersen</surname>
            , and
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Thomsen</surname>
          </string-name>
          .
          <article-title>Dimensional enrichment of statistical linked open data</article-title>
          .
          <source>Journal of Web Semantics</source>
          ,
          <volume>40</volume>
          :
          <fpage>22</fpage>
          {
          <fpage>51</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>B.</given-names>
            <surname>Vrdoljak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Banek</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Rizzi</surname>
          </string-name>
          .
          <article-title>Designing web warehouses from XML schemas</article-title>
          .
          <source>In Proc. DaWaK</source>
          , pages
          <volume>89</volume>
          {
          <fpage>98</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Hassanzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zou</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <article-title>Schema management for document stores</article-title>
          .
          <source>VLDB Journal</source>
          ,
          <volume>8</volume>
          (
          <issue>9</issue>
          ):
          <volume>922</volume>
          {
          <fpage>933</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>