Variety-Aware OLAP of Document-Oriented Databases

Variety-Aware OLAP of Document-Oriented Databases EnricoGallinucci enrico.gallinucci2@unibo.it DISI-Univ. of Bologna

Cesena Italy, Italy

MatteoGolfarelli matteo.golfarelli@unibo.it DISI-Univ. of Bologna

Cesena Italy, Italy

StefanoRizzi stefano.rizzi@unibo.it DISI-Univ. of Bologna

Bologna Italy, Italy

Variety-Aware OLAP of Document-Oriented Databases 1613-0073 4B178103C71D19E11FED0D2CD32D2F7D GROBID - A machine learning software for extracting information from scholarly documents

Schemaless databases, and document-oriented databases in particular, are preferred to relational ones for storing heterogeneous data with variable schemas and structural forms. However, the absence of a unique schema adds complexity to analytical applications, in which a single analysis often involves large sets of data with different schemas.

In this paper we propose an original approach to OLAP on collections stored in document-oriented databases. The basic idea is to stop fighting against schema variety and welcome it as an inherent source of information wealth in schemaless sources. Our approach builds on four stages: schema extraction, schema integration, FD enrichment, and querying; these stages are discussed in detail in the paper. To make users aware of the impact of schema variety, we propose a set of indicators related for instance to query completeness and precision.

INTRODUCTION

Recent years have witnessed an erosion of the relational DBMS predominance to the benefit of DBMSs based on alternative representation models (e.g., document-oriented and graph-based) which adopt a schemaless representation for data. Schemaless databases are preferred to relational ones for storing heterogeneous data with variable schemas and structural forms; typical schema variants within a collection consist in missing or additional attributes, in different names or types for an attribute, and in different structures for instances [9]. The absence of a unique schema grants flexibility to operational applications but adds complexity to analytical applications, in which a single analysis often involves large sets of data with different schemas. Dealing with this complexity while adopting a classical data warehouse design approach would require a notable effort to understand the rules that drove the use of alternative schemas, plus an integration activity to identify a common schema to be adopted for analysiswhich is quite hard when no documentation is available. Furthermore, since new schema variations are often made, a continuous evolution of both ETL process and cube schemas would be needed.

In this paper we propose an original approach to multidimensional querying and OLAP on schemaless sources, in particular on collections stored in document-oriented databases (DODs) such as MongoDB. The basic idea is to stop fighting against data heterogeneity and schema variety, and welcome it as an inherent source of information wealth in schemaless sources. So, instead of trying to hide this variety, we show it to users (basically, data scientist and • To the best of our knowledge, this is the first approach to propose a form of approximated OLAP analyses on document-oriented databases that embraces and exploits the inherent variety of documents. • Multidimensional querying and OLAP are carried out directly on the data source, without materializing any cube or data warehouse. • We adopt an inclusive solution to integration, i.e., the user can include a concept in a query even if it is present in a subset of documents only. We cover both inter-schema and intra-schema variety, specifically we cope with missing attributes, different levels of detail in instances, different attribute naming. • Our approach to reformulation of multidimensional queries on heterogeneous documents grounds on a formal approach [11], which ensures its correctness and completeness. • We propose a set of indicators to make the user aware of the level of completeness and precision of the query result.

Remarkably, this is not yet another paper on multidimensional modeling from non-traditional data sources. Indeed, our goal is not to design a single "sharp" schema where source attributes are either included or absent, but rather to enable OLAP querying on some sort of "soft" schema where each source attribute is present to some extent.

The paper outline is as follows. After giving an overview of our approach in Section 2, in Sections 3, 4, 5, and 6 we describe its four stages, namely, schema extraction, schema integration, FD enrichment, and querying. Then, in Section 7 we discuss the related literature, and finally in Section 8 we draw the conclusions. An appendix completes the paper discussing the correctness of our query reformulation framework.

APPROACH OVERVIEW

Figure 1 gives an overview of the approach: in blue the different stages of the approach, on the right the metadata produced/consumed by each stage. Remarkably, all schema-related concepts are stored as metadata, so no transformation has to be done on source data. User interaction is required at most stages. Although the picture suggests a sequential execution of the stages, it simply outlines the ordering for the first iteration. In the scenario that we envision, the user starts by analyzing the first results provided by the system, then iteratively injects additional knowledge into the different stages to refine the metadata and improve the querying effectiveness. We now provide a Schema extraction (Section 3). The goal of this stage is to identify the set of distinct local schemas that occur inside a collection of documents. To this end we provide a tree-like definition for schemas which models arrays by considering the union of the schemas of their elements. This is a completely automatic stage which requires no interaction with the user.

Schema integration (Section 4). At this stage we rely on inter-schema mappings and schema integration techniques to determine a (tree-like) global schema that gives the user a single and comprehensive description of the contents of the collection. In principle, this stage could be completely automated. In practice, the best results can be obtained through a semi-automatic approach, that allows users to manually validate/refine the mappings proposed by the system. As of now, we rely on the user to manually provide inter-schema mappings, from which the global schema is derived.

FD enrichment (Section 5). Traditional OLAP analyses are carried out on multidimensional cubes. To enable the OLAP experience in our setting, a multidimensional representation of the collection must be derived from the global schema. In particular, we introduce the notion of dependency graph, i.e., a graph that provides a multidimensional view of the global schema in terms of the functional dependencies (FDs) between its attributes. Some FDs can be inferred from the structure of the schema, others by analyzing data; given the expected schema variety, we specifically look for approximate FDs.

Querying (Section 6). The last stage consists in delivering the OLAP experience to the user by enabling the formulation of multidimensional queries on the dependency graph and their execution on the collection. First of all, each formulated query is validated against the requirements of well-formedness proposed in the literature [19]. Then, the query is translated to the query language of the DOD and reformulated into multiple queries, one for each local schema in the collection; the results presented to the user are obtained by merging the results of the single local queries. To make the user aware of the impact of schema variety in terms of quality and reliability of the results, we show her a set of indicators related to query completeness and precision.

The motivation example that we use across the paper is based on a real-world collection of workout sessions, [ { "_id" : ObjectId("54a4332f44cfc02424f961d4"), "User" : { "FullName" : "John Smith", "Age" : 42 }, "StartedOn" : ISODate("2017-06-15T10:20:44.000Z"), "Facility" : { "Name" : "PureGym Piccadilly", "Chain" : "PureGym" }, "SessionType" : "RunningProgram", "DurationMins": 90, "Exercises" : [ { "Type" : "Leg press", "ExCalories" : 28, "Sets" : [ { "Reps" : 14, "Weight" : 60 }, . . . ] }, { "Type" : "Tapis roulant" }, . . . ] } , . . . ] Figure 2: An excerpt of the WorkoutSession collection obtained from a worldwide company selling fitness equipment. Figure 2 shows a sample document in the collection, organized according to three nesting levels:

(1) The first level contains information about the user, including the facility in which the session took place, the date, and the total duration in minutes. (2) The Exercises array contains an object for every exercise carried out during the session, with information on the type of exercise, and the total calories. (3) The Sets array contains an object for every set that the exercise was split into. For example, the "leg press" exercise has been done in multiple sets, the first of which comprises 14 repetitions with a weight of 60 kilograms, for a total of 28 calories.

SCHEMA EXTRACTION

The goal of this stage is to introduce a notion of (local) schema for a document, to be used in the integration stage to determine a (global) schema for a collection and then, in the FD enrichment stage, to derive an OLAP-compliant representation of the collection itself. The notion of a document is the central concept of a DOD, and it encapsulates and encodes its data in some standard format. The most widely adopted format is currently JSON, which we will use as a reference in this work.

Definition 3.1 (Document and Collection).

A document 𝑑 is a JSON object. An object is formed by a set of key/value pairs (aka fields); a key is string, while a value can be either a primitive value (i.e, a number, a string, or a Boolean), an array of values, an object, or null. A collection 𝐷 is an array of documents.

Example 3.2. Figure 2 shows a document excerpted from the WorkoutSession collection; it contains numbers (e.g., Age), strings (e.g., Chain), objects (e.g., User), and arrays (e.g., Exercises). Conceptually, a session is done by a user at a facility; it includes a list of exercises, each possibly comprising several sets. □ Since there is no explicit representation of schemas in documents, multiple definitions of schema are possible for the schemas of collections and documents -with different levels of conciseness and precision. The main difference in these definitions lies in how they cope with inter-document variety and intra-document variety.

• Inter-document variety impacts on the definition of the schema for a collection, as it concerns the presence of documents with different fields. This issue is usually dealt with in one of two ways: either by defining the schema of the collection as the union/intersection [1,24] of the most frequent fields, or by keeping track of every different schema [20].

Our work mixes the above mentioned approaches in that it builds a global schema starting from local schemas. • Intra-document variety impacts on the definition of the schema for a document, and is mainly related to the presence in a document of a heterogeneous array. For instance, an array of objects can mix objects with different fields (e.g., the first objects of the Exercises array in Figure 2 contains fields that are missing from the second one). In this work we adopt a simple representation that, like in [1,15], considers the union of the values contained in the array. We start by giving a "structural" definition of a schema as a tree, then we reuse it to define the schema of a document and, in Section 4, the schema of a collection.

Definition 3.3 (Schema).

A schema is a directed tree 𝑠 = (𝐹, 𝐴) where 𝐹 is a set of fields and 𝐴 is a set of arcs representing the relationships between arrays and the contained fields. In particular,

(1) 𝐹 = 𝐹 𝑎𝑟𝑟 ∪𝐹 𝑝𝑟𝑖𝑚 , 𝐹 𝑎𝑟𝑟 is a set of array fields (including the root 𝑟 of 𝑠), and 𝐹 𝑝𝑟𝑖𝑚 is a set of primitive fields; (2) 𝐴 includes arcs from fields in 𝐹 𝑎𝑟𝑟 to fields in 𝐹 𝑎𝑟𝑟 ∪ 𝐹 𝑝𝑟𝑖𝑚 .

Each field 𝑓 ∈ 𝐹 has a name, 𝑘𝑒𝑦(𝑓 ), a unique pathname (obtained by concatenating the names of the fields along the path from 𝑟 to 𝑓 , with the exclusion of 𝑟), and a type, 𝑡𝑦𝑝𝑒(𝑓 ) (𝑡𝑦𝑝𝑒(𝑓 ) ∈ {number, string, Boolean} for all 𝑓 ∈ 𝐹 𝑝𝑟𝑖𝑚 , 𝑡𝑦𝑝𝑒(𝑓 ) = array for all 𝑓 ∈ 𝐹 𝑎𝑟𝑟 ). Given field 𝑓 ̸ = 𝑟, we denote with 𝑎𝑟𝑟(𝑓 ) the array 𝑎 ∈ 𝐹 𝑎𝑟𝑟 such that (𝑎, 𝑓 ) ∈ 𝐴.

To define the schema of a specific document we need to add identifiers to arrays. We denote with 𝑖𝑑(𝑎) the primitive field that identifies an object within array 𝑎. Documents always contain an identifier, 𝑖𝑑(𝑟) = id. Conversely, array objects may not contain such a field, but still they can be univocally identified by their positional index within the array. Therefore, given array 𝑎, 𝑖𝑑(𝑎) can be recursively defined as the concatenation of 𝑖𝑑(𝑎𝑟𝑟(𝑎)) and the positional index within 𝑎; it is 𝑘𝑒𝑦(𝑖𝑑(𝑎)) = id and 𝑡𝑦𝑝𝑒(𝑖𝑑(𝑎)) = string. Definition 3.4 (Schema of a Document). Given document 𝑑 ∈ 𝐷, the schema of 𝑑 is the schema 𝑠(𝑑) = (𝐹 𝑎𝑟𝑟 ∪ 𝐹 𝑝𝑟𝑖𝑚 , 𝐴) such that (1) 𝐹 𝑎𝑟𝑟 includes a field for each array in 𝑑, labelled with the corresponding key and type, plus a root 𝑟 labelled with the name of 𝐷 and with type array. (2) 𝐹 𝑝𝑟𝑖𝑚 includes (i) a field for each primitive in 𝑑, and (ii) a field for each 𝑖𝑑(𝑎) with 𝑎 ∈ 𝐹 𝑎𝑟𝑟 , 𝑓 ̸ = 𝑟; every field is labelled with its corresponding key and type (keys of primitives within an object field are "flattened", i.e., prefixed with the object's key);

(3) 𝐴 includes (i) an arc (𝑟, 𝑓 ) for each field 𝑓 such that 𝑘𝑒𝑦(𝑓 ) appears as a key in the root level of 𝑑, and (ii) an arc (𝑎, 𝑓 ) iff 𝑘𝑒𝑦(𝑓 ) appears as a key in an object of array 𝑎.

Example 3.5. Figure 3.a shows the schema of the document represented in Figure 2, part of the WorkoutSession collection (from now on, abbreviated in WS). Each array is represented as a box, with its child primitives listed below (numeric primitives are in italics). Object fields are prefixed with the object key (e.g., Facility.Chain). The vertical lines between boxes represent inter-array arcs, with the root WS on top. It is 𝑎𝑟𝑟(Exercises.Type) = Exercises and 𝑖𝑑(Exercises) = Exercises. id. □

Given collection 𝐷, we denote with 𝑆(𝐷) the set of distinct schemas of the documents in 𝐷 (where two fields in the schemas of two documents are considered equal if they have the same pathname and the same type).

𝑆(𝐷) = ⋃︁

𝑑∈𝐷

𝑠(𝑑)

Given 𝑠 ∈ 𝑆(𝐷), we denote with 𝐷 𝑠 the set of documents in 𝐷 such that 𝑠(𝑑) = 𝑠.

SCHEMA INTEGRATION

The goal of this stage is to integrate the distinct, local schemas extracted from 𝐷 to obtain a single and comprehensive view of the collection, i.e., a global schema, and its mappings with each local schema. The global schema can be incrementally built using one of the methodologies discussed in [2]; for instance, adopting a ladder integration strategy, by (i) taking one local schema as the global schema; (ii) iteratively taking each other local schema, finding its mappings onto the global schema, and updating the global schema accordingly. However, notice that some mappings may be missed by adopting a purely incremental strategy (i.e., a second iteration on the local schemas may be required). A survey of the techniques that can be used for finding mappings is provided in [3,18].

A mapping is defined as follows: where 𝜑1 is the identity function while 𝜑2 is a function that concatenates two strings. □

A transcoding function transforms values of a set of fields into values of another set of fields; it is needed for each primitive mapping to enable query reformulation in presence of selection predicates as well as to enable the results obtained from all documents to be integrated (see Appendix). On the other hand, array mappings are not associated to a transcoding function because arrays are just containers and do not have values themselves.

Due to the already mentioned inter-document variety, a field 𝑓 of the global schema may not be available in every local schema (e.g., Facility.Chain is absent in the second schema in Figure 3); therefore we need a measure of the support of 𝑓 with respect to the different schemas in collection 𝐷. Intuitively, given the nested structure of documents, the support of 𝑓 could be defined as the percentage of times that 𝑓 occurs among the objects of 𝑎𝑟𝑟(𝑓 ). However, due to the fact that 𝑓 may occur at different depths in different documents (e.g., if 𝑓 = Exercises.ExCalories in the global schema, 𝑎𝑟𝑟(𝑓 ) is Exercises in the schema of Figure 3.a and Exercises.Sets in the schema of Figure 3.c), this measure must be computed locally to each schema and then aggregated to get a global measure. Thus, we define the global support of 𝑓 as the weighted average of the local supports calculated on the distinct schemas. Definition 4.4 (Local Support of a Field). Given a document schema 𝑠 = (𝐹, 𝐴), the local support of a field 𝑓 ∈ 𝐹 is recursively defined as:

𝑙𝑜𝑐𝑆𝑢𝑝𝑝(𝑓, 𝑠) = {︃ 1, if 𝑓 ≡ 𝑟 ∑︀ 𝑠∈𝐷 𝑠 𝑝𝑒𝑟𝑐(𝑓 ) • 𝑙𝑜𝑐𝑆𝑢𝑝𝑝(𝑎𝑟𝑟(𝑓 ), 𝑠), otherw.

where 𝑝𝑒𝑟𝑐(𝑓 ) is the percentage of objects of 𝑎𝑟𝑟(𝑓 ) which include 𝑓 .

Note that the support of 𝑓 is weighted on the support of its array 𝑎𝑟𝑟(𝑓 ); this is because, for instance, 𝑓 may occur in every object of 𝑎𝑟𝑟(𝑓 ) but 𝑎𝑟𝑟(𝑓 ) may be missing for some object of 𝑎𝑟𝑟(𝑎𝑟𝑟(𝑓 )). As a result, it is always 𝑙𝑜𝑐𝑆𝑢𝑝𝑝(𝑓, 𝑠) ≤ 𝑙𝑜𝑐𝑆𝑢𝑝𝑝(𝑎𝑟𝑟(𝑓 ), 𝑠). Definition 4.5 (Global Support of a Field). Given collection 𝐷 and the set of distinct schemas 𝑆(𝐷), the global support of a field 𝑓 ∈ 𝐹 is:

𝑔𝑙𝑜𝑆𝑢𝑝𝑝(𝑓 ) = ∑︁ 𝑠∈𝑆(𝐷) 𝑙𝑜𝑐𝑆𝑢𝑝𝑝(𝑓, 𝑠) • |𝐷 𝑠 | |𝐷|

where |𝐷 𝑠 | is the number of documents with schema 𝑠 and |𝐷| is the overall number of documents.

FD ENRICHMENT

The goal of this stage is to propose a multidimensional view of the global schema to enable OLAP analyses. The main informative gap to be filled to this end is the identification of hierarchies, which in turn relies on the identification of FDs between fields in the global schema. While in relational databases FDs are represented at the schema level by means of primary and referential integrity constraints, the same is not true in DODs. Yet, identifiers are present in DODs: each collection has its (explicit) id field and, as discussed in Section 3, every nested object has its own (implicit) identifier (i.e., 𝑖𝑑(𝑎) with 𝑎 ∈ 𝐹 𝑎𝑟𝑟 ). The presence of these identifiers implies the existence of some FDs, that we call intensional as they can be derived from the global schema, without looking at the data. In particular, given global schema 𝑔(𝐷) = (𝐹 𝑎𝑟𝑟 ∪ 𝐹 𝑝𝑟𝑖𝑚 , 𝐴) and array 𝑎 ∈ 𝐹 𝑎𝑟𝑟 , we can infer that:

• 𝑖𝑑(𝑎) → 𝑓 for every 𝑓 ∈ 𝐹 𝑝𝑟𝑖𝑚 such that 𝑎𝑟𝑟(𝑓 ) = 𝑎, i.e., the identifier of 𝑎 determines the value of every primitive in 𝑎 (e.g., id → SessionType); • if 𝑎 ̸ = 𝑟, then 𝑖𝑑(𝑎) → 𝑖𝑑(𝑎𝑟𝑟(𝑎)) -i.e., the identifier of 𝑎 determines the identifier of 𝑎𝑟𝑟(𝑎) (e.g., Exercises. id → id); this is trivial, since 𝑖𝑑(𝑎𝑟𝑟(𝑎)) is part of 𝑖𝑑(𝑎).

In practice, additional FDs can exist between primitive nodes, though they cannot be inferred from the schema; so, they can only be found by checking the data. More precisely, since DODs may contain incomplete and faulty data, we have to look for approximate FDs (AFDs), i.e., FDs that "mostly" hold on data -like done for instance in [6,10,23].

Definition 5.1 (Approximate Functional Dependency). Given two fields 𝑓 and 𝑓 ′ , let 𝑎𝑐𝑐(𝑓, 𝑓 ′ ) ∈ [0..1] denote the ratio between the number of unique values of 𝑓 and the number of unique values of (𝑓, 𝑓 ′ ). We will say that AFD 𝑓 ⇝ 𝑓 ′ holds if 𝑎𝑐𝑐(𝑓, 𝑓 ′ ) ≥ 𝜖, where 𝜖 is a user-defined threshold [14].

To detect AFDs and create hierarchies accordingly, some approaches that were recently devised in the literature (e.g., [10] and [6]) can be reused, possibly coupled with traditional approaches to multidimensional modeling based on FDs (e.g., [23]). Interestingly, in [6] the number of checks to be made for AFD detection is effectively reduced thanks to the intensional FDs provided by the global schema. Note that, differently from [6], in our approach we consider inter-document variety, so the queries that check for AFDs must be reformulated from the global schema on each local schema. How this can be done is discussed in the Appendix. Definition 5.2 (Dependency Graph). Given the global schema 𝑔(𝐷) = (𝐹, 𝐴) and an (acyclic) set of (A)FDs Γ, the dependency graph is a couple ℳ = (𝐹 𝑝𝑟𝑖𝑚 , ⪰) where 𝐹 𝑝𝑟𝑖𝑚 is the set of primitive nodes in 𝐹 and ⪰ is a roll-up partial order of 𝐹 𝑝𝑟𝑖𝑚 derived from Γ. In particular, 𝑓𝑗 ⪰ 𝑓 𝑘 (i.e., 𝑓𝑗 is a predecessor of

𝑓 𝑘 in ⪰) if either 𝑓𝑗 ⇝ 𝑓 𝑘 ∈ Γ or 𝑓𝑗 → 𝑓 𝑘 ∈ Γ.

The differences between a dependency graph and the global schema it is derived from are that (1) the global schema is a tree, the dependency graph is a DAG; (2) arrays are not present in the dependency graph, but their id's are; (3) arcs express (A)FDs in the dependency graph, syntactical containment in the global schema; (4) differently from the global schema, the dependency graph can include arcs between primitive fields.

Example 5.3. Figure 4 shows the dependency graph for our working example. Each primitive field is represented as a circle whose color is representative of the field global support (the lighter the tone, the lower the support). Identifiers (e.g., id) are shown in bold. Directed arrows are representative of the (A)FDs in Γ; for instance, it is id → Facility.Name (FDs are shown in black) and Facility.Name ⇝ Facility.Chain (AFDs are shown in gray). Note that, in this case, the dependency graph is a tree, because in the global schema of Figure 3.b arrays are nested within each other. A different situation is the one shown in Figure 5, where the collection includes documents with two arrays at the same level, so the dependency graph is not a tree. □

Exercises

QUERYING

In this section we describe the final querying stage. We start by providing the definition of an OLAP query and discussing its correctness from a multidimensional standpoint (Section 6.1). Then, we discuss the execution of a query, which mainly involves its translation into the MongoDB language and the reformulation from the global schema to the local schemas (Section 6.2). Finally, we introduce a set of indicators to evaluate a query in the context of an OLAP session (Section 6.3).

Query Formulation

First of all, we define an OLAP query as follows. Definition 6.1 (OLAP query). Given dependency graph ℳ = (𝐹 𝑝𝑟𝑖𝑚 , ⪰), an OLAP query on ℳ is a triple 𝑞 = ⟨𝐺, 𝑝, 𝑚, 𝜙⟩ where:

• 𝐺 is the query group-by set, i.e., a non-empty set of fields in 𝐹 𝑝𝑟𝑖𝑚 such that for all couples 𝑓𝑗, 𝑓 𝑘 in 𝐺 it is 𝑓𝑗 ̸ ⪰ 𝑓 𝑘 ; • 𝑝 is an (optional) selection predicate; it is a conjunction of Boolean predicates, each involving a field in 𝐹 𝑝𝑟𝑖𝑚 ; • 𝑚 ∈ 𝐹 𝑝𝑟𝑖𝑚 is the query measure, i.e., the numerical field to be aggregated; • 𝜙 is the operator to be used for aggregation (e.g., avg, sum);

Algorithm 1 Validity check of an OLAP query

Input a dependency graph ℳ = (𝐹 𝑝𝑟𝑖𝑚 , ⪰), an OLAP query 𝑞 = ⟨𝐺, 𝑝, 𝑚, 𝜙⟩ Output a validity status 1: 𝑤𝑎𝑟𝑛 ← false 2: for each 𝑓 ∈ 𝐺 do 3:

if 𝑖𝑑(𝑎𝑟𝑟(𝑚)) ̸ ⪰ 𝑖𝑑(𝑎𝑟𝑟(𝑓 )) then

𝑤𝑎𝑟𝑛 ← true ◁ Disjointness failed

if 𝑔𝑙𝑜𝑆𝑢𝑝𝑝(𝑓 ) < 1 then

𝑤𝑎𝑟𝑛 ← true ◁ Completeness failed 7: if 𝑤𝑎𝑟𝑛 then 8:

return "warning" 9: else 10:

return "valid"

• there exists in ℳ one single field 𝑓 such that 𝑓 ⪰ 𝑓 for all other fields mentioned in 𝑞 (either in 𝐺, 𝑝, or 𝑚).

We will refer to all the fields in 𝐺 and 𝑝 as the query levels. Field 𝑓 is called the fact of 𝑞 (denoted 𝑓 𝑎𝑐𝑡(𝑞)) and corresponds to the coarsest granularity of ℳ on which 𝑞 can be formulated. An example of a case in which a fact cannot be determined is the one in Figure 5, with 𝐺 = {Classes.Name, Exercises.Type}. In [19] the authors outline the constraints that must hold for an OLAP query to be considered well-formed, namely, the base integrity constraint (stating that the levels in the group-by set must be functionally independent on each other) and the summarization integrity constraint [16], which in turn requires disjointness (the measure instances to be aggregated are partitioned by the group-by instances), completeness (the union of these partitions constitutes the entire set), and compatibility (the aggregation operator chosen for each measure is compatible with the type of that measure). Remarkably, Definition 6.1 already ensures that queries meet the base integrity constraint (because the query group-by set cannot include fields related by (A)FDs). As to the summarization integrity constraint, since the goal of our approach is to enable an immediate querying of data with no cleaning beforehand, we adopt a "soft" approach to avoid being too restrictive. So, after each query has been formulated by the user, it undergoes a check (sketched in Algorithm 1) that can possibly return some warnings to inform the user of potentially incorrect results. Specifically, the disjointness constraint ensures that the granularity of the measure is not coarser than the one of the group-by set levels (line 3); if this is false, the same instance of 𝑚 will be double counted for multiple instances of the group-by set [19]). The completeness constraint ensures that the levels in the group-by set have full global support (line 5); this constraint is easily contradicted as it clashes with the schemaless property of DODs. Finally, the compatibility constraint is not considered at all since its verification would require to properly categorize measures (i.e., flow, stock and value-per-unit) and levels (i.e, temporal and nontemporal), but these information can hardly be inferred from the schema or even provided by the user [6].

□

As previously mentioned, a query fails the completeness constraint if one or more levels in the group-by set do not have full support. This issue is strictly related to the one of incomplete hierarchies in data warehouse design. The related work proposes three alternative strategies to replace missing values in a hierarchy level 𝑙𝑗: balancing by exclusion (i.e., replacing all missing values with a single value "Other"), downward balancing (replacing with values from the closest level 𝑙 𝑘 such that 𝑙 𝑘 ⪰ 𝑙𝑗), and upward balancing (replacing with values from the closest level 𝑙 𝑘 such that 𝑙𝑗 ⪰ 𝑙 𝑘 ) [12]. Whereas they are originally meant to be applied when populating a data warehouse from an operational source, these strategies can be directly applied at query time, e.g., by using the $ifNull operator in MongoDB, which allows to replace a missing value in a field with a custom value or with the value of another field. Thus, when a query fails the completeness constraint, we ask the user to indicate the desired strategy to replace missing values in the levels without full support.

Query Execution

Once a query has been formulated by the user on the dependency graph corresponding to the global schema, it has to be reformulated on each local schema to effectively cope with inter-document variety. How this can be done is discussed in the Appendix. In the remainder to this subsection we explain how, after reformulation, each single query obtained can be translated to MongoDB.

OLAP queries are translated to MongoDB according to its aggregation framework, which allows to declare a multi-stage pipeline of transformations to be carried out on the documents of a collection. The most important stages are: $match (to apply predicate selections), $project (to apply transformations to the single fields), $unwind (to unfold an array by creating a different document for every object inside the array), $group (to group the documents and calculate aggregated values).

Given query 𝑞 = ⟨𝐺, 𝑝, 𝑚, 𝜙⟩ on ℳ and global schema 𝑔(𝐷) = (𝐹, 𝐴), the translation of 𝑞 into the MongoDB language is done as follows:

(1) For every array 𝑎 in 𝑔(𝐷), 𝑎 ̸ = 𝑟, for which there is a field 𝑓 mentioned in 𝑞 such that 𝑓 𝑎𝑐𝑡(𝑞) ⪰ 𝑖𝑑(𝑎) ⪰ 𝑓 , an $unwind stage is defined; the order of this stages reflects the order of the arrays in 𝑔(𝐷), beginning from the one closest to 𝑟.

(2) If 𝑝 ̸ = ∅, a $match stage is defined listing every selection predicate.

(3) A $project stage is defined to keep only the fields that are required for the following stages, i.e., 𝑚 and every group-by level. If there is one (or more) incomplete level 𝑓 ∈ 𝐺 (i.e., such that 𝑔𝑙𝑜𝑆𝑢𝑝𝑝(𝑓 ) < 1), the replacement of the missing values of 𝑓 is done at this stage, in accordance with the balancing strategy chosen by the user. Additionally, a new field named balanced is added and valued true if any of the projected fields has been affected by the balancing strategy, false otherwise. (4) A $group stage is defined including the fields that identify a group (i.e., every level 𝑓 ∈ 𝐺 plus the balanced field), the measure 𝑚 to be aggregated, and its aggregation functions 𝜙. Additionally, two new measures named count and count-m are added to count, respectively, the number of aggregated objects and the number of aggregated objects that actually contain a value for 𝑚.

The query-independent fields balanced, count, and count-m are needed to calculate the indicators of the query, which will be discussed in Section 6.3.

Example 6.4. The MongoDB query obtained from 𝑞1 considering a downward balancing strategy is the following.

db.WS.aggregate({

{ $unwind: "$Exercises" }, { $unwind: "$Exercises.Sets" }, { $match: { "User.Age": { $gte: 60 } } }, { $project: { "Facility.City": { $ifNull:

["$FacilityCity","$FacilityName"] } }, "Exercises.Type": 1, "Exercises.Sets.Weight": 1, "balanced": { $cond: ["$FacilityCity",false,true] } } }, { $group: { " id": { "FacilityCity","$FacilityCity", "ExercisesType","$Exercises.Type", "balanced","$balanced" }, "Exercises.Sets.Weight": { $avg: "$Exercises.Sets.Weight" }, "count":

{ $sum: 1 }, "count-m": { $sum: { $cond: ["$Exercises.Sets.Weight",1,0] } } } } } □ 6.

Query Evaluation and Evolution

In our schemaless scenario, the evaluation of the query results cannot transcend from the evaluation of the query itself. In particular, it is important to understand the coverage of the query with respect to the collection (which may be influenced by the support of the fields, the quality of the mappings, and the selectivity of the selection predicate), as well as the reliability of the results. For these reason, we introduce some indicators to evaluate the quality of an OLAP query after it has been executed. Let 𝐸 be the set of distinct groups returned by query 𝑞; each group 𝑒 ∈ 𝐸 includes |𝑒| objects (measured by the count field as of Section 6.2), of which |𝑒|𝑚 (measured by the count-m field) have a value for 𝑚.

Selectivity. This indicator measures the selectivity of the selection predicates in 𝑞:

𝑠𝑒𝑙(𝑞) = ∑︀ 𝑒∈𝐸 |𝑒| |𝑓 𝑎𝑐𝑡(𝑞)|

Completeness. This indicator is built on the concept of completeness previously introduced. The idea is to show the percentage of the queried objects that have not been affected by the balancing strategies (which steps in when the value of a level is null or does not exists):

𝑐𝑜𝑚𝑝𝑙(𝑞) = ∑︀ 𝑒∈𝐸,!𝑏𝑎𝑙𝑎𝑛𝑐𝑒𝑑(𝑒) |𝑒| ∑︀ 𝑒∈𝐸 |𝑒|

where 𝑏𝑎𝑙𝑎𝑛𝑐𝑒𝑑(𝑒) is true if 𝑒 has been balanced, false otherwise (as stated by the balanced field introduced in Section 6.2).

Group precision. While the absence of full support on levels can be overcome by the balancing strategies, nothing can be done when it involves the query measure. In this case, the precision of the aggregated value returned for each group is determined by the percentage of aggregated objects that actually contain a value for the measure. Thus, the precision of a group 𝑒 is

𝑝𝑟𝑒𝑐(𝑒) = |𝑒|𝑚 |𝑒|

Consistently with an OLAP scenario, a query can evolve into another with the application of an OLAP operation; the resulting sequence of queries is called an OLAP session. In particular, the permitted operations are the following ones.

• The replacement of the query measure with a different one, or the selection of a different aggregation operator. If a new measure is chosen, a new validity check is required to verify whether the disjointness requirement still holds. • The addition/removal/modification of a selection predicate. This operation has no impact on the validity of the query. • The roll-up (or drill-down) of one of the group-by levels, which leads to replacing a level 𝑓 with a level 𝑓 ′ such that 𝑓 ⪰ 𝑓 ′ (or 𝑓 ′ ⪰ 𝑓 ).

Roll-ups and drill-downs imply a navigation of the dependency graph on the relationships between 𝑓 and 𝑓 ′ , which represent (A)FDs. From a multidimensional standpoint, the navigation of an AFD with accuracy lower than 1 leads to a violation of the roll-up semantics, i.e., the results of the second query will not be a correct composition (or decomposition) of the results of the first query. This happens because the FD is not strictly true in some cases, which compromises the correctness of the aggregation. Thus, we evaluate the impact of these operations by means of another indicator:

Accuracy. This indicator quantifies the accuracy of the aggregated results of a query during an OLAP session with respect to the results obtained from the previous query. Given query 𝑞, let 𝑞 ′ be the query resulting from a roll-up (or drill-down) of 𝑞 from level 𝑓 to 𝑓 ′ , and let Γ ′ ⊆ Γ be the set of AFDs in the path between 𝑓 and 𝑓 ′ . Then, the accuracy of 𝑞 ′ with respect to 𝑞 is

𝑎𝑐𝑐(𝑞 ′ , 𝑞) = 1 − ∏︁ 𝛾∈Γ ′ 𝑎𝑐𝑐(𝛾)

RELATED LITERATURE

The rise of NoSQL stores has captured a lot of interest from the research community, which has proposed a variety of approaches to deal with the schemaless feature. In particular, most of the recent works focus on the widely adopted JSON format and on key/value repositories in general.

A first distinction lies in how each work approaches the problem of schema discovery. Some works aim at providing a comprehensive view of the schema variety in JSON documents; e.g., [20] proposes a reverse engineering process to derive a versioned schema model, where multiple versions of the same field are created for every intensional variation detected in the collection. Other works provide a more concise representation that tends to hide schema variety. For instance, [24] couples a clustering technique with schema matching techniques to identify a skeleton containing the smallest set of core fields, while [1] adopts regular expressions to model the variability of a field type. Our work is closer to the latter group, although our global schema captures the entire variety of fields and enables the user to choose the fields to focus on, while assisting her with quality indicators of the final queries. Several free tools have also been released to perform schema detection on different platforms (MongoDB, ElasticSearch, Couchbase, Apache Drill), although they are mostly limited to collecting the union of the fields. In a previous work [9] we followed a different approach and devised a schema profiling algorithm that explains the schema variety in a collection in terms of the extensional values found in the documents (e.g., it could find that different schemas depend on the different values for SessionType).

The most distinguishing feature of our approach is the definition of a multidimensional representation of the schema in order to enable OLAP analyses directly on the DOD. From this point of view, a work closely related to ours is [6], which proposes a schema-on-read approach for OLAP queries over DODs. This is done by building a multidimensional schema from the union of fields found in the collection; then, the OLAP experience is proposed at query time, where suggestions for roll-up and drill-down operations are provided given the query formulated by the user. Differently from our approach, [6] exclusively focuses on the multidimensional representation of JSON data and overlooks the schemaless property of DODs: in particular, inter-document variety is considered only in terms of fields with varying support (thus no schema integration is performed), and no support is given to the user to evaluate the coverage and accuracy of queries. Also, AFD detection is carried out on demand only after the user has written a query, thus it only impacts the OLAP experience. Another similar work is [7], which proposes a MapReduce-based algorithm to compute OLAP cubes on columnar stores. The approach is meant to work on a data warehouse (i.e., a database already comprising facts and dimensions); besides, it is limited to the computation of the cubes, while the OLAP querying aspect is mentioned as future work. Also [4] aims at delivering the OLAP experience, but its operational data source is a graph-based database, whose data model is entirely different from the one of DODs. Finally, [13] builds on [24] to propose a complete architecture that ingests NoSQL data and provides schema-on-read functionalities, but without mentioning multidimensional enrichment and OLAP analyses.

Since schema variety in a collection often consists of different representation of the same data (e.g., due to schema evolution or to the ingestion of data from different sources), the problem of schema discovery is often coupled with schema matching algorithms. [18] provides a comprehensive summary of the different techniques envisioned for generic schema matching (which ranges from the relational world to ontologies and XML documents); it is mentioned as a baseline reference in [24], while [8] starts from there to define its own algorithm for schema matching on NoSQL stores based on subtree matching. In [21] a tool is presented to automatically identify evolution in the schema of instances in NoSQL databases: once a schema change is detected, the tool either updates the database instances to enforce schema consistency or provides a code to deal with this issue on the application side. This structured approach differs from our schema-on-read scenario, which transparently handles schema differences and avoids to update the original data.

Several works have focused on bringing NoSQL back to the relational world. [17] discusses an approach to provide schema-on-read capabilities for flexible schema data stored on RDBMSs; this is done by mapping the document structure on different tables and by providing a data guide as the union of every possible field at any level. Differently from our approach, no advanced schema matching mechanism is provided. [5] proposes an algorithm to provide a generic relational encoding of arbitrary JSON documents; in particular, documents are stored in ternary relations that contain rows for every key in every document (i.e., each row stores the document id, the key name, and the key value). A more sophisticated algorithm is proposed in [8], where normalized relational schemas are automatically generated from NoSQL stores. It relies on AFD detection to build relationships between entities and it provides its own schema matching algorithm. Based on this approach, a vision for a new paradigm called adaptive schema databases has been proposed in [22]; it is a conceptual framework that devises global schemas as time-evolving and user-dependent relational views that are mapped to local schemas via probabilistic mappings -whereas mappings are deterministic in our approach.

CONCLUSIONS

In this paper we have presented an original approach to OLAP on DODs. Our current implementation relies on a prototype that separately handles the different stages. In As future work, we plan to build a fully-functioning implementation, as well as to thoroughly evaluate the performance and scalability of the approach. Also, we plan to switch from a single machine to a multi-node cluster and to consider schema profiling techniques [9] to enhance the support given to the user at query time.

APPENDIX

Not only inter-schema mappings enable the definition of a global schema, they also allow a MongoDB query formulated on the global schema to be reformulated on each local schema, which is necessary in two situations: (i) when the collection is queried to detect AFDs (Section 5) and (ii) when the user issues an OLAP query on the collection (Section 6). The query reformulation algorithm we adopt here is the one proposed by [11] in the context of business intelligence networks (BINs); it enables the rewriting of a query from a source multidimensional schema to a target multidimensional schema and has been proved to be complete and provide all certain answers to the query. In this section we discuss why that algorithm can be reused to safely rewrite queries in both situations (i) and (ii). To this end we need to prove that the data schemas, the interschema mappings, and the queries which we consider in our work are a particular case of those used as a reference in the BIN context. Data schema. The reference schema in the BIN context is a classical multidimensional schema featuring a fact, a set of hierarchies (each made of levels), and a set of measures (each coupled with an aggregation operator). The dependency graph of Definition 5.2 can be thought of as a sort of "multi-fact" multidimensional schema with no explicit distinction between levels and measures. However, when an OLAP query is formulated as in Definition 6.1, exactly one fact is implicitly determined, group-by levels are explicitly distinguished from measures, and an aggregation operator is coupled to each measure. So, from the data schema point of view, there is no difference between the context of BINs and the one of this paper. Mappings. The primitive mappings of Definition 4 can be expressed, according to the BIN terminology, using either same or equi-level predicates. same predicates are used for measures, and can be annotated with an expression; since in Definition 6.1 measures are required to be numerical, the associated transcodings must be translatable into an expression. equi-level predicates are used for levels, and can be directly annotated with a transcoding. Remarkably, in [11] these two types of mappings are called exact since they enable non-approximate query reformulations. Note that array mappings are not used for query reformulation but only for determining the global schema, so they are not considered here.

Queries. An OLAP query (Definition 6.1) has a group-by set, a (conjunctive) selection predicate, and a measure. A BIN query has a group-by set, a (conjunctive) selection predicate, and an expression involving one or more measures. By simply picking a single measure and the identity expression, situation (ii) is addressed. As to situation (i), i.e., querying aimed at checking AFDs, we remark that the query for checking AFD 𝑙 ⇝ 𝑙 ′ can be expressed as a BIN query with group-by set {𝑙, 𝑙 ′ } and a dummy measure, on whose result a simple COUNT DISTINCT is then executed.

Based on the considerations above, we can state that an OLAP query of the global schema can be correctly reformulated into a set of local queries, one on each local schema. Then, each local query is separately executed on the DOD; specifically, each query must target only the documents that belong to a specific local schema 𝑠. This is done in two steps. First, the information about which document has which schema (obtained in the schema extraction stage) is stored in a different collection (called WorkoutSession-schemas in our example) in the following form: a document is created for every schema 𝑠 ∈ 𝑆(𝐷), containing an array ids with the id of every document 𝑑 ∈ 𝐷 𝑠 . Then, the query on schema 𝑠 is executed by joining it with the list of identifiers in WorkoutSession-schemas). Finally, a post-processing activity is required to integrate the results coming from the different local queries.

Figure 1 :1Figure 1: Approach overview

Figure 3 :3Figure 3: The schema of the JSON document in Figure 2 (a), another schema of the same collection (c), and the global schema (b)

Definition 4 . 1 (41Mapping). Given two schemas 𝑠𝑖 and 𝑠𝑗, a mapping from 𝑠𝑖 to 𝑠𝑗 can be either • an array mapping with form ⟨𝑎, 𝑎 ′ ⟩, where 𝑎 ∈ 𝐹 𝑎𝑟𝑟 𝑖 and 𝑎 ′ ∈ 𝐹 𝑎𝑟𝑟 𝑗 ; • a primitive mapping with form ⟨𝑃, 𝑃 ′ , 𝜑⟩, where 𝑃 ⊆ 𝐹 𝑝𝑟𝑖𝑚 𝑖 , 𝑃 ′ ⊆ 𝐹 𝑝𝑟𝑖𝑚 𝑗 , and 𝜑 is a transcoding function, 𝜑 : 𝐷𝑜𝑚(𝑃 ) → 𝐷𝑜𝑚(𝑃 ′ ).The definition of the global schema for a collection is based on the inter-schema mappings determined.

Definition 4 . 2 (Example 4 . 3 .4243Global Schema). Given collection 𝐷 and the corresponding set of schemas 𝑆(𝐷) = {𝑠1, . . . , 𝑠𝑛}, the global schema of 𝐷 is a schema 𝑔(𝐷) = (𝐹, 𝐴) where(1) for every 𝑠𝑖 ∈ 𝑆(𝐷) there is a mapping ⟨𝑟𝑖, 𝑟⟩ between the roots of 𝑠𝑖 and 𝑔(𝐷); (2) every field 𝑓 in each 𝑠𝑖 is involved in at least one mapping onto the fields of 𝑔(𝐷); (3) every field 𝑓 in 𝑔(𝐷) is involved in at least one mapping with some 𝑠𝑖. Figure 3 shows two sample schemas from the WS collection (a and c) and the corresponding global schema 𝑔(𝐷) (b); mappings are represented with dotted lines. An example of array mapping from local schema (c) to global schema (b) is ⟨Series, Exercises.Sets⟩ Examples of primitive mappings are ⟨{Date}, {StartedOn}, 𝜑1⟩ ⟨{FirstName, LastName}, {User.FullName}, 𝜑2⟩ ⟨{Series.ExType}, {Exercise.Type}, 𝜑1⟩

Example 4 . 6 .46In our working example, let the collection have 100 documents (i.e., |𝐷| = 100) evenly distributed between 𝑠1 and 𝑠2 (i.e., |𝐷 𝑠 1 | = |𝐷 𝑠 2 | = 50). Let 𝑓 = Facility.City occur 40 times in 𝑠1 and 20 times in 𝑠2; then, 𝑙𝑜𝑐𝑆𝑢𝑝𝑝(𝑓, 𝑠1) = 40 50 * 1 = 0.8, 𝑙𝑜𝑐𝑆𝑢𝑝𝑝(𝑓, 𝑠2) = 20 50 * 1 = 0.4 and 𝑔𝑙𝑜𝑆𝑢𝑝𝑝(𝑓 ) = 0.8 * 0.5 + 0.4 * 0.5 = 0.6. □

Figure 5 :5Figure 5: Excerpt of the dependency graph (left) in presence of alternative documents (right)

Example 6 . 2 .62The following query, 𝑞1, measures the average amount of weight lifted by elderly athletes per city and type of exercise: 𝑞1 = ⟨ {Facility.City, Exercises.Type}, User.Age ≥ 60, Exercises.Sets.Weight, avg ⟩ It is 𝑓 𝑎𝑐𝑡(𝑞1) = Exercises.Sets. id. □

Example 6 . 3 .63Query 𝑞1 passes the validity check of Algorithm 1 with a completeness warning, because 𝑔𝑙𝑜𝑆𝑢𝑝𝑝(Facility.City) < 1. On the other hand, 𝑞1 meets the disjointness constraint because 𝑖𝑑(𝑎𝑟𝑟(Facility.City)) = id 𝑖𝑑(𝑎𝑟𝑟(Exercises.Type)) = Exercises. id 𝑖𝑑(𝑎𝑟𝑟(Exercises.Sets.Weight)) = Exercises.Sets. id Exercises.Sets. id ⪰ id Exercises.Sets. id ⪰ Exercises. id

Table 1 :1Execution times for schema extraction# records DB sizeTime5 K2 MB4 sec50 K20 MB33 sec500 K197 MB 6 min5 M1.7 GB 60 minparticular, we use a customized version of the free tool va-riety.js for schema extraction on MongoDB collections; werely on the BIN framework [11] to handle schema mappingsand query reformulation (see Appendix); AFD detectionis carried out by a simple Javascript algorithm, which de-termines the presence of AFDs between couples of fieldsby means of count distinct queries, adopting a smart ex-ploration strategy that reduces the search space like in [6];finally, OLAP queries are manually formulated. Our refer-ence real-world collection is stored on a single machine (i7CPU, 32GB RAM) and contains 5M workout sessions with6 different local schemas (mostly due to missing attributes),35M exercises and 85M sets. Table 1 shows the executiontimes for the schema extraction phase. Times are consis-tent with those of related approaches that perform schema

extraction on JSON datasets, such as[1]; also, we note that time increases linearly with the size of the database. Given the low number of schemas, mappings have been manually defined. A sample OLAP query 𝑞 that groups documents by Facility.Chain (global support 0.38) to obtain the average amount of Exercises.ExCalories (global support 0.69) returns the following indicator values: 𝑠𝑒𝑙(𝑞) = 1 (as there is no filtering), 𝑐𝑜𝑚𝑝𝑙(𝑞) = 0.33 (lower than the group-by set support), and an average 𝑝𝑟𝑒𝑐(𝑒) of 0.99. The time for executing 𝑞 on MongoDB is about 3 minutes. Drilling down to Facility.Name (global support 1) increases 𝑐𝑜𝑚𝑝𝑙(𝑞) to 1, while the accuracy of the new query with respect to 𝑞 is 0.98.

Schema Inference for Massive JSON Datasets MohamedAmineBaazizi HoussemBen Lahmar DarioColazzo GiorgioGhelli CarloSartiani Proc. EDBT EDBT

Venice, Italy

2017 A Comparative Analysis of Methodologies for Database Schema Integration CarloBatini MaurizioLenzerini ShamkantNavathe Comput. Surveys 18 4 1986. 1986 Generic schema matching, ten years later JayantPhilip A Bernstein ErhardMadhavan Rahm Proc. VLDB Endowment 4 2011. 2011 NoSQL Graphbased OLAP Analysis ArnaudCastelltort AnneLaurent Proc. KDIR KDIR

Rome, Italy

2014 Enabling JSON Document Stores in Relational Systems CraigChasseur YinanLi JigneshMPatel Proc. WebDB WebDB

New York, USA

2013 MohamedLamine Chouder StefanoRizzi RachidChalal EXODuS: Exploratory OLAP over Document Stores Inf. Syst.. In press Building OLAP Cubes from Columnar NoSQL Data Warehouses KhaledDehdouh Proc. MEDI MEDI

Almería, Spain

2016 Automatic Generation of Normalized Relational Schemas from Nested Key-Value Data MichaelDiscala DanielJAbadi Proc. SIGMOD SIGMOD

San Francisco, USA

2016 Schema Profiling of Document Stores EnricoGallinucci MatteoGolfarelli StefanoRizzi Proc. SEBD SEBD

Squillace Lido, Italy

2017 Starry Vault: Automating Multidimensional Modeling from Data Vaults MatteoGolfarelli SimoneGraziani StefanoRizzi Proc. ADBIS ADBIS 2016 OLAP query reformulation in peer-to-peer data warehousing MatteoGolfarelli FedericaMandreoli WilmaPenzo StefanoRizzi ElisaTurricchia Inf. Syst 37 5 2012. 2012 MatteoGolfarelli StefanoRizzi Data warehouse design: Modern principles and methodologies McGraw-Hill, Inc 2009 Constance: An Intelligent Data Lake System RihanHai SandraGeisler ChristophQuix Proc. SIGMOD SIGMOD

San Francisco, USA

2016 CORDS: Automatic Discovery of Correlations and Soft Functional Dependencies FIhab Ilyas PeterVolker Markl PaulHaas AshrafBrown Aboulnaga Proc. SIGMOD SIGMOD 2004 Discovering implicit schemas in JSON data JavierLuis CánovasIzquierdo JordiCabot Proc. ICWE ICWE 2013 Summarizability in OLAP and Statistical Data Bases Hans-JoachimLenz ArieShoshani Proc. Ninth International Conference on Scientific and Statistical Database Management Ninth International Conference on Scientific and Statistical Database Management 1997 Management of Flexible Schema Data in RDBMSs -Opportunities and Limitations for NoSQL ZhenHua Liu DieterGawlick Proc. CIDR. Asilomar CIDR. Asilomar

USA

2015 A survey of approaches to automatic schema matching ErhardRahm PhilipABernstein VLDB J 10 4 2001. 2001 Multidimensional Design by Examples OscarRomero AlbertoAbelló Proc. DaWaK DaWaK

Krakow, Poland

2006 Inferring Versioned Schemas from NoSQL Databases and Its Applications DiegoSevilla Ruiz SeverinoFeliciano Morales Jesús GarcíaMolina Proc. ER ER 2015 Finding and Fixing Type Mismatches in the Evolution of Object-NoSQL Mappings StefanieScherzinger EduardoCunha De Almeida ThomasCerqueus LeandroBatista De Almeida PedroHolanda Proc. Workshops EDBT/ICDT Workshops EDBT/ICDT 2016 Adaptive Schema Databases WilliamSpoth SadatBahareh EricSArab DieterChan AdelGawlick BorisGhoneimy Glavic ChristophBeda OliverHammerschmidt SeokkiKennedy ZhenLee XingHua Liu YingNiu Yang Proc. CIDR CIDR

Chaminade, USA

2017 Designing Web Warehouses from XML Schemas BorisVrdoljak MarkoBanek StefanoRizzi Proc. DaWaK DaWaK 2003 Schema management for document stores LanjunWang ShuoZhang JuweiShi LimeiJiao OktieHassanzadeh JiaZou ChenWangz Proc. VLDB Endowment 8 9 2015. 2015