=Paper= {{Paper |id=Vol-2572/paper19 |storemode=property |title=To Each His Own: Accommodating Data Variety by a Multimodel Star Schema |pdfUrl=https://ceur-ws.org/Vol-2572/paper19.pdf |volume=Vol-2572 |authors=Sandro Bimonte,Yassine Hifdi,Mohammed Maliari,Patrick Marcel,Stefano Rizzi |dblpUrl=https://dblp.org/rec/conf/dolap/BimonteHMMR20 }} ==To Each His Own: Accommodating Data Variety by a Multimodel Star Schema== https://ceur-ws.org/Vol-2572/paper19.pdf
                     To Each His Own: Accommodating Data Variety
                             by a Multimodel Star Schema
                 Sandro Bimonte                                               Yassine Hifdi                                 Mohammed Maliari
     University Clermont, TSCF, INRAE                          LIFAT Laboratory, University Tours                               ENSA
              Aubiere, France                                             Blois, France                                     Tangier, Maroc
         sandro.bimonte@inrae.fr                                    hifdiyassine@gmail.com                             mohammedmaliari@gmail.com

                                               Patrick Marcel                                        Stefano Rizzi
                                  LIFAT Laboratory, University Tours                         DISI, University of Bologna
                                             Blois, France                                          Bologna, Italy
                                     Patrick.Marcel@univ-tours.fr                              stefano.rizzi@unibo.it
ABSTRACT                                                                                 for the right data type is essential to grant good storage and anal-
Recent approaches adopt multimodel databases (MMDBs) to                                  ysis performance. Traditionally, each DBMS has been conceived
natively handle the variety issues arising from the increasing                           for handling a specific data type; for example, relational DBMSs
amounts of heterogeneous data (structured, semi-structured, graph-                       for structured data, document-based DBMSs for semi-structured
based, etc.) made available. However, when it comes to analyzing                         data, etc. Therefore, when an application requires different data
these data, traditional data warehouses (DWs) and OLAP systems                           types, two solutions are actually possible: (i) integrating all data
fall short because they rely on relational Database Management                           into a single DBMS, or (ii) using two or more DBMSs together.
Systems (DBMSs) for storage and querying, thus constraining                              The former solution presents serious drawbacks: first of all, some
data variety into the rigidity of a structured schema. This pa-                          types of data cannot be stored and analyzed (e.g., the pure re-
per provides a preliminary investigation of the performance of                           lational model does not support the storage of images, XML,
an MMDB when used to store multidimensional data for OLAP                                arrays, etc. [29]); besides, even when data can be converted and
analysis. A multimodel DW would store each of its elements                               stored in the target DBMS, querying performances could be un-
according to its native model; among the benefits we envision for                        satisfactory. The latter approach (known as polyglot persistence
this solution, that of bridging the architectural gap between data                       [16]) presents important challenges as well, namely, technically
lakes and DWs, that of reducing the cost for ETL data transfor-                          managing more DBMSs, complex query languages, inadequate
mations, and that of ensuring better flexibility, extensibility, and                     performance optimization, etc. Therefore, Multimodel databases
evolvability thanks to the use of schemaless models. To support                          (MMDBs) have recently been proposed to overcome these issues.
our investigation we present an implementation, based on the                             A MMDB is a DBMS that natively supports different data types
UniBench benchmark dataset, that extends a star schema with                              under a single query language to grant performance, scalability,
JSON, XML, spatial, and key-value data; we also define a sample                          and fault tolerance [21]. Remarkably, using a single platform for
OLAP workload and use it to test the performance of our solution                         multimodel data promises to deliver several benefits to users
and compare it with that of a classical star schema. As expected,                        besides that of providing a unified query interface; namely, it will
the full-relational implementation performs better, but we believe                       simplify query operations, reduce development and maintenance
that this gap could be balanced by the benefits of multimodel in                         issues, speed up development, and eliminate migration problems
dealing with variety. Finally, we give our perspective view of the                       [21]. Examples of MMDBs are PostgreSQL and ArangoDB. Post-
research on this topic.                                                                  greSQL supports the row-oriented, column-oriented, key-value,
                                                                                         and document-oriented data models, offering XML, HSTORE,
                                                                                         JSON/JSONB data types for storage. ArangoDB supports the
                                                                                         graph-based, key-value, and document-oriented data models.
1     INTRODUCTION
                                                                                            Handling variety while granting at the same time volume and
Big Data is notoriously characterized by (at least) the 3 V’s: vol-                      velocity is even more complex in Data Warehouses (DWs) and
ume, velocity, and variety. To handle velocity and volume, some                          OLAP systems. Indeed, warehoused data result from the integra-
distributed file system-based storage (such as Hadoop) and new                           tion of huge volumes of heterogeneous data, and OLAP requires
Database Management Systems (DBMSs) have been proposed.                                  very good performances for data-intensive analytical queries [20].
In particular, four main categories of NoSQL databases have                              Traditional DW architectures rely on a single, relational DBMS
been proposed [2]: key-value, extensible record, graph-based,                            for storage and querying1 . To offer better support to volume
and document-based.                                                                      while maintaining velocity, some recent works propose the usage
   Although NoSQL DBMSs have successfully proved to support                              of NoSQL DBMSs; for example, [8] relies on a document-based
the volume and velocity features, variety is still a challenge [21].                     DBMS, and [5] on a column-based DBMS. NoSQL proposals for
Indeed, several practical applications (e.g. retail, agriculture, etc.)                  DWs are based on a single data model, and all data are trans-
ask for collecting and analyzing data of different types: structured                     formed to fit with that model (document, graph, etc.). Overall,
(e.g., relational tables), semi-structured (e.g., XML and JSON), and                     although these approaches offer interesting results in terms of
unstructured (such as text, images, etc.). Using the right DBMS                          volume and velocity, they have been mainly conceived and tested
                                                                                         for structured data, without taking into account variety.
© Copyright 2020 for this paper held by its author(s). Published in the proceedings of
DOLAP 2020 (March 30, 2020, Copenhagen, Denmark, co-located with EDBT/ICDT               1 More precisely, this is true for so-called ROLAP architectures. In MOLAP architec-
2020) on CEUR-WS.org. Use permitted under Creative Commons License Attribution           tures, data are stored in multidimensional arrays. Finally, in HOLAP architectures,
4.0 International (CC BY 4.0).                                                           a MOLAP and a ROLAP systems are coupled.
    Furthermore, to facilitate OLAP querying, DWs are normally                 Key-Value                 Relational            XML
based on the multidimensional model, which introduces the con-               Ranking and         Customers
                                                                              feedback                                        Invoices
cepts of facts, dimensions, and measures to analyze data, so source
                                                                                                             Vendors
data must be forcibly transformed to fit a multidimensional logi-
                                                                                     Graph
cal schema following a so-called schema-on-write approach. Since                    Social
                                                                                   networks
this is not always painless because of the schemaless nature of                                                   Orders
some source data, some recent work (such as [12]) propose to                                                               Products
                                                                                                 RegUsers
directly rewrite OLAP queries over document stores that are not                                                   JSON

organized according to the multidimensional model, following a
schema-on-read approach (i.e., the multidimensional schema is                   Figure 1: Overview of the UniBench data
not decided at design time and forced in a DW, but decided by
each single user at querying time). However, even this approach
relies on a single DBMS.
                                                                       logical model for column-based DWs has been proposed by [5]
    An interesting direction towards a solution for effectively
                                                                       and [7] to address volume scalability. In [28], transformation
handling the 3 V’s in DW and OLAP systems is represented by
                                                                       rules for DW implementation in graph-based DBMSs have been
MMDBs. A multimodel data warehouse (MMDW) can store data
                                                                       proposed for better handling social network data. To the best of
according to the multidimensional model and, at the same time,
                                                                       our knowledge, only [22] presents a benchmark for comparing
let each of its elements be natively represented through the most
                                                                       NoSQL DW proposals; specifically, this benchmark is applied to
appropriate model. Among the benefits we envision for MMDWs,
                                                                       MongoDB and Hbase. Some works also study the usage of XML
that of bridging the architectural gap between data lakes and
                                                                       DBMSs for warehousing XML data [24]. Although XML DWs
DWs, that of reducing the cost for ETL data transformations, and
                                                                       represent a first effort towards native storage of semi-structured
that of ensuring better flexibility, extensibility, and evolvability
                                                                       data, their querying performances do not scale well with size,
thanks to the use of schemaless models.
                                                                       and compression techniques must be adopted [4].
    In this paper we conduct a preliminary investigation of the
                                                                          Among all these proposals, it is hard to champion one logical
performance of MMDWs to store multidimensional data. To this
                                                                       and physical implementation for NoSQL and XML DWs, since
end we introduce a logical schema for MMDWs and its implemen-
                                                                       no approach clearly outperforms the other on the 3 V’s. More-
tation on PostgreSQL, which gives native multimodel support.
                                                                       over, these single-model proposals do not address other issues
Our schema extends the classical star schema introducing semi-
                                                                       related to warehousing big data, such as reducing the cost of ETL,
structured (JSON, XML, and key-value) data in all the multidi-
                                                                       evolution and improving flexibility.
mensional elements; thus, it goes in the direction of coupling the
                                                                          Recently, some approaches to execute OLAP queries directly
pros of schema-on-write approaches (mainly, good performances
                                                                       against NoSQL data sources were proposed. In [12], a schema-
and simple query formulation with no need for query rewriting)
                                                                       on-read approach to automatically extract facts and hierarchies
with those of schema-on-read approaches (higher flexibility in
                                                                       from document data stores and trigger OLAP queries is proposed.
ad-hoc querying).
                                                                       A similar approach is presented in [17]; there, schema variety is
    Due to the lack of a benchmark for multimodel data warehouse,
                                                                       explicitly taken into account by choosing not to design a single
in this paper we propose our own OLAP workload to evaluate
                                                                       crisp schema where source fields are either included or absent,
the performance of our proposal, which we also test against a
                                                                       but rather to enable an OLAP experience on some sort of “soft”
full-relational implementation on PostgreSQL. To the best of our
                                                                       schema where each source field is present to some extent. In
knowledge, no benchmark dataset for DW (either relational or
                                                                       the same direction, [13] proposes a MapReduce-based algorithm
NoSQL) supports variety; thus, for the experiments we use the
                                                                       to compute OLAP cubes on column stores, while [6] aims at
schema and data provided by UniBench [30], a benchmark for
                                                                       delivering the OLAP experience over a graph-based database.
MMDBs that well represents variety.
                                                                          The approaches mentioned above rely on a single-model data-
    The paper outline is as follows. After discussing the related
                                                                       base. Conversely, [19] proposes a pay-as-you-go approach which
literature in Section 2, in Section 3 we present the UniBench case
                                                                       enables OLAP queries against a polystore supporting relational,
study. Sections 4 and 5 introduce our logical schema for MMDWs
                                                                       document, and column data models by hiding heterogeneity be-
and the related OLAP workload, respectively. Section 6 shows
                                                                       hind a dataspace layer. Data integration is carried out on-the-fly
the results of the experiments we made, while Section 7 presents
                                                                       using a set of mappings. Even this approach can be classified as
our vision of future MMDW research. Finally, in Section 8 we
                                                                       schema-on-read; the focus is on query rewriting against hetero-
draw the conclusions.
                                                                       geneous databases and not on the performances of the approach.

2   RELATED WORK                                                       3   CASE STUDY: UNIBENCH
Some recent work concerns warehousing and OLAP using NoSQL             UniBench is a benchmark for multimodel databases proposed
DBMSs of different kinds. In [11], three different logical mod-        in [30]. It includes a retail dataset composed of relational, XML,
els are proposed, using 1 or N document collections to store           JSON, key-value, and graph data as shown in Figure 1, which
data in document-based DBMSs and highlighting the utility of           makes it a good representative for variety. However, UniBench
nested document and array types [10]. The same authors also            was not conceived for OLAP queries. Since our goal is to han-
investigate how to handle complex hierarchies and summariz-            dle variety with specific reference to DWs, we had to derive a
ability issues with document-based DWs [9]. The introduction           multidimensional schema from UniBench. This adaptation re-
of spatial data in document-based DWs has been discussed in            quired some modifications, including the addition of descriptive
[15], which proposes a new spatial multidimensional model to           attributes (e.g., LastName), which allows to better test the ef-
avoid redundancy of spatial data and improve performances. A           fectiveness of the proposed approach; as a consequence, some
                                    Location              Vendor
                                                  Store
                                                                                 that results from applying this guideline to the conceptual schema
                        Year                              Price
                   Month
                                      Image               Comment                in Figure 2. It can be described as follows:
                           Day                      Product                           • The fact table, Fact_Order, has one tuple for each order
                                     Order
                                                                                        and references the order customer and date via foreign
                                                                                        keys. Each tuple includes a JSON document that stores
                                 TotalPrice                   Rating
                                                                                        the totalPrice measure and an array of orderlines, each
                             Customer
                        FirstName              CreationDate
                                                                                        specifying a product.
                       LastName                Birthday                               • The customer dimension table, Dim_Customer, specifies
                    UsedBrowser                 Gender                                  each customer’s data in the form of XML documents.
                                                                                      • The temporal dimension table, Dim_Date, stores in each
Figure 2: Multidimensional schema for UniBench (the                                     tuple a JSON document with the order date; to enable use-
DFM notation [18] is used)                                                              ful aggregations, it also stores the corresponding month
                                                                                        and year.
                                                                                      • The product dimension table, Dim_Product, for each prod-
additional data had to be (randomly) generated. The resulting
                                                                                        uct stores its location (as a spatial attribute), vendor, and
schema represents the Order fact; as shown in Figure 2 it presents
                                                                                        store, as well as a JSON document with the product name
three dimensions:
                                                                                        (title), price, and image. Each product also has a Feedback
     • A Time dimension with levels Day, Month, and Year.                               attribute that stores all its ratings in key-value form, with
     • A Product dimension with one hierarchy including level                           the customer code as a key.
       Store and some descriptive attributes (e.g., Vendor). Inter-                   • As shown in Figure 2, each order refers to several prod-
       estingly, stores are described by a spatial level, Location.                     ucts. To model this non-strict hierarchy, rather than opt-
       The cloud symbol in the schema denotes that a product can                        ing for the classical relational solution (a many-to-many
       have some additional descriptive attributes not specified                        bridge table [18]), we established a connection between
       at design time.                                                                  the InfoOrder document stored in the fact table and the
     • A Costumer dimension in which two hierarchies are rooted:                        Dim_Product dimension table via the asin attribute.
       one with level Gender, one with UsedBrowser. The cus-
       tomer also has some descriptive attributes, e.g., LastName.               An example of instances of the fact table and of the product
                                                                                 dimension table are shown in Figure 4.
Attribute Rating is cross-dimensional, i.e., its value is jointly de-
                                                                                    The cloud symbol in Figure 2 denotes that the product di-
termined by Product and Customer (a customer can rate several
                                                                                 mension can include some additional attributes not specified
products). The fact has one measure, TotalPrice.
                                                                                 at design time (hence, not included in the JSON schema). For
   Finally, since an order is associated to many products, a many-
                                                                                 instance, some InfoPrdt documents will have an EU attribute
to-many relationship is set between the fact and the product
                                                                                 precising the category of product according to the EU classifica-
dimension (non-strict hierarchy).2
                                                                                 tion (see Figure 5), while some InfoOrder documents will have a
                                                                                 brand attribute.
4    A MULTIMODEL STAR SCHEMA FOR
     UNIBENCH
In this section we present a MultiModel, MultiDimensional (in
                                                                                 5    AN OLAP WORKLOAD FOR UNIBENCH
short, M3 D) logical schema for the Order fact introduced above.                 The workload we introduce to test our M3 D schema is inspired by
Essentially, we use a classical star schema with a fact and di-                  that of the classical SSB benchmark [23], itself loosely based on
mension tables, extended with semi-structured data in JSON and                   the TPC-H benchmark. The SSB workload is meant to function-
XML form, and with spatial data. Starting from a star schema                     ally cover the different types of star schema queries while varying
has several clear advantages: (i) the star schema is supported                   fact table selectivity. SSB queries are organized in 4 flights, where
by all OLAP servers and already in use in a huge number of en-                   each flight is a list of 3 to 4 queries. Query flight 1 has restrictions
terprise DWs; (ii) the best practices for designing a star schema                on only 1 dimension, flight 2 has restrictions on 2 dimensions,
from a conceptual schema are well understood and commonly                        flight 3 on 3, and flight 4 represents a what-if sequence of the
adopted by practitioners; (iii) fact-dimension relationships are                 OLAP type. We adopt the same approach, while at the same time
ruled by foreign keys so their consistence is natively checked                   classifying queries according to the usage of relational (R)/non
by the DBMS; (iv) performance optimization of star schema has                    relational (NR) measures, relational/non relational group-by lev-
been long studied and practiced at both the logical (e.g., via view              els, and relational/non relational selection levels. We also add
materialization) and the physical (e.g., via indexing) level.                    a parameter representing the type of join: relational means a
   Clearly, several possible alternatives arise for modeling the                 join using two relational attributes, while JSON means a join
Order fact with an extended star schema. Defining a set of best                  between a JSON attribute and a relational one. Q10 and Q12 use
practices for designing an M3 D schema that achieves the best                    selection attributes that are not part of the JSON schema (brand
trade-off between the five advantages listed in Section 1 is out of              and EU, respectively). In Table 1, which presents the workload,
the scope of this paper; so, we opted for designing the schema                   “by” introduces a group-by and “for” a selection.
based on a simple guideline: preserve as much as possible the
source data variety, i.e., minimizing the transformations to be                  6    MULTIMODEL VS. FULL-RELATIONAL
applied to UniBench source data. Figure 3 shows the M3 D schema
                                                                                 In this section we give a preliminary assessment of the effec-
2 We have not considered the graph data of UniBench, since the PostgreSQL DBMS   tiveness and efficiency of MMDWs as compared to those of a
used for implementation does not support them natively.                          classical relational implementation.
                                                          Date                                            Dim_Date
                                 { "$schema": "http://json-schema.org/draft-03/schema#",
                                   "id": "#",                                                  Id_Date : Int
                                   "type": "object",                                               Date    : JSON
                                   "properties":
                                   { "orderDate":                                                  Month : Date
                                      { "type": "string" }}}                                       Year    : Date



                                           Dim_Product
                 Feedback                                                                                Fact_Order                                               Dim_Customer
                                     Asin     : String
              key : id_cust              Feedback : Hstore                                     Id_Order : String                                      Id_Cust : String
              value : rating
                                         InfoPrdt : JSON                                       Id_Cust : String                                           InfoCust : XML
                                         Location : Geo                                        Id_Date : Int
                                         Store    : String                                         InfoOrder : JSON
                                         Vendor : String

                                                      InfoPrdt                                                   InfoOrder                                                 InfoCust
                               { "$schema": "http://json-schema.org/draft-03/schema#",     { "$schema": "http://json-schema.org/draft-03/schema#",
                                                                                                                                                       
                                 "id": "#",                                                  "id": "#",
                                                                                                                                                       
                                 { "title":    { "type": "string" },                         { "totalPrice":                                           
                                    "price": { "type": "number" },                              { "type": "number" },
                                                                                                                                                       
                                    "imhUrl": { "type": "string" }}}}                           "orderLine":                                           
                                                                                                { "type": "array",                                     
                                                                                                     "items":                                          
                                                                                                     { "id": "0",                                      
                                                                                                         "type": "object",
                                                                                                                                                       
                                                                                                         "properties":
                                                                                                         { "productId": { "type": "string" },
                                                                                                              "asin":      { "type": "string" },
                                                                                                              "title":     { "type": "string" }}}}}}




 Figure 3: Multimodel star schema in PostgreSQL (solid and dashed lines represent foreign key relationships and implicit
 relationships between relational and JSON attributes, respectively; in italics, level properties)




Dim prod WITH EU



                                 Figure 4: Sample instances of Fact_Order (top) and Dim_Product (bottom)

          {    "title":  "5 LED Bicyle Rear Tail",                                                            6.1          Efficiency
               "price": 8.26,
               "imhUrl": "http://ecx.images-amazon.com/SY300.jpg",                                            We have implemented the M3 D schema using PostgreSQL with
               "EU":     "Electronics" }}}}
                                                                                                              its JSON, XML, key-value, and spatial native storage. Data used to
                                                                                                              feed dimensions and facts has been extracted from the UniBench
 Figure 5: An InfoPrdt document including an extra-                                                           benchmark [30]. Specifically, we have 745 dates (|Dim_Date|),
 schema attribute, EU                                                                                         9,949 customers (|Dim_Customer|), 10,116 products (|Dim_Product|),
                                                                                                              and 640,000 orders (|Fact_Order|).
                                                        Table 1: OLAP queries on the Order fact

       Query     Measure     Group-by     Selection    Join   Query                                                                 Numb. of selections        NR types
        Q0         R            R             R         R     Number of orders by months forQ2:  given months and years                      2                    —
        Q1         R            R             R       JSON    Number of orders by months forselect     count(distinct
                                                                                                 given stores  and years o.Id_Order) as NumberOfOrders,
                                                                                                                                             2            d.Month
                                                                                                                                                                JSON
        Q2         R            R            NR       JSON    Number of orders by months forfrom given(years
                                                                                                        selectandId_Order,
                                                                                                                   rating    products->>'asin’2 as Asin     JSON,key-value
        Q3         R            NR            R         R     Number of orders by months, gender forfrom given Fact_Order
                                                                                                                 years        o,             1                   XML
        Q4         R            NR           NR       JSON    Number of orders by months, gender for given products                          1
                                                                                                        jsonb_array_elements(o.InfoOrder->'orderLine')      asJSON,XML
                                                                                                                                                               products
        Q5         NR           R             R       JSON    Total price by year for given stores                                           1                  JSON
        Q6         NR           R            NR         R     Total price by year for given genders ) as op, Fact_Order o, Dim_Date d,1                       JSON,XML
        Q7         NR           NR            R       JSON                                            ( select
                                                              Total price by year, gender for given stores        skeys(p.Feedback) as Id_Cust,
                                                                                                           and years                         2 Asin, svals(p.Feedback)
                                                                                                                                                                JSON as Rating
        Q8         NR           NR           NR         R                                               from and
                                                              Total price by year of birth for given browsers   Dim_Product
                                                                                                                     genders     p           2                   XML
        Q9         NR           NR           NR       JSON    Total price by date, customer for given)months,
                                                                                                        as cp ratings, stores                3              JSON,key-value
        Q10        NR           NR         NR opt.      R                                       where
                                                              Total price by date for given months,     d.Year='2020'
                                                                                                     genders,   brands and Rating>4 and o.Id_Date=d.Id_Date
                                                                                                                                             3                JSON,XML
        Q11        NR           NR           NR       JSON    Total price by date, customer forand
                                                                                                 given months, genders, ratings
                                                                                                     o.Id_Cust=cp.Id_Cust                    3
                                                                                                                               and o.Id_Order=op.Id_OrderJSON,XML,key-value
                                                                                                                                                           and cp.Asin=op.Asin
        Q12        NR           NR         NR opt.    JSON    Total price by costumer for given EU values                                    1                JSON,XML
                                                                                           group by d.Month


Q2:                                                                                        Q2:
select count(distinct o.Id_Order) as NumberOfOrders, d.Month                               select count(distinct Id_Order) as NumberOfOrders, d.Month
from ( select Id_Order, products->>'asin’ as Asin                                          from Fact_Order o, Dim_Date d, Bridge_Cust_Prod cp, Bridge_Ord_Prod op
        from Fact_Order o,                                                                 where d.Year='2020' and Rating>4 and o.Id_Date=d.Id_Date
        jsonb_array_elements(o.InfoOrder->'orderLine') as products                         and o.Id_Cust=cp.Id_Cust and o.Id_Order=op.Id_Order and cp.Asin=op.Asin
      ) as op, Fact_Order o, Dim_Date d,                                                   group by d.Month
     ( select skeys(p.Feedback) as Id_Cust, Asin, svals(p.Feedback) as Rating
        from Dim_Product p
      ) as cp                                                                             Figure 7: SQL formulation of query Q2 in PostgreSQL over
where d.Year='2020' and Rating>4 and o.Id_Date=d.Id_Date                                  the FR schema
and o.Id_Cust=cp.Id_Cust and o.Id_Order=op.Id_Order and cp.Asin=op.Asin
group by d.Month
                                                                                                                                 Dim_Date

 Q2:
Figure     6: SQL formulation of query Q2 in PostgreSQL over                                                              Id_Date : Int
 select count(distinct Id_Order) as NumberOfOrders, d.Month                                                                   Date    : Date
the   M  3 D schema                                                                                                           Month : Date
 from Fact_Order o, Dim_Date d, Bridge_Cust_Prod cp, Bridge_Ord_Prod op                        Bridge_Ord_Prod
                                                                                                                              Year    : Date
where d.Year='2020' and Rating>4 and o.Id_Date=d.Id_Date                                    Id_Order: String
and o.Id_Cust=cp.Id_Cust and o.Id_Order=op.Id_Order and cp.Asin=op.Asin                     Asin    : String                                               Dim_Customer
group by d.Month                                                                                                                                            Id_Cust      :   String
                                                                                                                                Fact_Order                      FirstName    :   String
                                                                                                 Dim_Product                                                    LastName     :   String
   All the OLAP queries proposed in Section 5 have been suc-                                                              Id_Order : String
                                                                                                                                                                Gender       :   String
                                                                                            Asin    :   String        Id_Cust : String
                                                                                                                                                                Birthday     :   Date
cessfully formulated and executed over the M3 D schema, which                                   Title   :   String        Id_Date : Int                     CreationDate :   Date
                                                                                                Price   :   Double            TotalPrice : Double               BrowserUsed :    String
confirms the feasibility of using PostgreSQL as a platform for stor-                            ImhUrl :    String
                                                                                                Location:   Geo
ing and querying MMDWs. Figure 6 shows the SQL formulation                                      Store :     String           Bridge_Cust_Prod
                                                                                                Vendor :    String
of a sample query in PostgreSQL; note that attributes Feedback                                                            Id_Cust : String
of type key-value and InfoOrder of type JSON are retrieved as                                                             Asin    : String
                                                                                                                                 Rating : Int
table views to be used for a join or a selection.
   In the following, we present some experiments aimed at quan-
                                                                                               Figure 8: Full-relational star schema in PostgreSQL
titatively comparing the querying performances of the M3 D
schema and those of a full-relational star schema (from now
on, FR). For the FR schema we used two bridge tables as shown                             Table 2: Performance of benchmark queries (in millisec-
in Figure 8. The first one, Bridge_Ord_Prod, stores the many-                             onds)
to-many relationship between an order and its products. The
second one, Bridge_Cust_Prod, is necessary to store the Rating                                                             Query      M3 D          FR
                                                                                                                            Q0        253           310
cross-dimensional attribute. Noticeably, attributes EU and brand                                                            Q1        712           633
are not included here since, as explained in Section 4, they were                                                           Q2        2509          1161
                                                                                                                            Q3        3996          1023
not known at design time. Clearly, unless some (costly) evolution                                                           Q4        1049          175
of the schema is carried out, these attributes cannot be loaded                                                             Q5        1437          714
                                                                                                                            Q6        1034          197
and they cannot be used for querying. The FR schema is also                                                                 Q7        1902          711
                                                                                                                            Q8        817           131
implemented in PostgreSQL; Figure 7 shows the SQL formulation                                                               Q9        187           732
of query Q2 over the FR schema. A comparison between Figures                                                                Q10       1660          —
                                                                                                                            Q11       2726          517
6 and 7 suggests that the formulation over the M3 D schema is                                                               Q12       1165          —
more complex; however, we wish to emphasize that there is no
real difficulty in formulating queries on an MMDW in compar-
ison to a traditional star schema, except that some knowledge                             on a Core i5 with 4 CPUs @2.3GHz laptop with 16 GB RAM and
of the DBMS-specific operators to manipulate key-value, JSON,                             SSD running MacOS Mojave.
and XML types is required.                                                                   Table 2 shows the query execution in milliseconds against
   For both implementations, B+trees have been used to index                              both implementations. Note that Q10 and Q12 cannot be exe-
relational attributes. For the M3 D schema, some tests were done                          cuted on the FR schema because they use attributes (brand and
in order to find the best optimization plan for the workload                              EU, respectively) that were not known at design time so they
queries. The results we report below use the following: (i) a Gist                        are not part of that schema. Not surprisingly, the full-relational
index is used on the Feedback hstore attribute; (ii) B+trees and                          implementation outperforms the multimodel implementation
Gin indexes are used on JSON attributes. All tests have been run                          over most queries. This can partly be explained by recalling that
                      Table 3: Storage size                                    ad-hoc queries to satisfy situational analysis needs [1].
                                                                               Besides, mixing different models allows, in an MMDW, to
                  Table              M3 D      FR                              achieve higher flexibility in the modeling solutions taken,
                  Fact_Order         603 MB    41 MB
                  Dim_Product        4160 kB   3896 kB                         for instance when dealing with many-to-many relation-
                  Dim_Customer       3280 kB   824 kB
                  Dim_Date           56 kB     40 kB
                                                                               ships.
                  Bridge_Ord_Prod    —         55 MB                         • Evolution. While the multimodel implementation is par-
                  Bridge_Cust_Prod   —         9864 kB
                                                                               tially schemaless, so it inherently supports evolution, the
                                                                               situation with the full-relational implementation is quite
PostgreSQL was originally born as a relational DBMS, so semi-                  different. In fact, even adding a couple of simple levels
structured and complex data querying is not fully optimized yet.               (as EU and brand in our case study) requires, at the very
In particular, PostgreSQL lacks specific optimization structures               least, changing the relational schema of one or more tables,
adapted to XML data, thus, the InfoCust attribute cannot be prop-              editing the ETL procedures, and migrating the data from
erly indexed; this impacts queries Q3, Q4, Q6, Q8, and Q10. M3 D               the old schema to the new one. A more complex evolu-
is also penalized by the necessity to have a JSON attribute in the             tion, e.g., one involving a new many-to-many relationship,
fact table to be joined with a dimension table (namely, InfoOrder).            would have even more impact because it would require
Additionally, the fact table in M3 D is quite larger than the one in           creating new tables. In case users ask for a full versioning
the FR schema, which results in slower star joins (even using the              of the schemata, the effort would be greater still. An M3 D
JSONB type instead of JSON, the improvement is very small). The                schema represents a good trade-off here because most
only case where the multimodel implementation significantly                    evolutions can be handled seamlessly with no impact on
outperforms the full-relational one is Q9; this is due to the use of           tables and ETL; clearly, a more invasive evolution (such as
a bridge table in the relational implementation and specifically               adding a new dimension or measure) would still require
to the fact that, despite the presence of indexes, the optimizer               a change to the relational part of the schema and to the
uses sequential scan to access the bridge table.                               ETL.
   Table 3 shows the storage size of both implementations. Un-
surprisingly, the relational implementation is more sober than          7    A PERSPECTIVE ON MMDW RESEARCH
the multimodel one.
                                                                        The experiments we conduct in this work are encouraging enough
   Though devising complete guidelines and best practices for
                                                                        to set a short- and mid-term perspectives of the research on
multimodel design is out of the scope of this paper, we observe
                                                                        MMDWs. The advantages we envision for MMDWs can be sum-
that:
                                                                        marized as follows:
    (1) the relational model is still more efficient, so it should be
        used, during logical design, for the data sources that can          (1) An MMDW will natively and efficiently support OLAP
        be smoothly transformed into relational form (i.e., those               querying over large volumes of multimodel and multidi-
        whose transformation does not entail loss of information                mensional data, thus ensuring support to both volume,
        content and can be accommodated within the time frame                   velocity, and variety.
        of ETL);                                                            (2) Storing data in their native model means reducing the
    (2) conversely, the data sources that hardly fit into the fixed             data transformations required; hence, the effort for writing
        structure of a relational schema, e.g., because their schema            (time-consuming and error-prone) ETL procedures will be
        is not completely known in advance, should be left in their             reduced in MMDWs, and the freshness of data in the DWs
        native form.                                                            will be increased.
                                                                            (3) MMDWs will bridge the architectural gap between data
6.2    Effectiveness                                                            lakes and DWs. A data lake ingests heterogeneously-
In this section we provide a qualitative comparison of the two                  structured raw data from various sources and stores them
solutions in terms of effectiveness from three points of view:                  in their native format, enabling their processing according
     • Transformation. The full-relational implementation required              to changing requirements [25]. Differently from DWs, data
       all the UniBench data to be translated in relational form                lakes support storage of any kind of data with low-cost
       according to the star schema in Figure 8. While in the M3 D              design, provide increasing analysis capabilities, and offer
       schema the dimension and fact tables are fed with JSON                   an improvement in data ingestion; however, analysis tasks
       data with simple INSERT queries, in the FR schema more                   are more complex and time-consuming since a schema-on-
       steps are required. For instance, just to feed the bridge                read approach must be followed. We believe MMDWs will
       table using the ETL Talend tool we need (i) a job for read-              offer an effective architectural trade-off by enabling both
       ing the JSON collection (tFileInputJSOn); (ii) a loop JSON               OLAP multidimensional analyses and ad-hoc analytics on
       query to read the array of products of each InfoOrder doc-               the same repository.
       ument; (iii) a job for reading the Dim_Product dimension             (4) Schema evolution is a crucial issue in traditional DW ar-
       table; and finally (iv) a join operation. This means that                chitectures, since modifying relational schemata to accom-
       transformations may require a significant time and can be                modate new user requirements is a complex and expensive
       error-prone, so they may be unsuitable in specific settings              task. MMDWs can store schemaless data, so they will en-
       such as those of real-time DWs.                                          sure a more effective support to schema evolution [27].
     • Flexibility. Differently from the FR schema, the M3 D one            (5) Again thanks to their support of schemaless data, higher
       preserves the data variety existing in the data sources. This            flexibility and extensibility will be granted, which will
       is particularly relevant for instance in self-service busi-              enhance analysis capabilities thus generating added value
       ness intelligence scenarios, where data scientist will write             for users [3].
   (6) More specifically, key-value stores on the one hand,             8    CONCLUSION
       and the array constructs supported by document-based             Handling big data variety, volume, and velocity is an important
       databases on the other, provide an alternative solution          challenge for decision-making information systems. On the one
       to model many-to-many relationships appearing in some            hand, data lakes have been proposed to ensure flexible storage of
       multidimensional schemata.                                       raw data, but at the price of making analyses more complex. On
   In our short-term research agenda on MMDWs we mainly                 the other hand, classical DW architectures provide an efficient
plan to verify and quantify these benefits via an extensive set         framework for analyzing transformed and integrated data, but
of experiments based on a more comprehensive case study. This           they fall short in natively handling data variety. Motivated by the
will require, for instance, to measure the effort for writing ETL       emerging trend of MMDBs, in this work we have investigated
procedures to transform all data according to a single model;           the feasibility of a multimodel approach to DW based on an
to assess the increase in querying expressiveness achieved by           extension of the well-known star schema with schemaless data
MMDWs in function of the amount of data variety; to simulate            as dimensions and facts. Our experiments are encouraging as
dynamic settings so as to evaluate the saving in dealing with           they show that all queries of our multimodel tailored OLAP
schema evolution. In order to overcome performance limitations          workload can run over the proposed multimodel star schema in
described in the previous section, we think also that new exper-        acceptable time compared to a full-relational implementation.
iments are mandatory on another multimodel DBMS such as                 Based on these first results, we have presented many short- and
Oracle, which provides other types of implementation for non            mid-term research perspectives on MMDW.
relational data, and also distributed storage and computation.
   In the mid-term, the preliminary work we presented in this           ACKNOWLEDGEMENT
paper opens several research issues:                                    This work was partially supported by the French National Re-
                                                                        search Agency as part of the “Investissements d’Avenir” through
    • Multidimensional design from MMDBs. The existing data-
                                                                        the IDEX-ISITE initiative CAP 20-25 (ANR-16-IDEX-0001), and
      driven approaches to multidimensional design are based
                                                                        the project VGI4bio (ANR-17-CE04-0012).
      on detecting functional dependencies in single-model data
      sources, namely, relational, XML, linked-open data, JSON
      [26]. Using a multimodel data source for design requires
                                                                        REFERENCES
                                                                         [1] Alberto Abelló, Jérôme Darmont, Lorena Etcheverry, Matteo Golfarelli, Jose-
      integrating different techniques into a synergic methodol-             Norberto Mazón, Felix Naumann, Torben Bach Pedersen, Stefano Rizzi, Juan
      ogy.                                                                   Trujillo, Panos Vassiliadis, and Gottfried Vossen. 2013. Fusion Cubes: Towards
                                                                             Self-Service Business Intelligence. IJDWM 9, 2 (2013), 66–88.
    • Conceptual models. Existing conceptual models for DWs              [2] Paolo Atzeni, Francesca Bugiotti, and Luca Rossi. 2014. Uniform access to
      are mostly aimed at designing multidimensional schemata                NoSQL systems. Inf. Syst. 43 (2014), 117–133.
      with fixed structure. To take full advantage of the flexibility    [3] Nabila Berkani, Ladjel Bellatreche, Selma Khouri, and Carlos Ordonez. 2019.
                                                                             Value-driven Approach for Designing Extended Data Warehouses. In Proc.
      ensured by MMDWs, new models capable of coping with                    DOLAP@EDBT/ICDT. Lisbon, Portugal.
      schemaless data (as naively done with the cloud symbol             [4] Doulkifli Boukraâ, Mohammed Amin Bouchoukh, and Omar Boussaïd. 2015.
      in Figure 2) are needed.                                               Efficient Compression and Storage of XML OLAP Cubes. IJDWM 11, 3 (2015),
                                                                             1–25.
    • Best practices for logical design. In presence of variety, sev-    [5] Mohamed Boussahoua, Omar Boussaid, and Fadila Bentayeb. 2017. Logical
      eral alternatives emerge for the logical representation of             Schema for Data Warehouse on Column-Oriented NoSQL Databases. In Proc.
                                                                             DEXA. Lyon, France, 247–256.
      dimensions and facts [14]. Indeed, some combinations of            [6] Arnaud Castelltort and Anne Laurent. 2014. NoSQL Graph-based OLAP
      models may be better than others when coupled with star                Analysis. In Proc. KDIR. Rome, Italy, 217–224.
      schemata. A specific set of guidelines for logical design of       [7] Max Chevalier, Mohammed El Malki, Arlind Kopliku, Olivier Teste, and Ronan
                                                                             Tournier. 2015. Implementation of Multidimensional Databases in Column-
      MMDWs is thus needed to find the best trade-off between                Oriented NoSQL Systems. In Proc. ADBIS. Poitiers, France, 79–91.
      performances, fidelity to source schemata, extensibility,          [8] Max Chevalier, Mohammed El Malki, Arlind Kopliku, Olivier Teste, and Ro-
      and evolvability; this should also include the issues related          nan Tournier. 2015. Implementation of Multidimensional Databases with
                                                                             Document-Oriented NoSQL. In Proc. DaWaK. Valencia, Spain, 379–390.
      to view materialization.                                           [9] Max Chevalier, Mohammed El Malki, Arlind Kopliku, Olivier Teste, and Ronan
    • OLAP benchmark. Effectively benchmarking MMDBs [21]                    Tournier. 2016. Document-Oriented Data Warehouses: Complex Hierarchies
                                                                             and Summarizability. In Proc. UNet. Casablanca, Morocco, 671–683.
      and non relational DBMSs [22] is still a challenge. Pro-          [10] Max Chevalier, Mohammed El Malki, Arlind Kopliku, Olivier Teste, and Ronan
      viding a benchmark for MMDWs is a further challenge,                   Tournier. 2016. Document-oriented data warehouses: Models and extended
      since it requires defining a dataset representative of DW              cuboids, extended cuboids in oriented document. In Proc. RCIS. Grenoble,
                                                                             France, 1–11.
      volume and multimodel variety, as well as a full range of         [11] Max Chevalier, Mohammed El Malki, Arlind Kopliku, Olivier Teste, and Ronan
      representative OLAP queries over this dataset.                         Tournier. 2016. Document-oriented Models for Data Warehouses - NoSQL
    • Indexing. PostgreSQL offers different types of indexes over            Document-oriented for Data Warehouses. In Proc. ICEIS. Rome, Italy, 142–149.
                                                                        [12] Mohamed Lamine Chouder, Stefano Rizzi, and Rachid Chalal. 2019. EXODuS:
      MMDBs, e.g., B-trees, hash, GiST (for geo data), GIN (for              Exploratory OLAP over Document Stores. Inf. Syst. 79 (2019), 44–57.
      document and hstore data), etc. Ad hoc indexing strategies        [13] Khaled Dehdouh. 2016. Building OLAP Cubes from Columnar NoSQL Data
                                                                             Warehouses. In Proc. MEDI. Almería, Spain.
      will have to be devised, in presence of variety, to cope with     [14] Ibtisam Ferrahi, Sandro Bimonte, and Kamel Boukhalfa. 2017. A Model &
      the specific features of multidimensional data and OLAP                DBMS Independent Benchmark for Data Warehouses. In Proc. EDA. Lyon,
      queries.                                                               France, 101–110.
                                                                        [15] Ibtisam Ferrahi, Sandro Bimonte, Myoung-Ah Kang, and Kamel Boukhalfa.
    • OLAP tools. Last but not least, more sophisticated OLAP                2017. Design and Implementation of Falling Star - A Non-Redundant Spatio-
      tools are required to let users benefit from the additional            Multidimensional Logical Model for Document Stores. In Proc. ICEIS. Porto,
      flexibility introduced by MMDWs while ensuring good                    Portugal, 343–350.
                                                                        [16] Vijay Gadepally, Peinan Chen, Jennie Duggan, Aaron J. Elmore, Brandon
      performances. Specifically, there is a need for devising               Haynes, Jeremy Kepner, Samuel Madden, Tim Mattson, and Michael Stone-
      techniques to automatically generate efficient SQL queries             braker. 2016. The BigDAWG polystore system and architecture. In Proc. HPEC.
                                                                             Waltham, MA, USA, 1–6.
      over MMDWs from the (MDX-like or graphical) language              [17] Enrico Gallinucci, Matteo Golfarelli, and Stefano Rizzi. 2019. Approximate
      used by the front-end.                                                 OLAP of document-oriented databases: A variety-aware approach. Inf. Syst.
     85 (2019), 114–130.                                                            [25] Franck Ravat and Yan Zhao. 2019. Data Lakes: Trends and Perspectives. In
[18] Matteo Golfarelli and Stefano Rizzi. 2009. Data Warehouse Design: Modern            Proc. DEXA. Linz, Austria, 304–313.
     Principles and Methodologies. McGraw-Hill, Inc., New York, NY, USA.            [26] Oscar Romero and Alberto Abelló. 2009. A Survey of Multidimensional
[19] Hamdi Ben Hamadou, Enrico Gallinucci, and Matteo Golfarelli. 2019. Answer-          Modeling Methodologies. IJDWM 5, 2 (2009), 1–23.
     ing GPSJ Queries in a Polystore: a Dataspace-Based Approach. In Proc. ER.      [27] Stefanie Scherzinger, Meike Klettke, and Uta Störl. 2013. Managing Schema
     Salvador de Bahia, Brazil.                                                          Evolution in NoSQL Data Stores. In Proc. DBPL. Riva del Garda, Italy.
[20] Ralph Kimball and Margy Ross. 2002. The data warehouse toolkit: the complete   [28] Amal Sellami, Ahlem Nabli, and Faïez Gargouri. 2018. Transformation of Data
     guide to dimensional modeling, 2nd Edition. Wiley.                                  Warehouse Schema to NoSQL Graph Data Base. In Proc. ISDA. Vellore, India,
[21] Jiaheng Lu and Irena Holubová. 2019. Multi-model Databases: A New Journey           410–420.
     to Handle the Variety of Data. ACM Comput. Surv. 52, 3 (2019), 55:1–55:38.     [29] Takeyuki Shimura, Masatoshi Yoshikawa, and Shunsuke Uemura. 1999. Stor-
[22] Mohammed El Malki, Arlind Kopliku, Essaid Sabir, and Olivier Teste. 2018.           age and Retrieval of XML Documents Using Object-Relational Databases. In
     Benchmarking Big Data OLAP NoSQL Databases. In Proc. UNet. Hammamet,                Proc. DEXA. Florence, Italy, 206–217.
     Tunisia, 82–94.                                                                [30] Chao Zhang, Jiaheng Lu, Pengfei Xu, and Yuxing Chen. 2018. UniBench: A
[23] Patrick E. O’Neil, Elizabeth J. O’Neil, Xuedong Chen, and Stephen Revilak.          Benchmark for Multi-model Database Management Systems. In Proc. TPCTC.
     2009. The Star Schema Benchmark and Augmented Fact Table Indexing. In               Rio de Janeiro, Brazil, 7–23.
     Proc. TPCTC. Lyon, France, 237–252.
[24] Zoubir Ouaret, Rachid Chalal, and Omar Boussaid. 2013. An overview of XML
     warehouse design approaches and techniques. IJICoT 2, 2/3 (2013), 140–170.