=Paper=
{{Paper
|id=Vol-3931/paper1
|storemode=property
|title=Impact Study of NoSQL Refactoring in SkyServer Database
|pdfUrl=https://ceur-ws.org/Vol-3931/paper1.pdf
|volume=Vol-3931
|authors=Enrico Gallinucci,Matteo Golfarelli,Wafaa Radwan,Gabriel Zarate,Alberto Abelló
|dblpUrl=https://dblp.org/rec/conf/dolap/GallinucciGRZA25
}}
==Impact Study of NoSQL Refactoring in SkyServer Database==
<pdf width="1500px">https://ceur-ws.org/Vol-3931/paper1.pdf</pdf>
<pre>
                         Impact Study of NoSQL Refactoring in SkyServer Database⋆
                         Enrico Gallinucci1,* , Matteo Golfarelli1 , Wafaa Radwan2 , Gabriel Zarate3 and Alberto Abelló3
                         1
                           University of Bologna, Cesena, Italy
                         2
                           Jawwal Telecommunications, Ramallah, Palestine
                         3
                           Universitat Politècnica de Catalunya, Barcelona, Spain


                                           Abstract
                                           Data modeling in NoSQL databases is notoriously complex and driven by multiple and possibly conflicting requirements. Researchers
                                           have proposed methodologies to optimize schema design of a given domain for a given workload; however, due to the agile environment
                                           in which NoSQL databases are usually employed, both domain and workload are frequently subject to changes and evolution - possibly
                                           neutralizing the benefits of optimization. When this happens, the benefits of a new optimal schema design must be weighed against the
                                           costs of migrating the data. In this work, we empirically show the benefits of schema redesign in a real publicly available database. In
                                           particular, we identify multiple snapshots (in terms of domain extension and querying workload) in the 20+ years evolution of SkyServer,
                                           demonstrate how NoSQL schema optimization at a given time can later backfire, and evaluate the conditions under which data migration
                                           becomes beneficial. This takes us to define the foundations and challenges of a framework for continuous NoSQL database refactoring,
                                           with the goal of helping DBAs and data engineers decide if, when, and how a NoSQL database should be reconsidered to restore schema
                                           design optimality.

                                           Keywords
                                           NoSQL database, Database refactoring, Data modeling, Data migration


                         1. Introduction                                                                                                 CM0
                                                                                                                                                                       Domain evolves
                                                                                                                                                                                                      CMcur = CM0 + ΔCM

                         Database design has been studied for many years in rela-                                                                                  w0
                                                                                                                                                                                     Workload evolves
                                                                                                                                                                                                                            wcur = w0 + Δw
                         tional databases, but its automation has not been achieved                                                                                                     Optimality
                         yet. Moreover, the advent of NoSQL databases since the                                                                                                         of s0 drops

                         early 2010s has just added complexity to the problem by                                                                 Schema
                                                                                                                                                                                        scur = s0
                                                                                                                                                                                                                    Refactoring
                                                                                                                                              recommender                                                          recommender
                         offering alternative data models: key-value, wide-column,                                                                                                                                                 Is it worth
                                                                                                                                                                                                                                   refactoring
                         document-based, and graph [1]. Among these, the first                                                                                                                                                      to another
                                                                                                                                                                                                                                     schema?
                         three are also known as aggregate-oriented data models, as
                                                                                                                                           Objective 2


                                                                                                                                                                                Optimal
                         they encourage the modeling of tuples as complex objects,                                                                                            schema s0 is


                                                                                                                                                                                                            Gain
                                                                                                                                                                             implemented
                         embedding all the data required to answer a query and min-                                                                      Objective 1
                         imizing the need to compute joins (thus avoiding the costly
                                                                                                                                       (covered by state-of-the-art)
                         operation of transferring data between nodes) [1]. For this                                                                                                                                    Effort

                         reason, the traditional domain-driven data modeling strate-                                                                                                         time

                         gies typically used in relational databases [2] are abandoned
                                                                                                                                      Figure 1: Intuition of the research problem
                         in favor of workload-driven strategies, where tuples are
                         modeled (i.e., their schema is designed) depending on the
                         queries that the database is bound to answer.1 Notice that                                                   ferent contents to accommodate different queries.
                         we do not use the term NoSQL to name a family of tools,                                                          As sketched in Figure 1, we focus on what happens next
                         but a family of models, as a synonim of “co-relational” in                                                   (i.e., after a schema design has been chosen and the sys-
                         [3], which can then be implemented in any tool, including                                                    tem/application is in production). For multiple reasons, the
                         an object-relational one like PostgreSQL.                                                                    conditions considered at design time are continuously evolv-
                            Several research papers have proposed methodologies to                                                    ing (e.g., new data must be stored, new queries appear or
                         obtain the optimal schema design, especially on the wide-                                                    they are executed at different rates), overturning the fitness
                         column and document-based data models (as the key-value                                                      of schema designs to the optimization problem. Conse-
                         does not leave much room for alternative modeling strate-                                                    quently, the database should be refactored to the schema
                         gies). As fully discussed in Section 2, these methodologies                                                  design that proves to be optimal under the new conditions.
                         typically rely on a conceptual model (CM) of the domain                                                      Intuitively, the sweet spot of interesting solutions are the
                         (e.g., a UML class diagram) and a set of queries to be an-                                                   ones showing the most gain with minimum effort (i.e., those
                         swered (a.k.a. workload). Their goal is to find a target                                                     in blue in Figure 1). However, refactoring a database can be
                         database schema design that minimizes query answering                                                        costly from multiple perspectives (design and execution of
                         times. This is often achieved by indexing, partitioning and                                                  the migration process in the first place) and the trade-off
                         replicating data in multiple tables (or collections) with dif-                                               between the benefits of refactoring and its effort should be
                                                                                                                                      carefully evaluated. Moreover, the evaluation of database
                         DOLAP 2025: 27th International Workshop on Design, Optimization, Lan-                                        refactoring should not be a once-in-a-while activity: in-
                         guages and Analytical Processing of Big Data, co-located with EDBT/ICDT
                                                                                                                                      spired by the DevOps philosophy of continuous evolution
                         2025, March 25, 2025, Barcelona, Spain
                         *
                           Corresponding author.                                                                                      in an agile software development environment, database
                         $ enrico.gallinucci@unibo.it (E. Gallinucci);                                                                refactoring should be treated as a continuous problem as
                         matteo.golfarelli@unibo.it (M. Golfarelli); wafaa.radwan@jawwal.ps                                           well. It is known that the performance of query execution
                         (W. Radwan); gabriel.zarate@estudiantat.upc.edu (G. Zarate);                                                 can improve by migrating the corresponding data between
                         alberto.abello@upc.edu (A. Abelló)
                                       © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License   DBMSs, even when the migration time is included [4, 5].
                         1
                                       Attribution 4.0 International (CC BY 4.0).
                                                                                                                                      Our experiments, based on real astronomic data, show that,
                             In domain-driven design, workload information is used as well, but it
                             is not the main driver.                                                                                  although the corresponding data migration takes some days

CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
                                                                                                          1                                                       *                                                                    1                                                          0..1
                                                                                      galSpecLine                                                                      sppParams                                    galSpecLine                                                                            zooSpec


                                                                                            …


                                                                                                                                                                             …


                                                                                                                                                                                                                          …
                                                                               1                              1               1*                   1                   *                                     1                             1                   1
                                                                                                                                                                                                                                                              0..1                   1              0..1
                                                                 galSpecLine           galSpecLine1                           sppParams                                    sppParams           galSpecLine           galSpecLine1                                 zooSpec                          1        zooSpec
                                                                                                                                                                  *
                                                                                      galSpecInfo                          SpecObjAll                                      sppLines                                 galSpecInfo                               SpecObjAll                                   PlateX


                                                                                             …
                                                                                                                      1                                11


                                                                     …


                                                                                                                                                                              …


                                                                                                                                                                                                                           …
                                                                                                                                                                                                                                                                                         1*


                                                                                                                                               …


                                                                                                                                                                                                   …
                                                                                                              1                                                                                                            1               1          1         1
                                                                                              1                                    1
                                                                               1                              1                    *                                                                         1                             1                   1 *                                     1
                                                                                                                              *                                        *
                      CMR1                              galSpecInfo                    galSpecInfo
                                                                                          SpecObjAll                          SpecObjAll
                                                                                                                                 sppLines                                   sppLines           galSpecInfo           galSpecInfo
                                                                                                                                                                                                                        SpecObjAll                             SpecObjAll
                                                                                                                                                                                                                                                                   PlateX                                   PlateX
                                                  SpecObjAll                            1                         1       1                                 1                                                         1                           *       1                                   *
                                                                                                  *                                    * 1                                                                                     *                                         * 1
                                                                                                                      1                                 1                                                                                             1
CMR1                      CMR1                         *
                                                                                        Photoz                            PhotoObjAll                                      PhotozRF                                   Photoz                              PhotoObjAll
                          SpecObjAll                  SpecObjAll                                          0..1                                                  0..1                                                                   0..1
                                                                                                      1                                        1                                                                                   1                                             1
                                                                                                                                   *                        1
                             *                             * 1                          1                            1 1                                                                                              1                                   1          *
                                              1            Photoz                           Photoz
                                                                                            PhotoObjAll                       PhotoObjAll
                                                                                                                                 PhotozRF                                   PhotozRF             Photoz                   Photoz
                                                                                                                                                                                                                          PhotoObjAll                          PhotoObjAll
                          Photoz                  PhotoObjAll                  0..1                           0..1         0..1                                   0..1                                       0..1                          0..1
                                       0..1                                                                                                1                                                                                                                                 1
                                   1                             1                                *                                    *                1                                                                      *                                         *                1
                      1                           1                                                                                Field                                    Frame                                                                                  Field                                   Frame
 Photoz                    Photoz (a) R1 PhotoObjAll
                          PhotoObjAll                                                 CMR8
                                                                                                                                               1
                                                                                                                                                                  *
                                                                                                                                                                                                                    CMR18                                                                          *
               0..1                      0..1                                                         1                                                                                                                            1                                             1
                                                                                                                  1                                         1                                                                                     1                                           1
                                                             CMR8                      CMR8 Field                                      Field
                                                                                                                                         Frame                               Frame             CMR18                 CMR18Field                                            Frame
                                                                                                                                                                                                                                                                         Field                               Frame
                                                                                                                              *                                        *                                                                                       *                                       *

                                                                                                                                   (b) R8                                                                                  (c) R18
                                 Figure 2: Conceptual models of the main tables in SDSS SkyServer.


          of execution, schema optimization reduces query cost by an                                                                                                        easily subject to schema changes, which highly impacts
          order of magnitude, and consequently pays off in the long                                                                                                         their performance. Researchers have looked for patterns in
          term (notice we are not considering here the effort of appli-                                                                                                     the evolution of schemas in both relational [16] and NoSQL
          cation code evolution). Thus, a structured and automated                                                                                                          databases [17, 18] (and beyond [19]). Recent efforts to sup-
          approach is even more crucial to ensure the feasibility of                                                                                                        port and/or automate the management of schema evolution
          continuous evolution.                                                                                                                                             have been directed toward keeping track of different schema
             The main contributions we provide in this paper are:                                                                                                           versions [20], propagating manually-defined schema modifi-
          1. An experimental setting that allows to analyze database                                                                                                        cation operations (SMO) to the database [21] and to queries
             refactoring (not considering changes in application                                                                                                            [22], and evaluating multiple strategies to apply schema
             code).                                                                                                                                                         changes to the data [23, 24]. Overall, this is still an open
                                                                                                                                                                            research field, and none of the mentioned works goes in the
          2. A detailed empirical analysis of the performance impact
                                                                                                                                                                            direction of recommending if, how, and/or when a (NoSQL)
             of schema evolution in the SkyServer database.
                                                                                                                                                                            database should be refactored. Recommendations to (rela-
          3. A framework proposal able to explore schema designs                                                                                                            tional) database refactoring have been given, but mostly
             alternative to the current one, and give recommenda-                                                                                                           focused on finding and resolving issues such as inconsisten-
             tions based on the evaluation of the trade-off between                                                                                                         cies [25] and anti-patterns [26]. More recently, [18] proposes
             migration effort and the gain under different optimiza-                                                                                                        a first approach to migration strategy planning of NoSQL
             tion criteria.                                                                                                                                                 databases, but still without deciding whether migrating is
             The outline of the paper is as follows. The related liter-                                                                                                     worth or not, or how to do it.
          ature is presented in Section 2. Section 3 introduces our
          use case. Section 4 explains and exemplifies the motiva-
          tion behind the research problem. Section 5 defines the                                                                                                           3. SkyServer Case study
          experimental setting. Section 6 presents the evaluation of
          SkyServer, grounded on which we define our framework                                                                                                              SkyServer3 is a publicly available relational database de-
          described in Section 7. Conclusions are drawn in Section 8.                                                                                                       signed to map the cosmos, made available by the Sloan
                                                                                                                                                                            Digital Sky Survey (SDSS) organization. Over the years, it
                                                                                                                                                                            has integrated more and more data in successive extensions.
          2. Related work                                                                                                                                                   Access to SDSS data is provided via a web interface, where
                                                                                                                                                                            users can query and download data through either SQL or
          The workload-driven nature of NoSQL data modeling has                                                                                                             interfaces designed for both professional astronomers and
          been established since the dawn of NoSQL databases [1].                                                                                                           educational purposes.
          Indeed, results suggest that the schema alternatives affect                                                                                                          Besides the astronomical data themselves, the server also
          the database performance in different NoSQL models [6].                                                                                                           makes public SkyServer Traffic Log,4 which captures statis-
          Over the last decade, researchers have worked to support                                                                                                          tics on SQL queries being executed. This includes columns
          DBAs and data engineers in the complicated task of finding                                                                                                        such as theTime (datetime of the query), webserver (URL
          the best logical model for a given workload. The most re-                                                                                                         of the server), winname (Windows name of the server),
          cent existing works mainly differentiate for (i) focusing on a                                                                                                    clientIP (client’s IP address), and sql (the SQL statement
          single [7, 8, 9] or multiple data models [10, 11, 12, 13, 14], (ii)                                                                                               executed), among others. It also captures performance met-
          considering only the conceptual model of the data [11] or                                                                                                         rics like elapsed (query execution time), busy (CPU time
          including workload queries, with [7, 12, 13, 14, 8] or without                                                                                                    used by the query), and rows (number of rows returned by
          query frequencies [10, 15, 9], and (iii) directly generating                                                                                                      the query), providing a comprehensive snapshot of server
          one [10, 11, 14, 9] or more target schemas [15], or evaluat-                                                                                                      activity in the last two decades (since 2003). Thus, we ana-
          ing more of them, based on a single criterion [12, 15, 9] or                                                                                                      lyzed three different database schemas and corresponding
          multiple thereof [7, 13, 8]. The common factor between all                                                                                                        snapshots of this log as in Release 1 (December 2003), Re-
          these works is the limited focus on the initial design of a                                                                                                       lease 8 (December 2013), and Release 18 (December 2023).
          logical schema (i.e., none of them considers the challenge of                                                                                                     The corresponding schemas (𝐶𝑀𝑅1 , 𝐶𝑀𝑅8 , and 𝐶𝑀𝑅18 )
          implementing such schema by refactoring an existing one).                                                                                                         are summarized in Figure 2 and include changes in both
             Research work on database evolution also started in the
          relational world and then propagated to the NoSQL side,
                                                                                                                                                                              to each data item, thus imposing no constraint at the level of the
          where the schemaless characteristic2 makes databases more                                                                                                           table/collection of data.
                                                                                                                                                                            3
                                                                                                                                                                              https://skyserver.sdss.org/dr18
          2                                                                                                                                                                 4
              The term refers to the fact that schema information is attached directly                                                                                        https://skyserver.sdss.org/log/en/traffic/sql.asp
                                                                PlateX                   SpecObjs         si_2
                                                                 1   PlateX
                                                                                         {id: SpecObjs
                                                                                               “S1",
                                                                                                                                                  Storage
CM0        SpecObjAll                               CMi                  1                 tile: 122,                                         1.0
                                                                     *                     program: “legacy",
           CM0 SpecObjAll                                 CMi                                  {id: “S1",
                                                                                           quality: “good” }                                   0.8            si_1
                                                             SpecObjAll *                        tile: 122,                                                   si_2
                                                                                                                                                 0.6
                                                                 SpecObjAll                      program: “legacy",                                           si_3
      SpecObjs                                                                                   quality:si_2
                                                                                                           “good” }                                0.4
                 SpecObjs                                                                                                                           0.2
      { id: “S1",                        Plates                   SpecObjs                     Plates
        tile: 122{ }id: “S1",                      Plates                    SpecObjs                     Plates
                    tile: 122 }          {id: "P1",               { id: “S1",                  {id: "P1",
            s0                            program:{id:
                                                     “legacy",
                                                        "P1",       tile: 122,{ id: “S1",       program:{id:“legacy",
                                                                                                               "P1",
                                          quality: “good”  } “legacy",
                                                    program:        plateID: “P1”     }
                                                                                tile: 122,      quality: “good”,
                                                                                                           program: “legacy",
                       s0                           quality: “good” }           plateID: “P1” } SpecObjs:      [
                                                                                                           quality:  “good”,    Performance                 Performance
                       (a)                                   si_1                                 {id: “S1",
                                                                                                           SpecObjs: [
                                                                                                                                    (read)                     (write)
                                                                                                   tile: 122 {id:
                                                                                                              } ] }“S1",                           (c)
                                                                        si_1                                  tile: 122 } ] }
                                                                                                       si_3
                                                                              (b)                                si_3

                  Figure 3: Example of schema evolution: (a) the initial conceptual schema 𝐶𝑀0 and the initial database schema 𝑠0 ; (b) the
                  evolved conceptual schema 𝐶𝑀𝑖 at time 𝑖 and three alternative database schemas 𝑠𝑖_1 , 𝑠𝑖_2 , and 𝑠𝑖_3 ; and (c) comparison of
                  the three potential databases on the maximization of three different objectives.


the number of tables and attributes, and how the latter are                                           The frequency of schema changes depends on the do-
placed in the former, but without information loss.                                                main and application [16]: in some cases it can be pervasive
   In these schemas, we find data captured from different                                          (in [27], the authors found that all the tables over the 20
regions of the sky called Fields, where different objects are                                      analyzed databases were somehow affected by the evolu-
observed as PhotoObj. Spectroscopic data are also captured                                         tion process), while in others it was completely absent (in
for each one of these astronomical objects, and stored per                                         [19], 70% of the database schemas over the 195 analyzed
wavelength intervals into SpecObj. All measurements are                                            open source software projects demonstrate the absence or
done through aluminum Plates that allow to precisely plug                                          very small presence of change). Nevertheless, in the lat-
individual spectrographs to the telescope through optical                                          ter case, the authors verify that the absence of evolution
fibers.5                                                                                           does not mean that application requirements are static, but
                                                                                                   rather that DBAs/developers are reluctant to evolve database
                                                                                                   schemas to avoid the effort. A similar insight is found in
4. Motivation                                                                                      [33], which studied schema evolution in 29 data-intensive
                                                                                                   applications using either relational or NoSQL databases. The
In any kind of DBMS, the choice of the initial schema design
                                                                                                   study found that complex refactoring operations are seldom
is based on conditions (i.e., the conceptual representation
                                                                                                   carried out, due to the lack of tools to support them.
of the domain and the estimated workload) that can change
                                                                                                      From these studies, we conclude that: (1) it is very diffi-
– either because they were not accurate or because they
                                                                                                   cult to have a perfect understanding at design time of how
have evolved, but this is even more so in NoSQL systems.
                                                                                                   schema information must be modeled; (2) the conditions to
In this section, we present a couple of comprehensive and
                                                                                                   modify database schemas can mature at any time; (3) there
small examples to illustrate the problems, before moving
                                                                                                   is reluctance to change a database schema once it reaches
to a larger one with real data that demonstrated the true
                                                                                                   a certain maturity level, and such changes are aimed at
impact.
                                                                                                   minimizing refactoring efforts.
                                                                                                      As a result: (a) the updated schemas tend to be simple
4.1. Domains evolve                                                                                variations of the initial one, despite the choice of the latter
Plenty of research papers show database schemas need to                                            being based on a significant degree of uncertainty at design
evolve to accommodate changes in the domain (e.g., new in-                                         time; (b) the pure minimization of refactoring efforts po-
formation to be added, obsolete information to be removed,                                         tentially leads to missing big opportunities hidden by the
data type changes), from the ’90s [27] to most recent times                                        scarecrow of complex refactoring, steering instead towards
[28], in both relational [29, 16] and NoSQL databases [30, 31],                                    possible antipatterns, i.e., bad practices in schema design
looking for patterns in schema updates [16], studying the                                          that are intended to solve certain problems but eventually
repercussions on the related application code [29], man-                                           lead to other problems [34, 26] (which, in turn, will require
aging multiple schema versions [22] and designing frame-                                           further modifications to remodel the data).
works to automate schema evolution [17]. An interesting
                                                                                                  Example 1. An exemplification of schema evolution on a
pattern emerging from multiple research work [27, 29, 16]
                                                                                                  document-based database is shown in Figure 3. Let 𝐶𝑀0 be
is that, in the early stages of the application lifespan, rela-
                                                                                                  the initial conceptual schema with only one entity; database
tional databases typically undergo an inflation phase, where
                                                                                                  schema 𝑠0 is created with a single collection of SpecObjAll
multiple operations are carried out to add new schema in-
                                                                                                  (Figure 3a shows a sample document). Later on, at time 𝑖 (Fig-
formation. In this sense, NoSQL databases are even more ap-
                                                                                                  ure 3b), the conceptual schema evolves to 𝐶𝑀𝑖 to organize
pealing due to their schemaless nature, which lets them eas-
                                                                                                  spectral readings into plates. To accommodate this change
ily accommodate schema additions to move on, and makes
                                                                                                  with minimum effort, DBAs would be inclined to evolve the
them more suitable in agile development [32]. Nevertheless,
                                                                                                  database towards the schema design of 𝑠𝑖_1 or 𝑠𝑖_2 , but they
this does not mean their performance is optimum regardless
                                                                                                  would probably avoid 𝑠𝑖_3 , even though it might be the op-
of how you store data and still require reconsidering it.
                                                                                                  timal schema – as hinted by the radar chart in Figure 3c.
5
    The whole catalog of tables is available at https://skyserver.sdss.org/
                                                                                                  Inspired by [35], the chart shows a comparison between the
    dr18/MoreTools/browser                                                                        three databases in terms of the maximization of three objec-
                                                Product                                                            Item
                                                                         (0,n)                                                                                           Plates                 Fields
                                                                                                        (1,1)

                                               ER0 = ER1                                                                                                                 {id: "P1",             { id: “F1",
                                                                                                                           (1,n)                                           program: “legacy",     nTotal: 923,
                                                                                                (1,1)                                                                      quality: “good” }      photoObjs: [ {
                                                Customer                                                          Order                                                                            id: “O1”,
                                                                         (0,n)                                                                                                                     type: “Star”
                                                                                                                                                                                                   specObjs: [ {
                                                                                                                                                                                                     id: “S1”,
                                                                                                             s1                                                                                      waveMin: 3806,
            Item
                                                                                       Plates                     Fields
                                                                                                                                                  • Changes in queries (i.e., the existing queries are formu-
                                                                                                                                                                                                     waveMax: 9202,
                                                                                                                                                                                                     plate_id: "P1",
    (1,1)
                                                                                  {id: "P1",        { id: “F1",
                                                                                                                                                    lated differently, e.g., due to a change at domain level)
                                                                                                                                                                                                     plate_quality: “good” } ] } ] }
                (1,n)                                                               program: “legacy",
                                                                             q1 = "select
                                                                                                      nTotal: 923,                                  or new/old queries are added/removed
                                                                                                                                                                                     si_2       (e.g., due to the
                                                                                    quality: I.*, P.* }
                                                                                             “good”   photoObjs: [ {
            Order                                 PlateX                           from Field F        id: “O1”,                                    addition/removal    to/from
                                                                                                                                                                   Plates
                                                                                                                                                                                the application  in the latest re-
                                                                                                                                                                                            Fields
                                                                                     join PhotoObj O type: “Star”
                                                     1                               join SpecObj S    specObjs: [ {                                lease).
                                                                                                         id: “S1”,                                                       {id: "P1",                           { id: “F1",
                                                                                   where F.name = <f_name>"
                                                                                                         waveMin: 3806,                                                    program: “legacy",                   nTotal: 923,
                                                           *
                                                                                                         waveMax: 9202,                                  When the initialquality:
                                                                                                                                                                           schema “good”,design is chosen,          the [workload
                                                                                                                                                                                                             photoObjs:   {
                                                SpecObjAll                   q2 = "select I.*, O.*
                                                                                   from Products P
                                                                                                         plate_id: "P1",
                                                                                                         plate_quality: “good” } ] } ] }              is assumed to be constant
                                                                                                                                                                         specObjs: [ {
                                                                                                                                                                           id: “S1”,
                                                                                                                                                                                        in  terms      of the   queries
                                                                                                                                                                                                              id: “O1”,
                                                                                                                                                                                                                             using it
                                                                                                                                                                                                              type: “Star”} ] }
                                                     *                                   join Items I
 P.*                                     CM                                                                  s
                                                                                      where P.name = <p_name>"       2
                                                                                                                                                      and their corresponding
                                                                                                                                                                           waveMin:  frequencies.
                                                                                                                                                                                       3806,             This simplification is
                                                                                                                                                                           waveMax: 9202,
dF
otoObj O
                                                           1
                                                                                    Plates                          Fields                            understandable (if not      essential)
                                                                                                                                                                           photoObj_id:   “O1”, to choose a (sub-)optimal
ecObj S                                         PhotoObjAll
                                                                                    {id: "P1",                      { id: “F1",                       initial schema design.       However,
                                                                                                                                                                           photoObj_type:    “Star”}given
                                                                                                                                                                                                     ]}     the strong depen-
name = <f_name>"
                                                                                  q1 =program:
                                                                                       "select “legacy",
                                                                                                  I.*, P.*                      = "select I.*, O.*
                                                                                                                            q2923,
                                                     *
                                                                                         from “good”,
                                                                                      quality:  Customers C
                                                                                                                      nTotal:
                                                                                                                      photoObjs: [from
                                                                                                                                    {     Products P
                                                                                                                                                      dence    of schema design       optimality        on the workload, the
 O.*
 ucts P                                                    1                          specObjs:
                                                                                         id:join
                                                                                             “S1”,
                                                                                                   [{
                                                                                            join Orders    O           id: “O1”,
                                                                                                                       type: “Star”}
                                                                                                                                      join Items I
                                                                                                                                      ]}
                                                                                                                                                      evolution of the latter can have a tremendous impact and
ms I                                                                                             Items I                           where    P.name = <p_name>"
name = <p_name>"
                                                     Field                               waveMin: 3806,
                                                                                         where C.name = <c_name>"
                                                                                         waveMax: 9202,
                                                                                                                                                      ignoring it can lead to a progressive distancing from the
                                                     (a)                                 photoObj_id: “O1”,                                           objectives that the initial schema design was originally max-
                                                                                         photoObj_type: “Star”} ] }
 *, P.*                 q2 = "select I.*, O.*
                                                                                                                                                      imizing.
ustomers C                    from Products P
 rders O                        join Items I                                                                        (b)
ems I                         where P.name = <p_name>"                                                                                            Example 2. An exemplification of workload evolution on a
C.name = <c_name>"
                                                                                                                                                  document-based database is shown in Figure 4. Let 𝐶𝑀 be
                                   High                                      Good                                        100%
                                                                                                                                                  the initial conceptual model with four classes (Figure 4a) and
                                                               q1
                                                                                                                      Storage


                                                                                                        s1
                                Query
                                 freq.


                                                                                                                        ratio
                                                                            Query
                                                                             perf.


                                                               q2                                       s2                                        two different database schemas 𝑠1 and 𝑠2 (Figure 4b). Assume
                                    Low
                                                                                                                                                  that 𝑠1 is the one chosen at design time, given the following
                                          t0                        ti           Bad                                        0% t0            ti
                                                 Time                                  t0        Time        ti                       Time        workload.
                                                                                                                                                       𝑞1 = SELECT o.*, s.* FROM Field f
                                               (c)                                          (d)                                     (e)                     JOIN PhotoObj o JOIN SpecObj s
                                                                                                                                                            WHERE f.id = <f_id>;
                               Figure 4: Example of workload evolution: (a) the conceptual                                                             𝑞2 = SELECT p.*, s.*
                               model 𝑈 𝑀 𝐿; (b) two possible database schemas, 𝑠1 (the initially                                                            FROM Plate p JOIN SpecObj s
                               chosen one) and 𝑠2 ; (c) change of queries’ frequency in time;                                                               WHERE p.id = <p_id>;
                               (d,e) change of optimality of the two database schemas on query                                                       Figure 4c shows that the frequency of the two queries inverts
                               performance and storage occupation.
                                                                                                                                                  from the initial deployment at 𝑡0 to the one at 𝑡𝑖 . Consequently,
                                                                                                                                                  the optimality of the two database schemas with respect to
                               tives: storage occupation, and performance (speed) of read and                                                     query performance changes accordingly (Figure 4d). Storage
                               write queries. For storage, we assume 30 bytes for IDs, 50 bytes                                                   ratio can also change (Figure 4e shows the database size in 𝑠2
                               for strings, 8 bytes for numbers, and a ratio of 2 products per                                                    divided by the size in 𝑠1 ) if the data grows unevenly. Reusing
                               category. To estimate query performance in this example, we                                                        storage assumptions from Example 1, the values in the figure
                               used the cost model by [12] and assumed two read queries (𝑞1                                                       are calculated assuming a ratio between fields and spectral
                               and 𝑞2 ) and two write queries (𝑞3 and 𝑞4 ) as follows:                                                            readings that grows from 1:1 to 10:1 and the average number
                                  𝑞1 = SELECT s.* FROM SpecObjAll s
                                                                                                                                                  of items per order growing from 2 to 10.                        ♢
                                       WHERE s.id = <s_id>;
                                  𝑞2 = SELECT s.* FROM SpecObjAll s
                                       JOIN PlateX p WHERE p.id = <p_id>;
                                  𝑞3 = UPDATE SpecObjAll s SET
                                                                                                                                                  5. Experimental setting
                                       s.tile = <s_tile> WHERE s.id = <s_id>;
                                                                                                                                                  To empirically demonstrate our point, we analyzed in detail
                                  𝑞4 = UPDATE PlateX p SET
                                                                                                                                                  the real effect of schema evolution on SkyServer perfor-
                                       p.quality = <p_quality>
                                       WHERE p.id = <p_id>;
                                                                                                                                                  mance. For this, we considered three points in time corre-
                                  The indicators are normalized on a scale from 0 (worse) to                                                      sponding to releases R1, R8 and R18, respectively. Firstly,
                               1 (best) using the complementary of the min-max normalized                                                         we characterized the different workloads identifying the
                               value. For instance, given 𝜙(𝑠) as the average query execution                                                     most common query patterns (i.e., ignoring mostly unique
                               time on schema 𝑠, and 𝑥 and 𝑦 as the minimum and maximum                                                           queries in the very long tail of frequencies). Then, we recre-
                               values for 𝜙(𝑠𝑖_𝑘 ), 𝑘 ∈ {1, 2, 3}, query performance for the                                                      ated the database at the point in time of each of the three
                                                            𝑦−𝜙(𝑠 )                                                                               releases and populated them with a sample of the data avail-
                               j-th schema is calculated as 𝑦−𝑥𝑖_𝑗 .                        ♢                                                     able in SkyServer. Finally, we measured both the cost of
                                                                                                                                                  queries in each schema as well as the cost of moving the
                                                                                                                                                  data from one to another. All tests have been executed on a
                               4.2. Workloads evolve
                                                                                                                                                  PostgreSQL 15 instance, running on a server with an i7-8700
                               Similarly to schema evolution, early studies on the evolution                                                      CPU and 64 GB of RAM. To guarantee reproducibility, all
                               of query workloads date back to the 80’s [36] and continue                                                         the corresponding code is publicly available in GitHub.6
                               to most recent times [37]. The evolution of workloads can                                                             In the following, we use numbers (1, 2, and 3) to refer
                               be traced back to common patterns [38, 39].                                                                        to database schemas in different points in time, and Greek
                                                                                                                                                  letters (𝛼, 𝛽, and 𝛾) to refer to the corresponding workloads.
                               • Changes in frequency (i.e., the same queries are executed
                                                                                                                                                  A summary of the experimental setup is shown in Figure 5
                                 with different frequencies and/or ratios), either with cyclic
                                                                                                                                                  and detailed in the following sections; in the figure, the
                                 patterns (e.g., daily or monthly), with occasional spikes
                                                                                                                                                  yellow area indicates the database schemas over which each
                                 (e.g., due to unexpected popularity increase of the appli-
                                                                                                                                                  workload is executed.
                                 cation), or more stable changes (e.g., due to new users
                                 from different time zones).                                                                                      6
                                                                                                                                                      https://github.com/enricogallinucci/nosql-refactoring
                                                                                                       N. query patterns by output
                                                                               Queries    Query
                                                                      Work.                             1 row >1 rows >1 rows
                                                                              per month freq. (Hz)
                                          LEGEND                                                       (stable) (stable) (scaling)
                                             Workload execution
                                                                      𝑊𝛼           48,337       0.02          2        3         -
                                                                      𝑊𝛽        1,352,498       0.57          6        3         -
                                              Time evolution
                                                                      𝑊𝛾        3,485,018       1.39          6        2         7
                                              Optimized version of
                                              Migration to be        Table 1
                                              evaluated
                                            Time-evolved DB          Workload statistics
                                            Optimized DB

                                            SkyServer DB
                                                                     new workload.

                                                                     5.2. Data
Figure 5: DBs and workloads in our experimental setup                The overall size of the SkyServer database (in its latest re-
                                                                     lease) is approximately 5TB and cannot be directly down-
                                                                     loaded; consequently, we proportionally sampled the source
5.1. Schemas
                                                                     to manage it more effectively. Sampling is based on the
The schema of each database in Figure 5 is generated as              main table, PhotoObjAll, which originally contains 1.2
follows. Original databases (𝐷1 , 𝐷2 , 𝐷3 ) are populated            billion rows: we collected 4 samples of 100K, 200K, 500K,
exactly as provided by SDSS. Nevertheless, to make the               and 1M rows, ensuring that samples preserve the distribu-
performance comparable after optimization, we did not im-            tion of attributes involved in the selection predicates; then,
plement classic 1NF, but encoded all them into a flat JSON           other tables are populated with the rows linked to the ones
document (without any subdocument or array) that was                 sampled on PhotoObjAll. To make the performance com-
then stored in a relational table in PostgreSQL. Optimized           parable across the releases, we made the four samples of the
databases (𝐷1𝛼 , 𝐷2𝛽 , and 𝐷3𝛾 ) are optimized for the work-         same size, independently of the size of the database at the
load in the corresponding point in time following the hints          point in time of the release.
in [12]. More concretely, to decide on join materialization,
as well as vertical and horizontal partitioning, we:                 5.3. Workload
1. Followed a greedy algorithm taking each query in the or-          The workload of every release was extracted from the
   der indicated by their criticality (i.e., queries with higher     SqlLog table of the SkyServer Traffic Log for December of
   value for the product of their frequency and cost were            the corresponding year, excluding queries that were unsuc-
   considered first) and:                                            cessful or involving customer user tables. Given the nature
       (i) Created a JSON document per row considering all           of the service, we should notice that users are not allowed
           the tables involved in the query (i.e., join materi-      to modify the database, hence, the log contains only read
           alization or embedding).                                  queries. After parsing the queries, we extracted (1) the tables
      (ii) Took from each source table only the attributes           involved, (2) the columns projected, and (3) the selection
           necessary for the query (i.e., vertical partitioning).    predicate. Firstly, the queries were clustered based on the
     (iii) Applied filters of the query (i.e., horizontal parti-     tables they required, and a minimum threshold of 1% was
           tioning).                                                 fixed for the cluster to be further considered. These initial
                                                                     clusters were then subdivided depending on the columns
   Notice that (a) once a critical query has generated some
                                                                     projected and selection predicate used, filtering out subclus-
   optimization, this is not undone by less critical queries,
                                                                     ters with less than 0.5% queries, for a final count of 5, 23,
   and (b) we allowed intra-table redundancies (e.g., materi-
                                                                     and 21 clusters being considered for each release. Since we
   alizing the many-to-one join between Frame and Field,
                                                                     wanted to evaluate changes from one release to another,
   which replicates field data for every frame), but not inter-
                                                                     out of those clusters, we generated query patterns only for
   table ones (i.e., once we materialize the join between
                                                                     those involving tables present in more than one release.
   PhotoObjAll and Photoz in a single table, we do not
                                                                        Statistics of the final workloads, including overall query
   create another standalone copy of the later).
                                                                     frequency (assuming uniform distribution in time) and a
2. Generated a separate vertical partition of the correspond-        characterization of the included query patterns (based on
   ing table to store all attributes not used in any query.          the number of returned rows), are reported in Table 1.
3. Created secondary indexes for any attribute in the selec-
   tion predicates for both the original as well as optimized
   schemas.                                                          6. Experimental evaluation
   Time-evolved databases (𝐷2𝛼 , 𝐷3𝛼 , and 𝐷3𝛽 ) correspond          In our experiments, we first look at the space being used,
exactly to databases optimized for an obsolete workload, but         then the execution time of the query workload, and finally,
extended with the concepts introduced in the new release in          the cost of migrating from one schema to another.
the form of one extra table per new concept, so all queries
can be executed. For instance, 𝐷2𝛼 is the time-evolution of          6.1. Evaluation of storage occupation
𝐷1𝛼 , which preserves the pre-existing data and optimiza-
tions for workload 𝛼, but adds the flat new tables introduced        Table 2 shows the total storage occupation (in MB) of each
by 𝐷2 . These schemas are crucial to put optimizations un-           database schema on every scale. Intuitively, the storage in-
der the test of time and evaluate whether effectiveness is           creases proportionally with the scale - though this is less ev-
held upon workload evolution or if it would be more conve-           ident in 𝐷2* and 𝐷3* due to some tables (Field and Frame)
nient to migrate the data to the schema optimized for the            being independent from PhotoObjAll. Interestingly, the
 Scale 𝐷1 𝐷1𝛼 𝐷2           𝐷2𝛼    𝐷2𝛽 𝐷3 𝐷3𝛼 𝐷3𝛽         𝐷3𝛾                                   107


                                                                                                                  1 min.


                                                                                                                                       1 hour


                                                                                                                                                      1 day


                                                                                                                                                                               1 month
 100K 683 704 849           875    853 885 892 882        878
 200K 1365 1408 1600       1652   1607 1672 1685 1666    1665                                  106
 500K 3412 3520 3852       3983   3869 4035 4065 4019    4028                                        Mig. time (full) (est.)


                                                                  Cumulative time gained (s)
 1M    6824 7039 7607      7868   7640 7973 8033 7941    7965
                                                                                               105                                                                              les
                                                                                                                                                                             sca
                                                                                                                                                                       All
Table 2
Storage occupation (in MB) of database schemas                                                 104


         𝑊𝛼               𝑊𝛽                  𝑊𝛾                                               103                                                               Mig. time (1M)
 Scale                                                                                                                                                         Mig. time (500K)
      𝐷1 𝐷1𝛼       𝐷2     𝐷2𝛼 𝐷2𝛽 𝐷3       𝐷3𝛼    𝐷3𝛽    𝐷3𝛾
                                                                                                                                                               Mig. time (200K)
 100K 73.8 7.6     27.8   63.7 2.0 145.9 571.2 269.7 45.5                                      102                                                             Mig. time (100K)
 200K 71.3 5.4     26.3   60.2 1.6 285.4 1124.4 490.9 60.7
 500K 71.3 5.4     25.5   60.2 1.6 714.5 1546.4 1134.6 125.5                                   101 1
 1M   77.9 5.4     27.8   63.7 1.0 1125.2 3249.2 2304.3 240.6                                     10                       102   103            104      105           106
                                                                                                                                   Time elapsed (s)
Table 3                                                           Figure 6: Study of migration convenience from 𝐷2𝛼 to 𝐷2𝛽
Average execution time (in ms) of a single query

applied optimizations have no significant impact on storage,                                   107
                                                                                                                                                                              1M


                                                                                                                  1 min.


                                                                                                                                       1 hour


                                                                                                                                                      1 day
due to the absence of inter-table redundancies and to a low                                                                                                                      K
                                                                                                                                                                             500
                                                                                                                                                                                 K
footprint of intra-table ones.                                                                 106 Mig. time (full) (est.)                                                   200
                                                                                                                                                                                 K
                                                                                                                                                                             100


                                                                  Cumulative time gained (s)
                                                                                                               )
                                                                                               105        (est.
6.2. Evaluation of query execution times                                                             Full


                                                                                                                                                                               1 month
As shown in Figure 5, the three workloads are executed over                                    104
the database schemas available for the corresponding point
in time (i.e., 𝑊 𝛼 , 𝑊 𝛽 , and 𝑊 𝛾 are respectively executed                                   103                                                               Mig. time (1M)
                                                                                                                                                               Mig. time (500K)
over the 𝐷1* , 𝐷2* , and 𝐷3* versions). For each combination                                                                                                   Mig. time (200K)
of workload and database schema, 11K queries have been                                         102                                                             Mig. time (100K)
executed by preserving the frequency of each query pattern
and randomly choosing values (among the existing ones)                                         101 1
                                                                                                  10                       102   103            104      105           106
for the selection predicates; the first 1000 queries are then                                                                      Time elapsed (s)
discarded to minimize the impact of cold-start on the cache.
Table 3 shows the average execution time (in ms) of a single      Figure 7: Study of migration convenience from 𝐷3𝛽 to 𝐷3𝛾
query, by weighing the average execution time of each query
pattern on the respective query frequency. From the results,                                                              Time gain         Mig. N. queries Time to
we can make the following observations.                                Migration                                    Scale per query         time   to pass   pass mig.
                                                                                                                            (ms)            (ms) mig. time time (min)
• The workload changes significantly across releases. This                                                          100K          62        55,100       893       25.9
  is evident not only on the average execution time of a                                                            200K          59       127,100     2,172       63.0
                                                                       𝐷2𝛼 to 𝐷2𝛽
  single query (which considerably grows on 𝑊 𝛾 ), but also                                                         500K          61       319,200     5,236      151.9
  in the variation over different scales: as shown in Table 1,                                                      1M            63       675,000    10,761      312.2
  𝑊 𝛾 includes query patterns where the number of rows                                                              100K         172        93,000       542        6.5
  scales with the cardinality of tables, while these are not           𝐷3𝛽 to 𝐷3𝛾
                                                                                                                    200K         378       233,300       618        7.4
  present in 𝑊 𝛼 and 𝑊 𝛽 .                                                                                          500K         910       521,900       574        6.8
                                                                                                                    1M         1,867       922,000       494        5.9
• The random choice of selection predicates slightly im-
  pacts on the average execution times, especially when           Table 4
  these are particularly low. For example, it may seem that       Migration statistics
  execution times improve with the database size in 𝐷2𝛽 ;
  however, the standard deviation in this case ranges from        6.3. Evaluation of migration convenience
  1.9 to 2.8 in all scales for this database schema, so the
                                                                  Finally, we study the convenience of database migration; as
  variation is clearly not statistically significant.
                                                                  seen in Figure 5, we focus on the migrations from 𝐷2𝛼 to 𝐷2𝛽
• Optimizations have a huge impact on performances, with          and from 𝐷3𝛽 to 𝐷3𝛾 (i.e., migrating data from R8 optimized
  reductions of execution times ranging from 3 to 10 times        for the old R1 workload, to another schema optimized for
  across all workloads. This provides a solid justification for   the true R8 workload; similarly for R8 and R18).
  the need to implement optimized database schemas - also            Migration convenience is evaluated by measuring the
  in light of the essentially unvaried storage occupation.        gain obtained in query performance (due to database re-
• Interestingly, optimizations carried out at a specific point    optimization) against the effort taken to migrate the data.
  in time do not outlive the workload and end up backfiring       Both factors are measured in terms of time: the gain is the
  at later stages. As the characteristics of the workloads        difference in the average execution time of two database
  evolve, execution times sensibly increase due to previous       schemas, and effort is the time required to execute migration
  optimizations losing effectiveness and becoming a liabil-       scripts. Then, the migration becomes convenient when the
  ity. This nicely demonstrates the need for a continuous         (cumulated) gain overcomes the effort. Table 4 summarizes
  re-evaluation of database optimizations.                        the results on all sample sizes and indicates the number of
                                                                  queries needed to accumulate enough gain to overcome the
migration effort; the same is also translated into a measure                                        CMcur      wcur       scur

of time, based on the query frequency in the real workload,
as in Table 1. The results are also reported in Figures 6                                                   Recommender
and 7, which emphasize trends over logarithmic time scales:                           placeholder
                                                                                                                                 wpred
                                                                                  Target Schema Explorer                                 Workload Predictor
for each sample size (identified by a different color), the
cumulative time gained is shown as time elapses, migration                                                                                            LEGEND
                                                                                        Migration            Migration
time is shown as a flat horizontal threshold, and a star marks          Gain
                                                                      Estimator
                                                                                         Effort               Process
                                                                                                                                                       calls


the turning point. Linear regression is used to estimate gain                           Estimator            Generator                         sends output to


and effort on the full database size (shown as black lines).
   From these results, we derive the following takeaways.

• First, we observe that migration time is proportional to


                                                                                                     Gain
  the database size. As discussed in Section 6.1, this is
  not surprising given the low-to-no impact of replication.
  What is remarkable is the estimated migration time on                                                          Effort

  the full scale, which achieves the order of multiple days.       Figure 8: Overview of the proposed framework for continuous
  Though this estimate could be easily optimized by par-           refactoring evaluation
  allelizing the migration of the different tables, it shows
  the importance of considering workload prediction in             [20, 24, 31, 21, 22, 23, 24], but do not address these research
  the refactoring recommendation: the longer the time to           questions. In this section, we pave the way in this direction,
  migrate the data, the longer the required stability of the       defining a framework and illustrating some novel research
  workload (or accuracy of the prediction) to ensure that          challenges that this should address to achieve this goal.
  the migration pays off.                                              The outcomes of the experiments evidently suggest that
• Since execution times for 𝑊 𝛽 are unaffected by the              (i) database optimization is neither a one-time nor an incre-
  database size (as discussed in Section 6.2), the gain is         mental activity, as some optimizations can become coun-
  almost identical across all scales. As a result, the bigger      terproductive in future stages, (ii) database migration can
  the database, the more time it takes to accumulate enough        be particularly expensive and may not be worth the trou-
  gain to compensate for migration times. Differently, in          ble, and (iii) as a result, the continuous evaluation of refac-
  𝑊 𝛾 , query execution times grow with the database size,         toring options is fundamental to guarantee maximum ef-
  thus the gain scales accordingly. As a result, the migra-        ficiency under evolving workloads. Hence, we propose a
  tion becomes convenient after only 6-7 minutes, indepen-         multi-objective optimization to continuously evaluate and
  dently of the database size.                                     recommend the refactoring of NoSQL databases.
• Interestingly, the two studies reveal radically different sce-       An overview of the proposed framework is shown in
  narios where the recommendations to carry out database           Figure 8. The main component is the Target Schema Ex-
  refactoring are diverse. In the samples, database migra-         plorer, which is in charge of enumerating and evaluating
  tion is always fast and particularly convenient. In our          the possible target schemas. The enumeration requires the
  projected estimates over the full database, the migration        current schema 𝑠𝑐𝑢𝑟 , the current conceptual model 𝐶𝑀𝑐𝑢𝑟 ,
  to 𝐷2𝛽 would be discouraged under the assumption that            and either the current workload 𝑤𝑐𝑢𝑟 or a prediction of
  the workload significantly differs in the following month;       a future workload 𝑤𝑝𝑟𝑒𝑑 , calculated by the Workload Pre-
  differently, the migration to 𝐷3𝛾 is shown to be conve-          dictor. Given a possible target schema 𝑠𝑖 , its evaluation is
  nient, even though the implications of the considerable          aimed at quantifying the pros and cons of carrying out the
  migration time should be carefully considered before en-         refactoring from 𝑠𝑐𝑢𝑟 . The pros are calculated by the Gain
  acting the refactoring.                                          Estimator, which measures the variations of multiple qual-
                                                                   ity criteria (in terms of performance, storage occupation,
                                                                   etc.). The cons are calculated by the Effort Estimator, which
7. Framework overview                                              measures the work required to carry out the refactoring
                                                                   (in terms of designing and executing the migration process,
As we have just demonstrated, the evolution of schemas             rewriting all workload queries, etc.). The latter estimation
and workloads can dramatically change the optimality of            requires as much information as possible about the migra-
the schema design chosen at design time, and refactor-             tion process, which is produced by the Migration Process
ing the database can restore such optimality. The newly-           Generator component. Finally, the Recommender obtains
optimal schema should be found as the one maximizing               from the Target Schema Explorer the list of evaluated target
the trade-off between the benefits of a refactoring and the        schemas and produces a recommendation; given the amount
effort to design and execute it. This task opens to multi-         and diversity of criteria to measure gains and efforts, the
ple research challenges, including the exploration of the          Recommender determines the set of relevant target schemas
search space of target schemas (potentially scaling to thou-       on the Pareto front [44].
sands of concepts), the quantification (and comparison)                The benefits of advancing the state-of-the-art in this di-
of the benefits from refactoring the database and the ef-          rection are twofold. On the one hand, the Recommender can
forts to design and execute the refactoring, and the pre-          provide critical insights to make refactoring decisions with
diction of changes in the workload (potentially including          significant improvements to the current situation based on
millions of queries). As shown in Section 2, related work          objective criteria and a comprehensive coverage of possible
mainly explore the identification of the first target solu-        alternatives. On the other hand, the automation of this task
tion [40, 41, 42, 7, 8, 43, 10, 11, 12, 13, 14] or devise frame-   enables its continuous adoption through the lifetime of the
works to manage multiple schema versions and propagate             database and the applications running on top of it; indeed,
manually-defined schema transformations to the database            a continuous evaluation of database refactoring minimizes
the risk of undergoing major efforts at a later time to recover    be at several levels, namely conceptual (e.g., identifying
from a degraded state. In both cases, complete automation          which entities are involved in the migration), logical (i.e.,
is hard to achieve, as the precise measurement of gains and        defining the sequence of operations that should be carried
efforts is particularly challenging and the selection of the       out to migrate the data), and physical (i.e., producing the
“best” refactoring activity from the Pareto front requires busi-   scripts or application code to be executed). In any case, an
ness knowledge and strategic vision (i.e., skills that cannot      optimizer should be used to make the process as efficient as
be easily quantified and encoded). Thus, our proposal goes         possible. In our study, the process was generated and opti-
in the direction of human-in-the-loop automation: while we         mized manually, but automation is clearly necessary. Some
turn to the DevOps philosophy in the continuous applica-           proposals in this direction have been made, but they only
tion of an automated procedure to maintain a high-quality          support a limited range of schema modification operations
level of the database, DBAs/engineers should be able to step       [51] and are tied to table-to-table (or collection-to-collection)
in at critical points to contribute with their knowledge and       mappings [52], whereas the migration of the database needs
exploit the system to make decisions and decide the path           to be considered as a whole.
forward.                                                              Workload prediction. The capability of the recom-
   In the following, we delve into the details of each of the      mender to operate on a predicted future workload is a bonus
framework’s components, discussing current implementa-             feature, in the sense that the recommendation could also be
tions and presenting the research challenges that are yet to       given just by considering the current workload. Nonetheless,
be addressed to achieve automation.                                given the (possibly considerable) effort to do a migration
   Target schemas exploration. The aggregate-data mod-             and the (possibly continuous) evolution of the workload,
eling style of NoSQL databases implies a huge search space         the optimality of the new target schema may be lost by the
of alternative schemas that could be devised in a given do-        time that the migration is completed – as the evidence of
main [45]. [46] shows that there are 12 different ways to log-     this study has shown. For this reason, predicting (with suf-
ically model a conceptual relationship between two entities        ficient accuracy) what the workload will be at time 𝑖 + ∆
in a document-based database. This search space is further         allows the recommender to consider an additional variable
amplified by the practice of replicating data in multiple col-     and to possibly converge towards the optimal solutions that
lections to optimize the performance of the most frequent          require that ∆ time to carry out the migration. The predic-
queries (we avoided this possibility in our experiments to         tion of the evolution of workload queries is a field that has
keep them simpler); thus, an exhaustive generation and             recently attracted research interest [37]. However, existing
evaluation of all possibilities is prohibitive. The challenge is   works are limited to relational databases and mostly focused
worsened by the absence of a single optimization metric to         on supporting a live tuning of the DBMS’s configuration
drive the exploration towards convergence. In the related          and/or resources [53].
work, the most common approach to schema exploration
(as well as the one followed in this study) simply consists
in converging to a target schema through some heuristics           8. Conclusions
(e.g., [12]). However, we see huge potential behind multi-
                                                                   In this paper, we have presented an impact study of NoSQL
objective evolutionary algorithms (MOEAs), which are par-
                                                                   database refactoring over a real-world use case, motivating
ticularly suitable for the task of finding Pareto-optimal so-
                                                                   the research problem, supporting it with empirical evidence,
lutions [47] but not yet adopted in this context.
                                                                   and presenting a proposal for a refactoring recommender
   Gain estimation. The Gain Estimator relies on a
                                                                   framework. Our research work will continue under two
set of Key Performance Indicators (KPIs) to quantify the
                                                                   directions. On the one hand, we plan to further investi-
(dis)advantages of migrating from the current schema to
                                                                   gate the SkyServer use case to consider additional strate-
a different one under many perspectives, namely Perfor-
                                                                   gies for schema optimizations and to incorporate workload
mance (query execution time is crucial in NoSQL), Storage
                                                                   prediction into the migration convenience evaluation; by
(redundancy is typically encouraged, but updates should not
                                                                   collecting additional information about the workload in the
be forgotten), and Complexity (schemaless allows quick
                                                                   upcoming months, we will put the proposed optimizations
development, but also hides mistakes in the coding). In
                                                                   under a more comprehensive test of time. On the other
this study, we focused on the performance evaluation, but
                                                                   hand, we will work towards the implementation and au-
measured it empirically. Thus far, the proposed metrics for
                                                                   tomation of the proposed framework. Each module in the
estimations are either oversimplistic (e.g., [12] considers the
                                                                   framework encompasses its own challenges, which can be
number of accessed documents) or too narrow (e.g., [48] pre-
                                                                   addressed separately. Our main efforts will be first directed
dicts execution times using an advanced database-specific
                                                                   towards enabling a broad exploration of target schemas and
model, but limitedly to point-queries on the primary key).
                                                                   defining a comprehensive method for estimating migration
   Migration effort estimation. Similarly to the gain,
                                                                   efforts considering the many variables that influence this
the estimation of the migration effort can be measured
                                                                   process, including application code evolution and human
from multiple perspectives, namely Process design (whose
                                                                   effort estimation. We plan to work in close collaboration
cost would depend on the complexity and extension of the
                                                                   with companies dealing with evolving workloads in NoSQL
model), Execution (which should consider the impact over
                                                                   databases and whose support is already shown in previous
the currently-running workload), and Application update
                                                                   work on heterogeneous and evolving datasets [54, 55].
(especially relevant due to schemaless philosophy in NoSQL).
For this, we can build on top of recent software refactoring
work [49] and ETL evolution [50].                                  Acknowledgments
   Migration process generation. The proper estimation
of migration efforts also depends on how well the migration        This work has been partially supported by the Spanish Minis-
process can be predicted and how much it can be auto-              terio de Ciencia e Innovación under project PID2020-117191
mated. The information returned by this component can              RB-I00/AEI/10.13039/501100011033 (DOGO4ML).
References                                                            eldrivenguide: An approach for implementing nosql
                                                                      schemas, in: S. Hartmann, J. Küng, G. Kotsis, A. M.
 [1] P. J. Sadalage, M. Fowler, NoSQL distilled: a brief guide        Tjoa, I. Khalil (Eds.), Database and Expert Systems
     to the emerging world of polyglot persistence, Pearson           Applications - 31st International Conference, DEXA
     Education, 2013.                                                 2020, Bratislava, Slovakia, September 14-17, 2020, Pro-
 [2] J. L. Harrington, Relational database design and imple-          ceedings, Part I, volume 12391 of Lecture Notes in
     mentation, Morgan Kaufmann, 2016.                                Computer Science, Springer, 2020, pp. 141–151. URL:
 [3] E. Meijer, A co-relational model of data for large               https://doi.org/10.1007/978-3-030-59003-1_9. doi:10.
     shared data banks, in: M. Mezini (Ed.), ECOOP                    1007/978-3-030-59003-1\_9.
     2011 - Object-Oriented Programming - 25th Euro-             [12] L. Chen, A. Davoudian, M. Liu, A workload-driven
     pean Conference, Lancaster, UK, July 25-29, 2011                 method for designing aggregate-oriented nosql
     Proceedings, volume 6813 of Lecture Notes in Com-                databases, Data Knowl. Eng. 142 (2022) 102089.
     puter Science, Springer, 2011, p. 1. URL: https://               URL:       https://doi.org/10.1016/j.datak.2022.102089.
     doi.org/10.1007/978-3-642-22655-7_1. doi:10.1007/                doi:10.1016/J.DATAK.2022.102089.
     978-3-642-22655-7\_1.                                       [13] E. M. Kuszera, L. M. Peres, M. D. D. Fabro, Exploring
 [4] V. Gadepally, P. Chen, J. Duggan, A. J. Elmore,                  data structure alternatives in the RDB to nosql docu-
     B. Haynes, J. Kepner, S. Madden, T. Mattson, M. Stone-           ment store conversion process, Inf. Syst. 105 (2022)
     braker, The bigdawg polystore system and architec-               101941. URL: https://doi.org/10.1016/j.is.2021.101941.
     ture, in: 2016 IEEE High Performance Extreme Com-                doi:10.1016/j.is.2021.101941.
     puting Conference, HPEC 2016, Waltham, MA, USA,             [14] N. Roy-Hubara, A. Sturm, P. Shoval, Designing nosql
     September 13-15, 2016, IEEE, 2016, pp. 1–6. URL: https:          databases based on multiple requirement views, Data
     //doi.org/10.1109/HPEC.2016.7761636. doi:10.1109/                Knowl. Eng. 145 (2023) 102149. URL: https://doi.org/
     HPEC.2016.7761636.                                               10.1016/j.datak.2023.102149. doi:10.1016/j.datak.
 [5] R. Alotaibi, D. Bursztyn, A. Deutsch, I. Manolescu,              2023.102149.
     S. Zampetakis, Towards scalable hybrid stores:              [15] W. Y. Mok, A conceptual model based design methodol-
     Constraint-based rewriting to the rescue, in: P. A.              ogy for mongodb databases, in: 7th International Con-
     Boncz, S. Manegold, A. Ailamaki, A. Deshpande,                   ference on Information and Computer Technologies,
     T. Kraska (Eds.), Proceedings of the 2019 Interna-               ICICT 2024, Honolulu, HI, USA, March 15-17, 2024,
     tional Conference on Management of Data, SIG-                    IEEE, 2024, pp. 151–159. URL: https://doi.org/10.1109/
     MOD Conference 2019, Amsterdam, The Netherlands,                 ICICT62343.2024.00030. doi:10.1109/ICICT62343.
     June 30 - July 5, 2019, ACM, 2019, pp. 1660–1677.                2024.00030.
     URL: https://doi.org/10.1145/3299869.3319895. doi:10.       [16] I. Skoulis, P. Vassiliadis, A. V. Zarras, Growing up
     1145/3299869.3319895.                                            with stability: How open-source relational databases
 [6] S. S. Neha Bansal, L. K. Awasthi,            Are nosql           evolve, Inf. Syst. 53 (2015) 363–385. URL: https:
     databases affected by schema?,              IETE Jour-           //doi.org/10.1016/j.is.2015.03.009. doi:10.1016/j.is.
     nal of Research 70 (2024) 4770–4791. URL:                        2015.03.009.
     https://doi.org/10.1080/03772063.2023.2237478.              [17] S. Scherzinger, S. Sidortschuck, An empirical study on
     doi:10.1080/03772063.2023.2237478.                               the design and evolution of nosql database schemas,
 [7] V. Reniers, D. V. Landuyt, A. Rafique, W. Joosen,                in: G. Dobbie, U. Frank, G. Kappel, S. W. Liddle, H. C.
     A workload-driven document database schema rec-                  Mayr (Eds.), Conceptual Modeling - 39th International
     ommender (DBSR),           in: G. Dobbie, U. Frank,              Conference, ER 2020, Vienna, Austria, November 3-6,
     G. Kappel, S. W. Liddle, H. C. Mayr (Eds.), Concep-              2020, Proceedings, volume 12400 of Lecture Notes in
     tual Modeling - 39th International Conference, ER                Computer Science, Springer, 2020, pp. 441–455. URL:
     2020, Vienna, Austria, November 3-6, 2020, Proceed-              https://doi.org/10.1007/978-3-030-62522-1_33. doi:10.
     ings, volume 12400 of Lecture Notes in Computer                  1007/978-3-030-62522-1\_33.
     Science, Springer, 2020, pp. 471–484. URL: https://         [18] S. Fedushko, R. Malyi, Y. Syerov, P. Serdyuk, Nosql
     doi.org/10.1007/978-3-030-62522-1_35. doi:10.1007/               document data migration strategy in the context of
     978-3-030-62522-1\_35.                                           schema evolution, Data & Knowledge Engineering 154
 [8] M. Hewasinghage,            S. Nadal,      A. Abelló,            (2024) 102369. URL: https://www.sciencedirect.com/
     E. Zimányi,         Automated database design for                science/article/pii/S0169023X24000934. doi:https://
     document stores with multicriteria optimiza-                     doi.org/10.1016/j.datak.2024.102369.
     tion,      Knowl. Inf. Syst. 65 (2023) 3045–3078.           [19] P. Vassiliadis, Profiles of schema evolution in free open
     URL:      https://doi.org/10.1007/s10115-023-01828-3.            source software projects, in: 37th IEEE International
     doi:10.1007/s10115-023-01828-3.                                  Conference on Data Engineering, ICDE 2021, Chania,
 [9] M. Mozaffari, E. Nazemi, A. Eftekhari-Moghadam,                  Greece, April 19-22, 2021, IEEE, 2021, pp. 1–12. URL:
     CONST: continuous online nosql schema tuning,                    https://doi.org/10.1109/ICDE51399.2021.00008. doi:10.
     Softw. Pract. Exp. 51 (2021) 1147–1169. URL: https:              1109/ICDE51399.2021.00008.
     //doi.org/10.1002/spe.2945. doi:10.1002/SPE.2945.           [20] K. Herrmann, H. Voigt, A. Behrend, J. Rausch,
[10] A. de la Vega, D. García-Saiz, C. Blanco, M. E. Zorrilla,        W. Lehner, Living in parallel realities: Co-existing
     P. Sánchez, Mortadelo: Automatic generation of nosql             schema versions with a bidirectional database evo-
     stores from platform-independent data models, Future             lution language,          in: S. Salihoglu, W. Zhou,
     Gener. Comput. Syst. 105 (2020) 455–474. URL: https:             R. Chirkova, J. Yang, D. Suciu (Eds.), Proceedings of
     //doi.org/10.1016/j.future.2019.11.032. doi:10.1016/             the 2017 ACM International Conference on Manage-
     j.future.2019.11.032.                                            ment of Data, SIGMOD Conference 2017, Chicago,
[11] J. Mali, F. Atigui, A. Azough, N. Travers, Mod-                  IL, USA, May 14-19, 2017, ACM, 2017, pp. 1101–
     1116. URL: https://doi.org/10.1145/3035918.3064046.              Barcelona, Spain, June 12-16, 2008, 2008, pp. 323–332.
     doi:10.1145/3035918.3064046.                                [30] M. Klettke, U. Störl, M. Shenavai, S. Scherzinger, Nosql
[21] P. Koupil, J. Bártík, I. Holubová, MM-evocat: A tool             schema evolution and big data migration at scale,
     for modelling and evolution management of multi-                 in: J. Joshi, G. Karypis, L. Liu, X. Hu, R. Ak, Y. Xia,
     model data, in: M. A. Hasan, L. Xiong (Eds.), Pro-               W. Xu, A. Sato, S. Rachuri, L. H. Ungar, P. S. Yu,
     ceedings of the 31st ACM International Conference                R. Govindaraju, T. Suzumura (Eds.), 2016 IEEE In-
     on Information & Knowledge Management, Atlanta,                  ternational Conference on Big Data (IEEE BigData
     GA, USA, October 17-21, 2022, ACM, 2022, pp. 4892–               2016), Washington DC, USA, December 5-8, 2016,
     4896. URL: https://doi.org/10.1145/3511808.3557180.              IEEE Computer Society, 2016, pp. 2764–2774. URL:
     doi:10.1145/3511808.3557180.                                     https://doi.org/10.1109/BigData.2016.7840924. doi:10.
[22] L. Caruccio, G. Polese, G. Tortora, Synchronization              1109/BigData.2016.7840924.
     of queries and views upon schema evolutions: A sur-         [31] A. H. Chillón, D. S. Ruiz, J. G. Molina, Towards a taxon-
     vey, ACM Trans. Database Syst. 41 (2016) 9:1–9:41.               omy of schema changes for nosql databases: The orion
     URL: https://doi.org/10.1145/2903726. doi:10.1145/               language, in: A. K. Ghose, J. Horkoff, V. E. S. Souza,
     2903726.                                                         J. Parsons, J. Evermann (Eds.), Conceptual Modeling -
[23] A. Hillenbrand, M. Levchenko, U. Störl, S. Scherzinger,          40th International Conference, ER 2021, Virtual Event,
     M. Klettke, Migcast: Putting a price tag on data model           October 18-21, 2021, Proceedings, volume 13011 of Lec-
     evolution in nosql data stores, in: P. A. Boncz, S. Mane-        ture Notes in Computer Science, Springer, 2021, pp. 176–
     gold, A. Ailamaki, A. Deshpande, T. Kraska (Eds.),               185. URL: https://doi.org/10.1007/978-3-030-89022-3_
     Proceedings of the 2019 International Conference on              15. doi:10.1007/978-3-030-89022-3\_15.
     Management of Data, SIGMOD Conference 2019, Am-             [32] U. Störl, M. Klettke, S. Scherzinger, Nosql schema
     sterdam, The Netherlands, June 30 - July 5, 2019, ACM,           evolution and data migration: State-of-the-art and
     2019, pp. 1925–1928. URL: https://doi.org/10.1145/               opportunities, in: A. Bonifati, Y. Zhou, M. A. V.
     3299869.3320223. doi:10.1145/3299869.3320223.                    Salles, A. Böhm, D. Olteanu, G. H. L. Fletcher, A. Khan,
[24] I. Holubová, M. Vavrek, S. Scherzinger, Evolution man-           B. Yang (Eds.), Proceedings of the 23rd Interna-
     agement in multi-model databases, Data Knowl. Eng.               tional Conference on Extending Database Technology,
     136 (2021) 101932. URL: https://doi.org/10.1016/j.datak.         EDBT 2020, Copenhagen, Denmark, March 30 - April
     2021.101932. doi:10.1016/j.datak.2021.101932.                    02, 2020, OpenProceedings.org, 2020, pp. 655–658.
[25] S. Chang, V. Deufemia, G. Polese, M. Vacca, A logic              URL: https://doi.org/10.5441/002/edbt.2020.87. doi:10.
     framework to support database refactoring, in: R. R.             5441/002/edbt.2020.87.
     Wagner, N. Revell, G. Pernul (Eds.), Database and Ex-       [33] B. A. Muse, F. Khomh, G. Antoniol, Refactor-
     pert Systems Applications, 18th International Confer-            ing practices in the context of data-intensive sys-
     ence, DEXA 2007, Regensburg, Germany, September                  tems, Empir. Softw. Eng. 28 (2023) 46. URL: https:
     3-7, 2007, Proceedings, volume 4653 of Lecture Notes             //doi.org/10.1007/s10664-022-10271-x. doi:10.1007/
     in Computer Science, Springer, 2007, pp. 509–518. URL:           s10664-022-10271-x.
     https://doi.org/10.1007/978-3-540-74469-6_50. doi:10.       [34] B. Karwin, S. Antipatterns, Avoiding the pitfalls of
     1007/978-3-540-74469-6\_50.                                      database programming, The Pragmatic Bookshelf
[26] P. Khumnin, T. Senivongse, SQL antipatterns detec-               (2010) 15–155.
     tion and database refactoring process, in: T. Hochin,       [35] M. Athanassoulis, M. S. Kester, L. M. Maas, R. Sto-
     H. Hirata, H. Nomiya (Eds.), 18th IEEE/ACIS Inter-               ica, S. Idreos, A. Ailamaki, M. Callaghan, Design-
     national Conference on Software Engineering, Artifi-             ing access methods: The RUM conjecture, in: E. Pi-
     cial Intelligence, Networking and Parallel/Distributed           toura, S. Maabout, G. Koutrika, A. Marian, L. Tanca,
     Computing, SNPD 2017, Kanazawa, Japan, June 26-                  I. Manolescu, K. Stefanidis (Eds.), Proceedings of the
     28, 2017, IEEE Computer Society, 2017, pp. 199–                  19th International Conference on Extending Database
     205. URL: https://doi.org/10.1109/SNPD.2017.8022723.             Technology, EDBT 2016, Bordeaux, France, March
     doi:10.1109/SNPD.2017.8022723.                                   15-16, 2016, Bordeaux, France, March 15-16, 2016,
[27] D. Sjøberg, Quantifying schema evolution, Inf.                   OpenProceedings.org, 2016, pp. 461–466. URL: https://
     Softw. Technol. 35 (1993) 35–44. URL: https://                   doi.org/10.5441/002/edbt.2016.42. doi:10.5441/002/
     doi.org/10.1016/0950-5849(93)90027-Z. doi:10.1016/               edbt.2016.42.
     0950-5849(93)90027-Z.                                       [36] S. Salza, M. Terranova, Workload modeling for rela-
[28] P. Vassiliadis, F. Shehaj, G. Kalampokis, A. V. Zarras,          tional database systems, in: D. J. DeWitt, H. Boral
     Joint source and schema evolution: Insights from a               (Eds.), Database Machines, Fourth International Work-
     study of 195 FOSS projects, in: J. Stoyanovich, J. Teub-         shop, Grand Bahama Island, March 1985, Springer,
     ner, N. Mamoulis, E. Pitoura, J. Mühlig, K. Hose, S. S.          1985, pp. 233–255.
     Bhowmick, M. Lissandrini (Eds.), Proceedings 26th In-       [37] X. Huang, S. Cao, Y. Gao, X. Gao, G. Chen, Lightpro:
     ternational Conference on Extending Database Tech-               Lightweight probabilistic workload prediction frame-
     nology, EDBT 2023, Ioannina, Greece, March 28-31,                work for database-as-a-service, in: C. A. Ardagna, N. L.
     2023, OpenProceedings.org, 2023, pp. 27–39. URL:                 Atukorala, B. Benatallah, A. Bouguettaya, F. Casati,
     https://doi.org/10.48786/edbt.2023.03. doi:10.48786/             C. K. Chang, R. N. Chang, E. Damiani, C. G. Guegan,
     edbt.2023.03.                                                    R. Ward, F. Xhafa, X. Xu, J. Zhang (Eds.), IEEE In-
[29] C. Curino, H. J. Moon, L. Tanca, C. Zaniolo, Schema              ternational Conference on Web Services, ICWS 2022,
     evolution in wikipedia - toward a web information sys-           Barcelona, Spain, July 10-16, 2022, IEEE, 2022, pp. 160–
     tem benchmark, in: J. Cordeiro, J. Filipe (Eds.), ICEIS          169. URL: https://doi.org/10.1109/ICWS55610.2022.
     2008 - Proceedings of the Tenth International Confer-            00036. doi:10.1109/ICWS55610.2022.00036.
     ence on Enterprise Information Systems, Volume DISI,        [38] L. Ma, D. V. Aken, A. Hefny, G. Mezerhane, A. Pavlo,
     G. J. Gordon, Query-based workload forecasting                   fast and elitist multiobjective genetic algorithm:
     for self-driving database management systems, in:                NSGA-II, IEEE Trans. Evol. Comput. 6 (2002) 182–
     G. Das, C. M. Jermaine, P. A. Bernstein (Eds.), Proceed-         197. URL: https://doi.org/10.1109/4235.996017. doi:10.
     ings of the 2018 International Conference on Manage-             1109/4235.996017.
     ment of Data, SIGMOD Conference 2018, Houston,              [48] M. Hewasinghage, A. Abelló, J. Varga, E. Zimányi,
     TX, USA, June 10-15, 2018, ACM, 2018, pp. 631–645.               Managing polyglot systems metadata with hy-
     URL: https://doi.org/10.1145/3183713.3196908. doi:10.            pergraphs, Data Knowl. Eng. 134 (2021) 101896.
     1145/3183713.3196908.                                            URL:      https://doi.org/10.1016/j.datak.2021.101896.
[39] S. S. Elnaffar, P. Martin, An intelligent framework              doi:10.1016/j.datak.2021.101896.
     for predicting shifts in the workloads of autonomic         [49] R. Rasool, A. A. Malik, Effort estimation of etl projects
     database management systems, in: Proc of 2004 IEEE               using forward stepwise regression, in: 2015 Interna-
     International Conference on Advances in Intelligent              tional Conference on Emerging Technologies (ICET),
     Systems–Theory and Applications, 15-18, 2004, pp.                2015, pp. 1–6. doi:10.1109/ICET.2015.7389209.
     1–8.                                                        [50] G. Papastefanatos, P. Vassiliadis, A. Simit-
[40] A. Chebotko, A. Kashlev, S. Lu, A big data modeling              sis, Y. Vassiliou,        Metrics for the prediction
     methodology for apache cassandra, in: B. Carminati,              of evolution impact in ETL ecosystems:                 A
     L. Khan (Eds.), 2015 IEEE International Congress on              case study,        J. Data Semant. 1 (2012) 75–97.
     Big Data, New York City, NY, USA, June 27 - July                 URL:       https://doi.org/10.1007/s13740-012-0006-9.
     2, 2015, IEEE Computer Society, 2015, pp. 238–245.               doi:10.1007/s13740-012-0006-9.
     URL: https://doi.org/10.1109/BigDataCongress.2015.          [51] U. Störl, M. Klettke, Darwin: A data platform for
     41. doi:10.1109/BigDataCongress.2015.41.                         schema evolution management and data migration, in:
[41] M. J. Mior, K. Salem, A. Aboulnaga, R. Liu, Nose:                M. Ramanath, T. Palpanas (Eds.), Proceedings of the
     Schema design for nosql applications, IEEE Trans.                Workshops of the EDBT/ICDT 2022 Joint Conference,
     Knowl. Data Eng. 29 (2017) 2275–2289. URL: https:                Edinburgh, UK, March 29, 2022, volume 3135 of CEUR
     //doi.org/10.1109/TKDE.2017.2722412. doi:10.1109/                Workshop Proceedings, CEUR-WS.org, 2022. URL: https:
     TKDE.2017.2722412.                                               //ceur-ws.org/Vol-3135/dataplat_short3.pdf.
[42] C. de Lima, R. dos Santos Mello, On proposing and eval-     [52] C. Forresi, E. Gallinucci, M. Golfarelli, H. B. Hamadou,
     uating a nosql document database logical approach,               A dataspace-based framework for OLAP analyses in a
     Int. J. Web Inf. Syst. 12 (2016) 398–417. URL: https:            high-variety multistore, VLDB J. 30 (2021) 1017–1040.
     //doi.org/10.1108/IJWIS-04-2016-0018. doi:10.1108/               URL:      https://doi.org/10.1007/s00778-021-00682-5.
     IJWIS-04-2016-0018.                                              doi:10.1007/s00778-021-00682-5.
[43] F. Abdelhédi, A. A. Brahim, F. Atigui, G. Zurfluh,          [53] V. V. Meduri, K. Chowdhury, M. Sarwat, Evaluation
     Umltonosql: Automatic transformation of conceptual               of machine learning algorithms in predicting the next
     schema to nosql databases, in: 14th IEEE/ACS Inter-              SQL query from the future, ACM Trans. Database
     national Conference on Computer Systems and Appli-               Syst. 46 (2021) 4:1–4:46. URL: https://doi.org/10.1145/
     cations, AICCSA 2017, Hammamet, Tunisia, October                 3442338. doi:10.1145/3442338.
     30 - Nov. 3, 2017, IEEE Computer Society, 2017, pp.         [54] E. Gallinucci, M. Golfarelli, S. Rizzi, Schema profil-
     272–279. URL: https://doi.org/10.1109/AICCSA.2017.               ing of document-oriented databases, Inf. Syst. 75
     76. doi:10.1109/AICCSA.2017.76.                                  (2018) 13–25. URL: https://doi.org/10.1016/j.is.2018.02.
[44] A. Ben-Tal, Characterization of pareto and lexico-               007. doi:10.1016/j.is.2018.02.007.
     graphic optimal solutions, in: Multiple Criteria Deci-      [55] C. Forresi, M. Francia, E. Gallinucci, M. Golfarelli,
     sion Making Theory and Application: Proceedings of               Streaming approach to schema profiling, in: A. Abelló,
     the Third Conference Hagen/Königswinter, West Ger-               P. Vassiliadis, O. Romero, R. Wrembel, F. Bugiotti,
     many, August 20–24, 1979, Springer, 1980, pp. 1–11.              J. Gamper, G. Vargas-Solar, E. Zumpano (Eds.), New
[45] M. Hewasinghage, N. B. Seghouani, F. Bugiotti, Mod-              Trends in Database and Information Systems - ADBIS
     eling strategies for storing data in distributed het-            2023 Short Papers, Doctoral Consortium and Work-
     erogeneous nosql databases, in: J. Trujillo, K. C.               shops: AIDMA, DOING, K-Gals, MADEISD, PeRS,
     Davis, X. Du, Z. Li, T. W. Ling, G. Li, M. Lee (Eds.),           Barcelona, Spain, September 4-7, 2023, Proceedings,
     Conceptual Modeling - 37th International Confer-                 volume 1850 of Communications in Computer and In-
     ence, ER 2018, Xi’an, China, October 22-25, 2018,                formation Science, Springer, 2023, pp. 211–220. URL:
     Proceedings, volume 11157 of Lecture Notes in Com-               https://doi.org/10.1007/978-3-031-42941-5_19. doi:10.
     puter Science, Springer, 2018, pp. 488–496. URL: https://        1007/978-3-031-42941-5\_19.
     doi.org/10.1007/978-3-030-00847-5_35. doi:10.1007/
     978-3-030-00847-5\_35.
[46] M. Hewasinghage, S. Nadal, A. Abelló, Docdesign
     2.0: Automated database design for document stores
     with multi-criteria optimization, in: Y. Velegrakis,
     D. Zeinalipour-Yazti, P. K. Chrysanthis, F. Guerra
     (Eds.), Proceedings of the 24th International Con-
     ference on Extending Database Technology, EDBT
     2021, Nicosia, Cyprus, March 23 - 26, 2021, Open-
     Proceedings.org, 2021, pp. 674–677. URL: https://
     doi.org/10.5441/002/edbt.2021.81. doi:10.5441/002/
     edbt.2021.81.
[47] K. Deb, S. Agrawal, A. Pratap, T. Meyarivan, A

</pre>