=Paper=
{{Paper
|id=Vol-3931/paper1
|storemode=property
|title=Impact Study of NoSQL Refactoring in SkyServer Database
|pdfUrl=https://ceur-ws.org/Vol-3931/paper1.pdf
|volume=Vol-3931
|authors=Enrico Gallinucci,Matteo Golfarelli,Wafaa Radwan,Gabriel Zarate,Alberto Abelló
|dblpUrl=https://dblp.org/rec/conf/dolap/GallinucciGRZA25
}}
==Impact Study of NoSQL Refactoring in SkyServer Database==
Impact Study of NoSQL Refactoring in SkyServer Database⋆
Enrico Gallinucci1,* , Matteo Golfarelli1 , Wafaa Radwan2 , Gabriel Zarate3 and Alberto Abelló3
1
University of Bologna, Cesena, Italy
2
Jawwal Telecommunications, Ramallah, Palestine
3
Universitat Politècnica de Catalunya, Barcelona, Spain
Abstract
Data modeling in NoSQL databases is notoriously complex and driven by multiple and possibly conflicting requirements. Researchers
have proposed methodologies to optimize schema design of a given domain for a given workload; however, due to the agile environment
in which NoSQL databases are usually employed, both domain and workload are frequently subject to changes and evolution - possibly
neutralizing the benefits of optimization. When this happens, the benefits of a new optimal schema design must be weighed against the
costs of migrating the data. In this work, we empirically show the benefits of schema redesign in a real publicly available database. In
particular, we identify multiple snapshots (in terms of domain extension and querying workload) in the 20+ years evolution of SkyServer,
demonstrate how NoSQL schema optimization at a given time can later backfire, and evaluate the conditions under which data migration
becomes beneficial. This takes us to define the foundations and challenges of a framework for continuous NoSQL database refactoring,
with the goal of helping DBAs and data engineers decide if, when, and how a NoSQL database should be reconsidered to restore schema
design optimality.
Keywords
NoSQL database, Database refactoring, Data modeling, Data migration
1. Introduction CM0
Domain evolves
CMcur = CM0 + ΔCM
Database design has been studied for many years in rela- w0
Workload evolves
wcur = w0 + Δw
tional databases, but its automation has not been achieved Optimality
yet. Moreover, the advent of NoSQL databases since the of s0 drops
early 2010s has just added complexity to the problem by Schema
scur = s0
Refactoring
recommender recommender
offering alternative data models: key-value, wide-column, Is it worth
refactoring
document-based, and graph [1]. Among these, the first to another
schema?
three are also known as aggregate-oriented data models, as
Objective 2
Optimal
they encourage the modeling of tuples as complex objects, schema s0 is
Gain
implemented
embedding all the data required to answer a query and min- Objective 1
imizing the need to compute joins (thus avoiding the costly
(covered by state-of-the-art)
operation of transferring data between nodes) [1]. For this Effort
reason, the traditional domain-driven data modeling strate- time
gies typically used in relational databases [2] are abandoned
Figure 1: Intuition of the research problem
in favor of workload-driven strategies, where tuples are
modeled (i.e., their schema is designed) depending on the
queries that the database is bound to answer.1 Notice that ferent contents to accommodate different queries.
we do not use the term NoSQL to name a family of tools, As sketched in Figure 1, we focus on what happens next
but a family of models, as a synonim of “co-relational” in (i.e., after a schema design has been chosen and the sys-
[3], which can then be implemented in any tool, including tem/application is in production). For multiple reasons, the
an object-relational one like PostgreSQL. conditions considered at design time are continuously evolv-
Several research papers have proposed methodologies to ing (e.g., new data must be stored, new queries appear or
obtain the optimal schema design, especially on the wide- they are executed at different rates), overturning the fitness
column and document-based data models (as the key-value of schema designs to the optimization problem. Conse-
does not leave much room for alternative modeling strate- quently, the database should be refactored to the schema
gies). As fully discussed in Section 2, these methodologies design that proves to be optimal under the new conditions.
typically rely on a conceptual model (CM) of the domain Intuitively, the sweet spot of interesting solutions are the
(e.g., a UML class diagram) and a set of queries to be an- ones showing the most gain with minimum effort (i.e., those
swered (a.k.a. workload). Their goal is to find a target in blue in Figure 1). However, refactoring a database can be
database schema design that minimizes query answering costly from multiple perspectives (design and execution of
times. This is often achieved by indexing, partitioning and the migration process in the first place) and the trade-off
replicating data in multiple tables (or collections) with dif- between the benefits of refactoring and its effort should be
carefully evaluated. Moreover, the evaluation of database
DOLAP 2025: 27th International Workshop on Design, Optimization, Lan- refactoring should not be a once-in-a-while activity: in-
guages and Analytical Processing of Big Data, co-located with EDBT/ICDT
spired by the DevOps philosophy of continuous evolution
2025, March 25, 2025, Barcelona, Spain
*
Corresponding author. in an agile software development environment, database
$ enrico.gallinucci@unibo.it (E. Gallinucci); refactoring should be treated as a continuous problem as
matteo.golfarelli@unibo.it (M. Golfarelli); wafaa.radwan@jawwal.ps well. It is known that the performance of query execution
(W. Radwan); gabriel.zarate@estudiantat.upc.edu (G. Zarate); can improve by migrating the corresponding data between
alberto.abello@upc.edu (A. Abelló)
© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License DBMSs, even when the migration time is included [4, 5].
1
Attribution 4.0 International (CC BY 4.0).
Our experiments, based on real astronomic data, show that,
In domain-driven design, workload information is used as well, but it
is not the main driver. although the corresponding data migration takes some days
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
1 * 1 0..1
galSpecLine sppParams galSpecLine zooSpec
…
…
…
1 1 1* 1 * 1 1 1
0..1 1 0..1
galSpecLine galSpecLine1 sppParams sppParams galSpecLine galSpecLine1 zooSpec 1 zooSpec
*
galSpecInfo SpecObjAll sppLines galSpecInfo SpecObjAll PlateX
…
1 11
…
…
…
1*
…
…
1 1 1 1 1
1 1
1 1 * 1 1 1 * 1
* *
CMR1 galSpecInfo galSpecInfo
SpecObjAll SpecObjAll
sppLines sppLines galSpecInfo galSpecInfo
SpecObjAll SpecObjAll
PlateX PlateX
SpecObjAll 1 1 1 1 1 * 1 *
* * 1 * * 1
1 1 1
CMR1 CMR1 *
Photoz PhotoObjAll PhotozRF Photoz PhotoObjAll
SpecObjAll SpecObjAll 0..1 0..1 0..1
1 1 1 1
* 1
* * 1 1 1 1 1 1 *
1 Photoz Photoz
PhotoObjAll PhotoObjAll
PhotozRF PhotozRF Photoz Photoz
PhotoObjAll PhotoObjAll
Photoz PhotoObjAll 0..1 0..1 0..1 0..1 0..1 0..1
0..1 1 1
1 1 * * 1 * * 1
1 1 Field Frame Field Frame
Photoz Photoz (a) R1 PhotoObjAll
PhotoObjAll CMR8
1
*
CMR18 *
0..1 0..1 1 1 1
1 1 1 1
CMR8 CMR8 Field Field
Frame Frame CMR18 CMR18Field Frame
Field Frame
* * * *
(b) R8 (c) R18
Figure 2: Conceptual models of the main tables in SDSS SkyServer.
of execution, schema optimization reduces query cost by an easily subject to schema changes, which highly impacts
order of magnitude, and consequently pays off in the long their performance. Researchers have looked for patterns in
term (notice we are not considering here the effort of appli- the evolution of schemas in both relational [16] and NoSQL
cation code evolution). Thus, a structured and automated databases [17, 18] (and beyond [19]). Recent efforts to sup-
approach is even more crucial to ensure the feasibility of port and/or automate the management of schema evolution
continuous evolution. have been directed toward keeping track of different schema
The main contributions we provide in this paper are: versions [20], propagating manually-defined schema modifi-
1. An experimental setting that allows to analyze database cation operations (SMO) to the database [21] and to queries
refactoring (not considering changes in application [22], and evaluating multiple strategies to apply schema
code). changes to the data [23, 24]. Overall, this is still an open
research field, and none of the mentioned works goes in the
2. A detailed empirical analysis of the performance impact
direction of recommending if, how, and/or when a (NoSQL)
of schema evolution in the SkyServer database.
database should be refactored. Recommendations to (rela-
3. A framework proposal able to explore schema designs tional) database refactoring have been given, but mostly
alternative to the current one, and give recommenda- focused on finding and resolving issues such as inconsisten-
tions based on the evaluation of the trade-off between cies [25] and anti-patterns [26]. More recently, [18] proposes
migration effort and the gain under different optimiza- a first approach to migration strategy planning of NoSQL
tion criteria. databases, but still without deciding whether migrating is
The outline of the paper is as follows. The related liter- worth or not, or how to do it.
ature is presented in Section 2. Section 3 introduces our
use case. Section 4 explains and exemplifies the motiva-
tion behind the research problem. Section 5 defines the 3. SkyServer Case study
experimental setting. Section 6 presents the evaluation of
SkyServer, grounded on which we define our framework SkyServer3 is a publicly available relational database de-
described in Section 7. Conclusions are drawn in Section 8. signed to map the cosmos, made available by the Sloan
Digital Sky Survey (SDSS) organization. Over the years, it
has integrated more and more data in successive extensions.
2. Related work Access to SDSS data is provided via a web interface, where
users can query and download data through either SQL or
The workload-driven nature of NoSQL data modeling has interfaces designed for both professional astronomers and
been established since the dawn of NoSQL databases [1]. educational purposes.
Indeed, results suggest that the schema alternatives affect Besides the astronomical data themselves, the server also
the database performance in different NoSQL models [6]. makes public SkyServer Traffic Log,4 which captures statis-
Over the last decade, researchers have worked to support tics on SQL queries being executed. This includes columns
DBAs and data engineers in the complicated task of finding such as theTime (datetime of the query), webserver (URL
the best logical model for a given workload. The most re- of the server), winname (Windows name of the server),
cent existing works mainly differentiate for (i) focusing on a clientIP (client’s IP address), and sql (the SQL statement
single [7, 8, 9] or multiple data models [10, 11, 12, 13, 14], (ii) executed), among others. It also captures performance met-
considering only the conceptual model of the data [11] or rics like elapsed (query execution time), busy (CPU time
including workload queries, with [7, 12, 13, 14, 8] or without used by the query), and rows (number of rows returned by
query frequencies [10, 15, 9], and (iii) directly generating the query), providing a comprehensive snapshot of server
one [10, 11, 14, 9] or more target schemas [15], or evaluat- activity in the last two decades (since 2003). Thus, we ana-
ing more of them, based on a single criterion [12, 15, 9] or lyzed three different database schemas and corresponding
multiple thereof [7, 13, 8]. The common factor between all snapshots of this log as in Release 1 (December 2003), Re-
these works is the limited focus on the initial design of a lease 8 (December 2013), and Release 18 (December 2023).
logical schema (i.e., none of them considers the challenge of The corresponding schemas (𝐶𝑀𝑅1 , 𝐶𝑀𝑅8 , and 𝐶𝑀𝑅18 )
implementing such schema by refactoring an existing one). are summarized in Figure 2 and include changes in both
Research work on database evolution also started in the
relational world and then propagated to the NoSQL side,
to each data item, thus imposing no constraint at the level of the
where the schemaless characteristic2 makes databases more table/collection of data.
3
https://skyserver.sdss.org/dr18
2 4
The term refers to the fact that schema information is attached directly https://skyserver.sdss.org/log/en/traffic/sql.asp
PlateX SpecObjs si_2
1 PlateX
{id: SpecObjs
“S1",
Storage
CM0 SpecObjAll CMi 1 tile: 122, 1.0
* program: “legacy",
CM0 SpecObjAll CMi {id: “S1",
quality: “good” } 0.8 si_1
SpecObjAll * tile: 122, si_2
0.6
SpecObjAll program: “legacy", si_3
SpecObjs quality:si_2
“good” } 0.4
SpecObjs 0.2
{ id: “S1", Plates SpecObjs Plates
tile: 122{ }id: “S1", Plates SpecObjs Plates
tile: 122 } {id: "P1", { id: “S1", {id: "P1",
s0 program:{id:
“legacy",
"P1", tile: 122,{ id: “S1", program:{id:“legacy",
"P1",
quality: “good” } “legacy",
program: plateID: “P1” }
tile: 122, quality: “good”,
program: “legacy",
s0 quality: “good” } plateID: “P1” } SpecObjs: [
quality: “good”, Performance Performance
(a) si_1 {id: “S1",
SpecObjs: [
(read) (write)
tile: 122 {id:
} ] }“S1", (c)
si_1 tile: 122 } ] }
si_3
(b) si_3
Figure 3: Example of schema evolution: (a) the initial conceptual schema 𝐶𝑀0 and the initial database schema 𝑠0 ; (b) the
evolved conceptual schema 𝐶𝑀𝑖 at time 𝑖 and three alternative database schemas 𝑠𝑖_1 , 𝑠𝑖_2 , and 𝑠𝑖_3 ; and (c) comparison of
the three potential databases on the maximization of three different objectives.
the number of tables and attributes, and how the latter are The frequency of schema changes depends on the do-
placed in the former, but without information loss. main and application [16]: in some cases it can be pervasive
In these schemas, we find data captured from different (in [27], the authors found that all the tables over the 20
regions of the sky called Fields, where different objects are analyzed databases were somehow affected by the evolu-
observed as PhotoObj. Spectroscopic data are also captured tion process), while in others it was completely absent (in
for each one of these astronomical objects, and stored per [19], 70% of the database schemas over the 195 analyzed
wavelength intervals into SpecObj. All measurements are open source software projects demonstrate the absence or
done through aluminum Plates that allow to precisely plug very small presence of change). Nevertheless, in the lat-
individual spectrographs to the telescope through optical ter case, the authors verify that the absence of evolution
fibers.5 does not mean that application requirements are static, but
rather that DBAs/developers are reluctant to evolve database
schemas to avoid the effort. A similar insight is found in
4. Motivation [33], which studied schema evolution in 29 data-intensive
applications using either relational or NoSQL databases. The
In any kind of DBMS, the choice of the initial schema design
study found that complex refactoring operations are seldom
is based on conditions (i.e., the conceptual representation
carried out, due to the lack of tools to support them.
of the domain and the estimated workload) that can change
From these studies, we conclude that: (1) it is very diffi-
– either because they were not accurate or because they
cult to have a perfect understanding at design time of how
have evolved, but this is even more so in NoSQL systems.
schema information must be modeled; (2) the conditions to
In this section, we present a couple of comprehensive and
modify database schemas can mature at any time; (3) there
small examples to illustrate the problems, before moving
is reluctance to change a database schema once it reaches
to a larger one with real data that demonstrated the true
a certain maturity level, and such changes are aimed at
impact.
minimizing refactoring efforts.
As a result: (a) the updated schemas tend to be simple
4.1. Domains evolve variations of the initial one, despite the choice of the latter
Plenty of research papers show database schemas need to being based on a significant degree of uncertainty at design
evolve to accommodate changes in the domain (e.g., new in- time; (b) the pure minimization of refactoring efforts po-
formation to be added, obsolete information to be removed, tentially leads to missing big opportunities hidden by the
data type changes), from the ’90s [27] to most recent times scarecrow of complex refactoring, steering instead towards
[28], in both relational [29, 16] and NoSQL databases [30, 31], possible antipatterns, i.e., bad practices in schema design
looking for patterns in schema updates [16], studying the that are intended to solve certain problems but eventually
repercussions on the related application code [29], man- lead to other problems [34, 26] (which, in turn, will require
aging multiple schema versions [22] and designing frame- further modifications to remodel the data).
works to automate schema evolution [17]. An interesting
Example 1. An exemplification of schema evolution on a
pattern emerging from multiple research work [27, 29, 16]
document-based database is shown in Figure 3. Let 𝐶𝑀0 be
is that, in the early stages of the application lifespan, rela-
the initial conceptual schema with only one entity; database
tional databases typically undergo an inflation phase, where
schema 𝑠0 is created with a single collection of SpecObjAll
multiple operations are carried out to add new schema in-
(Figure 3a shows a sample document). Later on, at time 𝑖 (Fig-
formation. In this sense, NoSQL databases are even more ap-
ure 3b), the conceptual schema evolves to 𝐶𝑀𝑖 to organize
pealing due to their schemaless nature, which lets them eas-
spectral readings into plates. To accommodate this change
ily accommodate schema additions to move on, and makes
with minimum effort, DBAs would be inclined to evolve the
them more suitable in agile development [32]. Nevertheless,
database towards the schema design of 𝑠𝑖_1 or 𝑠𝑖_2 , but they
this does not mean their performance is optimum regardless
would probably avoid 𝑠𝑖_3 , even though it might be the op-
of how you store data and still require reconsidering it.
timal schema – as hinted by the radar chart in Figure 3c.
5
The whole catalog of tables is available at https://skyserver.sdss.org/
Inspired by [35], the chart shows a comparison between the
dr18/MoreTools/browser three databases in terms of the maximization of three objec-
Product Item
(0,n) Plates Fields
(1,1)
ER0 = ER1 {id: "P1", { id: “F1",
(1,n) program: “legacy", nTotal: 923,
(1,1) quality: “good” } photoObjs: [ {
Customer Order id: “O1”,
(0,n) type: “Star”
specObjs: [ {
id: “S1”,
s1 waveMin: 3806,
Item
Plates Fields
• Changes in queries (i.e., the existing queries are formu-
waveMax: 9202,
plate_id: "P1",
(1,1)
{id: "P1", { id: “F1",
lated differently, e.g., due to a change at domain level)
plate_quality: “good” } ] } ] }
(1,n) program: “legacy",
q1 = "select
nTotal: 923, or new/old queries are added/removed
si_2 (e.g., due to the
quality: I.*, P.* }
“good” photoObjs: [ {
Order PlateX from Field F id: “O1”, addition/removal to/from
Plates
the application in the latest re-
Fields
join PhotoObj O type: “Star”
1 join SpecObj S specObjs: [ { lease).
id: “S1”, {id: "P1", { id: “F1",
where F.name = "
waveMin: 3806, program: “legacy", nTotal: 923,
*
waveMax: 9202, When the initialquality:
schema “good”,design is chosen, the [workload
photoObjs: {
SpecObjAll q2 = "select I.*, O.*
from Products P
plate_id: "P1",
plate_quality: “good” } ] } ] } is assumed to be constant
specObjs: [ {
id: “S1”,
in terms of the queries
id: “O1”,
using it
type: “Star”} ] }
* join Items I
P.* CM s
where P.name = " 2
and their corresponding
waveMin: frequencies.
3806, This simplification is
waveMax: 9202,
dF
otoObj O
1
Plates Fields understandable (if not essential)
photoObj_id: “O1”, to choose a (sub-)optimal
ecObj S PhotoObjAll
{id: "P1", { id: “F1", initial schema design. However,
photoObj_type: “Star”}given
]} the strong depen-
name = "
q1 =program:
"select “legacy",
I.*, P.* = "select I.*, O.*
q2923,
*
from “good”,
quality: Customers C
nTotal:
photoObjs: [from
{ Products P
dence of schema design optimality on the workload, the
O.*
ucts P 1 specObjs:
id:join
“S1”,
[{
join Orders O id: “O1”,
type: “Star”}
join Items I
]}
evolution of the latter can have a tremendous impact and
ms I Items I where P.name = "
name = "
Field waveMin: 3806,
where C.name = "
waveMax: 9202,
ignoring it can lead to a progressive distancing from the
(a) photoObj_id: “O1”, objectives that the initial schema design was originally max-
photoObj_type: “Star”} ] }
*, P.* q2 = "select I.*, O.*
imizing.
ustomers C from Products P
rders O join Items I (b)
ems I where P.name = " Example 2. An exemplification of workload evolution on a
C.name = "
document-based database is shown in Figure 4. Let 𝐶𝑀 be
High Good 100%
the initial conceptual model with four classes (Figure 4a) and
q1
Storage
s1
Query
freq.
ratio
Query
perf.
q2 s2 two different database schemas 𝑠1 and 𝑠2 (Figure 4b). Assume
Low
that 𝑠1 is the one chosen at design time, given the following
t0 ti Bad 0% t0 ti
Time t0 Time ti Time workload.
𝑞1 = SELECT o.*, s.* FROM Field f
(c) (d) (e) JOIN PhotoObj o JOIN SpecObj s
WHERE f.id = ;
Figure 4: Example of workload evolution: (a) the conceptual 𝑞2 = SELECT p.*, s.*
model 𝑈 𝑀 𝐿; (b) two possible database schemas, 𝑠1 (the initially FROM Plate p JOIN SpecObj s
chosen one) and 𝑠2 ; (c) change of queries’ frequency in time; WHERE p.id = ;
(d,e) change of optimality of the two database schemas on query Figure 4c shows that the frequency of the two queries inverts
performance and storage occupation.
from the initial deployment at 𝑡0 to the one at 𝑡𝑖 . Consequently,
the optimality of the two database schemas with respect to
tives: storage occupation, and performance (speed) of read and query performance changes accordingly (Figure 4d). Storage
write queries. For storage, we assume 30 bytes for IDs, 50 bytes ratio can also change (Figure 4e shows the database size in 𝑠2
for strings, 8 bytes for numbers, and a ratio of 2 products per divided by the size in 𝑠1 ) if the data grows unevenly. Reusing
category. To estimate query performance in this example, we storage assumptions from Example 1, the values in the figure
used the cost model by [12] and assumed two read queries (𝑞1 are calculated assuming a ratio between fields and spectral
and 𝑞2 ) and two write queries (𝑞3 and 𝑞4 ) as follows: readings that grows from 1:1 to 10:1 and the average number
𝑞1 = SELECT s.* FROM SpecObjAll s
of items per order growing from 2 to 10. ♢
WHERE s.id = ;
𝑞2 = SELECT s.* FROM SpecObjAll s
JOIN PlateX p WHERE p.id = ;
𝑞3 = UPDATE SpecObjAll s SET
5. Experimental setting
s.tile = WHERE s.id = ;
To empirically demonstrate our point, we analyzed in detail
𝑞4 = UPDATE PlateX p SET
the real effect of schema evolution on SkyServer perfor-
p.quality =
WHERE p.id = ;
mance. For this, we considered three points in time corre-
The indicators are normalized on a scale from 0 (worse) to sponding to releases R1, R8 and R18, respectively. Firstly,
1 (best) using the complementary of the min-max normalized we characterized the different workloads identifying the
value. For instance, given 𝜙(𝑠) as the average query execution most common query patterns (i.e., ignoring mostly unique
time on schema 𝑠, and 𝑥 and 𝑦 as the minimum and maximum queries in the very long tail of frequencies). Then, we recre-
values for 𝜙(𝑠𝑖_𝑘 ), 𝑘 ∈ {1, 2, 3}, query performance for the ated the database at the point in time of each of the three
𝑦−𝜙(𝑠 ) releases and populated them with a sample of the data avail-
j-th schema is calculated as 𝑦−𝑥𝑖_𝑗 . ♢ able in SkyServer. Finally, we measured both the cost of
queries in each schema as well as the cost of moving the
data from one to another. All tests have been executed on a
4.2. Workloads evolve
PostgreSQL 15 instance, running on a server with an i7-8700
Similarly to schema evolution, early studies on the evolution CPU and 64 GB of RAM. To guarantee reproducibility, all
of query workloads date back to the 80’s [36] and continue the corresponding code is publicly available in GitHub.6
to most recent times [37]. The evolution of workloads can In the following, we use numbers (1, 2, and 3) to refer
be traced back to common patterns [38, 39]. to database schemas in different points in time, and Greek
letters (𝛼, 𝛽, and 𝛾) to refer to the corresponding workloads.
• Changes in frequency (i.e., the same queries are executed
A summary of the experimental setup is shown in Figure 5
with different frequencies and/or ratios), either with cyclic
and detailed in the following sections; in the figure, the
patterns (e.g., daily or monthly), with occasional spikes
yellow area indicates the database schemas over which each
(e.g., due to unexpected popularity increase of the appli-
workload is executed.
cation), or more stable changes (e.g., due to new users
from different time zones). 6
https://github.com/enricogallinucci/nosql-refactoring
N. query patterns by output
Queries Query
Work. 1 row >1 rows >1 rows
per month freq. (Hz)
LEGEND (stable) (stable) (scaling)
Workload execution
𝑊𝛼 48,337 0.02 2 3 -
𝑊𝛽 1,352,498 0.57 6 3 -
Time evolution
𝑊𝛾 3,485,018 1.39 6 2 7
Optimized version of
Migration to be Table 1
evaluated
Time-evolved DB Workload statistics
Optimized DB
SkyServer DB
new workload.
5.2. Data
Figure 5: DBs and workloads in our experimental setup The overall size of the SkyServer database (in its latest re-
lease) is approximately 5TB and cannot be directly down-
loaded; consequently, we proportionally sampled the source
5.1. Schemas
to manage it more effectively. Sampling is based on the
The schema of each database in Figure 5 is generated as main table, PhotoObjAll, which originally contains 1.2
follows. Original databases (𝐷1 , 𝐷2 , 𝐷3 ) are populated billion rows: we collected 4 samples of 100K, 200K, 500K,
exactly as provided by SDSS. Nevertheless, to make the and 1M rows, ensuring that samples preserve the distribu-
performance comparable after optimization, we did not im- tion of attributes involved in the selection predicates; then,
plement classic 1NF, but encoded all them into a flat JSON other tables are populated with the rows linked to the ones
document (without any subdocument or array) that was sampled on PhotoObjAll. To make the performance com-
then stored in a relational table in PostgreSQL. Optimized parable across the releases, we made the four samples of the
databases (𝐷1𝛼 , 𝐷2𝛽 , and 𝐷3𝛾 ) are optimized for the work- same size, independently of the size of the database at the
load in the corresponding point in time following the hints point in time of the release.
in [12]. More concretely, to decide on join materialization,
as well as vertical and horizontal partitioning, we: 5.3. Workload
1. Followed a greedy algorithm taking each query in the or- The workload of every release was extracted from the
der indicated by their criticality (i.e., queries with higher SqlLog table of the SkyServer Traffic Log for December of
value for the product of their frequency and cost were the corresponding year, excluding queries that were unsuc-
considered first) and: cessful or involving customer user tables. Given the nature
(i) Created a JSON document per row considering all of the service, we should notice that users are not allowed
the tables involved in the query (i.e., join materi- to modify the database, hence, the log contains only read
alization or embedding). queries. After parsing the queries, we extracted (1) the tables
(ii) Took from each source table only the attributes involved, (2) the columns projected, and (3) the selection
necessary for the query (i.e., vertical partitioning). predicate. Firstly, the queries were clustered based on the
(iii) Applied filters of the query (i.e., horizontal parti- tables they required, and a minimum threshold of 1% was
tioning). fixed for the cluster to be further considered. These initial
clusters were then subdivided depending on the columns
Notice that (a) once a critical query has generated some
projected and selection predicate used, filtering out subclus-
optimization, this is not undone by less critical queries,
ters with less than 0.5% queries, for a final count of 5, 23,
and (b) we allowed intra-table redundancies (e.g., materi-
and 21 clusters being considered for each release. Since we
alizing the many-to-one join between Frame and Field,
wanted to evaluate changes from one release to another,
which replicates field data for every frame), but not inter-
out of those clusters, we generated query patterns only for
table ones (i.e., once we materialize the join between
those involving tables present in more than one release.
PhotoObjAll and Photoz in a single table, we do not
Statistics of the final workloads, including overall query
create another standalone copy of the later).
frequency (assuming uniform distribution in time) and a
2. Generated a separate vertical partition of the correspond- characterization of the included query patterns (based on
ing table to store all attributes not used in any query. the number of returned rows), are reported in Table 1.
3. Created secondary indexes for any attribute in the selec-
tion predicates for both the original as well as optimized
schemas. 6. Experimental evaluation
Time-evolved databases (𝐷2𝛼 , 𝐷3𝛼 , and 𝐷3𝛽 ) correspond In our experiments, we first look at the space being used,
exactly to databases optimized for an obsolete workload, but then the execution time of the query workload, and finally,
extended with the concepts introduced in the new release in the cost of migrating from one schema to another.
the form of one extra table per new concept, so all queries
can be executed. For instance, 𝐷2𝛼 is the time-evolution of 6.1. Evaluation of storage occupation
𝐷1𝛼 , which preserves the pre-existing data and optimiza-
tions for workload 𝛼, but adds the flat new tables introduced Table 2 shows the total storage occupation (in MB) of each
by 𝐷2 . These schemas are crucial to put optimizations un- database schema on every scale. Intuitively, the storage in-
der the test of time and evaluate whether effectiveness is creases proportionally with the scale - though this is less ev-
held upon workload evolution or if it would be more conve- ident in 𝐷2* and 𝐷3* due to some tables (Field and Frame)
nient to migrate the data to the schema optimized for the being independent from PhotoObjAll. Interestingly, the
Scale 𝐷1 𝐷1𝛼 𝐷2 𝐷2𝛼 𝐷2𝛽 𝐷3 𝐷3𝛼 𝐷3𝛽 𝐷3𝛾 107
1 min.
1 hour
1 day
1 month
100K 683 704 849 875 853 885 892 882 878
200K 1365 1408 1600 1652 1607 1672 1685 1666 1665 106
500K 3412 3520 3852 3983 3869 4035 4065 4019 4028 Mig. time (full) (est.)
Cumulative time gained (s)
1M 6824 7039 7607 7868 7640 7973 8033 7941 7965
105 les
sca
All
Table 2
Storage occupation (in MB) of database schemas 104
𝑊𝛼 𝑊𝛽 𝑊𝛾 103 Mig. time (1M)
Scale Mig. time (500K)
𝐷1 𝐷1𝛼 𝐷2 𝐷2𝛼 𝐷2𝛽 𝐷3 𝐷3𝛼 𝐷3𝛽 𝐷3𝛾
Mig. time (200K)
100K 73.8 7.6 27.8 63.7 2.0 145.9 571.2 269.7 45.5 102 Mig. time (100K)
200K 71.3 5.4 26.3 60.2 1.6 285.4 1124.4 490.9 60.7
500K 71.3 5.4 25.5 60.2 1.6 714.5 1546.4 1134.6 125.5 101 1
1M 77.9 5.4 27.8 63.7 1.0 1125.2 3249.2 2304.3 240.6 10 102 103 104 105 106
Time elapsed (s)
Table 3 Figure 6: Study of migration convenience from 𝐷2𝛼 to 𝐷2𝛽
Average execution time (in ms) of a single query
applied optimizations have no significant impact on storage, 107
1M
1 min.
1 hour
1 day
due to the absence of inter-table redundancies and to a low K
500
K
footprint of intra-table ones. 106 Mig. time (full) (est.) 200
K
100
Cumulative time gained (s)
)
105 (est.
6.2. Evaluation of query execution times Full
1 month
As shown in Figure 5, the three workloads are executed over 104
the database schemas available for the corresponding point
in time (i.e., 𝑊 𝛼 , 𝑊 𝛽 , and 𝑊 𝛾 are respectively executed 103 Mig. time (1M)
Mig. time (500K)
over the 𝐷1* , 𝐷2* , and 𝐷3* versions). For each combination Mig. time (200K)
of workload and database schema, 11K queries have been 102 Mig. time (100K)
executed by preserving the frequency of each query pattern
and randomly choosing values (among the existing ones) 101 1
10 102 103 104 105 106
for the selection predicates; the first 1000 queries are then Time elapsed (s)
discarded to minimize the impact of cold-start on the cache.
Table 3 shows the average execution time (in ms) of a single Figure 7: Study of migration convenience from 𝐷3𝛽 to 𝐷3𝛾
query, by weighing the average execution time of each query
pattern on the respective query frequency. From the results, Time gain Mig. N. queries Time to
we can make the following observations. Migration Scale per query time to pass pass mig.
(ms) (ms) mig. time time (min)
• The workload changes significantly across releases. This 100K 62 55,100 893 25.9
is evident not only on the average execution time of a 200K 59 127,100 2,172 63.0
𝐷2𝛼 to 𝐷2𝛽
single query (which considerably grows on 𝑊 𝛾 ), but also 500K 61 319,200 5,236 151.9
in the variation over different scales: as shown in Table 1, 1M 63 675,000 10,761 312.2
𝑊 𝛾 includes query patterns where the number of rows 100K 172 93,000 542 6.5
scales with the cardinality of tables, while these are not 𝐷3𝛽 to 𝐷3𝛾
200K 378 233,300 618 7.4
present in 𝑊 𝛼 and 𝑊 𝛽 . 500K 910 521,900 574 6.8
1M 1,867 922,000 494 5.9
• The random choice of selection predicates slightly im-
pacts on the average execution times, especially when Table 4
these are particularly low. For example, it may seem that Migration statistics
execution times improve with the database size in 𝐷2𝛽 ;
however, the standard deviation in this case ranges from 6.3. Evaluation of migration convenience
1.9 to 2.8 in all scales for this database schema, so the
Finally, we study the convenience of database migration; as
variation is clearly not statistically significant.
seen in Figure 5, we focus on the migrations from 𝐷2𝛼 to 𝐷2𝛽
• Optimizations have a huge impact on performances, with and from 𝐷3𝛽 to 𝐷3𝛾 (i.e., migrating data from R8 optimized
reductions of execution times ranging from 3 to 10 times for the old R1 workload, to another schema optimized for
across all workloads. This provides a solid justification for the true R8 workload; similarly for R8 and R18).
the need to implement optimized database schemas - also Migration convenience is evaluated by measuring the
in light of the essentially unvaried storage occupation. gain obtained in query performance (due to database re-
• Interestingly, optimizations carried out at a specific point optimization) against the effort taken to migrate the data.
in time do not outlive the workload and end up backfiring Both factors are measured in terms of time: the gain is the
at later stages. As the characteristics of the workloads difference in the average execution time of two database
evolve, execution times sensibly increase due to previous schemas, and effort is the time required to execute migration
optimizations losing effectiveness and becoming a liabil- scripts. Then, the migration becomes convenient when the
ity. This nicely demonstrates the need for a continuous (cumulated) gain overcomes the effort. Table 4 summarizes
re-evaluation of database optimizations. the results on all sample sizes and indicates the number of
queries needed to accumulate enough gain to overcome the
migration effort; the same is also translated into a measure CMcur wcur scur
of time, based on the query frequency in the real workload,
as in Table 1. The results are also reported in Figures 6 Recommender
and 7, which emphasize trends over logarithmic time scales: placeholder
wpred
Target Schema Explorer Workload Predictor
for each sample size (identified by a different color), the
cumulative time gained is shown as time elapses, migration LEGEND
Migration Migration
time is shown as a flat horizontal threshold, and a star marks Gain
Estimator
Effort Process
calls
the turning point. Linear regression is used to estimate gain Estimator Generator sends output to
and effort on the full database size (shown as black lines).
From these results, we derive the following takeaways.
• First, we observe that migration time is proportional to
Gain
the database size. As discussed in Section 6.1, this is
not surprising given the low-to-no impact of replication.
What is remarkable is the estimated migration time on Effort
the full scale, which achieves the order of multiple days. Figure 8: Overview of the proposed framework for continuous
Though this estimate could be easily optimized by par- refactoring evaluation
allelizing the migration of the different tables, it shows
the importance of considering workload prediction in [20, 24, 31, 21, 22, 23, 24], but do not address these research
the refactoring recommendation: the longer the time to questions. In this section, we pave the way in this direction,
migrate the data, the longer the required stability of the defining a framework and illustrating some novel research
workload (or accuracy of the prediction) to ensure that challenges that this should address to achieve this goal.
the migration pays off. The outcomes of the experiments evidently suggest that
• Since execution times for 𝑊 𝛽 are unaffected by the (i) database optimization is neither a one-time nor an incre-
database size (as discussed in Section 6.2), the gain is mental activity, as some optimizations can become coun-
almost identical across all scales. As a result, the bigger terproductive in future stages, (ii) database migration can
the database, the more time it takes to accumulate enough be particularly expensive and may not be worth the trou-
gain to compensate for migration times. Differently, in ble, and (iii) as a result, the continuous evaluation of refac-
𝑊 𝛾 , query execution times grow with the database size, toring options is fundamental to guarantee maximum ef-
thus the gain scales accordingly. As a result, the migra- ficiency under evolving workloads. Hence, we propose a
tion becomes convenient after only 6-7 minutes, indepen- multi-objective optimization to continuously evaluate and
dently of the database size. recommend the refactoring of NoSQL databases.
• Interestingly, the two studies reveal radically different sce- An overview of the proposed framework is shown in
narios where the recommendations to carry out database Figure 8. The main component is the Target Schema Ex-
refactoring are diverse. In the samples, database migra- plorer, which is in charge of enumerating and evaluating
tion is always fast and particularly convenient. In our the possible target schemas. The enumeration requires the
projected estimates over the full database, the migration current schema 𝑠𝑐𝑢𝑟 , the current conceptual model 𝐶𝑀𝑐𝑢𝑟 ,
to 𝐷2𝛽 would be discouraged under the assumption that and either the current workload 𝑤𝑐𝑢𝑟 or a prediction of
the workload significantly differs in the following month; a future workload 𝑤𝑝𝑟𝑒𝑑 , calculated by the Workload Pre-
differently, the migration to 𝐷3𝛾 is shown to be conve- dictor. Given a possible target schema 𝑠𝑖 , its evaluation is
nient, even though the implications of the considerable aimed at quantifying the pros and cons of carrying out the
migration time should be carefully considered before en- refactoring from 𝑠𝑐𝑢𝑟 . The pros are calculated by the Gain
acting the refactoring. Estimator, which measures the variations of multiple qual-
ity criteria (in terms of performance, storage occupation,
etc.). The cons are calculated by the Effort Estimator, which
7. Framework overview measures the work required to carry out the refactoring
(in terms of designing and executing the migration process,
As we have just demonstrated, the evolution of schemas rewriting all workload queries, etc.). The latter estimation
and workloads can dramatically change the optimality of requires as much information as possible about the migra-
the schema design chosen at design time, and refactor- tion process, which is produced by the Migration Process
ing the database can restore such optimality. The newly- Generator component. Finally, the Recommender obtains
optimal schema should be found as the one maximizing from the Target Schema Explorer the list of evaluated target
the trade-off between the benefits of a refactoring and the schemas and produces a recommendation; given the amount
effort to design and execute it. This task opens to multi- and diversity of criteria to measure gains and efforts, the
ple research challenges, including the exploration of the Recommender determines the set of relevant target schemas
search space of target schemas (potentially scaling to thou- on the Pareto front [44].
sands of concepts), the quantification (and comparison) The benefits of advancing the state-of-the-art in this di-
of the benefits from refactoring the database and the ef- rection are twofold. On the one hand, the Recommender can
forts to design and execute the refactoring, and the pre- provide critical insights to make refactoring decisions with
diction of changes in the workload (potentially including significant improvements to the current situation based on
millions of queries). As shown in Section 2, related work objective criteria and a comprehensive coverage of possible
mainly explore the identification of the first target solu- alternatives. On the other hand, the automation of this task
tion [40, 41, 42, 7, 8, 43, 10, 11, 12, 13, 14] or devise frame- enables its continuous adoption through the lifetime of the
works to manage multiple schema versions and propagate database and the applications running on top of it; indeed,
manually-defined schema transformations to the database a continuous evaluation of database refactoring minimizes
the risk of undergoing major efforts at a later time to recover be at several levels, namely conceptual (e.g., identifying
from a degraded state. In both cases, complete automation which entities are involved in the migration), logical (i.e.,
is hard to achieve, as the precise measurement of gains and defining the sequence of operations that should be carried
efforts is particularly challenging and the selection of the out to migrate the data), and physical (i.e., producing the
“best” refactoring activity from the Pareto front requires busi- scripts or application code to be executed). In any case, an
ness knowledge and strategic vision (i.e., skills that cannot optimizer should be used to make the process as efficient as
be easily quantified and encoded). Thus, our proposal goes possible. In our study, the process was generated and opti-
in the direction of human-in-the-loop automation: while we mized manually, but automation is clearly necessary. Some
turn to the DevOps philosophy in the continuous applica- proposals in this direction have been made, but they only
tion of an automated procedure to maintain a high-quality support a limited range of schema modification operations
level of the database, DBAs/engineers should be able to step [51] and are tied to table-to-table (or collection-to-collection)
in at critical points to contribute with their knowledge and mappings [52], whereas the migration of the database needs
exploit the system to make decisions and decide the path to be considered as a whole.
forward. Workload prediction. The capability of the recom-
In the following, we delve into the details of each of the mender to operate on a predicted future workload is a bonus
framework’s components, discussing current implementa- feature, in the sense that the recommendation could also be
tions and presenting the research challenges that are yet to given just by considering the current workload. Nonetheless,
be addressed to achieve automation. given the (possibly considerable) effort to do a migration
Target schemas exploration. The aggregate-data mod- and the (possibly continuous) evolution of the workload,
eling style of NoSQL databases implies a huge search space the optimality of the new target schema may be lost by the
of alternative schemas that could be devised in a given do- time that the migration is completed – as the evidence of
main [45]. [46] shows that there are 12 different ways to log- this study has shown. For this reason, predicting (with suf-
ically model a conceptual relationship between two entities ficient accuracy) what the workload will be at time 𝑖 + ∆
in a document-based database. This search space is further allows the recommender to consider an additional variable
amplified by the practice of replicating data in multiple col- and to possibly converge towards the optimal solutions that
lections to optimize the performance of the most frequent require that ∆ time to carry out the migration. The predic-
queries (we avoided this possibility in our experiments to tion of the evolution of workload queries is a field that has
keep them simpler); thus, an exhaustive generation and recently attracted research interest [37]. However, existing
evaluation of all possibilities is prohibitive. The challenge is works are limited to relational databases and mostly focused
worsened by the absence of a single optimization metric to on supporting a live tuning of the DBMS’s configuration
drive the exploration towards convergence. In the related and/or resources [53].
work, the most common approach to schema exploration
(as well as the one followed in this study) simply consists
in converging to a target schema through some heuristics 8. Conclusions
(e.g., [12]). However, we see huge potential behind multi-
In this paper, we have presented an impact study of NoSQL
objective evolutionary algorithms (MOEAs), which are par-
database refactoring over a real-world use case, motivating
ticularly suitable for the task of finding Pareto-optimal so-
the research problem, supporting it with empirical evidence,
lutions [47] but not yet adopted in this context.
and presenting a proposal for a refactoring recommender
Gain estimation. The Gain Estimator relies on a
framework. Our research work will continue under two
set of Key Performance Indicators (KPIs) to quantify the
directions. On the one hand, we plan to further investi-
(dis)advantages of migrating from the current schema to
gate the SkyServer use case to consider additional strate-
a different one under many perspectives, namely Perfor-
gies for schema optimizations and to incorporate workload
mance (query execution time is crucial in NoSQL), Storage
prediction into the migration convenience evaluation; by
(redundancy is typically encouraged, but updates should not
collecting additional information about the workload in the
be forgotten), and Complexity (schemaless allows quick
upcoming months, we will put the proposed optimizations
development, but also hides mistakes in the coding). In
under a more comprehensive test of time. On the other
this study, we focused on the performance evaluation, but
hand, we will work towards the implementation and au-
measured it empirically. Thus far, the proposed metrics for
tomation of the proposed framework. Each module in the
estimations are either oversimplistic (e.g., [12] considers the
framework encompasses its own challenges, which can be
number of accessed documents) or too narrow (e.g., [48] pre-
addressed separately. Our main efforts will be first directed
dicts execution times using an advanced database-specific
towards enabling a broad exploration of target schemas and
model, but limitedly to point-queries on the primary key).
defining a comprehensive method for estimating migration
Migration effort estimation. Similarly to the gain,
efforts considering the many variables that influence this
the estimation of the migration effort can be measured
process, including application code evolution and human
from multiple perspectives, namely Process design (whose
effort estimation. We plan to work in close collaboration
cost would depend on the complexity and extension of the
with companies dealing with evolving workloads in NoSQL
model), Execution (which should consider the impact over
databases and whose support is already shown in previous
the currently-running workload), and Application update
work on heterogeneous and evolving datasets [54, 55].
(especially relevant due to schemaless philosophy in NoSQL).
For this, we can build on top of recent software refactoring
work [49] and ETL evolution [50]. Acknowledgments
Migration process generation. The proper estimation
of migration efforts also depends on how well the migration This work has been partially supported by the Spanish Minis-
process can be predicted and how much it can be auto- terio de Ciencia e Innovación under project PID2020-117191
mated. The information returned by this component can RB-I00/AEI/10.13039/501100011033 (DOGO4ML).
References eldrivenguide: An approach for implementing nosql
schemas, in: S. Hartmann, J. Küng, G. Kotsis, A. M.
[1] P. J. Sadalage, M. Fowler, NoSQL distilled: a brief guide Tjoa, I. Khalil (Eds.), Database and Expert Systems
to the emerging world of polyglot persistence, Pearson Applications - 31st International Conference, DEXA
Education, 2013. 2020, Bratislava, Slovakia, September 14-17, 2020, Pro-
[2] J. L. Harrington, Relational database design and imple- ceedings, Part I, volume 12391 of Lecture Notes in
mentation, Morgan Kaufmann, 2016. Computer Science, Springer, 2020, pp. 141–151. URL:
[3] E. Meijer, A co-relational model of data for large https://doi.org/10.1007/978-3-030-59003-1_9. doi:10.
shared data banks, in: M. Mezini (Ed.), ECOOP 1007/978-3-030-59003-1\_9.
2011 - Object-Oriented Programming - 25th Euro- [12] L. Chen, A. Davoudian, M. Liu, A workload-driven
pean Conference, Lancaster, UK, July 25-29, 2011 method for designing aggregate-oriented nosql
Proceedings, volume 6813 of Lecture Notes in Com- databases, Data Knowl. Eng. 142 (2022) 102089.
puter Science, Springer, 2011, p. 1. URL: https:// URL: https://doi.org/10.1016/j.datak.2022.102089.
doi.org/10.1007/978-3-642-22655-7_1. doi:10.1007/ doi:10.1016/J.DATAK.2022.102089.
978-3-642-22655-7\_1. [13] E. M. Kuszera, L. M. Peres, M. D. D. Fabro, Exploring
[4] V. Gadepally, P. Chen, J. Duggan, A. J. Elmore, data structure alternatives in the RDB to nosql docu-
B. Haynes, J. Kepner, S. Madden, T. Mattson, M. Stone- ment store conversion process, Inf. Syst. 105 (2022)
braker, The bigdawg polystore system and architec- 101941. URL: https://doi.org/10.1016/j.is.2021.101941.
ture, in: 2016 IEEE High Performance Extreme Com- doi:10.1016/j.is.2021.101941.
puting Conference, HPEC 2016, Waltham, MA, USA, [14] N. Roy-Hubara, A. Sturm, P. Shoval, Designing nosql
September 13-15, 2016, IEEE, 2016, pp. 1–6. URL: https: databases based on multiple requirement views, Data
//doi.org/10.1109/HPEC.2016.7761636. doi:10.1109/ Knowl. Eng. 145 (2023) 102149. URL: https://doi.org/
HPEC.2016.7761636. 10.1016/j.datak.2023.102149. doi:10.1016/j.datak.
[5] R. Alotaibi, D. Bursztyn, A. Deutsch, I. Manolescu, 2023.102149.
S. Zampetakis, Towards scalable hybrid stores: [15] W. Y. Mok, A conceptual model based design methodol-
Constraint-based rewriting to the rescue, in: P. A. ogy for mongodb databases, in: 7th International Con-
Boncz, S. Manegold, A. Ailamaki, A. Deshpande, ference on Information and Computer Technologies,
T. Kraska (Eds.), Proceedings of the 2019 Interna- ICICT 2024, Honolulu, HI, USA, March 15-17, 2024,
tional Conference on Management of Data, SIG- IEEE, 2024, pp. 151–159. URL: https://doi.org/10.1109/
MOD Conference 2019, Amsterdam, The Netherlands, ICICT62343.2024.00030. doi:10.1109/ICICT62343.
June 30 - July 5, 2019, ACM, 2019, pp. 1660–1677. 2024.00030.
URL: https://doi.org/10.1145/3299869.3319895. doi:10. [16] I. Skoulis, P. Vassiliadis, A. V. Zarras, Growing up
1145/3299869.3319895. with stability: How open-source relational databases
[6] S. S. Neha Bansal, L. K. Awasthi, Are nosql evolve, Inf. Syst. 53 (2015) 363–385. URL: https:
databases affected by schema?, IETE Jour- //doi.org/10.1016/j.is.2015.03.009. doi:10.1016/j.is.
nal of Research 70 (2024) 4770–4791. URL: 2015.03.009.
https://doi.org/10.1080/03772063.2023.2237478. [17] S. Scherzinger, S. Sidortschuck, An empirical study on
doi:10.1080/03772063.2023.2237478. the design and evolution of nosql database schemas,
[7] V. Reniers, D. V. Landuyt, A. Rafique, W. Joosen, in: G. Dobbie, U. Frank, G. Kappel, S. W. Liddle, H. C.
A workload-driven document database schema rec- Mayr (Eds.), Conceptual Modeling - 39th International
ommender (DBSR), in: G. Dobbie, U. Frank, Conference, ER 2020, Vienna, Austria, November 3-6,
G. Kappel, S. W. Liddle, H. C. Mayr (Eds.), Concep- 2020, Proceedings, volume 12400 of Lecture Notes in
tual Modeling - 39th International Conference, ER Computer Science, Springer, 2020, pp. 441–455. URL:
2020, Vienna, Austria, November 3-6, 2020, Proceed- https://doi.org/10.1007/978-3-030-62522-1_33. doi:10.
ings, volume 12400 of Lecture Notes in Computer 1007/978-3-030-62522-1\_33.
Science, Springer, 2020, pp. 471–484. URL: https:// [18] S. Fedushko, R. Malyi, Y. Syerov, P. Serdyuk, Nosql
doi.org/10.1007/978-3-030-62522-1_35. doi:10.1007/ document data migration strategy in the context of
978-3-030-62522-1\_35. schema evolution, Data & Knowledge Engineering 154
[8] M. Hewasinghage, S. Nadal, A. Abelló, (2024) 102369. URL: https://www.sciencedirect.com/
E. Zimányi, Automated database design for science/article/pii/S0169023X24000934. doi:https://
document stores with multicriteria optimiza- doi.org/10.1016/j.datak.2024.102369.
tion, Knowl. Inf. Syst. 65 (2023) 3045–3078. [19] P. Vassiliadis, Profiles of schema evolution in free open
URL: https://doi.org/10.1007/s10115-023-01828-3. source software projects, in: 37th IEEE International
doi:10.1007/s10115-023-01828-3. Conference on Data Engineering, ICDE 2021, Chania,
[9] M. Mozaffari, E. Nazemi, A. Eftekhari-Moghadam, Greece, April 19-22, 2021, IEEE, 2021, pp. 1–12. URL:
CONST: continuous online nosql schema tuning, https://doi.org/10.1109/ICDE51399.2021.00008. doi:10.
Softw. Pract. Exp. 51 (2021) 1147–1169. URL: https: 1109/ICDE51399.2021.00008.
//doi.org/10.1002/spe.2945. doi:10.1002/SPE.2945. [20] K. Herrmann, H. Voigt, A. Behrend, J. Rausch,
[10] A. de la Vega, D. García-Saiz, C. Blanco, M. E. Zorrilla, W. Lehner, Living in parallel realities: Co-existing
P. Sánchez, Mortadelo: Automatic generation of nosql schema versions with a bidirectional database evo-
stores from platform-independent data models, Future lution language, in: S. Salihoglu, W. Zhou,
Gener. Comput. Syst. 105 (2020) 455–474. URL: https: R. Chirkova, J. Yang, D. Suciu (Eds.), Proceedings of
//doi.org/10.1016/j.future.2019.11.032. doi:10.1016/ the 2017 ACM International Conference on Manage-
j.future.2019.11.032. ment of Data, SIGMOD Conference 2017, Chicago,
[11] J. Mali, F. Atigui, A. Azough, N. Travers, Mod- IL, USA, May 14-19, 2017, ACM, 2017, pp. 1101–
1116. URL: https://doi.org/10.1145/3035918.3064046. Barcelona, Spain, June 12-16, 2008, 2008, pp. 323–332.
doi:10.1145/3035918.3064046. [30] M. Klettke, U. Störl, M. Shenavai, S. Scherzinger, Nosql
[21] P. Koupil, J. Bártík, I. Holubová, MM-evocat: A tool schema evolution and big data migration at scale,
for modelling and evolution management of multi- in: J. Joshi, G. Karypis, L. Liu, X. Hu, R. Ak, Y. Xia,
model data, in: M. A. Hasan, L. Xiong (Eds.), Pro- W. Xu, A. Sato, S. Rachuri, L. H. Ungar, P. S. Yu,
ceedings of the 31st ACM International Conference R. Govindaraju, T. Suzumura (Eds.), 2016 IEEE In-
on Information & Knowledge Management, Atlanta, ternational Conference on Big Data (IEEE BigData
GA, USA, October 17-21, 2022, ACM, 2022, pp. 4892– 2016), Washington DC, USA, December 5-8, 2016,
4896. URL: https://doi.org/10.1145/3511808.3557180. IEEE Computer Society, 2016, pp. 2764–2774. URL:
doi:10.1145/3511808.3557180. https://doi.org/10.1109/BigData.2016.7840924. doi:10.
[22] L. Caruccio, G. Polese, G. Tortora, Synchronization 1109/BigData.2016.7840924.
of queries and views upon schema evolutions: A sur- [31] A. H. Chillón, D. S. Ruiz, J. G. Molina, Towards a taxon-
vey, ACM Trans. Database Syst. 41 (2016) 9:1–9:41. omy of schema changes for nosql databases: The orion
URL: https://doi.org/10.1145/2903726. doi:10.1145/ language, in: A. K. Ghose, J. Horkoff, V. E. S. Souza,
2903726. J. Parsons, J. Evermann (Eds.), Conceptual Modeling -
[23] A. Hillenbrand, M. Levchenko, U. Störl, S. Scherzinger, 40th International Conference, ER 2021, Virtual Event,
M. Klettke, Migcast: Putting a price tag on data model October 18-21, 2021, Proceedings, volume 13011 of Lec-
evolution in nosql data stores, in: P. A. Boncz, S. Mane- ture Notes in Computer Science, Springer, 2021, pp. 176–
gold, A. Ailamaki, A. Deshpande, T. Kraska (Eds.), 185. URL: https://doi.org/10.1007/978-3-030-89022-3_
Proceedings of the 2019 International Conference on 15. doi:10.1007/978-3-030-89022-3\_15.
Management of Data, SIGMOD Conference 2019, Am- [32] U. Störl, M. Klettke, S. Scherzinger, Nosql schema
sterdam, The Netherlands, June 30 - July 5, 2019, ACM, evolution and data migration: State-of-the-art and
2019, pp. 1925–1928. URL: https://doi.org/10.1145/ opportunities, in: A. Bonifati, Y. Zhou, M. A. V.
3299869.3320223. doi:10.1145/3299869.3320223. Salles, A. Böhm, D. Olteanu, G. H. L. Fletcher, A. Khan,
[24] I. Holubová, M. Vavrek, S. Scherzinger, Evolution man- B. Yang (Eds.), Proceedings of the 23rd Interna-
agement in multi-model databases, Data Knowl. Eng. tional Conference on Extending Database Technology,
136 (2021) 101932. URL: https://doi.org/10.1016/j.datak. EDBT 2020, Copenhagen, Denmark, March 30 - April
2021.101932. doi:10.1016/j.datak.2021.101932. 02, 2020, OpenProceedings.org, 2020, pp. 655–658.
[25] S. Chang, V. Deufemia, G. Polese, M. Vacca, A logic URL: https://doi.org/10.5441/002/edbt.2020.87. doi:10.
framework to support database refactoring, in: R. R. 5441/002/edbt.2020.87.
Wagner, N. Revell, G. Pernul (Eds.), Database and Ex- [33] B. A. Muse, F. Khomh, G. Antoniol, Refactor-
pert Systems Applications, 18th International Confer- ing practices in the context of data-intensive sys-
ence, DEXA 2007, Regensburg, Germany, September tems, Empir. Softw. Eng. 28 (2023) 46. URL: https:
3-7, 2007, Proceedings, volume 4653 of Lecture Notes //doi.org/10.1007/s10664-022-10271-x. doi:10.1007/
in Computer Science, Springer, 2007, pp. 509–518. URL: s10664-022-10271-x.
https://doi.org/10.1007/978-3-540-74469-6_50. doi:10. [34] B. Karwin, S. Antipatterns, Avoiding the pitfalls of
1007/978-3-540-74469-6\_50. database programming, The Pragmatic Bookshelf
[26] P. Khumnin, T. Senivongse, SQL antipatterns detec- (2010) 15–155.
tion and database refactoring process, in: T. Hochin, [35] M. Athanassoulis, M. S. Kester, L. M. Maas, R. Sto-
H. Hirata, H. Nomiya (Eds.), 18th IEEE/ACIS Inter- ica, S. Idreos, A. Ailamaki, M. Callaghan, Design-
national Conference on Software Engineering, Artifi- ing access methods: The RUM conjecture, in: E. Pi-
cial Intelligence, Networking and Parallel/Distributed toura, S. Maabout, G. Koutrika, A. Marian, L. Tanca,
Computing, SNPD 2017, Kanazawa, Japan, June 26- I. Manolescu, K. Stefanidis (Eds.), Proceedings of the
28, 2017, IEEE Computer Society, 2017, pp. 199– 19th International Conference on Extending Database
205. URL: https://doi.org/10.1109/SNPD.2017.8022723. Technology, EDBT 2016, Bordeaux, France, March
doi:10.1109/SNPD.2017.8022723. 15-16, 2016, Bordeaux, France, March 15-16, 2016,
[27] D. Sjøberg, Quantifying schema evolution, Inf. OpenProceedings.org, 2016, pp. 461–466. URL: https://
Softw. Technol. 35 (1993) 35–44. URL: https:// doi.org/10.5441/002/edbt.2016.42. doi:10.5441/002/
doi.org/10.1016/0950-5849(93)90027-Z. doi:10.1016/ edbt.2016.42.
0950-5849(93)90027-Z. [36] S. Salza, M. Terranova, Workload modeling for rela-
[28] P. Vassiliadis, F. Shehaj, G. Kalampokis, A. V. Zarras, tional database systems, in: D. J. DeWitt, H. Boral
Joint source and schema evolution: Insights from a (Eds.), Database Machines, Fourth International Work-
study of 195 FOSS projects, in: J. Stoyanovich, J. Teub- shop, Grand Bahama Island, March 1985, Springer,
ner, N. Mamoulis, E. Pitoura, J. Mühlig, K. Hose, S. S. 1985, pp. 233–255.
Bhowmick, M. Lissandrini (Eds.), Proceedings 26th In- [37] X. Huang, S. Cao, Y. Gao, X. Gao, G. Chen, Lightpro:
ternational Conference on Extending Database Tech- Lightweight probabilistic workload prediction frame-
nology, EDBT 2023, Ioannina, Greece, March 28-31, work for database-as-a-service, in: C. A. Ardagna, N. L.
2023, OpenProceedings.org, 2023, pp. 27–39. URL: Atukorala, B. Benatallah, A. Bouguettaya, F. Casati,
https://doi.org/10.48786/edbt.2023.03. doi:10.48786/ C. K. Chang, R. N. Chang, E. Damiani, C. G. Guegan,
edbt.2023.03. R. Ward, F. Xhafa, X. Xu, J. Zhang (Eds.), IEEE In-
[29] C. Curino, H. J. Moon, L. Tanca, C. Zaniolo, Schema ternational Conference on Web Services, ICWS 2022,
evolution in wikipedia - toward a web information sys- Barcelona, Spain, July 10-16, 2022, IEEE, 2022, pp. 160–
tem benchmark, in: J. Cordeiro, J. Filipe (Eds.), ICEIS 169. URL: https://doi.org/10.1109/ICWS55610.2022.
2008 - Proceedings of the Tenth International Confer- 00036. doi:10.1109/ICWS55610.2022.00036.
ence on Enterprise Information Systems, Volume DISI, [38] L. Ma, D. V. Aken, A. Hefny, G. Mezerhane, A. Pavlo,
G. J. Gordon, Query-based workload forecasting fast and elitist multiobjective genetic algorithm:
for self-driving database management systems, in: NSGA-II, IEEE Trans. Evol. Comput. 6 (2002) 182–
G. Das, C. M. Jermaine, P. A. Bernstein (Eds.), Proceed- 197. URL: https://doi.org/10.1109/4235.996017. doi:10.
ings of the 2018 International Conference on Manage- 1109/4235.996017.
ment of Data, SIGMOD Conference 2018, Houston, [48] M. Hewasinghage, A. Abelló, J. Varga, E. Zimányi,
TX, USA, June 10-15, 2018, ACM, 2018, pp. 631–645. Managing polyglot systems metadata with hy-
URL: https://doi.org/10.1145/3183713.3196908. doi:10. pergraphs, Data Knowl. Eng. 134 (2021) 101896.
1145/3183713.3196908. URL: https://doi.org/10.1016/j.datak.2021.101896.
[39] S. S. Elnaffar, P. Martin, An intelligent framework doi:10.1016/j.datak.2021.101896.
for predicting shifts in the workloads of autonomic [49] R. Rasool, A. A. Malik, Effort estimation of etl projects
database management systems, in: Proc of 2004 IEEE using forward stepwise regression, in: 2015 Interna-
International Conference on Advances in Intelligent tional Conference on Emerging Technologies (ICET),
Systems–Theory and Applications, 15-18, 2004, pp. 2015, pp. 1–6. doi:10.1109/ICET.2015.7389209.
1–8. [50] G. Papastefanatos, P. Vassiliadis, A. Simit-
[40] A. Chebotko, A. Kashlev, S. Lu, A big data modeling sis, Y. Vassiliou, Metrics for the prediction
methodology for apache cassandra, in: B. Carminati, of evolution impact in ETL ecosystems: A
L. Khan (Eds.), 2015 IEEE International Congress on case study, J. Data Semant. 1 (2012) 75–97.
Big Data, New York City, NY, USA, June 27 - July URL: https://doi.org/10.1007/s13740-012-0006-9.
2, 2015, IEEE Computer Society, 2015, pp. 238–245. doi:10.1007/s13740-012-0006-9.
URL: https://doi.org/10.1109/BigDataCongress.2015. [51] U. Störl, M. Klettke, Darwin: A data platform for
41. doi:10.1109/BigDataCongress.2015.41. schema evolution management and data migration, in:
[41] M. J. Mior, K. Salem, A. Aboulnaga, R. Liu, Nose: M. Ramanath, T. Palpanas (Eds.), Proceedings of the
Schema design for nosql applications, IEEE Trans. Workshops of the EDBT/ICDT 2022 Joint Conference,
Knowl. Data Eng. 29 (2017) 2275–2289. URL: https: Edinburgh, UK, March 29, 2022, volume 3135 of CEUR
//doi.org/10.1109/TKDE.2017.2722412. doi:10.1109/ Workshop Proceedings, CEUR-WS.org, 2022. URL: https:
TKDE.2017.2722412. //ceur-ws.org/Vol-3135/dataplat_short3.pdf.
[42] C. de Lima, R. dos Santos Mello, On proposing and eval- [52] C. Forresi, E. Gallinucci, M. Golfarelli, H. B. Hamadou,
uating a nosql document database logical approach, A dataspace-based framework for OLAP analyses in a
Int. J. Web Inf. Syst. 12 (2016) 398–417. URL: https: high-variety multistore, VLDB J. 30 (2021) 1017–1040.
//doi.org/10.1108/IJWIS-04-2016-0018. doi:10.1108/ URL: https://doi.org/10.1007/s00778-021-00682-5.
IJWIS-04-2016-0018. doi:10.1007/s00778-021-00682-5.
[43] F. Abdelhédi, A. A. Brahim, F. Atigui, G. Zurfluh, [53] V. V. Meduri, K. Chowdhury, M. Sarwat, Evaluation
Umltonosql: Automatic transformation of conceptual of machine learning algorithms in predicting the next
schema to nosql databases, in: 14th IEEE/ACS Inter- SQL query from the future, ACM Trans. Database
national Conference on Computer Systems and Appli- Syst. 46 (2021) 4:1–4:46. URL: https://doi.org/10.1145/
cations, AICCSA 2017, Hammamet, Tunisia, October 3442338. doi:10.1145/3442338.
30 - Nov. 3, 2017, IEEE Computer Society, 2017, pp. [54] E. Gallinucci, M. Golfarelli, S. Rizzi, Schema profil-
272–279. URL: https://doi.org/10.1109/AICCSA.2017. ing of document-oriented databases, Inf. Syst. 75
76. doi:10.1109/AICCSA.2017.76. (2018) 13–25. URL: https://doi.org/10.1016/j.is.2018.02.
[44] A. Ben-Tal, Characterization of pareto and lexico- 007. doi:10.1016/j.is.2018.02.007.
graphic optimal solutions, in: Multiple Criteria Deci- [55] C. Forresi, M. Francia, E. Gallinucci, M. Golfarelli,
sion Making Theory and Application: Proceedings of Streaming approach to schema profiling, in: A. Abelló,
the Third Conference Hagen/Königswinter, West Ger- P. Vassiliadis, O. Romero, R. Wrembel, F. Bugiotti,
many, August 20–24, 1979, Springer, 1980, pp. 1–11. J. Gamper, G. Vargas-Solar, E. Zumpano (Eds.), New
[45] M. Hewasinghage, N. B. Seghouani, F. Bugiotti, Mod- Trends in Database and Information Systems - ADBIS
eling strategies for storing data in distributed het- 2023 Short Papers, Doctoral Consortium and Work-
erogeneous nosql databases, in: J. Trujillo, K. C. shops: AIDMA, DOING, K-Gals, MADEISD, PeRS,
Davis, X. Du, Z. Li, T. W. Ling, G. Li, M. Lee (Eds.), Barcelona, Spain, September 4-7, 2023, Proceedings,
Conceptual Modeling - 37th International Confer- volume 1850 of Communications in Computer and In-
ence, ER 2018, Xi’an, China, October 22-25, 2018, formation Science, Springer, 2023, pp. 211–220. URL:
Proceedings, volume 11157 of Lecture Notes in Com- https://doi.org/10.1007/978-3-031-42941-5_19. doi:10.
puter Science, Springer, 2018, pp. 488–496. URL: https:// 1007/978-3-031-42941-5\_19.
doi.org/10.1007/978-3-030-00847-5_35. doi:10.1007/
978-3-030-00847-5\_35.
[46] M. Hewasinghage, S. Nadal, A. Abelló, Docdesign
2.0: Automated database design for document stores
with multi-criteria optimization, in: Y. Velegrakis,
D. Zeinalipour-Yazti, P. K. Chrysanthis, F. Guerra
(Eds.), Proceedings of the 24th International Con-
ference on Extending Database Technology, EDBT
2021, Nicosia, Cyprus, March 23 - 26, 2021, Open-
Proceedings.org, 2021, pp. 674–677. URL: https://
doi.org/10.5441/002/edbt.2021.81. doi:10.5441/002/
edbt.2021.81.
[47] K. Deb, S. Agrawal, A. Pratap, T. Meyarivan, A