Impact Study of NoSQL Refactoring in SkyServer Database⋆ Enrico Gallinucci1,* , Matteo Golfarelli1 , Wafaa Radwan2 , Gabriel Zarate3 and Alberto Abelló3 1 University of Bologna, Cesena, Italy 2 Jawwal Telecommunications, Ramallah, Palestine 3 Universitat Politècnica de Catalunya, Barcelona, Spain Abstract Data modeling in NoSQL databases is notoriously complex and driven by multiple and possibly conflicting requirements. Researchers have proposed methodologies to optimize schema design of a given domain for a given workload; however, due to the agile environment in which NoSQL databases are usually employed, both domain and workload are frequently subject to changes and evolution - possibly neutralizing the benefits of optimization. When this happens, the benefits of a new optimal schema design must be weighed against the costs of migrating the data. In this work, we empirically show the benefits of schema redesign in a real publicly available database. In particular, we identify multiple snapshots (in terms of domain extension and querying workload) in the 20+ years evolution of SkyServer, demonstrate how NoSQL schema optimization at a given time can later backfire, and evaluate the conditions under which data migration becomes beneficial. This takes us to define the foundations and challenges of a framework for continuous NoSQL database refactoring, with the goal of helping DBAs and data engineers decide if, when, and how a NoSQL database should be reconsidered to restore schema design optimality. Keywords NoSQL database, Database refactoring, Data modeling, Data migration 1. Introduction CM0 Domain evolves CMcur = CM0 + ΔCM Database design has been studied for many years in rela- w0 Workload evolves wcur = w0 + Δw tional databases, but its automation has not been achieved Optimality yet. Moreover, the advent of NoSQL databases since the of s0 drops early 2010s has just added complexity to the problem by Schema scur = s0 Refactoring recommender recommender offering alternative data models: key-value, wide-column, Is it worth refactoring document-based, and graph [1]. Among these, the first to another schema? three are also known as aggregate-oriented data models, as Objective 2 Optimal they encourage the modeling of tuples as complex objects, schema s0 is Gain implemented embedding all the data required to answer a query and min- Objective 1 imizing the need to compute joins (thus avoiding the costly (covered by state-of-the-art) operation of transferring data between nodes) [1]. For this Effort reason, the traditional domain-driven data modeling strate- time gies typically used in relational databases [2] are abandoned Figure 1: Intuition of the research problem in favor of workload-driven strategies, where tuples are modeled (i.e., their schema is designed) depending on the queries that the database is bound to answer.1 Notice that ferent contents to accommodate different queries. we do not use the term NoSQL to name a family of tools, As sketched in Figure 1, we focus on what happens next but a family of models, as a synonim of “co-relational” in (i.e., after a schema design has been chosen and the sys- [3], which can then be implemented in any tool, including tem/application is in production). For multiple reasons, the an object-relational one like PostgreSQL. conditions considered at design time are continuously evolv- Several research papers have proposed methodologies to ing (e.g., new data must be stored, new queries appear or obtain the optimal schema design, especially on the wide- they are executed at different rates), overturning the fitness column and document-based data models (as the key-value of schema designs to the optimization problem. Conse- does not leave much room for alternative modeling strate- quently, the database should be refactored to the schema gies). As fully discussed in Section 2, these methodologies design that proves to be optimal under the new conditions. typically rely on a conceptual model (CM) of the domain Intuitively, the sweet spot of interesting solutions are the (e.g., a UML class diagram) and a set of queries to be an- ones showing the most gain with minimum effort (i.e., those swered (a.k.a. workload). Their goal is to find a target in blue in Figure 1). However, refactoring a database can be database schema design that minimizes query answering costly from multiple perspectives (design and execution of times. This is often achieved by indexing, partitioning and the migration process in the first place) and the trade-off replicating data in multiple tables (or collections) with dif- between the benefits of refactoring and its effort should be carefully evaluated. Moreover, the evaluation of database DOLAP 2025: 27th International Workshop on Design, Optimization, Lan- refactoring should not be a once-in-a-while activity: in- guages and Analytical Processing of Big Data, co-located with EDBT/ICDT spired by the DevOps philosophy of continuous evolution 2025, March 25, 2025, Barcelona, Spain * Corresponding author. in an agile software development environment, database $ enrico.gallinucci@unibo.it (E. Gallinucci); refactoring should be treated as a continuous problem as matteo.golfarelli@unibo.it (M. Golfarelli); wafaa.radwan@jawwal.ps well. It is known that the performance of query execution (W. Radwan); gabriel.zarate@estudiantat.upc.edu (G. Zarate); can improve by migrating the corresponding data between alberto.abello@upc.edu (A. Abelló) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License DBMSs, even when the migration time is included [4, 5]. 1 Attribution 4.0 International (CC BY 4.0). Our experiments, based on real astronomic data, show that, In domain-driven design, workload information is used as well, but it is not the main driver. although the corresponding data migration takes some days CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 1 * 1 0..1 galSpecLine sppParams galSpecLine zooSpec … … … 1 1 1* 1 * 1 1 1 0..1 1 0..1 galSpecLine galSpecLine1 sppParams sppParams galSpecLine galSpecLine1 zooSpec 1 zooSpec * galSpecInfo SpecObjAll sppLines galSpecInfo SpecObjAll PlateX … 1 11 … … … 1* … … 1 1 1 1 1 1 1 1 1 * 1 1 1 * 1 * * CMR1 galSpecInfo galSpecInfo SpecObjAll SpecObjAll sppLines sppLines galSpecInfo galSpecInfo SpecObjAll SpecObjAll PlateX PlateX SpecObjAll 1 1 1 1 1 * 1 * * * 1 * * 1 1 1 1 CMR1 CMR1 * Photoz PhotoObjAll PhotozRF Photoz PhotoObjAll SpecObjAll SpecObjAll 0..1 0..1 0..1 1 1 1 1 * 1 * * 1 1 1 1 1 1 * 1 Photoz Photoz PhotoObjAll PhotoObjAll PhotozRF PhotozRF Photoz Photoz PhotoObjAll PhotoObjAll Photoz PhotoObjAll 0..1 0..1 0..1 0..1 0..1 0..1 0..1 1 1 1 1 * * 1 * * 1 1 1 Field Frame Field Frame Photoz Photoz (a) R1 PhotoObjAll PhotoObjAll CMR8 1 * CMR18 * 0..1 0..1 1 1 1 1 1 1 1 CMR8 CMR8 Field Field Frame Frame CMR18 CMR18Field Frame Field Frame * * * * (b) R8 (c) R18 Figure 2: Conceptual models of the main tables in SDSS SkyServer. of execution, schema optimization reduces query cost by an easily subject to schema changes, which highly impacts order of magnitude, and consequently pays off in the long their performance. Researchers have looked for patterns in term (notice we are not considering here the effort of appli- the evolution of schemas in both relational [16] and NoSQL cation code evolution). Thus, a structured and automated databases [17, 18] (and beyond [19]). Recent efforts to sup- approach is even more crucial to ensure the feasibility of port and/or automate the management of schema evolution continuous evolution. have been directed toward keeping track of different schema The main contributions we provide in this paper are: versions [20], propagating manually-defined schema modifi- 1. An experimental setting that allows to analyze database cation operations (SMO) to the database [21] and to queries refactoring (not considering changes in application [22], and evaluating multiple strategies to apply schema code). changes to the data [23, 24]. Overall, this is still an open research field, and none of the mentioned works goes in the 2. A detailed empirical analysis of the performance impact direction of recommending if, how, and/or when a (NoSQL) of schema evolution in the SkyServer database. database should be refactored. Recommendations to (rela- 3. A framework proposal able to explore schema designs tional) database refactoring have been given, but mostly alternative to the current one, and give recommenda- focused on finding and resolving issues such as inconsisten- tions based on the evaluation of the trade-off between cies [25] and anti-patterns [26]. More recently, [18] proposes migration effort and the gain under different optimiza- a first approach to migration strategy planning of NoSQL tion criteria. databases, but still without deciding whether migrating is The outline of the paper is as follows. The related liter- worth or not, or how to do it. ature is presented in Section 2. Section 3 introduces our use case. Section 4 explains and exemplifies the motiva- tion behind the research problem. Section 5 defines the 3. SkyServer Case study experimental setting. Section 6 presents the evaluation of SkyServer, grounded on which we define our framework SkyServer3 is a publicly available relational database de- described in Section 7. Conclusions are drawn in Section 8. signed to map the cosmos, made available by the Sloan Digital Sky Survey (SDSS) organization. Over the years, it has integrated more and more data in successive extensions. 2. Related work Access to SDSS data is provided via a web interface, where users can query and download data through either SQL or The workload-driven nature of NoSQL data modeling has interfaces designed for both professional astronomers and been established since the dawn of NoSQL databases [1]. educational purposes. Indeed, results suggest that the schema alternatives affect Besides the astronomical data themselves, the server also the database performance in different NoSQL models [6]. makes public SkyServer Traffic Log,4 which captures statis- Over the last decade, researchers have worked to support tics on SQL queries being executed. This includes columns DBAs and data engineers in the complicated task of finding such as theTime (datetime of the query), webserver (URL the best logical model for a given workload. The most re- of the server), winname (Windows name of the server), cent existing works mainly differentiate for (i) focusing on a clientIP (client’s IP address), and sql (the SQL statement single [7, 8, 9] or multiple data models [10, 11, 12, 13, 14], (ii) executed), among others. It also captures performance met- considering only the conceptual model of the data [11] or rics like elapsed (query execution time), busy (CPU time including workload queries, with [7, 12, 13, 14, 8] or without used by the query), and rows (number of rows returned by query frequencies [10, 15, 9], and (iii) directly generating the query), providing a comprehensive snapshot of server one [10, 11, 14, 9] or more target schemas [15], or evaluat- activity in the last two decades (since 2003). Thus, we ana- ing more of them, based on a single criterion [12, 15, 9] or lyzed three different database schemas and corresponding multiple thereof [7, 13, 8]. The common factor between all snapshots of this log as in Release 1 (December 2003), Re- these works is the limited focus on the initial design of a lease 8 (December 2013), and Release 18 (December 2023). logical schema (i.e., none of them considers the challenge of The corresponding schemas (𝐶𝑀𝑅1 , 𝐶𝑀𝑅8 , and 𝐶𝑀𝑅18 ) implementing such schema by refactoring an existing one). are summarized in Figure 2 and include changes in both Research work on database evolution also started in the relational world and then propagated to the NoSQL side, to each data item, thus imposing no constraint at the level of the where the schemaless characteristic2 makes databases more table/collection of data. 3 https://skyserver.sdss.org/dr18 2 4 The term refers to the fact that schema information is attached directly https://skyserver.sdss.org/log/en/traffic/sql.asp PlateX SpecObjs si_2 1 PlateX {id: SpecObjs “S1", Storage CM0 SpecObjAll CMi 1 tile: 122, 1.0 * program: “legacy", CM0 SpecObjAll CMi {id: “S1", quality: “good” } 0.8 si_1 SpecObjAll * tile: 122, si_2 0.6 SpecObjAll program: “legacy", si_3 SpecObjs quality:si_2 “good” } 0.4 SpecObjs 0.2 { id: “S1", Plates SpecObjs Plates tile: 122{ }id: “S1", Plates SpecObjs Plates tile: 122 } {id: "P1", { id: “S1", {id: "P1", s0 program:{id: “legacy", "P1", tile: 122,{ id: “S1", program:{id:“legacy", "P1", quality: “good” } “legacy", program: plateID: “P1” } tile: 122, quality: “good”, program: “legacy", s0 quality: “good” } plateID: “P1” } SpecObjs: [ quality: “good”, Performance Performance (a) si_1 {id: “S1", SpecObjs: [ (read) (write) tile: 122 {id: } ] }“S1", (c) si_1 tile: 122 } ] } si_3 (b) si_3 Figure 3: Example of schema evolution: (a) the initial conceptual schema 𝐶𝑀0 and the initial database schema 𝑠0 ; (b) the evolved conceptual schema 𝐶𝑀𝑖 at time 𝑖 and three alternative database schemas 𝑠𝑖_1 , 𝑠𝑖_2 , and 𝑠𝑖_3 ; and (c) comparison of the three potential databases on the maximization of three different objectives. the number of tables and attributes, and how the latter are The frequency of schema changes depends on the do- placed in the former, but without information loss. main and application [16]: in some cases it can be pervasive In these schemas, we find data captured from different (in [27], the authors found that all the tables over the 20 regions of the sky called Fields, where different objects are analyzed databases were somehow affected by the evolu- observed as PhotoObj. Spectroscopic data are also captured tion process), while in others it was completely absent (in for each one of these astronomical objects, and stored per [19], 70% of the database schemas over the 195 analyzed wavelength intervals into SpecObj. All measurements are open source software projects demonstrate the absence or done through aluminum Plates that allow to precisely plug very small presence of change). Nevertheless, in the lat- individual spectrographs to the telescope through optical ter case, the authors verify that the absence of evolution fibers.5 does not mean that application requirements are static, but rather that DBAs/developers are reluctant to evolve database schemas to avoid the effort. A similar insight is found in 4. Motivation [33], which studied schema evolution in 29 data-intensive applications using either relational or NoSQL databases. The In any kind of DBMS, the choice of the initial schema design study found that complex refactoring operations are seldom is based on conditions (i.e., the conceptual representation carried out, due to the lack of tools to support them. of the domain and the estimated workload) that can change From these studies, we conclude that: (1) it is very diffi- – either because they were not accurate or because they cult to have a perfect understanding at design time of how have evolved, but this is even more so in NoSQL systems. schema information must be modeled; (2) the conditions to In this section, we present a couple of comprehensive and modify database schemas can mature at any time; (3) there small examples to illustrate the problems, before moving is reluctance to change a database schema once it reaches to a larger one with real data that demonstrated the true a certain maturity level, and such changes are aimed at impact. minimizing refactoring efforts. As a result: (a) the updated schemas tend to be simple 4.1. Domains evolve variations of the initial one, despite the choice of the latter Plenty of research papers show database schemas need to being based on a significant degree of uncertainty at design evolve to accommodate changes in the domain (e.g., new in- time; (b) the pure minimization of refactoring efforts po- formation to be added, obsolete information to be removed, tentially leads to missing big opportunities hidden by the data type changes), from the ’90s [27] to most recent times scarecrow of complex refactoring, steering instead towards [28], in both relational [29, 16] and NoSQL databases [30, 31], possible antipatterns, i.e., bad practices in schema design looking for patterns in schema updates [16], studying the that are intended to solve certain problems but eventually repercussions on the related application code [29], man- lead to other problems [34, 26] (which, in turn, will require aging multiple schema versions [22] and designing frame- further modifications to remodel the data). works to automate schema evolution [17]. An interesting Example 1. An exemplification of schema evolution on a pattern emerging from multiple research work [27, 29, 16] document-based database is shown in Figure 3. Let 𝐶𝑀0 be is that, in the early stages of the application lifespan, rela- the initial conceptual schema with only one entity; database tional databases typically undergo an inflation phase, where schema 𝑠0 is created with a single collection of SpecObjAll multiple operations are carried out to add new schema in- (Figure 3a shows a sample document). Later on, at time 𝑖 (Fig- formation. In this sense, NoSQL databases are even more ap- ure 3b), the conceptual schema evolves to 𝐶𝑀𝑖 to organize pealing due to their schemaless nature, which lets them eas- spectral readings into plates. To accommodate this change ily accommodate schema additions to move on, and makes with minimum effort, DBAs would be inclined to evolve the them more suitable in agile development [32]. Nevertheless, database towards the schema design of 𝑠𝑖_1 or 𝑠𝑖_2 , but they this does not mean their performance is optimum regardless would probably avoid 𝑠𝑖_3 , even though it might be the op- of how you store data and still require reconsidering it. timal schema – as hinted by the radar chart in Figure 3c. 5 The whole catalog of tables is available at https://skyserver.sdss.org/ Inspired by [35], the chart shows a comparison between the dr18/MoreTools/browser three databases in terms of the maximization of three objec- Product Item (0,n) Plates Fields (1,1) ER0 = ER1 {id: "P1", { id: “F1", (1,n) program: “legacy", nTotal: 923, (1,1) quality: “good” } photoObjs: [ { Customer Order id: “O1”, (0,n) type: “Star” specObjs: [ { id: “S1”, s1 waveMin: 3806, Item Plates Fields • Changes in queries (i.e., the existing queries are formu- waveMax: 9202, plate_id: "P1", (1,1) {id: "P1", { id: “F1", lated differently, e.g., due to a change at domain level) plate_quality: “good” } ] } ] } (1,n) program: “legacy", q1 = "select nTotal: 923, or new/old queries are added/removed si_2 (e.g., due to the quality: I.*, P.* } “good” photoObjs: [ { Order PlateX from Field F id: “O1”, addition/removal to/from Plates the application in the latest re- Fields join PhotoObj O type: “Star” 1 join SpecObj S specObjs: [ { lease). id: “S1”, {id: "P1", { id: “F1", where F.name = " waveMin: 3806, program: “legacy", nTotal: 923, * waveMax: 9202, When the initialquality: schema “good”,design is chosen, the [workload photoObjs: { SpecObjAll q2 = "select I.*, O.* from Products P plate_id: "P1", plate_quality: “good” } ] } ] } is assumed to be constant specObjs: [ { id: “S1”, in terms of the queries id: “O1”, using it type: “Star”} ] } * join Items I P.* CM s where P.name = " 2 and their corresponding waveMin: frequencies. 3806, This simplification is waveMax: 9202, dF otoObj O 1 Plates Fields understandable (if not essential) photoObj_id: “O1”, to choose a (sub-)optimal ecObj S PhotoObjAll {id: "P1", { id: “F1", initial schema design. However, photoObj_type: “Star”}given ]} the strong depen- name = " q1 =program: "select “legacy", I.*, P.* = "select I.*, O.* q2923, * from “good”, quality: Customers C nTotal: photoObjs: [from { Products P dence of schema design optimality on the workload, the O.* ucts P 1 specObjs: id:join “S1”, [{ join Orders O id: “O1”, type: “Star”} join Items I ]} evolution of the latter can have a tremendous impact and ms I Items I where P.name = " name = " Field waveMin: 3806, where C.name = " waveMax: 9202, ignoring it can lead to a progressive distancing from the (a) photoObj_id: “O1”, objectives that the initial schema design was originally max- photoObj_type: “Star”} ] } *, P.* q2 = "select I.*, O.* imizing. ustomers C from Products P rders O join Items I (b) ems I where P.name = " Example 2. An exemplification of workload evolution on a C.name = " document-based database is shown in Figure 4. Let 𝐶𝑀 be High Good 100% the initial conceptual model with four classes (Figure 4a) and q1 Storage s1 Query freq. ratio Query perf. q2 s2 two different database schemas 𝑠1 and 𝑠2 (Figure 4b). Assume Low that 𝑠1 is the one chosen at design time, given the following t0 ti Bad 0% t0 ti Time t0 Time ti Time workload. 𝑞1 = SELECT o.*, s.* FROM Field f (c) (d) (e) JOIN PhotoObj o JOIN SpecObj s WHERE f.id = ; Figure 4: Example of workload evolution: (a) the conceptual 𝑞2 = SELECT p.*, s.* model 𝑈 𝑀 𝐿; (b) two possible database schemas, 𝑠1 (the initially FROM Plate p JOIN SpecObj s chosen one) and 𝑠2 ; (c) change of queries’ frequency in time; WHERE p.id = ; (d,e) change of optimality of the two database schemas on query Figure 4c shows that the frequency of the two queries inverts performance and storage occupation. from the initial deployment at 𝑡0 to the one at 𝑡𝑖 . Consequently, the optimality of the two database schemas with respect to tives: storage occupation, and performance (speed) of read and query performance changes accordingly (Figure 4d). Storage write queries. For storage, we assume 30 bytes for IDs, 50 bytes ratio can also change (Figure 4e shows the database size in 𝑠2 for strings, 8 bytes for numbers, and a ratio of 2 products per divided by the size in 𝑠1 ) if the data grows unevenly. Reusing category. To estimate query performance in this example, we storage assumptions from Example 1, the values in the figure used the cost model by [12] and assumed two read queries (𝑞1 are calculated assuming a ratio between fields and spectral and 𝑞2 ) and two write queries (𝑞3 and 𝑞4 ) as follows: readings that grows from 1:1 to 10:1 and the average number 𝑞1 = SELECT s.* FROM SpecObjAll s of items per order growing from 2 to 10. ♢ WHERE s.id = ; 𝑞2 = SELECT s.* FROM SpecObjAll s JOIN PlateX p WHERE p.id = ; 𝑞3 = UPDATE SpecObjAll s SET 5. Experimental setting s.tile = WHERE s.id = ; To empirically demonstrate our point, we analyzed in detail 𝑞4 = UPDATE PlateX p SET the real effect of schema evolution on SkyServer perfor- p.quality = WHERE p.id = ; mance. For this, we considered three points in time corre- The indicators are normalized on a scale from 0 (worse) to sponding to releases R1, R8 and R18, respectively. Firstly, 1 (best) using the complementary of the min-max normalized we characterized the different workloads identifying the value. For instance, given 𝜙(𝑠) as the average query execution most common query patterns (i.e., ignoring mostly unique time on schema 𝑠, and 𝑥 and 𝑦 as the minimum and maximum queries in the very long tail of frequencies). Then, we recre- values for 𝜙(𝑠𝑖_𝑘 ), 𝑘 ∈ {1, 2, 3}, query performance for the ated the database at the point in time of each of the three 𝑦−𝜙(𝑠 ) releases and populated them with a sample of the data avail- j-th schema is calculated as 𝑦−𝑥𝑖_𝑗 . ♢ able in SkyServer. Finally, we measured both the cost of queries in each schema as well as the cost of moving the data from one to another. All tests have been executed on a 4.2. Workloads evolve PostgreSQL 15 instance, running on a server with an i7-8700 Similarly to schema evolution, early studies on the evolution CPU and 64 GB of RAM. To guarantee reproducibility, all of query workloads date back to the 80’s [36] and continue the corresponding code is publicly available in GitHub.6 to most recent times [37]. The evolution of workloads can In the following, we use numbers (1, 2, and 3) to refer be traced back to common patterns [38, 39]. to database schemas in different points in time, and Greek letters (𝛼, 𝛽, and 𝛾) to refer to the corresponding workloads. • Changes in frequency (i.e., the same queries are executed A summary of the experimental setup is shown in Figure 5 with different frequencies and/or ratios), either with cyclic and detailed in the following sections; in the figure, the patterns (e.g., daily or monthly), with occasional spikes yellow area indicates the database schemas over which each (e.g., due to unexpected popularity increase of the appli- workload is executed. cation), or more stable changes (e.g., due to new users from different time zones). 6 https://github.com/enricogallinucci/nosql-refactoring N. query patterns by output Queries Query Work. 1 row >1 rows >1 rows per month freq. (Hz) LEGEND (stable) (stable) (scaling) Workload execution 𝑊𝛼 48,337 0.02 2 3 - 𝑊𝛽 1,352,498 0.57 6 3 - Time evolution 𝑊𝛾 3,485,018 1.39 6 2 7 Optimized version of Migration to be Table 1 evaluated Time-evolved DB Workload statistics Optimized DB SkyServer DB new workload. 5.2. Data Figure 5: DBs and workloads in our experimental setup The overall size of the SkyServer database (in its latest re- lease) is approximately 5TB and cannot be directly down- loaded; consequently, we proportionally sampled the source 5.1. Schemas to manage it more effectively. Sampling is based on the The schema of each database in Figure 5 is generated as main table, PhotoObjAll, which originally contains 1.2 follows. Original databases (𝐷1 , 𝐷2 , 𝐷3 ) are populated billion rows: we collected 4 samples of 100K, 200K, 500K, exactly as provided by SDSS. Nevertheless, to make the and 1M rows, ensuring that samples preserve the distribu- performance comparable after optimization, we did not im- tion of attributes involved in the selection predicates; then, plement classic 1NF, but encoded all them into a flat JSON other tables are populated with the rows linked to the ones document (without any subdocument or array) that was sampled on PhotoObjAll. To make the performance com- then stored in a relational table in PostgreSQL. Optimized parable across the releases, we made the four samples of the databases (𝐷1𝛼 , 𝐷2𝛽 , and 𝐷3𝛾 ) are optimized for the work- same size, independently of the size of the database at the load in the corresponding point in time following the hints point in time of the release. in [12]. More concretely, to decide on join materialization, as well as vertical and horizontal partitioning, we: 5.3. Workload 1. Followed a greedy algorithm taking each query in the or- The workload of every release was extracted from the der indicated by their criticality (i.e., queries with higher SqlLog table of the SkyServer Traffic Log for December of value for the product of their frequency and cost were the corresponding year, excluding queries that were unsuc- considered first) and: cessful or involving customer user tables. Given the nature (i) Created a JSON document per row considering all of the service, we should notice that users are not allowed the tables involved in the query (i.e., join materi- to modify the database, hence, the log contains only read alization or embedding). queries. After parsing the queries, we extracted (1) the tables (ii) Took from each source table only the attributes involved, (2) the columns projected, and (3) the selection necessary for the query (i.e., vertical partitioning). predicate. Firstly, the queries were clustered based on the (iii) Applied filters of the query (i.e., horizontal parti- tables they required, and a minimum threshold of 1% was tioning). fixed for the cluster to be further considered. These initial clusters were then subdivided depending on the columns Notice that (a) once a critical query has generated some projected and selection predicate used, filtering out subclus- optimization, this is not undone by less critical queries, ters with less than 0.5% queries, for a final count of 5, 23, and (b) we allowed intra-table redundancies (e.g., materi- and 21 clusters being considered for each release. Since we alizing the many-to-one join between Frame and Field, wanted to evaluate changes from one release to another, which replicates field data for every frame), but not inter- out of those clusters, we generated query patterns only for table ones (i.e., once we materialize the join between those involving tables present in more than one release. PhotoObjAll and Photoz in a single table, we do not Statistics of the final workloads, including overall query create another standalone copy of the later). frequency (assuming uniform distribution in time) and a 2. Generated a separate vertical partition of the correspond- characterization of the included query patterns (based on ing table to store all attributes not used in any query. the number of returned rows), are reported in Table 1. 3. Created secondary indexes for any attribute in the selec- tion predicates for both the original as well as optimized schemas. 6. Experimental evaluation Time-evolved databases (𝐷2𝛼 , 𝐷3𝛼 , and 𝐷3𝛽 ) correspond In our experiments, we first look at the space being used, exactly to databases optimized for an obsolete workload, but then the execution time of the query workload, and finally, extended with the concepts introduced in the new release in the cost of migrating from one schema to another. the form of one extra table per new concept, so all queries can be executed. For instance, 𝐷2𝛼 is the time-evolution of 6.1. Evaluation of storage occupation 𝐷1𝛼 , which preserves the pre-existing data and optimiza- tions for workload 𝛼, but adds the flat new tables introduced Table 2 shows the total storage occupation (in MB) of each by 𝐷2 . These schemas are crucial to put optimizations un- database schema on every scale. Intuitively, the storage in- der the test of time and evaluate whether effectiveness is creases proportionally with the scale - though this is less ev- held upon workload evolution or if it would be more conve- ident in 𝐷2* and 𝐷3* due to some tables (Field and Frame) nient to migrate the data to the schema optimized for the being independent from PhotoObjAll. Interestingly, the Scale 𝐷1 𝐷1𝛼 𝐷2 𝐷2𝛼 𝐷2𝛽 𝐷3 𝐷3𝛼 𝐷3𝛽 𝐷3𝛾 107 1 min. 1 hour 1 day 1 month 100K 683 704 849 875 853 885 892 882 878 200K 1365 1408 1600 1652 1607 1672 1685 1666 1665 106 500K 3412 3520 3852 3983 3869 4035 4065 4019 4028 Mig. time (full) (est.) Cumulative time gained (s) 1M 6824 7039 7607 7868 7640 7973 8033 7941 7965 105 les sca All Table 2 Storage occupation (in MB) of database schemas 104 𝑊𝛼 𝑊𝛽 𝑊𝛾 103 Mig. time (1M) Scale Mig. time (500K) 𝐷1 𝐷1𝛼 𝐷2 𝐷2𝛼 𝐷2𝛽 𝐷3 𝐷3𝛼 𝐷3𝛽 𝐷3𝛾 Mig. time (200K) 100K 73.8 7.6 27.8 63.7 2.0 145.9 571.2 269.7 45.5 102 Mig. time (100K) 200K 71.3 5.4 26.3 60.2 1.6 285.4 1124.4 490.9 60.7 500K 71.3 5.4 25.5 60.2 1.6 714.5 1546.4 1134.6 125.5 101 1 1M 77.9 5.4 27.8 63.7 1.0 1125.2 3249.2 2304.3 240.6 10 102 103 104 105 106 Time elapsed (s) Table 3 Figure 6: Study of migration convenience from 𝐷2𝛼 to 𝐷2𝛽 Average execution time (in ms) of a single query applied optimizations have no significant impact on storage, 107 1M 1 min. 1 hour 1 day due to the absence of inter-table redundancies and to a low K 500 K footprint of intra-table ones. 106 Mig. time (full) (est.) 200 K 100 Cumulative time gained (s) ) 105 (est. 6.2. Evaluation of query execution times Full 1 month As shown in Figure 5, the three workloads are executed over 104 the database schemas available for the corresponding point in time (i.e., 𝑊 𝛼 , 𝑊 𝛽 , and 𝑊 𝛾 are respectively executed 103 Mig. time (1M) Mig. time (500K) over the 𝐷1* , 𝐷2* , and 𝐷3* versions). For each combination Mig. time (200K) of workload and database schema, 11K queries have been 102 Mig. time (100K) executed by preserving the frequency of each query pattern and randomly choosing values (among the existing ones) 101 1 10 102 103 104 105 106 for the selection predicates; the first 1000 queries are then Time elapsed (s) discarded to minimize the impact of cold-start on the cache. Table 3 shows the average execution time (in ms) of a single Figure 7: Study of migration convenience from 𝐷3𝛽 to 𝐷3𝛾 query, by weighing the average execution time of each query pattern on the respective query frequency. From the results, Time gain Mig. N. queries Time to we can make the following observations. Migration Scale per query time to pass pass mig. (ms) (ms) mig. time time (min) • The workload changes significantly across releases. This 100K 62 55,100 893 25.9 is evident not only on the average execution time of a 200K 59 127,100 2,172 63.0 𝐷2𝛼 to 𝐷2𝛽 single query (which considerably grows on 𝑊 𝛾 ), but also 500K 61 319,200 5,236 151.9 in the variation over different scales: as shown in Table 1, 1M 63 675,000 10,761 312.2 𝑊 𝛾 includes query patterns where the number of rows 100K 172 93,000 542 6.5 scales with the cardinality of tables, while these are not 𝐷3𝛽 to 𝐷3𝛾 200K 378 233,300 618 7.4 present in 𝑊 𝛼 and 𝑊 𝛽 . 500K 910 521,900 574 6.8 1M 1,867 922,000 494 5.9 • The random choice of selection predicates slightly im- pacts on the average execution times, especially when Table 4 these are particularly low. For example, it may seem that Migration statistics execution times improve with the database size in 𝐷2𝛽 ; however, the standard deviation in this case ranges from 6.3. Evaluation of migration convenience 1.9 to 2.8 in all scales for this database schema, so the Finally, we study the convenience of database migration; as variation is clearly not statistically significant. seen in Figure 5, we focus on the migrations from 𝐷2𝛼 to 𝐷2𝛽 • Optimizations have a huge impact on performances, with and from 𝐷3𝛽 to 𝐷3𝛾 (i.e., migrating data from R8 optimized reductions of execution times ranging from 3 to 10 times for the old R1 workload, to another schema optimized for across all workloads. This provides a solid justification for the true R8 workload; similarly for R8 and R18). the need to implement optimized database schemas - also Migration convenience is evaluated by measuring the in light of the essentially unvaried storage occupation. gain obtained in query performance (due to database re- • Interestingly, optimizations carried out at a specific point optimization) against the effort taken to migrate the data. in time do not outlive the workload and end up backfiring Both factors are measured in terms of time: the gain is the at later stages. As the characteristics of the workloads difference in the average execution time of two database evolve, execution times sensibly increase due to previous schemas, and effort is the time required to execute migration optimizations losing effectiveness and becoming a liabil- scripts. Then, the migration becomes convenient when the ity. This nicely demonstrates the need for a continuous (cumulated) gain overcomes the effort. Table 4 summarizes re-evaluation of database optimizations. the results on all sample sizes and indicates the number of queries needed to accumulate enough gain to overcome the migration effort; the same is also translated into a measure CMcur wcur scur of time, based on the query frequency in the real workload, as in Table 1. The results are also reported in Figures 6 Recommender and 7, which emphasize trends over logarithmic time scales: placeholder wpred Target Schema Explorer Workload Predictor for each sample size (identified by a different color), the cumulative time gained is shown as time elapses, migration LEGEND Migration Migration time is shown as a flat horizontal threshold, and a star marks Gain Estimator Effort Process calls the turning point. Linear regression is used to estimate gain Estimator Generator sends output to and effort on the full database size (shown as black lines). From these results, we derive the following takeaways. • First, we observe that migration time is proportional to Gain the database size. As discussed in Section 6.1, this is not surprising given the low-to-no impact of replication. What is remarkable is the estimated migration time on Effort the full scale, which achieves the order of multiple days. Figure 8: Overview of the proposed framework for continuous Though this estimate could be easily optimized by par- refactoring evaluation allelizing the migration of the different tables, it shows the importance of considering workload prediction in [20, 24, 31, 21, 22, 23, 24], but do not address these research the refactoring recommendation: the longer the time to questions. In this section, we pave the way in this direction, migrate the data, the longer the required stability of the defining a framework and illustrating some novel research workload (or accuracy of the prediction) to ensure that challenges that this should address to achieve this goal. the migration pays off. The outcomes of the experiments evidently suggest that • Since execution times for 𝑊 𝛽 are unaffected by the (i) database optimization is neither a one-time nor an incre- database size (as discussed in Section 6.2), the gain is mental activity, as some optimizations can become coun- almost identical across all scales. As a result, the bigger terproductive in future stages, (ii) database migration can the database, the more time it takes to accumulate enough be particularly expensive and may not be worth the trou- gain to compensate for migration times. Differently, in ble, and (iii) as a result, the continuous evaluation of refac- 𝑊 𝛾 , query execution times grow with the database size, toring options is fundamental to guarantee maximum ef- thus the gain scales accordingly. As a result, the migra- ficiency under evolving workloads. Hence, we propose a tion becomes convenient after only 6-7 minutes, indepen- multi-objective optimization to continuously evaluate and dently of the database size. recommend the refactoring of NoSQL databases. • Interestingly, the two studies reveal radically different sce- An overview of the proposed framework is shown in narios where the recommendations to carry out database Figure 8. The main component is the Target Schema Ex- refactoring are diverse. In the samples, database migra- plorer, which is in charge of enumerating and evaluating tion is always fast and particularly convenient. In our the possible target schemas. The enumeration requires the projected estimates over the full database, the migration current schema 𝑠𝑐𝑢𝑟 , the current conceptual model 𝐶𝑀𝑐𝑢𝑟 , to 𝐷2𝛽 would be discouraged under the assumption that and either the current workload 𝑤𝑐𝑢𝑟 or a prediction of the workload significantly differs in the following month; a future workload 𝑤𝑝𝑟𝑒𝑑 , calculated by the Workload Pre- differently, the migration to 𝐷3𝛾 is shown to be conve- dictor. Given a possible target schema 𝑠𝑖 , its evaluation is nient, even though the implications of the considerable aimed at quantifying the pros and cons of carrying out the migration time should be carefully considered before en- refactoring from 𝑠𝑐𝑢𝑟 . The pros are calculated by the Gain acting the refactoring. Estimator, which measures the variations of multiple qual- ity criteria (in terms of performance, storage occupation, etc.). The cons are calculated by the Effort Estimator, which 7. Framework overview measures the work required to carry out the refactoring (in terms of designing and executing the migration process, As we have just demonstrated, the evolution of schemas rewriting all workload queries, etc.). The latter estimation and workloads can dramatically change the optimality of requires as much information as possible about the migra- the schema design chosen at design time, and refactor- tion process, which is produced by the Migration Process ing the database can restore such optimality. The newly- Generator component. Finally, the Recommender obtains optimal schema should be found as the one maximizing from the Target Schema Explorer the list of evaluated target the trade-off between the benefits of a refactoring and the schemas and produces a recommendation; given the amount effort to design and execute it. This task opens to multi- and diversity of criteria to measure gains and efforts, the ple research challenges, including the exploration of the Recommender determines the set of relevant target schemas search space of target schemas (potentially scaling to thou- on the Pareto front [44]. sands of concepts), the quantification (and comparison) The benefits of advancing the state-of-the-art in this di- of the benefits from refactoring the database and the ef- rection are twofold. On the one hand, the Recommender can forts to design and execute the refactoring, and the pre- provide critical insights to make refactoring decisions with diction of changes in the workload (potentially including significant improvements to the current situation based on millions of queries). As shown in Section 2, related work objective criteria and a comprehensive coverage of possible mainly explore the identification of the first target solu- alternatives. On the other hand, the automation of this task tion [40, 41, 42, 7, 8, 43, 10, 11, 12, 13, 14] or devise frame- enables its continuous adoption through the lifetime of the works to manage multiple schema versions and propagate database and the applications running on top of it; indeed, manually-defined schema transformations to the database a continuous evaluation of database refactoring minimizes the risk of undergoing major efforts at a later time to recover be at several levels, namely conceptual (e.g., identifying from a degraded state. In both cases, complete automation which entities are involved in the migration), logical (i.e., is hard to achieve, as the precise measurement of gains and defining the sequence of operations that should be carried efforts is particularly challenging and the selection of the out to migrate the data), and physical (i.e., producing the “best” refactoring activity from the Pareto front requires busi- scripts or application code to be executed). In any case, an ness knowledge and strategic vision (i.e., skills that cannot optimizer should be used to make the process as efficient as be easily quantified and encoded). Thus, our proposal goes possible. In our study, the process was generated and opti- in the direction of human-in-the-loop automation: while we mized manually, but automation is clearly necessary. Some turn to the DevOps philosophy in the continuous applica- proposals in this direction have been made, but they only tion of an automated procedure to maintain a high-quality support a limited range of schema modification operations level of the database, DBAs/engineers should be able to step [51] and are tied to table-to-table (or collection-to-collection) in at critical points to contribute with their knowledge and mappings [52], whereas the migration of the database needs exploit the system to make decisions and decide the path to be considered as a whole. forward. Workload prediction. The capability of the recom- In the following, we delve into the details of each of the mender to operate on a predicted future workload is a bonus framework’s components, discussing current implementa- feature, in the sense that the recommendation could also be tions and presenting the research challenges that are yet to given just by considering the current workload. Nonetheless, be addressed to achieve automation. given the (possibly considerable) effort to do a migration Target schemas exploration. The aggregate-data mod- and the (possibly continuous) evolution of the workload, eling style of NoSQL databases implies a huge search space the optimality of the new target schema may be lost by the of alternative schemas that could be devised in a given do- time that the migration is completed – as the evidence of main [45]. [46] shows that there are 12 different ways to log- this study has shown. For this reason, predicting (with suf- ically model a conceptual relationship between two entities ficient accuracy) what the workload will be at time 𝑖 + ∆ in a document-based database. This search space is further allows the recommender to consider an additional variable amplified by the practice of replicating data in multiple col- and to possibly converge towards the optimal solutions that lections to optimize the performance of the most frequent require that ∆ time to carry out the migration. The predic- queries (we avoided this possibility in our experiments to tion of the evolution of workload queries is a field that has keep them simpler); thus, an exhaustive generation and recently attracted research interest [37]. However, existing evaluation of all possibilities is prohibitive. The challenge is works are limited to relational databases and mostly focused worsened by the absence of a single optimization metric to on supporting a live tuning of the DBMS’s configuration drive the exploration towards convergence. In the related and/or resources [53]. work, the most common approach to schema exploration (as well as the one followed in this study) simply consists in converging to a target schema through some heuristics 8. Conclusions (e.g., [12]). However, we see huge potential behind multi- In this paper, we have presented an impact study of NoSQL objective evolutionary algorithms (MOEAs), which are par- database refactoring over a real-world use case, motivating ticularly suitable for the task of finding Pareto-optimal so- the research problem, supporting it with empirical evidence, lutions [47] but not yet adopted in this context. and presenting a proposal for a refactoring recommender Gain estimation. The Gain Estimator relies on a framework. Our research work will continue under two set of Key Performance Indicators (KPIs) to quantify the directions. On the one hand, we plan to further investi- (dis)advantages of migrating from the current schema to gate the SkyServer use case to consider additional strate- a different one under many perspectives, namely Perfor- gies for schema optimizations and to incorporate workload mance (query execution time is crucial in NoSQL), Storage prediction into the migration convenience evaluation; by (redundancy is typically encouraged, but updates should not collecting additional information about the workload in the be forgotten), and Complexity (schemaless allows quick upcoming months, we will put the proposed optimizations development, but also hides mistakes in the coding). In under a more comprehensive test of time. On the other this study, we focused on the performance evaluation, but hand, we will work towards the implementation and au- measured it empirically. Thus far, the proposed metrics for tomation of the proposed framework. Each module in the estimations are either oversimplistic (e.g., [12] considers the framework encompasses its own challenges, which can be number of accessed documents) or too narrow (e.g., [48] pre- addressed separately. Our main efforts will be first directed dicts execution times using an advanced database-specific towards enabling a broad exploration of target schemas and model, but limitedly to point-queries on the primary key). defining a comprehensive method for estimating migration Migration effort estimation. Similarly to the gain, efforts considering the many variables that influence this the estimation of the migration effort can be measured process, including application code evolution and human from multiple perspectives, namely Process design (whose effort estimation. We plan to work in close collaboration cost would depend on the complexity and extension of the with companies dealing with evolving workloads in NoSQL model), Execution (which should consider the impact over databases and whose support is already shown in previous the currently-running workload), and Application update work on heterogeneous and evolving datasets [54, 55]. (especially relevant due to schemaless philosophy in NoSQL). For this, we can build on top of recent software refactoring work [49] and ETL evolution [50]. Acknowledgments Migration process generation. The proper estimation of migration efforts also depends on how well the migration This work has been partially supported by the Spanish Minis- process can be predicted and how much it can be auto- terio de Ciencia e Innovación under project PID2020-117191 mated. The information returned by this component can RB-I00/AEI/10.13039/501100011033 (DOGO4ML). References eldrivenguide: An approach for implementing nosql schemas, in: S. Hartmann, J. Küng, G. Kotsis, A. M. [1] P. J. Sadalage, M. Fowler, NoSQL distilled: a brief guide Tjoa, I. Khalil (Eds.), Database and Expert Systems to the emerging world of polyglot persistence, Pearson Applications - 31st International Conference, DEXA Education, 2013. 2020, Bratislava, Slovakia, September 14-17, 2020, Pro- [2] J. L. Harrington, Relational database design and imple- ceedings, Part I, volume 12391 of Lecture Notes in mentation, Morgan Kaufmann, 2016. Computer Science, Springer, 2020, pp. 141–151. URL: [3] E. Meijer, A co-relational model of data for large https://doi.org/10.1007/978-3-030-59003-1_9. doi:10. shared data banks, in: M. Mezini (Ed.), ECOOP 1007/978-3-030-59003-1\_9. 2011 - Object-Oriented Programming - 25th Euro- [12] L. Chen, A. Davoudian, M. Liu, A workload-driven pean Conference, Lancaster, UK, July 25-29, 2011 method for designing aggregate-oriented nosql Proceedings, volume 6813 of Lecture Notes in Com- databases, Data Knowl. Eng. 142 (2022) 102089. puter Science, Springer, 2011, p. 1. URL: https:// URL: https://doi.org/10.1016/j.datak.2022.102089. doi.org/10.1007/978-3-642-22655-7_1. doi:10.1007/ doi:10.1016/J.DATAK.2022.102089. 978-3-642-22655-7\_1. [13] E. M. Kuszera, L. M. Peres, M. D. D. Fabro, Exploring [4] V. Gadepally, P. Chen, J. Duggan, A. J. Elmore, data structure alternatives in the RDB to nosql docu- B. Haynes, J. Kepner, S. Madden, T. Mattson, M. Stone- ment store conversion process, Inf. Syst. 105 (2022) braker, The bigdawg polystore system and architec- 101941. URL: https://doi.org/10.1016/j.is.2021.101941. ture, in: 2016 IEEE High Performance Extreme Com- doi:10.1016/j.is.2021.101941. puting Conference, HPEC 2016, Waltham, MA, USA, [14] N. Roy-Hubara, A. Sturm, P. Shoval, Designing nosql September 13-15, 2016, IEEE, 2016, pp. 1–6. URL: https: databases based on multiple requirement views, Data //doi.org/10.1109/HPEC.2016.7761636. doi:10.1109/ Knowl. Eng. 145 (2023) 102149. URL: https://doi.org/ HPEC.2016.7761636. 10.1016/j.datak.2023.102149. doi:10.1016/j.datak. [5] R. Alotaibi, D. Bursztyn, A. Deutsch, I. Manolescu, 2023.102149. S. Zampetakis, Towards scalable hybrid stores: [15] W. Y. Mok, A conceptual model based design methodol- Constraint-based rewriting to the rescue, in: P. A. ogy for mongodb databases, in: 7th International Con- Boncz, S. Manegold, A. Ailamaki, A. Deshpande, ference on Information and Computer Technologies, T. Kraska (Eds.), Proceedings of the 2019 Interna- ICICT 2024, Honolulu, HI, USA, March 15-17, 2024, tional Conference on Management of Data, SIG- IEEE, 2024, pp. 151–159. URL: https://doi.org/10.1109/ MOD Conference 2019, Amsterdam, The Netherlands, ICICT62343.2024.00030. doi:10.1109/ICICT62343. June 30 - July 5, 2019, ACM, 2019, pp. 1660–1677. 2024.00030. URL: https://doi.org/10.1145/3299869.3319895. doi:10. [16] I. Skoulis, P. Vassiliadis, A. V. Zarras, Growing up 1145/3299869.3319895. with stability: How open-source relational databases [6] S. S. Neha Bansal, L. K. Awasthi, Are nosql evolve, Inf. Syst. 53 (2015) 363–385. URL: https: databases affected by schema?, IETE Jour- //doi.org/10.1016/j.is.2015.03.009. doi:10.1016/j.is. nal of Research 70 (2024) 4770–4791. URL: 2015.03.009. https://doi.org/10.1080/03772063.2023.2237478. [17] S. Scherzinger, S. Sidortschuck, An empirical study on doi:10.1080/03772063.2023.2237478. the design and evolution of nosql database schemas, [7] V. Reniers, D. V. Landuyt, A. Rafique, W. Joosen, in: G. Dobbie, U. Frank, G. Kappel, S. W. Liddle, H. C. A workload-driven document database schema rec- Mayr (Eds.), Conceptual Modeling - 39th International ommender (DBSR), in: G. Dobbie, U. Frank, Conference, ER 2020, Vienna, Austria, November 3-6, G. Kappel, S. W. Liddle, H. C. Mayr (Eds.), Concep- 2020, Proceedings, volume 12400 of Lecture Notes in tual Modeling - 39th International Conference, ER Computer Science, Springer, 2020, pp. 441–455. URL: 2020, Vienna, Austria, November 3-6, 2020, Proceed- https://doi.org/10.1007/978-3-030-62522-1_33. doi:10. ings, volume 12400 of Lecture Notes in Computer 1007/978-3-030-62522-1\_33. Science, Springer, 2020, pp. 471–484. URL: https:// [18] S. Fedushko, R. Malyi, Y. Syerov, P. Serdyuk, Nosql doi.org/10.1007/978-3-030-62522-1_35. doi:10.1007/ document data migration strategy in the context of 978-3-030-62522-1\_35. schema evolution, Data & Knowledge Engineering 154 [8] M. Hewasinghage, S. Nadal, A. Abelló, (2024) 102369. URL: https://www.sciencedirect.com/ E. Zimányi, Automated database design for science/article/pii/S0169023X24000934. doi:https:// document stores with multicriteria optimiza- doi.org/10.1016/j.datak.2024.102369. tion, Knowl. Inf. Syst. 65 (2023) 3045–3078. [19] P. Vassiliadis, Profiles of schema evolution in free open URL: https://doi.org/10.1007/s10115-023-01828-3. source software projects, in: 37th IEEE International doi:10.1007/s10115-023-01828-3. Conference on Data Engineering, ICDE 2021, Chania, [9] M. Mozaffari, E. Nazemi, A. Eftekhari-Moghadam, Greece, April 19-22, 2021, IEEE, 2021, pp. 1–12. URL: CONST: continuous online nosql schema tuning, https://doi.org/10.1109/ICDE51399.2021.00008. doi:10. Softw. Pract. Exp. 51 (2021) 1147–1169. URL: https: 1109/ICDE51399.2021.00008. //doi.org/10.1002/spe.2945. doi:10.1002/SPE.2945. [20] K. Herrmann, H. Voigt, A. Behrend, J. Rausch, [10] A. de la Vega, D. García-Saiz, C. Blanco, M. E. Zorrilla, W. Lehner, Living in parallel realities: Co-existing P. Sánchez, Mortadelo: Automatic generation of nosql schema versions with a bidirectional database evo- stores from platform-independent data models, Future lution language, in: S. Salihoglu, W. Zhou, Gener. Comput. Syst. 105 (2020) 455–474. URL: https: R. Chirkova, J. Yang, D. Suciu (Eds.), Proceedings of //doi.org/10.1016/j.future.2019.11.032. doi:10.1016/ the 2017 ACM International Conference on Manage- j.future.2019.11.032. ment of Data, SIGMOD Conference 2017, Chicago, [11] J. Mali, F. Atigui, A. Azough, N. Travers, Mod- IL, USA, May 14-19, 2017, ACM, 2017, pp. 1101– 1116. URL: https://doi.org/10.1145/3035918.3064046. Barcelona, Spain, June 12-16, 2008, 2008, pp. 323–332. doi:10.1145/3035918.3064046. [30] M. Klettke, U. Störl, M. Shenavai, S. Scherzinger, Nosql [21] P. Koupil, J. Bártík, I. Holubová, MM-evocat: A tool schema evolution and big data migration at scale, for modelling and evolution management of multi- in: J. Joshi, G. Karypis, L. Liu, X. Hu, R. Ak, Y. Xia, model data, in: M. A. Hasan, L. Xiong (Eds.), Pro- W. Xu, A. Sato, S. Rachuri, L. H. Ungar, P. S. Yu, ceedings of the 31st ACM International Conference R. Govindaraju, T. Suzumura (Eds.), 2016 IEEE In- on Information & Knowledge Management, Atlanta, ternational Conference on Big Data (IEEE BigData GA, USA, October 17-21, 2022, ACM, 2022, pp. 4892– 2016), Washington DC, USA, December 5-8, 2016, 4896. URL: https://doi.org/10.1145/3511808.3557180. IEEE Computer Society, 2016, pp. 2764–2774. URL: doi:10.1145/3511808.3557180. https://doi.org/10.1109/BigData.2016.7840924. doi:10. [22] L. Caruccio, G. Polese, G. Tortora, Synchronization 1109/BigData.2016.7840924. of queries and views upon schema evolutions: A sur- [31] A. H. Chillón, D. S. Ruiz, J. G. Molina, Towards a taxon- vey, ACM Trans. Database Syst. 41 (2016) 9:1–9:41. omy of schema changes for nosql databases: The orion URL: https://doi.org/10.1145/2903726. doi:10.1145/ language, in: A. K. Ghose, J. Horkoff, V. E. S. Souza, 2903726. J. Parsons, J. Evermann (Eds.), Conceptual Modeling - [23] A. Hillenbrand, M. Levchenko, U. Störl, S. Scherzinger, 40th International Conference, ER 2021, Virtual Event, M. Klettke, Migcast: Putting a price tag on data model October 18-21, 2021, Proceedings, volume 13011 of Lec- evolution in nosql data stores, in: P. A. Boncz, S. Mane- ture Notes in Computer Science, Springer, 2021, pp. 176– gold, A. Ailamaki, A. Deshpande, T. Kraska (Eds.), 185. URL: https://doi.org/10.1007/978-3-030-89022-3_ Proceedings of the 2019 International Conference on 15. doi:10.1007/978-3-030-89022-3\_15. Management of Data, SIGMOD Conference 2019, Am- [32] U. Störl, M. Klettke, S. Scherzinger, Nosql schema sterdam, The Netherlands, June 30 - July 5, 2019, ACM, evolution and data migration: State-of-the-art and 2019, pp. 1925–1928. URL: https://doi.org/10.1145/ opportunities, in: A. Bonifati, Y. Zhou, M. A. V. 3299869.3320223. doi:10.1145/3299869.3320223. Salles, A. Böhm, D. Olteanu, G. H. L. Fletcher, A. Khan, [24] I. Holubová, M. Vavrek, S. Scherzinger, Evolution man- B. Yang (Eds.), Proceedings of the 23rd Interna- agement in multi-model databases, Data Knowl. Eng. tional Conference on Extending Database Technology, 136 (2021) 101932. URL: https://doi.org/10.1016/j.datak. EDBT 2020, Copenhagen, Denmark, March 30 - April 2021.101932. doi:10.1016/j.datak.2021.101932. 02, 2020, OpenProceedings.org, 2020, pp. 655–658. [25] S. Chang, V. Deufemia, G. Polese, M. Vacca, A logic URL: https://doi.org/10.5441/002/edbt.2020.87. doi:10. framework to support database refactoring, in: R. R. 5441/002/edbt.2020.87. Wagner, N. Revell, G. Pernul (Eds.), Database and Ex- [33] B. A. Muse, F. Khomh, G. Antoniol, Refactor- pert Systems Applications, 18th International Confer- ing practices in the context of data-intensive sys- ence, DEXA 2007, Regensburg, Germany, September tems, Empir. Softw. Eng. 28 (2023) 46. URL: https: 3-7, 2007, Proceedings, volume 4653 of Lecture Notes //doi.org/10.1007/s10664-022-10271-x. doi:10.1007/ in Computer Science, Springer, 2007, pp. 509–518. URL: s10664-022-10271-x. https://doi.org/10.1007/978-3-540-74469-6_50. doi:10. [34] B. Karwin, S. Antipatterns, Avoiding the pitfalls of 1007/978-3-540-74469-6\_50. database programming, The Pragmatic Bookshelf [26] P. Khumnin, T. Senivongse, SQL antipatterns detec- (2010) 15–155. tion and database refactoring process, in: T. Hochin, [35] M. Athanassoulis, M. S. Kester, L. M. Maas, R. Sto- H. Hirata, H. Nomiya (Eds.), 18th IEEE/ACIS Inter- ica, S. Idreos, A. Ailamaki, M. Callaghan, Design- national Conference on Software Engineering, Artifi- ing access methods: The RUM conjecture, in: E. Pi- cial Intelligence, Networking and Parallel/Distributed toura, S. Maabout, G. Koutrika, A. Marian, L. Tanca, Computing, SNPD 2017, Kanazawa, Japan, June 26- I. Manolescu, K. Stefanidis (Eds.), Proceedings of the 28, 2017, IEEE Computer Society, 2017, pp. 199– 19th International Conference on Extending Database 205. URL: https://doi.org/10.1109/SNPD.2017.8022723. Technology, EDBT 2016, Bordeaux, France, March doi:10.1109/SNPD.2017.8022723. 15-16, 2016, Bordeaux, France, March 15-16, 2016, [27] D. Sjøberg, Quantifying schema evolution, Inf. OpenProceedings.org, 2016, pp. 461–466. URL: https:// Softw. Technol. 35 (1993) 35–44. URL: https:// doi.org/10.5441/002/edbt.2016.42. doi:10.5441/002/ doi.org/10.1016/0950-5849(93)90027-Z. doi:10.1016/ edbt.2016.42. 0950-5849(93)90027-Z. [36] S. Salza, M. Terranova, Workload modeling for rela- [28] P. Vassiliadis, F. Shehaj, G. Kalampokis, A. V. Zarras, tional database systems, in: D. J. DeWitt, H. Boral Joint source and schema evolution: Insights from a (Eds.), Database Machines, Fourth International Work- study of 195 FOSS projects, in: J. Stoyanovich, J. Teub- shop, Grand Bahama Island, March 1985, Springer, ner, N. Mamoulis, E. Pitoura, J. Mühlig, K. Hose, S. S. 1985, pp. 233–255. Bhowmick, M. Lissandrini (Eds.), Proceedings 26th In- [37] X. Huang, S. Cao, Y. Gao, X. Gao, G. Chen, Lightpro: ternational Conference on Extending Database Tech- Lightweight probabilistic workload prediction frame- nology, EDBT 2023, Ioannina, Greece, March 28-31, work for database-as-a-service, in: C. A. Ardagna, N. L. 2023, OpenProceedings.org, 2023, pp. 27–39. URL: Atukorala, B. Benatallah, A. Bouguettaya, F. Casati, https://doi.org/10.48786/edbt.2023.03. doi:10.48786/ C. K. Chang, R. N. Chang, E. Damiani, C. G. Guegan, edbt.2023.03. R. Ward, F. Xhafa, X. Xu, J. Zhang (Eds.), IEEE In- [29] C. Curino, H. J. Moon, L. Tanca, C. Zaniolo, Schema ternational Conference on Web Services, ICWS 2022, evolution in wikipedia - toward a web information sys- Barcelona, Spain, July 10-16, 2022, IEEE, 2022, pp. 160– tem benchmark, in: J. Cordeiro, J. Filipe (Eds.), ICEIS 169. URL: https://doi.org/10.1109/ICWS55610.2022. 2008 - Proceedings of the Tenth International Confer- 00036. doi:10.1109/ICWS55610.2022.00036. ence on Enterprise Information Systems, Volume DISI, [38] L. Ma, D. V. Aken, A. Hefny, G. Mezerhane, A. Pavlo, G. J. Gordon, Query-based workload forecasting fast and elitist multiobjective genetic algorithm: for self-driving database management systems, in: NSGA-II, IEEE Trans. Evol. Comput. 6 (2002) 182– G. Das, C. M. Jermaine, P. A. Bernstein (Eds.), Proceed- 197. URL: https://doi.org/10.1109/4235.996017. doi:10. ings of the 2018 International Conference on Manage- 1109/4235.996017. ment of Data, SIGMOD Conference 2018, Houston, [48] M. Hewasinghage, A. Abelló, J. Varga, E. Zimányi, TX, USA, June 10-15, 2018, ACM, 2018, pp. 631–645. Managing polyglot systems metadata with hy- URL: https://doi.org/10.1145/3183713.3196908. doi:10. pergraphs, Data Knowl. Eng. 134 (2021) 101896. 1145/3183713.3196908. URL: https://doi.org/10.1016/j.datak.2021.101896. [39] S. S. Elnaffar, P. Martin, An intelligent framework doi:10.1016/j.datak.2021.101896. for predicting shifts in the workloads of autonomic [49] R. Rasool, A. A. Malik, Effort estimation of etl projects database management systems, in: Proc of 2004 IEEE using forward stepwise regression, in: 2015 Interna- International Conference on Advances in Intelligent tional Conference on Emerging Technologies (ICET), Systems–Theory and Applications, 15-18, 2004, pp. 2015, pp. 1–6. doi:10.1109/ICET.2015.7389209. 1–8. [50] G. Papastefanatos, P. Vassiliadis, A. Simit- [40] A. Chebotko, A. Kashlev, S. Lu, A big data modeling sis, Y. Vassiliou, Metrics for the prediction methodology for apache cassandra, in: B. Carminati, of evolution impact in ETL ecosystems: A L. Khan (Eds.), 2015 IEEE International Congress on case study, J. Data Semant. 1 (2012) 75–97. Big Data, New York City, NY, USA, June 27 - July URL: https://doi.org/10.1007/s13740-012-0006-9. 2, 2015, IEEE Computer Society, 2015, pp. 238–245. doi:10.1007/s13740-012-0006-9. URL: https://doi.org/10.1109/BigDataCongress.2015. [51] U. Störl, M. Klettke, Darwin: A data platform for 41. doi:10.1109/BigDataCongress.2015.41. schema evolution management and data migration, in: [41] M. J. Mior, K. Salem, A. Aboulnaga, R. Liu, Nose: M. Ramanath, T. Palpanas (Eds.), Proceedings of the Schema design for nosql applications, IEEE Trans. Workshops of the EDBT/ICDT 2022 Joint Conference, Knowl. Data Eng. 29 (2017) 2275–2289. URL: https: Edinburgh, UK, March 29, 2022, volume 3135 of CEUR //doi.org/10.1109/TKDE.2017.2722412. doi:10.1109/ Workshop Proceedings, CEUR-WS.org, 2022. URL: https: TKDE.2017.2722412. //ceur-ws.org/Vol-3135/dataplat_short3.pdf. [42] C. de Lima, R. dos Santos Mello, On proposing and eval- [52] C. Forresi, E. Gallinucci, M. Golfarelli, H. B. Hamadou, uating a nosql document database logical approach, A dataspace-based framework for OLAP analyses in a Int. J. Web Inf. Syst. 12 (2016) 398–417. URL: https: high-variety multistore, VLDB J. 30 (2021) 1017–1040. //doi.org/10.1108/IJWIS-04-2016-0018. doi:10.1108/ URL: https://doi.org/10.1007/s00778-021-00682-5. IJWIS-04-2016-0018. doi:10.1007/s00778-021-00682-5. [43] F. Abdelhédi, A. A. Brahim, F. Atigui, G. Zurfluh, [53] V. V. Meduri, K. Chowdhury, M. Sarwat, Evaluation Umltonosql: Automatic transformation of conceptual of machine learning algorithms in predicting the next schema to nosql databases, in: 14th IEEE/ACS Inter- SQL query from the future, ACM Trans. Database national Conference on Computer Systems and Appli- Syst. 46 (2021) 4:1–4:46. URL: https://doi.org/10.1145/ cations, AICCSA 2017, Hammamet, Tunisia, October 3442338. doi:10.1145/3442338. 30 - Nov. 3, 2017, IEEE Computer Society, 2017, pp. [54] E. Gallinucci, M. Golfarelli, S. Rizzi, Schema profil- 272–279. URL: https://doi.org/10.1109/AICCSA.2017. ing of document-oriented databases, Inf. Syst. 75 76. doi:10.1109/AICCSA.2017.76. (2018) 13–25. URL: https://doi.org/10.1016/j.is.2018.02. [44] A. Ben-Tal, Characterization of pareto and lexico- 007. doi:10.1016/j.is.2018.02.007. graphic optimal solutions, in: Multiple Criteria Deci- [55] C. Forresi, M. Francia, E. Gallinucci, M. Golfarelli, sion Making Theory and Application: Proceedings of Streaming approach to schema profiling, in: A. Abelló, the Third Conference Hagen/Königswinter, West Ger- P. Vassiliadis, O. Romero, R. Wrembel, F. Bugiotti, many, August 20–24, 1979, Springer, 1980, pp. 1–11. J. Gamper, G. Vargas-Solar, E. Zumpano (Eds.), New [45] M. Hewasinghage, N. B. Seghouani, F. Bugiotti, Mod- Trends in Database and Information Systems - ADBIS eling strategies for storing data in distributed het- 2023 Short Papers, Doctoral Consortium and Work- erogeneous nosql databases, in: J. Trujillo, K. C. shops: AIDMA, DOING, K-Gals, MADEISD, PeRS, Davis, X. Du, Z. Li, T. W. Ling, G. Li, M. Lee (Eds.), Barcelona, Spain, September 4-7, 2023, Proceedings, Conceptual Modeling - 37th International Confer- volume 1850 of Communications in Computer and In- ence, ER 2018, Xi’an, China, October 22-25, 2018, formation Science, Springer, 2023, pp. 211–220. URL: Proceedings, volume 11157 of Lecture Notes in Com- https://doi.org/10.1007/978-3-031-42941-5_19. doi:10. puter Science, Springer, 2018, pp. 488–496. URL: https:// 1007/978-3-031-42941-5\_19. doi.org/10.1007/978-3-030-00847-5_35. doi:10.1007/ 978-3-030-00847-5\_35. [46] M. Hewasinghage, S. Nadal, A. Abelló, Docdesign 2.0: Automated database design for document stores with multi-criteria optimization, in: Y. Velegrakis, D. Zeinalipour-Yazti, P. K. Chrysanthis, F. Guerra (Eds.), Proceedings of the 24th International Con- ference on Extending Database Technology, EDBT 2021, Nicosia, Cyprus, March 23 - 26, 2021, Open- Proceedings.org, 2021, pp. 674–677. URL: https:// doi.org/10.5441/002/edbt.2021.81. doi:10.5441/002/ edbt.2021.81. [47] K. Deb, S. Agrawal, A. Pratap, T. Meyarivan, A