Heterogeneity in NoSQL Databases — Challenges of Handling schema-less Data Mark Lukas Möller1 , Dominique Hausler1 , Sebastian Strasser1 , Tanja Auge1 and Meike Klettke1 1 Faculty of Computer Science and Data Science, University of Regensburg, Germany Abstract The schema flexibility of database management systems is often seen as an advantage, because it makes it easy to store all kinds of different data. Schema-less database systems (such as JSON or graph databases) allow structurally different data to be stored in the same database. It also allows storing different variants of data or data evolving over time. However, their use is much more complicated compared to relational data. In this paper, we show the impact of heterogeneity on two data processing steps: query execution and evolution operations including their composition and data transformation. From this, we will derive three proposals: (i) storing data with a partial schema management, (ii) vertical partitioning, and (iii) the usage of multi-model databases. In all cases, regular and irregular parts are distinguished in order to mitigate these effects of heterogeneity. Keywords NoSQL databases, graph databases, heterogeneity, schema-less systems, query execution, schema evolu- tion, schema management, vertical partitioning, multi-model databases, position paper 1. Introduction For a long time, the relational data model [1] and its regular structure with predefined fixed schema and regular tuples was the standard model in databases. Subsequent data models (XML, JSON, and graph data) have been developed with the aim to store heterogeneous data sets in the same database. For this, the database management system allows either flexible schemas, schema definitions which are only optional, or even completely schema-less data. In the latter case, no schema constraints are checked by the database management system. The benefits of these systems (XML, JSON and graph databases) were highly valued because they can store all kinds of data. The disadvantages are equally obvious: when using schema-less or schema-flexible databases, all structural variants of the data must be considered in all database components and also their downstream applications that access the database. This leads to additional challenges in a lot of database tasks. Significant effort was put into solving challenges arising from the flexible schema imposed by NoSQL databases, e.g. schema design [2] and LWDA’23: Lernen, Wissen, Daten, Analysen. October 09–11, 2023, Marburg, Germany $ dominique.hausler@ur.de (D. Hausler); sebastian.strasser@ur.de (S. Strasser); tanja.auge@ur.de (T. Auge); meike.klettke@ur.de (M. Klettke)  0000-0001-5726-7134 (M. L. Möller); 0009-0004-2381-133X (D. Hausler); 0009-0001-8848-1368 (S. Strasser); 0009-0006-2150-9713 (T. Auge); 0000-0003-0551-8389 (M. Klettke) © 2023 Copyright © 2023 by the paper’s authors. Copying permitted only for private and academic purposes. In: M. Leyer, Wichmann, J. (Eds.): Proceedings of the LWDA 2023 Workshops: BIA, DB, IR, KDML and WM. Marburg, Germany, 09.-11. October 2023, published at http://ceur-ws.org CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings modeling [3], data integration [4], and data querying [5]. In our own work regarding XML databases and JSON data collections [6, 7, 8, 9] and our ongoing work in graph databases, we have seen that many procedures already developed for the relational world are far more complicated to implement for schema-flexible database management systems. When thinking about optimizing schema modifications, we had an initial presumption that Schema Modification Operators (SMOs) similar to those suggested in [10] can be composed. For example, if a new attribute 𝐴 is inserted into a table 𝑡 and later renamed to 𝐵, it has the same effect as if it has been inserted at once with the attribute name 𝐵. In the relational world, for a table 𝑡 which does not contain an attribute 𝐴 and 𝐵, we can guarantee that the following two operations on the left-hand side generate the same table as the operation on the right-hand side: (alter table t add column A) + (alter table t rename column A to B) → alter table t add column B Based on this experience from relational databases, we developed a common SMO composition rule for JSON databases whereby 𝐸 is an entity type and 𝐴 and 𝐵 are properties in 𝐸: add 𝐸.𝐴 + rename 𝐸.𝐴 to 𝐵 → add 𝐸.𝐵 Applying these SMOs (add & rename) to a JSON document, they are translated into data migration operations. Because of schema-lessness, i.e., the lack of schema, we have to consider the different variants of datasets. Figure 1 represents this example for an entity type Students and the properties institution and university. Here, we see that the step-wise execution of the two SMOs add and rename produces a different output than a composed execution. This means that the composition rule does not hold for heterogeneous datasets. The example shows that { "id": 1, "family name": "Schmidt", { "id": 1, in NoSQL databases we add institution="UR" + rename institution to university "first name": "Anne", "profession": "student", "university" = "UR" } "family name": "Schmidt", "first name": "Anne", "profession": "student" } always have to consider { "id": 1, the possibility of hetero- add university="UR" "family name": "Schmidt", "first name": "Anne", "profession": "student", geneity. In this paper, we "university" = "UR" } present the impact of het- { "id": 1, "family name": "Schmidt", "first name": "Anne", "profession": "student", erogeneity through query { "id": 1, add institution="UR" + rename institution to university "university" = "FhG" } "family name": "Schulz", "first name": "Maria", execution and evolution "profession": "student" { "id": 1, "institution": "FhG" } add university="UR" "family name": "Schmidt", "first name": "Anne", operations including their "profession": "student", "institution" = "FhG", "university" = "UR" } composition based on our Figure 1: Composition of two SMOs on a heterogeneous dataset experiences. We will use four different heterogeneity classes (HCs), ranging from very regular (HC1) to very irregular (HC4) (see Section 2). We will show how database subtasks have to be adapted to these heterogeneity classes in Section 3. We start by querying heterogeneous data and show why it can deliver unwanted results. Next, we define the semantics of database evolution operations for the different HCs. Then, we describe why the composition of evolution operations becomes more complicated the higher the heterogeneity level. Finally, we briefly explain data transformation and query rewriting for the different heterogeneity classes. Based on these observations, we propose different data storage solutions for hybrid data consisting of homogeneous and heterogeneous parts (Section 4). Finally, we close with some ideas for future work. 2. Heterogeneity Classes Due to the inherent schema of NoSQL data, we can distinguish different levels of heterogeneity. Data can be present in a completely homogeneous structure where each entity consists of the same properties. However, it is also possible for properties to only occur in some entities. In this case, the properties are considered as optional and the entities of the entity type are called heterogeneous. We introduced four heterogeneity classes (HCs) which influence the complexity of SMOs [7]: HC1: All entities of a database are homogeneous which means they have the same implicit structure. Note: In the case of multi-type operations, i.e., using data from two entity types, only 1:1 or 1:n relation- ships occur without dangling tuples. HC2: Multi-type operations with 1:1 and 1:n relation- ships can create dangling tuples, i.e., join operations with datasets without a join partner. Figure 2: Classification of the dif- HC3: Multi-type operations with n:1 and n:m relation- ferent Heterogeneity Classes ships can create dangling tuples. HC4: Databases can contain entities with different structures, i.e., two entities from the same entity type can have different properties. Note: This heterogeneity class is the most flexible one. Every possible variant of the data must be considered in all operations. If we do not have further information, we always have to assume HC4 because there are neither schema constraints nor semantic constraints which can be guaranteed. In contrast, if HC1 can be assumed, querying and evolving databases is much easier. In Section 3, we will demonstrate these two cases. 3. Handling of Heterogeneous Data When databases are stored without a fixed schema, they can vary structurally from entity to entity. Also, checking for semantic constraints is not part of the data insertion process. Thus, in database management systems containing foreign key references, so-called dangling tuples can occur. In this section, we will briefly outline a few cases that illustrate the impact of structural heterogeneity and dangling tuples when using or querying data. Using and Querying Homogeneous and Heterogeneous Data To demonstrate the effects of heterogeneity, let’s look at a concrete example implemented in MongoDB1 . As seen in Figure 3, the university database contains entities with different structures. For instance, the attributes cId and courseId of the Student document with "id":2 both describe identical attributes, 1 MongoDB: https://www.mongodb.com/ { "pid": 1, { "id": 2, { "id": 3, { "cId": "MEI-M01", "name": "Schmidt", "name": "Schulz", "name":"Huber", "title": "Java", "profession": "student", "institution": "UR", "profession":"student", "ECTS": 6 } "university": "UR", "courses": [ "institution": "UR" } "courses": [{ { "cId": "MEI-M01" }, { “courseId": "MEI-M05", "cId": "MEI-M01", { “cId": "MEI-M05“ }]} "title": "Python" } "semester": "SoSe_22“ }]} Student Course Figure 3: Example for heterogeneous datasets (Student and Course documents) but have different property2 names. Another case of heterogeneity is the optionality of the attribute profession. There is also a lack of courses in Student with "id":3. All in all, the database can be classified as HC4 following the taxonomy defined in Section 2. db.Student.aggregate([{ $lookup: { from: "Course", localField: "courses.cId", db.getCollection("Student").find( foreignField: "cId", { institution: "UR" } as: "courses"} ) }]) Query 1: Querying the Student documents Query 2: Joining the collections Student and with institution="UR" (in MongoDB) Course by property cId (in MongoDB) Example 1: Query 1 filters all Students by the condition institution="UR". The results, shown in Query result 2, contains the Student documents with "id":2 and "id":3. Due to variation in the attribute names (institution or university), the Student document with "pid":1 is not in the result. Users unaware of the heterogeneity in their data could expect an output of all three entities. { id: 2, name: "Schulz", { id: 3, institution: "UR", name: "Huber", courses: [ profession: "student", { cId: "MEI-M01" }, institution: "UR" } { cId: "MEI-M05" }]} Query result 1: Results of Query 1 Example 2: Query 2 shows a join between the Student and Course documents. It yields one course for Student documents each with "pid":1 and "id":2. This is a result of per- forming the lookup function which corresponds to a left outer join with the condition Student. courses.cId = Course.cId. Additionally, the Student document with "id":2 contains a second course with "cId":"MEI-M05". This attribute has no joining partner in the Course document. Because of the incorrect attribute name, courseId represents a dangling tuple. 2 Attributes and properties are used interchangeably as they describe the same element in different database manage- ment systems. The result of the join between Student and Course documents are summarized in the Query result 2. In this result, the Student document with "id":3 has a courses attribute containing "courseId":"MEI-M05". This unexpected result is caused by the structural heterogeneity of the Course document. The lookup function sets the value of the localField (in our example the Student document with "id":3) and the foreignField (the Course document with cId) to NULL. In the query execution, both values are NULL [11] and are therefore joined. To avoid this unwanted effect and address the problem of missing values being set to NULL, one could introduce a match stage before executing the lookup function. This filters for Student documents containing a courses attribute and is implemented in MongoDB as follows: $match:{"courses":{"$exists":true,"$ne":[]}}}. { pid: 1, { id: 2, { id: 3, name: "Schmidt", name: "Schulz", name: "Huber", profession: "student", institution: "UR", profession: "student", institution: "UR", courses: [{ institution: "UR", courses: [{ cId: "MEI-M01", courses: [{ cId: "MEI-M01", title: "Java", title: "Python", title: "Java", ECTS: 6 }]} courseId: "MEI-M05" }]} ECTS: 6 }]} Query result 2: Results of Query 2 Both example queries show cases where schema and integrity constraints cannot be guaran- teed and HC4 must be assumed. This leads to individual documents not being found by the queries; concretely: the Student document with "pid":1 in Query 1 or the Course document with "courseId":"MEI-M05" in Query 2. This is the case for both divergent property names and dangling tuples. The heterogeneity of the data must be considered not only in queries, but also in evolution. In the following, we will emphasize the difficulties heterogeneity brings in this context. Semantics of Evolution Operations In databases which are used over a long time, evolution operations like add, delete, rename, move, copy, split and merge must be performed. These operations are suggested in several approaches for relational databases [12], XML [13], JSON databases [14, 15, 16], and for multi-model systems in [17]. The HCs of the data affect how complex the evolution operations become. This heterogeneity classes hold for all schema-less databases. In the following, we define them for JSON database. If we can start from data in HC1 (regular, relational-like data, no dangling tuple), an operation add is defined as follows: HC1: add 𝐸.𝐴 = 𝑥 ∀𝑒𝑖 ∈ 𝐸 : {..} −→ {.., ”𝐴” : ”𝑥”}. It specifies that a new property 𝐴 with value 𝑥 is added to all entities 𝑒𝑖 of entity type 𝐸. For a detailed definition, we refer to [7]. If we execute the same operation on a dataset in HC4, we need to distinguish two cases: {︃ 𝐴 ∈ 𝑒𝑖 : {.., ”𝐴” : ”𝑎”, ..} −→ {.., ”𝐴” : ”𝑎”, ..}, HC4: add 𝐸.𝐴 = 𝑥 : 𝐴∈ / 𝑒𝑖 : {..} −→ {.., ”𝐴” : ”𝑥”}. The attribute 𝐴 can either already exist in the entity or not. Therefore, both cases must be considered. There are two different semantics of the operation: overwrite and ignore. In case a property is present and we are adding a property with the same key, then overwrite — as the name already says — overwrites the value in the database. When following the ignore semantics, the operation would be rejected keeping the previous value. In the definition of add 𝐸.𝐴 in HC4 from above, the ignore semantics is given. In a similar way, we define a rename operation. Whereas in HC1 only one case suffices, for HC4 we have to distinguish four cases: HC1: rename 𝐸.𝐴 to 𝐵 ∀𝑒𝑖 ∈ 𝐸 : {.., ”𝐴” : ”𝑥”, ..} −→ {.., ”𝐵” : ”𝑥”, ..}. ⎧ ⎪ ⎪ 𝐴 ∈ 𝑒𝑖 ∧ 𝐵 ∈/ 𝑒𝑖 : {.., ”𝐴” : ”𝑥”, ..} −→ {.., ”𝐵” : ”𝑥”, ..}, ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 𝐴 ∈ 𝑒 𝑖 ∧ 𝐵 ∈ 𝑒𝑖 : {.., ”𝐴” : ”𝑎”, ”𝐵” : ”𝑏”..} ⎪ ⎨ −→ {.., ”𝐴” : ”𝑎”, ”𝐵” : ”𝑏”, ..}, HC4: rename 𝐸.𝐴 to 𝐵 : ⎪ ⎪ ⎪ 𝐴∈/ 𝑒𝑖 ∧ 𝐵 ∈ 𝑒𝑖 : {.., ”𝐵” : ”𝑥”, ..} −→ {.., ”𝐵” : ”𝑥”, ..}, ⎪ 𝐴 ∈ / 𝑒𝑖 ∧ 𝐵 ∈ / 𝑒𝑖 : {..} −→ {..}. ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ We want to mention that the definition for HC1 corresponds to the first case of the definition for HC4. Again, we differ between the ignore and overwrite semantics. In the definition of rename 𝐸.𝐴 to 𝐸.𝐵 in HC4 from above, the ignore semantics is used which avoids overwriting already available properties. For using overwrite, which is overwriting the already available properties, the definition of rename 𝐸.𝐴 to 𝐵 has to be adapted. For details we refer to [7]. Most evolution approaches also offer so-called multi-type operations like move, copy, split, and merge for a complex restructuring. Here, we also see large differences in the definitions of semantics for different heterogeneity classes. We pick the move operation for HC1 and HC4 as an example. We define move 𝐸1 .𝐴 to 𝐸2 .𝐵 where 𝐸1 .𝐶1 = 𝐸2 .𝐶2 for HC1: ∀𝑒𝑖 ∈ 𝐸1 , ∀𝑒𝑗 ∈ 𝐸2 : 𝑒𝑖 : {.., ”𝐴” : ”𝑥”, ”𝐶1 ” : ”𝑐”, ..}, 𝑒𝑗 : {.., ”𝐶2 ” : ”𝑐”, ..} −→ 𝑒𝑖 : {.., ”𝐶1 ” : ”𝑐”, ..}, 𝑒𝑗 : {.., ”𝐴” : ”𝑥”, ”𝐶2 ” : ”𝑐”, ..}. The definition of the move operation also looks a lot more complex when assuming the datasets are in HC4. Following are the different cases to be considered: • 𝐴 ∈ 𝑒𝑖 vs. 𝐴 ∈ / 𝑒𝑖 • 𝐶1 ∈ 𝑒𝑖 vs. 𝐶1 ∈ / 𝑒𝑖 • 𝐵 ∈ 𝑒𝑗 vs. 𝐵 ∈ / 𝑒𝑗 • 𝐶2 ∈ 𝑒𝑗 vs. 𝐶2 ∈ / 𝑒𝑗 Combining these conditions, 24 = 16 different cases are resulting. All cases where a join condition is not valid that means (𝐶1 ∈ / 𝑒𝑗 ) do not need to be considered furthermore. / 𝑒 𝑖 ∨ 𝐶2 ∈ We only have to distinguish and define four different cases in the semantics for the move operation: • (𝐴 ∈ 𝑒𝑖 , 𝐵 ∈ 𝑒𝑗 , 𝐶1 ∈ 𝑒𝑖 , 𝐶2 ∈ 𝑒𝑗 ), • (𝐴 ̸∈ 𝑒𝑖 , 𝐵 ∈ 𝑒𝑗 , 𝐶1 ∈ 𝑒𝑖 , 𝐶2 ∈ 𝑒𝑗 ), • (𝐴 ∈ 𝑒𝑖 , 𝐵 ∈ / 𝑒𝑗 , 𝐶1 ∈ 𝑒𝑖 , 𝐶2 ∈ 𝑒𝑗 ), • (𝐴 ∈ / 𝑒𝑗 , 𝐶1 ∈ 𝑒𝑖 , 𝐶2 ∈ 𝑒𝑗 ). / 𝑒𝑖 , 𝐵 ∈ For the definition of all evolution operations in all heterogeneity classes, we again refer to [7]. Composition of Operations for Heterogeneous Data When multiple evolution operations need to be performed (e.g., legacy data that has to be updated over several versions), an optimisation shall be applied. An obvious method is the composition of operations. In Figure 1, a first example for a composition of an add and a subsequent rename operation was already given. There are many other possible compositions that can be performed if we have the data in HC1. Some examples are: add 𝐸.𝐴 + rename 𝐵.𝐴 to 𝐵 −→ add 𝐸.𝐵 add 𝐸.𝐴 + delete 𝐸.𝐴 −→ ∅ add 𝐸.𝐴 + rename 𝐸.𝐴 to 𝐵 + delete 𝐸.𝐵 −→ ∅ rename 𝐸.𝐴 to 𝐵 + rename 𝐸.𝐵 to 𝐶 −→ rename 𝐸.𝐴 to 𝐶 move 𝐸1 .𝐴 to 𝐸2 on 𝐸1 .𝐾 = 𝐸2 .𝐹 + rename 𝐸2 .𝐴 to 𝐵 −→ move 𝐸1 .𝐴 to 𝐸2 .𝐵 on 𝐸1 .𝐾 = 𝐸2 .𝐹 rename 𝐸.𝐴 to 𝐵 + delete 𝐸.𝐵 −→ delete 𝐸.𝐴 copy 𝐸1 .𝐴 to 𝐸2 on 𝐸1 .𝐾 = 𝐸2 .𝐹 + delete 𝐸2 .𝐴 −→ ∅ Figure 4: Composition of evolution operations We can distinguish three abstract classes for compositions: (i) adding and renaming can be composed to a modified add operation, (ii) several rename operations can be combined, and (iii) rename operations followed by a delete can be composed to a modified delete operation. In case we can assume HC1, the composed operations (right-hand side in the equations of Figure 4) have the same effect as a step-wise execution (left-hand side). Data migration operations are generated for each evolution operation. Consequently, reducing the number of these evolution operations significantly improves performance. Let’s assume a NoSQL database with 10000 entities and five evolution operations. In the step-wise execution, we have to apply the data migration operations on 5 · 10000 entities. If we combine the operations into two evolution operations, then we need to perform the corresponding data migration operations on only 2 · 10000 entities. In [9] we have shown that using an operation caching improves the run-time significantly. Therefore, a composition of evolution operations is recommended. In [8], we have defined which compositions are possible for all combinations of evolution operations (add, delete, rename, move, and copy). This article was published some years before we defined the HCs for NoSQL databases. We have used pre- and postconditions instead. For example, for the composition of the operations add 𝐸.𝐴 + rename 𝐸.𝐴 to 𝐵 (see Figure 1), we have set the preconditions 𝐴 ∈ 𝐸 and 𝐵 ∈ / 𝐸. Preconditions define the presence or absence of properties and guarantee certain dataset characteristics. so that the composition to add 𝐸.𝐵 in the example can be done. Each composition can only be applied if all preconditions hold. The postconditions define the conditions that hold after the execution of the evolution operation. When applying several evolution operations, these conditions define the status of the dataset after each operation execution. They are needed to determine how evolution operations can be combined. Otherwise, i.e., in heterogeneity classes other than HC1, no composition is possible and the evolution operations have to be applied step-wise instead. If we do not apply preconditions in the composition, we have to assume HC4. Thus, all structural variants has to be defined in the composition. We want to explain this for the following evolution operations: move 𝐸1 .𝐴 to 𝐸2 on 𝐸1 .𝐾 = 𝐸2 .𝐹 + rename 𝐸2 .𝐴 to 𝐵 −→ move 𝐸1 .𝐴 to 𝐸2 .𝐵 on 𝐸1 .𝐾 = 𝐸2 .𝐹. In the prior paragraph, we have defined the move operation and the rename operation for HC4. For both operations, we distinguished four cases. If we compose both operations then we have to define 4 · 4 = 16 cases. We want to mention that besides the large effort to define all cases separately, a composition is not possible in all cases. This short example shows how high the effort for composition is in HC4. Database Transformation and Query Rewriting Database heterogeneity has to be consid- ered in all database tasks. Two further examples focus on the transformation of heterogeneous data into other formats and query rewriting. Data transformation has to consider all variants of the data. We assume that the database has heterogeneous structures in the source. If we define a mapping between source and target format, the transformations must be defined for all variants. For instance, if we use the local as view (LaV) approach — originally developed in [18] for relational data —, this leads to incomparably larger transformation scripts. These scripts have to contain all variants of the input data. If we assume data in different formats, we have to use query rewriting. This is necessary either in case of versioned databases or in data integration where we query against the global schema and translate the queries into the structure of the local schemas. For both tasks, the number of different cases that have to be defined depends on the heterogeneity class [6]. 4. Databases Designs for Heterogeneous Data In summary, as expected, all database tasks are more complex to perform for heterogeneous databases. Nevertheless, we cannot avoid heterogeneity; in countless applications, it is visible that not all data is structured in a regular way. And finally, the various developments of schema- less database management systems show this need. How can one now manage this balancing act between regular and irregular data? Our conclusion from the application examples listed in this article is very simple: we propose hybrid methods storing a maximum amount of data under schema control. For the rest of the data, we need the ability to store irregular data. Several applications following this hybrid approach have been presented in literature. They deal with the conjunction of relational databases with document stores [19] or graph databases [20]. These hybrid approaches can be implemented in three different ways (see Figure 5): (a) within one NoSQL or graph database with partial schema control, (b) as a set of connected NoSQL or graph databases (some with schema control and the others without), or (c) in a multi-model database combining one relational database and NoSQL databases. JSON and Graph databases with partial schema control. NoSQL management systems like JSON and graph database systems added the opportunity to check schema constraints (visualized in Figure 5(a)). A prominent example is MongoDB where users can specify a JSON schema for a specific MongoDB collection [21]. With this specification, document validation against a predefined schema is possible [22]. Neo4J – as an example for a graph database – follows a similar approach, making it possible to generate a partial schema control by defining an individual schema. Cypher offers existence, type, and uniqueness constraints, which are available for properties of either nodes or relationships. JSON JSON schema schema JSON JSON JSON JSON JSON JSON documents docu- docu- docu- docu- documents ments ments ments ments JSON JSON documents documents (a) JSON documents with (b) JSON documents, the central (c) Multi-model database: relational partial schema control documents under schema control database with JSON databases Vertical partitioning Figure 5: Overview on the three proposed hybrid storage methods Furthermore, key constraints can be defined for whole entities such as nodes or relations [23]. The constraints are checked upon creation or manipulation of data. Constraints can be used to avoid heterogeneity of data. In Figure 6 a Neo4J example is shown. Here, a key constraint constraint_student on the Student nodes with properties id, name, profession and institution is defined. The graph is based on the dataset of Figure 3. When violating the constraint, an error message is generated. In Neo4J the command SHOW CONSTRAINTS can be used to show all defined constraints. UNIVERSITY GRAPH DATABASE KEY CONSTRAINT Schmidt ATTE NDE D Java ED END ATT ERROR MESSAGE Schulz HAS Python Huber OUTPUT Figure 6: Constraint definition and checking for the university graph database (in Neo4J) Connected JSON or Graph databases. In this storage method, we propose vertical par- titioning, i.e., the splitting of attributes into groups and the distribution of these groups into different tables [24]. In our case, the regular and irregular parts of a database are stored sepa- rately in NoSQL (JSON or graph) databases. For the regular components, a NoSQL database with schema control is used. This can be seen as the kernel of the dataset. The other portions contain the heterogeneous parts of the datasets which could be imagined as satellites surrounding the kernel. The connection between the different databases is realized by references (see Figure 5(b)). The same approach can also be used for graph databases. When looking at Figure 6, one example for this approach would be to save essential infor- mation of a Student node as displayed in the constraint called constraint_student under schema control matching the kernel. Additional optional information for each Student node (e.g., a photo or her credit points) would be stored separately in the graph database. Multi-model database. The storage method visualized in Figure 5(c) is quite similar to the previous one. Instead of storing the regular parts in a NoSQL database under schema control (see Figure 5(b)), a relational database is used. This relational database has the built-in feature schema-first which guarantees schema constraints and offers the opportunity to define further semantic constraints. Similarly to the approach shown in Figure 5(b), the heterogeneous parts are kept in JSON or graph databases. The connections between the datasets are realized by references or other inter-model linkage constraints. 5. Conclusion and Future Work In recent years, the need to store heterogeneous data has become increasingly apparent. This resulted in the development of different generations of database systems that allow the hetero- geneity of data (XML, JSON and graph databases). However, we conclude that one should keep these heterogeneous parts as small as possible and store everything else relationally. With such an approach, we can achieve a balance in data storage between the desirability of regularly storing large chunks of homogeneous data and the need to store heterogeneous data externally. NoSQL databases like wide-column stores (Cassandra [25]) and JSON databases (MongoDB [21]) allow a (partial) schema definition. Therefore, they can be used for hybrid approaches defined in Section 4. As graph databases such as Neo4J [23] allow the definition of semantic constraints, they could also make use of the suggested storage methods. In the past, reverse engineering approaches have been developed to derive the structures [16, 26, 27, 28] and semantic constraints [29] from NoSQL data. There is still a need for such components. Schema management for graphs is also an active research topic [30]. Derivation of semantic constraints from graph data and multi-model data is part of our future work. Another interesting research direction is the consideration of hybrid databases in schema optimization. The challenge is to distinguish between data that is better suited to be stored in relational databases and data you are better off storing in nonrelational databases. Here, different factors like storage size or query performance have to be considered. Acknowledgments The article is published in the scope of the project “NoSQL Schema Evolution und Big Data Migration at Scale” which is funded by the Deutsche Forschungsgemeinschaft(DFG) under the grant no. 385808805. References [1] E. F. Codd, Extending the Database Relational Model to Capture More Meaning, ACM Trans. Database Syst. 4 (1979) 397–434. [2] M. Shah, A. Kothari, S. Patel, Influence of Schema Design in NoSQL Document Stores, in: Mobile Computing and Sustainable Informatics, Springer Singapore, 2022, pp. 435–452. [3] A. Abubakar Imam, S. Basri, R. Ahmad, J. Watada, M. González-Aparicio, M. A. Almomani, Data Modeling Guidelines for NoSQL Document-Store Databases, International Journal of Advanced Computer Science and Applications 9 (2018). [4] S. Amghar, S. Cherdal, S. Mouline, Data Integration and NoSQL Systems: A State of the Art, in: BDIoT, ACM, 2019, pp. 16:1–16:6. [5] H. B. Hamadou, F. Ghozzi, A. Péninou, O. Teste, Towards Schema-independent Querying on Document Data Stores, in: DOLAP, volume 2062 of CEUR Workshop Proceedings, CEUR-WS.org, 2018. [6] M. L. Möller, M. Klettke, A. Hillenbrand, U. Störl, Query Rewriting for Continuously Evolving NoSQL Databases, in: ER, volume 11788 of LNCS, Springer, 2019, pp. 213–221. [7] M. L. Möller, M. Klettke, U. Störl, Keeping NoSQL Databases Up to Date - Semantics of Evolution Operations and their Impact on Data Quality, in: LWDA, volume 2454 of CEUR Workshop Proceedings, CEUR-WS.org, 2019, pp. 25–37. [8] M. Klettke, U. Störl, M. Shenavai, S. Scherzinger, NoSQL schema evolution and big data migration at scale, in: IEEE BigData, IEEE Computer Society, 2016, pp. 2764–2774. [9] U. Störl, A. Tekleab, M. Klettke, S. Scherzinger, In for a Surprise When Migrating NoSQL Data, in: ICDE, IEEE Computer Society, 2018, p. 1662. [10] C. Curino, H. J. Moon, C. Zaniolo, Graceful database schema evolution: the PRISM workbench, Proc. VLDB Endow. 1 (2008) 761–772. [11] MongoDB, Inc., MongoDB documentation – $lookup (aggregation), https://www.mongodb. com/docs/manual/reference/operator/aggregation/lookup/, 2023. Accessed: 2023-07-31. [12] K. Herrmann, H. Voigt, A. Behrend, J. Rausch, W. Lehner, Logical Data Independence in the 21st Century - Co-Existing Schema Versions with InVerDa, CoRR abs/1608.05564 (2016). [13] M. Polák, M. Chytil, K. Jakubec, V. Kudelas, P. Piják, M. Necaský, I. Holubová, Data and Query Adaptation Using DaemonX, Comput. Informatics 34 (2015) 99–137. [14] S. Scherzinger, M. Klettke, U. Störl, Managing Schema Evolution in NoSQL Data Stores, in: DBPL, 2013. [15] A. H. Chillón, D. S. Ruiz, J. G. Molina, Towards a Taxonomy of Schema Changes for NoSQL Databases: The Orion Language, in: ER, volume 13011 of LNCS, Springer, 2021, pp. 176–185. [16] P. Suárez-Otero, M. J. Mior, M. J. S. Cabal, J. Tuya, CoDEvo: Column family database evolution using model transformations, J. Syst. Softw. 203 (2023) 111743. [17] P. Koupil, J. Bártík, I. Holubová, MM-evocat: A Tool for Modelling and Evolution Manage- ment of Multi-Model Data, in: CIKM, ACM, 2022, pp. 4892–4896. [18] J. D. Ullman, Information integration using logical views, Theoretical Computer Science 239 (2000) 189–210. [19] G. Ongo, G. P. Kusuma, Hybrid Database System of MySQL and MongoDB in Web Application Development, in: ICIMTech, 2018, pp. 256–260. [20] H. Vyawahare, P. Karde, V. Thakare, A Hybrid Database Approach Using Graph and Relational Database, in: RICE, 2018, pp. 1–4. [21] MongoDB, Inc., MongoDB documentation – Specify JSON Schema Validation, https://www. mongodb.com/docs/manual/core/schema-validation/specify-json-schema/, 2023. Accessed: 2023-07-13. [22] F. Pezoa, J. L. Reutter, F. Suárez, M. Ugarte, D. Vrgoc, Foundations of JSON Schema, in: WWW, ACM, 2016, pp. 263–273. [23] Neo4J, Inc., Neo4J Docs – Constraints, https://neo4j.com/docs/cypher-manual/current/ constraints/, 2023. Accessed: 2023-07-24. [24] S. B. Navathe, S. Ceri, G. Wiederhold, J. Dou, Vertical Partitioning Algorithms for Database Design, ACM Trans. Database Syst. 9 (1984) 680–710. [25] Apache Cassandra, Cassandra Docs – Defining Database Schema, https://cassandra.apache. org/doc/latest/cassandra/data_modeling/data_modeling_schema.html, 2023. Accessed: 2023-07-28. [26] D. S. Ruiz, S. F. Morales, J. G. Molina, Inferring Versioned Schemas from NoSQL Databases and Its Applications, in: ER, volume 9381 of LNCS, Springer, 2015, pp. 467–480. [27] M. Klettke, H. Awolin, U. Störl, D. Müller, S. Scherzinger, Uncovering the evolution history of data lakes, in: IEEE BigData, IEEE Computer Society, 2017, pp. 2462–2471. [28] P. Koupil, S. Hricko, I. Holubová, A universal approach for multi-model schema inference, J. Big Data 9 (2022) 97. [29] S. Klessinger, M. Klettke, U. Störl, S. Scherzinger, Extracting JSON Schemas with tagged unions, in: DEco@VLDB, volume 3306 of CEUR Workshop Proceedings, CEUR-WS.org, 2022, pp. 27–40. [30] A. Bonifati, The Quest for Schemas in Graph Databases (keynote), in: DOLAP, volume 3369 of CEUR Workshop Proceedings, CEUR-WS.org, 2023, pp. 1–2.