<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Keeping NoSQL Databases up to date -Semantics of Evolution Operations and their Impact on Data Quality</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Mark</forename><forename type="middle">Lukas</forename><surname>Möller</surname></persName>
							<email>mark.moeller2@uni-rostock.de</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Rostock</orgName>
								<address>
									<settlement>Rostock</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Meike</forename><surname>Klettke</surname></persName>
							<email>meike.klettke@uni-rostock.de</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Rostock</orgName>
								<address>
									<settlement>Rostock</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Uta</forename><surname>Störl</surname></persName>
							<email>uta.stoerl@h-da.de</email>
							<affiliation key="aff1">
								<orgName type="institution">Darmstadt University of Applied Sciences</orgName>
								<address>
									<settlement>Darmstadt</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Keeping NoSQL Databases up to date -Semantics of Evolution Operations and their Impact on Data Quality</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">2AAF8E7D5E3140E8E420E08ACF6725C2</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T18:27+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>NoSQL Schema Evolution</term>
					<term>Schema Evolution Operation</term>
					<term>Data Heterogeneity Classes</term>
					<term>Data Quality</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Evolving a NoSQL database schema regularly involves migrating datasets into new schema versions. NoSQL databases store datasets in different heterogeneity levels (HCs) that can be characterized by their degree of regularity and cardinality of various entity types. In this article, we present the semantics of NoSQL evolution operations and their corresponding data migration operations while distinguishing different NoSQL HCs. One use-case of NoSQL evolution operations is improvement of actuality and completeness of data which is especially relevant in terms of the ever-expanding volume of data.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>In agile software development environments source code is changed frequently which also can include changes of the data in a database. In order to deal with schema changes, schema evolution operations adapt the data to the new structure. While for relational databases schema evolution has been studied in detail in the past <ref type="bibr" target="#b11">[13]</ref>, these approaches cannot be directly transferred to NoSQL since characteristics like data heterogeneity have to be taken into account.</p><p>The majority of NoSQL database systems can be used for storing datasets with different characteristics:</p><p>1. No or limited schema control: In NoSQL, neither schema information nor semantical constraints have to be defined before the actual storing of the datasets. Thus, datasets with different structures can be stored even within the same collection and may lead to heterogeneous data. 2. Regularity of data: Oftentimes NoSQL databases are generated by applications or object mappers resulting in data structures that are checked in terms of data consistency. In these cases, well-structured data is stored in NoSQL databases that at least have an implicit schema. 3. Versioned datasets: In other applications, regular datasets are generated with a certain structure, yet this structure changes frequently over time.</p><p>Consequently, the NoSQL database becomes heterogeneous since it contains datasets in different versions within the same collection.</p><p>In all datasets that are used over long periods of time, we have to enable their evolution. In order to transform pre-existing stored data into a new structure, efficient schema evolution operations are required that can cope with problems of heterogeneity and cardinalities and that update and cleanse the data to ensure a high level of data quality. First, we are introducing different degrees of NoSQL heterogeneity. Figure <ref type="figure" target="#fig_0">1</ref>(a) visualizes the three dimensions that have to be considered. The first dimension (x-axis in Figure <ref type="figure" target="#fig_0">1</ref>(a)) describes the existence of dangling tuples. Our evolution language includes two multi-type operations, move and copy. Both operations specify matching conditions between entities. In this context, dangling tuples are termed as entities without a matching partner regarding a multi-type operation.</p><p>The second dimension describes the cardinalities between kinds that are affected by multi-type operations. Becauses NoSQL databases do not check semantic constraints in advance, it is required to differentiate whether all properties have a matching partner or dangling tuples exist. If matching partners exist, it is important to determine the number of partners -referred to as cardinality.</p><p>The last dimension regards the heterogeneity of entities of the same version. Here we distinguish between datasets in which all entities of the same version have homogeneous or heterogeneous structures (z-axis in Figure <ref type="figure" target="#fig_0">1(a)</ref>). We derive different heterogeneity classes (HCs) per schema evolution operations, starting from the most structured datasets and 1:1 cardinalities up to unstructured datasets and arbitrary cardinalities. HC1: In this class, the operation affects datasets in the same or different structural versions (e.g., when lazy migration approaches are used), yet all datasets in the same version have exactly the same structure. Multi-type operations presume 1:1 cardinalities only and there are no dangling tuples allowed between two kinds of matching conditions. HC2: The second class extends HC1 by 1:n cardinalities. Therefore, it is required to deal with dangling tuples. HC3: The third class encompasses HC2 with arbitrary cardinalities. Additional strategies are required for determining property values of entities affected by multi-entity operations with n:m cardinalities. HC4: The fourth class represents NoSQL databases that can have different structures within the same version. Consequently, optional properties can occur that may be available in some entities of a concrete version and missing in other entities of the same version. A schema evolution operation against a NoSQL database must be able to cope with all variants of input datasets. The article makes the following contribution.</p><p>-We have already introduced four different heterogeneity classes (HC1-HC4) for NoSQL. Based on these heterogeneity classes, we define the operational semantics and data migration for a NoSQL evolution language in Section 3. We show that for certain HCs the evolution operations can be simplified. -We discuss the impact on schema evolution operations on the data quality in Section 4, namely data actuality, data completeness, and data consistency.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Foundations</head><p>Our NoSQL evolution language contains three single-type operations, add, delete and rename, and two multi-type operations, move and copy. The operations are defined for the evolution of the schema and entailed data migration operations can be derived. The schema evolution and data migration operations are used to bring entities into the latest structural version. Firstly, we introduce the semantic foundations.</p><p>Data with an equal or similar set of properties is called a an entity-type or a kind. A kind named A consists of a schema and of a set of entities and is defined as</p><formula xml:id="formula_0">K A = (S A , E A ).</formula><p>The schema S A is defined as a set of property-names, S A = {A 1 , . . . , A n }. The set of entities E A of K A over the schema S A is defined as E A := {e 1 , . . . , e m } whereby m represents the number of entities and where each entity e i in E A consists of up to n properties (also referred to as attributes) called a ij with i ∈ (1, . . . , m) and j ∈ (1, . . . , n). Formally, e i = {a ij | j ∈ {1, . . . n}}.</p><p>Here, i represents the index for the i-th entity of E A and j is the j-th property of the corresponding entity. Each property a ij consists of a property name and and a property value: Example. To illustrate the definitions, let us consider an example for the representation of a subset of a research project database which stores information about research stations, the name of the funder of the project, and the budget. The kind is called project and is defined as K project = {S project , E project }, whereby S project = {"p id", "station name", "funder", "budget"}. E project is the set of entities that contains two entities (e 1 and e 2 ) of the kind project. A valid set of data E project is: Eproject = { {("p_id": 1), ("station_name": "Ocean"), ("funder": "DFG"), ("budget": "5 Mil")}, {("p_id": 2), ("station_name": "Baltic Sea")} } For the evolution operations, it is required to check whether an entity contains a property with a certain name, regardless of its value. Because properties are stored as a tuple and not as a set, the operator ∈ * is defined which evaluates if there is a property available for a given entity or not. For this purpose, we define a projection operation that projects onto the property name:</p><formula xml:id="formula_1">a ij = (A ij : v ij ) ∈ S Ai × D Ai ,</formula><formula xml:id="formula_2">π A := S Ai ×D Ai → S Ai with (A ij , v ij ) → A ij . Based on this projection, the ∈ * operator is defined. X ∈ * e i :⇔ ∃a ij ∈ e i : X ∈ π A (a ij ), and X ∈ * E A :⇔ ∀e i ∈ E A : X ∈ * e i .</formula><p>Reconsider the previous example. Here, "station name" ∈ * e 1 is True while "location" ∈ * e 2 is False.</p><p>The Dot-Notation is introduced for reading the value of a given property name and is particularly needed in order to express matching conditions for multi-entity operations. The following notation is introduced:</p><formula xml:id="formula_3">∀X ∈ * e i : e i .X := π v (a ij ) with π v := S Ai × D Ai → D Ai with (A ij , v ij ) → v ij .</formula><p>In the example, e 1 .station name evaluates to " Ocean"and e 2 .station name evaluates to " Baltic Sea", while e 1 .location throws an exception.</p><p>Due to migration and encompassed different schema versions, the same kind is inspected at different points in time. For this, a notation of a version is introduced in the form of in square brackets. For instance, S A[10] = {A 1 , . . . , A n } <ref type="bibr" target="#b9">[10]</ref> describes the schema of kind A at schema version 10. In the abstract notation for the evolution and migration operation, [v A ] and [v B ] is used for the version information of the kinds K A and K B .</p><p>Generally, S A can be derived by iterating over all entities of E A and collect all attribute names. Nevertheless, S A is stored as well to support a query rewriting approach presented in <ref type="bibr" target="#b8">[9]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Semantics of the Evolution Operations</head><p>In this section, we define the semantics of the evolution operations on regular structures and structured datasets, and we will extend them to irregular structures and heterogeneous datasets. The evolution operations were introduced for the first time as EBNF and as a NoSQL programming language in <ref type="bibr" target="#b12">[14]</ref> and since that time continously extended. The chosen evolution operations add, rename, delete, move and copy represent a set of frequent schema evolution operations in open source applications (c.f. <ref type="bibr" target="#b2">[3]</ref>). The effort for data migration increases accordingly to the HC. In order to define the concrete heterogeneity classes, preand postconditions are used to determine the regularity of the data. The preand postconditions are inspired by the Hoare triple. These conditions are comparable with the concept of design by contract. Operations are only executed if the preconditions are fulfilled, otherwise they will be rejected. After the execution of an operation, the postconditions are guaranteed.</p><p>Hereafter we define the single-type operation add and the multi-type operation move. The semantics for the complete evolution language is given in <ref type="bibr" target="#b9">[10]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Heterogeneity Class 1</head><p>Operations in HC1 assumes that in a dataset all entities of a kind have the same schema within the same schema version. Hence, there is no possibility to have datasets with optional properties. For multi-type operations, this class can only cope with matches of 1:1 cardinalities.</p><p>Each of the operations evolves the schema and migrates the entities into the new version. On the instance level, the operation modifies the data structure and updates affected instances. The effects of the evolution operations have to be defined on the schema level and the instance level. Evolution operations are defined as rules whereby the left side of a rule describes the schema/instances before the operation while the right side describes the schema/instances after the operation. All rules consist of a precondition which needs to hold before the operation. If the condition is not fulfilled, the operation is not executed. The postconditions are fulfilled after the operation and will become important for the chaining of operations and for the examination of Data the Quality.</p><p>The Add Operation This operation adds a property to all entities of a kind. The operation specifies the kind, the new property name and additionally, the default property value. In HC1, the add operation is defined as:</p><formula xml:id="formula_4">add A.X = d precond : {X ∈ S A[v A ] } SA(A1, . . . , An) [v A ] → SA(X, A1, . . . , An) [v A +1] ∀ei ∈ EA : (ei(a1, . . . , an) [v A ] → ei((X : d), a1, . . . , an) [v A +1] ) postcond : {X ∈ S A[v A +1] }</formula><p>First, the operation verifies that the precondition is fulfilled which states that the name of the property is not allowed to be available in the schema of K A in the version v A . The second line describes the schema evolution of K A . In version v A , schema S A consists of n properties A 1 , . . . , A n . After the operation in version v A + 1, the schema consists of n + 1 properties including the added property named X. The third line describes the instance level modification of each entity of K A . Each entity consists of the properties a 1 to a n and additionally the new property (X : d) whereby X is the name of the added property and d is the default value. After the modification of the schema and the entity migration, the postcondition holds which states that property name X is part of S A in version v A + 1. As a variant of the given semantics, it is possible to add a property without default value: add A.X. In this case, the property (X : ⊥) is added whereby ⊥ represents a Null value.</p><p>The Move Operation The multi-type operation move transfers a property from the entities of one kind (termed as source kind ) to entities of a different kind (termed as target kind ). To execute a multi-type operation, a matching condition between both kind is mandatory. In HC1, the matching cardinality is assumed as 1:1, which entails bijectivity so that every entity of the source kind has exactly one match with an entity of the target kind, and vice versa. This also presumes that the value of the matching condition is unique for each entity and there is neither an entity on the source side nor on the target side that does not have a matching partner. Consequently, multi-entity operations in HC1 are restricted to kinds with the same amount of entities.</p><p>In HC1, the semantics of the move operation is defined as follows:</p><formula xml:id="formula_5">move A.X To B.Z where A.K = B.F precond : {X ∈ S A[v A ] , Z ∈ S B[v B ] } SA(X, K, A3, . . . , An) [v A ] → SA(K, A3, . . . , An) [v A +1] SB(F, B2, . . . , Bm) [v B ] → SB(Z, F, B2, . . . , Bm) [v B +1]</formula><p>∀ei ∈ EA, ej ∈ EB, ei.K = ej.F :</p><p>(ei((X : x), (K : k), ai 3 , . . . , ai n )</p><formula xml:id="formula_6">[v A ] ∧ ej((F : k), bj 2 , . . . , bj m ) [v B ] → ei((K : k), ai 3 , . . . , ai n ) [v A +1] ∧ ej((Z : x), (F : k), bj 2 , . . . , bj m ) [v B +1] ) postcond : {X / ∈ S A[v A +1] , Z ∈ S B[v B +1] }</formula><p>Beside the matching condition, the source and target kinds as well as the property names are specified. Here, these are K A with the property X and K B with property Z. The move operations implicitly realizes a rename operation if the property names of the source and target kinds are different. In the where clause, the matching condition is explicitly specified.</p><p>Before the operation, S A of K A contains the property name X, while S B of the K B does not. On the schema level, it is apparent that the moved property X is not present anymore in S A after the operation execution. Instead, S B now contains Z. During the operation, all entities e i and e j are modified. The property (X : x) is not present anymore in any entity of K A while (Z : x) is part of each entity of K B . The same symbol x on the left and on the right hand side of the rule indicate the same property value -the value is transferred without a modification from the source kind to the target by the operation. The matching condition between both kinds is represented by the same property value k ((K : k) for e i and (F : k) for e j ) as well.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Heterogeneity Classes 2 and 3</head><p>In heterogeneity classes 2 and 3, we assume structurally homogeneous data within the same version, however, cardinalities are extended to 1:n in HC2 and to m:n in HC3. Thus, it is necessary to deal with dangling tuples and multi matches. Since HC2 and HC3 are inherited in HC4 in terms of their characteristics, we will explain the properties and challenges of these HCs in the next section. Furthermore, both HCs are discussed in detail in <ref type="bibr" target="#b9">[10]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Heterogeneity Class 4</head><p>Evolution operations in HC4 cover the most complicated NoSQL databases considering all structural variants. In this HC, schema heterogeneity and multientity operation of arbitrary cardinalities are included.</p><p>An example for challenges in HC4 is given in Figure <ref type="figure" target="#fig_2">2</ref>. Here, an add operation is executed and affects two entities of K project . For the property value of funder of the entity with the id : 2 it is required to decide whether the value of funder is either overwritten with the default value or preserved. The semantics is extended by introducing the additional keywords overwrite and ignore for implementing conflict resolution strategies.</p><p>For heterogeneous data, it is required to denote optionality in the semantics, especially in the preconditions, since it is not known whether a certain property occurs in all entities of a kind. Optional properties are labeled with a question mark. For example, X ? ∈ S A states that X is an optional property in the schema of kind A and can or cannot appear in an entity. This requires to deal with both cases in the semantics. On the schema level, the notation S A (X?) is used analogously.</p><p>The Add operation The definition of the add operation is given below. Here, the overwrite approach is used which adds the property and the specified default value to entities without that property. For entities that already contain the property before of the operation, their affected property values are overwritten by the operation's default value.</p><p>In contrast to HC1, it is distinguished between the global conditions which hold for the schema and all entities affected by the evolution operation, and case conditions which only hold for a subset of the entities affected by the operation. The definition of the evolution operation is divided into two cases: The first case defines the operation for all datasets in which X is not available. A property named X is added with the default value d. The second case defines the operation for the datasets that already contain X. The existing value of the property X is overwritten with the default value d. Analogously to HC1, this operation also can be defined without a default value.</p><p>Please note that in HC4 all properties are considered as optional that do not directly affect the operation (here: A 2 , . . . , A n ). For an improved readability, the denotation for optionality is only given for properties that are affected by the evolution operation (here: X).</p><formula xml:id="formula_7">add overwrite A.X = d global precond : {X ? ∈ SA} SA(X?, A2, . . . , An) [v t ] → SA(X, A2, . . . , An) [v t+1 ] ∀ei ∈ E A[v t ] : case : X ∈ * e i[v t ]          case precond : {X ∈ * e i[v t ] } ei(ai 2 , . . . , ai n ) [v t ] → ei((X : d), ai 2 , . . . , ai n ) [v t+1 ] case postcond : {X ∈ * e i[v t+1 ] } case : X ∈ * e i[v t ]          case precond : {X ∈ * e i[v t ] } ei((X : x), ai 2 , . . . , ai n ) [v t ] → ei((X : d), ai 2 , . . . , ai n ) [v t+1 ] case postcond : {X ∈ * e i[v t+1 ] } global postcond : {X ∈ S A[v t+1 ] }</formula><p>The Move Operation The definition of the move operation is more difficult because it has to be defined for two kinds (source and target). It is necessary to cope with both heterogeneity and arbitrary cardinalities in the semantics, whereby even 1:1 matches entail complex problems. Let us extend the introduced example by a second kind called metadata which at least consists of a property called m id. Some entities of K metadata contain the property station name as well. The database administrator wants to evolve the database schema by moving station name from K metadata to K project . Since project consists of the property station name as well, determining the property value is not trivial. Figure <ref type="figure" target="#fig_3">3</ref> depicts the cases that can occur with a matching cardinality of 1:1 for the move operation in HC4. The first match describes the case where station name is available in the corresponding entity of K metadata , but not in K project , and can be moved easily. The second case describes where station name is not available in K metadata yet in K project . For both introduced conflict resolution strategies the pre-existing value for station name is preserved. The third case describes the case that station name is neither available in K metadata nor in K project , here, a property with an empty value will be introduced. The last case delineates that station name is part of both entities. For the last case, the value of station name depends on the conflict resolution strategy. All cases are required to be handled by the semantics of the move operation in HC4.</p><p>On the schema level, it is established that the operation datetimestamp is removed from the source kind, while the property datetimestamp is contained in the target kind.</p><p>For all entities of the source kind without a matching partner, the property is removed and for all entities of the target kind without a matching partner the entity is assigned with a property of a Null value.</p><p>The formal semantics of the move overwrite operation for HC4 is given in the Appendix of this paper. The semantics of all other single-type and multitype operations of the NoSQL evolution language and their different conflict resolution approaches are described in <ref type="bibr" target="#b9">[10]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Increased Data Quality through Schema Evolution</head><p>Quality of data entails several characteristics such as data completeness, data actuality, and data consistency (c.f. <ref type="bibr" target="#b15">[17]</ref>). Schema evolution can be applied for refreshing the datasets and in parallel increasing the data quality and in some cases decreasing the HC. Both will be sketched in the following.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Data Actuality</head><p>The main focus of the evolution lies on updating datasets and migrating them into the latest version. In our previous work, we have introduced methods for an eager data migration (immediately after introducing a new version), lazy migration (on demand, if datasets are accessed) or by using hybrid strategies <ref type="bibr" target="#b5">[6]</ref>, <ref type="bibr" target="#b7">[8]</ref>. In all cases, datasets are transformed into the actual schema version. This enables that legacy datasets can be updated, transformed into the current structure and guarantees data actuality.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Data Completeness and Data Consistency</head><p>The NoSQL evolution operations presented in this paper never increase the heterogeneity of the databases. After execution data migration operations the databases always remain in the same HC as the source datasets. Even further, the operations can be used to increase regularity and completeness of the NoSQL databases, and in some cases to reduce the heterogeneity class and so improve the data quality. In the following we will present this in more detail for the heterogeneity classes.</p><p>Reconsidering the given semantics, it is evident that data in heterogeneity class 1 or 2 always remain in this class due to the restrictions of the pre-and postconditions, the heterogeneity and the matching conditions. For both HCs there are no optional properties and operations always affect all entities of a kind. Concluding, it is impossible to transform data without optional properties into schema-heterogeneous data. For multi-type operations with the same matching condition, the cardinality remains the same, even for chained operations.</p><p>In HC3, the same argumentation holds for optional properties. Regarding cardinalities, data in HC3 also remains in this HC for two multi-entity operations with the same matching condition. Nevertheless, the conflict resolution approaches provide an advantage. Consider two kinds with a n:1 relation (encompassed in HC3), e.g. two entities of K metadata (caused by duplicates) belong to a single entity of K project . Selecting data from both kinds using a join operation normally returns two result rows. By evolving the database and moving all properties from the entities of K metadata to K project using overwrite or ignore results in a concrete property value for all properties moved to the entity of K project . Nevertheless, depending on the application, it might be a downside that for both strategies because a subset of property values is lost after the move or copy operation. A better solution can be the generation of an array of values to collect the values of all matching partners while decreasing heterogeneity.</p><p>For data in HC4, it can be possible to transform this data into lower heterogeneity classes. Only HC4 copes with optional properties. Consider K project from the example on page 4 where the only optional property budget is not existent in each entity. After an add operation (with an arbitrary conflict resolution approach) on K project , all entities have a homogeneous schema. Hence, evolution operations can be used this way in order to increase the schema-homogeneity of NoSQL datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Related Work</head><p>The main aspect of this paper deals with the semantics for NoSQL schema evolution operations and data migration for different heterogeneity classes. Additionally, we presented the impact of evolution operations on data quality. In this section, we present approaches and concepts related to ours.</p><p>In <ref type="bibr" target="#b6">[7]</ref>, the authors present an approach for schema mapping. Similar to our semantics, a mapping consists of a source and a target schema, and a set of formulas of some logic over both schemas. The used formalism to describe database dependencies are Tuple-generating dependencies (TGDs) (see also <ref type="bibr" target="#b10">[12]</ref>, <ref type="bibr" target="#b0">[1]</ref>).</p><p>In <ref type="bibr" target="#b4">[5]</ref>, several schema versions are being maintained within a single relational database. In that publication a language for bidirectional schema evolution and forwards and backwards delta code generation is defined to support multiple versions of an application while maintaining only one database with co-existing schema versions.</p><p>Schildgen presents in <ref type="bibr" target="#b13">[15]</ref> the language NotaQL to transform NoSQL data and uses this language to overcome different kinds of heterogeneity.</p><p>Data quality is a long studied field in relational database theory and covers a broad field of characteristics, such as data homogeneity, data correctness, and data completeness. An overview of data integration steps and tools in practice is in <ref type="bibr" target="#b3">[4]</ref>. Naumann describes in <ref type="bibr">[11]</ref> research directions and challenges of data quality and classifies different data profiling subtasks. The aspects of the duplicate elimination/coping redundancy can be part of a data cleansing process (c.f. <ref type="bibr" target="#b1">[2]</ref>). Our presented semantics eliminates multiple values for properties that are affected by schema evolution operations by using the overwrite or ignore approach. This avoids a duplication of records with only one different property value. In contrast to other data cleansing approaches, we are focusing on transformation of NoSQL dataset into the current version and in parallel increasing the regularity of the databases. The transformation process is described by the evolution operations.</p><p>In our research project Darwin <ref type="bibr" target="#b14">[16]</ref>, we realized the evolution for MongoDB, Cassandra and CouchDB for the single-and multi-type operations in HC1 and HC2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Summary and Future Work</head><p>In agile development environments, data structures are often changed which necessitates the definition of schema evolution operations. For efficient schema evolution, schema evolution operations were defined that take the characteristics of NoSQL data, such as data heterogeneity, into account.</p><p>In this article, we introduced NoSQL heterogeneity classes which relate to the complexity of operations. We presented as a subset of our schema evolution language the semantics for the single-type operation add and the multi-type move for different HCs. We have shown the complexity of the operations in different heterogeneity classes and why evolving the schema allows to improve data quality under certain conditions. Storing completely unstructured and heterogeneous data is very uncommon, even in the NoSQL world -applications often require a certain schema for reading and processing data. Hence, datasets are stored homogeneously. Data in higher heterogeneity classes require sophisticated evolution and migration operations. In this article, we have shown that the presented semantics is able to migrate into a lower heterogeneity class when certain requirements are met.</p><p>In the future, we plan to extend the semantics by introducing further schema evolution operations. The current operations have been chosen due to an analysis of schema changes in open-source applications like Wikipedia (c.f. <ref type="bibr" target="#b2">[3]</ref>). Further operations such as split and merge are possible and useful as well. We plan to estimate and benchmark the impact of schema heterogeneity and low data quality for various scenarios, such as schema evolution or query rewriting in environments where data is lazily migrated as examined in <ref type="bibr" target="#b8">[9]</ref>.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. NoSQL Heterogeneity Classes</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head></head><label></label><figDesc>whereby S Ai ⊆ S A and D Ai ⊆ D A . Here, S Ai × D Ai represents the domain of the property.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Fig. 2 .</head><label>2</label><figDesc>Fig. 2. Execution of the add operation on heterogeneous data in HC4</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Fig. 3 .</head><label>3</label><figDesc>Fig. 3. Emerging cases of the move operation with 1:1 matching cardinalities and heterogeneous data</figDesc></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Acknowledgements This article is published in the scope of the project "NoSQL Schema Evolution und Big Data Migration at Scale" which is funded by the Deutsche Forschungsgemeinschaft (DFG) under the number 385808805. A special thanks goes to Stefanie Scherzinger, Andrea Hillenbrand, Dennis Marten, Tanja Auge, and Hannes Grunert for their support, comments on this work, and several discussions. We thank all reviewers for their constructive feedback.</p></div>
			</div>

			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A Appendix</head><p>Move Overwrite Semantics in Heterogeneity Class 4 S A (X?, K?, A3?, . . . , An?) [va ] → S A (K?, A3?, . . . , An?) [va +1] S B (F ?, B2?, . . . , Bm?) </p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Abiteboul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Hull</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Vianu</surname></persName>
		</author>
		<ptr target="http://webdam.inria.fr/Alice/" />
		<title level="m">Foundations of Databases</title>
				<imprint>
			<publisher>Addison-Wesley</publisher>
			<date type="published" when="1995">1995</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
	</analytic>
	<monogr>
		<title level="m">DEXA &apos;99</title>
		<title level="s">Proc. LNCS</title>
		<editor>
			<persName><forename type="first">T</forename><forename type="middle">J M</forename><surname>Bench-Capon</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">G</forename><surname>Soda</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><forename type="middle">M</forename><surname>Tjoa</surname></persName>
		</editor>
		<meeting><address><addrLine>Florence, Italy</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="1999">1999</date>
			<biblScope unit="volume">1677</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Schema Evolution in Wikipedia -Toward a Web Information System Benchmark</title>
		<author>
			<persName><forename type="first">C</forename><surname>Curino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">J</forename><surname>Moon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Tanca</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zaniolo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. ICEIS&apos;08</title>
				<meeting>ICEIS&apos;08</meeting>
		<imprint>
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Beauty and the beast: The theory and practice of information integration</title>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">M</forename><surname>Haas</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICDT. Springer LNCS</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2007">2007</date>
			<biblScope unit="volume">4353</biblScope>
			<biblScope unit="page" from="28" to="43" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Living in Parallel Realities -Co-Existing Schema Versions with a Bidirectional Database Evolution Language</title>
		<author>
			<persName><forename type="first">K</forename><surname>Herrmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Voigt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Rausch</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. SIGMOD</title>
				<meeting>SIGMOD</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">NoSQL Schema Evolution and Big Data Migration at Scale</title>
		<author>
			<persName><forename type="first">M</forename><surname>Klettke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Störl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Shenavai</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Big Data 2016</title>
				<meeting><address><addrLine>Washington DC</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Schema mappings, data exchange, and metadata management</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">G</forename><surname>Kolaitis</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">PODS</title>
				<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2005">2005</date>
			<biblScope unit="page" from="61" to="75" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Datenevolutions-und Migrationsstrategien in NoSQL-Datenbanken</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">L</forename><surname>Möller</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Grundlagen von Datenbanken. CEUR Workshop Proc</title>
				<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="volume">2126</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Query Rewriting for Continuously Evolving NoSQL Databases</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">L</forename><surname>Möller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Klettke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hillebrand</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">accepted for ER2019</title>
				<meeting><address><addrLine>Salvador, Brazil</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Formal Semantics of NoSQL Evolution Operations under different Heterogeneity Levels</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">L</forename><surname>Möller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Klettke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Störl</surname></persName>
		</author>
		<author>
			<persName><surname>Naumann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SIGMOD Record</title>
		<imprint>
			<biblScope unit="volume">42</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="40" to="49" />
			<date type="published" when="2013">2018. 2013</date>
		</imprint>
		<respStmt>
			<orgName>Rostock University</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Tech. Report</note>
	<note>Data profiling revisited</note>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">The Complexity of Evaluating Tuple Generating Dependencies</title>
		<author>
			<persName><forename type="first">R</forename><surname>Pichler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Skritek</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICDT 2011</title>
				<meeting><address><addrLine>Uppsala</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2011">2011. 2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Schema Evolution in Database Systems -An Annotated Bibliography</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">F</forename><surname>Roddick</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SIGMOD record</title>
		<imprint>
			<biblScope unit="volume">21</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="35" to="40" />
			<date type="published" when="1992">1992</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Managing schema evolution in nosql data stores</title>
		<author>
			<persName><forename type="first">S</forename><surname>Scherzinger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Klettke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Störl</surname></persName>
		</author>
		<idno>abs/1308.0514, abs/1308.0514</idno>
	</analytic>
	<monogr>
		<title level="m">Proc. DBPL CoRR</title>
				<meeting>DBPL CoRR</meeting>
		<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Heterogenität überwinden mit der Datentransformationssprache NotaQL</title>
		<author>
			<persName><forename type="first">J</forename><surname>Schildgen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Deßloch</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Datenbank-Spektrum</title>
		<imprint>
			<biblScope unit="volume">16</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="5" to="15" />
			<date type="published" when="2016-03">Mar 2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Curating variational data in application development</title>
		<author>
			<persName><forename type="first">U</forename><surname>Störl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Müller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Tekleab</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICDE</title>
				<imprint>
			<publisher>IEEE Computer Society</publisher>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="1605" to="1608" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Data Quality in Context</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">M</forename><surname>Strong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">W</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">Y</forename><surname>Wang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Commun. ACM</title>
		<imprint>
			<biblScope unit="volume">40</biblScope>
			<biblScope unit="issue">5</biblScope>
			<biblScope unit="page" from="103" to="110" />
			<date type="published" when="1997">1997</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
