<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Keeping NoSQL Databases up to date { Semantics of Evolution Operations and their Impact on Data Quality</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mark Lukas Moller</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Meike Klettke</string-name>
          <email>meike.klettkeg@uni-rostock.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Uta Storl</string-name>
          <email>uta.stoerl@h-da.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Darmstadt University of Applied Sciences</institution>
          ,
          <addr-line>Darmstadt</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Rostock</institution>
          ,
          <addr-line>Rostock</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Evolving a NoSQL database schema regularly involves migrating datasets into new schema versions. NoSQL databases store datasets in di erent heterogeneity levels (HCs) that can be characterized by their degree of regularity and cardinality of various entity types. In this article, we present the semantics of NoSQL evolution operations and their corresponding data migration operations while distinguishing di erent NoSQL HCs. One use-case of NoSQL evolution operations is improvement of actuality and completeness of data which is especially relevant in terms of the ever-expanding volume of data.</p>
      </abstract>
      <kwd-group>
        <kwd>NoSQL Schema Evolution</kwd>
        <kwd>Schema Evolution Operation</kwd>
        <kwd>Data Heterogeneity Classes</kwd>
        <kwd>Data Quality</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In agile software development environments source code is changed frequently
which also can include changes of the data in a database. In order to deal with
schema changes, schema evolution operations adapt the data to the new
structure. While for relational databases schema evolution has been studied in detail
in the past [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], these approaches cannot be directly transferred to NoSQL since
characteristics like data heterogeneity have to be taken into account.
      </p>
      <p>The majority of NoSQL database systems can be used for storing datasets
with di erent characteristics:
1. No or limited schema control: In NoSQL, neither schema information nor
semantical constraints have to be de ned before the actual storing of the
datasets. Thus, datasets with di erent structures can be stored even within
the same collection and may lead to heterogeneous data.
2. Regularity of data: Oftentimes NoSQL databases are generated by
applications or object mappers resulting in data structures that are checked in
? Copyright c 2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
terms of data consistency. In these cases, well-structured data is stored in
NoSQL databases that at least have an implicit schema.
3. Versioned datasets: In other applications, regular datasets are generated
with a certain structure, yet this structure changes frequently over time.
Consequently, the NoSQL database becomes heterogeneous since it contains
datasets in di erent versions within the same collection.</p>
      <p>In all datasets that are used over long periods of time, we have to enable their
evolution. In order to transform pre-existing stored data into a new structure,
e cient schema evolution operations are required that can cope with problems
of heterogeneity and cardinalities and that update and cleanse the data to ensure
a high level of data quality.</p>
      <p>m:n
s
e
iit n:1
l
a
n
i
rad1:n
C
3</p>
      <p>4
1:1 Dn1aongling Tupylee2ss hogmenoSe-toruushctegutee(rnrianoel-oHsuaemsteegvoerresnioeint)y
(a) Three dimensions of the four
NoSQL HCs</p>
      <p>C
o
p
y</p>
      <p>Add</p>
      <p>A4</p>
      <p>A1 D1
C4 C3 C2 C1 M1 R1</p>
      <p>M2
M3
M4</p>
      <p>Delete
D4</p>
      <p>R4
e
m
a
n
e</p>
      <p>R</p>
      <p>Move
(b) Heterogeneity Classes of the</p>
      <p>Evolution Operations</p>
      <p>First, we are introducing di erent degrees of NoSQL heterogeneity.
Figure 1(a) visualizes the three dimensions that have to be considered. The rst
dimension (x-axis in Figure 1(a)) describes the existence of dangling tuples. Our
evolution language includes two multi-type operations, move and copy. Both
operations specify matching conditions between entities. In this context, dangling
tuples are termed as entities without a matching partner regarding a multi-type
operation.</p>
      <p>The second dimension describes the cardinalities between kinds that are
affected by multi-type operations. Becauses NoSQL databases do not check
semantic constraints in advance, it is required to di erentiate whether all properties
have a matching partner or dangling tuples exist. If matching partners exist, it
is important to determine the number of partners - referred to as cardinality.</p>
      <p>The last dimension regards the heterogeneity of entities of the same version.
Here we distinguish between datasets in which all entities of the same version
have homogeneous or heterogeneous structures (z-axis in Figure 1(a)). We derive
di erent heterogeneity classes (HCs) per schema evolution operations, starting
from the most structured datasets and 1:1 cardinalities up to unstructured
datasets and arbitrary cardinalities.</p>
      <p>HC1: In this class, the operation a ects datasets in the same or di erent
structural versions (e.g., when lazy migration approaches are used), yet all
datasets in the same version have exactly the same structure. Multi-type
operations presume 1:1 cardinalities only and there are no dangling tuples allowed
between two kinds of matching conditions.</p>
      <p>HC2: The second class extends HC1 by 1:n cardinalities. Therefore, it is
required to deal with dangling tuples.</p>
      <p>HC3: The third class encompasses HC2 with arbitrary cardinalities. Additional
strategies are required for determining property values of entities a ected by
multi-entity operations with n:m cardinalities.</p>
      <p>HC4: The fourth class represents NoSQL databases that can have di erent
structures within the same version. Consequently, optional properties can
occur that may be available in some entities of a concrete version and missing
in other entities of the same version.</p>
      <p>A schema evolution operation against a NoSQL database must be able to cope
with all variants of input datasets. The article makes the following contribution.
{ We have already introduced four di erent heterogeneity classes (HC1-HC4)
for NoSQL. Based on these heterogeneity classes, we de ne the operational
semantics and data migration for a NoSQL evolution language in Section 3.</p>
      <p>We show that for certain HCs the evolution operations can be simpli ed.
{ We discuss the impact on schema evolution operations on the data quality in</p>
      <p>Section 4, namely data actuality, data completeness, and data consistency.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Foundations</title>
      <p>Our NoSQL evolution language contains three single-type operations, add, delete
and rename, and two multi-type operations, move and copy. The operations are
de ned for the evolution of the schema and entailed data migration operations
can be derived. The schema evolution and data migration operations are used to
bring entities into the latest structural version. Firstly, we introduce the semantic
foundations.</p>
      <p>Data with an equal or similar set of properties is called a an entity-type or a
kind. A kind named A consists of a schema and of a set of entities and is de ned
as KA = (SA; EA).</p>
      <p>The schema SA is de ned as a set of property-names, SA = fA1; : : : ; Ang. The
set of entities EA of KA over the schema SA is de ned as EA := fe1; : : : ; emg
whereby m represents the number of entities and where each entity ei in EA
consists of up to n properties (also referred to as attributes ) called aij with
i 2 (1; : : : ; m) and j 2 (1; : : : ; n). Formally, ei = faij j j 2 f1; : : : ngg.</p>
      <p>Here, i represents the index for the i-th entity of EA and j is the j-th property
of the corresponding entity. Each property aij consists of a property name and
and a property value: aij = (Aij : vij ) 2 SAi DAi , whereby SAi SA and
DAi DA. Here, SAi DAi represents the domain of the property.</p>
      <p>Example. To illustrate the de nitions, let us consider an example for the
representation of a subset of a research project database which stores
information about research stations, the name of the funder of the project, and the
budget. The kind is called project and is de ned as Kproject = fSproject; Eprojectg,
whereby Sproject = f"p id"; "station name"; "funder"; "budget"g. Eproject is the set
of entities that contains two entities (e1 and e2) of the kind project. A valid set
of data Eproject is:
Eproject = {
{("p_id": 1), ("station_name": "Ocean"), ("funder": "DFG"), ("budget": "5 Mil")},
{("p_id": 2), ("station_name": "Baltic Sea")}
}</p>
      <p>For the evolution operations, it is required to check whether an entity contains
a property with a certain name, regardless of its value. Because properties are
stored as a tuple and not as a set, the operator 2 is de ned which evaluates if
there is a property available for a given entity or not. For this purpose, we de ne
a projection operation that projects onto the property name: A := SAi DAi !
SAi with (Aij ; vij ) 7! Aij . Based on this projection, the 2 operator is de ned.
X 2 ei :, 9aij 2 ei : X 2 A(aij ), and X 2 EA :, 8ei 2 EA : X 2 ei.</p>
      <p>Reconsider the previous example. Here, "station name" 2 e1 is True while
"location" 2 e2 is False.</p>
      <p>The Dot-Notation is introduced for reading the value of a given property
name and is particularly needed in order to express matching conditions for
multi-entity operations. The following notation is introduced:
8X 2 ei : ei:X := v(aij ) with v := SAi DAi ! DAi with (Aij ; vij ) 7! vij .</p>
      <p>In the example, e1:station name evaluates to "Ocean"and e2:station name
evaluates to "Baltic Sea", while e1:location throws an exception.</p>
      <p>
        Due to migration and encompassed di erent schema versions, the same kind
is inspected at di erent points in time. For this, a notation of a version is
introduced in the form of in square brackets. For instance, SA[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] = fA1; : : : ; Ang[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
describes the schema of kind A at schema version 10. In the abstract notation
for the evolution and migration operation, [vA] and [vB] is used for the version
information of the kinds KA and KB.
      </p>
      <p>
        Generally, SA can be derived by iterating over all entities of EA and collect all
attribute names. Nevertheless, SA is stored as well to support a query rewriting
approach presented in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Semantics of the Evolution Operations</title>
      <p>
        In this section, we de ne the semantics of the evolution operations on regular
structures and structured datasets, and we will extend them to irregular
structures and heterogeneous datasets. The evolution operations were introduced for
the rst time as EBNF and as a NoSQL programming language in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and since
that time continously extended. The chosen evolution operations add, rename,
delete, move and copy represent a set of frequent schema evolution operations
in open source applications (c.f. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]). The e ort for data migration increases
accordingly to the HC. In order to de ne the concrete heterogeneity classes,
preand postconditions are used to determine the regularity of the data. The
preand postconditions are inspired by the Hoare triple. These conditions are
comparable with the concept of design by contract. Operations are only executed if the
preconditions are ful lled, otherwise they will be rejected. After the execution
of an operation, the postconditions are guaranteed.
      </p>
      <p>
        Hereafter we de ne the single-type operation add and the multi-type
operation move. The semantics for the complete evolution language is given in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
3.1
      </p>
      <sec id="sec-3-1">
        <title>Heterogeneity Class 1</title>
        <p>Operations in HC1 assumes that in a dataset all entities of a kind have the same
schema within the same schema version. Hence, there is no possibility to have
datasets with optional properties. For multi-type operations, this class can only
cope with matches of 1:1 cardinalities.</p>
        <p>Each of the operations evolves the schema and migrates the entities into the
new version. On the instance level, the operation modi es the data structure
and updates a ected instances. The e ects of the evolution operations have to
be de ned on the schema level and the instance level. Evolution operations are
de ned as rules whereby the left side of a rule describes the schema/instances
before the operation while the right side describes the schema/instances after
the operation. All rules consist of a precondition which needs to hold before the
operation. If the condition is not ful lled, the operation is not executed. The
postconditions are ful lled after the operation and will become important for
the chaining of operations and for the examination of Data the Quality.
The Add Operation This operation adds a property to all entities of a kind.
The operation speci es the kind, the new property name and additionally, the
default property value. In HC1, the add operation is de ned as:</p>
        <p>B add A.X = d</p>
        <p>precond : fX 62 SA[vA]g</p>
        <p>SA(A1; : : : ; An)[vA] ! SA(X; A1; : : : ; An)[vA+1]
8ei 2 EA : (ei(a1; : : : ; an)[vA] ! ei((X : d); a1; : : : ; an)[vA+1])</p>
        <p>postcond : fX 2 SA[vA+1]g
First, the operation veri es that the precondition is ful lled which states that
the name of the property is not allowed to be available in the schema of KA in the
version vA. The second line describes the schema evolution of KA. In version vA,
schema SA consists of n properties A1; : : : ; An. After the operation in version
vA + 1, the schema consists of n + 1 properties including the added property
named X. The third line describes the instance level modi cation of each entity
of KA. Each entity consists of the properties a1 to an and additionally the new
property (X : d) whereby X is the name of the added property and d is the
default value. After the modi cation of the schema and the entity migration,
the postcondition holds which states that property name X is part of SA in
version vA + 1. As a variant of the given semantics, it is possible to add a
property without default value: add A.X. In this case, the property (X : ?) is
added whereby ? represents a Null value.</p>
        <p>The Move Operation The multi-type operation move transfers a property
from the entities of one kind (termed as source kind ) to entities of a di erent
kind (termed as target kind ). To execute a multi-type operation, a matching
condition between both kind is mandatory. In HC1, the matching cardinality is
assumed as 1:1, which entails bijectivity so that every entity of the source kind
has exactly one match with an entity of the target kind, and vice versa. This
also presumes that the value of the matching condition is unique for each entity
and there is neither an entity on the source side nor on the target side that does
not have a matching partner. Consequently, multi-entity operations in HC1 are
restricted to kinds with the same amount of entities.</p>
        <p>In HC1, the semantics of the move operation is de ned as follows:
B move A.X To B.Z where A.K = B.F</p>
        <p>precond : fX 2 SA[vA]; Z 62 SB[vB]g
SA(X; K; A3; : : : ; An)[vA] ! SA(K; A3; : : : ; An)[vA+1]
SB(F; B2; : : : ; Bm)[vB] ! SB(Z; F; B2; : : : ; Bm)[vB+1]</p>
        <p>8ei 2 EA; ej 2 EB; ei:K = ej:F :
(ei((X : x); (K : k); ai3 ; : : : ; ain )[vA] ^ ej((F : k); bj2 ; : : : ; bjm )[vB]
! ei((K : k); ai3 ; : : : ; ain )[vA+1] ^ ej((Z : x); (F : k); bj2 ; : : : ; bjm )[vB+1])
postcond : fX 2= SA[vA+1]; Z 2 SB[vB+1]g</p>
        <p>Beside the matching condition, the source and target kinds as well as the
property names are speci ed. Here, these are KA with the property X and KB
with property Z. The move operations implicitly realizes a rename operation if
the property names of the source and target kinds are di erent. In the where
clause, the matching condition is explicitly speci ed.</p>
        <p>Before the operation, SA of KA contains the property name X, while SB of
the KB does not. On the schema level, it is apparent that the moved property
X is not present anymore in SA after the operation execution. Instead, SB
now contains Z. During the operation, all entities ei and ej are modi ed. The
property (X : x) is not present anymore in any entity of KA while (Z : x)
is part of each entity of KB. The same symbol x on the left and on the right
hand side of the rule indicate the same property value { the value is transferred
without a modi cation from the source kind to the target by the operation. The
matching condition between both kinds is represented by the same property
value k ((K : k) for ei and (F : k) for ej ) as well.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Heterogeneity Classes 2 and 3</title>
        <p>
          In heterogeneity classes 2 and 3, we assume structurally homogeneous data
within the same version, however, cardinalities are extended to 1:n in HC2 and
to m:n in HC3. Thus, it is necessary to deal with dangling tuples and multi
matches. Since HC2 and HC3 are inherited in HC4 in terms of their
characteristics, we will explain the properties and challenges of these HCs in the next
section. Furthermore, both HCs are discussed in detail in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
3.3
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Heterogeneity Class 4</title>
        <p>Evolution operations in HC4 cover the most complicated NoSQL databases
considering all structural variants. In this HC, schema heterogeneity and
multientity operation of arbitrary cardinalities are included.</p>
        <p>An example for challenges in HC4 is given in Figure 2. Here, an add operation
is executed and a ects two entities of Kproject. For the property value of funder
of the entity with the id : 2 it is required to decide whether the value of funder is
either overwritten with the default value or preserved. The semantics is extended
by introducing the additional keywords overwrite and ignore for implementing
con ict resolution strategies.</p>
        <p>For heterogeneous data, it is required to denote optionality in the semantics,
especially in the preconditions, since it is not known whether a certain property
occurs in all entities of a kind. Optional properties are labeled with a question
?
mark. For example, X 2 SA states that X is an optional property in the schema
of kind A and can or cannot appear in an entity. This requires to deal with
both cases in the semantics. On the schema level, the notation SA(X?) is used
analogously.</p>
        <p>The Add operation The de nition of the add operation is given below. Here,
the overwrite approach is used which adds the property and the speci ed default
value to entities without that property. For entities that already contain the
property before of the operation, their a ected property values are overwritten
by the operation's default value.</p>
        <p>In contrast to HC1, it is distinguished between the global conditions which
hold for the schema and all entities a ected by the evolution operation, and case
conditions which only hold for a subset of the entities a ected by the operation.
project
funder : "DFG
: 1,
: 2,
project
: 1,
: 2,</p>
        <p>?
add project.funder = "M-V"</p>
        <p>Fig. 2. Execution of the add operation on heterogeneous data in HC4</p>
        <p>The de nition of the evolution operation is divided into two cases: The rst
case de nes the operation for all datasets in which X is not available. A property
named X is added with the default value d. The second case de nes the operation
for the datasets that already contain X. The existing value of the property X
is overwritten with the default value d. Analogously to HC1, this operation also
can be de ned without a default value.</p>
        <p>Please note that in HC4 all properties are considered as optional that do not
directly a ect the operation (here: A2; : : : ; An). For an improved readability, the
denotation for optionality is only given for properties that are a ected by the
evolution operation (here: X).</p>
        <p>B add overwrite A.X = d
8ei 2 EA[vt] :</p>
        <p>global precond : fX 2? SAg</p>
        <p>SA(X?; A2; : : : ; An)[vt] ! SA(X; A2; : : : ; An)[vt+1]
case : X 62 ei[vt]
case : X 2 ei[vt]
8
&gt;
&gt;
&gt;
&lt;
&gt;
&gt;
&gt;
:
8
&gt;
&gt;
&gt;
&lt;
&gt;
&gt;
&gt;
:
case precond : fX 62 ei[vt]g
ei(ai2 ; : : : ; ain )[vt]
! ei((X : d); ai2 ; : : : ; ain )[vt+1]
case postcond : fX 2 ei[vt+1]g
case precond : fX 2 ei[vt]g
ei((X : x); ai2 ; : : : ; ain )[vt]
! ei((X : d); ai2 ; : : : ; ain )[vt+1]
case postcond : fX 2 ei[vt+1]g
global postcond : fX 2 SA[vt+1]g
The Move Operation The de nition of the move operation is more di cult
because it has to be de ned for two kinds (source and target). It is necessary
to cope with both heterogeneity and arbitrary cardinalities in the semantics,
whereby even 1:1 matches entail complex problems. Let us extend the introduced
example by a second kind called metadata which at least consists of a
property called m id. Some entities of Kmetadata contain the property station name
as well. The database administrator wants to evolve the database schema by
moving station name from Kmetadata to Kproject. Since project consists of the
property station name as well, determining the property value is not trivial.
Figure 3 depicts the cases that can occur with a matching cardinality of 1:1 for the
move operation in HC4. The rst match describes the case where station name is
available in the corresponding entity of Kmetadata, but not in Kproject, and can
be moved easily. The second case describes where station name is not available
in Kmetadata yet in Kproject. For both introduced con ict resolution strategies
the pre-existing value for station name is preserved. The third case describes the
metadata
m1 }{ m_id staion_me : "Ocean
m2 }{ m_id
: 1,
: 2
move metadata.station_name</p>
        <p>to project.station_name
where metadata.m_id = project.p_id
: 1
: 2,
: 3
: 4,
case that station name is neither available in Kmetadata nor in Kproject, here, a
property with an empty value will be introduced. The last case delineates that
station name is part of both entities. For the last case, the value of station name
depends on the con ict resolution strategy. All cases are required to be handled
by the semantics of the move operation in HC4.</p>
        <p>On the schema level, it is established that the operation datetimestamp is
removed from the source kind, while the property datetimestamp is contained in
the target kind.</p>
        <p>For all entities of the source kind without a matching partner, the property
is removed and for all entities of the target kind without a matching partner the
entity is assigned with a property of a Null value.</p>
        <p>
          The formal semantics of the move overwrite operation for HC4 is given in
the Appendix of this paper. The semantics of all other single-type and
multitype operations of the NoSQL evolution language and their di erent con ict
resolution approaches are described in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Increased Data Quality through Schema Evolution</title>
      <p>
        Quality of data entails several characteristics such as data completeness, data
actuality, and data consistency (c.f. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]). Schema evolution can be applied for
refreshing the datasets and in parallel increasing the data quality and in some
cases decreasing the HC. Both will be sketched in the following.
Data Actuality The main focus of the evolution lies on updating datasets and
migrating them into the latest version. In our previous work, we have introduced
methods for an eager data migration (immediately after introducing a new
version), lazy migration (on demand, if datasets are accessed) or by using hybrid
strategies [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In all cases, datasets are transformed into the actual schema
version. This enables that legacy datasets can be updated, transformed into the
current structure and guarantees data actuality.
      </p>
      <sec id="sec-4-1">
        <title>Data Completeness and Data Consistency The NoSQL evolution opera</title>
        <p>tions presented in this paper never increase the heterogeneity of the databases.
After execution data migration operations the databases always remain in the
same HC as the source datasets. Even further, the operations can be used to
increase regularity and completeness of the NoSQL databases, and in some cases to
reduce the heterogeneity class and so improve the data quality. In the following
we will present this in more detail for the heterogeneity classes.</p>
        <p>Reconsidering the given semantics, it is evident that data in heterogeneity
class 1 or 2 always remain in this class due to the restrictions of the pre- and
postconditions, the heterogeneity and the matching conditions. For both HCs
there are no optional properties and operations always a ect all entities of a kind.
Concluding, it is impossible to transform data without optional properties into
schema-heterogeneous data. For multi-type operations with the same matching
condition, the cardinality remains the same, even for chained operations.</p>
        <p>In HC3, the same argumentation holds for optional properties. Regarding
cardinalities, data in HC3 also remains in this HC for two multi-entity
operations with the same matching condition. Nevertheless, the con ict resolution
approaches provide an advantage. Consider two kinds with a n:1 relation
(encompassed in HC3), e.g. two entities of Kmetadata (caused by duplicates) belong
to a single entity of Kproject. Selecting data from both kinds using a join
operation normally returns two result rows. By evolving the database and moving
all properties from the entities of Kmetadata to Kproject using overwrite or
ignore results in a concrete property value for all properties moved to the entity
of Kproject. Nevertheless, depending on the application, it might be a downside
that for both strategies because a subset of property values is lost after the move
or copy operation. A better solution can be the generation of an array of values
to collect the values of all matching partners while decreasing heterogeneity.</p>
        <p>For data in HC4, it can be possible to transform this data into lower
heterogeneity classes. Only HC4 copes with optional properties. Consider Kproject
from the example on page 4 where the only optional property budget is not
existent in each entity. After an add operation (with an arbitrary con ict resolution
approach) on Kproject, all entities have a homogeneous schema. Hence, evolution
operations can be used this way in order to increase the schema-homogeneity of
NoSQL datasets.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Related Work</title>
      <p>The main aspect of this paper deals with the semantics for NoSQL schema
evolution operations and data migration for di erent heterogeneity classes.
Additionally, we presented the impact of evolution operations on data quality. In
this section, we present approaches and concepts related to ours.</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], the authors present an approach for schema mapping. Similar to our
semantics, a mapping consists of a source and a target schema, and a set of
formulas of some logic over both schemas. The used formalism to describe database
dependencies are Tuple-generating dependencies (TGDs) (see also [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]).
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], several schema versions are being maintained within a single relational
database. In that publication a language for bidirectional schema evolution and
forwards and backwards delta code generation is de ned to support multiple
versions of an application while maintaining only one database with co-existing
schema versions.
      </p>
      <p>
        Schildgen presents in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] the language NotaQL to transform NoSQL data
and uses this language to overcome di erent kinds of heterogeneity.
      </p>
      <p>
        Data quality is a long studied eld in relational database theory and covers
a broad eld of characteristics, such as data homogeneity, data correctness, and
data completeness. An overview of data integration steps and tools in practice
is given in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Naumann describes in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] research directions and challenges of
data quality and classi es di erent data pro ling subtasks. The aspects of the
duplicate elimination/coping redundancy can be part of a data cleansing
process (c.f. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]). Our presented semantics eliminates multiple values for properties
that are a ected by schema evolution operations by using the overwrite or ignore
approach. This avoids a duplication of records with only one di erent property
value. In contrast to other data cleansing approaches, we are focusing on
transformation of NoSQL dataset into the current version and in parallel increasing
the regularity of the databases. The transformation process is described by the
evolution operations.
      </p>
      <p>
        In our research project Darwin [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], we realized the evolution for MongoDB,
Cassandra and CouchDB for the single- and multi-type operations in HC1 and
HC2.
6
      </p>
    </sec>
    <sec id="sec-6">
      <title>Summary and Future Work</title>
      <p>In agile development environments, data structures are often changed which
necessitates the de nition of schema evolution operations. For e cient schema
evolution, schema evolution operations were de ned that take the characteristics
of NoSQL data, such as data heterogeneity, into account.</p>
      <p>In this article, we introduced NoSQL heterogeneity classes which relate to
the complexity of operations. We presented as a subset of our schema
evolution language the semantics for the single-type operation add and the multi-type
move for di erent HCs. We have shown the complexity of the operations in
different heterogeneity classes and why evolving the schema allows to improve data
quality under certain conditions. Storing completely unstructured and
heterogeneous data is very uncommon, even in the NoSQL world { applications often
require a certain schema for reading and processing data. Hence, datasets are
stored homogeneously. Data in higher heterogeneity classes require sophisticated
evolution and migration operations. In this article, we have shown that the
presented semantics is able to migrate into a lower heterogeneity class when certain
requirements are met.</p>
      <p>
        In the future, we plan to extend the semantics by introducing further schema
evolution operations. The current operations have been chosen due to an analysis
of schema changes in open-source applications like Wikipedia (c.f. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]). Further
operations such as split and merge are possible and useful as well. We plan
to estimate and benchmark the impact of schema heterogeneity and low data
quality for various scenarios, such as schema evolution or query rewriting in
environments where data is lazily migrated as examined in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>Acknowledgements This article is published in the scope of the project
\NoSQL Schema Evolution und Big Data Migration at Scale" which is funded
by the Deutsche Forschungsgemeinschaft (DFG) under the number 385808805. A
special thanks goes to Stefanie Scherzinger, Andrea Hillenbrand, Dennis Marten,
Tanja Auge, and Hannes Grunert for their support, comments on this work, and
several discussions. We thank all reviewers for their constructive feedback.</p>
      <sec id="sec-6-1">
        <title>Move Overwrite Semantics in Heterogeneity Class 4</title>
        <p>B move overwrite A.X to B.Z where A.K = B.F
SA(X?; K?; A3?; : : : ; An?)[va] ! SA(K?; A3?; : : : ; An?)[va+1]</p>
        <p>SB(F ?; B2?; : : : ; Bm?)[vb] ! SB(Z; F ?; B2?; : : : ; Bm?)[vb+1]
8ei 2 EA; ej 2 EB; ei:K = ej :F :
8 8 case precond : fX 2 ei[va] ^ Z 2= ej[vb]g
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt;&gt;&lt;&gt; (ei^((eXj ((:Fx):; k(K); b:jk2);;: a:i:3; ;b:jm::);[avbin] )[va]
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&lt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;case : Z 2= ej[vb] 8&gt;&gt;&gt;&gt;&gt;&gt;&gt;&lt;&gt;:&gt;&gt;&gt;&gt;&gt;&gt; ccaa(!sseeei^^(epp(eeiXrojj(es(((tc((:XcoZZxon):n::d;xdxz(:)K)):;f;;(f(X(K:FXFk2):::2;kkka)))ei;;;e3ibab[i;vj[jiv:a223a:];;;:]:^::;^:::a:Z::Zi;;;nbba2j)2jim[mnvae)))e[j[][vvvj[vab[bvb]]]b])g]g
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;case : Z 2 ej[vb] &gt;&gt;&gt;&gt;:&gt;&gt;&gt; ca!se^epeioj(s((t(XcZo:n:xdx)):;;(f(KXF ::2kk));;eabi[jiv23a;;]:: ^:::: Z;;ba2jimn))e[[vvja[bv]]b)]g
&gt;
&gt;&gt; ei((X : x); (K : k); ai3 ; : : : ; ain )[va]
&gt;
&gt;
&gt;&gt;&gt;&gt; ! ei((K : k); ai3 ; : : : ; ain )[va+1]
&gt;
&gt;&gt;&gt;&gt;:&gt; ceaj[sveb]p!ostecjo[vnbd+:1]fX 2= ei[va+1] ^ Z 2 ej[vb+1]g
(ei((K : k); ai3 ; : : : ; ain )[va] ! ei((K : k); ai3 ; : : : ; ain )[va+1])
(ej ((Z : z); (F : k); bj2 ; : : : ; bjm )[vb] ! ej ((Z : z); (F : k); bj2 ; : : : ; bjm )[vb+1])
(ej ((F : k); bj2 ; : : : ; bjm )[vb] ! ej ((Z : ?); (F : k); bj2 ; : : : ; bjm )[vb+1])
global postcond : fX 2= SA[va+1]; Z 2 SB[vb+1]g</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abiteboul</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hull</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vianu</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          : Foundations of Databases. Addison-Wesley (
          <year>1995</year>
          ), http://webdam.inria.fr/Alice/
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bench-Capon</surname>
            ,
            <given-names>T.J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soda</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tjoa</surname>
          </string-name>
          , A.M. (eds.): DEXA '
          <fpage>99</fpage>
          ,
          <string-name>
            <surname>Florence</surname>
          </string-name>
          , Italy,
          <source>Proc. LNCS</source>
          , vol.
          <volume>1677</volume>
          . Springer (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Curino</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moon</surname>
            ,
            <given-names>H.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tanca</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaniolo</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Schema Evolution in Wikipedia - Toward a Web Information System Benchmark</article-title>
          .
          <source>In: Proc. ICEIS</source>
          '
          <volume>08</volume>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Haas</surname>
            ,
            <given-names>L.M.</given-names>
          </string-name>
          :
          <article-title>Beauty and the beast: The theory and practice of information integration</article-title>
          .
          <source>In: ICDT. Springer LNCS</source>
          , vol.
          <volume>4353</volume>
          , pp.
          <volume>28</volume>
          {
          <fpage>43</fpage>
          . Springer (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Herrmann</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Voigt</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rausch</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , et al.:
          <article-title>Living in Parallel Realities | CoExisting Schema Versions with a Bidirectional Database Evolution Language</article-title>
          .
          <source>In: Proc. SIGMOD</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Klettke</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Storl, U.,
          <string-name>
            <surname>Shenavai</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , et al.:
          <article-title>NoSQL Schema Evolution and Big Data Migration at Scale</article-title>
          .
          <source>In: IEEE Big Data</source>
          <year>2016</year>
          , Washington DC. IEEE (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kolaitis</surname>
          </string-name>
          , P.G.:
          <article-title>Schema mappings, data exchange, and metadata management</article-title>
          .
          <source>In: PODS</source>
          . pp.
          <volume>61</volume>
          {
          <fpage>75</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. Moller, M.L.:
          <article-title>Datenevolutions- und Migrationsstrategien in NoSQL-Datenbanken</article-title>
          .
          <source>In: Grundlagen von Datenbanken. CEUR Workshop Proc.</source>
          , vol.
          <volume>2126</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. Moller,
          <string-name>
            <given-names>M.L.</given-names>
            ,
            <surname>Klettke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Hillebrand</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          , et al.:
          <article-title>Query Rewriting for Continuously Evolving NoSQL Databases (</article-title>
          <year>2019</year>
          ),
          <article-title>accepted for ER2019, Salvador</article-title>
          , Brazil
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. Moller,
          <string-name>
            <given-names>M.L.</given-names>
            ,
            <surname>Klettke</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          , Storl, U.:
          <article-title>Formal Semantics of NoSQL Evolution Operations under di erent Heterogeneity Levels (</article-title>
          <year>2018</year>
          ),
          <source>Tech. Report</source>
          , Rostock University
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Naumann</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Data pro ling revisited</article-title>
          .
          <source>SIGMOD Record</source>
          <volume>42</volume>
          (
          <issue>4</issue>
          ),
          <volume>40</volume>
          {
          <fpage>49</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Pichler</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Skritek</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>The Complexity of Evaluating Tuple Generating Dependencies</article-title>
          .
          <source>In: ICDT</source>
          <year>2011</year>
          , Uppsala,
          <year>2011</year>
          . ACM (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Roddick</surname>
            ,
            <given-names>J.F.</given-names>
          </string-name>
          :
          <article-title>Schema Evolution in Database Systems - An Annotated Bibliography</article-title>
          .
          <source>SIGMOD record 21(4)</source>
          ,
          <volume>35</volume>
          {
          <fpage>40</fpage>
          (
          <year>1992</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Scherzinger</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klettke</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Storl, U.:
          <article-title>Managing schema evolution in nosql data stores</article-title>
          .
          <source>Proc. DBPL CoRR, abs/1308</source>
          .0514, abs/1308.0514 (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Schildgen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , De loch, S.: Heterogenita
          <article-title>t uberwinden mit der Datentransformationssprache NotaQL</article-title>
          .
          <source>Datenbank-Spektrum</source>
          <volume>16</volume>
          (
          <issue>1</issue>
          ),
          <volume>5</volume>
          {
          <fpage>15</fpage>
          (Mar
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16. Storl, U., Muller,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Tekleab</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          , et al.:
          <article-title>Curating variational data in application development</article-title>
          .
          <source>In: ICDE</source>
          . pp.
          <volume>1605</volume>
          {
          <fpage>1608</fpage>
          . IEEE Computer Society (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Strong</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>Y.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>R.Y.</given-names>
          </string-name>
          :
          <article-title>Data Quality in Context</article-title>
          .
          <source>Commun. ACM</source>
          <volume>40</volume>
          (
          <issue>5</issue>
          ),
          <volume>103</volume>
          {
          <fpage>110</fpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>