<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Distributed Datastores: Towards Probabilistic Approach for Estimation of Reliability</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kyrylo Rukkas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Galyna Zholtkevych</string-name>
          <email>galynazholtkevych1991@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>V.N. Karazin Kharkiv National University 4</institution>
          ,
          <addr-line>Svobody Sqr., 61022, Kharkiv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper focuses on the contradiction that follows from Brewer's Conjecture for distributed datastores: the need to deliver qualitative data to the user requires a guarantee of consistency, availability and stability of the system at the same time, but Brewer's Conjecture claims that this is impossible. To overcome this contradiction in the paper it is suggested to estimate statistically violation of these guarantees. To implement this idea the interdependencies between the guarantees and indicators of information quality are considered, di erent kinds of models specifying the general architecture and behaviour of datastores are described, and, nally, the basic metrics of guarantee ensuring are de ned. This consideration allows us to formulate several problems that have both theoretical aspects and engineering applications for the improvement of the technology of distributed data processing.</p>
      </abstract>
      <kwd-group>
        <kwd>distributed datastore</kwd>
        <kwd>CAP-theorem</kwd>
        <kwd>information quality</kwd>
        <kwd>statistic metrics</kwd>
        <kwd>simulation Key Terms</kwd>
        <kwd>DistributedDataWarehousing</kwd>
        <kwd>ConcurrentComputation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Nowadays, any kind of human activity requires an infrastructure to support e
ciently the corresponding process of decision making. A modern answer to such
a requirement is an information system that is an integrated set of components
to collect, store and process data, to deliver information, knowledge, and
digital products. Today the development trend of Information and Communication
Technology is a wide use of networking technologies. Therefore a typical modern
information system is a distributed data processing system with a distributed
datastore.</p>
      <p>
        Now it is generally accepted that a distributed datastore should guarantee
the following properties: consistency (C), availability (A), and partition tolerance
(P). They are discussed in papers [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], but this discussion is too implicit.
These works had been critically reviewed in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This criticism is based on the fact
that nobody had so far given explicit and rigorous de nitions of these guarantees.
Taking into account this remark we accept the following understanding as the
origin point of our research:
consistency means that all replicas of any data unit contain the same data at
the same time point;
availability means that a datastore eventually returns a response (either a
success report or a failure noti cation) on each request;
and nally, partition tolerance characterizes the ability of a datastore to
continue to operate despite arbitrary message losses or failure of part of the
system (sometimes to refer to this ability the more general concept
faulttolerance is used).
      </p>
      <p>
        In the ideal case a distributed datastore should provide these guarantees.
In contrast to relational datastores like SQL databases that satisfy ACID
properties and ensure the system safety, non-relational datastores do not provide
complete safety for the information system. As known Brewer's conjecture [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
being called sometimes CAP theorem, says that it is impossible to maintain
simultaneously all three CAP-guarantees in asynchronous or partially synchronous
network and maintain safety in this way. Taking into account that
synchronization of a distributed datastore decreases essentially system throughput we study
consequences of Brewer's conjecture for asynchronous distributed datastores.
      </p>
      <p>A lot of research had been wasted at the consideration of CAP-guarantees.
As it is impossible to implement all three of them, we suggest a new approach:
to characterise stochastically that each of these requirements is ful lled or not
(see Sec. 5) and do the best for a consumer - provide him with mechanisms used
in di erent datastores for restoring the validity of the CAP-guarantees.</p>
      <p>This supposition leads us to the need to develop a simulation framework for
supporting numerical experiments to study these mechanisms.</p>
      <p>As we said above, the principal objective for an information system design
process is to provide a consumer with qualitative information just in time.
Therefore we should understand how information quality (IQ) and the CAP-guarantees
are connected. In order to clarify this interconnection we give some model of
information quality and determine its indicators that depend on providing the
CAP-guarantees in Sec. 2.</p>
      <p>In Sec. 4 we present the conceptual model of a distributed datastore proposed
as a background for the framework that should be developed. In Sec. 3 we
discuss brie y the models for maintaining the consistency property in distributed
systems. And nally, in Sec. 5 we give rigorous de nitions for stochastic criteria
of the CAP-guarantees providing, which is based on the presented conceptual
model.</p>
      <p>Further to avoid cumbersome formulations we shall say "CAP-properties"
instead of "ensuring CAP-guarantees".
2</p>
      <p>
        Interrelations between IQ model and CAP-properties
The problem to identify Information Quality (IQ) model was widely described
and discussed in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. On our opinion, this discussion had been compactly
summarized in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The information represented in these sources correlates with
data quality standards in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Below we give the list of IQ indicators according
to this paper:
accessibility is the indicator that characterises the extent of availability and
fast retrievability of information;
appropriate amount of information is used to denote if the scope of
information is appropriate for the task at hand;
believability means that the information is considered as true and credible;
completeness evaluates whether information is su ciently broad and deep to
solve the task at hand;
concise representation characterises the compactness of data representation;
consistent representation means that all the information processes operate
with the same data;
ease-of-manipulation envisages that information is easy manipulated and
applied to di erent tasks;
free-of-error means that information is correct and reliable;
interpretability characterises that information is in appropriate languages,
symbols and units, and the de nitions are clear;
objectivity denotes that information is considered as unbiased and impartial;
relevancy envisages that information is applicable and helpful for the task at
hand;
reputation determines that information should be highly regarded in the terms
of its source or content;
security is satis ed if the access to information is restricted appropriately in
accordance with the access rights;
timeliness is ful lled if information is up-to-date for the task at hand;
understandability means that information is well-comprehended;
value-added assumes that information is bene cial and provides advantages
from its use.
      </p>
      <p>
        These indicators are used to de ne di erent views of an information system.
We stress that the consideration is given from the next points of view: product
and service quality. These quality indicators are also classi ed taking into
account di erent views, namely, system speci cations and consumer expectations.
Therefore they had been grouped in a two-dimensional Table 1 presented below.
(See more complete description of this classi cation in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]).
      </p>
      <p>Focusing on our subject, we consider only those of indicators, that depend
on CAP-properties. Let us explain the reasons that has motivated us to take
exactly these indicators of IQ model.</p>
      <p>Obviously, by the de nition the consistency has the direct impact on the
consistent representation indicator. Also, consistency ensures the correct
information, thus the free-of-error indicator directly depends on it. Furthermore, if
the consistency is ful lled, there may be a lot of replicas in the system, that
is decreasing the concise representation indicator. And one can simply see that
the consistency de nition involves up-to-date information and then timeliness
immediately depends on the consistency. Consistency does not have an in uence
on the accessibility, because if the consistent information is received, it does not
necessarily mean that it was retrieved quickly and easily.</p>
      <p>Now we tell about the availability interrelations. Firstly, reasoning from the
availability de nition (see Sec. 1) the accessibility directly depends on the
availability measurement. Secondly, the free-of-error indicator de nition involves
reliable data that is delivered just-in-time, so it also has a direct dependence on
the availability. Evidently, it does not impact on the consistent representation
and concise representation indicators; also it is not tightly connected with the
timeliness indicator. Further in this paper we shall determine availability more
strictly: we shall denote the availability as the meantime between a request and
the response on it. And following from that de nition, with improving
availability the speed of retrieving data is increased, but it does not mean that these
data are up-to-date.</p>
      <p>The last group of interrelations is about the partition tolerance. It has an
impact on the consistent representation, free-of-error, accessibility and timeliness
quality indicators. The thing is that the perfect partition tolerance ful llment
ensures the successful response from a distributed datastore. In its turn, the
successful response must always contain consistent, correct, reliable and
up-todate information. Otherwise the system answer is counted as an error message.
Therefore greater probability to get the successful response gives higher values
of indicators mentioned above.</p>
      <p>Evidently, it does not have any in uence on the concise representation in
contrast to the consistency.</p>
      <p>We summarize this connection in Table 2.</p>
      <p>The endorsement of some interrelations and the discovering their behaviour
can be obtained by the further experiments. The rest of indicators of IQ model
depend on the quality of the information system that provide information
dataunits and we are not interested in them.
3</p>
      <p>Brief overview of Models for Distributed Systems
The distributed datastores are various, therefore it is necessary to have tools for
their analysis. The simplest classi cation is related to the ratio of read and write
operations quantity. This classi cation should be reasonable from the point of
view of three pillars that maintain all distributed systems: consistency,
availability, partition tolerance. Hence the list will be as follows:
{ Systems with domination of read operations (decision support or retrieval
systems);
{ Systems with domination of write operations (online transaction processing
systems);
{ Systems without explicit domination of read or write operations (business
systems).</p>
      <p>This classi cation is based on the quantity of read and write operations. It is
important to know for us, because it may require di erent mechanisms for each
type of system.</p>
      <p>In this section we tell about the way to verify consistency and maintain the
probability of its ful llment following [10, Chapter 7]. An appropriate way to
increase this probability is replication { making copies of new data units at each
node and their updating.</p>
      <p>Traditionally, consistency is discussed in the context of read and write
operations in distributed systems. Following the book mentioned above there are
two consistency models for di erent distributed systems: data-centric and
clientcentric ones. They are applied for di erent types of distributed systems. The rst
one, data-centric consistency model is a model for wide usage: online transaction
and business systems mentioned above. It involves many types of consistency,
such as continuous, sequential, causal and combined one, called grouping
operations. The protocols for these consistency types are more complex than protocols
in the second model. That is because data-centric model should be usable for
systems where a lot of write operations may occur and spoil all the data units.
For instance, imagine if two employees in a company use the same datastore.
They need to modify some le. And, unfortunately, it turns o that two
employees download the le from the data storage and modify it in a di erent way and
in di erent places. First one upload the le back and later the second one does
the same. But the problem is that the le is modi ed in di erent places and
there will be some con icts. This is the simplest example, but if this datastore is
distributed on many servers, there are a lot of copies of les in the data storage
and more people use the same distributed system, it causes a complete mess in
the storage. Therefore these protocols for controlling consistency should prevent
such errors when a lot of write operations may occur.</p>
      <p>The second one, client-centric model, is used for retrieval systems, where
mostly read operations occur, but write ones are rare. Thus it is very costly and
not necessary needed to use complex protocols. That is why for this model there
exist such a type of the consistency as eventual consistency that ensures such a
guarantee for the distributed system that eventually all the data units become
same and consistency is ful lled. The protocols for the client-centric model are
also invented (see more in [10, Chapter 7]).</p>
      <p>So as soon as we described di erent types of distributed datastores and
established the problem to use di erent models and protocols for datastores that
we focused on, we can go to the next section, where we specify the architecture
and important behaviour elements for a distributed datastore.
4</p>
      <p>Conceptual and Behavioral Model of Distributed
Datastore
Above we described the general accepted model of the information quality that
determines indicators which are needed for our research purpose, described
models that are used for distributed datastore to satisfy consistency guarantees. But
in order to come to the estimation step for the distributed datastore guarantees,
obviously, we should also discuss the model of a distributed datastore.</p>
      <p>Therefore in this section we represent the model of such systems in two views:
structural and behavioural. Below there is given the structural one (Fig. 1).</p>
      <p>The main component of our system is a distributed datastore. By de nition
it is a set of nodes (servers) connected by links between each other. Each node
may have one or more neighbours. That is why this entity is composed of nodes
and links. Obviously, each link is two-sided and a node entity may have many
incidences.</p>
      <p>Every node stores data units in replicas. If a node receives a new data unit
it nds the data unit with the same identi er and replaces the old replica.</p>
      <p>Nodes are classi ed into ready, busy and dead ones. If a node is busy or dead
it ignores all the messages, so in this case the request will be lost. That is why
the behaviour in this case is trivial.</p>
      <p>The behaviour of a ready node is represented below, in the activity diagram
(see Fig. 2).
DistributedDatastore</p>
      <p>incidence
receives
2..m
Node
1
allocation
1..*
1
2
1..*
*
1
1</p>
      <p>Response
sends
1..n
Link
Replica
replication</p>
      <p>DataUnit</p>
      <p>To present the behaviour in clear and understandable way, we separate more
complex method of node handle request from the general behavioural view and
show it in Fig. 3.
5</p>
      <p>Distributed Datastores: Basic Prerequisites and
Metrics
As said above the leading idea of this paper is to suggest an approach for
estimating probabilistic metrics of CAP-properties.</p>
      <p>It is generally accepted (see [9]) that a distributed datastore is a computer
network where information is stored on more than one node, often in a replicated
fashion.</p>
      <p>Moreover, it is important to mention that a researcher has the possibility to
observe datastore events in the sequence according to physical time while each
Node behaviour
receive
request</p>
      <p>search required
dataUnit on the node
[dataUnit has been found]</p>
      <p>handle request
[dataUnit has not been found]
redirect
request
Handle request</p>
      <p>update required
dataUnit on this node
datastore node considers the sequence of events with respect to its local time
only. Also we assume that datastores at study meet the eventual consistency
requirement [10]. It means that after an isolated read request for any data unit
all its replicas eventually have the same value.</p>
      <p>Let us represent a short case study in order to understand the motivation to
apply the probabilistic approach in asynchronous distributed datastore.</p>
      <p>Let us suppose that there is the unreliable node in a distributed system that
fails and recovers over over some time. At the initialization moment when nodes
are established each node has the probabilistic distribution of recovering time
and time of failure. In a distributed system links that bind nodes may also fail.
To be able to calculate the probability of partition tolerance ful llment we need
to take into account these distributions in our research.</p>
      <p>For now we are ready to describe how we apply the probabilistic approach
in our studying by giving the rigorous de nitions of CAP-properties. To do this,
we describe a distributed datastore using the following mathematical model.</p>
      <p>Our model is a tuple (N; L; @; D; r), where</p>
      <p>N is a nite set, whose elements correspond to nodes of a
distributed datastore;
L is a nite set, whose elements correspond to links of a distributed
datastore;
@ : L ! 2N is a mapping that associates each link with two nodes that it
connects;</p>
      <p>D is a nite set, whose elements correspond to stored data units;
r : D ! 2N is a mapping that associates each data unit d with a subset of
nodes that store its replica.</p>
      <p>Thus, rstly, let us introduce the consistency metric. Taking into account
that the consistency is the property when for each data unit its replicas have the
same value, we shall consider the probability that all data units at distributed
datastore are consistent at a certain time moment. Therefore we de ne the
consistency metric in the following manner.</p>
      <p>De nition 1 (Consistency metric). Let t 2 f0; 1g be the random variable
that represents one of two events at time point t 0:
t = 0 corresponds to the event \there exists a data unit that is not
consistent at the time point t" and
t = 1 corresponds to the event \all data units are consistent at the time
point t".</p>
      <p>Then the consistency metric C(t) at time point t is de ned by the formula
C(t) = Pr( t = 1) :
(1)</p>
      <p>The second metric estimates the availability guarantee. We propose to
measure the extent of availability as meantime between two events: the request has
been received by the datastore and the corresponding response 1 has been sent
by the datastore. More, formally:
1 This response is either a successful report on request or an error message.
De nition 2. Let t be the time interval between a request receiving at the time
point t and the corresponding response. Then the mean response time is de ned
by the formula</p>
      <p>T (t) = E[ t ] :</p>
      <p>Finally, to estimate the third guarantee, called the partition tolerance, we
consider the ability of a datastore to survive network partitions, so that the
performance of the datastore does not su er. This de nition is more complicated.
Let us consider some time point t . At this instant some nodes are alive, but other
ones failed. We denote by Nta the set of alive nodes (Nta N ). Similarly, we
denote by Lta the set of alive links that connect alive nodes.</p>
      <p>De nition 3. We shall say that a data unit d 2 D is reachable from a node
n 2 Nta at time point t if there exists a path in the graph (Nta; Lta) from n to
some n0 2 Nta T r(d) .</p>
      <p>Now we can introduce the metric for the partition tolerance using the previous
de nition.</p>
      <p>De nition 4. Let t 2 f0; 1g be the random variable that represents one of two
events at time point t 0:
t = 0 corresponds to the event \there exists a data unit that is not
reachable from some alive node at time point t" and
t = 1 corresponds to the event \all data units are reachable from any
alive node at time point t".</p>
      <p>Then the partition tolerance metric P (t) at time point t is de ned by the formula
P (t) = Pr( t = 1) :
(2)
(3)
6</p>
      <p>
        Conclusion
In this paper we have started studying the problem of the quality for distributed
datastores. We have proposed the approach to measure the quality properties
of a datastore. Therefore we have described CAP-properties and have built the
metric system for CAP-properties estimation. We have described the indicators
of the information quality and have found interrelations between the information
quality and distributed datastores' properties basing on [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>In order to have a view of the datastore and be able to work with it we
have built its structural and behavioural model and based on this knowledge,
we speci ed probabilistic characteristics for CAP-properties measurement.</p>
      <p>Thus, the following steps for the problem set in the paper have been done:
{ Formulation of understanding the idea: what are CAP-guarantees for
distributed datastores indeed;
{ Description of the information quality indicators;
{ Investigating the connection between CAP-guarantees and information
quality model;
{ Building the structural and behavioural models for a distributed datastore;
{ Forming "CAP-metrics" as the main idea for studying the quality of
distributed datastores.</p>
      <p>These steps give us possibilities to study CAP-properties ful llment using
the following background: the fault-tolerance mechanisms for asynchronous
systems, concurrent programming, algorithms of data propagation in distributed
systems, probably, some issues of internet strategy, Queueing Theory,
Mathematical Statistics. Fault-tolerance protocols can be used to invent the algorithms
for maintaining the partition tolerance guarantee. Obviously, a researcher should
have the good background in Graph Theory, Probability Theory and Concurrent
Programming in asynchronous distributed systems. Also, as we need to
represent experimental results, it is necessary to imitate the model of a distributed
datastore.</p>
      <p>Summing up now we might set following problems:
{ To simulate the model of a distributed datastore;
{ To study di erent mechanisms to estimate the degree of ensuring each
guarantee applying speci ed metrics;
{ To analyse discovered mechanisms and their composition;
{ To integrate these mechanisms with simulated asynchronous distributed
datastore;
{ To estimate theoretically the total time complexity for all provided
mechanisms;
{ To carry out the experimental estimation.</p>
      <p>Acknowledgment. Authors express a deep gratefulness to Grygoriy Zholtkevych
and Iryna Zaretska for their help and their criticism.
9. Pessach, Ya.: Distributed Storage: Concepts, Algorithms, and Implementations.</p>
      <p>CreateSpace Independent Publishing Platform (2013)
10. Tanenbaum, A.S., Van Steen, M.: Distributed systems. Principles and Paradigms
(2nd Edition). Prentice-Hall, Inc. (2006)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Brewer</surname>
            ,
            <given-names>E.: CAP</given-names>
          </string-name>
          <string-name>
            <surname>Twelve Years</surname>
          </string-name>
          <article-title>Later: How the "Rules" Have Changed</article-title>
          . Computer, IEEE Computer Society. Vol.
          <volume>45</volume>
          , No.
          <volume>2</volume>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Burgess</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Deconstructing the `CAP theorem' for CM and</article-title>
          DevOps http:// markburgess.org/blog_cap.html (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. CRG. Information Quality Survey: Administrators Guide. Cambridge Research Group, Cambridge, MA (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>English</surname>
            ,
            <given-names>L.P.</given-names>
          </string-name>
          : Information Quality Applied:
          <article-title>Best Practices for Improving Business Information, Processes and Systems</article-title>
          . Wiley (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Gilbert</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lynch</surname>
          </string-name>
          , N.:
          <article-title>Brewers Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services</article-title>
          .
          <source>ACM SIGACT News</source>
          . Vol.
          <volume>33</volume>
          , No.
          <issue>2</issue>
          , pp.
          <volume>51</volume>
          {
          <issue>59</issue>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. ISO 8000-1:
          <fpage>2011</fpage>
          .
          <string-name>
            <given-names>Data</given-names>
            <surname>Quality</surname>
          </string-name>
          .
          <source>International Organisation for Standartization</source>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kahn</surname>
            ,
            <given-names>B.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Strong</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
          </string-name>
          , R.Y.:
          <article-title>Information Quality Benchmarks: Product and Service Performance</article-title>
          .
          <source>Comm. ACM</source>
          . Vol.
          <volume>45</volume>
          , No.
          <volume>4</volume>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Kovac</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>Y.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pipino</surname>
            ,
            <given-names>L.L.</given-names>
          </string-name>
          :
          <article-title>Total data quality management: the case of IRI</article-title>
          .
          <source>Conf. on Information Quality</source>
          (Cambridge, MA), pp.
          <volume>63</volume>
          {
          <issue>79</issue>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>