=Paper= {{Paper |id=None |storemode=property |title=Distributed Datastores: Towards Probabilistic Approach for Estimation of Dependability |pdfUrl=https://ceur-ws.org/Vol-1356/paper_51.pdf |volume=Vol-1356 |dblpUrl=https://dblp.org/rec/conf/icteri/RukkasZ15 }} ==Distributed Datastores: Towards Probabilistic Approach for Estimation of Dependability== https://ceur-ws.org/Vol-1356/paper_51.pdf
    Distributed Datastores: Towards Probabilistic
       Approach for Estimation of Reliability

                    Kyrylo Rukkas and Galyna Zholtkevych

                     V.N. Karazin Kharkiv National University
                     4, Svobody Sqr., 61022, Kharkiv, Ukraine
                        galynazholtkevych1991@gmail.com



      Abstract. This paper focuses on the contradiction that follows from
      Brewer’s Conjecture for distributed datastores: the need to deliver qual-
      itative data to the user requires a guarantee of consistency, availability
      and stability of the system at the same time, but Brewer’s Conjecture
      claims that this is impossible. To overcome this contradiction in the pa-
      per it is suggested to estimate statistically violation of these guarantees.
      To implement this idea the interdependencies between the guarantees
      and indicators of information quality are considered, different kinds of
      models specifying the general architecture and behaviour of datastores
      are described, and, finally, the basic metrics of guarantee ensuring are
      defined. This consideration allows us to formulate several problems that
      have both theoretical aspects and engineering applications for the im-
      provement of the technology of distributed data processing.

      Keywords: distributed datastore, CAP-theorem, information quality,
      statistic metrics, simulation
      Key Terms: DistributedDataWarehousing, ConcurrentComputation


1    Introduction

Nowadays, any kind of human activity requires an infrastructure to support effi-
ciently the corresponding process of decision making. A modern answer to such
a requirement is an information system that is an integrated set of components
to collect, store and process data, to deliver information, knowledge, and digi-
tal products. Today the development trend of Information and Communication
Technology is a wide use of networking technologies. Therefore a typical modern
information system is a distributed data processing system with a distributed
datastore.
    Now it is generally accepted that a distributed datastore should guarantee
the following properties: consistency (C), availability (A), and partition tolerance
(P). They are discussed in papers [5] and [1], but this discussion is too implicit.
These works had been critically reviewed in [2]. This criticism is based on the fact
that nobody had so far given explicit and rigorous definitions of these guarantees.
Taking into account this remark we accept the following understanding as the
origin point of our research:
consistency means that all replicas of any data unit contain the same data at
    the same time point;
availability means that a datastore eventually returns a response (either a
    success report or a failure notification) on each request;
and finally, partition tolerance characterizes the ability of a datastore to con-
    tinue to operate despite arbitrary message losses or failure of part of the
    system (sometimes to refer to this ability the more general concept fault-
    tolerance is used).
    In the ideal case a distributed datastore should provide these guarantees.
In contrast to relational datastores like SQL databases that satisfy ACID prop-
erties and ensure the system safety, non-relational datastores do not provide
complete safety for the information system. As known Brewer’s conjecture [5],
being called sometimes CAP theorem, says that it is impossible to maintain si-
multaneously all three CAP-guarantees in asynchronous or partially synchronous
network and maintain safety in this way. Taking into account that synchroniza-
tion of a distributed datastore decreases essentially system throughput we study
consequences of Brewer’s conjecture for asynchronous distributed datastores.
    A lot of research had been wasted at the consideration of CAP-guarantees.
As it is impossible to implement all three of them, we suggest a new approach:
to characterise stochastically that each of these requirements is fulfilled or not
(see Sec. 5) and do the best for a consumer - provide him with mechanisms used
in different datastores for restoring the validity of the CAP-guarantees.
    This supposition leads us to the need to develop a simulation framework for
supporting numerical experiments to study these mechanisms.
    As we said above, the principal objective for an information system design
process is to provide a consumer with qualitative information just in time. There-
fore we should understand how information quality (IQ) and the CAP-guarantees
are connected. In order to clarify this interconnection we give some model of in-
formation quality and determine its indicators that depend on providing the
CAP-guarantees in Sec. 2.
    In Sec. 4 we present the conceptual model of a distributed datastore proposed
as a background for the framework that should be developed. In Sec. 3 we dis-
cuss briefly the models for maintaining the consistency property in distributed
systems. And finally, in Sec. 5 we give rigorous definitions for stochastic criteria
of the CAP-guarantees providing, which is based on the presented conceptual
model.
    Further to avoid cumbersome formulations we shall say ”CAP-properties”
instead of ”ensuring CAP-guarantees”.


2   Interrelations between IQ model and CAP-properties

The problem to identify Information Quality (IQ) model was widely described
and discussed in [3], [8], [4]. On our opinion, this discussion had been compactly
summarized in [7]. The information represented in these sources correlates with
data quality standards in [6]. Below we give the list of IQ indicators according
to this paper:
accessibility is the indicator that characterises the extent of availability and
    fast retrievability of information;
appropriate amount of information is used to denote if the scope of infor-
    mation is appropriate for the task at hand;
believability means that the information is considered as true and credible;
completeness evaluates whether information is sufficiently broad and deep to
    solve the task at hand;
concise representation characterises the compactness of data representation;
consistent representation means that all the information processes operate
    with the same data;
ease-of-manipulation envisages that information is easy manipulated and
    applied to different tasks;
free-of-error means that information is correct and reliable;
interpretability characterises that information is in appropriate languages,
    symbols and units, and the definitions are clear;
objectivity denotes that information is considered as unbiased and impartial;
relevancy envisages that information is applicable and helpful for the task at
    hand;
reputation determines that information should be highly regarded in the terms
    of its source or content;
security is satisfied if the access to information is restricted appropriately in
    accordance with the access rights;
timeliness is fulfilled if information is up-to-date for the task at hand;
understandability means that information is well-comprehended;
value-added assumes that information is beneficial and provides advantages
    from its use.

These indicators are used to define different views of an information system.
We stress that the consideration is given from the next points of view: product
and service quality. These quality indicators are also classified taking into ac-
count different views, namely, system specifications and consumer expectations.
Therefore they had been grouped in a two-dimensional Table 1 presented below.
(See more complete description of this classification in [7]).
    Focusing on our subject, we consider only those of indicators, that depend
on CAP-properties. Let us explain the reasons that has motivated us to take
exactly these indicators of IQ model.
    Obviously, by the definition the consistency has the direct impact on the
consistent representation indicator. Also, consistency ensures the correct infor-
mation, thus the free-of-error indicator directly depends on it. Furthermore, if
the consistency is fulfilled, there may be a lot of replicas in the system, that
is decreasing the concise representation indicator. And one can simply see that
the consistency definition involves up-to-date information and then timeliness
immediately depends on the consistency. Consistency does not have an influence
                        Table 1. Information Quality Model

              Conforms to Specifications                Meets or Exceeds
                                                    Consumer Expectations
Product Sound Information:                      Useful Information
Quality
          – Free-of-Error                        – Appropriate Amount
          – Concise Representation               – Relevancy
          – Completeness                         – Understandability
          – Consistent representation            – Interpretability
                                                 – Objectivity
Service    Dependable Information:              Usable Information:
Quality
             – Timeliness                         – Believability
             – Security                           – Accessibility
                                                  – Ease-of-Manipulation
                                                  – Reputation
                                                  – Value Added



on the accessibility, because if the consistent information is received, it does not
necessarily mean that it was retrieved quickly and easily.
    Now we tell about the availability interrelations. Firstly, reasoning from the
availability definition (see Sec. 1) the accessibility directly depends on the avail-
ability measurement. Secondly, the free-of-error indicator definition involves re-
liable data that is delivered just-in-time, so it also has a direct dependence on
the availability. Evidently, it does not impact on the consistent representation
and concise representation indicators; also it is not tightly connected with the
timeliness indicator. Further in this paper we shall determine availability more
strictly: we shall denote the availability as the meantime between a request and
the response on it. And following from that definition, with improving availabil-
ity the speed of retrieving data is increased, but it does not mean that these
data are up-to-date.
    The last group of interrelations is about the partition tolerance. It has an
impact on the consistent representation, free-of-error, accessibility and timeliness
quality indicators. The thing is that the perfect partition tolerance fulfillment
ensures the successful response from a distributed datastore. In its turn, the
successful response must always contain consistent, correct, reliable and up-to-
date information. Otherwise the system answer is counted as an error message.
Therefore greater probability to get the successful response gives higher values
of indicators mentioned above.
    Evidently, it does not have any influence on the concise representation in
contrast to the consistency.
    We summarize this connection in Table 2.
    The endorsement of some interrelations and the discovering their behaviour
can be obtained by the further experiments. The rest of indicators of IQ model
       Table 2. Interrelations between IQ Indicators and CAP-requirements

                     Consistency            Availability      Partition Tolerance
  Consistent             +                                             +
representation
 Free-of-Error             +                      +                     +
    Concise                +
representation
 Accessibility                                    +                     +
  Timeliness               +                                            +



depend on the quality of the information system that provide information datau-
nits and we are not interested in them.


3   Brief overview of Models for Distributed Systems
The distributed datastores are various, therefore it is necessary to have tools for
their analysis. The simplest classification is related to the ratio of read and write
operations quantity. This classification should be reasonable from the point of
view of three pillars that maintain all distributed systems: consistency, availabil-
ity, partition tolerance. Hence the list will be as follows:
 – Systems with domination of read operations (decision support or retrieval
   systems);
 – Systems with domination of write operations (online transaction processing
   systems);
 – Systems without explicit domination of read or write operations (business
   systems).
This classification is based on the quantity of read and write operations. It is
important to know for us, because it may require different mechanisms for each
type of system.
    In this section we tell about the way to verify consistency and maintain the
probability of its fulfillment following [10, Chapter 7]. An appropriate way to
increase this probability is replication – making copies of new data units at each
node and their updating.
    Traditionally, consistency is discussed in the context of read and write op-
erations in distributed systems. Following the book mentioned above there are
two consistency models for different distributed systems: data-centric and client-
centric ones. They are applied for different types of distributed systems. The first
one, data-centric consistency model is a model for wide usage: online transaction
and business systems mentioned above. It involves many types of consistency,
such as continuous, sequential, causal and combined one, called grouping opera-
tions. The protocols for these consistency types are more complex than protocols
in the second model. That is because data-centric model should be usable for
systems where a lot of write operations may occur and spoil all the data units.
For instance, imagine if two employees in a company use the same datastore.
They need to modify some file. And, unfortunately, it turns off that two employ-
ees download the file from the data storage and modify it in a different way and
in different places. First one upload the file back and later the second one does
the same. But the problem is that the file is modified in different places and
there will be some conflicts. This is the simplest example, but if this datastore is
distributed on many servers, there are a lot of copies of files in the data storage
and more people use the same distributed system, it causes a complete mess in
the storage. Therefore these protocols for controlling consistency should prevent
such errors when a lot of write operations may occur.
    The second one, client-centric model, is used for retrieval systems, where
mostly read operations occur, but write ones are rare. Thus it is very costly and
not necessary needed to use complex protocols. That is why for this model there
exist such a type of the consistency as eventual consistency that ensures such a
guarantee for the distributed system that eventually all the data units become
same and consistency is fulfilled. The protocols for the client-centric model are
also invented (see more in [10, Chapter 7]).
    So as soon as we described different types of distributed datastores and es-
tablished the problem to use different models and protocols for datastores that
we focused on, we can go to the next section, where we specify the architecture
and important behaviour elements for a distributed datastore.


4   Conceptual and Behavioral Model of Distributed
    Datastore
Above we described the general accepted model of the information quality that
determines indicators which are needed for our research purpose, described mod-
els that are used for distributed datastore to satisfy consistency guarantees. But
in order to come to the estimation step for the distributed datastore guarantees,
obviously, we should also discuss the model of a distributed datastore.
     Therefore in this section we represent the model of such systems in two views:
structural and behavioural. Below there is given the structural one (Fig. 1).
     The main component of our system is a distributed datastore. By definition
it is a set of nodes (servers) connected by links between each other. Each node
may have one or more neighbours. That is why this entity is composed of nodes
and links. Obviously, each link is two-sided and a node entity may have many
incidences.
     Every node stores data units in replicas. If a node receives a new data unit
it finds the data unit with the same identifier and replaces the old replica.
     Nodes are classified into ready, busy and dead ones. If a node is busy or dead
it ignores all the messages, so in this case the request will be lost. That is why
the behaviour in this case is trivial.
     The behaviour of a ready node is represented below, in the activity diagram
(see Fig. 2).
         Request                           to                    Response
                          1                                  1


             receives                                                sends

                                 DistributedDatastore




             2..m                                                    1..n

          Node                        incidence                    Link
                          2                              *


             1
             allocation
             1..*

         Replica                     replication                 DataUnit
                          1..*                           1




                              Fig. 1. Conceptual model


   To present the behaviour in clear and understandable way, we separate more
complex method of node handle request from the general behavioural view and
show it in Fig. 3.


5   Distributed Datastores: Basic Prerequisites and
    Metrics
As said above the leading idea of this paper is to suggest an approach for esti-
mating probabilistic metrics of CAP-properties.
    It is generally accepted (see [9]) that a distributed datastore is a computer
network where information is stored on more than one node, often in a replicated
fashion.
    Moreover, it is important to mention that a researcher has the possibility to
observe datastore events in the sequence according to physical time while each
Node behaviour

      receive
      request




         search required               [dataUnit has been found]
                                                                   handle request
       dataUnit on the node



                     [dataUnit has not been found]


                redirect
                request




                                      Fig. 2. Behaviour of node


 Handle request




                                  [write]                           [read]



         update required                                                            read
      dataUnit on this node                                                  required dataUnit




                                                                                      send
                                                                                    response

                                                 send
                                               response



                                    broadcast updating
                                  for all dataUnit replicas




                           Fig. 3. Behaviour of handle request method
datastore node considers the sequence of events with respect to its local time
only. Also we assume that datastores at study meet the eventual consistency
requirement [10]. It means that after an isolated read request for any data unit
all its replicas eventually have the same value.
     Let us represent a short case study in order to understand the motivation to
apply the probabilistic approach in asynchronous distributed datastore.
     Let us suppose that there is the unreliable node in a distributed system that
fails and recovers over over some time. At the initialization moment when nodes
are established each node has the probabilistic distribution of recovering time
and time of failure. In a distributed system links that bind nodes may also fail.
To be able to calculate the probability of partition tolerance fulfillment we need
to take into account these distributions in our research.
     For now we are ready to describe how we apply the probabilistic approach
in our studying by giving the rigorous definitions of CAP-properties. To do this,
we describe a distributed datastore using the following mathematical model.
     Our model is a tuple (N, L, ∂, D, r), where
      N            is a finite set, whose elements correspond to nodes of a dis-
                   tributed datastore;
       L           is a finite set, whose elements correspond to links of a distributed
                   datastore;
 ∂ : L → 2N        is a mapping that associates each link with two nodes that it
                   connects;
      D            is a finite set, whose elements correspond to stored data units;
 r : D → 2N        is a mapping that associates each data unit d with a subset of
                   nodes that store its replica.
     Thus, firstly, let us introduce the consistency metric. Taking into account
that the consistency is the property when for each data unit its replicas have the
same value, we shall consider the probability that all data units at distributed
datastore are consistent at a certain time moment. Therefore we define the con-
sistency metric in the following manner.
Definition 1 (Consistency metric). Let σt ∈ {0, 1} be the random variable
that represents one of two events at time point t ≥ 0:
    σt = 0      corresponds to the event “there exists a data unit that is not
                consistent at the time point t” and
    σt = 1      corresponds to the event “all data units are consistent at the time
                point t”.
Then the consistency metric C(t) at time point t is defined by the formula

                                  C(t) = Pr(σt = 1) .                              (1)

   The second metric estimates the availability guarantee. We propose to mea-
sure the extent of availability as meantime between two events: the request has
been received by the datastore and the corresponding response 1 has been sent
by the datastore. More, formally:
1
    This response is either a successful report on request or an error message.
Definition 2. Let τt be the time interval between a request receiving at the time
point t and the corresponding response. Then the mean response time is defined
by the formula
                                 T (t) = E[ τt ] .                             (2)
   Finally, to estimate the third guarantee, called the partition tolerance, we
consider the ability of a datastore to survive network partitions, so that the
performance of the datastore does not suffer. This definition is more complicated.
Let us consider some time point t . At this instant some nodes are alive, but other
ones failed. We denote by Nta the set of alive nodes (Nta ⊂ N ). Similarly, we
denote by Lat the set of alive links that connect alive nodes.
Definition 3. We shall say that a data unit d ∈ D is reachable from a node
n ∈ Nta at time                                               a    a
              T point t if there exists a path in the graph (Nt , Lt ) from n to
      0     a
some n ∈ Nt r(d) .
Now we can introduce the metric for the partition tolerance using the previous
definition.
Definition 4. Let ζt ∈ {0, 1} be the random variable that represents one of two
events at time point t ≥ 0:
    ζt = 0     corresponds to the event “there exists a data unit that is not
               reachable from some alive node at time point t” and
    ζt = 1     corresponds to the event “all data units are reachable from any
               alive node at time point t”.
Then the partition tolerance metric P (t) at time point t is defined by the formula

                               P (t) = Pr(ζt = 1) .                            (3)


6   Conclusion
In this paper we have started studying the problem of the quality for distributed
datastores. We have proposed the approach to measure the quality properties
of a datastore. Therefore we have described CAP-properties and have built the
metric system for CAP-properties estimation. We have described the indicators
of the information quality and have found interrelations between the information
quality and distributed datastores’ properties basing on [7].
    In order to have a view of the datastore and be able to work with it we
have built its structural and behavioural model and based on this knowledge,
we specified probabilistic characteristics for CAP-properties measurement.
    Thus, the following steps for the problem set in the paper have been done:
 – Formulation of understanding the idea: what are CAP-guarantees for dis-
   tributed datastores indeed;
 – Description of the information quality indicators;
 – Investigating the connection between CAP-guarantees and information qual-
   ity model;
 – Building the structural and behavioural models for a distributed datastore;
 – Forming ”CAP-metrics” as the main idea for studying the quality of dis-
   tributed datastores.
    These steps give us possibilities to study CAP-properties fulfillment using
the following background: the fault-tolerance mechanisms for asynchronous sys-
tems, concurrent programming, algorithms of data propagation in distributed
systems, probably, some issues of internet strategy, Queueing Theory, Mathe-
matical Statistics. Fault-tolerance protocols can be used to invent the algorithms
for maintaining the partition tolerance guarantee. Obviously, a researcher should
have the good background in Graph Theory, Probability Theory and Concurrent
Programming in asynchronous distributed systems. Also, as we need to repre-
sent experimental results, it is necessary to imitate the model of a distributed
datastore.
    Summing up now we might set following problems:
 – To simulate the model of a distributed datastore;
 – To study different mechanisms to estimate the degree of ensuring each guar-
   antee applying specified metrics;
 – To analyse discovered mechanisms and their composition;
 – To integrate these mechanisms with simulated asynchronous distributed
   datastore;
 – To estimate theoretically the total time complexity for all provided mecha-
   nisms;
 – To carry out the experimental estimation.

Acknowledgment. Authors express a deep gratefulness to Grygoriy Zholtkevych
and Iryna Zaretska for their help and their criticism.


References
 1. Brewer, E.: CAP Twelve Years Later: How the ”Rules” Have Changed. Computer,
    IEEE Computer Society. Vol. 45, No. 2 (2012)
 2. Burgess, M.: Deconstructing the ‘CAP theorem’ for CM and DevOps http://
    markburgess.org/blog_cap.html (2012)
 3. CRG. Information Quality Survey: Administrators Guide. Cambridge Research
    Group, Cambridge, MA (1997)
 4. English, L.P.: Information Quality Applied: Best Practices for Improving Business
    Information, Processes and Systems. Wiley (2009)
 5. Gilbert, S., Lynch, N.: Brewers Conjecture and the Feasibility of Consistent, Avail-
    able, Partition-Tolerant Web Services. ACM SIGACT News. Vol. 33, No. 2, pp.
    51–59 (2002)
 6. ISO 8000-1:2011. Data Quality. International Organisation for Standartization
    (2011).
 7. Kahn, B.K., Strong, D.M., Wang, R.Y.: Information Quality Benchmarks: Product
    and Service Performance. Comm. ACM. Vol. 45, No. 4 (2002)
 8. Kovac, R., Lee, Y.W., Pipino, L.L.: Total data quality management: the case of
    IRI. Conf. on Information Quality (Cambridge, MA), pp. 63–79 (1997)
 9. Pessach, Ya.: Distributed Storage: Concepts, Algorithms, and Implementations.
    CreateSpace Independent Publishing Platform (2013)
10. Tanenbaum, A.S., Van Steen, M.: Distributed systems. Principles and Paradigms
    (2nd Edition). Prentice-Hall, Inc. (2006)