<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Simplifying Simulation of Distributed Datastores Based on Statistical Estimating CAP-Constraint Violation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Viktor Sobol</string-name>
          <email>viktor.pdt@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Mathematics and Computer Science, V.N. Karazin Kharkiv National University 4</institution>
          ,
          <addr-line>Svobody Sqr., Kharkiv, 61022</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <fpage>42</fpage>
      <lpage>49</lpage>
      <abstract>
        <p>Running software in a distributed manner is a common practice nowadays. This approach produces a lot of new challenges which should be thought in advance. This paper is the next step on understanding systems such as distributed datastores by using statistical estimation for violation of guarantees from Brewer's Conjecture. The paper focuses on finding ways for simplification of theoretical and practical modelling of a system of a distributed datastore. Considering that real-world distributed datastore consists of nodes with a different distribution of fail and recovery time it is proposed to substitute every node of distributed datastore with nodes with one common distribution of fail and one common distribution of recovery time. The verification of the approach is done by modelling systems and statistically comparing their violation of guarantees from Brewer's Conjecture. The results allow us to define cases where we can substitute one system with another without losing perception of its behaviour.</p>
      </abstract>
      <kwd-group>
        <kwd>Distributed datastore</kwd>
        <kwd>Partition-tolerance</kwd>
        <kwd>CAP-theorem</kwd>
        <kwd>statistic metrics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Modern databases store data with multiple copies and often replicate it to
different geographical regions to support the efficient decision-making process.
However, processing and storing data in a distributed environment produce a lot of
new challenges and limitations which didn’t exist in a single-computer world.</p>
      <p>
        Some of those limitations were formalized as Brewer’s Conjecture [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] or more
often it is referred as CAP-theorem. CAP-theorem is often used as a tool to
reason about trade-offs in practical systems [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. However such reasoning provides
some ambiguities and was reviewed by Martin Kleppmann [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] showing that it
isn’t enough to justify system behaviour by relying only on CAP theorem and
he recommends to avoid using it for making design decisions.
      </p>
      <p>
        Paper [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] proposes an alternative probabilistic approach to measure CAP
properties of a distributed datastore. The proposed way opens up the
opportunities for studying CAP properties fulfilment to give a good assumption about
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
the behaviour of practical systems. As authors noted, for future research of
the proposed approach it is needed to represent experimental results hence the
model of a distributed datastore has to be simulated. From the definition,
distributed datastore consists of a set of nodes connected by links. Each node has
the probabilistic distribution of recovering time and time of failure. The
mentioned above consideration was taken from [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Real-world datastores consist of
different nodes following from that the distribution of fail and recovery time is
different for every node. The way of simplification explored in this paper is to
simulate a system of a distributed datastore with nodes having the same
probabilistic distribution of fail and recovery time and compare statistical estimation
for violation with a system which consists of nodes with different characteristics
and to define cases where mentioned substitution won’t change the whole system
behaviour.
      </p>
      <p>Obviously links that bind nodes may fail too but this is outside of the scope
of the next analysis and it is assumed that every link if always works correctly.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Distributed datastore model</title>
      <p>
        To conduct a simulation mathematical model of a distributed datastore defined
in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] was taken. The model by itself is a tuple (N, L, ∂, D, r), where:
N
L
∂ : L → 2N
D
r : D → 2N
is a finite set whose elements corresponds to nodes of a
distributed datastore;
is a finite set whose elements corresponds to links of a
distributed datastore;
is a mapping which associates each link with two nodes that
it connects;
is a finite set whose elements corresponds to stored data units;
is a mapping which associates each data unit with a subset of
nodes that store its replica;
      </p>
      <p>
        Experiments in this paper focus on estimation of the third guarantee from
Brewer’s Conjecture — partition tolerance. The metric used to estimate partition
tolerance is based on Definition 4 in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] as a random variable that represent two
possible events at a time point t &gt; 0
ζt = 0
ζt = 1
there exist data unit which is not reachable by some node from
subset of alive nodes;
all data units are reachable from any alive node;
      </p>
      <p>Figure 1 is an example of ζt = 1 as clearly every alive node is able to reach any
other alive node (a non-alive node is represented by a red-filled circle). Figure 2
is an example of ζt = 0. As node 1 is not able to reach any other alive node.</p>
      <p>It is assumed that the distribution of fail and recovery time for every node
obeys Poisson distribution. As we can treat time in our experiments as a discrete
quantity, events — failure and recovery as a series of events with the known
average time but the exact time is random. From practical examples, it may be
the hardware failures, periodic releases of newer versions of software, etc. Saying
that some node has a different distribution comparing with some other node it
is understood as both of them are having Poisson distribution but parameters
of Poisson distribution are different.</p>
      <p>In order, to simplify future simulating of a distributed datastore and
theoretical reasoning it is proposed to substitute probabilistic distribution of every
node with one common distribution. Empirical results to answer a question —
when ζt for a model of distributed datastore with every node having a different
probabilistic distribution of fail and recovery time is not statistically different
from the model with nodes having the same distribution are presented below.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Experiment description</title>
      <p>As was mentioned above in the set of experiments, time is considered as a discrete
quantity.</p>
      <p>Required steps before every experiment run:
– Generation of graph (N, L) which corresponds to nodes and links of a
distributed datastore:
– Generation of parameters for Poisson distribution of fail and recovery time
for every node from graph (N, L) for system having different parameters of
distribution — Soriginal
– Generation of parameters for Poisson distribution of fail and recovery time
for every node from graph (N, L) for system having different parameters of
distribution — Sideal
– Define discrete time series — T</p>
      <p>
        To generate a graph, WattsStrogatz model [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] was used. The model produces
a random graph with small-world properties what is well represent real-world
systems especially distributed across different geographical locations.
WattsStrogatz model is using next parameter defined below to generate a graph:
probability of rewiring in WattsStrogatz model;
degree in WattsStrogatz model;
      </p>
      <p>The next steps are taken to define parameters for both systems — Soriginal
and Sideal:
– define E(f ail) and E(recovery) for Sideal. Mentioned values are the input
data of an experiment;
– for every node from Soriginal calculate random values — deviationrecovery
and deviationfail based on normal distribution N (0, E(recovery)/2) and
N (0, E(f ail) ∗ (deviation) respectively. Define E(f ail − deviationfail) and
E(recovery − deviationrecovery) for Soriginal. Where deviation is an input
parameter of an experiment;</p>
      <p>
        During experiment run for every t ∈ T the value for ζ is recorded based on
Definition 4 from [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for both system, Soriginal and Sideal.
      </p>
      <p>The process is repeated for different randomly generated network structures
with the same parameters and then values for random variable ζ is aggregated
by the number of times when the system had ζ = 1 and frequency of this
systems. The mentioned aggregations are stored separately for Soriginal and
Sideal. Pearson test(χ2) with p − value &gt; 0.05 is applied to obtained data to
get a conclusion about the statistical difference in systems behaviour. Every
aggregated result is stored in a bucket of size — S based on a number of times.
Bucketing is done in order to eliminate inaccuracies in cases when time series is
long enough and the difference between two values is small enough to be treated
as one value for χ2 test.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Experiment results</title>
      <p>In total 5000 different systems were generated and run. Below are the results of
the most notable cases.
4.1</p>
      <sec id="sec-4-1">
        <title>Experiment 1</title>
        <p>Parameters used in this experiment are shown in Table 1.</p>
        <p>
          Distributions of a results are shown in Figure 3 and Figure 4. Shapiro-Wilk
Test [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] was applied to these distributions with a positive outcome. The results
were aggregated using different bucket size — S and χ2 test was applied. The
outcome is shown in Table 2
4.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Experiment 2</title>
        <p>Parameters for experiment 2 are presented in Table 3. Distributions of tested
systems are shown in Figure 5 and Figure 6. Positive outcome is received on
Shapiro-Wilk test. χ2 test results are presented in Table 3. In this case it is
possible to conclude that big E(f ail) in comparison to E(recovery) highly
contributes to χ2 test results.</p>
        <p>
          Parameters
Name Value
P r 0.5
K 5
E(f ail) 300
E(recovery) 30
deviation 0.5
The parameters used in experiment 3 are shown in Table 5. Distribution results
for this experiment are shown in Figure 7 and Figure 8. Shapiro-Wilk Test [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]
was applied to the mentioned distribution as well and gave a positive outcome.
χ2 test results can be seen in Table 6. Good results from χ2 can be explained
by small deviation value.
5
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Result interpretation</title>
      <p>
        Based on experiment results, it is possible to specify the next parameters which
results of χ2 test depends on hence answer to a question — are the system
statistically equal based on Definition 4 from [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>– The ratio of expected values of fail and recovery time:
– The average deviations of all components considering that ν(N, L) - is a
number of vertices in graph (N, L):</p>
      <p>AR =</p>
      <p>E(recovery)</p>
      <p>E(f ail)</p>
      <p>Experiments showed that the result of χ2 test has an inverse relation to the
specified above parameters. The bucket size — S has a direct relation to χ2 test,
as we treat close to each other values as one value for χ2 test.</p>
      <p>Based on relation each parameter has to the end result it is possible to define
one proportion that represents the influence of each parameter.</p>
      <p>S</p>
      <p>AR ∗ AD ∗ C′(β)
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>
        The study described in the paper was focused on improving and simplifying
tackling challenges set up in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>The proposed in the paper relation (4) based on empirical data gives a
practically useful tool concerning to model a real-world system (referred in this paper
as Soriginal) as a system with nodes having the same distribution (referred in
the paper as Sideal.
(3)
(4)</p>
      <p>This relation is grounded by a big set of experiments was conducted however
it can be improved using coefficients in order to make the influence of each of
the parameters more explicit.</p>
      <p>
        We plan to use relation (4) for studying different mechanisms to estimate the
degree of ensuring each CAP-guarantee applying metrics specified in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgement</title>
      <p>The author thanks to Prof. Grygoriy Zholtkevych for his supervision and
support.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Barrat</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weigt</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>On the properties of small-world network models</article-title>
          .
          <source>Eur. Phys. J. B</source>
          <volume>13</volume>
          ,
          <issue>547560</issue>
          (
          <year>2000</year>
          ). https://doi.org/10.1007/s100510050067
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Rukkas</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zholtkevych</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <article-title>Distributed Datastores: Towards Probabilistic Approach for Estimation of Reliability</article-title>
          . V.N. Karazin Kharkiv National University,
          <fpage>523</fpage>
          -
          <lpage>534</lpage>
          (
          <year>2015</year>
          ) http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>1356</volume>
          /paper 51.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Watts</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Strogatz</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>Collective dynamics of small-world networks</article-title>
          .
          <source>Nature</source>
          <volume>393</volume>
          ,
          <issue>440442</issue>
          (
          <year>1998</year>
          ). https://doi.org/10.1038/30918
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Biometrika</surname>
          </string-name>
          , Volume
          <volume>52</volume>
          ,
          <string-name>
            <surname>Issue</surname>
          </string-name>
          3-
          <issue>4</issue>
          ,
          <year>December 1965</year>
          , Pages 591611, https://doi.org/10.1093/biomet/52.3-
          <fpage>4</fpage>
          .
          <fpage>591</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Seth</given-names>
            <surname>Gilbert</surname>
          </string-name>
          and Nancy Lynch, ”
          <article-title>Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services”</article-title>
          ,
          <source>ACM SIGACT News</source>
          , Volume
          <volume>33</volume>
          <issue>Issue 2</issue>
          (
          <year>2002</year>
          ), pg. 5159. https://doi.org/10.1145/564585.564601
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Martin</given-names>
            <surname>Kleppmann</surname>
          </string-name>
          , ”
          <article-title>A Critique of the CAP Theorem”</article-title>
          https://arxiv.org/pdf/1509.05393.pdf
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Karl Pearson F.R.S</surname>
          </string-name>
          . X.
          <article-title>On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling Pages 157-</article-title>
          175 https://doi.org/10.1080/14786440009463897
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>