<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Index-Requisite Data Diagnostics In Information Management Systems</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>National Aerospace University “Kharkiv Aviation Institute”</institution>
          ,
          <addr-line>Kharkiv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Informational Management Systems (IMS) which are based on legacy systems have a significant problem of dirty data. The data cleansing problem solution in such systems usually starts with the search of similar tuples' clusters. After that for each cluster the reference tuple should be formed for saving in a data warehouse of IMS. Moreover, fail tuples should be returned to the source subsystem with the indication of error location, i. e. concrete invalid requisite. The necessary of such a deep diagnosis determined by the following fact: the reference tuple can be not just one of the existent, but as well the combination of several different tuples requisites. Considering one obtained cluster of similar tuples, a certain multiset can be composed from all of the certain attribute values. The paper represents the method of the multiset's diagnostic in terms of faultless and correctability, based on the majority principle. The method provides the minimum time required for establishing the fact of multiset's incorrectness, moreover it allow defining valid (reference) and failed elements of the multiset.</p>
      </abstract>
      <kwd-group>
        <kwd>Data Cleansing</kwd>
        <kwd>Diagnostics</kwd>
        <kwd>Similar Tuples</kwd>
        <kwd>Reference Requisite</kwd>
        <kwd>Multiset</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Informational Management Systems (IMS) which are based on legacy systems have a
significant problem of dirty data. The data cleansing problem solution in such systems
usually starts with the search of similar tuples' clusters [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. After that for each cluster
the reference tuple should be formed for saving in a data warehouse of IMS.
      </p>
      <p>
        Therefore, fast index-requisite diagnostics method's development based on the
hashing and cyclic codes [
        <xref ref-type="bibr" rid="ref2 ref3">2,3</xref>
        ] is quite perspective way of data cleaning problem
solution. The main principals of the rational control, which have successful
implementation in system engineering [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], should be investigated on the real functioning database
of IMS.
      </p>
      <p>Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).</p>
    </sec>
    <sec id="sec-2">
      <title>Problem Statement</title>
      <p>Let consider the situation when one cluster tuples consist of three requisites. Then
the probability p(BB) of the event BB , which means that one of two data tuples has
single or double error in one requisite and second tuple – in another requisite, can be
calculated as:
p(BB) =p(D3 2 * ( p( A1 ) p(D2 ) ) * p(D1 ) p( A2 ) p(D3 ) + p( A1 ) p(D2 ) p(D3 ) *
* p(D1 ) p(D2 ) p( A3 ) * p(D1 ) p( A2 ) p(D3 ) + p(D1 ) p(D2 ) p( A3 )),
where p( Ai ) = Lip c (1 −p c )Li −1 + CL2i (1 −p c )Li −2 are probabilities of requisites fails with
single or double error, p(Di ) = (1 −p c )Li are probabilities of requisites fails absence,
Li are the average lengths of the requisites values, i = 1, 3 , p c is the possibility of the
requisite symbol distortion. For example, when L1 = 8 , L2 = 6 , L3 = 10 and p c =
10−2 p(BB) ≈ 0.025 .</p>
      <p>Considering only one of the clusters obtained similar tuples proceeds to the formal
statement of the problem.</p>
      <p>Let R1 is the considered cluster, i.e. set including q ∈ N tuples. For every attribute
ρ</p>
      <p>
        ( ρ = 1, h ) of the tuple the corresponding multiset M ρ = {sm1ρ , sm2ρ ,..., smqρ }
should be formed. Then, based on the principle of majority voting (two among three,
three among five etc.) is widely used in diagnosing technical systems [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ], it is
possible to give definitions of multiset М ρ correctness.
      </p>
      <p>Definition 1. A multiset M ρ ( M ρ &gt; 2 ) is faultless, if all elements are equal, i.e.</p>
      <p>CORRECT ( M ρ ) = {∀i ∈{1,.., q}∀j ∈{1,.., q} (i ≠ j) ⇒ (smiρ = sm jρ )} ,
where CORRECT ( M ρ ) is a Boolean predicate, ⇒ is an implication operator.</p>
      <p>Definition 2. A multiset M ρ ( M ρ &gt; 2 ) is correctable, if more than half, but not
all of its elements are equal, i.e.</p>
      <p>CORRECTED ( M ρ ) ={∃M ρ' ={sm1'ρ , sm2'ρ ,..., smz'ρ } ⊂ M ρ (z &gt; q / 2) ∧ (z &lt; q) ∧
∧ (∀i ∈{1,..., z}∀j ∈{1,..., z} (i ≠ j) ⇒ (smi'ρ =sm'jρ ))},
where CORRECTED ( M ρ ) is a Boolean predicate, M ρ' is a subset of M ρ .</p>
      <p>Consequently, an element smip ∈ M ρ is a reference tuple, if smiρ ∈ M ρ' , as well
as an element smiρ ∈ M ρ is an failed tuple, if smiρ ∉ M ρ' . It is also obviously, in the
case if M ρ is correctable M ρ' is unique. In addition, if in M ρ are equal no more than
half of the elements, M ρ is not faultless and not correctable.</p>
      <p>The objective of this paper is to represent the method of the multiset’s
diagnostic, which ensures minimal time of M ρ correctness establishment, based on the given
above definitions, and allow to locate, if it is necessary, reference and failed elements
of M ρ .</p>
      <p>Let consider the obviously easiest approach, i.e. pairwise comparison of all the
M ρ elements smiρ and sm jρ for (i ≠ j) . To do this, we assign the multiset M ρ
anq
other multiset CNTρ = {cnt1ρ , cnt2ρ ,..., cntqρ } , such that cntiρ = 1 + ∑ eqij , i = 1, q
j=1
j≠i
1, if smiρ = sm jρ ;
where eqij =  If cnt1ρ = q then M ρ is faultless, or M ρ is
cor0 , othrwise.
rectable if ∃i ∈{1,..., q} : cntip &gt; q / 2 , i.e. there is a reference element. For example,
if M ρ = {' Иванов ', ' Ивашов ', ' Иванов '} , then
CNTρ = {2, 1, 2} . sm1ρ , sm3ρ
are
reference elements, sm2ρ is a failed element.</p>
      <p>Cpc = ( q −1 + q − 2 + ... + 1) L =</p>
      <p>
        The performance estimation of the pairwise comparisons method [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] by counting
the number of symbols comparisons is followed. Let each element smiρ ∈ M ρ
comprises L characters ( L &gt;&gt; 1 ). Then the maximum number of character comparisons
q2 − q q2 − q
      </p>
      <p>2 2
is the maximum time of the multiset M p of pairwise comparison elements, t pc is the
runtime of the two characters comparison. Since the diagnosing second stage
maxi q2 − q 
mum runtime is proportional to q then Tcpc ≈  L + q  t pc where Tcpc is the
max 2 
imum runtime of the diagnostic procedure based on the method of pairwise requisite’s
comparison.</p>
      <p>
        A significant improvement of this method is using of a conversion key
(hash) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which sends requisites to the array indexes (memory address):
L . Consequently, Tpc ≈
      </p>
      <sec id="sec-2-1">
        <title>L ⋅ t pc , where Tpc</title>
      </sec>
      <sec id="sec-2-2">
        <title>H : smip → aip</title>
        <p>(2)
where H is a transformation (mapping), aip is an array index corresponding to an
element smip of the multiset M p .</p>
        <p>The main challenge deal with the key conversion is that the set of possible
values is much greater than the set of possible memory locations. Therefore it is
necessary to choose a mapping H , which allow:
─ to detect common errors of source data, entering by human-operator in input fields,
i.e., reference argument smip and argument smrp , failed with any of the specified
error, always guarantee different results aip ≠ arp ;
─ to establish definitely the difference of smip and smrp
in the case of different
results aip ≠ arp ;
─ to produce the same addresses for random source elements with the difference of
arbitrarily small probability;
─ to conclude that the probability smip ≠ smrp is arbitrarily small for the equal
addresses ( aip = arp ).</p>
        <p>
          When mapping H is chosen, the problem of the M p correctness establishment
and the reference and failed M p elements search can be effectively solved by using
diagnostic models [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] linking errors indirect signs with the direct ones.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>The Choice Of The Requisites-To-Indices Reflection</title>
      <p>
        The stated problem should be considered as a task of error-correcting coding theory:
construction of the predetermined code with detecting ability for transmitting discrete
information through a noisy channel [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>Indeed, ( smip , aip ) can be regarded as permissible sequence of redundancy code,
where smip are data bits, aip = H (smip ) are checking bits. However, in contrast to
transmission over the communication channel, whereby both data bits and checking
bits could be corrupted because of possible noise, in this case only information bits
could be changed, i.e. smip is received as smrp ≠ smip .</p>
      <p>
        The most reliable are the cyclic codes having high detecting ability and widely
used in practice because of less complicated coding/decoding devices schemes in
comparison with other coding techniques [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Constructing the cyclic code for a given
number u of data bits the shortest length of code combinations w is determined to
provide a predetermined multiplicity of error detection. This problem is reduced to the
determination of needed generating polynomial G(x) of degrees w − u .
      </p>
      <p>For cyclic codes data bits transformation to test bits has the form as following:
H ( M ip (x)) =(M ip (x) + xw−u ) mod G(x) ,
(3)
where M ip (x) is polynomial of a dummy variable x corresponding to the data bits,
smip , mod is an operator getting remainder of the polynomials division.</p>
      <p>Thus, it is need to choose a mapping H such as:
1. If H (M rp (x)) ≠ H (M ip (x)) , then M rp (x) ≠ M ip (x) .
2. For frequently occurring error classes Eз and M rp (x)
=Mip (x) + Eз a combination
( smip , aip ) is excepted, i.e. H (M rp (x)) ≠ H (M ip (x)) .
3. For random noise M rp (x)</p>
      <p>=Mip (x) + Ec the probability of (mrp , aip ) permission is
arbitrarily small, i.e. p(H (M rp (x))</p>
      <p>=(M H ip (x))) → 0 where Ec is some random
noise.
4. The equity smip = smrp is ensured from H (M rp (x)) = H (M ip (x)) with a probability
close to one.</p>
      <p>It is necessary to show further that the first requirement is satisfied if
deg(G(x)) &gt; 0 where deq(G(x)) is a generator polynomial G(x) degree. If
H (M rp (x)) ≠ H (M ip (x)) , then (M rp (x)xw−u ) mod G(x) ≠ (M ip (x)xw−u ) mod G(x) .
Using the distributive property of the operator mod , it can be obtained
(M rp (x)xw−u + M ip (x)xw−u ) mod G(x) ≠ 0 . Let's supposed, that
M rp (x)xw−u + M ip (x)xw−u =then 0 , 0 mod G(x) ≠ 0 is a contradiction, since the
condition deg(G(x)) &gt; 0 and the definition of 0 mod G(x) = 0 . Consequently,
M rp (x)xw−u + M ip (x)xw−u ≠ 0 , M rp (x)xw−u ≠ M ip (x)xw−u and M rp (x) ≠ M ip (x) .</p>
      <p>Before finding out conditions that satisfy the second requirement, it is need to
introduce auxiliary statements.</p>
      <p>Statement 1. If A(x) mod G(x) ≠ 0 and B(x) mod G(x) ≠ 0 , then
( A(x)B(x)) mod G(x) ≠ 0 .</p>
      <p>Proof. Let's supposed, that A(x) mod G(x) ≠ 0 and B(x) mod G(x) ≠ 0 , but
( A(x)B(x)) mod G(x) = 0 , then A(x) ≠ W (x)G(x) , where W (x) is a polynomial.
Multiplying both sides of this inequality by B(x) , it could be obtained
A(x)B(x) ≠ W (x)G(x)B(x) . Next, let take the reminder by dividing both sides by
G(x) as ( A(x)B(x)) mod G(x) ≠ (W (x)G(x)B(x)) mod G(x) , then 0 ≠ 0 is a
contradiction. Consequently, ( A(x)B(x)) mod G(x) ≠ 0 , Q.E.D.</p>
      <p>Statement 2. If G(x) = xc + ... +1 then G ( x)W ( x) = xd + ... + x f + ... +α , where
c, d , f ∈ N , d &gt; f , d = c + deg(W (x)) ,α ∈{0,1} , W (x) is a polynomial.</p>
      <p>Proof. Let represent G(x) as sequence of ones. Then multiplication by modulo 2
of G(x) and W (x) may be considered as a modulo 2 addition with a shift:
G ( x) ×W ( x) =W ( x)1 ⊕W ( x)2 ⊕ ... ⊕W ( x)n , where ⊕ is operation of addition by
modulo 2, W ( x)i are right shifts by i of W ( x) .</p>
      <p>It should be mentioned that the lowest significant bit of the first term and the
highest significant bit of the last term are not compensated, therefore, G(x)W (x)
represented in the form xd + ... + x f + ... +α , Q.E.M.</p>
      <p>Statement 3. If G(x) = xw−u + ... +1 , then for any single error E(x) = xi .
i ∈{w − u,..., w −1} , such that M rp (x) =Mip (x) + E(x) , H (M rp (x)) ≠ H (M ip (x)) is
performed.</p>
      <p>Proof. Considered conditions H (M rp (x)) =((M ip (x) + xi )xw−u ) mod G(x) and
H (M ip (x)) = ((M ip (x))xw−u ) mod G(x) , let’s supposed that condition
H (M rp (x)) = H (M ip (x)) is true. Then equality
((M ip (x) + xi )xw−u ) mod G(x) =(x)xw−u (M ip ) mod G(x) is true as well, from which it
is followed that (xi+w−u ) mod G(x) = 0 and, therefore, xi+w−u = G(x)W (x) . On the
other hand, according to statement 2 G(x)W (x) = xd + ... + x f + ... +α and
xi+w−u ≠ xd + ... + x f + ... +α . Consequently, H (M rp (x)) ≠ H (M ip (x)) , Q.E.M.</p>
      <p>Statement 4. If G(x) = xw−u + ... +1 , then for packet type error
E(x) = xi + ... + xi− p+1 , p ≤ w − u , i ∈{w − u + p −1,..., w −1} , for which
M rp (x) =Mip (x) + E(x) is true, H (M rp (x)) ≠ H (M ip (x)) is performed.</p>
      <p>Proof. Let's supposed that H (M rp (x)) = H (M ip (x)) , then
((M ip (x) + xi + ... + xi− p+1 )xw−u ) mod G(x) =(M ip (x)xw−u ) mod G(x) , hence
(xw−u+i− p+1 (x p−1 + ... +1)) mod G(x) =0 . Each of the factors: xw−u+i− p+1 is not evenly
divisible by G(x) ; x p−1 + ... +1 also not divisible by G(x) , because of the
p −1 &lt; w − u and therefore, (xw−u+i− p+1 (x p−1 + ... +1)) mod G(x) ≠ 0 . It is a
contradiction, and hence, H (M rp (x)) ≠ H (M ip (x)) .</p>
      <p>Statement 4. If for requisite it is used 8 bits to represent one character, smrp
differ from smip by any single transcription and G(x) = xw−u + ... +1 , where w − u ≥ 8 ,
then arp ≠ aip .</p>
      <p>Proof. Any single transcription can be represented as E(x) = xi + ... + xi− p+1 ,
where p ≤ 8 . Consequently, in accordance with the statement 3,
H (M rp (x)) ≠ H (M ip (x)) and therefore arp ≠ aip .</p>
      <p>Statement 5. If for requisite it is used 8 bits to represent one character, smrp
differ from smip by any transposition or double transcription of adjacent characters and
G(x) = xw−u + ... +1 , where w − u ≥ 16 , then arp ≠ aip .</p>
      <p>Proof. Any transposition or double transcription of adjacent symbols can be
represented as E(x) = xi + ... + xi− p+1 , where p ≤ 16 . Consequently, in accordance with
the statement 3, H (M rp (x)) ≠ H (M ip (x)) and therefore arp ≠ aip .</p>
      <p>Considering the third requirement for independent input of two values smip and
smrp , let’s supposed that all valid requisites smip are equal and H uniformly send
them to the full range of possible addresses aip . In this case, each aip corresponds to
2(u−(w−u)) smip . Then there 2u 2u options independent of input values smip and smrp ,
among which:
a) 2u identical values input options, i.e., ( smip =smrp , aip =arp );
b) 2u (2u − 2u−(w−u) ) options, in which errors are detected, i.e.
( smip ≠ smrp , aip ≠ arp ) ;
c)
2u (2u−(w−u) −1)
options,
in
which
errors
are
not
detected,
i.e.
( smip ≠ smrp , aip =arp) .</p>
      <p>Further there are described computations of the probability of different outcomes
independent input values smip and smrp . The probability of entering identical values
is p(smip
=smrp , aip</p>
      <p>=arp ) =22u2uu =21u . The probability of the case, in which error
p(smip ≠ smrp , aip =arp ) =218 − 2148 ≈ 0, 004 s</p>
      <p>2u (2u − 2u−(w−u) )
p(smip ≠ smrp , aip ≠ arp ) = u u</p>
      <p>2 2
which error is not detected, is p(smip ≠ smrp , aip
For</p>
      <p>example, when w = 56,
p(smip ≠ smrp , aip = arp ) = 2116 − 2148 ≈ 1, 5 ⋅10−5 .
is
detected,</p>
      <p>is
=1 − 2w1−u . The probability of the case, in</p>
      <p>2u (2u−(w−u) −1)
=arp ) = u u</p>
      <p>2 2</p>
      <p>Let consider now independent input information elements smip , smrp and smsp ,
each with equal probability takes any of the valid values and is transformed uniformly
on the entire range of possible addresses. As previously, each aip corresponds to
2(u−(w−u)) smip . Totally, there are 2 2 2</p>
      <p>u u u values for smip , smrp and smsp inputs,
among which:
─ 2u identical values input options, i.e., ( smip =smrp =smsp , aip =arp =asp );
─ the cases, in which errors are not detected, are following:
a)
c)
e)
g)
aip
=arp , aip
=asp , arp</p>
      <p>=asp ;
smip = smrp , s mip ≠ smsp , s mrp ≠ smsp ,
aip
=arp , aip
=asp , arp
=a.</p>
      <p>sp
smip ≠ smrp , s mip ≠ smsp , s mrp ≠ smsp ,
─ the cases, in which errors are detected, are following:
aip ≠ arp , aip ≠ asp , arp ≠ asp ;
smip ≠ smrp , s mip = smsp , s mrp ≠ smsp ,</p>
      <p>=asp;
smip = smrp , s mip ≠ smsp , s mrp ≠ smsp ,
b)
d)
aip ≠ arp , aip = asp , arp ≠ asp ; aip = arp , aip ≠ asp , arp ≠ asp ,
i.e. case a includes 2u (2u − 2u−(w−u) )(2u − 2u−(w−u) − 2u−(w−u) ) options; cases b, c, d –
2u (2u − 2u−(w−u) ) options.</p>
      <p>Further there are described computations of probabilities of the different
diagnoses with independent input values requisites smip . smrp and smsp . The probability of
entering identical values of requisites is equal to 223uu = 212u . The probability of the
case, when the error are skipped, is equal to
+ 2u (2u−(w−u) −1)(2u−(w−u2)3−u 2) + 3 ⋅ 2u (2u−(w−u) −1) =  2w1−u − 21u   3 − 2w1−u−1 + 21u . The
probability of error detection by comparing indices obtained for the three requisites is
equal to 2u (2u − 2u−(w−u) )(22u3−u 2u−(w−u) − 2u−(w−u) ) + 3 ⋅ 2u (2u2−3u2u−(w−u) )
=
3 ⋅ 2u (2u − 2u−(w−u) )(2u−(w−u) −1)
23u
+
of
=1 − 2w1−u  1 − 2w1−u−1 + 23u  . For example, when w = 64,
the case, when an error is
skipped,
=  2116 − 2148   3 − 2115 + 2148  ≈ 4, 5 ⋅10−5 .
u = 48 , the probability
p [a ∨ b ∨ c ∨ d ] =</p>
      <p>
        Considering the fourth requirement in the case of independent input of two
values smip and smrp , it should be calculated probability of case, if aip = arp , then
smip = smrp . The Bayes' formula [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] allows to calculate posteriori conditional
probability of the presence of unconditional priori one.
      </p>
      <p>Let the event ER1 is equal smip = smrp , event ER2 – smip ≠ smrp , event EI –
p(ER1 ) p(EI | ER1 )
aip = arp . Then p(ER1 | EI ) =
2
∑ p(ERi ) p(EI | ERi )
i=1
smrp , which consist of L characters, are independently entered by two
humanoperators based on the same original document. Then p(ER1 ) can be calculated as the
probability of error-free entry of two requisites, i.e. p(ER1 ) = (1 −p c )2L , where p c is
the possibility of mistakes in the human information (errors per symbol), hence
p(ER2 ) =1 − (1 −p c )2L . For example, for L = 6 and p c = 10−2 p(ER1) ≈ 0, 88 . It is
obvious that p(EI | ER1 ) =p(aip =arp | mip =mrp ) =1 , as, for equal requisites
(M ip (x) = M rp (x)) it is impossible to obtain different indexes
. Let's supposed, that smip and
(H (M ip (x)) ≠ H (M rp (x))) ,according to (3). For calculations of
=arp | smip ≠ smrp ) it is possible to use the fact of equal
probap(EI | ER2 ) =p(aip
bilities of all admissible smip , smrp and their uniform mapping to the corresponding
smip ≠ smrp , s mip = smsp , s mrp ≠ smsp ,
event
(1 −p c )2L ⋅1
(1 −p c )2L ⋅1 + (1 − (1 −p c )2L )  2u−(w−u) −1 
 2u −1 
. For example, for u = 48,
smip
smip
–
ues smip , smrp
w = 64, L = 6 p(ER1 | EI ) ≈ 0, 999998 .</p>
      <p>Considering the fourth requirement in the case of independent input of three
valand smsp , it is necessary to calculate the probability of
=smrp
=smrp
=smsp , when aip =arp =asp . Let the event E3R1 is equal values
=smsp , event E3R2 – smip ≠ smrp , s mip ≠ smsp , s mrp
=smsp, event E3R3</p>
      <p>E3R4
E3R5
–
–
Then
ranges aip , arp .</p>
      <p>According
to</p>
      <p>the
p(aip =arp | smip ≠ smrp ) =
p(ER1 | EI ) =
p(aip
p(smip ≠ smrp )
formula of conditional
=arp , smip ≠ smrp ) = 2u−(w−u) −1</p>
      <p>.
smip = smrp , s mip ≠ smsp , s mrp ≠ smsp ,
smip ≠ smrp , s mip ≠ smsp , s mrp ≠ smsp ,
event</p>
      <p>E3I
event
–
aip =arp =asp .
p(E3R1 | E3I ) = 5
∑ p(E3Ri ) p(E3I | E3Ri )
i=1
events E3Ri , i = 1, 5 according to the binomial law, assuming independence of errors
in separate characters, as following: p(E3R1 ) = (1−p c )3L .</p>
      <p>p(E3R1 ) p(E3I | E3R1 ) . Let calculate the a priori probability of
p(E3R2 ) = p(E3R3 ) = p(E3R4 ) = (1−p c )2L (1− (1 −p c )L ) , p(E3R5 ) = (1− (1 −p c )L )3 .
For example, for L = 6 and p c = 10−2 p(E3R1 ) ≈ 0,83 , p(E3R2 ) ≈ 0, 05 ,
p(E3R5 ) ≈ 0, 0002 .</p>
      <p>As previously, calculation of the conditional probabilities p(E3I | E3Ri ) is
based on the conditions of the equal probability of all admissible smip , smrp and
smsp and uniformity of transformation to corresponding ranges aip , arp and asp . So,
p(E3I | E3R1 ) = 1 due to the fact that the mapping H each value smip sends to no
more than one aip and hence H is a function. The conditional probabilities of
coincidence of codes in the case of only one failed requisite are following:
p(E3I | E3R2 ) =p(E3I | E3R3 ) =p(E3I | E3R4 ) =
= p(aip =par(psm=ipas≠p,ssmmripp ,≠smsimpr≠p,ssmmspip, s≠msrmpsp =,ssmmsrpp) =smsp ) = 2u−2(wu−−u)1−1 . The
conditional probability of indices coincidence in the case of three different requisites input
=
=
p(E3I | E3R5 )
(2u−(w−u) −1)(2u−(w−u) − 2)</p>
      <p>(2u −1)(2u − 2)
smip
=smrp</p>
      <p>=smsp
p(E3R1 | E3I ) =
humanoperator
is
=smrp p(aip =asp, smip ≠ , s mip ≠ smsp , s mrp ≠ smsp )
=a
rp
p(smip ≠ smrp , smip ≠ smsp , smrp ≠ smsp )
following:
. Thus, the posterior probability of identity requisites
provided
that
aip =arp =asp ,
can
be
calculated
as
.
(1−p c )3L ⋅1
2u−(w−u) −1 + (1− (1−p c )L )3 ⋅ (2u−(w−u) −1)(2u−(w−u) − 2)</p>
      <p>2u −1 (2u −1)(2u − 2)
(1−p c )3L ⋅1+ 3⋅ (1−p c )2L (1− (1−p c )L ) ⋅
For example, for u =48, w</p>
      <p>=64, L =6 p(E3R1 | E3I ) ≈ 0, 999997 .</p>
      <p>
        The standard CRC-CCITT polynomial G1 (x) = x16 + x12 + x5 +1 and CRC-16
G2 (x) = x16 + x15 + x2 +1 are commonly used to increase the reliability of information
transmission in computer networks [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. It is obvious that they satisfy the first and
second requirements, as deq(Gi (x)) &gt; 0 and Gi (x) = xw−u + ... +1 , where w − u ≥ 16 .
i = 1, 2 . Furthermore, they may be represented as a product of polynomials of lower
degree, for example, G1 (x) = (x +1) ⋅ (x15 + x14 + x13 + x12 + x4 + x3 + x2 + x +1) , and
are not irreducible. Therefore, the codes constructed based on the G1(x) and G2 (x)
does not refer to cyclic, but inherit all the capabilities of error detection, the inherent
cyclic codes, including the ability of uniform mapping the possible keys smip , smrp
..., smsp to the corresponding ranges aip , arp ..., asp . Therefore, assuming that each of
the elements smip . smrp ..., smsp with equal probability takes any of the permissible
values, then G1(x) and G2 (x) satisfies the third and fourth requirements.
      </p>
      <p>Choosing the best alternative was carried out using the method of weighted sum.
Natural when forming the weighting factors will have an idea of ranking weights
according densities classes most common error.</p>
      <p>Thus, the code based on the polynomial G1 (x) = x16 + x12 + x5 +1 will have the
best total controlling ability relative to the most common classes of errors in the data
on the names of employees of the KhAI University
4</p>
    </sec>
    <sec id="sec-4">
      <title>Diagnostic Data Model</title>
      <p>
        According to the signal-parametric approach to control systems diagnostic [
        <xref ref-type="bibr" rid="ref4 ref8">4,8</xref>
        ],
the diagnostic models are defined as mathematical constructions linking indirect
signs with direct reasons of the fault. In our case, diagnostic data model (DMD) is
named a mathematical construction that relates indirect indications of the data lines
with errors, the DMD must be of the form

where ΔD is an indirect indication of the presence of failed data; D , D are direct
functions of signs of error and the reference data, respectively. For any DMD, the
conditions of diagnosability must also be fulfilled, i.e. the possibility of an
unambiguous establishment of the fact of the presence of failed data on indirect signs.
      </p>
      <p>Let’s create the DMD to identify and search for a place of failed requisites in the
multiset Mρ . Let Aρ = {a1ρ , a2ρ ,..., aqρ } be multiset indices calculated for the initial
requisites, and</p>
      <p>G(x) = x16 + x12 + x5 +1 , let</p>
      <p>D
be row vector of dimension
[0,..., 216 −1] such that D[aiρ ] =| Aρ ∩ Aiρ | where Aiρ = {aiρ ,aiρ,..., aiρ } . Then the

q− раз
equation, characterized by the absence of failed requisites in Mρ will have the form

D[a1ρ ] = q , i.e. all indexes are the same. If, however, Mρ contains failed requisites,
the D [a1ρ ] =| Aρ ∩ A1ρ | . Thus, the DMD to detect failed data in Mρ looks as:
</p>
      <p>Ddet Dρ =D [a1ρ ] − D[a1ρ ] =| Aρ ∩ A1ρ | −q ,
where Ddet Dρ is an indirect indication of the presence of failed data in Mρ . If
Ddet Dρ ≡ 0 then Mρ is error-free, or Mρ contains failed information.</p>
      <p>To find a place in the wrong requisite Mρ DMD will be as follows:</p>
      <p>
DD = D − D ,
(4)
(5)
(6)

D pl Diρ = D [aiρ ] − D[aiρ ] = | Aρ ∩ Aiρ | −
q</p>
      <p>−α ,
2
where D pl Diρ is an indirect indication of the presence of failed data in the requisite
1 q
smiρ . α ∈[ ; −1] , and if D pl Diρ &lt; 0 , then smiρ is faulty requisite, otherwise smiρ
2 2
is a reference requisite.</p>
      <p>The performance of the method of index-requisite diagnosis was evaluated. In
this case the first stage is filling a row vector D . It can be assumed to be proportional
to the value q . It is assumed that the calculation of indices occur before data cleaning
process. As for the time of the second stage, it coincides with the time of the second
stage in the case of pairwise comparisons requisites. Maximum wait time for
diagnostic procedures on the basis of the method of index-requisite diagnosis
Tобщ.инд.рекв ≈ 2 * q *tcр.в . The overall performance of the method of index-requisite
(q −1) * L + 2
diagnosis of redundant information in times higher than the performance
4
of the method based on pairwise comparison requisites. For example, when q = 3 and</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>Deep diagnostics data is the basis for the following problem solution of data recovery.
Determining, based on the principle of majority, error and reference values for each
attribute it is possible automatic replacement of standard errors. In addition, the failed
attributes should be corrected in the source subsystem. Since the change in the
original data in the data warehouse is technically impossible, human-operator should be
informed about the error occurred to ensure the quality of subsystem data. Such
notification must include the failed attribute, reference attribute, as well as the record ID,
for example, last name, first name, etc. If the source subsystem allows working with a
clipboard, the failed value could be replaced by correct one automatically.</p>
      <p>If it is impossible to find the reference and failed values for the attribute, for
example, if there are two different requisites and diagnostic model cannot detect the place,
it is concluded that both requisites are incorrect. Decision-making is entrusted to the
system administrator, which can redirect the problem to the operators.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Chukhray</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Havrylenko</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Proximate Objects Probabilistic Searching Method</article-title>
          .
          <source>Advances in Intelligent Systems and Computing</source>
          ,
          <volume>1113</volume>
          AISC,
          <fpage>219</fpage>
          -
          <lpage>227</lpage>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Cormen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leiserson</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rivest</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Introduction to Algorithms</article-title>
          . 3rd edn. The MIT Press, 1292 p., (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Borda</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>Fundamentals in Information Theory and Coding</article-title>
          .
          <source>2011th edn</source>
          , Springer, 485 p. (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kulik</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Rational intellectualization of the aircraft control: Resources-saving safety improvement</article-title>
          .
          <source>Studies in Systems, Decision and Control</source>
          ,
          <volume>173</volume>
          -
          <fpage>192</fpage>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Martínez</given-names>
            <surname>Bastida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.P.</given-names>
            ,
            <surname>Havrylenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            ,
            <surname>Chukhray</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>Developing a selfregulation environment in an open learning model with higher fidelity assessment</article-title>
          .
          <source>Communications in Computer and Information Science</source>
          ,
          <volume>826</volume>
          ,
          <fpage>112</fpage>
          -
          <lpage>131</lpage>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Tanenbaum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wetherall</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          : Computer Networks,
          <volume>5</volume>
          <fpage>edn</fpage>
          , Pearson,
          <volume>960</volume>
          p. (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Ghahramani</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Fundamentals of Probability. With Stochastic Processes. 4th edn</article-title>
          , CRC Press, (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Martínez</given-names>
            <surname>Bastida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.P.</given-names>
            ,
            <surname>Gavrilenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.V.</given-names>
            ,
            <surname>Chukhray</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.G.</surname>
          </string-name>
          :
          <article-title>Developing a pedagogical intervention support based on Bayesian networks</article-title>
          .
          <source>CEUR Workshop Proceedings</source>
          ,
          <year>1844</year>
          ,
          <fpage>265</fpage>
          -
          <lpage>272</lpage>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>