Index-Requisite Data Diagnostics In Information
                                    Management Systems

                        Andrey Chukhray1 [0000-0002-8075-3664] and Olena Havrylenko2 [0000-0001-5227-9742]
                    1
                      National Aerospace University “Kharkiv Aviation Institute”, Kharkiv, Ukraine
                                               achukhray@gmail.com
                    2
                      National Aerospace University “Kharkiv Aviation Institute”, Kharkiv, Ukraine
                                              o.havrylenko@khai.edu


                         Abstract. Informational Management Systems (IMS) which are based on lega-
                         cy systems have a significant problem of dirty data. The data cleansing problem
                         solution in such systems usually starts with the search of similar tuples' clusters.
                         After that for each cluster the reference tuple should be formed for saving in a
                         data warehouse of IMS. Moreover, fail tuples should be returned to the source
                         subsystem with the indication of error location, i. e. concrete invalid requisite.
                         The necessary of such a deep diagnosis determined by the following fact: the
                         reference tuple can be not just one of the existent, but as well the combination
                         of several different tuples requisites. Considering one obtained cluster of similar
                         tuples, a certain multiset can be composed from all of the certain attribute val-
                         ues. The paper represents the method of the multiset's diagnostic in terms of
                         faultless and correctability, based on the majority principle. The method pro-
                         vides the minimum time required for establishing the fact of multiset's incor-
                         rectness, moreover it allow defining valid (reference) and failed elements of the
                         multiset.

                         Keywords: Data Cleansing, Diagnostics, Similar Tuples, Reference Requisite,
                         Multiset.


                1        Introduction

                Informational Management Systems (IMS) which are based on legacy systems have a
                significant problem of dirty data. The data cleansing problem solution in such systems
                usually starts with the search of similar tuples' clusters [1]. After that for each cluster
                the reference tuple should be formed for saving in a data warehouse of IMS.
                   Therefore, fast index-requisite diagnostics method's development based on the
                hashing and cyclic codes [2,3] is quite perspective way of data cleaning problem solu-
                tion. The main principals of the rational control, which have successful implementa-
                tion in system engineering [4], should be investigated on the real functioning database
                of IMS.


Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                           2       Problem Statement

    Let consider the situation when one cluster tuples consist of three requisites. Then
 the probability p ( BB ) of the event BB , which means that one of two data tuples has
 single or double error in one requisite and second tuple – in another requisite, can be
 calculated as:
=    p ( BB) 2 * ( p ( A1 ) p ( D2 ) p ( D3 ) * p ( D1 ) p ( A2 ) p ( D3 ) + p ( A1 ) p ( D2 ) p ( D3 ) *
                               * p ( D1 ) p ( D2 ) p ( A3 ) * p ( D1 ) p ( A2 ) p ( D3 ) + p ( D1 ) p ( D2 ) p ( A3 )),
                           where p ( Ai ) = Lip c (1 − p c ) Li −1 + CLi2 (1 − p c ) Li − 2 are probabilities of requisites fails with
                           single or double error, p ( Di )= (1 − p c ) Li are probabilities of requisites fails absence,
                            Li are the average lengths of the requisites values, i = 1,3 , π c is the possibility of the
                           requisite symbol distortion. For example, when L1 = 8 , L2 = 6 , L3 = 10 and π c =
                           10−2 p ( BB) ≈ 0.025 .
                              Considering only one of the clusters obtained similar tuples proceeds to the formal
                           statement of the problem.
                              Let R1 is the considered cluster, i.e. set including q ∈ N tuples. For every attribute
                           ρ ( ρ = 1, h ) of the tuple the corresponding multiset M ρ = {sm1ρ , sm2 ρ ,..., smq ρ }
                           should be formed. Then, based on the principle of majority voting (two among three,
                           three among five etc.) is widely used in diagnosing technical systems [4, 5], it is pos-
                           sible to give definitions of multiset М ρ correctness.
                                 Definition 1. A multiset M ρ ( M ρ > 2 ) is faultless, if all elements are equal, i.e.
                                 CORRECT ( M ρ ) =           {∀i ∈ {1,.., q}∀j ∈ {1,.., q} (i ≠ j) ⇒ (sm = sm )} ,        iρ   jρ


                           where CORRECT ( M ρ ) is a Boolean predicate, ⇒ is an implication operator.
                                 Definition 2. A multiset M ρ ( M ρ > 2 ) is correctable, if more than half, but not
                           all of its elements are equal, i.e.
                                              { M ρ' ={sm1' ρ , sm2' ρ ,..., smz' ρ } ⊂ M ρ ( z > q / 2) ∧ ( z < q) ∧
                           CORRECTED ( M ρ ) =∃

                           ∧ ( ∀i ∈ {1,..., z}∀j ∈ {1,..., z} (i ≠ j ) ⇒ ( smi'ρ =sm'j ρ ) ) ,       }
                           where CORRECTED ( M ρ ) is a Boolean predicate, M ρ' is a subset of M ρ .
                                 Consequently, an element smip ∈ M ρ is a reference tuple, if smi ρ ∈ M ρ' , as well
                           as an element smi ρ ∈ M ρ is an failed tuple, if smi ρ ∉ M ρ' . It is also obviously, in the
                           case if M ρ is correctable M ρ' is unique. In addition, if in M ρ are equal no more than
                           half of the elements, M ρ is not faultless and not correctable.
                                 The objective of this paper is to represent the method of the multiset’s diagnos-
                           tic, which ensures minimal time of M ρ correctness establishment, based on the given
above definitions, and allow to locate, if it is necessary, reference and failed elements
of M ρ .
    Let consider the obviously easiest approach, i.e. pairwise comparison of all the
M ρ elements smi ρ and sm j ρ for (i ≠ j ) . To do this, we assign the multiset M ρ an-
                                                                               q
other multiset CNTρ = {cnt1ρ , cnt2 ρ ,..., cntq ρ } , such that cnti ρ = 1 + ∑ eqij , i = 1, q
                                                                               j =1
                                                                               j ≠i

                        = sm j ;
             1, if smi rr
where eqij =                    If cnt1ρ = q then M ρ is faultless, or M ρ is cor-
             0, othrwise .
rectable if ∃i ∈ {1,..., q} : cntip > q / 2 , i.e. there is a reference element. For example,
if M ρ = {' Иванов ', ' Ивашов ', ' Иванов '} , then    CNTρ = {2, 1, 2} . sm1ρ , sm3 ρ      are
reference elements, sm2 ρ is a failed element.
     The performance estimation of the pairwise comparisons method [2] by counting
the number of symbols comparisons is followed. Let each element smi ρ ∈ M ρ com-
prises L characters ( L >> 1 ). Then the maximum number of character comparisons
                                       q2 − q                         q2 − q
 C pc = ( q − 1 + q − 2 + ... + 1) L =        L . Consequently, Tpc ≈        L ⋅ t pc , where Tpc
                                         2                              2
is the maximum time of the multiset M p of pairwise comparison elements, t pc is the
runtime of the two characters comparison. Since the diagnosing second stage maxi-
                                              q2 − q       
mum runtime is proportional to q then Tcpc ≈         L + q  t pc where Tcpc is the max-
                                              2            
imum runtime of the diagnostic procedure based on the method of pairwise requisite’s
comparison.
     A significant improvement of this method is using of a conversion key
(hash) [2], which sends requisites to the array indexes (memory address):

                                      H : smip → aip                                         (2)

where H is a transformation (mapping), aip is an array index corresponding to an
element smip of the multiset M p .
      The main challenge deal with the key conversion is that the set of possible val-
ues is much greater than the set of possible memory locations. Therefore it is neces-
sary to choose a mapping H , which allow:
─ to detect common errors of source data, entering by human-operator in input fields,
  i.e., reference argument smip and argument smrp , failed with any of the specified
   error, always guarantee different results aip ≠ arp ;
─ to establish definitely the difference of smip and smrp             in the case of different
   results aip ≠ arp ;
─ to produce the same addresses for random source elements with the difference of
  arbitrarily small probability;
─ to conclude that the probability smip ≠ smrp is arbitrarily small for the equal ad-
    dresses ( aip = arp ).

      When mapping H is chosen, the problem of the M p correctness establishment
and the reference and failed M p elements search can be effectively solved by using
diagnostic models [4] linking errors indirect signs with the direct ones.


3       The Choice Of The Requisites-To-Indices Reflection

The stated problem should be considered as a task of error-correcting coding theory:
construction of the predetermined code with detecting ability for transmitting discrete
information through a noisy channel [3].
   Indeed, ( smip , aip ) can be regarded as permissible sequence of redundancy code,
where smip are data bits, aip = H ( smip ) are checking bits. However, in contrast to
transmission over the communication channel, whereby both data bits and checking
bits could be corrupted because of possible noise, in this case only information bits
could be changed, i.e. smip is received as smrp ≠ smip .
   The most reliable are the cyclic codes having high detecting ability and widely
used in practice because of less complicated coding/decoding devices schemes in
comparison with other coding techniques [3]. Constructing the cyclic code for a given
number u of data bits the shortest length of code combinations w is determined to
provide a predetermined multiplicity of error detection. This problem is reduced to the
determination of needed generating polynomial G ( x) of degrees w − u .
   For cyclic codes data bits transformation to test bits has the form as following:

                         H ( M=
                              ip ( x ) )   ( M ( x) + x ) mod G( x) ,
                                                  ip
                                                        w−u
                                                                                       (3)

where M ip ( x) is polynomial of a dummy variable x corresponding to the data bits,
smip , mod is an operator getting remainder of the polynomials division.
    Thus, it is need to choose a mapping H such as:

1. If H ( M rp ( x)) ≠ H ( M ip ( x)) , then M rp ( x) ≠ M ip ( x) .
2. For frequently occurring error classes Eз and M=
                                                  rp ( x ) M ip ( x) + Eз a combination
    ( sm , a ) is excepted, i.e. H (M ( x)) ≠ H (M ( x)) .
        ip   ip                              rp               ip

3. For random noise M=
                     rp ( x ) M ip ( x) + Ec the probability of ( mrp , aip ) permission is
                                =
    arbitrarily small, i.e. p ( H ( M rp ( x)) H ( M ip ( x))) → 0 where Ec is some random
    noise.
4. The equity smip = smrp is ensured from H ( M rp ( x)) = H ( M ip ( x)) with a probability
   close to one.

  It is necessary to show further that the first requirement is satisfied if
deg (G ( x)) > 0 where deq (G ( x)) is a generator polynomial G ( x) degree. If
H ( M rp ( x)) ≠ H ( M ip ( x)) , then ( M rp ( x) x w −u ) mod G ( x) ≠ ( M ip ( x) x w −u ) mod G ( x) .
Using the distributive property of the operator mod , it can be obtained
( M rp ( x) x w −u + M ip ( x) x w −u ) mod G ( x) ≠ 0 . Let's supposed, that
                                      0 , then 0 mod G ( x) ≠ 0 is a contradiction, since the con-
M rp ( x) x w − u + M ip ( x) x w − u =
dition    deg(G ( x)) > 0                 and the definition of                  0 mod G ( x) = 0 . Consequently,
              w−u                   w−u                       w−u
M rp ( x) x         + M ip ( x) x         ≠ 0 , M rp ( x) x         ≠ M ip ( x) x w −u and M rp ( x) ≠ M ip ( x) .
     Before finding out conditions that satisfy the second requirement, it is need to in-
troduce auxiliary statements.
     Statement 1. If          A( x) mod G ( x) ≠ 0   and    B( x) mod G ( x) ≠ 0 , then
( A( x) B ( x)) mod G ( x) ≠ 0 .
      Proof. Let's supposed, that A( x) mod G ( x) ≠ 0 and B( x) mod G ( x) ≠ 0 , but
( A( x) B ( x)) mod G ( x) = 0 , then A( x) ≠ W ( x)G ( x) , where W ( x) is a polynomial.
Multiplying both sides of this inequality by B ( x) , it could be obtained
 A( x) B( x) ≠ W ( x)G ( x) B ( x) . Next, let take the reminder by dividing both sides by
G ( x) as ( A( x) B( x)) mod G ( x) ≠ (W ( x)G ( x) B( x)) mod G ( x) , then 0 ≠ 0 is a contra-
diction. Consequently, ( A( x) B ( x)) mod G ( x) ≠ 0 , Q.E.D.
      Statement 2. If G ( x) = x c + ... + 1 then G ( x ) W ( x ) = x d + ... + x f + ... + α , where
c, d , f ∈ N , d > f , d= c + deg(W ( x)) , α ∈ {0,1} , W ( x) is a polynomial.
       Proof. Let represent G ( x) as sequence of ones. Then multiplication by modulo 2
of G ( x) and W ( x) may be considered as a modulo 2 addition with a shift:
G ( x ) × W (=
             x ) W ( x )1 ⊕ W ( x )2 ⊕ ... ⊕ W ( x )n , where ⊕ is operation of addition by
modulo 2, W ( x )i are right shifts by i of W ( x ) .
     It should be mentioned that the lowest significant bit of the first term and the
highest significant bit of the last term are not compensated, therefore, G ( x)W ( x)
represented in the form x d + ... + x f + ... + α , Q.E.M.
     Statement 3. If G ( x) = x w −u + ... + 1 , then for any single error E ( x) = x i .
i ∈ {w − u ,..., w − 1} , such that M=rp ( x )  M ip ( x) + E ( x) , H ( M rp ( x)) ≠ H ( M ip ( x)) is
performed.
                                      =
     Proof. Considered conditions H ( M rp ( x )) (( M ip ( x) + x i ) x w −u ) mod G ( x) and
H ( M ip ( x)) = (( M ip ( x)) x w − u ) mod G ( x) ,               let’s        supposed         that       condition
H ( M rp ( x)) = H ( M ip ( x))                      is                  true.             Then                equality
(( M ip ( x) + x i ) x w − u ) mod G ( x) =
                                          ( M ip ( x) x w − u ) mod G ( x) is true as well, from which it
is followed that ( x i + w −u ) mod G ( x) = 0 and, therefore, x i + w −u = G ( x)W ( x) . On the
other hand, according to statement 2                                                          G ( x)W ( x) = x d + ... + x f + ... + α             and
x   i + w−u
              ≠ x + ... + x + ... + α . Consequently, H ( M rp ( x)) ≠ H ( M ip ( x)) , Q.E.M.
                      d                     f


          Statement                       4.         If             G ( x) = x w −u + ... + 1 ,   then          for    packet             type   error
E ( x) = x + ... + x
                  i
                           ,      p ≤ w−u ,
                                         i − p +1
                                                    i ∈ {w − u + p − 1,..., w − 1} ,  for                                                        which
M= rp ( x ) M ip ( x) + E ( x) is true, H ( M rp ( x)) ≠ H ( M ip ( x)) is performed.
          Proof.                        Let's                  supposed               that        H ( M rp ( x)) = H ( M ip ( x)) ,               then
                                                 i − p +1          w−u                                 w−u
(( M ip ( x) + x + ... + x   i
                                                            )x           ) mod G ( x) =( M ip ( x) x         ) mod G ( x) ,                      hence
     w − u + i − p +1            p −1                                                                                 w − u + i − p +1
(x                      (x              + ... + 1)) mod G ( x) =0 . Each of the factors: x                                               is not evenly
                                                            p −1
divisible by G ( x) ; x                                            + ... + 1 also not divisible by G ( x) , because of the
 p − 1 < w − u and therefore, ( x w −u + i − p +1 ( x p −1 + ... + 1)) mod G ( x) ≠ 0 . It is a contradic-
tion, and hence, H ( M rp ( x)) ≠ H ( M ip ( x)) .
          Statement 4. If for requisite it is used 8 bits to represent one character, smrp dif-
                                                   ) x w −u + ... + 1 , where w − u ≥ 8 ,
fer from smip by any single transcription and G ( x=
then arp ≠ aip .
     Proof. Any single transcription can be represented as E ( x) = x i + ... + x i − p +1 ,
where        p ≤ 8 . Consequently, in accordance with the statement 3,
H ( M rp ( x)) ≠ H ( M ip ( x)) and therefore arp ≠ aip .
          Statement 5. If for requisite it is used 8 bits to represent one character, smrp dif-
fer from smip by any transposition or double transcription of adjacent characters and
     ) x w −u + ... + 1 , where w − u ≥ 16 , then arp ≠ aip .
G ( x=
      Proof. Any transposition or double transcription of adjacent symbols can be rep-
resented as E ( x) = x i + ... + x i − p +1 , where p ≤ 16 . Consequently, in accordance with
the statement 3, H ( M rp ( x)) ≠ H ( M ip ( x)) and therefore arp ≠ aip .
          Considering the third requirement for independent input of two values smip and
smrp , let’s supposed that all valid requisites smip are equal and H uniformly send
them to the full range of possible addresses aip . In this case, each aip corresponds to
2( u − ( w − u )) smip . Then there 2u 2u options independent of input values smip and smrp ,
among which:
    a) 2u identical values input options, i.e.,=       =
                                               ( smip smrp , aip arp );
          b)            2u (2u − 2u − ( w −u ) )                         options,     in     which           errors      are         detected,     i.e.
( sm ≠ sm , a ≠ a ) ;
       ip               rp         ip           rp
          c)       2u (2u − ( w −u ) − 1)          options,   in        which      errors            are       not     detected,       i.e.
  ( sm ≠ sm , a =
        ip      a ).   rp   ip        rp

      Further there are described computations of the probability of different outcomes
 independent input values smip and smrp . The probability of entering identical values
                                    2u       1
         =
 is p ( smip smrp ,=
                   aip a=
                        rp )         =             . The probability of the case, in which error
                                   2u 2u 2u
                                 1     1
  p ( smip ≠ smrp , aip =arp ) = 8 − 48 ≈ 0, 004 s                  is              detected,             is
                                 2 2
                                2u (2u − 2u − ( w −u ) )      1
  p ( smip ≠ smrp , aip ≠ arp ) =       u u
                                                         =
                                                         1 − w −u . The probability of the case, in
                                       22                   2
                                                                        2u (2u − ( w −u ) − 1)   1      1
 which error is not detected, is p ( smip ≠ smrp , aip =        arp ) = u u                    = w−u
                                                                                                     − u .
                                                                             22                2       2
 For       example,        when        w = 56,           u = 48        and          w = 64,         u = 48
                               1    1
  p ( smip ≠ smrp , aip = arp ) =−     ≈ 1,5 ⋅10−5 .
                              216 248
          Let consider now independent input information elements smip , smrp and smsp ,
 each with equal probability takes any of the valid values and is transformed uniformly
 on the entire range of possible addresses. As previously, each aip corresponds to
  2( u − ( w − u )) smip . Totally, there are 2u 2u 2u values for smip , smrp and smsp inputs,
 among which:

 ─ 2u identical values input options, i.e., ( sm=
                                                ip sm=
                                                     rp smsp , a=
                                                                ip a=
                                                                    rp asp );
 ─ the cases, in which errors are not detected, are following:

       smip ≠ smrp , s mip ≠ smsp , s mrp ≠ smsp ,                      smip ≠ smrp , s mip ≠ smsp , s mrp ≠ smsp ,
 a)                                                                b)                                                              .
      aip ≠ arp , aip ≠ asp , arp =
                                  asp ;                                 aip ≠ arp , aip = asp , arp ≠ asp ;
       smip ≠ smrp , s mip ≠ smsp , s mrp ≠ smsp ,                  smip ≠ smrp , smip ≠ smsp , s mrp ≠ smsp ,
 c)                                                            , d)
      aip = arp , aip ≠ asp , arp ≠ asp ;                      =    aip a= rp , aip  a=sp , arp  asp ;
   smip ≠ smrp , s mip ≠ smsp , s mrp =smsp ,    smip ≠ smrp , s mip = smsp , s mrp ≠ smsp ,
e)                                            f)
=  aip a= rp , aip  a=sp , arp   asp ;       =   aip a= rp , aip  a=sp , arp   asp ;
   smip = smrp , s mip ≠ smsp , s mrp ≠ smsp ,
g)
=  aip a=rp , aip   a=sp , arp   asp .
                                                                                              u − ( w−u )
 i.e.        cases          a,       b,        c   include    2u (2u − 2u − ( w −u ) )(2                    − 1)     options;      g     –
         u − ( w−u )             u − ( w−u )                                  u − ( w−u )
   u
  2 (2                 − 1)(2                  − 2) options; e, f – 2 (2  u
                                                                                            − 1) options.

 ─ the cases, in which errors are detected, are following:
     smip ≠ smrp , smip ≠ smsp , s mrp ≠ smsp ,                         smip ≠ smrp , s mip ≠ smsp , s mrp =
                                                                                                           smsp ,
a)                                                                 b)
     aip ≠ arp , aip ≠ asp , arp ≠ asp ;                                aip ≠ arp , aip ≠ asp , arp =
                                                                                                    asp ;
     smip ≠ smrp , s mip = smsp , s mrp ≠ smsp ,                        smip = smrp , s mip ≠ smsp , s mrp ≠ smsp ,
c)                                                                 d)
     aip ≠ arp , aip = asp , arp ≠ asp ;                                aip = arp , aip ≠ asp , arp ≠ asp ,
                                             u − ( w−u )             u − ( w−u )
i.e. case a includes 2 (2 − 2  u        u
                                                           )(2 − 2
                                                               u
                                                                                   − 2u − ( w −u ) ) options; cases b, c, d –
 2u (2u − 2u − ( w −u ) ) options.
      Further there are described computations of probabilities of the different diagno-
ses with independent input values requisites smip . smrp and smsp . The probability of
                                                                               2u       1
entering identical values of requisites is equal to                                 =       . The probability of the
                                                                               23u 2 2 u
                                                                                3 ⋅ 2u (2u − 2u − ( w −u ) )(2u − ( w −u ) − 1)
case, when the error are skipped, is equal to                                                                                   +
                                                                                                    23u
   2u (2u − ( w −u ) − 1)(2u − ( w −u ) − 2) + 3 ⋅ 2u (2u − ( w −u ) − 1)  1            1              1          1 
+                                                                           =  w −u − u   3 − w −u −1 + u  . The
                                                                              2        2                         2 
                                        3u
                                     2                                                                2
probability of error detection by comparing indices obtained for the three requisites is
               2u (2u − 2u − ( w −u ) )(2u − 2u − ( w −u ) − 2u − ( w −u ) ) 3 ⋅ 2u (2u − 2u − ( w −u ) )
equal to                                                                    +                             =
                                           23u                                          23u
          1              1             3 
=  1 − w −u 1 − w −u −1 + u  . For example, when w = 64, u = 48 , the probability
       2               2              2  
of      the           case,         when          an        error           is      skipped,             p [a ∨ b ∨ c ∨ d ] =
   1    1       1    1 
=  16 − 48  3 − 15 + 48  ≈ 4,5 ⋅10−5 .
  2    2       2    2 
     Considering the fourth requirement in the case of independent input of two val-
ues smip and smrp , it should be calculated probability of case, if aip = arp , then
smip = smrp . The Bayes' formula [7] allows to calculate posteriori conditional proba-
bility of the presence of unconditional priori one.
      Let the event ER1 is equal smip = smrp , event ER2 – smip ≠ smrp , event EI –
                                                      p ( ER1 ) p ( EI | ER1 )
aip = arp . Then p ( ER1 | EI ) =                 2
                                                                                      . Let's supposed, that smip and
                                                 ∑ p( ERi ) p( EI | ERi )
                                                 i =1

smrp , which consist of L characters, are independently entered by two human-
operators based on the same original document. Then p ( ER1 ) can be calculated as the
probability of error-free entry of two requisites, i.e. p ( ER1 )= (1 − p c ) 2 L , where π c is
the possibility of mistakes in the human information (errors per symbol), hence
 p ( ER2 ) =1 − (1 − p c ) 2 L . For example, for L = 6 and π c = 10−2 p ( ER1) ≈ 0,88 . It is
obvious that p ( EI | ER         =1)  p (=
                                         aip arp | =
                                                   mip m=
                                                        rp ) 1 , as, for equal requisites
( M ip ( x) = M rp ( x))           it       is          impossible             to        obtain      different      indexes
( H ( M ip ( x)) ≠ H ( M rp ( x))) ,according                       to    (3).        For      calculations           of
         =
p ( EI | ER2) p=
               (aip arp | smip ≠ smrp ) it is possible to use the fact of equal proba-
bilities of all admissible smip , smrp and their uniform mapping to the corresponding
ranges aip , arp .
       According           theto    formula      of    conditional          probability                             [7],
                           = p (aip arp , smip ≠ smrp ) 2  u − ( w−u )
                                                                       −1
p (aip= arp | smip ≠ smrp=
                         )                              =                 .                                       Then
                                   p ( smip ≠ smrp )         2   u
                                                                   −  1
                                            (1 − p c ) 2 L ⋅1
p ( ER1 | EI ) =                                                               . For example, for u = 48,
                                                         2u − ( w − u ) − 1 
              (1 − p c ) 2 L ⋅1 + (1 − (1 − p c ) 2 L )                     
                                                         2 −1 
                                                                 u


 w = 64, L = 6 p ( ER1 | EI ) ≈ 0,999998 .
     Considering the fourth requirement in the case of independent input of three val-
ues smip , smrp and smsp , it is necessary to calculate the probability of
 =
smip  =
     smrp smsp , when a=
                       ip a=
                           rp asp . Let the event E 3R1 is equal values
 =
smip  =
     smrp smsp , event E 3R2 – smip ≠ smrp , s mip ≠ smsp , s mrp =
                                                                  smsp , event E 3R3
–            smip ≠ smrp , s mip = smsp , s mrp ≠ smsp ,                          event             E 3R4              –
smip = smrp , s mip ≠ smsp , s mrp ≠ smsp ,                               event                E 3R5                   –
smip ≠ smrp , s mip ≠ smsp , s mrp ≠ smsp ,                 event          E 3I    –        a=
                                                                                             ip a=
                                                                                                 rp asp .         Then
                            p ( E 3R1 ) p ( E 3I | E 3R1 )
p ( E 3R1 | E 3I ) =    5
                                                                     . Let calculate the a priori probability of
                       ∑ p ( E 3R ) p ( E 3I | E 3R )
                       i =1
                                        i                       i


events E 3Ri , i = 1,5 according to the binomial law, assuming independence of errors
in        separate                 characters,            as             following:          p ( E 3R1 )= (1 − p c )3 L .
p ( E 3R2 ) = p ( E 3R3 ) = p ( E 3R4 ) = (1 − p c ) 2 L (1 − (1 − p c ) L ) , p ( E 3R5 ) = (1 − (1 − p c ) L )3 .
For example, for                   L=6       and      π c = 10−2
                                                 p ( E 3R1 ) ≈ 0,83 , p ( E 3R2 ) ≈ 0, 05 ,
 p ( E 3R5 ) ≈ 0, 0002 .
       As previously, calculation of the conditional probabilities p ( E 3I | E 3Ri ) is
based on the conditions of the equal probability of all admissible smip , smrp and
smsp and uniformity of transformation to corresponding ranges aip , arp and asp . So,
p ( E 3I | E 3R1 ) = 1 due to the fact that the mapping H each value smip sends to no
more than one aip and hence H is a function. The conditional probabilities of coin-
cidence of codes in the case of only one failed requisite are following:
 p=
  ( E 3I | E 3R2 ) p=
                    ( E 3I | E 3R3 ) p=
                                      ( E 3I | E 3R4 )
     p (aip =
            arp =  asp , smip ≠ smrp , s mip ≠ smsp , smrp =smsp ) 2u − ( w −u ) − 1
=                                                                 =                  . The condi-
            p ( smip ≠ smrp , smip ≠ smsp , smrp =   smsp )           2u − 1
tional probability of indices coincidence in the case of three different requisites input
by                 human-                  operator                   is                                      following:
                    p (aip =
                           arp =  asp , smip ≠ smrp , s mip ≠ smsp , s mrp ≠ smsp )
 p ( E 3I | E 3R5 ) =
                           p ( smip ≠ smrp , smip ≠ smsp , smrp ≠ smsp )
  (2u − ( w −u ) − 1)(2u − ( w −u ) − 2)
=                                        . Thus, the posterior probability of identity requisites
          (2u − 1)(2u − 2)
 =
sm ip      =
          sm   rp    smsp          provided that a=   ip  a=rp asp , can be calculated as
 p ( E 3R1 | E 3I ) =
                                                 (1 − π c )3 L ⋅1
=                                                                                                                           .
                                                2u − ( w − u ) − 1                L 3 (2
                                                                                         u − ( w−u )
                                                                                                     − 1)(2u − ( w−u ) − 2)
  (1 − ππππ
       c ) ⋅ 1 + 3 ⋅ (1 − c ) (1 − (1 − c ) ) ⋅
          3L                 2L            L
                                                                   + (1 − (1 − c ) ) ⋅
                                                   2u − 1                                      (2u − 1)(2u − 2)
For example, for  =    u 48, =  w 64, = L 6 p ( E 3R1 | E 3I ) ≈ 0,999997 .
       The standard CRC-CCITT polynomial G1 ( x) = x16 + x12 + x 5 + 1 and CRC-16 -
 G2 ( x) = x16 + x15 + x 2 + 1 are commonly used to increase the reliability of information
transmission in computer networks [6]. It is obvious that they satisfy the first and
second requirements, as deq (Gi ( x)) > 0 and Gi ( x=  ) x w −u + ... + 1 , where w − u ≥ 16 .
i = 1, 2 . Furthermore, they may be represented as a product of polynomials of lower
degree, for example, G1 ( x) = ( x + 1) ⋅ ( x15 + x14 + x13 + x12 + x 4 + x 3 + x 2 + x + 1) , and
are not irreducible. Therefore, the codes constructed based on the G1 ( x) and G2 ( x)
does not refer to cyclic, but inherit all the capabilities of error detection, the inherent
cyclic codes, including the ability of uniform mapping the possible keys smip , smrp
..., smsp to the corresponding ranges aip , arp ..., asp . Therefore, assuming that each of
the elements smip . smrp ..., smsp with equal probability takes any of the permissible
values, then G1 ( x) and G2 ( x) satisfies the third and fourth requirements.
      Choosing the best alternative was carried out using the method of weighted sum.
Natural when forming the weighting factors will have an idea of ranking weights
according densities classes most common error.
      Thus, the code based on the polynomial G1 ( x) = x16 + x12 + x 5 + 1 will have the
best total controlling ability relative to the most common classes of errors in the data
on the names of employees of the KhAI University


4         Diagnostic Data Model

     According to the signal-parametric approach to control systems diagnostic [4,8],
the diagnostic models are defined as mathematical constructions linking indirect
signs with direct reasons of the fault. In our case, diagnostic data model (DMD) is
named a mathematical construction that relates indirect indications of the data lines
with errors, the DMD must be of the form
                                                      
                                            DD = D − D ,                                     (4)
                                                                                 
where ΔD is an indirect indication of the presence of failed data; D          , D are direct
functions of signs of error and the reference data, respectively. For any DMD, the
conditions of diagnosability must also be fulfilled, i.e. the possibility of an unambigu-
ous establishment of the fact of the presence of failed data on indirect signs.
   Let’s create the DMD to identify and search for a place of failed requisites in the
multiset M ρ . Let Aρ = {a1ρ , a2 ρ ,..., aq ρ } be multiset indices calculated for the initial
requisites, and G ( x) = x16 + x12 + x 5 + 1 , let            D    be row vector of dimension
[0,..., 2 − 1] such that D[ai=
       16
                             ρ ] | Aρ ∩ Ai ρ | where Ai ρ = {ai ρ , ai ρ ,..., ai ρ } . Then the
                                                                         
                                                                                    q − раз

equation, characterized by the absence of failed requisites in M ρ will have the form
 
 D[a1ρ ] = q , i.e. all indexes are the same. If, however, M ρ contains failed requisites,
the D [a1=
          ρ ] | Aρ ∩ A1ρ | . Thus, the DMD to detect failed data in M ρ looks as:

                                                 
                           D det Dρ =D [a1ρ ] − D[a1ρ ] =| Aρ ∩ A1ρ | − q ,                  (5)

where D det Dρ is an indirect indication of the presence of failed data in M ρ . If
D det Dρ ≡ 0 then M ρ is error-free, or M ρ contains failed information.
   To find a place in the wrong requisite M ρ DMD will be as follows:

                                                                          q
                        D pl Di ρ = D [ai ρ ] − D[ai ρ ] = | Aρ ∩ Ai ρ | − − a ,             (6)
                                                                           2
where D pl Di ρ is an indirect indication of the presence of failed data in the requisite
                1 q
 smi ρ . α ∈ [ ; − 1] , and if D pl Di ρ < 0 , then smi ρ is faulty requisite, otherwise smi ρ
                2 2
is a reference requisite.
       The performance of the method of index-requisite diagnosis was evaluated. In
this case the first stage is filling a row vector D . It can be assumed to be proportional
to the value q . It is assumed that the calculation of indices occur before data cleaning
process. As for the time of the second stage, it coincides with the time of the second
stage in the case of pairwise comparisons requisites. Maximum wait time for diagnos-
tic procedures on the basis of the method of index-requisite diagnosis -
Tобщ.инд. рекв ≈ 2 * q * tcр.в . The overall performance of the method of index-requisite
                                    (q − 1) * L + 2
diagnosis of redundant information in               times higher than the performance
                                           4
of the method based on pairwise comparison requisites. For example, when q = 3 and
        Tобщ.поп.ср
L=8                     ≈ 4.5 .
       Tобщ.инд. рекв
5      Conclusion

Deep diagnostics data is the basis for the following problem solution of data recovery.
Determining, based on the principle of majority, error and reference values for each
attribute it is possible automatic replacement of standard errors. In addition, the failed
attributes should be corrected in the source subsystem. Since the change in the origi-
nal data in the data warehouse is technically impossible, human-operator should be
informed about the error occurred to ensure the quality of subsystem data. Such noti-
fication must include the failed attribute, reference attribute, as well as the record ID,
for example, last name, first name, etc. If the source subsystem allows working with a
clipboard, the failed value could be replaced by correct one automatically.
    If it is impossible to find the reference and failed values for the attribute, for exam-
ple, if there are two different requisites and diagnostic model cannot detect the place,
it is concluded that both requisites are incorrect. Decision-making is entrusted to the
system administrator, which can redirect the problem to the operators.


References
 1. Chukhray, A., Havrylenko, O.: Proximate Objects Probabilistic Searching Meth-
    od. Advances in Intelligent Systems and Computing, 1113 AISC, 219-227
    (2020).
 2. Cormen, T., Leiserson, C., Rivest, R., Stein, C.: Introduction to Algorithms. 3rd
    edn. The MIT Press, 1292 p., (2009).
 3. Borda, M.: Fundamentals in Information Theory and Coding. 2011th edn, Spring-
    er, 485 p. (2011).
 4. Kulik, A.: Rational intellectualization of the aircraft control: Resources-saving
    safety improvement. Studies in Systems, Decision and Control, 173-192 (2017).
 5. Martínez Bastida, J.P., Havrylenko, O., Chukhray, A.: Developing a self-
    regulation environment in an open learning model with higher fidelity assess-
    ment. Communications in Computer and Information Science, 826, 112-131
    (2018).
 6. Tanenbaum, A., Wetherall, D.: Computer Networks, 5 edn, Pearson, 960 p.
    (2012).
 7. Ghahramani, S.: Fundamentals of Probability. With Stochastic Processes. 4th edn,
    CRC Press, (2018).
 8. Martínez Bastida, J.P., Gavrilenko, E.V., Chukhray, A.G.: Developing a pedagog-
    ical intervention support based on Bayesian networks. CEUR Workshop Proceed-
    ings, 1844, 265-272 (2017).