Introduction

Index-Requisite Data Diagnostics In Information Management Systems

0 National Aerospace University “Kharkiv Aviation Institute” , Kharkiv , Ukraine

0000 0002

Informational Management Systems (IMS) which are based on legacy systems have a significant problem of dirty data. The data cleansing problem solution in such systems usually starts with the search of similar tuples' clusters. After that for each cluster the reference tuple should be formed for saving in a data warehouse of IMS. Moreover, fail tuples should be returned to the source subsystem with the indication of error location, i. e. concrete invalid requisite. The necessary of such a deep diagnosis determined by the following fact: the reference tuple can be not just one of the existent, but as well the combination of several different tuples requisites. Considering one obtained cluster of similar tuples, a certain multiset can be composed from all of the certain attribute values. The paper represents the method of the multiset's diagnostic in terms of faultless and correctability, based on the majority principle. The method provides the minimum time required for establishing the fact of multiset's incorrectness, moreover it allow defining valid (reference) and failed elements of the multiset.

Data Cleansing Diagnostics Similar Tuples Reference Requisite Multiset

Introduction

Therefore, fast index-requisite diagnostics method's development based on the hashing and cyclic codes [ 2,3 ] is quite perspective way of data cleaning problem solution. The main principals of the rational control, which have successful implementation in system engineering [ 4 ], should be investigated on the real functioning database of IMS.

Problem Statement

Let consider the situation when one cluster tuples consist of three requisites. Then the probability p(BB) of the event BB , which means that one of two data tuples has single or double error in one requisite and second tuple – in another requisite, can be calculated as: p(BB) =p(D3 2 * ( p( A1 ) p(D2 ) ) * p(D1 ) p( A2 ) p(D3 ) + p( A1 ) p(D2 ) p(D3 ) * * p(D1 ) p(D2 ) p( A3 ) * p(D1 ) p( A2 ) p(D3 ) + p(D1 ) p(D2 ) p( A3 )), where p( Ai ) = Lip c (1 −p c )Li −1 + CL2i (1 −p c )Li −2 are probabilities of requisites fails with single or double error, p(Di ) = (1 −p c )Li are probabilities of requisites fails absence, Li are the average lengths of the requisites values, i = 1, 3 , p c is the possibility of the requisite symbol distortion. For example, when L1 = 8 , L2 = 6 , L3 = 10 and p c = 10−2 p(BB) ≈ 0.025 .

Considering only one of the clusters obtained similar tuples proceeds to the formal statement of the problem.

Let R1 is the considered cluster, i.e. set including q ∈ N tuples. For every attribute ρ

( ρ = 1, h ) of the tuple the corresponding multiset M ρ = {sm1ρ , sm2ρ ,..., smqρ } should be formed. Then, based on the principle of majority voting (two among three, three among five etc.) is widely used in diagnosing technical systems [ 4, 5 ], it is possible to give definitions of multiset М ρ correctness.

Definition 1. A multiset M ρ ( M ρ > 2 ) is faultless, if all elements are equal, i.e.

CORRECT ( M ρ ) = {∀i ∈{1,.., q}∀j ∈{1,.., q} (i ≠ j) ⇒ (smiρ = sm jρ )} , where CORRECT ( M ρ ) is a Boolean predicate, ⇒ is an implication operator.

Definition 2. A multiset M ρ ( M ρ > 2 ) is correctable, if more than half, but not all of its elements are equal, i.e.

CORRECTED ( M ρ ) ={∃M ρ' ={sm1'ρ , sm2'ρ ,..., smz'ρ } ⊂ M ρ (z > q / 2) ∧ (z < q) ∧ ∧ (∀i ∈{1,..., z}∀j ∈{1,..., z} (i ≠ j) ⇒ (smi'ρ =sm'jρ ))}, where CORRECTED ( M ρ ) is a Boolean predicate, M ρ' is a subset of M ρ .

Consequently, an element smip ∈ M ρ is a reference tuple, if smiρ ∈ M ρ' , as well as an element smiρ ∈ M ρ is an failed tuple, if smiρ ∉ M ρ' . It is also obviously, in the case if M ρ is correctable M ρ' is unique. In addition, if in M ρ are equal no more than half of the elements, M ρ is not faultless and not correctable.

The objective of this paper is to represent the method of the multiset’s diagnostic, which ensures minimal time of M ρ correctness establishment, based on the given above definitions, and allow to locate, if it is necessary, reference and failed elements of M ρ .

Let consider the obviously easiest approach, i.e. pairwise comparison of all the M ρ elements smiρ and sm jρ for (i ≠ j) . To do this, we assign the multiset M ρ anq other multiset CNTρ = {cnt1ρ , cnt2ρ ,..., cntqρ } , such that cntiρ = 1 + ∑ eqij , i = 1, q j=1 j≠i 1, if smiρ = sm jρ ; where eqij =  If cnt1ρ = q then M ρ is faultless, or M ρ is cor0 , othrwise. rectable if ∃i ∈{1,..., q} : cntip > q / 2 , i.e. there is a reference element. For example, if M ρ = {' Иванов ', ' Ивашов ', ' Иванов '} , then CNTρ = {2, 1, 2} . sm1ρ , sm3ρ are reference elements, sm2ρ is a failed element.

Cpc = ( q −1 + q − 2 + ... + 1) L =

The performance estimation of the pairwise comparisons method [ 2 ] by counting the number of symbols comparisons is followed. Let each element smiρ ∈ M ρ comprises L characters ( L >> 1 ). Then the maximum number of character comparisons q2 − q q2 − q

2 2 is the maximum time of the multiset M p of pairwise comparison elements, t pc is the runtime of the two characters comparison. Since the diagnosing second stage maxi q2 − q  mum runtime is proportional to q then Tcpc ≈  L + q  t pc where Tcpc is the max 2  imum runtime of the diagnostic procedure based on the method of pairwise requisite’s comparison.

A significant improvement of this method is using of a conversion key (hash) [ 2 ], which sends requisites to the array indexes (memory address): L . Consequently, Tpc ≈

L ⋅ t pc , where Tpc H : smip → aip

(2) where H is a transformation (mapping), aip is an array index corresponding to an element smip of the multiset M p .

The main challenge deal with the key conversion is that the set of possible values is much greater than the set of possible memory locations. Therefore it is necessary to choose a mapping H , which allow: ─ to detect common errors of source data, entering by human-operator in input fields, i.e., reference argument smip and argument smrp , failed with any of the specified error, always guarantee different results aip ≠ arp ; ─ to establish definitely the difference of smip and smrp in the case of different results aip ≠ arp ; ─ to produce the same addresses for random source elements with the difference of arbitrarily small probability; ─ to conclude that the probability smip ≠ smrp is arbitrarily small for the equal addresses ( aip = arp ).

When mapping H is chosen, the problem of the M p correctness establishment and the reference and failed M p elements search can be effectively solved by using diagnostic models [ 4 ] linking errors indirect signs with the direct ones. 3

The Choice Of The Requisites-To-Indices Reflection

The stated problem should be considered as a task of error-correcting coding theory: construction of the predetermined code with detecting ability for transmitting discrete information through a noisy channel [ 3 ].

Indeed, ( smip , aip ) can be regarded as permissible sequence of redundancy code, where smip are data bits, aip = H (smip ) are checking bits. However, in contrast to transmission over the communication channel, whereby both data bits and checking bits could be corrupted because of possible noise, in this case only information bits could be changed, i.e. smip is received as smrp ≠ smip .

The most reliable are the cyclic codes having high detecting ability and widely used in practice because of less complicated coding/decoding devices schemes in comparison with other coding techniques [ 3 ]. Constructing the cyclic code for a given number u of data bits the shortest length of code combinations w is determined to provide a predetermined multiplicity of error detection. This problem is reduced to the determination of needed generating polynomial G(x) of degrees w − u .

For cyclic codes data bits transformation to test bits has the form as following: H ( M ip (x)) =(M ip (x) + xw−u ) mod G(x) , (3) where M ip (x) is polynomial of a dummy variable x corresponding to the data bits, smip , mod is an operator getting remainder of the polynomials division.

Thus, it is need to choose a mapping H such as: 1. If H (M rp (x)) ≠ H (M ip (x)) , then M rp (x) ≠ M ip (x) . 2. For frequently occurring error classes Eз and M rp (x) =Mip (x) + Eз a combination ( smip , aip ) is excepted, i.e. H (M rp (x)) ≠ H (M ip (x)) . 3. For random noise M rp (x)

=Mip (x) + Ec the probability of (mrp , aip ) permission is arbitrarily small, i.e. p(H (M rp (x))

=(M H ip (x))) → 0 where Ec is some random noise. 4. The equity smip = smrp is ensured from H (M rp (x)) = H (M ip (x)) with a probability close to one.

It is necessary to show further that the first requirement is satisfied if deg(G(x)) > 0 where deq(G(x)) is a generator polynomial G(x) degree. If H (M rp (x)) ≠ H (M ip (x)) , then (M rp (x)xw−u ) mod G(x) ≠ (M ip (x)xw−u ) mod G(x) . Using the distributive property of the operator mod , it can be obtained (M rp (x)xw−u + M ip (x)xw−u ) mod G(x) ≠ 0 . Let's supposed, that M rp (x)xw−u + M ip (x)xw−u =then 0 , 0 mod G(x) ≠ 0 is a contradiction, since the condition deg(G(x)) > 0 and the definition of 0 mod G(x) = 0 . Consequently, M rp (x)xw−u + M ip (x)xw−u ≠ 0 , M rp (x)xw−u ≠ M ip (x)xw−u and M rp (x) ≠ M ip (x) .

Before finding out conditions that satisfy the second requirement, it is need to introduce auxiliary statements.

Statement 1. If A(x) mod G(x) ≠ 0 and B(x) mod G(x) ≠ 0 , then ( A(x)B(x)) mod G(x) ≠ 0 .

Proof. Let's supposed, that A(x) mod G(x) ≠ 0 and B(x) mod G(x) ≠ 0 , but ( A(x)B(x)) mod G(x) = 0 , then A(x) ≠ W (x)G(x) , where W (x) is a polynomial. Multiplying both sides of this inequality by B(x) , it could be obtained A(x)B(x) ≠ W (x)G(x)B(x) . Next, let take the reminder by dividing both sides by G(x) as ( A(x)B(x)) mod G(x) ≠ (W (x)G(x)B(x)) mod G(x) , then 0 ≠ 0 is a contradiction. Consequently, ( A(x)B(x)) mod G(x) ≠ 0 , Q.E.D.

Statement 2. If G(x) = xc + ... +1 then G ( x)W ( x) = xd + ... + x f + ... +α , where c, d , f ∈ N , d > f , d = c + deg(W (x)) ,α ∈{0,1} , W (x) is a polynomial.

Proof. Let represent G(x) as sequence of ones. Then multiplication by modulo 2 of G(x) and W (x) may be considered as a modulo 2 addition with a shift: G ( x) ×W ( x) =W ( x)1 ⊕W ( x)2 ⊕ ... ⊕W ( x)n , where ⊕ is operation of addition by modulo 2, W ( x)i are right shifts by i of W ( x) .

It should be mentioned that the lowest significant bit of the first term and the highest significant bit of the last term are not compensated, therefore, G(x)W (x) represented in the form xd + ... + x f + ... +α , Q.E.M.

Statement 3. If G(x) = xw−u + ... +1 , then for any single error E(x) = xi . i ∈{w − u,..., w −1} , such that M rp (x) =Mip (x) + E(x) , H (M rp (x)) ≠ H (M ip (x)) is performed.

Proof. Considered conditions H (M rp (x)) =((M ip (x) + xi )xw−u ) mod G(x) and H (M ip (x)) = ((M ip (x))xw−u ) mod G(x) , let’s supposed that condition H (M rp (x)) = H (M ip (x)) is true. Then equality ((M ip (x) + xi )xw−u ) mod G(x) =(x)xw−u (M ip ) mod G(x) is true as well, from which it is followed that (xi+w−u ) mod G(x) = 0 and, therefore, xi+w−u = G(x)W (x) . On the other hand, according to statement 2 G(x)W (x) = xd + ... + x f + ... +α and xi+w−u ≠ xd + ... + x f + ... +α . Consequently, H (M rp (x)) ≠ H (M ip (x)) , Q.E.M.

Statement 4. If G(x) = xw−u + ... +1 , then for packet type error E(x) = xi + ... + xi− p+1 , p ≤ w − u , i ∈{w − u + p −1,..., w −1} , for which M rp (x) =Mip (x) + E(x) is true, H (M rp (x)) ≠ H (M ip (x)) is performed.

Proof. Let's supposed that H (M rp (x)) = H (M ip (x)) , then ((M ip (x) + xi + ... + xi− p+1 )xw−u ) mod G(x) =(M ip (x)xw−u ) mod G(x) , hence (xw−u+i− p+1 (x p−1 + ... +1)) mod G(x) =0 . Each of the factors: xw−u+i− p+1 is not evenly divisible by G(x) ; x p−1 + ... +1 also not divisible by G(x) , because of the p −1 < w − u and therefore, (xw−u+i− p+1 (x p−1 + ... +1)) mod G(x) ≠ 0 . It is a contradiction, and hence, H (M rp (x)) ≠ H (M ip (x)) .

Statement 4. If for requisite it is used 8 bits to represent one character, smrp differ from smip by any single transcription and G(x) = xw−u + ... +1 , where w − u ≥ 8 , then arp ≠ aip .

Proof. Any single transcription can be represented as E(x) = xi + ... + xi− p+1 , where p ≤ 8 . Consequently, in accordance with the statement 3, H (M rp (x)) ≠ H (M ip (x)) and therefore arp ≠ aip .

Statement 5. If for requisite it is used 8 bits to represent one character, smrp differ from smip by any transposition or double transcription of adjacent characters and G(x) = xw−u + ... +1 , where w − u ≥ 16 , then arp ≠ aip .

Proof. Any transposition or double transcription of adjacent symbols can be represented as E(x) = xi + ... + xi− p+1 , where p ≤ 16 . Consequently, in accordance with the statement 3, H (M rp (x)) ≠ H (M ip (x)) and therefore arp ≠ aip .

Considering the third requirement for independent input of two values smip and smrp , let’s supposed that all valid requisites smip are equal and H uniformly send them to the full range of possible addresses aip . In this case, each aip corresponds to 2(u−(w−u)) smip . Then there 2u 2u options independent of input values smip and smrp , among which: a) 2u identical values input options, i.e., ( smip =smrp , aip =arp ); b) 2u (2u − 2u−(w−u) ) options, in which errors are detected, i.e. ( smip ≠ smrp , aip ≠ arp ) ; c) 2u (2u−(w−u) −1) options, in which errors are not detected, i.e. ( smip ≠ smrp , aip =arp) .

Further there are described computations of the probability of different outcomes independent input values smip and smrp . The probability of entering identical values is p(smip =smrp , aip

=arp ) =22u2uu =21u . The probability of the case, in which error p(smip ≠ smrp , aip =arp ) =218 − 2148 ≈ 0, 004 s

2u (2u − 2u−(w−u) ) p(smip ≠ smrp , aip ≠ arp ) = u u

2 2 which error is not detected, is p(smip ≠ smrp , aip For

example, when w = 56, p(smip ≠ smrp , aip = arp ) = 2116 − 2148 ≈ 1, 5 ⋅10−5 . is detected,

is =1 − 2w1−u . The probability of the case, in

2u (2u−(w−u) −1) =arp ) = u u

2 2

Let consider now independent input information elements smip , smrp and smsp , each with equal probability takes any of the valid values and is transformed uniformly on the entire range of possible addresses. As previously, each aip corresponds to 2(u−(w−u)) smip . Totally, there are 2 2 2

u u u values for smip , smrp and smsp inputs, among which: ─ 2u identical values input options, i.e., ( smip =smrp =smsp , aip =arp =asp ); ─ the cases, in which errors are not detected, are following: a) c) e) g) aip =arp , aip =asp , arp

=asp ; smip = smrp , s mip ≠ smsp , s mrp ≠ smsp , aip =arp , aip =asp , arp =a.

sp smip ≠ smrp , s mip ≠ smsp , s mrp ≠ smsp , ─ the cases, in which errors are detected, are following: aip ≠ arp , aip ≠ asp , arp ≠ asp ; smip ≠ smrp , s mip = smsp , s mrp ≠ smsp ,

=asp; smip = smrp , s mip ≠ smsp , s mrp ≠ smsp , b) d) aip ≠ arp , aip = asp , arp ≠ asp ; aip = arp , aip ≠ asp , arp ≠ asp , i.e. case a includes 2u (2u − 2u−(w−u) )(2u − 2u−(w−u) − 2u−(w−u) ) options; cases b, c, d – 2u (2u − 2u−(w−u) ) options.

Further there are described computations of probabilities of the different diagnoses with independent input values requisites smip . smrp and smsp . The probability of entering identical values of requisites is equal to 223uu = 212u . The probability of the case, when the error are skipped, is equal to + 2u (2u−(w−u) −1)(2u−(w−u2)3−u 2) + 3 ⋅ 2u (2u−(w−u) −1) =  2w1−u − 21u   3 − 2w1−u−1 + 21u . The probability of error detection by comparing indices obtained for the three requisites is equal to 2u (2u − 2u−(w−u) )(22u3−u 2u−(w−u) − 2u−(w−u) ) + 3 ⋅ 2u (2u2−3u2u−(w−u) ) = 3 ⋅ 2u (2u − 2u−(w−u) )(2u−(w−u) −1) 23u + of =1 − 2w1−u  1 − 2w1−u−1 + 23u  . For example, when w = 64, the case, when an error is skipped, =  2116 − 2148   3 − 2115 + 2148  ≈ 4, 5 ⋅10−5 . u = 48 , the probability p [a ∨ b ∨ c ∨ d ] =

Considering the fourth requirement in the case of independent input of two values smip and smrp , it should be calculated probability of case, if aip = arp , then smip = smrp . The Bayes' formula [ 7 ] allows to calculate posteriori conditional probability of the presence of unconditional priori one.

Let the event ER1 is equal smip = smrp , event ER2 – smip ≠ smrp , event EI – p(ER1 ) p(EI | ER1 ) aip = arp . Then p(ER1 | EI ) = 2 ∑ p(ERi ) p(EI | ERi ) i=1 smrp , which consist of L characters, are independently entered by two humanoperators based on the same original document. Then p(ER1 ) can be calculated as the probability of error-free entry of two requisites, i.e. p(ER1 ) = (1 −p c )2L , where p c is the possibility of mistakes in the human information (errors per symbol), hence p(ER2 ) =1 − (1 −p c )2L . For example, for L = 6 and p c = 10−2 p(ER1) ≈ 0, 88 . It is obvious that p(EI | ER1 ) =p(aip =arp | mip =mrp ) =1 , as, for equal requisites (M ip (x) = M rp (x)) it is impossible to obtain different indexes . Let's supposed, that smip and (H (M ip (x)) ≠ H (M rp (x))) ,according to (3). For calculations of =arp | smip ≠ smrp ) it is possible to use the fact of equal probap(EI | ER2 ) =p(aip bilities of all admissible smip , smrp and their uniform mapping to the corresponding smip ≠ smrp , s mip = smsp , s mrp ≠ smsp , event (1 −p c )2L ⋅1 (1 −p c )2L ⋅1 + (1 − (1 −p c )2L )  2u−(w−u) −1   2u −1  . For example, for u = 48, smip smip – ues smip , smrp w = 64, L = 6 p(ER1 | EI ) ≈ 0, 999998 .

Considering the fourth requirement in the case of independent input of three valand smsp , it is necessary to calculate the probability of =smrp =smrp =smsp , when aip =arp =asp . Let the event E3R1 is equal values =smsp , event E3R2 – smip ≠ smrp , s mip ≠ smsp , s mrp =smsp, event E3R3

E3R4 E3R5 – – Then ranges aip , arp .

According to

the p(aip =arp | smip ≠ smrp ) = p(ER1 | EI ) = p(aip p(smip ≠ smrp ) formula of conditional =arp , smip ≠ smrp ) = 2u−(w−u) −1

. smip = smrp , s mip ≠ smsp , s mrp ≠ smsp , smip ≠ smrp , s mip ≠ smsp , s mrp ≠ smsp , event

E3I event – aip =arp =asp . p(E3R1 | E3I ) = 5 ∑ p(E3Ri ) p(E3I | E3Ri ) i=1 events E3Ri , i = 1, 5 according to the binomial law, assuming independence of errors in separate characters, as following: p(E3R1 ) = (1−p c )3L .

p(E3R1 ) p(E3I | E3R1 ) . Let calculate the a priori probability of p(E3R2 ) = p(E3R3 ) = p(E3R4 ) = (1−p c )2L (1− (1 −p c )L ) , p(E3R5 ) = (1− (1 −p c )L )3 . For example, for L = 6 and p c = 10−2 p(E3R1 ) ≈ 0,83 , p(E3R2 ) ≈ 0, 05 , p(E3R5 ) ≈ 0, 0002 .

As previously, calculation of the conditional probabilities p(E3I | E3Ri ) is based on the conditions of the equal probability of all admissible smip , smrp and smsp and uniformity of transformation to corresponding ranges aip , arp and asp . So, p(E3I | E3R1 ) = 1 due to the fact that the mapping H each value smip sends to no more than one aip and hence H is a function. The conditional probabilities of coincidence of codes in the case of only one failed requisite are following: p(E3I | E3R2 ) =p(E3I | E3R3 ) =p(E3I | E3R4 ) = = p(aip =par(psm=ipas≠p,ssmmripp ,≠smsimpr≠p,ssmmspip, s≠msrmpsp =,ssmmsrpp) =smsp ) = 2u−2(wu−−u)1−1 . The conditional probability of indices coincidence in the case of three different requisites input = = p(E3I | E3R5 ) (2u−(w−u) −1)(2u−(w−u) − 2)

(2u −1)(2u − 2) smip =smrp

=smsp p(E3R1 | E3I ) = humanoperator is =smrp p(aip =asp, smip ≠ , s mip ≠ smsp , s mrp ≠ smsp ) =a rp p(smip ≠ smrp , smip ≠ smsp , smrp ≠ smsp ) following: . Thus, the posterior probability of identity requisites provided that aip =arp =asp , can be calculated as . (1−p c )3L ⋅1 2u−(w−u) −1 + (1− (1−p c )L )3 ⋅ (2u−(w−u) −1)(2u−(w−u) − 2)

2u −1 (2u −1)(2u − 2) (1−p c )3L ⋅1+ 3⋅ (1−p c )2L (1− (1−p c )L ) ⋅ For example, for u =48, w

=64, L =6 p(E3R1 | E3I ) ≈ 0, 999997 .

The standard CRC-CCITT polynomial G1 (x) = x16 + x12 + x5 +1 and CRC-16 G2 (x) = x16 + x15 + x2 +1 are commonly used to increase the reliability of information transmission in computer networks [ 6 ]. It is obvious that they satisfy the first and second requirements, as deq(Gi (x)) > 0 and Gi (x) = xw−u + ... +1 , where w − u ≥ 16 . i = 1, 2 . Furthermore, they may be represented as a product of polynomials of lower degree, for example, G1 (x) = (x +1) ⋅ (x15 + x14 + x13 + x12 + x4 + x3 + x2 + x +1) , and are not irreducible. Therefore, the codes constructed based on the G1(x) and G2 (x) does not refer to cyclic, but inherit all the capabilities of error detection, the inherent cyclic codes, including the ability of uniform mapping the possible keys smip , smrp ..., smsp to the corresponding ranges aip , arp ..., asp . Therefore, assuming that each of the elements smip . smrp ..., smsp with equal probability takes any of the permissible values, then G1(x) and G2 (x) satisfies the third and fourth requirements.

Choosing the best alternative was carried out using the method of weighted sum. Natural when forming the weighting factors will have an idea of ranking weights according densities classes most common error.

Thus, the code based on the polynomial G1 (x) = x16 + x12 + x5 +1 will have the best total controlling ability relative to the most common classes of errors in the data on the names of employees of the KhAI University 4

Diagnostic Data Model

According to the signal-parametric approach to control systems diagnostic [ 4,8 ], the diagnostic models are defined as mathematical constructions linking indirect signs with direct reasons of the fault. In our case, diagnostic data model (DMD) is named a mathematical construction that relates indirect indications of the data lines with errors, the DMD must be of the form  where ΔD is an indirect indication of the presence of failed data; D , D are direct functions of signs of error and the reference data, respectively. For any DMD, the conditions of diagnosability must also be fulfilled, i.e. the possibility of an unambiguous establishment of the fact of the presence of failed data on indirect signs.

Let’s create the DMD to identify and search for a place of failed requisites in the multiset Mρ . Let Aρ = {a1ρ , a2ρ ,..., aqρ } be multiset indices calculated for the initial requisites, and

G(x) = x16 + x12 + x5 +1 , let

D be row vector of dimension [0,..., 216 −1] such that D[aiρ ] =| Aρ ∩ Aiρ | where Aiρ = {aiρ ,aiρ,..., aiρ } . Then the  q− раз equation, characterized by the absence of failed requisites in Mρ will have the form  D[a1ρ ] = q , i.e. all indexes are the same. If, however, Mρ contains failed requisites, the D [a1ρ ] =| Aρ ∩ A1ρ | . Thus, the DMD to detect failed data in Mρ looks as: 

Ddet Dρ =D [a1ρ ] − D[a1ρ ] =| Aρ ∩ A1ρ | −q , where Ddet Dρ is an indirect indication of the presence of failed data in Mρ . If Ddet Dρ ≡ 0 then Mρ is error-free, or Mρ contains failed information.

To find a place in the wrong requisite Mρ DMD will be as follows:

 DD = D − D , (4) (5) (6)  D pl Diρ = D [aiρ ] − D[aiρ ] = | Aρ ∩ Aiρ | − q

−α , 2 where D pl Diρ is an indirect indication of the presence of failed data in the requisite 1 q smiρ . α ∈[ ; −1] , and if D pl Diρ < 0 , then smiρ is faulty requisite, otherwise smiρ 2 2 is a reference requisite.

The performance of the method of index-requisite diagnosis was evaluated. In this case the first stage is filling a row vector D . It can be assumed to be proportional to the value q . It is assumed that the calculation of indices occur before data cleaning process. As for the time of the second stage, it coincides with the time of the second stage in the case of pairwise comparisons requisites. Maximum wait time for diagnostic procedures on the basis of the method of index-requisite diagnosis Tобщ.инд.рекв ≈ 2 * q *tcр.в . The overall performance of the method of index-requisite (q −1) * L + 2 diagnosis of redundant information in times higher than the performance 4 of the method based on pairwise comparison requisites. For example, when q = 3 and

Conclusion

Deep diagnostics data is the basis for the following problem solution of data recovery. Determining, based on the principle of majority, error and reference values for each attribute it is possible automatic replacement of standard errors. In addition, the failed attributes should be corrected in the source subsystem. Since the change in the original data in the data warehouse is technically impossible, human-operator should be informed about the error occurred to ensure the quality of subsystem data. Such notification must include the failed attribute, reference attribute, as well as the record ID, for example, last name, first name, etc. If the source subsystem allows working with a clipboard, the failed value could be replaced by correct one automatically.

If it is impossible to find the reference and failed values for the attribute, for example, if there are two different requisites and diagnostic model cannot detect the place, it is concluded that both requisites are incorrect. Decision-making is entrusted to the system administrator, which can redirect the problem to the operators.

1. Chukhray , A. , Havrylenko , O. : Proximate Objects Probabilistic Searching Method . Advances in Intelligent Systems and Computing , 1113 AISC, 219 - 227 ( 2020 ).

2. Cormen , T. , Leiserson , C. , Rivest , R. , Stein , C. : Introduction to Algorithms . 3rd edn. The MIT Press, 1292 p., ( 2009 ).

3. Borda , M.: Fundamentals in Information Theory and Coding . 2011th edn , Springer, 485 p. ( 2011 ).

4. Kulik , A. : Rational intellectualization of the aircraft control: Resources-saving safety improvement . Studies in Systems, Decision and Control , 173 - 192 ( 2017 ).

Martínez

Bastida , J.P. , Havrylenko , O. , Chukhray , A. : Developing a selfregulation environment in an open learning model with higher fidelity assessment . Communications in Computer and Information Science , 826 , 112 - 131 ( 2018 ).

6. Tanenbaum , A. , Wetherall , D. : Computer Networks, 5 edn , Pearson, 960 p. ( 2012 ).

7. Ghahramani , S. : Fundamentals of Probability. With Stochastic Processes. 4th edn , CRC Press, ( 2018 ).

Martínez

Bastida , J.P. , Gavrilenko , E.V. , Chukhray , A.G. : Developing a pedagogical intervention support based on Bayesian networks . CEUR Workshop Proceedings , 1844 , 265 - 272 ( 2017 ).