Index-Requisite Data Diagnostics In Information Management Systems Andrey Chukhray1 [0000-0002-8075-3664] and Olena Havrylenko2 [0000-0001-5227-9742] 1 National Aerospace University “Kharkiv Aviation Institute”, Kharkiv, Ukraine achukhray@gmail.com 2 National Aerospace University “Kharkiv Aviation Institute”, Kharkiv, Ukraine o.havrylenko@khai.edu Abstract. Informational Management Systems (IMS) which are based on lega- cy systems have a significant problem of dirty data. The data cleansing problem solution in such systems usually starts with the search of similar tuples' clusters. After that for each cluster the reference tuple should be formed for saving in a data warehouse of IMS. Moreover, fail tuples should be returned to the source subsystem with the indication of error location, i. e. concrete invalid requisite. The necessary of such a deep diagnosis determined by the following fact: the reference tuple can be not just one of the existent, but as well the combination of several different tuples requisites. Considering one obtained cluster of similar tuples, a certain multiset can be composed from all of the certain attribute val- ues. The paper represents the method of the multiset's diagnostic in terms of faultless and correctability, based on the majority principle. The method pro- vides the minimum time required for establishing the fact of multiset's incor- rectness, moreover it allow defining valid (reference) and failed elements of the multiset. Keywords: Data Cleansing, Diagnostics, Similar Tuples, Reference Requisite, Multiset. 1 Introduction Informational Management Systems (IMS) which are based on legacy systems have a significant problem of dirty data. The data cleansing problem solution in such systems usually starts with the search of similar tuples' clusters [1]. After that for each cluster the reference tuple should be formed for saving in a data warehouse of IMS. Therefore, fast index-requisite diagnostics method's development based on the hashing and cyclic codes [2,3] is quite perspective way of data cleaning problem solu- tion. The main principals of the rational control, which have successful implementa- tion in system engineering [4], should be investigated on the real functioning database of IMS. Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 Problem Statement Let consider the situation when one cluster tuples consist of three requisites. Then the probability p ( BB ) of the event BB , which means that one of two data tuples has single or double error in one requisite and second tuple – in another requisite, can be calculated as: = p ( BB) 2 * ( p ( A1 ) p ( D2 ) p ( D3 ) * p ( D1 ) p ( A2 ) p ( D3 ) + p ( A1 ) p ( D2 ) p ( D3 ) * * p ( D1 ) p ( D2 ) p ( A3 ) * p ( D1 ) p ( A2 ) p ( D3 ) + p ( D1 ) p ( D2 ) p ( A3 )), where p ( Ai ) = Lip c (1 − p c ) Li −1 + CLi2 (1 − p c ) Li − 2 are probabilities of requisites fails with single or double error, p ( Di )= (1 − p c ) Li are probabilities of requisites fails absence, Li are the average lengths of the requisites values, i = 1,3 , π c is the possibility of the requisite symbol distortion. For example, when L1 = 8 , L2 = 6 , L3 = 10 and π c = 10−2 p ( BB) ≈ 0.025 . Considering only one of the clusters obtained similar tuples proceeds to the formal statement of the problem. Let R1 is the considered cluster, i.e. set including q ∈ N tuples. For every attribute ρ ( ρ = 1, h ) of the tuple the corresponding multiset M ρ = {sm1ρ , sm2 ρ ,..., smq ρ } should be formed. Then, based on the principle of majority voting (two among three, three among five etc.) is widely used in diagnosing technical systems [4, 5], it is pos- sible to give definitions of multiset М ρ correctness. Definition 1. A multiset M ρ ( M ρ > 2 ) is faultless, if all elements are equal, i.e. CORRECT ( M ρ ) = {∀i ∈ {1,.., q}∀j ∈ {1,.., q} (i ≠ j) ⇒ (sm = sm )} , iρ jρ where CORRECT ( M ρ ) is a Boolean predicate, ⇒ is an implication operator. Definition 2. A multiset M ρ ( M ρ > 2 ) is correctable, if more than half, but not all of its elements are equal, i.e. { M ρ' ={sm1' ρ , sm2' ρ ,..., smz' ρ } ⊂ M ρ ( z > q / 2) ∧ ( z < q) ∧ CORRECTED ( M ρ ) =∃ ∧ ( ∀i ∈ {1,..., z}∀j ∈ {1,..., z} (i ≠ j ) ⇒ ( smi'ρ =sm'j ρ ) ) , } where CORRECTED ( M ρ ) is a Boolean predicate, M ρ' is a subset of M ρ . Consequently, an element smip ∈ M ρ is a reference tuple, if smi ρ ∈ M ρ' , as well as an element smi ρ ∈ M ρ is an failed tuple, if smi ρ ∉ M ρ' . It is also obviously, in the case if M ρ is correctable M ρ' is unique. In addition, if in M ρ are equal no more than half of the elements, M ρ is not faultless and not correctable. The objective of this paper is to represent the method of the multiset’s diagnos- tic, which ensures minimal time of M ρ correctness establishment, based on the given above definitions, and allow to locate, if it is necessary, reference and failed elements of M ρ . Let consider the obviously easiest approach, i.e. pairwise comparison of all the M ρ elements smi ρ and sm j ρ for (i ≠ j ) . To do this, we assign the multiset M ρ an- q other multiset CNTρ = {cnt1ρ , cnt2 ρ ,..., cntq ρ } , such that cnti ρ = 1 + ∑ eqij , i = 1, q j =1 j ≠i = sm j ; 1, if smi rr where eqij =  If cnt1ρ = q then M ρ is faultless, or M ρ is cor- 0, othrwise . rectable if ∃i ∈ {1,..., q} : cntip > q / 2 , i.e. there is a reference element. For example, if M ρ = {' Иванов ', ' Ивашов ', ' Иванов '} , then CNTρ = {2, 1, 2} . sm1ρ , sm3 ρ are reference elements, sm2 ρ is a failed element. The performance estimation of the pairwise comparisons method [2] by counting the number of symbols comparisons is followed. Let each element smi ρ ∈ M ρ com- prises L characters ( L >> 1 ). Then the maximum number of character comparisons q2 − q q2 − q C pc = ( q − 1 + q − 2 + ... + 1) L = L . Consequently, Tpc ≈ L ⋅ t pc , where Tpc 2 2 is the maximum time of the multiset M p of pairwise comparison elements, t pc is the runtime of the two characters comparison. Since the diagnosing second stage maxi-  q2 − q  mum runtime is proportional to q then Tcpc ≈  L + q  t pc where Tcpc is the max-  2  imum runtime of the diagnostic procedure based on the method of pairwise requisite’s comparison. A significant improvement of this method is using of a conversion key (hash) [2], which sends requisites to the array indexes (memory address): H : smip → aip (2) where H is a transformation (mapping), aip is an array index corresponding to an element smip of the multiset M p . The main challenge deal with the key conversion is that the set of possible val- ues is much greater than the set of possible memory locations. Therefore it is neces- sary to choose a mapping H , which allow: ─ to detect common errors of source data, entering by human-operator in input fields, i.e., reference argument smip and argument smrp , failed with any of the specified error, always guarantee different results aip ≠ arp ; ─ to establish definitely the difference of smip and smrp in the case of different results aip ≠ arp ; ─ to produce the same addresses for random source elements with the difference of arbitrarily small probability; ─ to conclude that the probability smip ≠ smrp is arbitrarily small for the equal ad- dresses ( aip = arp ). When mapping H is chosen, the problem of the M p correctness establishment and the reference and failed M p elements search can be effectively solved by using diagnostic models [4] linking errors indirect signs with the direct ones. 3 The Choice Of The Requisites-To-Indices Reflection The stated problem should be considered as a task of error-correcting coding theory: construction of the predetermined code with detecting ability for transmitting discrete information through a noisy channel [3]. Indeed, ( smip , aip ) can be regarded as permissible sequence of redundancy code, where smip are data bits, aip = H ( smip ) are checking bits. However, in contrast to transmission over the communication channel, whereby both data bits and checking bits could be corrupted because of possible noise, in this case only information bits could be changed, i.e. smip is received as smrp ≠ smip . The most reliable are the cyclic codes having high detecting ability and widely used in practice because of less complicated coding/decoding devices schemes in comparison with other coding techniques [3]. Constructing the cyclic code for a given number u of data bits the shortest length of code combinations w is determined to provide a predetermined multiplicity of error detection. This problem is reduced to the determination of needed generating polynomial G ( x) of degrees w − u . For cyclic codes data bits transformation to test bits has the form as following: H ( M= ip ( x ) ) ( M ( x) + x ) mod G( x) , ip w−u (3) where M ip ( x) is polynomial of a dummy variable x corresponding to the data bits, smip , mod is an operator getting remainder of the polynomials division. Thus, it is need to choose a mapping H such as: 1. If H ( M rp ( x)) ≠ H ( M ip ( x)) , then M rp ( x) ≠ M ip ( x) . 2. For frequently occurring error classes Eз and M= rp ( x ) M ip ( x) + Eз a combination ( sm , a ) is excepted, i.e. H (M ( x)) ≠ H (M ( x)) . ip ip rp ip 3. For random noise M= rp ( x ) M ip ( x) + Ec the probability of ( mrp , aip ) permission is = arbitrarily small, i.e. p ( H ( M rp ( x)) H ( M ip ( x))) → 0 where Ec is some random noise. 4. The equity smip = smrp is ensured from H ( M rp ( x)) = H ( M ip ( x)) with a probability close to one. It is necessary to show further that the first requirement is satisfied if deg (G ( x)) > 0 where deq (G ( x)) is a generator polynomial G ( x) degree. If H ( M rp ( x)) ≠ H ( M ip ( x)) , then ( M rp ( x) x w −u ) mod G ( x) ≠ ( M ip ( x) x w −u ) mod G ( x) . Using the distributive property of the operator mod , it can be obtained ( M rp ( x) x w −u + M ip ( x) x w −u ) mod G ( x) ≠ 0 . Let's supposed, that 0 , then 0 mod G ( x) ≠ 0 is a contradiction, since the con- M rp ( x) x w − u + M ip ( x) x w − u = dition deg(G ( x)) > 0 and the definition of 0 mod G ( x) = 0 . Consequently, w−u w−u w−u M rp ( x) x + M ip ( x) x ≠ 0 , M rp ( x) x ≠ M ip ( x) x w −u and M rp ( x) ≠ M ip ( x) . Before finding out conditions that satisfy the second requirement, it is need to in- troduce auxiliary statements. Statement 1. If A( x) mod G ( x) ≠ 0 and B( x) mod G ( x) ≠ 0 , then ( A( x) B ( x)) mod G ( x) ≠ 0 . Proof. Let's supposed, that A( x) mod G ( x) ≠ 0 and B( x) mod G ( x) ≠ 0 , but ( A( x) B ( x)) mod G ( x) = 0 , then A( x) ≠ W ( x)G ( x) , where W ( x) is a polynomial. Multiplying both sides of this inequality by B ( x) , it could be obtained A( x) B( x) ≠ W ( x)G ( x) B ( x) . Next, let take the reminder by dividing both sides by G ( x) as ( A( x) B( x)) mod G ( x) ≠ (W ( x)G ( x) B( x)) mod G ( x) , then 0 ≠ 0 is a contra- diction. Consequently, ( A( x) B ( x)) mod G ( x) ≠ 0 , Q.E.D. Statement 2. If G ( x) = x c + ... + 1 then G ( x ) W ( x ) = x d + ... + x f + ... + α , where c, d , f ∈ N , d > f , d= c + deg(W ( x)) , α ∈ {0,1} , W ( x) is a polynomial. Proof. Let represent G ( x) as sequence of ones. Then multiplication by modulo 2 of G ( x) and W ( x) may be considered as a modulo 2 addition with a shift: G ( x ) × W (= x ) W ( x )1 ⊕ W ( x )2 ⊕ ... ⊕ W ( x )n , where ⊕ is operation of addition by modulo 2, W ( x )i are right shifts by i of W ( x ) . It should be mentioned that the lowest significant bit of the first term and the highest significant bit of the last term are not compensated, therefore, G ( x)W ( x) represented in the form x d + ... + x f + ... + α , Q.E.M. Statement 3. If G ( x) = x w −u + ... + 1 , then for any single error E ( x) = x i . i ∈ {w − u ,..., w − 1} , such that M=rp ( x ) M ip ( x) + E ( x) , H ( M rp ( x)) ≠ H ( M ip ( x)) is performed. = Proof. Considered conditions H ( M rp ( x )) (( M ip ( x) + x i ) x w −u ) mod G ( x) and H ( M ip ( x)) = (( M ip ( x)) x w − u ) mod G ( x) , let’s supposed that condition H ( M rp ( x)) = H ( M ip ( x)) is true. Then equality (( M ip ( x) + x i ) x w − u ) mod G ( x) = ( M ip ( x) x w − u ) mod G ( x) is true as well, from which it is followed that ( x i + w −u ) mod G ( x) = 0 and, therefore, x i + w −u = G ( x)W ( x) . On the other hand, according to statement 2 G ( x)W ( x) = x d + ... + x f + ... + α and x i + w−u ≠ x + ... + x + ... + α . Consequently, H ( M rp ( x)) ≠ H ( M ip ( x)) , Q.E.M. d f Statement 4. If G ( x) = x w −u + ... + 1 , then for packet type error E ( x) = x + ... + x i , p ≤ w−u , i − p +1 i ∈ {w − u + p − 1,..., w − 1} , for which M= rp ( x ) M ip ( x) + E ( x) is true, H ( M rp ( x)) ≠ H ( M ip ( x)) is performed. Proof. Let's supposed that H ( M rp ( x)) = H ( M ip ( x)) , then i − p +1 w−u w−u (( M ip ( x) + x + ... + x i )x ) mod G ( x) =( M ip ( x) x ) mod G ( x) , hence w − u + i − p +1 p −1 w − u + i − p +1 (x (x + ... + 1)) mod G ( x) =0 . Each of the factors: x is not evenly p −1 divisible by G ( x) ; x + ... + 1 also not divisible by G ( x) , because of the p − 1 < w − u and therefore, ( x w −u + i − p +1 ( x p −1 + ... + 1)) mod G ( x) ≠ 0 . It is a contradic- tion, and hence, H ( M rp ( x)) ≠ H ( M ip ( x)) . Statement 4. If for requisite it is used 8 bits to represent one character, smrp dif- ) x w −u + ... + 1 , where w − u ≥ 8 , fer from smip by any single transcription and G ( x= then arp ≠ aip . Proof. Any single transcription can be represented as E ( x) = x i + ... + x i − p +1 , where p ≤ 8 . Consequently, in accordance with the statement 3, H ( M rp ( x)) ≠ H ( M ip ( x)) and therefore arp ≠ aip . Statement 5. If for requisite it is used 8 bits to represent one character, smrp dif- fer from smip by any transposition or double transcription of adjacent characters and ) x w −u + ... + 1 , where w − u ≥ 16 , then arp ≠ aip . G ( x= Proof. Any transposition or double transcription of adjacent symbols can be rep- resented as E ( x) = x i + ... + x i − p +1 , where p ≤ 16 . Consequently, in accordance with the statement 3, H ( M rp ( x)) ≠ H ( M ip ( x)) and therefore arp ≠ aip . Considering the third requirement for independent input of two values smip and smrp , let’s supposed that all valid requisites smip are equal and H uniformly send them to the full range of possible addresses aip . In this case, each aip corresponds to 2( u − ( w − u )) smip . Then there 2u 2u options independent of input values smip and smrp , among which: a) 2u identical values input options, i.e.,= = ( smip smrp , aip arp ); b) 2u (2u − 2u − ( w −u ) ) options, in which errors are detected, i.e. ( sm ≠ sm , a ≠ a ) ; ip rp ip rp c) 2u (2u − ( w −u ) − 1) options, in which errors are not detected, i.e. ( sm ≠ sm , a = ip a ). rp ip rp Further there are described computations of the probability of different outcomes independent input values smip and smrp . The probability of entering identical values 2u 1 = is p ( smip smrp ,= aip a= rp ) = . The probability of the case, in which error 2u 2u 2u 1 1 p ( smip ≠ smrp , aip =arp ) = 8 − 48 ≈ 0, 004 s is detected, is 2 2 2u (2u − 2u − ( w −u ) ) 1 p ( smip ≠ smrp , aip ≠ arp ) = u u = 1 − w −u . The probability of the case, in 22 2 2u (2u − ( w −u ) − 1) 1 1 which error is not detected, is p ( smip ≠ smrp , aip = arp ) = u u = w−u − u . 22 2 2 For example, when w = 56, u = 48 and w = 64, u = 48 1 1 p ( smip ≠ smrp , aip = arp ) =− ≈ 1,5 ⋅10−5 . 216 248 Let consider now independent input information elements smip , smrp and smsp , each with equal probability takes any of the valid values and is transformed uniformly on the entire range of possible addresses. As previously, each aip corresponds to 2( u − ( w − u )) smip . Totally, there are 2u 2u 2u values for smip , smrp and smsp inputs, among which: ─ 2u identical values input options, i.e., ( sm= ip sm= rp smsp , a= ip a= rp asp ); ─ the cases, in which errors are not detected, are following: smip ≠ smrp , s mip ≠ smsp , s mrp ≠ smsp , smip ≠ smrp , s mip ≠ smsp , s mrp ≠ smsp , a) b) . aip ≠ arp , aip ≠ asp , arp = asp ; aip ≠ arp , aip = asp , arp ≠ asp ; smip ≠ smrp , s mip ≠ smsp , s mrp ≠ smsp , smip ≠ smrp , smip ≠ smsp , s mrp ≠ smsp , c) , d) aip = arp , aip ≠ asp , arp ≠ asp ; = aip a= rp , aip a=sp , arp asp ; smip ≠ smrp , s mip ≠ smsp , s mrp =smsp , smip ≠ smrp , s mip = smsp , s mrp ≠ smsp , e) f) = aip a= rp , aip a=sp , arp asp ; = aip a= rp , aip a=sp , arp asp ; smip = smrp , s mip ≠ smsp , s mrp ≠ smsp , g) = aip a=rp , aip a=sp , arp asp . u − ( w−u ) i.e. cases a, b, c include 2u (2u − 2u − ( w −u ) )(2 − 1) options; g – u − ( w−u ) u − ( w−u ) u − ( w−u ) u 2 (2 − 1)(2 − 2) options; e, f – 2 (2 u − 1) options. ─ the cases, in which errors are detected, are following: smip ≠ smrp , smip ≠ smsp , s mrp ≠ smsp , smip ≠ smrp , s mip ≠ smsp , s mrp = smsp , a) b) aip ≠ arp , aip ≠ asp , arp ≠ asp ; aip ≠ arp , aip ≠ asp , arp = asp ; smip ≠ smrp , s mip = smsp , s mrp ≠ smsp , smip = smrp , s mip ≠ smsp , s mrp ≠ smsp , c) d) aip ≠ arp , aip = asp , arp ≠ asp ; aip = arp , aip ≠ asp , arp ≠ asp , u − ( w−u ) u − ( w−u ) i.e. case a includes 2 (2 − 2 u u )(2 − 2 u − 2u − ( w −u ) ) options; cases b, c, d – 2u (2u − 2u − ( w −u ) ) options. Further there are described computations of probabilities of the different diagno- ses with independent input values requisites smip . smrp and smsp . The probability of 2u 1 entering identical values of requisites is equal to = . The probability of the 23u 2 2 u 3 ⋅ 2u (2u − 2u − ( w −u ) )(2u − ( w −u ) − 1) case, when the error are skipped, is equal to + 23u 2u (2u − ( w −u ) − 1)(2u − ( w −u ) − 2) + 3 ⋅ 2u (2u − ( w −u ) − 1)  1 1  1 1  + =  w −u − u   3 − w −u −1 + u  . The 2 2  2  3u 2 2 probability of error detection by comparing indices obtained for the three requisites is 2u (2u − 2u − ( w −u ) )(2u − 2u − ( w −u ) − 2u − ( w −u ) ) 3 ⋅ 2u (2u − 2u − ( w −u ) ) equal to + = 23u 23u  1  1 3  = 1 − w −u 1 − w −u −1 + u  . For example, when w = 64, u = 48 , the probability  2  2 2  of the case, when an error is skipped, p [a ∨ b ∨ c ∨ d ] =  1 1  1 1  =  16 − 48  3 − 15 + 48  ≈ 4,5 ⋅10−5 . 2 2  2 2  Considering the fourth requirement in the case of independent input of two val- ues smip and smrp , it should be calculated probability of case, if aip = arp , then smip = smrp . The Bayes' formula [7] allows to calculate posteriori conditional proba- bility of the presence of unconditional priori one. Let the event ER1 is equal smip = smrp , event ER2 – smip ≠ smrp , event EI – p ( ER1 ) p ( EI | ER1 ) aip = arp . Then p ( ER1 | EI ) = 2 . Let's supposed, that smip and ∑ p( ERi ) p( EI | ERi ) i =1 smrp , which consist of L characters, are independently entered by two human- operators based on the same original document. Then p ( ER1 ) can be calculated as the probability of error-free entry of two requisites, i.e. p ( ER1 )= (1 − p c ) 2 L , where π c is the possibility of mistakes in the human information (errors per symbol), hence p ( ER2 ) =1 − (1 − p c ) 2 L . For example, for L = 6 and π c = 10−2 p ( ER1) ≈ 0,88 . It is obvious that p ( EI | ER =1) p (= aip arp | = mip m= rp ) 1 , as, for equal requisites ( M ip ( x) = M rp ( x)) it is impossible to obtain different indexes ( H ( M ip ( x)) ≠ H ( M rp ( x))) ,according to (3). For calculations of = p ( EI | ER2) p= (aip arp | smip ≠ smrp ) it is possible to use the fact of equal proba- bilities of all admissible smip , smrp and their uniform mapping to the corresponding ranges aip , arp . According theto formula of conditional probability [7], = p (aip arp , smip ≠ smrp ) 2 u − ( w−u ) −1 p (aip= arp | smip ≠ smrp= ) = . Then p ( smip ≠ smrp ) 2 u − 1 (1 − p c ) 2 L ⋅1 p ( ER1 | EI ) = . For example, for u = 48,  2u − ( w − u ) − 1  (1 − p c ) 2 L ⋅1 + (1 − (1 − p c ) 2 L )    2 −1  u w = 64, L = 6 p ( ER1 | EI ) ≈ 0,999998 . Considering the fourth requirement in the case of independent input of three val- ues smip , smrp and smsp , it is necessary to calculate the probability of = smip = smrp smsp , when a= ip a= rp asp . Let the event E 3R1 is equal values = smip = smrp smsp , event E 3R2 – smip ≠ smrp , s mip ≠ smsp , s mrp = smsp , event E 3R3 – smip ≠ smrp , s mip = smsp , s mrp ≠ smsp , event E 3R4 – smip = smrp , s mip ≠ smsp , s mrp ≠ smsp , event E 3R5 – smip ≠ smrp , s mip ≠ smsp , s mrp ≠ smsp , event E 3I – a= ip a= rp asp . Then p ( E 3R1 ) p ( E 3I | E 3R1 ) p ( E 3R1 | E 3I ) = 5 . Let calculate the a priori probability of ∑ p ( E 3R ) p ( E 3I | E 3R ) i =1 i i events E 3Ri , i = 1,5 according to the binomial law, assuming independence of errors in separate characters, as following: p ( E 3R1 )= (1 − p c )3 L . p ( E 3R2 ) = p ( E 3R3 ) = p ( E 3R4 ) = (1 − p c ) 2 L (1 − (1 − p c ) L ) , p ( E 3R5 ) = (1 − (1 − p c ) L )3 . For example, for L=6 and π c = 10−2 p ( E 3R1 ) ≈ 0,83 , p ( E 3R2 ) ≈ 0, 05 , p ( E 3R5 ) ≈ 0, 0002 . As previously, calculation of the conditional probabilities p ( E 3I | E 3Ri ) is based on the conditions of the equal probability of all admissible smip , smrp and smsp and uniformity of transformation to corresponding ranges aip , arp and asp . So, p ( E 3I | E 3R1 ) = 1 due to the fact that the mapping H each value smip sends to no more than one aip and hence H is a function. The conditional probabilities of coin- cidence of codes in the case of only one failed requisite are following: p= ( E 3I | E 3R2 ) p= ( E 3I | E 3R3 ) p= ( E 3I | E 3R4 ) p (aip = arp = asp , smip ≠ smrp , s mip ≠ smsp , smrp =smsp ) 2u − ( w −u ) − 1 = = . The condi- p ( smip ≠ smrp , smip ≠ smsp , smrp = smsp ) 2u − 1 tional probability of indices coincidence in the case of three different requisites input by human- operator is following: p (aip = arp = asp , smip ≠ smrp , s mip ≠ smsp , s mrp ≠ smsp ) p ( E 3I | E 3R5 ) = p ( smip ≠ smrp , smip ≠ smsp , smrp ≠ smsp ) (2u − ( w −u ) − 1)(2u − ( w −u ) − 2) = . Thus, the posterior probability of identity requisites (2u − 1)(2u − 2) = sm ip = sm rp smsp provided that a= ip a=rp asp , can be calculated as p ( E 3R1 | E 3I ) = (1 − π c )3 L ⋅1 = . 2u − ( w − u ) − 1 L 3 (2 u − ( w−u ) − 1)(2u − ( w−u ) − 2) (1 − ππππ c ) ⋅ 1 + 3 ⋅ (1 − c ) (1 − (1 − c ) ) ⋅ 3L 2L L + (1 − (1 − c ) ) ⋅ 2u − 1 (2u − 1)(2u − 2) For example, for = u 48, = w 64, = L 6 p ( E 3R1 | E 3I ) ≈ 0,999997 . The standard CRC-CCITT polynomial G1 ( x) = x16 + x12 + x 5 + 1 and CRC-16 - G2 ( x) = x16 + x15 + x 2 + 1 are commonly used to increase the reliability of information transmission in computer networks [6]. It is obvious that they satisfy the first and second requirements, as deq (Gi ( x)) > 0 and Gi ( x= ) x w −u + ... + 1 , where w − u ≥ 16 . i = 1, 2 . Furthermore, they may be represented as a product of polynomials of lower degree, for example, G1 ( x) = ( x + 1) ⋅ ( x15 + x14 + x13 + x12 + x 4 + x 3 + x 2 + x + 1) , and are not irreducible. Therefore, the codes constructed based on the G1 ( x) and G2 ( x) does not refer to cyclic, but inherit all the capabilities of error detection, the inherent cyclic codes, including the ability of uniform mapping the possible keys smip , smrp ..., smsp to the corresponding ranges aip , arp ..., asp . Therefore, assuming that each of the elements smip . smrp ..., smsp with equal probability takes any of the permissible values, then G1 ( x) and G2 ( x) satisfies the third and fourth requirements. Choosing the best alternative was carried out using the method of weighted sum. Natural when forming the weighting factors will have an idea of ranking weights according densities classes most common error. Thus, the code based on the polynomial G1 ( x) = x16 + x12 + x 5 + 1 will have the best total controlling ability relative to the most common classes of errors in the data on the names of employees of the KhAI University 4 Diagnostic Data Model According to the signal-parametric approach to control systems diagnostic [4,8], the diagnostic models are defined as mathematical constructions linking indirect signs with direct reasons of the fault. In our case, diagnostic data model (DMD) is named a mathematical construction that relates indirect indications of the data lines with errors, the DMD must be of the form  DD = D − D , (4)  where ΔD is an indirect indication of the presence of failed data; D  , D are direct functions of signs of error and the reference data, respectively. For any DMD, the conditions of diagnosability must also be fulfilled, i.e. the possibility of an unambigu- ous establishment of the fact of the presence of failed data on indirect signs. Let’s create the DMD to identify and search for a place of failed requisites in the multiset M ρ . Let Aρ = {a1ρ , a2 ρ ,..., aq ρ } be multiset indices calculated for the initial requisites, and G ( x) = x16 + x12 + x 5 + 1 , let D be row vector of dimension [0,..., 2 − 1] such that D[ai= 16 ρ ] | Aρ ∩ Ai ρ | where Ai ρ = {ai ρ , ai ρ ,..., ai ρ } . Then the    q − раз equation, characterized by the absence of failed requisites in M ρ will have the form  D[a1ρ ] = q , i.e. all indexes are the same. If, however, M ρ contains failed requisites, the D [a1= ρ ] | Aρ ∩ A1ρ | . Thus, the DMD to detect failed data in M ρ looks as:  D det Dρ =D [a1ρ ] − D[a1ρ ] =| Aρ ∩ A1ρ | − q , (5) where D det Dρ is an indirect indication of the presence of failed data in M ρ . If D det Dρ ≡ 0 then M ρ is error-free, or M ρ contains failed information. To find a place in the wrong requisite M ρ DMD will be as follows:  q D pl Di ρ = D [ai ρ ] − D[ai ρ ] = | Aρ ∩ Ai ρ | − − a , (6) 2 where D pl Di ρ is an indirect indication of the presence of failed data in the requisite 1 q smi ρ . α ∈ [ ; − 1] , and if D pl Di ρ < 0 , then smi ρ is faulty requisite, otherwise smi ρ 2 2 is a reference requisite. The performance of the method of index-requisite diagnosis was evaluated. In this case the first stage is filling a row vector D . It can be assumed to be proportional to the value q . It is assumed that the calculation of indices occur before data cleaning process. As for the time of the second stage, it coincides with the time of the second stage in the case of pairwise comparisons requisites. Maximum wait time for diagnos- tic procedures on the basis of the method of index-requisite diagnosis - Tобщ.инд. рекв ≈ 2 * q * tcр.в . The overall performance of the method of index-requisite (q − 1) * L + 2 diagnosis of redundant information in times higher than the performance 4 of the method based on pairwise comparison requisites. For example, when q = 3 and Tобщ.поп.ср L=8 ≈ 4.5 . Tобщ.инд. рекв 5 Conclusion Deep diagnostics data is the basis for the following problem solution of data recovery. Determining, based on the principle of majority, error and reference values for each attribute it is possible automatic replacement of standard errors. In addition, the failed attributes should be corrected in the source subsystem. Since the change in the origi- nal data in the data warehouse is technically impossible, human-operator should be informed about the error occurred to ensure the quality of subsystem data. Such noti- fication must include the failed attribute, reference attribute, as well as the record ID, for example, last name, first name, etc. If the source subsystem allows working with a clipboard, the failed value could be replaced by correct one automatically. If it is impossible to find the reference and failed values for the attribute, for exam- ple, if there are two different requisites and diagnostic model cannot detect the place, it is concluded that both requisites are incorrect. Decision-making is entrusted to the system administrator, which can redirect the problem to the operators. References 1. Chukhray, A., Havrylenko, O.: Proximate Objects Probabilistic Searching Meth- od. Advances in Intelligent Systems and Computing, 1113 AISC, 219-227 (2020). 2. Cormen, T., Leiserson, C., Rivest, R., Stein, C.: Introduction to Algorithms. 3rd edn. The MIT Press, 1292 p., (2009). 3. Borda, M.: Fundamentals in Information Theory and Coding. 2011th edn, Spring- er, 485 p. (2011). 4. Kulik, A.: Rational intellectualization of the aircraft control: Resources-saving safety improvement. Studies in Systems, Decision and Control, 173-192 (2017). 5. Martínez Bastida, J.P., Havrylenko, O., Chukhray, A.: Developing a self- regulation environment in an open learning model with higher fidelity assess- ment. Communications in Computer and Information Science, 826, 112-131 (2018). 6. Tanenbaum, A., Wetherall, D.: Computer Networks, 5 edn, Pearson, 960 p. (2012). 7. Ghahramani, S.: Fundamentals of Probability. With Stochastic Processes. 4th edn, CRC Press, (2018). 8. Martínez Bastida, J.P., Gavrilenko, E.V., Chukhray, A.G.: Developing a pedagog- ical intervention support based on Bayesian networks. CEUR Workshop Proceed- ings, 1844, 265-272 (2017).