Estimation of Errors Rates for FCA-based Knowledge Discovery? Dmitry V. Vinogradov1,2[0000−0001−5761−4706] 1 FRC Computer Science and Control RAS, Moscow 119333, Russia vinogradov.d.w@gmail.com 2 Russian State University for Humanities, Moscow 125993, Russia http://isdwiki.rsuh.ru Abstract. In this paper we present an approach for estimating the error rate of bitset representation of object descriptions in FCA-based knowl- edge discovery. Errors of this kind lead to overfitting phenomenon. The key technique used in our approach is based on the Möbius function on finite partial ordered sets, which was introduced by G.-C. Rota. Keywords: FCA · JSM-method · Möbius function · error rates. Introduction ‘JSM-method of automatic hypotheses generation’ is an approach to Knowledge Discovery that was proposed by V.K. Finn (see, [2], [3]). Initial goal was to formalize ‘Inductive Logic’ proposed by sir John Stuart Mill in 1848 (at the same time as Boolean Logic was proposed by John Boole) using Boolean Algebra and Many-Valued Logic. This approach has been extended to predict the target property of test examples with the help of generated causes of the property (also called ‘hypotheses’). This approach is actually a Machine Learning techniques but high computational complexity (see [7]) is an obstacle to its scalability. Later JSM-method has been extended to arbitrary (lower semi-)lattices of object descrptions, where similarity operation on object descriptions (wedge op- eration) satisfies usual idempotent, commutative and associative laws. In [1], [7] FCA-based techniques [4] were applied to JSM-method. FCA provides a very efficient representation for training objects by means of bitsets (fixed length strings of bits) with bit-wise multiplication as similarity operation on them. The author [12] investigated the ‘overfitting’ phenomenon for JSM-method by means of so-called ‘phantom’ or ‘accidental’ hypotheses. In practice they oc- cur when generated hypotheses are contained in descriptions of examples of the opposite sign (so-called ‘counter-examples’) that do not possess the target prop- erty. JSM-method rejects such hypotheses by means of the ‘forbidding counter- example test’ (FCET), however the remaining hypotheses can be contained in test examples and erroneously classify them as having the target property, hence causing the ‘overfitting’. This phenomenon was experimentally detected ? partially supported by RFBR grant 18-29-03063mk. Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). by RSUH student L.A. Yakimova within her master project under supervision of the author. Another approach to the overfitting of FCA-based Machine Learning was invented and studied in [8]. The analysis of such situations reveals that FCET can reject some phantom hypotheses if the data contains errors in values of some attributes. This paper considers a possibility to estimate error rates in attribute values taking into account the lattice structures on them. Without loss of generality, we restrict ourselves to the case of single target attribute. The general case is reduced to this one by assuming the independence of errors rates of values of different attributes. 1 Basic definitions and results 1.1 Bitset Encoder Algorithm A (finite) context is a triple (G, M, I) where G and M are finite sets and I ⊆ G × M . The elements of G and M are called objects and attributes, respectively. As usual, we write gIm instead of hg, mi ∈ I to denote that object g has attribute m. For A ⊆ G and B ⊆ M , define A0 = {m ∈ M |∀g ∈ A(gIm)}, (1) B 0 = {g ∈ G|∀m ∈ B(gIm)}; (2) so A0 is the set of attributes common to all the objects in A and B 0 is the set of objects possesing all the attributes in B. The maps (·)0 : A 7→ A0 and (·)0 : B 7→ B 0 are called derivation operators (polars) of the context (G, M, I). A concept of the context (G, M, I) is defined to be a pair (A, B), where A ⊆ G, B ⊆ M , A0 = B, and B 0 = A. The first component A of the concept (A, B) is called the extent of the concept, and the second component B is called its intent. The set of all concepts of the context (G, M, I) is denoted by L(G, M, I). Let (G, M, I) be a context. For concepts (A1 , B1 ) and (A2 , B2 ) in L(G, M, I) we write (A1 , B1 ) ≤ (A2 , B2 ), if B1 ⊆ B2 . The relation ≤ is a partial order on L(G, M, I). Element x ∈ L of finite lattice hL, ∧, ∨i is called ∨-irreducible, if x 6= ∅ and for all y, z ∈ L y < x and z < x imply y ∨ z < x. Element x ∈ L of finite lattice hL, ∧, ∨i is called ∧-irreducible, if x 6= ∅ and for all y, z ∈ L x < y and x < z imply x < y ∧ z. If subsets of attributes {gi }0 ⊂ M and {gj }0 ⊂ M are intents of objects gi ∈ G and gj ∈ G, respectively, with corresponding bitsets, then the derivation operator on a pair of objects corresponds to bit-wise multiplication, since {gi , gj }0 = {gi }0 ∩ {gj }0 . Moreover, polars correspond to iteration of bit-wise multiplication (in arbitrary order) of corresponding bitset-represented objects and attributes, respectively. The last remark is important, since bit-wise multiplication is a basic operation of modern CPU and GPGPU. In terms of JSM-method the binary operation ∩ : 2M × 2M → 2M is called ‘similarity operation’. This operation defines a low-semilattice on subsets of attributes. Descriptions of objects can combine discrete and continuous attrtibutes. Here we restrict ourselves to discrete case only, and we will consider the continuous case in another paper. The set of objects’ descriptions must be a part of extended set F of ‘fragments’ supplied with binary ‘similarity’ operation ∧ : F × F → F , which is idempotent x ∧ x = x, commutative x ∧ y = y ∧ x, and associative x ∧ (y ∧ z) = (x ∧ y) ∧ z. Moreover, there is the minimal element ∅ satisfying x ∧ ∅ = ∅ (so-called ‘trivial fragment’). In FCA this construction is realized by means of pattern structures [5]. We convert a fragment into a subset of attributes in such a way that similarity operation becomes set-theoretic intersection (and bit-wise multiplication on corresponding bitsets). We reformulate Basic Theorem 1 of FCA in the following form to construct such an algorithm for encoding values of each attribute, then we form their concatenation for encoding whole object descriptions. Theorem 1. [4] For every finite lattice hL, ∧, ∨i let G be a (super)set of all ∧-irreducible elements and M be a (super)set of all ∨-irreducible elements. For gIm ⇔ g ≥ m the formal context (G, M, I) generates L(G, M, I), which is iso- morphic to the original lattice hL, ∧, ∨i. In [13] we used this theorem to prove correctness of the following algorithm with respect to the property that similarity operation between values corre- sponds to the bit-wise multiplication between their codes: Data: set V of values of current attribute Result: matrix B with rows as bitset codes of values V := topological sort(V ); // topological sorting T := order matrix; // transitive closure of cover relation ∀i[Del[i] = f alse]; // deleted columns for (index = 2; index < n; + + index) do for (indx = 1; indx < index; + + indx) do for (ndx = 0; ndx < indx; + + ndx) do if (T [ ][V [index]] == T [ ][V [indx]]&T [ ][V [ndx]]) then Del[V [index]] := true; end end end end for (index = 2; index < n; + + index) do for (indx = 1; indx < index; + + indx) do if ¬Del[indx] then ⇒ B[indx][index] := T [index][indx]; end end end Algorithm 1: Encoder Algorithm For instance, when we consider famous Mushroom Data Set [11] from Ma- chine Learning Repository at University of California in/ Irvine, the following values of ’spore print color’ are available: black (k), brown (n), buff (b), choco- late (c), green (r), orange (o), purple (u), white (w), and yellow (y). Let us concentrate on black, brown, buff, chocolate, and yellow values. The correspond- ing semi-lattice is shown in Fig. 1. c b @@ k n y @@ ∅ Fig. 1. A fragment of spore print color values semi-lattice We added ∅ value to denote the absence of similarity of spore print colors between several training examples that generate some hypotheses. The order corresponds to “to be more specific/general” relation between values. For example, buff is brown-yellow and chocolate is black-brown. Hence the similarity between mushrooms with chocolate spore print (c) and ones with buff spore print (b) has brown color (n) of spore print as common. The algorithm has the following steps: 1. Topological sorting of elements of the semilattice. 2. In the context of order ≥ look for columns that coincide with bit-wise mul- tiplication of previous ones (every such column corresponds to ∨-reducible element). 3. All found (∨-reducible) columns are removed. 4. Rows of reduced context form bitset representations of the corresponding values. We sort the values as k < n < c < y < b in correspondence with the partial order on them. Then Theorem 1 gives a big context corresponding to ≥. values k n c y b k 10000 n 01000 c 11100 y 00010 b 01011 The column ‘c’ is equal to the product of columns ‘k’ and ‘n’ since c = k ∨ n. Hence it is reducible. The other ∨-reducible element of the lattice is ‘b’ (again column ‘b’ is a product of columns ‘n’ and ‘y’). The rows of reduced context values k n y k 100 n 010 c 110 y 001 b 011 are the bitset representations of the corresponding colors of spore print. The encoding of the whole description is a concatenation of bit-set presenta- tion of values of its features in some fixed order. Since the bitset representations of different values of same feature have same length, the bit-wise multiplication of the whole representations reduces to the bit-wise multiplications of the corre- sponding parts, and the similarity between objects is given by the component- wise similarity of their intents. 1.2 Overfitting Phenomenon for JSM-method We begin with a demonstration of the phenomenon by JSM-method’s application to a school problem in geometry. The task is to learn sufficient conditions on a convex polygon to be circled and to predict this property for test examples. Hence there are two target classes: the positive one (with a possible circle around the figure) and the negative one. The training sample contains regular triangle, rectangular triangle, square, isosceles trapezoid, and diamond (the last figure is negative, the rest contains positive training examples). The test sample contains isosceles triangle, rectan- gle, and deltoid. We consider the most general case of the corresponding polygon. Hence, for instance, isosceles trapezoid has bases of different sizes and differs from rectangle. We represent each polygon by a subset of attributes from the following list: (a) the figure is a triangle; (b) the figure is a quadrangle; (c) the figure has a right angle; (d) the figure has a pair of equal length sides; (e) all sides of the figure have same length; (f) the figure has a pair of parallel sides; (g) the figure has a pair of equal angles; (h) all angles of the figure are equal. (i) the sum of the opposite angles of the quadrangle is equal to π. Hence, the training context (G+ , M = {a, b, c, d, e, f, g, h, i}, I) is training objects a b c d e f g h i regular triangle 1 0 0 1 1 0 1 1 0 rectangular triangle 1 0 1 0 0 0 0 0 0 square 011111111 isosceles trapezoid 0 1 0 1 0 1 1 0 1 This context generates the following concepts with extents of cardinality > 1 h{regular triangle, rectangular triangle}, {a}i, h{regular triangle, square}, {d, e, g, h}i, h{regular triangle, square, isosceles trapezoid}, {d, g}i, h{square, isosceles trapezoid}, {b, d, f, g, i}i. The first concept h{a}0 , {a}i corresponds to the well-known geometric the- orem “Each triangle can be circumscribed by a circle”. The second one expresses the famous fact about regular polygons “Vertices of regular poly- gon lie on a circle”. This concept has the form h{e, h}0 , {e, h}00 i. The fourth concept represented as h{i}0 , {i}00 i corresponds to the well-known geometric the- orem “Every quadrangle with the sum of opposite angles equal to π can be circumscribed by a circle”. The third concept is a ‘phantom’ because its extent contains two types of objects: a regular triangle with the first ‘real’ cause, and quadrangles with the sum of opposite angles equal to π (square and isosceles trapezoid). Its intent consists of ‘accidental common’ attributes. Luck- ily, the forbidden counter-example test (FCET) procedure rejects this concept because of the counter-example counter − example a b c d e f g h i diamond 010111100 However, the situation is very subtle, since the 4th concept has two types of objects in its extent. However, it corresponds to the true geometric fact! We think that this concept is a ‘real cause’ as opposed to ‘phantom concept’ 3. Test sample Gτ contains test objects abcdef ghi isosceles triangle 1 0 0 1 0 0 1 0 0 rectangle 011101111 deltoid 010100100 JSM-method predicts the first test (isosceles triangle) positively through the 1st concept. The second test object (rectangle) is classified positively by applying the 4th concept. JSM-method might incorrectly predict the target property of the last test case if the 3rd hypothesis is not rejected. This is exactly the phenomenon of overfitting: the hypothesis is consistent with the training sample, but it leads to the incorrect classification of test examples. A similar situation occurs in real data experiments with the use of JSM- method. For example, consdier an application of JSM-method to the study of toxicity of substituted nitrobenzenes [6]. The data was collected by pharmacol- ogists from the Liverpool University. RSUH student Anastasia S. Oparysheva detected suspicious phenomenon when she analyzed the results of this experi- ment with respect to overfitting within her undergraduate project [9] under the supervision of the author. Fig. 2. Concept suspicious to be a phantom Concept (hypothetical cause of toxicity) 28 with extent consisting 2 elements (training examples 37 and 39) has the form shown in figure 2. However, example 37 and other 6 training examples generate alternative hypothesis 3. Similarly, example 39 contains alternative concept 41 with 16 el- ements extent. There is a plausible assertion that hypothesis 28 is a ‘phantom’ because of an accidental coincidence fragment between two training examples each of which contains some different ‘real cause’ (example 37 has ’real cause’ 3, and example 39 contains ‘real cause’ 41). This case was not unique. More than 10 percent of concepts without counter- examples exhibit the same behavior. The aim of her study was to study empiri- cally the overfitting phenomenon, which was previously investigated theoretically by the author in [12]. We present some results from this article below. An attribute is called essential, if it appears in some ‘real cause’. Here ‘cause’ is a set of attributes. Assume for the sake of simplicity that two ‘real causes’ have no common attributes. Other attributes are called accompanying. Hence we partition set M of all attributes into three subsets: the first ‘real cause’, the second one, and accompanying attributes. Term ‘real cause’ corresponds to a generator of intent {a} of 1st concept and {i} of 4th concept in our initial illustrative example. However in mathematical study below it means just a special subset of attributes other than the accompa- nying ones. Last attributes form building blocks for ‘phantom’ concepts. So ‘real’ in ‘real causes’ means nothing! It’s simply initially introduced term to distinct the group of essential attributes from accompanying ones. Assume for the sake of simplicity that counter-examples do not contain any essential attribute. Now we introduce probabilistic model to simultaneously gen- erate accompanying attributes for a pair of training examples and m counter- examples. Denote the number of counter-examples by m, and the number of accompa- nying attributes by n. It is clear that the accompanying attributes of training objects and counterexamples form a (2 + m) × n binary matrix. It contains N = (2 + m) · n bits. Accompanying attributes are generated by Bernoulli series of N tests. N Bernoulli series of N tests is the probability distribution on {0, 1} with N Y 1−δj P(x1 = δ1 , . . . , xN = δN ) = pδj · (1 − p) , j=1 where 0 < p < 1. The number p > 0 is called success probability xj = 1 in test j. Then we set all the attributes of first “real” cause to 1s, all attributes of the second “real” cause to 0s, and add accompanying attributes of first training example to obtain first training example itself. We generate the second training example by setting attributes of the first “real” causes to 0, and of the second one to 1. All counter-examples have 0 in positions corresponding to both “real” causes. As result we obtain context 2 × |M | and list of m counter-examples. What is the probability to generate concept with 2 element extent without any counter- example from the list? Theorem 2. If number p a n of random attributes tends √ to infinity, the probability of success equals to n , and there are m = b · n counterexamples, then the probability of an accidental formal concept with 2√elements extent and without any counterexample is 1 − e−a − a · e−a · [1 − e−b· a ] at limit. Note, that even smaller number 1 − e−a − a · e−a is positive, since it coincides with the probability that the Poisson variable Ya with mean a has value Ya > 1. Recently Lyudmila A. Yakimova, a former master student of the Russian State University for Humanities made experimental studies [15] on behavior of Machine Learning procedures based on FCA. She also detected the essential overfitting phenomenon. For example, on Mushroom Data Set [11] the JSM method generates several ‘phantom’ concepts. And as consequence, their use resulted in the wrong classification of toadstools as eatable mushrooms. Another result of Yakimova’s study is a higher rate of ‘phantom’ concepts than its estimate by the theorem. The reason is in the difference of frequency of appearance of different attributes. Moreover, Yakimova’s experiments do not detect overfitting phenomenon for VKF method of Machine Learning based on FCA [14]. 2 Error Rates for Values 2.1 Problem Explanation While checking the condition of forbidding counter-examples, the similarity of some training examples can be contained in description of a counter-example. JSM-method rejects such similarities, however some suspicious hypotheses may be missed if some values of attributes were entered erroneously. Can we estimate the rate of such errors? Consider again Figure 1. Assume that an expert mistakenly replaces buff spore print color (b) by yellow one (y) for some counter-example. Then the similarity with a mushroom with chocolate spore print has common brown color (n) and can not be included into counter-example, so the procedure saves the hypothesis. If such similarity is phantom it leads to overfitting. It is clear that value w ∈ V frequently replaces value v ∈ V when w ≤ v. The case of totally fatal mistake is ignored in this study. Denote the rate of such errors r(v|w). The problem is that there does not exist a way to discover v, we see only w as a value entered by an expert. To resolve this difficulty the Möbius function from the incidence algebra is used. 2.2 Möbius functions on finite partial ordered sets In fundamental work [10] Gian-Carlo Rota introduced the definition of Möbius function on (locally) finite partial ordered sets. It is a working tool for our approach. Below we will recall some key concepts and results of this theory. Consider the set of real-valued functions of two variables on V with the property f (x, y) = 0, if x 6≤ y. It has the structure of an associative algebra over the real field if we define the product of such functions h = f · g as X h(x, y) = f (x, z) · g(z, y). (3) z:x≤z≤y Addition and multiplication by constants are defined in a natural way. This structure is called incidence algebra of the given poset. This algebra has the identity element δ(x, y) = 1 if x = y and δ(x, y) = 0 otherwise, the Kronecker delta. The zeta function ζ(x, y) is an element of incidence algebra such that ζ(x, y) = 1 if x ≤ y and ζ(x, y) = 0 otherwise. It has the inverse element µ(x, y), Möbius function. The proof of the next statement is trivial check. Proposition 1. Function defined by induction as µ(x, x) = 1 and X µ(x, y) = − µ(x, z) (4) z:x≤z