Small Overfitting Probability in Minimization of Empirical Risk for FCA-based Machine Learning Dmitry V. Vinogradov1,* 1 Dorodnicyn Computing Center, Federal Research Center for Computer Science and Control, Russian Academy of Sciences, Moscow 119333, Russia Abstract Main result of the paper provides a small upper bound on a probability of overfitting in FCA-based Machine Learning in the simplest case of Boolean algebra without counter-examples. This Machine Learning approach uses a set of randomly generated formal concepts to predict test examples. The well-known Vapnik–Chervonenkis’ technique of empirical risk minimization determines a number of generated concepts. Estimations of Mixture and Stopping times of several probabilistic algorithms based on Markov chains leads to a plausible assumption on the uniform independent distribution of elements of Boolean algebra. In this case the main theorem proves that the probability of overfitting in fixed fraction of test examples tends to zero faster than exponent when the number of attributes goes to infinity. Keywords FCA, Markov chain, empirical risk, overfitting, Boolean algebra 1. Introduction Formal Concept Analysis (FCA) [1] is a popular means of data analysis in case of small samples. However an applicability of FCA to Big Data has several obstacles: β€’ There will be an exponentially large number of hypotheses with respect to a size of an formal context in the worst case (for instance, the case of Boolean algebra, see next section). β€’ Many problems of FCA [4] belong to famous classes of 𝑁 𝑃 - and #𝑃 -complete problems of computational complexity. β€’ There is a positive probability of β€œaccidental" concepts appearance that corresponds to the overfitting phenomenon [9]. The paper [8] introduces Markov chain approach to probabilistic generation of formal con- cepts to construct a Machine Learning system based on FCA (FCAML). Recently FCAML-system Published in Sergei O. Kuznetsov, Amedeo Napoli, Sebastian Rudolph (Eds.): The 10th International Workshop "What can FCA do for Artificial Intelligence?", FCA4AI 2022, co-located with IJCAI-ECAI 2022, July 23 2022, Vienna, Austria, Proceedings, pp. 41–49. * Corresponding author. $ vinogradov.d.w@gmail.com (D. V. Vinogradov) Β€ http://ccas.ru (D. V. Vinogradov)  0000-0001-5761-4706 (D. V. Vinogradov) Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) applies the coupling Markov chain to generate a random sample of concepts. Each run of this chain terminates with probability 1. Early the system uses a monotonic Markov chain that corresponds to the famous Lazy Random Walk in the case of Boolean algebra. The paper [10] discuses Induction procedure for generalization of training examples into hypotheses about causes of the property under research with counter-examples forbidding. Finally the system predicts a target class of each test example by Analogy reasoning. The main result of paper [10] gives a sufficient number of hypotheses to predict a target class with given level of confidence. The framework is dual to the famous one of V.N. Vapnik and A.Ya. Chervonenkis with respect to their dimension of a classifiers class. However V.N. Vapnik and A.Ya. Chervonenkis also developed another approach to choose a good hypothesis. Best one must minimize an empirical risk (a fraction of wrongly predicted training examples). The main problem in this approach is to estimate a fraction of test examples with prediction error. We provide a partial answer on this task in the simplest case of Boolean algebra without counter-examples. 2. Background 2.1. Basic definitions and facts Here we recall some basic definitions and facts of Machine Learning based on Formal Concept Analysis (FCAML) in the particular case of Boolean algebra. Most general situation is considered in [10]. Book [1] is the best source of information about Formal Concept Analysis itself. The smallest (formal) context for 𝑛-dimensional Boolean algebra is a triple (𝐺, 𝑀, ΜΈ=), where 𝐺 = {𝑔1 , . . . , 𝑔𝑛 } is a set of coatoms (objects), 𝑀 = {π‘š1 , . . . , π‘šπ‘› } is a set of binary attributes, and ΜΈ=βŠ† 𝐺 Γ— 𝑀 is relation defined as 𝑔𝑖 ΜΈ= π‘šπ‘— ⇔ 𝑖 ΜΈ= 𝑗. A concept of the Boolean algebra context (𝐺, 𝑀, ΜΈ=) is defined to be a pair (𝐴, 𝐡), where 𝐴 βŠ† 𝐺, 𝐡 βŠ† 𝑀 , 𝐴 = {𝑔𝑖 ∈ 𝐺 : βˆ€π‘šπ‘— ∈ 𝐡 [𝑖 ΜΈ= 𝑗]}. The first component 𝐴 of the concept (𝐴, 𝐡) is called the extent of the concept, and the second component 𝐡 is called its intent. It clear that the lattice of concepts coincides with Boolean algebra of all the subsets 𝐡 βŠ† 𝑀 . We consider the partial order on it dual to set inclusions. This observation proves truth of the first obstacle of applicability of FCA to Big Data from Introduction. When formal context occupies 𝑛2 bites only, the full description of 𝑛-dimensional Boolean algebra requires 𝑛 Β· 2𝑛 bites of memory. Proposition 1. For context (𝐺, 𝑀, ΜΈ=) corresponding Boolean algebra of concepts has (βˆ…, 𝑀 ) as the bottom element and (𝐺, βˆ…) as the top element. In other words, for every the concept (𝐴, 𝐡) the following inequalities hold: (βˆ…, 𝑀 ) ≀ (𝐴, 𝐡) ≀ (𝐺, βˆ…). (1) Definition 1. For a concept (𝐴, 𝐡), 𝑔𝑖 ∈ 𝐺, and π‘šπ‘— ∈ 𝑀 define {οΈƒ (𝐴, 𝐡) 𝑔𝑖 ∈ 𝐴 𝐢𝑏𝑂((𝐴, 𝐡), 𝑔𝑖 ) = , (2) (𝐴 βˆͺ {𝑔𝑖 }, 𝐡 βˆ– {π‘šπ‘– }) 𝑔𝑖 ∈ /𝐴 {οΈƒ (𝐴, 𝐡) π‘šπ‘— ∈ 𝐡 𝐢𝑏𝑂((𝐴, 𝐡), π‘šπ‘— ) = . (3) (𝐴 βˆ– {𝑔𝑗 }, 𝐡 βˆͺ {π‘šπ‘— }) π‘šπ‘— ∈ /𝐡 We call these operations CbO because the first one is used in Close-by-One (CbO) Algorithm to generate all the formal concepts of an arbitrary context, see [3] for details. Monotonicity property of introduced operations are summarized in the following Lemma. Lemma 1. Let (𝐺, 𝑀, ΜΈ=) be a context, (𝐴, 𝐡) and (𝐢, 𝐷) be concepts for it, 𝑔 ∈ 𝐺, and π‘š ∈ 𝑀 . Then (𝐴, 𝐡) ≀ (𝐢, 𝐷) β‡’ 𝐢𝑏𝑂((𝐴, 𝐡), 𝑔) ≀ 𝐢𝑏𝑂((𝐢, 𝐷), 𝑔), (4) (𝐴, 𝐡) ≀ (𝐢, 𝐷) β‡’ 𝐢𝑏𝑂((𝐴, 𝐡), π‘š) ≀ 𝐢𝑏𝑂((𝐢, 𝐷), π‘š). (5) Initially the system used the monotonic Markov chain algorithm as a core of probabilistic approach to Machine Learning based on FCA. Data: context (𝐺, 𝑀, ΜΈ=) Result: random concept (𝐴, 𝐡) 𝑉 := 𝐺 βŠ” 𝑀 ; (𝐴, 𝐡) := (βˆ…, 𝑀 ); for (𝑖 := 0; 𝑖 < 𝑇 ; 𝑖 + +) do select random element 𝑣 ∈ 𝑉 ; (𝐴, 𝐡) := 𝐢𝑏𝑂((𝐴, 𝐡), 𝑣); end Algorithm 1: Monotonic Markov chain The main difficulty with the monotonic Markov chain in general case is an absence of a good estimation on length 𝑇 of its trajectory to achieve approximately stationary distribution. How- ever the case of Boolean algebra was investigated successfully by the author. Next subsection contains the key results about it. Then the coupling Markov chain algorithm described below appears, where there exists the natural stopping moment. Now it is a working horse for our approach. Data: context (𝐺, 𝑀, ΜΈ=) Result: random concept (𝐴, 𝐡) 𝑉 := 𝐺 βŠ” 𝑀 ; (𝐴, 𝐡) := (βˆ…, 𝑀 ); (𝐢, 𝐷) = (𝐺, βˆ…); while (𝐴 ΜΈ= 𝐢) do select random element 𝑣 ∈ 𝑉 ; (𝐴, 𝐡) := 𝐢𝑏𝑂((𝐴, 𝐡), 𝑣); (𝐢, 𝐷) := 𝐢𝑏𝑂((𝐢, 𝐷), 𝑣); end Algorithm 2: Coupling Markov chain The algorithm terminates when upper and lower concepts coincide. Condition on remaining of ordering between two concepts (𝐴, 𝐡) ≀ (𝐢, 𝐷) at any intermediate step of the while loop of Algorithm 2 follows from Lemma 1. Now we represent Machine Learning based on FCA (FCAML-method) for our setting (Boolean algebra without counter-examples). See seminal paper[10] for description of the general scheme of FCAML-method. The reader can learn the classical deterministic FCA-based approach to Machine Learning from Kuznetsov [5]. Our technique uses probabilistic Algorithm 2 for computing a random subset of concepts. As usual, there are two sets of objects called the training 𝐺 = {𝑔1 , . . . , 𝑔𝑛 } and test 𝑋 = {π‘œ1 , . . . , π‘œπ‘› } samples, respectively. Set 𝑋 contains examples to predict the target class (so-called test objects). From the training samples the program generates a formal context (𝐺, 𝑀, ΜΈ=), where 𝑀 = {π‘š1 , . . . , π‘šπ‘› }. After that the program applies the coupling Markov chain Algorithm 2 to generate a given number 𝑁 of random concepts (𝐴, 𝐡). Data: number 𝑁 of concepts to generate Result: random sample 𝑆 of formal concepts while (𝑖 < 𝑁 ) do Generate concept (𝐴, 𝐡) by Algorithm 2; 𝑆 := 𝑆 βˆͺ {(𝐴, 𝐡)}; 𝑖 := 𝑖 + 1; end Algorithm 3: Inductive generalization FCAML-method replaces a time-consuming deterministic algorithm (for instance, "Close-by- One") for generation of all concepts by the probabilistic one to randomly generate the prescribed number of concepts. The goal of Markov chain approach is to select a random sample of formal concepts without computation of the (possibly exponential size) set of all the concepts. How to select number 𝑁 of concepts for the coupling Markov chain? There are 2 different approaches, both based on ideas of V.N. Vapnik and A.Ya. Chervonenkis. The approach promoted by K.V. Vorontsov is the empirical risk minimization. Data: random sample 𝑆 of concepts Result: empirical risk value 𝐺 :=training examples; π‘˜ := 0; for (𝑔 ∈ 𝐺) do for (⟨𝐴, 𝐡⟩ ∈ 𝑆) do if (𝐡 βŠ† {𝑔}β€² ) then π‘˜ := π‘˜ + 𝑛1 ; end break; end end Algorithm 4: Calculation of empirical risk In the Boolean algebra case it is possible select sufficiently large 𝑁 to make the empirical rick equals to 0. With permutation, we can assume without reducing generality that the first 𝑛 objects were included in the training sample, and the last 𝑛 objects form the test sample. Let 𝑁 of FCA Machine Learning hypotheses be generated using a coupling Markov chain from a training sample for Boolean algebra, where 𝑁 chosen sufficiently large to obtain πœ‚ = 0. Stationary distribution uniformity on hypotheses allows to construct hypothesis β„Žπ‘— = (𝐴𝑗 , 𝐡𝑗 ) (where 1 ≀ 𝑗 ≀ 𝑁 ) from independent Bernoulli sequence 𝐸𝑗 = (πœ–π‘—,1 , . . . , πœ–π‘—,𝑛 ) as 𝐴𝑗 = {𝑔𝑖 : πœ–π‘—,𝑖 = 0} and 𝐡𝑗 = {𝑓𝑖 : πœ–π‘—,𝑖 = 1}. Finally, the FCAML program predicts the target class of test examples and computes tests risk. Consider set 𝐹 of binary features 𝐹 = {𝑓1 , . . . , 𝑓𝑛 }. For each 1 ≀ 𝑖 ≀ 𝑛 introduce test example 𝑔𝑛+𝑖 with the intent {𝑔𝑛+𝑖 }β€² = {𝑓𝑗 ∈ 𝐹 : 𝑗 ΜΈ= 𝑖}. Independent Bernoulli sequence 𝐸𝑗 = (πœ–π‘—,1 , . . . , πœ–π‘—,𝑛 ) determines the corresponding hypoth- esis β„Žπ‘— = (𝐴𝑗 , 𝐡𝑗 ), where 𝐴𝑗 = {π‘œπ‘– : πœ–π‘—,𝑖 = 0} and 𝐡𝑗 = {𝑓𝑖 : πœ–π‘—,𝑖 = 1}. Data: random sample 𝑆 = {(𝐴1 , 𝐡1 ), . . . , (𝐴𝑁 , 𝐡𝑁 )} of concepts Result: number of erroneous predicted test examples 𝑋 :=test examples; π‘Ÿ = 0; for (𝑔 ∈ 𝑋) do for (𝑗 := 0; 𝑗 < 𝑁 ; 𝑗 + +) do if (𝐡𝑗 βŠ† {𝑔}β€² ) then π‘Ÿ := π‘Ÿ + 1; end break; end end Algorithm 5: Calculation of fraction of erroneous predictions 2.2. Approximate Uniformity of Random Subsets Algorithm 1 in the case of Boolean algebra coincides with famous Lazy Random Walk on Boolean hypercube. Lemma 2. Stationary distribution πœ‹ of Lazy Random Walk is uniform. The simplest proof of Lemma 2 uses reversibility of the corresponding Markov chain with Balance equations with respect to uniform distribution πœ‹. Definition 2. Total variation distance between distributions πœ‡ = βŸ¨πœ‡π‘– : 𝑔𝑖 ∈ 𝐺⟩ and πœ‹ = βŸ¨πœ‹π‘– : 1 βˆ‘οΈ€ 𝑔𝑖 ∈ 𝐺⟩ on finite space 𝐺 is defined as the half of 𝐿1 -metric, i.e. β€–πœ‡ βˆ’ πœ‹β€–π‘‡ 𝑉 = 2 Β· 𝑔𝑖 ∈𝐺 |πœ‡π‘– βˆ’ πœ‹π‘– |. Lemma 3. β€–πœ‡ βˆ’ πœ‹β€–π‘‡ 𝑉 = maxπ΄βŠ†πΊ |πœ‡(𝐴) βˆ’ πœ‹(𝐴)|. This Lemma is a direct consequence of Definition 2. Proposition 2. For Lazy Random Walk let 1 πœ‡(0) = P [𝑋𝑑+1 = 𝑔𝑖 | 𝑋𝑑 = 𝑔𝑖 ] = , 2 1 πœ‡(𝑒𝑗 ) = P [𝑋𝑑+1 = (𝑔𝑖 βŠ• 𝑒𝑗 ) | 𝑋𝑑 = 𝑔𝑖 ] = , 2𝑛 and πœ‡ = 0 otherwise, and let πœ‹ be the uniform distribution. Then )οΈ€2 1 (︁ π‘’βˆ’π‘ )︁ β€–πœ‡*𝑑 βˆ’ πœ‹β€–π‘‡ 𝑉 (οΈ€ ≀ Β· 𝑒 βˆ’1 . 4 holds for 𝑑 β‰₯ 12 Β· 𝑛 Β· (log 𝑛 + 𝑐). This proposition is analogue of result of Diakonis [2] and it was proved by the author during his research on monotonic Markov chain (Algorithm 1). Comparison of Algorithms 1 and 2 gives assertion that the lower component of the coupling Markov chain coincides with a state of the monotonic Markov chain. The next two propositions estimate mean length E𝑇 of trajectory of coupling Markov chain (Algorithm 2) and proves the strong concentration of trajectory length 𝑇 around the mean E𝑇 . The author proved them during research on coupling Markov chain (Algorithm 2). Proposition 3. For 𝑛-dimensional Boolean algebra 𝑛 βˆ‘οΈ 𝑛 1 E𝑇 = β‰ˆ 𝑛 Β· ln(𝑛) + 𝑛 Β· 𝛾 + . (6) 𝑗 2 𝑗=1 Proposition 4. For 𝑛-dimensional Boolean algebra P [𝑇 β‰₯ (1 + πœ€) Β· 𝑛 Β· ln(𝑛)] β†’ 0, (7) when 𝑛 β†’ ∞ for any πœ€ > 0. Statements of Propositions 2 and 3 and Lemma 4 imply the assertion that outputs of Algorithm 2 are approximately uniformly distributed. Since each trajectory depends only on (pseudo-)random number generator, these outputs are independent. But random subsets of binary attributes can be generated by Bernoulli sequences. It provides possibility of direct probabilistic computations. In the next section these considerations lead to main result of the paper. 3. Main result Any set of hypotheses about causes of the target property can be considered as a classifier: if a test example includes at least one hypothesis, then the classifier will predict the target class positively; if none of the hypothetical reasons is embedded in a test example, then this example is classified negatively. The method of minimizing empirical risk proposed by V.N. Vapnik and A.Ya. Chervonenkis [7] consists in choosing algorithms for which the classification of training examples contains the minimum number of errors (empirical, or observed, risk). In our case, there will always be classifiers (sets of hypotheses) whose empirical risk is zero. We restrict ourselves to these situations only. On the other hand, a risk of making a mistake in predicting test examples remains. Following K.V. Vorontsov[11], we will randomly divide objects into two groups: training and test examples. For simplicity, let’s assume that the number of objects is even, and the splitting is done in half. This assumption does not reduce generality, since the mean binomial coefficient is the largest one. Let’s denote the empirical risk by πœ‚ and introduce the prediction risk as a fraction of πœƒ = π‘Ÿ/𝑛 incorrectly predicted test examples. We are interested in the probability of P [πœ‚ = 0, πœƒ = 𝛿] when objects are evenly divided into training and test samples in half. Since the probabilities are equal for each partition, we can assume without reducing generality that the first 𝑛 objects were included in the training sample, and the last 𝑛 objects form the test sample. Let 𝑁 of FCAML hypotheses be generated using a coupling Markov chain from a training sample for Boolean algebra. If the trajectories are chosen long enough, then the distribution of hypotheses will be (almost) uniform and independent. Uniformity follows from the prop- erty of fast mixing to a uniform stationary distribution, and independence follows from the independence of the Markov chain trajectories generating the FCAML hypotheses. Denote generated hypotheses as 𝐻 = {β„Ž1 , β„Ž2 , . . . , β„Žπ‘ } and form a table like 𝐺|𝐻 β„Ž1 β„Ž2 ... β„Žπ‘ 𝑔1 0 1 ... 0 .. .. .. .. .. . . . . . 𝑔𝑛 0 0 ... 1 𝑔𝑛+1 0 0 ... 0 .. .. .. .. . . . 0 . 𝑔(1+𝛿)𝑛 0 0 ... 0 𝑔(1+𝛿)𝑛+1 1 0 ... 0 .. .. .. .. .. . . . . . 𝑔2𝑛 0 1 ... 1 Here 1 corresponds to inclusion of given hypothesis into a given example, i.e. the hypothesis predicts the example correctly (positively). To reach the empirical risk πœ‚ = 0 each of the first 𝑛 rows must contain at least one 1. Due to the uniform distribution and independence of hypotheses, the corresponding cells form an independent Bernoulli series with a probability of success 12 . Lemma 4. If hypotheses number 𝑁 β‰₯ (1+𝜎) log2 𝑛 for some 𝜎 > 0 then limπ‘›β†’βˆž P [πœ‚ = 0] = 1. Proof. [οΈ‚ ]︂𝑛·2βˆ’π‘ (οΈ€ βˆ’π‘ 𝑛 )οΈ€ (οΈ€ )οΈ€ 𝑁 βˆ’π‘ 2 1 β‰₯ lim 1 βˆ’ 2 = lim 1βˆ’2 = π‘›β†’βˆž π‘›β†’βˆž ]︀𝑛·2βˆ’π‘ 𝜎 = lim π‘’βˆ’1 β‰₯ lim π‘’βˆ’1/𝑛 = 1. [οΈ€ π‘›β†’βˆž π‘›β†’βˆž To achieve πœƒ = 𝛿 fraction of erroneous predictions of test examples it needs to select 𝛿 Β· 𝑛 rows of the lower half (there are 𝛿·𝑛 (οΈ€ 𝑛 )οΈ€ ways to do this) containing zeroes only, the rest rows can contains ones somewhere. Table above corresponds to the situation with choice of rows 𝑔𝑛+1 , . . . , 𝑔(1+𝛿)·𝑛 . )οΈ€(2βˆ’π›Ώ)𝑛 (οΈ€ βˆ’π‘ )︀𝛿·𝑛 The task is to estimate P = 𝛿·𝑛 (οΈ€ 𝑛 )οΈ€ (οΈ€ Β· 1 βˆ’ 2βˆ’π‘ Β· 2 when 𝑛 β†’ ∞. √ 𝑛 Lemma 5 (Stirling’s formula). 𝑛! β‰ˆ 2πœ‹π‘› 𝑛𝑒𝑛 for 𝑛 β†’ ∞. Lemma 6 (Entropy inequality). βˆ’π›Ώ Β· ln 𝛿 βˆ’ (1 βˆ’ 𝛿) Β· ln(1 βˆ’ 𝛿) ≀ ln 2. Theorem 1. limπ‘›β†’βˆž P ≀ √ 1 exp βˆ’ (1 + 𝜎) Β· 𝛿 Β· 𝑛 Β· ln 𝑛 + ln 2 Β· 𝑛 βˆ’ ln2𝑛 for 𝑛 β†’ {οΈ€ }οΈ€ 2πœ‹π›Ώ(1βˆ’π›Ώ) ∞ and 𝑁 β‰₯ (1 + 𝜎) log2 𝑛. )οΈ€(2βˆ’π›Ώ)𝑛 (οΈ€ βˆ’π‘ )︀𝛿·𝑛 Proof. The second factor of P = 𝛿·𝑛 (οΈ€ 𝑛 )οΈ€ (οΈ€ Β· 1 βˆ’ 2βˆ’π‘ Β· 2 does not exceed 1. Stirling’s formula and Entropy inequality imply √ )︀𝛿·𝑛 2πœ‹ Β· 𝑛 Β· 𝑛𝛿·𝑛 Β· 𝑛(1βˆ’π›Ώ)·𝑛 Β· 𝑒𝛿·𝑛 Β· 𝑒(1βˆ’π›Ώ)·𝑛 Β· 2βˆ’π‘ (οΈ€ P≀ √ βˆšοΈ€ ≀ 𝑒𝑛 Β· 2πœ‹π›Ώ Β· 𝑛 Β· 2πœ‹(1 βˆ’ 𝛿) Β· 𝑛 Β· 𝛿 𝛿·𝑛 Β· 𝑛𝛿·𝑛 Β· (1 βˆ’ 𝛿)(1βˆ’π›Ώ)·𝑛 Β· 𝑛(1βˆ’π›Ώ)·𝑛 2βˆ’π›ΏΒ·π‘›Β·(1+𝜎)Β·log2 𝑛 ≀ βˆšοΈ€ = 2πœ‹π›Ώ(1 βˆ’ 𝛿) Β· 𝑛 Β· 𝑒𝛿·𝑛·ln 𝛿 Β· 𝑒(1βˆ’π›Ώ)·𝑛·ln(1βˆ’π›Ώ) π‘’βˆ’π›ΏΒ·(1+𝜎)·𝑛·ln 𝑛 Β· 𝑒𝑛·(βˆ’π›ΏΒ·ln π›Ώβˆ’(1βˆ’π›Ώ)Β·ln(1βˆ’π›Ώ)) π‘’βˆ’π›ΏΒ·(1+𝜎)·𝑛·ln 𝑛 Β· 𝑒ln 2·𝑛 = βˆšοΈ€ √ ≀ βˆšοΈ€ √ 2πœ‹π›Ώ(1 βˆ’ 𝛿) Β· 𝑛 2πœ‹π›Ώ(1 βˆ’ 𝛿) Β· 𝑛 Conclusions Main theorem of the paper provides a small upper bound on a probability of overfitting in FCA-based Machine Learning in the simplest case of Boolean algebra without counter-examples. It states that the probability of overfitting in fixed fraction of test examples tends to zero faster than exponent when the number of attributes goes to infinity. Interesting alternative for our research is the work of T.P. Makhalova and S.O. Kuznetsov [6], where classifiers form a lattice. Acknowledgements. The author would like to thank his colleagues at Dorodnicyn Computing Center of Russian Academy of Science and especially his PhD student L.A. Yakimova for collaboration and helpful discussions. The author is grateful to anonymous reviewers for improving the style of presentation. References [1] Ganter Bernard and Wille Rudolf. Formal Concept Analysis: Mathematical Foundations, Springer-Verlag, 1999 [2] Diaconis Persi. Group representations in probability and statistics. IMS Lecture Notes – Monograph Series Vol. 11.– Hayward (CA): Institute of Mathematical Statistics, 1988 [3] Kuznetsov S.O. A Fast Algorithm for Computing all Intersections of Objects in a Finite Semi-Lattice. Automatic Documentation and Mathematical Linguistics. Vol. 27. – no. 5. – 1993. – pp. 11–21 [4] Kuznetsov S.O. Complexity of Learning in Concept Lattices from Positive and Negative Examples. Discrete Applied Mathematics, no. 142(1-3). – 2004. – pp. 111–125 [5] Kuznetsov S.O. Machine Learning and Formal Concept Analysis. Proc. 2nd International Conference on Formal Concept Analysis: Springer LNAI, Vol. 2961. – 2004. – pp. 287–312 [6] Makhalova T.P. and Kuznetsov S.O. On Overfitting of Classifiers Making a Lattice. Proc. 14th International Conference on Formal Concept Analysis: Springer LNAI, Vol. 10308. – 2017. – pp. 184–197 [7] Vapnik V.N. Statistical Learning Theory, Wiley-Interscience, 1998 [8] Vinogradov D.V. A Markov Chain Approach to Random Generation of Formal Concepts. Proc. of Workshop Formal Concept Analysis Meets Information Retrieval (FCAIR 2013): CEUR Workshop Proceedings, Vol. 977. – 2013. – p. 127–133 [9] Vinogradov D.V. Accidental Formal Concepts in the Presence of Counterexamples. Proc. of Workshop on Formal Concept Analysis for Knowledge Discovery (FCA4KD 2017): CEUR Workshop Proceedings, Vol. 1921. – 2017. – p. 104–112 [10] Vinogradov D.V. FCA-based Approach to Machine Learning. Proc. of Workshop on Formal Concept Analysis for Artificial Intelligence (FCA4AI 2019): CEUR Workshop Proceedings, Vol. 2529. – 2019. – p. 57–64 [11] Vorontsov K.V. and Ivahnenko A. Tight Combinatorial Generalization Bounds for Threshold Conjunction Rules. Proc. of 4th International Conference on Pattern Recognition and Machine Intelligence. – 2011. – p. 66–73