=Paper= {{Paper |id=Vol-3233/paper5 |storemode=property |title=Small Overfitting Probability in Minimization of Empirical Risk for FCA-based Machine Learning |pdfUrl=https://ceur-ws.org/Vol-3233/paper5.pdf |volume=Vol-3233 |authors=Dmitry V. Vinogradov |dblpUrl=https://dblp.org/rec/conf/ijcai/Vinogradov22 }} ==Small Overfitting Probability in Minimization of Empirical Risk for FCA-based Machine Learning== https://ceur-ws.org/Vol-3233/paper5.pdf
Small Overfitting Probability in
Minimization of Empirical Risk for
FCA-based Machine Learning
Dmitry V. Vinogradov1,*
1
 Dorodnicyn Computing Center, Federal Research Center for Computer Science and Control, Russian Academy of
Sciences, Moscow 119333, Russia


                                         Abstract
                                         Main result of the paper provides a small upper bound on a probability of overfitting in FCA-based
                                         Machine Learning in the simplest case of Boolean algebra without counter-examples. This Machine
                                         Learning approach uses a set of randomly generated formal concepts to predict test examples. The
                                         well-known Vapnik–Chervonenkis’ technique of empirical risk minimization determines a number of
                                         generated concepts. Estimations of Mixture and Stopping times of several probabilistic algorithms based
                                         on Markov chains leads to a plausible assumption on the uniform independent distribution of elements of
                                         Boolean algebra. In this case the main theorem proves that the probability of overfitting in fixed fraction
                                         of test examples tends to zero faster than exponent when the number of attributes goes to infinity.

                                         Keywords
                                         FCA, Markov chain, empirical risk, overfitting, Boolean algebra




1. Introduction
Formal Concept Analysis (FCA) [1] is a popular means of data analysis in case of small samples.
  However an applicability of FCA to Big Data has several obstacles:
                  β€’ There will be an exponentially large number of hypotheses with respect to a size of
                    an formal context in the worst case (for instance, the case of Boolean algebra, see next
                    section).
                  β€’ Many problems of FCA [4] belong to famous classes of 𝑁 𝑃 - and #𝑃 -complete problems
                    of computational complexity.
                  β€’ There is a positive probability of β€œaccidental" concepts appearance that corresponds to
                    the overfitting phenomenon [9].
  The paper [8] introduces Markov chain approach to probabilistic generation of formal con-
cepts to construct a Machine Learning system based on FCA (FCAML). Recently FCAML-system
Published in Sergei O. Kuznetsov, Amedeo Napoli, Sebastian Rudolph (Eds.): The 10th International Workshop "What
can FCA do for Artificial Intelligence?", FCA4AI 2022, co-located with IJCAI-ECAI 2022, July 23 2022, Vienna, Austria,
Proceedings, pp. 41–49.
*
  Corresponding author.
$ vinogradov.d.w@gmail.com (D. V. Vinogradov)
Β€ http://ccas.ru (D. V. Vinogradov)
 0000-0001-5761-4706 (D. V. Vinogradov)
                                       Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
applies the coupling Markov chain to generate a random sample of concepts. Each run of this
chain terminates with probability 1. Early the system uses a monotonic Markov chain that
corresponds to the famous Lazy Random Walk in the case of Boolean algebra. The paper [10]
discuses Induction procedure for generalization of training examples into hypotheses about
causes of the property under research with counter-examples forbidding. Finally the system
predicts a target class of each test example by Analogy reasoning.
   The main result of paper [10] gives a sufficient number of hypotheses to predict a target class
with given level of confidence. The framework is dual to the famous one of V.N. Vapnik and
A.Ya. Chervonenkis with respect to their dimension of a classifiers class.
   However V.N. Vapnik and A.Ya. Chervonenkis also developed another approach to choose a
good hypothesis. Best one must minimize an empirical risk (a fraction of wrongly predicted
training examples). The main problem in this approach is to estimate a fraction of test examples
with prediction error. We provide a partial answer on this task in the simplest case of Boolean
algebra without counter-examples.


2. Background
2.1. Basic definitions and facts
Here we recall some basic definitions and facts of Machine Learning based on Formal Concept
Analysis (FCAML) in the particular case of Boolean algebra. Most general situation is considered
in [10]. Book [1] is the best source of information about Formal Concept Analysis itself.
   The smallest (formal) context for 𝑛-dimensional Boolean algebra is a triple (𝐺, 𝑀, ΜΈ=),
where 𝐺 = {𝑔1 , . . . , 𝑔𝑛 } is a set of coatoms (objects), 𝑀 = {π‘š1 , . . . , π‘šπ‘› } is a set of binary
attributes, and ΜΈ=βŠ† 𝐺 Γ— 𝑀 is relation defined as 𝑔𝑖 ΜΈ= π‘šπ‘— ⇔ 𝑖 ΜΈ= 𝑗.
   A concept of the Boolean algebra context (𝐺, 𝑀, ΜΈ=) is defined to be a pair (𝐴, 𝐡), where
𝐴 βŠ† 𝐺, 𝐡 βŠ† 𝑀 , 𝐴 = {𝑔𝑖 ∈ 𝐺 : βˆ€π‘šπ‘— ∈ 𝐡 [𝑖 ΜΈ= 𝑗]}. The first component 𝐴 of the concept
(𝐴, 𝐡) is called the extent of the concept, and the second component 𝐡 is called its intent. It
clear that the lattice of concepts coincides with Boolean algebra of all the subsets 𝐡 βŠ† 𝑀 . We
consider the partial order on it dual to set inclusions.
   This observation proves truth of the first obstacle of applicability of FCA to Big Data from
Introduction. When formal context occupies 𝑛2 bites only, the full description of 𝑛-dimensional
Boolean algebra requires 𝑛 Β· 2𝑛 bites of memory.
Proposition 1. For context (𝐺, 𝑀, ΜΈ=) corresponding Boolean algebra of concepts has (βˆ…, 𝑀 ) as
the bottom element and (𝐺, βˆ…) as the top element. In other words, for every the concept (𝐴, 𝐡) the
following inequalities hold:
                                  (βˆ…, 𝑀 ) ≀ (𝐴, 𝐡) ≀ (𝐺, βˆ…).                                    (1)
Definition 1. For a concept (𝐴, 𝐡), 𝑔𝑖 ∈ 𝐺, and π‘šπ‘— ∈ 𝑀 define
                                         {οΈƒ
                                           (𝐴, 𝐡)                       𝑔𝑖 ∈ 𝐴
                 𝐢𝑏𝑂((𝐴, 𝐡), 𝑔𝑖 ) =                                            ,                  (2)
                                           (𝐴 βˆͺ {𝑔𝑖 }, 𝐡 βˆ– {π‘šπ‘– })       𝑔𝑖 ∈
                                                                           /𝐴
                                         {οΈƒ
                                           (𝐴, 𝐡)                       π‘šπ‘— ∈ 𝐡
                𝐢𝑏𝑂((𝐴, 𝐡), π‘šπ‘— ) =                                             .                  (3)
                                           (𝐴 βˆ– {𝑔𝑗 }, 𝐡 βˆͺ {π‘šπ‘— })       π‘šπ‘— ∈
                                                                           /𝐡
   We call these operations CbO because the first one is used in Close-by-One (CbO) Algorithm
to generate all the formal concepts of an arbitrary context, see [3] for details.
   Monotonicity property of introduced operations are summarized in the following Lemma.

Lemma 1. Let (𝐺, 𝑀, ΜΈ=) be a context, (𝐴, 𝐡) and (𝐢, 𝐷) be concepts for it, 𝑔 ∈ 𝐺, and π‘š ∈ 𝑀 .
Then

                (𝐴, 𝐡) ≀ (𝐢, 𝐷) β‡’ 𝐢𝑏𝑂((𝐴, 𝐡), 𝑔) ≀ 𝐢𝑏𝑂((𝐢, 𝐷), 𝑔),                          (4)
               (𝐴, 𝐡) ≀ (𝐢, 𝐷) β‡’ 𝐢𝑏𝑂((𝐴, 𝐡), π‘š) ≀ 𝐢𝑏𝑂((𝐢, 𝐷), π‘š).                           (5)

   Initially the system used the monotonic Markov chain algorithm as a core of probabilistic
approach to Machine Learning based on FCA.
   Data: context (𝐺, 𝑀, ΜΈ=)
   Result: random concept (𝐴, 𝐡)
   𝑉 := 𝐺 βŠ” 𝑀 ; (𝐴, 𝐡) := (βˆ…, 𝑀 );
   for (𝑖 := 0; 𝑖 < 𝑇 ; 𝑖 + +) do
       select random element 𝑣 ∈ 𝑉 ;
       (𝐴, 𝐡) := 𝐢𝑏𝑂((𝐴, 𝐡), 𝑣);
   end
                           Algorithm 1: Monotonic Markov chain
   The main difficulty with the monotonic Markov chain in general case is an absence of a good
estimation on length 𝑇 of its trajectory to achieve approximately stationary distribution. How-
ever the case of Boolean algebra was investigated successfully by the author. Next subsection
contains the key results about it.
   Then the coupling Markov chain algorithm described below appears, where there exists the
natural stopping moment. Now it is a working horse for our approach.
   Data: context (𝐺, 𝑀, ΜΈ=)
   Result: random concept (𝐴, 𝐡)
   𝑉 := 𝐺 βŠ” 𝑀 ; (𝐴, 𝐡) := (βˆ…, 𝑀 ); (𝐢, 𝐷) = (𝐺, βˆ…);
   while (𝐴 ̸= 𝐢) do
       select random element 𝑣 ∈ 𝑉 ;
       (𝐴, 𝐡) := 𝐢𝑏𝑂((𝐴, 𝐡), 𝑣);
       (𝐢, 𝐷) := 𝐢𝑏𝑂((𝐢, 𝐷), 𝑣);
   end
                            Algorithm 2: Coupling Markov chain
   The algorithm terminates when upper and lower concepts coincide. Condition on remaining
of ordering between two concepts (𝐴, 𝐡) ≀ (𝐢, 𝐷) at any intermediate step of the while loop
of Algorithm 2 follows from Lemma 1.
   Now we represent Machine Learning based on FCA (FCAML-method) for our setting (Boolean
algebra without counter-examples). See seminal paper[10] for description of the general scheme
of FCAML-method.
   The reader can learn the classical deterministic FCA-based approach to Machine Learning
from Kuznetsov [5]. Our technique uses probabilistic Algorithm 2 for computing a random
subset of concepts.
   As usual, there are two sets of objects called the training 𝐺 = {𝑔1 , . . . , 𝑔𝑛 } and test 𝑋 =
{π‘œ1 , . . . , π‘œπ‘› } samples, respectively. Set 𝑋 contains examples to predict the target class (so-called
test objects).
   From the training samples the program generates a formal context (𝐺, 𝑀, ΜΈ=), where 𝑀 =
{π‘š1 , . . . , π‘šπ‘› }. After that the program applies the coupling Markov chain Algorithm 2 to
generate a given number 𝑁 of random concepts (𝐴, 𝐡).
   Data: number 𝑁 of concepts to generate
   Result: random sample 𝑆 of formal concepts
   while (𝑖 < 𝑁 ) do
         Generate concept (𝐴, 𝐡) by Algorithm 2;
         𝑆 := 𝑆 βˆͺ {(𝐴, 𝐡)};
         𝑖 := 𝑖 + 1;
   end
                                 Algorithm 3: Inductive generalization
   FCAML-method replaces a time-consuming deterministic algorithm (for instance, "Close-by-
One") for generation of all concepts by the probabilistic one to randomly generate the prescribed
number of concepts. The goal of Markov chain approach is to select a random sample of formal
concepts without computation of the (possibly exponential size) set of all the concepts.
   How to select number 𝑁 of concepts for the coupling Markov chain? There are 2 different
approaches, both based on ideas of V.N. Vapnik and A.Ya. Chervonenkis. The approach promoted
by K.V. Vorontsov is the empirical risk minimization.
   Data: random sample 𝑆 of concepts
   Result: empirical risk value
   𝐺 :=training examples; π‘˜ := 0;
   for (𝑔 ∈ 𝐺) do
         for (⟨𝐴, 𝐡⟩ ∈ 𝑆) do
               if (𝐡 βŠ† {𝑔}β€² ) then
                    π‘˜ := π‘˜ + 𝑛1 ;
               end
               break;
         end
   end
                             Algorithm 4: Calculation of empirical risk
   In the Boolean algebra case it is possible select sufficiently large 𝑁 to make the empirical
rick equals to 0.
   With permutation, we can assume without reducing generality that the first 𝑛 objects were
included in the training sample, and the last 𝑛 objects form the test sample.
   Let 𝑁 of FCA Machine Learning hypotheses be generated using a coupling Markov chain
from a training sample for Boolean algebra, where 𝑁 chosen sufficiently large to obtain πœ‚ = 0.
   Stationary distribution uniformity on hypotheses allows to construct hypothesis β„Žπ‘— =
(𝐴𝑗 , 𝐡𝑗 ) (where 1 ≀ 𝑗 ≀ 𝑁 ) from independent Bernoulli sequence 𝐸𝑗 = (πœ–π‘—,1 , . . . , πœ–π‘—,𝑛 )
as 𝐴𝑗 = {𝑔𝑖 : πœ–π‘—,𝑖 = 0} and 𝐡𝑗 = {𝑓𝑖 : πœ–π‘—,𝑖 = 1}.
   Finally, the FCAML program predicts the target class of test examples and computes tests
risk.
   Consider set 𝐹 of binary features 𝐹 = {𝑓1 , . . . , 𝑓𝑛 }. For each 1 ≀ 𝑖 ≀ 𝑛 introduce test
example 𝑔𝑛+𝑖 with the intent {𝑔𝑛+𝑖 }β€² = {𝑓𝑗 ∈ 𝐹 : 𝑗 ΜΈ= 𝑖}.
   Independent Bernoulli sequence 𝐸𝑗 = (πœ–π‘—,1 , . . . , πœ–π‘—,𝑛 ) determines the corresponding hypoth-
esis β„Žπ‘— = (𝐴𝑗 , 𝐡𝑗 ), where 𝐴𝑗 = {π‘œπ‘– : πœ–π‘—,𝑖 = 0} and 𝐡𝑗 = {𝑓𝑖 : πœ–π‘—,𝑖 = 1}.
   Data: random sample 𝑆 = {(𝐴1 , 𝐡1 ), . . . , (𝐴𝑁 , 𝐡𝑁 )} of concepts
   Result: number of erroneous predicted test examples
   𝑋 :=test examples; π‘Ÿ = 0;
   for (𝑔 ∈ 𝑋) do
       for (𝑗 := 0; 𝑗 < 𝑁 ; 𝑗 + +) do
           if (𝐡𝑗 βŠ† {𝑔}β€² ) then
               π‘Ÿ := π‘Ÿ + 1;
           end
           break;
       end
   end
               Algorithm 5: Calculation of fraction of erroneous predictions

2.2. Approximate Uniformity of Random Subsets
Algorithm 1 in the case of Boolean algebra coincides with famous Lazy Random Walk on Boolean
hypercube.

Lemma 2. Stationary distribution πœ‹ of Lazy Random Walk is uniform.

  The simplest proof of Lemma 2 uses reversibility of the corresponding Markov chain with
Balance equations with respect to uniform distribution πœ‹.

Definition 2. Total variation distance between distributions πœ‡ = βŸ¨πœ‡π‘– : 𝑔𝑖 ∈ 𝐺⟩         and πœ‹ = βŸ¨πœ‹π‘– :
                                                                                  1 βˆ‘οΈ€
𝑔𝑖 ∈ 𝐺⟩ on finite space 𝐺 is defined as the half of 𝐿1 -metric, i.e. β€–πœ‡ βˆ’ πœ‹β€–π‘‡ 𝑉 = 2 Β· 𝑔𝑖 ∈𝐺 |πœ‡π‘– βˆ’ πœ‹π‘– |.

Lemma 3. β€–πœ‡ βˆ’ πœ‹β€–π‘‡ 𝑉 = maxπ΄βŠ†πΊ |πœ‡(𝐴) βˆ’ πœ‹(𝐴)|.

  This Lemma is a direct consequence of Definition 2.

Proposition 2. For Lazy Random Walk let
                                                          1
                         πœ‡(0) = P [𝑋𝑑+1 = 𝑔𝑖 | 𝑋𝑑 = 𝑔𝑖 ] = ,
                                                          2
                                                                               1
                        πœ‡(𝑒𝑗 ) = P [𝑋𝑑+1 = (𝑔𝑖 βŠ• 𝑒𝑗 ) | 𝑋𝑑 = 𝑔𝑖 ] =              ,
                                                                              2𝑛
and πœ‡ = 0 otherwise, and let πœ‹ be the uniform distribution. Then
                                                    )οΈ€2       1 (︁ π‘’βˆ’π‘   )︁
                                     β€–πœ‡*𝑑 βˆ’ πœ‹β€–π‘‡ 𝑉
                                (οΈ€
                                                          ≀     Β· 𝑒    βˆ’1 .
                                                              4
holds for 𝑑 β‰₯ 12 Β· 𝑛 Β· (log 𝑛 + 𝑐).
  This proposition is analogue of result of Diakonis [2] and it was proved by the author during
his research on monotonic Markov chain (Algorithm 1).
  Comparison of Algorithms 1 and 2 gives assertion that the lower component of the coupling
Markov chain coincides with a state of the monotonic Markov chain.
  The next two propositions estimate mean length E𝑇 of trajectory of coupling Markov chain
(Algorithm 2) and proves the strong concentration of trajectory length 𝑇 around the mean E𝑇 .
The author proved them during research on coupling Markov chain (Algorithm 2).
Proposition 3. For 𝑛-dimensional Boolean algebra
                                     𝑛
                                    βˆ‘οΈ 𝑛                         1
                             E𝑇 =           β‰ˆ 𝑛 Β· ln(𝑛) + 𝑛 Β· 𝛾 + .                            (6)
                                          𝑗                      2
                                    𝑗=1

Proposition 4. For 𝑛-dimensional Boolean algebra
                                P [𝑇 β‰₯ (1 + πœ€) Β· 𝑛 Β· ln(𝑛)] β†’ 0,                               (7)
when 𝑛 β†’ ∞ for any πœ€ > 0.
   Statements of Propositions 2 and 3 and Lemma 4 imply the assertion that outputs of Algorithm
2 are approximately uniformly distributed.
   Since each trajectory depends only on (pseudo-)random number generator, these outputs are
independent.
   But random subsets of binary attributes can be generated by Bernoulli sequences. It provides
possibility of direct probabilistic computations. In the next section these considerations lead to
main result of the paper.


3. Main result
Any set of hypotheses about causes of the target property can be considered as a classifier: if a
test example includes at least one hypothesis, then the classifier will predict the target class
positively; if none of the hypothetical reasons is embedded in a test example, then this example
is classified negatively.
   The method of minimizing empirical risk proposed by V.N. Vapnik and A.Ya. Chervonenkis
[7] consists in choosing algorithms for which the classification of training examples contains
the minimum number of errors (empirical, or observed, risk). In our case, there will always
be classifiers (sets of hypotheses) whose empirical risk is zero. We restrict ourselves to these
situations only. On the other hand, a risk of making a mistake in predicting test examples
remains.
   Following K.V. Vorontsov[11], we will randomly divide objects into two groups: training and
test examples. For simplicity, let’s assume that the number of objects is even, and the splitting
is done in half. This assumption does not reduce generality, since the mean binomial coefficient
is the largest one.
   Let’s denote the empirical risk by πœ‚ and introduce the prediction risk as a fraction of πœƒ = π‘Ÿ/𝑛
incorrectly predicted test examples. We are interested in the probability of P [πœ‚ = 0, πœƒ = 𝛿]
when objects are evenly divided into training and test samples in half.
   Since the probabilities are equal for each partition, we can assume without reducing generality
that the first 𝑛 objects were included in the training sample, and the last 𝑛 objects form the test
sample.
   Let 𝑁 of FCAML hypotheses be generated using a coupling Markov chain from a training
sample for Boolean algebra. If the trajectories are chosen long enough, then the distribution
of hypotheses will be (almost) uniform and independent. Uniformity follows from the prop-
erty of fast mixing to a uniform stationary distribution, and independence follows from the
independence of the Markov chain trajectories generating the FCAML hypotheses.
   Denote generated hypotheses as 𝐻 = {β„Ž1 , β„Ž2 , . . . , β„Žπ‘ } and form a table like
                                𝐺|𝐻                β„Ž1    β„Ž2      ...    β„Žπ‘
                                𝑔1                 0     1       ...     0
                                ..                  ..    ..     ..      ..
                                 .                   .     .        .     .
                                𝑔𝑛                 0      0      ...     1
                                𝑔𝑛+1               0      0      ...     0
                                ..                 ..     ..             ..
                                 .                  .      .      0       .
                                𝑔(1+𝛿)𝑛            0      0      ...     0
                                𝑔(1+𝛿)𝑛+1          1      0      ...     0
                                ..                 ..     ..     ..      ..
                                 .                  .      .        .     .
                                𝑔2𝑛                0      1      ...     1
  Here 1 corresponds to inclusion of given hypothesis into a given example, i.e. the hypothesis
predicts the example correctly (positively).
  To reach the empirical risk πœ‚ = 0 each of the first 𝑛 rows must contain at least one 1.
  Due to the uniform distribution and independence of hypotheses, the corresponding cells
form an independent Bernoulli series with a probability of success 12 .
Lemma 4. If hypotheses number 𝑁 β‰₯ (1+𝜎) log2 𝑛 for some 𝜎 > 0 then limπ‘›β†’βˆž P [πœ‚ = 0] = 1.
Proof.
                                   [οΈ‚                    ]︂𝑛·2βˆ’π‘
               (οΈ€   βˆ’π‘ 𝑛
                      )οΈ€                (οΈ€       )οΈ€ 𝑁
                                               βˆ’π‘ 2
  1 β‰₯ lim 1 βˆ’ 2            = lim             1βˆ’2                   =
         π‘›β†’βˆž                 π‘›β†’βˆž
                                                                         ]︀𝑛·2βˆ’π‘            𝜎
                                                               = lim π‘’βˆ’1         β‰₯ lim π‘’βˆ’1/𝑛 = 1.
                                                                     [οΈ€
                                                                 π‘›β†’βˆž              π‘›β†’βˆž




  To achieve πœƒ = 𝛿 fraction of erroneous     predictions of test examples it needs to select 𝛿 Β· 𝑛
rows of the lower half (there are 𝛿·𝑛
                                  (οΈ€ 𝑛 )οΈ€
                                          ways to do this) containing zeroes only, the rest rows
can contains ones somewhere. Table above corresponds to the situation with choice of rows
𝑔𝑛+1 , . . . , 𝑔(1+𝛿)·𝑛 .
                                                 )οΈ€(2βˆ’π›Ώ)𝑛 (οΈ€ βˆ’π‘ )︀𝛿·𝑛
  The task is to estimate P = 𝛿·𝑛
                              (οΈ€ 𝑛 )οΈ€ (οΈ€
                                     Β· 1 βˆ’ 2βˆ’π‘           Β· 2          when 𝑛 β†’ ∞.
                                      √       𝑛
Lemma 5 (Stirling’s formula). 𝑛! β‰ˆ 2πœ‹π‘› 𝑛𝑒𝑛 for 𝑛 β†’ ∞.
Lemma 6 (Entropy inequality). βˆ’π›Ώ Β· ln 𝛿 βˆ’ (1 βˆ’ 𝛿) Β· ln(1 βˆ’ 𝛿) ≀ ln 2.

Theorem 1. limπ‘›β†’βˆž P ≀ √ 1            exp βˆ’ (1 + 𝜎) Β· 𝛿 Β· 𝑛 Β· ln 𝑛 + ln 2 Β· 𝑛 βˆ’ ln2𝑛 for 𝑛 β†’
                                         {οΈ€                                        }οΈ€
                                 2πœ‹π›Ώ(1βˆ’π›Ώ)
∞ and 𝑁 β‰₯ (1 + 𝜎) log2 𝑛.
                                                )οΈ€(2βˆ’π›Ώ)𝑛 (οΈ€ βˆ’π‘ )︀𝛿·𝑛
Proof. The second factor of P = 𝛿·𝑛
                               (οΈ€ 𝑛 )οΈ€ (οΈ€
                                      Β· 1 βˆ’ 2βˆ’π‘         Β· 2          does not exceed 1. Stirling’s
formula and Entropy inequality imply
                 √                                             )︀𝛿·𝑛
              2πœ‹ Β· 𝑛 Β· 𝑛𝛿·𝑛 Β· 𝑛(1βˆ’π›Ώ)·𝑛 Β· 𝑒𝛿·𝑛 Β· 𝑒(1βˆ’π›Ώ)·𝑛 Β· 2βˆ’π‘
                                                          (οΈ€
  P≀     √         βˆšοΈ€                                                        ≀
     𝑒𝑛 Β· 2πœ‹π›Ώ Β· 𝑛 Β· 2πœ‹(1 βˆ’ 𝛿) Β· 𝑛 Β· 𝛿 𝛿·𝑛 Β· 𝑛𝛿·𝑛 Β· (1 βˆ’ 𝛿)(1βˆ’π›Ώ)·𝑛 Β· 𝑛(1βˆ’π›Ώ)·𝑛
                                     2βˆ’π›ΏΒ·π‘›Β·(1+𝜎)Β·log2 𝑛
                        ≀ βˆšοΈ€                                              =
                            2πœ‹π›Ώ(1 βˆ’ 𝛿) Β· 𝑛 Β· 𝑒𝛿·𝑛·ln 𝛿 Β· 𝑒(1βˆ’π›Ώ)·𝑛·ln(1βˆ’π›Ώ)
                            π‘’βˆ’π›ΏΒ·(1+𝜎)·𝑛·ln 𝑛 Β· 𝑒𝑛·(βˆ’π›ΏΒ·ln π›Ώβˆ’(1βˆ’π›Ώ)Β·ln(1βˆ’π›Ώ))   π‘’βˆ’π›ΏΒ·(1+𝜎)·𝑛·ln 𝑛 Β· 𝑒ln 2·𝑛
                        =              βˆšοΈ€                 √               ≀  βˆšοΈ€                 √
                                          2πœ‹π›Ώ(1 βˆ’ 𝛿) Β· 𝑛                        2πœ‹π›Ώ(1 βˆ’ 𝛿) Β· 𝑛




Conclusions
Main theorem of the paper provides a small upper bound on a probability of overfitting in
FCA-based Machine Learning in the simplest case of Boolean algebra without counter-examples.
It states that the probability of overfitting in fixed fraction of test examples tends to zero faster
than exponent when the number of attributes goes to infinity. Interesting alternative for our
research is the work of T.P. Makhalova and S.O. Kuznetsov [6], where classifiers form a lattice.


Acknowledgements.
The author would like to thank his colleagues at Dorodnicyn Computing Center of Russian
Academy of Science and especially his PhD student L.A. Yakimova for collaboration and helpful
discussions.
   The author is grateful to anonymous reviewers for improving the style of presentation.


References
 [1] Ganter Bernard and Wille Rudolf. Formal Concept Analysis: Mathematical Foundations,
     Springer-Verlag, 1999
 [2] Diaconis Persi. Group representations in probability and statistics. IMS Lecture Notes –
     Monograph Series Vol. 11.– Hayward (CA): Institute of Mathematical Statistics, 1988
 [3] Kuznetsov S.O. A Fast Algorithm for Computing all Intersections of Objects in a Finite
     Semi-Lattice. Automatic Documentation and Mathematical Linguistics. Vol. 27. – no. 5. –
     1993. – pp. 11–21
 [4] Kuznetsov S.O. Complexity of Learning in Concept Lattices from Positive and Negative
     Examples. Discrete Applied Mathematics, no. 142(1-3). – 2004. – pp. 111–125
 [5] Kuznetsov S.O. Machine Learning and Formal Concept Analysis. Proc. 2nd International
     Conference on Formal Concept Analysis: Springer LNAI, Vol. 2961. – 2004. – pp. 287–312
 [6] Makhalova T.P. and Kuznetsov S.O. On Overfitting of Classifiers Making a Lattice. Proc.
     14th International Conference on Formal Concept Analysis: Springer LNAI, Vol. 10308. –
     2017. – pp. 184–197
 [7] Vapnik V.N. Statistical Learning Theory, Wiley-Interscience, 1998
 [8] Vinogradov D.V. A Markov Chain Approach to Random Generation of Formal Concepts.
     Proc. of Workshop Formal Concept Analysis Meets Information Retrieval (FCAIR 2013): CEUR
     Workshop Proceedings, Vol. 977. – 2013. – p. 127–133
 [9] Vinogradov D.V. Accidental Formal Concepts in the Presence of Counterexamples. Proc.
     of Workshop on Formal Concept Analysis for Knowledge Discovery (FCA4KD 2017): CEUR
     Workshop Proceedings, Vol. 1921. – 2017. – p. 104–112
[10] Vinogradov D.V. FCA-based Approach to Machine Learning. Proc. of Workshop on Formal
     Concept Analysis for Artificial Intelligence (FCA4AI 2019): CEUR Workshop Proceedings,
     Vol. 2529. – 2019. – p. 57–64
[11] Vorontsov K.V. and Ivahnenko A. Tight Combinatorial Generalization Bounds for Threshold
     Conjunction Rules. Proc. of 4th International Conference on Pattern Recognition and Machine
     Intelligence. – 2011. – p. 66–73