=Paper=
{{Paper
|id=Vol-3233/paper5
|storemode=property
|title=Small Overfitting Probability in Minimization of Empirical Risk for FCA-based Machine Learning
|pdfUrl=https://ceur-ws.org/Vol-3233/paper5.pdf
|volume=Vol-3233
|authors=Dmitry V. Vinogradov
|dblpUrl=https://dblp.org/rec/conf/ijcai/Vinogradov22
}}
==Small Overfitting Probability in Minimization of Empirical Risk for FCA-based Machine Learning==
Small Overfitting Probability in
Minimization of Empirical Risk for
FCA-based Machine Learning
Dmitry V. Vinogradov1,*
1
Dorodnicyn Computing Center, Federal Research Center for Computer Science and Control, Russian Academy of
Sciences, Moscow 119333, Russia
Abstract
Main result of the paper provides a small upper bound on a probability of overfitting in FCA-based
Machine Learning in the simplest case of Boolean algebra without counter-examples. This Machine
Learning approach uses a set of randomly generated formal concepts to predict test examples. The
well-known VapnikβChervonenkisβ technique of empirical risk minimization determines a number of
generated concepts. Estimations of Mixture and Stopping times of several probabilistic algorithms based
on Markov chains leads to a plausible assumption on the uniform independent distribution of elements of
Boolean algebra. In this case the main theorem proves that the probability of overfitting in fixed fraction
of test examples tends to zero faster than exponent when the number of attributes goes to infinity.
Keywords
FCA, Markov chain, empirical risk, overfitting, Boolean algebra
1. Introduction
Formal Concept Analysis (FCA) [1] is a popular means of data analysis in case of small samples.
However an applicability of FCA to Big Data has several obstacles:
β’ There will be an exponentially large number of hypotheses with respect to a size of
an formal context in the worst case (for instance, the case of Boolean algebra, see next
section).
β’ Many problems of FCA [4] belong to famous classes of π π - and #π -complete problems
of computational complexity.
β’ There is a positive probability of βaccidental" concepts appearance that corresponds to
the overfitting phenomenon [9].
The paper [8] introduces Markov chain approach to probabilistic generation of formal con-
cepts to construct a Machine Learning system based on FCA (FCAML). Recently FCAML-system
Published in Sergei O. Kuznetsov, Amedeo Napoli, Sebastian Rudolph (Eds.): The 10th International Workshop "What
can FCA do for Artificial Intelligence?", FCA4AI 2022, co-located with IJCAI-ECAI 2022, July 23 2022, Vienna, Austria,
Proceedings, pp. 41β49.
*
Corresponding author.
$ vinogradov.d.w@gmail.com (D. V. Vinogradov)
Β http://ccas.ru (D. V. Vinogradov)
0000-0001-5761-4706 (D. V. Vinogradov)
Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
applies the coupling Markov chain to generate a random sample of concepts. Each run of this
chain terminates with probability 1. Early the system uses a monotonic Markov chain that
corresponds to the famous Lazy Random Walk in the case of Boolean algebra. The paper [10]
discuses Induction procedure for generalization of training examples into hypotheses about
causes of the property under research with counter-examples forbidding. Finally the system
predicts a target class of each test example by Analogy reasoning.
The main result of paper [10] gives a sufficient number of hypotheses to predict a target class
with given level of confidence. The framework is dual to the famous one of V.N. Vapnik and
A.Ya. Chervonenkis with respect to their dimension of a classifiers class.
However V.N. Vapnik and A.Ya. Chervonenkis also developed another approach to choose a
good hypothesis. Best one must minimize an empirical risk (a fraction of wrongly predicted
training examples). The main problem in this approach is to estimate a fraction of test examples
with prediction error. We provide a partial answer on this task in the simplest case of Boolean
algebra without counter-examples.
2. Background
2.1. Basic definitions and facts
Here we recall some basic definitions and facts of Machine Learning based on Formal Concept
Analysis (FCAML) in the particular case of Boolean algebra. Most general situation is considered
in [10]. Book [1] is the best source of information about Formal Concept Analysis itself.
The smallest (formal) context for π-dimensional Boolean algebra is a triple (πΊ, π, ΜΈ=),
where πΊ = {π1 , . . . , ππ } is a set of coatoms (objects), π = {π1 , . . . , ππ } is a set of binary
attributes, and ΜΈ=β πΊ Γ π is relation defined as ππ ΜΈ= ππ β π ΜΈ= π.
A concept of the Boolean algebra context (πΊ, π, ΜΈ=) is defined to be a pair (π΄, π΅), where
π΄ β πΊ, π΅ β π , π΄ = {ππ β πΊ : βππ β π΅ [π ΜΈ= π]}. The first component π΄ of the concept
(π΄, π΅) is called the extent of the concept, and the second component π΅ is called its intent. It
clear that the lattice of concepts coincides with Boolean algebra of all the subsets π΅ β π . We
consider the partial order on it dual to set inclusions.
This observation proves truth of the first obstacle of applicability of FCA to Big Data from
Introduction. When formal context occupies π2 bites only, the full description of π-dimensional
Boolean algebra requires π Β· 2π bites of memory.
Proposition 1. For context (πΊ, π, ΜΈ=) corresponding Boolean algebra of concepts has (β
, π ) as
the bottom element and (πΊ, β
) as the top element. In other words, for every the concept (π΄, π΅) the
following inequalities hold:
(β
, π ) β€ (π΄, π΅) β€ (πΊ, β
). (1)
Definition 1. For a concept (π΄, π΅), ππ β πΊ, and ππ β π define
{οΈ
(π΄, π΅) ππ β π΄
πΆππ((π΄, π΅), ππ ) = , (2)
(π΄ βͺ {ππ }, π΅ β {ππ }) ππ β
/π΄
{οΈ
(π΄, π΅) ππ β π΅
πΆππ((π΄, π΅), ππ ) = . (3)
(π΄ β {ππ }, π΅ βͺ {ππ }) ππ β
/π΅
We call these operations CbO because the first one is used in Close-by-One (CbO) Algorithm
to generate all the formal concepts of an arbitrary context, see [3] for details.
Monotonicity property of introduced operations are summarized in the following Lemma.
Lemma 1. Let (πΊ, π, ΜΈ=) be a context, (π΄, π΅) and (πΆ, π·) be concepts for it, π β πΊ, and π β π .
Then
(π΄, π΅) β€ (πΆ, π·) β πΆππ((π΄, π΅), π) β€ πΆππ((πΆ, π·), π), (4)
(π΄, π΅) β€ (πΆ, π·) β πΆππ((π΄, π΅), π) β€ πΆππ((πΆ, π·), π). (5)
Initially the system used the monotonic Markov chain algorithm as a core of probabilistic
approach to Machine Learning based on FCA.
Data: context (πΊ, π, ΜΈ=)
Result: random concept (π΄, π΅)
π := πΊ β π ; (π΄, π΅) := (β
, π );
for (π := 0; π < π ; π + +) do
select random element π£ β π ;
(π΄, π΅) := πΆππ((π΄, π΅), π£);
end
Algorithm 1: Monotonic Markov chain
The main difficulty with the monotonic Markov chain in general case is an absence of a good
estimation on length π of its trajectory to achieve approximately stationary distribution. How-
ever the case of Boolean algebra was investigated successfully by the author. Next subsection
contains the key results about it.
Then the coupling Markov chain algorithm described below appears, where there exists the
natural stopping moment. Now it is a working horse for our approach.
Data: context (πΊ, π, ΜΈ=)
Result: random concept (π΄, π΅)
π := πΊ β π ; (π΄, π΅) := (β
, π ); (πΆ, π·) = (πΊ, β
);
while (π΄ ΜΈ= πΆ) do
select random element π£ β π ;
(π΄, π΅) := πΆππ((π΄, π΅), π£);
(πΆ, π·) := πΆππ((πΆ, π·), π£);
end
Algorithm 2: Coupling Markov chain
The algorithm terminates when upper and lower concepts coincide. Condition on remaining
of ordering between two concepts (π΄, π΅) β€ (πΆ, π·) at any intermediate step of the while loop
of Algorithm 2 follows from Lemma 1.
Now we represent Machine Learning based on FCA (FCAML-method) for our setting (Boolean
algebra without counter-examples). See seminal paper[10] for description of the general scheme
of FCAML-method.
The reader can learn the classical deterministic FCA-based approach to Machine Learning
from Kuznetsov [5]. Our technique uses probabilistic Algorithm 2 for computing a random
subset of concepts.
As usual, there are two sets of objects called the training πΊ = {π1 , . . . , ππ } and test π =
{π1 , . . . , ππ } samples, respectively. Set π contains examples to predict the target class (so-called
test objects).
From the training samples the program generates a formal context (πΊ, π, ΜΈ=), where π =
{π1 , . . . , ππ }. After that the program applies the coupling Markov chain Algorithm 2 to
generate a given number π of random concepts (π΄, π΅).
Data: number π of concepts to generate
Result: random sample π of formal concepts
while (π < π ) do
Generate concept (π΄, π΅) by Algorithm 2;
π := π βͺ {(π΄, π΅)};
π := π + 1;
end
Algorithm 3: Inductive generalization
FCAML-method replaces a time-consuming deterministic algorithm (for instance, "Close-by-
One") for generation of all concepts by the probabilistic one to randomly generate the prescribed
number of concepts. The goal of Markov chain approach is to select a random sample of formal
concepts without computation of the (possibly exponential size) set of all the concepts.
How to select number π of concepts for the coupling Markov chain? There are 2 different
approaches, both based on ideas of V.N. Vapnik and A.Ya. Chervonenkis. The approach promoted
by K.V. Vorontsov is the empirical risk minimization.
Data: random sample π of concepts
Result: empirical risk value
πΊ :=training examples; π := 0;
for (π β πΊ) do
for (β¨π΄, π΅β© β π) do
if (π΅ β {π}β² ) then
π := π + π1 ;
end
break;
end
end
Algorithm 4: Calculation of empirical risk
In the Boolean algebra case it is possible select sufficiently large π to make the empirical
rick equals to 0.
With permutation, we can assume without reducing generality that the first π objects were
included in the training sample, and the last π objects form the test sample.
Let π of FCA Machine Learning hypotheses be generated using a coupling Markov chain
from a training sample for Boolean algebra, where π chosen sufficiently large to obtain π = 0.
Stationary distribution uniformity on hypotheses allows to construct hypothesis βπ =
(π΄π , π΅π ) (where 1 β€ π β€ π ) from independent Bernoulli sequence πΈπ = (ππ,1 , . . . , ππ,π )
as π΄π = {ππ : ππ,π = 0} and π΅π = {ππ : ππ,π = 1}.
Finally, the FCAML program predicts the target class of test examples and computes tests
risk.
Consider set πΉ of binary features πΉ = {π1 , . . . , ππ }. For each 1 β€ π β€ π introduce test
example ππ+π with the intent {ππ+π }β² = {ππ β πΉ : π ΜΈ= π}.
Independent Bernoulli sequence πΈπ = (ππ,1 , . . . , ππ,π ) determines the corresponding hypoth-
esis βπ = (π΄π , π΅π ), where π΄π = {ππ : ππ,π = 0} and π΅π = {ππ : ππ,π = 1}.
Data: random sample π = {(π΄1 , π΅1 ), . . . , (π΄π , π΅π )} of concepts
Result: number of erroneous predicted test examples
π :=test examples; π = 0;
for (π β π) do
for (π := 0; π < π ; π + +) do
if (π΅π β {π}β² ) then
π := π + 1;
end
break;
end
end
Algorithm 5: Calculation of fraction of erroneous predictions
2.2. Approximate Uniformity of Random Subsets
Algorithm 1 in the case of Boolean algebra coincides with famous Lazy Random Walk on Boolean
hypercube.
Lemma 2. Stationary distribution π of Lazy Random Walk is uniform.
The simplest proof of Lemma 2 uses reversibility of the corresponding Markov chain with
Balance equations with respect to uniform distribution π.
Definition 2. Total variation distance between distributions π = β¨ππ : ππ β πΊβ© and π = β¨ππ :
1 βοΈ
ππ β πΊβ© on finite space πΊ is defined as the half of πΏ1 -metric, i.e. βπ β πβπ π = 2 Β· ππ βπΊ |ππ β ππ |.
Lemma 3. βπ β πβπ π = maxπ΄βπΊ |π(π΄) β π(π΄)|.
This Lemma is a direct consequence of Definition 2.
Proposition 2. For Lazy Random Walk let
1
π(0) = P [ππ‘+1 = ππ | ππ‘ = ππ ] = ,
2
1
π(ππ ) = P [ππ‘+1 = (ππ β ππ ) | ππ‘ = ππ ] = ,
2π
and π = 0 otherwise, and let π be the uniform distribution. Then
)οΈ2 1 (οΈ πβπ )οΈ
βπ*π‘ β πβπ π
(οΈ
β€ Β· π β1 .
4
holds for π‘ β₯ 12 Β· π Β· (log π + π).
This proposition is analogue of result of Diakonis [2] and it was proved by the author during
his research on monotonic Markov chain (Algorithm 1).
Comparison of Algorithms 1 and 2 gives assertion that the lower component of the coupling
Markov chain coincides with a state of the monotonic Markov chain.
The next two propositions estimate mean length Eπ of trajectory of coupling Markov chain
(Algorithm 2) and proves the strong concentration of trajectory length π around the mean Eπ .
The author proved them during research on coupling Markov chain (Algorithm 2).
Proposition 3. For π-dimensional Boolean algebra
π
βοΈ π 1
Eπ = β π Β· ln(π) + π Β· πΎ + . (6)
π 2
π=1
Proposition 4. For π-dimensional Boolean algebra
P [π β₯ (1 + π) Β· π Β· ln(π)] β 0, (7)
when π β β for any π > 0.
Statements of Propositions 2 and 3 and Lemma 4 imply the assertion that outputs of Algorithm
2 are approximately uniformly distributed.
Since each trajectory depends only on (pseudo-)random number generator, these outputs are
independent.
But random subsets of binary attributes can be generated by Bernoulli sequences. It provides
possibility of direct probabilistic computations. In the next section these considerations lead to
main result of the paper.
3. Main result
Any set of hypotheses about causes of the target property can be considered as a classifier: if a
test example includes at least one hypothesis, then the classifier will predict the target class
positively; if none of the hypothetical reasons is embedded in a test example, then this example
is classified negatively.
The method of minimizing empirical risk proposed by V.N. Vapnik and A.Ya. Chervonenkis
[7] consists in choosing algorithms for which the classification of training examples contains
the minimum number of errors (empirical, or observed, risk). In our case, there will always
be classifiers (sets of hypotheses) whose empirical risk is zero. We restrict ourselves to these
situations only. On the other hand, a risk of making a mistake in predicting test examples
remains.
Following K.V. Vorontsov[11], we will randomly divide objects into two groups: training and
test examples. For simplicity, letβs assume that the number of objects is even, and the splitting
is done in half. This assumption does not reduce generality, since the mean binomial coefficient
is the largest one.
Letβs denote the empirical risk by π and introduce the prediction risk as a fraction of π = π/π
incorrectly predicted test examples. We are interested in the probability of P [π = 0, π = πΏ]
when objects are evenly divided into training and test samples in half.
Since the probabilities are equal for each partition, we can assume without reducing generality
that the first π objects were included in the training sample, and the last π objects form the test
sample.
Let π of FCAML hypotheses be generated using a coupling Markov chain from a training
sample for Boolean algebra. If the trajectories are chosen long enough, then the distribution
of hypotheses will be (almost) uniform and independent. Uniformity follows from the prop-
erty of fast mixing to a uniform stationary distribution, and independence follows from the
independence of the Markov chain trajectories generating the FCAML hypotheses.
Denote generated hypotheses as π» = {β1 , β2 , . . . , βπ } and form a table like
πΊ|π» β1 β2 ... βπ
π1 0 1 ... 0
.. .. .. .. ..
. . . . .
ππ 0 0 ... 1
ππ+1 0 0 ... 0
.. .. .. ..
. . . 0 .
π(1+πΏ)π 0 0 ... 0
π(1+πΏ)π+1 1 0 ... 0
.. .. .. .. ..
. . . . .
π2π 0 1 ... 1
Here 1 corresponds to inclusion of given hypothesis into a given example, i.e. the hypothesis
predicts the example correctly (positively).
To reach the empirical risk π = 0 each of the first π rows must contain at least one 1.
Due to the uniform distribution and independence of hypotheses, the corresponding cells
form an independent Bernoulli series with a probability of success 12 .
Lemma 4. If hypotheses number π β₯ (1+π) log2 π for some π > 0 then limπββ P [π = 0] = 1.
Proof.
[οΈ ]οΈπΒ·2βπ
(οΈ βπ π
)οΈ (οΈ )οΈ π
βπ 2
1 β₯ lim 1 β 2 = lim 1β2 =
πββ πββ
]οΈπΒ·2βπ π
= lim πβ1 β₯ lim πβ1/π = 1.
[οΈ
πββ πββ
To achieve π = πΏ fraction of erroneous predictions of test examples it needs to select πΏ Β· π
rows of the lower half (there are πΏΒ·π
(οΈ π )οΈ
ways to do this) containing zeroes only, the rest rows
can contains ones somewhere. Table above corresponds to the situation with choice of rows
ππ+1 , . . . , π(1+πΏ)Β·π .
)οΈ(2βπΏ)π (οΈ βπ )οΈπΏΒ·π
The task is to estimate P = πΏΒ·π
(οΈ π )οΈ (οΈ
Β· 1 β 2βπ Β· 2 when π β β.
β π
Lemma 5 (Stirlingβs formula). π! β 2ππ πππ for π β β.
Lemma 6 (Entropy inequality). βπΏ Β· ln πΏ β (1 β πΏ) Β· ln(1 β πΏ) β€ ln 2.
Theorem 1. limπββ P β€ β 1 exp β (1 + π) Β· πΏ Β· π Β· ln π + ln 2 Β· π β ln2π for π β
{οΈ }οΈ
2ππΏ(1βπΏ)
β and π β₯ (1 + π) log2 π.
)οΈ(2βπΏ)π (οΈ βπ )οΈπΏΒ·π
Proof. The second factor of P = πΏΒ·π
(οΈ π )οΈ (οΈ
Β· 1 β 2βπ Β· 2 does not exceed 1. Stirlingβs
formula and Entropy inequality imply
β )οΈπΏΒ·π
2π Β· π Β· ππΏΒ·π Β· π(1βπΏ)Β·π Β· ππΏΒ·π Β· π(1βπΏ)Β·π Β· 2βπ
(οΈ
Pβ€ β βοΈ β€
ππ Β· 2ππΏ Β· π Β· 2π(1 β πΏ) Β· π Β· πΏ πΏΒ·π Β· ππΏΒ·π Β· (1 β πΏ)(1βπΏ)Β·π Β· π(1βπΏ)Β·π
2βπΏΒ·πΒ·(1+π)Β·log2 π
β€ βοΈ =
2ππΏ(1 β πΏ) Β· π Β· ππΏΒ·πΒ·ln πΏ Β· π(1βπΏ)Β·πΒ·ln(1βπΏ)
πβπΏΒ·(1+π)Β·πΒ·ln π Β· ππΒ·(βπΏΒ·ln πΏβ(1βπΏ)Β·ln(1βπΏ)) πβπΏΒ·(1+π)Β·πΒ·ln π Β· πln 2Β·π
= βοΈ β β€ βοΈ β
2ππΏ(1 β πΏ) Β· π 2ππΏ(1 β πΏ) Β· π
Conclusions
Main theorem of the paper provides a small upper bound on a probability of overfitting in
FCA-based Machine Learning in the simplest case of Boolean algebra without counter-examples.
It states that the probability of overfitting in fixed fraction of test examples tends to zero faster
than exponent when the number of attributes goes to infinity. Interesting alternative for our
research is the work of T.P. Makhalova and S.O. Kuznetsov [6], where classifiers form a lattice.
Acknowledgements.
The author would like to thank his colleagues at Dorodnicyn Computing Center of Russian
Academy of Science and especially his PhD student L.A. Yakimova for collaboration and helpful
discussions.
The author is grateful to anonymous reviewers for improving the style of presentation.
References
[1] Ganter Bernard and Wille Rudolf. Formal Concept Analysis: Mathematical Foundations,
Springer-Verlag, 1999
[2] Diaconis Persi. Group representations in probability and statistics. IMS Lecture Notes β
Monograph Series Vol. 11.β Hayward (CA): Institute of Mathematical Statistics, 1988
[3] Kuznetsov S.O. A Fast Algorithm for Computing all Intersections of Objects in a Finite
Semi-Lattice. Automatic Documentation and Mathematical Linguistics. Vol. 27. β no. 5. β
1993. β pp. 11β21
[4] Kuznetsov S.O. Complexity of Learning in Concept Lattices from Positive and Negative
Examples. Discrete Applied Mathematics, no. 142(1-3). β 2004. β pp. 111β125
[5] Kuznetsov S.O. Machine Learning and Formal Concept Analysis. Proc. 2nd International
Conference on Formal Concept Analysis: Springer LNAI, Vol. 2961. β 2004. β pp. 287β312
[6] Makhalova T.P. and Kuznetsov S.O. On Overfitting of Classifiers Making a Lattice. Proc.
14th International Conference on Formal Concept Analysis: Springer LNAI, Vol. 10308. β
2017. β pp. 184β197
[7] Vapnik V.N. Statistical Learning Theory, Wiley-Interscience, 1998
[8] Vinogradov D.V. A Markov Chain Approach to Random Generation of Formal Concepts.
Proc. of Workshop Formal Concept Analysis Meets Information Retrieval (FCAIR 2013): CEUR
Workshop Proceedings, Vol. 977. β 2013. β p. 127β133
[9] Vinogradov D.V. Accidental Formal Concepts in the Presence of Counterexamples. Proc.
of Workshop on Formal Concept Analysis for Knowledge Discovery (FCA4KD 2017): CEUR
Workshop Proceedings, Vol. 1921. β 2017. β p. 104β112
[10] Vinogradov D.V. FCA-based Approach to Machine Learning. Proc. of Workshop on Formal
Concept Analysis for Artificial Intelligence (FCA4AI 2019): CEUR Workshop Proceedings,
Vol. 2529. β 2019. β p. 57β64
[11] Vorontsov K.V. and Ivahnenko A. Tight Combinatorial Generalization Bounds for Threshold
Conjunction Rules. Proc. of 4th International Conference on Pattern Recognition and Machine
Intelligence. β 2011. β p. 66β73