Towards Fast Finding Optimal Short Classifiers
Egor Dudyrev∗,† , Sergei O. Kuznetsov†
HSE University, 20 Myasnitskaya St, Moscow, 101000, Russian Federation


                                      Abstract
                                      Studies on Explainable Artificial Intelligence show that a model should be small in order to be human
                                      understandable. The restriction on the size of a model drastically reduces the space of possible solutions.
                                      Many rule learning models still rely on greedy algorithms for generating ensembles of decision trees.
                                      This paper discusses FCA-inspired mathematical and engineering techniques to efficiently find most
                                      optimal short binary classifiers, i.e., classifiers that consist of no more than three binary attributes and
                                      are optimal w.r.t. F1 score.

                                      Keywords
                                      Supervised Machine Learning, Explainable Artificial Intelligence, Formal Concept Analysis


1. Introduction
Studies on Explainable Artificial Intelligence show that a model should be small in order to be
human understandable.
   Modern learning models such as Gradient Boosting over Decision Trees [1] or Random Forest
[2] are universally recognized as mainstream state-of-the-art solutions for binary classification
tasks. However, such models are too complex to be analyzed and to be used in trust requiring
scenarios even with the help of XAI [3] [4]. This is a reason for the rising tendency of rejecting
complex black box models in favor of small explainable ones [5] [6].
   Besides making models highly explainable, restricting the size of model also drastically limits
the space of all possible models. Thus, one can search for a globally optimal model instead of
approaching locally optimal ones. This paper considers the models operating only up to three
binary attributes. In order to find an optimal model meeting these conditions, one should iterate
though all combinations of up to three given binary attributes. However, such brute-force
algorithm suffers from combinatorial explosion with the increasing number of attributes. This
paper dwells on both mathematical and engineering techniques based on Formal Concept
Analysis (FCA) [7] that make this brute-force algorithm more efficient.
   The challenge of constructing simple logical models has been addressed in many areas of
previous research. In the early 1960s, the work on formal logic led to the inception of logical

Published in Sergei O. Kuznetsov, Amedeo Napoli, Sebastian Rudolph (Eds.): The 10th International Workshop ”What
can FCA do for Artificial Intelligence?”, FCA4AI 2022, co-located with IJCAI-ECAI 2022, July 23 2022, Vienna, Austria,
Proceedings, pp. 23–34.
∗
    Corresponding author.
†
    These authors contributed equally.
Envelope-Open eo.dudyrev@hse.ru (E. Dudyrev); skuznetsov@hse.ru (S. O. Kuznetsov)
Orcid 0000-0002-2144-3308 (E. Dudyrev); 0000-0003-3284-9001 (S. O. Kuznetsov)
                                    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)
programming and rule-learning algorithms [8], [9]. The latter – including algorithms as Skope-
Rules, RuleFit [10] and more – often rely on greedy approaches to extract short rules from more
complex models (such as large decision trees). Contrary to these greedy approaches resulting
in locally optimal models, this paper tackles the problem of finding the most optimal model of
given size.
   Another possible reason to develop and to use short models is that they make good benchmark
for big black box models. Indeed, if a short model outputs the same prediction quality as a big
black box then there is no interest in using the latter.
   Extensive research shows that a man can operate with premises having no more three plus
minus one ideas in his head simultaneously [11] [12]. In this paper, due to complexity constraints,
we concentrate on finding the rules with premises having no more than three attributes for
two reasons. Since it is our first approach to the problem, we should try to solve its easiest
version. In addition, the choice of number three is justified by the cognitive studies. Thus, short
rules consisting of more than three attributes represent a specific class of explainable machine
learning models.
   The structure of the paper is as follows. Section 2 presents the theoretical background used
throughout the paper. Section 3 describes the mathematical techniques for optimizing the
algorithm by minimizing the number of required operations. Complementing this, Section 4
discusses the engineering approaches to optimize the algorithm by maximizing the computation
speed. Section 5 merges all the discussed techniques together in one algorithm. And Section 6
presents the experimental results of this algorithm. Finally, Section 7 concludes the paper.


2. Theoretical Background
This subsection introduces definitions we use throughout the paper. Firstly, we provide the
basic terms of Formal Concept Analysis to describe the rule models. Secondly, we describe the
space of premises that contains the machine learning models discussed in the paper. Thirdly,
we describe the main topics of a binary classification in the language in the FCA notation.

2.1. Formal Concept Analysis
A formal context 𝐾 describes the dataset to build the model on. It is presented as a triple
𝐾 = (𝐺, 𝑀, 𝐼 ) where 𝐺 is a set of objects (rows in a dataset), 𝑀 is a set of attributes (columns in
a dataset), and 𝐼 ⊆ 𝐺 × 𝑀 represents relations between objects and attributes.
   Prime (′ ) operators match a subset of objects 𝐴 ⊆ 𝐺 and a subset of attributes 𝐵 such that the
objects from 𝐴 are “described” by the attributes from 𝐵 and vice versa:

                  𝐴′ = {𝑚 ∈ 𝑀 ∣ ∀𝑔 ∈ 𝐴 ∶ 𝑔𝐼 𝑚}     𝐵′ = {𝑔 ∈ 𝐺 ∣ ∀𝑚 ∈ 𝐵 ∶ 𝑔𝐼 𝑚}                  (1)

  Given a subset of attributes 𝐵 ⊆ 𝑀, the subset of objects 𝐴 = 𝐵′ is called extent of 𝐵. Dually,
the subset of attributes 𝐵 = 𝐴′ is called intent of 𝐴.
2.2. Premises
In this subsection we define a premise as a combination of attributes of a formal context, joined
by conjunction, disjuction, and negation operations. The notion of a premise is necessary for
the following subsections.

Definition 1. The premise space ℙ is a set of all combinations of attributes 𝑀 constructed
with conjunction ∧, disjunction ∨, and negation operations:

                                  ℙ is a set s.t.
                                      1)𝑀 ⊂ ℙ,                                                     (2)
                                      2)∀𝑝, 𝑞 ∈ ℙ ∶ 𝑝 ∧ 𝑞, 𝑝 ∨ 𝑞, 𝑝 ∈ ℙ

  Each premise 𝑝 ∈ ℙ corresponds to a subset of objects 𝑝 ′ ⊆ 𝐺 called an extent of a premise.
Extents of conjunction, disjunction, and negation operations are defined as follows:

                      (𝑝 ∧ 𝑞)′ = 𝑝 ′ ∩ 𝑞 ′ ,   (𝑝 ∨ 𝑞)′ = 𝑝 ′ ∪ 𝑞 ′ ,   𝑝′ = 𝐺 ⧵ 𝑝 ′               (3)

  This paper is specifically interested in premises consisting of no more than three attributes.
Generally, a set of premises 𝑃𝑛 constructed from 𝑛 attributes can be described in the following
way:
                        𝑃1 = 𝑀 ∪ {𝑚 ∣ 𝑚 ∈ 𝑀}
                               ⌊𝑛/2⌋                                                               (4)
                      𝑃𝑛∈ℕ = ⋃ {𝑝 ∧ 𝑞, 𝑝 ∨ 𝑞, 𝑝 ∧ 𝑞, 𝑝 ∨ 𝑞 ∣ 𝑝 ∈ 𝑃𝑖 , 𝑞 ∈ 𝑃𝑛−𝑖 }
                        𝑛>1     𝑖=1
 Uniting sets of premises 𝑃𝑛 for each natural number 𝑛 we obtain the premise space ℙ: ℙ =
⋃𝑛∈ℕ 𝑃𝑛 . In what follows, we say that premise 𝑝 ∈ ℙ has size 𝑛 if it belongs to 𝑃𝑛 .

2.3. Binary classification
Binary classification is a task in machine learning when a model is asked to predict whether an
object 𝑔 belongs to a “positive” or a “negative” class given the object’s description (given by a
subset of attributes). The model is obtained based on the provided training context (dataset)
𝐾 = (𝐺, 𝑀, 𝐼 ) with predefined positive 𝐺+ ⊂ 𝐺 and negative 𝐺− = 𝐺 ⧵ 𝐺+ objects. The first
step to constructing a good binary classifier is to find a model operating the set of attributes 𝑀
that efficiently separates positive objects from 𝐺+ and negative objects from 𝐺− . After that, the
model can be applied to a test context 𝐾𝑡𝑒𝑠𝑡 = (𝐺𝑡𝑒𝑠𝑡 , 𝑀, 𝐼𝑡𝑒𝑠𝑡 ) to predict its unknown positive and
negative objects.
   This paper studies binary classifiers of the form “if premise 𝑝 ∈ ℙ is true then object 𝑔 is
predicted positive, otherwise object 𝑔 is predicted negative”.
   The prediction quality of a premise 𝑝 ∈ ℙ on a training dataset is measured by comparing
the given sets of positive 𝐺+ and negative objects 𝐺− with the sets of positive 𝐺𝑝+ and negative
objects 𝐺𝑝− predicted by a premise 𝑝. Note that the set 𝐺𝑝+ is exactly the extent of the premise
𝑝 ∶ 𝐺𝑝+ = 𝑝 ′ .
                    𝑇 𝑃𝑝 = 𝐺+ ∩ 𝐺𝑝+ = 𝐺+ ∩ 𝑝 ′                         𝐹 𝑃𝑝 = 𝐺− ∩ 𝐺𝑝+ = 𝐺− ∩ 𝑝 ′
                                                                                                           (5)
                   𝐹 𝑁𝑝 = 𝐺+ ∩ 𝐺𝑝− = 𝐺+ ⧵ 𝑝 ′                         𝑇 𝑁𝑝 = 𝐺− ∩ 𝐺𝑝− = 𝐺− ⧵ 𝑝 ′
   For the sake of brevity, let us use lowercase letters to denote the cardinalities of so-called
true positives 𝑇 𝑃𝑝 , false positives 𝐹 𝑃𝑝 , false negatives 𝐹 𝑁𝑝 , and true negatives 𝑇 𝑁𝑝 . Likewise,
we use 𝑦+ , 𝑦− to denote the cardinalities of the set of positive objects 𝐺+ and the set of negative
objects 𝐺− respectively:
                    tp 𝑝 = |𝑇 𝑃𝑝 |,        fp 𝑝 = |𝐹 𝑃𝑝 |,            fn𝑝 = |𝐹 𝑁𝑝 |,     tn𝑝 = |𝑇 𝑁𝑝 |,
                                                                                                           (6)
                    𝑦+ = |𝐺+ |,        𝑦− = |𝐺− |
  One of the most widely used quality scores for binary classifications are precision (𝑝𝑟𝑒𝑐),
recall (𝑟𝑒𝑐), and their harmonic mean called F1 score (𝐹 1). Their definitions are as follows:
                                tpp                      tpp                          prec(𝑝) ∗ rec(𝑝)
                  𝑝𝑟𝑒𝑐(𝑝) =            ,     rec(𝑝) =             ,     F1(𝑝) = 2                          (7)
                                |𝑝 ′ |                       𝑦+                       prec(𝑝) + rec(𝑝)
  This paper focuses on finding the premise 𝑝 ∗ of size not bigger that 3 having the maximal F1
score:
                                     𝑝 ∗ = arg max𝐹 1(𝑝)                                     (8)
                                                                  3
                                                        𝑝∈⋃𝑖=𝑛 𝑃𝑛
The presented techniques can be easily adjusted for other quality measures.


3. Minimizing the number of comparisons
3.1. F1 score optimization
The definition of F1 score as a harmonic mean of precision and recall makes the possible
optimization strategies obscure. This subsection simplifies the task of maximizing the F1 score
through maximizing the number of true positives and true negatives.
Proposition 1. F1 score 𝐹 1(𝑝) is comonotonic to Jaccard score 𝐽 (𝑝) (denoted by ∼), where the
latter represents the Jaccard similarity coefficient between the set of positive objects 𝐺+ and the set
of objects predicted positive 𝐺𝑝+ = 𝑝 ′ :

                                                                        |𝐺+ ∩ 𝑝 ′ |
                                             𝐹 1(𝑝) ∼ 𝐽 (𝑝) =                                              (9)
                                                                        |𝐺+ ∪ 𝑝 ′ |
Proof. Let us describe the Jaccard score in terms of true positives 𝑡𝑝𝑝 and true negatives 𝑡𝑛𝑝 :

                                                     |𝐺+ ∩ 𝑝 ′ |   𝑡𝑝𝑝
                                           𝐽 (𝑝) =           ′   =                                        (10)
                                                     |𝐺+ ∪ 𝑝 | |𝐺| − 𝑡𝑛𝑝
Now we should also express F1 score in terms of true positives 𝑡𝑝𝑝 and true negatives 𝑡𝑛𝑝 :

                        𝑝𝑟𝑒𝑐(𝑝) ∗ 𝑟𝑒𝑐(𝑝)   2𝑡𝑝𝑝                   2𝑡𝑝𝑝
           𝐹 1(𝑝) = 2                    =      ′ =
                        𝑝𝑟𝑒𝑐(𝑝) + 𝑟𝑒𝑐(𝑝) 𝑦+ + |𝑝 | (|𝐺| − 𝑓 𝑝𝑝 − 𝑡𝑛𝑝 ) + (𝑡𝑝𝑝 + 𝑓 𝑝𝑝 )
                           2𝑡𝑝𝑝                 2               2
                  =                     =                =         1
                                                                                                    (11)
                      |𝐺| + 𝑡𝑝𝑝 − 𝑡𝑛𝑝          |𝐺|−𝑡𝑛𝑝       1 + 𝐽 (𝑝)
                                            1 + 𝑡𝑝
                                                   𝑝


Therefore we obtain the relation:
                                               𝐹 1(𝑝) ∼ 𝐽 (𝑝)


   Since F1 score 𝐹 1(𝑝) is monotonic with respect to the Jaccard score 𝐽 (𝑝) then the F1 score
optimization problem can be viewed as the problem of optimizing the fraction 𝑡𝑝𝑝 /(|𝐺| − 𝑡𝑛𝑝 ),
i.e. maximizing the number of true positives and true negatives:
                                                                            𝑡𝑝𝑝
                         arg max𝐹 1(𝑝) = arg max𝐽 (𝑝) = arg max                                     (12)
                           𝑝∈ℙ                  𝑝∈ℙ                 𝑝∈ℙ   |𝐺| − 𝑡𝑛𝑝

3.2. Logical operations effect on Jaccard score
This subsection discusses how conjunction and disjunction operations affect the Jaccard score.
That is, given the Jaccard score of two premises 𝑝, 𝑞 ∈ ℙ can we expect that premises 𝑝 ∧ 𝑞, 𝑝 ∨ 𝑞
would have higher or lower Jaccard score values?
  Firstly, let us express the true positives and true negatives of premises constructed by con-
junction and disjunction with the true positives and true negatives of the original premises:

         𝑇 𝑃𝑝∧𝑞 = 𝐺+ ∩ (𝑝 ′ ∩ 𝑞 ′ ) = 𝑇 𝑃𝑝 ∩ 𝑇 𝑃𝑞        𝑇 𝑃𝑝∨𝑞 = 𝐺+ ∩ (𝑝 ′ ∪ 𝑞 ′ ) = 𝑇 𝑃𝑝 ∪ 𝑇 𝑃𝑞
                                                                                                    (13)
        𝑇 𝑁𝑝∧𝑞 = 𝐺− ⧵ (𝑝 ′ ∩ 𝑞 ′ ) = 𝑇 𝑁𝑝 ∪ 𝑇 𝑁𝑞         𝑇 𝑁𝑝∨𝑞 = 𝐺− ⧵ (𝑝 ′ ∪ 𝑞 ′ ) = 𝑇 𝑁𝑝 ∩ 𝑇 𝑁𝑞
   Therefore, conjunction operation shrinks the set of true positives while expanding the set
of true negatives. Opposite to it, disjunction operation expands the set of true positives while
shrinking the set of true negatives.
   Now we derive the equations for the infimum and the supremum for the cardinalities of true
positives and true negatives. To do so we incorporate two notions from Equation 13. Firstly,
we use set cardinality restrictions for intersection and union operations. Secondly, since the
number of true positives 𝑡𝑝 is limited by the number of positive objects 𝑦+ , then for two premises
𝑝, 𝑞 ∈ ℙ if the sum of 𝑡𝑝𝑝 , 𝑡𝑝𝑞 exceeds 𝑦+ , then the corresponding true positives should have at
least 𝑡𝑝𝑝 + 𝑡𝑝𝑞 − 𝑦+ objects in common (analogous conclusions can be provided for the number
of true negatives 𝑡𝑛 and the number of negative objects 𝑦− ).

                       max(𝑡𝑝𝑝 + 𝑡𝑝𝑞 − 𝑦+ , 0) ≤ 𝑡𝑝𝑝∧𝑞 ≤ min(𝑡𝑝𝑝 , 𝑡𝑝𝑞 )
                                  max(𝑡𝑛𝑝 , 𝑡𝑛𝑞 ) ≤ 𝑡𝑛𝑝∧𝑞 ≤ min(𝑡𝑛𝑝 + 𝑡𝑛𝑞 , 𝑦− )
                                                                                                    (14)
                                  max(𝑡𝑝𝑝 , 𝑡𝑝𝑞 ) ≤ 𝑡𝑝𝑝∨𝑞 ≤ min(𝑡𝑝𝑝 + 𝑡𝑝𝑞 , 𝑦+ )
                        max(𝑡𝑛𝑝 + 𝑡𝑝𝑞 − 𝑦− , 0) ≤ 𝑡𝑛𝑝∨𝑞 ≤ min(𝑡𝑛𝑝 , 𝑡𝑛𝑞 )
          Thus, the Jaccard score bounds are:
                                            max(𝑡𝑝𝑝 + 𝑡𝑝𝑞 − 𝑦+ , 0)                                  min(𝑡𝑝𝑝 , 𝑡𝑝𝑞 )
                                                                                ≤ 𝐽 (𝑝 ∧ 𝑞) ≤
                                               |𝐺| − max(𝑡𝑛𝑝 , 𝑡𝑛𝑞 )                            |𝐺| − min(𝑡𝑛𝑝 + 𝑡𝑛𝑞 , 𝑦− )
                                                                                                                                              (15)
                                               max(𝑡𝑝𝑝 , 𝑡𝑝𝑞 )                                  min(𝑡𝑝𝑝 + 𝑡𝑝𝑞 , 𝑦+ )
                                                                                ≤ 𝐽 (𝑝 ∨ 𝑞) ≤
                                    |𝐺| − max(𝑡𝑛𝑝 + 𝑡𝑝𝑞 − 𝑦− , 0)                               |𝐺| − 𝑚𝑖𝑛(𝑡𝑛𝑝 , 𝑡𝑛𝑞 )

  Figure 1 visualizes the bounds on TruePositive-TrueNegative plane for an abstract formal
context with 40% of objects being positive. Maximizing the Jaccard score, shown by increasing
grey colour gradient, requires us to reach the upper-right corner of the plane. However,
conjunction and disjunction operations over premises 𝑝, 𝑞 ∈ ℙ can only “move” the premises to
the upper-left (𝑝 ∨ 𝑞) and lower-right (𝑝 ∧ 𝑞) corners.

                  /|G|| 0.40
                + /|
               y+
TruePosi ive


                                              sup: J(p ∨ q) = 0.50                                                                        J=1
                                                                                                                                        F1 = 1

                       0.35
                                             p∨q
                       0.30


                       0.25
                               inf: J(p ∨ q) = 0.25    J(p) = 0.31
                                                                            p
                       0.20


                       0.15
                                                                                                     q J(q) = 0.25        sup: J(p ∧ q) = 0.37

                       0.10
                                                                                                                        p∧q
                       0.05

                               J=0
                               F1 = 0                                                                     inf: J(p ∧ q) = 0.00
                       0.00
                          0.00                  0.10                 0.20             0.30             0.40                0.50              0.60
                                                                                                                                            y − /|G|
                                                                                                                                  TrueNegative
Figure 1: The Jaccard score bounds for conjunction and disjunction operations on abstract premises
𝑝, 𝑞 of abstract formal context. The gray colour intensity represents the value of the Jaccard score for a
specific point on the plane.


  The bounds derived for conjunction and disjunction operations are vague: they do not say
whether the newly formed premise would have the higher or lower Jaccard score. The more
precise estimation is only possible via intersecting the extents of premises 𝑝, 𝑞. In this case,
however, one can immediately compute the resulting Jaccard score itself, rather than operating
the estimations.
Proposition 2. Given a threshold 𝜃 ∈ ℝ and a premise 𝑝 ∈ ℙ, one can identify whether the Jaccard
score of any premise 𝑝 ∧ 𝑞, 𝑝 ∨ 𝑞 ∈ ℙ will not exceed the threshold 𝜃 using the following inequations:

                            𝑡𝑝𝑝 ≤ 𝑦+ 𝜃          ⟹ 𝐽 (𝑝 ∧ 𝑞) ≤ 𝜃,    ∀𝑞 ∈ ℙ
                                          𝑦+                                                     (16)
                            𝑡𝑛𝑝 ≤ |𝐺| −         ⟹ 𝐽 (𝑝 ∨ 𝑞) ≤ 𝜃,    ∀𝑞 ∈ ℙ
                                           𝜃
Proof. Let us describe how the bounds in formulae 15 depend on true positives and true negatives
of a premise 𝑝 ∈ ℙ:
                                          min(𝑡𝑝𝑝 , 𝑡𝑝𝑞 )          𝑡𝑝𝑝        𝑡𝑝𝑝
                        𝐽 (𝑝 ∧ 𝑞) ≤                           ≤           =                      (17)
                                    |𝐺| − min(𝑡𝑛𝑝 + 𝑡𝑛𝑞 , 𝑦− ) |𝐺| − 𝑦−       𝑦+
                                    min(𝑡𝑝𝑝 + 𝑡𝑝𝑞 , 𝑦+ )         𝑦+
                        𝐽 (𝑝 ∨ 𝑞) ≤                        ≤                                     (18)
                                     |𝐺| − 𝑚𝑖𝑛(𝑡𝑛𝑝 , 𝑡𝑛𝑞 )   |𝐺| − 𝑡𝑛𝑝

  Now we compare the obtained fractions with a threshold 𝜃:
                   𝑡𝑝𝑝                                  𝑦+                      𝑦+
                         ≤ 𝜃 ⇔ 𝑡𝑝𝑝 ≤ 𝑦+ 𝜃,                    ≤ 𝜃 ⇔ 𝑡𝑛𝑝 ≤ |𝐺| −                  (19)
                   𝑦+                               |𝐺| − 𝑡𝑛𝑝                    𝜃

  Thus we result in the initial implications:

                                  𝑡𝑝𝑝 ≤ 𝑦+ 𝜃 ⟹ 𝐽 (𝑝 ∧ 𝑞) ≤ 𝜃,       ∀𝑞 ∈ ℙ                       (20)
                                         𝑦+
                             𝑡𝑛𝑝 ≤ |𝐺| −     ⟹ 𝐽 (𝑝 ∨ 𝑞) ≤ 𝜃,       ∀𝑞 ∈ ℙ                       (21)
                                          𝜃


4. Engineering to maximize the speed of comparisons
Mathematical tricks help to minimize the number of comparing rules. However, the number of
such rules is still high. This section concentrates on engineering tricks to make each comparison
faster.

4.1. Extents and bitarrays
In this subsection we use two important facts about concept extents: (i) different premises may
correspond to the same extent ∃𝑝, 𝑞 ∈ ℙ ∶ 𝑝 ≠ 𝑞, 𝑝 ′ = 𝑞 ′ , (ii) any prediction quality measure of
a premise 𝑝 relies on the premise extent 𝑝 ′ (see eq. 5).
   Different premises 𝑝, 𝑞 ∈ ℙ, 𝑝 ≠ 𝑞 may correspond to the same extent 𝑝 ′ = 𝑞 ′ ⊆ 𝐺 for many
reasons. First, the premises can be logically equivalent: e.g. (𝑝 ∧ 𝑞)′ = (𝑝 ∨ 𝑞)′ by De Morgan
laws. Second, if one premise is less general than another 𝑝 ′ ⊂ 𝑞 ′ then their conjunction would
correspond to the extent 𝑝 ′ and their disjunction would correspond to the extent 𝑞 ′ . Lastly,
different premises may correspond to the same extents due to the specific characteristics of
the formal context 𝐾: e.g. it may occur that an extent of the conjunction of two premises
𝑝, 𝑞 ∈ ℙ would be equal to the extent of the third premise 𝑟 ∈ ℙ ⧵ {𝑝, 𝑞} ∶ (𝑝 ∧ 𝑞)′ = 𝑟 ′ . Since the
computation of prediction quality measures relies on extents of premises, then many various
premises corresponding to the same extent would have the same prediction quality (on the train
context 𝐾).
   Thus, we propose to search for the most optimal extent instead of the most optimal premise.
In order to formalize this idea, let us define the sets extents 𝐸𝑛 :

                            𝐸1 = {𝑝 ′ ∣ ∀𝑝 ∈ 𝑃1 } = ⋃ {𝑚′ , (𝑚)′ }
                                                      𝑚∈𝑀
                                 ⌊𝑛/2⌋                                𝑛−1                         (22)
                         𝐸𝑛∈ℕ = ⋃ {𝑎 ∧ 𝑏, 𝑎 ∨ 𝑏 ∣ 𝑎 ∈ 𝐸𝑖 , 𝑏 ∈ 𝐸𝑛−𝑖 } ⧵ ⋃ 𝐸𝑖
                          𝑛>1     𝑖=1                                  𝑖=1

   The proposed definition of the sets of extents ensures that any extent 𝑒 ∈ 𝐸𝑛 is generated by
a premise of size at least 𝑛:

                            ∀𝑛, 𝑖 ∈ ℕ, 𝑒 ∈ 𝐸𝑛 , 𝑝 ∈ 𝑃𝑖 ∶ 𝑒 = 𝑝 ′ ⟹ 𝑛 ≤ 𝑖                          (23)

  Thus the optimization problem becomes as the following:

                                         𝑒 ∗ = arg max𝐹 1(𝑒)                                      (24)
                                                  3
                                              𝑒∈⋃𝑛=1 𝐸𝑛

where F1 score function is slightly modified to take an extent as its parameter and not the
premise.
   The last but not the least, an extent 𝑒, being a subset of objects 𝐺, can be represented and stored
in a computer as a bit mask (a tuple of bits) of length |𝐺| where each bit represents whether the
corresponding object is in the extent of not. Conjunction and disjunction operations become
operations on bit masks, that are the most efficient operations performed of modern binary
coded computers. So the use of extents instead of premises not only reduces the number of
comparisons, it also highly accelerates each of the comparisons.

4.2. Array operations
This subsections describes the trick, that is well-known among data science practitioners,
however we should cover it for the full disclosure.
   Inequalities, presented in Proposition 2, allow us to skip the processing of many conjunctions
𝑝 ∧ 𝑞 and disjunctions 𝑝 ∨ 𝑞 based on the characteristics of the initial premises 𝑝, 𝑞 ∈ ℙ and a
prediction quality threshold 𝜃. However, these characteristics are still to be computed. And
to compute these numerical characteristics the most efficiently we use Numpy [13] package
for Python. The package is specifically designed to work with large volume of numerical data
through the use of C++ code.


5. The proposed algorithm
Here is a pseudo-code of the algorithm for finding the best 𝑘 premises of size no bigger than 3:

    • Step 1. Find all extents of size 1 that will not lose quality after conjunction, disjunction:
         1. Compute all extents 𝐸1 ;
         2. Find the quality threshold 𝜃 as the minimal quality of the 𝑘 best extents from 𝐸1 ;
         3. Filter out extents from 𝐸1 that satisfy both inequalities presented in Prop. 2;
    • Step 2. Find all extents of size 2 that will not lose quality after conjunction, disjunction:
         1. Compute all extents 𝐸2 based on filtered set 𝐸1
            while keeping the information about a pair of extents 𝑎, 𝑏 ∈ 𝐸1 and an operation
            (∧, ∨) used for constructing each extent 𝑒 ∈ 𝐸2 ;
         2. Find the quality threshold 𝜃 as the minimal quality of the 𝑘 best extents from 𝐸1 ∪ 𝐸2
         3. Filter out extents from 𝐸1 , 𝐸2 that satisfy both inequalities presented in Prop. 2
    • Step 3. Find only the best extents of size 3:
      For each pair of extent 𝑒1 , 𝑒2 from filtered 𝐸1 , 𝐸2
         1. If any of the extents 𝑒1 , 𝑒2 satisfy both inequalities in Prop. 2 then proceed to the
            next pair; otherwise:
         2. Compute and measure the prediction quality of the conjunction 𝑒1 ∩ 𝑒2 ;
         3. Compute and measure the prediction quality of the disjunction 𝑒1 ∪ 𝑒2 ;
         4. Update 𝜃 if needed;
    • Step 4. Reconstruct the premises corresponding to the best 𝑘 extents
      using the kept information about extents and operations.

  The time complexity of this algorithm is 𝑂(|𝑀|3 ) where 𝑀 is a set of attributes in a formal
context 𝐾. From the asymptotic point of view, this is the same time complexity as that of the
brute force algorithm to test all premises of size not bigger than three. However, the use of
extents, as well as reducing the number of combinations, allows us to minimize the practical
processing time of the algorithm.


6. Experiments
This section applies the proposed algorithm in practice. First, we study the statistics of the
number of comparisons the algorithm has to make. Second, we roughly compare the prediction
quality of short models with that of the black box model to show that there are cases where the
former performs as efficient as the latter.
   The algorithm is run on a real-world Myocard dataset [14] from UCI repository. The dataset
contains 1700 objects and 124 attributes. The task behind Myocard dataset is to predict whether
a hospital patient will have or have not a chronic heart failure based on its data. We were not
successful to run the algorithm in fast time (i.e. hours and days) on bigger datasets due to the
combinatorial explosion. However, we consider Myocard dataset being big enough to test the
algorithm.
Table 1
Statistics for algorithm iterations
                    premise size                  1             2             3
                    # premises                 1752      3.07e+06      1.08e+10
                    # ext. combinations        1644      2.53e+06      1.49e+09
                    # ext. combs. to test      1644      2.53e+06      1.17e+09
                    # new extents              1644      9.44e+05      9.55e+08
                    # extents to keep          1644      9.44e+05            259
                    computation time        7.08 ms         6.55 s        1.06 h


6.1. Number of comparisons
Table 1 shows that the use of bounds from Prop. 2 does not filter out many extent combinations.
For example, for the premise size of 3, the bounds filter only 21.5% of extents combinations
(from 1.49e+09 to 1.17e+09). Although such percentage is better than nothing, it still requires
to test 1.17 billions extent combinations. However, the use of bit arrays allows us to test 1.17
billions of extent combinations in only 1 hour on a laptop with 8 GB of RAM.
   Legend for Table 1:
    • # premises: number of premises of a given size; It is the combinatorically computed
      maximal amount of iterations of the algorithm;
    • # ext. combinations: number of extent combinations resulting in a premise of a given
      size;
    • # ext. combs. to test: number of extent combinations that can result in a good prediction
      quality (filtered by Prop. 2);
    • # new extents: the number of newly generated extents resulted from testing extent
      combinations;
    • # extents to keep: the number of extents to keep in memory. For size 1 and 2 we keep all
      the extents, for size 3 we keep only the extents with high prediction quality;
    • computation time: the time it took to process all combinations for a given premise size.

6.2. Prediction quality of short rules
The simplicity of short rule models allows us to fully describe some of the obtained models in
the paper. The following list provides the best short models (w.r.t. F1 score defined in a standard
way for a binary classification method) obtained on Myocard dataset:
    • Premise size 1: F1 score = 0.401426
      Premise: There is data about the use of painkillers in intensive care unit in the third day
      of the hospital period
    • Premise size 2: F1 score = 0.448819
      Premise: (a person had a chronic heart failure) OR (has diabetes mellitus in the anamnesis)
    • Premise size 3: F1 score = 0.473786
      Premise: (has data on use of opioid drugs in the intensive care unit in the third day of the
      hospital period) AND ( (Had Chronic heart failure) OR (Age ≥ 66) )
               • XGBoost model: F1 score = 0.464000
                 The model contains 100 decision trees of max depth 6
               • CatBoost model: F1 score = 0.434783
                 The model contains 1000 decision trees of depth 6
          We can also show all the obtained short rule models on TruePositive-TrueNegative space.
TruePositive


           y + ,394                                                                      Premise si e 1
                                                                                         Premise si e 2
               340                                                                       Premise si e 3
                                                                                         XGBoost
                                                                                         CatBoost
               255


               170


                85


                 0
                      0      170       340       510        680         850   1020      1190        1306
                                                                                                      −
                                                                                                     y
                                                                                      TrueNegative
Figure 2: Short models prediction quality on TruePositive-TrueNegative space. The prediction quality
of the default XGBoost and CatBoost models are presented as a reference.

  Figure 2 shows the prediction quality of the obtained short models on TruePositive-
TrueNegative scale. It should be noted that the points corresponding to complex black box
gradient boosting models are not far away from the ones of short models. Thus, it is not
reasonable to use complex black boxes on Myocard dataset, since simple short models offer the
same prediction quality.


7. Conclusion
In this paper we have presented some preliminary results on finding the most optimal rule
with antecedent consisting of no more than three binary attributes. We described the F1 score
optimization task in terms of true positive and true negative predictions. We computed upper
and lower bounds on Jaccard coefficients for premises obtained with conjunction and disjunction
operations. We also covered FCA-inspired technique of iterating over extents of premises in
order to minimize the computation runtime.
   In the following studies we plan to develop sharper lower and upper bounds on Jaccard score
for premises constructed with conjunction and disjunction operations. We also plan to discuss
other logical operations that will increase the prediction quality of rules keeping the number of
used attributes the same.


Acknowledgments
The article was prepared within the framework of the Basic Research Program at HSE University,
RF.
References
 [1] J. Friedman, Greedy function approximation: A gradient boosting machine, Annals of
     Statistics 29 (2001) 1189–1232. doi:10.2307/2699986 .
 [2] L. Breiman, Random forests, Machine Learning 45 (2001) 5–32. doi:10.1023/A:
     1010933404324 .
 [3] C. Molnar, Interpretable Machine Learning, 2019. https://christophm.github.io/
     interpretable-ml-book/.
 [4] A. B. Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. García,
     S. Gil-López, D. Molina, R. Benjamins, et al., Explainable artificial intelligence (xai):
     Concepts, taxonomies, opportunities and challenges toward responsible ai, Information
     fusion 58 (2020) 82–115.
 [5] C. Rudin, Stop explaining black box machine learning models for high stakes decisions
     and use interpretable models instead, Nature Machine Intelligence 1 (2019) 206–215.
     doi:10.1038/s42256- 019- 0048- x .
 [6] O. Pianykh, E. Dudyrev, A. Sharp, G. Gusev, S. O. Kuznetsov, I. Semenkov, Human knowl-
     edge models: Learning applied knowledge from the data, 2022. Unpublished.
 [7] B. Ganter, R. Wille, Formal Concept Analysis, Springer, Berlin, 1999.
 [8] M. M. Bongard, The recognition problem, Technical Report, FOREIGN TECHNOLOGY
     DIV WRIGHT-PATTERSON AFB OHIO, 1968.
 [9] R. S. Michalski, Discovering classification rules using variable-valued logic system vl1
     (1973).
[10] J. H. Friedman, B. E. Popescu, Predictive learning via rule ensembles, The annals of applied
     statistics 2 (2008) 916–954.
[11] N. Cowan, The magical number 4 in short-term memory: A reconsideration of mental
     storage capacity, Behavioral and brain sciences 24 (2001) 87–114.
[12] G. S. Halford, R. Baker, J. E. McCredden, J. D. Bain, How many variables can humans
     process?, Psychological science 16 (2005) 70–76.
[13] C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Courna-
     peau, et al., Array programming with NumPy, Nature 585 (2020) 357–362. doi:10.1038/
     s41586- 020- 2649- 2 .
[14] S. E. Golovenkin, J. Bac, A. Chervov, E. M. Mirkes, Y. V. Orlova, E. Barillot, A. N. Gor-
     ban, A. Zinovyev, Trajectories, bifurcations, and pseudo-time in large clinical datasets:
     Applications to myocardial infarction and diabetes data, GigaScience 9 (2020) giaa128.