Towards Fast Finding Optimal Short Classifiers Egor Dudyrev∗,† , Sergei O. Kuznetsov† HSE University, 20 Myasnitskaya St, Moscow, 101000, Russian Federation Abstract Studies on Explainable Artificial Intelligence show that a model should be small in order to be human understandable. The restriction on the size of a model drastically reduces the space of possible solutions. Many rule learning models still rely on greedy algorithms for generating ensembles of decision trees. This paper discusses FCA-inspired mathematical and engineering techniques to efficiently find most optimal short binary classifiers, i.e., classifiers that consist of no more than three binary attributes and are optimal w.r.t. F1 score. Keywords Supervised Machine Learning, Explainable Artificial Intelligence, Formal Concept Analysis 1. Introduction Studies on Explainable Artificial Intelligence show that a model should be small in order to be human understandable. Modern learning models such as Gradient Boosting over Decision Trees [1] or Random Forest [2] are universally recognized as mainstream state-of-the-art solutions for binary classification tasks. However, such models are too complex to be analyzed and to be used in trust requiring scenarios even with the help of XAI [3] [4]. This is a reason for the rising tendency of rejecting complex black box models in favor of small explainable ones [5] [6]. Besides making models highly explainable, restricting the size of model also drastically limits the space of all possible models. Thus, one can search for a globally optimal model instead of approaching locally optimal ones. This paper considers the models operating only up to three binary attributes. In order to find an optimal model meeting these conditions, one should iterate though all combinations of up to three given binary attributes. However, such brute-force algorithm suffers from combinatorial explosion with the increasing number of attributes. This paper dwells on both mathematical and engineering techniques based on Formal Concept Analysis (FCA) [7] that make this brute-force algorithm more efficient. The challenge of constructing simple logical models has been addressed in many areas of previous research. In the early 1960s, the work on formal logic led to the inception of logical Published in Sergei O. Kuznetsov, Amedeo Napoli, Sebastian Rudolph (Eds.): The 10th International Workshop ”What can FCA do for Artificial Intelligence?”, FCA4AI 2022, co-located with IJCAI-ECAI 2022, July 23 2022, Vienna, Austria, Proceedings, pp. 23–34. ∗ Corresponding author. † These authors contributed equally. Envelope-Open eo.dudyrev@hse.ru (E. Dudyrev); skuznetsov@hse.ru (S. O. Kuznetsov) Orcid 0000-0002-2144-3308 (E. Dudyrev); 0000-0003-3284-9001 (S. O. Kuznetsov) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) programming and rule-learning algorithms [8], [9]. The latter – including algorithms as Skope- Rules, RuleFit [10] and more – often rely on greedy approaches to extract short rules from more complex models (such as large decision trees). Contrary to these greedy approaches resulting in locally optimal models, this paper tackles the problem of finding the most optimal model of given size. Another possible reason to develop and to use short models is that they make good benchmark for big black box models. Indeed, if a short model outputs the same prediction quality as a big black box then there is no interest in using the latter. Extensive research shows that a man can operate with premises having no more three plus minus one ideas in his head simultaneously [11] [12]. In this paper, due to complexity constraints, we concentrate on finding the rules with premises having no more than three attributes for two reasons. Since it is our first approach to the problem, we should try to solve its easiest version. In addition, the choice of number three is justified by the cognitive studies. Thus, short rules consisting of more than three attributes represent a specific class of explainable machine learning models. The structure of the paper is as follows. Section 2 presents the theoretical background used throughout the paper. Section 3 describes the mathematical techniques for optimizing the algorithm by minimizing the number of required operations. Complementing this, Section 4 discusses the engineering approaches to optimize the algorithm by maximizing the computation speed. Section 5 merges all the discussed techniques together in one algorithm. And Section 6 presents the experimental results of this algorithm. Finally, Section 7 concludes the paper. 2. Theoretical Background This subsection introduces definitions we use throughout the paper. Firstly, we provide the basic terms of Formal Concept Analysis to describe the rule models. Secondly, we describe the space of premises that contains the machine learning models discussed in the paper. Thirdly, we describe the main topics of a binary classification in the language in the FCA notation. 2.1. Formal Concept Analysis A formal context 𝐾 describes the dataset to build the model on. It is presented as a triple 𝐾 = (𝐺, 𝑀, 𝐼 ) where 𝐺 is a set of objects (rows in a dataset), 𝑀 is a set of attributes (columns in a dataset), and 𝐼 ⊆ 𝐺 × 𝑀 represents relations between objects and attributes. Prime (′ ) operators match a subset of objects 𝐴 ⊆ 𝐺 and a subset of attributes 𝐵 such that the objects from 𝐴 are “described” by the attributes from 𝐵 and vice versa: 𝐴′ = {𝑚 ∈ 𝑀 ∣ ∀𝑔 ∈ 𝐴 ∶ 𝑔𝐼 𝑚} 𝐵′ = {𝑔 ∈ 𝐺 ∣ ∀𝑚 ∈ 𝐵 ∶ 𝑔𝐼 𝑚} (1) Given a subset of attributes 𝐵 ⊆ 𝑀, the subset of objects 𝐴 = 𝐵′ is called extent of 𝐵. Dually, the subset of attributes 𝐵 = 𝐴′ is called intent of 𝐴. 2.2. Premises In this subsection we define a premise as a combination of attributes of a formal context, joined by conjunction, disjuction, and negation operations. The notion of a premise is necessary for the following subsections. Definition 1. The premise space ℙ is a set of all combinations of attributes 𝑀 constructed with conjunction ∧, disjunction ∨, and negation operations: ℙ is a set s.t. 1)𝑀 ⊂ ℙ, (2) 2)∀𝑝, 𝑞 ∈ ℙ ∶ 𝑝 ∧ 𝑞, 𝑝 ∨ 𝑞, 𝑝 ∈ ℙ Each premise 𝑝 ∈ ℙ corresponds to a subset of objects 𝑝 ′ ⊆ 𝐺 called an extent of a premise. Extents of conjunction, disjunction, and negation operations are defined as follows: (𝑝 ∧ 𝑞)′ = 𝑝 ′ ∩ 𝑞 ′ , (𝑝 ∨ 𝑞)′ = 𝑝 ′ ∪ 𝑞 ′ , 𝑝′ = 𝐺 ⧵ 𝑝 ′ (3) This paper is specifically interested in premises consisting of no more than three attributes. Generally, a set of premises 𝑃𝑛 constructed from 𝑛 attributes can be described in the following way: 𝑃1 = 𝑀 ∪ {𝑚 ∣ 𝑚 ∈ 𝑀} ⌊𝑛/2⌋ (4) 𝑃𝑛∈ℕ = ⋃ {𝑝 ∧ 𝑞, 𝑝 ∨ 𝑞, 𝑝 ∧ 𝑞, 𝑝 ∨ 𝑞 ∣ 𝑝 ∈ 𝑃𝑖 , 𝑞 ∈ 𝑃𝑛−𝑖 } 𝑛>1 𝑖=1 Uniting sets of premises 𝑃𝑛 for each natural number 𝑛 we obtain the premise space ℙ: ℙ = ⋃𝑛∈ℕ 𝑃𝑛 . In what follows, we say that premise 𝑝 ∈ ℙ has size 𝑛 if it belongs to 𝑃𝑛 . 2.3. Binary classification Binary classification is a task in machine learning when a model is asked to predict whether an object 𝑔 belongs to a “positive” or a “negative” class given the object’s description (given by a subset of attributes). The model is obtained based on the provided training context (dataset) 𝐾 = (𝐺, 𝑀, 𝐼 ) with predefined positive 𝐺+ ⊂ 𝐺 and negative 𝐺− = 𝐺 ⧵ 𝐺+ objects. The first step to constructing a good binary classifier is to find a model operating the set of attributes 𝑀 that efficiently separates positive objects from 𝐺+ and negative objects from 𝐺− . After that, the model can be applied to a test context 𝐾𝑡𝑒𝑠𝑡 = (𝐺𝑡𝑒𝑠𝑡 , 𝑀, 𝐼𝑡𝑒𝑠𝑡 ) to predict its unknown positive and negative objects. This paper studies binary classifiers of the form “if premise 𝑝 ∈ ℙ is true then object 𝑔 is predicted positive, otherwise object 𝑔 is predicted negative”. The prediction quality of a premise 𝑝 ∈ ℙ on a training dataset is measured by comparing the given sets of positive 𝐺+ and negative objects 𝐺− with the sets of positive 𝐺𝑝+ and negative objects 𝐺𝑝− predicted by a premise 𝑝. Note that the set 𝐺𝑝+ is exactly the extent of the premise 𝑝 ∶ 𝐺𝑝+ = 𝑝 ′ . 𝑇 𝑃𝑝 = 𝐺+ ∩ 𝐺𝑝+ = 𝐺+ ∩ 𝑝 ′ 𝐹 𝑃𝑝 = 𝐺− ∩ 𝐺𝑝+ = 𝐺− ∩ 𝑝 ′ (5) 𝐹 𝑁𝑝 = 𝐺+ ∩ 𝐺𝑝− = 𝐺+ ⧵ 𝑝 ′ 𝑇 𝑁𝑝 = 𝐺− ∩ 𝐺𝑝− = 𝐺− ⧵ 𝑝 ′ For the sake of brevity, let us use lowercase letters to denote the cardinalities of so-called true positives 𝑇 𝑃𝑝 , false positives 𝐹 𝑃𝑝 , false negatives 𝐹 𝑁𝑝 , and true negatives 𝑇 𝑁𝑝 . Likewise, we use 𝑦+ , 𝑦− to denote the cardinalities of the set of positive objects 𝐺+ and the set of negative objects 𝐺− respectively: tp 𝑝 = |𝑇 𝑃𝑝 |, fp 𝑝 = |𝐹 𝑃𝑝 |, fn𝑝 = |𝐹 𝑁𝑝 |, tn𝑝 = |𝑇 𝑁𝑝 |, (6) 𝑦+ = |𝐺+ |, 𝑦− = |𝐺− | One of the most widely used quality scores for binary classifications are precision (𝑝𝑟𝑒𝑐), recall (𝑟𝑒𝑐), and their harmonic mean called F1 score (𝐹 1). Their definitions are as follows: tpp tpp prec(𝑝) ∗ rec(𝑝) 𝑝𝑟𝑒𝑐(𝑝) = , rec(𝑝) = , F1(𝑝) = 2 (7) |𝑝 ′ | 𝑦+ prec(𝑝) + rec(𝑝) This paper focuses on finding the premise 𝑝 ∗ of size not bigger that 3 having the maximal F1 score: 𝑝 ∗ = arg max𝐹 1(𝑝) (8) 3 𝑝∈⋃𝑖=𝑛 𝑃𝑛 The presented techniques can be easily adjusted for other quality measures. 3. Minimizing the number of comparisons 3.1. F1 score optimization The definition of F1 score as a harmonic mean of precision and recall makes the possible optimization strategies obscure. This subsection simplifies the task of maximizing the F1 score through maximizing the number of true positives and true negatives. Proposition 1. F1 score 𝐹 1(𝑝) is comonotonic to Jaccard score 𝐽 (𝑝) (denoted by ∼), where the latter represents the Jaccard similarity coefficient between the set of positive objects 𝐺+ and the set of objects predicted positive 𝐺𝑝+ = 𝑝 ′ : |𝐺+ ∩ 𝑝 ′ | 𝐹 1(𝑝) ∼ 𝐽 (𝑝) = (9) |𝐺+ ∪ 𝑝 ′ | Proof. Let us describe the Jaccard score in terms of true positives 𝑡𝑝𝑝 and true negatives 𝑡𝑛𝑝 : |𝐺+ ∩ 𝑝 ′ | 𝑡𝑝𝑝 𝐽 (𝑝) = ′ = (10) |𝐺+ ∪ 𝑝 | |𝐺| − 𝑡𝑛𝑝 Now we should also express F1 score in terms of true positives 𝑡𝑝𝑝 and true negatives 𝑡𝑛𝑝 : 𝑝𝑟𝑒𝑐(𝑝) ∗ 𝑟𝑒𝑐(𝑝) 2𝑡𝑝𝑝 2𝑡𝑝𝑝 𝐹 1(𝑝) = 2 = ′ = 𝑝𝑟𝑒𝑐(𝑝) + 𝑟𝑒𝑐(𝑝) 𝑦+ + |𝑝 | (|𝐺| − 𝑓 𝑝𝑝 − 𝑡𝑛𝑝 ) + (𝑡𝑝𝑝 + 𝑓 𝑝𝑝 ) 2𝑡𝑝𝑝 2 2 = = = 1 (11) |𝐺| + 𝑡𝑝𝑝 − 𝑡𝑛𝑝 |𝐺|−𝑡𝑛𝑝 1 + 𝐽 (𝑝) 1 + 𝑡𝑝 𝑝 Therefore we obtain the relation: 𝐹 1(𝑝) ∼ 𝐽 (𝑝) Since F1 score 𝐹 1(𝑝) is monotonic with respect to the Jaccard score 𝐽 (𝑝) then the F1 score optimization problem can be viewed as the problem of optimizing the fraction 𝑡𝑝𝑝 /(|𝐺| − 𝑡𝑛𝑝 ), i.e. maximizing the number of true positives and true negatives: 𝑡𝑝𝑝 arg max𝐹 1(𝑝) = arg max𝐽 (𝑝) = arg max (12) 𝑝∈ℙ 𝑝∈ℙ 𝑝∈ℙ |𝐺| − 𝑡𝑛𝑝 3.2. Logical operations effect on Jaccard score This subsection discusses how conjunction and disjunction operations affect the Jaccard score. That is, given the Jaccard score of two premises 𝑝, 𝑞 ∈ ℙ can we expect that premises 𝑝 ∧ 𝑞, 𝑝 ∨ 𝑞 would have higher or lower Jaccard score values? Firstly, let us express the true positives and true negatives of premises constructed by con- junction and disjunction with the true positives and true negatives of the original premises: 𝑇 𝑃𝑝∧𝑞 = 𝐺+ ∩ (𝑝 ′ ∩ 𝑞 ′ ) = 𝑇 𝑃𝑝 ∩ 𝑇 𝑃𝑞 𝑇 𝑃𝑝∨𝑞 = 𝐺+ ∩ (𝑝 ′ ∪ 𝑞 ′ ) = 𝑇 𝑃𝑝 ∪ 𝑇 𝑃𝑞 (13) 𝑇 𝑁𝑝∧𝑞 = 𝐺− ⧵ (𝑝 ′ ∩ 𝑞 ′ ) = 𝑇 𝑁𝑝 ∪ 𝑇 𝑁𝑞 𝑇 𝑁𝑝∨𝑞 = 𝐺− ⧵ (𝑝 ′ ∪ 𝑞 ′ ) = 𝑇 𝑁𝑝 ∩ 𝑇 𝑁𝑞 Therefore, conjunction operation shrinks the set of true positives while expanding the set of true negatives. Opposite to it, disjunction operation expands the set of true positives while shrinking the set of true negatives. Now we derive the equations for the infimum and the supremum for the cardinalities of true positives and true negatives. To do so we incorporate two notions from Equation 13. Firstly, we use set cardinality restrictions for intersection and union operations. Secondly, since the number of true positives 𝑡𝑝 is limited by the number of positive objects 𝑦+ , then for two premises 𝑝, 𝑞 ∈ ℙ if the sum of 𝑡𝑝𝑝 , 𝑡𝑝𝑞 exceeds 𝑦+ , then the corresponding true positives should have at least 𝑡𝑝𝑝 + 𝑡𝑝𝑞 − 𝑦+ objects in common (analogous conclusions can be provided for the number of true negatives 𝑡𝑛 and the number of negative objects 𝑦− ). max(𝑡𝑝𝑝 + 𝑡𝑝𝑞 − 𝑦+ , 0) ≤ 𝑡𝑝𝑝∧𝑞 ≤ min(𝑡𝑝𝑝 , 𝑡𝑝𝑞 ) max(𝑡𝑛𝑝 , 𝑡𝑛𝑞 ) ≤ 𝑡𝑛𝑝∧𝑞 ≤ min(𝑡𝑛𝑝 + 𝑡𝑛𝑞 , 𝑦− ) (14) max(𝑡𝑝𝑝 , 𝑡𝑝𝑞 ) ≤ 𝑡𝑝𝑝∨𝑞 ≤ min(𝑡𝑝𝑝 + 𝑡𝑝𝑞 , 𝑦+ ) max(𝑡𝑛𝑝 + 𝑡𝑝𝑞 − 𝑦− , 0) ≤ 𝑡𝑛𝑝∨𝑞 ≤ min(𝑡𝑛𝑝 , 𝑡𝑛𝑞 ) Thus, the Jaccard score bounds are: max(𝑡𝑝𝑝 + 𝑡𝑝𝑞 − 𝑦+ , 0) min(𝑡𝑝𝑝 , 𝑡𝑝𝑞 ) ≤ 𝐽 (𝑝 ∧ 𝑞) ≤ |𝐺| − max(𝑡𝑛𝑝 , 𝑡𝑛𝑞 ) |𝐺| − min(𝑡𝑛𝑝 + 𝑡𝑛𝑞 , 𝑦− ) (15) max(𝑡𝑝𝑝 , 𝑡𝑝𝑞 ) min(𝑡𝑝𝑝 + 𝑡𝑝𝑞 , 𝑦+ ) ≤ 𝐽 (𝑝 ∨ 𝑞) ≤ |𝐺| − max(𝑡𝑛𝑝 + 𝑡𝑝𝑞 − 𝑦− , 0) |𝐺| − 𝑚𝑖𝑛(𝑡𝑛𝑝 , 𝑡𝑛𝑞 ) Figure 1 visualizes the bounds on TruePositive-TrueNegative plane for an abstract formal context with 40% of objects being positive. Maximizing the Jaccard score, shown by increasing grey colour gradient, requires us to reach the upper-right corner of the plane. However, conjunction and disjunction operations over premises 𝑝, 𝑞 ∈ ℙ can only “move” the premises to the upper-left (𝑝 ∨ 𝑞) and lower-right (𝑝 ∧ 𝑞) corners. /|G|| 0.40 + /| y+ TruePosi ive sup: J(p ∨ q) = 0.50 J=1 F1 = 1 0.35 p∨q 0.30 0.25 inf: J(p ∨ q) = 0.25 J(p) = 0.31 p 0.20 0.15 q J(q) = 0.25 sup: J(p ∧ q) = 0.37 0.10 p∧q 0.05 J=0 F1 = 0 inf: J(p ∧ q) = 0.00 0.00 0.00 0.10 0.20 0.30 0.40 0.50 0.60 y − /|G| TrueNegative Figure 1: The Jaccard score bounds for conjunction and disjunction operations on abstract premises 𝑝, 𝑞 of abstract formal context. The gray colour intensity represents the value of the Jaccard score for a specific point on the plane. The bounds derived for conjunction and disjunction operations are vague: they do not say whether the newly formed premise would have the higher or lower Jaccard score. The more precise estimation is only possible via intersecting the extents of premises 𝑝, 𝑞. In this case, however, one can immediately compute the resulting Jaccard score itself, rather than operating the estimations. Proposition 2. Given a threshold 𝜃 ∈ ℝ and a premise 𝑝 ∈ ℙ, one can identify whether the Jaccard score of any premise 𝑝 ∧ 𝑞, 𝑝 ∨ 𝑞 ∈ ℙ will not exceed the threshold 𝜃 using the following inequations: 𝑡𝑝𝑝 ≤ 𝑦+ 𝜃 ⟹ 𝐽 (𝑝 ∧ 𝑞) ≤ 𝜃, ∀𝑞 ∈ ℙ 𝑦+ (16) 𝑡𝑛𝑝 ≤ |𝐺| − ⟹ 𝐽 (𝑝 ∨ 𝑞) ≤ 𝜃, ∀𝑞 ∈ ℙ 𝜃 Proof. Let us describe how the bounds in formulae 15 depend on true positives and true negatives of a premise 𝑝 ∈ ℙ: min(𝑡𝑝𝑝 , 𝑡𝑝𝑞 ) 𝑡𝑝𝑝 𝑡𝑝𝑝 𝐽 (𝑝 ∧ 𝑞) ≤ ≤ = (17) |𝐺| − min(𝑡𝑛𝑝 + 𝑡𝑛𝑞 , 𝑦− ) |𝐺| − 𝑦− 𝑦+ min(𝑡𝑝𝑝 + 𝑡𝑝𝑞 , 𝑦+ ) 𝑦+ 𝐽 (𝑝 ∨ 𝑞) ≤ ≤ (18) |𝐺| − 𝑚𝑖𝑛(𝑡𝑛𝑝 , 𝑡𝑛𝑞 ) |𝐺| − 𝑡𝑛𝑝 Now we compare the obtained fractions with a threshold 𝜃: 𝑡𝑝𝑝 𝑦+ 𝑦+ ≤ 𝜃 ⇔ 𝑡𝑝𝑝 ≤ 𝑦+ 𝜃, ≤ 𝜃 ⇔ 𝑡𝑛𝑝 ≤ |𝐺| − (19) 𝑦+ |𝐺| − 𝑡𝑛𝑝 𝜃 Thus we result in the initial implications: 𝑡𝑝𝑝 ≤ 𝑦+ 𝜃 ⟹ 𝐽 (𝑝 ∧ 𝑞) ≤ 𝜃, ∀𝑞 ∈ ℙ (20) 𝑦+ 𝑡𝑛𝑝 ≤ |𝐺| − ⟹ 𝐽 (𝑝 ∨ 𝑞) ≤ 𝜃, ∀𝑞 ∈ ℙ (21) 𝜃 4. Engineering to maximize the speed of comparisons Mathematical tricks help to minimize the number of comparing rules. However, the number of such rules is still high. This section concentrates on engineering tricks to make each comparison faster. 4.1. Extents and bitarrays In this subsection we use two important facts about concept extents: (i) different premises may correspond to the same extent ∃𝑝, 𝑞 ∈ ℙ ∶ 𝑝 ≠ 𝑞, 𝑝 ′ = 𝑞 ′ , (ii) any prediction quality measure of a premise 𝑝 relies on the premise extent 𝑝 ′ (see eq. 5). Different premises 𝑝, 𝑞 ∈ ℙ, 𝑝 ≠ 𝑞 may correspond to the same extent 𝑝 ′ = 𝑞 ′ ⊆ 𝐺 for many reasons. First, the premises can be logically equivalent: e.g. (𝑝 ∧ 𝑞)′ = (𝑝 ∨ 𝑞)′ by De Morgan laws. Second, if one premise is less general than another 𝑝 ′ ⊂ 𝑞 ′ then their conjunction would correspond to the extent 𝑝 ′ and their disjunction would correspond to the extent 𝑞 ′ . Lastly, different premises may correspond to the same extents due to the specific characteristics of the formal context 𝐾: e.g. it may occur that an extent of the conjunction of two premises 𝑝, 𝑞 ∈ ℙ would be equal to the extent of the third premise 𝑟 ∈ ℙ ⧵ {𝑝, 𝑞} ∶ (𝑝 ∧ 𝑞)′ = 𝑟 ′ . Since the computation of prediction quality measures relies on extents of premises, then many various premises corresponding to the same extent would have the same prediction quality (on the train context 𝐾). Thus, we propose to search for the most optimal extent instead of the most optimal premise. In order to formalize this idea, let us define the sets extents 𝐸𝑛 : 𝐸1 = {𝑝 ′ ∣ ∀𝑝 ∈ 𝑃1 } = ⋃ {𝑚′ , (𝑚)′ } 𝑚∈𝑀 ⌊𝑛/2⌋ 𝑛−1 (22) 𝐸𝑛∈ℕ = ⋃ {𝑎 ∧ 𝑏, 𝑎 ∨ 𝑏 ∣ 𝑎 ∈ 𝐸𝑖 , 𝑏 ∈ 𝐸𝑛−𝑖 } ⧵ ⋃ 𝐸𝑖 𝑛>1 𝑖=1 𝑖=1 The proposed definition of the sets of extents ensures that any extent 𝑒 ∈ 𝐸𝑛 is generated by a premise of size at least 𝑛: ∀𝑛, 𝑖 ∈ ℕ, 𝑒 ∈ 𝐸𝑛 , 𝑝 ∈ 𝑃𝑖 ∶ 𝑒 = 𝑝 ′ ⟹ 𝑛 ≤ 𝑖 (23) Thus the optimization problem becomes as the following: 𝑒 ∗ = arg max𝐹 1(𝑒) (24) 3 𝑒∈⋃𝑛=1 𝐸𝑛 where F1 score function is slightly modified to take an extent as its parameter and not the premise. The last but not the least, an extent 𝑒, being a subset of objects 𝐺, can be represented and stored in a computer as a bit mask (a tuple of bits) of length |𝐺| where each bit represents whether the corresponding object is in the extent of not. Conjunction and disjunction operations become operations on bit masks, that are the most efficient operations performed of modern binary coded computers. So the use of extents instead of premises not only reduces the number of comparisons, it also highly accelerates each of the comparisons. 4.2. Array operations This subsections describes the trick, that is well-known among data science practitioners, however we should cover it for the full disclosure. Inequalities, presented in Proposition 2, allow us to skip the processing of many conjunctions 𝑝 ∧ 𝑞 and disjunctions 𝑝 ∨ 𝑞 based on the characteristics of the initial premises 𝑝, 𝑞 ∈ ℙ and a prediction quality threshold 𝜃. However, these characteristics are still to be computed. And to compute these numerical characteristics the most efficiently we use Numpy [13] package for Python. The package is specifically designed to work with large volume of numerical data through the use of C++ code. 5. The proposed algorithm Here is a pseudo-code of the algorithm for finding the best 𝑘 premises of size no bigger than 3: • Step 1. Find all extents of size 1 that will not lose quality after conjunction, disjunction: 1. Compute all extents 𝐸1 ; 2. Find the quality threshold 𝜃 as the minimal quality of the 𝑘 best extents from 𝐸1 ; 3. Filter out extents from 𝐸1 that satisfy both inequalities presented in Prop. 2; • Step 2. Find all extents of size 2 that will not lose quality after conjunction, disjunction: 1. Compute all extents 𝐸2 based on filtered set 𝐸1 while keeping the information about a pair of extents 𝑎, 𝑏 ∈ 𝐸1 and an operation (∧, ∨) used for constructing each extent 𝑒 ∈ 𝐸2 ; 2. Find the quality threshold 𝜃 as the minimal quality of the 𝑘 best extents from 𝐸1 ∪ 𝐸2 3. Filter out extents from 𝐸1 , 𝐸2 that satisfy both inequalities presented in Prop. 2 • Step 3. Find only the best extents of size 3: For each pair of extent 𝑒1 , 𝑒2 from filtered 𝐸1 , 𝐸2 1. If any of the extents 𝑒1 , 𝑒2 satisfy both inequalities in Prop. 2 then proceed to the next pair; otherwise: 2. Compute and measure the prediction quality of the conjunction 𝑒1 ∩ 𝑒2 ; 3. Compute and measure the prediction quality of the disjunction 𝑒1 ∪ 𝑒2 ; 4. Update 𝜃 if needed; • Step 4. Reconstruct the premises corresponding to the best 𝑘 extents using the kept information about extents and operations. The time complexity of this algorithm is 𝑂(|𝑀|3 ) where 𝑀 is a set of attributes in a formal context 𝐾. From the asymptotic point of view, this is the same time complexity as that of the brute force algorithm to test all premises of size not bigger than three. However, the use of extents, as well as reducing the number of combinations, allows us to minimize the practical processing time of the algorithm. 6. Experiments This section applies the proposed algorithm in practice. First, we study the statistics of the number of comparisons the algorithm has to make. Second, we roughly compare the prediction quality of short models with that of the black box model to show that there are cases where the former performs as efficient as the latter. The algorithm is run on a real-world Myocard dataset [14] from UCI repository. The dataset contains 1700 objects and 124 attributes. The task behind Myocard dataset is to predict whether a hospital patient will have or have not a chronic heart failure based on its data. We were not successful to run the algorithm in fast time (i.e. hours and days) on bigger datasets due to the combinatorial explosion. However, we consider Myocard dataset being big enough to test the algorithm. Table 1 Statistics for algorithm iterations premise size 1 2 3 # premises 1752 3.07e+06 1.08e+10 # ext. combinations 1644 2.53e+06 1.49e+09 # ext. combs. to test 1644 2.53e+06 1.17e+09 # new extents 1644 9.44e+05 9.55e+08 # extents to keep 1644 9.44e+05 259 computation time 7.08 ms 6.55 s 1.06 h 6.1. Number of comparisons Table 1 shows that the use of bounds from Prop. 2 does not filter out many extent combinations. For example, for the premise size of 3, the bounds filter only 21.5% of extents combinations (from 1.49e+09 to 1.17e+09). Although such percentage is better than nothing, it still requires to test 1.17 billions extent combinations. However, the use of bit arrays allows us to test 1.17 billions of extent combinations in only 1 hour on a laptop with 8 GB of RAM. Legend for Table 1: • # premises: number of premises of a given size; It is the combinatorically computed maximal amount of iterations of the algorithm; • # ext. combinations: number of extent combinations resulting in a premise of a given size; • # ext. combs. to test: number of extent combinations that can result in a good prediction quality (filtered by Prop. 2); • # new extents: the number of newly generated extents resulted from testing extent combinations; • # extents to keep: the number of extents to keep in memory. For size 1 and 2 we keep all the extents, for size 3 we keep only the extents with high prediction quality; • computation time: the time it took to process all combinations for a given premise size. 6.2. Prediction quality of short rules The simplicity of short rule models allows us to fully describe some of the obtained models in the paper. The following list provides the best short models (w.r.t. F1 score defined in a standard way for a binary classification method) obtained on Myocard dataset: • Premise size 1: F1 score = 0.401426 Premise: There is data about the use of painkillers in intensive care unit in the third day of the hospital period • Premise size 2: F1 score = 0.448819 Premise: (a person had a chronic heart failure) OR (has diabetes mellitus in the anamnesis) • Premise size 3: F1 score = 0.473786 Premise: (has data on use of opioid drugs in the intensive care unit in the third day of the hospital period) AND ( (Had Chronic heart failure) OR (Age ≥ 66) ) • XGBoost model: F1 score = 0.464000 The model contains 100 decision trees of max depth 6 • CatBoost model: F1 score = 0.434783 The model contains 1000 decision trees of depth 6 We can also show all the obtained short rule models on TruePositive-TrueNegative space. TruePositive y + ,394 Premise si e 1 Premise si e 2 340 Premise si e 3 XGBoost CatBoost 255 170 85 0 0 170 340 510 680 850 1020 1190 1306 − y TrueNegative Figure 2: Short models prediction quality on TruePositive-TrueNegative space. The prediction quality of the default XGBoost and CatBoost models are presented as a reference. Figure 2 shows the prediction quality of the obtained short models on TruePositive- TrueNegative scale. It should be noted that the points corresponding to complex black box gradient boosting models are not far away from the ones of short models. Thus, it is not reasonable to use complex black boxes on Myocard dataset, since simple short models offer the same prediction quality. 7. Conclusion In this paper we have presented some preliminary results on finding the most optimal rule with antecedent consisting of no more than three binary attributes. We described the F1 score optimization task in terms of true positive and true negative predictions. We computed upper and lower bounds on Jaccard coefficients for premises obtained with conjunction and disjunction operations. We also covered FCA-inspired technique of iterating over extents of premises in order to minimize the computation runtime. In the following studies we plan to develop sharper lower and upper bounds on Jaccard score for premises constructed with conjunction and disjunction operations. We also plan to discuss other logical operations that will increase the prediction quality of rules keeping the number of used attributes the same. Acknowledgments The article was prepared within the framework of the Basic Research Program at HSE University, RF. References [1] J. Friedman, Greedy function approximation: A gradient boosting machine, Annals of Statistics 29 (2001) 1189–1232. doi:10.2307/2699986 . [2] L. Breiman, Random forests, Machine Learning 45 (2001) 5–32. doi:10.1023/A: 1010933404324 . [3] C. Molnar, Interpretable Machine Learning, 2019. https://christophm.github.io/ interpretable-ml-book/. [4] A. B. Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. García, S. Gil-López, D. Molina, R. Benjamins, et al., Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai, Information fusion 58 (2020) 82–115. [5] C. Rudin, Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead, Nature Machine Intelligence 1 (2019) 206–215. doi:10.1038/s42256- 019- 0048- x . [6] O. Pianykh, E. Dudyrev, A. Sharp, G. Gusev, S. O. Kuznetsov, I. Semenkov, Human knowl- edge models: Learning applied knowledge from the data, 2022. Unpublished. [7] B. Ganter, R. Wille, Formal Concept Analysis, Springer, Berlin, 1999. [8] M. M. Bongard, The recognition problem, Technical Report, FOREIGN TECHNOLOGY DIV WRIGHT-PATTERSON AFB OHIO, 1968. [9] R. S. Michalski, Discovering classification rules using variable-valued logic system vl1 (1973). [10] J. H. Friedman, B. E. Popescu, Predictive learning via rule ensembles, The annals of applied statistics 2 (2008) 916–954. [11] N. Cowan, The magical number 4 in short-term memory: A reconsideration of mental storage capacity, Behavioral and brain sciences 24 (2001) 87–114. [12] G. S. Halford, R. Baker, J. E. McCredden, J. D. Bain, How many variables can humans process?, Psychological science 16 (2005) 70–76. [13] C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Courna- peau, et al., Array programming with NumPy, Nature 585 (2020) 357–362. doi:10.1038/ s41586- 020- 2649- 2 . [14] S. E. Golovenkin, J. Bac, A. Chervov, E. M. Mirkes, Y. V. Orlova, E. Barillot, A. N. Gor- ban, A. Zinovyev, Trajectories, bifurcations, and pseudo-time in large clinical datasets: Applications to myocardial infarction and diabetes data, GigaScience 9 (2020) giaa128.