Interval Pattern Concept Lattice as a Classifier Ensemble Yury Kashnitsky, Sergei O. Kuznetsov National Research University Higher School of Economics, Moscow, Russia {ykashnitsky,skuznetsov}@hse.ru Abstract. Decision tree learning is one of the most popular classifica- tion techniques. However, by its nature it is a greedy approach to finding a classification hypothesis that optimizes some information-based crite- rion. It is very fast but may lead to finding suboptimal classification hy- potheses. Moreover, in spite of decision trees being easily interpretable, ensembles of trees (random forests and gradient-boosted trees) are not, which is crucial in some domains, like medical diagnostics or bank credit scoring. In case of such “small, but important-data” problems one is not obliged to perform a greedy search for classification hypotheses, and therefore alternatives to decision tree learning techniques may be con- sidered. In this paper, we propose an FCA-based classification technique where each test instance is classified with a set of the best (in terms of some information-based criterion) classification rules. In a set of bench- marking experiments, the proposed strategy is compared with decision tree and nearest neighbor learning. Keywords: machine learning, classification, decision tree learning, for- mal concept analysis, pattern structures 1 Introduction The classification task in machine learning aims to use some historical data (a training set) to predict unknown discrete variables in unknown data (a test set). While there are dozens of popular methods for solving the classification problem, usually there is an accuracy-interpretability trade-off when choosing a method for a particular task. Neural networks, random forests and ensemble techniques (boosting, bagging, stacking etc.) are known to outperform simple methods in difficult tasks. Kaggle competitions also bear testimony for that – usually, winners resort to ensemble techniques, mainly to gradient boosting [13]. The mentioned algorithms are widely spread in those application scenarios where classification performance is the main objective. In Optical Character Recogni- tion, voice recognition, information retrieval and many other tasks typically we are satisfied with a trained model if it has a low generalization error. However, in lots of applications we need a model to be interpretable as well as accurate. Some classification rules, built from data and examined by experts, may be justified or proved. In medical diagnostics, when making highly responsible decisions (e.g., predicting whether a patient has cancer), experts prefer to extract readable rules from a machine learning model in order to “understand” it and justify the decision. In credit scoring, for instance, applying ensemble techniques can be very effective, but the model is often obliged to have “sound business logic”, that is, to be interpretable [10]. In what follows, we introduce some notions from Formal Concept Analysis (FCA) [5] and provide a technique to express decision tree learning in terms of a search for a hypothesis in a concept lattice (section 3). In section 4, we propose an algorithm which by its design guarantees that each test object is classified with a better (in terms of some criterion such as information gain or Gini impurity) rule than in case of applying a decision tree. Finally, we discuss the results of the experiments with several popular datasets (section 5), make conclusions and directions of further work on developing the performed ideas. 2 Pattern Structures and Projections Pattern structures are natural extension of Formal Concept Analysis to ob- jects with arbitrary partially-ordered descriptions [4]. Definition 1. Let G be a set (of objects), let (D, u) be a meet-semi-lattice (of all possible object descriptions) and let δ : G → D be a mapping between objects and descriptions. Set δ(G) := {δ(g)|g ∈ G} generates a complete subsemilattice (Dδ , u) of (D, u), if every subset X of δ(G) has infimum uX in (D, u). Pattern structure is a triple (G, D, δ), where D = (D, u), provided that the set δ(G) := {δ(g) | g ∈ G} generates a complete subsemilattice (Dδ , u) [4, 9]. Definition 2. Patterns are elements of D. Patterns are naturally ordered by subsumption relation v: given c, d ∈ D one has c v d ⇔ c u d = c. Operation u is also called a similarity operation. A pattern structure (G, D, δ) gives rise to the following derivation operators (·) : l A = δ(g) for A ∈ G, g∈A d = {g ∈ G | d v δ(g)} for d ∈ (D, u). Pairs (A, d) satisfying A ⊆ G, d ∈ D, A = d, and A = d are called pattern concepts of (G, D, δ). As in classical FCA, pattern concepts form a pattern concept lattice. In case it is too computationally demanding to build the whole lattice, projections are used to simplify object descriptions and boost the formation of a pattern concept lattice. Definition 3. A projection [4] of a semilattice (D, u) is a kernel function ψ : D → D, i.e ∀x, y ∈ D : – x v y ⇒ ψ(x) v ψ(y) (monotonicity) – ψ(x) v x (contractivity) – ψ(ψ(x)) = ψ(x) (idempotence) 3 Interval Pattern Structure Projections The theoretical part of the proposed approach is based on Formal Concept Analysis and pattern structures, in particular, on Interval Pattern Structures [6] that provide a way to apply FCA techniques to data with numeric attributes. Unfortunately, the size of the concept lattice is usually too large to be used effi- ciently in learning [11]. Hence, we introduce a so-called discretizing projection on interval pattern structures which helps to build more general object descriptions based on numeric attributes. Definition 4. Let (G, (D, u), δ) be an interval pattern structure. Let Ti = {τi1 , . . . , τiti }, i = 1, . . . , m be m sets of real numbers where m is a cardinality of each d ∈ D. Then, ψ(h[ai , bi ]ii∈[1,m] ) = h[max{τ | τ ∈ Ti ∪ {−∞, +∞}, τ ≤ ai }, min{τ | τ ∈ Ti ∪ {−∞, +∞}, τ ≥ bi }]i is called a discretizing projection of a semilattice (D, u). The discretizing projection, as defined in Def. 4, is a projection according to the definition Def. 3. Example 1. Consider a toy dataset with only 4 objects and 1 numeric attribute as shown in Fig. 1 (left). In order to apply decision tree learning for some classi- fication task with this dataset, one would apply some discretization method to produce binary attributes from attribute a. Consider the discretization shown in Fig. 1 (middle). The corresponding concept lattice is shown in the same figure on the right-hand side. Fig. 1. A toy many-valued context, it’s discretization and the corresponding con- cept lattice. a a ≤ 1.5 a ≤ 3.5 a ≥ 1.5 a ≥ 3.5 11 1× × 22 2 × × 3 × × 33 4 × × 44 ψ([a, b]) = [max{τ | τ ∈ T + , τ ≤ a}, min{τ | τ ∈ T + , τ ≥ b}] with + T = {−∞, 1.5, 3.5, +∞} is a projection of the semilattice built for a context that arises from the interordinal scaling of the initial many-valued numerical context. Address to [8] for more details on the link between interordinal scaling and interval pattern structures. The pattern concept lattice corresponding to the discretizing projection ψ([a, b]) is isomorphic to the concept lattice of the discretized context shown in Fig. 1 (middle). Introducing a discretizing projection is a general way to express any dis- cretizing procedure (essential part of decision tree learning algorithms) in terms of FCA. 4 Learning with Pattern Concept Lattices For classification tasks with complex data we propose Algorithm 1. The main idea is to find the classification rule for each test instance that maximizes some information criterion (Gini index, pairwise mutual information etc.). In case of interval pattern structures, by its design, the algorithm guarantees to classify each test instance with at least as good rule (in terms of an information criterion) as a decision tree. We apply a modification of the CloseByOne algorithm [7] to build all pattern concepts – the search space for classification rules. Let P Strain = (Gtrain , ((D, u), ctrain ), δtrain ) and P Stest = (Gtest , (D, u), δtest ) be two pattern structures corresponding to a train and a test set in a classification task. Let CbOP S (P S, min_supp) be the algorithm used to find all pattern concepts of a pattern structure P S with support greater or equal to min_supp. Let inf : D ∪ctrain → R be an information criterion used to rate classification rules (we use Gini impurity by default). Finally, let min_supp and n_rules be the parameters of the algorithm (the minimal support of each classification rule’s premise and the number of rules to be used for prediction of each test instance’s class attribute). With this designations, the main steps of the proposed algorithm are the following: 1. Initialize a list of predicted labels for test instances ctest and a dictionary of classification rules rtest for each test instance. |c0 | 2. Calculate the proportion of positive objects in the training set: fpos = |Gtrain train | 3. With the CbOP S algorithm, find S – a dictionary of all pattern concepts (with support greater or equal to min_supp) of a pattern structure P Strain Meanwhile, calculate the value of the criterion inf (values in the dictionary S) for each concept intent (keys in the dictionary S). 4. Sort S by its values. 5. For each test instance gt ∈ Gtest : – Find first nrules concept intents from S such that (Ai , di ) ∈ S, gt v di , i = 1, . . . , nrules – For each “top-ranked” concept intent di determine ci – the proportion of |d ∩ c0 | positive objects among di : fi+ = i |dtrain | . i – Thus, form {di → fi+ }i∈[1,n_rules] – a set of classification rules for gt . Set rtest [t] be equal to this set of rules. – Predict the value of the class attribute for gt as an indicator of the average antecedent of rtest [t] being greater or equal to the proportion of positive objects in the training set: n_rules X ctest [i] = [ fi+ ≥ fpos ∗ n_rules] i=1 Algorithm 1 Concept Lattice-Based Rule-learner (CoLiBRi) Input: P Strain = (Gtrain , ((D, u), ctrain ), δtrain ) P Stest = (Gtest , (D, u), δtest ) min_supp ∈ R+ , nrules ∈ N; CbOP S (P S, min_supp) : P S → S; inf : D × ctrain → R; sort(S, inf ) : S → S Output: ctest , rtest ctest = ∅, rtest = ∅ |c0 | fpos = |Gtrain train | S = {(A, d) : inf (d, ctrain ) | A ⊆ Gtrain , d ∈ D, A = d, d = A, |A| ≥ min_supp} = CbOP S (P Strain , min_supp) S = sort(S, inf ) for gt ∈ Gtest do {di }i∈[1,nrules ] = {d | (A, d) ∈ S, gt v d} |d 0 i ∩ ctrain | fi+ = |d i| rtest [i] = {di → fi+ }i∈[1,nrules ] Pn_rules + ctest [i] = [ i=1 fi ≥ fpos ∗ n_rules] end for In case of a classification task with numeric attributes we apply the same Algorithm 1 for interval pattern structures. To make it tractable, we apply it to projections ψ(P Strain ) and ψ(P Stest ) of a training and a test interval pattern structure. Here ψ(P S), is a discretizing pattern structure projection as defined in Def. 4. 5 Experiments We compare the proposed classification algorithm (denoted as “CoLiBRi” for “Concept Lattice-Based Rule-learner”) with Scikit-learn [12] implementations of CART [2], Random Forest [1] and kNN on several datasets from the UCI machine learning repository.1 1 http://repository.seasr.org/Datasets/UCI/csv/ dataset DT acc RF acc kNN acc CoLiBRi acc DT F1 RF F1 kNN F1 CoLiBRi F1 audiology 0.75 0.8 0.63 0.79* 0.71 0.74 0.58 0.74 balance-scale 0.63 0.66 0.76 0.65 0.58 0.63 0.75 0.61 breast_cancer 0.7 0.74 0.73 0.76 0.45 0.42 0.38 0.44* car 0.75 0.78* 0.71 0.79 0.75 0.76 0.71 0.76 hayses-roth 0.84* 0.83* 0.49 0.86 0.84* 0.82 0.49 0.85 lymph 0.8 0.83 0.86 0.83 0.77 0.85 0.84* 0.84* mol_bio_prom 0.78 0.83 0.83 0.82* 0.78 0.84 0.8 0.83* nursery 0.64 0.65 0.72 0.65 0.62 0.62 0.7 0.62 primary_tumor 0.41 0.46 0.41 0.45* 0.37 0.41 0.37 0.4* solar_flare 0.7* 0.7* 0.63 0.72 0.67 0.69* 0.6 0.71 soybean 0.91* 0.91* 0.92 0.91* 0.91* 0.93 0.92* 0.91* spect_train 0.61 0.69 0.68* 0.7 0.34 0.36* 0.23 0.38 tic-tac-toe 0.79 0.79 0.85 0.78 0.82 0.86 0.89 0.85 Table 1. Accuracies and F1-scores in classification experiments with the UCI machine learning datasets. “DT acc” and “DT F1” stand for average 5-run 5-fold CV accuracy and F1 score of the CART algorithm, . . . , “CoLiBRi F1” stands for average 5-run 5-fold CV F1 score of the proposed CoLiBRi algorithm. We used Gini impurity as a criterion for rule selection and MDL [3] for con- tinuous feature discretization. CART, Random Forest and kNN parameters (min_samples_leaf ∈ [1, 10] for tree-based algorithms and k ∈ {1, 2, 5, 15, 30, 50} for kNN) were chosen in stratified 5-fold cross-validation. We built 10 trees for each instance of Random Forest classifier. Parameter min_supp for “CoLiBRi” was taken equal to CART’s min_samples_leaf for each dataset. We used n = 10 classification rules to vote for a test instance label. The described algorithms were implemented in Python 2.7.3 and run on a 4-CPU machine with 4 GB RAM. The results are presented in Table 1. Each entry stands for the average metric (accuracy or F1-score) in 5 runs of 5-fold cross-validation. In the table, the algo- rithm with the best performance on each metric is boldfaced. Other algorithm’s whose performance is not statistically distinguishable from the best algorithm at p = 0.05 using paired t-tests on the 5 runs are *’ed. The best parameters for each algorithm are mentioned in Table 2. As it can be seen, the proposed approach performs better than CART and is statistically indistinguishable from RF on most of the datasets. Surprising enough, kNN seems to be the best-performer (on average over all datasets) in terms of accuracy but not F1-score. Conclusions and further work In this paper, we have shown how searching for classification hypotheses in a formal concept lattice may yield accurate results while providing interpretable classification rules. Further we plan to test the proposed strategy in classification tasks such as predicting biological activity (toxicology, mutagenicity, etc.) and telecom client dataset DT min_samples_leaf RF min_samples_leaf kNN k audiology 1 1 2 balance-scale 6 1 50 breast cancer 4 3 5 car 3 2 5 hayses-roth 3 1 15 lymph 1 1 5 mol-bio-prom 3 3 5 nursery 3 4 50 primary tumor 4 4 30 solar flare 3 1 30 soybean 1 1 2 spect train 9 5 10 tic-tac-toe 10 3 10 Table 2. Best parameters in classification experiments with the UCI ma- chine learning datasets. CoLiBRi’s min_supp is taken equal to CART’s min_samples_leaf for each dataset. satisfaction where objects have complex descriptions (graphs and sequences cor- respondingly). We also plan to introduce some randomization in mining rules for each test instance (as it is done with random forests) in order to further improve the classification quality. References [1] L. Breiman. “Random Forests”. In: Machine Learning 45.1 (Oct. 2001), pp. 5–32. issn: 0885-6125. [2] L. Breiman et al. Classification and Regression Trees. The Wadsworth and Brooks-Cole statistics-probability series. Taylor & Francis, 1984. [3] U. M. Fayyad and K. B. Irani. “Multi-Interval Discretization of Continuous- Valued Attributes for Classification Learning.” In: IJCAI. 1993, pp. 1022– 1029. [4] B. Ganter and S. O. Kuznetsov. “Pattern Structures and Their Projec- tions”. In: Conceptual Structures: Broadening the Base. Ed. by Harry Del- ugach and Gerd Stumme. Vol. 2120. Lecture Notes in Computer Science. Berlin/Heidelberg: Springer, 2001, pp. 129–142. [5] B. Ganter and R. Wille. Formal Concept Analysis: Mathematical Founda- tions. 1st. Secaucus, NJ, USA: Springer-Verlag New York, Inc., 1997. [6] M. Kaytoue et al. “Mining gene expression data with pattern structures in formal concept analysis”. en. In: Information Sciences 181.10 (May 2011), pp. 1989–2001. [7] S. O. Kuznetsov. “A fast algorithm for computing all intersections of ob- jects from an arbitrary semilattice”. In: Nauchno-Tekhnicheskaya Infor- matsiya Seriya 2-Informatsionnye Protsessy I Sistemy 1 (1993), pp. 17– 20. [8] S. O. Kuznetsov. “Fitting Pattern Structures to Knowledge Discovery in Big Data”. In: Formal Concept Analysis, 11th International Conference, ICFCA 2013, Dresden, Germany, May 21-24, 2013. Proceedings. Vol. 7880. Lecture Notes in Computer Science. Springer, 2013, pp. 254–266. [9] S. O. Kuznetsov. “Scalable Knowledge Discovery in Complex Data with Pattern Structures”. In: PReMI. Ed. by Pradipta Maji et al. Vol. 8251. Lecture Notes in Computer Science. Springer, 2013, pp. 30–39. [10] X. Li and Y. Zhong. “An Overview of Personal Credit Scoring: Techniques and Future Work”. In: International Journal of Intelligence Science 2.4A (2012), pp. 181–189. [11] A. Masyutin, Y. Kashnitsky, and S. O. Kuznetsov. “Lazy classification with interval pattern structures: Application to credit scoring”. In: Proceedings of the 4th International Workshop "What can FCA do for Artificial Intelli- gence?", FCA4AI 2015, co-located with the International Joint Conference on Artificial Intelligence (IJCAI 2015), Buenos Aires, Argentina, July 25, 2015. Vol. 1430. 2015, pp. 43–54. [12] F. Pedregosa et al. “Scikit-learn: Machine Learning in Python”. In: Journal of Machine Learning Research 12 (2011), pp. 2825–2830. [13] G. Tsoumakas et al. “WISE 2014 Challenge: Multi-label Classification of Print Media Articles to Topics”. eng. In: 15th International Conference on Web Information Systems Engineering (WISE 2014). Proceedings Part II. Vol. 8787. Lecture Notes in Computer Science. Springer, Oct. 2014, pp. 541–548.