Metric Generalization and Modification of Classification Algorithms Based on Formal Concept Analysis Evgeny Kolmakov Faculty of Computational Mathematics and Cybernetics, Lomonosov Moscow State University, Russia Abstract. FCA-based classifiers can deal with nonbinary data represen- tation in different ways: use it directly or binarize it. Those algorithms that binarize data use metric information from the initial feature space only as a result of scaling (feature binarization procedure). Metric ap- proach in this area allows one significantly reducing classification refusals number and provides additional information which can be used for clas- sifier training. In this paper we propose an approach which generalizes some of existing FCA classification methods and allows one to modify them. Unlike other algorithms, the proposed classifier model uses initial metric information together with order object-attribute dependencies. Keywords: classification, pattern recognition, formal concept analysis 1 Introduction Formal concept analysis (FCA) is a branch of applied lattice theory allowing one to formalize some machine learning models. It provides tools to solve various tasks in many domains of computer science, such as knowledge representation and management, data mining, including classification and clustering. There are many FCA-based classification algorithms known [6]. One of the particular fea- tures of FCA methods is that object x ∈ X is being described using binary attributes. However, in many cases attributes can be, e.g., real numbers, graphs, etc. There are classification methods using nonbinary representation directly, e.g., see these works on pattern structures [4], [5], but many classifiers use it only after scaling procedure. The scaling procedure is the transformation of the initial feature space F into the Boolean cube B n . It leads to the significant loss of the metric information provided by F space. In this paper we propose gener- alizations and modifications of several FCA-based classifiers, which use scaling procedure, by introducing new classifier model on the basis of class estimates. It generalizes straight hypotheses-based algorithm [1] and both of GALOIS classifi- cation procedures [3]. We also define the pseudometric on arbitrary finite lattice, which is based on the ideas from Rulearner rules induction algorithm [2] and so has intelligible interpretation in terms of formal concepts and concept lattice. In what follows we keep to standard lattice theory and FCA definitions. Therefore here we briefly describe some basic definitions, classifiers and introduce the notation which is used further. Let G and M be an arbitrary sets called the set of objects and the set of attributes respectively and I ⊆ G × M be a binary relation. The triple K = (G, M, I) is called a formal context. The following (·)′ mappings define a Galois connection between 2G and 2M sets partially ordered by set-theoretic inclusion: A′ = {m ∈ M | gIm for all g ∈ A} , B ′ = {g ∈ G | gIm for all m ∈ B} . A pair (A, B), such that A ⊆ G, B ⊆ M and A′ = B, B ′ = A is called a formal concept of K with formal extent A and formal intent B. For object g ∈ G we write g ′ instead of {g}′ . Define the “projection” mappings ext : (A, B) 7→ A and int : (A, B) 7→ B. Formal concepts of a given context K form a complete lattice denoted by B(K). It is called the concept lattice of a context K. Let hL, ∧, ∨i be a lattice and x ∈ L. By x▽ (x△ ) we denote the order ideal (filter) generated by x. By At(L), J(L) and M (L) we denote the set of all atoms, join-irreducible and meet-irreducible elements of L respectively. The function f : L → R is called supermodular if f (x) + f (y) 6 f (x ∨ y) + f (x ∧ y) for all x, y ∈ L. A concept C is called consistent if all objects in ext(C) belong to the same class. Both GALOIS classification procedures are described in [3]. GALOIS(1) calculates the similarity ΓC (x) between an object x and each consistent concept C, then x is assigned to the class corresponding to C with the highest value of ΓC (x). GALOIS(2) finds all consistent concepts C satisfying int(C) ⊆ x′ , then x is assigned to the most numerous class in the previous set. Let K = (G, M, I) be a context and w ∈ / M be a target attribute. The input data for classification may be described by three contexts w.r.t. w: the positive context K+ = (G+ , M, I+ ), the negative context K− = (G− , M, I− ) and the un- defined context Kτ = (Gτ , M, Iτ ). G− , G+ and Gτ are sets of positive, negative and undefined objects respectively. Iǫ ⊆ Gǫ × M , where ǫ ∈ {−, +, τ } are binary relations that define structural attributes. Galois operators in these contexts are denoted by (·)+ , (·)− , and (·)τ respectively. A formal concept of a positive con- text is called a positive concept. Negative and undefined concepts are defined similarly. If the intent B+ of a positive concept (A+ , B+ ) is not contained in the intent g − of any negative example g ∈ G− , then it is called a positive hypoth- esis with respect to the property w. A positive intent B+ is called falsified if B+ ⊆ g − for some negative example g ∈ G− . Negative hypotheses are defined similarly. By “hypotheses-based classifier” we mean the classification procedure from [1], which can be described as follows. If unclassified object g ∈ Gτ contains a positive but no negative hypotheses, it is classified positively, similar for nega- tive. If g does not contain any positive or negative hypothesis (insufficient data) or contains both positive and negative hypotheses (inconsistent data), then no classification happens. 2 Generalization and Modification of Algorithms The common drawback of the FCA-based classifiers using binary features af- ter scaling is that they forget the initial feature space metric structure. The main idea of this paper is to use this metric information together with order- theoretic relations between objects and attributes provided by a concept lattice. It is important that F and B n spaces with additional structures (metric and formal context) are being used at the same time, providing more possibilities for classifier training methods. 2.1 Metric estimates Denote by H+ and H− the sets of concepts constructed from a training set, which intents are positive and negative hypotheses respectively. We assume that (F, ρ) is a metric space, and let S(x, A) be the similarity measure (based on the metric from F, see examples in Section 3) between an object x and a set of objects A. Let us define the estimates for positive and negative classes: X X Γ+ (x) = I(x, C)S(x, ext(C)), Γ− (x) = I(x, C)S(x, ext(C)), C∈H+ C∈H− ′ where I(x, C) = [int(C) ⊆ x ] and [·] is the indicator function. Then the classifier will have the following form: a(x) = sign Γ (x) = sign(Γ+ (x) − Γ− (x)). Proposition 1. If hypotheses-based classifier correctly predicts class label for an object then a(x) = sign Γ (x) does the same. In comparison with hypotheses-based classifier the number of classification refusals is reduced, but the total error rate can increase. 2.2 Analogy with algorithms based on estimate calculations To calculate the estimates in the method above we use positive and negative hy- potheses sets, i.e. special subsets of concept lattice. Such calculation of estimates can be generalized to an arbitrary concept lattice subsets somehow characteriz- ing individual classes y ∈ Y . Let C be the set of concepts which we call the support concepts system. SupposeF that each concept from C characterizes only one class y ∈ Y , that is C = y∈Y Cy , where Y is the set of classes. Then define the estimate of object x for class y as follows: X Γy (x) = S(x, C). C∈Cy The classifier will have the following form: a(x) = arg maxy∈Y Γy (x). The es- timates of this type are similar to the estimates used in estimate calculations methods [7] and the sets Cy are the support sets analogues. Consider specific examples of support concepts system C, similarity measure S(x, C) and analyze corresponding classifiers: F 1. C = H+ H− are positive and negative hypotheses sets, S(x, C) = [int(C) ⊆ x′ ]Ŝ(x, ext(C)), where Ŝ(x, ext(C)) is the given simi- larityFmeasure. The corresponding classifier was described above. 2. C = y∈Y Cy is the consistent concepts set. If S(x, C) = | (M \ int(C)) ∪ x′ | we get modified GALOIS(1) algorithm. If S(x, C) = [int(C) ⊆ x′ ] we get GALOIS(2) algorithm. 2.3 Analogy with metric classifiers Let C = {C1 , . . . , Cn } be the support concepts system. Suppose that there is the distance measure ρ in F space. Sort C in increasing order w.r.t. the values of the distance ρ(x, Ci ) between object x and concepts Ci : ρ(x, Cx(1) ) 6 ρ(x, Cx(2) ) 6 · · · 6 ρ(x, Cx(n) ), (i) (i) where Cx is the i-th neighbour of x among C, yx is the class, characterized by (i) Cx concept. Define the estimate of object x for class y: X n Γy (x) = wi (x)[yx(i) = y], i=1 wi (x) is x i-th neighbour weight (positive function non-increasing w.r.t. i). The defined estimates are completely analogous to the metric classifiers es- timates, except that the neighbours here are not objects but support concepts. Thus choosing the suitable weights wi (x) we get analogs of all known metric classifiers (kNN, Parzen window, potential functions and others), but in terms of concepts. For example: – wi (x) = [i 6 k] is k nearest neighbours method; – wi (x) = [i 6 k]wi is k weighted nearest neighbour method (wi depends only on the neighbour number); ρ(x,C (i) ) – wi (x) = K( h(x)x ) is Parzen window method (K(z) is non-increasing positive-valued function defined on [0, 1], h(x) is the window width). All the proposed methods are the generalizations of the existing methods and can be used for their modifications. They use both metric information from F and object-attribute dependencies provided by concept lattice. This allows to reduce the number of classification refusals and error rate. 2.4 Pseudometric on the set of concepts Another approach which uses the notion of similarity in FCA algorithms is to define a distance function on the set of concepts. In Rulearner algorithm ([2]) the most important characteristics of concept lattice element u were the value of the function cover(u) = |J(L) ∩ u▽ | and M (L) ∩ u△ set. The comparison of lattice elements is performed on the basis of these characteristics. In the case of reduced context, this ties up with a fact, that every concept is characterized by its extent (distinct objects correspond to join-irreducible elements) and intent (distinct features correspond to meet-irreducible elements). Thus, cover(u) corresponds to the number of objects from training set covered by the concept u, and M (L)∩u△ corresponds to the attributes characterizing u. We use these observations to define the distance function on an arbitrary finite lattice. Due to the propositions dual to theorems 3.1 and 3.3 from [8], the following theorem holds. Theorem 1. Let hL, ∧, ∨i be a lattice and f : L → R is isotone and supermod- ular function, then df (x, y) = f (x) + f (y) − 2f (x ∧ y) defines a pseudometric on this lattice. Consider arbitrary finite lattice hL, ∧, ∨i, non-empty subset D ⊆ L and a function f : L → Z+ , defined as follows: f (x) = |D(x)|, where D(x) = D ∩ x▽ . Proposition 2. The function f (x) is isotone and supermodular. Proof. The isotone property of f follows from the following chain of implications: x 6 y ⇒ x▽ ⊆ y ▽ ⇒ D(x) ⊆ D(y) ⇒ f (x) = |D(x)| 6 |D(y)| = f (y). To prove supermodularity consider the following: f (x)+f (y) = |D(x)|+|D(y)| = |D(x)∪D(y)|+|D(x)∩D(y)| 6 f (x∨y)+f (x∧y). To prove the last inequality observe that D(x) ∪ D(y) ⊆ D(x ∨ y) follows from the following inclusions: x 6 x ∨ y ⇒ D(x) ⊆ D(x ∨ y), y 6 x ∨ y ⇒ D(y) ⊆ D(x ∨ y). The equality D(x) ∩ D(y) = D(x ∧ y) follows from x▽ ∩ y ▽ = (x ∧ y)▽ . ⊓ ⊔ Thus, according to the theorem above, the function f (x) induces the pseu- dometric df (x, y) on the lattice, defined by the following equality: df (x, y) = f (x) + f (y) − 2f (x ∧ y). The value of df (x, y) has simple interpretation. Proposition 3. df (x, y) = |D(x) ⊕ D(y)|, where A ⊕ B = (A \ B) ∪ (B \ A). Proof. From the proof of the proposition above we conclude that the equality D(x) ∩ D(y) = D(x ∧ y) holds. Consider the chain of equalities: f (x) + f (y) − 2f (x ∧ y) = |D(x)| + |D(y)| − 2|D(x ∧ y)| = = |D(x)| + |D(y)| − 2|D(x) ∩ D(y)| = = |D(x) ∪ D(y)| + |D(x) ∩ D(y)| − 2|D(x) ∩ D(y)| = = |D(x) ∪ D(y)| − |D(x) ∩ D(y)| = |D(x) ⊕ D(y)|. ⊓ ⊔ Corollary 1. If hL, ∧, ∨i is a finite Boolean algebra and D is the set of all atoms of L, then df (x, y) is exactly the Hamming distance. In order to compare formal concepts it is reasonable to choose D = J(L) or D = At(L). In terms of this pseudometric two concepts are the closer, the less object concepts are covered by only one of V them. Moreover, the cover(u) and df (x, y) functions are tied: cover(u) = df (u, L). One of the drawbacks of the defined distance measure is that the number of elements from D covered by x ∧ y is not taken into account. In some cases it may lead to inadequate distance estimates. Possible modifications: 1. Various normalizations to take the number of elements into account, e.g.: |D(x) ⊕ D(y)| d(x, y) = . |D(x) ∪ D(y)| 2. Weighting elements of D, e.g. let D = J(L) and we be the proportion of the hypotheses covering e = (g ′′ , g ′ ). Then d(x, y) will have the following form: X d(x, y) = we . e∈D(x)⊕D(y) The distance between concepts can be applied to modify the classification algorithms mentioned above. For example, let an object x be classified with hypotheses-based algorithm. Suppose there are two positive hypotheses H1+ , H2+ and two negative hypotheses H1− , H2− for the classification of x. In this case the algorithm refuses to classify x. Suppose we know the concept distances d(H1+ , H2+ ), d(H1− , H2− ) and also d(H1+ , H2+ ) ≫ d(H1− , H2− ). Then it is natural to classify x as positive, because the distant concepts (in terms of the proposed measure) are less ”correlated” (since they cover many distinct object concepts), and hence their answers are more significant. Distance between concepts can also be used for reducing the size of concepts system (used by classifier, e.g. consis- tent concepts) in order to improve generalization ability of classifier, reduce the overfitting and remove concepts based on noisy data. 3 Experiments In this section the experimental results are presented. The algorithms have been tested on two data sets taken from UCI Machine Learning Repository [9]: SPECT and SPECTF Heart Data Set (training set consists of 80 objects, testing set con- sists of 187 objects, 22 binary attributes in SPECT, 44 real-valued attributes in SPECTF) and Liver Disorders Data Set (training set consists of 150 ob- jects, testing set consists of 195 objects, 6 real-valued attributes, 30 binary at- tributes (after scaling)). Tested algorithms: GALOIS(1, 2), Rulearner, straight hypotheses-based algorithm, modified GALOIS(1) (described by the second ex- ample in Section 2.2), modified hypotheses-based algorithm with metric esti- mates (described in Section 2.1 with different similarity functions). Euclidian metric ρ(x, y) was used in F space in both experiments. Similarity function: S(x, C) = K(ρ(x, C), a), where K(r, a) and ρ(x, C) are one of the following functions: 1 1 K1 (r, a) = , K2 (r, a) = . 1 + exp(ar) r+a 1 X ρ1 (x, C) = inf ρ(x, c), ρ2 (x, C) = ρ(x, c), ρ3 (x, C) = sup ρ(x, c). c∈C |C| c∈C c∈C We introduce the following notation: νc is the proportion of classified objects, νr = 1 − νc is the proportion of refused classifications, et is total error rate (including refusals), er is the error rate among classified objects. Algorithm νc νr et er GALOIS(1) 1 0 0.1604 0.1604 Modified GALOIS(1) 1 0 0.0856 0.0856 GALOIS(2) 1 0 0.0802 0.0802 Rulearner 0.7487 0.2513 0.2727 0.0286 Hypotheses-based 0.5936 0.4064 0.6150 0.1842 K = K1 , a = 0.0125, ρ = ρ1 0.8021 0.1979 0.3155 0.1467 K = K1 , a = 0.0125, ρ = ρ2 0.8021 0.1979 0.2888 0.1133 K = K1 , a = 0.0125, ρ = ρ3 0.8021 0.1979 0.2834 0.1067 K = K1 , a = 1, ρ = ρ2 0.7273 0.2727 0.3422 0.0956 K = K2 , a = 1, ρ = ρ1 0.8021 0.1979 0.2941 0.1200 K = K2 , a = 1, ρ = ρ2 0.8021 0.1979 0.3209 0.1533 Table 1. SPECT and SPECTF Heart Data Set. Experimental results. Algorithm νc νr et er GALOIS(1) 1 0 0.4605 0.4605 Modified GALOIS(1) 1 0 0.5590 0.5590 GALOIS(2) 1 0 0.4359 0.4359 Rulearner 0.9795 0.0205 0.4564 0.4450 Hypotheses-based 0.2923 0.7077 0.8256 0.4035 K = K1 , a = 1, ρ = ρ1 0.8821 0.1179 0.5231 0.4593 K = K1 , a = 0.01, ρ = ρ2 0.8974 0.1026 0.5436 0.4914 K = K1 , a = 0.25, ρ = ρ3 0.8872 0.1128 0.5385 0.4798 K = K2 , a = 200, ρ = ρ1 0.8974 0.1026 0.4769 0.4171 K = K2 , a = 150, ρ = ρ2 0.8974 0.1026 0.4564 0.3943 K = K2 , a = 150, ρ = ρ3 0.8974 0.1026 0.4667 0.4057 Table 2. Livers Disorder Data Set. Experimental results. The aim of the experiments was to compare FCA classification methods and not to achieve low error rate in solving particular tasks. Hence we used simple scaling procedure: normalizing all attributes to [0, 1] interval and then applying interval-based nominal scaling (the number of intervals was chosen to be 5). It explains high error rate of all classifiers in the second task. Individual scaling (e.g. scaling with floating-size intervals) for each task may significantly reduce error rate, but this work is not focused on this problem. From the results above we may conclude that for hypotheses-based algorithm modifications the number of refusals is substantially reduced together with total error rate et . Modified GALOIS(1) classifier improved GALOIS(1) on the first data set and disimproved it on the second. This may be due to the different nature of binary data description: in the first case 22 binary attributes were obtained from 44 real-valued using complex binarization procedure, while in the second one this procedure was very simple. The choice of K(r, a) and ρ(x, C) affects only et but not νr , hence their accurate selection may improve classification quality. 4 Conclusions In this paper we have formally described and experimentally studied a new ap- proach to classification which encompasses the usage both of metric information provided by the initial feature space and the order object-attribute dependen- cies. Also we have defined the pseudometric on arbitrary finite lattice, which has intelligeble interpretation in terms of concepts and hence can be used for comparing concepts in order to improve FCA classification methods. Further developments can be focused on studying of classifiers obtained from the pro- posed model by fixing the support concepts system C and the similarity measure S(x, C) and on the possibilities of choosing such support concepts system that allows to construct only a part of concept lattice. References 1. S.O. Kuznetsov: Complexity of Learning in Concept Lattices from Positive and Negative Examples. Discrete Applied Mathematics, 2004, No. 142(13), pp. 111- 125. 2. M. Sahami: Learning classification Rules Using Lattices. N. Lavrac and S. Wrobel eds., pp. 343346, Proc ECML, Heraclion, Crete, Greece (April 1995). 3. C. Caprineto, G. Romano: GALOIS An order-theoretic approach to conceptual clustering. In proceedings of ICML93, pp. 3340, Amherst, USA (July 1993). 4. M. Kaytoue, S.O. Kuznetsov, A. Napoli, S. Duplessis, Mining gene expression data with pattern structures in formal concept analysis. Information Sciences, Volume 181, Issue 10, 15 May 2011, pp. 1989-2001, Information Science, 2011. 5. S.O. Kuznetsov, Scalable Knowledge Discovery in Complex Data with Pattern Structures. In: P. Maji, A. Ghosh, M.N. Murty, K. Ghosh, S.K. Pal, Eds., Proc. 5th International Conference Pattern Recognition and Machine Intelligence (PReMI’2013), Lecture Notes in Computer Science (Springer), Vol. 8251, pp. 30-41, 2013. 6. O. Prokasheva, A. Onishchenko, S. Gurov. Classification methods based on For- mal Concept Analysis. FCAIR 2013 Formal Concept Analysis Meets Information Retrieval. Workshop co-located with the 35th European Conference on Informa- tion Retrieval (ECIR 2013). March 24, 2013, Moscow, Russia. National Research University Higher School of Economics, pp. 95-104. ISSN 1613-0073 7. Yu.I. Zhuravlev: An Algebraic Approach to Recognition and Classification Prob- lems, in Problems of Cybernetics, Issue 33 (Nauka, Moscow, 1978), pp. 568 [In Russian]. 8. Dan A. Simovici: Betweenness, Metrics and Entropies in Lattices. Proceedings of the 38th International Symposium on Multiple Valued Logic. 22-24 May, 2008, Dallas, TX, USA. IEEE Computer Society Washington, pp. 26-31. ISSN 0195-623X 9. Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of In- formation and Computer Science.