Metric Generalization and Modification of
     Classification Algorithms Based on Formal
                   Concept Analysis

                                 Evgeny Kolmakov

 Faculty of Computational Mathematics and Cybernetics, Lomonosov Moscow State
                               University, Russia


      Abstract. FCA-based classifiers can deal with nonbinary data represen-
      tation in different ways: use it directly or binarize it. Those algorithms
      that binarize data use metric information from the initial feature space
      only as a result of scaling (feature binarization procedure). Metric ap-
      proach in this area allows one significantly reducing classification refusals
      number and provides additional information which can be used for clas-
      sifier training. In this paper we propose an approach which generalizes
      some of existing FCA classification methods and allows one to modify
      them. Unlike other algorithms, the proposed classifier model uses initial
      metric information together with order object-attribute dependencies.

      Keywords: classification, pattern recognition, formal concept analysis


1   Introduction
Formal concept analysis (FCA) is a branch of applied lattice theory allowing one
to formalize some machine learning models. It provides tools to solve various
tasks in many domains of computer science, such as knowledge representation
and management, data mining, including classification and clustering. There are
many FCA-based classification algorithms known [6]. One of the particular fea-
tures of FCA methods is that object x ∈ X is being described using binary
attributes. However, in many cases attributes can be, e.g., real numbers, graphs,
etc. There are classification methods using nonbinary representation directly,
e.g., see these works on pattern structures [4], [5], but many classifiers use it
only after scaling procedure. The scaling procedure is the transformation of the
initial feature space F into the Boolean cube B n . It leads to the significant loss
of the metric information provided by F space. In this paper we propose gener-
alizations and modifications of several FCA-based classifiers, which use scaling
procedure, by introducing new classifier model on the basis of class estimates. It
generalizes straight hypotheses-based algorithm [1] and both of GALOIS classifi-
cation procedures [3]. We also define the pseudometric on arbitrary finite lattice,
which is based on the ideas from Rulearner rules induction algorithm [2] and so
has intelligible interpretation in terms of formal concepts and concept lattice.
    In what follows we keep to standard lattice theory and FCA definitions.
Therefore here we briefly describe some basic definitions, classifiers and introduce
the notation which is used further. Let G and M be an arbitrary sets called the
set of objects and the set of attributes respectively and I ⊆ G × M be a binary
relation. The triple K = (G, M, I) is called a formal context. The following (·)′
mappings define a Galois connection between 2G and 2M sets partially ordered
by set-theoretic inclusion:
    A′ = {m ∈ M | gIm for all g ∈ A} , B ′ = {g ∈ G | gIm for all m ∈ B} .
A pair (A, B), such that A ⊆ G, B ⊆ M and A′ = B, B ′ = A is called a formal
concept of K with formal extent A and formal intent B. For object g ∈ G we
write g ′ instead of {g}′ . Define the “projection” mappings ext : (A, B) 7→ A and
int : (A, B) 7→ B. Formal concepts of a given context K form a complete lattice
denoted by B(K). It is called the concept lattice of a context K. Let hL, ∧, ∨i
be a lattice and x ∈ L. By x▽ (x△ ) we denote the order ideal (filter) generated
by x. By At(L), J(L) and M (L) we denote the set of all atoms, join-irreducible
and meet-irreducible elements of L respectively. The function f : L → R is called
supermodular if f (x) + f (y) 6 f (x ∨ y) + f (x ∧ y) for all x, y ∈ L.
    A concept C is called consistent if all objects in ext(C) belong to the same
class. Both GALOIS classification procedures are described in [3]. GALOIS(1)
calculates the similarity ΓC (x) between an object x and each consistent concept
C, then x is assigned to the class corresponding to C with the highest value of
ΓC (x). GALOIS(2) finds all consistent concepts C satisfying int(C) ⊆ x′ , then
x is assigned to the most numerous class in the previous set.
    Let K = (G, M, I) be a context and w ∈    / M be a target attribute. The input
data for classification may be described by three contexts w.r.t. w: the positive
context K+ = (G+ , M, I+ ), the negative context K− = (G− , M, I− ) and the un-
defined context Kτ = (Gτ , M, Iτ ). G− , G+ and Gτ are sets of positive, negative
and undefined objects respectively. Iǫ ⊆ Gǫ × M , where ǫ ∈ {−, +, τ } are binary
relations that define structural attributes. Galois operators in these contexts are
denoted by (·)+ , (·)− , and (·)τ respectively. A formal concept of a positive con-
text is called a positive concept. Negative and undefined concepts are defined
similarly. If the intent B+ of a positive concept (A+ , B+ ) is not contained in the
intent g − of any negative example g ∈ G− , then it is called a positive hypoth-
esis with respect to the property w. A positive intent B+ is called falsified if
B+ ⊆ g − for some negative example g ∈ G− . Negative hypotheses are defined
similarly. By “hypotheses-based classifier” we mean the classification procedure
from [1], which can be described as follows. If unclassified object g ∈ Gτ contains
a positive but no negative hypotheses, it is classified positively, similar for nega-
tive. If g does not contain any positive or negative hypothesis (insufficient data)
or contains both positive and negative hypotheses (inconsistent data), then no
classification happens.


2   Generalization and Modification of Algorithms

The common drawback of the FCA-based classifiers using binary features af-
ter scaling is that they forget the initial feature space metric structure. The
main idea of this paper is to use this metric information together with order-
theoretic relations between objects and attributes provided by a concept lattice.
It is important that F and B n spaces with additional structures (metric and
formal context) are being used at the same time, providing more possibilities for
classifier training methods.

2.1   Metric estimates
Denote by H+ and H− the sets of concepts constructed from a training set,
which intents are positive and negative hypotheses respectively. We assume that
(F, ρ) is a metric space, and let S(x, A) be the similarity measure (based on
the metric from F, see examples in Section 3) between an object x and a set of
objects A. Let us define the estimates for positive and negative classes:
              X                                      X
   Γ+ (x) =       I(x, C)S(x, ext(C)), Γ− (x) =          I(x, C)S(x, ext(C)),
             C∈H+                                    C∈H−
                              ′
where I(x, C) = [int(C) ⊆ x ] and [·] is the indicator function. Then the classifier
will have the following form: a(x) = sign Γ (x) = sign(Γ+ (x) − Γ− (x)).
Proposition 1. If hypotheses-based classifier correctly predicts class label for an
object then a(x) = sign Γ (x) does the same.
    In comparison with hypotheses-based classifier the number of classification
refusals is reduced, but the total error rate can increase.

2.2   Analogy with algorithms based on estimate calculations
To calculate the estimates in the method above we use positive and negative hy-
potheses sets, i.e. special subsets of concept lattice. Such calculation of estimates
can be generalized to an arbitrary concept lattice subsets somehow characteriz-
ing individual classes y ∈ Y .
    Let C be the set of concepts which we call the support concepts system.
SupposeF that each concept from C characterizes only one class y ∈ Y , that
is C = y∈Y Cy , where Y is the set of classes. Then define the estimate of object
x for class y as follows:                 X
                                Γy (x) =      S(x, C).
                                        C∈Cy

The classifier will have the following form: a(x) = arg maxy∈Y Γy (x). The es-
timates of this type are similar to the estimates used in estimate calculations
methods [7] and the sets Cy are the support sets analogues.
    Consider specific examples of support concepts system C, similarity measure
S(x, C) and analyze corresponding classifiers:
             F
 1. C = H+ H− are positive and negative hypotheses sets,
    S(x, C) = [int(C) ⊆ x′ ]Ŝ(x, ext(C)), where Ŝ(x, ext(C)) is the given simi-
    larityFmeasure. The corresponding classifier was described above.
 2. C = y∈Y Cy is the consistent concepts set.
    If S(x, C) = | (M \ int(C)) ∪ x′ | we get modified GALOIS(1) algorithm.
    If S(x, C) = [int(C) ⊆ x′ ] we get GALOIS(2) algorithm.
2.3   Analogy with metric classifiers

Let C = {C1 , . . . , Cn } be the support concepts system. Suppose that there is the
distance measure ρ in F space. Sort C in increasing order w.r.t. the values of the
distance ρ(x, Ci ) between object x and concepts Ci :

                    ρ(x, Cx(1) ) 6 ρ(x, Cx(2) ) 6 · · · 6 ρ(x, Cx(n) ),
        (i)                                                 (i)
where Cx is the i-th neighbour of x among C, yx is the class, characterized by
 (i)
Cx concept. Define the estimate of object x for class y:

                                           X
                                           n
                                Γy (x) =         wi (x)[yx(i) = y],
                                           i=1

wi (x) is x i-th neighbour weight (positive function non-increasing w.r.t. i).
    The defined estimates are completely analogous to the metric classifiers es-
timates, except that the neighbours here are not objects but support concepts.
    Thus choosing the suitable weights wi (x) we get analogs of all known metric
classifiers (kNN, Parzen window, potential functions and others), but in terms
of concepts. For example:

 – wi (x) = [i 6 k] is k nearest neighbours method;
 – wi (x) = [i 6 k]wi is k weighted nearest neighbour method (wi depends only
   on the neighbour number);
                  ρ(x,C (i) )
 – wi (x) = K( h(x)x ) is Parzen window method (K(z) is non-increasing
   positive-valued function defined on [0, 1], h(x) is the window width).

   All the proposed methods are the generalizations of the existing methods
and can be used for their modifications. They use both metric information from
F and object-attribute dependencies provided by concept lattice. This allows to
reduce the number of classification refusals and error rate.


2.4   Pseudometric on the set of concepts

Another approach which uses the notion of similarity in FCA algorithms is to
define a distance function on the set of concepts. In Rulearner algorithm ([2]) the
most important characteristics of concept lattice element u were the value of the
function cover(u) = |J(L) ∩ u▽ | and M (L) ∩ u△ set. The comparison of lattice
elements is performed on the basis of these characteristics. In the case of reduced
context, this ties up with a fact, that every concept is characterized by its extent
(distinct objects correspond to join-irreducible elements) and intent (distinct
features correspond to meet-irreducible elements). Thus, cover(u) corresponds to
the number of objects from training set covered by the concept u, and M (L)∩u△
corresponds to the attributes characterizing u. We use these observations to
define the distance function on an arbitrary finite lattice. Due to the propositions
dual to theorems 3.1 and 3.3 from [8], the following theorem holds.
Theorem 1. Let hL, ∧, ∨i be a lattice and f : L → R is isotone and supermod-
ular function, then df (x, y) = f (x) + f (y) − 2f (x ∧ y) defines a pseudometric on
this lattice.
   Consider arbitrary finite lattice hL, ∧, ∨i, non-empty subset D ⊆ L and a
function f : L → Z+ , defined as follows: f (x) = |D(x)|, where D(x) = D ∩ x▽ .
Proposition 2. The function f (x) is isotone and supermodular.
Proof. The isotone property of f follows from the following chain of implications:
     x 6 y ⇒ x▽ ⊆ y ▽ ⇒ D(x) ⊆ D(y) ⇒ f (x) = |D(x)| 6 |D(y)| = f (y).
To prove supermodularity consider the following:
f (x)+f (y) = |D(x)|+|D(y)| = |D(x)∪D(y)|+|D(x)∩D(y)| 6 f (x∨y)+f (x∧y).
To prove the last inequality observe that D(x) ∪ D(y) ⊆ D(x ∨ y) follows from
the following inclusions:
        x 6 x ∨ y ⇒ D(x) ⊆ D(x ∨ y),        y 6 x ∨ y ⇒ D(y) ⊆ D(x ∨ y).
The equality D(x) ∩ D(y) = D(x ∧ y) follows from x▽ ∩ y ▽ = (x ∧ y)▽ .            ⊓
                                                                                  ⊔
   Thus, according to the theorem above, the function f (x) induces the pseu-
dometric df (x, y) on the lattice, defined by the following equality:
                       df (x, y) = f (x) + f (y) − 2f (x ∧ y).
The value of df (x, y) has simple interpretation.
Proposition 3. df (x, y) = |D(x) ⊕ D(y)|, where A ⊕ B = (A \ B) ∪ (B \ A).
Proof. From the proof of the proposition above we conclude that the equality
D(x) ∩ D(y) = D(x ∧ y) holds. Consider the chain of equalities:
 f (x) + f (y) − 2f (x ∧ y) = |D(x)| + |D(y)| − 2|D(x ∧ y)| =
                            = |D(x)| + |D(y)| − 2|D(x) ∩ D(y)| =
                            = |D(x) ∪ D(y)| + |D(x) ∩ D(y)| − 2|D(x) ∩ D(y)| =
                          = |D(x) ∪ D(y)| − |D(x) ∩ D(y)| = |D(x) ⊕ D(y)|.
                                                                                  ⊓
                                                                                  ⊔
Corollary 1. If hL, ∧, ∨i is a finite Boolean algebra and D is the set of all atoms
of L, then df (x, y) is exactly the Hamming distance.
    In order to compare formal concepts it is reasonable to choose D = J(L)
or D = At(L). In terms of this pseudometric two concepts are the closer, the
less object concepts are covered by only one of V  them. Moreover, the cover(u)
and df (x, y) functions are tied: cover(u) = df (u, L). One of the drawbacks of
the defined distance measure is that the number of elements from D covered by
x ∧ y is not taken into account. In some cases it may lead to inadequate distance
estimates.
    Possible modifications:
 1. Various normalizations to take the number of elements into account, e.g.:
                                            |D(x) ⊕ D(y)|
                                d(x, y) =                 .
                                            |D(x) ∪ D(y)|
 2. Weighting elements of D, e.g. let D = J(L) and we be the proportion of the
    hypotheses covering e = (g ′′ , g ′ ). Then d(x, y) will have the following form:
                                                X
                             d(x, y) =                  we .
                                            e∈D(x)⊕D(y)

    The distance between concepts can be applied to modify the classification
algorithms mentioned above. For example, let an object x be classified with
hypotheses-based algorithm. Suppose there are two positive hypotheses H1+ , H2+
and two negative hypotheses H1− , H2− for the classification of x. In this case
the algorithm refuses to classify x. Suppose we know the concept distances
d(H1+ , H2+ ), d(H1− , H2− ) and also d(H1+ , H2+ ) ≫ d(H1− , H2− ). Then it is natural
to classify x as positive, because the distant concepts (in terms of the proposed
measure) are less ”correlated” (since they cover many distinct object concepts),
and hence their answers are more significant. Distance between concepts can also
be used for reducing the size of concepts system (used by classifier, e.g. consis-
tent concepts) in order to improve generalization ability of classifier, reduce the
overfitting and remove concepts based on noisy data.


3    Experiments
In this section the experimental results are presented. The algorithms have been
tested on two data sets taken from UCI Machine Learning Repository [9]: SPECT
and SPECTF Heart Data Set (training set consists of 80 objects, testing set con-
sists of 187 objects, 22 binary attributes in SPECT, 44 real-valued attributes
in SPECTF) and Liver Disorders Data Set (training set consists of 150 ob-
jects, testing set consists of 195 objects, 6 real-valued attributes, 30 binary at-
tributes (after scaling)). Tested algorithms: GALOIS(1, 2), Rulearner, straight
hypotheses-based algorithm, modified GALOIS(1) (described by the second ex-
ample in Section 2.2), modified hypotheses-based algorithm with metric esti-
mates (described in Section 2.1 with different similarity functions). Euclidian
metric ρ(x, y) was used in F space in both experiments. Similarity function:
S(x, C) = K(ρ(x, C), a), where K(r, a) and ρ(x, C) are one of the following
functions:
                                    1                         1
                   K1 (r, a) =             ,    K2 (r, a) =       .
                               1 + exp(ar)                  r+a
                                          1 X
   ρ1 (x, C) = inf ρ(x, c), ρ2 (x, C) =          ρ(x, c), ρ3 (x, C) = sup ρ(x, c).
               c∈C                       |C|                          c∈C
                                              c∈C

    We introduce the following notation: νc is the proportion of classified objects,
νr = 1 − νc is the proportion of refused classifications, et is total error rate
(including refusals), er is the error rate among classified objects.
              Algorithm                   νc     νr     et     er
              GALOIS(1)                   1      0      0.1604 0.1604
              Modified GALOIS(1)          1      0      0.0856 0.0856
              GALOIS(2)                   1      0      0.0802 0.0802
              Rulearner                   0.7487 0.2513 0.2727 0.0286
              Hypotheses-based            0.5936 0.4064 0.6150 0.1842
              K = K1 , a = 0.0125, ρ = ρ1 0.8021 0.1979 0.3155 0.1467
              K = K1 , a = 0.0125, ρ = ρ2 0.8021 0.1979 0.2888 0.1133
              K = K1 , a = 0.0125, ρ = ρ3 0.8021 0.1979 0.2834 0.1067
              K = K1 , a = 1, ρ = ρ2      0.7273 0.2727 0.3422 0.0956
              K = K2 , a = 1, ρ = ρ1      0.8021 0.1979 0.2941 0.1200
              K = K2 , a = 1, ρ = ρ2      0.8021 0.1979 0.3209 0.1533

       Table 1. SPECT and SPECTF Heart Data Set. Experimental results.


                Algorithm                 νc     νr     et     er
                GALOIS(1)                 1      0      0.4605 0.4605
                Modified GALOIS(1)        1      0      0.5590 0.5590
                GALOIS(2)                 1      0      0.4359 0.4359
                Rulearner                 0.9795 0.0205 0.4564 0.4450
                Hypotheses-based          0.2923 0.7077 0.8256 0.4035
                K = K1 , a = 1, ρ = ρ1    0.8821 0.1179 0.5231 0.4593
                K = K1 , a = 0.01, ρ = ρ2 0.8974 0.1026 0.5436 0.4914
                K = K1 , a = 0.25, ρ = ρ3 0.8872 0.1128 0.5385 0.4798
                K = K2 , a = 200, ρ = ρ1 0.8974 0.1026 0.4769 0.4171
                K = K2 , a = 150, ρ = ρ2 0.8974 0.1026 0.4564 0.3943
                K = K2 , a = 150, ρ = ρ3 0.8974 0.1026 0.4667 0.4057

             Table 2. Livers Disorder Data Set. Experimental results.


    The aim of the experiments was to compare FCA classification methods
and not to achieve low error rate in solving particular tasks. Hence we used
simple scaling procedure: normalizing all attributes to [0, 1] interval and then
applying interval-based nominal scaling (the number of intervals was chosen to
be 5). It explains high error rate of all classifiers in the second task. Individual
scaling (e.g. scaling with floating-size intervals) for each task may significantly
reduce error rate, but this work is not focused on this problem. From the results
above we may conclude that for hypotheses-based algorithm modifications the
number of refusals is substantially reduced together with total error rate et .
Modified GALOIS(1) classifier improved GALOIS(1) on the first data set and
disimproved it on the second. This may be due to the different nature of binary
data description: in the first case 22 binary attributes were obtained from 44
real-valued using complex binarization procedure, while in the second one this
procedure was very simple. The choice of K(r, a) and ρ(x, C) affects only et but
not νr , hence their accurate selection may improve classification quality.

4   Conclusions
In this paper we have formally described and experimentally studied a new ap-
proach to classification which encompasses the usage both of metric information
provided by the initial feature space and the order object-attribute dependen-
cies. Also we have defined the pseudometric on arbitrary finite lattice, which
has intelligeble interpretation in terms of concepts and hence can be used for
comparing concepts in order to improve FCA classification methods. Further
developments can be focused on studying of classifiers obtained from the pro-
posed model by fixing the support concepts system C and the similarity measure
S(x, C) and on the possibilities of choosing such support concepts system that
allows to construct only a part of concept lattice.

References
 1. S.O. Kuznetsov: Complexity of Learning in Concept Lattices from Positive and
    Negative Examples. Discrete Applied Mathematics, 2004, No. 142(13), pp. 111-
    125.
 2. M. Sahami: Learning classification Rules Using Lattices. N. Lavrac and S. Wrobel
    eds., pp. 343346, Proc ECML, Heraclion, Crete, Greece (April 1995).
 3. C. Caprineto, G. Romano: GALOIS An order-theoretic approach to conceptual
    clustering. In proceedings of ICML93, pp. 3340, Amherst, USA (July 1993).
 4. M. Kaytoue, S.O. Kuznetsov, A. Napoli, S. Duplessis, Mining gene expression data
    with pattern structures in formal concept analysis. Information Sciences, Volume
    181, Issue 10, 15 May 2011, pp. 1989-2001, Information Science, 2011.
 5. S.O. Kuznetsov, Scalable Knowledge Discovery in Complex Data with Pattern
    Structures. In: P. Maji, A. Ghosh, M.N. Murty, K. Ghosh, S.K. Pal, Eds.,
    Proc. 5th International Conference Pattern Recognition and Machine Intelligence
    (PReMI’2013), Lecture Notes in Computer Science (Springer), Vol. 8251, pp. 30-41,
    2013.
 6. O. Prokasheva, A. Onishchenko, S. Gurov. Classification methods based on For-
    mal Concept Analysis. FCAIR 2013 Formal Concept Analysis Meets Information
    Retrieval. Workshop co-located with the 35th European Conference on Informa-
    tion Retrieval (ECIR 2013). March 24, 2013, Moscow, Russia. National Research
    University Higher School of Economics, pp. 95-104. ISSN 1613-0073
 7. Yu.I. Zhuravlev: An Algebraic Approach to Recognition and Classification Prob-
    lems, in Problems of Cybernetics, Issue 33 (Nauka, Moscow, 1978), pp. 568 [In
    Russian].
 8. Dan A. Simovici: Betweenness, Metrics and Entropies in Lattices. Proceedings
    of the 38th International Symposium on Multiple Valued Logic. 22-24 May, 2008,
    Dallas, TX, USA. IEEE Computer Society Washington, pp. 26-31. ISSN 0195-623X
 9. Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository
    [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of In-
    formation and Computer Science.