=Paper=
{{Paper
|id=Vol-2062/paper6
|storemode=property
|title=Binary Classification With Hypergraph Case-Based Reasoning
|pdfUrl=https://ceur-ws.org/Vol-2062/paper06.pdf
|volume=Vol-2062
|authors=Alexandre Quemy
|dblpUrl=https://dblp.org/rec/conf/dolap/Quemy18
}}
==Binary Classification With Hypergraph Case-Based Reasoning==
<pdf width="1500px">https://ceur-ws.org/Vol-2062/paper06.pdf</pdf>
<pre>
Binary Classification With Hypergraph Case-Based Reasoning
                                                                        Alexandre Quemy
                                                                 IBM Software Lab
                                                                  Cracow, Poland
                                               Faculty of Computing, Poznań University of Technology
                                                                  Poznań, Poland
                                                               aquemy@pl.ibm.com

ABSTRACT                                                                                   The well-known "no free lunch" theorem states that there is
Binary classification is one of the most common problem in ma-                          no unique algorithm that can outperform all the others in super-
chine learning. It consists in predicting whether a given element                       vised learning. However, the case description and representation
is of a particular class. In this paper, a new algorithm for binary                     play a preponderant role in the performances but the algorithms
classification is proposed using a hypergraph representation.                           are very often constrained by being designed to a specific repre-
Each element to be classified is partitioned according to its inter-                    sentation that may not be suitable for a given problem or user
actions with the training set. For each class, the total support is                     data. The proposed method is totally agnostic about the repre-
calculated as a convex combination of the evidence strength of the                      sentation1 and does not require structured representations. For
element of the partition. The evidence measure is pre-computed                          instance, it can work on atomic representations such as Bag-of-
using the hypergraph induced by the training set and iteratively                        Word representations for texts where an atom is a single word
adjusted through a training phase. It does not require structured                       (or n-gram). In this case, two elements would not have the same
information, each case being represented by a set of agnostic                           number of elements and it may be hard to properly define the
information atoms. Empirical validation demonstrates its high                           case domain.
potential on a wide range of well-known datasets and the results                           The plan of the paper is as follows: in Section 3 the problem of
are compared to the literature. The time complexity is given and                        binary classification is formalized in the particular case of a finite
empirically validated. Its capacity to provide good accuracy with                       countable set of information. The contribution of this paper is
few examples is also studied.                                                           divided into two parts: Section 4 defines the HCBR algorithm and
                                                                                        Section 5 presents empirical results on 8 datasets from the UC
                                                                                        Irvine Machine Learning Repository (UCI)2 and the LIBSVM3 .
1     INTRODUCTION                                                                      The paper ends in Section 6 with a discussion about the results,
In many real-life situations, one tries to take a decision based                        future work, and possible improvements.
on previous similar situations. Each situation is described by a
certain amount of information, either collected by an expert ac-                        2    CLASSIFICATION AND HYPERGRAPH
cording to the relevance of the information, or automatically by
                                                                                        Hypergraphs generalize graphs and can be used to represent
some sensors or algorithms. Those situations share similarities
                                                                                        higher order relations while graphs are limited to binary rela-
on which to make analogies or counter-examples in order to take
                                                                                        tions. A hypergraph is defined by a set of vertices and a collection
a new decision. Conversely, in general, if two situations do not
                                                                                        of hyperedges where each hyperedge is a subset of this set of ver-
share any common characteristic, then they are totally indepen-
                                                                                        tices. Therefore, a graph is a special case of hypergraph for which
dent, i.e. it is impossible to infer one’s outcome from the other
                                                                                        each hyperedge contains only two vertices. We will formally in-
one. The purpose of supervised machine learning algorithms is
                                                                                        troduce hypergraphs in Section 4.1. Recently hypergraphs have
to exploit the available information and interactions between
                                                                                        been used as data representation, and some classification algo-
past cases or examples in order to build a model or infer the key
                                                                                        rithms on hypergraph have been proposed. A vast majority of
rules to take correct decisions.
                                                                                        approaches models the objects to classify as the set of vertices
   Due to the large variety of concrete situations that can be
                                                                                        and constructs the hyperedges as the representation of a metric.
reduced to binary classification, it is one of the most studied
                                                                                        This conventional approach is known as neighborhood-based hy-
problems in machine learning. In this paper, we investigate the
                                                                                        pergraph. The metric relies on some assumptions on the data or
problem of binary prediction under a supervised setting, i.e. given
                                                                                        a specific representation (e.g. Zernike moment and histogram of
a history of previous situations labeled with the correct output.
                                                                                        oriented gradient (HOG) to measure the distance between images
   This paper contributes to binary classification with a new
                                                                                        in [16]) and for each vertex, a hyperedge is created to represent
algorithm called Hypergraph Case-Based Reasoning (HCBR). The
                                                                                        its k-nearest neighbors [12]. The problem of classification on
idea is to create a hypergraph where each element of the training
                                                                                        hypergraph consists in labeling some unlabeled vertices given a
set is a hyperedge and vertices are represented by the features
                                                                                        training set such that all vertices in a same hyperedge have the
describing the elements. The intersections between edges create
                                                                                        same label. As all the vertices are known a priori, the problem is
a partition for each case, and we model the support for each class
                                                                                        part of transductive learning. To learn the labels, the standard
as a convex combination of the elements of this partition. Each
                                                                                        approach is to minimize a cost function based on a hypergraph
of those elements is valued according to its importance w.r.t. to
                                                                                        equivalent of a graph Laplacian [16, 25] with a structural risk:
the set of all the hyperedges it belongs to and their respective
labels.                                                                                                         C(x) = x t ∆x + µ ||x − y|| 2                        (1)
© 2018 Copyright held by the owner/author(s). Published in the Workshop
Proceedings of the 21st International Conference on Extending Database Tech-
                                                                                        1 With the exception that the current version must work on discrete information.
nology (EDBT) (March 26-29, 2018, Vienna, Austria) on CEUR-WS.org (ISSN 978-3-
                                                                                        2 http://archive.ics.uci.edu/ml/index.php
89318-078-3). Distribution of this paper is permitted under the terms of the Creative
                                                                                        3 https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html
Commons license CC-by-nc-nd 4.0.
where ∆ is the hypergraph Laplacian, µ > 0 a regularization
factor and ||.|| a norm. The vector y represents the initial labels                                                            x1 e                     x3
                                                                                                                 e1                           2
for all vertices with yi = 0 for unlabeled vertices, a negative (resp.                                                                                       e4
                                                                                                                                         e3
positive) value for label -1 or 1.                                                                                  e6                            e5
    On the contrary, the method proposed in this paper models                                                                         e7
the elements to classify as the hyperedges and the vertices as the                                                      x2
different components of those elements. As far as we know, there                                                                                                  F
is no previous work that uses this modeling choice. In addition,
it does not require knowing all the elements before building the
                                                                             Figure 1: The family E = {ei }i forms the partition ob-
model: our approach is inductive. If the literature focus on the
                                                                             tained by the union of the projection of cases and repre-
method to build the hyperedges, the proposed framework has
                                                                             sents how the three cases share information.
a straightforward hypergraph construction and rather focus on
model selection.
                                                                               Definition 4.3 (Projection defined by a subhypergraph). The
3    PROBLEM DEFINITION                                                      projection operator π H over a hypergraph H = (V , X ) for any
Consider a countable finite space of information F and its σ -               A ⊆ V is defined by π H (A) = {e ⊆ A | ∃X ′ ⊆ X A , e =   x }.
                                                                                                                                     Ñ
algebra P(F) defined as the powerset of F. We call case an element                                                                                                          x ∈X ′
of P(F). In practice, it is very likely that only a subset of P(F) may          For a case x, the operator π takes the subhypergraph H [x] and
appear (for instance if two features encode two contradictory                returns the partition of the features of x defined by the intersec-
propositions or if every case has the same number of features).              tion family induced by H [x]. Each element of π H (x) is a (sub)set
The real class for any case is defined by the unknown measurable             of features. For instance, in Figure 1, π H (x 1 ) = {e 1 , e 2 , e 3 , e 6 } and
mapping:                                                                     in Figure 2, the projection of x (in yellow) on H is represented
                                                                             by the family {ei′ }i . For convenience, for a given H = (VX , X ),
                         J : P(F) → {0, 1}                                                              m = {e ∈ ∪ π (x)} that is to say, the
                                                                             we denote by EH = {ei }i=1                    H
                                    x 7→ J (x)                                                                                                    x ∈X
                                                                             partition of VX obtained by the intersection of the edges. This
  Given an example set X of n cases, the classification problem              partition is unique to a hypergraph.
consists in minimizing the prediction errors for all the elements of            We call discretionary features of x (w.r.t. H ) the set (possibly
P(F), that is to say finding a decision mapping J¯ : P(F) → {0, 1}           empty) of features that are not in any element of the projection
such that:                                                                   π H (x), denoted D x . It can be interpreted as the features of x that
                                                                             do not belong to any hyperedge. If a hypergraph induced by a
              ∀X ∈ P(F)n , min
                                   Õ
                                         1 {J (x ), J¯(x )}      (2)
                               J¯                                            set of examples represents a knowledge base at a given moment,
                                     x ∈ P(F)
                                                                             the discretionary features of x are new pieces of information that
In other words, for any possible training set, we want to find               were never encountered before. For instance, considering the
an estimation of the exact mapping J and thus it is a functional             hypergraph composed of x 1 and x 2 as illustrated by Figure 1, the
problem. Notice that in this paper we do not consider uncertainty:           set of discretionary features of x 3 is e 4 . In Figure 2, the yellow
if two situations are described the same in F, then they have the            case x has no discretionary feature: all its features are present at
same label.                                                                  least in one example.

4 HYPERGRAPH CASE-BASED REASONING
4.1 Representation And Projection                                                                                                        x1
                                                                                                                                                             x3
                                                                                                                        e1           e 1′  e 2′ e 2
Before describing HCBR, we recall the definition of a hypergraph                                                                                  e3
                                                                                                                                                                   e4
                                                                                                                                             e 3′
and induced hypergraph. For more results on hypergraphs, we                                                                           e 6′       e 5′ e 5
refer the reader to [3].                                                                                                 e6                e 7′
                                                                                                                                        e7
                                                                                                                                    x2
   Definition 4.1 (Hypergraph). A hypergraph is defined by H =                                                                                                          F
(V , X ) with V a set of vertices, X the hyperedges such that ∀x ∈
X, x ⊆ V .
A hypergraph can be viewed as a collection of subsets X of a                                                                             x
given set of vertices V . It is sometimes convenient to define a
hypergraph solely by a collection of sets. In this case, the set of
                                                                                                     e 2′        e 3′        e 5′             e 1′ e 6′          e 7′
vertices, denoted VX , is implicitly defined as the union of edges.
   Definition 4.2 (Induced Subhypergraph). The subhypergraph
H [A] induced by A ⊆ V is defined by H [A] = (A, X A ) with
X A = {x ∩ A | x ∩ A , ∅}.                                                                                  x3                        x1                    x2

    A set of examples X can be seen as a hypergraph H = (F, X ),
i.e. such that each example is a hyperedge (Figure 1). In practice,          Figure 2: The projection of x on H is represented by the
we do not need to know F as we can always use the implicit                   family {ei′ }i . Under the projection lies a graph representa-
hypergraph H = (FX , X ) where FX is the restriction of F to the             tion of x with the partition elements {ei }i and their respec-
features that appear in the sample X .                                       tive connections to the cases {x i }i , in particular, D x = ∅.
                                                                         2
   For any set of features x ⊆ F, we define d H (x) = {x ′ ∈ X | x ∩              modeled by
x , ∅} the set of edges sharing some features with x. In partic-
 ′                                                                                                                      m
                                                                                                      sw, µ (x)      =     w(ei , x)µ(ei )
                                                                                                                       Í
                                            (1)   (0)
ular, the set d H can be split into d H and d H depending on the
                                                                                                  
                                                                                                  
                                                                                                                       i=1
                                                                                                  
                                                                                                  
                                                                                                     m
                                                                                                  
                                     (l )
                                                                                                  
label of the case and defined by d H (x) = {x ′ ∈ d H (x) | J (x ′ ) = l }.                             w(ei , x j ) = 1 ∀1 ≤ i ≤ n
                                                                                                  Í
                                                                                                  
                                                                                                                                                   (4)
|d H (x)| corresponds to the number of cases the case x intersects                                i=1Í
                                                                                                  
                                                                                                        n
                                                                                                  
          (l )
while |d H (x)| counts the number of times the class l is used                                            µ(ei )     = 1 ∀1 ≤ i ≤ m
                                                                                                  
                                                                                                  
                                                                                                  
                                                                                                   i=1
                                                                                                  
among this set of intersecting cases. Note that if x < X and
|d H (x)| = 0, then x = D x , i.e. the case x is in relation with no              where w(ei , x j ) ≥ 0 models the importance of ei in x j and µ(ei )
case in X . In the hypergraph literature, |d(x)| is called the degree             the support of ei for class 1 w.r.t. whole training set. For this
of a hyperedge and its domain of definition is restricted to X .                  reason, we call µ the intrinsic strength of ei . The assumption (ii)
    From now, we consider only the specific hypergraph generated                  implies that if ei ∩ x then w(ei , x) = 0. For readability, we write
by the set of examples X . For the sake of readability, we remove                 s in place of sw, µ .
the subscript H .                                                                    The classification rule consists in selecting the class with the
                                                                                  total highest support.
                                                                                                                          1 s(x) > 0
                                                                                                                        
4.2     Model Space                                                                                 ∀x ∈ P(F), J¯(x) =                            (R1)
                                                                                                                          0 s(x) ≤ 0
In many binary classification approaches such as SVM or logistic
regression, the model space consists of the set of hyperplanes of                 Our goal is to select w and µ such that the classification rule
a given inner product space, usually RM . A hyperplane in RM                      (R1) provides a good solution to the original problem (2). For this
is uniquely defined by M + 1 parameters. For an input vector                      purpose, we proceed in three steps:
x, its margin m(w, x) is defined by its distance to a hyperplan                      (1) Define w and µ to capture as much information as possible
defined by w and is negative in case x is wrongly classified,                            from E for any X (Section 4.3).
positive otherwise. Thus, the problem consists in finding the best                   (2) Train the model to adjust the intrinsic strength on a spe-
parametrization to minimize a loss function summed over the                              cific X using the classification rule (R1) (Section 4.4).
training set. For instance, the Perceptron algorithm minimizes                       (3) Refine the classification rule (R1) to take into account the
the 0-1 loss, SVM the hinge loss, and the Logistic Regression uses                       local nature of the model (Section 4.4).
the log loss. Those functions are defined by (3).                                 A summary and high level view of HCBR is given by Algorithm 1.

                L 01 (w, x)    =       1 {m(w,x )≤0}                              Algorithm 1 HCBR (High level view)
              L hinge (w, x)   =    max(0, 1 − m(w, x))                (3)         1: Build H and E from X .
               L log (w, x)    =     ln(1 + e −mw,x ) )                            2: Calculate w and µ on E.
                                                                                   3: Adjust µ with training algorithm 2 on X using rule (R1)
   In this paper, we relax the three implicit constraints on the                   4: for each x in test set do
input vector space:                                                                5:   Calculate the projection π (x).
                                                                                   6:   Calculate the support s(x) using the projection.
   (1) we do not assume to have any inner product or metric                        7:   Predict using the updated rule (R2).
       embedded with the input vector space,                                       8: end for
   (2) we do not assume the cardinality of the input vectors, such
       that it is possible to build a model and make predictions                     While SVM aims at separating the data using a single hyper-
       on incomplete systems (for instance missing in some rows                   plane in the original input vector space, HCBR tries to explain the
       in a database or Bag-of-Words representation),                             decision for each case by a convex combination expressed in a
   (3) we do not assume anything about the concrete representa-                   lower dimensional space E generated by the data. In addition, the
       tion of features while in most classification methods, all                 convex combinations depend on each other since the elements of
       the features belongs to the same space, often R.                           E are by definition the features resulting in the case intersections.
As a counterpart, HCBR relies on two assumptions: (i) the correct                 For any case, the decision rule is a combination of the strength
class of an input vector is the result of its features only and (ii)              of the elements of its projection on the hypergraph.
if two input vectors x and x ′ do not share any feature, they are
independent i.e. x cannot help to understand the correct class of                 4.3    Model Selection
x ′ and vice versa. As a result, HCBR produces only local models                  In this section, we define how to concretely select and calculate
because if a new input vector is independent of all examples, it is               w and µ for a given hypergraph. To ease the notation and for
impossible to generate a prediction. On the contrary, a hyperplane                practical reasons, the measure µ is provided w.r.t. the intersec-
model is global in a sense that it can produce a prediction for any               tion family E of the hypergraph induced by the training set X .
element of RM . A concrete situation for which such assumptions                   However, µ can be defined over any partition of F.
are natural is a justice trial. Cases are composed of some elements,                 For x and x ′ in P(F), a basic quantitative measure on the
                                                                                                                                      |x ∩x ′ |
and the correct label is the result of a reasoning that can possibly              importance of x ′ w.r.t. x can be expressed by |x | , i.e. how
use analogies or counter-examples with a set of past cases on top                 much x ′ overlaps with x. This measure is a sort of potential for
of the legal texts. However, if a judge or lawyer wants to use x                  an analogy with x. Potential because it does not account for the
to justify the outcome of x ′ , x ′ must have similarities with x.                individual importance of the features and simply holds the idea
    Let us formally define the model space. Given the hypergraph                  that the larger is a subset of features in a case, the higher is the
H = (F, X ) defined by a training set X and its associated partition              chance it holds important features to decide the outcome.
E = {ei }im , the relation between an example x and its class is                     Let us consider E and an example x ∈ X .
                                                                              3
   Definition 4.4 (Intrinsic strength w.r.t. x). The intrinsic strength               the quality of the relation (the more examples sharing the same
of e ∈ E w.r.t. x ∈ X is defined by                                                   class the stronger the analogy).
                                                           |x ∩e |
                                         |d (l ) (e)| |x |
                                                                                      Example: Consider the hypergraph in Figure 1 made of x 1 , x 2
           ∀l ∈ {0, 1}, S (l ) (e, x) = Í
                                                          |x ∩e |
                                           |d (l ) (e j )| |x | j                     and x 3 arbitrarily labeled with resp. 1, 0 and 1. Arbitrarily, we as-
                                         ej ∈ E
                                                                           (5)        sume the cardinal of the elements of E to be #e = (2, 1, 2, 3, 1, 2, 3)
                                                                                      such that the cardinal of cases are #x = (7, 8, 7) and |FX | = 14.
                                         |d (l ) (e)||x ∩ e |
                                      = Í                                             The values of |d (l ) (e)| can be summarized by the vectors #d (0) =
                                           |d (l ) (e j )||x ∩ e j |
                                         ej ∈ E                                       (0, 0, 1, 0, 1, 1, 1) and #d (1) = (1, 2, 2, 1, 1, 1, 0). Let us calculate
                                                                                      S (0) (e 3 ):
In particular, for a given x ∈ X , S (l ) (ei , x) = 0 if ei is not part
of the projection of x on H . Also,         S (e, x) = 1 which can
                                        Í (l )
                                                                                                                               2×2                   4
                                            e ∈E                                                S (1) (e 3 , x 1 ) =                             =
                                                                                                                   2 × 1 + 1 × 2 + 2 × 2 + 1 × 2 10
be interpreted as how much e accounts for the set of labels l                                                                  2×2                  4
the element x holds in its projection. The more ei belongs to                                S (1) (e 3 , x 2 ) =                                =
                                                                                                                   2×2+1×1+1×2+0×3 7
many cases with the same class l and the higher S (l ) (ei , x) is.                                                            2×2                   4
Conversely, for a fixed number of cases, the more ei describes                               S (1) (e 3 , x 3 ) =                                =
                                                                                                                   2 × 1 + 2 × 2 + 3 × 1 + 1 × 1 10
x, the higher S (l ) (ei , x) is. As ∀ei ∈ E, |d(ei )| > 0, either we                                                                         4 of the support
                                                                                      which we interpret as e 3 being responsible for 10
have S (1) (ei , x) , 0 or S (0) (ei , x) , 0. We have S (l ) (ei , x) = 0 only                                                  4
when all the cases in which ei results of are labeled with the                        toward class 1 in x 1 and x 3 , while 7 for x 2 . This lead to
opposite class l.  ¯ For S (l ) (ei , x) = 1, it needs both the unanimity                                            2 4      4    4
                                                                                                     S (1) (e 3 ) =         + +         ≃ 0.1959
of labels for the cases in which ei belongs to and that ei = x. The                                                  14 10 7 10
relation ei = x implies that x does not share any feature with                          Similarly, we calculate the support for each e and both labels.
any other examples or that x is included into another example.                        We summarize this into the following vectors:
   Definition 4.5 (Intrinsic strength w.r.t. a hypergraph H ). The
instrinsic strength of e ∈ E w.r.t. H = (FX , X ) is defined by                          S (1) ≃ (0.0286, 0.0286, 0.1959, 0.0643, 0.0173, 0.0694, 0.0000)
                                       |e |          Õ                                   S (0) ≃ (0.0000, 0.0000, 0.2024, 0.0000, 0.0327, 0.1071, 0.0000)
         ∀l ∈ {1, 0}, S (l ) (e) = Í ′                      S (l ) (e, x)
                                          |e |                                        After normalization, we obtain the intrinsic strength:
                                   e′ ∈E       x ∈d (l ) (e)
                                                                          (6)
                                    |e |     Õ                                        µ ≃ (0.0707, 0.0707, 0.0060, 0.1591, −0.0345, −0.0818, −0.1901)T
                                 =                    S (l ) (e, x)
                                   |FX |                                              Let us evaluate the model on the three examples:
                                            x ∈d(l ) (e)
                                                                                                    2         1         2         2
The measure S (l ) (e) is simply the sum of the relevance for all                          s(x 1 ) = µ(e 1 ) + µ(e 2 ) + µ(e 3 ) + µ(e 6 ) ≃ 0.0086
                                                                                                    7         7         7         7
the examples e belongs too. The more e belongs to several cases,                                    2         1         2         3
the more information it has to support a class or another. As E                            s(x 2 ) = µ(e 3 ) + µ(e 5 ) + µ(e 6 ) + µ(e 7 ) ≃ −0.0946
                                                                                                    8         8         8         8
represents the sets of features that appear all the time together,                                  1         2         3         1
we favor the larger e ∈ E as they hold more information to                                 s(x 3 ) = µ(e 2 ) + µ(e 3 ) + µ(e 4 ) + µ(e 5 ) ≃ 0.0751
                                                                                                    7         7         7         7
explain a decision. The normalized version is defined by:
                                             S (l ) (e)                               As a result, x 1 and x 3 are labeled 1 and x 2 is labeled 0. All three
                  ∀l ∈ {1, 0}, µ (l ) (e) = Í                              (7)
                                                S (l ) (e ′ )                         cases are correctly labeled. The highest support is given for case
                                              e′ ∈E                                   x 2 and x 3 while the support for x 1 is one order of magnitude
   Finally, the measure µ is simply defined by the difference                         lower then for x 3 . This is because the discretionary features of
between the strength of both classes:                                                 x 3 are larger while the intersection with x 2 is lower than for x 1
                                                                                      ( 38 of x 3 against 47 of x 1 ).
                         µ(e) = µ (1) (e) − µ (0) (e)                      (8)             Consider a new case x as described on Figure 2. Its support is
   For any case x, the projection on the hypergraph π (x) inter-                      given by s(x) =            w(e, x)µ(e) with π (x) = {e 1 , e 2 , e 3 , e 5 , e 6 , e 7 }.
                                                                                                            Í
                                                                                                           e ∈π (x )
sects with some elements of E. We use this intersection to define
                                                                                      It is impossible for x to be classified as 1 because the highest sup-
w by
                                                                                      port would be for a maximal intersection with e 1 , e 2 , e 3 and
                                         |x ∩ e |                                     minimal for e 5 , e 6 and e 7 such that s(x) = 81 (2µ(e 1 ) + µ(e 2 ) +
                           w(e, x) =                                       (9)
                                        |x \ D x |                                    2µ(e 3 ) + µ(e 5 ) + µ(e 6 ) + µ(e 7 )) ≃ −0.0103 < 0. It can be explained
, i.e. we modulate the strength of an element of the partition by                     by the fact that the support for 1 is provided by a larger set of
its importance in x w.r.t. set inclusion. In a sense, the projection                  features (11 features versus 8). On top of that, the intersections
on the hypergraph provides all the possible analogies with the                        between positive cases (e 2 and e 3 ) are too small (1 for e 2 compared
training set, either seen as a support for a particular outcome l or                  to e.g. 3 for e 7 ) or include also negative cases (e 3 ).
as a counter-example. The intrinsic strength of each element of
this partition is a measure of “how much it can be considered as                      4.4      Training and Decision
a counter-example or an analogy” taking into account simultane-                       In this section, we give an algorithm to adjust the intrinsic
ously all the analogies between the examples that are in relation                     strength µ(ei ) in order to minimize a hinge loss and we update
with the considered case. It favors the importance of the relation                    the classification rule to take into account the fact the model
(the more features it shares, the stronger is the analogy) but also                   cannot provide predictions over F entirely.
                                                                                  4
   Once the model is built, it can be evaluated on the training                                                   max(s (1) (x ),s (0) (x ))
set. Analogously to SVM, we define the margin as the distance                                                       s (1) (x )+s (0) (x )
                                                                                                                           1
to the correct class, i.e. m(w, µ, x) = sw, µ (x)|J (x) − J¯(x)|. To
improve the pertinence of the strength of the elements of E, we                                                           l0        l1
                                                                                                          0                                         1
use the iterative algorithm described by Algorithm 2 to minimize
                                                                                                                                                           η̄ 1
the hinge loss over the training set X . When and only when a                               η̄ 0
classification for a case x is wrong, it modifies each element of                                         l0              l0        l1              l1
its projection by lowering its strength for the wrong class and                                                      η0        0               η1        s(x)
increasing it for the proper class. The margin is split between
the element of the projection w.r.t. their respective weight in
                                                                                      Figure 3: Representation of the updated decision rule (12).
x i.e. w(e, x). If a case x is wrongly classified, it is due to the
cases intersecting with it. Indeed, if x was not intersecting with
any other example, its projection would be itself, and its support                    with η 0 , η 1 ∈ R+ and η̄ 0 , η̄ 1 ∈ [0, 1]. The constraints on η 0 and η 1
toward the wrong class would be 0 and positive for the real class.                    defines a minimal amount of support respectively toward class 0
In other words, x would be correctly classified. Thus, the idea is                    and 1 while η̄ 0 and η̄ 1 requires the support toward a class to be
not to directly bring the support of x to the correct class but to                    significantly higher than the support for the other class. As µ is
gradually adjust the weights such that the neighbors are modified                     normalized over the partition E, the value of η 0 and η 1 must be
enough for x to be correctly classified. In particular, it is sensitive               set w.r.t. the hypergraph. On the contrary, η̄ 1 and η̄ 0 can be set
to the order in which the cases are considered: a modification in                     independently of the hypergraph.
the strength of any e ∈ E impacts all cases in which it appears                          Those constraints may be used to design a decision rule for
and potentially changes the predicted class for those cases. In                       new cases depending on the application or the dataset. The most
addition, there is no guarantee of convergence. Future work will                      generic decision rule is as follows:
focus on the optimal order or a modification schema such that the
algorithm converges. Investigating other minimizing functions
                                                                                                              
                                                                                                               1      s(x) > 0             and C 1
                                                                                                                       s(x) ≤ 0             and C 0
                                                                                                              
                                                                                                              0
                                                                                                              
is also considered.                                                                                   J˜(x) =
                                                                                                              
                                                                                                                                                                  (12)
                                                                                                              l1      s(x) > 0           and not C 1
                                                                                                              l 0     s(x) ≤ 0           and not C 0
                                                                                                              
                                                                                                              
Algorithm 2 Model training                                                                                    
Input:                                                                                where l 1 , l 0 are two labels. A representation is given by Figure 3.
 - X : training set                                                                   Those hyperparameters are intended to model the “burden of
 - y: correct labels for X                                                            proof”. For instance, in a trial, one is assumed innocent until
 - k: number of training iterations                                                   proven guilty which implies the support for the class "guilty"
 - µ (1) , µ (0) : weights calculated with (4.5)                                      must be beyond a reasonable doubt (where the term reasonable is
Output:                                                                               defined by the jurisprudence of the applicable country). In case
 - Modified vectors µ (1) , µ (0)                                                     η 0 = η 1 = η̄ 0 = η̄ 1 (and l 0 = 0 and l 1 = 1), then the decision rule
                                                                                      is equivalent to the original one defined by (R1). In case x has too
  1: for k iterations do
                                                                                      many discretionary features, this classification rule is likely to be
  2:       for x i ∈ X do
                                                                                      irrelevant. Indeed, the intersection between x and FX is to small
  3:          ȳi ← J¯(x i )
                                                                                      to hold enough information and make strong analogies with x.
  4:          if ȳi , yi then
                                                                                      To overcome this drawback, P(F) is split into two subsets:
  5:               for e ∈ π (x i ) do
  6:                 µ (yi ) (e) ← µ (yi ) (x i ) + w(e, x i )|µ(e)|                       • F1 = {x ∈ P(F) | |x ∩ FX | ≥ δ }, ∀δ ∈ N
  7:                 µ (ȳi ) (e) ← µ (ȳi ) (x i ) − w(e, x i )|µ(e)|                     • F2 = P(F) \ F1
  8:               end for                                                            F1 corresponds to the elements such that they share some fea-
  9:          end if                                                                  tures with the examples. An alternative may be considered by
 10:       end for                                                                    using F1 = {x ∈ P(F) | D|xx| ≤ δ }, ∀δ ∈ [0, 1]. In this case, F1
 11: end for                                                                          contains the elements for which we have enough information
                                                                                      provided by the examples. From our preliminary tests, the choice
                                                                                      depends on the dataset structure.
   The measure µ is defined as the difference of support for both                        Finally, the decision rule for new cases is built as follows:
classes. Thus, by linearity we can rewrite
                                                                                                                     J˜(x) if x ∈ F1
                                                                                                                   
                      m                       m                                                            J¯(x) =                                   (R2)
                                                                                                                      ox   if x ∈ F2
                      Õ                       Õ
            s(x) =      w(ei , x)µ (1) (ei ) − w(ei , x)µ (0) (ei )
                      i=1                           i=1                  (10)         where o x is one draw from a random variable that has Bernoulli
                       (1)             (0)                                                                         | {x ∈X | J (x )=1} |
                   =s        (x) − s         (x)                                      law with parameter p =              |X |           , i.e. the prevalence of 1
This form is convenient because we can control how much ev-                           in X . This assumes that X is correctly representing P(F) (or that
idence we need to support a specific class using the following                        the prevalence does not change in time for sequential problems
constraints and a familly η of four hyperparameters:                                  in which the new cases are generated by an unknown random
                                                                                      measure). The rational behind is that if for a case x, it is not
                                      η̄ 0 (1)
                   s (0) (x) > max(          s (x), η 0 ) ≥ 0            (C 0 )       possible to exploit the model built on the hypergraph, then we can
                                    1 − η̄ 0                                          still consider that J acts as a Bernoulli random variable and use a
                                      η̄ 1 (0)                                        maximum likelihood estimation for p. In a sense, it is extending
                   s (1) (x) > max(          s (x), η 1 ) ≥ 0            (C 1 )
                                    1 − η̄ 1                                          the local model to the entire input vector space P(F).
                                                                                  5
4.5     Complexity                                                               The integrality of the data used for the experiments, as well as
Model Building: Given X ∈ P(F)n , constructing EH can be                      the scripts to transform them and analyze the results are available
done in O(
           Í
             |x |) by using a Partition Refinement data struc-                in the HCBR Github repository4 and the whole experimental
               x ∈X                                                           campaign starting from the raw data can be reproduced in "one
ture [18]. Given x ∈ X , calculating the family {S(e, x)}e ∈EH can            click".
be done in O(|x |) by asking for each feature of x the e it belongs
to and maintaining the size of each e during the construction of              5.1     Classification performance
EH . Thus, calculating {S(e, x)}e ∈ EH for all x ∈ X can be done
                                                                              To validate the approach, 8 datasets for binary classification
           |x |). On m-uniform hypergraphs (when all cases are
       Í
in O(
       x ∈X                                                                   have been used. They are available either from the UCI Machine
described with m features) with n hyperedges, it becomes O(mn).               Learning Repository5 or provided with the LIBSVM6 : adult,
    Calculating {S(e)}e ∈ EH and µ can be done in O(|EH |) because            audiology, breasts, heart, mushrooms, phishing, skin and
it requires to iterate over EH . An obvious upper bound on |EH |              splice. For each dataset, the original features (name=value)
is |FX | i.e. the number of vertices in the hypergraph. The worst-            are converted into a unique identifier and the union of all such
case cardinal of EH is when each x ∈ X intersects with all the                identifiers constitute the information set F considered by the algo-
others and none of them is strictly a subset of any other. Thus,              rithm. Notice that there is no need to remove the rows with empty
|EH | ≤ min(2n − 1, |FX |).                                                   values. The dataset audiology initially contains several classes
                                                                              corresponding to several ear abnormalities. They are grouped
Learning Phase: For each wrongly classified x ∈ X , the train-                to obtain two classes (normal ear and abnormal ear). able 1 de-
ing requires at most O(|x |) steps (maximal cardinal for π (x)).              scribes each dataset. The minimal, maximal and average size give
The worst-case scenario is when the model wrongly classifies                  information about the case sizes (notice some cases are missing
every x ∈ X . Thus, the learning phase worst-case complexity is               values for adult, heart and mushrooms datasets). The unique
        |x |) and on m-uniform hypergraphs it becomes O(kmn).
     Í
O(k                                                                           features are the number of (name=value) in the original dataset.
      x ∈X                                                                    In addition, two datasets have at least one real-valued attribute
Model Query: For a case x ∈ P(F), the projection can be done                  as indicated by the column "Real" in Table 1.
in O(|x |). Calculating the classification rule also requires at most            The validation was made using a 10-fold cross-validation: each
O(|x |) (maximal cardinal for π (x)).                                         dataset has been split into 10 equal sized samples, 90% has been
                                                                              used as training set and the remaining 10% to test the classifi-
5     EXPERIMENTS                                                             cation. Each subsample is used once as testing set and the final
The experiments are broken down into two parts: the comparison                metrics are calculated as the average of the 10 runs. The training
with literature results in terms of classification performance,               set was not rebalanced and kept the original prevalence. The
and intrinsic performance (computation time and influence of                  of training steps k was adjusted with a manual trial and errors
parameters).                                                                  approach, the familly η set to η 0 = η 1 = η̄ 0 = η̄ 1 = 0 and δ
     For the first part, we measured the confusion matrix obtained            was set to 1. Further work will focus on automatic parameter
over all the runs but also after each prediction. From this confu-            tuning. The average confusion matrix obtained for each dataset
sion matrix, we calculated the standard performance indicators:
accuracy, recall, specificity, precision, negative prediction value,
F 1 -score and Matthews correlation coefficient. Denoting by TP
the number of true positives, TN the true negatives, FP the false
positives and FN the false negative, the F 1 -score and Matthews
correlation coefficient are defined by:
                          2T P
             F1 =
                    2T P + F P + F N

                                TP × T N − FP × FN
       MCC = p
                      (T P + F P)(T P + F N )(T N + F P)(T N + F N )
F 1 - score and Matthews correlation coefficients respectively be-
longs to [0, 1] and [−1, 1] and the closer to 1, the better it is. Both
takes into account false positive and false negatives. In particu-
lar, Matthews correlation coefficient is very adapted for binary
classification on unbalanced dataset (with a prevalence far from
0.5). However, as many studies do not report them, we will base               Figure 4: Evolution of the confusion matrix during the pre-
our comparison with literature results on the accuracy defined                diction phase for phishing dataset.
by:
                                  TP +TN                                      is showed in Table 2. The performance indicators are reported
                   ACC =
                            TP + T N + FP + FN                                in Table 3. The proposed algorithm performs very well on a
   For the second part, we studied the computation time depend-               wide range of datasets as reported by tables 2 and 3, in particular
ing on the number of cases and the size of each case in order to              4 https://github.com/aquemy/HCBR
validate the worst-case complexity given in Section 4.5. We also              5 https://archive.ics.uci.edu/ml/index.php
studied the influence of the training set size on the accuracy.               6 https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html

                                                                          6
                                                     Table 1: Datasets description.

                                     Cases     Total Features   Unique       Min. Size   Max. Size   Average Size   Real
                        adult        32561        418913         118            10         13           12.87       No
                        audiology     200          13624         376            70         70            70         No
                        breasts       699           5512          80            8           8             8         No
                        heart         270           3165         344            12         13           12.99       Yes
                        mushrooms     8124        162374         106            20         20            20         No
                        phishing     11055        319787         808            29         29            29         No
                        skin         245057       734403         768            3           3             3         Yes
                        splice        3175        190263         237            60         60            60         No


when they contain strong predictors as it is the case for mushroom.          82.96% accuracy resp. on breast, and heart datasets. Compar-
The accuracy is contained in a range from 0.8206 (adult) to 1                ing bayesian approaches, [13] demonstrated 97.35% (breast) and
(mushrooms) while the F 1 -score is bounded by 0.8653 (heart) and            83.00% (heart) accuracy. A 5 layers neural network with fuzzy
1 (mushrooms). On adult, the accuracy is only 6% higher than                 inference rules achieved 87.78% on heart [21] while a k-NN al-
the prevalence. A model consisting in returning 1 for any point              gorithm reached 99.96% on mushrooms [8]. The best alternative
would be only 6% worse. This relatively poor performance in                  among 6 rules-based classification methods achieved 95.84% on
learning the underlying decision mapping is better reflected by              breast and 100.00% on mushroom [11]. Using 80% of phishing
the Matthews correlation coefficient of 0.51.                                as training set, an adaptative neural network achieved an av-
   The false positives and false negatives are equilibrated for each         erage accuracy of 93.76% (among 6 alternatives) with the best
dataset, despite a huge variation in the prevalence (between 20%             run at 94.90% [23]. Still on phishing, [22] proposes to combine
and 64%, cf. Table 2) which may be a desirable property depending            several classifiers and reaches 97.75% accuracy for the best hybrid
on applications. In addition, the results were obtained with non-            model (and demonstrates 97.58% for Random Forest classifier).
balanced training sets which are known to be a problem for                   On adult, the comparison of several classifiers (naive bayes, de-
many machine learning algorithms. Moreover, the performance                  cision tree, ...) demonstrated at most 86.25% accuracy [14] while
indicators remain stable during the whole prediction phase as                a Support Vector Machine approach reached 85.35% [15]. On
shown on Figure 4 for the dataset phishing (similar result is                splice, a method using Fuzzy Decision Trees [4] reaches 94.10%
observed for all datasets).                                                  accuracy and a neural network combined to boosting [5] 97.54%.
   For a given case, the support is a metric of confidence in the            On breast, Support Vector Machine approaches reached resp.
prediction as illustrated in Figure 5. In general, the wrongly               96.87%, 98.53%, 99.51% accuracy [1, 7, 19], 99.26% and 97.36% for
classified cases have a smaller difference between the evidence              neural network based techniques [17, 24], 98.1% for a bayesian
for each class which confirm the interest in the hyperparameters             network method [10], or 94.74% using Decision Trees [20]. On
η and η̄ used in (R2).                                                       skin, [5] reports 98.94% accuracy against 99.68% for Decision
                                                                             Tree based method [6]. The best result, as far as we know, is
                                                                             99.92%, obtained by a Generalized Linear Model [2] (with 80%
                                                                             training set).
                                                                                 The accuracy is slightly lower than the best results from the
                                                                             literature (adult 82.06% against 86.25%, breast 96.96% against
                                                                             99.51%, heart 85.77% against 87.78%, phishing 96.05% against
                                                                             97.75%, splice 94.43% against 97.54%, skin 98.65% against 99.92%).
                                                                             We explain this by at least two factors. First, the best methods
                                                                             on a given dataset are often dedicated to this dataset with ad-hoc
                                                                             or engineered parts which is not the case of HCBR. Secondly,
                                                                             the hyperparameter familly η have not been tuned but as we
                                                                             will show in next section, they might have a decisive impact on
                                                                             performance. HCBR performed better than Bayes classifier in
                                                                             two thirds of cases. Bayes classifier performs better on breast
                                                                             by 1̃% which represents less than one case wrongly classified.
                                                                             Similar results are observed with Decision Trees. However, the
                                                                             1% difference on skin represents an average of 7 cases misclas-
                                                                             sified in comparison in favor of Bayes. It performs better than
Figure 5: Difference between the weight assigned to both                     Rule-based approaches (or gives similar results on mushrooms
classes for each decision on phishing (average). Similar re-                 with an accuracy of 1) in the four considered references on three
sults are observed for all datasets.                                         different datasets. Notice that combined with Neural Network,
                                                                             Rule-based achieves the state-of-art result on heart dataset. Ex-
   To compare the results of the proposed method, we explored                cept for phishing, Neural Network returns better results (0.46
the best results from the literature for each dataset. The com-              more cases with correct classification in average for breast, 71
parison with audiology is not relevant due to the transforma-                for skin and almost 10 for splice). Last, SVM gives better results
tion into a two-class problem. The results are summarized in                 in all three cases, but appear only as best results in two datasets.
Table 4. In [9], 5 rule-based classification techniques dedicated to
medical databases are compared and achieve at best 95.85% and
                                                                         7
Table 2: Average confusion matrix obtained with a 10-fold                 is the worst-case scenario in terms of the size of E if m < n
cross-validation.                                                         because the family grows exponentially in function of m or n.
                                                                          We split the computation time into constructing the hypergraph
                 TP       FN        FP        TP      Prevalence          (and determining the intersection family) and calculating the
 adult         2182.40   295.30   288.50    488.80      0.7586            strength of the partition. The results are illustrated in Figure 7.
 audiology      12.70     0.10     0.00      6.20       0.6048            By increasing n with a fixed m, the partition grows exponentially
 breast         23.00     1.40     0.70     43.90       0.3338            and thus, it is expected to have an exponential curve for the
 heart          12.40     1.80     1.90      9.90       0.5107            strength computation. On the contrary, building the hypergraph
 mushrooms     390.60     0.00     0.00     420.40      0.4804
                                                                          can be done in linear time when m fixed. When n is fixed and m
 phishing      595.40    23.80    19.80     465.00      0.5562
                                                                          increases, constructing the hypergraph is still doable in linear
 skin          4886.30   132.40   199.40   19286.90     0.2075
 splice        155.70     9.10     8.50     142.70      0.5164            time as expected. Interestingly, calculating the strength has two
                                                                          phases: if m ≤ n, increasing m exponentially increases the time
                                                                          (because E exponentially increases) but if m > n, increasing m
5.2    Intrinsic performance                                              cannot results in an exponential growth in the computation time
                                                                          (because E grows linearly).
In this section, we evaluate the capacity of HCBR to build an
efficient model (w.r.t. the accuracy measure) with few examples.
We also study the influence of the two main parameters (number
of examples and size of the examples) on the computation time.

Training set size: To evaluate HCBR, we used a standard 90-10
dataset split in the previous section. Additionally, we studied the
accuracy depending on the split from 1% to 99%. For each split
value, we performed 100 runs with random splits and averaged
the accuracy. As shown in Figure 6, HCBR reaches its maximal
accuracy with about 15% of the dataset as examples. The dataset
adult shows a constant results from 1% while for audiology
more than 40% is required to achieve over 90% accuracy. Despite
the datasets having different sizes (e.g. skin is 350 larger than
breast), the trajectories look the same. Further work should fo-
cus on determining the optimal training set size depending on
the size of cases and the underlying measure to generate them
in P(F) (as it influences the induced hypergraph, it influences
the strength measure and thus the decisions). Additional experi-
ments need to be performed to fairly compare those results to
the existing algorithms. However, it is commonly accepted that
non-linear classifiers requires very large training sets.


                                                                          Figure 7: At the top, computation time to build the model
                                                                          (hypergraph construction + strength calculation) depend-
                                                                          ing on n (m = 10), and at the bottom, depending on m
                                                                          (n = 100). The case i is described by {i, ..., i + m} such
                                                                          that each case is partitionned into m elements (one discre-
                                                                          tionary feature). Right scale for bulding and left scale for
                                                                          strength.
Figure 6: Accuracy depending on the percentage of the
dataset used as training set.
                                                                          Hyperparameters η: We used a 90-10 split and set η 0 = η 1 to
Computation Time: We generated a casebase of n cases of size              ease the visualization. Instead of using the decision function de-
m such that case i is described by {i, ..., i + m} i.e., each case        fined by (12), we did not produce a prediction if the constraints C 1
is partitioned into m elements (one discretionary feature). This          or C 0 were not respected. It can be viewed as creating a third class
                                                                      8
                                 Table 3: Average performances obtained with a 10-fold cross-validation.

                          Accuracy (standard dev.)      Recall     Specificity   Precision    Neg. Pred. Value      F 1 score    Matthews corr. coef.
          adult               0.8206 (0.0094)           0.8832       0.6233       0.8808           0.6290            0.8820            0.5081
          audiology           0.9947 (0.0166)           1.0000       0.9875       0.9917           1.0000            0.9957            0.9896
          breasts             0.9696 (0.0345)           0.9691       0.9676       0.9479           0.9844            0.9575            0.9344
          heart               0.8577 (0.0943)           0.8695       0.8437       0.8699           0.8531            0.8653            0.7178
          mushrooms           1.0000 (0.0000)           1.0000       1.0000       1.0000           1.0000            1.0000            1.0000
          phishing            0.9605 (0.0081)           0.9680       0.9514       0.9615           0.9590            0.9647            0.9199
          skin                0.9865 (0.0069)           0.9608       0.9932       0.9736           0.9898            0.9672            0.9587
          splice              0.9443 (0.0124)           0.9478       0.9398       0.9450           0.9441            0.9463            0.8884


Table 4: Previous literature results measured as the high-
est accuracy obtained by the authors.

     Dataset       Ref.   Type                                Accuracy
                   [14]   Many classifiers                     86.25%
      adult
                   [15]   SVM                                  85.35%
                          HCBR                                 82.06%
                    [1]   SVM                                  99.51%
                   [17]   Neural Network                       99.26%
                   [19]   SVM                                  98.53%
                   [10]   Bayes                                 98.1%
                   [24]   Neural Network                       97.36%
     breast
                   [13]   Bayes                                97.35%
                          HCBR                                 96.96%
                    [7]   SVM                                  96.87%
                    [9]   Rule-based                           95.85%
                   [11]   Rule-based                           95.84%
                   [20]   Decision Tree                        94.74%
                   [21]   Neural Network + Rule-based          87.78%
      heart               HCBR                                 85.77%
                   [13]   Bayes                                83.00%
                    [9]   Rule-based                           82.96%
                   [11]   Rule-Based                          100.00%
   mushrooms
                          HCBR                                100.00%
                    [8]   k-NN                                 99.96%
                   [22]   Ensemble                             97.75%
    phishing       [22]   Random-Forest                        97.58%
                          HCBR                                 96.05%
                   [23]   Neural Network                       94.90%
                    [2]   Generalized Linear Model             99.92%
      skin          [6]   Decision Tree                        99.68%
                    [5]   Neural Network + Boosting            98.94%
                          HCBR                                 98.65%
                   [5]    Neural Network + Boosting            97.54%
     splice
                          HCBR                                 94.43%
                   [4]    (fuzzy) Decision Tree                94.10%
                                                                                              Figure 8: Influence of η on phishing dataset.

unknown for which we consider we cannot produce a decision.
                                                                                     heart, the accuracy globally increases with η while on heart
We measured the accuracy and the test set size ratio for which a
                                                                                     the accuracy slightly decreases indicating poor influence of the
prediction has been produced for different values of η := η 0 = η 1 .
                                                                                     hyperparameters and model.
If J¯ correctly approximates J , increasing η should increase the
                                                                                         Notice that for certain values of η it is possible to reach 100%
accuracy while the test set ratio should remain high. Additionally,
                                                                                     accuracy with heart while it is not with breast. Also, for high
we plot the test set ratio in function of the accuracy and calcu-
                                                                                     values of η, we observe a fall in accuracy for breast. We suspect
late the Pareto frontier7 which represents the best compromises
                                                                                     those two phenomena to appear because we used the same value
accuracy/ratio. The closer the points are to (1, 1) the better it
                                                                                     for η 0 and η 1 .
is. A Pareto frontier consisting of (1, 1) represents the perfect
                                                                                         More work is required to fully study the influence of the hyper-
model (e.g. reached on mushroom dataset). Figures 8, 9, 10 and 11
                                                                                     parameters (η 0 , η 1 , η̄ 0 , η̄ 1 ) and how to select l 0 and l 1 . We believe
provides the result for the best and worst two datasets. Figure
                                                                                     it is the key to improve the overall performance, and it is possible
12 shows all of the four Pareto frontiers. As expected, the results
                                                                                     to derivate the best values from the training set. Also, a compari-
are better on phishing and breast. On phishing, breast and
                                                                                     son with binary classification methods that provide a prediction
7 Points such that improving one component would degrades the other one.             confidence metric is necessary.
                                                                                 9
         Figure 9: Influence of η on breast dataset.                                 Figure 10: Influence of η on heart dataset.


6   CONCLUSION
This paper presented HCBR, a method for binary classification                   This proof of concept raises many questions and offer many
using a hypergraph representation of information and building a             improvement axes. First of all, it seems relatively easy to extend
convex combination out of the induced partition to determine                the method to several classes but it needs an additional empirical
the support for each class. The general framework introduced                validation. As most of the computational effort is on calculating
by (4) is instantiated in Section 4.3 where the support is deter-           the class support, adding more classes will linearly increase the
mined using all the interactions between the hyperedges. Beyond             computation time and thus, working on a faster algorithm or an
this specific implementation, one can imagine different model               approximation of the main measure should be investigated. The
selection methods to be used, e.g. using some assumptions on                solution may come from exploring the feature selection capacity
the data.                                                                   of HCBR. Indeed, by the hypergraph construction, it may be
   However, being totally agnostic on the data representation is            possible to remove from the partition some elements that do not
convenient compared to many methods such as SVM. It allows                  participate enough (e.g. not being in enough cases at the same
combining information from multiple sources by simply stacking              time), reducing the computation time. Additionally, we plan to
the information. It does not require transforming the data to fit           investigate how to generate an explanation about each prediction
the algorithm, often by designing specific ad-hoc metrics.                  and one about the decision function J itself, using not only the
   HCBR has been tested on 8 well-known datasets and demon-                 convex combination and the strength of the partition elements,
strated similar accuracies when compared to the best results                but also the link between cases in a similar way a lawyer may
from the literature. The algorithm has shown a strong stability             use past cases to make analogies or counter-examples. We also
in terms of accuracy, true positive and negative rates during the           work on an online version of HCBR where the hypergraph is
whole prediction phase. We showed that the difference of class              constructed case after case, including forgetting some old past
support is a good confidence indicator for the prediction. We               cases (which would allow handling non-stationary environment).
demonstrated the capacity to obtain a good accuracy using very              It seems also possible not only to add new examples dynamically,
few examples from the dataset (10% to 15% of the training set)              but also some vertices (i.e. adding some pertinent information to
without balanced classes. This last property is very important for          some cases) without generating the whole model from scratch.
robustness as in practice, the dataset are rarely balanced. Finally,            Empirically, further experiments should focus on more un-
we empirically validated the exponential worst-case complexity.             structured datasets (for instance for text classification). As stated
                                                                       10
                                                                           ACKNOWLEDGMENT
                                                                           The author warmly thanks Pr. Robert Wrembel, Poznan Univer-
                                                                           sity of Technology, and Dr. Jean-François Puget, IBM Analytics,
                                                                           for their useful suggestions and advice to improve this paper.

                                                                           REFERENCES
                                                                            [1] M. F. Akay. 2009. Support vector machines combined with feature selection
                                                                                for breast cancer diagnosis. Expert systems with applications 36, 2 (2009),
                                                                                3240–3247.
                                                                            [2] Mesa A. Dinh N.-T. Basterrech, S. 2015. Generalized Linear Models Applied
                                                                                for Skin Identification in Image Processing. In Intelligent Data Analysis and
                                                                                Applications. Springer, 97–107.
                                                                            [3] C. Berge. 1984. Hypergraphs: combinatorics of finite sets. Vol. 45. Elsevier.
                                                                            [4] Sharma G. Dhall-A. Chaudhury S. Bhatt, R. B. 2009. Efficient Skin Region
                                                                                Segmentation Using Low Complexity Fuzzy Decision Tree Model. In 2009
                                                                                Annual IEEE India Conference. 1–4.
                                                                            [5] F. Ö. Çatak. 2017. Classification with boosting of extreme learning machine
                                                                                over arbitrarily partitioned data. Soft Computing 21, 9 (2017), 2269–2281.
                                                                            [6] Ribeiro M. X. Cazzolato, M. T. 2013. A statistical decision tree algorithm
                                                                                for medical data stream mining. In Proceedings of the 26th IEEE International
                                                                                Symposium on Computer-Based Medical Systems. 389–392.
                                                                            [7] Yang B. Liu-J. Liu D.-Y. Chen, H.-L. 2011. A support vector machine classifier
                                                                                with rough set-based feature selection for breast cancer diagnosis. Expert
                                                                                Systems with Applications 38, 7 (2011), 9014 – 9022.
                                                                            [8] S. Das. 2001. Filters, Wrappers and a Boosting-Based Hybrid for Feature
                                                                                Selection. In Proceedings of the Eighteenth International Conference on Machine
                                                                                Learning (ICML ’01). Morgan Kaufmann Publishers Inc., San Francisco, CA,
                                                                                USA, 74–81.
                                                                            [9] Saha S. Datta, R. P. 2016. Applying rule-based classification techniques to med-
                                                                                ical databases: an empirical study. International Journal of Business Intelligence
                                                                                and Systems Engineering 1, 1 (2016), 32–48.
                                                                           [10] Jafari S. Fallahi, A. 2011. An expert system for detection of breast cancer using
                                                                                data preprocessing and bayesian network. International Journal of Advanced
                                                                                Science and Technology 34 (2011), 65–70.
                                                                           [11] Issa G. Ishtaiwi A. Hadi, W. 2017. ACPRISM: Associative classification based
                                                                                on PRISM algorithm. Information Sciences 417, Supplement C (2017), 287 –
                                                                                300.
                                                                           [12] Liu Q. Zhang S.-Metaxas D. N. Huang, Y. 2010. Image retrieval via probabilistic
                                                                                hypergraph ranking. In IEEE Computer Society Conference on Computer Vision
                                                                                and Pattern Recognition. 3376–3383.
                                                                           [13] L. Jiang. 2012. Learning Instance Weighted Naive Bayes from Labeled and
                                                                                Unlabeled Data. J. Intell. Inf. Syst. 38, 1 (2012), 257–268.
                                                                           [14] Lu Y. Peng Y.-Shi Y. Kou, G. 2012. Evaluation of classification algorithms using
         Figure 11: Influence of η on adult dataset.                            MCDM and rank correlation. International Journal of Information Technology
                                                                                & Decision Making 11, 01 (2012), 197–225.
                                                                           [15] Mangasarian O.L. Lee, Y.-J. 2001. SSVM: A Smooth Support Vector Machine
                                                                                for Classification. Computational Optimization and Applications 20, 1 (2001),
                                                                                5–22.
                                                                           [16] Xibin Zhao-Hai Wan Ming Gu Jiaguang Sun Lifan Su, Yue Gao. 2017. Vertex-
                                                                                Weighted Hypergraph Learning for Multi-View Object Classification. In Pro-
                                                                                ceedings of the Twenty-Sixth International Joint Conference on Artificial Intelli-
                                                                                gence, IJCAI-17. 2779–2785.
                                                                           [17] Quintanilla-DomÃŋnguez J.-Andina D. Marcano-CedeÃśo, A. 2011. WBCD
                                                                                breast cancer database classification applying artificial metaplasticity neural
                                                                                network. Expert Systems with Applications 38, 8 (2011), 9573 – 9579.
                                                                           [18] Tarjan R. E. Paige, R. 1987. Three Partition Refinement Algorithms. SIAM J.
                                                                                Comput. 16, 6 (1987), 973–989.
                                                                           [19] Güneş S. Polat, K. 2007. Breast cancer diagnosis using least square support
                                                                                vector machine. Digital Signal Processing 17, 4 (2007), 694 – 701.
                                                                           [20] J. R. Quinlan. 1996. Improved use of continuous attributes in C4. 5. Journal of
                                                                                artificial intelligence research 4 (1996), 77–90.
                                                                           [21] A. M. Sagir and S. Sathasivam. 2017. A Hybridised Intelligent Technique for
                                                                                the Diagnosis of Medical Diseases. Pertanika Journal of Science & Technology
                                                                                25, 2 (2017).
                                                                           [22] Asghar-S. Zafar A. Gillani S. Tahir, M. A. U. H. 2016. A Hybrid Model to Detect
                                                                                Phishing-Sites Using Supervised Learning Algorithms. In 2016 International
                                                                                Conference on Computational Science and Computational Intelligence (CSCI).
                                                                                1126–1133.
                                                                           [23] Mohammad R. M. McCluskey L. Thabtah, F. 2016. A dynamic self-structuring
                                                                                neural network model to combat phishing. In 2016 International Joint Confer-
          Figure 12: Pareto Frontiers comparison.                               ence on Neural Networks (IJCNN). 4221–4226.
                                                                           [24] E. D. Übeyli. 2007. Implementing automated diagnostic systems for breast
                                                                                cancer detection. Expert Systems with Applications 33, 4 (2007), 1054–1062.
                                                                           [25] Denny Zhou, Jiayuan Huang, and Bernhard Schölkopf. 2007. Learning with
previously, strategies for hyperparameter tunning are also a pri-               Hypergraphs: Clustering, Classification, and Embedding. In Advances in Neural
                                                                                Information Processing Systems 19, B. Schölkopf, J. C. Platt, and T. Hoffman
ority. Last but not least, we would like to answer some questions:              (Eds.). MIT Press, 1601–1608.
is it possible to find a method to adjust the strength measure such
that the accuracy on the training set is 1? Can we provide some
quality bounds depending on the initial hypergraph and thus the
dataset? How to handle continuous values?
                                                                      11

</pre>