<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Binary Classification With Hypergraph Case-Based Reasoning</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Alexandre Quemy IBM Software Lab Cracow, Poland Faculty of Computing, Poznań University of Technology Poznań</institution>
          ,
          <country country="PL">Poland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Binary classification is one of the most common problem in machine learning. It consists in predicting whether a given element is of a particular class. In this paper, a new algorithm for binary classification is proposed using a hypergraph representation. Each element to be classified is partitioned according to its interactions with the training set. For each class, the total support is calculated as a convex combination of the evidence strength of the element of the partition. The evidence measure is pre-computed using the hypergraph induced by the training set and iteratively adjusted through a training phase. It does not require structured information, each case being represented by a set of agnostic information atoms. Empirical validation demonstrates its high potential on a wide range of well-known datasets and the results are compared to the literature. The time complexity is given and empirically validated. Its capacity to provide good accuracy with few examples is also studied.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>In many real-life situations, one tries to take a decision based
on previous similar situations. Each situation is described by a
certain amount of information, either collected by an expert
according to the relevance of the information, or automatically by
some sensors or algorithms. Those situations share similarities
on which to make analogies or counter-examples in order to take
a new decision. Conversely, in general, if two situations do not
share any common characteristic, then they are totally
independent, i.e. it is impossible to infer one’s outcome from the other
one. The purpose of supervised machine learning algorithms is
to exploit the available information and interactions between
past cases or examples in order to build a model or infer the key
rules to take correct decisions.</p>
      <p>Due to the large variety of concrete situations that can be
reduced to binary classification, it is one of the most studied
problems in machine learning. In this paper, we investigate the
problem of binary prediction under a supervised setting, i.e. given
a history of previous situations labeled with the correct output.</p>
      <p>This paper contributes to binary classification with a new
algorithm called Hypergraph Case-Based Reasoning (HCBR). The
idea is to create a hypergraph where each element of the training
set is a hyperedge and vertices are represented by the features
describing the elements. The intersections between edges create
a partition for each case, and we model the support for each class
as a convex combination of the elements of this partition. Each
of those elements is valued according to its importance w.r.t. to
the set of all the hyperedges it belongs to and their respective
labels.</p>
      <p>The well-known "no free lunch" theorem states that there is
no unique algorithm that can outperform all the others in
supervised learning. However, the case description and representation
play a preponderant role in the performances but the algorithms
are very often constrained by being designed to a specific
representation that may not be suitable for a given problem or user
data. The proposed method is totally agnostic about the
representation1 and does not require structured representations. For
instance, it can work on atomic representations such as
Bag-ofWord representations for texts where an atom is a single word
(or n-gram). In this case, two elements would not have the same
number of elements and it may be hard to properly define the
case domain.</p>
      <p>The plan of the paper is as follows: in Section 3 the problem of
binary classification is formalized in the particular case of a finite
countable set of information. The contribution of this paper is
divided into two parts: Section 4 defines the HCBR algorithm and
Section 5 presents empirical results on 8 datasets from the UC
Irvine Machine Learning Repository (UCI)2 and the LIBSVM3.
The paper ends in Section 6 with a discussion about the results,
future work, and possible improvements.
2</p>
      <p>
        CLASSIFICATION AND HYPERGRAPH
Hypergraphs generalize graphs and can be used to represent
higher order relations while graphs are limited to binary
relations. A hypergraph is defined by a set of vertices and a collection
of hyperedges where each hyperedge is a subset of this set of
vertices. Therefore, a graph is a special case of hypergraph for which
each hyperedge contains only two vertices. We will formally
introduce hypergraphs in Section 4.1. Recently hypergraphs have
been used as data representation, and some classification
algorithms on hypergraph have been proposed. A vast majority of
approaches models the objects to classify as the set of vertices
and constructs the hyperedges as the representation of a metric.
This conventional approach is known as neighborhood-based
hypergraph. The metric relies on some assumptions on the data or
a specific representation (e.g. Zernike moment and histogram of
oriented gradient (HOG) to measure the distance between images
in [16]) and for each vertex, a hyperedge is created to represent
its k-nearest neighbors [12]. The problem of classification on
hypergraph consists in labeling some unlabeled vertices given a
training set such that all vertices in a same hyperedge have the
same label. As all the vertices are known a priori, the problem is
part of transductive learning. To learn the labels, the standard
approach is to minimize a cost function based on a hypergraph
equivalent of a graph Laplacian [16, 25] with a structural risk:
C(x ) = xt ∆ x + µ ||x − y ||2
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
1With the exception that the current version must work on discrete information.
2http://archive.ics.uci.edu/ml/index.php
3https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html
where ∆ is the hypergraph Laplacian, µ &gt; 0 a regularization
factor and ||.|| a norm. The vector y represents the initial labels
for all vertices with yi = 0 for unlabeled vertices, a negative (resp.
positive) value for label -1 or 1.
      </p>
      <p>On the contrary, the method proposed in this paper models
the elements to classify as the hyperedges and the vertices as the
diferent components of those elements. As far as we know, there
is no previous work that uses this modeling choice. In addition,
it does not require knowing all the elements before building the
model: our approach is inductive. If the literature focus on the
method to build the hyperedges, the proposed framework has
a straightforward hypergraph construction and rather focus on
model selection.
3</p>
    </sec>
    <sec id="sec-2">
      <title>PROBLEM DEFINITION</title>
      <p>Consider a countable finite space of information F and its σ
algebra P(F) defined as the powerset of F. We call case an element
of P(F). In practice, it is very likely that only a subset of P(F) may
appear (for instance if two features encode two contradictory
propositions or if every case has the same number of features).
The real class for any case is defined by the unknown measurable
mapping:</p>
      <p>J : P(F) → {0, 1}</p>
      <p>x 7→ J (x )</p>
      <p>Given an example set X of n cases, the classification problem
consists in minimizing the prediction errors for all the elements of
P(F), that is to say finding a decision mapping J¯: P(F) → {0, 1}
such that:
∀X ∈ P(F)n , m¯in</p>
      <p>
        J x ∈ P(F)
Õ 1
{J (x ),J¯(x )}
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
In other words, for any possible training set, we want to find
an estimation of the exact mapping J and thus it is a functional
problem. Notice that in this paper we do not consider uncertainty:
if two situations are described the same in F, then they have the
same label.
4
4.1
      </p>
      <p>HYPERGRAPH CASE-BASED REASONING</p>
      <p>Representation And Projection
Before describing HCBR, we recall the definition of a hypergraph
and induced hypergraph. For more results on hypergraphs, we
refer the reader to [3].</p>
      <p>Definition 4.1 (Hypergraph). A hypergraph is defined by H =
(V , X ) with V a set of vertices, X the hyperedges such that ∀x ∈
X , x ⊆ V .</p>
      <p>A hypergraph can be viewed as a collection of subsets X of a
given set of vertices V . It is sometimes convenient to define a
hypergraph solely by a collection of sets. In this case, the set of
vertices, denoted VX , is implicitly defined as the union of edges.</p>
      <p>Definition 4.2 (Induced Subhypergraph).</p>
      <p>H [A] induced by A ⊆ V is defined by
XA = {x ∩ A | x ∩ A , ∅}.</p>
      <p>The subhypergraph
H [A] = (A, XA) with</p>
      <p>A set of examples X can be seen as a hypergraph H = (F, X ),
i.e. such that each example is a hyperedge (Figure 1). In practice,
we do not need to know F as we can always use the implicit
hypergraph H = (FX , X ) where FX is the restriction of F to the
features that appear in the sample X .</p>
      <p>For a case x , the operator π takes the subhypergraph H [x ] and
returns the partition of the features of x defined by the
intersection family induced by H [x ]. Each element of πH (x ) is a (sub)set
of features. For instance, in Figure 1, πH (x1) = {e1, e2, e3, e6} and
in Figure 2, the projection of x (in yellow) on H is represented
by the family {ei′}i . For convenience, for a given H = (VX , X ),
we denote by EH = {ei }im=1 = {e ∈ ∪ πH (x )} that is to say, the
x ∈X
partition of VX obtained by the intersection of the edges. This
partition is unique to a hypergraph.</p>
      <p>We call discretionary features of x (w.r.t. H ) the set (possibly
empty) of features that are not in any element of the projection
πH (x ), denoted Dx . It can be interpreted as the features of x that
do not belong to any hyperedge. If a hypergraph induced by a
set of examples represents a knowledge base at a given moment,
the discretionary features of x are new pieces of information that
were never encountered before. For instance, considering the
hypergraph composed of x1 and x2 as illustrated by Figure 1, the
set of discretionary features of x3 is e4. In Figure 2, the yellow
case x has no discretionary feature: all its features are present at
least in one example.</p>
      <p>x1
e1 e1′ ee2′′ee23
e6
e′
6
x2 e7
e′
7
3e5′ e5
x3
e4</p>
      <p>F
x
x1
e′
2
e3′ e′
5
e1′ e6′
e′
7
x3
x2</p>
      <p>
        For any set of features x ⊆ F, we define dH (x ) = {x ′ ∈ X | x ∩
x ′ , ∅} the set of edges sharing some features with x . In
particular, the set dH can be split into d(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) and d(0) depending on the
H H
label of the case and defined by dH(l )(x ) = {x ′ ∈ dH (x ) | J (x ′) = l }.
|dH (x )| corresponds to the number of cases the case x intersects
while |dH(l )(x )| counts the number of times the class l is used
among this set of intersecting cases. Note that if x &lt; X and
|dH (x )| = 0, then x = Dx , i.e. the case x is in relation with no
case in X . In the hypergraph literature, |d(x )| is called the degree
of a hyperedge and its domain of definition is restricted to X .
      </p>
      <p>
        From now, we consider only the specific hypergraph generated
by the set of examples X . For the sake of readability, we remove
the subscript H .
In many binary classification approaches such as SVM or logistic
regression, the model space consists of the set of hyperplanes of
a given inner product space, usually RM . A hyperplane in RM
is uniquely defined by M + 1 parameters. For an input vector
x , its margin m(w, x ) is defined by its distance to a hyperplan
defined by w and is negative in case x is wrongly classified,
positive otherwise. Thus, the problem consists in finding the best
parametrization to minimize a loss function summed over the
training set. For instance, the Perceptron algorithm minimizes
the 0-1 loss, SVM the hinge loss, and the Logistic Regression uses
the log loss. Those functions are defined by (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ).
      </p>
      <p>L01(w, x )
Lhinge(w, x )
Llog(w, x )
=
=
=</p>
      <p>
        1{m(w,x )≤0}
max(0, 1 − m(w, x ))
ln(1 + e−mw,x ))
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
      </p>
      <p>
        In this paper, we relax the three implicit constraints on the
input vector space:
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) we do not assume to have any inner product or metric
embedded with the input vector space,
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) we do not assume the cardinality of the input vectors, such
that it is possible to build a model and make predictions
on incomplete systems (for instance missing in some rows
in a database or Bag-of-Words representation),
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) we do not assume anything about the concrete
representation of features while in most classification methods, all
the features belongs to the same space, often R.
      </p>
      <p>As a counterpart, HCBR relies on two assumptions: (i) the correct
class of an input vector is the result of its features only and (ii)
if two input vectors x and x ′ do not share any feature, they are
independent i.e. x cannot help to understand the correct class of
x ′ and vice versa. As a result, HCBR produces only local models
because if a new input vector is independent of all examples, it is
impossible to generate a prediction. On the contrary, a hyperplane
model is global in a sense that it can produce a prediction for any
element of RM . A concrete situation for which such assumptions
are natural is a justice trial. Cases are composed of some elements,
and the correct label is the result of a reasoning that can possibly
use analogies or counter-examples with a set of past cases on top
of the legal texts. However, if a judge or lawyer wants to use x
to justify the outcome of x ′, x ′ must have similarities with x .</p>
      <p>
        Let us formally define the model space. Given the hypergraph
H = (F, X ) defined by a training set X and its associated partition
E = {ei }im , the relation between an example x and its class is
modeled by
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
 sw, µ (x )



 Ím w(ei , xj )
i=1Ín µ (ei )
 i=1
=
=
=
1
1
m
Í w(ei , x )µ (ei )
i=1
∀1 ≤ i ≤ n
∀1 ≤ i ≤ m
where w(ei , xj ) ≥ 0 models the importance of ei in xj and µ (ei )
the support of ei for class 1 w.r.t. whole training set. For this
reason, we call µ the intrinsic strength of ei . The assumption (ii)
implies that if ei ∩ x then w(ei , x ) = 0. For readability, we write
s in place of sw, µ .
      </p>
      <p>The classification rule consists in selecting the class with the
total highest support.</p>
      <p>
        ∀x ∈ P(F), J¯(x ) = 10 ss((xx)) &gt;≤ 00 (R1)
Our goal is to select w and µ such that the classification rule
(R1) provides a good solution to the original problem (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ). For this
purpose, we proceed in three steps:
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Define w and µ to capture as much information as possible
from E for any X (Section 4.3).
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) Train the model to adjust the intrinsic strength on a
specific X using the classification rule (R1) (Section 4.4).
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) Refine the classification rule (R1) to take into account the
local nature of the model (Section 4.4).
      </p>
      <p>A summary and high level view of HCBR is given by Algorithm 1.
Algorithm 1 HCBR (High level view)
1: Build H and E from X .
2: Calculate w and µ on E.
3: Adjust µ with training algorithm 2 on X using rule (R1)
4: for each x in test set do
5: Calculate the projection π (x ).
6: Calculate the support s(x ) using the projection.
7: Predict using the updated rule (R2).
8: end for</p>
      <p>While SVM aims at separating the data using a single
hyperplane in the original input vector space, HCBR tries to explain the
decision for each case by a convex combination expressed in a
lower dimensional space E generated by the data. In addition, the
convex combinations depend on each other since the elements of
E are by definition the features resulting in the case intersections.
For any case, the decision rule is a combination of the strength
of the elements of its projection on the hypergraph.
4.3</p>
    </sec>
    <sec id="sec-3">
      <title>Model Selection</title>
      <p>In this section, we define how to concretely select and calculate
w and µ for a given hypergraph. To ease the notation and for
practical reasons, the measure µ is provided w.r.t. the
intersection family E of the hypergraph induced by the training set X .
However, µ can be defined over any partition of F.</p>
      <p>For x and x ′ in P(F), a basic quantitative measure on the
importance of x ′ w.r.t. x can be expressed by |x ∩x′ | , i.e. how
|x |
much x ′ overlaps with x . This measure is a sort of potential for
an analogy with x . Potential because it does not account for the
individual importance of the features and simply holds the idea
that the larger is a subset of features in a case, the higher is the
chance it holds important features to decide the outcome.</p>
      <p>Let us consider E and an example x ∈ X .</p>
      <p>Definition 4.4 (Intrinsic strength w.r.t. x ). The intrinsic strength
of e ∈ E w.r.t. x ∈ X is defined by
∀l ∈ {0, 1}, S(l )(e, x ) =
=</p>
      <p>|x ∩e |
|d(l )(e)| |x |</p>
      <p>|x ∩ej |
Í |d(l )(ej )| |x |
ej ∈E</p>
      <p>
        |d(l )(e)||x ∩ e |
Í |d(l )(ej )||x ∩ ej |
ej ∈E
In particular, for a given x ∈ X , S(l )(ei , x ) = 0 if ei is not part
of the projection of x on H . Also, Í S(l )(e, x ) = 1 which can
e ∈E
be interpreted as how much e accounts for the set of labels l
the element x holds in its projection. The more ei belongs to
many cases with the same class l and the higher S(l )(ei , x ) is.
Conversely, for a fixed number of cases, the more ei describes
x , the higher S(l )(ei , x ) is. As ∀ei ∈ E, |d(ei )| &gt; 0, either we
have S(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )(ei , x ) , 0 or S(0)(ei , x ) , 0. We have S(l )(ei , x ) = 0 only
when all the cases in which ei results of are labeled with the
opposite class l¯. For S(l )(ei , x ) = 1, it needs both the unanimity
of labels for the cases in which ei belongs to and that ei = x . The
relation ei = x implies that x does not share any feature with
any other examples or that x is included into another example.
      </p>
      <p>Definition 4.5 (Intrinsic strength w.r.t. a hypergraph H ). The
instrinsic strength of e ∈ E w.r.t. H = (FX , X ) is defined by
∀l ∈ {1, 0}, S(l )(e) =</p>
      <p>|e |
Í |e ′|
e′ ∈E
= |e |</p>
      <p>Õ
x ∈d(l)(e)
Õ
|FX | x ∈d(l)(e)</p>
      <p>S(l )(e, x )
S(l )(e, x )
The measure S(l )(e) is simply the sum of the relevance for all
the examples e belongs too. The more e belongs to several cases,
the more information it has to support a class or another. As E
represents the sets of features that appear all the time together,
we favor the larger e ∈ E as they hold more information to
explain a decision. The normalized version is defined by:
∀l ∈ {1, 0}, µ (l )(e) =</p>
      <p>S(l )(e)
Í S(l )(e ′)
e′ ∈E</p>
      <p>Finally, the measure µ is simply defined by the diference
between the strength of both classes:</p>
      <p>
        µ (e) = µ (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )(e) − µ (0)(e)
      </p>
      <p>
        For any case x , the projection on the hypergraph π (x )
intersects with some elements of E. We use this intersection to define
w by
w(e, x ) = |x ∩ e | (
        <xref ref-type="bibr" rid="ref9">9</xref>
        )
      </p>
      <p>
        |x \ Dx |
, i.e. we modulate the strength of an element of the partition by
its importance in x w.r.t. set inclusion. In a sense, the projection
on the hypergraph provides all the possible analogies with the
training set, either seen as a support for a particular outcome l or
as a counter-example. The intrinsic strength of each element of
this partition is a measure of “how much it can be considered as
a counter-example or an analogy” taking into account
simultaneously all the analogies between the examples that are in relation
with the considered case. It favors the importance of the relation
(the more features it shares, the stronger is the analogy) but also
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        )
(
        <xref ref-type="bibr" rid="ref6">6</xref>
        )
(
        <xref ref-type="bibr" rid="ref7">7</xref>
        )
(
        <xref ref-type="bibr" rid="ref8">8</xref>
        )
the quality of the relation (the more examples sharing the same
class the stronger the analogy).
      </p>
      <p>
        Example: Consider the hypergraph in Figure 1 made of x1, x2
and x3 arbitrarily labeled with resp. 1, 0 and 1. Arbitrarily, we
assume the cardinal of the elements of E to be #e = (
        <xref ref-type="bibr" rid="ref1 ref1 ref2 ref2 ref2 ref3 ref3">2, 1, 2, 3, 1, 2, 3</xref>
        )
such that the cardinal of cases are #x = (
        <xref ref-type="bibr" rid="ref7 ref7 ref8">7, 8, 7</xref>
        ) and |FX | = 14.
The values of |d(l )(e)| can be summarized by the vectors #d(0) =
(
        <xref ref-type="bibr" rid="ref1 ref1 ref1 ref1">0, 0, 1, 0, 1, 1, 1</xref>
        ) and #d(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) = (
        <xref ref-type="bibr" rid="ref1 ref1 ref1 ref1 ref2 ref2">1, 2, 2, 1, 1, 1, 0</xref>
        ). Let us calculate
S(0)(e3):
      </p>
      <p>
        S(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )(e3, x1) = 2 × 1 + 1 × 22 +× 22 × 2 + 1 × 2 = 140
S(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )(e3, x2) = 2 × 2 + 1 × 12 +× 12 × 2 + 0 × 3 = 47
      </p>
      <p>
        S(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )(e3, x3) = 2 × 1 + 2 × 22 +× 32 × 1 + 1 × 1 = 140
which we interpret as e3 being responsible for 140 of the support
toward class 1 in x1 and x3, while 47 for x2. This lead to
2 4 4 4
      </p>
      <p>
        S(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )(e3) = 14 10 + 7 + 10 ≃ 0.1959
      </p>
      <p>Similarly, we calculate the support for each e and both labels.
We summarize this into the following vectors:
As a result, x1 and x3 are labeled 1 and x2 is labeled 0. All three
cases are correctly labeled. The highest support is given for case
x2 and x3 while the support for x1 is one order of magnitude
lower then for x3. This is because the discretionary features of
x3 are larger while the intersection with x2 is lower than for x1
( 83 of x3 against 74 of x1).</p>
      <p>Consider a new case x as described on Figure 2. Its support is
given by s(x ) = Í w(e, x )µ (e) with π (x ) = {e1, e2, e3, e5, e6, e7}.</p>
      <p>e ∈π (x )
It is impossible for x to be classified as 1 because the highest
support would be for a maximal intersection with e1, e2, e3 and
minimal for e5, e6 and e7 such that s(x ) = 81 (2µ (e1) + µ (e2) +
2µ (e3) + µ (e5) + µ (e6) + µ (e7)) ≃ −0.0103 &lt; 0. It can be explained
by the fact that the support for 1 is provided by a larger set of
features (11 features versus 8). On top of that, the intersections
between positive cases (e2 and e3) are too small (1 for e2 compared
to e.g. 3 for e7) or include also negative cases (e3).</p>
    </sec>
    <sec id="sec-4">
      <title>4.4 Training and Decision</title>
      <p>In this section, we give an algorithm to adjust the intrinsic
strength µ (ei ) in order to minimize a hinge loss and we update
the classification rule to take into account the fact the model
cannot provide predictions over F entirely.</p>
      <p>Once the model is built, it can be evaluated on the training
set. Analogously to SVM, we define the margin as the distance
to the correct class, i.e. m(w, µ , x ) = sw, µ (x )| J (x ) − J¯(x )|. To
improve the pertinence of the strength of the elements of E, we
use the iterative algorithm described by Algorithm 2 to minimize
the hinge loss over the training set X . When and only when a
classification for a case x is wrong, it modifies each element of
its projection by lowering its strength for the wrong class and
increasing it for the proper class. The margin is split between
the element of the projection w.r.t. their respective weight in
x i.e. w(e, x ). If a case x is wrongly classified, it is due to the
cases intersecting with it. Indeed, if x was not intersecting with
any other example, its projection would be itself, and its support
toward the wrong class would be 0 and positive for the real class.
In other words, x would be correctly classified. Thus, the idea is
not to directly bring the support of x to the correct class but to
gradually adjust the weights such that the neighbors are modified
enough for x to be correctly classified. In particular, it is sensitive
to the order in which the cases are considered: a modification in
the strength of any e ∈ E impacts all cases in which it appears
and potentially changes the predicted class for those cases. In
addition, there is no guarantee of convergence. Future work will
focus on the optimal order or a modification schema such that the
algorithm converges. Investigating other minimizing functions
is also considered.</p>
      <p>
        Algorithm 2 Model training
Input:
- X : training set
- y: correct labels for X
- k: number of training iterations
- µ (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), µ (0): weights calculated with (4.5)
Output:
- Modified vectors µ (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), µ (0)
1: for k iterations do
2: for xi ∈ X do
3: y¯i ← J¯(xi )
4: if y¯i , yi then
5: for e ∈ π (xi ) do
6: µ (yi )(e) ← µ (yi )(xi ) + w(e, xi )|µ (e)|
7: µ (y¯i )(e) ← µ (y¯i )(xi ) − w(e, xi )|µ (e)|
8: end for
9: end if
10: end for
11: end for
      </p>
      <p>
        The measure µ is defined as the diference of support for both
classes. Thus, by linearity we can rewrite
m
Õ
i=1
= s(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )(x ) − s(0)(x )
m
Õ
i=1
s(x ) =
w(ei , x )µ (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )(ei ) −
w(ei , x )µ (0)(ei )
This form is convenient because we can control how much
evidence we need to support a specific class using the following
constraints and a familly η of four hyperparameters:
s(0)(x ) &gt; max(
s(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )(x ) &gt; max(
¯
η0
1 − η¯0
¯
η1
1 − η¯1
s(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )(x ), η0) ≥ 0
s(0)(x ), η1) ≥ 0
(
        <xref ref-type="bibr" rid="ref10">10</xref>
        )
(C0)
(C1)
with η0, η1 ∈ R+ and η¯0, η¯1 ∈ [0, 1]. The constraints on η0 and η1
defines a minimal amount of support respectively toward class 0
and 1 while η¯0 and η¯1 requires the support toward a class to be
significantly higher than the support for the other class. As µ is
normalized over the partition E, the value of η0 and η1 must be
set w.r.t. the hypergraph. On the contrary, η¯1 and η¯0 can be set
independently of the hypergraph.
      </p>
      <p>Those constraints may be used to design a decision rule for
new cases depending on the application or the dataset. The most
generic decision rule is as follows:</p>
      <p>
         1 s(x ) &gt; 0 and C1
J˜(x ) =  0 s(x ) ≤ 0 and C0 (
        <xref ref-type="bibr" rid="ref12">12</xref>
        )
l1 s(x ) &gt; 0 and not C1
l0 s(x ) ≤ 0 and not C0

where l1, l0 are two labels. A representation is given by Figure 3.
Those hyperparameters are intended to model the “burden of
proof”. For instance, in a trial, one is assumed innocent until
proven guilty which implies the support for the class "guilty"
must be beyond a reasonable doubt (where the term reasonable is
defined by the jurisprudence of the applicable country). In case
η0 = η1 = η¯0 = η¯1 (and l0 = 0 and l1 = 1), then the decision rule
is equivalent to the original one defined by (R1). In case x has too
many discretionary features, this classification rule is likely to be
irrelevant. Indeed, the intersection between x and FX is to small
to hold enough information and make strong analogies with x .
To overcome this drawback, P(F) is split into two subsets:
• F1 = {x ∈ P(F) | |x ∩ FX | ≥ δ }, ∀δ ∈ N
• F2 = P(F) \ F1
F1 corresponds to the elements such that they share some
features with the examples. An alternative may be considered by
      </p>
      <p>Dx
using F1 = {x ∈ P(F) | |x | ≤ δ }, ∀δ ∈ [0, 1]. In this case, F1
contains the elements for which we have enough information
provided by the examples. From our preliminary tests, the choice
depends on the dataset structure.</p>
      <p>Finally, the decision rule for new cases is built as follows:
J¯(x ) =</p>
      <p>J˜(x )
ox
if x ∈ F1
if x ∈ F2
where ox is one draw from a random variable that has Bernoulli
law with parameter p = | {x ∈X | J (x )=1} | , i.e. the prevalence of 1
|X |
in X . This assumes that X is correctly representing P(F) (or that
the prevalence does not change in time for sequential problems
in which the new cases are generated by an unknown random
measure). The rational behind is that if for a case x , it is not
possible to exploit the model built on the hypergraph, then we can
still consider that J acts as a Bernoulli random variable and use a
maximum likelihood estimation for p. In a sense, it is extending
the local model to the entire input vector space P(F).
(R2)
Model Building: Given X ∈ P(F)n , constructing EH can be
done in O( Í |x |) by using a Partition Refinement data
strucx ∈X
ture [18]. Given x ∈ X , calculating the family {S(e, x )}e ∈EH can
be done in O(|x |) by asking for each feature of x the e it belongs
to and maintaining the size of each e during the construction of
EH . Thus, calculating {S(e, x )}e ∈EH for all x ∈ X can be done
in O( Í |x |). On m-uniform hypergraphs (when all cases are
x ∈X
described with m features) with n hyperedges, it becomes O(mn).</p>
      <p>Calculating {S(e)}e ∈EH and µ can be done in O(|EH |) because
it requires to iterate over EH . An obvious upper bound on |EH |
is |FX | i.e. the number of vertices in the hypergraph. The
worstcase cardinal of EH is when each x ∈ X intersects with all the
others and none of them is strictly a subset of any other. Thus,
|EH | ≤ min(2n − 1, |FX |).</p>
      <p>Learning Phase: For each wrongly classified x ∈ X , the
training requires at most O(|x |) steps (maximal cardinal for π (x )).
The worst-case scenario is when the model wrongly classifies
every x ∈ X . Thus, the learning phase worst-case complexity is
O(k Í |x |) and on m-uniform hypergraphs it becomes O(kmn).</p>
      <p>x ∈X
Model Query: For a case x ∈ P(F), the projection can be done
in O(|x |). Calculating the classification rule also requires at most
O(|x |) (maximal cardinal for π (x )).
5</p>
    </sec>
    <sec id="sec-5">
      <title>EXPERIMENTS</title>
      <p>The experiments are broken down into two parts: the comparison
with literature results in terms of classification performance,
and intrinsic performance (computation time and influence of
parameters).</p>
      <p>For the first part, we measured the confusion matrix obtained
over all the runs but also after each prediction. From this
confusion matrix, we calculated the standard performance indicators:
accuracy, recall, specificity, precision, negative prediction value,
F1-score and Matthews correlation coeficient. Denoting by TP
the number of true positives, TN the true negatives, FP the false
positives and FN the false negative, the F1-score and Matthews
correlation coeficient are defined by:</p>
      <p>F1 =</p>
      <p>T P × T N − F P × F N
F1- score and Matthews correlation coeficients respectively
belongs to [0, 1] and [−1, 1] and the closer to 1, the better it is. Both
takes into account false positive and false negatives. In
particular, Matthews correlation coeficient is very adapted for binary
classification on unbalanced dataset (with a prevalence far from
0.5). However, as many studies do not report them, we will base
our comparison with literature results on the accuracy defined
by:</p>
      <p>ACC =</p>
      <p>T P + T N</p>
      <p>T P + T N + F P + F N</p>
      <p>For the second part, we studied the computation time
depending on the number of cases and the size of each case in order to
validate the worst-case complexity given in Section 4.5. We also
studied the influence of the training set size on the accuracy.</p>
      <p>The integrality of the data used for the experiments, as well as
the scripts to transform them and analyze the results are available
in the HCBR Github repository4 and the whole experimental
campaign starting from the raw data can be reproduced in "one
click".
5.1</p>
    </sec>
    <sec id="sec-6">
      <title>Classification performance</title>
      <p>To validate the approach, 8 datasets for binary classification
have been used. They are available either from the UCI Machine
Learning Repository5 or provided with the LIBSVM6 : adult,
audiology, breasts, heart, mushrooms, phishing, skin and
splice. For each dataset, the original features (name=value)
are converted into a unique identifier and the union of all such
identifiers constitute the information set F considered by the
algorithm. Notice that there is no need to remove the rows with empty
values. The dataset audiology initially contains several classes
corresponding to several ear abnormalities. They are grouped
to obtain two classes (normal ear and abnormal ear). able 1
describes each dataset. The minimal, maximal and average size give
information about the case sizes (notice some cases are missing
values for adult, heart and mushrooms datasets). The unique
features are the number of (name=value) in the original dataset.
In addition, two datasets have at least one real-valued attribute
as indicated by the column "Real" in Table 1.</p>
      <p>The validation was made using a 10-fold cross-validation: each
dataset has been split into 10 equal sized samples, 90% has been
used as training set and the remaining 10% to test the
classification. Each subsample is used once as testing set and the final
metrics are calculated as the average of the 10 runs. The training
set was not rebalanced and kept the original prevalence. The
of training steps k was adjusted with a manual trial and errors
approach, the familly η set to η0 = η1 = η¯0 = η¯1 = 0 and δ
was set to 1. Further work will focus on automatic parameter
tuning. The average confusion matrix obtained for each dataset
is showed in Table 2. The performance indicators are reported
in Table 3. The proposed algorithm performs very well on a
wide range of datasets as reported by tables 2 and 3, in particular
4https://github.com/aquemy/HCBR
5https://archive.ics.uci.edu/ml/index.php
6https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html
when they contain strong predictors as it is the case for mushroom.
The accuracy is contained in a range from 0.8206 (adult) to 1
(mushrooms) while the F1-score is bounded by 0.8653 (heart) and
1 (mushrooms). On adult, the accuracy is only 6% higher than
the prevalence. A model consisting in returning 1 for any point
would be only 6% worse. This relatively poor performance in
learning the underlying decision mapping is better reflected by
the Matthews correlation coeficient of 0.51.</p>
      <p>The false positives and false negatives are equilibrated for each
dataset, despite a huge variation in the prevalence (between 20%
and 64%, cf. Table 2) which may be a desirable property depending
on applications. In addition, the results were obtained with
nonbalanced training sets which are known to be a problem for
many machine learning algorithms. Moreover, the performance
indicators remain stable during the whole prediction phase as
shown on Figure 4 for the dataset phishing (similar result is
observed for all datasets).</p>
      <p>For a given case, the support is a metric of confidence in the
prediction as illustrated in Figure 5. In general, the wrongly
classified cases have a smaller diference between the evidence
for each class which confirm the interest in the hyperparameters
η and η¯ used in (R2).</p>
      <p>To compare the results of the proposed method, we explored
the best results from the literature for each dataset. The
comparison with audiology is not relevant due to the
transformation into a two-class problem. The results are summarized in
Table 4. In [9], 5 rule-based classification techniques dedicated to
medical databases are compared and achieve at best 95.85% and
82.96% accuracy resp. on breast, and heart datasets.
Comparing bayesian approaches, [13] demonstrated 97.35% (breast) and
83.00% (heart) accuracy. A 5 layers neural network with fuzzy
inference rules achieved 87.78% on heart [21] while a k-NN
algorithm reached 99.96% on mushrooms [8]. The best alternative
among 6 rules-based classification methods achieved 95.84% on
breast and 100.00% on mushroom [11]. Using 80% of phishing
as training set, an adaptative neural network achieved an
average accuracy of 93.76% (among 6 alternatives) with the best
run at 94.90% [23]. Still on phishing, [22] proposes to combine
several classifiers and reaches 97.75% accuracy for the best hybrid
model (and demonstrates 97.58% for Random Forest classifier).
On adult, the comparison of several classifiers (naive bayes,
decision tree, ...) demonstrated at most 86.25% accuracy [14] while
a Support Vector Machine approach reached 85.35% [15]. On
splice, a method using Fuzzy Decision Trees [4] reaches 94.10%
accuracy and a neural network combined to boosting [5] 97.54%.
On breast, Support Vector Machine approaches reached resp.
96.87%, 98.53%, 99.51% accuracy [1, 7, 19], 99.26% and 97.36% for
neural network based techniques [17, 24], 98.1% for a bayesian
network method [10], or 94.74% using Decision Trees [20]. On
skin, [5] reports 98.94% accuracy against 99.68% for Decision
Tree based method [6]. The best result, as far as we know, is
99.92%, obtained by a Generalized Linear Model [2] (with 80%
training set).</p>
      <p>The accuracy is slightly lower than the best results from the
literature (adult 82.06% against 86.25%, breast 96.96% against
99.51%, heart 85.77% against 87.78%, phishing 96.05% against
97.75%, splice 94.43% against 97.54%, skin 98.65% against 99.92%).
We explain this by at least two factors. First, the best methods
on a given dataset are often dedicated to this dataset with ad-hoc
or engineered parts which is not the case of HCBR. Secondly,
the hyperparameter familly η have not been tuned but as we
will show in next section, they might have a decisive impact on
performance. HCBR performed better than Bayes classifier in
two thirds of cases. Bayes classifier performs better on breast
by 1˜% which represents less than one case wrongly classified.
Similar results are observed with Decision Trees. However, the
1% diference on skin represents an average of 7 cases
misclassified in comparison in favor of Bayes. It performs better than
Rule-based approaches (or gives similar results on mushrooms
with an accuracy of 1) in the four considered references on three
diferent datasets. Notice that combined with Neural Network,
Rule-based achieves the state-of-art result on heart dataset.
Except for phishing, Neural Network returns better results (0.46
more cases with correct classification in average for breast, 71
for skin and almost 10 for splice). Last, SVM gives better results
in all three cases, but appear only as best results in two datasets.
In this section, we evaluate the capacity of HCBR to build an
eficient model (w.r.t. the accuracy measure) with few examples.
We also study the influence of the two main parameters (number
of examples and size of the examples) on the computation time.
Training set size: To evaluate HCBR, we used a standard 90-10
dataset split in the previous section. Additionally, we studied the
accuracy depending on the split from 1% to 99%. For each split
value, we performed 100 runs with random splits and averaged
the accuracy. As shown in Figure 6, HCBR reaches its maximal
accuracy with about 15% of the dataset as examples. The dataset
adult shows a constant results from 1% while for audiology
more than 40% is required to achieve over 90% accuracy. Despite
the datasets having diferent sizes (e.g. skin is 350 larger than
breast), the trajectories look the same. Further work should
focus on determining the optimal training set size depending on
the size of cases and the underlying measure to generate them
in P(F) (as it influences the induced hypergraph, it influences
the strength measure and thus the decisions). Additional
experiments need to be performed to fairly compare those results to
the existing algorithms. However, it is commonly accepted that
non-linear classifiers requires very large training sets.</p>
      <p>Computation Time: We generated a casebase of n cases of size
m such that case i is described by {i, ..., i + m} i.e., each case
is partitioned into m elements (one discretionary feature). This
is the worst-case scenario in terms of the size of E if m &lt; n
because the family grows exponentially in function of m or n.
We split the computation time into constructing the hypergraph
(and determining the intersection family) and calculating the
strength of the partition. The results are illustrated in Figure 7.
By increasing n with a fixed m, the partition grows exponentially
and thus, it is expected to have an exponential curve for the
strength computation. On the contrary, building the hypergraph
can be done in linear time when m fixed. When n is fixed and m
increases, constructing the hypergraph is still doable in linear
time as expected. Interestingly, calculating the strength has two
phases: if m ≤ n, increasing m exponentially increases the time
(because E exponentially increases) but if m &gt; n, increasing m
cannot results in an exponential growth in the computation time
(because E grows linearly).</p>
      <p>
        Hyperparameters η: We used a 90-10 split and set η0 = η1 to
ease the visualization. Instead of using the decision function
deifned by (
        <xref ref-type="bibr" rid="ref12">12</xref>
        ), we did not produce a prediction if the constraints C1
or C0 were not respected. It can be viewed as creating a third class
unknown for which we consider we cannot produce a decision.
We measured the accuracy and the test set size ratio for which a
prediction has been produced for diferent values of η := η0 = η1.
If J¯ correctly approximates J , increasing η should increase the
accuracy while the test set ratio should remain high. Additionally,
we plot the test set ratio in function of the accuracy and
calculate the Pareto frontier7 which represents the best compromises
accuracy/ratio. The closer the points are to (
        <xref ref-type="bibr" rid="ref1 ref1">1, 1</xref>
        ) the better it
is. A Pareto frontier consisting of (
        <xref ref-type="bibr" rid="ref1 ref1">1, 1</xref>
        ) represents the perfect
model (e.g. reached on mushroom dataset). Figures 8, 9, 10 and 11
provides the result for the best and worst two datasets. Figure
12 shows all of the four Pareto frontiers. As expected, the results
are better on phishing and breast. On phishing, breast and
7Points such that improving one component would degrades the other one.
heart, the accuracy globally increases with η while on heart
the accuracy slightly decreases indicating poor influence of the
hyperparameters and model.
      </p>
      <p>Notice that for certain values of η it is possible to reach 100%
accuracy with heart while it is not with breast. Also, for high
values of η, we observe a fall in accuracy for breast. We suspect
those two phenomena to appear because we used the same value
for η0 and η1.</p>
      <p>More work is required to fully study the influence of the
hyperparameters (η0, η1, η¯0, η¯1) and how to select l0 and l1. We believe
it is the key to improve the overall performance, and it is possible
to derivate the best values from the training set. Also, a
comparison with binary classification methods that provide a prediction
confidence metric is necessary.</p>
    </sec>
    <sec id="sec-7">
      <title>CONCLUSION</title>
      <p>
        This paper presented HCBR, a method for binary classification
using a hypergraph representation of information and building a
convex combination out of the induced partition to determine
the support for each class. The general framework introduced
by (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) is instantiated in Section 4.3 where the support is
determined using all the interactions between the hyperedges. Beyond
this specific implementation, one can imagine diferent model
selection methods to be used, e.g. using some assumptions on
the data.
      </p>
      <p>However, being totally agnostic on the data representation is
convenient compared to many methods such as SVM. It allows
combining information from multiple sources by simply stacking
the information. It does not require transforming the data to fit
the algorithm, often by designing specific ad-hoc metrics.</p>
      <p>HCBR has been tested on 8 well-known datasets and
demonstrated similar accuracies when compared to the best results
from the literature. The algorithm has shown a strong stability
in terms of accuracy, true positive and negative rates during the
whole prediction phase. We showed that the diference of class
support is a good confidence indicator for the prediction. We
demonstrated the capacity to obtain a good accuracy using very
few examples from the dataset (10% to 15% of the training set)
without balanced classes. This last property is very important for
robustness as in practice, the dataset are rarely balanced. Finally,
we empirically validated the exponential worst-case complexity.</p>
      <p>This proof of concept raises many questions and ofer many
improvement axes. First of all, it seems relatively easy to extend
the method to several classes but it needs an additional empirical
validation. As most of the computational efort is on calculating
the class support, adding more classes will linearly increase the
computation time and thus, working on a faster algorithm or an
approximation of the main measure should be investigated. The
solution may come from exploring the feature selection capacity
of HCBR. Indeed, by the hypergraph construction, it may be
possible to remove from the partition some elements that do not
participate enough (e.g. not being in enough cases at the same
time), reducing the computation time. Additionally, we plan to
investigate how to generate an explanation about each prediction
and one about the decision function J itself, using not only the
convex combination and the strength of the partition elements,
but also the link between cases in a similar way a lawyer may
use past cases to make analogies or counter-examples. We also
work on an online version of HCBR where the hypergraph is
constructed case after case, including forgetting some old past
cases (which would allow handling non-stationary environment).
It seems also possible not only to add new examples dynamically,
but also some vertices (i.e. adding some pertinent information to
some cases) without generating the whole model from scratch.</p>
      <p>Empirically, further experiments should focus on more
unstructured datasets (for instance for text classification). As stated
previously, strategies for hyperparameter tunning are also a
priority. Last but not least, we would like to answer some questions:
is it possible to find a method to adjust the strength measure such
that the accuracy on the training set is 1? Can we provide some
quality bounds depending on the initial hypergraph and thus the
dataset? How to handle continuous values?</p>
    </sec>
    <sec id="sec-8">
      <title>ACKNOWLEDGMENT</title>
      <p>The author warmly thanks Pr. Robert Wrembel, Poznan
University of Technology, and Dr. Jean-François Puget, IBM Analytics,
for their useful suggestions and advice to improve this paper.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Akay</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Support vector machines combined with feature selection for breast cancer diagnosis</article-title>
          .
          <source>Expert systems with applications 36</source>
          ,
          <issue>2</issue>
          (
          <year>2009</year>
          ),
          <fpage>3240</fpage>
          -
          <lpage>3247</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Mesa</surname>
            <given-names>A. Dinh N.-T.</given-names>
          </string-name>
          <string-name>
            <surname>Basterrech</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Generalized Linear Models Applied for Skin Identification in Image Processing</article-title>
          .
          <source>In Intelligent Data Analysis and Applications</source>
          . Springer,
          <fpage>97</fpage>
          -
          <lpage>107</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Berge</surname>
          </string-name>
          .
          <year>1984</year>
          .
          <article-title>Hypergraphs: combinatorics of finite sets</article-title>
          . Vol.
          <volume>45</volume>
          . Elsevier.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Sharma</surname>
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Dhall-A. Chaudhury S. Bhatt</surname>
            ,
            <given-names>R. B.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>Eficient Skin Region Segmentation Using Low Complexity Fuzzy Decision Tree Model</article-title>
          .
          <source>In 2009 Annual IEEE India Conference. 1-4.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Ö</surname>
          </string-name>
          . Çatak.
          <year>2017</year>
          .
          <article-title>Classification with boosting of extreme learning machine over arbitrarily partitioned data</article-title>
          .
          <source>Soft Computing</source>
          <volume>21</volume>
          ,
          <issue>9</issue>
          (
          <year>2017</year>
          ),
          <fpage>2269</fpage>
          -
          <lpage>2281</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Ribeiro</surname>
            <given-names>M. X.</given-names>
          </string-name>
          <string-name>
            <surname>Cazzolato</surname>
            ,
            <given-names>M. T.</given-names>
          </string-name>
          <year>2013</year>
          .
          <article-title>A statistical decision tree algorithm for medical data stream mining</article-title>
          .
          <source>In Proceedings of the 26th IEEE International Symposium on Computer-Based Medical Systems</source>
          .
          <volume>389</volume>
          -
          <fpage>392</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Yang</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Liu-J. Liu D.-Y. Chen</surname>
            ,
            <given-names>H.-L.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>A support vector machine classifier with rough set-based feature selection for breast cancer diagnosis</article-title>
          .
          <source>Expert Systems with Applications 38</source>
          ,
          <issue>7</issue>
          (
          <year>2011</year>
          ),
          <fpage>9014</fpage>
          -
          <lpage>9022</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Das</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>Filters, Wrappers and a Boosting-Based Hybrid for Feature Selection</article-title>
          .
          <source>In Proceedings of the Eighteenth International Conference on Machine Learning (ICML '01)</source>
          . Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,
          <fpage>74</fpage>
          -
          <lpage>81</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Saha</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Datta</surname>
            ,
            <given-names>R. P.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Applying rule-based classification techniques to medical databases: an empirical study</article-title>
          .
          <source>International Journal of Business Intelligence and Systems Engineering</source>
          <volume>1</volume>
          ,
          <issue>1</issue>
          (
          <year>2016</year>
          ),
          <fpage>32</fpage>
          -
          <lpage>48</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Jafari</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Fallahi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>An expert system for detection of breast cancer using data preprocessing and bayesian network</article-title>
          .
          <source>International Journal of Advanced Science and Technology</source>
          <volume>34</volume>
          (
          <year>2011</year>
          ),
          <fpage>65</fpage>
          -
          <lpage>70</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Issa</surname>
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Ishtaiwi A. Hadi</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>ACPRISM: Associative classification based on PRISM algorithm</article-title>
          .
          <source>Information Sciences 417</source>
          ,
          <string-name>
            <surname>Supplement</surname>
            <given-names>C</given-names>
          </string-name>
          (
          <year>2017</year>
          ),
          <fpage>287</fpage>
          -
          <lpage>300</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Liu</surname>
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang S.-Metaxas</surname>
            <given-names>D. N.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Image retrieval via probabilistic hypergraph ranking</article-title>
          .
          <source>In IEEE Computer Society Conference on Computer Vision and Pattern Recognition</source>
          .
          <fpage>3376</fpage>
          -
          <lpage>3383</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Jiang</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Learning Instance Weighted Naive Bayes from Labeled and Unlabeled Data</article-title>
          .
          <source>J. Intell. Inf. Syst</source>
          .
          <volume>38</volume>
          ,
          <issue>1</issue>
          (
          <year>2012</year>
          ),
          <fpage>257</fpage>
          -
          <lpage>268</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Lu</surname>
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Peng Y.-Shi</surname>
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Kou</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <year>2012</year>
          .
          <article-title>Evaluation of classification algorithms using MCDM and rank correlation</article-title>
          .
          <source>International Journal of Information Technology &amp; Decision Making</source>
          <volume>11</volume>
          ,
          <issue>01</issue>
          (
          <year>2012</year>
          ),
          <fpage>197</fpage>
          -
          <lpage>225</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Mangasarian</surname>
            <given-names>O.L.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>Y.-J.</given-names>
          </string-name>
          <year>2001</year>
          .
          <article-title>SSVM: A Smooth Support Vector Machine for Classification</article-title>
          .
          <source>Computational Optimization and Applications</source>
          <volume>20</volume>
          ,
          <issue>1</issue>
          (
          <year>2001</year>
          ),
          <fpage>5</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Xibin</given-names>
            <surname>Zhao-Hai Wan Ming Gu Jiaguang Sun Lifan Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Yue</given-names>
            <surname>Gao</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>VertexWeighted Hypergraph Learning for Multi-View Object Classification</article-title>
          .
          <source>In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17</source>
          .
          <fpage>2779</fpage>
          -
          <lpage>2785</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Quintanilla-DomÃŋnguez J.-Andina</surname>
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Marcano-CedeÃśo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>WBCD breast cancer database classification applying artificial metaplasticity neural network</article-title>
          .
          <source>Expert Systems with Applications 38</source>
          ,
          <issue>8</issue>
          (
          <year>2011</year>
          ),
          <fpage>9573</fpage>
          -
          <lpage>9579</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Tarjan</surname>
            <given-names>R. E.</given-names>
          </string-name>
          <string-name>
            <surname>Paige</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <year>1987</year>
          .
          <article-title>Three Partition Refinement Algorithms</article-title>
          .
          <source>SIAM J. Comput. 16</source>
          ,
          <issue>6</issue>
          (
          <year>1987</year>
          ),
          <fpage>973</fpage>
          -
          <lpage>989</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Güneş</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Polat</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>Breast cancer diagnosis using least square support vector machine</article-title>
          .
          <source>Digital Signal Processing 17</source>
          ,
          <issue>4</issue>
          (
          <year>2007</year>
          ),
          <fpage>694</fpage>
          -
          <lpage>701</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Quinlan</surname>
          </string-name>
          .
          <year>1996</year>
          .
          <article-title>Improved use of continuous attributes in C4. 5</article-title>
          .
          <source>Journal of artificial intelligence research 4</source>
          (
          <year>1996</year>
          ),
          <fpage>77</fpage>
          -
          <lpage>90</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Sagir</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Sathasivam</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>A Hybridised Intelligent Technique for the Diagnosis of Medical Diseases</article-title>
          .
          <source>Pertanika Journal of Science &amp; Technology</source>
          <volume>25</volume>
          ,
          <issue>2</issue>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Asghar-S. Zafar A. Gillani S. Tahir</surname>
            ,
            <given-names>M. A. U. H.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>A Hybrid Model to Detect Phishing-Sites Using Supervised Learning Algorithms</article-title>
          .
          <source>In 2016 International Conference on Computational Science and Computational Intelligence (CSCI)</source>
          .
          <volume>1126</volume>
          -
          <fpage>1133</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Mohammad R. M. McCluskey L. Thabtah</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>A dynamic self-structuring neural network model to combat phishing</article-title>
          .
          <source>In 2016 International Joint Conference on Neural Networks (IJCNN)</source>
          .
          <volume>4221</volume>
          -
          <fpage>4226</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>E. D.</given-names>
            <surname>Übeyli</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Implementing automated diagnostic systems for breast cancer detection</article-title>
          .
          <source>Expert Systems with Applications 33</source>
          ,
          <issue>4</issue>
          (
          <year>2007</year>
          ),
          <fpage>1054</fpage>
          -
          <lpage>1062</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Denny</surname>
            <given-names>Zhou</given-names>
          </string-name>
          , Jiayuan Huang, and
          <string-name>
            <given-names>Bernhard</given-names>
            <surname>Schölkopf</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Learning with Hypergraphs: Clustering, Classification, and Embedding</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          19,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schölkopf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Platt</surname>
          </string-name>
          , and T. Hofman (Eds.). MIT Press,
          <fpage>1601</fpage>
          -
          <lpage>1608</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>