=Paper=
{{Paper
|id=Vol-2062/paper6
|storemode=property
|title=Binary Classification With Hypergraph Case-Based Reasoning
|pdfUrl=https://ceur-ws.org/Vol-2062/paper06.pdf
|volume=Vol-2062
|authors=Alexandre Quemy
|dblpUrl=https://dblp.org/rec/conf/dolap/Quemy18
}}
==Binary Classification With Hypergraph Case-Based Reasoning==
Binary Classification With Hypergraph Case-Based Reasoning
Alexandre Quemy
IBM Software Lab
Cracow, Poland
Faculty of Computing, Poznań University of Technology
Poznań, Poland
aquemy@pl.ibm.com
ABSTRACT The well-known "no free lunch" theorem states that there is
Binary classification is one of the most common problem in ma- no unique algorithm that can outperform all the others in super-
chine learning. It consists in predicting whether a given element vised learning. However, the case description and representation
is of a particular class. In this paper, a new algorithm for binary play a preponderant role in the performances but the algorithms
classification is proposed using a hypergraph representation. are very often constrained by being designed to a specific repre-
Each element to be classified is partitioned according to its inter- sentation that may not be suitable for a given problem or user
actions with the training set. For each class, the total support is data. The proposed method is totally agnostic about the repre-
calculated as a convex combination of the evidence strength of the sentation1 and does not require structured representations. For
element of the partition. The evidence measure is pre-computed instance, it can work on atomic representations such as Bag-of-
using the hypergraph induced by the training set and iteratively Word representations for texts where an atom is a single word
adjusted through a training phase. It does not require structured (or n-gram). In this case, two elements would not have the same
information, each case being represented by a set of agnostic number of elements and it may be hard to properly define the
information atoms. Empirical validation demonstrates its high case domain.
potential on a wide range of well-known datasets and the results The plan of the paper is as follows: in Section 3 the problem of
are compared to the literature. The time complexity is given and binary classification is formalized in the particular case of a finite
empirically validated. Its capacity to provide good accuracy with countable set of information. The contribution of this paper is
few examples is also studied. divided into two parts: Section 4 defines the HCBR algorithm and
Section 5 presents empirical results on 8 datasets from the UC
Irvine Machine Learning Repository (UCI)2 and the LIBSVM3 .
1 INTRODUCTION The paper ends in Section 6 with a discussion about the results,
In many real-life situations, one tries to take a decision based future work, and possible improvements.
on previous similar situations. Each situation is described by a
certain amount of information, either collected by an expert ac- 2 CLASSIFICATION AND HYPERGRAPH
cording to the relevance of the information, or automatically by
Hypergraphs generalize graphs and can be used to represent
some sensors or algorithms. Those situations share similarities
higher order relations while graphs are limited to binary rela-
on which to make analogies or counter-examples in order to take
tions. A hypergraph is defined by a set of vertices and a collection
a new decision. Conversely, in general, if two situations do not
of hyperedges where each hyperedge is a subset of this set of ver-
share any common characteristic, then they are totally indepen-
tices. Therefore, a graph is a special case of hypergraph for which
dent, i.e. it is impossible to infer one’s outcome from the other
each hyperedge contains only two vertices. We will formally in-
one. The purpose of supervised machine learning algorithms is
troduce hypergraphs in Section 4.1. Recently hypergraphs have
to exploit the available information and interactions between
been used as data representation, and some classification algo-
past cases or examples in order to build a model or infer the key
rithms on hypergraph have been proposed. A vast majority of
rules to take correct decisions.
approaches models the objects to classify as the set of vertices
Due to the large variety of concrete situations that can be
and constructs the hyperedges as the representation of a metric.
reduced to binary classification, it is one of the most studied
This conventional approach is known as neighborhood-based hy-
problems in machine learning. In this paper, we investigate the
pergraph. The metric relies on some assumptions on the data or
problem of binary prediction under a supervised setting, i.e. given
a specific representation (e.g. Zernike moment and histogram of
a history of previous situations labeled with the correct output.
oriented gradient (HOG) to measure the distance between images
This paper contributes to binary classification with a new
in [16]) and for each vertex, a hyperedge is created to represent
algorithm called Hypergraph Case-Based Reasoning (HCBR). The
its k-nearest neighbors [12]. The problem of classification on
idea is to create a hypergraph where each element of the training
hypergraph consists in labeling some unlabeled vertices given a
set is a hyperedge and vertices are represented by the features
training set such that all vertices in a same hyperedge have the
describing the elements. The intersections between edges create
same label. As all the vertices are known a priori, the problem is
a partition for each case, and we model the support for each class
part of transductive learning. To learn the labels, the standard
as a convex combination of the elements of this partition. Each
approach is to minimize a cost function based on a hypergraph
of those elements is valued according to its importance w.r.t. to
equivalent of a graph Laplacian [16, 25] with a structural risk:
the set of all the hyperedges it belongs to and their respective
labels. C(x) = x t ∆x + µ ||x − y|| 2 (1)
© 2018 Copyright held by the owner/author(s). Published in the Workshop
Proceedings of the 21st International Conference on Extending Database Tech-
1 With the exception that the current version must work on discrete information.
nology (EDBT) (March 26-29, 2018, Vienna, Austria) on CEUR-WS.org (ISSN 978-3-
2 http://archive.ics.uci.edu/ml/index.php
89318-078-3). Distribution of this paper is permitted under the terms of the Creative
3 https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html
Commons license CC-by-nc-nd 4.0.
where ∆ is the hypergraph Laplacian, µ > 0 a regularization
factor and ||.|| a norm. The vector y represents the initial labels x1 e x3
e1 2
for all vertices with yi = 0 for unlabeled vertices, a negative (resp. e4
e3
positive) value for label -1 or 1. e6 e5
On the contrary, the method proposed in this paper models e7
the elements to classify as the hyperedges and the vertices as the x2
different components of those elements. As far as we know, there F
is no previous work that uses this modeling choice. In addition,
it does not require knowing all the elements before building the
Figure 1: The family E = {ei }i forms the partition ob-
model: our approach is inductive. If the literature focus on the
tained by the union of the projection of cases and repre-
method to build the hyperedges, the proposed framework has
sents how the three cases share information.
a straightforward hypergraph construction and rather focus on
model selection.
Definition 4.3 (Projection defined by a subhypergraph). The
3 PROBLEM DEFINITION projection operator π H over a hypergraph H = (V , X ) for any
Consider a countable finite space of information F and its σ - A ⊆ V is defined by π H (A) = {e ⊆ A | ∃X ′ ⊆ X A , e = x }.
Ñ
algebra P(F) defined as the powerset of F. We call case an element x ∈X ′
of P(F). In practice, it is very likely that only a subset of P(F) may For a case x, the operator π takes the subhypergraph H [x] and
appear (for instance if two features encode two contradictory returns the partition of the features of x defined by the intersec-
propositions or if every case has the same number of features). tion family induced by H [x]. Each element of π H (x) is a (sub)set
The real class for any case is defined by the unknown measurable of features. For instance, in Figure 1, π H (x 1 ) = {e 1 , e 2 , e 3 , e 6 } and
mapping: in Figure 2, the projection of x (in yellow) on H is represented
by the family {ei′ }i . For convenience, for a given H = (VX , X ),
J : P(F) → {0, 1} m = {e ∈ ∪ π (x)} that is to say, the
we denote by EH = {ei }i=1 H
x 7→ J (x) x ∈X
partition of VX obtained by the intersection of the edges. This
Given an example set X of n cases, the classification problem partition is unique to a hypergraph.
consists in minimizing the prediction errors for all the elements of We call discretionary features of x (w.r.t. H ) the set (possibly
P(F), that is to say finding a decision mapping J¯ : P(F) → {0, 1} empty) of features that are not in any element of the projection
such that: π H (x), denoted D x . It can be interpreted as the features of x that
do not belong to any hyperedge. If a hypergraph induced by a
∀X ∈ P(F)n , min
Õ
1 {J (x ), J¯(x )} (2)
J¯ set of examples represents a knowledge base at a given moment,
x ∈ P(F)
the discretionary features of x are new pieces of information that
In other words, for any possible training set, we want to find were never encountered before. For instance, considering the
an estimation of the exact mapping J and thus it is a functional hypergraph composed of x 1 and x 2 as illustrated by Figure 1, the
problem. Notice that in this paper we do not consider uncertainty: set of discretionary features of x 3 is e 4 . In Figure 2, the yellow
if two situations are described the same in F, then they have the case x has no discretionary feature: all its features are present at
same label. least in one example.
4 HYPERGRAPH CASE-BASED REASONING
4.1 Representation And Projection x1
x3
e1 e 1′ e 2′ e 2
Before describing HCBR, we recall the definition of a hypergraph e3
e4
e 3′
and induced hypergraph. For more results on hypergraphs, we e 6′ e 5′ e 5
refer the reader to [3]. e6 e 7′
e7
x2
Definition 4.1 (Hypergraph). A hypergraph is defined by H = F
(V , X ) with V a set of vertices, X the hyperedges such that ∀x ∈
X, x ⊆ V .
A hypergraph can be viewed as a collection of subsets X of a x
given set of vertices V . It is sometimes convenient to define a
hypergraph solely by a collection of sets. In this case, the set of
e 2′ e 3′ e 5′ e 1′ e 6′ e 7′
vertices, denoted VX , is implicitly defined as the union of edges.
Definition 4.2 (Induced Subhypergraph). The subhypergraph
H [A] induced by A ⊆ V is defined by H [A] = (A, X A ) with
X A = {x ∩ A | x ∩ A , ∅}. x3 x1 x2
A set of examples X can be seen as a hypergraph H = (F, X ),
i.e. such that each example is a hyperedge (Figure 1). In practice, Figure 2: The projection of x on H is represented by the
we do not need to know F as we can always use the implicit family {ei′ }i . Under the projection lies a graph representa-
hypergraph H = (FX , X ) where FX is the restriction of F to the tion of x with the partition elements {ei }i and their respec-
features that appear in the sample X . tive connections to the cases {x i }i , in particular, D x = ∅.
2
For any set of features x ⊆ F, we define d H (x) = {x ′ ∈ X | x ∩ modeled by
x , ∅} the set of edges sharing some features with x. In partic-
′ m
sw, µ (x) = w(ei , x)µ(ei )
Í
(1) (0)
ular, the set d H can be split into d H and d H depending on the
i=1
m
(l )
label of the case and defined by d H (x) = {x ′ ∈ d H (x) | J (x ′ ) = l }. w(ei , x j ) = 1 ∀1 ≤ i ≤ n
Í
(4)
|d H (x)| corresponds to the number of cases the case x intersects i=1Í
n
(l )
while |d H (x)| counts the number of times the class l is used µ(ei ) = 1 ∀1 ≤ i ≤ m
i=1
among this set of intersecting cases. Note that if x < X and
|d H (x)| = 0, then x = D x , i.e. the case x is in relation with no where w(ei , x j ) ≥ 0 models the importance of ei in x j and µ(ei )
case in X . In the hypergraph literature, |d(x)| is called the degree the support of ei for class 1 w.r.t. whole training set. For this
of a hyperedge and its domain of definition is restricted to X . reason, we call µ the intrinsic strength of ei . The assumption (ii)
From now, we consider only the specific hypergraph generated implies that if ei ∩ x then w(ei , x) = 0. For readability, we write
by the set of examples X . For the sake of readability, we remove s in place of sw, µ .
the subscript H . The classification rule consists in selecting the class with the
total highest support.
1 s(x) > 0
4.2 Model Space ∀x ∈ P(F), J¯(x) = (R1)
0 s(x) ≤ 0
In many binary classification approaches such as SVM or logistic
regression, the model space consists of the set of hyperplanes of Our goal is to select w and µ such that the classification rule
a given inner product space, usually RM . A hyperplane in RM (R1) provides a good solution to the original problem (2). For this
is uniquely defined by M + 1 parameters. For an input vector purpose, we proceed in three steps:
x, its margin m(w, x) is defined by its distance to a hyperplan (1) Define w and µ to capture as much information as possible
defined by w and is negative in case x is wrongly classified, from E for any X (Section 4.3).
positive otherwise. Thus, the problem consists in finding the best (2) Train the model to adjust the intrinsic strength on a spe-
parametrization to minimize a loss function summed over the cific X using the classification rule (R1) (Section 4.4).
training set. For instance, the Perceptron algorithm minimizes (3) Refine the classification rule (R1) to take into account the
the 0-1 loss, SVM the hinge loss, and the Logistic Regression uses local nature of the model (Section 4.4).
the log loss. Those functions are defined by (3). A summary and high level view of HCBR is given by Algorithm 1.
L 01 (w, x) = 1 {m(w,x )≤0} Algorithm 1 HCBR (High level view)
L hinge (w, x) = max(0, 1 − m(w, x)) (3) 1: Build H and E from X .
L log (w, x) = ln(1 + e −mw,x ) ) 2: Calculate w and µ on E.
3: Adjust µ with training algorithm 2 on X using rule (R1)
In this paper, we relax the three implicit constraints on the 4: for each x in test set do
input vector space: 5: Calculate the projection π (x).
6: Calculate the support s(x) using the projection.
(1) we do not assume to have any inner product or metric 7: Predict using the updated rule (R2).
embedded with the input vector space, 8: end for
(2) we do not assume the cardinality of the input vectors, such
that it is possible to build a model and make predictions While SVM aims at separating the data using a single hyper-
on incomplete systems (for instance missing in some rows plane in the original input vector space, HCBR tries to explain the
in a database or Bag-of-Words representation), decision for each case by a convex combination expressed in a
(3) we do not assume anything about the concrete representa- lower dimensional space E generated by the data. In addition, the
tion of features while in most classification methods, all convex combinations depend on each other since the elements of
the features belongs to the same space, often R. E are by definition the features resulting in the case intersections.
As a counterpart, HCBR relies on two assumptions: (i) the correct For any case, the decision rule is a combination of the strength
class of an input vector is the result of its features only and (ii) of the elements of its projection on the hypergraph.
if two input vectors x and x ′ do not share any feature, they are
independent i.e. x cannot help to understand the correct class of 4.3 Model Selection
x ′ and vice versa. As a result, HCBR produces only local models In this section, we define how to concretely select and calculate
because if a new input vector is independent of all examples, it is w and µ for a given hypergraph. To ease the notation and for
impossible to generate a prediction. On the contrary, a hyperplane practical reasons, the measure µ is provided w.r.t. the intersec-
model is global in a sense that it can produce a prediction for any tion family E of the hypergraph induced by the training set X .
element of RM . A concrete situation for which such assumptions However, µ can be defined over any partition of F.
are natural is a justice trial. Cases are composed of some elements, For x and x ′ in P(F), a basic quantitative measure on the
|x ∩x ′ |
and the correct label is the result of a reasoning that can possibly importance of x ′ w.r.t. x can be expressed by |x | , i.e. how
use analogies or counter-examples with a set of past cases on top much x ′ overlaps with x. This measure is a sort of potential for
of the legal texts. However, if a judge or lawyer wants to use x an analogy with x. Potential because it does not account for the
to justify the outcome of x ′ , x ′ must have similarities with x. individual importance of the features and simply holds the idea
Let us formally define the model space. Given the hypergraph that the larger is a subset of features in a case, the higher is the
H = (F, X ) defined by a training set X and its associated partition chance it holds important features to decide the outcome.
E = {ei }im , the relation between an example x and its class is Let us consider E and an example x ∈ X .
3
Definition 4.4 (Intrinsic strength w.r.t. x). The intrinsic strength the quality of the relation (the more examples sharing the same
of e ∈ E w.r.t. x ∈ X is defined by class the stronger the analogy).
|x ∩e |
|d (l ) (e)| |x |
Example: Consider the hypergraph in Figure 1 made of x 1 , x 2
∀l ∈ {0, 1}, S (l ) (e, x) = Í
|x ∩e |
|d (l ) (e j )| |x | j and x 3 arbitrarily labeled with resp. 1, 0 and 1. Arbitrarily, we as-
ej ∈ E
(5) sume the cardinal of the elements of E to be #e = (2, 1, 2, 3, 1, 2, 3)
such that the cardinal of cases are #x = (7, 8, 7) and |FX | = 14.
|d (l ) (e)||x ∩ e |
= Í The values of |d (l ) (e)| can be summarized by the vectors #d (0) =
|d (l ) (e j )||x ∩ e j |
ej ∈ E (0, 0, 1, 0, 1, 1, 1) and #d (1) = (1, 2, 2, 1, 1, 1, 0). Let us calculate
S (0) (e 3 ):
In particular, for a given x ∈ X , S (l ) (ei , x) = 0 if ei is not part
of the projection of x on H . Also, S (e, x) = 1 which can
Í (l )
2×2 4
e ∈E S (1) (e 3 , x 1 ) = =
2 × 1 + 1 × 2 + 2 × 2 + 1 × 2 10
be interpreted as how much e accounts for the set of labels l 2×2 4
the element x holds in its projection. The more ei belongs to S (1) (e 3 , x 2 ) = =
2×2+1×1+1×2+0×3 7
many cases with the same class l and the higher S (l ) (ei , x) is. 2×2 4
Conversely, for a fixed number of cases, the more ei describes S (1) (e 3 , x 3 ) = =
2 × 1 + 2 × 2 + 3 × 1 + 1 × 1 10
x, the higher S (l ) (ei , x) is. As ∀ei ∈ E, |d(ei )| > 0, either we 4 of the support
which we interpret as e 3 being responsible for 10
have S (1) (ei , x) , 0 or S (0) (ei , x) , 0. We have S (l ) (ei , x) = 0 only 4
when all the cases in which ei results of are labeled with the toward class 1 in x 1 and x 3 , while 7 for x 2 . This lead to
opposite class l. ¯ For S (l ) (ei , x) = 1, it needs both the unanimity 2 4 4 4
S (1) (e 3 ) = + + ≃ 0.1959
of labels for the cases in which ei belongs to and that ei = x. The 14 10 7 10
relation ei = x implies that x does not share any feature with Similarly, we calculate the support for each e and both labels.
any other examples or that x is included into another example. We summarize this into the following vectors:
Definition 4.5 (Intrinsic strength w.r.t. a hypergraph H ). The
instrinsic strength of e ∈ E w.r.t. H = (FX , X ) is defined by S (1) ≃ (0.0286, 0.0286, 0.1959, 0.0643, 0.0173, 0.0694, 0.0000)
|e | Õ S (0) ≃ (0.0000, 0.0000, 0.2024, 0.0000, 0.0327, 0.1071, 0.0000)
∀l ∈ {1, 0}, S (l ) (e) = Í ′ S (l ) (e, x)
|e | After normalization, we obtain the intrinsic strength:
e′ ∈E x ∈d (l ) (e)
(6)
|e | Õ µ ≃ (0.0707, 0.0707, 0.0060, 0.1591, −0.0345, −0.0818, −0.1901)T
= S (l ) (e, x)
|FX | Let us evaluate the model on the three examples:
x ∈d(l ) (e)
2 1 2 2
The measure S (l ) (e) is simply the sum of the relevance for all s(x 1 ) = µ(e 1 ) + µ(e 2 ) + µ(e 3 ) + µ(e 6 ) ≃ 0.0086
7 7 7 7
the examples e belongs too. The more e belongs to several cases, 2 1 2 3
the more information it has to support a class or another. As E s(x 2 ) = µ(e 3 ) + µ(e 5 ) + µ(e 6 ) + µ(e 7 ) ≃ −0.0946
8 8 8 8
represents the sets of features that appear all the time together, 1 2 3 1
we favor the larger e ∈ E as they hold more information to s(x 3 ) = µ(e 2 ) + µ(e 3 ) + µ(e 4 ) + µ(e 5 ) ≃ 0.0751
7 7 7 7
explain a decision. The normalized version is defined by:
S (l ) (e) As a result, x 1 and x 3 are labeled 1 and x 2 is labeled 0. All three
∀l ∈ {1, 0}, µ (l ) (e) = Í (7)
S (l ) (e ′ ) cases are correctly labeled. The highest support is given for case
e′ ∈E x 2 and x 3 while the support for x 1 is one order of magnitude
Finally, the measure µ is simply defined by the difference lower then for x 3 . This is because the discretionary features of
between the strength of both classes: x 3 are larger while the intersection with x 2 is lower than for x 1
( 38 of x 3 against 47 of x 1 ).
µ(e) = µ (1) (e) − µ (0) (e) (8) Consider a new case x as described on Figure 2. Its support is
For any case x, the projection on the hypergraph π (x) inter- given by s(x) = w(e, x)µ(e) with π (x) = {e 1 , e 2 , e 3 , e 5 , e 6 , e 7 }.
Í
e ∈π (x )
sects with some elements of E. We use this intersection to define
It is impossible for x to be classified as 1 because the highest sup-
w by
port would be for a maximal intersection with e 1 , e 2 , e 3 and
|x ∩ e | minimal for e 5 , e 6 and e 7 such that s(x) = 81 (2µ(e 1 ) + µ(e 2 ) +
w(e, x) = (9)
|x \ D x | 2µ(e 3 ) + µ(e 5 ) + µ(e 6 ) + µ(e 7 )) ≃ −0.0103 < 0. It can be explained
, i.e. we modulate the strength of an element of the partition by by the fact that the support for 1 is provided by a larger set of
its importance in x w.r.t. set inclusion. In a sense, the projection features (11 features versus 8). On top of that, the intersections
on the hypergraph provides all the possible analogies with the between positive cases (e 2 and e 3 ) are too small (1 for e 2 compared
training set, either seen as a support for a particular outcome l or to e.g. 3 for e 7 ) or include also negative cases (e 3 ).
as a counter-example. The intrinsic strength of each element of
this partition is a measure of “how much it can be considered as 4.4 Training and Decision
a counter-example or an analogy” taking into account simultane- In this section, we give an algorithm to adjust the intrinsic
ously all the analogies between the examples that are in relation strength µ(ei ) in order to minimize a hinge loss and we update
with the considered case. It favors the importance of the relation the classification rule to take into account the fact the model
(the more features it shares, the stronger is the analogy) but also cannot provide predictions over F entirely.
4
Once the model is built, it can be evaluated on the training max(s (1) (x ),s (0) (x ))
set. Analogously to SVM, we define the margin as the distance s (1) (x )+s (0) (x )
1
to the correct class, i.e. m(w, µ, x) = sw, µ (x)|J (x) − J¯(x)|. To
improve the pertinence of the strength of the elements of E, we l0 l1
0 1
use the iterative algorithm described by Algorithm 2 to minimize
η̄ 1
the hinge loss over the training set X . When and only when a η̄ 0
classification for a case x is wrong, it modifies each element of l0 l0 l1 l1
its projection by lowering its strength for the wrong class and η0 0 η1 s(x)
increasing it for the proper class. The margin is split between
the element of the projection w.r.t. their respective weight in
Figure 3: Representation of the updated decision rule (12).
x i.e. w(e, x). If a case x is wrongly classified, it is due to the
cases intersecting with it. Indeed, if x was not intersecting with
any other example, its projection would be itself, and its support with η 0 , η 1 ∈ R+ and η̄ 0 , η̄ 1 ∈ [0, 1]. The constraints on η 0 and η 1
toward the wrong class would be 0 and positive for the real class. defines a minimal amount of support respectively toward class 0
In other words, x would be correctly classified. Thus, the idea is and 1 while η̄ 0 and η̄ 1 requires the support toward a class to be
not to directly bring the support of x to the correct class but to significantly higher than the support for the other class. As µ is
gradually adjust the weights such that the neighbors are modified normalized over the partition E, the value of η 0 and η 1 must be
enough for x to be correctly classified. In particular, it is sensitive set w.r.t. the hypergraph. On the contrary, η̄ 1 and η̄ 0 can be set
to the order in which the cases are considered: a modification in independently of the hypergraph.
the strength of any e ∈ E impacts all cases in which it appears Those constraints may be used to design a decision rule for
and potentially changes the predicted class for those cases. In new cases depending on the application or the dataset. The most
addition, there is no guarantee of convergence. Future work will generic decision rule is as follows:
focus on the optimal order or a modification schema such that the
algorithm converges. Investigating other minimizing functions
1 s(x) > 0 and C 1
s(x) ≤ 0 and C 0
0
is also considered. J˜(x) =
(12)
l1 s(x) > 0 and not C 1
l 0 s(x) ≤ 0 and not C 0
Algorithm 2 Model training
Input: where l 1 , l 0 are two labels. A representation is given by Figure 3.
- X : training set Those hyperparameters are intended to model the “burden of
- y: correct labels for X proof”. For instance, in a trial, one is assumed innocent until
- k: number of training iterations proven guilty which implies the support for the class "guilty"
- µ (1) , µ (0) : weights calculated with (4.5) must be beyond a reasonable doubt (where the term reasonable is
Output: defined by the jurisprudence of the applicable country). In case
- Modified vectors µ (1) , µ (0) η 0 = η 1 = η̄ 0 = η̄ 1 (and l 0 = 0 and l 1 = 1), then the decision rule
is equivalent to the original one defined by (R1). In case x has too
1: for k iterations do
many discretionary features, this classification rule is likely to be
2: for x i ∈ X do
irrelevant. Indeed, the intersection between x and FX is to small
3: ȳi ← J¯(x i )
to hold enough information and make strong analogies with x.
4: if ȳi , yi then
To overcome this drawback, P(F) is split into two subsets:
5: for e ∈ π (x i ) do
6: µ (yi ) (e) ← µ (yi ) (x i ) + w(e, x i )|µ(e)| • F1 = {x ∈ P(F) | |x ∩ FX | ≥ δ }, ∀δ ∈ N
7: µ (ȳi ) (e) ← µ (ȳi ) (x i ) − w(e, x i )|µ(e)| • F2 = P(F) \ F1
8: end for F1 corresponds to the elements such that they share some fea-
9: end if tures with the examples. An alternative may be considered by
10: end for using F1 = {x ∈ P(F) | D|xx| ≤ δ }, ∀δ ∈ [0, 1]. In this case, F1
11: end for contains the elements for which we have enough information
provided by the examples. From our preliminary tests, the choice
depends on the dataset structure.
The measure µ is defined as the difference of support for both Finally, the decision rule for new cases is built as follows:
classes. Thus, by linearity we can rewrite
J˜(x) if x ∈ F1
m m J¯(x) = (R2)
ox if x ∈ F2
Õ Õ
s(x) = w(ei , x)µ (1) (ei ) − w(ei , x)µ (0) (ei )
i=1 i=1 (10) where o x is one draw from a random variable that has Bernoulli
(1) (0) | {x ∈X | J (x )=1} |
=s (x) − s (x) law with parameter p = |X | , i.e. the prevalence of 1
This form is convenient because we can control how much ev- in X . This assumes that X is correctly representing P(F) (or that
idence we need to support a specific class using the following the prevalence does not change in time for sequential problems
constraints and a familly η of four hyperparameters: in which the new cases are generated by an unknown random
measure). The rational behind is that if for a case x, it is not
η̄ 0 (1)
s (0) (x) > max( s (x), η 0 ) ≥ 0 (C 0 ) possible to exploit the model built on the hypergraph, then we can
1 − η̄ 0 still consider that J acts as a Bernoulli random variable and use a
η̄ 1 (0) maximum likelihood estimation for p. In a sense, it is extending
s (1) (x) > max( s (x), η 1 ) ≥ 0 (C 1 )
1 − η̄ 1 the local model to the entire input vector space P(F).
5
4.5 Complexity The integrality of the data used for the experiments, as well as
Model Building: Given X ∈ P(F)n , constructing EH can be the scripts to transform them and analyze the results are available
done in O(
Í
|x |) by using a Partition Refinement data struc- in the HCBR Github repository4 and the whole experimental
x ∈X campaign starting from the raw data can be reproduced in "one
ture [18]. Given x ∈ X , calculating the family {S(e, x)}e ∈EH can click".
be done in O(|x |) by asking for each feature of x the e it belongs
to and maintaining the size of each e during the construction of 5.1 Classification performance
EH . Thus, calculating {S(e, x)}e ∈ EH for all x ∈ X can be done
To validate the approach, 8 datasets for binary classification
|x |). On m-uniform hypergraphs (when all cases are
Í
in O(
x ∈X have been used. They are available either from the UCI Machine
described with m features) with n hyperedges, it becomes O(mn). Learning Repository5 or provided with the LIBSVM6 : adult,
Calculating {S(e)}e ∈ EH and µ can be done in O(|EH |) because audiology, breasts, heart, mushrooms, phishing, skin and
it requires to iterate over EH . An obvious upper bound on |EH | splice. For each dataset, the original features (name=value)
is |FX | i.e. the number of vertices in the hypergraph. The worst- are converted into a unique identifier and the union of all such
case cardinal of EH is when each x ∈ X intersects with all the identifiers constitute the information set F considered by the algo-
others and none of them is strictly a subset of any other. Thus, rithm. Notice that there is no need to remove the rows with empty
|EH | ≤ min(2n − 1, |FX |). values. The dataset audiology initially contains several classes
corresponding to several ear abnormalities. They are grouped
Learning Phase: For each wrongly classified x ∈ X , the train- to obtain two classes (normal ear and abnormal ear). able 1 de-
ing requires at most O(|x |) steps (maximal cardinal for π (x)). scribes each dataset. The minimal, maximal and average size give
The worst-case scenario is when the model wrongly classifies information about the case sizes (notice some cases are missing
every x ∈ X . Thus, the learning phase worst-case complexity is values for adult, heart and mushrooms datasets). The unique
|x |) and on m-uniform hypergraphs it becomes O(kmn).
Í
O(k features are the number of (name=value) in the original dataset.
x ∈X In addition, two datasets have at least one real-valued attribute
Model Query: For a case x ∈ P(F), the projection can be done as indicated by the column "Real" in Table 1.
in O(|x |). Calculating the classification rule also requires at most The validation was made using a 10-fold cross-validation: each
O(|x |) (maximal cardinal for π (x)). dataset has been split into 10 equal sized samples, 90% has been
used as training set and the remaining 10% to test the classifi-
5 EXPERIMENTS cation. Each subsample is used once as testing set and the final
The experiments are broken down into two parts: the comparison metrics are calculated as the average of the 10 runs. The training
with literature results in terms of classification performance, set was not rebalanced and kept the original prevalence. The
and intrinsic performance (computation time and influence of of training steps k was adjusted with a manual trial and errors
parameters). approach, the familly η set to η 0 = η 1 = η̄ 0 = η̄ 1 = 0 and δ
For the first part, we measured the confusion matrix obtained was set to 1. Further work will focus on automatic parameter
over all the runs but also after each prediction. From this confu- tuning. The average confusion matrix obtained for each dataset
sion matrix, we calculated the standard performance indicators:
accuracy, recall, specificity, precision, negative prediction value,
F 1 -score and Matthews correlation coefficient. Denoting by TP
the number of true positives, TN the true negatives, FP the false
positives and FN the false negative, the F 1 -score and Matthews
correlation coefficient are defined by:
2T P
F1 =
2T P + F P + F N
TP × T N − FP × FN
MCC = p
(T P + F P)(T P + F N )(T N + F P)(T N + F N )
F 1 - score and Matthews correlation coefficients respectively be-
longs to [0, 1] and [−1, 1] and the closer to 1, the better it is. Both
takes into account false positive and false negatives. In particu-
lar, Matthews correlation coefficient is very adapted for binary
classification on unbalanced dataset (with a prevalence far from
0.5). However, as many studies do not report them, we will base Figure 4: Evolution of the confusion matrix during the pre-
our comparison with literature results on the accuracy defined diction phase for phishing dataset.
by:
TP +TN is showed in Table 2. The performance indicators are reported
ACC =
TP + T N + FP + FN in Table 3. The proposed algorithm performs very well on a
For the second part, we studied the computation time depend- wide range of datasets as reported by tables 2 and 3, in particular
ing on the number of cases and the size of each case in order to 4 https://github.com/aquemy/HCBR
validate the worst-case complexity given in Section 4.5. We also 5 https://archive.ics.uci.edu/ml/index.php
studied the influence of the training set size on the accuracy. 6 https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html
6
Table 1: Datasets description.
Cases Total Features Unique Min. Size Max. Size Average Size Real
adult 32561 418913 118 10 13 12.87 No
audiology 200 13624 376 70 70 70 No
breasts 699 5512 80 8 8 8 No
heart 270 3165 344 12 13 12.99 Yes
mushrooms 8124 162374 106 20 20 20 No
phishing 11055 319787 808 29 29 29 No
skin 245057 734403 768 3 3 3 Yes
splice 3175 190263 237 60 60 60 No
when they contain strong predictors as it is the case for mushroom. 82.96% accuracy resp. on breast, and heart datasets. Compar-
The accuracy is contained in a range from 0.8206 (adult) to 1 ing bayesian approaches, [13] demonstrated 97.35% (breast) and
(mushrooms) while the F 1 -score is bounded by 0.8653 (heart) and 83.00% (heart) accuracy. A 5 layers neural network with fuzzy
1 (mushrooms). On adult, the accuracy is only 6% higher than inference rules achieved 87.78% on heart [21] while a k-NN al-
the prevalence. A model consisting in returning 1 for any point gorithm reached 99.96% on mushrooms [8]. The best alternative
would be only 6% worse. This relatively poor performance in among 6 rules-based classification methods achieved 95.84% on
learning the underlying decision mapping is better reflected by breast and 100.00% on mushroom [11]. Using 80% of phishing
the Matthews correlation coefficient of 0.51. as training set, an adaptative neural network achieved an av-
The false positives and false negatives are equilibrated for each erage accuracy of 93.76% (among 6 alternatives) with the best
dataset, despite a huge variation in the prevalence (between 20% run at 94.90% [23]. Still on phishing, [22] proposes to combine
and 64%, cf. Table 2) which may be a desirable property depending several classifiers and reaches 97.75% accuracy for the best hybrid
on applications. In addition, the results were obtained with non- model (and demonstrates 97.58% for Random Forest classifier).
balanced training sets which are known to be a problem for On adult, the comparison of several classifiers (naive bayes, de-
many machine learning algorithms. Moreover, the performance cision tree, ...) demonstrated at most 86.25% accuracy [14] while
indicators remain stable during the whole prediction phase as a Support Vector Machine approach reached 85.35% [15]. On
shown on Figure 4 for the dataset phishing (similar result is splice, a method using Fuzzy Decision Trees [4] reaches 94.10%
observed for all datasets). accuracy and a neural network combined to boosting [5] 97.54%.
For a given case, the support is a metric of confidence in the On breast, Support Vector Machine approaches reached resp.
prediction as illustrated in Figure 5. In general, the wrongly 96.87%, 98.53%, 99.51% accuracy [1, 7, 19], 99.26% and 97.36% for
classified cases have a smaller difference between the evidence neural network based techniques [17, 24], 98.1% for a bayesian
for each class which confirm the interest in the hyperparameters network method [10], or 94.74% using Decision Trees [20]. On
η and η̄ used in (R2). skin, [5] reports 98.94% accuracy against 99.68% for Decision
Tree based method [6]. The best result, as far as we know, is
99.92%, obtained by a Generalized Linear Model [2] (with 80%
training set).
The accuracy is slightly lower than the best results from the
literature (adult 82.06% against 86.25%, breast 96.96% against
99.51%, heart 85.77% against 87.78%, phishing 96.05% against
97.75%, splice 94.43% against 97.54%, skin 98.65% against 99.92%).
We explain this by at least two factors. First, the best methods
on a given dataset are often dedicated to this dataset with ad-hoc
or engineered parts which is not the case of HCBR. Secondly,
the hyperparameter familly η have not been tuned but as we
will show in next section, they might have a decisive impact on
performance. HCBR performed better than Bayes classifier in
two thirds of cases. Bayes classifier performs better on breast
by 1̃% which represents less than one case wrongly classified.
Similar results are observed with Decision Trees. However, the
1% difference on skin represents an average of 7 cases misclas-
sified in comparison in favor of Bayes. It performs better than
Figure 5: Difference between the weight assigned to both Rule-based approaches (or gives similar results on mushrooms
classes for each decision on phishing (average). Similar re- with an accuracy of 1) in the four considered references on three
sults are observed for all datasets. different datasets. Notice that combined with Neural Network,
Rule-based achieves the state-of-art result on heart dataset. Ex-
To compare the results of the proposed method, we explored cept for phishing, Neural Network returns better results (0.46
the best results from the literature for each dataset. The com- more cases with correct classification in average for breast, 71
parison with audiology is not relevant due to the transforma- for skin and almost 10 for splice). Last, SVM gives better results
tion into a two-class problem. The results are summarized in in all three cases, but appear only as best results in two datasets.
Table 4. In [9], 5 rule-based classification techniques dedicated to
medical databases are compared and achieve at best 95.85% and
7
Table 2: Average confusion matrix obtained with a 10-fold is the worst-case scenario in terms of the size of E if m < n
cross-validation. because the family grows exponentially in function of m or n.
We split the computation time into constructing the hypergraph
TP FN FP TP Prevalence (and determining the intersection family) and calculating the
adult 2182.40 295.30 288.50 488.80 0.7586 strength of the partition. The results are illustrated in Figure 7.
audiology 12.70 0.10 0.00 6.20 0.6048 By increasing n with a fixed m, the partition grows exponentially
breast 23.00 1.40 0.70 43.90 0.3338 and thus, it is expected to have an exponential curve for the
heart 12.40 1.80 1.90 9.90 0.5107 strength computation. On the contrary, building the hypergraph
mushrooms 390.60 0.00 0.00 420.40 0.4804
can be done in linear time when m fixed. When n is fixed and m
phishing 595.40 23.80 19.80 465.00 0.5562
increases, constructing the hypergraph is still doable in linear
skin 4886.30 132.40 199.40 19286.90 0.2075
splice 155.70 9.10 8.50 142.70 0.5164 time as expected. Interestingly, calculating the strength has two
phases: if m ≤ n, increasing m exponentially increases the time
(because E exponentially increases) but if m > n, increasing m
5.2 Intrinsic performance cannot results in an exponential growth in the computation time
(because E grows linearly).
In this section, we evaluate the capacity of HCBR to build an
efficient model (w.r.t. the accuracy measure) with few examples.
We also study the influence of the two main parameters (number
of examples and size of the examples) on the computation time.
Training set size: To evaluate HCBR, we used a standard 90-10
dataset split in the previous section. Additionally, we studied the
accuracy depending on the split from 1% to 99%. For each split
value, we performed 100 runs with random splits and averaged
the accuracy. As shown in Figure 6, HCBR reaches its maximal
accuracy with about 15% of the dataset as examples. The dataset
adult shows a constant results from 1% while for audiology
more than 40% is required to achieve over 90% accuracy. Despite
the datasets having different sizes (e.g. skin is 350 larger than
breast), the trajectories look the same. Further work should fo-
cus on determining the optimal training set size depending on
the size of cases and the underlying measure to generate them
in P(F) (as it influences the induced hypergraph, it influences
the strength measure and thus the decisions). Additional experi-
ments need to be performed to fairly compare those results to
the existing algorithms. However, it is commonly accepted that
non-linear classifiers requires very large training sets.
Figure 7: At the top, computation time to build the model
(hypergraph construction + strength calculation) depend-
ing on n (m = 10), and at the bottom, depending on m
(n = 100). The case i is described by {i, ..., i + m} such
that each case is partitionned into m elements (one discre-
tionary feature). Right scale for bulding and left scale for
strength.
Figure 6: Accuracy depending on the percentage of the
dataset used as training set.
Hyperparameters η: We used a 90-10 split and set η 0 = η 1 to
Computation Time: We generated a casebase of n cases of size ease the visualization. Instead of using the decision function de-
m such that case i is described by {i, ..., i + m} i.e., each case fined by (12), we did not produce a prediction if the constraints C 1
is partitioned into m elements (one discretionary feature). This or C 0 were not respected. It can be viewed as creating a third class
8
Table 3: Average performances obtained with a 10-fold cross-validation.
Accuracy (standard dev.) Recall Specificity Precision Neg. Pred. Value F 1 score Matthews corr. coef.
adult 0.8206 (0.0094) 0.8832 0.6233 0.8808 0.6290 0.8820 0.5081
audiology 0.9947 (0.0166) 1.0000 0.9875 0.9917 1.0000 0.9957 0.9896
breasts 0.9696 (0.0345) 0.9691 0.9676 0.9479 0.9844 0.9575 0.9344
heart 0.8577 (0.0943) 0.8695 0.8437 0.8699 0.8531 0.8653 0.7178
mushrooms 1.0000 (0.0000) 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
phishing 0.9605 (0.0081) 0.9680 0.9514 0.9615 0.9590 0.9647 0.9199
skin 0.9865 (0.0069) 0.9608 0.9932 0.9736 0.9898 0.9672 0.9587
splice 0.9443 (0.0124) 0.9478 0.9398 0.9450 0.9441 0.9463 0.8884
Table 4: Previous literature results measured as the high-
est accuracy obtained by the authors.
Dataset Ref. Type Accuracy
[14] Many classifiers 86.25%
adult
[15] SVM 85.35%
HCBR 82.06%
[1] SVM 99.51%
[17] Neural Network 99.26%
[19] SVM 98.53%
[10] Bayes 98.1%
[24] Neural Network 97.36%
breast
[13] Bayes 97.35%
HCBR 96.96%
[7] SVM 96.87%
[9] Rule-based 95.85%
[11] Rule-based 95.84%
[20] Decision Tree 94.74%
[21] Neural Network + Rule-based 87.78%
heart HCBR 85.77%
[13] Bayes 83.00%
[9] Rule-based 82.96%
[11] Rule-Based 100.00%
mushrooms
HCBR 100.00%
[8] k-NN 99.96%
[22] Ensemble 97.75%
phishing [22] Random-Forest 97.58%
HCBR 96.05%
[23] Neural Network 94.90%
[2] Generalized Linear Model 99.92%
skin [6] Decision Tree 99.68%
[5] Neural Network + Boosting 98.94%
HCBR 98.65%
[5] Neural Network + Boosting 97.54%
splice
HCBR 94.43%
[4] (fuzzy) Decision Tree 94.10%
Figure 8: Influence of η on phishing dataset.
unknown for which we consider we cannot produce a decision.
heart, the accuracy globally increases with η while on heart
We measured the accuracy and the test set size ratio for which a
the accuracy slightly decreases indicating poor influence of the
prediction has been produced for different values of η := η 0 = η 1 .
hyperparameters and model.
If J¯ correctly approximates J , increasing η should increase the
Notice that for certain values of η it is possible to reach 100%
accuracy while the test set ratio should remain high. Additionally,
accuracy with heart while it is not with breast. Also, for high
we plot the test set ratio in function of the accuracy and calcu-
values of η, we observe a fall in accuracy for breast. We suspect
late the Pareto frontier7 which represents the best compromises
those two phenomena to appear because we used the same value
accuracy/ratio. The closer the points are to (1, 1) the better it
for η 0 and η 1 .
is. A Pareto frontier consisting of (1, 1) represents the perfect
More work is required to fully study the influence of the hyper-
model (e.g. reached on mushroom dataset). Figures 8, 9, 10 and 11
parameters (η 0 , η 1 , η̄ 0 , η̄ 1 ) and how to select l 0 and l 1 . We believe
provides the result for the best and worst two datasets. Figure
it is the key to improve the overall performance, and it is possible
12 shows all of the four Pareto frontiers. As expected, the results
to derivate the best values from the training set. Also, a compari-
are better on phishing and breast. On phishing, breast and
son with binary classification methods that provide a prediction
7 Points such that improving one component would degrades the other one. confidence metric is necessary.
9
Figure 9: Influence of η on breast dataset. Figure 10: Influence of η on heart dataset.
6 CONCLUSION
This paper presented HCBR, a method for binary classification This proof of concept raises many questions and offer many
using a hypergraph representation of information and building a improvement axes. First of all, it seems relatively easy to extend
convex combination out of the induced partition to determine the method to several classes but it needs an additional empirical
the support for each class. The general framework introduced validation. As most of the computational effort is on calculating
by (4) is instantiated in Section 4.3 where the support is deter- the class support, adding more classes will linearly increase the
mined using all the interactions between the hyperedges. Beyond computation time and thus, working on a faster algorithm or an
this specific implementation, one can imagine different model approximation of the main measure should be investigated. The
selection methods to be used, e.g. using some assumptions on solution may come from exploring the feature selection capacity
the data. of HCBR. Indeed, by the hypergraph construction, it may be
However, being totally agnostic on the data representation is possible to remove from the partition some elements that do not
convenient compared to many methods such as SVM. It allows participate enough (e.g. not being in enough cases at the same
combining information from multiple sources by simply stacking time), reducing the computation time. Additionally, we plan to
the information. It does not require transforming the data to fit investigate how to generate an explanation about each prediction
the algorithm, often by designing specific ad-hoc metrics. and one about the decision function J itself, using not only the
HCBR has been tested on 8 well-known datasets and demon- convex combination and the strength of the partition elements,
strated similar accuracies when compared to the best results but also the link between cases in a similar way a lawyer may
from the literature. The algorithm has shown a strong stability use past cases to make analogies or counter-examples. We also
in terms of accuracy, true positive and negative rates during the work on an online version of HCBR where the hypergraph is
whole prediction phase. We showed that the difference of class constructed case after case, including forgetting some old past
support is a good confidence indicator for the prediction. We cases (which would allow handling non-stationary environment).
demonstrated the capacity to obtain a good accuracy using very It seems also possible not only to add new examples dynamically,
few examples from the dataset (10% to 15% of the training set) but also some vertices (i.e. adding some pertinent information to
without balanced classes. This last property is very important for some cases) without generating the whole model from scratch.
robustness as in practice, the dataset are rarely balanced. Finally, Empirically, further experiments should focus on more un-
we empirically validated the exponential worst-case complexity. structured datasets (for instance for text classification). As stated
10
ACKNOWLEDGMENT
The author warmly thanks Pr. Robert Wrembel, Poznan Univer-
sity of Technology, and Dr. Jean-François Puget, IBM Analytics,
for their useful suggestions and advice to improve this paper.
REFERENCES
[1] M. F. Akay. 2009. Support vector machines combined with feature selection
for breast cancer diagnosis. Expert systems with applications 36, 2 (2009),
3240–3247.
[2] Mesa A. Dinh N.-T. Basterrech, S. 2015. Generalized Linear Models Applied
for Skin Identification in Image Processing. In Intelligent Data Analysis and
Applications. Springer, 97–107.
[3] C. Berge. 1984. Hypergraphs: combinatorics of finite sets. Vol. 45. Elsevier.
[4] Sharma G. Dhall-A. Chaudhury S. Bhatt, R. B. 2009. Efficient Skin Region
Segmentation Using Low Complexity Fuzzy Decision Tree Model. In 2009
Annual IEEE India Conference. 1–4.
[5] F. Ö. Çatak. 2017. Classification with boosting of extreme learning machine
over arbitrarily partitioned data. Soft Computing 21, 9 (2017), 2269–2281.
[6] Ribeiro M. X. Cazzolato, M. T. 2013. A statistical decision tree algorithm
for medical data stream mining. In Proceedings of the 26th IEEE International
Symposium on Computer-Based Medical Systems. 389–392.
[7] Yang B. Liu-J. Liu D.-Y. Chen, H.-L. 2011. A support vector machine classifier
with rough set-based feature selection for breast cancer diagnosis. Expert
Systems with Applications 38, 7 (2011), 9014 – 9022.
[8] S. Das. 2001. Filters, Wrappers and a Boosting-Based Hybrid for Feature
Selection. In Proceedings of the Eighteenth International Conference on Machine
Learning (ICML ’01). Morgan Kaufmann Publishers Inc., San Francisco, CA,
USA, 74–81.
[9] Saha S. Datta, R. P. 2016. Applying rule-based classification techniques to med-
ical databases: an empirical study. International Journal of Business Intelligence
and Systems Engineering 1, 1 (2016), 32–48.
[10] Jafari S. Fallahi, A. 2011. An expert system for detection of breast cancer using
data preprocessing and bayesian network. International Journal of Advanced
Science and Technology 34 (2011), 65–70.
[11] Issa G. Ishtaiwi A. Hadi, W. 2017. ACPRISM: Associative classification based
on PRISM algorithm. Information Sciences 417, Supplement C (2017), 287 –
300.
[12] Liu Q. Zhang S.-Metaxas D. N. Huang, Y. 2010. Image retrieval via probabilistic
hypergraph ranking. In IEEE Computer Society Conference on Computer Vision
and Pattern Recognition. 3376–3383.
[13] L. Jiang. 2012. Learning Instance Weighted Naive Bayes from Labeled and
Unlabeled Data. J. Intell. Inf. Syst. 38, 1 (2012), 257–268.
[14] Lu Y. Peng Y.-Shi Y. Kou, G. 2012. Evaluation of classification algorithms using
Figure 11: Influence of η on adult dataset. MCDM and rank correlation. International Journal of Information Technology
& Decision Making 11, 01 (2012), 197–225.
[15] Mangasarian O.L. Lee, Y.-J. 2001. SSVM: A Smooth Support Vector Machine
for Classification. Computational Optimization and Applications 20, 1 (2001),
5–22.
[16] Xibin Zhao-Hai Wan Ming Gu Jiaguang Sun Lifan Su, Yue Gao. 2017. Vertex-
Weighted Hypergraph Learning for Multi-View Object Classification. In Pro-
ceedings of the Twenty-Sixth International Joint Conference on Artificial Intelli-
gence, IJCAI-17. 2779–2785.
[17] Quintanilla-DomÃŋnguez J.-Andina D. Marcano-CedeÃśo, A. 2011. WBCD
breast cancer database classification applying artificial metaplasticity neural
network. Expert Systems with Applications 38, 8 (2011), 9573 – 9579.
[18] Tarjan R. E. Paige, R. 1987. Three Partition Refinement Algorithms. SIAM J.
Comput. 16, 6 (1987), 973–989.
[19] Güneş S. Polat, K. 2007. Breast cancer diagnosis using least square support
vector machine. Digital Signal Processing 17, 4 (2007), 694 – 701.
[20] J. R. Quinlan. 1996. Improved use of continuous attributes in C4. 5. Journal of
artificial intelligence research 4 (1996), 77–90.
[21] A. M. Sagir and S. Sathasivam. 2017. A Hybridised Intelligent Technique for
the Diagnosis of Medical Diseases. Pertanika Journal of Science & Technology
25, 2 (2017).
[22] Asghar-S. Zafar A. Gillani S. Tahir, M. A. U. H. 2016. A Hybrid Model to Detect
Phishing-Sites Using Supervised Learning Algorithms. In 2016 International
Conference on Computational Science and Computational Intelligence (CSCI).
1126–1133.
[23] Mohammad R. M. McCluskey L. Thabtah, F. 2016. A dynamic self-structuring
neural network model to combat phishing. In 2016 International Joint Confer-
Figure 12: Pareto Frontiers comparison. ence on Neural Networks (IJCNN). 4221–4226.
[24] E. D. Übeyli. 2007. Implementing automated diagnostic systems for breast
cancer detection. Expert Systems with Applications 33, 4 (2007), 1054–1062.
[25] Denny Zhou, Jiayuan Huang, and Bernhard Schölkopf. 2007. Learning with
previously, strategies for hyperparameter tunning are also a pri- Hypergraphs: Clustering, Classification, and Embedding. In Advances in Neural
Information Processing Systems 19, B. Schölkopf, J. C. Platt, and T. Hoffman
ority. Last but not least, we would like to answer some questions: (Eds.). MIT Press, 1601–1608.
is it possible to find a method to adjust the strength measure such
that the accuracy on the training set is 1? Can we provide some
quality bounds depending on the initial hypergraph and thus the
dataset? How to handle continuous values?
11