<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Less is Better: An Energy-Based Approach to Case Base Competence</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Esteban Marquer</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fadi Badra</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marie - Jeanne Lesot</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Miguel Couceiro</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Leake</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Sorbonne Université</institution>
          ,
          <addr-line>CNRS, LIP6, F-75005 Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>The Luddy School of Informatics, Computing, and Engineering, Indiana University</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Université Sorbonne Paris Nord, Laboratoire d'Informatique Médicale et d'Ingénierie des Connaissances en e-Santé, LIMICS, Sorbonne Université, INSERM</institution>
          ,
          <addr-line>F-93000, Bobigny</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Lorraine</institution>
          ,
          <addr-line>CNRS, Loria, Nancy</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper revisits the notion of case base competence in the light of recent advances in the modeling of analogical reasoning, based on the idea of similarity transfer from a situation space to an outcome space. For that we consider the CoAT indicator, that measures the compatibility between two similarity measures on a case base, and use it to define an intrinsic measure of competence of a case base with respect to a reference set. Initial experimental results show that the proposed competence measure correlates with the performance of the CoAT prediction algorithm. In fact, our preliminary results seem to indicate that, under some initial conditions, our competence based model can fit any classification boundary. We then revisit the notions of case competence and locality, and show that some source cases may degrade the overall case base competence while others may improve it, and that a given source case may have disparate influence on diferent regions of the case space.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Competence models</kwd>
        <kwd>case base compression</kwd>
        <kwd>energy-based models</kwd>
        <kwd>case base maintenance</kwd>
        <kwd>case-based classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Case bases are one of the main sources of knowledge used in case-based reasoning (CBR), along
with similarity knowledge, adaptation knowledge and domain knowledge [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Case acquisition
and maintenance therefore constitute crucial steps in the knowledge engineering process of
a CBR system. Acquiring and maintaining cases may be expensive, and case storage capacity
may be limited or constrained by selection criteria. Crafting a case base for a given task thus
requires addressing questions such as “which cases should be included in the case base?”, which
can alternatively be expressed as “which cases are the most competent?”, where the definition
of the competence notion can be seen as the formalization of this issue.
      </p>
      <p>
        Such questions have been extensively studied in the literature on case base maintenance
(e.g., [
        <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5 ref6 ref7 ref8">2, 3, 4, 5, 6, 7, 8</xref>
        ]). Most competence models assume that problems are solved by a
Nearest Neighbor algorithm (often with  = 1) augmented by case adaptation, and are strongly
influenced by the way this algorithm works. For instance, it is often assumed that only the
most similar cases may contribute to solving a new case, provided that they are themselves
adaptable. The competence of a case is typically assessed by computing its coverage, i.e., the
set of cases that it may contribute to solving. Yet beyond approaches based on the -Nearest
Neighbor algorithm, no case-based prediction algorithm actually complies with this assumption.
Algorithms such as CCBI [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], PossIBL [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], or CoAT [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], take into account the similarities with
all source cases in order to make a prediction. Here we examine competence criteria suitable
for such approaches.
      </p>
      <p>
        Competence methods yield guidance for case deletion to minimize the competence loss in the
case base compression process [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. However, determining a suitable notion of competence
over the whole case base is challenging and, to our knowledge, no theoretical guarantees exist
to relate the competence of a case base and the performance of a corresponding CBR system.
      </p>
      <p>
        Recent advances in the modeling of analogical transfer shed new light on the case base
competence problem. It has been shown [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] that all case-based prediction methods share the
common inference principle based on the transfer of similarity knowledge from the situation
space, in which the cases are described, to an outcome space, in which their attached solutions
are described. This transfer can then be achieved by optimizing a measure of compatibility
between two similarity measures, respectively associated with each of the two spaces. The
latter idea motivated the case-based prediction method CoAT [
        <xref ref-type="bibr" rid="ref10 ref12 ref13">10, 12, 13</xref>
        ] that relies on the
optimization of a global compatibility indicator between two similarity measures on the case
base. In this framework, the predicted outcome is the one that entails the least increase in
the value of the CoAT indicator. The latter can be interpreted as an intrinsic indicator of the
optimality of the case-based setting for the task at hand. Preliminary results showed it can be
used to assess the quality of similarity measures or of solutions.
      </p>
      <p>
        In this paper, we further explore this indicator to address the problem of case competence.
The main idea is to exploit this indicator to define an intrinsic measure of competence of a
case base, which could be used later to obtain theoretical guarantees on the link between the
competence of a case and the performance of a case-based classifier or predictor. To do so,
we first propose to interpret the CoAT global indicator in an energy-based framework [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
Then the competence of a source case can be intuitively related to its ability to reduce the
energy of correct outcomes and to increase the energy of incorrect outcomes. For instance, in a
classification setting, this would mean the case is lowering the energy of the good class, and
increasing the energy of all others.
      </p>
      <p>This new approach to the problem of case (base) competence has two noteworthy
consequences. First, rather than taking the traditional case base maintenance view of considering
only the nearest cases, it considers the compatibility of all cases in the case base. Intuitively,
this could be important for deletion scenarios, because case bases with lower energy should
provide more stable results when cases are deleted. Second, this makes clear the potential
ramification that case deletion could either decrease or increase system performance: cases may
be competent (increased overall performance) w.r.t. a given class, while entailing competence
degradation (decreased overall performance) w.r.t. another class.</p>
      <p>To support the latter and establish the relation between competence and performance of
a case-based system, we propose two loss functions (see Sec. 4): one that corresponds to the
intuitive notion of competence (i.e., counting positively the energy of correct outcomes and
negatively the energy of incorrect ones), and the other inspired by the hinge loss. We perform
a comparative study of the two and observe that the latter is preferable to the former. Thus
focusing on the hinge loss, we conduct several empirical studies to assess both the performance
and robustness of this CoAT-based competence notion (see Sec. 5) in various initial settings.
These experiments also indicate the potential use of our competence notion for case base
compression and maintenance purposes. Furthermore, they support that our competence-based
framework can produce surrogate models capable of approximating diferent classification
boundaries.</p>
      <p>The paper is organized as follows. In Sec. 2 we briefly discuss well-known approaches to
case base maintenance, and recall key definitions from related work on case competence. The
definition of the CoAT indicator is recapped in Sec. 3 and then used in Sec. 4 to propose a new
definition of case base competence and of competence of an individual case. We present several
empirical studies in Sec. 5 to support our performance and robustness claims, and to analyze
the behavior of our approach both quantitatively and qualitatively. Sec. 6 concludes the paper
and discusses several perspectives for future work.</p>
      <p>Main contributions. The main contributions of the paper are the following:
• We introduce an energy-based framework that relies on the optimization of the CoAT
indicator as a measure of similarity compatibility, and which is used to propose new
measures of case base competence w.r.t. diferent loss functions.
• We show empirically that thus defined, the case base competence is tightly linked to the
performance of case-based models, which constitutes a promising step towards theoretical
guarantees of performance.
• We propose fine-grained competence notions, namely, w.r.t. individual source cases and
w.r.t. individual reference cases, which can be used to identify areas of “expertise” of
cases in a case-based system.
• We present an empirical study to assess the robustness of the proposed approach w.r.t.
to diferent case base initializations and reference case sets, followed by an iterative
and qualitative analysis that shows the potential of the proposed approach for case base
maintenance and compression, as well as for fitting any classification boundary.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Basic Background and Motivation</title>
      <p>This section briefly presents the notation used throughout the paper and recaps the definition
of the case base maintenance task in the CBR setting.</p>
      <sec id="sec-2-1">
        <title>2.1. Key Definitions</title>
        <p>Let  denote an input space, and ℛ an output space. An element of  is called a situation, and an
element of ℛ is called an outcome, or a result. A set  = {(1, 1), . . . , (, )} of elements
in  × ℛ is called a case base. An element  = (, ) ∈  is called a source case. In addition,
the spaces  and ℛ are respectively equipped with the similarity measures   and  ℛ, that
respectively denote the similarity measure on situations and on outcomes. Let  ⊂  × ℛ be a
set of cases called a reference set, and  = (, ) ∈  be a reference case. We will write (, ^)
to denote a potential case constructed by keeping the same situation  ∈ , but choosing a
diferent outcome ^ ∈ ℛ, ^ ̸=  for the case.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Case Base Maintenance</title>
        <p>
          Case base maintenance revises the contents or organization of a case base to improve
performance, and is a longstanding research area for CBR (e.g., [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]). Much of this work has studied
case base compression by case deletion [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Compression eforts were initially motivated by the
desire to control retrieval costs and respect storage constraints. Advances in computational
power have reduced some of these concerns in practice [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] but compression remains useful
when eficiency is paramount and for reasons such as reducing the number of cases for a
knowledge engineer to maintain. Deletion of cases may remove the knowledge required to
solve particular problems, motivating maintenance work focused on retention of case base
competence, which Smyth [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] and McKenna define as the range of problems a CBR system can
successfully solve.
        </p>
        <p>
          Case base compression strategies are often deletion-based, aimed at successively removing
cases whose loss will least harm competence. Estimates of case competence contributions are
commonly done based on the existing cases in the case base, under the representativeness
assumption [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] that the case base is a good predictor of the distribution of future problems. This
assumption is expected to hold for domains well suited to CBR, when the case base is suficiently
mature, though may be endangered by problem drift (e.g., [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]). Estimation of expected case
competence contributions is commonly based on considering relationships between cases and
their nearest neighbors in the case base, favoring cases that have high coverage of other cases
and low reachability, i.e., that are recoverable from fewer cases [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>This paper presents a maintenance perspective that is novel in three ways. First, rather than
emphasizing the relationship of cases to nearby neighbors, the core of the approach is a global
optimization of a case base energy function. Second, rather than using a global approximation
of future problems, it defines competence with respect to specific reference sets. Third, it
questions the assumption that case base compression entails competence loss and illustrates
that compression may actually enhance performance, providing a new motivation for case base
compression.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. The CoAT Method</title>
      <p>
        The CoAT method [
        <xref ref-type="bibr" rid="ref10 ref12 ref13">10, 12, 13</xref>
        ] performs analogical transfer by minimizing a global indicator of
compatibility between two similarity measures. In this section, we recall the definition of this
indicator, and show that it can be seen as an energy function.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Definition of the CoAT Indicator</title>
        <p>The compatibility of  ℛ with   for a given case base  is measured globally on the case
base , by a global indicator denoted Γ(  ,  ℛ, ). The latter takes an ordinal point of
view on the whole case base , by checking if the order induced by  ℛ is the same as the one
induced by   . The following continuity constraint is tested on each triple of cases (0, ,  ),
with 0 = (0, 0),  = (, ), and  = ( ,  ):
if   (0, ) ≥   (0,  ), then  ℛ(0, ) ≥  ℛ(0,  ).
()</p>
        <p>Constraint () expresses that whenever a situation  is more similar to situation 0 than to
situation  , this order should be preserved on outcomes. A triple (0, ,  ) does not satisfy ()
if case  is more similar to case 0 than to case  for situations, but less similar for outcomes, i.e.,
when   (0, ) ≥   (0,  ) and  ℛ(0, ) &lt;  ℛ(0,  ). Such a violation of the constraint
is called an inversion of similarity. The indicator Γ(  ,  ℛ, ) counts the total number of
inversions of similarity observed on a case base :
Γ(  ,  ℛ, ) = |{((0, 0), (, ), ( ,  )) ∈  ×  ×  such that
  (0, ) ≥   (0,  ) and  ℛ(0, ) &lt;  ℛ(0,  )}|.</p>
        <p>For a new situation , the transfer inference consists in finding the outcome  that leads to
the new case  = (, ) that minimizes the value of the indicator:
 = arg min Γ(  ,  ℛ,  ∪ {(, )}).</p>
        <p>∈ℛ
(1)
An important aspect to notice is that the CoAT method makes use of the whole case base ,
and not only the most similar case(s), in order to predict the outcome of the new case.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. An Energy Function View on the CoAT Method</title>
        <p>After briefly reminding the principles of the energy-based framework for solving machine
learning tasks, this section proposes to interpret the CoAT optimization of the global indicator
in this setting.</p>
        <p>
          Energy-based models. Inspired from statistical physics, energy-based models [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] specify
a probability distribution (;  ) = −  ()/ ∫︀ −  () via a parameterized scalar-valued
function  () called an energy function. In its conditional version, the definition of an energy
function  :  ×  →− R associates to each pair (, ) ∈  ×  a scalar value  (, ) that
represents the compatibility between the input  and the output  under the set of parameters  .
The energy function  takes low values when  is compatible with , and higher values when
 and  are less compatible. The goal of the energy-based inference is to find, among a set of
outputs , the output * ∈  that minimizes the value of the energy function:
* = arg min  (, ).
        </p>
        <p>∈</p>
        <p>
          Given a family of energy functions  (, ) indexed by a set of parameters  , the goal of the
learning step is to optimize the  parameters in order to “push down” (i.e., assign lower energy
values to) the points on the energy surface that are around the training samples, and to “pull
up” all other points. Contrastive divergence [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] is a common learning strategy that, given a
numerical hyperparameter  , consists in optimizing a contrastive loss function such as the hinge
loss, which is defined, for a training sample (, ) and a generated out of distribution sample
(, ^) by: ℓ(,  , , ^) = max(0,  +  (, ) −  (, ^)). The hinge loss associates a
loss value to a training sample (, ) whenever its energy is not lower by at least a margin 
than the energy of the incorrect sample (, ^).
        </p>
        <p>The CoAT indicator as an energy function. The CoAT case-based prediction method can
be interpreted in the energy-based model framework, in which the energy  (, ) of any new
case (, ) is given by the value of the Γ indicator when the case is added to the case base, i.e.,
 (, ) = Γ(  ,  ℛ,  ∪ {(, )}).</p>
        <p>The input space  is the situation space  and the output space  is the outcome space ℛ.
The energy function  :  × ℛ →− R measures the compatibility of the outcome similarities
with the added situation similarities when the potential new case ^ = (, ) is added to the
case base. The energy function  is parameterized by  = (  ,  ℛ, ) which includes the
case base . The goal of the energy-based inference is to find, among the set of potential
outcomes  ∈ ℛ, the outcome  that minimizes the value of the energy function, and (1) can
be reformulated as:
 = arg min  (, ).</p>
        <p>∈ℛ</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Measuring Competence</title>
      <p>In this section we introduce new case (base) competence measures using the previous
energybased framework of the CoAT indicator, w.r.t. diferent loss functions. We then propose
finegrained variants of competence, at individual and reference case levels.</p>
      <sec id="sec-4-1">
        <title>4.1. Idea of the Method</title>
        <p>In the CoAT energy-based model, the energy function  (, ) is used to compute a (scalar)
energy value for each potential outcome  of the new case . The diference between the
energy of the predicted outcome and the lowest energy of all other outcomes can be interpreted
as a measure of prediction confidence. Therefore, our goal is to capture the idea that the
competence of a case base should be related to its ability to maximize the prediction confidence,
by decreasing the energy of the correct outcome of a new case and increasing the energy of
incorrect outcomes.</p>
        <p>We consider two diferent loss functions of the underlying energy-based model that take as
input, besides the  = (  ,  ℛ, ) parameters of the energy function, that are considered
to be fixed, an auxiliary set of reference cases  . The intuition is that, if   and  ℛ are fixed,
optimizing such a loss function should allow us to learn the right case base  for the task, i.e.,
address the case base maintenance issue.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Competence of a Case Base</title>
        <p>
          This section discusses two definitions of the competence of a case base  with respect to a
reference set  , that are defined from two diferent loss functions of the energy-based model.
MCE loss competence. The first definition of competence we propose, denoted 
relies on the notion of the minimum classification error loss ℓ [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] classically used in
the energy-based framework. More precisely,  computes the average value, across the
reference set, of this loss ℓ that is defined as the diference between the energy of the
correct outcome and the minimum energy of a reference case if it were assigned a diferent
outcome:
 (,  ) = −
1
        </p>
        <p>∑︁ ℓ (, ),
| | ∈
where ℓ (, ) =  (, ) − min^̸=  (, ^). For a correctly classified instance,
ℓ (, ) is a negative value whose magnitude can be interpreted as the prediction
confidence of CoAT, as mentioned previously. For an incorrectly classified instance, ℓ (, )
is a positive value that allows to measure the extent of the error, i.e., how much the true class is
missed. As a consequence, the lower the ℓ (, ) value, the better and, due to the - sign
in the definition of  (,  ), the greater the  (,  ), the better, i.e., the more
competent  is w.r.t.  .</p>
        <p>Hinge loss competence. The hinge loss competence ℎ modifies the minimum
classification error loss by integrating an additional parameter, denoted by  , that corresponds to a
margin. The values of ℓ (, ) that are lower than −  (corresponding to the instances
with high prediction confidence) are not taken into account and not allowed to compensate for
the misclassified instances:
ℎ(,  ) = −
1</p>
        <p>∑︁ ℓℎ(, ),
| | ∈
where ℓℎ(, ) = max(0,  + ℓ (, )).</p>
        <p>Comparison between the competence metrics.  is close to a direct translation of the
notion of competence described in Subsec. 4.1. However, in  , the negative contributions
to competence (incorrect predictions) and the positive ones (correct predictions) can cancel
each other out. In other words, a high increase of the confidence for correctly predicted class
can compensate for a lot of small misclassifications. This scaling issue between negative and
positive contributions to  is avoided in ℎ as only the negative contributions (to a
margin) are accounted for.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Fine-Grained Competence: Case Level and Expertise Areas</title>
        <p>This section proposes to break down the case base competence at a more refined level,
considering the individual source and reference cases levels.
Proposed definitions. We first propose to define the competence of a source case  =
(, ) ∈  w.r.t. a reference set  as the loss of competence that would happen if this source
case was deleted from the case base:</p>
        <p>(, ,  ) = (,  ) − ( ∖ {},  ).</p>
        <sec id="sec-4-3-1">
          <title>As for any competence measure, the greater, the better.</title>
          <p>At an even finer level, we define the notion of competence locally as the contribution of a
source case  = (, ) ∈  on each individual reference case  ∈  . Indeed, the competence
(, ,  ) over the reference set  equals the sum over  of the loss (e.g., ℓℎ or ℓ ):
the above-defined competence of a case  ∈  w.r.t.  can be expressed as
(, ,  ) =
1</p>
          <p>∑︁  (, )
| | ∈
where  (, ) = ℓ(, ) − ℓ( ∖ {}, ). This notion of case influence entails
the idea of locality: all source cases contribute to the competence of the case base, but each source
case may contribute diferently on diferent regions of space. This enables the identification of
regions of expertise of a source case. Since the loss distinguishes between correct and incorrect
classification, case influence can also be used to identify those regions where a source case can
improve the performance from those where performance is degraded.</p>
          <p>Illustrative example. Figure 1 ofers a visualization of the case competence and influence
considering two source cases, circled in red, of the Half Moon dataset (see Subsec. 5.1), and
the map of their influence values. The left figure shows the least competent source case
((, ,  ) = − 4.680) which degrades the performance on many reference cases (dark red
regions) and contributes negatively to the overall competence. This case would be the first
one to be removed in a case deletion strategy. The right figure shows the most competent case
((, ,  ) = 4.020), which contributes positively to the competence of the case base: it
improves the performance of a large set of reference cases (green regions). Interestingly, this
case also harms the performance for some references of the opposing class.</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>Algorithm 1 Case deletion procedure</title>
          <p>Require: An initial case base  and a reference set 
while || &gt; 0 do
 = arg min∈ (, ,  )
 =  ∖ {}
end while
Case deletion procedure. Source case competence can be applied in a case deletion
procedure, as described by Algorithm 1: at each iteration, the source case  that contributes
least to the competence of the case base  w.r.t. the reference set  is deleted from the case
base. To observe the efects of successive deletions this algorithm is exhaustive, but in practice
deletion would repeat only until a stopping criterion is reached (e.g., desired compression).</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experiments</title>
      <p>We investigate experimentally the properties of the proposed case deletion procedure and
competence definitions, in particular examining their correlation with the classification performance
of the CoAT prediction algorithm. We also provide a stability analysis, as well as a qualitative
analysis of the results.</p>
      <sec id="sec-5-1">
        <title>5.1. Considered Artificial Datasets</title>
        <p>The experiments are performed in a binary classification setting with three synthetic 2D datasets
generated from 3 distributions coined Line, Ring, and Half Moon and respectively illustrated in
Fig. 2a, 2b, and 2c.</p>
        <p>
          The Line data are drawn from a uniform distribution defined on [
          <xref ref-type="bibr" rid="ref2">0, 2</xref>
          ] × [
          <xref ref-type="bibr" rid="ref3">0, 3</xref>
          ]. They are
labeled according to the arbitrary chosen line  () = −  + 2.5, and noise is added by randomly
switching the label, with a probability of 20%, for cases within a 0.3 distance to the boundary.
        </p>
        <p>For the Ring data, two classes are defined a concentric rings of radii 25 and 50. For each
class, points are randomly sampled using polar coordinates, drawing the angle from a uniform
distribution on [0, 2 ] and radius from a normal distribution  ( = ,  = 10), where
 ∈ {25, 50} is the radius of the class. The theoretical decision boundary for the Ring data is
the circle of radius 32.5.</p>
        <p>The Half Moon dataset is generated with “make_moons” function from the Scikit-Learn
library1 with a noise of 0.2. The distribution is composed of two halves of a circle, one of which
is shifted laterally by the radius. Each half-circle corresponds to a class.</p>
        <p>In all three cases,  = R2 and the associated similarity is a decreasing function of the standard
Euclidean distance   (, ) = exp(− 2(, )); the outcome space is ℛ = {0, 1} equipped
with  ℛ(, ) = 1 if  =  and 0, otherwise. Note that the three data distributions are more
or less compatible with   due to their geometry. In that regard, the limitations of   help
understand the performance of our approach when the similarity is not as good as it could be.</p>
        <sec id="sec-5-1-1">
          <title>1https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_moons.html</title>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Experimental Protocol</title>
        <p>For each of the three data distributions considered, we generate 1000 samples that we split into
20 non-overlapping subsets of 50 cases, each one being balanced in terms of classes. We separate
them in 2 groups: 10 serve as initial case bases 1, . . . , 10 and the others as reference
sets 1, . . . , 10. Fig. 2a, 2b, and 2c display the overall 1000 samples together with their label,
showing in light colors the reference sets.</p>
        <p>For each pair (,  ), we apply the proposed compression algorithm. After each removal
step, the classification results obtained by the CoAT algorithm applied with the current  are
assessed by the macro F1 on all reference cases ⋃︀=1..10 .</p>
        <p>Fig. 2d, 2e and 2f show the evolution of this macro F1 criterion during the case
deletion procedure, comparing the two proposed competence measures  (, ,  ) and
ℎ(, ,  ); the shade corresponding to the 95% confidence interval over the 100
combinations of initial case base and references. In Fig. 2g, 2h, and 2i, each line shows the results for
one of the 10 case bases and the shades show the 95% confidence interval over the 10 reference
sets. Reciprocally, in Fig. 2j, 2k, and 2l, each line corresponds to a reference set and the shades
correspond to the 95% confidence interval over the 10 initial case bases.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Results</title>
        <p>We study the behavior of the case-based models resulting from compression by performing
both quantitative and qualitative analyses.
5.3.1. Competence Definitions and Correlation with Performance
In the second row of Fig. 2, we compare the evolution of the macro F1 when using either
 or ℎ for case competence in the compression process. With  , F1 remains at
its maximum slightly longer, so a few more cases can be removed. However, with  , F1
remains at its initial value throughout the process and does not reach as high values as with
ℎ. For instance, on Ring, F1 reaches a value close to 60% for  and 85% for ℎ.</p>
        <p>Complementary experiments, whose curves are omitted for brevity, examine the evolution of
the case base competence during the compression process. They show that, as desired, the case
base competence remains constant or increases during compression when using ℎ. On the
other hand, using  causes an increasingly faster decrease of competence, making it a poor
choice for case base compression. Also, by comparing the decrease when using  with F1,
which is almost constant, it becomes striking that  is not directly correlated with predictive
performance, and this is problematic in our vision of competence. As mentioned in Subsec. 4.2,
prediction successes and failures are considered at the same time in  , but higher 
could be an expression of higher confidence in already well predicted cases, of fewer errors, or
of less confident errors. In that regard, ℎ is more suitable as a competence measure as it
measures how confident the model is in its errors, and thus higher ℎ corresponds to fewer
or less confident errors, which directly translates to higher performance.</p>
        <p>The experiments described hereafter consider only ℎ, as it is more interesting in terms
of performance, stability across datasets, and is a better fit for the notion of competence.
(a) Line distribution
(b) Ring distribution
(c) Half Moon distribution
(d)  &amp; ℎ, Line
(e)  &amp; ℎ, Ring
(f)  &amp; ℎ, Half Moon
(g) Initial cases, Line
(h) Initial cases, Ring
(i) Initial cases, Half Moon
(j) Reference cases, Line
(k) Reference cases, Ring
(l) Reference cases, Half Moon
5.3.2. Competence of the Case and Impact on Performance
Looking at the F1 evolution (see Fig. 2g to 2l) shows that no matter the initial cases in the case
base or the reference cases used, the same trend of performance can be observed: (i) a raise, (ii)
a plateau, and finally (iii) a faster and faster decrease. As our algorithm removes the cases by
order of increasing competence, it appears that (i) corresponds to incompetent cases, (ii) to cases
that are neither competent nor incompetent, and (iii) to competent cases. Going further, during
phase (i) removing cases improves the performance, meaning that the removed cases were
“polluting” the case base. In (ii), the removed cases neither harm nor benefit the performance of
the case base, as such they can be considered redundant w.r.t. the remaining cases. The cases
that remain are the most competent and useful ones, and in (iii) they are removed by order of
increasing competence, leading to sharper and sharper drops in performance.</p>
        <p>
          This behavior is similar to the footprint deletion procedure from Smyth and Keane [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], with
the auxiliary, spanning, and support cases removed in (ii), and pivotal cases removed in (iii).
Compared to [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], our procedure is more powerful as it can handle case bases that do not properly
ift the distribution of the data, as harmful cases are removed in priority in (i).
        </p>
        <p>As we observe a striking parallel between the performance change and the compression
step, it appears that ℎ suits the intuition of competence, since the step at which a case is
removed during compression is proportional to its competence. Furthermore, this general trend
provides empirical guarantees that the maximum performance is reached just before the first
significant decrease in performance, meaning we can stop the process as soon as we detect such
a decrease.
5.3.3. Robustness of the Compression
Robustness w.r.t. the initialization of the case base. Fig. 2g, 2h, and 2i show that the
initial cases in the case base change the initial performance and time needed to converge to the
general trend of performance. In extreme cases of poor initial performance, the convergence
might be delayed until after performance starts to decrease, as can be seen in Fig. 2i for the
lower of the two groups of case bases.</p>
        <p>By analyzing the distribution of each set of initial cases (not shown here for brevity), we
observe that not having enough cases in a particular area of the distribution (i.e., having holes
in the case base in important places) causes the case base to have dificulties to reach the best
performance. We were able to confirm this efect by manually removing cases in parts of the
distribution, in experiments omitted here for brevity. Conversely, if we manually make one class
over-represented, the performance is not damaged as much, as the cases in the over-represented
class are redundant and are removed in the plateau (ii).</p>
        <p>From these results, the initial cases harm the best performance only when the initial
performance is too poor (leading to converging too slowly to reach the best state) or when there are
no cases in an important area of the boundary.</p>
        <p>Robustness w.r.t. the reference cases. The cases used to measure the competence can
have a critical impact on the best performance reached. If there is no major gap between the
distribution of references and the true distribution of the data, the maximal performance can
be harmed but is still in the same range as the other references, as can be seen with the cyan
references in Fig. 2g and 2h. However, the efect of the references becomes striking when we
manually create holes in the distribution of references, in experiments omitted for brevity. In
that setting, the case base becomes biased towards the incorrect distribution of the references.
5.3.4. Qualitative Analysis
Fig. 3 displays, for each dataset and for a single initialization and reference, 3 steps of the case
deletion procedure: the initial case base (first column), after 10 deletion steps (second column),
and after 30 deletion steps (third column). In each figure, the red and blue dots represent the
remaining source cases, and the crosses and the triangles represent the references (triangles
and crosses respectively mean correct and incorrect prediction). The least competent source
case  that will be deleted is circled in red.</p>
        <p>The colored map in the figure represents CoAT’s predictions for new cases across the space,
with the color matching the predicted class and the saturation corresponding to the confidence
(i.e., the energy diference between the two outcomes, see Subsec. 4.1). In that manner, it is
possible to identify the decision boundary of the compressed case base. At the end of the process,
CoAT’s decision frontier meets the theoretical classification boundary of the distribution, even
for Half Moon, which has a relatively complex boundary. The decision frontier induced by the
compressed case base is thereby able to closely approximate the ideal classification boundary.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Discussion</title>
        <p>The compression process using ℎ is able to reduce the number of cases in the case base
to 40% (Ring) or even 20% (Line and Half Moon) of its initial size, while strictly improving
performance. While our current experiments only cover binary classification, our approach
is designed to handle any kind of nominal data in the outcome space. Further work on the
approach will include multi-class classification and real-world data.</p>
        <p>The robustness experiments show that the initial case base is not a major factor in the
peak performance, as long as there are enough cases in the important regions of the situation
space. However, it is important to have a proper set of reference cases, as the distribution
of the references is closely matched by the compressed case base. If the reference cases are
not representative of the true distribution of the data, then the compressed case base is not
guaranteed to match the true distribution. To summarize, it is useful to focus on the quality
of the reference (i.e., how representative of the actual distribution they are) and on having
suficient initial cases for the case base, even if their quality is rather poor, as long as they cover
enough of the distribution for the intended purpose of the model.</p>
        <p>Additionally, we obtain empirical evidence of the benefits of ℎ over  , and the
performance of the case base measured by ℎ correlates to CoAT’s prediction performance.
The question of whether this measure of competence is compatible with other CBR processes
than CoAT remains open, in particular since CoAT and our competence measure are based on
the same energy function. The ordering of cases—based on their competence—may change after
a case is removed, as our competence measure involves the rest of the case base. This might have
an efect on the compression process, but our energy-based approach to competence may ofer
theoretical guaranties or bounds on those changes. If the competence of a case remains stable
when removing another case, we can speed up convergence by removing cases by batches.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <p>This paper introduced an energy-based approach to measuring the competence of a case base
for machine learning tasks such as case prediction and classification. This competence approach
difers from prior approaches proposed in the literature as it relies on the optimization of a
global compatibility indicator between two similarity measures, one on the situation space and
the other on the outcome space.</p>
      <p>We show empirically that this notion of competence is tightly related to performance for a
case-based classification task, in the sense that the competence of a source case is positively
correlated to its ability to reduce the energy of correct outcomes and to increase the energy
of incorrect outcomes. We analyze both quantitatively and qualitatively the behavior of this
competence-based approach on diferent datasets (with substantially diferent distributions) and
taking into account diferent classification frontiers and loss functions. Moreover, we analyze
its robustness with respect to diferent reference and initial cases.</p>
      <p>These results suggest the strong potential of this energy-based framework for guiding case
base maintenance, providing an alternative to existing methods. One of the main diferences
is that it employs a global approach by considering the competence of a case base as a whole,
rather than a local approach as it is often the case in the literature (where only nearest neighbors
are considered). The empirical and thorough comparison between the former and the latter will
constitute one of the topics to be investigated in a future contribution.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Acknowledgments</title>
      <p>This research was partially supported by the ANR project “Analogies: from theory to tools
and applications” (AT2TA), ANR-22-CE23-0023 and by the ANR project “Similarity Measure
Learning for Analogical Transfer” (SMeLT), ANR-22-CE23-0032. David Leake’s work was funded
in part by the Department of the Navy, Ofice of Naval Research (Award N00014-19-1-2655).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M. M.</given-names>
            <surname>Richter</surname>
          </string-name>
          , Knowledge Containers,
          <source>in: Readings in Case-Based Reasoning</source>
          ,
          <year>2003</year>
          . URL: https://www.researchgate.net/publication/225070310_Knowledge_Containers.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Arshadi</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Jurisica</surname>
          </string-name>
          ,
          <article-title>Maintaining case-based reasoning systems: A machine learning approach</article-title>
          , in: P. Funk,
          <string-name>
            <given-names>P. A.</given-names>
            <surname>González-Calero</surname>
          </string-name>
          (Eds.),
          <source>Advances in Case-Based Reasoning, 7th European Conference, ECCBR'04'</source>
          , volume
          <volume>3155</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2004</year>
          , pp.
          <fpage>17</fpage>
          -
          <lpage>31</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Cummins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bridge</surname>
          </string-name>
          ,
          <article-title>On Dataset Complexity for Case Base Maintenance</article-title>
          , in: A.
          <string-name>
            <surname>Ram</surname>
          </string-name>
          , N. Wiratunga (Eds.),
          <source>Case-Based Reasoning Research and Development</source>
          , volume
          <volume>6880</volume>
          , Springer,
          <year>2011</year>
          , pp.
          <fpage>47</fpage>
          -
          <lpage>61</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Cummins</surname>
          </string-name>
          ,
          <source>Combining and Choosing Case Base Maintenance Algorithms, Ph.D. thesis</source>
          , University College Cork,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Smyth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Keane</surname>
          </string-name>
          ,
          <article-title>Remembering to forget: A competence-preserving case deletion policy for case-based reasoning systems</article-title>
          ,
          <source>in: Proc. of the 13th Int. Joint Conf. on Artificial Intelligence IJCAI</source>
          , Morgan Kaufmann,
          <year>1995</year>
          , pp.
          <fpage>377</fpage>
          -
          <lpage>382</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Smyth</surname>
          </string-name>
          , E. McKenna,
          <article-title>Competence models and the maintenance problem</article-title>
          ,
          <source>Computational Intelligence</source>
          <volume>17</volume>
          (
          <year>2001</year>
          )
          <fpage>235</fpage>
          -
          <lpage>249</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S. C. K.</given-names>
            <surname>Shiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Yeung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. H.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Transferring case knowledge to adaptation knowledge: An approach for case-base maintenance</article-title>
          ,
          <source>Comput. Intell</source>
          .
          <volume>17</volume>
          (
          <year>2001</year>
          )
          <fpage>295</fpage>
          -
          <lpage>314</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Remembering to add: Competence-preserving case-addition policies for case base maintenance</article-title>
          ,
          <source>in: Proc. of the 15th Int. Joint Conf. on Artificial Intelligence</source>
          , IJCAI, Morgan Kaufmann,
          <year>1999</year>
          , pp.
          <fpage>234</fpage>
          -
          <lpage>241</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E.</given-names>
            <surname>Hüllermeier</surname>
          </string-name>
          ,
          <article-title>Credible case-based inference using similarity profiles</article-title>
          ,
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>19</volume>
          (
          <year>2007</year>
          )
          <fpage>847</fpage>
          -
          <lpage>858</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F.</given-names>
            <surname>Badra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A Dataset</given-names>
            <surname>Complexity</surname>
          </string-name>
          <article-title>Measure for Analogical Transfer</article-title>
          ,
          <source>in: Proc. of the 29th Int. Joint Conf. on Artificial Intelligence IJCAI</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1601</fpage>
          -
          <lpage>1607</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Badra</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-J. Lesot</surname>
          </string-name>
          ,
          <string-name>
            <surname>Case-Based Prediction - A Survey</surname>
          </string-name>
          , IJAR
          <volume>158</volume>
          (
          <year>2023</year>
          )
          <fpage>108920</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>F.</given-names>
            <surname>Badra</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-J. Lesot</surname>
          </string-name>
          ,
          <article-title>Theoretical and Experimental Study of a Complexity Measure for Analogical Transfer</article-title>
          , in: ICCBR,
          <year>2022</year>
          , pp.
          <fpage>175</fpage>
          -
          <lpage>189</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>F.</given-names>
            <surname>Badra</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-J. Lesot</surname>
          </string-name>
          , CoAT-APC:
          <article-title>When Analogical Proportion-based Classification Meets Case-Based Prediction</article-title>
          ,
          <source>in: Proc. of the workshop on Analogies: from Theory to Applications</source>
          ATA@ICCBR,
          <string-name>
            <surname>CEUR-WS</surname>
          </string-name>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Y.</given-names>
            <surname>LeCun</surname>
          </string-name>
          , S. Chopra,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hadsell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ranzato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          <article-title>Tutorial on Energy-Based Learning</article-title>
          , in: Predicting Structured Data, MIT Press,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>D.</given-names>
            <surname>Wilson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Leake</surname>
          </string-name>
          ,
          <article-title>Maintaining case-based reasoners: Dimensions and directions</article-title>
          ,
          <source>Computational Intelligence</source>
          <volume>17</volume>
          (
          <year>2001</year>
          )
          <fpage>196</fpage>
          -
          <lpage>213</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>G.</given-names>
            <surname>Houeland</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Aamodt,</surname>
          </string-name>
          <article-title>The utility problem for lazy learners - towards a non-eager approach</article-title>
          , in: I.
          <string-name>
            <surname>Bichindaritz</surname>
          </string-name>
          , S. Montani (Eds.),
          <source>Case-Based Reasoning Research and Development, ICCBR 2010</source>
          , Springer,
          <year>2010</year>
          , pp.
          <fpage>141</fpage>
          -
          <lpage>155</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>D.</given-names>
            <surname>Leake</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schack</surname>
          </string-name>
          ,
          <article-title>The problem drift problem and first steps towards addressing it for CBR</article-title>
          ,
          <source>in: Case-Based Reasoning Research and Development, ICCBR 2023</source>
          , Springer,
          <year>2023</year>
          . In press.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>G.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Osindero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Welling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-W.</given-names>
            <surname>Teh</surname>
          </string-name>
          ,
          <source>Unsupervised Discovery of Nonlinear Structure Using Contrastive Backpropagation, Cognitive Science 30</source>
          (
          <year>2006</year>
          )
          <fpage>725</fpage>
          -
          <lpage>731</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>