Conformal sets in neural network regression?

                                          Radim Demut1 and Martin Holeňa2
                                1
                                    Faculty of Nuclear Sciences and Physical Engineering
                                           Czech Technical University in Prague
                                                      demut@seznam.cz
                                             2
                                                Institute of Computer Science
                                       Academy of Sciences of the Czech Republic
                                                     martin@cs.cas.cz

Abstract. This paper is concerned with predictive regions these predictors are not suitable for neural network
in regression models, especially neural networks. We use  regression, therefore, we also introduce inductive con-
the concept of conformal prediction (CP) to construct re- formal predictors where the prediction rule is updated
gions which satisfy given confidence level. Conformal pre-only after a given number of new examples has arrived
diction outputs regions, which are automatically valid, but
                                                          and a calibration set is used.
their width and therefore usefulness depends on the used
                                                              In order to define a conformal predictor we need
nonconformity measure. A nonconformity measure should
tell us how different a given example is with respect to
                                                          a suitable nonconformity measure. A nonconformity
other examples. We define nonconformity measures based    measure should tell us how different a given example
on some reliability estimates such as variance of a baggedis with respect to other examples. In chapter 3, we in-
                                                          troduce two reliability estimates: variance of a bagged
model or local modeling of prediction error. We also present
                                                          model and local modeling of prediction error. We use
results of testing CP based on different nonconformity mea-
sures showing their usefulness and comparing them to tra- these reliability estimates in chapter 4 to define nor-
ditional confidence intervals.                            malized nonconformity measures. Some other reliabil-
                                                          ity estimates could be used, e.g. sensitivity analysis or
                                                          density based reliability estimate.
1 Introduction                                                In chapter 5, we use CP, based on nonconformity
                                                          measures defined in chapter 4, on testing data to com-
This paper is concerned with predictive regions for re- pare our conformal regions with traditional confidence
gression models, especially neural networks. We often intervals and with conformal intervals where these tra-
want to know not only the label y of a new object, but ditional confidence intervals are used to construct the
also how accurate the prediction is. Could the real la- nonconformity measure.
bel be very far from our prediction or is our prediction
very accurate? It is possible to use traditional confi-
                                                          2 Conformal prediction
dence intervals to answer this question but they do not
work very well with highly nonlinear regression models
                                                          We assume that we have an infinite sequence of pairs
such as neural networks. We use conformal prediction
to solve this problem and construct some accurate and                       (x1 , y1 ), (x2 , y2 ), . . . ,     (1)
useful prediction regions.
    We introduce conformal prediction (CP) in chap- called examples. Each example (xi , yi ) consists of an
ter 2. Conformal prediction does not output single la- object xi and its label yi . The objects are elements of
bel but a set of labels Γ ε . The size of the prediction a measurable space X called the object space and the
set depends on a significance level ε which we want labels are elements of a measurable space Y called the
to achieve. Significance level is under some conditions label space. Moreover, we assume that X is non-empty
the probability that our prediction lies outside the set. and that the σ-algebra on Y is different from {∅, Y}.
The set is smaller for larger ε. If we have some predic- We denote zi := (xi , yi ) and we set
tion rule, we will call it simple predictor and we can
use it to construct conformal predictor. We introduce                             Z := X × Y                    (2)
transductive conformal predictors where the predic-
                                                          and call Z the example space. Thus the infinite data se-
tion rule is updated after a new example arrives. But
                                                          quence (??) is an element of the measurable space Z∞ .
 ?
   This work was supported by the Grant Agency of             Our standard assumption is that the examples are
   the Czech Technical University in Prague, grant No.    chosen  independently from some probability distribu-
   SGS12/157/OHK4/2T/14 and the Czech Science Foun- tion Q on Z, i.e. the infinite data sequence (??) is
   dation grant 201/08/0802.                              drawn from the power probability distribution Q∞ on
18      Radim Demut, Martin Holeňa

Z∞ . Usually we need only slightly weaker assumption              Formally, a confidence predictor is a measurable
that the infinite data sequence (??) is drawn from a function
distribution P on Z∞ that is exchangeable, that means                          Γ : Z∗ × X × (0, 1) → 2Y                      (7)
that every n ∈ IN, every permutation π of {1, . . . , n},
                                                              that satisfies (??) for all n ∈ IN, all incomplete data se-
and every measurable set E ⊆ Z∞ fulfill
                                                              quences x1 , y1 , . . . , xn−1 , yn−1 , xn and all significance
                              ∞
      P {(z1 , z2 , . . .) ∈ Z : (z1 , . . . , zn ) ∈ E} =    levels ε1 ≥ ε2 .
                             ∞                                    Whether Γ makes an error on the nth trial of the
     P {(z1 , z2 , . . .) ∈ Z : (zπ(1) , . . . , zπ(n) ) ∈ E}
                                                              data sequence ω = (x1 , y1 , x2 , y2 , . . .) at significance
    We denote Z∗ the set of all finite sequences of ele- level ε can be represented by a number that is one in
ments of Z, Zn the set of all sequences of elements of Z case of an error and zero in case of no error
of length n. The order in which old examples appear
                                                                                         
                                                                                          1 if yn ∈/ Γ ε (x1 , y1 , . . . ,
should not make any difference. In order to formalize
                                                                                         
                                                                                         
this point we need the concept of a bag. A bag of size                errεn (Γ, ω) :=         xn−1 , yn−1 , xn ) ,           (8)
                                                                                          0 otherwise ,
                                                                                         
n ∈ IN is a collection of n elements some of which may
                                                                                         
be identical. To identify a bag we must say what ele-
ments it contains and how many times each of these and the number of errors during the first n trials is
elements is repeated. We write \z1 , . . . , zn / for the bag                                   n
consisting of elements z1 , . . . , zn , some of which may
                                                                                               X
                                                                             Errεn (Γ, ω) :=       errεi (Γ, ω) .   (9)
be identical with each other. We write Z(n) for the                                            i=1
set of all bags of size n of elements of a measurable
space Z. We write Z(∗) for the set of all bags of ele-             If ω is drawn from an exchangeable probability
ments of Z.                                                   distribution P , the number errεn (Γ, ω) is the realized
                                                              value of a random variable, which we may designate
                                                              errεn (Γ, P ). We say that confidence predictor Γ is con-
2.1 Confidence predictors
                                                              servatively valid if for any exchangeable probability
We assume that at the nth trial we have firstly only distribution P on Z∞ there exist two families
the object xn and only later we get the label yn . If we
want to predict yn , we need a simple predictor                              (ξn(ε) : ε ∈ (0, 1), n = 1, 2, . . .) (10)

                     D : Z∗ × X → Y .                      (3)   and
                                                                                 (ηn(ε) : ε ∈ (0, 1), n = 1, 2, . . .)   (11)
For any sequence of old examples x1 , y1 , . . . , xn−1 ,
yn−1 ∈ Z∗ and any new object xn , it gives of {0, 1}-valued variables such that
D(x1 , y1 , . . . , xn−1 , yn−1 , xn ) ∈ Y as its prediction for                        (ε) (ε)
                                                                  – for a fixed ε, ξ1 , ξ2 , . . . is a sequence of indepen-
the new label yn .                                                  dent Bernoulli random variables with parameter
    Instead of merely choosing a single element of Y                ε;
as our prediction for yn , we want to give subsets of Y                                     (ε)
                                                                  – for all n and ε, ηn ≤ ξn ;
                                                                                                    (ε)
large enough that we can be confident that yn will fall           – the joint distribution of errεn (Γ, P ), ε ∈ (0, 1), n =
in them, while also giving smaller subsets in which we              1, 2, . . ., coincides with the joint distribution of
are less confident. An algorithm that predicts in this               (ε)
                                                                    ηn , ε ∈ (0, 1), n = 1, 2, . . ..
sense requires additional input ε ∈ (0, 1), which we
call significance level, the complementary value 1 − ε
is called confidence level. Given all these inputs               2.2 Transductive conformal predictors

               x1 , y1 , . . . , xn−1 , yn−1 , xn , ε      (4)   A nonconformity measure is a measurable mapping

an algorithm Γ that interests us outputs a subset                                    A : Z(∗) × Z → IR .                   (12)

              Γ ε (x1 , y1 , . . . , xn−1 , yn−1 , xn )     To each possible bag of old examples and each possible
                                                           (5)
                                                            new example, A assigns a numerical score indicating
of Y. We require this subset to shrink as ε is increased how different the new example is from the old ones.
that means it holds                                         It is sometimes convenient to consider separately how
                                                            a nonconformity measure deals with bags of different
         Γ ε1 (x1 , y1 , . . . , xn−1 , yn−1 , xn ) ⊆       sizes. If A is a nonconformity measure, for each n =
             Γ ε2 (x1 , y1 , . . . , xn−1 , yn−1 , xn ) (6) 1, 2, . . . we define a function

whenever ε1 ≥ ε2 .                                                                  An : Z(n−1) × Z → IR                   (13)
                                                                                                             Conformal sets    19

as the restriction of A to Z(n−1) × Z. The sequence A discrepancy measure is a measurable function
(An : n ∈ IN), which we abbreviate to (An ) will also
be called a nonconformity measure.                                                            ∆ : Y × Y → IR .         (21)
    Given a nonconformity measure (An ) and a bag
\z1 , . . . , zn / we can compute the nonconformity score Given a simple predictor D and a discrepancy mea-
                                                                    sure ∆ we define functions (An ) as follows: for any
        αi := An (\z1 , . . . , zi−1 , zi+1 , . . . zn /, zi ) (14) ((x1 , y1 ), . . . , (xn , yn )) ∈ Z∗ , the values

for each example zi in the bag. Because a nonconfor-              αi = An (\(x1 , y1 ), . . . , (xi−1 , yi−1 ),
mity measure (An ) may be scaled however we like, the              (xi+1 , yi+1 ), . . . , (xn , yn )/, (xi , yi )) (22)
numerical value of αi does not, by itself, tell us how
unusual (An ) finds zi to be. For that we define p-value are defined according to (??) and (??) by the formula
for zi as
                                                                       αi := ∆(yi , D\z1 ,...,zn / (xi ))           (23)
                 |{j = 1, . . . , n : αj ≥ αi }|
           p :=                                  .  (15)
                                 n                       and the formula
    We define transductive conformal predictor (TCP)          αi := ∆(yi , D\z1 ,...,zi−1 ,zi+1 ,...,zn / (xi )) , (24)
by a nonconformity measure (An ) as a confidence pre-
dictor Γ obtained by setting                             respectively. It can be easily checked that in both
              ε                                          cases (An ) form a nonconformity measure.
            Γ (x1 , y1 , . . . , xn−1 , yn−1 , xn ) (16)

equal to the set of all labels y ∈ Y such that                      2.3    Inductive conformal predictors
       |{i = 1, . . . , n : αi (y) ≥ αn (y)}|
                                              >ε ,             (17) In TCP, we need to compute the p-value (??) for all
                           n                                        labels y ∈ Y to determine the set Γ ε . In the case of
where                                                               regression, we have Y = IR and it is not possible to
                                                                    try each y ∈ Y. Sometimes it is possible to generally
  αi (y) := An (\(x1 , y1 ), . . . , (xi−1 , yi−1 ),                solve equations αi (y) ≥ αn (y) with respect to y, and
                                                                    therefore determine the set Γ ε . But if we use neural
            (xi+1 , yi+1 ), . . . , (xn−1 , yn−1 ), (xn , y)/,
                                                                    networks as simple predictor, we do not know the gen-
            (xi , yi )) , ∀i = 1, . . . , n − 1 ,                   eral form of the simple predictor, i.e. we do not know
  αn (y) := An (\(x1 , y1 ), . . . , (xn−1 , yn−1 )/, (xn , y)) .   a functional relationship between the training set and
                                                                    the trained network, because random influences en-
    We now remind an important property of TCP. ter the training algorithm. Hence, we cannot solve the
The proof of the following theorem can be found in [?]. equations αi (y) ≥ αn (y), and it is not possible to use
                                                                    TCP. Even if the equations can be solved, it can be
Theorem 1. All conformal predictors are conserva-
                                                                    very computationally inefficient.
tively valid.
                                                                        To avoid this problem we can use inductive confor-
    If we are given a simple predictor (??) whose out- mal predictor (ICP). To define ICP from a nonconfor-
put does not depend on the order in which the old mity measure (An ) we fix a finite or infinite increasing
examples are presented, than the simple predictor D sequence of positive integers m1 , m2 , . . . (called update
defines a prediction rule D\z1 ,...,zn / : X → Y by the trials). If the sequence is finite we add one more mem-
formula                                                             ber equal to infinity at the end of the sequence. We
                                                                    need more than m1 training examples. Then we find
           D\z1 ,...,zn / (x) := D(z1 , . . . , zn , x) .      (18) k such that mk < n ≤ mk+1 . The ICP is determined
                                                                    by (An ) and the sequence m1 , m2 , . . . of update trials
A natural measure of nonconformity of zi is the devi- is defined to be the confidence predictor Γ such that
ation of the predicted label                                        the prediction set

                    ybi := D\z1 ,...,zn / (xi )              (19)                 Γ ε (x1 , y1 , . . . , xn−1 , yn−1 , xn )   (25)
from the true label yi . We can also use the deleted                is equal to the set of all labels y ∈ Y such that
prediction defined as
                                                                          |{j = mk + 1, . . . , n : αj ≥ αn (y)}|
           yb(i) := D\z1 ,...,zi−1 ,zi+1 ,...,zn / (xi ) .   (20)                                                 >ε ,        (26)
                                                                                       n − mk
20       Radim Demut, Martin Holeňa

where the nonconformity scores are defined by                       3.2   Local modeling of prediction error

  αj := Amk +1 (\(x1 , y1 ), . . . , (xmk , ymk )/, (xj , yj )) , We find k nearest neighbors of an unlabeled exam-
                                                                  ple x in the training set, therefore, we have a set
         for j = mk + 1, . . . , n − 1                       (27) N = {(x , y ), . . . , (x , y )} of nearest neighbors. We
                                                                           1 1             k k
  αn := Amk +1 (\(x1 , y1 ), . . . , (xmk , ymk )/,               define the estimate denoted CNK for an unlabeled ex-
         (xn , y)) .                                         (28) ample x as the difference between the average label of
                                                                  the nearest neighbors and the example’s prediction y
    The proof of the following theorem can be found (using the model that was generated on all learning
in [?].                                                           examples)
                                                                                                Pk
Theorem 2. All ICPs are conservatively valid.                                                         yi
                                                                                 CNK(x) := i=1 − y .                    (33)
    For ICP combining (??) with (??) and (??) we get                                                k
                                                                  The dependence on x on the right hand side of the
        Al+1 ( \(x1 , y1 ), . . . , (xl , yl )/, (x, y))          previous equation is implicit, but both the prediction y
                = ∆(y, D\(x1 ,y1 ),...,(xl ,yl ),(x,y)/ (x)) (29) and the selection of nearest neighbors depends on x.

and
                                                                    4     Normalized nonconformity
         Al+1 ( \(x1 , y1 ), . . . , (xl , yl )/, (x, y))                 measures
                  = ∆(y, D\(x1 ,y1 ),...,(xl ,yl )/ (x)) , (30) We will follow a similar approach as is used in the
                                                                 article [?], but we will incorporate the reliability es-
respectively. When we define A by (??), we can see
                                                                 timates from previous chapter and use it for neural
that the ICP requires recomputing the prediction rule
                                                                 network regression.
only at the update trials m1 , m2 , . . .. We will use the
                                                                     We will use ICP with only one update trial. Let us
simplest case, where there is only one update trial m1 ,
                                                                 have training set of size l, where l > m1 . We will split
therefore, we compute the prediction rule only once.
                                                                 it into two sets, the proper training set T of size m1
                                                                 (we will further write m) and the calibration set C of
3 Reliability estimates                                          size q = l − m. We will use the proper training set
                                                                 for creating the simple predictor D\(x1 ,y1 ),...,(xm ,ym )/ .
In this chapter we are interested in different ap- The calibration set is used for calculating the p-value
proaches to estimate the reliability of individual pre- of new test examples. It is good to first normalize the
dictions in regression.                                          data (i.e. subtract the mean and divide data by sample
                                                                 variance).
                                                                     We will denote ri any of the previously defined re-
3.1 Variance of a bagged model                                   liability estimates in the point xi with given simple
We are given a learning set L = {(x1 , y1 ), . . . , (xn , yn )} predictor D. We compute ri for all points in the cali-
and take repeated bootstrap samples L(i) , i = 1, . . . , m bration set and define Ri for any given point xi as
of size d from the learning set, i.e. for i = 1, . . . , m                                       ri
                                                                               Ri :=                          .          (34)
we randomly choose d points from the original learn-                                  median{rj : rj ∈ C}
ing set L with the return and put them in L(i) . The
                                                                 We define a discrepancy measure (??) as
number of points d can be chosen arbitrary. We in-
duce a new model on each of these bootstrap sam-                                                  y1 − y2
       (i)                                                                       ∆(y1 , y2 ) :=             ,            (35)
ples L . Each of the models yields a prediction Ki (x),                                           γ + Ri
i = 1, . . . , m for a considered input x. The label of the where parameter γ ≥ 0 controls the sensitivity to
example x is predicted by averaging the individual pre- changes of R . Then, we get the nonconformity score
                                                                               i
dictions                     Pm
                                   Ki (x)                                                       y − ŷi
                    K(x) := i=1            .               (31)                     αi (y) =              .              (36)
                                  m                                                            γ + Ri
We call this procedure bootstrap aggregating or bag-                 We sort nonconformity scores of the calibration ex-
ging. The reliability estimate of a bagged model is de- amples in descending order
fined as the prediction variance
                                                                                 α(m+1) ≥ . . . ≥ α(m+q) ,               (37)
                            m
                         1 X
      BAGV(x) :=              (Ki (x) − K(x))2 .           (32) and denote
                        m i=1                                                         s = bε(q + 1)c .                   (38)
                                                                                                        Conformal sets        21

Proposition 1. The prediction set Γ ε of the new test The value of this function ϑ in the point (x1 , x2 , x3 ,
example xl+g (where xl+g is from the infinite se- x4 , x5 ) can be expressed as
quence (??)) given the nonconformity score (??) is
equal to the interval                                           ϑ(x1 , x2 , x3 , x4 , x5 ) = −A(x1 , x2 )
                                                                   −B(x2 , x3 )C(x3 , x4 , x5 ) ,         (44)
hŷl+g − α(m+s) (γ + Rl+g ), ŷl+g + α(m+s) (γ + Rl+g )i .
                                                     (39) where
                                                                        A(x1 , x2 ) = 0.6g(x1 − 0.35, x2 − 0.35)
Proof. To compute the prediction set Γ ε of the new
test example xl+g we need to find all y ∈ Y such that                                  +0.75g(x1 − 0.1, x2 − 0.1)
for the p-value it holds                                                               +g(x1 − 0.35, x2 − 0.1)
                                                                        B(x2 , x3 ) = 0.4g(x2 − 0.1, x3 − 0.3)
      p(y) =
                                                                     C(x3 , x4 , x5 ) = 5 + 25[1 − {1 + (x3 − 0.3)2
      |{i = m + 1, . . . , m + q, l + g : αi ≥ αl+g (y)}|
                             q+1                                                       +(x4 − 0.15)2 + (x5 − 0.1)2 }1/2 ]
                                                                                            p
      >ε .                                               (40)                g(a, b) = 100 − (100a)2 + (100b)2
                                                                                                 p
                                                                                              sin (100a)2 + (100b)2
We multiply the inequality by q + 1 and then it is                                     +50 p                              .
                                                                                             (100a)2 + (100b)2 + (0.01)2
equivalent to
                                                                    Moreover, the input vectors must satisfy following con-
    |{i = m + 1, . . . , m + q, l + g : αi ≥ αl+g (y)}| >           ditions
                                               bε(q + 1)c (41)        5
                                                                      X
                                                                            xi = 1   and xi ∈ [0, 1], for i = 1, . . . , 5 . (45)
and this inequality holds if and only if                              i=1

                                       y − ŷl+g                        We repeated the following procedure five times for
            α(m+s) ≥ αl+g (y) =                      .       (42)
                                       γ + Rl+g                     region with significance level 0.1 and five times for
                                                                    region with significance level 0.05.
From (??) follows the assertion of the proposition.
                                                                     – Randomly generate 600 points satisfying the con-
                                                                       ditions (??).
5     Simulation                                                     – Compute the function values of function ϑ in these
                                                                       points.
                                                                     – Normalize data (i.e. subtract the mean and divide
We carried out a simulation to test the normalized                     data by sample variance)
nonconformity measures based on different reliability                – Split this set of points into a training set of
estimates. We used neural networks with radial ba-                     500 points and a testing set of 100 points.
sis functions (RBF networks) as our regression models                – Split the training set into a proper training set of
with Gaussian used as the basis function. Therefore,                   401 points and a calibration set of 99 points (then,
the output of the RBF network f : IRn → IR has the                     we divide the p-value in (??) by 100).
form                                                                 – Split the proper training set on training set for fit-
                     N                                                 ting the RBF network and the validation set. Fit
                     X
                           πi exp −βi ||x − ci ||2                     the RBF network with 1, 2, 3, 4 and 5 hidden neu-
                                 
           f (x) =                                       ,   (43)
                     i=1                                               rons ten times using the Matlab function lsqcurve-
                                                                       fit.
where N is the number of neurons in the hidden layer,                – Choose the RBF network with the smallest error
ci is the center vector for neuron i, βi determines the                on the validation set for each number of hidden
width of the ith neuron and πi are the weights of the                  neurons.
linear output neuron. RBF networks are universal ap-                 – Compute the prediction sets for each of the
proximators on a compact subset of IRn . This means                    100 testing points for each number of hidden neu-
that a RBF network with enough hidden neurons can                      rons.
approximate any continuous function with arbitrary                   – Transform data and predictive regions back to the
precision.                                                             original size (i.e. multiply by the original sample
    We used a benchmark function similar to some em-                   variance and add the original mean)
pirical functions encountered in chemistry to carry out              – Determine if the original point lies in our predic-
our experiment. This function was introduced in [?].                   tion sets.
22      Radim Demut, Martin Holeňa

The initial values of parameters πi were set as mean of      predictive regions is always slightly higher than the
the response vector, initial values of βi were set as the    confidence level.
mean of the standard deviation of the components of              Results for predictive regions based on the local
training data points. The centers ci were set randomly.      modeling of prediction errors depend a little bit on the
    We also computed confidence intervals using Mat-         count of nearest neighbors. These intervals are valid
lab function nlpredci (denoted Conf Int). The Jaco-          for all numbers of neighbors, but the tightest inter-
bian can be computed exactly, because the form of the        vals were achieved for two neighbors. The difference
RBF network is known and differentiable. Therefore,          between using five or ten neighbors is not too big but
we supply the function nlpredci with this Jacobian. We       lower number of neighbors works better in our model.
also use the width of this interval as another reliability   This is probably caused by our data and it seems that
estimate for our normalized nonconformity measure.           only a few neighbors are relevant to our prediction.
    We compare normalized nonconformity measures             These regions are also the easiest and fastest to com-
based on the following reliability estimates: the local      pute.
modeling of prediction errors using nearest neighbors            The best results among all predictive regions are
(CNK), the variance of a bagged model (BAGV) and             achieved by those based on a variance of a bagged
the width of confidence intervals (CONF).                    model. These regions are the tightest of all tested and
    The variance of a bagged model was computed for          they do not vary as much as those based on confidence
number of different models m = 10 and the bootstrap          intervals. These regions also maintain the validity. The
samples were as big as the original sample.                  drawback of these regions is that we need to fit a lot of
    The CNK estimates were computed for number of            additional models which takes a lot of time in the case
neighbors k = 2, 5, 10.                                      of neural network regression. But if time and compu-
                                                             tational efficiency is not a problem then this method
    We present the results of testing CP based on dif-
                                                             produces best regions.
ferent nonconformity measures in Figures ??, ?? and
??. There is a boxplot of all labels in Figure ?? to com-
pare the range of all labels with the width of different
predictive regions. Figures ?? and ?? show boxplots of             450
the width of prediction regions for significance levels
                                                                   400
ε = 0.1 and ε = 0.05, respectively. It is not only in-
                                                                   350
teresting whether the intervals are small enough, but
they should also be valid. The percentage of labels in-            300

side the predictive regions are in Tables ?? and ?? for            250

significance levels ε = 0.1 and ε = 0.05, respectively.            200

    The results for traditional confidence inter-                  150
vals computed by Matlab function nlpredci are not                  100
shown in the figures, because these results are very
                                                                    50
different from the others. The median width for these
intervals lies between 1010 and 1014 for all counts of                                     1

neurons. This is probably because of the highly non-
                                                                            Fig. 1. Boxplot of all labels.
linear character of neural nets, while nlpredci is based
on linearization. Moreover, during the computation of
these intervals a Jacobian matrix must be inverted but
this matrix was very often ill conditioned, therefore,
the results for confidence intervals are not too reliable.
    Despite what was said in the previous paragraph,          Neurons    CNK2      CNK5        CNK10   BAGV    CONF
the predictive regions based on the width of confidence          2        91.0      91.2        90.0    91.4    92.4
                                                                 3        92.6      92.4        92.6    94.0    93.6
intervals produce sensible results. But these prediction
                                                                 4        92.2      90.4        90.0    90.4    90.2
regions show highest inconsistency between different
                                                                 5        94.6      92.6        90.0    90.2    90.8
neuron counts and have highest number of very large              6        92.8      89.8        91.8    91.6    91.8
intervals. These regions produce sometimes very good
results, but they are probably very dependent on the         Table 1. Percentage of labels inside predictive regions for
actual fit of the neural network and their results are       ε = 0.1.
not as consistent as the results of the other methods.
However, we can see in Tables ?? and ?? that these
intervals are valid as the percentage of labels inside
                                                            Conformal sets   23


                           Neurons: 2


100

  0
      CNK2    CNK5           CNK10          BAGV     CONF
                           Neurons: 3

100

  0
      CNK2    CNK5           CNK10          BAGV     CONF

                           Neurons: 4

100
  0
      CNK2    CNK5           CNK10          BAGV     CONF
                           Neurons: 5

100

  0
      CNK2    CNK5           CNK10          BAGV     CONF
             Fig. 2. Interval widths for ε = 0.1.


                           Neurons: 2
400
200
  0
      CNK2    CNK5          CNK10           BAGV     CONF
                           Neurons: 3
400
200
  0
      CNK2    CNK5           CNK10          BAGV     CONF

                           Neurons: 4
400
200
  0
      CNK2    CNK5           CNK10          BAGV     CONF

                           Neurons: 5
400
200
  0
      CNK2    CNK5           CNK10          BAGV     CONF
             Fig. 3. Interval widths for ε = 0.05.
24       Radim Demut, Martin Holeňa

    Neurons   CNK2    CNK5      CNK10     BAGV      CONF
       2       95.8    96.8      96.8      96.2      95.6
       3       97.2    96.8      96.8      97.8      95.6
       4       96.2    96.4      96.6      97.4      97.0
       5       97.2    97.6      96.4      96.4      97.6
       6       97.2    98.0      96.4      97.4      96.8

Table 2. Percentage of labels inside predictive regions for
ε = 0.05.


6     Conclusion
We presented several methods for computing predic-
tive regions in neural network regressions. These meth-
ods are based on the inductive conformal prediction
with novel nonconformity measures proposed in this
paper. Those measures use reliability estimates to de-
termine how different a given example is with respect
to other examples. We compared our new predictive
regions with traditional confidence intervals on test-
ing data. The confidence intervals did not perform
very well, the intervals were too large, it was probably
caused by the high nonlinearity of radial basis neu-
ral networks. Predictive regions which used the width
of confidence intervals as the nonconformity measure
gave much better results. But those results were not as
consistent as the results of the other methods. Predic-
tive regions based on the local modeling of prediction
errors gave us good results and the computation of the
regions was very fast. A smaller number of neighbors
gave better results for these regions. The best results
were achieved by the regions based on the variance of
a bagged model. The only drawback of this method is
that a lot of models must be fitted and it is, therefore,
computationally very inefficient.


References
1. Z. Bosnic, Kononenko, I.: Comparison of approaches
   for estimating reliability of individual regression predic-
   tions. Data & Knowledge Engineering, 2008, 504–516.
2. A. Gammerman, G. Shafer, V. Vovk: Algorithmic learn-
   ing in a random world. Springer Science+Business Me-
   dia, 2005.
3. E. Uusipaikka: Confidence intervals in generalized re-
   gression models. Chapman & Hall, 2009.
4. H. Papadopoulos, V. Vovk, A. Gammerman: Regression
   conformal prediction with nearest neighbours. Journal of
   Artificial Intelligence Research 40, 2011, 815–840.
5. S. Valero, E. Argente, et al.: DoE framework for catalyst
   development based on soft computing techniques. Com-
   puters and Chemical Engineering 33(1), 2009, 225–238.