Conformal sets in neural network regression? Radim Demut1 and Martin Holeňa2 1 Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague demut@seznam.cz 2 Institute of Computer Science Academy of Sciences of the Czech Republic martin@cs.cas.cz Abstract. This paper is concerned with predictive regions these predictors are not suitable for neural network in regression models, especially neural networks. We use regression, therefore, we also introduce inductive con- the concept of conformal prediction (CP) to construct re- formal predictors where the prediction rule is updated gions which satisfy given confidence level. Conformal pre-only after a given number of new examples has arrived diction outputs regions, which are automatically valid, but and a calibration set is used. their width and therefore usefulness depends on the used In order to define a conformal predictor we need nonconformity measure. A nonconformity measure should tell us how different a given example is with respect to a suitable nonconformity measure. A nonconformity other examples. We define nonconformity measures based measure should tell us how different a given example on some reliability estimates such as variance of a baggedis with respect to other examples. In chapter 3, we in- troduce two reliability estimates: variance of a bagged model or local modeling of prediction error. We also present model and local modeling of prediction error. We use results of testing CP based on different nonconformity mea- sures showing their usefulness and comparing them to tra- these reliability estimates in chapter 4 to define nor- ditional confidence intervals. malized nonconformity measures. Some other reliabil- ity estimates could be used, e.g. sensitivity analysis or density based reliability estimate. 1 Introduction In chapter 5, we use CP, based on nonconformity measures defined in chapter 4, on testing data to com- This paper is concerned with predictive regions for re- pare our conformal regions with traditional confidence gression models, especially neural networks. We often intervals and with conformal intervals where these tra- want to know not only the label y of a new object, but ditional confidence intervals are used to construct the also how accurate the prediction is. Could the real la- nonconformity measure. bel be very far from our prediction or is our prediction very accurate? It is possible to use traditional confi- 2 Conformal prediction dence intervals to answer this question but they do not work very well with highly nonlinear regression models We assume that we have an infinite sequence of pairs such as neural networks. We use conformal prediction to solve this problem and construct some accurate and (x1 , y1 ), (x2 , y2 ), . . . , (1) useful prediction regions. We introduce conformal prediction (CP) in chap- called examples. Each example (xi , yi ) consists of an ter 2. Conformal prediction does not output single la- object xi and its label yi . The objects are elements of bel but a set of labels Γ ε . The size of the prediction a measurable space X called the object space and the set depends on a significance level ε which we want labels are elements of a measurable space Y called the to achieve. Significance level is under some conditions label space. Moreover, we assume that X is non-empty the probability that our prediction lies outside the set. and that the σ-algebra on Y is different from {∅, Y}. The set is smaller for larger ε. If we have some predic- We denote zi := (xi , yi ) and we set tion rule, we will call it simple predictor and we can use it to construct conformal predictor. We introduce Z := X × Y (2) transductive conformal predictors where the predic- and call Z the example space. Thus the infinite data se- tion rule is updated after a new example arrives. But quence (??) is an element of the measurable space Z∞ . ? This work was supported by the Grant Agency of Our standard assumption is that the examples are the Czech Technical University in Prague, grant No. chosen independently from some probability distribu- SGS12/157/OHK4/2T/14 and the Czech Science Foun- tion Q on Z, i.e. the infinite data sequence (??) is dation grant 201/08/0802. drawn from the power probability distribution Q∞ on 18 Radim Demut, Martin Holeňa Z∞ . Usually we need only slightly weaker assumption Formally, a confidence predictor is a measurable that the infinite data sequence (??) is drawn from a function distribution P on Z∞ that is exchangeable, that means Γ : Z∗ × X × (0, 1) → 2Y (7) that every n ∈ IN, every permutation π of {1, . . . , n}, that satisfies (??) for all n ∈ IN, all incomplete data se- and every measurable set E ⊆ Z∞ fulfill quences x1 , y1 , . . . , xn−1 , yn−1 , xn and all significance ∞ P {(z1 , z2 , . . .) ∈ Z : (z1 , . . . , zn ) ∈ E} = levels ε1 ≥ ε2 . ∞ Whether Γ makes an error on the nth trial of the P {(z1 , z2 , . . .) ∈ Z : (zπ(1) , . . . , zπ(n) ) ∈ E} data sequence ω = (x1 , y1 , x2 , y2 , . . .) at significance We denote Z∗ the set of all finite sequences of ele- level ε can be represented by a number that is one in ments of Z, Zn the set of all sequences of elements of Z case of an error and zero in case of no error of length n. The order in which old examples appear   1 if yn ∈/ Γ ε (x1 , y1 , . . . , should not make any difference. In order to formalize   this point we need the concept of a bag. A bag of size errεn (Γ, ω) := xn−1 , yn−1 , xn ) , (8)  0 otherwise ,  n ∈ IN is a collection of n elements some of which may  be identical. To identify a bag we must say what ele- ments it contains and how many times each of these and the number of errors during the first n trials is elements is repeated. We write \z1 , . . . , zn / for the bag n consisting of elements z1 , . . . , zn , some of which may X Errεn (Γ, ω) := errεi (Γ, ω) . (9) be identical with each other. We write Z(n) for the i=1 set of all bags of size n of elements of a measurable space Z. We write Z(∗) for the set of all bags of ele- If ω is drawn from an exchangeable probability ments of Z. distribution P , the number errεn (Γ, ω) is the realized value of a random variable, which we may designate errεn (Γ, P ). We say that confidence predictor Γ is con- 2.1 Confidence predictors servatively valid if for any exchangeable probability We assume that at the nth trial we have firstly only distribution P on Z∞ there exist two families the object xn and only later we get the label yn . If we want to predict yn , we need a simple predictor (ξn(ε) : ε ∈ (0, 1), n = 1, 2, . . .) (10) D : Z∗ × X → Y . (3) and (ηn(ε) : ε ∈ (0, 1), n = 1, 2, . . .) (11) For any sequence of old examples x1 , y1 , . . . , xn−1 , yn−1 ∈ Z∗ and any new object xn , it gives of {0, 1}-valued variables such that D(x1 , y1 , . . . , xn−1 , yn−1 , xn ) ∈ Y as its prediction for (ε) (ε) – for a fixed ε, ξ1 , ξ2 , . . . is a sequence of indepen- the new label yn . dent Bernoulli random variables with parameter Instead of merely choosing a single element of Y ε; as our prediction for yn , we want to give subsets of Y (ε) – for all n and ε, ηn ≤ ξn ; (ε) large enough that we can be confident that yn will fall – the joint distribution of errεn (Γ, P ), ε ∈ (0, 1), n = in them, while also giving smaller subsets in which we 1, 2, . . ., coincides with the joint distribution of are less confident. An algorithm that predicts in this (ε) ηn , ε ∈ (0, 1), n = 1, 2, . . .. sense requires additional input ε ∈ (0, 1), which we call significance level, the complementary value 1 − ε is called confidence level. Given all these inputs 2.2 Transductive conformal predictors x1 , y1 , . . . , xn−1 , yn−1 , xn , ε (4) A nonconformity measure is a measurable mapping an algorithm Γ that interests us outputs a subset A : Z(∗) × Z → IR . (12) Γ ε (x1 , y1 , . . . , xn−1 , yn−1 , xn ) To each possible bag of old examples and each possible (5) new example, A assigns a numerical score indicating of Y. We require this subset to shrink as ε is increased how different the new example is from the old ones. that means it holds It is sometimes convenient to consider separately how a nonconformity measure deals with bags of different Γ ε1 (x1 , y1 , . . . , xn−1 , yn−1 , xn ) ⊆ sizes. If A is a nonconformity measure, for each n = Γ ε2 (x1 , y1 , . . . , xn−1 , yn−1 , xn ) (6) 1, 2, . . . we define a function whenever ε1 ≥ ε2 . An : Z(n−1) × Z → IR (13) Conformal sets 19 as the restriction of A to Z(n−1) × Z. The sequence A discrepancy measure is a measurable function (An : n ∈ IN), which we abbreviate to (An ) will also be called a nonconformity measure. ∆ : Y × Y → IR . (21) Given a nonconformity measure (An ) and a bag \z1 , . . . , zn / we can compute the nonconformity score Given a simple predictor D and a discrepancy mea- sure ∆ we define functions (An ) as follows: for any αi := An (\z1 , . . . , zi−1 , zi+1 , . . . zn /, zi ) (14) ((x1 , y1 ), . . . , (xn , yn )) ∈ Z∗ , the values for each example zi in the bag. Because a nonconfor- αi = An (\(x1 , y1 ), . . . , (xi−1 , yi−1 ), mity measure (An ) may be scaled however we like, the (xi+1 , yi+1 ), . . . , (xn , yn )/, (xi , yi )) (22) numerical value of αi does not, by itself, tell us how unusual (An ) finds zi to be. For that we define p-value are defined according to (??) and (??) by the formula for zi as αi := ∆(yi , D\z1 ,...,zn / (xi )) (23) |{j = 1, . . . , n : αj ≥ αi }| p := . (15) n and the formula We define transductive conformal predictor (TCP) αi := ∆(yi , D\z1 ,...,zi−1 ,zi+1 ,...,zn / (xi )) , (24) by a nonconformity measure (An ) as a confidence pre- dictor Γ obtained by setting respectively. It can be easily checked that in both ε cases (An ) form a nonconformity measure. Γ (x1 , y1 , . . . , xn−1 , yn−1 , xn ) (16) equal to the set of all labels y ∈ Y such that 2.3 Inductive conformal predictors |{i = 1, . . . , n : αi (y) ≥ αn (y)}| >ε , (17) In TCP, we need to compute the p-value (??) for all n labels y ∈ Y to determine the set Γ ε . In the case of where regression, we have Y = IR and it is not possible to try each y ∈ Y. Sometimes it is possible to generally αi (y) := An (\(x1 , y1 ), . . . , (xi−1 , yi−1 ), solve equations αi (y) ≥ αn (y) with respect to y, and therefore determine the set Γ ε . But if we use neural (xi+1 , yi+1 ), . . . , (xn−1 , yn−1 ), (xn , y)/, networks as simple predictor, we do not know the gen- (xi , yi )) , ∀i = 1, . . . , n − 1 , eral form of the simple predictor, i.e. we do not know αn (y) := An (\(x1 , y1 ), . . . , (xn−1 , yn−1 )/, (xn , y)) . a functional relationship between the training set and the trained network, because random influences en- We now remind an important property of TCP. ter the training algorithm. Hence, we cannot solve the The proof of the following theorem can be found in [?]. equations αi (y) ≥ αn (y), and it is not possible to use TCP. Even if the equations can be solved, it can be Theorem 1. All conformal predictors are conserva- very computationally inefficient. tively valid. To avoid this problem we can use inductive confor- If we are given a simple predictor (??) whose out- mal predictor (ICP). To define ICP from a nonconfor- put does not depend on the order in which the old mity measure (An ) we fix a finite or infinite increasing examples are presented, than the simple predictor D sequence of positive integers m1 , m2 , . . . (called update defines a prediction rule D\z1 ,...,zn / : X → Y by the trials). If the sequence is finite we add one more mem- formula ber equal to infinity at the end of the sequence. We need more than m1 training examples. Then we find D\z1 ,...,zn / (x) := D(z1 , . . . , zn , x) . (18) k such that mk < n ≤ mk+1 . The ICP is determined by (An ) and the sequence m1 , m2 , . . . of update trials A natural measure of nonconformity of zi is the devi- is defined to be the confidence predictor Γ such that ation of the predicted label the prediction set ybi := D\z1 ,...,zn / (xi ) (19) Γ ε (x1 , y1 , . . . , xn−1 , yn−1 , xn ) (25) from the true label yi . We can also use the deleted is equal to the set of all labels y ∈ Y such that prediction defined as |{j = mk + 1, . . . , n : αj ≥ αn (y)}| yb(i) := D\z1 ,...,zi−1 ,zi+1 ,...,zn / (xi ) . (20) >ε , (26) n − mk 20 Radim Demut, Martin Holeňa where the nonconformity scores are defined by 3.2 Local modeling of prediction error αj := Amk +1 (\(x1 , y1 ), . . . , (xmk , ymk )/, (xj , yj )) , We find k nearest neighbors of an unlabeled exam- ple x in the training set, therefore, we have a set for j = mk + 1, . . . , n − 1 (27) N = {(x , y ), . . . , (x , y )} of nearest neighbors. We 1 1 k k αn := Amk +1 (\(x1 , y1 ), . . . , (xmk , ymk )/, define the estimate denoted CNK for an unlabeled ex- (xn , y)) . (28) ample x as the difference between the average label of the nearest neighbors and the example’s prediction y The proof of the following theorem can be found (using the model that was generated on all learning in [?]. examples) Pk Theorem 2. All ICPs are conservatively valid. yi CNK(x) := i=1 − y . (33) For ICP combining (??) with (??) and (??) we get k The dependence on x on the right hand side of the Al+1 ( \(x1 , y1 ), . . . , (xl , yl )/, (x, y)) previous equation is implicit, but both the prediction y = ∆(y, D\(x1 ,y1 ),...,(xl ,yl ),(x,y)/ (x)) (29) and the selection of nearest neighbors depends on x. and 4 Normalized nonconformity Al+1 ( \(x1 , y1 ), . . . , (xl , yl )/, (x, y)) measures = ∆(y, D\(x1 ,y1 ),...,(xl ,yl )/ (x)) , (30) We will follow a similar approach as is used in the article [?], but we will incorporate the reliability es- respectively. When we define A by (??), we can see timates from previous chapter and use it for neural that the ICP requires recomputing the prediction rule network regression. only at the update trials m1 , m2 , . . .. We will use the We will use ICP with only one update trial. Let us simplest case, where there is only one update trial m1 , have training set of size l, where l > m1 . We will split therefore, we compute the prediction rule only once. it into two sets, the proper training set T of size m1 (we will further write m) and the calibration set C of 3 Reliability estimates size q = l − m. We will use the proper training set for creating the simple predictor D\(x1 ,y1 ),...,(xm ,ym )/ . In this chapter we are interested in different ap- The calibration set is used for calculating the p-value proaches to estimate the reliability of individual pre- of new test examples. It is good to first normalize the dictions in regression. data (i.e. subtract the mean and divide data by sample variance). We will denote ri any of the previously defined re- 3.1 Variance of a bagged model liability estimates in the point xi with given simple We are given a learning set L = {(x1 , y1 ), . . . , (xn , yn )} predictor D. We compute ri for all points in the cali- and take repeated bootstrap samples L(i) , i = 1, . . . , m bration set and define Ri for any given point xi as of size d from the learning set, i.e. for i = 1, . . . , m ri Ri := . (34) we randomly choose d points from the original learn- median{rj : rj ∈ C} ing set L with the return and put them in L(i) . The We define a discrepancy measure (??) as number of points d can be chosen arbitrary. We in- duce a new model on each of these bootstrap sam- y1 − y2 (i) ∆(y1 , y2 ) := , (35) ples L . Each of the models yields a prediction Ki (x), γ + Ri i = 1, . . . , m for a considered input x. The label of the where parameter γ ≥ 0 controls the sensitivity to example x is predicted by averaging the individual pre- changes of R . Then, we get the nonconformity score i dictions Pm Ki (x) y − ŷi K(x) := i=1 . (31) αi (y) = . (36) m γ + Ri We call this procedure bootstrap aggregating or bag- We sort nonconformity scores of the calibration ex- ging. The reliability estimate of a bagged model is de- amples in descending order fined as the prediction variance α(m+1) ≥ . . . ≥ α(m+q) , (37) m 1 X BAGV(x) := (Ki (x) − K(x))2 . (32) and denote m i=1 s = bε(q + 1)c . (38) Conformal sets 21 Proposition 1. The prediction set Γ ε of the new test The value of this function ϑ in the point (x1 , x2 , x3 , example xl+g (where xl+g is from the infinite se- x4 , x5 ) can be expressed as quence (??)) given the nonconformity score (??) is equal to the interval ϑ(x1 , x2 , x3 , x4 , x5 ) = −A(x1 , x2 ) −B(x2 , x3 )C(x3 , x4 , x5 ) , (44) hŷl+g − α(m+s) (γ + Rl+g ), ŷl+g + α(m+s) (γ + Rl+g )i . (39) where A(x1 , x2 ) = 0.6g(x1 − 0.35, x2 − 0.35) Proof. To compute the prediction set Γ ε of the new test example xl+g we need to find all y ∈ Y such that +0.75g(x1 − 0.1, x2 − 0.1) for the p-value it holds +g(x1 − 0.35, x2 − 0.1) B(x2 , x3 ) = 0.4g(x2 − 0.1, x3 − 0.3) p(y) = C(x3 , x4 , x5 ) = 5 + 25[1 − {1 + (x3 − 0.3)2 |{i = m + 1, . . . , m + q, l + g : αi ≥ αl+g (y)}| q+1 +(x4 − 0.15)2 + (x5 − 0.1)2 }1/2 ] p >ε . (40) g(a, b) = 100 − (100a)2 + (100b)2 p sin (100a)2 + (100b)2 We multiply the inequality by q + 1 and then it is +50 p . (100a)2 + (100b)2 + (0.01)2 equivalent to Moreover, the input vectors must satisfy following con- |{i = m + 1, . . . , m + q, l + g : αi ≥ αl+g (y)}| > ditions bε(q + 1)c (41) 5 X xi = 1 and xi ∈ [0, 1], for i = 1, . . . , 5 . (45) and this inequality holds if and only if i=1 y − ŷl+g We repeated the following procedure five times for α(m+s) ≥ αl+g (y) = . (42) γ + Rl+g region with significance level 0.1 and five times for region with significance level 0.05. From (??) follows the assertion of the proposition. – Randomly generate 600 points satisfying the con- ditions (??). 5 Simulation – Compute the function values of function ϑ in these points. – Normalize data (i.e. subtract the mean and divide We carried out a simulation to test the normalized data by sample variance) nonconformity measures based on different reliability – Split this set of points into a training set of estimates. We used neural networks with radial ba- 500 points and a testing set of 100 points. sis functions (RBF networks) as our regression models – Split the training set into a proper training set of with Gaussian used as the basis function. Therefore, 401 points and a calibration set of 99 points (then, the output of the RBF network f : IRn → IR has the we divide the p-value in (??) by 100). form – Split the proper training set on training set for fit- N ting the RBF network and the validation set. Fit X πi exp −βi ||x − ci ||2 the RBF network with 1, 2, 3, 4 and 5 hidden neu-  f (x) = , (43) i=1 rons ten times using the Matlab function lsqcurve- fit. where N is the number of neurons in the hidden layer, – Choose the RBF network with the smallest error ci is the center vector for neuron i, βi determines the on the validation set for each number of hidden width of the ith neuron and πi are the weights of the neurons. linear output neuron. RBF networks are universal ap- – Compute the prediction sets for each of the proximators on a compact subset of IRn . This means 100 testing points for each number of hidden neu- that a RBF network with enough hidden neurons can rons. approximate any continuous function with arbitrary – Transform data and predictive regions back to the precision. original size (i.e. multiply by the original sample We used a benchmark function similar to some em- variance and add the original mean) pirical functions encountered in chemistry to carry out – Determine if the original point lies in our predic- our experiment. This function was introduced in [?]. tion sets. 22 Radim Demut, Martin Holeňa The initial values of parameters πi were set as mean of predictive regions is always slightly higher than the the response vector, initial values of βi were set as the confidence level. mean of the standard deviation of the components of Results for predictive regions based on the local training data points. The centers ci were set randomly. modeling of prediction errors depend a little bit on the We also computed confidence intervals using Mat- count of nearest neighbors. These intervals are valid lab function nlpredci (denoted Conf Int). The Jaco- for all numbers of neighbors, but the tightest inter- bian can be computed exactly, because the form of the vals were achieved for two neighbors. The difference RBF network is known and differentiable. Therefore, between using five or ten neighbors is not too big but we supply the function nlpredci with this Jacobian. We lower number of neighbors works better in our model. also use the width of this interval as another reliability This is probably caused by our data and it seems that estimate for our normalized nonconformity measure. only a few neighbors are relevant to our prediction. We compare normalized nonconformity measures These regions are also the easiest and fastest to com- based on the following reliability estimates: the local pute. modeling of prediction errors using nearest neighbors The best results among all predictive regions are (CNK), the variance of a bagged model (BAGV) and achieved by those based on a variance of a bagged the width of confidence intervals (CONF). model. These regions are the tightest of all tested and The variance of a bagged model was computed for they do not vary as much as those based on confidence number of different models m = 10 and the bootstrap intervals. These regions also maintain the validity. The samples were as big as the original sample. drawback of these regions is that we need to fit a lot of The CNK estimates were computed for number of additional models which takes a lot of time in the case neighbors k = 2, 5, 10. of neural network regression. But if time and compu- tational efficiency is not a problem then this method We present the results of testing CP based on dif- produces best regions. ferent nonconformity measures in Figures ??, ?? and ??. There is a boxplot of all labels in Figure ?? to com- pare the range of all labels with the width of different predictive regions. Figures ?? and ?? show boxplots of 450 the width of prediction regions for significance levels 400 ε = 0.1 and ε = 0.05, respectively. It is not only in- 350 teresting whether the intervals are small enough, but they should also be valid. The percentage of labels in- 300 side the predictive regions are in Tables ?? and ?? for 250 significance levels ε = 0.1 and ε = 0.05, respectively. 200 The results for traditional confidence inter- 150 vals computed by Matlab function nlpredci are not 100 shown in the figures, because these results are very 50 different from the others. The median width for these intervals lies between 1010 and 1014 for all counts of 1 neurons. This is probably because of the highly non- Fig. 1. Boxplot of all labels. linear character of neural nets, while nlpredci is based on linearization. Moreover, during the computation of these intervals a Jacobian matrix must be inverted but this matrix was very often ill conditioned, therefore, the results for confidence intervals are not too reliable. Despite what was said in the previous paragraph, Neurons CNK2 CNK5 CNK10 BAGV CONF the predictive regions based on the width of confidence 2 91.0 91.2 90.0 91.4 92.4 3 92.6 92.4 92.6 94.0 93.6 intervals produce sensible results. But these prediction 4 92.2 90.4 90.0 90.4 90.2 regions show highest inconsistency between different 5 94.6 92.6 90.0 90.2 90.8 neuron counts and have highest number of very large 6 92.8 89.8 91.8 91.6 91.8 intervals. These regions produce sometimes very good results, but they are probably very dependent on the Table 1. Percentage of labels inside predictive regions for actual fit of the neural network and their results are ε = 0.1. not as consistent as the results of the other methods. However, we can see in Tables ?? and ?? that these intervals are valid as the percentage of labels inside Conformal sets 23 Neurons: 2 100 0 CNK2 CNK5 CNK10 BAGV CONF Neurons: 3 100 0 CNK2 CNK5 CNK10 BAGV CONF Neurons: 4 100 0 CNK2 CNK5 CNK10 BAGV CONF Neurons: 5 100 0 CNK2 CNK5 CNK10 BAGV CONF Fig. 2. Interval widths for ε = 0.1. Neurons: 2 400 200 0 CNK2 CNK5 CNK10 BAGV CONF Neurons: 3 400 200 0 CNK2 CNK5 CNK10 BAGV CONF Neurons: 4 400 200 0 CNK2 CNK5 CNK10 BAGV CONF Neurons: 5 400 200 0 CNK2 CNK5 CNK10 BAGV CONF Fig. 3. Interval widths for ε = 0.05. 24 Radim Demut, Martin Holeňa Neurons CNK2 CNK5 CNK10 BAGV CONF 2 95.8 96.8 96.8 96.2 95.6 3 97.2 96.8 96.8 97.8 95.6 4 96.2 96.4 96.6 97.4 97.0 5 97.2 97.6 96.4 96.4 97.6 6 97.2 98.0 96.4 97.4 96.8 Table 2. Percentage of labels inside predictive regions for ε = 0.05. 6 Conclusion We presented several methods for computing predic- tive regions in neural network regressions. These meth- ods are based on the inductive conformal prediction with novel nonconformity measures proposed in this paper. Those measures use reliability estimates to de- termine how different a given example is with respect to other examples. We compared our new predictive regions with traditional confidence intervals on test- ing data. The confidence intervals did not perform very well, the intervals were too large, it was probably caused by the high nonlinearity of radial basis neu- ral networks. Predictive regions which used the width of confidence intervals as the nonconformity measure gave much better results. But those results were not as consistent as the results of the other methods. Predic- tive regions based on the local modeling of prediction errors gave us good results and the computation of the regions was very fast. A smaller number of neighbors gave better results for these regions. The best results were achieved by the regions based on the variance of a bagged model. The only drawback of this method is that a lot of models must be fitted and it is, therefore, computationally very inefficient. References 1. Z. Bosnic, Kononenko, I.: Comparison of approaches for estimating reliability of individual regression predic- tions. Data & Knowledge Engineering, 2008, 504–516. 2. A. Gammerman, G. Shafer, V. Vovk: Algorithmic learn- ing in a random world. Springer Science+Business Me- dia, 2005. 3. E. Uusipaikka: Confidence intervals in generalized re- gression models. Chapman & Hall, 2009. 4. H. Papadopoulos, V. Vovk, A. Gammerman: Regression conformal prediction with nearest neighbours. Journal of Artificial Intelligence Research 40, 2011, 815–840. 5. S. Valero, E. Argente, et al.: DoE framework for catalyst development based on soft computing techniques. Com- puters and Chemical Engineering 33(1), 2009, 225–238.