<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>H. 1996. Asymmetric least squares re-
gression estimation: a nonparametric approach. Journal of
nonparametric statistics 6(2</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Overestimation learning with guarantees</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Adrien Gauffriau</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Franc¸ois Malgouyres</string-name>
          <email>Francois.Malgouyres@math.univ-toulouse.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Me´lanie Ducoffe</string-name>
          <email>melanie.ducoffeg@irt-saintexupery.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IRT Saint Exupe ́ry</institution>
          ,
          <addr-line>Toulouse</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institut de Mathe ́matiques de Toulouse ; UMR5219 Universite ́ de Toulouse ;</institution>
          <addr-line>CNRS ; UPS IMT F-31062 Toulouse Cedex 9</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1996</year>
      </pub-date>
      <fpage>2</fpage>
      <lpage>3</lpage>
      <abstract>
        <p>We describe a complete method that learns a neural network which is guaranteed to over-estimate a reference function on a given domain. The neural network can then be used as a surrogate for the reference function. The method involves two steps. In the first step, we construct an adaptive set of Majoring Points. In the second step, we optimize a well-chosen neural network to over-estimate the Majoring Points. In order to extend the guarantee on the Majoring Points to the whole domain, we necessarily have to make an assumption on the reference function. In this study, we assume that the reference function is monotonic. We provide experiments on synthetic and real problems. The experiments show that the density of the Majoring Points concentrate where the reference function varies. The learned over-estimations are both guaranteed to overestimate the reference function and are proven empirically to provide good approximations of it. Experiments on real data show that the method makes it possible to use the surrogate function in embedded systems for which an underestimation is critical; when computing the reference function requires too many resources.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>(1)</p>
      <p>Inspired by critical systems applications, we require (1) to
be guaranteed by a formal theorem. We call fw ;b the
surrogate and call f the model.</p>
      <p>In the typical application we have in mind, f is the
result of a deterministic but complex phenomenon. It is
difficult to compute and its evaluation requires intensive
computations or extensive memory consumption. We consider
two settings. In the first setting, we can evaluate f (x) for
any x 2 D. In the second setting, we only know a data set
(xi; f (xi))i=1::n of size n 2 N.</p>
      <p>The constructed surrogate fw ;b permits to rapidly
evaluate an approximation of f at any point of D. Of course,
we want the surrogate fw ;b to approximate f well; but
most importantly we want fw ;b to overestimate f . In the
typical scenario we have in mind, f models the
consumption of some resource for a critical action. Underestimating
the model may cause a (potentially deadly) hazard that will
not be acceptable for certification authorities especially in
aeronautics (EASA or FAA). The applications include for
instance: - the estimation of the braking distance of an
autonomous vehicle; - the number of kilometers that can be
traveled; - the maximum load a structure can carry, - the
network traveling time etc Guaranteed overestimation can also
be used to guarantee the fairness of a score, when
underestimation on a sub-population leads to an unfair surrogate
function. Providing such fairness or performance guarantee can
be seen as a niche activity, but it alleviates the limitation that
restrains the massive industrialization of neural networks in
critical applications where passengers lives are at stake.</p>
      <p>Finally, the adaptation of the method to construct an
under-estimation of f is straightforward and, for ease of
presentation, not exposed in the paper. Of course, the
combination of both an over-estimation and an under-estimation
permits, for any x, the construction of an interval in which f (x)
is guaranteed to be.</p>
    </sec>
    <sec id="sec-2">
      <title>Related works</title>
      <p>
        The most standard type of guarantees in machine learning
aim at upper-bounding the risk by upper-bounding the
generalization error (see
        <xref ref-type="bibr" rid="ref2">(Anthony and Bartlett 1999)</xref>
        for an
overview). We would like to highlight that the risk measures
an average cost that does not exclude failures. Moreover, at
the writing of this paper, the bounds on the generalization
error are very pessimistic when compared to the performance
observed in practice
        <xref ref-type="bibr" rid="ref11 ref25">(Harvey, Liaw, and Mehrabian 2017;
Bartlett et al. 2019)</xref>
        and the recent overview
        <xref ref-type="bibr" rid="ref13 ref31 ref4">(Jakubovitz,
Giryes, and Rodrigues 2019)</xref>
        . In particular, neural network
are commonly learned with much less examples than
required by the theory.
      </p>
      <p>
        Other works provide guarantees on the neural network
performance
        <xref ref-type="bibr" rid="ref19 ref24 ref7">(Mirman, Gehr, and Vechev 2018; Boopathy
et al. 2019; Katz et al. 2017)</xref>
        that can be used to exclude
failures (in our setting, the under-estimation of f ). The
philosophy of these works is to analyze the output of the
learning phase in order to provide guarantees of robustness.
Another line of research, is to impose constraints or optimize
robustness criteria during the learning phase
        <xref ref-type="bibr" rid="ref28 ref9">(Fischer et al.
2019; Raghunathan, Steinhardt, and Liang 2018)</xref>
        . This does
not permit to guarantee an over-estimation property.
      </p>
      <p>
        Finally, another direction makes probabilistic
assumptions and provides confidence scores
        <xref ref-type="bibr" rid="ref10 ref20 ref26 ref31 ref4">(Gal and Ghahramani
2016; Pearce et al. 2018; Tagasovska and Lopez-Paz 2019;
Keren, Cummins, and Schuller 2018)</xref>
        . The confidence score
does not necessarily provide a formal guarantee.
      </p>
      <p>Compared to these approaches, the guarantee on the
surrogate fw ;b is due to the attention given to the construction
of the learning problem and the hypotheses on the function
f . To the best of our knowledge, and not considering trivial
overestimation with poor performances, we are not aware of
any other construction of a surrogate fw ;b that is
guaranteed to overestimate the model f .</p>
    </sec>
    <sec id="sec-3">
      <title>Hypotheses and restrictions</title>
      <p>In order to extend the overestimation property to the whole
domain we can obviously not consider any arbitrary
function f . In this paper, we restrict our analysis to real-valued
and finite non-decreasing functions. The extension to
vector valued function is straightforward. The extension to
non-increasing functions is straightforward too. We
deliberately use these restrictive hypotheses to simplify the
presentation. Furthermore, many extensions of the principles of
the method described in the paper to other classes of
nonmonotone functions are possible.</p>
      <p>Formally, for a compact domain D Rd, a function g :
D ! R is non-decreasing if and only if</p>
      <p>
        Other authors have already studied the approximation of
non-decreasing functions with neural networks
        <xref ref-type="bibr" rid="ref21 ref29 ref8">(Lang 2005;
Daniels and Velikova 2010; Sill 1998)</xref>
        . In our context, the
monotonic hypothesis is motivated by the fact that
monotone function are ubiquitous in industrial applications and in
physics. Moreover, for a given application, where a function
f 0 : Rd0 ! R needs to be approximated. It is sometimes
possible to extract features with a function g : Rd0 ! Rd
such that the function f : Rd ! R, satisfying f 0 = f g,
is monotone.
      </p>
      <p>Another restriction is that the method cannot be applied
when the dimension d is large and f varies in all directions.
In fact, one contribution of the paper is to design algorithms
to alleviate this problem and apply the method for larger d,
but d must anyway remain moderate for most function f .</p>
      <p>These hypotheses permit to have global guarantee that
hold on the whole domain D while weaker hypotheses on
f only lead to local guarantees, holding in the close vicinity
of the learning examples.</p>
    </sec>
    <sec id="sec-4">
      <title>Organization of the paper</title>
      <p>In the next section, we define Majoring Points and describe a
grid based and an adaptive algorithm to construct them. The
algorithms are adapted when the function f can be evaluated
at any new input as well as when we only know a dataset
(xi; f (xi))i=1::n. In Section 3, we describe a strategy to
construct monotonic, over-estimating neural networks.
Finally, in Section 4, we provide experiments showing that the
Majoring Points are located in the regions where f varies
strongly or is discontinuous. We also illustrate on synthetic
examples that the method can accurately approximate f ,
while being guaranteed to overestimate it. An example on
real data illustrates that the use of over-estimating neural
networks permits to reduce the memory required to compute an
over-estimation and thus permits to embed it, in a real world
application.</p>
      <p>2</p>
      <sec id="sec-4-1">
        <title>Majoring Points</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Introduction</title>
    </sec>
    <sec id="sec-6">
      <title>Definition 1. Majoring Points</title>
      <p>Let (ai; bi)i=1::m 2 Rd R m. Let D Rd be compact
and let f : Rd ! R be finite and non-decreasing. We say
that (ai; bi)i=1::m are Majoring Points for f if and only if
for any non-decreasing g : Rd ! R :</p>
      <p>If g(ai)
bi; 8i = 1::m; then g(x)
f (x); 8x 2 D:</p>
      <p>When f is upper-bounded on D, the existence of such
majoring points is established by considering: m = 1,
• a1 such that a1
• b1 = supx2D f (x).</p>
      <p>x for all x 2 D,
However, for most function f , any non-decreasing g such
that g(a1) b1 is a poor approximation of f . The goal,
when constructing Majoring Points of f is to have bi
f (ai), while bi f (ai), for all i = 1::m. The number of
points m should remain sufficiently small to make the
optimization of the neural network manageable.</p>
    </sec>
    <sec id="sec-7">
      <title>Constructing Majoring Points using a cover</title>
      <p>For any y and y0 2 Rd, with y
rectangle between y and y0 by
y0, we define the
hyperRy;y0 = fx 2 Rdjy
x &lt; y0g:
Using hyper-rectangles, we define the considered covers.</p>
    </sec>
    <sec id="sec-8">
      <title>Definition 2. Cover</title>
      <p>Let (yi)i=1::m and (yi0)i=1::m in Rd be such that yi0 yi,
for all i = 1::m. We say that (yi; yi0)i=1::m covers D if and
only if
For any cover C = (yi; yi0)i=1::m, we define the function
fC(x) =</p>
      <p>min
i : x2Ryi;yi0
f (yi0)
, for all x 2 D:
Notice that, since C is a cover, fijx 2 Ryi;yi0 g 6= ; and
fC(x) is well-defined. We can establish that the function fC
overestimates f over D:</p>
      <p>For all x 2 D;
fC(x)
f (x):
Indeed, for any x 2 D, there exists i such that x 2 Ryi;yi0
and fC(x) = f (yi0). Therefore, since f is non-decreasing
and x yi0, we have
f (x)</p>
      <p>f (yi0) = fC(x):</p>
    </sec>
    <sec id="sec-9">
      <title>Proposition 1. A cover defines Majoring Points</title>
      <p>Let D Rd be compact. Let C = (yi; yi0)i=1::m be
a cover of D and let f : Rd ! R be finite and
non-decreasing. The family (yi; f (yi0))i=1::m are Majoring
Points for f .</p>
      <p>Proof. We consider a non-decreasing function g such that,
for all i = 1::m;
g(yi)
f (yi0):</p>
      <sec id="sec-9-1">
        <title>We need to prove that,</title>
        <p>solution of an optimization taking into account these two
properties. However, the optimization would be intractable
and we only describe an heuristic algorithm that construct a
cover.</p>
        <p>We construct Majoring Points according to two scenarios:
• We can generate f (x) for all x 2 Rd. We call the
Majoring Points generated according to this setting Function
Adapted Majoring Points.
• We have a dataset (xi; f (xi))i=1::n. It is not possible to
have access to more training points. We call the Majoring
Points generated according to this setting Data Adapted
Majoring Points. In order to define them, we consider the
function f~ : Rd ! R defined, for all x 2 Rymin;ymax ,
by
f~(x) = min f (xi):</p>
        <p>i:xi x
Since f is non-decreasing, we have
f~(x)
f (x)</p>
        <p>for all x 2 Rymin;ymax :
At the core of the constructions described below is the
construction of a cover C = (yi; yi0)i=1::m.</p>
        <p>Grid Based Majoring Points We consider an accuracy
parameter " &gt; 0 and a norm k:k on Rd. We define
nmax = d
kymax
ymink :</p>
        <p>e
"</p>
        <p>A simple way to define Majoring Points is to generate
a cover made of equally spaced points between ymin and
ymax. Setting
(3)
(4)
(5)
(6)
(8)
(9)
for all x 2 D;
g(x)
f (x):
To do so, we consider x 2 D and i such that x 2 Ryi;yi0 .
Using respectively that g is non-decreasing, (5), the
definition of fC (3) and (4), we obtain the following sequence of
inequalities:</p>
        <p>The function fC can be computed rapidly using a
lookup table but requires storing (yi; f (yi0))i=1::m. This can be
prohibitive in some applications.</p>
        <p>To deal with this scenario, we describe in Section 3 a
method to construct a neural network such that fw ;b is
non-decreasing and satisfies fw ;b (yi) f (yi0), for all
i = 1::m. According to the proposition, such a network
provides a guaranteed over-estimation of f , whose computation
is rapid. The resources required to store w and b are
independent of m. We show in the experiments that it can be
several orders of magnitude smaller. This makes it possible
to embed the overestimating neural network when
embedding the Majoring Points is not be possible. This advantage
comes at the price of loss of accuracy as fw ;b (x) fC(x)
(fw ;b (x) has the role of g in (6)).</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>Majoring Points construction algorithm</title>
      <p>Introduction In this section, we describe algorithmic
strategies to adapt Majoring Points to the function f .
Throughout the section, we assume that we know ymin and
ymax 2 Rd such that</p>
      <p>Rymin;ymax :</p>
      <p>D
The goal is to build a cover such that fC is close to f and
m is not too large. Ideally, the cover can be expressed as the
(7)
ai = yi
bi = f (yi0)
We can also construct a Grid Based Majoring Points when
the function f cannot be evaluated but a dataset
(xi; f (xi))i=1::n is available by replacing in the above
definition f by f~, see (8).</p>
      <p>The Grid Based Majoring Points are mostly given for
pedagogical reasons. The number of values f (x) that need to be
evaluated and the number of Majoring Points defining the
objective of the optimization of the neural network are both
equal to ndmax = O(" d). It scales poorly with d and this
restrains the application of the Grid Based Majoring Points to
small d, whatever the function f .</p>
      <p>When f does not vary much in some areas, the Majoring
Points in this area are useless. This is what motivates the
adaptive algorithms developed in the next sections.
and for all i0;
Pd 1
k=0 ikN k and
r =
ymax</p>
      <p>ymin
nmax
2 R</p>
      <p>d
; id 1 2 f1; : : : ; N</p>
      <p>1g, we set i =
yi = ymin + (i0r1; : : : ; id 1rd)
yi0 = yi + r
Notice that, the parameter r defining the grid satisfies
krk ". Given the cover, the Function Adapted Majoring
Points (ai; bi)i=0::Nd 1 are defined, for i = 0::ndmax 1, by</p>
    </sec>
    <sec id="sec-11">
      <title>Adaptive Majoring Points Bellow, we describe a Di</title>
      <p>chotomy Algorithm that permits to generate an adaptive
cover with regard to the variations of the function f (or f~).
It begins with a cover made of the single hyper-rectangle
Rymin;ymax . Then it decomposes every hyper-rectangle of
the current cover that have not reached the desired accuracy.
The decomposition is repeated until all the hyper-rectangle
of the current cover have the desired accuracy (see
Algorithm 1).</p>
      <p>Initially, we have D Rymin;ymax . For any
hyperrectangle Ry;y0 , denoting r = y02 y , the decomposition
replaces Ry;y0 by its sub-parts as defined by</p>
      <p>Ry+(s1r1;:::;sdrd);y+(s1+1)r1;:::;(sd+1)rd)j</p>
      <p>(s1; : : : ; sd) 2 f0; 1gd : (10)
(Hence the term ‘dichotomy algorithm’. ) The union of the
sub-parts equal the initial hyper-rectangle. Therefore, the
cover remains a cover after the decomposition. The final C
is a cover of D.</p>
      <p>We consider a norm k:k on Rd and real parameters " &gt;
0, "f &gt; 0 and np 2 N. We stop decomposing an
hyperrectangle if a notion of accuracy is satisfied. The notion of
accuracy depends on whether we can compute f or not.
• When we can evaluate f (x): The accuracy of Ry;y0 is
defined by the test
f (y0)
f (y)
"f
ky0</p>
      <p>yk
or
or
• when ky0</p>
      <p>strongly.
• When we only know a dataset (xi; f (xi))i=1::n: The
accuracy of Ry;y0 is defined by the test
f~(y0)
or
f~(y)</p>
      <p>"f
jfijxi 2 Ry;y0 gj
where j:j is the cardinal of the set.
np
ky0
yk
We stop decomposing if a given accuracy is reached :
• when f (y0) f (y) "f or f~(y0) f~(y) "f : This
happens where the function f varies only slightly.
yk: This happens where the function f varies
"
"
(11)
(12)
• when jfijxi 2 Ry;y0 gj np: This happens when the
number of samples in Ry;y0 does not permit to improve
the approximation of f by f~ in its sub-parts.</p>
      <p>The cover is constructed according to Algorithm 1. This
algorithm is guaranteed to stop after at most n0max =
dlog2 kymax " ymink e iteration of the ‘while loop’. In the
worst case, every hyper-rectangle of the current cover is
decomposed into 2d hyper-rectangles. Therefore, the
worstcase complexity of the algorithm creates</p>
      <p>2dn0max = O(" d)
hyper-rectangles. The worst-case complexity bound is
similar to the complexity of the Grid Based Majoring Points.
However, depending on f , the number of hyper-rectangles
generated by the algorithm can be much smaller than this
worst-case complexity bound. The smoother the function f ,
the less hyper-rectangles are generated.</p>
      <p>Algorithm 1 Adaptive cover construction
Require: " : Distance below which we stop decomposing
Require: "f : Target upper bound for the error in f
Require: np : number of examples in a decomposable
hyper-rectangle
Require: Inputs needed to compute f (resp. f~)
Require: ymin; ymax : Points satisfying (7)</p>
      <p>C fRymin;ymax g
t 0
while t 6= 1 do
t 1
C0 ;
for Ry;y0 2 C do
if Ry;y0 satisfies (11) (resp. (12)) then</p>
      <p>C0 C0 [ fRy;y0 g
else
t 0
for all sub-parts R of Ry;y0 (see (10)) do</p>
      <p>C0 C0 [ fRg
end for
end if
end for</p>
      <p>C C0
end while
return C</p>
      <sec id="sec-11-1">
        <title>3 Overestimating neural networks</title>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>Monotonic Neural Networks</title>
      <p>
        In this section, we remind known result on the
approximation of non-decreasing functions with neural networks
having non-negative weights and non-decreasing
activation functions
        <xref ref-type="bibr" rid="ref21 ref29 ref8">(Lang 2005; Daniels and Velikova 2010; Sill
1998)</xref>
        .
      </p>
    </sec>
    <sec id="sec-13">
      <title>Proposition 2. Sufficient condition to get a nondecreasing network</title>
      <p>For any neural network such that:
• its activation functions are non-decreasing
• its weights w are non-negative
the function fw;b defined by the neural network is
nondecreasing.</p>
      <p>The conditions are sufficient but not necessary. We can
think of simple non-decreasing neural network with both
positive and negative weights. However, as stated in the next
theorem, neural networks with non-negative weights are
universal approximators of non-decreasing functions.</p>
    </sec>
    <sec id="sec-14">
      <title>Theorem 1. Universality of neural networks with nonnegative weights (Daniels and Velikova 2010)</title>
      <p>Let D Rd be compact. For any continuous
nondecreasing function g : D ! R . For any &gt; 0, there
exist a feed-forward neural network with d hidden layers,
a non-decreasing activation function, non-negative weights
w and bias b such that
jg(x)
fw ;b (x)j
, for all x 2 D;
where fw ;b is the function defined by the neural network.</p>
      <p>
        Notice that, in
        <xref ref-type="bibr" rid="ref8">(Daniels and Velikova 2010)</xref>
        , the neural
network constructed in the proof of Theorem 1 involves a
Heaviside activation function. The choice of the activation
function is important. For instance, with a convex activation
function (like ReLU), the function defined by the neural
network with non-negative weights is convex
        <xref ref-type="bibr" rid="ref1 ref25">(Amos, Xu, and
Kolter 2017)</xref>
        and may approximate arbitrary poorly a
wellchosen non-decreasing non-convex function.
      </p>
      <p>Theorem 1 guarantees that a well-chosen and optimized
neural network with non-negative weights can approximate
with any required accuracy the smallest non-decreasing
majorant1 of fC, as defined in (3).</p>
    </sec>
    <sec id="sec-15">
      <title>Learning the neural network</title>
      <p>We consider a feed-forward neural network of depth h. The
hidden layers are of width l. The weights are denoted by w
and we will restrict the search to non-negative weights: w
0. The bias is denoted by b. We consider, for a parameter
&gt; 0, the activation function</p>
      <p>t
(t) = tanh( )
for all t 2 R:</p>
      <p>
        We consider an asymmetric loss function in order to
penalize more underestimation than overestimation
+(t )2 if t
l ; +; ;p(t) = jt jp if t &lt;
where the parameters ( +; ; ) 2 R3 are non-negative
and p 2 N. Notice that asymmetric loss functions have
already been used to penalize either under-estimation or
overestimation (Yao and Tong 1996)
        <xref ref-type="bibr" rid="ref16 ref31 ref4">(Julian, Kochenderfer, and
Owen 2019)</xref>
        .
      </p>
      <p>Given Majoring Points (ai; bi)i=1::m, we define, for all w
and b</p>
      <p>m
E(w; b) = X l ; +; ;p(fw;b(ai)</p>
      <p>
        i=1
The parameters of the network optimize
bi):
argminw 0;b E(w; b):
(13)
The function E is smooth but non-convex. In order to solve
(13), we apply a projected stochastic gradient algorithm
        <xref ref-type="bibr" rid="ref5">(Bianchi and Jakubowicz 2012)</xref>
        . The projection on the
constraint w 0 is obtained by canceling its negative entries.
As often with neural network, we cannot guarantee that the
algorithm converges to a global minimizer.
      </p>
      <sec id="sec-15-1">
        <title>Guaranteeing fw ;b (ai) bi</title>
        <p>The parameter 0 is an offset parameter. Increasing
leads to poorer approximations. We show in the following
proposition that can be arbitrarily small, if the other
parameters are properly chosen.</p>
      </sec>
    </sec>
    <sec id="sec-16">
      <title>Proposition 3. Guaranteed overestimation of the samples</title>
      <p>Let &gt; 0, &gt; 0, p &gt; 0. If the neural network is
sufficiently large and if is sufficiently small, then
fw ;b (ai)
bi
Given Theorem 1, when the network is sufficiently large and
is sufficiently small there exist a bias b0 and non-negative
weights w0 such that for all i = 1::m:
jfw0;b0 (ai)</p>
      <p>(bi + )j</p>
      <sec id="sec-16-1">
        <title>Therefore,</title>
        <p>E(w ; b )</p>
        <p>E(w0; b0)
m
X max(
i=1
p:
p; + 2)
p:
:
(15)
p:
If by contradiction we assume that there exists i0 such that
fw ;b (ai0 ) &lt; bi0
then we must have</p>
        <p>E(w ; b )
l ; +; ;p(fw ;b (ai0 )
bi0 ) &gt;
This contradicts (15) and we conclude that (14) holds.</p>
        <p>
          The proposition guarantees that for a large network, with
small, we are sure to overestimate the target. Because
Theorem 1 does not provide a configuration (depth, width,
activation function) that permits to approximate any
function with an accuracy , it is not possible to provide such a
configuration for the parameters in Proposition 3. However,
given a configuration and given weights w and bias b
returned by an algorithm, it is possible to test if (14) holds. If
is does not hold, it is always possible to increase the width,
depth, etc and redo the optimization. Theorem 1 and
Proposition 3, combined with properties of the landscape of large
networks such as
          <xref ref-type="bibr" rid="ref25">(Nguyen and Hein 2017)</xref>
          , guarantee that
such a strategy stops after a finite number of optimization
procedure.
        </p>
        <p>4</p>
        <sec id="sec-16-1-1">
          <title>Experiments</title>
          <p>In this section, we compare the results of several learning
strategies on two synthetic experiments with d = 1 and 2
and on a real dataset from avionic industry. The synthetic
experiments permit to illustrate the method; the latter real
dataset show that the method permits to construct a surrogate
of a critical function that can be embedded while the true
critical function cannot.</p>
          <p>The python codes that have been used to generate the
experiments, as well as additional experiments, are provided
with the submission and will be made available on an open
source deposit.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-17">
      <title>Methods to be compared</title>
      <p>The architecture of the network is the same for all
experiments and contains 4 fully connected layers with 64 neurons
in each layer. The memory requirement to store the network
is 64 d + 4 642 + 5 64 = 64d + 16705 floating numbers.
The size of the input layer is d. The size of the output layer
is 1. We compare:
• The -baseline: For a parameter 0, it is a simple
neural network, with an `2 loss. It is trained:
– on the points (ai; f (ai)+ )i=1::m, when f can be
computed;
– on the modified dataset (xi; f (xi) + )i=1::n, when f
cannot be computed.</p>
      <p>The -baseline is in general not guaranteed to provide an
overestimation. The 0-baseline is expected to provide a
better approximation of the true function f than the other
methods but it fails to always overestimating it.</p>
      <p>If is such that f (ai) + bi, for all i = 1::m, the
-baseline is guaranteed to provide an overestimation.
• The Overestimating Neural Network (ONN): Is a
neural network whose parameters solve (13) for the
parameters ; +; and p coarsely tuned for each experiment,
and depending on the context, the Grid Based Majoring
Points (ONN with GMP), the Function Adapted
Majoring Points (ONN with FMP), the Grid Based Majoring
Points (ONN with DMP).</p>
      <p>We always take = 1. The size of the network and
the values of s ; +; and p always permit to have
fw ;b (ai) bi, for all i = 1::m. Therefore, as
demonstrated in the previous sections, fw ;b is guaranteed to
overestimate f .</p>
    </sec>
    <sec id="sec-18">
      <title>Evaluation metrics</title>
      <p>We are defining in this section the metrics that are used
to compare the methods. Some metrics use a test dataset
(x0i; f (x0i))i=1::n0 .</p>
      <p>For the 1D synthetic example, we take n0 = 100000 and
for the industrial example, we take n0 = 75000. In both
cases the x0i are iid according to the uniform distribution
in D. We consider the Majoring Approximation Error
(MAE) defined by
the Root Mean Square Error (RMSE) defined by
1 m</p>
      <p>X(bi
m</p>
      <p>i=1
M AE =</p>
      <p>f (ai));
0
f (x0i))2A
;
the Overestimation proportion (OP) defined by
100
n0 jfijfw ;b (x0i)
f (x0i)gj ;
and remind if the method guarantees fw ;b
overestimates f Formal Guarantee (FG).</p>
      <p>For the experiments on the 1D synthetic example the
methods are also evaluated using visual inspection.</p>
      <sec id="sec-18-1">
        <title>1D synthetic experiment</title>
        <p>The 1D synthetic experiment aims at overestimating the
function f1 defined over [ 10; 10] by</p>
        <p>The function f1, fC, fw ;b for Grid Based Majoring
Points and the 0:2-baseline are displayed on Figure 1, on
the interval [ 4:5; 2]. The function f1f , the Grid Based
Majoring Points and fw ;b for Grid Based Majoring
Points and the 0:2-baseline are displayed on Figure 2, on the
interval [0:8; 1:2]. The function f1, the Data Adapted
Majoring Points, the sample (xi; f1(xi))i=1::n and fw ;b for
Data Adapted Majoring Points are displayed on Figure 3,
on the interval [ 5:5; 0].</p>
        <p>We clearly see that the adaptive Majoring Points
aggregate in dyadic manner in the vicinity of the discontinuities.
We also see on Figure 2 how Majoring Points permit to
anticipate the discontinuity and construct an overestimation.
The density of Majoring Points depends on the slope of f1.
This permits to reduce the approximation error. Also, fw ;b
passes in the vicinity of the Majoring Points and provides a
good approximation that overestimates f .</p>
        <p>The MAE, RMSE, OP and FG are provided in Table 1 for
the 0-baseline, the 0:5-baseline, fC and the ONN for three
types of Majoring Points. We see that the RMSE worsen as
we impose guarantees and approach the realistic scenario
where the surrogate is easy to compute and does not require
too much memory consumption.</p>
      </sec>
      <sec id="sec-18-2">
        <title>2D synthetic experiment</title>
        <p>The same phenomenon described on 1D synthetic
experiment occur in dimension 2. We only illustrate here the
difference between the Grid Based Majoring Points,
Function Adapted Majoring Points and Data Adapted Majoring
Points for the function
8(x; y) 2 [0; 15]2
g(x; y) = f1
px2 + y2
10
where f1 is the function defined in (16).</p>
        <p>We display, on Figure 5, the Function Adapted Majoring
Points returned by the Dichotomy Algorithm. The density
of points is correlated with the amplitude of the gradient of
the function. Algorithm 1 permit to diminish the number of
Majoring Points. For instance, for the 2D synthetic example
for " = 0:1, the Grid Based Majoring Points counts 22500
points. The Function Adapted Majoring Points counts 14898
points, for " = 0:5, and 3315, for " = 2. The Data Adapted
Majoring Points counts 8734 points, for " = 0:5, and 1864,
for " = 2.</p>
        <p>On Figure 4, we represent the dataset and the cover
obtained using Algorithm 1 for the synthetic 2D example. The
inputs ai of the corresponding Majoring Points are displayed
on Figure 5.</p>
      </sec>
    </sec>
    <sec id="sec-19">
      <title>Industrial application</title>
      <p>The method developed in this paper provides formal
guarantees of overestimation that are safety guarantee directly
applicable in critical embedded systems.
(a) Complete domain</p>
      <p>(b) Zoom on the discontinuity</p>
      <p>
        The construction of surrogate functions is an important
subject in industry
        <xref ref-type="bibr" rid="ref14 ref23 ref6">(Lathuilie`re et al. 2019; Biannic et al.
2016; Jian et al. 2017; Sudakov et al. 2019)</xref>
        . In this work,
we are considering an industrial and heavy simulation code
that has six inputs d = 6 and one output and that represents
a complex physic phenomenon of an aircraft. The output is
an non-decreasing function. During the flight, given flight
conditions x the output f (x) is compared to a threshold and
the result of the test launch an action. When we replace f by
the overestimating surrogate fw ;b , the airplane launches
the action less often. However, the airplane only launches
the action when the action is guaranteed to be safe.
      </p>
      <p>The industrial dataset contains n = 150000 examples on
a static grid and another set of 150000 sampled according
to the uniform distribution on the whole domain. For each
inputs, the reference computation code is used to generate
the associated true output.</p>
      <p>We compare 0-baseline,300-baseline with the ONN
learned on Grid Based Majoring Points and Data Adapted
Majoring Points methods. All the methods are learned on the
Method
0-baseline
300-baseline
ONN with GMP
ONN with DMP
n
static grid except OON with Data Adapted Majoring Points.
The table 2 presents the metrics for the different methods.
The results are acceptable for the application and memory
requirement to store and embed the neural network is 17088
floating numbers. It is one order of magnitude smaller than
the size of the dataset.</p>
      <p>5</p>
      <sec id="sec-19-1">
        <title>Conclusion - Future Work</title>
        <p>We presented a method that enables to formally guarantee
that a prediction of a monotonic neural network will always
be in an area that preserves the safety of a system. This is
achieved by the construction of the network, the utilization
of majoring points and the learning phase, which allows us
to free ourselves from a massive testing phase that is long
and costly while providing fewer guarantees.</p>
        <p>Our work have limitations on functions that can be a
safely approximate, but this is a first step toward a safe use
of neural networks in critical applications. Nevertheless, this
can already be used in simple safety critical systems that
verify our hypotheses. Future works will look on
possibility to leverage the utilization of the monotonic hypothesis.
Another direction of improvement is to build smarter
algorithms that require less majoring points thanks to a better
adaptation to the structure of the function. In particular, this
should permit to apply the method to functions whose is
input space is of larger dimension, when they have the proper
structure.</p>
      </sec>
      <sec id="sec-19-2">
        <title>Acknowledgements</title>
        <p>This project received funding from the French ”Investing for
the Future – PIA3” program within the Artificial and
Natural Intelligence Toulouse Institute (ANITI) under the grant
agreement ANR-19-PI3A-0004. The authors gratefully
acknowledge the support of the DEEL project.2</p>
        <sec id="sec-19-2-1">
          <title>2https://www.deel.ai/</title>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Amos</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Kolter</surname>
            ,
            <given-names>J. Z.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Input Convex Neural Networks</article-title>
          .
          <source>In Proceedings of the 34th International Conference on Machine Learning - Volume 70</source>
          , ICML'
          <volume>17</volume>
          ,
          <fpage>146</fpage>
          -
          <lpage>155</lpage>
          . JMLR.org.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Anthony</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and Bartlett,
          <string-name>
            <surname>P. L.</surname>
          </string-name>
          <year>1999</year>
          .
          <article-title>Neural Network Learning: Theoretical Foundations</article-title>
          . Cambridge University Press. ISBN 0 521 57353 X. URL http://www.stat.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          berkeley.edu/ bartlett/nnl/index.html.
          <source>404pp. ISBN 978- 0-521-57353-X. Reprinted</source>
          <year>2001</year>
          ,
          <year>2002</year>
          .
          <source>Paperback edition 2009; ISBN 978-0-521-11862-0.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          2019.
          <article-title>Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>20</volume>
          (
          <issue>63</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>17</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Bianchi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Jakubowicz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2012</year>
          .
          <article-title>Convergence of a multi-agent projected stochastic gradient algorithm for nonconvex optimization</article-title>
          .
          <source>IEEE transactions on automatic control 58</source>
          <volume>(2)</volume>
          :
          <fpage>391</fpage>
          -
          <lpage>405</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Biannic</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Hardier,
          <string-name>
            <surname>G.</surname>
          </string-name>
          ; Roos,
          <string-name>
            <given-names>C.</given-names>
            ;
            <surname>Seren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ; and
            <surname>Verdier</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>Surrogate models for aircraft flight control: some off-line and embedded applications</article-title>
          .
          <source>AerospaceLab Journal</source>
          (
          <volume>12</volume>
          ): pages-
          <fpage>1</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Boopathy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Weng</surname>
            , T.-W.; Chen, P.-Y.; Liu,
            <given-names>S.</given-names>
          </string-name>
          ; and Daniel,
          <string-name>
            <surname>L.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Cnn-cert: An efficient framework for certifying robustness of convolutional neural networks</article-title>
          .
          <source>In Proceedings of the AAAI Conference on Artificial Intelligence</source>
          , volume
          <volume>33</volume>
          ,
          <fpage>3240</fpage>
          -
          <lpage>3247</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Daniels</surname>
            , H.; and Velikova,
            <given-names>M.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Monotone and partially monotone neural networks</article-title>
          .
          <source>IEEE Transactions on Neural Networks</source>
          <volume>21</volume>
          (
          <issue>6</issue>
          ):
          <fpage>906</fpage>
          -
          <lpage>917</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Fischer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Balunovic</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Drachsler-Cohen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; Gehr,
          <string-name>
            <surname>T.</surname>
          </string-name>
          ; Zhang,
          <string-name>
            <surname>C.</surname>
          </string-name>
          ; and Vechev,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2019</year>
          . DL2:
          <article-title>- Training and Querying Neural Networks with Logic</article-title>
          .
          <source>In International Conference on Machine Learning.</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Gal</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Ghahramani</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference</article-title>
          .
          <source>In 4th International Conference on Learning Representations (ICLR) workshop track.</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Harvey</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Liaw</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Mehrabian</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Nearly-tight VC-dimension bounds for piecewise linear neural networks</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>In Conference on Learning Theory</source>
          ,
          <volume>1064</volume>
          -
          <fpage>1068</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Jakubovitz</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; Giryes, R.; and Rodrigues,
          <string-name>
            <surname>M. R.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Generalization error in deep learning</article-title>
          .
          <source>In Compressed Sensing and Its Applications</source>
          ,
          <fpage>153</fpage>
          -
          <lpage>193</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Jian</surname>
            ,
            <given-names>Z.-D.</given-names>
          </string-name>
          ; Chang, H.-J.; Hsu, T.-s.; and
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>D.-W.</given-names>
          </string-name>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <article-title>Learning from Simulated World-Surrogates Construction with Deep Neural Network</article-title>
          .
          <source>In SIMULTECH</source>
          ,
          <fpage>83</fpage>
          -
          <lpage>92</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Julian</surname>
            ,
            <given-names>K. D.</given-names>
          </string-name>
          ; Kochenderfer,
          <string-name>
            <surname>M. J.;</surname>
          </string-name>
          and Owen,
          <string-name>
            <surname>M. P.</surname>
          </string-name>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Deep</given-names>
            <surname>Neural</surname>
          </string-name>
          <article-title>Network Compression for Aircraft Collision Avoidance Systems</article-title>
          .
          <source>Journal of Guidance, Control, and Dynamics</source>
          <volume>42</volume>
          (
          <issue>3</issue>
          ):
          <fpage>598</fpage>
          -
          <lpage>608</lpage>
          . ISSN 1533-
          <fpage>3884</fpage>
          . doi:
          <volume>10</volume>
          .2514/1.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>g003724. URL http://dx.doi.org/10.2514/1.G003724.</mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Katz</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ; Barrett,
          <string-name>
            <given-names>C.</given-names>
            ;
            <surname>Dill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. L.</given-names>
            ;
            <surname>Julian</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          ; and Kochenderfer,
          <string-name>
            <surname>M. J.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Reluplex: An efficient SMT solver for verifying deep neural networks</article-title>
          .
          <source>In International Conference on Computer Aided Verification</source>
          ,
          <fpage>97</fpage>
          -
          <lpage>117</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Keren</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ; Cummins, N.; and
          <string-name>
            <surname>Schuller</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Calibrated prediction intervals for neural network regressors</article-title>
          .
          <source>IEEE Access</source>
          <volume>6</volume>
          :
          <fpage>54033</fpage>
          -
          <lpage>54041</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Lang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2005</year>
          .
          <article-title>Monotonic Multi-layer Perceptron Networks as Universal Approximators</article-title>
          . In Duch, W.; Kacprzyk,
          <string-name>
            <given-names>J.</given-names>
            ;
            <surname>Oja</surname>
          </string-name>
          , E.; and Zadroz˙ny, S., eds.,
          <source>Artificial Neural Networks: Formal Models and Their Applications - ICANN</source>
          <year>2005</year>
          ,
          <volume>31</volume>
          -
          <fpage>37</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Berlin</surname>
          </string-name>
          , Heidelberg: Springer Berlin Heidelberg. ISBN 978- 3-
          <fpage>540</fpage>
          -28756-8.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          Lathuilie`re, S.; Mesejo,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Alameda-Pineda</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          ; and Horaud,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>A comprehensive analysis of deep regression</article-title>
          .
          <source>IEEE transactions on pattern analysis and machine intelligence .</source>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>Mirman</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gehr</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ; and Vechev,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Differentiable Abstract Interpretation for Provably Robust Neural Networks</article-title>
          . In Dy, J.; and
          <string-name>
            <surname>Krause</surname>
          </string-name>
          , A., eds.,
          <source>Proceedings of the 35th International Conference on Machine Learning</source>
          , volume
          <volume>80</volume>
          <source>of Proceedings of Machine Learning Research</source>
          ,
          <volume>3578</volume>
          -
          <fpage>3586</fpage>
          . Stockholmsma¨ssan, Stockholm Sweden: PMLR. URL http://proceedings.mlr.press/v80/ mirman18b.html.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ; and Hein,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>The loss surface of deep and wide neural networks</article-title>
          .
          <source>In Proceedings of the 34th International Conference on Machine Learning-</source>
          Volume
          <volume>70</volume>
          ,
          <fpage>2603</fpage>
          -
          <lpage>2612</lpage>
          . JMLR. org.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>Pearce</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Brintrup</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zaki</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Neely</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <article-title>High-Quality Prediction Intervals for Deep Learning: A Distribution-Free, Ensembled Approach</article-title>
          .
          <source>In Proceedings of the 35th International Conference on Machine Learning</source>
          , ICML,
          <fpage>4072</fpage>
          -
          <lpage>4081</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>Raghunathan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Steinhardt</surname>
            , J.; and Liang,
            <given-names>P.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Certified Defenses against Adversarial Examples</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <surname>Sill</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>1998</year>
          .
          <article-title>Monotonic Networks</article-title>
          . In
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>M. I.</given-names>
          </string-name>
          ; Kearns,
          <string-name>
            <given-names>M. J.;</given-names>
            and
            <surname>Solla</surname>
          </string-name>
          , S. A., eds.,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>10</volume>
          ,
          <fpage>661</fpage>
          -
          <lpage>667</lpage>
          . MIT Press. URL http:// papers.nips.cc/paper/1358-monotonic-networks.
          <source>pdf.</source>
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          2019.
          <article-title>Artificial Neural Network Surrogate Modeling of Oil Reservoir: a Case Study</article-title>
          .
          <source>In International Symposium on Neural Networks</source>
          ,
          <fpage>232</fpage>
          -
          <lpage>241</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <surname>Tagasovska</surname>
          </string-name>
          , N.; and
          <string-name>
            <surname>Lopez-Paz</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>SingleModel Uncertainties for Deep Learning</article-title>
          . arXiv preprint arXiv:
          <year>1811</year>
          .00908, To appear in NeurIPS .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>