Overestimation learning with guarantees

                           Adrien Gauffriau1 * François Malgouyres2† , Mélanie Ducoffe1‡
                                                 1
                                             IRT Saint Exupéry, Toulouse, France
                       2
                     Institut de Mathématiques de Toulouse ; UMR5219 Université de Toulouse ; CNRS ;
                                         UPS IMT F-31062 Toulouse Cedex 9, France
           {adrien.gauffriau, melanie.ducoffe}@irt-saintexupery.com, Francois.Malgouyres@math.univ-toulouse.fr


                            Abstract                                   Inspired by critical systems applications, we require (1) to
  We describe a complete method that learns a neural network
                                                                       be guaranteed by a formal theorem. We call fw∗ ,b∗ the sur-
  which is guaranteed to over-estimate a reference function on         rogate and call f the model.
  a given domain. The neural network can then be used as a                In the typical application we have in mind, f is the re-
  surrogate for the reference function.                                sult of a deterministic but complex phenomenon. It is dif-
  The method involves two steps. In the first step, we construct       ficult to compute and its evaluation requires intensive com-
  an adaptive set of Majoring Points. In the second step, we           putations or extensive memory consumption. We consider
  optimize a well-chosen neural network to over-estimate the           two settings. In the first setting, we can evaluate f (x) for
  Majoring Points.                                                     any x ∈ D. In the second setting, we only know a data set
  In order to extend the guarantee on the Majoring Points to the       (xi , f (xi ))i=1..n of size n ∈ N.
  whole domain, we necessarily have to make an assumption                 The constructed surrogate fw∗ ,b∗ permits to rapidly eval-
  on the reference function. In this study, we assume that the         uate an approximation of f at any point of D. Of course,
  reference function is monotonic.                                     we want the surrogate fw∗ ,b∗ to approximate f well; but
  We provide experiments on synthetic and real problems. The           most importantly we want fw∗ ,b∗ to overestimate f . In the
  experiments show that the density of the Majoring Points con-        typical scenario we have in mind, f models the consump-
  centrate where the reference function varies. The learned            tion of some resource for a critical action. Underestimating
  over-estimations are both guaranteed to overestimate the ref-
  erence function and are proven empirically to provide good
                                                                       the model may cause a (potentially deadly) hazard that will
  approximations of it.                                                not be acceptable for certification authorities especially in
  Experiments on real data show that the method makes it pos-
                                                                       aeronautics (EASA or FAA). The applications include for
  sible to use the surrogate function in embedded systems for          instance: - the estimation of the braking distance of an au-
  which an underestimation is critical; when computing the ref-        tonomous vehicle; - the number of kilometers that can be
  erence function requires too many resources.                         traveled; - the maximum load a structure can carry, - the net-
                                                                       work traveling time etc Guaranteed overestimation can also
                                                                       be used to guarantee the fairness of a score, when underesti-
                      1    Introduction                                mation on a sub-population leads to an unfair surrogate func-
Overestimation guarantee                                               tion. Providing such fairness or performance guarantee can
In this paper, we consider a real value finite function f de-          be seen as a niche activity, but it alleviates the limitation that
fined on a compact domain D and describe a method that                 restrains the massive industrialization of neural networks in
finds optimal weights w∗ and bias b∗ such that the neural              critical applications where passengers lives are at stake.
network fw∗ ,b∗ both provides a good approximation of f :                 Finally, the adaptation of the method to construct an
                           fw∗ ,b∗ ∼ f,                                under-estimation of f is straightforward and, for ease of pre-
                                                                       sentation, not exposed in the paper. Of course, the combina-
and is guaranteed to overestimate f over D:                            tion of both an over-estimation and an under-estimation per-
            fw∗ ,b∗ (x) ≥ f (x)        for all x ∈ D.           (1)    mits, for any x, the construction of an interval in which f (x)
                                                                       is guaranteed to be.
   * seconded from Airbus Operations, Toulouse
   †
     This work has benefited from the AI Interdisciplinary Institute   Related works
ANITI. ANITI is funded by the French ”Investing for the Future –
PIA3” program under the Grant agreement n°ANR-19-PI3A-0004.            The most standard type of guarantees in machine learning
   ‡
     seconded from Airbus AI Research, Toulouse                        aim at upper-bounding the risk by upper-bounding the gen-
Copyright © 2021 for this paper by its authors. Use permitted under    eralization error (see (Anthony and Bartlett 1999) for an
Creative Commons License Attribution 4.0 International (CC BY          overview). We would like to highlight that the risk measures
4.0).                                                                  an average cost that does not exclude failures. Moreover, at
the writing of this paper, the bounds on the generalization er-       Another restriction is that the method cannot be applied
ror are very pessimistic when compared to the performance          when the dimension d is large and f varies in all directions.
observed in practice (Harvey, Liaw, and Mehrabian 2017;            In fact, one contribution of the paper is to design algorithms
Bartlett et al. 2019) and the recent overview (Jakubovitz,         to alleviate this problem and apply the method for larger d,
Giryes, and Rodrigues 2019). In particular, neural network         but d must anyway remain moderate for most function f .
are commonly learned with much less examples than re-                 These hypotheses permit to have global guarantee that
quired by the theory.                                              hold on the whole domain D while weaker hypotheses on
    Other works provide guarantees on the neural network           f only lead to local guarantees, holding in the close vicinity
performance (Mirman, Gehr, and Vechev 2018; Boopathy               of the learning examples.
et al. 2019; Katz et al. 2017) that can be used to exclude
failures (in our setting, the under-estimation of f ). The phi-    Organization of the paper
losophy of these works is to analyze the output of the learn-      In the next section, we define Majoring Points and describe a
ing phase in order to provide guarantees of robustness. An-        grid based and an adaptive algorithm to construct them. The
other line of research, is to impose constraints or optimize       algorithms are adapted when the function f can be evaluated
robustness criteria during the learning phase (Fischer et al.      at any new input as well as when we only know a dataset
2019; Raghunathan, Steinhardt, and Liang 2018). This does          (xi , f (xi ))i=1..n . In Section 3, we describe a strategy to
not permit to guarantee an over-estimation property.               construct monotonic, over-estimating neural networks. Fi-
    Finally, another direction makes probabilistic assump-         nally, in Section 4, we provide experiments showing that the
tions and provides confidence scores (Gal and Ghahramani           Majoring Points are located in the regions where f varies
2016; Pearce et al. 2018; Tagasovska and Lopez-Paz 2019;           strongly or is discontinuous. We also illustrate on synthetic
Keren, Cummins, and Schuller 2018). The confidence score           examples that the method can accurately approximate f ,
does not necessarily provide a formal guarantee.                   while being guaranteed to overestimate it. An example on
    Compared to these approaches, the guarantee on the sur-        real data illustrates that the use of over-estimating neural net-
rogate fw∗ ,b∗ is due to the attention given to the construction   works permits to reduce the memory required to compute an
of the learning problem and the hypotheses on the function         over-estimation and thus permits to embed it, in a real world
f . To the best of our knowledge, and not considering trivial      application.
overestimation with poor performances, we are not aware of
any other construction of a surrogate fw∗ ,b∗ that is guaran-                        2    Majoring Points
teed to overestimate the model f .                                 Introduction
Hypotheses and restrictions                                        Definition 1. Majoring Points
                                                                                                       m
In order to extend the overestimation property to the whole            Let (ai , bi )i=1..m ∈ Rd × R . Let D ⊂ Rd be compact
domain we can obviously not consider any arbitrary func-           and let f : Rd −→ R be finite and non-decreasing. We say
tion f . In this paper, we restrict our analysis to real-valued    that (ai , bi )i=1..m are Majoring Points for f if and only if
and finite non-decreasing functions. The extension to vec-         for any non-decreasing g : Rd −→ R :
tor valued function is straightforward. The extension to              If g(ai ) ≥ bi , ∀i = 1..m, then g(x) ≥ f (x), ∀x ∈ D.
non-increasing functions is straightforward too. We delib-             When f is upper-bounded on D, the existence of such ma-
erately use these restrictive hypotheses to simplify the pre-      joring points is established by considering: m = 1,
sentation. Furthermore, many extensions of the principles of
the method described in the paper to other classes of non-          • a1 such that a1 ≤ x for all x ∈ D,
monotone functions are possible.                                    • b1 = supx∈D f (x).
   Formally, for a compact domain D ⊂ Rd , a function g :          However, for most function f , any non-decreasing g such
D −→ R is non-decreasing if and only if                            that g(a1 ) ≥ b1 is a poor approximation of f . The goal,
              g(x) ≤ g(x0 )       for all x ≤ x0                   when constructing Majoring Points of f is to have bi ∼
                                                                   f (ai ), while bi ≥ f (ai ), for all i = 1..m. The number of
where the partial order on Rd is defined for any x =               points m should remain sufficiently small to make the opti-
(xk )k=1..d and x0 = (x0k )k=1..d ∈ Rd by                          mization of the neural network manageable.
      x ≤ x0 if and only if xk ≤ x0k , for all k = 1..d.
                                                                   Constructing Majoring Points using a cover
   Other authors have already studied the approximation of         For any y and y 0 ∈ Rd , with y ≤ y 0 , we define the hyper-
non-decreasing functions with neural networks (Lang 2005;          rectangle between y and y 0 by
Daniels and Velikova 2010; Sill 1998). In our context, the
monotonic hypothesis is motivated by the fact that mono-                           Ry,y0 = {x ∈ Rd |y ≤ x < y 0 }.
tone function are ubiquitous in industrial applications and in     Using hyper-rectangles, we define the considered covers.
physics. Moreover, for a given application, where a function       Definition 2. Cover
        0
f 0 : Rd −→ R needs to be approximated. It is sometimes               Let (yi )i=1..m and (yi0 )i=1..m in Rd be such that yi0 ≥ yi ,
                                                     0
possible to extract features with a function g : Rd −→ Rd          for all i = 1..m. We say that (yi , yi0 )i=1..m covers D if and
                             d                      0
such that the function f : R −→ R, satisfying f = f ◦ g,           only if
is monotone.                                                                             D ⊂ ∪m   i=1 Ryi ,yi0 .                (2)
  For any cover C = (yi , yi0 )i=1..m , we define the function                  solution of an optimization taking into account these two
                                                                                properties. However, the optimization would be intractable
       fC (x) =      min           f (yi0 )        , for all x ∈ D.      (3)
                  i : x∈Ry ,y0                                                  and we only describe an heuristic algorithm that construct a
                           i   i
                                                                                cover.
Notice that, since C is a cover, {i|x ∈ Ryi ,yi0 } =6 ∅ and                       We construct Majoring Points according to two scenarios:
fC (x) is well-defined. We can establish that the function fC                   • We can generate f (x) for all x ∈ Rd . We call the Ma-
overestimates f over D:                                                           joring Points generated according to this setting Function
             For all x ∈ D,               fC (x) ≥ f (x).                (4)      Adapted Majoring Points.
                                                                                • We have a dataset (xi , f (xi ))i=1..n . It is not possible to
Indeed, for any x ∈ D, there exists i such that x ∈ Ryi ,yi0
                                                                                  have access to more training points. We call the Majoring
and fC (x) = f (yi0 ). Therefore, since f is non-decreasing                       Points generated according to this setting Data Adapted
and x ≤ yi0 , we have                                                             Majoring Points. In order to define them, we consider the
                    f (x) ≤ f (yi0 ) = fC (x).                                    function f˜ : Rd −→ R defined, for all x ∈ Rymin ,ymax ,
                                                                                  by
Proposition 1. A cover defines Majoring Points
                                                                                                    f˜(x) = min f (xi ).                     (8)
  Let D ⊂ Rd be compact. Let C = (yi , yi0 )i=1..m be                                                          i:xi ≥x
a cover of D and let f : Rd −→ R be finite and                                    Since f is non-decreasing, we have
non-decreasing. The family (yi , f (yi0 ))i=1..m are Majoring
Points for f .                                                                             f˜(x) ≥ f (x)       for all x ∈ Rymin ,ymax .
Proof. We consider a non-decreasing function g such that,                       At the core of the constructions described below is the con-
                                                                                struction of a cover C = (yi , yi0 )i=1..m .
            for all i = 1..m,                 g(yi ) ≥ f (yi0 ).         (5)
                                                                                Grid Based Majoring Points We consider an accuracy
We need to prove that,                                                          parameter ε > 0 and a norm k.k on Rd . We define
              for all x ∈ D,              g(x) ≥ f (x).                                                 kymax − ymin k
                                                                                                 nmax = d              e.
To do so, we consider x ∈ D and i such that x ∈ R           .         yi ,yi0                                  ε
Using respectively that g is non-decreasing, (5), the defini-                     A simple way to define Majoring Points is to generate
tion of fC (3) and (4), we obtain the following sequence of                     a cover made of equally spaced points between ymin and
inequalities:                                                                   ymax . Setting
          g(x) ≥ g(yi ) ≥ f (yi0 ) ≥ fC (x) ≥ f (x).                     (6)                        ymax − ymin
                                                                                                r=                ∈ Rd
                                                                                                        nmax
                                                                                and for all i0 , · · · , id−1 ∈ {1, . . . , N − 1}, we set i =
   The function fC can be computed rapidly using a look-                        Pd−1       k
up table but requires storing (yi , f (yi0 ))i=1..m . This can be                 k=0 ik N and
prohibitive in some applications.
                                                                                           
                                                                                                yi = ymin + (i0 r1 , . . . , id−1 rd )
   To deal with this scenario, we describe in Section 3 a                                       yi0 = yi + r
method to construct a neural network such that fw∗ ,b∗ is
non-decreasing and satisfies fw∗ ,b∗ (yi ) ≥ f (yi0 ), for all                  Notice that, the parameter r defining the grid satisfies
i = 1..m. According to the proposition, such a network pro-                     krk ≤ ε. Given the cover, the Function Adapted Majoring
vides a guaranteed over-estimation of f , whose computation                     Points (ai , bi )i=0..N d −1 are defined, for i = 0..ndmax − 1, by
is rapid. The resources required to store w∗ and b∗ are in-                                                
                                                                                                              ai = yi
dependent of m. We show in the experiments that it can be                                                                                      (9)
                                                                                                              bi = f (yi0 )
several orders of magnitude smaller. This makes it possible
to embed the overestimating neural network when embed-                          We can also construct a Grid Based Majoring Points when
ding the Majoring Points is not be possible. This advantage                     the function f cannot be evaluated but a dataset
comes at the price of loss of accuracy as fw∗ ,b∗ (x) ≥ fC (x)                  (xi , f (xi ))i=1..n is available by replacing in the above defi-
(fw∗ ,b∗ (x) has the role of g in (6)).                                         nition f by f˜, see (8).
                                                                                   The Grid Based Majoring Points are mostly given for ped-
Majoring Points construction algorithm                                          agogical reasons. The number of values f (x) that need to be
Introduction In this section, we describe algorithmic                           evaluated and the number of Majoring Points defining the
strategies to adapt Majoring Points to the function f .                         objective of the optimization of the neural network are both
Throughout the section, we assume that we know ymin and                         equal to ndmax = O(ε−d ). It scales poorly with d and this re-
ymax ∈ Rd such that                                                             strains the application of the Grid Based Majoring Points to
                                                                                small d, whatever the function f .
                       D ⊂ Rymin ,ymax .                                 (7)
                                                                                   When f does not vary much in some areas, the Majoring
The goal is to build a cover such that fC is close to f and                     Points in this area are useless. This is what motivates the
m is not too large. Ideally, the cover can be expressed as the                  adaptive algorithms developed in the next sections.
Adaptive Majoring Points Bellow, we describe a Di-                   Algorithm 1 Adaptive cover construction
chotomy Algorithm that permits to generate an adaptive               Require: ε : Distance below which we stop decomposing
cover with regard to the variations of the function f (or f˜).       Require: εf : Target upper bound for the error in f
It begins with a cover made of the single hyper-rectangle            Require: np : number of examples in a decomposable
Rymin ,ymax . Then it decomposes every hyper-rectangle of              hyper-rectangle
the current cover that have not reached the desired accuracy.        Require: Inputs needed to compute f (resp. f˜)
The decomposition is repeated until all the hyper-rectangle          Require: ymin , ymax : Points satisfying (7)
of the current cover have the desired accuracy (see Algo-
rithm 1).                                                              C ← {Rymin ,ymax }
   Initially, we have D ⊂ Rymin ,ymax . For any hyper-                 t←0
                                       0
rectangle Ry,y0 , denoting r = y 2−y , the decomposition re-           while t 6= 1 do
places Ry,y0 by its sub-parts as defined by                              t←1
                                                                        C0 ← ∅
     Ry+(s1 r1,...,sd rd ),y+(s1 +1)r1,...,(sd +1)rd ) |                 for Ry,y0 ∈ C do
                               (s1 , . . . , sd ) ∈ {0, 1}d . (10)          if Ry,y0 satisfies (11) (resp. (12)) then
                                                                               C 0 ← C 0 ∪ {Ry,y0 }
(Hence the term ‘dichotomy algorithm’. ) The union of the                   else
sub-parts equal the initial hyper-rectangle. Therefore, the                    t←0
cover remains a cover after the decomposition. The final C                     for all sub-parts R of Ry,y0 (see (10)) do
is a cover of D.                                                                   C 0 ← C 0 ∪ {R}
   We consider a norm k.k on Rd and real parameters ε >                        end for
0, εf > 0 and np ∈ N. We stop decomposing an hyper-                         end if
rectangle if a notion of accuracy is satisfied. The notion of            end for
accuracy depends on whether we can compute f or not.                     C ← C0
 • When we can evaluate f (x): The accuracy of Ry,y0 is de-            end while
   fined by the test                                                   return C
       f (y 0 ) − f (y) ≤ εf     or      ky 0 − yk ≤ ε       (11)
• When we only know a dataset (xi , f (xi ))i=1..n : The ac-                3   Overestimating neural networks
  curacy of Ry,y0 is defined by the test                             Monotonic Neural Networks
       f˜(y 0 ) − f˜(y) ≤ εf                                         In this section, we remind known result on the approxi-
    
                                  or   ky 0 − yk ≤ ε (12)
       or        |{i|xi ∈ Ry,y }| ≤ np
                              0
                                                                     mation of non-decreasing functions with neural networks
                                                                     having non-negative weights and non-decreasing activa-
   where |.| is the cardinal of the set.                             tion functions (Lang 2005; Daniels and Velikova 2010; Sill
We stop decomposing if a given accuracy is reached :                 1998).
• when f (y 0 ) − f (y) ≤ εf or f˜(y 0 ) − f˜(y) ≤ εf : This         Proposition 2. Sufficient condition to get a non-
   happens where the function f varies only slightly.                decreasing network
                                                                        For any neural network such that:
• when ky 0 − yk: This happens where the function f varies
   strongly.                                                          • its activation functions are non-decreasing
                                                                      • its weights w are non-negative
• when |{i|xi ∈ Ry,y0 }| ≤ np : This happens when the
   number of samples in Ry,y0 does not permit to improve             the function fw,b defined by the neural network is non-
   the approximation of f by f˜ in its sub-parts.                    decreasing.
   The cover is constructed according to Algorithm 1. This              The conditions are sufficient but not necessary. We can
algorithm                      to stop after at most n0max =         think of simple non-decreasing neural network with both
        is guaranteed                                              positive and negative weights. However, as stated in the next
dlog2 kymax −y ε
                  min k
                          e iteration of the ‘while loop’. In the    theorem, neural networks with non-negative weights are uni-
worst case, every hyper-rectangle of the current cover is de-        versal approximators of non-decreasing functions.
composed into 2d hyper-rectangles. Therefore, the worst-             Theorem 1. Universality of neural networks with non-
case complexity of the algorithm creates                             negative weights (Daniels and Velikova 2010)
                          0
                      2dnmax = O(ε−d )                                  Let D ⊂ Rd be compact. For any continuous non-
                                                                     decreasing function g : D → R . For any η > 0, there
hyper-rectangles. The worst-case complexity bound is sim-            exist a feed-forward neural network with d hidden layers,
ilar to the complexity of the Grid Based Majoring Points.            a non-decreasing activation function, non-negative weights
However, depending on f , the number of hyper-rectangles             w∗ and bias b∗ such that
generated by the algorithm can be much smaller than this
worst-case complexity bound. The smoother the function f ,                  |g(x) − fw∗ ,b∗ (x)| ≤ η      , for all x ∈ D,
the less hyper-rectangles are generated.                             where fw∗ ,b∗ is the function defined by the neural network.
   Notice that, in (Daniels and Velikova 2010), the neural         Proof. Since α− β p > 0, there exists η > 0 such that
network constructed in the proof of Theorem 1 involves a
Heaviside activation function. The choice of the activation                      m max(α− η p , α+ η 2 ) ≤ α− β p .
function is important. For instance, with a convex activation      Given Theorem 1, when the network is sufficiently large and
function (like ReLU), the function defined by the neural net-      θ is sufficiently small there exist a bias b0 and non-negative
work with non-negative weights is convex (Amos, Xu, and            weights w0 such that for all i = 1..m:
Kolter 2017) and may approximate arbitrary poorly a well-
chosen non-decreasing non-convex function.                                         |fw0 ,b0 (ai ) − (bi + β)| ≤ η.
   Theorem 1 guarantees that a well-chosen and optimized           Therefore,
neural network with non-negative weights can approximate
with any required accuracy the smallest non-decreasing ma-                  E(w∗ , b∗ ) ≤ E(w0 , b0 )
jorant1 of fC , as defined in (3).                                                        m
                                                                                          X
                                                                                        ≤    max(α− η p , α+ η 2 )
Learning the neural network                                                                     i=1
We consider a feed-forward neural network of depth h. The                                        − p
                                                                                           ≤ α β .                              (15)
hidden layers are of width l. The weights are denoted by w
and we will restrict the search to non-negative weights: w ≥       If by contradiction we assume that there exists i0 such that
0. The bias is denoted by b. We consider, for a parameter                                fw∗ ,b∗ (ai0 ) < bi0
θ > 0, the activation function
                            t                                      then we must have
             σ(t) = tanh( )         for all t ∈ R.
                            θ                                        E(w∗ , b∗ ) ≥ lβ,α+ ,α− ,p (fw∗ ,b∗ (ai0 ) − bi0 ) > α− β p .
   We consider an asymmetric loss function in order to pe-
nalize more underestimation                                        This contradicts (15) and we conclude that (14) holds.
                            than   overestimation
                             α+ (t − β)2 if t ≥ β                     The proposition guarantees that for a large network, with
        lβ,α+ ,α− ,p (t) =
                             α− |t − β|p if t < β                  θ small, we are sure to overestimate the target. Because
where the parameters (α+ , α− , β) ∈ R3 are non-negative           Theorem 1 does not provide a configuration (depth, width,
and p ∈ N. Notice that asymmetric loss functions have al-          activation function) that permits to approximate any func-
ready been used to penalize either under-estimation or over-       tion with an accuracy η, it is not possible to provide such a
estimation (Yao and Tong 1996) (Julian, Kochenderfer, and          configuration for the parameters in Proposition 3. However,
Owen 2019).                                                        given a configuration and given weights w∗ and bias b∗ re-
   Given Majoring Points (ai , bi )i=1..m , we define, for all w   turned by an algorithm, it is possible to test if (14) holds. If
and b                                                              is does not hold, it is always possible to increase the width,
                     m                                             depth, etc and redo the optimization. Theorem 1 and Propo-
                                                                   sition 3, combined with properties of the landscape of large
                     X
         E(w, b) =       lβ,α+ ,α,p (fw,b (ai ) − bi ).
                          i=1
                                                                   networks such as (Nguyen and Hein 2017), guarantee that
                                                                   such a strategy stops after a finite number of optimization
The parameters of the network optimize
                                                                   procedure.
                 argminw≥0,b E(w, b).                  (13)
The function E is smooth but non-convex. In order to solve                             4    Experiments
(13), we apply a projected stochastic gradient algorithm           In this section, we compare the results of several learning
(Bianchi and Jakubowicz 2012). The projection on the con-          strategies on two synthetic experiments with d = 1 and 2
straint w ≥ 0 is obtained by canceling its negative entries.       and on a real dataset from avionic industry. The synthetic
As often with neural network, we cannot guarantee that the         experiments permit to illustrate the method; the latter real
algorithm converges to a global minimizer.                         dataset show that the method permits to construct a surrogate
                                                                   of a critical function that can be embedded while the true
Guaranteeing fw∗ ,b∗ (ai ) ≥ bi                                    critical function cannot.
The parameter β ≥ 0 is an offset parameter. Increasing β              The python codes that have been used to generate the ex-
leads to poorer approximations. We show in the following           periments, as well as additional experiments, are provided
proposition that β can be arbitrarily small, if the other pa-      with the submission and will be made available on an open
rameters are properly chosen.                                      source deposit.
Proposition 3. Guaranteed overestimation of the sam-
ples                                                               Methods to be compared
   Let β > 0, α− > 0, p > 0. If the neural network is              The architecture of the network is the same for all experi-
sufficiently large and if θ is sufficiently small, then            ments and contains 4 fully connected layers with 64 neurons
            fw∗ ,b∗ (ai ) ≥ bi       for all i = 1..m,   (14)      in each layer. The memory requirement to store the network
                                                                   is 64×d+4×642 +5∗64 = 64d+16705 floating numbers.
for any w and b∗ solving (13).
           ∗
                                                                   The size of the input layer is d. The size of the output layer
   1
       Notice fC is not necessarily non-decreasing.                is 1. We compare:
• The δ-baseline: For a parameter δ ≥ 0, it is a simple
  neural network, with an `2 loss. It is trained:
  – on the points (ai , f (ai )+δ)i=1..m , when f can be com-
    puted;
  – on the modified dataset (xi , f (xi ) + δ)i=1..n , when f
    cannot be computed.
  The δ-baseline is in general not guaranteed to provide an
  overestimation. The 0-baseline is expected to provide a
  better approximation of the true function f than the other
  methods but it fails to always overestimating it.
  If δ is such that f (ai ) + δ ≥ bi , for all i = 1..m, the
  δ-baseline is guaranteed to provide an overestimation.        Figure 1: 1D synthetic experiment with Grid Based Major-
• The Overestimating Neural Network (ONN): Is a neu-            ing Points
  ral network whose parameters solve (13) for the param-
  eters β, α+ , α and p coarsely tuned for each experiment,
  and depending on the context, the Grid Based Majoring
  Points (ONN with GMP), the Function Adapted Major-
  ing Points (ONN with FMP), the Grid Based Majoring
  Points (ONN with DMP).
  We always take θ = 1. The size of the network and
  the values of s β, α+ , α and p always permit to have
  fw∗ ,b∗ (ai ) ≥ bi , for all i = 1..m. Therefore, as demon-
  strated in the previous sections, fw∗ ,b∗ is guaranteed to
  overestimate f .

  Evaluation metrics
                                                                Figure 2: Discontinuity of 1D synthetic experiment with
  We are defining in this section the metrics that are used     Grid Based Majoring Points
  to compare the methods. Some metrics use a test dataset
  (x0i , f (x0i ))i=1..n0 .
  For the 1D synthetic example, we take n0 = 100000 and
  for the industrial example, we take n0 = 75000. In both                 
                                                                           3x + 3 sin(x) − 4    if x ∈ [−10; −1)
  cases the x0i are iid according to the uniform distribution
                                                                 f1 (x) =   −sign(x).x2 + sin(x) if x ∈ [−1; 1]
  in D. We consider the Majoring Approximation Error                       x + cos(x) + 10      if x ∈ (1; 10]
  (MAE) defined by
                                                                                                                (16)
                                m
                           1 X
                M AE =           (bi − f (ai ));                   The function f1 , fC , fw∗ ,b∗ for Grid Based Majoring
                           m i=1
                                                                Points and the 0.2-baseline are displayed on Figure 1, on
  the Root Mean Square Error (RMSE) defined by                  the interval [−4.5, −2]. The function f1 f , the Grid Based
                                                                Majoring Points and fw∗ ,b∗ for Grid Based Majoring
                                                 12           Points and the 0.2-baseline are displayed on Figure 2, on the
                 n0
           1                                                   interval [0.8, 1.2]. The function f1 , the Data Adapted Ma-
                X
                    (fw∗ ,b∗ (x0i ) − f (x0i ))2  ;            joring Points, the sample (xi , f1 (xi ))i=1..n and fw∗ ,b∗ for
             n0 i=1
                                                                Data Adapted Majoring Points are displayed on Figure 3,
  the Overestimation proportion (OP) defined by                 on the interval [−5.5, 0].
                                                                   We clearly see that the adaptive Majoring Points aggre-
               100                                              gate in dyadic manner in the vicinity of the discontinuities.
                   |{i|fw∗ ,b∗ (x0i ) ≥ f (x0i )}| ;
                n0                                              We also see on Figure 2 how Majoring Points permit to an-
                                                                ticipate the discontinuity and construct an overestimation.
  and remind if the method guarantees fw∗ ,b∗ overesti-
                                                                The density of Majoring Points depends on the slope of f1 .
  mates f Formal Guarantee (FG).
                                                                This permits to reduce the approximation error. Also, fw∗ ,b∗
  For the experiments on the 1D synthetic example the           passes in the vicinity of the Majoring Points and provides a
  methods are also evaluated using visual inspection.           good approximation that overestimates f .
                                                                   The MAE, RMSE, OP and FG are provided in Table 1 for
1D synthetic experiment                                         the 0-baseline, the 0.5-baseline, fC and the ONN for three
The 1D synthetic experiment aims at overestimating the          types of Majoring Points. We see that the RMSE worsen as
function f1 defined over [−10, 10] by                           we impose guarantees and approach the realistic scenario
                                                                     (a) Complete domain          (b) Zoom on the discontinuity

                                                                              Figure 4: Data Majoring Points grid

Figure 3: 1D synthetic experiment with Data Adapted Ma-
joring Points

                     n     MAE     RMSE        OP     FG
 0-baseline         200    N/A      -.04     51.9 %   NO
 0.5-baseline       200     0.5     0.59     99.5%    NO
 fC                 200    N/A      0.10     100 %    YES
 ONN with GMP       200    0.24     0.23      100%    YES
 ONN with FMP       200    0.23     0.55      100%    YES
 ONN with DMP       500    0.82     0.60      100%    YES

        Table 1: Metrics - 1D synthetic experiment
                                                                     (a) Complete domain          (b) Zoom on the discontinuity

                                                                Figure 5: Function Adapted Majoring Points on synthetic
where the surrogate is easy to compute and does not require     2D f .
too much memory consumption.

2D synthetic experiment
                                                                   The construction of surrogate functions is an important
The same phenomenon described on 1D synthetic exper-            subject in industry (Lathuilière et al. 2019; Biannic et al.
iment occur in dimension 2. We only illustrate here the         2016; Jian et al. 2017; Sudakov et al. 2019). In this work,
difference between the Grid Based Majoring Points, Func-        we are considering an industrial and heavy simulation code
tion Adapted Majoring Points and Data Adapted Majoring          that has six inputs d = 6 and one output and that represents
Points for the function                                         a complex physic phenomenon of an aircraft. The output is
                                     p                        an non-decreasing function. During the flight, given flight
   ∀(x, y) ∈ [0; 15]2   g(x, y) = f1    x2 + y 2 − 10           conditions x the output f (x) is compared to a threshold and
                                                                the result of the test launch an action. When we replace f by
where f1 is the function defined in (16).                       the overestimating surrogate fw∗ ,b∗ , the airplane launches
   We display, on Figure 5, the Function Adapted Majoring       the action less often. However, the airplane only launches
Points returned by the Dichotomy Algorithm. The density         the action when the action is guaranteed to be safe.
of points is correlated with the amplitude of the gradient of      The industrial dataset contains n = 150000 examples on
the function. Algorithm 1 permit to diminish the number of      a static grid and another set of 150000 sampled according
Majoring Points. For instance, for the 2D synthetic example     to the uniform distribution on the whole domain. For each
for ε = 0.1, the Grid Based Majoring Points counts 22500        inputs, the reference computation code is used to generate
points. The Function Adapted Majoring Points counts 14898       the associated true output.
points, for ε = 0.5, and 3315, for ε = 2. The Data Adapted
                                                                   We compare 0-baseline,300-baseline with the ONN
Majoring Points counts 8734 points, for ε = 0.5, and 1864,
                                                                learned on Grid Based Majoring Points and Data Adapted
for ε = 2.
                                                                Majoring Points methods. All the methods are learned on the
   On Figure 4, we represent the dataset and the cover ob-
tained using Algorithm 1 for the synthetic 2D example. The
inputs ai of the corresponding Majoring Points are displayed         Method         n      nmaj   RMSE     MAE      OP     FG
on Figure 5.                                                       0-baseline      150k    150k     3.3     N/A    65.1%   NO
                                                                  300-baseline     150k    150k    302.6   300.0   100%    NO
Industrial application                                           ONN with GMP      150k    150k    309.3   262.7   100%    YES
The method developed in this paper provides formal guar-         ONN with DMP      150k0   110k    445.7    N/A     N/A    YES
antees of overestimation that are safety guarantee directly
applicable in critical embedded systems.                                  Table 2: Results on the industrial dataset
static grid except OON with Data Adapted Majoring Points.          Biannic, J.; Hardier, G.; Roos, C.; Seren, C.; and Verdier,
The table 2 presents the metrics for the different methods.        L. 2016. Surrogate models for aircraft flight control: some
The results are acceptable for the application and memory          off-line and embedded applications. AerospaceLab Journal
requirement to store and embed the neural network is 17088         (12): pages–1.
floating numbers. It is one order of magnitude smaller than        Boopathy, A.; Weng, T.-W.; Chen, P.-Y.; Liu, S.; and Daniel,
the size of the dataset.                                           L. 2019. Cnn-cert: An efficient framework for certifying
                                                                   robustness of convolutional neural networks. In Proceed-
              5    Conclusion - Future Work                        ings of the AAAI Conference on Artificial Intelligence, vol-
We presented a method that enables to formally guarantee           ume 33, 3240–3247.
that a prediction of a monotonic neural network will always        Daniels, H.; and Velikova, M. 2010. Monotone and partially
be in an area that preserves the safety of a system. This is       monotone neural networks. IEEE Transactions on Neural
achieved by the construction of the network, the utilization       Networks 21(6): 906–917.
of majoring points and the learning phase, which allows us         Fischer, M.; Balunovic, M.; Drachsler-Cohen, D.; Gehr, T.;
to free ourselves from a massive testing phase that is long        Zhang, C.; and Vechev, M. 2019. DL2:- Training and Query-
and costly while providing fewer guarantees.                       ing Neural Networks with Logic. In International Confer-
   Our work have limitations on functions that can be a            ence on Machine Learning.
safely approximate, but this is a first step toward a safe use
of neural networks in critical applications. Nevertheless, this    Gal, Y.; and Ghahramani, Z. 2016. Bayesian Convolutional
can already be used in simple safety critical systems that         Neural Networks with Bernoulli Approximate Variational
verify our hypotheses. Future works will look on possibil-         Inference. In 4th International Conference on Learning Rep-
ity to leverage the utilization of the monotonic hypothesis.       resentations (ICLR) workshop track.
Another direction of improvement is to build smarter algo-         Harvey, N.; Liaw, C.; and Mehrabian, A. 2017. Nearly-tight
rithms that require less majoring points thanks to a better        VC-dimension bounds for piecewise linear neural networks.
adaptation to the structure of the function. In particular, this   In Conference on Learning Theory, 1064–1068.
should permit to apply the method to functions whose is in-        Jakubovitz, D.; Giryes, R.; and Rodrigues, M. R. 2019. Gen-
put space is of larger dimension, when they have the proper        eralization error in deep learning. In Compressed Sensing
structure.                                                         and Its Applications, 153–193. Springer.
                                                                   Jian, Z.-D.; Chang, H.-J.; Hsu, T.-s.; and Wang, D.-W. 2017.
                     Acknowledgements                              Learning from Simulated World-Surrogates Construction
This project received funding from the French ”Investing for       with Deep Neural Network. In SIMULTECH, 83–92.
the Future – PIA3” program within the Artificial and Natu-         Julian, K. D.; Kochenderfer, M. J.; and Owen, M. P. 2019.
ral Intelligence Toulouse Institute (ANITI) under the grant        Deep Neural Network Compression for Aircraft Collision
agreement ANR-19-PI3A-0004. The authors gratefully ac-             Avoidance Systems. Journal of Guidance, Control, and Dy-
knowledge the support of the DEEL project.2                        namics 42(3): 598–608. ISSN 1533-3884. doi:10.2514/1.
                                                                   g003724. URL http://dx.doi.org/10.2514/1.G003724.
                              References                           Katz, G.; Barrett, C.; Dill, D. L.; Julian, K.; and Kochender-
Amos, B.; Xu, L.; and Kolter, J. Z. 2017. Input Convex             fer, M. J. 2017. Reluplex: An efficient SMT solver for veri-
Neural Networks. In Proceedings of the 34th International          fying deep neural networks. In International Conference on
Conference on Machine Learning - Volume 70, ICML’17,               Computer Aided Verification, 97–117. Springer.
146–155. JMLR.org.                                                 Keren, G.; Cummins, N.; and Schuller, B. 2018. Calibrated
Anthony, M.; and Bartlett, P. L. 1999. Neural Network              prediction intervals for neural network regressors. IEEE Ac-
Learning: Theoretical Foundations. Cambridge Univer-               cess 6: 54033–54041.
sity Press. ISBN 0 521 57353 X. URL http://www.stat.               Lang, B. 2005. Monotonic Multi-layer Perceptron Networks
berkeley.edu/∼bartlett/nnl/index.html. 404pp. ISBN 978-            as Universal Approximators. In Duch, W.; Kacprzyk, J.; Oja,
0-521-57353-X. Reprinted 2001, 2002. Paperback edition             E.; and Zadrożny, S., eds., Artificial Neural Networks: For-
2009; ISBN 978-0-521-11862-0.                                      mal Models and Their Applications – ICANN 2005, 31–37.
Bartlett, P. L.; Harvey, N.; Liaw, C.; and Mehrabian, A.           Berlin, Heidelberg: Springer Berlin Heidelberg. ISBN 978-
2019. Nearly-tight VC-dimension and Pseudodimension                3-540-28756-8.
Bounds for Piecewise Linear Neural Networks. Journal of            Lathuilière, S.; Mesejo, P.; Alameda-Pineda, X.; and Ho-
Machine Learning Research 20(63): 1–17.                            raud, R. 2019. A comprehensive analysis of deep regres-
Bianchi, P.; and Jakubowicz, J. 2012. Convergence of a             sion. IEEE transactions on pattern analysis and machine
multi-agent projected stochastic gradient algorithm for non-       intelligence .
convex optimization. IEEE transactions on automatic con-           Mirman, M.; Gehr, T.; and Vechev, M. 2018. Differ-
trol 58(2): 391–405.                                               entiable Abstract Interpretation for Provably Robust Neu-
                                                                   ral Networks. In Dy, J.; and Krause, A., eds., Pro-
   2
       https://www.deel.ai/                                        ceedings of the 35th International Conference on Machine
Learning, volume 80 of Proceedings of Machine Learn-
ing Research, 3578–3586. Stockholmsmässan, Stockholm
Sweden: PMLR. URL http://proceedings.mlr.press/v80/
mirman18b.html.
Nguyen, Q.; and Hein, M. 2017. The loss surface of deep
and wide neural networks. In Proceedings of the 34th In-
ternational Conference on Machine Learning-Volume 70,
2603–2612. JMLR. org.
Pearce, T.; Brintrup, A.; Zaki, M.; and Neely, A. 2018.
High-Quality Prediction Intervals for Deep Learning: A
Distribution-Free, Ensembled Approach. In Proceedings
of the 35th International Conference on Machine Learning,
ICML, 4072–4081.
Raghunathan, A.; Steinhardt, J.; and Liang, P. 2018. Certi-
fied Defenses against Adversarial Examples.
Sill, J. 1998. Monotonic Networks. In Jordan, M. I.; Kearns,
M. J.; and Solla, S. A., eds., Advances in Neural Information
Processing Systems 10, 661–667. MIT Press. URL http://
papers.nips.cc/paper/1358-monotonic-networks.pdf.
Sudakov, O.; Koroteev, D.; Belozerov, B.; and Burnaev, E.
2019. Artificial Neural Network Surrogate Modeling of Oil
Reservoir: a Case Study. In International Symposium on
Neural Networks, 232–241. Springer.
Tagasovska, N.; and Lopez-Paz, D. 2019.       Single-
Model Uncertainties for Deep Learning. arXiv preprint
arXiv:1811.00908, To appear in NeurIPS .
Yao, Q.; and Tong, H. 1996. Asymmetric least squares re-
gression estimation: a nonparametric approach. Journal of
nonparametric statistics 6(2-3): 273–292.