TEDL: A Two-stage Evidential Deep Learning Method for
Classification Uncertainty Quantification
Xue Li1,* , Wei Shen2 and Denis Charles2
1
    1045 La Avenida St, Mountain View, CA, 94043, United States
2
    555 110th Ave NE, Bellevue, WA, 98004, United States


                                          Abstract
                                          In this paper, we propose TEDL, a two-stage learning approach to quantify uncertainty for deep learning models in classification
                                          tasks, inspired by our findings in experimenting with Evidential Deep Learning (EDL) method, a recently proposed uncertainty
                                          quantification approach based on the Dempster-Shafer theory. More specifically, we observe that EDL tends to yield inferior
                                          AUC compared with models learnt by cross-entropy loss and is highly sensitive in training. Such sensitivity is likely to
                                          cause unreliable uncertainty estimation, making it risky for practical applications. To mitigate both limitations, we propose a
                                          simple yet effective two-stage learning approach based on our analysis on the likely reasons causing such sensitivity, with
                                          the first stage learning from cross-entropy loss, followed by a second stage learning from EDL loss. We also re-formulate the
                                          EDL loss by replacing ReLU with ELU to avoid the Dying ReLU issue. Extensive experiments are carried out on varied sized
                                          training corpus collected from a large-scale commercial search engine, demonstrating that the proposed two-stage learning
                                          framework can increase AUC significantly and greatly improve training robustness.

                                          Keywords
                                          uncertainty quantification, search ads recommendation, deep learning, classification, BERT, TwinBERT


1. Introduction                                                                                        ods, especially internal approaches, since such methods
                                                                                                       typically need only a single forward pass on a determin-
Uncertainty quantification of deep learning models has istic network to estimate uncertainty, and hence does
been a hot topic in the community ever since the rise not require stochastic DNN or ensemble models, making
of deep learning, and the demand for effective uncer- both training and inference more efficient.
tainty quantification methods is becoming increasingly                                                    More specifically, instead of considering the model
urgent in the recent decade as deep learning continue to outputs as a pointwise maximum-a-posteriori (MAP) es-
reshape many industries. Search recommendation, as per- timation, internal single deterministic methods usually
haps the most radically reshaped industry, often relies on interpret model outputs as parameters of a prior distri-
many different deep learning models to give accurate rec- bution over all the possible predictions, and then give
ommendations, which makes uncertainty quantification prediction by taking the expected value over the prior
especially important since unreliable predictions could distribution. For classification tasks, Dirichlet distribu-
accumulate in the system and finally lead to inaccurate tion is often chosen as prior since it is the conjugate prior
or even embarrassing recommendation results.                                                           of the categorical distribution. Meanwhile, statistical dis-
   To make machine learning models aware of their own tance metrics such as Kullback-Leibler (KL) divergence
prediction confidence, many uncertainty quantification are often included in their loss functions due to the need
approaches have been proposed [1], including single to optimize on parameters of distributions [2, 3].
deterministic methods, Bayesian methods and ensem-                                                        However, the efficiency of such methods comes with
ble methods, etc., among which the single deterministic a cost. As mentioned in [1], they are typically more
methods could be further grouped into internal or exter- sensitive towards training settings such as initialization,
nal methods depending on whether additional compo- hyper-parameters, training data, etc., which is what we
nents are required for uncertainty estimation. We present observed when apply EDL [2], a recently proposed single
a brief review on this topic in Section 2. In this paper, we deterministic method, to practical scenarios.
are particularly interested in single deterministic meth-                                                 To be more specific, in our experiments we identify sev-
                                                                                                       eral issues in the EDL method. Firstly, as shown in Figure
DL4SR’22: Workshop on Deep Learning for Search and Recommen- 2, when applied to binary classification tasks, the ROC
dation, co-located with the 31st ACM International Conference on
                                                                                                       AUC achieved by the EDL method is significantly lower
Information and Knowledge Management (CIKM), October 17-21, 2022,
Atlanta, USA                                                                                           than that obtained by cross-entropy loss, and such gap
*
  Corresponding author.                                                                                cannot be bridged by simply adding more training sam-
$ xeli@microsoft.com (X. Li); sashen@microsoft.com (W. Shen);                                          ples. Secondly, EDL tends to be sensitive to initialization
cdx@microsoft.com (D. Charles)                                                                         and some hyper-parameters, where improper settings
          © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
          Attribution 4.0 International (CC BY 4.0).                                                   may lead to significantly degraded AUC and unreliable
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
Figure 1: A schematic illustration of the proposed TEDL method. (a) The original EDL method transforms the model outputs
to strictly positive values using ReLU activation and learns to quantify uncertainty via the EDL loss in Equation (1), which
yields inferior AUC and is sensitive to training. (b) The proposed TEDL method employs a two-stage learning strategy to
decompose the original problem into two easier sub-problems and tackle one at a time: the first stage learns to make good
pointwise estimations via cross-entropy loss; and then the second stage will learn to quantify uncertainty using the pointwise
estimation as anchor points, with ReLU replaced by ELU to avoid the Dying ReLU issue.


uncertainty estimation.                                          categorical distribution, based on which we can quantify
   To see this more clearly, in Figure 5 (the orange curve)      uncertainty.
we summarize the per epoch ROC AUC obtained in EDL                  The overall training framework of TEDL is illustrated
training with different 𝜆, a hyper-parameter controlling         in Figure 1, where two stages are needed: in the first stage,
how close the Dirichlet prior is to a uniform distribution.      we train our classification model with cross-entropy loss,
As we can see, the AUC of EDL suffers in the beginning           in order to obtain a model that is able to output reason-
under all the four settings, and in some cases (for example      able pointwise estimations of the categorical distribution.
when 𝜆 = 0.5) there is no signs of improvement at all.           And then in the second stage, we initialize the model
In cases where AUC does improve, its final AUC is still          from the weights obtained in the previous stage, and go
significantly lower than that from the proposed method           through the same training corpus by learning with the
(the green curve). On the other hand, consider evaluating        reformulated EDL loss where ReLU is replaced by ELU.
AUC on validation samples with uncertainty lower than            As shown in Section 4, compared with the EDL baseline,
a certain threshold: If the learnt uncertainty is of high        TEDL can achieve higher AUC across all evaluation set-
quality, smaller thresholds should indicate higher confi-        tings and effectively avoid the risk of running into Dying
dence, and hence should be associated with higher AUC.           ReLU problem. More importantly, TEDL also shows sig-
However, this is not always the case for EDL, as shown           nificantly improved robustness towards training settings,
in the first row of Figure 6. Besides, we also observe that      making it more reliable for practical applications.
when a large 𝜆 is used (for example 𝜆=0.75), there would            It is also worth to mention that we name our proposed
be a higher risk of running into the Dying ReLU problem          method following EDL mainly due to the convenience of
where all outputs are zero, leading to an AUC that similar       experimentation, as it is proposed recently and is easy
to a random guess. All these issues make it risky to apply       to implement with code open-sourced by the authors.
methods like EDL into real-world applications.                   However, our analysis in Section 3 also applies to other
   To fix these issues, we firstly present an analysis in this   single deterministic uncertainty quantification methods
paper on the likely reasons causing the above issues in          suffering from similar issues, and hence the two-stage
Section 3, and based on our analysis, we further propose         learning framework we propose in this paper could be
TEDL, short for Two-stage Evidential Deep Learning, as           readily extended to those methods as well.
a simple but effective training framework to mitigate all
the aforementioned issues in a single shot. As we will see
in Section 3, the basic idea of TEDL is to transform the         2. Related Works
difficult uncertainty quantification problem into two sub-
                                                                 The interest for uncertainty estimation dates back to the
problems that are much easier to tackle, i.e., 1) finding a
                                                                 days even before the rise of deep learning, entailing a
reasonably good pointwise estimation of the categorical
                                                                 large body of literature on this topic. Based on whether
distribution, and 2) leveraging this pointwise estimation
                                                                 model ensemble is used and whether the model is stochas-
as an anchor point for estimating the Dirichlet prior of
tic, uncertainty quantification methods could be roughly      gories and 𝛼𝑖 = ⟨𝛼𝑖1 , . . . , 𝛼𝑖𝐾 ⟩ is the parameter of a
grouped into three categories, including single determin-     Dirichlet distribution for the classification of sample 𝑖,
istic methods, Bayesian neural networks and ensemble          the authors propose to replace softmax with ReLU and
methods. Please refer to [1] for a comprehensive survey.      represent the Dirichlet parameter as 𝛼𝑖 = 𝑓 (𝑥𝑖 |Θ) + 1
   Single deterministic methods [4, 5] estimate uncer-        where Θ represents network parameters and 𝑓 (𝑥𝑖 |Θ)
tainty based on one single forward pass within a deter-       is the ReLU outputs. The 𝛼𝑖𝑗 here also represents the
ministic network, and could be further split into external    subjective opinion
                                                                             ∑︀𝐾 collected from sample 𝑖 and category
approaches [6, 7] and internal approaches [2, 3, 8] depend-   𝑗, and 𝑆𝑖 =       𝑗=1 𝛼𝑖𝑗 is referred to as the Dirichlet
ing on whether additional method is used for deriving un-     strength. Note that 𝑆𝑖 is inversely proportional to uncer-
certainty estimation. Methods in this category typically      tainty: a larger 𝑆𝑖 indicates more evidence is collected
have lower requirements on computational resources            for sample 𝑖, and hence lower uncertainty.
since no stochastic networks nor model ensembles are             Based on the above assumptions, the EDL loss is de-
needed, but suffer from sensitivity to initialization and     fined as below:
parameters compared with other categories. The pro-                                     𝑁
posed TEDL method in this paper, as well as the original
                                                                                        ∑︁
                                                                            ℒ(Θ) =            ℒ𝑖 (Θ) + 𝜆𝑡 ℒ𝐾𝐿                   (1)
EDL method, both fall into this category.                                               𝑖=1
   Bayesian neural networks cover all kinds of stochas-
                                                              where ℒ𝑖 (Θ) is formulated as the expected value of a basic loss
tic DNNs, including methods based on variational in-
                                                              and ℒ𝐾𝐿 represents regularization. According to the authors
ference [9, 10, 11, 12, 13, 14, 15], sampling methods         of [2], EDL method appears relatively more stable when sum
[16, 17, 18, 19], and Laplace approximation [20, 21, 22].     of squares loss is used as the basic loss, as below:
Methods in this category usually have higher compu-
tational complexity in both the training and inference                                 𝐾
                                                                                       ∑︁                      ^𝑖𝑗 (1 − 𝑝
                                                                                                               𝑝        ^𝑖𝑗 )
phases due to stochastic sampling.                                 ℒ𝑖 (Θ)          =                ^𝑖𝑗 )2 +
                                                                                             (𝑦𝑖𝑗 − 𝑝
                                                                                                                  𝑆𝑖 + 1
   Ensemble methods [23, 24, 25, 26] combine the pre-                                  𝑗=𝑖

dictions from several different deterministic networks at                              𝐾
                                                                                       ∑︁             𝛼𝑖𝑗 2  𝛼𝑖𝑗 (𝑆𝑖 − 𝛼𝑖𝑗 )
inference. Methods in this category typically have higher                          =         (𝑦𝑖𝑗 −      ) +                    (2)
                                                                                       𝑗=1
                                                                                                      𝑆𝑖      𝑆𝑖2 (𝑆𝑖 + 1)
requirements on both the memory and computational
resources at inference phase.                                 where 𝑦𝑖𝑗 and 𝑝^𝑖𝑗 denote the class label and expectation for
   The proposed method also relates to the concept of         sample 𝑖 and class 𝑗 , respectively.
two-stage learning, which bears similarity to trans-             Meanwhile, the above loss function is further regularized by
fer learning but has some subtle differences. Transfer        minimizing the KL divergence between the estimated Dirichlet
learning generally refers to the procedure that transfers     distribution 𝐷(𝑝𝑖 |𝛼
                                                                                 ˜ 𝑖 ) and the uniform distribution, as below:
knowledge obtained from different but related source                        𝑁
domains to target domains, usually to reduce training
                                                                            ∑︁
                                                                 ℒ𝐾𝐿 =            𝐾𝐿 [𝐷(𝑝𝑖 |𝛼
                                                                                            ˜ 𝑖 ) || 𝐷(𝑝𝑖 |⟨1, . . . , 1⟩)]     (3)
data required on the target domains. [27] gives a com-                      𝑖=1
prehensive survey on transfer learning. In contrast, in
two-stage learning [28, 29], although it also consists of     The coefficient 𝜆𝑡 in Equation (1) is heuristically set to increase
                                                              with epoch 𝑡 (zero-based), i.e., 𝜆𝑡 = min(1.0, 𝑡 * 𝜆) where
two consecutive stages, these two stages are often con-
                                                              𝜆 = 0.1. Note that we denote the per-epoch increment as
ducted on the same data. In a typical two-stage learning      𝜆. For brevity, we will treat 𝜆 rather than 𝜆𝑡 as the hyper-
setting, the second stage should be the final stage that      parameter henceforth, since 𝜆𝑡 is determined only by 𝜆.
yields the desired output, while the first stage serves as
a preparation step. Given such differences, the proposed
method should be categorized as two-stage learning.
                                                              3.2. A Closer Look into the EDL Method
                                                              Equation (1) could be split into two parts: the first part is
                                                              Equation (2) which is designed to estimate the Dirichlet prior,
3. Approach                                                   and the second part is the regularization term in Equation
                                                              (3) derived from KL divergence. Next, we will take a closer
3.1. A Recap on EDL Uncertainty                               look at these two parts respectively to understand the cause
                                                              of sensitivity.
     Quantification
                                                                 As we mentioned previously, unlike cross-entropy loss
The basic idea of EDL method is treating softmax out-         which is designed to learn the pointwise estimations of the
put as the pointwise estimation of the categorical dis-       categorical distribution as a MAP estimate, the loss function
tribution, and placing a Dirichlet prior over the distri-     in Equation (2) is derived to learn the parameter of a Dirichlet
bution of all possible softmax outputs. Then, following       prior distribution over all the possible predictions. Therefore,
                                                              the pointwise estimation should also be covered by the Dirich-
the Dempster-Shafer theory, assume we have 𝐾 cate-
                                                              let prior distribution. This perspective highlights the huge gap
Figure 2: AUC comparison between cross-entropy loss, EDL loss and TEDL loss, evaluated on the same validation data. All
the three methods are learnt on training corpus with 1M, 5M, 50M and 500M samples, respectively. In all the training settings,
EDL method achieves inferior AUC compared to cross-entropy loss, while the proposed TEDL method yields comparable AUC
than cross-entropy, outperforming EDL significantly.


in terms of how difficult the optimization problems behind           see in Section 4, such cost is well paid off given the significant
these two loss functions are, especially given that obtaining        AUC increase and greatly improved robustness in training.
a good MAP estimation is already a hard problem in many                 So why does such a simple strategy work? On one hand,
applications. This perspective also highlights the importance        the first stage in TEDL learns a pointwise estimation of the
of a sufficiently large training data, as it would be meaningless    categorical distribution, which is a much easier problem com-
to model a distribution without sufficient samples.                  pared with modeling the entire distribution and entails much
    In the meanwhile, the KL divergence also makes optimiza-         fewer training samples. Then in the second stage, since the
tion more complicated since it is not Lipschitz smooth. More         model is initialized from the weights obtained in stage 1, it
precisely, given a function 𝑓 , it is said to be Lipschitz smooth    amounts to modeling the prior distribution using the point-
if and only if there exists a finite value 𝐿 such that               wise estimation as certain anchor points, which is much easier
                                                                     than modeling the prior from scratch, if we can assume that
                                                                     the pointwise estimation is close to the expected value of the
           ‖∇𝑓 (𝑎) − ∇𝑓 (𝑏)‖ < 𝐿 · ‖𝑎 − 𝑏‖                    (4)    prior. This assumption should be easily hold for most practical
                                                                     applications, otherwise we will not be able to apply internal
   In other words, the gradient of 𝑓 should exist and be             single deterministic methods at all, since the expected value
bounded by a finite value 𝐿. However, the regularization term        from the prior distribution is unlikely to derive meaningful
in Equation (3) does not satisfy this condition since its gradient   predictions in that case.
will go to infinity when 𝐷(𝑝𝑖 |𝛼  ˜ 𝑖 ) → 0, as even though 𝛼   ˜𝑖      On the other hand, by learning from cross-entropy loss, we
is guaranteed to be positive, 𝑝𝑖 may still become very close         could effectively avoid assigning extremely small values to
to zero when a certain 𝛼 ˜ 𝑖 is extremely large, leading to very     𝑝𝑖 , given that softmax involves exponential operations and
large gradients and hence unstable training.                         there is no point in pushing model outputs before softmax to
   In summary, internal single deterministic methods are try-        extremely large values. That means, when softmax is replaced
ing to optimize an inherently difficult problem, with poten-         by ELU later in stage 2, it is unlikely for us to see extremely
tially ill-conditioned loss functions due to existence of KL         large 𝛼˜ 𝑖 values.
divergence.

3.3. The Proposed Two-stage Learning                                 4. Experiments
     Framework                                                       4.1. Implementation Details
Having analyzed the possible reasons causing training sensi-
                                                                All experiments throughout this paper are conducted on a
tivity, a more important question is how could we fix such
                                                                binary classification task, with the goal to predict whether a
issues and make training more stable. At first glance, this ap-
                                                                <query, ad> pair is relevant or not. Both the training (1.4M)
pears to be infeasible since we can neither bypass distribution
                                                                and validation samples (100K) are sampled from a large-scale
modeling nor drop the terms related to KL divergence in loss
                                                                commercial search engine, with human-provided relevance
functions. In this paper, we propose an alternative approach,
                                                                labels. In order to examine the impact of the size of training
which can fix both issues with a simple yet effective strategy:
                                                                data, we further create a synthetic training set with soft labels,
decomposing the original problem into two sub-problems and
                                                                by sampling a large corpus and inference using an ensemble
tackling one at a time, leading to a two-stage learning method
                                                                of BERT [30] models fine-tuned on the human-labeled train-
as illustrated in Figure 1. Compared with the original EDL
                                                                ing set, similar to what we do in knowledge distillation [31].
method, the only cost introduced by TEDL is a preparation
                                                                This allows us to experiment on a much larger scale, without
stage learning from the cross-entropy loss, however as we will
                                                                breaking any assumptions in the EDL method. Without further
Figure 3: ROC AUC vs. uncertainty thresholds with 1M, 5M, 50M and 500M training corpus, respectively, and 𝜆 = 0.1.
The first row is for EDL, while the second row is for TEDL. This figure shows that under a relatively small 𝜆, the quality of
uncertainty learnt by both EDL and TEDL improves as training proceeds.


Figure 4: Uncertainty distribution of EDL (first row) and TEDL (second row), learnt on 1M, 5M, 50M and 500M training
samples with 𝜆 = 0.1. The first row is for EDL, while the second row is for TEDL.


clarification, we will henceforth refer to this synthetic training   our deep classification model, which uses two BERT encoders
set as our training corpus, and experiments will be conducted        to encode query and ad respectively, and then calculates their
on subsets sampled from this synthetic training set, with 1M,        relevance score by cosine similarity. We choose this model
5M, 50M and 500M samples, respectively.                              mainly for its simplicity and efficiency, and the conclusions of
   In addition, in this paper we will use TwinBERT [32, 33] as       this paper should hold for other model architectures as well,
Figure 5: Comparison of ROC AUC for EDL and TEDL, learnt on 1M training corpus with different 𝜆. Compared to EDL,
TEDL not only achieves higher ROC AUC, but also shows improved robustness towards 𝜆, especially when 𝜆 = 0.75 where
EDL method runs into the Dying ReLU problem.


since no particular assumptions for model architectures are         steadily improved over the training process, and the improving
made in the proposed TEDL method.                                   pattern for EDL and TEDL are very similar. However, this only
    In terms of metrics, since we are working on binary classifi-   happens when a relatively small 𝜆 is used. Later in Section
cation task, we will use ROC AUC to evaluate the prediction         4.3 we will see that compared with EDL, TEDL is much more
performance (in our experiments PR AUC shows a very simi-           robust towards 𝜆. We also plot the distribution of uncertainty
lar trend to ROC AUC). Meanwhile, to measure the quality of         in each training epoch, as shown in Figure 4, where TEDL also
uncertainty, we follow the approach in [2] to split our valida-     looks similar to EDL when 𝜆 is relatively small, but later in
tion data using different uncertainty thresholds first, and then    Section 4.3 we will see their difference when 𝜆 gets larger.
evaluate ROC AUC on each individual subset. For example,
when threshold is 0.1, ROC AUC will be calculated only on
validation samples with uncertainty lower than 0.1. Therefore,
                                                                    4.3. Sensitivity towards Hyper-parameters
if uncertainty is properly quantified, we should expect higher      So far all the results we report are obtained under mild condi-
ROC AUC on lower thresholds, since this is the subset that          tions with 𝜆 = 0.1, however as we mentioned in Section 1, 𝜆
our model feels more confident with. This way, we can plot a        and the number of training epochs may have dramatic impact
curve over ROC AUC v.s. uncertainty thresholds.                     on EDL, and hence it is necessary to examine how robust TEDL
                                                                    is towards these two hyper-parameters.
4.2. Results and Analysis
                                                                    4.3.1. ROC AUC
4.2.1. Classification Performance evaluated by
       ROC AUC                                                      Figure 5 compares the ROC AUC obtained by EDL and TEDL
                                                                    method, respectively, under different 𝜆 values. Similar to Fig-
Figure 2 summarizes the per-epoch ROC AUC of models learnt          ure 2, TEDL constantly outperforms EDL, and is more sta-
by cross-entropy loss, EDL method and the proposed TEDL             ble when more training epochs are used. In particular, when
method, with 1M, 5M, 50M and 500M training samples respec-          𝜆 = 0.75 we observe the Dying ReLU problem in EDL, which
tively. In all these settings, we consistently observe that the     inspires us to replace ReLU by ELU in TEDL.
ROC AUC from EDL method is much lower than that from
cross-entropy loss, while the proposed TEDL method is able          4.3.2. Quality of Uncertainty
to achieve comparable performance than cross-entropy loss,
outperforming EDL significantly.                                    Figure 6 and Figure 7 compare the quality of uncertainty learnt
   In addition, if we look into ROC AUC measured on different       by EDL and TEDL method, respectively, under different 𝜆 val-
epochs in Figure 2, we can also see that TEDL is much more          ues. Compared with Figure 3 and Figure 4, the uncertainty
stable than EDL, especially when training corpus is relatively      quality learnt from EDL degrades dramatically when larger 𝜆
small.                                                              is used, as shown in the case where 𝜆 = 0.25 and 𝜆 = 0.5.
                                                                    By contrast, for TEDL, both its plots over ROC AUC vs. uncer-
4.2.2. Quality of Uncertainty                                       tainty as well as its uncertainty distribution look very similar
                                                                    to what we observed for 𝜆 = 0.1, demonstrating significantly
As mentioned previously, we will measure the quality of the         improved robustness towards 𝜆.
learnt uncertainty by plotting a curve over ROC AUC v.s. un-
certainty thresholds, as shown in Figure 3, where the first row
corresponds to EDL, while the second row is for TEDL. By            5. Conclusion
comparing plots from different epochs, we can see that the
quality of uncertainty learnt from both EDL and TEDL gets           In this paper, we propose TEDL, a two-stage learning approach
                                                                    to quantify uncertainty for deep classification models. TEDL
Figure 6: Comparison of ROC AUC vs. uncertainty for EDL (first row) and TEDL (second row), learnt on 1M training corpus
with different 𝜆, where TEDL shows significantly better robustness.


Figure 7: Comparison of uncertainty distribution for EDL (first row) and TEDL (second row), learnt on 1M training corpus
with different 𝜆, where TEDL shows significantly better robustness.


contains two stages: the first stage learns from cross-entropy   mercial search engine, which demonstrates that compared with
loss to obtain a good point estimate of the Dirichlet prior      EDL, the proposed TEDL not only achieves higher AUC, but
distribution, and then the second stage learns to quantify un-   also shows improved robustness towards hyper-parameters.
certainty via the reformulated EDL loss. We conduct extensive    As future work, the uncertainty learnt by TEDL may be lever-
experiments using training corpus sampled from a real com-       aged to develop active learning algorithms.
References                                                               ume 118, Springer Science & Business Media, 2012.
                                                                    [18] M. Welling, Y. W. Teh, Bayesian learning via stochastic
 [1] J. Gawlikowski, C. R. N. Tassi, M. Ali, J. Lee, M. Humt,            gradient langevin dynamics, in: International Confer-
     J. Feng, A. Kruspe, R. Triebel, P. Jung, R. Roscher, et al.,        ence on Machine Learning, 2011, pp. 681–688.
     A survey of uncertainty in deep neural networks, arXiv         [19] C. Nemeth, P. Fearnhead, Stochastic gradient markov
     preprint arXiv:2107.03342 (2021).                                   chain monte carlo, Journal of the American Statistical
 [2] M. Sensoy, L. Kaplan, M. Kandemir, Evidential deep learn-           Association 116 (2021) 433–450.
     ing to quantify classification uncertainty, in: Advances       [20] T. Salimans, D. P. Kingma, Weight normalization: A
     in Neural Information Processing Systems, volume 31,                simple reparameterization to accelerate training of deep
     2018, pp. 3183–3193.                                                neural networks, in: Advances in Neural Information
 [3] A. Malinin, M. Gales, Predictive uncertainty estimation             Processing Systems, volume 29, 2016, pp. 901–909.
     via prior networks, in: Advances in Neural Information         [21] J. Lee, M. Humt, J. Feng, R. Triebel, Estimating model
     Processing Systems, volume 31, 2018, pp. 7047–7058.                 uncertainty of neural networks in sparse information
 [4] J. Nandy, W. Hsu, M. L. Lee, Towards maximizing the rep-            form, in: International Conference on Machine Learning,
     resentation gap between in-domain & out-of-distribution             PMLR, 2020, pp. 5702–5713.
     examples, in: Advances in Neural Information Process-          [22] H. Ritter, A. Botev, D. Barber, A scalable laplace approxi-
     ing Systems, volume 33, 2020, pp. 9239–9250.                        mation for neural networks, in: International Conference
 [5] M. Możejko, M. Susik, R. Karczewski, Inhibited softmax              on Learning Representations, volume 6, 2018.
     for uncertainty estimation in neural networks, arXiv           [23] B. Lakshminarayanan, A. Pritzel, C. Blundell, Simple and
     preprint arXiv:1810.01861 (2018).                                   scalable predictive uncertainty estimation using deep en-
 [6] J. Lee, G. AlRegib, Gradients as a measure of uncertainty           sembles, in: Advances in Neural Information Processing
     in neural networks, in: International Conference on                 Systems, volume 30, 2017, pp. 6404–6416.
     Image Processing, IEEE, 2020, pp. 2416–2420.                   [24] H. Guo, H. Liu, R. Li, C. Wu, Y. Guo, M. Xu, Margin &
 [7] M. Raghu, K. Blumer, R. Sayres, Z. Obermeyer, B. Klein-             diversity based ordering ensemble pruning, Neurocom-
     berg, S. Mullainathan, J. Kleinberg, Direct uncertainty             puting 275 (2018) 237–246.
     prediction for medical second opinions, in: International      [25] W. G. Martinez, Ensemble pruning via quadratic margin
     Conference on Machine Learning, PMLR, 2019, pp. 5281–               maximization, IEEE Access 9 (2021) 48931–48951.
     5290.                                                          [26] J. Lindqvist, A. Olmin, F. Lindsten, L. Svensson, A general
 [8] T. Ramalho, M. Miranda, Density estimation in represen-             framework for ensemble distribution distillation, in:
     tation space to predict model uncertainty, in: Interna-             International Workshop on Machine Learning for Signal
     tional Workshop on Engineering Dependable and Secure                Processing, IEEE, 2020, pp. 1–6.
     Machine Learning Systems, Springer, 2020, pp. 84–96.           [27] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong,
 [9] G. E. Hinton, D. Van Camp, Keeping the neural net-                  Q. He, A comprehensive survey on transfer learning,
     works simple by minimizing the description length of                Proceedings of the IEEE 109 (2021) 43–76.
     the weights, in: Computational Learning Theory, 1993,          [28] V. Dang, M. Bendersky, W. B. Croft, Two-stage learning
     pp. 5–13.                                                           to rank for information retrieval, in: European Confer-
[10] Y. Gal, Z. Ghahramani, Dropout as a bayesian approxi-               ence on Information Retrieval, Springer, 2013, pp. 423–
     mation: Representing model uncertainty in deep learn-               434.
     ing, in: International Conference on Machine Learning,         [29] F. A. Khan, A. Gumaei, A. Derhab, A. Hussain, A novel
     PMLR, 2016, pp. 1050–1059.                                          two-stage deep learning model for efficient network in-
[11] C. Blundell, J. Cornebise, K. Kavukcuoglu, D. Wierstra,             trusion detection, IEEE Access 7 (2019) 30373–30385.
     Weight uncertainty in neural network, in: International        [30] J. D. M.-W. C. Kenton, L. K. Toutanova, Bert: Pre-training
     Conference on Machine Learning, PMLR, 2015, pp. 1613–               of deep bidirectional transformers for language under-
     1622.                                                               standing, in: Proceedings of NAACL-HLT, 2019, pp.
[12] A. Graves, Practical variational inference for neural net-          4171–4186.
     works, in: Advances in Neural Information Processing           [31] X. Li, Z. Luo, H. Sun, J. Zhang, W. Han, X. Chu, L. Zhang,
     Systems, volume 24, 2011, pp. 2348–2356.                            Q. Zhang, Learning fast matching models from weak
[13] C. Louizos, K. Ullrich, M. Welling, Bayesian compression            annotations, in: Proceedings of the Web Conference,
     for deep learning, in: Advances in Neural Information               2019, pp. 2985–2991.
     Processing Systems, volume 30, 2017, pp. 3288–3298.            [32] W. Lu, J. Jiao, R. Zhang, Twinbert: Distilling knowledge
[14] D. Rezende, S. Mohamed, Variational inference with nor-             to twin-structured compressed bert models for large-
     malizing flows, in: International Conference on Machine             scale retrieval, in: Proceedings of the ACM International
     Learning, PMLR, 2015, pp. 1530–1538.                                Conference on Information & Knowledge Management,
[15] D. Barber, C. M. Bishop, Ensemble learning in bayesian              2020, pp. 2645–2652.
     neural networks, Nato ASI Series F Computer and Sys-           [33] J. Zhu, Y. Cui, Y. Liu, H. Sun, X. Li, M. Pelger, T. Yang,
     tems Sciences 168 (1998) 215–238.                                   L. Zhang, R. Zhang, H. Zhao, Textgnn: Improving text
[16] R. M. Neal, An improved acceptance procedure for the                encoder via graph neural network in sponsored search,
     hybrid monte carlo algorithm, Journal of Computational              in: Proceedings of the Web Conference, 2021, pp. 2848–
     Physics 111 (1994) 194–203.                                         2857.
[17] R. M. Neal, Bayesian learning for neural networks, vol-