1. Introduction

TEDL: A Two-stage Evidential Deep Learning Method for Classification Uncertainty Quantification

Xue Li

Wei Shen

Denis Charles

1 0 1045 La Avenida St, Mountain View , CA, 94043 , United States 1 555 110th Ave NE , Bellevue, WA, 98004 , United States

In this paper, we propose TEDL, a two-stage learning approach to quantify uncertainty for deep learning models in classification tasks, inspired by our findings in experimenting with Evidential Deep Learning (EDL) method, a recently proposed uncertainty quantification approach based on the Dempster-Shafer theory. More specifically, we observe that EDL tends to yield inferior AUC compared with models learnt by cross-entropy loss and is highly sensitive in training. Such sensitivity is likely to cause unreliable uncertainty estimation, making it risky for practical applications. To mitigate both limitations, we propose a simple yet efective two-stage learning approach based on our analysis on the likely reasons causing such sensitivity, with the first stage learning from cross-entropy loss, followed by a second stage learning from EDL loss. We also re-formulate the EDL loss by replacing ReLU with ELU to avoid the Dying ReLU issue. Extensive experiments are carried out on varied sized training corpus collected from a large-scale commercial search engine, demonstrating that the proposed two-stage learning framework can increase AUC significantly and greatly improve training robustness.

eol>uncertainty quantification search ads recommendation deep learning classification BERT TwinBERT

1. Introduction

DL4SR’22: Workshop on Deep Learning for Search and Recommendation, co-located with the 31st ACM International Conference on Information and Knowledge Management (CIKM), October 17-21, 2022, Atlanta, USA * Corresponding author. $ xeli@microsoft.com (X. Li); sashen@microsoft.com (W. Shen); cdx@microsoft.com (D. Charles)

© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License and some hyper-parameters, where improper settings CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) may lead to significantly degraded AUC and unreliable uncertainty estimation. categorical distribution, based on which we can quantify

To see this more clearly, in Figure 5 (the orange curve) uncertainty. we summarize the per epoch ROC AUC obtained in EDL The overall training framework of TEDL is illustrated training with diferent , a hyper-parameter controlling in Figure 1, where two stages are needed: in the first stage, how close the Dirichlet prior is to a uniform distribution. we train our classification model with cross-entropy loss, As we can see, the AUC of EDL sufers in the beginning in order to obtain a model that is able to output reasonunder all the four settings, and in some cases (for example able pointwise estimations of the categorical distribution. when = 0.5) there is no signs of improvement at all. And then in the second stage, we initialize the model In cases where AUC does improve, its final AUC is still from the weights obtained in the previous stage, and go significantly lower than that from the proposed method through the same training corpus by learning with the (the green curve). On the other hand, consider evaluating reformulated EDL loss where ReLU is replaced by ELU. AUC on validation samples with uncertainty lower than As shown in Section 4, compared with the EDL baseline, a certain threshold: If the learnt uncertainty is of high TEDL can achieve higher AUC across all evaluation setquality, smaller thresholds should indicate higher confi- tings and efectively avoid the risk of running into Dying dence, and hence should be associated with higher AUC. ReLU problem. More importantly, TEDL also shows sigHowever, this is not always the case for EDL, as shown nificantly improved robustness towards training settings, in the first row of Figure 6. Besides, we also observe that making it more reliable for practical applications. when a large is used (for example =0.75), there would It is also worth to mention that we name our proposed be a higher risk of running into the Dying ReLU problem method following EDL mainly due to the convenience of where all outputs are zero, leading to an AUC that similar experimentation, as it is proposed recently and is easy to a random guess. All these issues make it risky to apply to implement with code open-sourced by the authors. methods like EDL into real-world applications. However, our analysis in Section 3 also applies to other

To fix these issues, we firstly present an analysis in this single deterministic uncertainty quantification methods paper on the likely reasons causing the above issues in sufering from similar issues, and hence the two-stage Section 3, and based on our analysis, we further propose learning framework we propose in this paper could be TEDL, short for Two-stage Evidential Deep Learning, as readily extended to those methods as well. a simple but efective training framework to mitigate all the aforementioned issues in a single shot. As we will see in Section 3, the basic idea of TEDL is to transform the 2. Related Works dificult uncertainty quantification problem into two subproblems that are much easier to tackle, i.e., 1) finding a The interest for uncertainty estimation dates back to the reasonably good pointwise estimation of the categorical days even before the rise of deep learning, entailing a distribution, and 2) leveraging this pointwise estimation large body of literature on this topic. Based on whether as an anchor point for estimating the Dirichlet prior of model ensemble is used and whether the model is stochasℒ(Θ)

= ∑︁( − = ∑︁( − = =1 ^ )2 + ^ (1 − ^ )

+ 1 )2 + ( − ) 2( + 1) The proposed method also relates to the concept of sample and class , respectively.

Single deterministic methods [4, 5] estimate uncerwhere Θ tic, uncertainty quantification methods could be roughly grouped into three categories, including single deterministic methods, Bayesian neural networks and ensemble gories and = ⟨ 1, . . . , ⟩ is the parameter of a Dirichlet distribution for the classification of sample , the authors propose to replace softmax with ReLU and methods. Please refer to [1] for a comprehensive survey. represent the Dirichlet parameter as = (|Θ) + 1 represents network parameters and (|Θ) approaches [6, 7] and internal approaches [2, 3, 8] depend- , and = ∑︀ tainty based on one single forward pass within a deter- is the ReLU outputs. The here also represents the ministic network, and could be further split into external subjective opinion collected from sample and category ing on whether additional method is used for deriving uncertainty estimation. Methods in this category typically have lower requirements on computational resources since no stochastic networks nor model ensembles are needed, but sufer from sensitivity to initialization and parameters compared with other categories. The proposed TEDL method in this paper, as well as the original

EDL method, both fall into this category. Bayesian neural networks cover all kinds of stochas

tic DNNs, including methods based on variational inference [9, 10, 11, 12, 13, 14, 15], sampling methods [16, 17, 18, 19], and Laplace approximation [20, 21, 22].

Methods in this category usually have higher computational complexity in both the training and inference phases due to stochastic sampling.

Ensemble methods [23, 24, 25, 26] combine the predictions from several diferent deterministic networks at inference. Methods in this category typically have higher requirements on both the memory and computational resources at inference phase. two-stage learning, which bears similarity to transfer learning but has some subtle diferences. Transfer learning generally refers to the procedure that transfers knowledge obtained from diferent but related source domains to target domains, usually to reduce training data required on the target domains. [27] gives a comprehensive survey on transfer learning. In contrast, in two-stage learning [28, 29], although it also consists of two consecutive stages, these two stages are often conducted on the same data. In a typical two-stage learning setting, the second stage should be the final stage that yields the desired output, while the first stage serves as a preparation step. Given such diferences, the proposed method should be categorized as two-stage learning.

3. Approach 3.1. A Recap on EDL Uncertainty Quantification The basic idea of EDL method is treating softmax out

put as the pointwise estimation of the categorical distribution, and placing a Dirichlet prior over the distribution of all possible softmax outputs. Then, following the Dempster-Shafer theory, assume we have catewhere and ^ denote the class label and expectation for

Meanwhile, the above loss function is further regularized by

minimizing the KL divergence between the estimated Dirichlet distribution (| ˜ ) and the uniform distribution, as below: =1 ℒ = ∑︁ [(| ˜ ) || (|⟨1, . . . , 1⟩)] The coeficient in Equation (1) is heuristically set to increase with epoch (zero-based), i.e., = min(1.0, * ) where = 0.1. Note that we denote the per-epoch increment as . For brevity, we will treat rather than as the hyperparameter henceforth, since is determined only by .

3.2. A Closer Look into the EDL Method Equation (1) could be split into two parts: the first part is Equation (2) which is designed to estimate the Dirichlet prior,

and the second part is the regularization term in Equation (3) derived from KL divergence. Next, we will take a closer look at these two parts respectively to understand the cause of sensitivity.

As we mentioned previously, unlike cross-entropy loss which is designed to learn the pointwise estimations of the categorical distribution as a MAP estimate, the loss function in Equation (2) is derived to learn the parameter of a Dirichlet prior distribution over all the possible predictions. Therefore, the pointwise estimation should also be covered by the Dirichlet prior distribution. This perspective highlights the huge gap strength. Note that is inversely proportional to uncertainty: a larger indicates more evidence is collected for sample , and hence lower uncertainty.

Based on the above assumptions, the EDL loss is de=1 is referred to as the Dirichlet ifned as below: ℒ(Θ) = ∑︁ =1

ℒ(Θ) + ℒ where ℒ(Θ) is formulated as the expected value of a basic loss and ℒ represents regularization. According to the authors of [2], EDL method appears relatively more stable when sum of squares loss is used as the basic loss, as below: (1) (2) (3) in terms of how dificult the optimization problems behind these two loss functions are, especially given that obtaining a good MAP estimation is already a hard problem in many applications. This perspective also highlights the importance of a suficiently large training data, as it would be meaningless to model a distribution without suficient samples.

In the meanwhile, the KL divergence also makes optimization more complicated since it is not Lipschitz smooth. More precisely, given a function , it is said to be Lipschitz smooth if and only if there exists a finite value such that ‖∇ () − ∇ ()‖ < · ‖ − ‖ (4)

In other words, the gradient of should exist and be bounded by a finite value . However, the regularization term in Equation (3) does not satisfy this condition since its gradient will go to infinity when (| ˜ ) → 0, as even though ˜ is guaranteed to be positive, may still become very close to zero when a certain ˜ is extremely large, leading to very large gradients and hence unstable training.

In summary, internal single deterministic methods are trying to optimize an inherently dificult problem, with potentially ill-conditioned loss functions due to existence of KL divergence.

3.3. The Proposed Two-stage Learning Framework

Having analyzed the possible reasons causing training sensitivity, a more important question is how could we fix such issues and make training more stable. At first glance, this appears to be infeasible since we can neither bypass distribution modeling nor drop the terms related to KL divergence in loss functions. In this paper, we propose an alternative approach, which can fix both issues with a simple yet efective strategy: decomposing the original problem into two sub-problems and tackling one at a time, leading to a two-stage learning method as illustrated in Figure 1. Compared with the original EDL method, the only cost introduced by TEDL is a preparation stage learning from the cross-entropy loss, however as we will see in Section 4, such cost is well paid of given the significant AUC increase and greatly improved robustness in training.

So why does such a simple strategy work? On one hand, the first stage in TEDL learns a pointwise estimation of the categorical distribution, which is a much easier problem compared with modeling the entire distribution and entails much fewer training samples. Then in the second stage, since the model is initialized from the weights obtained in stage 1, it amounts to modeling the prior distribution using the pointwise estimation as certain anchor points, which is much easier than modeling the prior from scratch, if we can assume that the pointwise estimation is close to the expected value of the prior. This assumption should be easily hold for most practical applications, otherwise we will not be able to apply internal single deterministic methods at all, since the expected value from the prior distribution is unlikely to derive meaningful predictions in that case.

On the other hand, by learning from cross-entropy loss, we could efectively avoid assigning extremely small values to , given that softmax involves exponential operations and there is no point in pushing model outputs before softmax to extremely large values. That means, when softmax is replaced by ELU later in stage 2, it is unlikely for us to see extremely large ˜ values.

4. Experiments 4.1. Implementation Details

All experiments throughout this paper are conducted on a binary classification task, with the goal to predict whether a <query, ad> pair is relevant or not. Both the training (1.4M) and validation samples (100K) are sampled from a large-scale commercial search engine, with human-provided relevance labels. In order to examine the impact of the size of training data, we further create a synthetic training set with soft labels, by sampling a large corpus and inference using an ensemble of BERT [30] models fine-tuned on the human-labeled training set, similar to what we do in knowledge distillation [31].

This allows us to experiment on a much larger scale, without breaking any assumptions in the EDL method. Without further clarification, we will henceforth refer to this synthetic training our deep classification model, which uses two BERT encoders set as our training corpus, and experiments will be conducted to encode query and ad respectively, and then calculates their on subsets sampled from this synthetic training set, with 1M, relevance score by cosine similarity. We choose this model 5M, 50M and 500M samples, respectively. mainly for its simplicity and eficiency, and the conclusions of

In addition, in this paper we will use TwinBERT [32, 33] as this paper should hold for other model architectures as well, since no particular assumptions for model architectures are made in the proposed TEDL method.

In terms of metrics, since we are working on binary classification task, we will use ROC AUC to evaluate the prediction performance (in our experiments PR AUC shows a very similar trend to ROC AUC). Meanwhile, to measure the quality of uncertainty, we follow the approach in [2] to split our validation data using diferent uncertainty thresholds first, and then evaluate ROC AUC on each individual subset. For example, when threshold is 0.1, ROC AUC will be calculated only on validation samples with uncertainty lower than 0.1. Therefore, if uncertainty is properly quantified, we should expect higher ROC AUC on lower thresholds, since this is the subset that our model feels more confident with. This way, we can plot a curve over ROC AUC v.s. uncertainty thresholds.

4.2. Results and Analysis

4.2.1. Classification Performance evaluated by ROC AUC steadily improved over the training process, and the improving pattern for EDL and TEDL are very similar. However, this only happens when a relatively small is used. Later in Section 4.3 we will see that compared with EDL, TEDL is much more robust towards . We also plot the distribution of uncertainty in each training epoch, as shown in Figure 4, where TEDL also looks similar to EDL when is relatively small, but later in Section 4.3 we will see their diference when gets larger.

4.3. Sensitivity towards Hyper-parameters So far all the results we report are obtained under mild condi

tions with = 0.1, however as we mentioned in Section 1, and the number of training epochs may have dramatic impact on EDL, and hence it is necessary to examine how robust TEDL is towards these two hyper-parameters. 4.3.1. ROC AUC

In this paper, we propose TEDL, a two-stage learning approach to quantify uncertainty for deep classification models. TEDL

contains two stages: the first stage learns from cross-entropy mercial search engine, which demonstrates that compared with loss to obtain a good point estimate of the Dirichlet prior EDL, the proposed TEDL not only achieves higher AUC, but distribution, and then the second stage learns to quantify un- also shows improved robustness towards hyper-parameters. certainty via the reformulated EDL loss. We conduct extensive As future work, the uncertainty learnt by TEDL may be leverexperiments using training corpus sampled from a real com- aged to develop active learning algorithms. ume 118, Springer Science & Business Media, 2012. [18] M. Welling, Y. W. Teh, Bayesian learning via stochastic [1] J. Gawlikowski, C. R. N. Tassi, M. Ali, J. Lee, M. Humt, gradient langevin dynamics, in: International ConferJ. Feng, A. Kruspe, R. Triebel, P. Jung, R. Roscher, et al., ence on Machine Learning, 2011, pp. 681–688. A survey of uncertainty in deep neural networks, arXiv [19] C. Nemeth, P. Fearnhead, Stochastic gradient markov preprint arXiv:2107.03342 (2021). chain monte carlo, Journal of the American Statistical [2] M. Sensoy, L. Kaplan, M. Kandemir, Evidential deep learn- Association 116 (2021) 433–450.

ing to quantify classification uncertainty, in: Advances [20] T. Salimans, D. P. Kingma, Weight normalization: A in Neural Information Processing Systems, volume 31, simple reparameterization to accelerate training of deep 2018, pp. 3183–3193. neural networks, in: Advances in Neural Information [3] A. Malinin, M. Gales, Predictive uncertainty estimation Processing Systems, volume 29, 2016, pp. 901–909. via prior networks, in: Advances in Neural Information [21] J. Lee, M. Humt, J. Feng, R. Triebel, Estimating model Processing Systems, volume 31, 2018, pp. 7047–7058. uncertainty of neural networks in sparse information [4] J. Nandy, W. Hsu, M. L. Lee, Towards maximizing the rep- form, in: International Conference on Machine Learning, resentation gap between in-domain & out-of-distribution PMLR, 2020, pp. 5702–5713. examples, in: Advances in Neural Information Process- [22] H. Ritter, A. Botev, D. Barber, A scalable laplace approxiing Systems, volume 33, 2020, pp. 9239–9250. mation for neural networks, in: International Conference [5] M. Możejko, M. Susik, R. Karczewski, Inhibited softmax on Learning Representations, volume 6, 2018. for uncertainty estimation in neural networks, arXiv [23] B. Lakshminarayanan, A. Pritzel, C. Blundell, Simple and preprint arXiv:1810.01861 (2018). scalable predictive uncertainty estimation using deep en[6] J. Lee, G. AlRegib, Gradients as a measure of uncertainty sembles, in: Advances in Neural Information Processing in neural networks, in: International Conference on Systems, volume 30, 2017, pp. 6404–6416.

Image Processing, IEEE, 2020, pp. 2416–2420. [24] H. Guo, H. Liu, R. Li, C. Wu, Y. Guo, M. Xu, Margin & [7] M. Raghu, K. Blumer, R. Sayres, Z. Obermeyer, B. Klein- diversity based ordering ensemble pruning, Neurocomberg, S. Mullainathan, J. Kleinberg, Direct uncertainty puting 275 (2018) 237–246. prediction for medical second opinions, in: International [25] W. G. Martinez, Ensemble pruning via quadratic margin Conference on Machine Learning, PMLR, 2019, pp. 5281– maximization, IEEE Access 9 (2021) 48931–48951. 5290. [26] J. Lindqvist, A. Olmin, F. Lindsten, L. Svensson, A general [8] T. Ramalho, M. Miranda, Density estimation in represen- framework for ensemble distribution distillation, in: tation space to predict model uncertainty, in: Interna- International Workshop on Machine Learning for Signal tional Workshop on Engineering Dependable and Secure Processing, IEEE, 2020, pp. 1–6.

Machine Learning Systems, Springer, 2020, pp. 84–96. [27] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, [9] G. E. Hinton, D. Van Camp, Keeping the neural net- Q. He, A comprehensive survey on transfer learning, works simple by minimizing the description length of Proceedings of the IEEE 109 (2021) 43–76. the weights, in: Computational Learning Theory, 1993, [28] V. Dang, M. Bendersky, W. B. Croft, Two-stage learning pp. 5–13. to rank for information retrieval, in: European Confer[10] Y. Gal, Z. Ghahramani, Dropout as a bayesian approxi- ence on Information Retrieval, Springer, 2013, pp. 423– mation: Representing model uncertainty in deep learn- 434. ing, in: International Conference on Machine Learning, [29] F. A. Khan, A. Gumaei, A. Derhab, A. Hussain, A novel PMLR, 2016, pp. 1050–1059. two-stage deep learning model for eficient network in[11] C. Blundell, J. Cornebise, K. Kavukcuoglu, D. Wierstra, trusion detection, IEEE Access 7 (2019) 30373–30385.

Weight uncertainty in neural network, in: International [30] J. D. M.-W. C. Kenton, L. K. Toutanova, Bert: Pre-training Conference on Machine Learning, PMLR, 2015, pp. 1613– of deep bidirectional transformers for language under1622. standing, in: Proceedings of NAACL-HLT, 2019, pp. [12] A. Graves, Practical variational inference for neural net- 4171–4186.

works, in: Advances in Neural Information Processing [31] X. Li, Z. Luo, H. Sun, J. Zhang, W. Han, X. Chu, L. Zhang, Systems, volume 24, 2011, pp. 2348–2356. Q. Zhang, Learning fast matching models from weak [13] C. Louizos, K. Ullrich, M. Welling, Bayesian compression annotations, in: Proceedings of the Web Conference, for deep learning, in: Advances in Neural Information 2019, pp. 2985–2991.

Processing Systems, volume 30, 2017, pp. 3288–3298. [32] W. Lu, J. Jiao, R. Zhang, Twinbert: Distilling knowledge [14] D. Rezende, S. Mohamed, Variational inference with nor- to twin-structured compressed bert models for largemalizing flows, in: International Conference on Machine scale retrieval, in: Proceedings of the ACM International Learning, PMLR, 2015, pp. 1530–1538. Conference on Information & Knowledge Management, [15] D. Barber, C. M. Bishop, Ensemble learning in bayesian 2020, pp. 2645–2652.

neural networks, Nato ASI Series F Computer and Sys- [33] J. Zhu, Y. Cui, Y. Liu, H. Sun, X. Li, M. Pelger, T. Yang, tems Sciences 168 (1998) 215–238. L. Zhang, R. Zhang, H. Zhao, Textgnn: Improving text [16] R. M. Neal, An improved acceptance procedure for the encoder via graph neural network in sponsored search, hybrid monte carlo algorithm, Journal of Computational in: Proceedings of the Web Conference, 2021, pp. 2848– Physics 111 (1994) 194–203. 2857. [17] R. M. Neal, Bayesian learning for neural networks, vol