TEDL: A Two-stage Evidential Deep Learning Method for Classification Uncertainty Quantification Xue Li1,* , Wei Shen2 and Denis Charles2 1 1045 La Avenida St, Mountain View, CA, 94043, United States 2 555 110th Ave NE, Bellevue, WA, 98004, United States Abstract In this paper, we propose TEDL, a two-stage learning approach to quantify uncertainty for deep learning models in classification tasks, inspired by our findings in experimenting with Evidential Deep Learning (EDL) method, a recently proposed uncertainty quantification approach based on the Dempster-Shafer theory. More specifically, we observe that EDL tends to yield inferior AUC compared with models learnt by cross-entropy loss and is highly sensitive in training. Such sensitivity is likely to cause unreliable uncertainty estimation, making it risky for practical applications. To mitigate both limitations, we propose a simple yet effective two-stage learning approach based on our analysis on the likely reasons causing such sensitivity, with the first stage learning from cross-entropy loss, followed by a second stage learning from EDL loss. We also re-formulate the EDL loss by replacing ReLU with ELU to avoid the Dying ReLU issue. Extensive experiments are carried out on varied sized training corpus collected from a large-scale commercial search engine, demonstrating that the proposed two-stage learning framework can increase AUC significantly and greatly improve training robustness. Keywords uncertainty quantification, search ads recommendation, deep learning, classification, BERT, TwinBERT 1. Introduction ods, especially internal approaches, since such methods typically need only a single forward pass on a determin- Uncertainty quantification of deep learning models has istic network to estimate uncertainty, and hence does been a hot topic in the community ever since the rise not require stochastic DNN or ensemble models, making of deep learning, and the demand for effective uncer- both training and inference more efficient. tainty quantification methods is becoming increasingly More specifically, instead of considering the model urgent in the recent decade as deep learning continue to outputs as a pointwise maximum-a-posteriori (MAP) es- reshape many industries. Search recommendation, as per- timation, internal single deterministic methods usually haps the most radically reshaped industry, often relies on interpret model outputs as parameters of a prior distri- many different deep learning models to give accurate rec- bution over all the possible predictions, and then give ommendations, which makes uncertainty quantification prediction by taking the expected value over the prior especially important since unreliable predictions could distribution. For classification tasks, Dirichlet distribu- accumulate in the system and finally lead to inaccurate tion is often chosen as prior since it is the conjugate prior or even embarrassing recommendation results. of the categorical distribution. Meanwhile, statistical dis- To make machine learning models aware of their own tance metrics such as Kullback-Leibler (KL) divergence prediction confidence, many uncertainty quantification are often included in their loss functions due to the need approaches have been proposed [1], including single to optimize on parameters of distributions [2, 3]. deterministic methods, Bayesian methods and ensem- However, the efficiency of such methods comes with ble methods, etc., among which the single deterministic a cost. As mentioned in [1], they are typically more methods could be further grouped into internal or exter- sensitive towards training settings such as initialization, nal methods depending on whether additional compo- hyper-parameters, training data, etc., which is what we nents are required for uncertainty estimation. We present observed when apply EDL [2], a recently proposed single a brief review on this topic in Section 2. In this paper, we deterministic method, to practical scenarios. are particularly interested in single deterministic meth- To be more specific, in our experiments we identify sev- eral issues in the EDL method. Firstly, as shown in Figure DL4SR’22: Workshop on Deep Learning for Search and Recommen- 2, when applied to binary classification tasks, the ROC dation, co-located with the 31st ACM International Conference on AUC achieved by the EDL method is significantly lower Information and Knowledge Management (CIKM), October 17-21, 2022, Atlanta, USA than that obtained by cross-entropy loss, and such gap * Corresponding author. cannot be bridged by simply adding more training sam- $ xeli@microsoft.com (X. Li); sashen@microsoft.com (W. Shen); ples. Secondly, EDL tends to be sensitive to initialization cdx@microsoft.com (D. Charles) and some hyper-parameters, where improper settings Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). may lead to significantly degraded AUC and unreliable CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Figure 1: A schematic illustration of the proposed TEDL method. (a) The original EDL method transforms the model outputs to strictly positive values using ReLU activation and learns to quantify uncertainty via the EDL loss in Equation (1), which yields inferior AUC and is sensitive to training. (b) The proposed TEDL method employs a two-stage learning strategy to decompose the original problem into two easier sub-problems and tackle one at a time: the first stage learns to make good pointwise estimations via cross-entropy loss; and then the second stage will learn to quantify uncertainty using the pointwise estimation as anchor points, with ReLU replaced by ELU to avoid the Dying ReLU issue. uncertainty estimation. categorical distribution, based on which we can quantify To see this more clearly, in Figure 5 (the orange curve) uncertainty. we summarize the per epoch ROC AUC obtained in EDL The overall training framework of TEDL is illustrated training with different πœ†, a hyper-parameter controlling in Figure 1, where two stages are needed: in the first stage, how close the Dirichlet prior is to a uniform distribution. we train our classification model with cross-entropy loss, As we can see, the AUC of EDL suffers in the beginning in order to obtain a model that is able to output reason- under all the four settings, and in some cases (for example able pointwise estimations of the categorical distribution. when πœ† = 0.5) there is no signs of improvement at all. And then in the second stage, we initialize the model In cases where AUC does improve, its final AUC is still from the weights obtained in the previous stage, and go significantly lower than that from the proposed method through the same training corpus by learning with the (the green curve). On the other hand, consider evaluating reformulated EDL loss where ReLU is replaced by ELU. AUC on validation samples with uncertainty lower than As shown in Section 4, compared with the EDL baseline, a certain threshold: If the learnt uncertainty is of high TEDL can achieve higher AUC across all evaluation set- quality, smaller thresholds should indicate higher confi- tings and effectively avoid the risk of running into Dying dence, and hence should be associated with higher AUC. ReLU problem. More importantly, TEDL also shows sig- However, this is not always the case for EDL, as shown nificantly improved robustness towards training settings, in the first row of Figure 6. Besides, we also observe that making it more reliable for practical applications. when a large πœ† is used (for example πœ†=0.75), there would It is also worth to mention that we name our proposed be a higher risk of running into the Dying ReLU problem method following EDL mainly due to the convenience of where all outputs are zero, leading to an AUC that similar experimentation, as it is proposed recently and is easy to a random guess. All these issues make it risky to apply to implement with code open-sourced by the authors. methods like EDL into real-world applications. However, our analysis in Section 3 also applies to other To fix these issues, we firstly present an analysis in this single deterministic uncertainty quantification methods paper on the likely reasons causing the above issues in suffering from similar issues, and hence the two-stage Section 3, and based on our analysis, we further propose learning framework we propose in this paper could be TEDL, short for Two-stage Evidential Deep Learning, as readily extended to those methods as well. a simple but effective training framework to mitigate all the aforementioned issues in a single shot. As we will see in Section 3, the basic idea of TEDL is to transform the 2. Related Works difficult uncertainty quantification problem into two sub- The interest for uncertainty estimation dates back to the problems that are much easier to tackle, i.e., 1) finding a days even before the rise of deep learning, entailing a reasonably good pointwise estimation of the categorical large body of literature on this topic. Based on whether distribution, and 2) leveraging this pointwise estimation model ensemble is used and whether the model is stochas- as an anchor point for estimating the Dirichlet prior of tic, uncertainty quantification methods could be roughly gories and 𝛼𝑖 = βŸ¨π›Όπ‘–1 , . . . , 𝛼𝑖𝐾 ⟩ is the parameter of a grouped into three categories, including single determin- Dirichlet distribution for the classification of sample 𝑖, istic methods, Bayesian neural networks and ensemble the authors propose to replace softmax with ReLU and methods. Please refer to [1] for a comprehensive survey. represent the Dirichlet parameter as 𝛼𝑖 = 𝑓 (π‘₯𝑖 |Θ) + 1 Single deterministic methods [4, 5] estimate uncer- where Θ represents network parameters and 𝑓 (π‘₯𝑖 |Θ) tainty based on one single forward pass within a deter- is the ReLU outputs. The 𝛼𝑖𝑗 here also represents the ministic network, and could be further split into external subjective opinion βˆ‘οΈ€πΎ collected from sample 𝑖 and category approaches [6, 7] and internal approaches [2, 3, 8] depend- 𝑗, and 𝑆𝑖 = 𝑗=1 𝛼𝑖𝑗 is referred to as the Dirichlet ing on whether additional method is used for deriving un- strength. Note that 𝑆𝑖 is inversely proportional to uncer- certainty estimation. Methods in this category typically tainty: a larger 𝑆𝑖 indicates more evidence is collected have lower requirements on computational resources for sample 𝑖, and hence lower uncertainty. since no stochastic networks nor model ensembles are Based on the above assumptions, the EDL loss is de- needed, but suffer from sensitivity to initialization and fined as below: parameters compared with other categories. The pro- 𝑁 posed TEDL method in this paper, as well as the original βˆ‘οΈ β„’(Θ) = ℒ𝑖 (Θ) + πœ†π‘‘ ℒ𝐾𝐿 (1) EDL method, both fall into this category. 𝑖=1 Bayesian neural networks cover all kinds of stochas- where ℒ𝑖 (Θ) is formulated as the expected value of a basic loss tic DNNs, including methods based on variational in- and ℒ𝐾𝐿 represents regularization. According to the authors ference [9, 10, 11, 12, 13, 14, 15], sampling methods of [2], EDL method appears relatively more stable when sum [16, 17, 18, 19], and Laplace approximation [20, 21, 22]. of squares loss is used as the basic loss, as below: Methods in this category usually have higher compu- tational complexity in both the training and inference 𝐾 βˆ‘οΈ ^𝑖𝑗 (1 βˆ’ 𝑝 𝑝 ^𝑖𝑗 ) phases due to stochastic sampling. ℒ𝑖 (Θ) = ^𝑖𝑗 )2 + (𝑦𝑖𝑗 βˆ’ 𝑝 𝑆𝑖 + 1 Ensemble methods [23, 24, 25, 26] combine the pre- 𝑗=𝑖 dictions from several different deterministic networks at 𝐾 βˆ‘οΈ 𝛼𝑖𝑗 2 𝛼𝑖𝑗 (𝑆𝑖 βˆ’ 𝛼𝑖𝑗 ) inference. Methods in this category typically have higher = (𝑦𝑖𝑗 βˆ’ ) + (2) 𝑗=1 𝑆𝑖 𝑆𝑖2 (𝑆𝑖 + 1) requirements on both the memory and computational resources at inference phase. where 𝑦𝑖𝑗 and 𝑝^𝑖𝑗 denote the class label and expectation for The proposed method also relates to the concept of sample 𝑖 and class 𝑗 , respectively. two-stage learning, which bears similarity to trans- Meanwhile, the above loss function is further regularized by fer learning but has some subtle differences. Transfer minimizing the KL divergence between the estimated Dirichlet learning generally refers to the procedure that transfers distribution 𝐷(𝑝𝑖 |𝛼 ˜ 𝑖 ) and the uniform distribution, as below: knowledge obtained from different but related source 𝑁 domains to target domains, usually to reduce training βˆ‘οΈ ℒ𝐾𝐿 = 𝐾𝐿 [𝐷(𝑝𝑖 |𝛼 ˜ 𝑖 ) || 𝐷(𝑝𝑖 |⟨1, . . . , 1⟩)] (3) data required on the target domains. [27] gives a com- 𝑖=1 prehensive survey on transfer learning. In contrast, in two-stage learning [28, 29], although it also consists of The coefficient πœ†π‘‘ in Equation (1) is heuristically set to increase with epoch 𝑑 (zero-based), i.e., πœ†π‘‘ = min(1.0, 𝑑 * πœ†) where two consecutive stages, these two stages are often con- πœ† = 0.1. Note that we denote the per-epoch increment as ducted on the same data. In a typical two-stage learning πœ†. For brevity, we will treat πœ† rather than πœ†π‘‘ as the hyper- setting, the second stage should be the final stage that parameter henceforth, since πœ†π‘‘ is determined only by πœ†. yields the desired output, while the first stage serves as a preparation step. Given such differences, the proposed method should be categorized as two-stage learning. 3.2. A Closer Look into the EDL Method Equation (1) could be split into two parts: the first part is Equation (2) which is designed to estimate the Dirichlet prior, 3. Approach and the second part is the regularization term in Equation (3) derived from KL divergence. Next, we will take a closer 3.1. A Recap on EDL Uncertainty look at these two parts respectively to understand the cause of sensitivity. Quantification As we mentioned previously, unlike cross-entropy loss The basic idea of EDL method is treating softmax out- which is designed to learn the pointwise estimations of the put as the pointwise estimation of the categorical dis- categorical distribution as a MAP estimate, the loss function tribution, and placing a Dirichlet prior over the distri- in Equation (2) is derived to learn the parameter of a Dirichlet bution of all possible softmax outputs. Then, following prior distribution over all the possible predictions. Therefore, the pointwise estimation should also be covered by the Dirich- the Dempster-Shafer theory, assume we have 𝐾 cate- let prior distribution. This perspective highlights the huge gap Figure 2: AUC comparison between cross-entropy loss, EDL loss and TEDL loss, evaluated on the same validation data. All the three methods are learnt on training corpus with 1M, 5M, 50M and 500M samples, respectively. In all the training settings, EDL method achieves inferior AUC compared to cross-entropy loss, while the proposed TEDL method yields comparable AUC than cross-entropy, outperforming EDL significantly. in terms of how difficult the optimization problems behind see in Section 4, such cost is well paid off given the significant these two loss functions are, especially given that obtaining AUC increase and greatly improved robustness in training. a good MAP estimation is already a hard problem in many So why does such a simple strategy work? On one hand, applications. This perspective also highlights the importance the first stage in TEDL learns a pointwise estimation of the of a sufficiently large training data, as it would be meaningless categorical distribution, which is a much easier problem com- to model a distribution without sufficient samples. pared with modeling the entire distribution and entails much In the meanwhile, the KL divergence also makes optimiza- fewer training samples. Then in the second stage, since the tion more complicated since it is not Lipschitz smooth. More model is initialized from the weights obtained in stage 1, it precisely, given a function 𝑓 , it is said to be Lipschitz smooth amounts to modeling the prior distribution using the point- if and only if there exists a finite value 𝐿 such that wise estimation as certain anchor points, which is much easier than modeling the prior from scratch, if we can assume that the pointwise estimation is close to the expected value of the β€–βˆ‡π‘“ (π‘Ž) βˆ’ βˆ‡π‘“ (𝑏)β€– < 𝐿 Β· β€–π‘Ž βˆ’ 𝑏‖ (4) prior. This assumption should be easily hold for most practical applications, otherwise we will not be able to apply internal In other words, the gradient of 𝑓 should exist and be single deterministic methods at all, since the expected value bounded by a finite value 𝐿. However, the regularization term from the prior distribution is unlikely to derive meaningful in Equation (3) does not satisfy this condition since its gradient predictions in that case. will go to infinity when 𝐷(𝑝𝑖 |𝛼 ˜ 𝑖 ) β†’ 0, as even though 𝛼 Λœπ‘– On the other hand, by learning from cross-entropy loss, we is guaranteed to be positive, 𝑝𝑖 may still become very close could effectively avoid assigning extremely small values to to zero when a certain 𝛼 ˜ 𝑖 is extremely large, leading to very 𝑝𝑖 , given that softmax involves exponential operations and large gradients and hence unstable training. there is no point in pushing model outputs before softmax to In summary, internal single deterministic methods are try- extremely large values. That means, when softmax is replaced ing to optimize an inherently difficult problem, with poten- by ELU later in stage 2, it is unlikely for us to see extremely tially ill-conditioned loss functions due to existence of KL large π›ΌΛœ 𝑖 values. divergence. 3.3. The Proposed Two-stage Learning 4. Experiments Framework 4.1. Implementation Details Having analyzed the possible reasons causing training sensi- All experiments throughout this paper are conducted on a tivity, a more important question is how could we fix such binary classification task, with the goal to predict whether a issues and make training more stable. At first glance, this ap- pair is relevant or not. Both the training (1.4M) pears to be infeasible since we can neither bypass distribution and validation samples (100K) are sampled from a large-scale modeling nor drop the terms related to KL divergence in loss commercial search engine, with human-provided relevance functions. In this paper, we propose an alternative approach, labels. In order to examine the impact of the size of training which can fix both issues with a simple yet effective strategy: data, we further create a synthetic training set with soft labels, decomposing the original problem into two sub-problems and by sampling a large corpus and inference using an ensemble tackling one at a time, leading to a two-stage learning method of BERT [30] models fine-tuned on the human-labeled train- as illustrated in Figure 1. Compared with the original EDL ing set, similar to what we do in knowledge distillation [31]. method, the only cost introduced by TEDL is a preparation This allows us to experiment on a much larger scale, without stage learning from the cross-entropy loss, however as we will breaking any assumptions in the EDL method. Without further Figure 3: ROC AUC vs. uncertainty thresholds with 1M, 5M, 50M and 500M training corpus, respectively, and πœ† = 0.1. The first row is for EDL, while the second row is for TEDL. This figure shows that under a relatively small πœ†, the quality of uncertainty learnt by both EDL and TEDL improves as training proceeds. Figure 4: Uncertainty distribution of EDL (first row) and TEDL (second row), learnt on 1M, 5M, 50M and 500M training samples with πœ† = 0.1. The first row is for EDL, while the second row is for TEDL. clarification, we will henceforth refer to this synthetic training our deep classification model, which uses two BERT encoders set as our training corpus, and experiments will be conducted to encode query and ad respectively, and then calculates their on subsets sampled from this synthetic training set, with 1M, relevance score by cosine similarity. We choose this model 5M, 50M and 500M samples, respectively. mainly for its simplicity and efficiency, and the conclusions of In addition, in this paper we will use TwinBERT [32, 33] as this paper should hold for other model architectures as well, Figure 5: Comparison of ROC AUC for EDL and TEDL, learnt on 1M training corpus with different πœ†. Compared to EDL, TEDL not only achieves higher ROC AUC, but also shows improved robustness towards πœ†, especially when πœ† = 0.75 where EDL method runs into the Dying ReLU problem. since no particular assumptions for model architectures are steadily improved over the training process, and the improving made in the proposed TEDL method. pattern for EDL and TEDL are very similar. However, this only In terms of metrics, since we are working on binary classifi- happens when a relatively small πœ† is used. Later in Section cation task, we will use ROC AUC to evaluate the prediction 4.3 we will see that compared with EDL, TEDL is much more performance (in our experiments PR AUC shows a very simi- robust towards πœ†. We also plot the distribution of uncertainty lar trend to ROC AUC). Meanwhile, to measure the quality of in each training epoch, as shown in Figure 4, where TEDL also uncertainty, we follow the approach in [2] to split our valida- looks similar to EDL when πœ† is relatively small, but later in tion data using different uncertainty thresholds first, and then Section 4.3 we will see their difference when πœ† gets larger. evaluate ROC AUC on each individual subset. For example, when threshold is 0.1, ROC AUC will be calculated only on validation samples with uncertainty lower than 0.1. Therefore, 4.3. Sensitivity towards Hyper-parameters if uncertainty is properly quantified, we should expect higher So far all the results we report are obtained under mild condi- ROC AUC on lower thresholds, since this is the subset that tions with πœ† = 0.1, however as we mentioned in Section 1, πœ† our model feels more confident with. This way, we can plot a and the number of training epochs may have dramatic impact curve over ROC AUC v.s. uncertainty thresholds. on EDL, and hence it is necessary to examine how robust TEDL is towards these two hyper-parameters. 4.2. Results and Analysis 4.3.1. ROC AUC 4.2.1. Classification Performance evaluated by ROC AUC Figure 5 compares the ROC AUC obtained by EDL and TEDL method, respectively, under different πœ† values. Similar to Fig- Figure 2 summarizes the per-epoch ROC AUC of models learnt ure 2, TEDL constantly outperforms EDL, and is more sta- by cross-entropy loss, EDL method and the proposed TEDL ble when more training epochs are used. In particular, when method, with 1M, 5M, 50M and 500M training samples respec- πœ† = 0.75 we observe the Dying ReLU problem in EDL, which tively. In all these settings, we consistently observe that the inspires us to replace ReLU by ELU in TEDL. ROC AUC from EDL method is much lower than that from cross-entropy loss, while the proposed TEDL method is able 4.3.2. Quality of Uncertainty to achieve comparable performance than cross-entropy loss, outperforming EDL significantly. Figure 6 and Figure 7 compare the quality of uncertainty learnt In addition, if we look into ROC AUC measured on different by EDL and TEDL method, respectively, under different πœ† val- epochs in Figure 2, we can also see that TEDL is much more ues. Compared with Figure 3 and Figure 4, the uncertainty stable than EDL, especially when training corpus is relatively quality learnt from EDL degrades dramatically when larger πœ† small. is used, as shown in the case where πœ† = 0.25 and πœ† = 0.5. By contrast, for TEDL, both its plots over ROC AUC vs. uncer- 4.2.2. Quality of Uncertainty tainty as well as its uncertainty distribution look very similar to what we observed for πœ† = 0.1, demonstrating significantly As mentioned previously, we will measure the quality of the improved robustness towards πœ†. learnt uncertainty by plotting a curve over ROC AUC v.s. un- certainty thresholds, as shown in Figure 3, where the first row corresponds to EDL, while the second row is for TEDL. By 5. Conclusion comparing plots from different epochs, we can see that the quality of uncertainty learnt from both EDL and TEDL gets In this paper, we propose TEDL, a two-stage learning approach to quantify uncertainty for deep classification models. TEDL Figure 6: Comparison of ROC AUC vs. uncertainty for EDL (first row) and TEDL (second row), learnt on 1M training corpus with different πœ†, where TEDL shows significantly better robustness. Figure 7: Comparison of uncertainty distribution for EDL (first row) and TEDL (second row), learnt on 1M training corpus with different πœ†, where TEDL shows significantly better robustness. contains two stages: the first stage learns from cross-entropy mercial search engine, which demonstrates that compared with loss to obtain a good point estimate of the Dirichlet prior EDL, the proposed TEDL not only achieves higher AUC, but distribution, and then the second stage learns to quantify un- also shows improved robustness towards hyper-parameters. certainty via the reformulated EDL loss. We conduct extensive As future work, the uncertainty learnt by TEDL may be lever- experiments using training corpus sampled from a real com- aged to develop active learning algorithms. References ume 118, Springer Science & Business Media, 2012. [18] M. Welling, Y. W. Teh, Bayesian learning via stochastic [1] J. Gawlikowski, C. R. N. Tassi, M. Ali, J. Lee, M. Humt, gradient langevin dynamics, in: International Confer- J. Feng, A. Kruspe, R. Triebel, P. Jung, R. Roscher, et al., ence on Machine Learning, 2011, pp. 681–688. A survey of uncertainty in deep neural networks, arXiv [19] C. Nemeth, P. Fearnhead, Stochastic gradient markov preprint arXiv:2107.03342 (2021). chain monte carlo, Journal of the American Statistical [2] M. Sensoy, L. Kaplan, M. Kandemir, Evidential deep learn- Association 116 (2021) 433–450. ing to quantify classification uncertainty, in: Advances [20] T. Salimans, D. P. Kingma, Weight normalization: A in Neural Information Processing Systems, volume 31, simple reparameterization to accelerate training of deep 2018, pp. 3183–3193. neural networks, in: Advances in Neural Information [3] A. Malinin, M. Gales, Predictive uncertainty estimation Processing Systems, volume 29, 2016, pp. 901–909. via prior networks, in: Advances in Neural Information [21] J. Lee, M. Humt, J. Feng, R. Triebel, Estimating model Processing Systems, volume 31, 2018, pp. 7047–7058. uncertainty of neural networks in sparse information [4] J. Nandy, W. Hsu, M. L. Lee, Towards maximizing the rep- form, in: International Conference on Machine Learning, resentation gap between in-domain & out-of-distribution PMLR, 2020, pp. 5702–5713. examples, in: Advances in Neural Information Process- [22] H. Ritter, A. Botev, D. Barber, A scalable laplace approxi- ing Systems, volume 33, 2020, pp. 9239–9250. mation for neural networks, in: International Conference [5] M. MoΕΌejko, M. Susik, R. Karczewski, Inhibited softmax on Learning Representations, volume 6, 2018. for uncertainty estimation in neural networks, arXiv [23] B. Lakshminarayanan, A. Pritzel, C. Blundell, Simple and preprint arXiv:1810.01861 (2018). scalable predictive uncertainty estimation using deep en- [6] J. Lee, G. AlRegib, Gradients as a measure of uncertainty sembles, in: Advances in Neural Information Processing in neural networks, in: International Conference on Systems, volume 30, 2017, pp. 6404–6416. Image Processing, IEEE, 2020, pp. 2416–2420. [24] H. Guo, H. Liu, R. Li, C. Wu, Y. Guo, M. Xu, Margin & [7] M. Raghu, K. Blumer, R. Sayres, Z. Obermeyer, B. Klein- diversity based ordering ensemble pruning, Neurocom- berg, S. Mullainathan, J. Kleinberg, Direct uncertainty puting 275 (2018) 237–246. prediction for medical second opinions, in: International [25] W. G. Martinez, Ensemble pruning via quadratic margin Conference on Machine Learning, PMLR, 2019, pp. 5281– maximization, IEEE Access 9 (2021) 48931–48951. 5290. [26] J. Lindqvist, A. Olmin, F. Lindsten, L. Svensson, A general [8] T. Ramalho, M. Miranda, Density estimation in represen- framework for ensemble distribution distillation, in: tation space to predict model uncertainty, in: Interna- International Workshop on Machine Learning for Signal tional Workshop on Engineering Dependable and Secure Processing, IEEE, 2020, pp. 1–6. Machine Learning Systems, Springer, 2020, pp. 84–96. [27] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, [9] G. E. Hinton, D. Van Camp, Keeping the neural net- Q. He, A comprehensive survey on transfer learning, works simple by minimizing the description length of Proceedings of the IEEE 109 (2021) 43–76. the weights, in: Computational Learning Theory, 1993, [28] V. Dang, M. Bendersky, W. B. Croft, Two-stage learning pp. 5–13. to rank for information retrieval, in: European Confer- [10] Y. Gal, Z. Ghahramani, Dropout as a bayesian approxi- ence on Information Retrieval, Springer, 2013, pp. 423– mation: Representing model uncertainty in deep learn- 434. ing, in: International Conference on Machine Learning, [29] F. A. Khan, A. Gumaei, A. Derhab, A. Hussain, A novel PMLR, 2016, pp. 1050–1059. two-stage deep learning model for efficient network in- [11] C. Blundell, J. Cornebise, K. Kavukcuoglu, D. Wierstra, trusion detection, IEEE Access 7 (2019) 30373–30385. Weight uncertainty in neural network, in: International [30] J. D. M.-W. C. Kenton, L. K. Toutanova, Bert: Pre-training Conference on Machine Learning, PMLR, 2015, pp. 1613– of deep bidirectional transformers for language under- 1622. standing, in: Proceedings of NAACL-HLT, 2019, pp. [12] A. Graves, Practical variational inference for neural net- 4171–4186. works, in: Advances in Neural Information Processing [31] X. Li, Z. Luo, H. Sun, J. Zhang, W. Han, X. Chu, L. Zhang, Systems, volume 24, 2011, pp. 2348–2356. Q. Zhang, Learning fast matching models from weak [13] C. Louizos, K. Ullrich, M. Welling, Bayesian compression annotations, in: Proceedings of the Web Conference, for deep learning, in: Advances in Neural Information 2019, pp. 2985–2991. Processing Systems, volume 30, 2017, pp. 3288–3298. [32] W. Lu, J. Jiao, R. Zhang, Twinbert: Distilling knowledge [14] D. Rezende, S. Mohamed, Variational inference with nor- to twin-structured compressed bert models for large- malizing flows, in: International Conference on Machine scale retrieval, in: Proceedings of the ACM International Learning, PMLR, 2015, pp. 1530–1538. Conference on Information & Knowledge Management, [15] D. Barber, C. M. Bishop, Ensemble learning in bayesian 2020, pp. 2645–2652. neural networks, Nato ASI Series F Computer and Sys- [33] J. Zhu, Y. Cui, Y. Liu, H. Sun, X. Li, M. Pelger, T. Yang, tems Sciences 168 (1998) 215–238. L. Zhang, R. Zhang, H. Zhao, Textgnn: Improving text [16] R. M. Neal, An improved acceptance procedure for the encoder via graph neural network in sponsored search, hybrid monte carlo algorithm, Journal of Computational in: Proceedings of the Web Conference, 2021, pp. 2848– Physics 111 (1994) 194–203. 2857. [17] R. M. Neal, Bayesian learning for neural networks, vol-