<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TEDL: A Two-stage Evidential Deep Learning Method for Classification Uncertainty Quantification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xue Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wei Shen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Denis Charles</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>1045 La Avenida St, Mountain View</institution>
          ,
          <addr-line>CA, 94043</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>555 110th Ave NE</institution>
          ,
          <addr-line>Bellevue, WA, 98004</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we propose TEDL, a two-stage learning approach to quantify uncertainty for deep learning models in classification tasks, inspired by our findings in experimenting with Evidential Deep Learning (EDL) method, a recently proposed uncertainty quantification approach based on the Dempster-Shafer theory. More specifically, we observe that EDL tends to yield inferior AUC compared with models learnt by cross-entropy loss and is highly sensitive in training. Such sensitivity is likely to cause unreliable uncertainty estimation, making it risky for practical applications. To mitigate both limitations, we propose a simple yet efective two-stage learning approach based on our analysis on the likely reasons causing such sensitivity, with the first stage learning from cross-entropy loss, followed by a second stage learning from EDL loss. We also re-formulate the EDL loss by replacing ReLU with ELU to avoid the Dying ReLU issue. Extensive experiments are carried out on varied sized training corpus collected from a large-scale commercial search engine, demonstrating that the proposed two-stage learning framework can increase AUC significantly and greatly improve training robustness.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;uncertainty quantification</kwd>
        <kwd>search ads recommendation</kwd>
        <kwd>deep learning</kwd>
        <kwd>classification</kwd>
        <kwd>BERT</kwd>
        <kwd>TwinBERT</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>DL4SR’22: Workshop on Deep Learning for Search and
Recommendation, co-located with the 31st ACM International Conference on
Information and Knowledge Management (CIKM), October 17-21, 2022,
Atlanta, USA
* Corresponding author.
$ xeli@microsoft.com (X. Li); sashen@microsoft.com (W. Shen);
cdx@microsoft.com (D. Charles)</p>
      <p>© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License and some hyper-parameters, where improper settings
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) may lead to significantly degraded AUC and unreliable
uncertainty estimation. categorical distribution, based on which we can quantify</p>
      <p>To see this more clearly, in Figure 5 (the orange curve) uncertainty.
we summarize the per epoch ROC AUC obtained in EDL The overall training framework of TEDL is illustrated
training with diferent  , a hyper-parameter controlling in Figure 1, where two stages are needed: in the first stage,
how close the Dirichlet prior is to a uniform distribution. we train our classification model with cross-entropy loss,
As we can see, the AUC of EDL sufers in the beginning in order to obtain a model that is able to output
reasonunder all the four settings, and in some cases (for example able pointwise estimations of the categorical distribution.
when  = 0.5) there is no signs of improvement at all. And then in the second stage, we initialize the model
In cases where AUC does improve, its final AUC is still from the weights obtained in the previous stage, and go
significantly lower than that from the proposed method through the same training corpus by learning with the
(the green curve). On the other hand, consider evaluating reformulated EDL loss where ReLU is replaced by ELU.
AUC on validation samples with uncertainty lower than As shown in Section 4, compared with the EDL baseline,
a certain threshold: If the learnt uncertainty is of high TEDL can achieve higher AUC across all evaluation
setquality, smaller thresholds should indicate higher confi- tings and efectively avoid the risk of running into Dying
dence, and hence should be associated with higher AUC. ReLU problem. More importantly, TEDL also shows
sigHowever, this is not always the case for EDL, as shown nificantly improved robustness towards training settings,
in the first row of Figure 6. Besides, we also observe that making it more reliable for practical applications.
when a large  is used (for example  =0.75), there would It is also worth to mention that we name our proposed
be a higher risk of running into the Dying ReLU problem method following EDL mainly due to the convenience of
where all outputs are zero, leading to an AUC that similar experimentation, as it is proposed recently and is easy
to a random guess. All these issues make it risky to apply to implement with code open-sourced by the authors.
methods like EDL into real-world applications. However, our analysis in Section 3 also applies to other</p>
      <p>To fix these issues, we firstly present an analysis in this single deterministic uncertainty quantification methods
paper on the likely reasons causing the above issues in sufering from similar issues, and hence the two-stage
Section 3, and based on our analysis, we further propose learning framework we propose in this paper could be
TEDL, short for Two-stage Evidential Deep Learning, as readily extended to those methods as well.
a simple but efective training framework to mitigate all
the aforementioned issues in a single shot. As we will see
in Section 3, the basic idea of TEDL is to transform the 2. Related Works
dificult uncertainty quantification problem into two
subproblems that are much easier to tackle, i.e., 1) finding a The interest for uncertainty estimation dates back to the
reasonably good pointwise estimation of the categorical days even before the rise of deep learning, entailing a
distribution, and 2) leveraging this pointwise estimation large body of literature on this topic. Based on whether
as an anchor point for estimating the Dirichlet prior of model ensemble is used and whether the model is
stochasℒ(Θ)</p>
      <p>= ∑︁( −
= ∑︁( −
=

=1
^ )2 +
^ (1 − ^ )</p>
      <p>+ 1
  )2 +

  ( −   )
2( + 1)
The proposed method also relates to the concept of sample  and class , respectively.</p>
      <p>Single deterministic methods [4, 5] estimate
uncerwhere Θ
tic, uncertainty quantification methods could be roughly
grouped into three categories, including single
deterministic methods, Bayesian neural networks and ensemble
gories and   = ⟨ 1, . . . ,   ⟩ is the parameter of a
Dirichlet distribution for the classification of sample ,
the authors propose to replace softmax with ReLU and
methods. Please refer to [1] for a comprehensive survey. represent the Dirichlet parameter as   =  (|Θ) + 1
represents network parameters and  (|Θ)
approaches [6, 7] and internal approaches [2, 3, 8] depend- , and  = ∑︀
tainty based on one single forward pass within a deter- is the ReLU outputs. The   here also represents the
ministic network, and could be further split into external
subjective opinion collected from sample  and category
ing on whether additional method is used for deriving
uncertainty estimation. Methods in this category typically
have lower requirements on computational resources
since no stochastic networks nor model ensembles are
needed, but sufer from sensitivity to initialization and
parameters compared with other categories. The
proposed TEDL method in this paper, as well as the original</p>
      <sec id="sec-1-1">
        <title>EDL method, both fall into this category.</title>
      </sec>
      <sec id="sec-1-2">
        <title>Bayesian neural networks cover all kinds of stochas</title>
        <p>tic DNNs, including methods based on variational
inference [9, 10, 11, 12, 13, 14, 15], sampling methods
[16, 17, 18, 19], and Laplace approximation [20, 21, 22].</p>
      </sec>
      <sec id="sec-1-3">
        <title>Methods in this category usually have higher computational complexity in both the training and inference phases due to stochastic sampling.</title>
        <p>Ensemble methods [23, 24, 25, 26] combine the
predictions from several diferent deterministic networks at
inference. Methods in this category typically have higher
requirements on both the memory and computational
resources at inference phase.
two-stage learning, which bears similarity to
transfer learning but has some subtle diferences. Transfer
learning generally refers to the procedure that transfers
knowledge obtained from diferent but related source
domains to target domains, usually to reduce training
data required on the target domains. [27] gives a
comprehensive survey on transfer learning. In contrast, in
two-stage learning [28, 29], although it also consists of
two consecutive stages, these two stages are often
conducted on the same data. In a typical two-stage learning
setting, the second stage should be the final stage that
yields the desired output, while the first stage serves as
a preparation step. Given such diferences, the proposed
method should be categorized as two-stage learning.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Approach</title>
      <sec id="sec-2-1">
        <title>3.1. A Recap on EDL Uncertainty</title>
      </sec>
      <sec id="sec-2-2">
        <title>Quantification</title>
        <sec id="sec-2-2-1">
          <title>The basic idea of EDL method is treating softmax out</title>
          <p>put as the pointwise estimation of the categorical
distribution, and placing a Dirichlet prior over the
distribution of all possible softmax outputs. Then, following
the Dempster-Shafer theory, assume we have 
catewhere  and ^ denote the class label and expectation for</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>Meanwhile, the above loss function is further regularized by</title>
          <p>minimizing the KL divergence between the estimated Dirichlet
distribution (| ˜ ) and the uniform distribution, as below:

=1
ℒ = ∑︁  [(| ˜ ) || (|⟨1, . . . , 1⟩)]
The coeficient   in Equation (1) is heuristically set to increase

with epoch  (zero-based), i.e.,   = min(1.0,  *  ) where
= 0.1. Note that we denote the per-epoch increment as
 . For brevity, we will treat  rather than   as the
hyperparameter henceforth, since   is determined only by  .</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>3.2. A Closer Look into the EDL Method</title>
        <sec id="sec-2-3-1">
          <title>Equation (1) could be split into two parts: the first part is</title>
        </sec>
        <sec id="sec-2-3-2">
          <title>Equation (2) which is designed to estimate the Dirichlet prior,</title>
          <p>and the second part is the regularization term in Equation
(3) derived from KL divergence. Next, we will take a closer
look at these two parts respectively to understand the cause
of sensitivity.</p>
          <p>As we mentioned previously, unlike cross-entropy loss
which is designed to learn the pointwise estimations of the
categorical distribution as a MAP estimate, the loss function
in Equation (2) is derived to learn the parameter of a Dirichlet
prior distribution over all the possible predictions. Therefore,
the pointwise estimation should also be covered by the
Dirichlet prior distribution. This perspective highlights the huge gap
strength. Note that  is inversely proportional to
uncertainty: a larger  indicates more evidence is collected
for sample , and hence lower uncertainty.</p>
          <p>Based on the above assumptions, the EDL loss is
de=1   is referred to as the Dirichlet
ifned as below:
ℒ(Θ) = ∑︁

=1</p>
          <p>ℒ(Θ) +  ℒ
where ℒ(Θ) is formulated as the expected value of a basic loss
and ℒ represents regularization. According to the authors
of [2], EDL method appears relatively more stable when sum
of squares loss is used as the basic loss, as below:
(1)
(2)
(3)
in terms of how dificult the optimization problems behind
these two loss functions are, especially given that obtaining
a good MAP estimation is already a hard problem in many
applications. This perspective also highlights the importance
of a suficiently large training data, as it would be meaningless
to model a distribution without suficient samples.</p>
          <p>In the meanwhile, the KL divergence also makes
optimization more complicated since it is not Lipschitz smooth. More
precisely, given a function  , it is said to be Lipschitz smooth
if and only if there exists a finite value  such that
‖∇ () − ∇
 ()‖ &lt;  · ‖  − ‖
(4)</p>
          <p>In other words, the gradient of  should exist and be
bounded by a finite value . However, the regularization term
in Equation (3) does not satisfy this condition since its gradient
will go to infinity when (| ˜ ) → 0, as even though  ˜ 
is guaranteed to be positive,  may still become very close
to zero when a certain  ˜  is extremely large, leading to very
large gradients and hence unstable training.</p>
          <p>In summary, internal single deterministic methods are
trying to optimize an inherently dificult problem, with
potentially ill-conditioned loss functions due to existence of KL
divergence.</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>3.3. The Proposed Two-stage Learning</title>
      </sec>
      <sec id="sec-2-5">
        <title>Framework</title>
        <p>Having analyzed the possible reasons causing training
sensitivity, a more important question is how could we fix such
issues and make training more stable. At first glance, this
appears to be infeasible since we can neither bypass distribution
modeling nor drop the terms related to KL divergence in loss
functions. In this paper, we propose an alternative approach,
which can fix both issues with a simple yet efective strategy:
decomposing the original problem into two sub-problems and
tackling one at a time, leading to a two-stage learning method
as illustrated in Figure 1. Compared with the original EDL
method, the only cost introduced by TEDL is a preparation
stage learning from the cross-entropy loss, however as we will
see in Section 4, such cost is well paid of given the significant
AUC increase and greatly improved robustness in training.</p>
        <p>So why does such a simple strategy work? On one hand,
the first stage in TEDL learns a pointwise estimation of the
categorical distribution, which is a much easier problem
compared with modeling the entire distribution and entails much
fewer training samples. Then in the second stage, since the
model is initialized from the weights obtained in stage 1, it
amounts to modeling the prior distribution using the
pointwise estimation as certain anchor points, which is much easier
than modeling the prior from scratch, if we can assume that
the pointwise estimation is close to the expected value of the
prior. This assumption should be easily hold for most practical
applications, otherwise we will not be able to apply internal
single deterministic methods at all, since the expected value
from the prior distribution is unlikely to derive meaningful
predictions in that case.</p>
        <p>On the other hand, by learning from cross-entropy loss, we
could efectively avoid assigning extremely small values to
, given that softmax involves exponential operations and
there is no point in pushing model outputs before softmax to
extremely large values. That means, when softmax is replaced
by ELU later in stage 2, it is unlikely for us to see extremely
large  ˜  values.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Experiments</title>
      <sec id="sec-3-1">
        <title>4.1. Implementation Details</title>
        <p>All experiments throughout this paper are conducted on a
binary classification task, with the goal to predict whether a
&lt;query, ad&gt; pair is relevant or not. Both the training (1.4M)
and validation samples (100K) are sampled from a large-scale
commercial search engine, with human-provided relevance
labels. In order to examine the impact of the size of training
data, we further create a synthetic training set with soft labels,
by sampling a large corpus and inference using an ensemble
of BERT [30] models fine-tuned on the human-labeled
training set, similar to what we do in knowledge distillation [31].</p>
        <p>This allows us to experiment on a much larger scale, without
breaking any assumptions in the EDL method. Without further
clarification, we will henceforth refer to this synthetic training our deep classification model, which uses two BERT encoders
set as our training corpus, and experiments will be conducted to encode query and ad respectively, and then calculates their
on subsets sampled from this synthetic training set, with 1M, relevance score by cosine similarity. We choose this model
5M, 50M and 500M samples, respectively. mainly for its simplicity and eficiency, and the conclusions of</p>
        <p>In addition, in this paper we will use TwinBERT [32, 33] as this paper should hold for other model architectures as well,
since no particular assumptions for model architectures are
made in the proposed TEDL method.</p>
        <p>In terms of metrics, since we are working on binary
classification task, we will use ROC AUC to evaluate the prediction
performance (in our experiments PR AUC shows a very
similar trend to ROC AUC). Meanwhile, to measure the quality of
uncertainty, we follow the approach in [2] to split our
validation data using diferent uncertainty thresholds first, and then
evaluate ROC AUC on each individual subset. For example,
when threshold is 0.1, ROC AUC will be calculated only on
validation samples with uncertainty lower than 0.1. Therefore,
if uncertainty is properly quantified, we should expect higher
ROC AUC on lower thresholds, since this is the subset that
our model feels more confident with. This way, we can plot a
curve over ROC AUC v.s. uncertainty thresholds.</p>
      </sec>
      <sec id="sec-3-2">
        <title>4.2. Results and Analysis</title>
        <p>4.2.1. Classification Performance evaluated by
ROC AUC
steadily improved over the training process, and the improving
pattern for EDL and TEDL are very similar. However, this only
happens when a relatively small  is used. Later in Section
4.3 we will see that compared with EDL, TEDL is much more
robust towards  . We also plot the distribution of uncertainty
in each training epoch, as shown in Figure 4, where TEDL also
looks similar to EDL when  is relatively small, but later in
Section 4.3 we will see their diference when  gets larger.</p>
      </sec>
      <sec id="sec-3-3">
        <title>4.3. Sensitivity towards Hyper-parameters</title>
        <sec id="sec-3-3-1">
          <title>So far all the results we report are obtained under mild condi</title>
          <p>tions with  = 0.1, however as we mentioned in Section 1, 
and the number of training epochs may have dramatic impact
on EDL, and hence it is necessary to examine how robust TEDL
is towards these two hyper-parameters.
4.3.1. ROC AUC</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>In this paper, we propose TEDL, a two-stage learning approach to quantify uncertainty for deep classification models. TEDL</title>
          <p>contains two stages: the first stage learns from cross-entropy mercial search engine, which demonstrates that compared with
loss to obtain a good point estimate of the Dirichlet prior EDL, the proposed TEDL not only achieves higher AUC, but
distribution, and then the second stage learns to quantify un- also shows improved robustness towards hyper-parameters.
certainty via the reformulated EDL loss. We conduct extensive As future work, the uncertainty learnt by TEDL may be
leverexperiments using training corpus sampled from a real com- aged to develop active learning algorithms.
ume 118, Springer Science &amp; Business Media, 2012.
[18] M. Welling, Y. W. Teh, Bayesian learning via stochastic
[1] J. Gawlikowski, C. R. N. Tassi, M. Ali, J. Lee, M. Humt, gradient langevin dynamics, in: International
ConferJ. Feng, A. Kruspe, R. Triebel, P. Jung, R. Roscher, et al., ence on Machine Learning, 2011, pp. 681–688.
A survey of uncertainty in deep neural networks, arXiv [19] C. Nemeth, P. Fearnhead, Stochastic gradient markov
preprint arXiv:2107.03342 (2021). chain monte carlo, Journal of the American Statistical
[2] M. Sensoy, L. Kaplan, M. Kandemir, Evidential deep learn- Association 116 (2021) 433–450.</p>
          <p>ing to quantify classification uncertainty, in: Advances [20] T. Salimans, D. P. Kingma, Weight normalization: A
in Neural Information Processing Systems, volume 31, simple reparameterization to accelerate training of deep
2018, pp. 3183–3193. neural networks, in: Advances in Neural Information
[3] A. Malinin, M. Gales, Predictive uncertainty estimation Processing Systems, volume 29, 2016, pp. 901–909.
via prior networks, in: Advances in Neural Information [21] J. Lee, M. Humt, J. Feng, R. Triebel, Estimating model
Processing Systems, volume 31, 2018, pp. 7047–7058. uncertainty of neural networks in sparse information
[4] J. Nandy, W. Hsu, M. L. Lee, Towards maximizing the rep- form, in: International Conference on Machine Learning,
resentation gap between in-domain &amp; out-of-distribution PMLR, 2020, pp. 5702–5713.
examples, in: Advances in Neural Information Process- [22] H. Ritter, A. Botev, D. Barber, A scalable laplace
approxiing Systems, volume 33, 2020, pp. 9239–9250. mation for neural networks, in: International Conference
[5] M. Możejko, M. Susik, R. Karczewski, Inhibited softmax on Learning Representations, volume 6, 2018.
for uncertainty estimation in neural networks, arXiv [23] B. Lakshminarayanan, A. Pritzel, C. Blundell, Simple and
preprint arXiv:1810.01861 (2018). scalable predictive uncertainty estimation using deep
en[6] J. Lee, G. AlRegib, Gradients as a measure of uncertainty sembles, in: Advances in Neural Information Processing
in neural networks, in: International Conference on Systems, volume 30, 2017, pp. 6404–6416.</p>
          <p>Image Processing, IEEE, 2020, pp. 2416–2420. [24] H. Guo, H. Liu, R. Li, C. Wu, Y. Guo, M. Xu, Margin &amp;
[7] M. Raghu, K. Blumer, R. Sayres, Z. Obermeyer, B. Klein- diversity based ordering ensemble pruning,
Neurocomberg, S. Mullainathan, J. Kleinberg, Direct uncertainty puting 275 (2018) 237–246.
prediction for medical second opinions, in: International [25] W. G. Martinez, Ensemble pruning via quadratic margin
Conference on Machine Learning, PMLR, 2019, pp. 5281– maximization, IEEE Access 9 (2021) 48931–48951.
5290. [26] J. Lindqvist, A. Olmin, F. Lindsten, L. Svensson, A general
[8] T. Ramalho, M. Miranda, Density estimation in represen- framework for ensemble distribution distillation, in:
tation space to predict model uncertainty, in: Interna- International Workshop on Machine Learning for Signal
tional Workshop on Engineering Dependable and Secure Processing, IEEE, 2020, pp. 1–6.</p>
          <p>Machine Learning Systems, Springer, 2020, pp. 84–96. [27] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong,
[9] G. E. Hinton, D. Van Camp, Keeping the neural net- Q. He, A comprehensive survey on transfer learning,
works simple by minimizing the description length of Proceedings of the IEEE 109 (2021) 43–76.
the weights, in: Computational Learning Theory, 1993, [28] V. Dang, M. Bendersky, W. B. Croft, Two-stage learning
pp. 5–13. to rank for information retrieval, in: European
Confer[10] Y. Gal, Z. Ghahramani, Dropout as a bayesian approxi- ence on Information Retrieval, Springer, 2013, pp. 423–
mation: Representing model uncertainty in deep learn- 434.
ing, in: International Conference on Machine Learning, [29] F. A. Khan, A. Gumaei, A. Derhab, A. Hussain, A novel
PMLR, 2016, pp. 1050–1059. two-stage deep learning model for eficient network
in[11] C. Blundell, J. Cornebise, K. Kavukcuoglu, D. Wierstra, trusion detection, IEEE Access 7 (2019) 30373–30385.</p>
          <p>Weight uncertainty in neural network, in: International [30] J. D. M.-W. C. Kenton, L. K. Toutanova, Bert: Pre-training
Conference on Machine Learning, PMLR, 2015, pp. 1613– of deep bidirectional transformers for language
under1622. standing, in: Proceedings of NAACL-HLT, 2019, pp.
[12] A. Graves, Practical variational inference for neural net- 4171–4186.</p>
          <p>works, in: Advances in Neural Information Processing [31] X. Li, Z. Luo, H. Sun, J. Zhang, W. Han, X. Chu, L. Zhang,
Systems, volume 24, 2011, pp. 2348–2356. Q. Zhang, Learning fast matching models from weak
[13] C. Louizos, K. Ullrich, M. Welling, Bayesian compression annotations, in: Proceedings of the Web Conference,
for deep learning, in: Advances in Neural Information 2019, pp. 2985–2991.</p>
          <p>Processing Systems, volume 30, 2017, pp. 3288–3298. [32] W. Lu, J. Jiao, R. Zhang, Twinbert: Distilling knowledge
[14] D. Rezende, S. Mohamed, Variational inference with nor- to twin-structured compressed bert models for
largemalizing flows, in: International Conference on Machine scale retrieval, in: Proceedings of the ACM International
Learning, PMLR, 2015, pp. 1530–1538. Conference on Information &amp; Knowledge Management,
[15] D. Barber, C. M. Bishop, Ensemble learning in bayesian 2020, pp. 2645–2652.</p>
          <p>neural networks, Nato ASI Series F Computer and Sys- [33] J. Zhu, Y. Cui, Y. Liu, H. Sun, X. Li, M. Pelger, T. Yang,
tems Sciences 168 (1998) 215–238. L. Zhang, R. Zhang, H. Zhao, Textgnn: Improving text
[16] R. M. Neal, An improved acceptance procedure for the encoder via graph neural network in sponsored search,
hybrid monte carlo algorithm, Journal of Computational in: Proceedings of the Web Conference, 2021, pp. 2848–
Physics 111 (1994) 194–203. 2857.
[17] R. M. Neal, Bayesian learning for neural networks,
vol</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>