From Black-box to White-box: Examining Confidence Calibration under different
      Franziska Schwaiger 1 , Maximilian Henne 1 , Fabian Küppers 2 , Felippe Schmoeller Roza 1 ,
                               Karsten Roscher 1 , Anselm Haselhoff 2
                                 Fraunhofer Institute for Cognitive Systems IKS, Munich, Germany
                                     Ruhr West University of Applied Sciences, Bottrop, Germany
                      {franziska.schwaiger, maximilian.henne, felippe.schmoeller.da.roza}@iks.fraunhofer.de
                                        {fabian.kueppers, anselm.haselhoff}@hs-ruhrwest.de

                            Abstract                                     In the past, most research in this area has focused on
  Confidence calibration is a major concern when applying ar-         classification (Naeini, Cooper, and Hauskrecht 2015; Kull,
  tificial neural networks in safety-critical applications. Since     Silva Filho, and Flach 2017; Guo et al. 2017; Seo, Seo, and
  most research in this area has focused on classification in the     Han 2019; Mukhoti et al. 2020), whereas calibration in ob-
  past, confidence calibration in the scope of object detection       ject detection has recently gained more attention (Neumann,
  has gained more attention only recently. Based on previous          Zisserman, and Vedaldi 2018; Feng et al. 2019; Küppers
  work, we study the miscalibration of object detection models        et al. 2020). Object detection is a joint task of classifica-
  with respect to image location and box scale. Our main con-         tion and regression of the predictions’ position and scale.
  tribution is to additionally consider the impact of box selec-      Recent work has shown that the regression branch of ob-
  tion methods like non-maximum suppression to calibration.           ject detection models also affects confidence calibration
  We investigate the default intrinsic calibration of object de-
                                                                      (Küppers et al. 2020). However, the observable detections
  tection models and how it is affected by these post-processing
  techniques. For this purpose, we distinguish between black-         of a model are commonly processed by non-maximum sup-
  box calibration with non-maximum suppression and white-             pression (NMS) and/or thresholded by a certain confidence
  box calibration with raw network outputs. Our experiments           score. In this work, our goal is to investigate the influence of
  reveal that post-processing highly affects confidence calibra-      such post-processing techniques on the model calibration.
  tion. We show that non-maximum suppression has the poten-           For this purpose, we adapt common object detection mod-
  tial to degrade initially well-calibrated predictions, leading to   els and examine their miscalibration before NMS on the one
  overconfident and thus miscalibrated models.                        hand (white-box scenario). In this way, we have access to
                                                                      the raw predictions of a network and are thus able to exam-
                      1    Introduction                               ine the network’s calibration properties by default. On the
Modern deep neural networks achieve remarkable results on             other hand, we further apply NMS with increasing intersec-
various tasks but it is a well-known issue that these net-            tion over union (IoU) thresholds (black-box scenario), which
works fail to provide reliable estimates about the correctness        varies the number of boxes that are suppressed. Changing
of predictions in many cases (Niculescu-Mizil and Caru-               the parameters of the NMS enables us to examine to what
ana 2005; Guo et al. 2017). A network outputs a score                 extent the models are intrinsically calibrated and how this is
attached to each prediction that can be interpreted as the            affected by postprocessing techniques. An illustrative rep-
probability of correctness. Such a model is well-calibrated           resentation for the problem setting is demonstrated in Fig.
if the observed accuracy matches the estimated confidence             1. Furthermore, we use a Faster R-CNN architecture (Ren
scores. However, recent work has shown that these con-                et al. 2015) that uses the cross entropy loss during training
fidence scores neither represent the actual observed accu-            and compare it to a RetinaNet (Lin et al. 2017) that uses a
racy in classification (Niculescu-Mizil and Caruana 2005;             focal loss. It is already known that models trained with fo-
Naeini, Cooper, and Hauskrecht 2015; Guo et al. 2017) nor             cal loss produce much less confident predictions (Mukhoti
the observed precision in object detection (Küppers et al.           et al. 2020). This enables us to further investigate the ef-
2020). Calibrated confidence estimates integrated in safety-          fect of post-processing methods by comparing the default
critical applications like autonomous driving can provide             calibration properties of both model architectures with and
valuable additional information with respect to situational           without NMS.
awareness and can reduce the risk of hazards resulting from
functional insufficiencies by decreasing the space of un-                This work is structured as follows: we give a review of
known unsafe scenarios which is a critical part for the safety        the current state-of-the-art research in confidence calibration
of the intended functionality (SOTIF ISO/PAS 21448).                  in Section 2. We further give a definition of white-box and
     Input images                                      Detections

                               Detector                                                                 NMS

                                                                                  White-box                             Black-box
                                                                                  Calibration                           Calibration
Figure 1: Typically a non-maximum suppression (NMS) is applied to all detections of a detection model to fuse and reduce
redundant bounding boxes. In our work, we investigate how NMS affects confidence calibration. Thus, we study the difference
in calibration before NMS (white-box) and afterward (black-box).

                    2   Related Work                                vestigate how to directly obtain well-calibrated models after
                                                                    training. Besides the previously mentioned MMCE, the au-
Numerous methods have been developed in the past to ad-
                                                                    thors in (Pereyra et al. 2017) introduce a regularization term
dress the miscalibration of neural networks. One of the first
                                                                    penalizing high confident predictions. In contrast, (Müller,
representatives of post-processing calibration methods has
                                                                    Kornblith, and Hinton 2019) show that label smoothing
been histogram binning (Zadrozny and Elkan 2001), iso-
                                                                    yields good probabilities after training. Recently, (Mukhoti
tonic regression (Zadrozny and Elkan 2002), Bayesian bin-
                                                                    et al. 2020) investigate the effects of focal loss, originally
ning (Naeini, Cooper, and Hauskrecht 2015), and Platt scal-
                                                                    proposed as a loss term for RetinaNet (Lin et al. 2017),
ing (Platt 1999), whereas more recently temperature scaling
                                                                    on confidence calibration. They show that using focal loss
(Guo et al. 2017), beta calibration (Kull, Silva Filho, and
                                                                    in conjunction with an adaptive parameter significantly im-
Flach 2017), and Dirichlet calibration (Kull et al. 2019) have
                                                                    proves the confidence calibration of classification models.
been developed to tackle miscalibration in the scope of clas-
                                                                    We also observe that using focal loss prevents overconfi-
sification. In object detection models, dealing with miscali-
                                                                    dent predictions in our experiments on the RetinaNet with
bration presents a different set of challenges and was first ad-
                                                                    standard hyperparameters (Lin et al. 2017). While well-
dressed by (Neumann, Zisserman, and Vedaldi 2018), who
                                                                    known object detection models like Faster-RCNN (Ren et al.
proposed an additional model output to be utilized as a reg-
                                                                    2015) commonly tend to output overconfident predictions,
ularizing temperature applied to the remaining logits. Re-
                                                                    the probability scores of a RetinaNet rather underestimate
cently, (Küppers et al. 2020) have studied the effect of posi-
                                                                    the observed frequency.
tion and scale of detected objects to miscalibration and con-
cluded that calibration also depends on the regression out-
put of a detection model. They further provide a framework
to include position and scale information into a calibration
                                                                    3   Defining Confidence Calibration for Object
mapping.                                                                            Detection Models
   For the task of classification, a common way to mea-
sure miscalibration is to adapt the expected calibration er-        In this section, we describe the definition of black-box and
ror (ECE), proposed by (Naeini, Cooper, and Hauskrecht              white-box calibration. The idea behind this distinction is to
2015), which uses a binning scheme to measure the gap be-           analyze the impact of bounding-box postprocessing on cali-
tween observed frequency and average confidence. (Kumar,            bration.
Liang, and Ma 2019) show in their work that the common                 An object detector takes an image as input x and out-
ECE underestimates the true calibration error in some cases         puts predictions in form of a class label y ∈ Y with cor-
and provide a differentiable upper bound called maximum             responding confidence score p ∈ [0, 1] and bounding box
mean calibration error (MMCE) that can also be used dur-            r = (cx , cx , h, w) ∈ RJ , with (cx , cy ) being the center
ing model training as a second regularization term. For mea-        position, (h, w) the box height and width and J the size
suring miscalibration in object detection tasks, an extension       of the used box encoding. The authors in (Küppers et al.
of the ECE called detection expected calibration error (D-          2020) propose a confidence calibration that not only consid-
ECE) was proposed by (Küppers et al. 2020), consisting of a        ers the confidence score p but also includes the box infor-
multidimensional binning scheme to assess the miscalibra-           mation r. The performance of a detector is thus evaluated by
tion over all predicted features of an object detection model.      matching its predictions (ŷ, p̂, r̂) with the ground-truth an-
   Since the standard cross-entropy loss is prone to favor          notations, where m = 1 denotes a matched box and m = 0
overly confident predictions, further research directions in-       a mismatch. More formally, perfect calibration in the scope
of object detection is defined by                                 uses a focal loss that enables to focus on hard examples dur-
                                                                  ing training with low confidence. On the other hand, good
        P(M = 1|P̂ = p, Ŷ = y, R̂ = r) =  p ,             (1)    predictions with high confidence are less weighted during
                                                                  training that in turn leads to less confident predictions (Lin
        |            {z               }   |{z}
                 precision given p,y,r        confidence
                                                                  et al. 2017; Mukhoti et al. 2020). Our experiments are re-
       ∀p ∈ [0, 1], y ∈ Y, r ∈ RJ .                               stricted to the predictions of class person.
   As the detections rarely match the ground-truth perfectly,        To study the effect of non-maximum suppression, we ap-
true positives (TP, m = 1) and false positives (FP, m = 0)        ply different IoU thresholds to merge boxes denoted by
are obtained by comparing the IoU to a fixed threshold τ . TP     NMS@{0.5, 0.75, 0.9}. In the white-box case without NMS,
and FP correspond to boxes with IoU ≥ τ and IoU < τ , re-         we use the raw predictions for measuring and performing
spectively. The process of inference is commonly followed         calibration on the one hand. On the other hand, we further
by a non-maximum suppression since an object detection            adopt top-k box selection where only k bounding boxes with
model outputs a huge amount of mostly less confident and          the highest confidence are kept using k = 1000. This is the
redundant detections. On the one hand, we can consider the        common case during inference to reduce low confidential
definition of calibration given by Equation 1 to the raw out-     and mostly redundant predictions. Following (Küppers et al.
puts of a detector without any post-processing. We denote         2020), the predictions of all models are obtained by infer-
this case as the white-box calibration case for the following     ence with a probability threshold of 0.3 which means dis-
of this paper. On the other hand, we can also view the NMS        carding all predictions with a confidence score less than this
as part of the detector and treat the output of the NMS as        threshold. As the relative amount of predictions per image
our desired calibration target. This is denoted as black-box      with low confidence score is significantly higher than the
calibration.                                                      relative amount of the remaining predictions, this probabil-
                                                                  ity threshold ensures that the D-ECE is not dominated by
   In (Küppers et al. 2020), the detection expected calibra-
                                                                  these low confidence samples.
tion error (D-ECE) is defined as an extension of the com-
monly used ECE (Naeini, Cooper, and Hauskrecht 2015) for             For confidence calibration, we use multivariate histogram
object detection tasks. The D-ECE also includes the box in-       binning (Zadrozny and Elkan 2001; Küppers et al. 2020) for
formation r by partitioning the space of each variable k into     calibration as a fast and reliable calibration method. We also
Nk equally spaced bins. The total amount of bins is given by      evaluate several setups with different subsets of box infor-
          QK                                                      mation to evaluate the effect of the used feature set. We ei-
Ntotal = k=1 Nk and the D-ECE is defined as                       ther use the confidence only, also including the box centers
                                                                  (cˆx , cˆy ) or box scales (h, w), or we use all features for mea-
                        |I(n)|                                    suring and performing calibration. For the histogram-based
      D-ECEK =                 · |prec(n) − conf(n)|,      (2)
                         |D|                                      calibration, we use 15 bins for confidence only, Nk = 5
                                                                  bins for (p̂, cˆx , cˆy ) and (p̂, ĥ, ŵ), and Nk = 3 when using all
where I(n) is the set of all samples in a single bin and |D|      available features. In contrast, for D-ECE computation we
the total amount of samples, while prec(n) and conf(n) de-        use 20 bins for confidence only, Nk = 8 bins for (p̂, cˆx , cˆy )
note the average precision and confidence within each bin,        and (p̂, ĥ, ŵ), and Nk = 5 when using all available informa-
respectively. We use this metric to measure miscalibration in     tion. We increase the robustness of the D-ECE calculation
both cases: For white-box, we consider all possible box pre-      by also neglecting bins with less than 8 samples (Küppers
dictions whereas for black-box only the winning boxes after       et al. 2020).
NMS are considered. This is explained in more detail in the
following section.
            4   Experimental Evaluation                           In Tables 2 and 3, the results for black-box and white-
                                                                  box calibration for RetinaNet and Faster R-CNN are pre-
In order to analyze the confidence calibration under different    sented, respectively. Three different IoU threshold values
conditions, we use the COCO 2017 validation dataset (Lin          of τ = {0.5, 0.6, 0.75} are considered to match predic-
et al. 2014) with a random split of 70% and 30% for training      tions with ground-truth annotations. In the tables, each cell
and testing the calibration, respectively.                        presents the D-ECE for the baseline (without calibration)
                                                                  and the corresponding D-ECE after histogram-based cali-
Evaluation Protocol                                               bration (HB). The Tables 2 and 3 show the results of the
We perform both black-box and white-box calibration by            black-box models with varying strength of NMS as well as
following the evaluation protocol of (Küppers et al. 2020)       the calibration results for the white-box case without NMS.
and use their provided calibration framework. The final cal-      The D-ECE is evaluated with different additional box infor-
ibration results are obtained as an average over 20 indepen-      mation: The first column shows the confidence only calibra-
dent training and testing results. For inference, we use a pre-   tion, the second and third columns the calibration with box
trained RetinaNet (Lin et al. 2017) and a Faster R-CNN (Ren       centers and box scales, and the last columns show the results
et al. 2015) model provided by the Detectron2 framework           for the calibration with all box information considered. The
(Wu et al. 2019). While the classification branch of the for-     best D-ECE scores are highlighted for each set of features
mer model is trained by cross entropy loss, the latter one        and IoU value across all variants.
                                                                                          Confidence Histogram (top)                                   D-ECE
                                                                                          Reliability Diagram (bottom)                         w.r.t. image location   1e-1
                     (p̂)   (p̂, cx , cy )   (p̂, h, w)    full                                                               1.0                                          5

                                                                          % of samples
    NMS@0.5           0           1              29        528                                                                                                                4
   NMS@0.75           0           0              20        485

                                                                                                                               relative cy
    NMS@0.9           0           0              12        435
   Without NMS        0           0               9        414                                                                                                                2

     Baseline        20          256            256       1024

Table 1: Amount of neglected bins within D-ECE calcula-                                                                       0.0                                      1.0
                                                                                                                                                     relative cx
tion of the three black-box models and the white-box model                               (a) D-ECE = 23.851%
for Faster R-CNN (Ren et al. 2015; Wu et al. 2019). A simi-                                                                                  (b) D-ECE = 22.963%
lar amount of bins is also neglected during the examinations                             Confidence Histogram (top)                           w.r.t. image location      1e-1
                                                                                                                         1.0                                                      5
for RetinaNet (Lin et al. 2017; Wu et al. 2019).                                         Reliability Diagram (bottom)

                                                                  % of samples

                                                                                                                         relative cy
   For Faster R-CNN, we observe that the white-box model
calibrates consistently better by default than the black-box                                                                                                                      2
models in most cases. In contrast, we observe the oppo-

site behavior for the RetinaNet model. Therefore, we fur-                                                                                                                         1

ther study the calibration properties of those networks by                                                                                                                        0
                                                                                                                         0.0                                            1.0
inspecting their reliability diagrams shown in Fig. 3 for                                                                                            relative cx
the black-box and white-box cases. The RetinaNet white-                                  (c) D-ECE = 21.444%                                 (d) D-ECE = 19.268%
box model without NMS offers underconfident predictions
which is a known property of models trained by focal loss                   Figure 2: Confidence histogram and reliability diagram (left)
(Lin et al. 2017). After NMS, a particular behavior can be                  and position-dependent heatmap (right) for RetinaNet (Lin
observed in Fig. 3e with overconfident predictions in the                   et al. 2017; Wu et al. 2019) (top row) and Faster RCNN (Ren
low confidence interval (p̂ < 0.5) and underconfident pre-                  et al. 2015; Wu et al. 2019) (bottom row) after white-box cal-
dictions in the high confidence interval (p̂ > 0.5). Also,                  ibration and then further application of non-maximum sup-
when comparing the calibrated results shown in Fig. 3b and                  pression with NMS@0.5.
3f, it is evident that the calibration for the white-box model
leads to a better D-ECE score. In contrast, Faster R-CNN
outputs reasonably well calibrated predictions before NMS
but is highly overconfident after NMS. Again, we observe
                                                                            Ma 2019). As previously mentioned, bins with less than 8
that the white-box D-ECE score is much better compared to
                                                                            samples are neglected for the computation of the D-ECE.
the black-box model after calibration has been applied.
                                                                            The total amount of neglected bins for each configuration
   We also study the effect of position-dependent miscali-                  is illustrated in Table 1. Especially using all available infor-
bration as in (Küppers et al. 2020), shown in Fig. 4. We                   mation for calibration (full case), more and more bins are
compare the white-box and black-box models before and                       left out when going from white-box (bottom) to black-box
after calibration for each object detector. These figures al-               (top) resulting in less bins contributing to the miscalibration
low to analyze if calibration is influenced by the position                 score.
of predicted bounding boxes. All images show a tendency
of higher miscalibration close to the borders. That may be                     A critical question arises how to integrate white-box cal-
caused by the difficulty of detecting objects correctly which               ibration into the object detection pipeline. As demonstrated
are cropped out of the frame. However, this is of minor rel-                in the previous results, NMS has a significant impact in the
evance considering that most of the positional discrepancies                calibration affecting the precision as well as the confidence
are mitigated after calibration in all cases.                               scores of the detections. It has been shown that NMS has the
   As shown in Tables 2 and 3, the calibration for the white-               potential to degrade the calibration results. Therefore, we
box model performs better than the calibration for the black-               investigate the calibration properties of the detection models
box model for the first and second columns. The opposite                    that are processed by a NMS with histogram-based calibra-
happens when including the box scales into the computation                  tion beforehand. The results are shown in Fig. 2: It can be
of the D-ECE. Here, the black-box model with NMS@0.5                        seen that calibration before NMS leads to higher miscali-
provides the best results. A possible explanation for this ob-              bration as the confidence is calibrated before NMS as well.
servation could be, that by increasing the NMS value, the                   However, as NMS also affects the precision, the detection
number of samples also increases from 4,229 and 4,496 to                    model gets too overconfident in both cases. In order to pre-
117,292 and 37,355 for RetinaNet and Faster R-CNN, re-                      serve good calibrations from the white-box method, alterna-
spectively. As expected, the more we go in the white-box                    tive box suppression methods should be investigated. One
direction, the less predictions are discarded. Having more                  option would be to integrate the confidence calibration with
samples for the miscalibration computation also means that                  the box merging strategies compared by (Roza et al. 2020),
there are possibly more samples within each bin leading to                  such as weighted box fusion and variance voting and test
a more robust miscalibration estimation (Kumar, Liang, and                  how such methods influence the model calibration.
  % of samples
  Precision                                     RetinaNet                                                                                      Faster R-CNN

                 (a) Uncalibrated white-box (b) Calibrated white-box model (c) Uncalibrated white-box (d) Calibrated white-box model:
                 model with D-ECE = 22.913% with D-ECE = 0.981%            model with D-ECE = 4.198% D − ECE = 0.861%
  % of samples

                 (e) Uncalibrated black-box (f) Calibrated black-box model (g) Uncalibrated black-box (h) Calibrated black-box model
                 model with D-ECE = 10.350% with D-ECE = 1.210%            model with D-ECE = 20.527% with D-ECE = 1.615%

Figure 3: Confidence histograms and reliability diagrams of the miscalibration for RetinaNet (left) and Faster R-CNN (right)
black-box (NMS@0.5) and white-box (without NMS) models with IoU@0.6 before and after histogram-based calibration.

                                                RetinaNet                                                                                      Faster R-CNN
                               Default D-ECE          1e-1                After Histogram Binning         1e-1                Default D-ECE           1e-1                After Histogram Binning         1e-1
                   1.0                                   5 1.0                                               5 1.0                                       5 1.0                                               5

                                                        4                                                   4                                           4                                                   4
                 relative cy

                                                            relative cy

                                                                                                                relative cy

                                                        3                                                   3                                           3   relative cy                                     3

                                                        2                                                   2                                           2                                                   2

                                                        1                                                   1                                           1                                                   1

                                                        0                                                   0                                           0                                                   0
                     0.0             0.5        1.0             0.0                  0.5            1.0             0.0             0.5         1.0             0.0                  0.5            1.0
                                  relative cx                                     relative cx                                    relative cx                                      relative cx
                 (a) Uncalibrated white-box (b) Calibrated white-box model (c) Uncalibrated white-box (d) Calibrated white-box model
                 model with D-ECE = 22.992% with D-ECE = 5.671%            model with D-ECE = 7.631% with D-ECE = 5.998%
                               Default D-ECE          1e-1                After Histogram Binning         1e-1                Default D-ECE           1e-1                After Histogram Binning         1e-1
                   1.0                                   5 1.0                                               5 1.0                                       5 1.0                                               5

                                                        4                                                   4                                           4                                                   4
                 relative cy

                                                            relative cy

                                                                                                                relative cy

                                                                                                                                                            relative cy

                                                        3                                                   3                                           3                                                   3

                                                        2                                                   2                                           2                                                   2

                                                        1                                                   1                                           1                                                   1

                                                        0                                                   0                                           0                                                   0
                     0.0             0.5        1.0             0.0                  0.5            1.0             0.0             0.5         1.0             0.0                  0.5            1.0
                                  relative cx                                     relative cx                                    relative cx                                      relative cx
                 (e) Uncalibrated black-box (f) Calibrated black-box model (g) Uncalibrated black-box (h) Calibrated black-box model
                 model with D-ECE = 10.894% with D-ECE = 6.620%            model with D-ECE = 15.975% with D-ECE = 7.156%

Figure 4: Position-dependent miscalibration of the RetinaNet (left) and Faster R-CNN (right) black-box (NMS@0.5) and white-
box (without NMS) models with IoU@0.6, before and after the histogram-based calibration.
                       (p̂)    (p̂, cx , cy )   (p̂, h, w)    full                  (p̂)    (p̂, cx , cy )   (p̂, h, w)    full
        IoU@0.5                                                      IoU@0.5
        Baseline     16.200      12.004          14.963      13.478 Baseline      15.448      15.750          14.246      12.586
          HB          1.636       6.335           2.775       5.391    HB          1.388       6.123           4.252      6.665
        IoU@0.6                                                      IoU@0.6
        Baseline     20.862      15.303          18.743      16.546 Baseline      3.435        7.486          6.710       7.064
          HB          1.673       6.091           2.991       5.743    HB         1.441        6.273          4.192       6.444
       IoU@0.75                                                     IoU@0.75
        Baseline     31.659      24.684          28.864      25.765 Baseline      20.980      20.840          20.041      17.504
          HB          1.436       6.095           3.110       5.704    HB          1.227       4.847           3.315      4.974
        (a) Black-box calibration with NMS@0.5, |D| = 4, 229.          (b) Black-box calibration NMS@0.75, |D| = 7, 923
                       (p̂)    (p̂, cx , cy )   (p̂, h, w)    full                  (p̂)    (p̂, cx , cy )   (p̂, h, w)    full
        IoU@0.5                                                      IoU@0.5
        Baseline     30.748      30.672          30.436      29.427 Baseline      28.027      28.127          28.014      28.176
          HB          1.212       5.290           3.671       6.686    HB          0.855       4.947           2.895      6.114
        IoU@0.6                                                      IoU@0.6
        Baseline     21.773      21.954          21.612      21.350 Baseline      23.097      23.290          23.118      23.482
          HB          1.195       5.717           3.981       7.647    HB          1.033       5.331           3.306      6.912
       IoU@0.75                                                     IoU@0.75
        Baseline     3.057        6.907          8.489       10.143 Baseline      8.487       10.190          10.190      11.992
          HB         1.367        5.675          4.468        7.847    HB         1.132        5.892           4.207      8.266
        (c) Black-box calibration with NMS@0.9, |D| = 20, 005        (d) White-box calibration without NMS, |D| = 117, 292

Table 2: D-ECE results (%) for RetinaNet (Lin et al. 2017; Wu et al. 2019) for different IoU scores. Each column shows the
baseline D-ECE and the calibrated one using histogram-based (HB) calibration with different subsets using either confidence
only p̂, including the box centers (cx , cy ), box scales (h, w) or using all features. Note that comparing the D-ECE scores of
columns to each other is not applicable since different subsets of data have been used for D-ECE measurement and calibration.

                       (p̂)    (p̂, cx , cy )   (p̂, h, w)    full                  (p̂)    (p̂, cx , cy )   (p̂, h, w)    full
        IoU@0.5                                                      IoU@0.5
        Baseline     7.781        9.060          7.829        7.168  Baseline     7.597        9.927          10.828      9.804
          HB         1.789        6.186          2.947        4.960    HB         1.523        6.968           3.778      6.134
        IoU@0.6                                                      IoU@0.6
        Baseline     9.370       10.041          9.033        7.810  Baseline     16.100      15.226          16.417      14.933
          HB         1.564        6.075          3.105        5.142    HB          1.343       6.323           3.490      5.610
       IoU@0.75                                                     IoU@0.75
        Baseline     31.659      24.684          28.864      25.765 Baseline      34.634      32.535          31.861      27.883
          HB          1.436       6.095           3.110       5.704    HB          1.123       4.878           3.018      4.996
        (a) Black-box calibration with NMS@0.5, |D| = 4, 496.        (b) Black-box calibration with NMS@0.75, |D| = 7, 231.
                       (p̂)    (p̂, cx , cy )   (p̂, h, w)    full                  (p̂)    (p̂, cx , cy )   (p̂, h, w)    full
        IoU@0.5                                                      IoU@0.5
        Baseline     7.323       10.431          10.042      10.318 Baseline      6.914        9.619          8.638       10.061
          HB         1.354        6.697           4.062       7.121    HB         1.038        5.234          3.206       6.239
        IoU@0.6                                                      IoU@0.6
        Baseline     7.499       10.328          11.622      11.630 Baseline      4.592        7.720          8.540       9.548
          HB         1.184        6.383           4.050       7.141    HB         1.099        5.523          3.603       6.959
       IoU@0.75                                                     IoU@0.75
        Baseline     25.689      25.539          25.792      25.002 Baseline      13.067      13.883          15.658      16.462
          HB          1.139       5.478           4.126       6.908    HB          0.999       5.996           4.505      8.652
        (c) Black-box calibration with NMS@0.9, |D| = 17, 742.       (d) White-box calibration without NMS, |D| = 37, 355.

Table 3: D-ECE results (%) for Faster R-CNN (Ren et al. 2015; Wu et al. 2019) before and after histogram-based (HB)
calibration using different IoU thresholds for NMS. The structure of this table is comparable to Tab. 2.
