<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>From Black-box to White-box: Examining Confidence Calibration under different Conditions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Franziska Schwaiger</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maximilian Henne</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabian Ku¨ ppers</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Felippe Schmoeller Roza</string-name>
          <email>felippe.schmoeller.da.rozag@iks.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Karsten Roscher</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anselm Haselhoff</string-name>
          <email>anselm.haselhoffg@hs-ruhrwest.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fraunhofer Institute for Cognitive Systems IKS</institution>
          ,
          <addr-line>Munich</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Ruhr West University of Applied Sciences</institution>
          ,
          <addr-line>Bottrop</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Confidence calibration is a major concern when applying artificial neural networks in safety-critical applications. Since most research in this area has focused on classification in the past, confidence calibration in the scope of object detection has gained more attention only recently. Based on previous work, we study the miscalibration of object detection models with respect to image location and box scale. Our main contribution is to additionally consider the impact of box selection methods like non-maximum suppression to calibration. We investigate the default intrinsic calibration of object detection models and how it is affected by these post-processing techniques. For this purpose, we distinguish between blackbox calibration with non-maximum suppression and whitebox calibration with raw network outputs. Our experiments reveal that post-processing highly affects confidence calibration. We show that non-maximum suppression has the potential to degrade initially well-calibrated predictions, leading to overconfident and thus miscalibrated models.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Modern deep neural networks achieve remarkable results on
various tasks but it is a well-known issue that these
networks fail to provide reliable estimates about the correctness
of predictions in many cases
        <xref ref-type="bibr" rid="ref15 ref2">(Niculescu-Mizil and
Caruana 2005; Guo et al. 2017)</xref>
        . A network outputs a score
attached to each prediction that can be interpreted as the
probability of correctness. Such a model is well-calibrated
if the observed accuracy matches the estimated confidence
scores. However, recent work has shown that these
confidence scores neither represent the actual observed
accuracy in classification
        <xref ref-type="bibr" rid="ref13 ref15 ref2">(Niculescu-Mizil and Caruana 2005;
Naeini, Cooper, and Hauskrecht 2015; Guo et al. 2017)</xref>
        nor
the observed precision in object detection (Ku¨ ppers et al.
2020). Calibrated confidence estimates integrated in
safetycritical applications like autonomous driving can provide
valuable additional information with respect to situational
awareness and can reduce the risk of hazards resulting from
functional insufficiencies by decreasing the space of
unknown unsafe scenarios which is a critical part for the safety
of the intended functionality (SOTIF ISO/PAS 21448).
Copyright c 2021 for this paper by its authors. Use permitted
under Creative Commons License Attribution 4.0 International (CC
BY 4.0).
      </p>
      <p>
        In the past, most research in this area has focused on
classification
        <xref ref-type="bibr" rid="ref11 ref13 ref2 ref22 ref5">(Naeini, Cooper, and Hauskrecht 2015; Kull,
Silva Filho, and Flach 2017; Guo et al. 2017; Seo, Seo, and
Han 2019; Mukhoti et al. 2020)</xref>
        , whereas calibration in
object detection has recently gained more attention
        <xref ref-type="bibr" rid="ref1 ref14">(Neumann,
Zisserman, and Vedaldi 2018; Feng et al. 2019; Ku¨ ppers
et al. 2020)</xref>
        . Object detection is a joint task of
classification and regression of the predictions’ position and scale.
Recent work has shown that the regression branch of
object detection models also affects confidence calibration
(K u¨ppers et al. 2020). However, the observable detections
of a model are commonly processed by non-maximum
suppression (NMS) and/or thresholded by a certain confidence
score. In this work, our goal is to investigate the influence of
such post-processing techniques on the model calibration.
For this purpose, we adapt common object detection
models and examine their miscalibration before NMS on the one
hand (white-box scenario). In this way, we have access to
the raw predictions of a network and are thus able to
examine the network’s calibration properties by default. On the
other hand, we further apply NMS with increasing
intersection over union (IoU) thresholds (black-box scenario), which
varies the number of boxes that are suppressed. Changing
the parameters of the NMS enables us to examine to what
extent the models are intrinsically calibrated and how this is
affected by postprocessing techniques. An illustrative
representation for the problem setting is demonstrated in Fig.
1. Furthermore, we use a Faster R-CNN architecture
        <xref ref-type="bibr" rid="ref19">(Ren
et al. 2015)</xref>
        that uses the cross entropy loss during training
and compare it to a RetinaNet
        <xref ref-type="bibr" rid="ref9">(Lin et al. 2017)</xref>
        that uses a
focal loss. It is already known that models trained with
focal loss produce much less confident predictions
        <xref ref-type="bibr" rid="ref11">(Mukhoti
et al. 2020)</xref>
        . This enables us to further investigate the
effect of post-processing methods by comparing the default
calibration properties of both model architectures with and
without NMS.
      </p>
      <p>This work is structured as follows: we give a review of
the current state-of-the-art research in confidence calibration
in Section 2. We further give a definition of white-box and
black-box calibration and a description of our calibration
targets in Section 3. In Section 4, our experimental results are
demonstrated and in Section 5 we discuss our findings.</p>
      <sec id="sec-1-1">
        <title>Input images Detections</title>
      </sec>
      <sec id="sec-1-2">
        <title>Detector NMS</title>
      </sec>
      <sec id="sec-1-3">
        <title>White-box</title>
        <p>Calibration</p>
      </sec>
      <sec id="sec-1-4">
        <title>Black-box</title>
        <p>
          Calibration
Numerous methods have been developed in the past to
address the miscalibration of neural networks. One of the first
representatives of post-processing calibration methods has
been histogram binning
          <xref ref-type="bibr" rid="ref24">(Zadrozny and Elkan 2001)</xref>
          ,
isotonic regression
          <xref ref-type="bibr" rid="ref25">(Zadrozny and Elkan 2002)</xref>
          , Bayesian
binning
          <xref ref-type="bibr" rid="ref13">(Naeini, Cooper, and Hauskrecht 2015)</xref>
          , and Platt
scaling
          <xref ref-type="bibr" rid="ref17">(Platt 1999)</xref>
          , whereas more recently temperature scaling
          <xref ref-type="bibr" rid="ref2">(Guo et al. 2017)</xref>
          , beta calibration
          <xref ref-type="bibr" rid="ref5">(Kull, Silva Filho, and
Flach 2017)</xref>
          , and Dirichlet calibration
          <xref ref-type="bibr" rid="ref4">(Kull et al. 2019)</xref>
          have
been developed to tackle miscalibration in the scope of
classification. In object detection models, dealing with
miscalibration presents a different set of challenges and was first
addressed by
          <xref ref-type="bibr" rid="ref14">(Neumann, Zisserman, and Vedaldi 2018)</xref>
          , who
proposed an additional model output to be utilized as a
regularizing temperature applied to the remaining logits.
Recently, (Ku¨ppers et al. 2020) have studied the effect of
position and scale of detected objects to miscalibration and
concluded that calibration also depends on the regression
output of a detection model. They further provide a framework
to include position and scale information into a calibration
mapping.
        </p>
        <p>
          For the task of classification, a common way to
measure miscalibration is to adapt the expected calibration
error (ECE), proposed by
          <xref ref-type="bibr" rid="ref13">(Naeini, Cooper, and Hauskrecht
2015)</xref>
          , which uses a binning scheme to measure the gap
between observed frequency and average confidence.
          <xref ref-type="bibr" rid="ref6">(Kumar,
Liang, and Ma 2019)</xref>
          show in their work that the common
ECE underestimates the true calibration error in some cases
and provide a differentiable upper bound called maximum
mean calibration error (MMCE) that can also be used
during model training as a second regularization term. For
measuring miscalibration in object detection tasks, an extension
of the ECE called detection expected calibration error
(DECE) was proposed by (Ku¨ppers et al. 2020), consisting of a
multidimensional binning scheme to assess the
miscalibration over all predicted features of an object detection model.
        </p>
        <p>
          Since the standard cross-entropy loss is prone to favor
overly confident predictions, further research directions
investigate how to directly obtain well-calibrated models after
training. Besides the previously mentioned MMCE, the
authors in
          <xref ref-type="bibr" rid="ref16">(Pereyra et al. 2017)</xref>
          introduce a regularization term
penalizing high confident predictions. In contrast,
          <xref ref-type="bibr" rid="ref12">(Mu¨ller,
Kornblith, and Hinton 2019)</xref>
          show that label smoothing
yields good probabilities after training. Recently,
          <xref ref-type="bibr" rid="ref11">(Mukhoti
et al. 2020)</xref>
          investigate the effects of focal loss, originally
proposed as a loss term for RetinaNet
          <xref ref-type="bibr" rid="ref9">(Lin et al. 2017)</xref>
          ,
on confidence calibration. They show that using focal loss
in conjunction with an adaptive parameter significantly
improves the confidence calibration of classification models.
We also observe that using focal loss prevents
overconfident predictions in our experiments on the RetinaNet with
standard hyperparameters
          <xref ref-type="bibr" rid="ref9">(Lin et al. 2017)</xref>
          . While
wellknown object detection models like Faster-RCNN
          <xref ref-type="bibr" rid="ref19">(Ren et al.
2015)</xref>
          commonly tend to output overconfident predictions,
the probability scores of a RetinaNet rather underestimate
the observed frequency.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Defining Confidence Calibration for Object</title>
    </sec>
    <sec id="sec-3">
      <title>Detection Models</title>
      <p>In this section, we describe the definition of black-box and
white-box calibration. The idea behind this distinction is to
analyze the impact of bounding-box postprocessing on
calibration.</p>
      <p>An object detector takes an image as input x and
outputs predictions in form of a class label y 2 Y with
corresponding confidence score p 2 [0; 1] and bounding box</p>
      <p>J
r = (cx; cx; h; w) 2 R , with (cx; cy) being the center
position, (h; w) the box height and width and J the size
of the used box encoding. The authors in (Ku¨ppers et al.
2020) propose a confidence calibration that not only
considers the confidence score p but also includes the box
information r. The performance of a detector is thus evaluated by
matching its predictions (y^; p^; r^) with the ground-truth
annotations, where m = 1 denotes a matched box and m = 0
a mismatch. More formally, perfect calibration in the scope
of object detection is defined by</p>
      <p>P(M = 1jP^ = p; Y^ = y; R^ = r) =
| precision{gziven p;y;r }
p</p>
      <p>;
co|nfi{dze}nce</p>
      <p>(1)</p>
      <p>J
8p 2 [0; 1]; y 2 Y; r 2 R :</p>
      <p>As the detections rarely match the ground-truth perfectly,
true positives (TP, m = 1) and false positives (FP, m = 0)
are obtained by comparing the IoU to a fixed threshold . TP
and FP correspond to boxes with IoU and IoU &lt; ,
respectively. The process of inference is commonly followed
by a non-maximum suppression since an object detection
model outputs a huge amount of mostly less confident and
redundant detections. On the one hand, we can consider the
definition of calibration given by Equation 1 to the raw
outputs of a detector without any post-processing. We denote
this case as the white-box calibration case for the following
of this paper. On the other hand, we can also view the NMS
as part of the detector and treat the output of the NMS as
our desired calibration target. This is denoted as black-box
calibration.</p>
      <p>
        In (Ku¨ppers et al. 2020), the detection expected
calibration error (D-ECE) is defined as an extension of the
commonly used ECE
        <xref ref-type="bibr" rid="ref13">(Naeini, Cooper, and Hauskrecht 2015)</xref>
        for
object detection tasks. The D-ECE also includes the box
information r by partitioning the space of each variable k into
Nk equally spaced bins. The total amount of bins is given by
Ntotal = QK
k=1 Nk and the D-ECE is defined as
      </p>
      <p>Ntotal jI(n)j
X
n=1</p>
      <p>jDj
D-ECEK =
jprec(n)
conf(n)j;
(2)
where I(n) is the set of all samples in a single bin and jDj
the total amount of samples, while prec(n) and conf(n)
denote the average precision and confidence within each bin,
respectively. We use this metric to measure miscalibration in
both cases: For white-box, we consider all possible box
predictions whereas for black-box only the winning boxes after
NMS are considered. This is explained in more detail in the
following section.</p>
      <p>4</p>
    </sec>
    <sec id="sec-4">
      <title>Experimental Evaluation</title>
      <p>
        In order to analyze the confidence calibration under different
conditions, we use the COCO 2017 validation dataset
        <xref ref-type="bibr" rid="ref8">(Lin
et al. 2014)</xref>
        with a random split of 70% and 30% for training
and testing the calibration, respectively.
      </p>
      <sec id="sec-4-1">
        <title>Evaluation Protocol</title>
        <p>
          We perform both black-box and white-box calibration by
following the evaluation protocol of (Ku¨ppers et al. 2020)
and use their provided calibration framework. The final
calibration results are obtained as an average over 20
independent training and testing results. For inference, we use a
pretrained RetinaNet
          <xref ref-type="bibr" rid="ref9">(Lin et al. 2017)</xref>
          and a Faster R-CNN
          <xref ref-type="bibr" rid="ref19">(Ren
et al. 2015)</xref>
          model provided by the Detectron2 framework
          <xref ref-type="bibr" rid="ref23">(Wu et al. 2019)</xref>
          . While the classification branch of the
former model is trained by cross entropy loss, the latter one
uses a focal loss that enables to focus on hard examples
during training with low confidence. On the other hand, good
predictions with high confidence are less weighted during
training that in turn leads to less confident predictions
          <xref ref-type="bibr" rid="ref11 ref9">(Lin
et al. 2017; Mukhoti et al. 2020)</xref>
          . Our experiments are
restricted to the predictions of class person.
        </p>
        <p>To study the effect of non-maximum suppression, we
apply different IoU thresholds to merge boxes denoted by
NMS@f0:5; 0:75; 0:9g. In the white-box case without NMS,
we use the raw predictions for measuring and performing
calibration on the one hand. On the other hand, we further
adopt top-k box selection where only k bounding boxes with
the highest confidence are kept using k = 1000. This is the
common case during inference to reduce low confidential
and mostly redundant predictions. Following (Ku¨ppers et al.
2020), the predictions of all models are obtained by
inference with a probability threshold of 0:3 which means
discarding all predictions with a confidence score less than this
threshold. As the relative amount of predictions per image
with low confidence score is significantly higher than the
relative amount of the remaining predictions, this
probability threshold ensures that the D-ECE is not dominated by
these low confidence samples.</p>
        <p>
          For confidence calibration, we use multivariate histogram
binning
          <xref ref-type="bibr" rid="ref24">(Zadrozny and Elkan 2001; Ku¨ppers et al. 2020)</xref>
          for
calibration as a fast and reliable calibration method. We also
evaluate several setups with different subsets of box
information to evaluate the effect of the used feature set. We
either use the confidence only, also including the box centers
(c^x; c^y) or box scales (h; w), or we use all features for
measuring and performing calibration. For the histogram-based
calibration, we use 15 bins for confidence only, Nk = 5
bins for (p^; c^x; c^y) and (p^; h^; w^), and Nk = 3 when using all
available features. In contrast, for D-ECE computation we
use 20 bins for confidence only, Nk = 8 bins for (p^; c^x; c^y)
and (p^; h^; w^), and Nk = 5 when using all available
information. We increase the robustness of the D-ECE calculation
by also neglecting bins with less than 8 samples (Ku¨ppers
et al. 2020).
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Results</title>
        <p>In Tables 2 and 3, the results for black-box and
whitebox calibration for RetinaNet and Faster R-CNN are
presented, respectively. Three different IoU threshold values
of = f0:5; 0:6; 0:75g are considered to match
predictions with ground-truth annotations. In the tables, each cell
presents the D-ECE for the baseline (without calibration)
and the corresponding D-ECE after histogram-based
calibration (HB). The Tables 2 and 3 show the results of the
black-box models with varying strength of NMS as well as
the calibration results for the white-box case without NMS.</p>
        <p>The D-ECE is evaluated with different additional box
information: The first column shows the confidence only
calibration, the second and third columns the calibration with box
centers and box scales, and the last columns show the results
for the calibration with all box information considered. The
best D-ECE scores are highlighted for each set of features
and IoU value across all variants.</p>
        <p>
          For Faster R-CNN, we observe that the white-box model
calibrates consistently better by default than the black-box
models in most cases. In contrast, we observe the
opposite behavior for the RetinaNet model. Therefore, we
further study the calibration properties of those networks by
inspecting their reliability diagrams shown in Fig. 3 for
the black-box and white-box cases. The RetinaNet
whitebox model without NMS offers underconfident predictions
which is a known property of models trained by focal loss
          <xref ref-type="bibr" rid="ref9">(Lin et al. 2017)</xref>
          . After NMS, a particular behavior can be
observed in Fig. 3e with overconfident predictions in the
low confidence interval (p^ &lt; 0:5) and underconfident
predictions in the high confidence interval (p^ &gt; 0:5). Also,
when comparing the calibrated results shown in Fig. 3b and
3f, it is evident that the calibration for the white-box model
leads to a better D-ECE score. In contrast, Faster R-CNN
outputs reasonably well calibrated predictions before NMS
but is highly overconfident after NMS. Again, we observe
that the white-box D-ECE score is much better compared to
the black-box model after calibration has been applied.
        </p>
        <p>We also study the effect of position-dependent
miscalibration as in (Ku¨ppers et al. 2020), shown in Fig. 4. We
compare the white-box and black-box models before and
after calibration for each object detector. These figures
allow to analyze if calibration is influenced by the position
of predicted bounding boxes. All images show a tendency
of higher miscalibration close to the borders. That may be
caused by the difficulty of detecting objects correctly which
are cropped out of the frame. However, this is of minor
relevance considering that most of the positional discrepancies
are mitigated after calibration in all cases.</p>
        <p>As shown in Tables 2 and 3, the calibration for the
whitebox model performs better than the calibration for the
blackbox model for the first and second columns. The opposite
happens when including the box scales into the computation
of the D-ECE. Here, the black-box model with NMS@0:5
provides the best results. A possible explanation for this
observation could be, that by increasing the NMS value, the
number of samples also increases from 4,229 and 4,496 to
117,292 and 37,355 for RetinaNet and Faster R-CNN,
respectively. As expected, the more we go in the white-box
direction, the less predictions are discarded. Having more
samples for the miscalibration computation also means that
there are possibly more samples within each bin leading to
a more robust miscalibration estimation (Kumar, Liang, and
s
e
l
p
m
a
fs
o
%
n
o
ii
s
c
e
r
P
s
e
l
p
m
a
fs
o
%
n
o
ii
s
c
e
r
P
(a) D-ECE = 23:851%
Confidence Histogram (top)
Reliability Diagram (bottom)
Confidence Histogram (top)
Reliability Diagram (bottom)</p>
        <p>D-ECE
w.r.t. image location
1.0
y
c
e
v
tlir
a
e
4
3
2
1
5
4
3
2
1
1.0 0
(c) D-ECE = 21:444%</p>
        <p>relative cx
(d) D-ECE = 19:268%</p>
        <p>Ma 2019). As previously mentioned, bins with less than 8
samples are neglected for the computation of the D-ECE.</p>
        <p>The total amount of neglected bins for each configuration
is illustrated in Table 1. Especially using all available
information for calibration (full case), more and more bins are
left out when going from white-box (bottom) to black-box
(top) resulting in less bins contributing to the miscalibration
score.</p>
        <p>A critical question arises how to integrate white-box
calibration into the object detection pipeline. As demonstrated
in the previous results, NMS has a significant impact in the
calibration affecting the precision as well as the confidence
scores of the detections. It has been shown that NMS has the
potential to degrade the calibration results. Therefore, we
investigate the calibration properties of the detection models
that are processed by a NMS with histogram-based
calibration beforehand. The results are shown in Fig. 2: It can be
seen that calibration before NMS leads to higher
miscalibration as the confidence is calibrated before NMS as well.</p>
        <p>However, as NMS also affects the precision, the detection
model gets too overconfident in both cases. In order to
preserve good calibrations from the white-box method,
alternative box suppression methods should be investigated. One
option would be to integrate the confidence calibration with
the box merging strategies compared by (Roza et al. 2020),
such as weighted box fusion and variance voting and test
how such methods influence the model calibration.</p>
        <p>Faster R-CNN
(a) Uncalibrated white-box (b) Calibrated white-box model (c) Uncalibrated white-box (d) Calibrated white-box model:
model with D-ECE = 22:913% with D-ECE = 0:981% model with D-ECE = 4:198% D ECE = 0:861%
1.0
y
c
e
v
ltire
a
y
c
e
v
ltire
a
4
3 cy
e
v
ti
2 lea</p>
        <p>r
1
0.0 rel0a.t5ive cx 1.0 rel0a.t5ive cx 1.0 rel0a.t5ive cx 1.0 rel0a.t5ive cx 1.0
(a) Uncalibrated white-box (b) Calibrated white-box model (c) Uncalibrated white-box (d) Calibrated white-box model
m1.0odel witDheDfau-lEtDC-EEC=E 22:9921%e-15 w1.0ith DA-fEteCrHEist=ogr5am:6B7i1n%ning 1e-15 m1.0odel witDheDfau-lEtDC-EEC=E 7:631 %1e-15 w1.0ith DA-fEteCrHEist=ogr5am:9B9i8n%ning 1e-15
4
1
4
1
3 cy
e
v
it
2 lea</p>
        <p>r</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper, we analyzed the influence of box
suppression methods on confidence calibration for object
detection models. To do so, we adapt models without box
suppression methods denoted as white-box models, contrasting
to the black-box approach commonly suggested. We
performed histogram-based calibration for both black-box and
white-box scenarios on the COCO dataset. We found that
the initial calibration of detection models is highly impacted
by NMS. Additionally, we observed that calibration also
depends on the architecture of the object detection model. For
RetinaNet, the model predictions are underconfident before
applying NMS whereas, for Faster R-CNN, the white-box
model outputs quite well calibrated detections that become
overconfident after NMS.</p>
      <p>Knowing that the miscalibration not only depends on the
classification outputs but also on the regression output for
the bounding boxes, we performed histogram-based
calibration using different subsets of the output data. For the
confidence only and (p^; cx; cy) case, the white-box model
outperforms the black-box models while the black-box models
present slightly better results on the other scenarios.</p>
      <p>While the white-box calibration has given good results,
the most effective integration of white-box calibration
methods in existing object detectors utilizing NMS remains as
an open issue. As shown by the results in this paper, the
NMS layer affects the results by giving different calibration
profiles before and after the suppression. Corroborating with
further results presented in this paper, the calibrated
detections obtained by the white-box models deteriorated after
NMS for both RetinaNet and Faster R-CNN. However, we
think this problem can be solved by using other suppression
methods which consider a larger set of the overall better
calibrated boxes than NMS.</p>
      <p>For future work we suggest alternative applications to the
standard NMS method to verify if they can lead to better
calibrated object detectors. One option would be to integrate
the confidence calibration with box merging strategies
compared by (Roza et al. 2020), such as box averaging, weighted
box fusion or variance voting.</p>
    </sec>
    <sec id="sec-6">
      <title>Aknowledgements</title>
      <p>This work was funded by the Bavarian Ministry for
Economic Affairs, Regional Development and Energy as part of
a project to support the thematic development of the Institute
for Cognitive Systems and within the Intel Collaborative
Research Institute Safe Automated Vehicles.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Rosenbaum</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ; Gla¨ser, C.;
          <string-name>
            <surname>Timm</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Dietmayer</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Can We Trust You? On Calibration of a Probabilistic Object Detector for Autonomous Driving</article-title>
          . ArXiv abs/
          <year>1909</year>
          .12358.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Pleiss</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ; Sun,
          <string-name>
            <given-names>Y.</given-names>
            ; and
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. Q.</surname>
          </string-name>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>On Calibration of Modern Neural Networks</article-title>
          .
          <source>In Proceedings of the 34th International Conference on Machine Learning</source>
          , volume
          <volume>70</volume>
          <source>of Proceedings of Machine Learning Research</source>
          ,
          <volume>1321</volume>
          -
          <fpage>1330</fpage>
          . PMLR.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Kull</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Nieto</surname>
            ,
            <given-names>M. P.</given-names>
          </string-name>
          ; Ka¨ngsepp, M.; Silva Filho,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Song</surname>
          </string-name>
          , H.; and Flach,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with Dirichlet calibration</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          ,
          <volume>12316</volume>
          -
          <fpage>12326</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Kull</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; Silva Filho, T.; and Flach,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers</article-title>
          .
          <source>In Artificial Intelligence and Statistics</source>
          ,
          <volume>623</volume>
          -
          <fpage>631</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>P. S.</given-names>
          </string-name>
          ; and Ma, T.
          <year>2019</year>
          .
          <article-title>Verified Uncertainty Calibration</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          <volume>32</volume>
          ,
          <fpage>3792</fpage>
          -
          <lpage>3803</lpage>
          . Curran Associates, Inc. URL http://papers.nips.cc/paper/8635-verifieduncertainty-calibration.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          2020.
          <article-title>Multivariate Confidence Calibration for Object Detection</article-title>
          .
          <source>In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops</source>
          ,
          <fpage>326</fpage>
          -
          <lpage>327</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Maire</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Belongie</surname>
            ,
            <given-names>S. J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Bourdev</surname>
            ,
            <given-names>L. D.</given-names>
          </string-name>
          ; Girshick,
          <string-name>
            <given-names>R. B.</given-names>
            ;
            <surname>Hays</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          ; Perona,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Ramanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ; Dolla´r, P.; and
            <surname>Zitnick</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. L.</surname>
          </string-name>
          <year>2014</year>
          .
          <string-name>
            <surname>Microsoft</surname>
            <given-names>COCO</given-names>
          </string-name>
          :
          <article-title>Common Objects in Context</article-title>
          .
          <source>CoRR abs/1405</source>
          .0312. URL http://arxiv.org/abs/1405.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Lin</surname>
          </string-name>
          , T.-Y.;
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; He,
          <string-name>
            <surname>K.</surname>
          </string-name>
          ; and Dolla´r, P.
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <article-title>Focal Loss for Dense Object Detection</article-title>
          .
          <source>In Proceedings of the IEEE International Conference on Computer Vision</source>
          (ICCV),
          <fpage>2980</fpage>
          -
          <lpage>2988</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Mukhoti</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kulharia</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sanyal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Golodetz</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Torr,
          <string-name>
            <surname>P. H.</surname>
          </string-name>
          ; and Dokania,
          <string-name>
            <surname>P. K.</surname>
          </string-name>
          <year>2020</year>
          .
          <article-title>Calibrating Deep Neural Networks using Focal Loss</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <article-title>Mu¨ller</article-title>
          , R.; Kornblith,
          <string-name>
            <surname>S.</surname>
          </string-name>
          ; and Hinton,
          <string-name>
            <surname>G. E.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>When does label smoothing help?</article-title>
          <source>In Advances in Neural Information Processing Systems</source>
          ,
          <volume>4694</volume>
          -
          <fpage>4703</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Naeini</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Cooper</surname>
            , G.; and Hauskrecht,
            <given-names>M.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Obtaining Well Calibrated Probabilities Using Bayesian Binning</article-title>
          .
          <source>In Proceedings of the 29th AAAI Conference on Artificial Intelligence</source>
          ,
          <fpage>2901</fpage>
          -
          <lpage>2907</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Vedaldi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Relaxed Softmax: Efficient Confidence Auto-Calibration for Safe Pedestrian Detection</article-title>
          .
          <source>In Workshop on Machine Learning for Intelligent Transportation Systems (NIPS).</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Niculescu-Mizil</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and Caruana,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2005</year>
          .
          <article-title>Predicting Good Probabilities with Supervised Learning</article-title>
          .
          <source>In Proceedings of the 22nd International Conference on Machine Learning</source>
          ,
          <fpage>625</fpage>
          -
          <lpage>632</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Pereyra</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ; Tucker,
          <string-name>
            <given-names>G.</given-names>
            ;
            <surname>Chorowski</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          ; Łukasz Kaiser; and Hinton,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Regularizing Neural Networks by Penalizing Confident Output Distributions</article-title>
          . CoRR .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Platt</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>1999</year>
          .
          <article-title>Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <source>Advances in Large Margin Classifiers</source>
          <volume>61</volume>
          -74.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Girshick</surname>
          </string-name>
          , R. B.; and
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2015</year>
          .
          <string-name>
            <surname>Faster</surname>
            <given-names>R-CNN</given-names>
          </string-name>
          :
          <article-title>Towards Real-Time Object Detection with Region Proposal Networks</article-title>
          .
          <source>CoRR abs/1506</source>
          .01497. URL http://arxiv.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <source>org/abs/1506</source>
          .01497.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          2020.
          <article-title>Assessing Box Merging Strategies and Uncertainty Estimation Methods in Multimodel Object Detection</article-title>
          . In Beyond mAP:
          <article-title>Reassessing the Evaluation of Object Detectors @ECCV, -</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Seo</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Seo,
          <string-name>
            <surname>P. H.</surname>
          </string-name>
          ; and Han,
          <string-name>
            <surname>B.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Learning for Single-Shot Confidence Calibration in Deep Neural Networks Through Stochastic Inferences</article-title>
          .
          <source>In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).</source>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kirillov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Massa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lo</surname>
            , W.-Y.; and Girshick,
            <given-names>R.</given-names>
          </string-name>
          <year>2019</year>
          . Detectron2. https://github.com/facebookresearch/ detectron2.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>Zadrozny</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Elkan</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2001</year>
          .
          <article-title>Obtaining Calibrated Probability Estimates from Decision Trees and Naive Bayesian Classifiers</article-title>
          .
          <source>In Proceedings of the Eighteenth International Conference on Machine Learning (ICML)</source>
          ,
          <fpage>609</fpage>
          -
          <lpage>616</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Zadrozny</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Elkan</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2002</year>
          .
          <article-title>Transforming Classifier Scores into Accurate Multiclass Probability Estimates</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <source>In Proceedings of the Eighth International Conference on Knowledge Discovery and Data Mining, July 23-26</source>
          ,
          <year>2002</year>
          , Edmonton, Alberta, Canada,
          <fpage>694</fpage>
          -
          <lpage>699</lpage>
          . doi:
          <volume>10</volume>
          .1145/775047.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>775151. URL https://doi.org/10.1145/775047.775151.</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>