From Black-box to White-box: Examining Confidence Calibration under different Conditions Franziska Schwaiger 1 , Maximilian Henne 1 , Fabian Küppers 2 , Felippe Schmoeller Roza 1 , Karsten Roscher 1 , Anselm Haselhoff 2 1 Fraunhofer Institute for Cognitive Systems IKS, Munich, Germany 2 Ruhr West University of Applied Sciences, Bottrop, Germany {franziska.schwaiger, maximilian.henne, felippe.schmoeller.da.roza}@iks.fraunhofer.de {fabian.kueppers, anselm.haselhoff}@hs-ruhrwest.de Abstract In the past, most research in this area has focused on Confidence calibration is a major concern when applying ar- classification (Naeini, Cooper, and Hauskrecht 2015; Kull, tificial neural networks in safety-critical applications. Since Silva Filho, and Flach 2017; Guo et al. 2017; Seo, Seo, and most research in this area has focused on classification in the Han 2019; Mukhoti et al. 2020), whereas calibration in ob- past, confidence calibration in the scope of object detection ject detection has recently gained more attention (Neumann, has gained more attention only recently. Based on previous Zisserman, and Vedaldi 2018; Feng et al. 2019; Küppers work, we study the miscalibration of object detection models et al. 2020). Object detection is a joint task of classifica- with respect to image location and box scale. Our main con- tion and regression of the predictions’ position and scale. tribution is to additionally consider the impact of box selec- Recent work has shown that the regression branch of ob- tion methods like non-maximum suppression to calibration. ject detection models also affects confidence calibration We investigate the default intrinsic calibration of object de- (Küppers et al. 2020). However, the observable detections tection models and how it is affected by these post-processing techniques. For this purpose, we distinguish between black- of a model are commonly processed by non-maximum sup- box calibration with non-maximum suppression and white- pression (NMS) and/or thresholded by a certain confidence box calibration with raw network outputs. Our experiments score. In this work, our goal is to investigate the influence of reveal that post-processing highly affects confidence calibra- such post-processing techniques on the model calibration. tion. We show that non-maximum suppression has the poten- For this purpose, we adapt common object detection mod- tial to degrade initially well-calibrated predictions, leading to els and examine their miscalibration before NMS on the one overconfident and thus miscalibrated models. hand (white-box scenario). In this way, we have access to the raw predictions of a network and are thus able to exam- 1 Introduction ine the network’s calibration properties by default. On the Modern deep neural networks achieve remarkable results on other hand, we further apply NMS with increasing intersec- various tasks but it is a well-known issue that these net- tion over union (IoU) thresholds (black-box scenario), which works fail to provide reliable estimates about the correctness varies the number of boxes that are suppressed. Changing of predictions in many cases (Niculescu-Mizil and Caru- the parameters of the NMS enables us to examine to what ana 2005; Guo et al. 2017). A network outputs a score extent the models are intrinsically calibrated and how this is attached to each prediction that can be interpreted as the affected by postprocessing techniques. An illustrative rep- probability of correctness. Such a model is well-calibrated resentation for the problem setting is demonstrated in Fig. if the observed accuracy matches the estimated confidence 1. Furthermore, we use a Faster R-CNN architecture (Ren scores. However, recent work has shown that these con- et al. 2015) that uses the cross entropy loss during training fidence scores neither represent the actual observed accu- and compare it to a RetinaNet (Lin et al. 2017) that uses a racy in classification (Niculescu-Mizil and Caruana 2005; focal loss. It is already known that models trained with fo- Naeini, Cooper, and Hauskrecht 2015; Guo et al. 2017) nor cal loss produce much less confident predictions (Mukhoti the observed precision in object detection (Küppers et al. et al. 2020). This enables us to further investigate the ef- 2020). Calibrated confidence estimates integrated in safety- fect of post-processing methods by comparing the default critical applications like autonomous driving can provide calibration properties of both model architectures with and valuable additional information with respect to situational without NMS. awareness and can reduce the risk of hazards resulting from functional insufficiencies by decreasing the space of un- This work is structured as follows: we give a review of known unsafe scenarios which is a critical part for the safety the current state-of-the-art research in confidence calibration of the intended functionality (SOTIF ISO/PAS 21448). in Section 2. We further give a definition of white-box and Copyright c 2021 for this paper by its authors. Use permitted un- black-box calibration and a description of our calibration tar- der Creative Commons License Attribution 4.0 International (CC gets in Section 3. In Section 4, our experimental results are BY 4.0). demonstrated and in Section 5 we discuss our findings. Input images Detections Detector NMS White-box Black-box Calibration Calibration Figure 1: Typically a non-maximum suppression (NMS) is applied to all detections of a detection model to fuse and reduce redundant bounding boxes. In our work, we investigate how NMS affects confidence calibration. Thus, we study the difference in calibration before NMS (white-box) and afterward (black-box). 2 Related Work vestigate how to directly obtain well-calibrated models after training. Besides the previously mentioned MMCE, the au- Numerous methods have been developed in the past to ad- thors in (Pereyra et al. 2017) introduce a regularization term dress the miscalibration of neural networks. One of the first penalizing high confident predictions. In contrast, (Müller, representatives of post-processing calibration methods has Kornblith, and Hinton 2019) show that label smoothing been histogram binning (Zadrozny and Elkan 2001), iso- yields good probabilities after training. Recently, (Mukhoti tonic regression (Zadrozny and Elkan 2002), Bayesian bin- et al. 2020) investigate the effects of focal loss, originally ning (Naeini, Cooper, and Hauskrecht 2015), and Platt scal- proposed as a loss term for RetinaNet (Lin et al. 2017), ing (Platt 1999), whereas more recently temperature scaling on confidence calibration. They show that using focal loss (Guo et al. 2017), beta calibration (Kull, Silva Filho, and in conjunction with an adaptive parameter significantly im- Flach 2017), and Dirichlet calibration (Kull et al. 2019) have proves the confidence calibration of classification models. been developed to tackle miscalibration in the scope of clas- We also observe that using focal loss prevents overconfi- sification. In object detection models, dealing with miscali- dent predictions in our experiments on the RetinaNet with bration presents a different set of challenges and was first ad- standard hyperparameters (Lin et al. 2017). While well- dressed by (Neumann, Zisserman, and Vedaldi 2018), who known object detection models like Faster-RCNN (Ren et al. proposed an additional model output to be utilized as a reg- 2015) commonly tend to output overconfident predictions, ularizing temperature applied to the remaining logits. Re- the probability scores of a RetinaNet rather underestimate cently, (Küppers et al. 2020) have studied the effect of posi- the observed frequency. tion and scale of detected objects to miscalibration and con- cluded that calibration also depends on the regression out- put of a detection model. They further provide a framework to include position and scale information into a calibration 3 Defining Confidence Calibration for Object mapping. Detection Models For the task of classification, a common way to mea- sure miscalibration is to adapt the expected calibration er- In this section, we describe the definition of black-box and ror (ECE), proposed by (Naeini, Cooper, and Hauskrecht white-box calibration. The idea behind this distinction is to 2015), which uses a binning scheme to measure the gap be- analyze the impact of bounding-box postprocessing on cali- tween observed frequency and average confidence. (Kumar, bration. Liang, and Ma 2019) show in their work that the common An object detector takes an image as input x and out- ECE underestimates the true calibration error in some cases puts predictions in form of a class label y ∈ Y with cor- and provide a differentiable upper bound called maximum responding confidence score p ∈ [0, 1] and bounding box mean calibration error (MMCE) that can also be used dur- r = (cx , cx , h, w) ∈ RJ , with (cx , cy ) being the center ing model training as a second regularization term. For mea- position, (h, w) the box height and width and J the size suring miscalibration in object detection tasks, an extension of the used box encoding. The authors in (Küppers et al. of the ECE called detection expected calibration error (D- 2020) propose a confidence calibration that not only consid- ECE) was proposed by (Küppers et al. 2020), consisting of a ers the confidence score p but also includes the box infor- multidimensional binning scheme to assess the miscalibra- mation r. The performance of a detector is thus evaluated by tion over all predicted features of an object detection model. matching its predictions (ŷ, p̂, r̂) with the ground-truth an- Since the standard cross-entropy loss is prone to favor notations, where m = 1 denotes a matched box and m = 0 overly confident predictions, further research directions in- a mismatch. More formally, perfect calibration in the scope of object detection is defined by uses a focal loss that enables to focus on hard examples dur- ing training with low confidence. On the other hand, good P(M = 1|P̂ = p, Ŷ = y, R̂ = r) = p , (1) predictions with high confidence are less weighted during training that in turn leads to less confident predictions (Lin | {z } |{z} precision given p,y,r confidence et al. 2017; Mukhoti et al. 2020). Our experiments are re- ∀p ∈ [0, 1], y ∈ Y, r ∈ RJ . stricted to the predictions of class person. As the detections rarely match the ground-truth perfectly, To study the effect of non-maximum suppression, we ap- true positives (TP, m = 1) and false positives (FP, m = 0) ply different IoU thresholds to merge boxes denoted by are obtained by comparing the IoU to a fixed threshold τ . TP NMS@{0.5, 0.75, 0.9}. In the white-box case without NMS, and FP correspond to boxes with IoU ≥ τ and IoU < τ , re- we use the raw predictions for measuring and performing spectively. The process of inference is commonly followed calibration on the one hand. On the other hand, we further by a non-maximum suppression since an object detection adopt top-k box selection where only k bounding boxes with model outputs a huge amount of mostly less confident and the highest confidence are kept using k = 1000. This is the redundant detections. On the one hand, we can consider the common case during inference to reduce low confidential definition of calibration given by Equation 1 to the raw out- and mostly redundant predictions. Following (Küppers et al. puts of a detector without any post-processing. We denote 2020), the predictions of all models are obtained by infer- this case as the white-box calibration case for the following ence with a probability threshold of 0.3 which means dis- of this paper. On the other hand, we can also view the NMS carding all predictions with a confidence score less than this as part of the detector and treat the output of the NMS as threshold. As the relative amount of predictions per image our desired calibration target. This is denoted as black-box with low confidence score is significantly higher than the calibration. relative amount of the remaining predictions, this probabil- ity threshold ensures that the D-ECE is not dominated by In (Küppers et al. 2020), the detection expected calibra- these low confidence samples. tion error (D-ECE) is defined as an extension of the com- monly used ECE (Naeini, Cooper, and Hauskrecht 2015) for For confidence calibration, we use multivariate histogram object detection tasks. The D-ECE also includes the box in- binning (Zadrozny and Elkan 2001; Küppers et al. 2020) for formation r by partitioning the space of each variable k into calibration as a fast and reliable calibration method. We also Nk equally spaced bins. The total amount of bins is given by evaluate several setups with different subsets of box infor- QK mation to evaluate the effect of the used feature set. We ei- Ntotal = k=1 Nk and the D-ECE is defined as ther use the confidence only, also including the box centers NX total (cˆx , cˆy ) or box scales (h, w), or we use all features for mea- |I(n)| suring and performing calibration. For the histogram-based D-ECEK = · |prec(n) − conf(n)|, (2) n=1 |D| calibration, we use 15 bins for confidence only, Nk = 5 bins for (p̂, cˆx , cˆy ) and (p̂, ĥ, ŵ), and Nk = 3 when using all where I(n) is the set of all samples in a single bin and |D| available features. In contrast, for D-ECE computation we the total amount of samples, while prec(n) and conf(n) de- use 20 bins for confidence only, Nk = 8 bins for (p̂, cˆx , cˆy ) note the average precision and confidence within each bin, and (p̂, ĥ, ŵ), and Nk = 5 when using all available informa- respectively. We use this metric to measure miscalibration in tion. We increase the robustness of the D-ECE calculation both cases: For white-box, we consider all possible box pre- by also neglecting bins with less than 8 samples (Küppers dictions whereas for black-box only the winning boxes after et al. 2020). NMS are considered. This is explained in more detail in the following section. Results 4 Experimental Evaluation In Tables 2 and 3, the results for black-box and white- box calibration for RetinaNet and Faster R-CNN are pre- In order to analyze the confidence calibration under different sented, respectively. Three different IoU threshold values conditions, we use the COCO 2017 validation dataset (Lin of τ = {0.5, 0.6, 0.75} are considered to match predic- et al. 2014) with a random split of 70% and 30% for training tions with ground-truth annotations. In the tables, each cell and testing the calibration, respectively. presents the D-ECE for the baseline (without calibration) and the corresponding D-ECE after histogram-based cali- Evaluation Protocol bration (HB). The Tables 2 and 3 show the results of the We perform both black-box and white-box calibration by black-box models with varying strength of NMS as well as following the evaluation protocol of (Küppers et al. 2020) the calibration results for the white-box case without NMS. and use their provided calibration framework. The final cal- The D-ECE is evaluated with different additional box infor- ibration results are obtained as an average over 20 indepen- mation: The first column shows the confidence only calibra- dent training and testing results. For inference, we use a pre- tion, the second and third columns the calibration with box trained RetinaNet (Lin et al. 2017) and a Faster R-CNN (Ren centers and box scales, and the last columns show the results et al. 2015) model provided by the Detectron2 framework for the calibration with all box information considered. The (Wu et al. 2019). While the classification branch of the for- best D-ECE scores are highlighted for each set of features mer model is trained by cross entropy loss, the latter one and IoU value across all variants. Confidence Histogram (top) D-ECE Reliability Diagram (bottom) w.r.t. image location 1e-1 (p̂) (p̂, cx , cy ) (p̂, h, w) full 1.0 5 % of samples NMS@0.5 0 1 29 528 4 NMS@0.75 0 0 20 485 relative cy 3 NMS@0.9 0 0 12 435 Without NMS 0 0 9 414 2 Precision Baseline 20 256 256 1024 1 Table 1: Amount of neglected bins within D-ECE calcula- 0.0 1.0 0 relative cx tion of the three black-box models and the white-box model (a) D-ECE = 23.851% for Faster R-CNN (Ren et al. 2015; Wu et al. 2019). A simi- (b) D-ECE = 22.963% D-ECE lar amount of bins is also neglected during the examinations Confidence Histogram (top) w.r.t. image location 1e-1 1.0 5 for RetinaNet (Lin et al. 2017; Wu et al. 2019). Reliability Diagram (bottom) % of samples 4 relative cy 3 For Faster R-CNN, we observe that the white-box model calibrates consistently better by default than the black-box 2 models in most cases. In contrast, we observe the oppo- Precision site behavior for the RetinaNet model. Therefore, we fur- 1 ther study the calibration properties of those networks by 0 0.0 1.0 inspecting their reliability diagrams shown in Fig. 3 for relative cx the black-box and white-box cases. The RetinaNet white- (c) D-ECE = 21.444% (d) D-ECE = 19.268% box model without NMS offers underconfident predictions which is a known property of models trained by focal loss Figure 2: Confidence histogram and reliability diagram (left) (Lin et al. 2017). After NMS, a particular behavior can be and position-dependent heatmap (right) for RetinaNet (Lin observed in Fig. 3e with overconfident predictions in the et al. 2017; Wu et al. 2019) (top row) and Faster RCNN (Ren low confidence interval (p̂ < 0.5) and underconfident pre- et al. 2015; Wu et al. 2019) (bottom row) after white-box cal- dictions in the high confidence interval (p̂ > 0.5). Also, ibration and then further application of non-maximum sup- when comparing the calibrated results shown in Fig. 3b and pression with NMS@0.5. 3f, it is evident that the calibration for the white-box model leads to a better D-ECE score. In contrast, Faster R-CNN outputs reasonably well calibrated predictions before NMS but is highly overconfident after NMS. Again, we observe Ma 2019). As previously mentioned, bins with less than 8 that the white-box D-ECE score is much better compared to samples are neglected for the computation of the D-ECE. the black-box model after calibration has been applied. The total amount of neglected bins for each configuration We also study the effect of position-dependent miscali- is illustrated in Table 1. Especially using all available infor- bration as in (Küppers et al. 2020), shown in Fig. 4. We mation for calibration (full case), more and more bins are compare the white-box and black-box models before and left out when going from white-box (bottom) to black-box after calibration for each object detector. These figures al- (top) resulting in less bins contributing to the miscalibration low to analyze if calibration is influenced by the position score. of predicted bounding boxes. All images show a tendency of higher miscalibration close to the borders. That may be A critical question arises how to integrate white-box cal- caused by the difficulty of detecting objects correctly which ibration into the object detection pipeline. As demonstrated are cropped out of the frame. However, this is of minor rel- in the previous results, NMS has a significant impact in the evance considering that most of the positional discrepancies calibration affecting the precision as well as the confidence are mitigated after calibration in all cases. scores of the detections. It has been shown that NMS has the As shown in Tables 2 and 3, the calibration for the white- potential to degrade the calibration results. Therefore, we box model performs better than the calibration for the black- investigate the calibration properties of the detection models box model for the first and second columns. The opposite that are processed by a NMS with histogram-based calibra- happens when including the box scales into the computation tion beforehand. The results are shown in Fig. 2: It can be of the D-ECE. Here, the black-box model with NMS@0.5 seen that calibration before NMS leads to higher miscali- provides the best results. A possible explanation for this ob- bration as the confidence is calibrated before NMS as well. servation could be, that by increasing the NMS value, the However, as NMS also affects the precision, the detection number of samples also increases from 4,229 and 4,496 to model gets too overconfident in both cases. In order to pre- 117,292 and 37,355 for RetinaNet and Faster R-CNN, re- serve good calibrations from the white-box method, alterna- spectively. As expected, the more we go in the white-box tive box suppression methods should be investigated. One direction, the less predictions are discarded. Having more option would be to integrate the confidence calibration with samples for the miscalibration computation also means that the box merging strategies compared by (Roza et al. 2020), there are possibly more samples within each bin leading to such as weighted box fusion and variance voting and test a more robust miscalibration estimation (Kumar, Liang, and how such methods influence the model calibration. % of samples Precision RetinaNet Faster R-CNN (a) Uncalibrated white-box (b) Calibrated white-box model (c) Uncalibrated white-box (d) Calibrated white-box model: model with D-ECE = 22.913% with D-ECE = 0.981% model with D-ECE = 4.198% D − ECE = 0.861% % of samples Precision (e) Uncalibrated black-box (f) Calibrated black-box model (g) Uncalibrated black-box (h) Calibrated black-box model model with D-ECE = 10.350% with D-ECE = 1.210% model with D-ECE = 20.527% with D-ECE = 1.615% Figure 3: Confidence histograms and reliability diagrams of the miscalibration for RetinaNet (left) and Faster R-CNN (right) black-box (NMS@0.5) and white-box (without NMS) models with IoU@0.6 before and after histogram-based calibration. RetinaNet Faster R-CNN Default D-ECE 1e-1 After Histogram Binning 1e-1 Default D-ECE 1e-1 After Histogram Binning 1e-1 1.0 5 1.0 5 1.0 5 1.0 5 4 4 4 4 relative cy relative cy relative cy 3 3 3 relative cy 3 2 2 2 2 1 1 1 1 0 0 0 0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 relative cx relative cx relative cx relative cx (a) Uncalibrated white-box (b) Calibrated white-box model (c) Uncalibrated white-box (d) Calibrated white-box model model with D-ECE = 22.992% with D-ECE = 5.671% model with D-ECE = 7.631% with D-ECE = 5.998% Default D-ECE 1e-1 After Histogram Binning 1e-1 Default D-ECE 1e-1 After Histogram Binning 1e-1 1.0 5 1.0 5 1.0 5 1.0 5 4 4 4 4 relative cy relative cy relative cy relative cy 3 3 3 3 2 2 2 2 1 1 1 1 0 0 0 0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 relative cx relative cx relative cx relative cx (e) Uncalibrated black-box (f) Calibrated black-box model (g) Uncalibrated black-box (h) Calibrated black-box model model with D-ECE = 10.894% with D-ECE = 6.620% model with D-ECE = 15.975% with D-ECE = 7.156% Figure 4: Position-dependent miscalibration of the RetinaNet (left) and Faster R-CNN (right) black-box (NMS@0.5) and white- box (without NMS) models with IoU@0.6, before and after the histogram-based calibration. (p̂) (p̂, cx , cy ) (p̂, h, w) full (p̂) (p̂, cx , cy ) (p̂, h, w) full IoU@0.5 IoU@0.5 Baseline 16.200 12.004 14.963 13.478 Baseline 15.448 15.750 14.246 12.586 HB 1.636 6.335 2.775 5.391 HB 1.388 6.123 4.252 6.665 IoU@0.6 IoU@0.6 Baseline 20.862 15.303 18.743 16.546 Baseline 3.435 7.486 6.710 7.064 HB 1.673 6.091 2.991 5.743 HB 1.441 6.273 4.192 6.444 IoU@0.75 IoU@0.75 Baseline 31.659 24.684 28.864 25.765 Baseline 20.980 20.840 20.041 17.504 HB 1.436 6.095 3.110 5.704 HB 1.227 4.847 3.315 4.974 (a) Black-box calibration with NMS@0.5, |D| = 4, 229. (b) Black-box calibration NMS@0.75, |D| = 7, 923 (p̂) (p̂, cx , cy ) (p̂, h, w) full (p̂) (p̂, cx , cy ) (p̂, h, w) full IoU@0.5 IoU@0.5 Baseline 30.748 30.672 30.436 29.427 Baseline 28.027 28.127 28.014 28.176 HB 1.212 5.290 3.671 6.686 HB 0.855 4.947 2.895 6.114 IoU@0.6 IoU@0.6 Baseline 21.773 21.954 21.612 21.350 Baseline 23.097 23.290 23.118 23.482 HB 1.195 5.717 3.981 7.647 HB 1.033 5.331 3.306 6.912 IoU@0.75 IoU@0.75 Baseline 3.057 6.907 8.489 10.143 Baseline 8.487 10.190 10.190 11.992 HB 1.367 5.675 4.468 7.847 HB 1.132 5.892 4.207 8.266 (c) Black-box calibration with NMS@0.9, |D| = 20, 005 (d) White-box calibration without NMS, |D| = 117, 292 Table 2: D-ECE results (%) for RetinaNet (Lin et al. 2017; Wu et al. 2019) for different IoU scores. Each column shows the baseline D-ECE and the calibrated one using histogram-based (HB) calibration with different subsets using either confidence only p̂, including the box centers (cx , cy ), box scales (h, w) or using all features. Note that comparing the D-ECE scores of columns to each other is not applicable since different subsets of data have been used for D-ECE measurement and calibration. (p̂) (p̂, cx , cy ) (p̂, h, w) full (p̂) (p̂, cx , cy ) (p̂, h, w) full IoU@0.5 IoU@0.5 Baseline 7.781 9.060 7.829 7.168 Baseline 7.597 9.927 10.828 9.804 HB 1.789 6.186 2.947 4.960 HB 1.523 6.968 3.778 6.134 IoU@0.6 IoU@0.6 Baseline 9.370 10.041 9.033 7.810 Baseline 16.100 15.226 16.417 14.933 HB 1.564 6.075 3.105 5.142 HB 1.343 6.323 3.490 5.610 IoU@0.75 IoU@0.75 Baseline 31.659 24.684 28.864 25.765 Baseline 34.634 32.535 31.861 27.883 HB 1.436 6.095 3.110 5.704 HB 1.123 4.878 3.018 4.996 (a) Black-box calibration with NMS@0.5, |D| = 4, 496. (b) Black-box calibration with NMS@0.75, |D| = 7, 231. (p̂) (p̂, cx , cy ) (p̂, h, w) full (p̂) (p̂, cx , cy ) (p̂, h, w) full IoU@0.5 IoU@0.5 Baseline 7.323 10.431 10.042 10.318 Baseline 6.914 9.619 8.638 10.061 HB 1.354 6.697 4.062 7.121 HB 1.038 5.234 3.206 6.239 IoU@0.6 IoU@0.6 Baseline 7.499 10.328 11.622 11.630 Baseline 4.592 7.720 8.540 9.548 HB 1.184 6.383 4.050 7.141 HB 1.099 5.523 3.603 6.959 IoU@0.75 IoU@0.75 Baseline 25.689 25.539 25.792 25.002 Baseline 13.067 13.883 15.658 16.462 HB 1.139 5.478 4.126 6.908 HB 0.999 5.996 4.505 8.652 (c) Black-box calibration with NMS@0.9, |D| = 17, 742. (d) White-box calibration without NMS, |D| = 37, 355. Table 3: D-ECE results (%) for Faster R-CNN (Ren et al. 2015; Wu et al. 2019) before and after histogram-based (HB) calibration using different IoU thresholds for NMS. The structure of this table is comparable to Tab. 2. 5 Conclusion Kull, M.; Nieto, M. P.; Kängsepp, M.; Silva Filho, T.; Song, H.; and Flach, P. 2019. Beyond temperature scaling: Ob- In this paper, we analyzed the influence of box suppres- taining well-calibrated multi-class probabilities with Dirich- sion methods on confidence calibration for object detec- let calibration. In Advances in Neural Information Processing tion models. To do so, we adapt models without box sup- Systems, 12316–12326. pression methods denoted as white-box models, contrasting to the black-box approach commonly suggested. We per- Kull, M.; Silva Filho, T.; and Flach, P. 2017. Beta calibra- tion: a well-founded and easily implemented improvement on formed histogram-based calibration for both black-box and logistic calibration for binary classifiers. In Artificial Intelli- white-box scenarios on the COCO dataset. We found that gence and Statistics, 623–631. the initial calibration of detection models is highly impacted by NMS. Additionally, we observed that calibration also de- Kumar, A.; Liang, P. S.; and Ma, T. 2019. Verified pends on the architecture of the object detection model. For Uncertainty Calibration. In Advances in Neural Infor- mation Processing Systems 32, 3792–3803. Curran Asso- RetinaNet, the model predictions are underconfident before ciates, Inc. URL http://papers.nips.cc/paper/8635-verified- applying NMS whereas, for Faster R-CNN, the white-box uncertainty-calibration.pdf. model outputs quite well calibrated detections that become overconfident after NMS. Küppers, F.; Kronenberger, J.; Shantia, A.; and Haselhoff, A. Knowing that the miscalibration not only depends on the 2020. Multivariate Confidence Calibration for Object Detec- tion. In Proceedings of the IEEE/CVF Conference on Com- classification outputs but also on the regression output for puter Vision and Pattern Recognition Workshops, 326–327. the bounding boxes, we performed histogram-based calibra- tion using different subsets of the output data. For the con- Lin, T.; Maire, M.; Belongie, S. J.; Bourdev, L. D.; Girshick, fidence only and (p̂, cx , cy ) case, the white-box model out- R. B.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; and Zit- performs the black-box models while the black-box models nick, C. L. 2014. Microsoft COCO: Common Objects in Con- text. CoRR abs/1405.0312. URL http://arxiv.org/abs/1405. present slightly better results on the other scenarios. 0312. While the white-box calibration has given good results, the most effective integration of white-box calibration meth- Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Dollár, P. 2017. ods in existing object detectors utilizing NMS remains as Focal Loss for Dense Object Detection. In Proceedings of the an open issue. As shown by the results in this paper, the IEEE International Conference on Computer Vision (ICCV), 2980–2988. NMS layer affects the results by giving different calibration profiles before and after the suppression. Corroborating with Mukhoti, J.; Kulharia, V.; Sanyal, A.; Golodetz, S.; Torr, further results presented in this paper, the calibrated detec- P. H.; and Dokania, P. K. 2020. Calibrating Deep Neural Net- tions obtained by the white-box models deteriorated after works using Focal Loss. In Advances in Neural Information NMS for both RetinaNet and Faster R-CNN. However, we Processing Systems. think this problem can be solved by using other suppression Müller, R.; Kornblith, S.; and Hinton, G. E. 2019. When does methods which consider a larger set of the overall better cal- label smoothing help? In Advances in Neural Information ibrated boxes than NMS. Processing Systems, 4694–4703. For future work we suggest alternative applications to the Naeini, M.; Cooper, G.; and Hauskrecht, M. 2015. Obtain- standard NMS method to verify if they can lead to better ing Well Calibrated Probabilities Using Bayesian Binning. In calibrated object detectors. One option would be to integrate Proceedings of the 29th AAAI Conference on Artificial Intel- the confidence calibration with box merging strategies com- ligence, 2901–2907. pared by (Roza et al. 2020), such as box averaging, weighted Neumann, L.; Zisserman, A.; and Vedaldi, A. 2018. Re- box fusion or variance voting. laxed Softmax: Efficient Confidence Auto-Calibration for Safe Pedestrian Detection. In Workshop on Machine Learn- Aknowledgements ing for Intelligent Transportation Systems (NIPS). This work was funded by the Bavarian Ministry for Eco- Niculescu-Mizil, A.; and Caruana, R. 2005. Predicting Good nomic Affairs, Regional Development and Energy as part of Probabilities with Supervised Learning. In Proceedings of the 22nd International Conference on Machine Learning, 625– a project to support the thematic development of the Institute 632. for Cognitive Systems and within the Intel Collaborative Re- search Institute Safe Automated Vehicles. Pereyra, G.; Tucker, G.; Chorowski, J.; Łukasz Kaiser; and Hinton, G. 2017. Regularizing Neural Networks by Penaliz- ing Confident Output Distributions. CoRR . References Platt, J. 1999. Probabilistic Outputs for Support Vector Ma- Feng, D.; Rosenbaum, L.; Gläser, C.; Timm, F.; and Diet- chines and Comparisons to Regularized Likelihood Methods. mayer, K. 2019. Can We Trust You? On Calibration of a Advances in Large Margin Classifiers 61–74. Probabilistic Object Detector for Autonomous Driving. ArXiv abs/1909.12358. Ren, S.; He, K.; Girshick, R. B.; and Sun, J. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Guo, C.; Pleiss, G.; Sun, Y.; and Weinberger, K. Q. 2017. Proposal Networks. CoRR abs/1506.01497. URL http://arxiv. On Calibration of Modern Neural Networks. In Proceedings org/abs/1506.01497. of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, Roza, F. S.; Henne, M.; Roscher, K.; and Gunnemann, S. 1321–1330. PMLR. 2020. Assessing Box Merging Strategies and Uncertainty Estimation Methods in Multimodel Object Detection. In Be- yond mAP: Reassessing the Evaluation of Object Detectors @ECCV, –. Seo, S.; Seo, P. H.; and Han, B. 2019. Learning for Single-Shot Confidence Calibration in Deep Neural Net- works Through Stochastic Inferences. In The IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR). Wu, Y.; Kirillov, A.; Massa, F.; Lo, W.-Y.; and Girshick, R. 2019. Detectron2. https://github.com/facebookresearch/ detectron2. Zadrozny, B.; and Elkan, C. 2001. Obtaining Calibrated Prob- ability Estimates from Decision Trees and Naive Bayesian Classifiers. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML), 609–616. Zadrozny, B.; and Elkan, C. 2002. Transforming Classi- fier Scores into Accurate Multiclass Probability Estimates. In Proceedings of the Eighth International Conference on Knowledge Discovery and Data Mining, July 23-26, 2002, Edmonton, Alberta, Canada, 694–699. doi:10.1145/775047. 775151. URL https://doi.org/10.1145/775047.775151.