<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Practical Evaluation of Active Learning Approaches for Ob ject Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jan Schneegans</string-name>
          <email>jschneegans@uni-kassel.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maarten Bieshaar</string-name>
          <email>maarten.bieshaar@de.bosch.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bernhard Sick</string-name>
          <email>bsick@uni-kassel.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Intelligent Embedded Systems, University of Kassel</institution>
          ,
          <addr-line>Kassel</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Robert Bosch GmbH, Corporate Research</institution>
          ,
          <addr-line>Hildesheim</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The supervised training of deep learning models typically requires vast amounts of annotated data. With active learning, the annotation process can be made much more eficient by intelligently selecting the most valuable batches of samples to annotate and train on. Those samples are selected based on their utility regarding the training algorithm. In this work, we examine a wide range of such selection criteria for the task of object detection as performed by the widely applied Faster R-CNN model. We focus on the large and diverse BDD100K autonomous driving dataset, paying special attention to evaluate the model's performance regarding the dataset's meta information. Furthermore, we distinguish between approaches that select samples based on aleatoric or epistemic uncertainty. A selection of evaluation measures that cover specific error sources and the overall model performance suggests that there is little diference between the individual active learning approaches, even in regards to their specialized focus on diferent model parts and the object detection tasks of localization and classification. We conclude with a detailed discussion of the implied mechanisms regarding the active learning approaches that seem to afect model performances.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Data annotation is costly both in time and resources (human and
computational). The theoretical advantages of using active learning lie in a more eficient
data annotation process by intelligently selecting a subset of samples that is
thought to be most useful to train the machine learning model on. In this work,
we perform a practical examination of active learning strategies, gaining insights
into why certain approaches perform better than others regarding the task of
2D object detection. This is done on the very large BDD100K [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] dataset, which
is one of the most diverse autonomous driving datasets in terms of scenarios,
weather, uncommon objects, and other attributes, applying the popular Faster
R-CNN model [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. We describe the active learning process as a cycle of
iteratively selecting a batch of samples to be annotated and training the Faster
R-CNN model on the annotated portion of the data, cf. Figure 1. One batch
consists of a set of images, each image containing one or more objects. The
selection process consists of three parts: i) an utility function which estimates the
© 2022 for this paper by its authors. Use permitted under CC BY 4.0.
Acquisition
Data Annotation
      </p>
      <p>Active
Learning</p>
      <p>Cycle</p>
      <p>
        Selection
Utility Function
Aggregation Function
Selection Strategy
usefulness of each object, ii) an aggregation functions, which aggregates the
usefulness for a complete image, and iii) a selection strategy, which selects the set
of images deemed most useful. The examined variety of active learning
strategies are based on the sub-tasks performed by the Faster R-CNN model, namely
the separation of the annotated objects and the background, and the precise
classification and localization of those objects. Furthermore, we consider and
compare utility functions based on aleatoric and epistemic measures of
uncertainty facilitated by Monte-Carlo dropout [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Multiple aggregation functions,
e.g., mean and quantiles, are tested for each utility function to summarize the
utilities over a whole image. We restrict the selection strategy to simply select
the top k samples, i.e., images, with the highest utilities as aggregated from the
individual object utilities for each image.
      </p>
      <p>Our goal is to evaluate the practicality and benefits of utilizing active learning
strategies in the training of large object detection models and to provide
practical insights into the design of the corresponding machine learning pipeline.
We present a novel utility function facilitated by the box predictions and their
intersection-over-union (IOU) and experiment with approaches based on the
diferent sub-tasks executed by the object detection model, i.e. utilizing the
objectness, class, and box predictions of the model. We see a lack of a thorough,
realistic and foremost practical evaluation of various active learning approaches
for the task of object detection. In this work, we aim to close this gap and
compare the actual annotation cost of each active learning approach and discuss the
practicality and performance given the required computational efort. Moreover,
we can get further insights into why certain active learning approaches perform
better than others by examining the selection of meta-attributes, e.g. weather,
time-of-day, scene, etc., of the BDD100K dataset.</p>
      <p>In the following Section 2, we give an overview of active learning approaches
as a whole and those specicfially aimed at the task of object detection.
Section 3 introduces our methodology, including the model setup and a thorough
introduction of the examined active learning strategies. In Section 4, we provide
further information regarding the dataset, experimental design, and evaluation
measures. Section 5 discusses the experimental results, after which we summarize
our findings and presents directions towards future work in Section 6.</p>
      <p>A Practical EvaAluaPtrioanctoicfaAlEctviaveluLateiaornnionfgAAcptipvreoLacehaernsifnogr AOpbpjercotacDheetsecfotrioOnbject 5D1etection</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Active learning methods deal with the selection of data samples for annotation
and subsequent model training. Much published work on the topic of active
learning is concerned with proposing specific utility functions or selection
criteria. Often these are based on a Bayesian Neural Network approach or
MonteCarlo sampling approaches through dropout [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Popular examples are
uncertainty sampling [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and entropy based ones [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], e.g. BALD [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and
BatchBALD [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Siddahnt et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] present a large-scale empirical study on deep
active learning approaches, concluding that BALD can significantly outperform
other approaches, using uncertainty estimates provided either by Dropout or
Bayes-by-Backprop. Most techniques are made for classification tasks and only
recently the spectrum of approaches was widened to encompass regression tasks
[
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13">10,11,12,13</xref>
        ]. For a survey on further aspects to consider in active learning, e.g.
cost types and annotator performance, see [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. As we do not consider any
temporal information in the object detection tasks, we do not consider stream-based
active learning methods, but instead focus on a variety of pool-based utility
functions building on uncertainty estimation. Methods for query-synthesis, i.e. the
generation of novel sample to annotate, are also beyond the scope of this work.
      </p>
      <p>
        Advancing active learning methodologies towards more complex prediction
tasks, e.g., object detection and localization, requires more sophisticated active
learning approaches. Brust et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] select images in an uncertainty-based
approach using bounding box and class metrics. In [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], the uncertainties of both
classification and bounding box predictions are utilized, as well. Roy et al. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]
use a query by committee approach and the disagreement between the
convolutional layers in the object detector backbone. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] investigates continual
learning aspects of an ensemble-based method incorporating both classification
and localization aspects for 2D and 3D object detection. Multiple Instance
Active Learning for Object Detection [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] adapts an adversarial training procedure
to select informative images for detector training by observing instance-level
uncertainty, although, this approach implicitly assumes that there is a dominating
object in each image, hence it can attach a single label to each images (as in
image classification). Haussmann et al. [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] evaluate the use of active learning
on a large scale object detection dataset for autonomous driving, although, with
a diferent choice of models, active learning strategies, and dataset.
      </p>
      <p>
        The Faster R-CNN model is one of the most widely used object detection
model due its good performance and many readily available implementations.
Since the original publication of the Faster R-CNN model many improvements to
its architectural design were proposed [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. Since most of these approaches add
more complexity to the models with minuscule performance improvements, we
only utilize an additional feature pyramid network [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], whose multi-scale feature
maps will take part in our active learning approaches. Aghdam et al. [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] perform
active learning for object detection by aggregating diferent pixel-level scores on
the output of a convolutional neural network, which bears resemblance to our
application of utility functions on the objectness maps predicted by the region
proposal network inside the Faster R-CNN.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>This section describes the required preliminaries and individual parts of the
applied machine learning model and introduces the examined active learning
approaches. First, we briefly describe the applied object detection model, i.e.
a modified Faster R-CNN, which we augment with dropout layers to perform
uncertainty estimations, i.e. Monte-Carlo Dropout. Then the active learning
approaches, consisting of individual utility functions, aggregation functions, and
selection strategies are introduced.
3.1</p>
      <sec id="sec-3-1">
        <title>Faster R-CNN</title>
        <p>
          The Faster R-CNN model is one of the most widely used object detection
models due to it’s reliable performance and readily available implementations; but
due to it’s two-stage approach it is also one of the slowest. Accordingly,
incorporating active learning in the training pipeline is a natural match to reduce
training times. The Faster R-CNN consists of three main parts: a ResNet [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]
backbone, a region proposal network (RPN), and the classification and regression
heads. Fig. 2 shows the three components of the model, the augmentation via
the dropout layers, as well as exemplary predictions for each sub-task utilized in
the active learning approaches.
        </p>
        <p>
          The backbone used for features extraction consists of a ResNet50 as
implemented by the torchvision framework [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ], followed by a feature pyramid network
(FPN) [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] to better handle objects of diferent scales. The FPN extracts
features at five diferent scales and three aspect ratios (1:1, 1:2, and 2:1), which are
all input to the region proposal network.
        </p>
        <p>The region proposal network consists of convolutional layers and performs
a foreground and background classification and an initial rough localization of
potential objects. Due to the use of the FPN, this is performed at five scales (32 2,
642, 1282, 2562, and 5122 pixels) and three aspect ratios, resulting in a total of
15 objectness maps containing pixel wise binary classifications. The objectness
is treated as one of three model outputs, which are further utilized in the active
learning approaches.</p>
        <p>Based on the binary classification performed by the RPN, a set of highest
scoring object proposals is selected. Together with the features extracted by the
backbone the selected object proposals are subsequently processed in separate
heads for the final object classification and localization, i.e. box prediction. Those
are the second and third model outputs on which the active learning approaches
are applied.</p>
        <p>
          The dimensions of the last few layers of the Faster R-CNN need to be adjusted
to the specific learning problem posed by the dataset, i.e. the thirteen object
classes considered. The dimensions of the final fully connected layers are adjusted
accordingly. We start each experiment on a pretrained Faster R-CNN model on
the COCO [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] object detection dataset to facilitate faster learning.
        </p>
        <p>A Practical EvaAluaPtrioanctoicfaAlEctviaveluLateiaornnionfgAAcptipvreoLacehaernsifnogr AOpbpjercotacDheetsecfotrioOnbject 5D3etection</p>
        <p>Feature Pyramid Network
256 256 + 256 256
1x1 3x3
512 256 + sgapupn 256 256
li
m
1x1 3x3
10214x2156 + sgappun 256 256
il
m</p>
        <p>3x3
20418x2156 + sgauppn 256 256
il
m
3x3
maxPool</p>
        <p>
          Region Proposal Network
dro0p.5 2536x2356 dro0p.5 22553366xx33132 cbloaxsesses
Uncertainty Estimation Most active learning approaches are based on
uncertainty estimates provided by a probabilistic model or sampled from an
augmented model [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]. Typically a distinction between aleatoric and epistemic
uncertainty is done [
          <xref ref-type="bibr" rid="ref28 ref29">28,29</xref>
          ]. Aleatoric uncertainty measures the uncertainty inherent
in the data, produced by, e.g. noise, or in regards to the application it might
also encompass sources of unpredictability such as motion blur or dirty lenses.
Epistemic uncertainty measures the uncertainty of the model itself about its
predictions and is typically harder to compute. To perform active learning this
second kind of uncertainty is more useful, because one desires to select those
samples, which the model struggles with, given the assumption that those samples
provide the most benetfi during training. The (pseudo-) probabilities produced
by a neural network do not capture the epistemic uncertainty [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ], therefore,
the model architecture needs to be extended via an appropriate uncertainty
estimation technique. We will utilize Monte-Carlo dropout as proposed in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]
and add respective dropout layers to the Faster R-CNN. More specifically they
are added to the convolutional layers of the RPN and the classification and
regression heads. Dropout refers to the CNN specific 2D variant that zeros-out
entire channels, or in abstraction complete features. The model output can thus
be sampled via multiple forward passes to estimate the epistemic uncertainty
about the predictions. We draw 10 samples in each forward pass to maintain a
reasonable inference time during the active learning cycles. The dropout layers
are also kept active in those approaches that do not rely on the epistemic
uncertainty estimates to avoid biasing the results, because we observed slightly lower
performance while utilizing dropout, and we want to investigate the performance
diferences based on the utility functions and not due to adding dropout.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Active Learning Approaches for Object Detection</title>
        <p>The term active learning encompasses strategies to select a subset of a given
dataset with the goal of reducing costs in data annotation and model training.
It does so in an iterative process of selecting and annotating data, and training
on the available annotated data. We term one of those iterations as a cycle. Since
we are working with the already annotated BDD100K dataset the annotation
process is simulated by taking the annotations of the selected data into account.</p>
        <p>For the application of object detection an active learning strategy consists
of three main parts: a utility function that estimates the utility of an object or
image, an aggregation function that aggregates the utilities of all objects in an
image, and a selection strategy that selects the k most useful images.</p>
        <p>Accordingly, one cycle consists of the model predicting object locations and
classes, the application of the utility function, the application of the
aggregations function, the application of the selection strategy, annotation of the
selected data, i.e. moving data from the unlabeled dataset to the labeled dataset,
retraining of the model based on the labeled dataset, and checking of a stopping
criteria.</p>
        <p>The initial condition (zeroth cycle) consist of an unlabeled dataset and the
newly initialized (pretrained) model. Stopping criteria can be based on the
amount of data that can be annotated, which might be restricted by available
resources, e.g. financial budget, or based on the model performance, e.g., when
a desired performance is reached or when the training saturates. Since we do
not explicitly consider a fixed budget in the utility functions, we simply set the
number of active learning cycles to 30 (based on the model convergence during
the experiments) and compare diferent approaches based on the model
performances over those iterations. The design of cost-sensitive utility functions, which
explicitly consider the sample costs during estimation of their utility is still an
open research-topic.</p>
        <p>Utility Functions Given the model predictions about the object classes and
locations, a utility function ascribes a usefulness to each object in the unlabeled
dataset. Additionally, in case of the objectness predictions for an image, i.e. the
feature maps showing the foreground-background classifications performed by
the RPN, the utility of the entire image can be estimated directly. We further
consider an approach utilizing all three predictions, i.e. objectness, class and
location, combining multiple utility functions. The approaches based on the object
classes consist of the following measures:</p>
        <p>
          The normalized entropy provides a measure of the lack of model confidence
based on the class predictions. It is obtained through normalization of the
Shannon entropy [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] H over the maximum possible entropy log(K), which is reached
by a uniform distribution. Formally it is defined as
η(pˆ) =
        </p>
        <p>H
Hmax
= −
k=1
K
X pˆk log(pˆk) ,
log(K)
(1)
A Practical EvaAluaPtrioanctoicfaAlEctviaveluLateiaornnionfgAAcptipvreoLacehaernsifnogr AOpbpjercotacDheetsecfotrioOnbject 5D5etection
where pˆ are the class predictions and K the number of classes. The predicted
class probabilities pˆk = σK (x)c are given by applying the softmax function to
the logits, i.e. the classification output of the model. The normalized entropy
produces a high value when there is a strong disagreement between the diferent
classes, i.e., when the distribution over the predicted classes approaches
uniformity. Contrary, this entropy measure will be low, when there is a single class
holding most of the distribution’s mass, i.e., when the model in confident.</p>
        <p>
          BALD [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], is also an entropy based measure. It aims to select samples that are
expected to maximize the information gained about the model parameters [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
BALD specifically utilizes the epistemic uncertainty of the model by sampling
the model output via the applied Monte-Carlo dropout technique. The sampled
predictions are clustered according to their location via an agglomerative
clustering based on a distance threshold of 0.5 regarding their box IOU. The BALD
utility function can then be applied to each cluster, i.e. each predicted object.
We utilize a modified version of the BALD utility function that is normalized
facilitating an unbiased combination of utility functions. Given a dataset D and
model M with parameters ωt as one of T random dropout configurations, the
computationally tractable approximation of the utility function used during the
experiments is formally given by
        </p>
        <p>k,t
+ 1 X pˆtk log(pˆtk)</p>
        <p>T
"

α(x|w) =</p>
        <p>1
log(K)
−</p>
        <p>X
k</p>
        <p>!
T1 Xt pˆtk log
1 X pˆt
T t k
!#
(2)
where pˆtk is the probability of input x predicted by the model with parameters
ωt to take on class k, i.e., pˆtk = σK (Mωt (x))k. ptk is given by the class predictions
of the cluster. As with the entropy measure above, we normalize the BALD
equation by the maximal possible entropy log(K).</p>
        <p>We will also apply both the normalized entropy and BALD utility functions
to the objectness maps produced by the RPN.</p>
        <p>As a utility function based on the object box regression, we propose a novel
measure based on the Intersection-Over-Union (IOU). Similar to BALD we first
need to cluster the proposed boxes per object before we calculate the IOU of
each box within a cluster to the cluster mean, i.e., the mean box of the cluster.
Subsequently those IOU values are averaged. Because we require the utility
function to signify higher uncertainty with higher values we invert the expression
by calculating 1 - the mean IOU. The expected IOU (eIOU) is thus formalized
as
b¯c =
|Bc| b∈Bc</p>
        <p>X b
1
1
eIOU(Bc) = 1 −</p>
        <p>X IOU(b, b¯c),
|Bc| b∈Bc
(3)
(4)
where Bc is the set of predicted boxes per cluster c, and b¯c the mean of the
cluster. The expected IOU is normalized by definition, due to the IOU being
normalized; this is advantageous compared to using the total or generalized
variance of a cluster, because it is not influenced by the position and size of
the object proposals. For the utility function should not be biased by those
properties.</p>
        <p>In order to evaluate a utility function utilizing all three model outputs, i.e.,
objectness, classes and boxes, we define a combined measure consisting of the
normalized BALD approach applied to the objectness and the class, together
with the eIOU.</p>
        <p>The presented selection of utility functions comprise the most general and
widely applied approaches based on estimated model uncertainties with the
addition of a similarly inspired box-based version, i.e. the expected IOU. Having
defined the utility functions, we can measure the utility per predicted object, or
pixel in case of the objectness maps.</p>
        <p>Aggregation Functions An aggregation function summarizes the output
produced by a utility function to describe the utility of a complete sample, i.e. an
entire image.</p>
        <p>Intuitively the mean over the utility function output provides a measure of
the average utility in annotating a certain sample. The median is not a good
option as it is not influenced by outliers, e.g. objects the model is especially
uncertain about, but we want the utility measure to be explicitly influenced by
those parts of the image, assuming that these outliers are particularly interesting
and useful.</p>
        <p>Accordingly, applying the max function gives further priority to especially
high values of the utilities produced by the utility function.</p>
        <p>Although, simply applying the max aggregation function on the utility
functions applied to the objectness maps would not work well due to fact that almost
always the objectness maps contains maximum values of 1, thus every image
would be ascribed the same maximum utility. To solve this issue we ignore those
very high values by only considering the 95th and 99th percentiles.</p>
        <p>Selection Strategies The selection strategy decides which of the samples from
the unlabeled dataset are selected for annotation, given the aggregated utilities
inferred through the application of a utility and aggregation function. We want
to train the model on those samples that are deemed most useful, naturally,
the selection strategy will simply consist of the max function over all unlabeled
samples, selecting those samples with the maximum utility as aggregate per
image. Depending on the utilized utility functions those samples are also the
ones the model is most uncertain about.</p>
        <p>A Practical EvaAluaPtrioanctoicfaAlEctviaveluLateiaornnionfgAAcptipvreoLacehaernsifnogr AOpbpjercotacDheetsecfotrioOnbject 5D7etection</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation Methodology</title>
      <p>This section details preliminary information regarding the dataset, the
experimental setup, and the applied evaluation measures necessary to investigate the
object detection as well as the active learning performances.</p>
      <p>Dataset The experiments are performed on the BDD100K dataset, which is is
one of the largest object detection datasets in the autonomous driving domain. It
contains a variety of scenarios, sceneries, and annotated objects. Due to varying
conditions such as time-of-day, weather, as well as noise, motion blur, and lens
lfares, the dataset poses a challenge towards current machine learning models.
This naturally befits the use of active learning techniques to select diferent and
useful samples, with the goal of reducing both annotation costs and training
time.</p>
      <p>There are five object categories with overall 13 classes: bike: bicycle,
motorcycle; person: pedestrian, rider; vehicle: bus, car, truck; distractor: other person,
other vehicle, trailer, train; signal: trafic light, trafic sign.</p>
      <p>We perform experiments on all of the 13 classes as well as an easier subset
of three classes summarized by bike, person, and vehicle, which supported the
results on the larger set of classes. Additionally, we investigate the connection
between the available meta-attributes with the active learning approaches to see
if the approaches display certain preferences in selecting data samples in regards
to these attributes: weather: clear, foggy, overcast, partly cloudy, rainy, snowy,
undefined; scene: city street, gas stations, highway, parking lot, residential,
tunnel, undefined; time-of-day: dawn/dusk, daytime, night, undefined; occluded:
False, True; truncated: False, True.</p>
      <p>Experimental Design Our goal is to evaluate the individual active learning
approaches, consisting of combinations of the introduced acquisitions functions,
aggregation functions, and selection strategy. For each of those combination we
train a Faster R-CNN model for 30 active learning cycles. The pretrained model
gets trained from scratch in each cycle as to not overfit on the data annotated
the earliest, which we observed upon initial experimentation. Each experiment
is performed twice, to make sure the experimental results and discussion thereof
are reliable.</p>
      <p>The BDD100K dataset contains 100.000 images, although, the annotations of
the original test set are not available so we took the last 10.000 images from the
training set to form an annotated test set. The splits are illustrated in Figure 3
with the original validation set, containing 10k images.</p>
      <p>
        ¡Through preliminary experiments the main hyper-parameters were
determined. Those include: learning rate = 1e-5, batch size = 20, epochs = 10 (per
cycle), and Mish activation functions. We utilize the Ranger optimizer [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] for
faster convergence, which incorporates AdaBelief, RAdam, Lookahead, and
Gradient Centralization. An acquisition size of 512, i.e. the number of acquired
samples after each cycle, was determined to balance a reasonable annotation cost
training
      </p>
      <p>60k test 10k val. 10k unlabeled 20k
and training progress on the growing annotated dataset. This means the final
models (at cycle 30) were trained on 30 ∗ 512 = 15360, which corresponds to
25.6% of the training data.</p>
      <p>During training we utilize image augmentation by alteration of the brightness
and contrast, or by adding Gaussian noise. The augmentations are applied with a
random probability, order, and intensity (within previously determined bounds).
Performance Measures We distinguish between two kinds of measures that
evaluate the object detection performance and further aid the investigation of
the active learning approaches, respectively.</p>
      <p>
        The most commonly applied evaluation measure for the task of object
detection is the mean Average Precision (mAP). It describes the area under the
precision-recall curve derived from the statistics of the model predictions. Since
the mAP score only provides a single number, we additionally apply separate
measures to evaluate each of 6 possible kinds of errors: class error, location error,
class and location error, duplicate predictions, background prediction, missed
objects, as proposed by [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]. A predicted box is considered correct if its IOU with
the ground-truth box is higher than 0.5. To be able to compare class and location
errors independently, an additional lower IOU threshold is needed, so that if a
box is in the wrong location the classes can still be reasonably compared. This
lower IOU threshold is set to 0.1, as suggested by [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]. Arguably, predictions
recognized as class or location errors could also be counted as duplicate predictions
if they can be matched to a ground-truth box, but since we want to count every
prediction only once, only otherwise correct predictions count towards duplicate
errors. By investigating the individual error sources, we can for example examine
if approaches based on the predicted classes produce less classification errors, or
if approaches based on the box predictions perform better in the localization
sub-task.
      </p>
      <p>To evaluate the performance of the active learning strategies, we are not
only interested in the nfial model performances after 30 cycles, but also in the
annotation costs (as measured by the number of annotated objects), which are
often neglected in the current literature, and in the learning behavior over all
active learning cycles. Notably, while the same number of samples, i.e., images,
is selected in each cycle, diferent amounts of objects in the selected images
lead to diferent annotation costs of the active learning approaches. Splitting
the performance evaluation according to the available attributes is very useful
to investigate whether a model performs better on samples that are considered
more dificult, e.g., when time-of-day is night. Another assumption often made
is that dificult samples are most useful to train on and that active learning
approaches based on uncertainty measures are supposed to select those dificult
samples. To investigate if this is the case, we sort all samples by the average
mAP score over all models and compare them with the selection by the models.</p>
      <p>A Practical EvaAluaPtrioanctoicfaAlEctviaveluLateiaornnionfgAAcptipvreoLacehaernsifnogr AOpbpjercotacDheetsecfotrioOnbject 5D9etection</p>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>This section presents the results of the experiments and discusses the various
insights into the applied active learning approaches. This encompasses three main
parts: the learning behavior, the final model performances, and the acquisition
characteristics.</p>
      <p>The diferent approaches are each abbreviated by three letters, indicating:
the type of sub-task predictions used for the active learning approach, the utility
function, and the aggregation function. For example, cbm is the approach
comprised of the BALD utility function applied to the predicted object classes and
aggregated by the mean aggregation function. See all abbreviations in Table 1.
rdm denotes random sampling, and all stands for the approach that utilizes all
three kinds of predictions, applying the normalized BALD utility function to the
class and objectness predictions, and the eIOU to the boxes.</p>
      <p>We first compare the various active learning approaches by considering their
performance on the test set. Fig. 4 shows the mAP performance for each cycle,
averaged over both random seeds. Remember that each cycle adds 512 images to
the labeled portion of the training set and in each cycle the models are trained
anew. With this in mind we can see that training on more data does improve the
model performances consistently for all approaches. Although, the convergent
behavior is similar for all approaches, i.e. no approach provides considerably
faster learning based on the selected data, and they all end up with around the
same performance after 30 cycles. We performed the same set of experiments with
a reduced set of annotation containing only three classes (vehicle, pedestrian, and
cyclist), resulting in very much the same performance and learning behavior.</p>
      <p>
        Second, we compare the performance of the trained models after 30 cycles.
Fig. 5 shows the performance of each approach sorted by the mAP score on the
test set, and the diferent error sources according to the TIDE measures [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ].
Again, the numbers represent the average over both random seeds. Generally,
we observe no significant diferences between the active learning approaches.
Most perform slightly better than random sampling. The approaches based on
the objectness seem to perform best, for which the reason will be explored in
the next subsection. There is no clear winner between the utility functions or
the aggregation function. The performance of models trained via utility functions
based on class or box predictions, do not show a clear correlation for their specicfi
sub-tasks, as shown by the detailed evaluation of the diferent kinds of errors.
      </p>
      <p>To investigate which kind of images each active learning approach acquires,
we compare the selection for each model and for each cycle for the following
object- and image-level attributes: class label, box size (5 bins, logarithmic),
Fig. 4. Learning curves showing the mAP performances per cycle on the test set.
All approaches show similar learning behavior, although, cem, cex, and bim perform
slightly worse than the other approaches.
and the attributes presented by the BDD100K dataset. Additionally, we will
assert if the approaches select especially dificult samples, as measured by the
average mAP score of the models.</p>
      <p>Fig. 6 shows the distribution of attributes in the training set (black dashed)
from which the active learning approaches iteratively select a subset to train
on. The attribute distributions of the selected subsets are shown for each cycle,
whereby lower cycles are blue and higher cycles are red. One would assume, that
the approaches should select a higher proportion of attributes that occur less
often in the dataset, given the assumption that those samples are probably more
dificult for the models. If this were the case, one should see a balancing between
the individual instances of the attributes, respectively.</p>
      <p>Regarding the label distributions, the approaches mostly adhere to the
training distributions, with the exception of bim and cbm, which even exaggerate the
label imbalances by selecting many images with cars in them. The approaches
based on the objectness maps show more promising behavior by selecting less
cars and more pedestrians, with the exception of the approaches utilizing the
max aggregation function. The reason being, that they practically reduce to
random random sampling due to the efect explained in Sec. 3.2. The box size
selection is very consistent between every approach, oversampling small boxes.
The weather selection looks very similar to the label attribute, with the cem
approach also oversampling the most prominent weather instance (clear ).
Interestingly, the scene attribute shows an inverse behavior compared to the label and
weather distributions. bim and cbm select more images showing residential and
fewer showing city street. cem and cex have a similar preference towards highway
scenes. Contrary to the label selection, the objectness based approaches
oversample city street scenes and avoid highway scenes; we will discover why when
we look at the number of acquired objects. Regarding the time-of-day, most
apFig. 5. Model performance after 30 cycles, sorted by the mAP score on the left. FN
and TPgt are in proportion to the number of ground truth, all else in proportion to the
number of predictions. There is no score threshold applied to the predictions, which is
why the percentage TPpred seems relatively low. FN: false negatives, TP: true positives,
CLS: class error, LOC: location error, CLS: class and location error, DUP: duplicate
predictions, BKG: false positives/background predictions.
proaches exaggerate the day- and night-time imbalance in the distribution by
selecting primarily day-time images, while some seem to prefer night-time images
(bim, cem). The distributions of the occlusion and truncation attributes show
no significant behavior, except for some approaches, apparent in the figure.</p>
      <p>Contrary to the assumption, we generally observe little balancing and the
attribute distributions of the selected images mostly vary around the training
data distribution. We make the general observation that if there are deviations
from the training distributions, they are more pronounced in the early cycles
(blue), and the selection distributions are closing in on the training distributions
towards later cycles (red), often matching them in the final cycles. For the
attribute instances with very few sample we barely see any selection; enabling the
approaches to oversample during sample selection might help in those cases. The
attribute distributions of the data selected by random sampling (rdm, last row)
is consistent with the overall data distribution, as expected.</p>
      <p>Another kind of attribute often implicitly talked about is the dificulty of
the samples, based on the assumption that more dificult samples are
particularly useful for model training and should thus be selected by active learning
approaches; especially by uncertainty based ones if we relate uncertainty to
dificulty. To check this assumption we sorted all images in the training set according
to their average mAP score over all models to estimate their dificulty.</p>
      <p>Fig. 7 shows the four kinds of behaviors observed when we look at the image
selections in each cycle over the mAP score. To create the depiction all 60k
images in the training set, sorted by their average mAP score, were binned into
128 bins, shown along the x-axis (low to high mAP, from left to right). This
includes the selection from both random seeds. The rows depict the cycles (early
to late, from bottom to top). The approaches based on the class predictions and
the BALD utility function, as well as the proposed box based eIOU approach,
select dificult examples in the earlier cycles, which then difuses towards the
region of average dificulty after the first few cycles. cbx maintained a slightly
stronger preference towards high mAP scores throughout all cycles class-based
approaches utilizing the normalized entropy utility function have a very strong
tendency to sample very easy or very dificult images, as estimated by the mAP
score. Here the focus tapers of towards later cycles as well. The approaches
based on the objectness maps show a more independent distributions over the
mAP score, with a slight tendency to not sample very dificult images. There
is barely any variation over the cycles. Lastly, some approaches show similar
random behavior as random sampling. Notably, all of those approaches use the
Fig. 7. The four kinds of selection behaviors expressed by the active learning
approaches, in regards to the dificulty of the available images, estimated by the average
mAP score of each image. It shows a two-dimensional histogram indicating the number
of images in that region of the sorted mAP score. The mAP score is sorted from left
to right, i.e. less dificult to most dificult.
max aggregation function. As already mentioned before, these approaches assign
the same maximum utility to most images, leading to a random selection.</p>
      <p>Finally, we compare the performance of each approach with the annotation
costs of their acquired images. For this shows the actual practical applicability of
the approaches. The performance is estimated by the mAP score of the trained
models after 30 cycles. The annotation costs are estimated by the number of
objects contained in the acquired images, because annotation companies usually
calculate the annotation efort and costs per object. Fig. 8 shows both
performance and the annotation costs for each approach, relative to random sampling.
The approaches are sorted by their performance.</p>
      <p>We observe a strong correlation between annotations costs, i.e. the number
of objects, with the model performance. If we allow for a small decrease in
performance compared to random sampling, the annotation costs can be reduced
immensely, depending on the approach. In contrast, to achieve a slight increase
in performance the annotation costs rise by about 40%. Notably, we see that the
objectness based approaches consistently acquire images with more objects.</p>
      <p>This is consistent with our previous observations that the objectness based
approaches preferably select city street scenes with many pedestrian labels
compared to the usually high number of car labels, during daytime. Otherwise, the
results do not seem to correlate with the results depicted in Fig. 6 or Fig. 7. If
we compare the annotation costs with the more detailed TIDE scores, we notice
that the model performance correlates well with the overall amount of errors.</p>
      <p>Overall, a high number of objects is benecfiial to the model performance,
and the objectness based approaches represent a proxy to find images
containing many objects. We attribute it to the observation that the objectness maps
produce high values around object edges, because there the uncertainty, whether
a pixel in the objectness map corresponds to and object, is very high.
Accordingly, the more objects are in the image, the more edges there will be, and the
higher the aggregated utilities are. This naturally leads to the question, whether
simply approaches like an edge detector could provide a good basis for active
learning strategies, reducing the computationally efort by a large margin. To
further note: random sampling takes drastically less compute efort, because one
does not need to estimate the uncertainties, e.g. by sampling a model multiple
times.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>We applied a variety of active learning approaches to the task of object detection
on a large and varied autonomous driving dataset. The approaches, comprising
combinations of multiple utility functions and aggregations functions, utilized
diferent kinds of model predictions based on the sub-tasks performed by the
Faster R-CNN model. Overall, the approaches performed similarly, but showed
some diferences in how they functioned. An investigation into the attributes of
the selected images lead to the observation that the objectness based approaches
perform an elaborate proxy-task to estimate the number of objects per image.
A main insight is the clear correlation between number of objects in the selected
images and performance of the models trained on them.</p>
      <p>It remains questionable if the uncertainty based approaches evaluated in this
work justify the added complexity in the implementation and computational
costs, compared to random sampling. Therefore, active learning approaches must
further strive to be applicable to complex, real world datasets and dificult
learning tasks such as object detection. Although, we discovered a promising direction
of utilizing more primitive and eficient proxy-tasks, e.g. estimating the number
of object per image, to base the active learning approaches on.</p>
      <p>The assumption that, for example, night-time images are more dificult and
should thus be selected by active learning approaches could not be confirmed.
Which either means that the assumption is not true, which could be further
verified by looking at the error scores of individual images with the respective
attributes, or that the approaches simply do not select samples according to the
assumption.</p>
      <p>The experiments can be extended to include more datasets and a wider range
of active learning approaches, since in the time past since the conduction of the
experiments more utility functions and active learning strategies were proposed.
Likewise, the application domain as well as the object detection task further
warrant the additional use of other sensor modalities, e.g. Lidar or Radar. We
also did not consider temporal information, e.g. video data, for stream based
active learning.</p>
      <p>A Practical EvaAluaPtrioanctoicfaAlEctviaveluLateiaornnionfgAAcptipvreoLacehaernsifnogr AOpbpjercotacDheetsecfotrioOnbject 6D5etection</p>
    </sec>
    <sec id="sec-7">
      <title>Code Availability</title>
      <p>The code base of this work is available to reproduce, verify, or extend the
experiments conducted for this work under https://git.ies.uni-kassel.de/
public_code/a_practical_evaluation_of_active_learning_approaches_for_
object_detection.
8</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgment</title>
      <p>This work results from the project KI Data Tooling (19A20001O) funded by the
German Federal Ministry for Economic Afairs and Climate Action (BMWK).
A Practical EvaAluaPtrioanctoicfaAlEctviaveluLateiaornnionfgAAcptipvreoLacehaernsifnogr AOpbpjercotacDheetsecfotrioOnbject 6D7etection</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>F.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Madhavan</surname>
          </string-name>
          , and T. Darrell, “
          <article-title>BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning</article-title>
          ,”
          <source>2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          , pp.
          <fpage>2633</fpage>
          -
          <lpage>2642</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>S.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. B.</given-names>
            <surname>Girshick</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          , “
          <string-name>
            <surname>Faster</surname>
            <given-names>R-CNN</given-names>
          </string-name>
          :
          <article-title>Towards Real-Time Object Detection with Region Proposal Networks</article-title>
          ,
          <source>” IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          , vol.
          <volume>39</volume>
          , pp.
          <fpage>1137</fpage>
          -
          <lpage>1149</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Islam</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ghahramani</surname>
          </string-name>
          , “
          <article-title>Deep Bayesian active learning with image data,”</article-title>
          <source>in 34th International Conference on Machine Learning, ICML 2017</source>
          , vol.
          <volume>3</volume>
          ,
          <issue>2017</issue>
          , pp.
          <fpage>1923</fpage>
          -
          <lpage>1932</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gal</surname>
          </string-name>
          and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ghahramani</surname>
          </string-name>
          , “
          <article-title>Dropout as a Bayesian Approximation: Appendix,”</article-title>
          <source>in 33rd International Conference on Machine Learning, ICML 2016</source>
          , vol.
          <volume>3</volume>
          ,
          <issue>2016</issue>
          , pp.
          <fpage>1661</fpage>
          -
          <lpage>1680</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>D. D.</given-names>
            <surname>Lewis</surname>
          </string-name>
          and
          <string-name>
            <given-names>W. A.</given-names>
            <surname>Gale</surname>
          </string-name>
          , “
          <article-title>A sequential algorithm for training text classiifers</article-title>
          ,”
          <source>in Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <string-name>
            <surname>SIGIR</surname>
          </string-name>
          <year>1994</year>
          ,
          <year>1994</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>C. E.</given-names>
            <surname>Shannon</surname>
          </string-name>
          , “
          <source>A Mathematical Theory of Communication,” Bell System Technical Journal</source>
          , vol.
          <volume>27</volume>
          , no.
          <issue>6</issue>
          ,
          <issue>10</issue>
          , pp.
          <fpage>379</fpage>
          -
          <lpage>423</lpage>
          ,
          <fpage>623</fpage>
          -
          <lpage>656</lpage>
          ,
          <year>1948</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>N.</given-names>
            <surname>Houlsby</surname>
          </string-name>
          ,
          <string-name>
            <surname>F. Husza´</surname>
          </string-name>
          r,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ghahramani</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Lengyel</surname>
          </string-name>
          , “
          <article-title>Bayesian Active Learning for Classification and Preference Learning,” ArXiv</article-title>
          , vol.
          <source>abs/1112.5</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>A.</given-names>
            <surname>Kirsch</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. van Amersfoort</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gal</surname>
          </string-name>
          , “
          <article-title>BatchBALD: Eficient and diverse batch acquisition for deep Bayesian active learning</article-title>
          ,
          <source>” in Advances in Neural Information Processing Systems</source>
          , vol.
          <volume>32</volume>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>A.</given-names>
            <surname>Siddhant</surname>
          </string-name>
          and
          <string-name>
            <given-names>Z. C.</given-names>
            <surname>Lipton</surname>
          </string-name>
          , “
          <article-title>Deep Bayesian Active Learning for Natural Language Processing: Results of a Large-Scale Empirical Study,” ArXiv</article-title>
          , vol. abs/
          <year>1808</year>
          .0,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>D. Wu</surname>
            ,
            <given-names>C. T.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>and J.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
          </string-name>
          , “
          <article-title>Active learning for regression using greedy sampling,”</article-title>
          <source>Information Sciences</source>
          , vol.
          <volume>474</volume>
          , pp.
          <fpage>90</fpage>
          -
          <lpage>105</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>D. Wu</surname>
          </string-name>
          , “
          <article-title>Pool-Based Sequential Active Learning for Regression,”</article-title>
          <source>IEEE Transactions on Neural Networks and Learning Systems</source>
          , vol.
          <volume>30</volume>
          , pp.
          <fpage>1348</fpage>
          -
          <lpage>1359</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. C. Ka¨ding, E. Rodner,
          <string-name>
            <given-names>A.</given-names>
            <surname>Freytag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Mothes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Barz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Denzler</surname>
          </string-name>
          , “
          <article-title>Active learning for regression tasks with expected model output changes</article-title>
          ,
          <source>” in British Machine Vision Conference</source>
          <year>2018</year>
          ,
          <string-name>
            <surname>BMVC</surname>
          </string-name>
          <year>2018</year>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>J. Goetz</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Tewari</surname>
            , and
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Zimmerman</surname>
          </string-name>
          , “
          <article-title>Active Learning for Non-Parametric Regression Using Purely Random Trees</article-title>
          ,” in NeurIPS,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>M. Herde</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Huseljic</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Sick</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Calma</surname>
          </string-name>
          , “
          <article-title>A survey on cost types, interaction schemes, and annotator performance models in selection algorithms for active learning in classification</article-title>
          ,
          <source>” IEEE Access</source>
          , vol.
          <volume>9</volume>
          , pp.
          <volume>166</volume>
          <fpage>970</fpage>
          -
          <lpage>166</lpage>
          989,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>A</article-title>
          .
          <string-name>
            <surname>Brust</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <article-title>Ka¨ding, and</article-title>
          <string-name>
            <given-names>J.</given-names>
            <surname>Denzler</surname>
          </string-name>
          , “
          <article-title>Active learning for deep object detection,”</article-title>
          <source>in VISIGRAPP 2019 - Proceedings of the 14th International Joint Conference on Computer Vision</source>
          , Imaging and
          <source>Computer Graphics Theory and Applications</source>
          , vol.
          <volume>5</volume>
          ,
          <issue>2019</issue>
          , pp.
          <fpage>181</fpage>
          -
          <lpage>190</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>C. C. Kao</surname>
            ,
            <given-names>T. Y.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Sen</surname>
          </string-name>
          , and M. Y. Liu, “
          <article-title>Localization-Aware Active Learning for Object Detection</article-title>
          ,
          <source>” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)</source>
          , vol.
          <volume>11366</volume>
          LNCS,
          <year>2019</year>
          , pp.
          <fpage>506</fpage>
          -
          <lpage>522</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <given-names>S.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Unmesh</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V. P.</given-names>
            <surname>Namboodiri</surname>
          </string-name>
          , “
          <article-title>Deep active learning for object detection</article-title>
          ,
          <source>” in British Machine Vision Conference</source>
          <year>2018</year>
          ,
          <string-name>
            <surname>BMVC</surname>
          </string-name>
          <year>2018</year>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <given-names>S.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tatsch</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Knoll</surname>
          </string-name>
          , “
          <article-title>Advanced active learning strategies for object detection,” in Intelligent Vehicles Symposium (IV), Las Vegas</article-title>
          ,
          <string-name>
            <surname>NV</surname>
          </string-name>
          ,
          <year>2020</year>
          , pp.
          <fpage>871</fpage>
          -
          <lpage>876</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <given-names>T.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fu</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ji</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ye</surname>
          </string-name>
          , “
          <article-title>Multiple instance active learning for object detection,” in Conference on Computer Vision and Pattern Recognition (CVPR), virtual</article-title>
          ,
          <year>2021</year>
          , pp.
          <fpage>5330</fpage>
          -
          <lpage>5339</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20. E. Haussmann,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fenzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chitta</surname>
          </string-name>
          , J. Ivanecky´,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mittel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Koumchatzky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Farabet</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Alvarez</surname>
          </string-name>
          , “
          <article-title>Scalable Active Learning for Object Detection,” in IEEE Intelligent Vehicles Symposium (IV), Las Vegas</article-title>
          ,
          <string-name>
            <surname>NV</surname>
          </string-name>
          , USA,
          <year>2020</year>
          , pp.
          <fpage>1430</fpage>
          -
          <lpage>1435</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <given-names>Z.</given-names>
            <surname>Cai</surname>
          </string-name>
          and
          <string-name>
            <given-names>N.</given-names>
            <surname>Vasconcelos</surname>
          </string-name>
          , “
          <string-name>
            <surname>Cascade</surname>
            <given-names>R-CNN</given-names>
          </string-name>
          :
          <article-title>Delving Into High Quality Object Detection</article-title>
          ,”
          <source>2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>6154</fpage>
          -
          <lpage>6162</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22. T. Y.
          <string-name>
            <surname>Lin</surname>
            , P. Doll´ar,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Hariharan</surname>
            , and
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Belongie</surname>
          </string-name>
          , “
          <article-title>Feature pyramid networks for object detection</article-title>
          ,”
          <source>in Proceedings - 30th IEEE Conference on Computer Vision</source>
          and Pattern Recognition,
          <string-name>
            <surname>CVPR</surname>
          </string-name>
          <year>2017</year>
          , vol.
          <source>2017-Janua</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>936</fpage>
          -
          <lpage>944</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>H. Habibi Aghdam</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzales-Garcia</surname>
            ,
            <given-names>A. M.</given-names>
          </string-name>
          <string-name>
            <surname>Lopez</surname>
          </string-name>
          , and J. van de Weijer, “
          <article-title>Active learning for deep detection neural networks</article-title>
          ,” in International Conference on Computer Vision (ICCV), Seoul, South Korea,
          <year>2019</year>
          , pp.
          <fpage>3671</fpage>
          -
          <lpage>3679</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          , “
          <article-title>Deep residual learning for image recognition,”</article-title>
          <source>in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition</source>
          , vol.
          <source>2016-Decem</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <given-names>S.</given-names>
            <surname>Marcel</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Rodriguez</surname>
          </string-name>
          , “
          <article-title>Torchvision the machine-vision package of torch,”</article-title>
          <source>in Proceedings of the 18th ACM International Conference on Multimedia, ser. MM '10</source>
          . New York, NY, USA: Association for Computing Machinery,
          <year>2010</year>
          , p.
          <fpage>1485</fpage>
          -
          <lpage>1488</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>T.-Y. Lin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Maire</surname>
            ,
            <given-names>S. J.</given-names>
          </string-name>
          <string-name>
            <surname>Belongie</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Hays</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Perona</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Ramanan</surname>
            , P. Dolla´r, and
            <given-names>C. L.</given-names>
          </string-name>
          <string-name>
            <surname>Zitnick</surname>
          </string-name>
          , “
          <article-title>Microsoft coco: Common objects in context</article-title>
          ,” in ECCV,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <given-names>P.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.-Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          , “
          <article-title>A survey of deep active learning</article-title>
          ,
          <source>” ACM Computing Surveys (CSUR)</source>
          , vol.
          <volume>54</volume>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>40</lpage>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <given-names>D.</given-names>
            <surname>Huseljic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Herde</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Kottke</surname>
          </string-name>
          , “
          <article-title>Separation of aleatoric and epistemic uncertainty in deterministic deep neural networks,” in 25th International Conference on Pattern Recognition (ICPR)</article-title>
          . IEEE,
          <year>2021</year>
          , pp.
          <fpage>9172</fpage>
          -
          <lpage>9179</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <given-names>A.</given-names>
            <surname>Kendall</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gal</surname>
          </string-name>
          , “
          <article-title>What uncertainties do we need in bayesian deep learning for computer vision</article-title>
          ?” in
          <source>Conference on Neural Information Processing Systems (NIPS)</source>
          , Long Beach, CA,
          <year>2017</year>
          , pp.
          <fpage>5574</fpage>
          -
          <lpage>5584</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30. L. Wright, “
          <article-title>Ranger-deep-learning-</article-title>
          <string-name>
            <surname>optimizer</surname>
          </string-name>
          ,”
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <given-names>D.</given-names>
            <surname>Bolya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Foley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hays</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Hofman</surname>
          </string-name>
          , “
          <article-title>Tide: A general toolbox for identifying object detection errors,” ArXiv</article-title>
          , vol. abs/
          <year>2008</year>
          .08115,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>