Safety-aware Active Learning with Perceptual Ambiguity and Severity Assessment Prajit T Rajendran1,* , Guillaume Ollier1 , Huascar Espinoza2 , Morayo Adedjouma1 , Agnes Delaborde3 and Chokri Mraidha1 1 CEA, List, F-91120, Palaiseau, France 2 KDT JU, Avenue de la Toison d’Or 56-60, 1060 Brussels, Belgium 3 Laboratoire National de Metrologie et d’Essais, Trappes, France Abstract Deep Neural Networks (DNN) used in self-driving cars need a large data coverage and labelling to manage all potential hazards in safety-critical scenarios. Active learning approaches make use of automated data selection and labelling that can build diverse datasets, with less human costs and more accuracy. Traditional active learning methods consider uncertainty of the model predictions and diversity of the data points for query selection. However, they are not optimal in capturing many critical data points, which are potentially risky with respect to safety considerations. In this position paper, we propose a novel approach that uses human feedback related to perceptual data ambiguity and a criticality score, linked to system-level safety assessment. This approach includes a continual learning model that learns to identify corner cases and blindspots with high impact in potential risk, and combines them with uncertainty-sampling and diversity-sampling models to create a safety-aware acquisition function for active learning. Keywords Safety, Active learning, Autonomous driving, Human-in-the-loop learning, 1. Introduction the fewest samples possible. This process usually consid- ers factors such as uncertainty and diversity to generate Self-driving cars are increasingly employing various deep a query list to the human [5]. Active learning has shown learning-based components in their technology stack. impressive performance gains over random selection in These components require tremendous amounts of data many self-driving perception tasks. to reach a significant level of performance [1]. Deep While there have been emerging efforts to improve Neural Networks (DNN) generally perform poorly when active learning for complex scenarios, little attention they come across previously unseen data. A DNN model has been given to active learning for safety-critical fea- trained on only a homogeneous set of images from a par- tures. One example of these features is the detection of ticular scenario would perform well only in that scenario ambiguous data points when the self-driving car is in a and under-perform in most other situations. This is a safety-critical situation. An example for ambiguity could major concern for the safety assessment of self-driving be an image used to train a traffic light detection system vehicle systems [2]. In a traffic light classification task for wherein there is a red light for traffic intending to turn instance, the more the diverse scenarios the DNN mod- right and green light for the straight-moving traffic. This ule encounters in training, the wider is its safe operation image could be delegated to the human to annotate if it region [3]. is deemed to have a high impact on potential risk. Typically, the labels to train such modules are provided This position paper proposes a novel approach that by humans [4]. Curating a large dataset with millions of uses human feedback related to perceptual data ambigu- human labels is painfully time consuming and expensive. ity and a criticality score. This criticality score, which is Active learning is a powerful technique that attempts to linked to the exposure and severity factors of a typical maximize a model’s performance gain while annotating safety assessment, helps to characterize the criticality context of corner cases and blindspots with high impact The IJCAI-ECAI-22 Workshop on Artificial Intelligence Safety (AISafety in potential risk. In a limited query budget scenario, 2022), July 24-25, 2022, Vienna, Austria * Corresponding author. perceptual ambiguity level and criticality level obtained $ prajit.thazhurazhikath@cea.fr (P. T. Rajendran); during the annotation process, along with uncertainty guillaume.ollier@cea.fr (G. Ollier); and diversity measurements help in selecting the images Huascar.Espinoza@kdt-ju.europa.eu (H. Espinoza); with highest impact on potential risk. This position paper morayo.adedjouma@cea.fr (M. Adedjouma); is a preliminary step towards deeper research into how agnes.delaborde@lne.fr (A. Delaborde); chokri.mraidha@cea.fr (C. Mraidha) human-in-the-loop feedback can help in a safety-aware © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License active learning approach. Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 2. Background and Related Works with less cost and more accuracy [6]. In this work, we focus on pool-based active learning, where we have a 2.1. Motivation small set of labelled data available and a large set of un- labelled data which needs to labelled within a certain A modular driving system typically consists of several query budget. components with specific functions collaborating to achieve the intended driving behaviour. There are also end-to-end driving systems, but these are usually en- tirely made up of opaque blackbox models and thereby it is not feasible to certify their functional safety. Learning enabled components making use of black box machine learning models are notorious in this aspect due to their lack of transparency. Failures or unsafe behavior at the component level can potentially compromise the safety of the entire system unless there are exhaustive system level measures to tackle them, and thus it is important to ensure that the component is trained in a manner so as to minimize vulnerability to unknown situations. The pres- ence of a human in the loop could help in mitigating some Figure 1: Block diagram of the active learning process of these vulnerabilities by identifying certain blindspots undetected by the trained models and by assessing the Some of the sampling strategies in active learning are severity level of the consequence of misprediction by the as follows [7]: trained models. In situations of limited query budget and training time, the paradigm of active learning could • Random sampling is a strategy where we pick assist in selecting the most safety relevant data points by random samples from the unlabeled pool of data analyzing the blindspot vulnerabilities of the component. as query points for the human to label. This is In this work, we focus on improving the data selection usually used just as a baseline as it does not have and training of a traffic light classification component in an intelligent strategy to select the query points. a modular driving system. • Uncertainty sampling is the set of strategies for identifying unlabeled items that are near a de- cision boundary in the trained model. This ap- 2.2. Active Learning proach basically picks out the data points with Active learning is a process of eliciting training data from a higher predictive uncertainty, and is thereby annotators to determine the right data to put in front of reflective of the blindspots of the trained model. people when you don’t have the budget or time for hu- • Diversity sampling is the set of strategies for iden- man feedback on all your data. This is especially true tifying unlabeled items that are underrepresented in datasets for autonomous driving, which could have or unknown to the machine learning model (for millions of hours of data available for training. More than instance, features that are not common in the the raw quantity of the data used, the quality, diversity training data, or are under-represented in real and usability of the data are the important parameters world demographics) to assure optimum performance and safety of the de- ployed models. The deep neural networks responsible The simplest approach in literature as illustrated in [8] for self-driving functions require exhaustive training, is to select examples based on distances in the feature and the data needs to cover new and uncertain situations space. In [9], diversity is measured using a similarity in order to tackle the problems of unknown unknowns. matrix made using the Gaussian kernel of the distance Unknown unknowns are data points for which the AI between two points. [10] makes use of entropy as a met- model provides a wrong prediction with a high degree ric of uncertainty.[11] makes use of information density of accuracy. Such points are dangerous because they are of the candidate instance obtained from the input space immune to detection by uncertainty measures, which are for the remaining unlabeled in- stances. [12] and [13] use often used as a proxy metric to test models’ weaknesses. ensemble and Bayesian methods to approximate uncer- The combination of data annotation and curation poses tainty respectively. [14] proposes heuristic methods to a major challenge to deploying deep learning models in balance between the uncertainty and the represen- tative- autonomous systems and active learning helps by auto- ness of the selected sample, considering the redundancy matically finding the relevant data points to query the between selected samples. [15] argues that the initial human, to build better datasets in a fraction of the time, model does not have a good performance so the queries generated by it are also likely to be inefficient. In [16], it is proposed to include knowledge from unlabeled images, without conceptual knowledge about the traffic, by adding unsupervised and semi-supervised methods to which a blackbox model may not necessarily pos- enhance the performance. The authors in [17] proposed sess. Such blindspots can be identified with the to use a binary classifier to predict if an image is from help of a human-in-the-loop. the labeled or unlabeled pool using the concept of adver- • Safety Blindspots: Datapoints whose misclassi- sarial learning. In [18], a semi-supervised active learning fication by the specific trained model at the com- approach is proposed wherein contention points are de- ponent level could compromise the safety of the termined by making use of both the informativeness and system which the component is a part of, consti- adaptive probabilistic label of the unlabelled points based tute safety blindspots. on the hypothesis of the current model. 2.3. Blindspots and Corner Cases 3. Proposed Method Blindspots are the deficiencies that are present in a model In contexts which are subjective in nature or when hu- which may be detrimental to its performance and adapt- man contextual knowledge plays a major role, current ability to unknown and uncertain situations [19]. In ac- active learning methods based purely on model knowl- tive learning, data points falling under these blindspots edge do not tend to perform well [2]. Safety in particular can be specifically picked to query to a human oracle. is a complex concept involving other environmental and There can be various categories of blindspots: situational factors. Since the onus in active learning is on a particular component, one can not discuss safety • Model Blindspots: The set of data points, and as it is a system-level concept. However, it is possible the feature regions they enclose, in which the to think about the safety implications of a mislabelled model is highly uncertain about or unsure of its or ambiguous data point. A human-in-the-loop can help predicted label constitute the model blindspots. in identifying certain conceptual blindspots which are It is possible to identify model blindspots using not covered under the model and data blindspots as dis- the prediction uncertainty of data points. Data cussed in the section above. Although human-in-the-loop points for which the model has a prediction with involves effort in terms of labelling, active learning ac- a high entropy fall under this category. quisition functions ensure that only a fraction of the data • Data Blindspots: The areas of the feature space points which are most critical according to the chosen that are not covered in the training set constitute criterion have to be labelled by the humans, thereby solv- the data blindspots. Diversity is one of the as- ing the scalability issue. Human bias is always a factor in pects that help in uncovering these blindspots. labelling but classic methods in active learning such as An example could be a dataset with images only inter-annotator agreement can be used to mitigate this recorded in daytime. An image taken at night problem. time would be very distant from the images that the model has seen before, and even if the model’s 3.1. Perceptual Ambiguity output prediction has a low entropy, it can not be fully trusted. Data points which the annotator perceives to be poten- • Human-identified Blindspots: The model tially ambiguous could be rejected and removed from the blindspots reveal the underconfidence and knowl- training set. However, a black and white approach of edge gaps of the trained model, and the data reject and accept is not suitable in many cases, such as blindspots explore the diversity of the data. How- traffic related tasks. Many data points could be slightly ever, there may be more conceptual aspects in ambiguous yet interesting to include in the dataset for the dataset which are not covered under both diversity and task relevance. Conservatively rejecting the above categories of blindspots. For example, all data points the annotator perceives to be slightly am- consider an image in the training set of a traf- biguous leads to lesser diversity in the training set. These fic light classification system wherein there are constitute human-identified blindspots and provide addi- two visible traffic lights- one for left moving traf- tional information for data selection. Thus, it would be fic, and the other for straight-moving traffic. If useful to quantify the level of ambiguity and underconfi- the ego vehicle is in the rightmost lane, a hu- dence that the annotator feels for each data point as very man looking at the image can see that the vehicle low, low, medium, high or very high. A secondary model could not possibly turn left so only the signal can be trained to predict the level of perceptual ambiguity light for straight-moving traffic is relevant for with the help of human feedback and this could assist in the scene. This however is an ambiguous situa- better data selection for active learning querying under tion that could be potentially difficult to classify a limited budget. We propose table 1 as reference for the Table 1 3.2. Criticality Assessment Perceptual ambiguity levels We consider the safety awareness of the data labeling Level Explanation process through the concept of criticality assessment and Very low Unambiguous image, label easy to identify thereby aim to tackle safety blindspots discussed above. Low Distracting features but easy to classify The idea behind it is to estimate the importance of a Medium Some ambiguities in identifying the label specific image regarding a task according to the global High Occlusions and ambiguities, hard to classify risk it could represent on a system facing that task. The Very high Corner case with safety implications global risk is here the combination of two factors. These two factors are the severity, i.e., estimated safety conse- quences if the system fails the task, and the exposure, annotators: i.e., the estimated probability of this fail. In the context of traffic light classification, this severity concerns the expected consequences if this traffic light is misclassified, and it will depend on which class is misclassified (i.e., green light misclassified as red/orange light, or red/or- ange light misclassified as green light) and the different visible environmental parameter which can participate in the possible accidents (e.g., pedestrian crossing, road intersection). The exposure is estimated by detecting the different visible factors that could cause the misclassifi- cation (e.g., camera obstruction or corruption, weather conditions). We focus here on the risk assessment at the component level without considering the whole system’s Figure 2: Ambiguous class labels and distracting features capabilities and interactions with the other components and subsystems. To include this active learning approach in a complete safety engineering process, the requirements identified in the preliminary analysis shall be considered to adapt this score on it. A first question to estimate the severity level is presented to the human annotator. We formulate the question as "How do you estimate the consequences on accidents risk if the automated driving system misclassifies this traffic light?"(As shown in Figure 5) with the possible answers "Negligible", "Light", "Severe", and "Fatal". We associate each of these answers to a value (zero for "Neg- ligible") and if the human rater do not select the answer "Negligible" i.e., if the severity score is higher than zero, Figure 3: Ambiguous class labels and conceptual understand- we ask another question for the exposure estimation: ing "Can you see any factor that might bother this traffic light identification?" with the answers "Yes", "No". If the rater answers "No", the exposure value is zero. Else we ask Consider figure 2 from the traffic light detection additional questions to identify these factors. Each factor dataset presented in [20]. There are two traffic lights is associated with an exposure value defined in amount visible in the image, which is a source of ambiguity. Ad- by the expert and not visible by the human rater. We ditionally, at night the tail lights of traffic ahead may can ∑︀ then compute the criticality score with the formula constitute distracting features which may affect the la- bel prediction. In figure 3 also from the same dataset, ( 𝑛 𝑘=1 𝑓𝑘 * 𝑒𝑘 ) * 𝑠 where 𝑛 is the number of identified factor, 𝑓 is a boolean vector which represent the pres- one can see that once again there is an ambiguity in the ence/absence of a factor, 𝑒 is a vector that represent the class label on first sight. However considering that the exposure value for each factor and 𝑠 is the severity score. ego vehicle is in the middle lane, with proper conceptual knowledge it can be presumed that the traffic light for straight-moving traffic is the relevant one. 3.3. Continual Learning Model for can be used as a measure of the self-evaluated Perceptual Ambiguity and Criticality confidence of the model in its own predictions. Zero entropy means that the model is perfectly A continual learning approach would be suitable in a confident in its prediction while an entropy of human-in-the-loop environment when the human can one is the level of maximum doubt. Entropy of a initially provide labels and eventually a simple model model with ’c’ classes with each class ’i’ having a (different from the main component that is being trained) probability 𝑝𝑖 is defined as follows: would be able to replace the human when it reaches 𝑐 a sufficient level of performance. Note that before the ∑︁ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = −𝑝𝑖 𝑝𝑖 𝑙𝑜𝑔(𝑝𝑖 ) (1) continual learning model’s misclassifications would only 𝑛=1 affect the data selection and not the predictions of the main component directly. Along with providing the class Thus, data points with a higher entropy are those labels, the human annotator can be asked to provide the with a higher level of uncertainty attached. The perception ambiguity and severity level associated with queries can be generated such that the most un- the data point. Thus there can be two separate continual certain data points are shown to the human for learning models attached to the main component model- review. In this work, we use an ensemble of mod- one to predict perceptual ambiguity and one to predict els as in [21] to generate the average predictive the severity level of the data point. The model used here entropy. could be a shallow neural network with the intermediate • Diversity-based querying: The diversity-based features from the main component model. querying approach aims to include the data points An issue with the continual learning approach is catas- most different from what the model has previ- trophic forgetting when the model updates itself con- ously seen [9]. For this, one should store a rep- stantly and forgets what it learnt before. To avoid this, resentation of the training data that the model it is necessary to maintain the best representation set of has been trained on. An ideal candidate for this is what the model already knows so that when the model is the distribution of the features at an intermediate re-trained it can also include this representation set. In layer of the prediction model. The distribution of this work, we make use of a buffer called the familiarity features of a fully connected (FC) layer in the later buffer for this purpose. The familiarity buffer holds a layers of a convolutional neural network for the representation of the data points where the model pre- training data points could be computed and then dicts the perceptual ambiguity or the criticality of the compared with each new data point to obtain a data point accurately. When the model encounters data distance score. In this work, we consider an FC points wherein there is mismatch between the model layer with ’N’ neurons and compute the means prediction and the human feedback, that data point is and variances of the output values from that layer populated in the unfamiliarity buffer. When the unfa- for all training data points as a new variant of the miliarity buffer is full, the continual learning model is existing distance based acquisition functions for retrained with the contents of both the familiarity and diversity such as in [8]. Then, for each new data unfamiliarity buffers. After the re-training, the famil- point, we calculate the Z-score for each of the ’N’ iarity buffer of size ’n’ is updated. From the contents features 𝑓1 to 𝑓𝑁 and consider their average. of both the buffers, the most diverse ’n’ data points are ∑︀𝑁 𝑓𝑖 −𝜇𝑖 chosen to repopulate the familiarity buffer. Finally, the 𝑍 − 𝑠𝑐𝑜𝑟𝑒 = 𝑛=1 𝜎𝑖 (2) unfamiliarity buffer is emptied. 𝑁 The higher the Z-score, the more distant the new 3.4. Uncertainty and Diversity data point from the known distribution. In this approach, the queries would be generated such The model blindspots and data blindspots can be captured that the data points with a higher average Z-score by uncertainty and diversity respectively. They can be are shown to the human for labelling. calculated as follows: • Uncertainty-based querying: In uncertainty- based querying, the model’s uncertainty about its 4. Proposed Evaluation predictions is used as a metric for selecting query points [10]. The model predictions typically con- Framework tain probability scores associated with each class 4.1. Planned Experiment label. In the ideal scenario, the model should al- locate a probability of one to the correct label The first step in the active learning process is training and zero to all the incorrect labels. Thus, entropy the initial model using the available pool of labelled data. Figure 4: Detailed block diagram of the proposed approach This model would serve as a starting point to generate queries from the unlabelled set. The large pool of unla- belled data is divided randomly into various chunks. Each of these chunks shall be labelled in a particular round of active learning [22]. In the first round of active learning, the pre-trained model is made use of to generate a query list of the data points to be reviewed and labelled by the human. The selection criteria of the query points is the major challenge in active learning and it depends on the mode of active learning selected as explained above. Af- ter all the data points in the first round of active learning are labelled successfully, the model is re-trained with the updated set of labelled data and the next chunk of unlabelled data is selected for the second round of active learning. This process continues till all data points are labelled. During the labelling process, the annotators are tasked at providing the class label, perceptual ambiguity level and severity level of each data point on a graphical user interface as shown in figure 5. If the data point has a high severity and ambiguity level, additional questions can Figure 5: Graphical user interface be asked of the annotators to determine the associated criticality score as mentioned above. • Uncertainty: In this mode, the top N% of images 4.2. Active Learning Acquisition with the highest average entropy are assigned to Functions the human to label In order to demonstrate the effectiveness of the proposed • Diversity: In this mode, the top N% of images approach, we propose to perform the experiment with with the highest average Z-score from the current the following combinations of acquisition functions: distribution are assigned to the human to label • Perceptual Ambiguity: In this mode, the top N% • Random: In this mode, N% of images are ran- of images with the highest perceptual ambiguity domly selected from the subset of unlabelled data scores are assigned to the human to label in a particular round, and are assigned to the hu- • Criticality: In this mode, the top N% of images man to label with the criticality scores are assigned to the hu- we can reuse the accuracy metric used to evaluate the man to label performance of classification models and adapt it to crit- • Combined: In this mode, the top N% of images icality aspects. Given the safety requirements identi- with the highest combined average of entropy, Z- fied through the Hazard Analysis and Risk Assessment score, criticality and perceptual ambiguity score (HARA) methods and all the relevant Operating Con- are assigned to the human to label ditions (OCs) visible on the input data, a safety expert identifies the possible hazardous scenarios that could be caused by misclassification of this input data (with a min- 4.3. Evaluation Metrics imum probability of occurrence), and weight the score We propose to use the following evaluation metrics to associated to this input on the visible risk. The OCs are compare the safety and performance of the proposed any relevant parameters to describe the system’s usage approach with that of the pre-existing ones: scenarios, including environmental conditions, dynamic elements, and scenery. As the criticality assessment pre- 4.3.1. F1-score sented in section 3.2, the risk evaluation is decomposed into severity and exposure factors. We estimate for each When there is an imbalance in the number of data points input the Safety Integrity Level (SIL), presented in the in different classes, accuracy might not be a good metric IEC 61508[23] standard, with the severity and exposures for prediction performance. In this case, F1-score which scores and the risk matrix. We then give each input an accounts for both type-I and type-II errors would be a integer score between one to four, and we compute the better metric: model safety-weighted accuracy as follows: 2 * 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 * 𝑅𝑒𝑐𝑎𝑙𝑙 ∑︀𝑛 𝐹 1 − 𝑠𝑐𝑜𝑟𝑒 = (3) 𝑘=1 𝑠𝑖𝑙𝑘 * 𝑐𝑘 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 ∑︀ 𝑠𝑖𝑙 4.3.2. Uncertainty Reduction With 𝑛 the number of predictions, 𝑠𝑖𝑙 the vector with The goal of training a model is to generalize its knowledge the SIL scores for all inputs, and 𝑐 a vector with the values over the assigned task and therefore perform well on the of the classification correctness (one if the classification unseen test set. As mentioned above, entropy is a good is correct and zero else). measure of prediction power of a model when the label probabilities are available. Therefore, we can use entropy over the test set as one of the measures of how the model 5. Conclusion uncertainty is reduced. Note that while the reduction In this paper, we introduced the concepts of percep- of uncertainty is good, it has to be viewed in tandem tual ambiguity and criticality, and proposed a model with other metrics such as accuracy, precision, recall or which learns to predict these through continuous feed- F1-score. back from a human in the loop. The proposed approach is aimed at tackling blindspots not covered under current 4.3.3. Query Relevance approaches dealing with the uncertainty and diversity In each round of active learning, N% of the data is selected sampling methods. An experiment was designed with as query points to be shown to the human. It is necessary the goal of testing such a model trained to perform traffic to measure if the selected points are indeed the best ones. light detection. The work is still in an early stage and the One of the ways to do this is to measure the difference next steps include performing an active learning experi- in the relevant scores (an average of the uncertainty, ment on a large scale with several volunteers, linking the diversity, criticality and perceptual ambiguity scores for definition of criticality to concrete safety metrics in the each point) in predictions of the human labelled points industry, development of other evaluation metrics and and those of the auto-labelled points. The larger the testing alternate designs of the continual learning model. difference between these sets, the more relevant are the selected query points. For the random mode, the query Acknowledgments relevance is expected to be the least because the points are selected randomly without considering their relevance This work is partially funded by TAILOR, an ICT-48 Net- in active learning. work of AI Research Excellence Centers funded by EU Horizon 2020 research and innovation programme under 4.3.4. Safety-weighted Accuracy grant agreement No 952215. To consider the importance of each input data on the safety relevancy for a machine learning model training, References C. Li, Z. Cui, An active learning approach with uncertainty, representativeness, and diversity, The [1] H. M. Eraqi, M. N. Moustafa, J. Honer, End-to- Scientific World Journal 2014 (2014). end deep learning for steering autonomous vehicles [15] O. Siméoni, Robust image representation for classi- considering temporal dependencies, arXiv preprint fication, retrieval and object discovery, Ph.D. thesis, arXiv:1710.03804 (2017). Université rennes1, 2020. [2] S. Mohseni, M. Pitale, V. Singh, Z. Wang, Prac- [16] O. Siméoni, M. Budnik, Y. Avrithis, G. Gravier, Re- tical solutions for machine learning safety in thinking deep active learning: Using unlabeled data autonomous vehicles, CoRR abs/1912.09630 at model training, in: 2020 25th International Con- (2019). URL: http://arxiv.org/abs/1912.09630. ference on Pattern Recognition (ICPR), IEEE, 2021, arXiv:1912.09630. pp. 1220–1227. [3] D. Wang, X. Ma, X. Yang, Tl-gan: Improving traffic [17] D. Gissin, S. Shalev-Shwartz, Discriminative active light recognition via data synthesis for autonomous learning, arXiv preprint arXiv:1907.06347 (2019). driving, arXiv preprint arXiv:2203.15006 (2022). [18] I. Muslea, S. Minton, C. A. Knoblock, Active+ semi- [4] J. Geary, H. Gouk, S. Ramamoorthy, Active altru- supervised learning= robust multi-view learning, ism learning and information sufficiency for au- in: ICML, volume 2, Citeseer, 2002, pp. 435–442. tonomous driving, arXiv preprint arXiv:2110.04580 [19] R. Ramakrishnan, E. Kamar, B. Nushi, D. Dey, (2021). J. Shah, E. Horvitz, Overcoming blind spots in the [5] E. Haussmann, M. Fenzi, K. Chitta, J. Ivanecky, real world: Leveraging complementary abilities for H. Xu, D. Roy, A. Mittel, N. Koumchatzky, C. Fara- joint execution, in: Proceedings of the AAAI Con- bet, J. M. Alvarez, Scalable active learning for object ference on Artificial Intelligence, volume 33, 2019, detection, in: 2020 IEEE intelligent vehicles sympo- pp. 6137–6145. sium (iv), IEEE, 2020, pp. 1430–1435. [20] X. Yang, J. Yan, X. Yang, J. Tang, W. Liao, [6] C.-C. Kao, T.-Y. Lee, P. Sen, M.-Y. Liu, Localization- T. He, Scrdet++: Detecting small, cluttered and aware active learning for object detection, in: Asian rotated objects via instance-level feature denois- Conference on Computer Vision, Springer, 2018, pp. ing and rotation loss smoothing, arXiv preprint 506–522. arXiv:2004.13316 (2020). [7] R. Monarch, Human-in-the-Loop Machine Learn- [21] R. Rahaman, et al., Uncertainty quantification and ing, Manning Publications Co., 2021. deep ensembles, Advances in Neural Information [8] Y. Geifman, R. El-Yaniv, Deep active learning over Processing Systems 34 (2021). the long tail, CoRR abs/1711.00941 (2017). URL: http: [22] R. Ganti, A. Gray, Upal: Unbiased pool based active //arxiv.org/abs/1711.00941. arXiv:1711.00941. learning, in: Artificial Intelligence and Statistics, [9] G. Wang, J.-N. Hwang, C. Rose, F. Wallace, Uncer- PMLR, 2012, pp. 422–431. tainty sampling based active learning with diversity [23] International Electrotechnical Commission, Func- constraint by sparse selection, in: 2017 IEEE 19th tional safety of electrical/electronic/programmable International Workshop on Multimedia Signal Pro- electronic safety-related systems, 2010. cessing (MMSP), IEEE, 2017, pp. 1–6. [10] A. J. Joshi, F. Porikli, N. Papanikolopoulos, Multi- class active learning for image classification, in: 2009 ieee conference on computer vision and pat- tern recognition, IEEE, 2009, pp. 2372–2379. [11] X. Li, Y. Guo, Adaptive active learning for image classification, in: Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition, 2013, pp. 859–866. [12] W. H. Beluch, T. Genewein, A. Nürnberger, J. M. Köhler, The power of ensembles for active learn- ing in image classification, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 9368–9377. [13] Y. Gal, R. Islam, Z. Ghahramani, Deep bayesian active learning with image data, in: International Conference on Machine Learning, PMLR, 2017, pp. 1183–1192. [14] T. He, S. Zhang, J. Xin, P. Zhao, J. Wu, X. Xian,