Human-AI Collaboration for Improving the Identification of Cars for Autonomous Driving Edwin Gamboa1,* , Alejandro Libreros1 , Matthias Hirth1 and Dan Dubiner2 1 User-centric Analysis of Multimedia Data Group, TU Ilmenau, Ehrenbergstraße 29, Ilmenau, 98693, Germany 2 Scalehub Gmbh, Heidbergstraße 100, Norderstedt, 22846, Germany Abstract Large and high-curated training data is required for Artificial Intelligence (AI) models to perform robustly and reliably. However, training data is scarce since its production normally requires manual expert annotation, which limits scalability. Crowdsourced micro-tasking can help to overcome this challenge, as it offers access to a global workforce that might enable high-scalable annotation of visual data in a cost-time effective way. Therefore, we aim to develop a workflow based on Human-AI collaboration that shall enable large-scale annotations of image data for autonomous driving systems. In this paper, we present the first steps towards this goal, in particular, a Human-AI approach for identifying cars. We assess the feasibility of this collaboration via three scenarios, each one representing different traffic and weather conditions. We found that crowdworkers improved the AI’s work by identifying more than 40% of the missing cars. Crowdworkers’ contribution was key in challenging situations in which identifying a car depended on context. Keywords Human-AI collaboration, Crowdsourcing, Micro-tasking, Autonomous driving, Anonymous annotation 1. Introduction particular objectives, and existing data sets do not meet high-scale purposes, therefore, learning from those data Autonomous driving is one of the most promising ap- is difficult [4]. Hence, machine learning models perform proaches to support smart mobility by reducing the asso- poorly in high-scale cases, leading to severe limitations ciated risks of human behavior and driving fatigue [1]. that make object identification for autonomous driving Key enablers for autonomous driving systems are sets of still an open problem [5]. sensors installed in the vehicle to monitor the vehicle’s In this paper, we present our first steps towards a environment. Then, prediction and estimation models Human-AI collaboration to enable fast and highly reliable use the sensor data to understand the current driving sit- labeling of camera images in the context of autonomous uation and decide upon appropriate actions. The models driving. We find that the image data and the required must be highly accurate and have low processing time to labels exhibit domain-specific challenges, and we illus- minimize the risks of threatening road actors’ lives [2]. trate how to consider these challenges in the design of Supervised learning outperforms classical identification the crowdsourcing workflow. An AI model supports the algorithms in this field of application [3]. However, a crowdworkers with pre-annotations of the images to re- supervised identification model needs large amounts of duce their workload and cope with a large amount of training data to later identify objects in a robust, accurate, data. The workflow is evaluated in a user study with and reliable way. A high-accurate model for identifying crowdworkers who annotated almost 400 real-world im- objects in the street must consider different scenarios ages. Our results show that the workflow combines the such as rain, sun, sunset, night, and seasons, and each of strengths of automated pre-annotation and manual hu- them with particular settings related to, e.g., luminosity man refinement using scalable, public micro-tasking. and reflectance. Still, the availability of public, accurate, reliable, and, especially, massive data sets is scarce for 2. Related Work HIL-DC2022: ACM CIKM 2022 Workshop Human-In-The-Loop Data For the past decade, attempts have been made to explore Curation, October 22, 2022, Atlanta, Georgia * Corresponding author. better ways to combine human-computer approaches $ edwin.gamboa@tu-ilmenau.de (E. Gamboa); to optimize image annotation [6]. Cheng et al. [7] jose.libreros@tu-ilmenau.de (A. Libreros); have classified automatic image annotation as generative matthias.hirth@tu-ilmenau.de (M. Hirth); model-based, nearest neighbor-based, discriminative, tag dan.dubiner@scalehub.com (D. Dubiner) completion-based, and deep learning-based. One com-  0000-0002-5037-5279 (E. Gamboa); 0000-0002-5434-5464 (A. Libreros); 0000-0002-1359-363X (M. Hirth); 0000-0001-8077-0387 mon limitation of automatic image annotation is that (D. Dubiner) those methods suppose availability of annotations, i.e., © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). they address the problem of having different probability CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) (a) Daylight city. (b) Nightlight city. (c) Rainy highway. Figure 1: Sample images of the investigated scenarios. distributions in the confidence of a set of identifications. ing this high diversity of scenarios, it seems likely that Despite recent advances, the lack of trustworthiness of there are cases in which an AI delivers better results than machine learning models has been shown [8]. Thus, the crowdsourcing workers and vice versa. In the following, problem of retrieving missing objects is still open. To we will show this with concrete examples and illustrate address this gap, manual annotations have been used, but the advantages of collaboration between AI and crowd- this approach is limited for scalability purposes due to workers in this use case. We employ three self-collected the scarce availability of experts. In this context, crowd- videos representing different, typical street scenarios to sourcing has the potential to enable high-scalable anno- assess the performance of the collaboration. A sample tations and produce reliable training data for AI mod- frame of each video is shown in Figure 1. First, a Day- els [9, 10, 8]. Heim [11] presents a cost-time analysis of light city video (Figure 1a), in which light conditions are manual segmentation for organs with experts and crowd- ideal, but the image contains a lot of objects typical of a workers. Results show that domain experts achieved big city. Second, a Nightlight city video (Figure 1b) of a approximately 0.1 segmentations per hour vs. 35 seg- small city, in which light conditions are most challenging. mentations from crowdworkers during the same time. Lastly, a Rainy highway video (Figure 1c), in which traffic Similarly, different works have employed crowdsourcing is smooth, crowds of cars are infrequent, but the visual for the annotation of large datasets [12, 13]. Also, Boor- quality is affected by the rain. We randomly selected boor et al. [14] showed how quality can be maximized 399 frames, 133 from the daylight video, 133 from the in the case of lung nodule detection, and Hu et al. [8] rainy highway, and 133 from the nightlight video for our have demonstrated that crowdsourcing might reduce the evaluation. identification bias in challenging scenes. Nevertheless, crowdsourced micro-tasking implies challenges related to the variance in annotation quality, which is mainly 4. Study Design related to the workers’ lack of domain knowledge [9, 11]. This section presents the design process of the annota- Thus, a collaboration between AI and crowdsourcing tion task, the steps that crowdworkers performed when might be feasible for addressing these issues as demon- accessing it, and the process to evaluate the Human-AI strated in the medical field. However, to the best of our collaboration. knowledge, this collaboration has not been studied in the context of autonomous driving considering different driving and weather scenarios. 4.1. Task Design Fully annotating a video in the context of autonomous 3. Problem Statement driving is rather complex, since such a task requires an- notating different objects, e.g., cars, pedestrians, traffic One of the main problems with the annotation of images signs, and other obstacles, frame by frame. Our first goal for autonomous driving is the diversity of scenarios that is to identify the main challenges of the annotation task may emerge. The driving situation can be highly differ- itself and address the multi-object annotation problem ent depending on the street environment, i.e., a highway later. Thus, we initially concentrate on the annotation or a narrow street inside a city, and vary in terms of, of cars only. This annotation process can be further de- e.g., available driving space, number and type of other composed into a three-steps task, i.e., (1) Crowdworkers road users, available signs, and traffic lights. Addition- identify cars not detected by the AI, (2) crowdworkers ally, numerous environmental factors such as lighting identify wrong AI- and crowd-based annotations, and and weather conditions have to be considered. Consider- (3) crowdworkers fix the wrong annotations. In this paper, we focus on the first step. We decided less than 5% of the frame height, are red highlighted in to request crowdworkers to use bounding boxing for the the task UI. If the crowdworker does not resize the small annotation instead of other methods like polygon enclos- annotations, the system informs the worker and deletes ing, or free drawing to reduce workload. Other, more the boxes. Before annotating each frame the workers are sophisticated, techniques like marking background/fore- shown a 2-seconds video containing the 10 preceding ground via simple clicks were discarded since it might frames. The goal of this video is to give context and lead to high heterogeneity in the results [9]. We decided support decision-making in case a crowdworker is not to use YOLOv3 [15] for the pre-annotation of the images sure whether an object is a car. This video can be replayed since it has demonstrated high performance for traffic anytime during the annotation. contexts with low computational cost. Also, YOLO tends to predict fewer false positives than other state-of-the- 4.3. Evaluation Procedure art object identification architectures like R-CNN, using pre-trained models [16]. Two experts manually inspected all frames to assess the We designed the task’s instructions following guide- quality of the YOLO annotations and the contribution of lines for crowdsourcing and usable texts. We used illus- the crowdworkers to the annotation quality. The number trated instructions minimizing visual complexity [17], of correct and incorrect YOLO identifications, the number together with short sentences using simple English [18, of missing identifications, and the number of correct and 19, 13]. Also, we included examples of wrong and right incorrect crowdworkers’ identifications were registered. annotations [11, 17]. The instructions and the User In- Using the expert annotations, we calculate precision, re- terface (UI) annotation mechanisms were iteratively im- call, and F1-score to get more rigorous information about proved using the Crowdsourced Thinking Aloud Protocol the behavior of each model. method as proposed in [20]. 5. Evaluation 4.2. Task Procedure We collected the crowdworkers’ annotations via the Ama- Training. As recommended by different works [18, 9], zon Mechanical Turk platform on July 12, 2022. The training tasks should be included to bring crowdworkers crowdworkers could carry out the annotation tasks as closer to the task domain and filter unreliable workers many times as desired. In total, 14 crowdworkers anno- out. In particular, gold standard data can be used in which tated all frames in 1 hour and 16 minutes. different complexity cases are trained. In the rest of this section, we present the results of In the training task, we show crowdworkers three ran- our study in three main parts. First, we analyze YOLO’s domly selected images, with different complexity levels. performance in terms of the identified cars in the frames. The complexity levels depended on the number of cars to Then, the contribution of the crowdworkers to YOLO’s be annotated, the amount of AI annotations, and the pres- work is assessed. Finally, we combined the identifications ence of cars that are hard to identify, e.g., very distanced carried out by both YOLO and the crowdworkers and or partially visible cars. Each training task includes addi- assessed the performance of this collaboration. tional hints relevant to the current frame and based on the workers’ performance, e.g., highlighting missing cars after each try until all expected cars are annotated. Once 1,00 the training task is successfully passed, crowdworkers 0,90 can complete the annotation task. Quick instructions are 0,80 visible during the whole process and crowdworkers can 0,70 0,60 go back to the detailed instructions anytime they want. F1-Score 0,50 Main Task Crowdworkers have to annotate five, ran- 0,40 domly selected frames. We asked them to draw boxes 0,30 around that the system, i.e., YOLO, did not find. To make 0,20 the completion criteria clear, we ask them to annotate 0,10 a maximum of 10 cars. To annotate only relevant cars 0,00 Daylight city Nightlight city Rainy highway All scenarios in each frame, the crowdworkers should consider the Yolo Crowdworkers Yolo+Crowdworkers following conditions: (1) The box should contain a car and fit its size. (2) Each box should contain only one car. Figure 2: Performance of the identifications made by YOLO, (3) The box should contain a big enough car, i.e., the car’s the crowdworkers, and both combined. The crowdworkers’ height is greater than 5% of the frame height. performance is based on the cars not identified by YOLO. When no cars are found, the worker can continue to the next frame. Annotated boxes that are too small, i.e., 5.1. YOLO Performance 5.3. Collaboration Performance We found that the YOLO’s best performance is achieved To assess the performance of the proposed collabora- in the Rainy highway scenario. In this case, YOLO reaches tion, we combine the identifications made by YOLO with a precision of 0.97 and managed to identify 81% of the those from crowdworkers. The best results for the col- cars, with an F1-Score of 0.88. Meanwhile, a moderate laboration are in the Rainy highway scenario, in which performance is observed in the Daylight city scenario, the share of identified cars increased to 93%, achieving a in which only 56% of cars are identified (Precision=0.95), 12-percentage-point increase. Here, precision decreased resulting in an F1-Score of 0.70. Finally, the most chal- slightly to 0.93, while the F1-Score increased to 0.93. This lenging scenario for YOLO is the Nightlight city. In this is somehow expected since the YOLO results were al- case, only 32% of the cars were identified although a ready really good. In contrast, the Nightlight city sce- precision of 0.99 is achieved. This behavior leads to an nario received the most significant contribution from F1-Score of 0.49. Analyzing YOLO’s performance by com- crowdworkers. In this case, the share of identified cars bining all scenarios, we observe rather moderate results in increased to 76%, meaning that 44% of the cars were the number of identified cars. Although most of YOLO’s identified by crowdworkers. The precision of the col- identifications were actually cars (Precision=0.96), YOLO laboration decreased again to 0.98, but the F1-Score was identified only 55% of the cars correctly. Resulting in an significantly increased to 0.86. This confirms again the F1-Score of 0.70 as shown in Figure 2. ability of crowdworkers to make decisions, where an AI YOLO’s performance suggests a rather conservative might be not enough trained. Finally, the Daylight city behavior, in which only most certain cars are identified, scenario remains the most challenging since the identi- thus achieving high precision, but not identifying a high fied cars rate increased to 69%, i.e., 13-percentage-point proportion of cars, maybe due to difficult or untrained after the crowdworkers’ participation. The precision also context conditions, e.g., crowds of cars, low brightness, decreased a little bit to 0.94, however, the F1-Score in- too small cars, etc. Our results also confirm YOLO’s creased to 0.79. The results for all scenarios combined difficulty to find cars in night conditions as in [21]. showed that the collaboration increased the share of iden- tified cars in all frames to 75%. Thus, the crowdworkers 5.2. Crowdworkers’ Performance contributed 20% of all the cars to be identified. Although the precision decreased to 0.95, the F1-Score increased to The crowdworkers’ contribution is studied by consid- 0.84. The decrease in precision can be due to the non-cars ering only the cars that were not identified by YOLO vehicles annotated by crowdworkers. since they received pre-annotated frames. We found that crowdworkers perform better in the Nightlight city sce- nario. In this case, they reached a precision of 0.97 and 6. Discussion and Conclusion identified 65% of the missing cars. Thereby, resulting in an F1-Score of 0.78. In the Rainy highway scenario, The success of autonomous driving vehicles relies heav- the crowdworkers’ precision decreased to 0.75 although ily on well-trained AI models used to understand the they managed to identify 61% of the missing cars (F1- current driving situation and take appropriate actions. Score=0.68). In this case, the false positives resulted from To train such models, an extensive amount of labeled trucks or construction vehicles identified as cars by the data is required. In this work, we studied the feasibility crowdworkers. The most challenging scenario for crowd- of a Human-AI collaboration via crowdsourcing for car workers was the Daylight city. Here, the precision was identification as the first step towards a scalable pipeline 0.92, but the workers only identified 29% of the missing for creating such labeled data. For this, we employed cars, which reduces the F-Score to 0.43. In this case, we YOLOv3 to pre-annotate frames of three different sce- observed that crowdworkers tend to skip objects that narios that exhibit different image quality and traffic are in the middle of car crowds, e.g., in lines of parked conditions. Then, we asked a group of crowdworkers to vehicles. When analyzing all scenarios combined, similar refine the AI-achieved annotations via a micro-task. to YOLO, the crowdworkers’ precision was high, i.e., 0.92, Our results showed that YOLO performed effectively but they managed to identify only 45% of the missing in a rainy highway scenario, in which the cars are driving cars, which leads to an F-Score of 0.61. in two directions and no crowds of cars are observed in Similar to YOLO, the crowdworkers’ performance a frame. A more moderate performance was observed in seems to be modest. The biggest issues for crowdwork- a daylight city scenario that constantly exhibited dense ers were finding missing cars in crowded scenarios, and crowds of multi-direction parked and moving cars, i.e., avoiding annotating other types of vehicles as cars. The implying different perspectives and proximity. However, second issue is less critical since in a driving situation YOLO’s performance was rather low in a nightlight city this is actually desired. scenario in which poor light conditions represent an ad- ditional constraint. Thus, it confirms the limitations of AI models in challenging contexts such the city scenar- and Neuroscience 2018 (2018) 1–13. doi:10.1155/ ios. On the other hand, the crowdworkers obtained the 2018/7068349. best results in the worst YOLO scenario, i.e., nightlight [4] H. Su, J. Deng, L. Fei-Fei, Crowdsourc- city, contributing almost half of the car identifications ing annotations for visual object detec- and demonstrating their ability to make decisions based tion, Uniwersytet śla̧ski (2012) 40–46. URL: on the scene’s hints. In the case of the rainy highway, https://collaborate.princeton.edu/en/publications/ the crowdworkers retrieved a significant amount of re- crowdsourcing-annotations-for-visual-object-detection. maining cars, which were normally the most distant cars. doi:10.2/JQUERY.MIN.JS. Lastly, the daylight city scenario also represented a chal- [5] H. Ning, R. Yin, A. Ullah, F. Shi, A Survey lenge for the crowdworkers. This might be related to on Hybrid Human-Artificial Intelligence for Au- the effort required to find partially hidden cars in dense tonomous Driving, IEEE Transactions on Intelli- parking locations. gent Transportation Systems 23 (2022) 6011–6026. The results show that a Human-AI collaboration might doi:10.1109/TITS.2021.3074695. be feasible and scalable to save human effort by having [6] L. Wenyin, S. T. Dumais, Y. Sun, H. Zhang, M. Czer- pre-annotated data and reacting to untrained or chal- winski, B. A. Field, others, Semi-Automatic Image lenging scenarios by taking advantage of crowdworkers’ Annotation., in: Interact, volume 1, 2001, pp. 326– ability to make decisions based on context. Nevertheless, 333. to achieve fully annotated frames further mechanisms [7] Q. Cheng, Q. Zhang, P. Fu, C. Tu, S. Li, A survey should be investigated. For instance, a AI active learn- and analysis on automatic image annotation, Pat- ing scheme using crowdworkers contribution, and the tern Recognition 79 (2018) 242–259. doi:10.1016/ inclusion of more crowdworkers per frame. Additionally, j.patcog.2018.02.017. automatic active learning for frequent crowdworkers can [8] X. Hu, H. Wang, A. Vegesana, S. Dube, K. Yu, G. Kao, be AI-supported, under a personalized training scheme S.-H. Chen, Y.-H. Lu, G. K. Thiruvathukal, M. Yin, based on their behavior. Finally, further steps for the Crowdsourcing Detection of Sampling Biases in detection and fixing of wrong identifications, e.g., as pro- Image Datasets, in: Proceedings of The Web Con- posed in [22], and for addressing multi-object scenarios ference 2020, ACM, New York, NY, USA, 2020, pp. should be investigated. 2955–2961. doi:10.1145/3366423.3380063. [9] A. Carlier, A. Salvador, X. Giró-i Nieto, O. Mar- ques, V. Charvillat, Click’n’Cut: Crowd- Acknowledgments sourced Interactive Segmentation with Object Can- didates, in: 3rd International ACM Workshop on This work was carried out under the project Segmenta- Crowdsourcing for Multimedia (CrowdMM), Or- tion of visual media (Computer Vision) for cloud-based lando, Florida (USA), 2014. URL: http://dx.doi.org/ processing co-financed by the program ProFIT Branden- 10.1145/2660114.2660125. doi:10.1145/2660114. burg of the Ministry of Economic and European Affairs of 2660125. the State of Brandenburg in Germany and the European [10] X. Wang, L. Mudie, C. J. Brady, Crowdsourcing: An Regional Development Fund. overview and applications to ophthalmology, 2016. doi:10.1097/ICU.0000000000000251. References [11] E. Heim, Large-scale medical image an- notation with quality-controlled crowd- [1] S. Davies, Interconnected sensor networks and sourcing (2018). URL: http://archiv.ub. decision-making self-driving car control algorithms uni-heidelberg.de/volltextserver/id/eprint/24641. in smart sustainable urbanism, Contemp. Read- doi:10.11588/HEIDOK.00024641. ings L. & Soc. Just. 12 (2020) 88. doi:10.22381/ [12] M. Amgad, H. Elfandy, H. Hussein, L. A. Atteya, CRLSJ122202010. M. A. T. Elsebaie, L. S. Abo Elnasr, R. A. Sakr, [2] T. Brell, R. Philipsen, M. Ziefle, Suspi- H. S. E. Salem, A. F. Ismail, A. M. Saad, J. Ahmed, cious minds? – users’ perceptions of au- M. A. T. Elsebaie, M. Rahman, I. A. Ruhban, N. M. tonomous and connected driving, Theoret- Elgazar, Y. Alagha, M. H. Osman, A. M. Alhusseiny, ical Issues in Ergonomics Science 20 (2019) M. M. Khalaf, A.-A. F. Younes, A. Abdulkarim, D. M. 301–331. URL: https://www.tandfonline.com/doi/ Younes, A. M. Gadallah, A. M. Elkashash, S. Y. Fala, abs/10.1080/1463922X.2018.1485985. doi:10.1080/ B. M. Zaki, J. Beezley, D. R. Chittajallu, D. Man- 1463922X.2018.1485985. they, D. A. Gutman, L. A. D. Cooper, Structured [3] A. Voulodimos, N. Doulamis, A. Doulamis, E. Pro- crowdsourcing enables convolutional segmentation topapadakis, Deep Learning for Computer Vi- of histology images, Bioinformatics 35 (2019) 3461– sion: A Brief Review, Computational Intelligence 3467. doi:10.1093/bioinformatics/btz083. [13] S. Ørting, A. Doyle, A. van Hilten, M. Hirth, O. Inel, C. R. Madan, P. Mavridis, H. Spiers, V. Cheplygina, A Survey of Crowdsourcing in Medical Image Anal- ysis, 2019. doi:10.15346/hc.v7i1.1. [14] S. Boorboor, S. Nadeem, J. H. Park, K. Baker, A. Kauf- man, Crowdsourcing lung nodules detection and annotation, in: Medical Imaging 2018: Imaging Informatics for Healthcare, Research, and Appli- cations, volume 10579, International Society for Optics and Photonics, SPIE, 2018, pp. 342–348. URL: https://doi.org/10.1117/12.2292563. doi:10.1117/ 12.2292563. [15] J. Redmon, A. Farhadi, YOLO v.3, Tech report (2018) 1–6. URL: https://pjreddie.com/media/files/papers/ YOLOv3.pdf. [16] J. Du, Understanding of Object Detection Based on CNN Family and YOLO, Journal of Physics: Con- ference Series 1004 (2018) 012029. doi:10.1088/ 1742-6596/1004/1/012029. [17] S. Khanna, A. Ratan, J. Davis, W. Thies, Evaluating and improving the usability of Mechanical Turk for low-income workers in India, in: ACM Symposium on Computing for Development, ACM DEV ’10, Association for Computing Machinery, New York, NY, USA, 2010, pp. 1–10. doi:10.1145/1926180. 1926195. [18] T. Hossfeld, C. Keimel, M. Hirth, B. Gardlo, J. Habigt, K. Diepold, P. Tran-Gia, Best practices for qoe crowdtesting: Qoe assessment with crowdsourcing, IEEE Transactions on Multimedia 16 (2014) 541–558. doi:10.1109/TMM.2013.2291663. [19] S. Krug, Don’t make me think!: Web & Mobile Us- ability: Das intuitive Web, mitp Professional, MITP Verlags GmbH & Company KG, 2018. URL: https: //books.google.de/books?id=e-VIDwAAQBAJ. [20] E. Gamboa, R. Galda, C. Mayas, M. Hirth, The Crowd Thinks Aloud: Crowdsourcing Usability Testing with the Thinking Aloud Method, in: HCI International 2021 - Late Breaking Papers: De- sign and User Experience, Springer International Publishing, Cham, 2021, pp. 24–39. doi:10.1007/ 978-3-030-90238-4{\_}3. [21] C. Tung, M. R. Kelleher, R. J. Schlueter, B. Xu, Y.-H. Lu, G. K. Thiruvathukal, Y.-K. Chen, Y. Lu, Large- Scale Object Detection of Images from Network Cameras in Variable Ambient Lighting Conditions, in: 2019 IEEE Conference on Multimedia Informa- tion Processing and Retrieval (MIPR), IEEE, 2019, pp. 393–398. doi:10.1109/MIPR.2019.00080. [22] C. Tessier, F. Dehais, Authority Management and Conflict Solving in Human-Machine Systems., Aerospace Lab (2012) p–1.