1. Introduction

Journal of Physics: Con

10.15346/hc.v7i1.1

Human-AI Collaboration for Improving the Identification of Cars for Autonomous Driving

Edwin Gamboa

Alejandro Libreros

Matthias Hirth

Dan Dubiner

0 0 Scalehub Gmbh , Heidbergstraße 100, Norderstedt, 22846 , Germany 1 User-centric Analysis of Multimedia Data Group, TU Ilmenau , Ehrenbergstraße 29, Ilmenau, 98693 , Germany

2021

10579 342 348

Large and high-curated training data is required for Artificial Intelligence (AI) models to perform robustly and reliably. However, training data is scarce since its production normally requires manual expert annotation, which limits scalability. Crowdsourced micro-tasking can help to overcome this challenge, as it ofers access to a global workforce that might enable high-scalable annotation of visual data in a cost-time efective way. Therefore, we aim to develop a workflow based on Human-AI collaboration that shall enable large-scale annotations of image data for autonomous driving systems. In this paper, we present the first steps towards this goal, in particular, a Human-AI approach for identifying cars. We assess the feasibility of this collaboration via three scenarios, each one representing diferent trafic and weather conditions. We found that crowdworkers improved the AI's work by identifying more than 40% of the missing cars. Crowdworkers' contribution was key in challenging situations in which identifying a car depended on context.

eol>Human-AI collaboration Crowdsourcing Micro-tasking Autonomous driving Anonymous annotation

1. Introduction

particular objectives, and existing data sets do not meet high-scale purposes, therefore, learning from those data Autonomous driving is one of the most promising ap- is dificult [ 4]. Hence, machine learning models perform proaches to support smart mobility by reducing the asso- poorly in high-scale cases, leading to severe limitations ciated risks of human behavior and driving fatigue [1]. that make object identification for autonomous driving Key enablers for autonomous driving systems are sets of still an open problem [5]. sensors installed in the vehicle to monitor the vehicle’s In this paper, we present our first steps towards a environment. Then, prediction and estimation models Human-AI collaboration to enable fast and highly reliable use the sensor data to understand the current driving sit- labeling of camera images in the context of autonomous uation and decide upon appropriate actions. The models driving. We find that the image data and the required must be highly accurate and have low processing time to labels exhibit domain-specific challenges, and we illusminimize the risks of threatening road actors’ lives [2]. trate how to consider these challenges in the design of Supervised learning outperforms classical identification the crowdsourcing workflow. An AI model supports the algorithms in this field of application [ 3]. However, a crowdworkers with pre-annotations of the images to resupervised identification model needs large amounts of duce their workload and cope with a large amount of training data to later identify objects in a robust, accurate, data. The workflow is evaluated in a user study with and reliable way. A high-accurate model for identifying crowdworkers who annotated almost 400 real-world imobjects in the street must consider diferent scenarios ages. Our results show that the workflow combines the such as rain, sun, sunset, night, and seasons, and each of strengths of automated pre-annotation and manual huthem with particular settings related to, e.g., luminosity man refinement using scalable, public micro-tasking. and reflectance. Still, the availability of public, accurate, reliable, and, especially, massive data sets is scarce for

2. Related Work

(a) Daylight city.

(b) Nightlight city.

(c) Rainy highway. distributions in the confidence of a set of identifications. ing this high diversity of scenarios, it seems likely that Despite recent advances, the lack of trustworthiness of there are cases in which an AI delivers better results than machine learning models has been shown [8]. Thus, the crowdsourcing workers and vice versa. In the following, problem of retrieving missing objects is still open. To we will show this with concrete examples and illustrate address this gap, manual annotations have been used, but the advantages of collaboration between AI and crowdthis approach is limited for scalability purposes due to workers in this use case. We employ three self-collected the scarce availability of experts. In this context, crowd- videos representing diferent, typical street scenarios to sourcing has the potential to enable high-scalable anno- assess the performance of the collaboration. A sample tations and produce reliable training data for AI mod- frame of each video is shown in Figure 1. First, a Dayels [9, 10, 8]. Heim [11] presents a cost-time analysis of light city video (Figure 1a), in which light conditions are manual segmentation for organs with experts and crowd- ideal, but the image contains a lot of objects typical of a workers. Results show that domain experts achieved big city. Second, a Nightlight city video (Figure 1b) of a approximately 0.1 segmentations per hour vs. 35 seg- small city, in which light conditions are most challenging. mentations from crowdworkers during the same time. Lastly, a Rainy highway video (Figure 1c), in which trafic Similarly, diferent works have employed crowdsourcing is smooth, crowds of cars are infrequent, but the visual for the annotation of large datasets [12, 13]. Also, Boor- quality is afected by the rain. We randomly selected boor et al. [14] showed how quality can be maximized 399 frames, 133 from the daylight video, 133 from the in the case of lung nodule detection, and Hu et al. [8] rainy highway, and 133 from the nightlight video for our have demonstrated that crowdsourcing might reduce the evaluation. identification bias in challenging scenes. Nevertheless, crowdsourced micro-tasking implies challenges related to the variance in annotation quality, which is mainly 4. Study Design related to the workers’ lack of domain knowledge [9, 11].

Thus, a collaboration between AI and crowdsourcing This section presents the design process of the annotamight be feasible for addressing these issues as demon- tion task, the steps that crowdworkers performed when strated in the medical field. However, to the best of our accessing it, and the process to evaluate the Human-AI knowledge, this collaboration has not been studied in collaboration. the context of autonomous driving considering diferent driving and weather scenarios. 4.1. Task Design

Fully annotating a video in the context of autonomous

3. Problem Statement driving is rather complex, since such a task requires annotating diferent objects, e.g., cars, pedestrians, trafic One of the main problems with the annotation of images signs, and other obstacles, frame by frame. Our first goal for autonomous driving is the diversity of scenarios that is to identify the main challenges of the annotation task may emerge. The driving situation can be highly difer- itself and address the multi-object annotation problem ent depending on the street environment, i.e., a highway later. Thus, we initially concentrate on the annotation or a narrow street inside a city, and vary in terms of, of cars only. This annotation process can be further dee.g., available driving space, number and type of other composed into a three-steps task, i.e., (1) Crowdworkers road users, available signs, and trafic lights. Addition- identify cars not detected by the AI, (2) crowdworkers ally, numerous environmental factors such as lighting identify wrong AI- and crowd-based annotations, and and weather conditions have to be considered. Consider- (3) crowdworkers fix the wrong annotations.

In this paper, we focus on the first step. We decided less than 5% of the frame height, are red highlighted in to request crowdworkers to use bounding boxing for the the task UI. If the crowdworker does not resize the small annotation instead of other methods like polygon enclos- annotations, the system informs the worker and deletes ing, or free drawing to reduce workload. Other, more the boxes. Before annotating each frame the workers are sophisticated, techniques like marking background/fore- shown a 2-seconds video containing the 10 preceding ground via simple clicks were discarded since it might frames. The goal of this video is to give context and lead to high heterogeneity in the results [9]. We decided support decision-making in case a crowdworker is not to use YOLOv3 [15] for the pre-annotation of the images sure whether an object is a car. This video can be replayed since it has demonstrated high performance for trafic anytime during the annotation. contexts with low computational cost. Also, YOLO tends to predict fewer false positives than other state-of-the- 4.3. Evaluation Procedure art object identification architectures like R-CNN, using pre-trained models [16]. Two experts manually inspected all frames to assess the

We designed the task’s instructions following guide- quality of the YOLO annotations and the contribution of lines for crowdsourcing and usable texts. We used illus- the crowdworkers to the annotation quality. The number trated instructions minimizing visual complexity [17], of correct and incorrect YOLO identifications, the number together with short sentences using simple English [18, of missing identifications, and the number of correct and 19, 13]. Also, we included examples of wrong and right incorrect crowdworkers’ identifications were registered. annotations [11, 17]. The instructions and the User In- Using the expert annotations, we calculate precision, reterface (UI) annotation mechanisms were iteratively im- call, and F1-score to get more rigorous information about proved using the Crowdsourced Thinking Aloud Protocol the behavior of each model. method as proposed in [20].

4.2. Task Procedure Training. As recommended by diferent works [ 18, 9],

training tasks should be included to bring crowdworkers closer to the task domain and filter unreliable workers out. In particular, gold standard data can be used in which diferent complexity cases are trained.

In the training task, we show crowdworkers three randomly selected images, with diferent complexity levels. The complexity levels depended on the number of cars to be annotated, the amount of AI annotations, and the presence of cars that are hard to identify, e.g., very distanced or partially visible cars. Each training task includes additional hints relevant to the current frame and based on the workers’ performance, e.g., highlighting missing cars after each try until all expected cars are annotated. Once the training task is successfully passed, crowdworkers can complete the annotation task. Quick instructions are visible during the whole process and crowdworkers can go back to the detailed instructions anytime they want.

Main Task Crowdworkers have to annotate five, randomly selected frames. We asked them to draw boxes around that the system, i.e., YOLO, did not find. To make the completion criteria clear, we ask them to annotate a maximum of 10 cars. To annotate only relevant cars in each frame, the crowdworkers should consider the following conditions: (1) The box should contain a car and fit its size. (2) Each box should contain only one car. (3) The box should contain a big enough car, i.e., the car’s height is greater than 5% of the frame height.

When no cars are found, the worker can continue to the next frame. Annotated boxes that are too small, i.e.,

5. Evaluation We collected the crowdworkers’ annotations via the Ama

zon Mechanical Turk platform on July 12, 2022. The crowdworkers could carry out the annotation tasks as many times as desired. In total, 14 crowdworkers annotated all frames in 1 hour and 16 minutes.

In the rest of this section, we present the results of our study in three main parts. First, we analyze YOLO’s performance in terms of the identified cars in the frames. Then, the contribution of the crowdworkers to YOLO’s work is assessed. Finally, we combined the identifications carried out by both YOLO and the crowdworkers and assessed the performance of this collaboration. 1,00 0,90 0,80 0,70 re0,60 o cS0,50 -1F0,40 0,30 0,20 0,10 0,00

Daylight city

Yolo

Nightlight city Crowdworkers

Rainy highway Yolo+Crowdworkers

All scenarios

5.1. YOLO Performance 5.3. Collaboration Performance

We found that the YOLO’s best performance is achieved To assess the performance of the proposed collaborain the Rainy highway scenario. In this case, YOLO reaches tion, we combine the identifications made by YOLO with a precision of 0.97 and managed to identify 81% of the those from crowdworkers. The best results for the colcars, with an F1-Score of 0.88. Meanwhile, a moderate laboration are in the Rainy highway scenario, in which performance is observed in the Daylight city scenario, the share of identified cars increased to 93%, achieving a in which only 56% of cars are identified (Precision=0.95), 12-percentage-point increase. Here, precision decreased resulting in an F1-Score of 0.70. Finally, the most chal- slightly to 0.93, while the F1-Score increased to 0.93. This lenging scenario for YOLO is the Nightlight city. In this is somehow expected since the YOLO results were alcase, only 32% of the cars were identified although a ready really good. In contrast, the Nightlight city sceprecision of 0.99 is achieved. This behavior leads to an nario received the most significant contribution from F1-Score of 0.49. Analyzing YOLO’s performance by com- crowdworkers. In this case, the share of identified cars bining all scenarios, we observe rather moderate results in increased to 76%, meaning that 44% of the cars were the number of identified cars. Although most of YOLO’s identified by crowdworkers. The precision of the colidentifications were actually cars (Precision=0.96), YOLO laboration decreased again to 0.98, but the F1-Score was identified only 55% of the cars correctly. Resulting in an significantly increased to 0.86. This confirms again the F1-Score of 0.70 as shown in Figure 2. ability of crowdworkers to make decisions, where an AI

YOLO’s performance suggests a rather conservative might be not enough trained. Finally, the Daylight city behavior, in which only most certain cars are identified, scenario remains the most challenging since the identithus achieving high precision, but not identifying a high ifed cars rate increased to 69%, i.e., 13-percentage-point proportion of cars, maybe due to dificult or untrained after the crowdworkers’ participation. The precision also context conditions, e.g., crowds of cars, low brightness, decreased a little bit to 0.94, however, the F1-Score intoo small cars, etc. Our results also confirm YOLO’s creased to 0.79. The results for all scenarios combined dificulty to find cars in night conditions as in [21]. showed that the collaboration increased the share of identified cars in all frames to 75%. Thus, the crowdworkers 5.2. Crowdworkers’ Performance contributed 20% of all the cars to be identified. Although the precision decreased to 0.95, the F1-Score increased to The crowdworkers’ contribution is studied by consid- 0.84. The decrease in precision can be due to the non-cars ering only the cars that were not identified by YOLO vehicles annotated by crowdworkers. since they received pre-annotated frames. We found that crowdworkers perform better in the Nightlight city scenario. In this case, they reached a precision of 0.97 and 6. Discussion and Conclusion identified 65% of the missing cars. Thereby, resulting in an F1-Score of 0.78. In the Rainy highway scenario, the crowdworkers’ precision decreased to 0.75 although they managed to identify 61% of the missing cars (F1Score=0.68). In this case, the false positives resulted from trucks or construction vehicles identified as cars by the crowdworkers. The most challenging scenario for crowdworkers was the Daylight city. Here, the precision was 0.92, but the workers only identified 29% of the missing cars, which reduces the F-Score to 0.43. In this case, we observed that crowdworkers tend to skip objects that are in the middle of car crowds, e.g., in lines of parked vehicles. When analyzing all scenarios combined, similar to YOLO, the crowdworkers’ precision was high, i.e., 0.92, but they managed to identify only 45% of the missing cars, which leads to an F-Score of 0.61.

Similar to YOLO, the crowdworkers’ performance seems to be modest. The biggest issues for crowdworkers were finding missing cars in crowded scenarios, and avoiding annotating other types of vehicles as cars. The second issue is less critical since in a driving situation this is actually desired.

The success of autonomous driving vehicles relies heav

ily on well-trained AI models used to understand the current driving situation and take appropriate actions. To train such models, an extensive amount of labeled data is required. In this work, we studied the feasibility of a Human-AI collaboration via crowdsourcing for car identification as the first step towards a scalable pipeline for creating such labeled data. For this, we employed YOLOv3 to pre-annotate frames of three diferent scenarios that exhibit diferent image quality and trafic conditions. Then, we asked a group of crowdworkers to refine the AI-achieved annotations via a micro-task.

Our results showed that YOLO performed efectively in a rainy highway scenario, in which the cars are driving in two directions and no crowds of cars are observed in a frame. A more moderate performance was observed in a daylight city scenario that constantly exhibited dense crowds of multi-direction parked and moving cars, i.e., implying diferent perspectives and proximity. However, YOLO’s performance was rather low in a nightlight city scenario in which poor light conditions represent an additional constraint. Thus, it confirms the limitations of

Acknowledgments This work was carried out under the project Segmenta

tion of visual media (Computer Vision) for cloud-based processing co-financed by the program ProFIT Brandenburg of the Ministry of Economic and European Afairs of the State of Brandenburg in Germany and the European Regional Development Fund.

AI models in challenging contexts such the city scenarios. On the other hand, the crowdworkers obtained the best results in the worst YOLO scenario, i.e., nightlight city, contributing almost half of the car identifications and demonstrating their ability to make decisions based on the scene’s hints. In the case of the rainy highway, the crowdworkers retrieved a significant amount of remaining cars, which were normally the most distant cars. Lastly, the daylight city scenario also represented a challenge for the crowdworkers. This might be related to the efort required to find partially hidden cars in dense parking locations.

The results show that a Human-AI collaboration might be feasible and scalable to save human efort by having pre-annotated data and reacting to untrained or challenging scenarios by taking advantage of crowdworkers’ ability to make decisions based on context. Nevertheless, to achieve fully annotated frames further mechanisms should be investigated. For instance, a AI active learning scheme using crowdworkers contribution, and the inclusion of more crowdworkers per frame. Additionally, automatic active learning for frequent crowdworkers can be AI-supported, under a personalized training scheme based on their behavior. Finally, further steps for the detection and fixing of wrong identifications, e.g., as proposed in [22], and for addressing multi-object scenarios should be investigated.