<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of Physics: Con</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.15346/hc.v7i1.1</article-id>
      <title-group>
        <article-title>Human-AI Collaboration for Improving the Identification of Cars for Autonomous Driving</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Edwin Gamboa</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alejandro Libreros</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matthias Hirth</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dan Dubiner</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Scalehub Gmbh</institution>
          ,
          <addr-line>Heidbergstraße 100, Norderstedt, 22846</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>User-centric Analysis of Multimedia Data Group, TU Ilmenau</institution>
          ,
          <addr-line>Ehrenbergstraße 29, Ilmenau, 98693</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>10579</volume>
      <fpage>342</fpage>
      <lpage>348</lpage>
      <abstract>
        <p>Large and high-curated training data is required for Artificial Intelligence (AI) models to perform robustly and reliably. However, training data is scarce since its production normally requires manual expert annotation, which limits scalability. Crowdsourced micro-tasking can help to overcome this challenge, as it ofers access to a global workforce that might enable high-scalable annotation of visual data in a cost-time efective way. Therefore, we aim to develop a workflow based on Human-AI collaboration that shall enable large-scale annotations of image data for autonomous driving systems. In this paper, we present the first steps towards this goal, in particular, a Human-AI approach for identifying cars. We assess the feasibility of this collaboration via three scenarios, each one representing diferent trafic and weather conditions. We found that crowdworkers improved the AI's work by identifying more than 40% of the missing cars. Crowdworkers' contribution was key in challenging situations in which identifying a car depended on context.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Human-AI collaboration</kwd>
        <kwd>Crowdsourcing</kwd>
        <kwd>Micro-tasking</kwd>
        <kwd>Autonomous driving</kwd>
        <kwd>Anonymous annotation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>particular objectives, and existing data sets do not meet
high-scale purposes, therefore, learning from those data
Autonomous driving is one of the most promising ap- is dificult [ 4]. Hence, machine learning models perform
proaches to support smart mobility by reducing the asso- poorly in high-scale cases, leading to severe limitations
ciated risks of human behavior and driving fatigue [1]. that make object identification for autonomous driving
Key enablers for autonomous driving systems are sets of still an open problem [5].
sensors installed in the vehicle to monitor the vehicle’s In this paper, we present our first steps towards a
environment. Then, prediction and estimation models Human-AI collaboration to enable fast and highly reliable
use the sensor data to understand the current driving sit- labeling of camera images in the context of autonomous
uation and decide upon appropriate actions. The models driving. We find that the image data and the required
must be highly accurate and have low processing time to labels exhibit domain-specific challenges, and we
illusminimize the risks of threatening road actors’ lives [2]. trate how to consider these challenges in the design of
Supervised learning outperforms classical identification the crowdsourcing workflow. An AI model supports the
algorithms in this field of application [ 3]. However, a crowdworkers with pre-annotations of the images to
resupervised identification model needs large amounts of duce their workload and cope with a large amount of
training data to later identify objects in a robust, accurate, data. The workflow is evaluated in a user study with
and reliable way. A high-accurate model for identifying crowdworkers who annotated almost 400 real-world
imobjects in the street must consider diferent scenarios ages. Our results show that the workflow combines the
such as rain, sun, sunset, night, and seasons, and each of strengths of automated pre-annotation and manual
huthem with particular settings related to, e.g., luminosity man refinement using scalable, public micro-tasking.
and reflectance. Still, the availability of public, accurate,
reliable, and, especially, massive data sets is scarce for</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>(a) Daylight city.</p>
      <p>(b) Nightlight city.</p>
      <p>(c) Rainy highway.
distributions in the confidence of a set of identifications. ing this high diversity of scenarios, it seems likely that
Despite recent advances, the lack of trustworthiness of there are cases in which an AI delivers better results than
machine learning models has been shown [8]. Thus, the crowdsourcing workers and vice versa. In the following,
problem of retrieving missing objects is still open. To we will show this with concrete examples and illustrate
address this gap, manual annotations have been used, but the advantages of collaboration between AI and
crowdthis approach is limited for scalability purposes due to workers in this use case. We employ three self-collected
the scarce availability of experts. In this context, crowd- videos representing diferent, typical street scenarios to
sourcing has the potential to enable high-scalable anno- assess the performance of the collaboration. A sample
tations and produce reliable training data for AI mod- frame of each video is shown in Figure 1. First, a
Dayels [9, 10, 8]. Heim [11] presents a cost-time analysis of light city video (Figure 1a), in which light conditions are
manual segmentation for organs with experts and crowd- ideal, but the image contains a lot of objects typical of a
workers. Results show that domain experts achieved big city. Second, a Nightlight city video (Figure 1b) of a
approximately 0.1 segmentations per hour vs. 35 seg- small city, in which light conditions are most challenging.
mentations from crowdworkers during the same time. Lastly, a Rainy highway video (Figure 1c), in which trafic
Similarly, diferent works have employed crowdsourcing is smooth, crowds of cars are infrequent, but the visual
for the annotation of large datasets [12, 13]. Also, Boor- quality is afected by the rain. We randomly selected
boor et al. [14] showed how quality can be maximized 399 frames, 133 from the daylight video, 133 from the
in the case of lung nodule detection, and Hu et al. [8] rainy highway, and 133 from the nightlight video for our
have demonstrated that crowdsourcing might reduce the evaluation.
identification bias in challenging scenes. Nevertheless,
crowdsourced micro-tasking implies challenges related
to the variance in annotation quality, which is mainly 4. Study Design
related to the workers’ lack of domain knowledge [9, 11].</p>
      <p>Thus, a collaboration between AI and crowdsourcing This section presents the design process of the
annotamight be feasible for addressing these issues as demon- tion task, the steps that crowdworkers performed when
strated in the medical field. However, to the best of our accessing it, and the process to evaluate the Human-AI
knowledge, this collaboration has not been studied in collaboration.
the context of autonomous driving considering diferent
driving and weather scenarios. 4.1. Task Design</p>
      <sec id="sec-2-1">
        <title>Fully annotating a video in the context of autonomous</title>
        <p>3. Problem Statement driving is rather complex, since such a task requires
annotating diferent objects, e.g., cars, pedestrians, trafic
One of the main problems with the annotation of images signs, and other obstacles, frame by frame. Our first goal
for autonomous driving is the diversity of scenarios that is to identify the main challenges of the annotation task
may emerge. The driving situation can be highly difer- itself and address the multi-object annotation problem
ent depending on the street environment, i.e., a highway later. Thus, we initially concentrate on the annotation
or a narrow street inside a city, and vary in terms of, of cars only. This annotation process can be further
dee.g., available driving space, number and type of other composed into a three-steps task, i.e., (1) Crowdworkers
road users, available signs, and trafic lights. Addition- identify cars not detected by the AI, (2) crowdworkers
ally, numerous environmental factors such as lighting identify wrong AI- and crowd-based annotations, and
and weather conditions have to be considered. Consider- (3) crowdworkers fix the wrong annotations.</p>
        <p>In this paper, we focus on the first step. We decided less than 5% of the frame height, are red highlighted in
to request crowdworkers to use bounding boxing for the the task UI. If the crowdworker does not resize the small
annotation instead of other methods like polygon enclos- annotations, the system informs the worker and deletes
ing, or free drawing to reduce workload. Other, more the boxes. Before annotating each frame the workers are
sophisticated, techniques like marking background/fore- shown a 2-seconds video containing the 10 preceding
ground via simple clicks were discarded since it might frames. The goal of this video is to give context and
lead to high heterogeneity in the results [9]. We decided support decision-making in case a crowdworker is not
to use YOLOv3 [15] for the pre-annotation of the images sure whether an object is a car. This video can be replayed
since it has demonstrated high performance for trafic anytime during the annotation.
contexts with low computational cost. Also, YOLO tends
to predict fewer false positives than other state-of-the- 4.3. Evaluation Procedure
art object identification architectures like R-CNN, using
pre-trained models [16]. Two experts manually inspected all frames to assess the</p>
        <p>We designed the task’s instructions following guide- quality of the YOLO annotations and the contribution of
lines for crowdsourcing and usable texts. We used illus- the crowdworkers to the annotation quality. The number
trated instructions minimizing visual complexity [17], of correct and incorrect YOLO identifications, the number
together with short sentences using simple English [18, of missing identifications, and the number of correct and
19, 13]. Also, we included examples of wrong and right incorrect crowdworkers’ identifications were registered.
annotations [11, 17]. The instructions and the User In- Using the expert annotations, we calculate precision,
reterface (UI) annotation mechanisms were iteratively im- call, and F1-score to get more rigorous information about
proved using the Crowdsourced Thinking Aloud Protocol the behavior of each model.
method as proposed in [20].</p>
        <sec id="sec-2-1-1">
          <title>4.2. Task Procedure</title>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Training. As recommended by diferent works [ 18, 9],</title>
        <p>training tasks should be included to bring crowdworkers
closer to the task domain and filter unreliable workers
out. In particular, gold standard data can be used in which
diferent complexity cases are trained.</p>
        <p>In the training task, we show crowdworkers three
randomly selected images, with diferent complexity levels.
The complexity levels depended on the number of cars to
be annotated, the amount of AI annotations, and the
presence of cars that are hard to identify, e.g., very distanced
or partially visible cars. Each training task includes
additional hints relevant to the current frame and based on
the workers’ performance, e.g., highlighting missing cars
after each try until all expected cars are annotated. Once
the training task is successfully passed, crowdworkers
can complete the annotation task. Quick instructions are
visible during the whole process and crowdworkers can
go back to the detailed instructions anytime they want.</p>
        <p>Main Task Crowdworkers have to annotate five,
randomly selected frames. We asked them to draw boxes
around that the system, i.e., YOLO, did not find. To make
the completion criteria clear, we ask them to annotate
a maximum of 10 cars. To annotate only relevant cars
in each frame, the crowdworkers should consider the
following conditions: (1) The box should contain a car
and fit its size. (2) Each box should contain only one car.
(3) The box should contain a big enough car, i.e., the car’s
height is greater than 5% of the frame height.</p>
        <p>When no cars are found, the worker can continue to
the next frame. Annotated boxes that are too small, i.e.,</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Evaluation</title>
      <sec id="sec-3-1">
        <title>We collected the crowdworkers’ annotations via the Ama</title>
        <p>zon Mechanical Turk platform on July 12, 2022. The
crowdworkers could carry out the annotation tasks as
many times as desired. In total, 14 crowdworkers
annotated all frames in 1 hour and 16 minutes.</p>
        <p>In the rest of this section, we present the results of
our study in three main parts. First, we analyze YOLO’s
performance in terms of the identified cars in the frames.
Then, the contribution of the crowdworkers to YOLO’s
work is assessed. Finally, we combined the identifications
carried out by both YOLO and the crowdworkers and
assessed the performance of this collaboration.
1,00
0,90
0,80
0,70
re0,60
o
cS0,50
-1F0,40
0,30
0,20
0,10
0,00</p>
        <p>Daylight city</p>
        <p>Yolo</p>
        <p>Nightlight city
Crowdworkers</p>
        <p>Rainy highway
Yolo+Crowdworkers</p>
        <p>All scenarios</p>
        <sec id="sec-3-1-1">
          <title>5.1. YOLO Performance</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>5.3. Collaboration Performance</title>
          <p>We found that the YOLO’s best performance is achieved To assess the performance of the proposed
collaborain the Rainy highway scenario. In this case, YOLO reaches tion, we combine the identifications made by YOLO with
a precision of 0.97 and managed to identify 81% of the those from crowdworkers. The best results for the
colcars, with an F1-Score of 0.88. Meanwhile, a moderate laboration are in the Rainy highway scenario, in which
performance is observed in the Daylight city scenario, the share of identified cars increased to 93%, achieving a
in which only 56% of cars are identified (Precision=0.95), 12-percentage-point increase. Here, precision decreased
resulting in an F1-Score of 0.70. Finally, the most chal- slightly to 0.93, while the F1-Score increased to 0.93. This
lenging scenario for YOLO is the Nightlight city. In this is somehow expected since the YOLO results were
alcase, only 32% of the cars were identified although a ready really good. In contrast, the Nightlight city
sceprecision of 0.99 is achieved. This behavior leads to an nario received the most significant contribution from
F1-Score of 0.49. Analyzing YOLO’s performance by com- crowdworkers. In this case, the share of identified cars
bining all scenarios, we observe rather moderate results in increased to 76%, meaning that 44% of the cars were
the number of identified cars. Although most of YOLO’s identified by crowdworkers. The precision of the
colidentifications were actually cars (Precision=0.96), YOLO laboration decreased again to 0.98, but the F1-Score was
identified only 55% of the cars correctly. Resulting in an significantly increased to 0.86. This confirms again the
F1-Score of 0.70 as shown in Figure 2. ability of crowdworkers to make decisions, where an AI</p>
          <p>YOLO’s performance suggests a rather conservative might be not enough trained. Finally, the Daylight city
behavior, in which only most certain cars are identified, scenario remains the most challenging since the
identithus achieving high precision, but not identifying a high ifed cars rate increased to 69%, i.e., 13-percentage-point
proportion of cars, maybe due to dificult or untrained after the crowdworkers’ participation. The precision also
context conditions, e.g., crowds of cars, low brightness, decreased a little bit to 0.94, however, the F1-Score
intoo small cars, etc. Our results also confirm YOLO’s creased to 0.79. The results for all scenarios combined
dificulty to find cars in night conditions as in [21]. showed that the collaboration increased the share of
identified cars in all frames to 75%. Thus, the crowdworkers
5.2. Crowdworkers’ Performance contributed 20% of all the cars to be identified. Although
the precision decreased to 0.95, the F1-Score increased to
The crowdworkers’ contribution is studied by consid- 0.84. The decrease in precision can be due to the non-cars
ering only the cars that were not identified by YOLO vehicles annotated by crowdworkers.
since they received pre-annotated frames. We found that
crowdworkers perform better in the Nightlight city
scenario. In this case, they reached a precision of 0.97 and 6. Discussion and Conclusion
identified 65% of the missing cars. Thereby, resulting
in an F1-Score of 0.78. In the Rainy highway scenario,
the crowdworkers’ precision decreased to 0.75 although
they managed to identify 61% of the missing cars
(F1Score=0.68). In this case, the false positives resulted from
trucks or construction vehicles identified as cars by the
crowdworkers. The most challenging scenario for
crowdworkers was the Daylight city. Here, the precision was
0.92, but the workers only identified 29% of the missing
cars, which reduces the F-Score to 0.43. In this case, we
observed that crowdworkers tend to skip objects that
are in the middle of car crowds, e.g., in lines of parked
vehicles. When analyzing all scenarios combined, similar
to YOLO, the crowdworkers’ precision was high, i.e., 0.92,
but they managed to identify only 45% of the missing
cars, which leads to an F-Score of 0.61.</p>
          <p>Similar to YOLO, the crowdworkers’ performance
seems to be modest. The biggest issues for
crowdworkers were finding missing cars in crowded scenarios, and
avoiding annotating other types of vehicles as cars. The
second issue is less critical since in a driving situation
this is actually desired.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>The success of autonomous driving vehicles relies heav</title>
        <p>ily on well-trained AI models used to understand the
current driving situation and take appropriate actions.
To train such models, an extensive amount of labeled
data is required. In this work, we studied the feasibility
of a Human-AI collaboration via crowdsourcing for car
identification as the first step towards a scalable pipeline
for creating such labeled data. For this, we employed
YOLOv3 to pre-annotate frames of three diferent
scenarios that exhibit diferent image quality and trafic
conditions. Then, we asked a group of crowdworkers to
refine the AI-achieved annotations via a micro-task.</p>
        <p>Our results showed that YOLO performed efectively
in a rainy highway scenario, in which the cars are driving
in two directions and no crowds of cars are observed in
a frame. A more moderate performance was observed in
a daylight city scenario that constantly exhibited dense
crowds of multi-direction parked and moving cars, i.e.,
implying diferent perspectives and proximity. However,
YOLO’s performance was rather low in a nightlight city
scenario in which poor light conditions represent an
additional constraint. Thus, it confirms the limitations of</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <sec id="sec-4-1">
        <title>This work was carried out under the project Segmenta</title>
        <p>tion of visual media (Computer Vision) for cloud-based
processing co-financed by the program ProFIT
Brandenburg of the Ministry of Economic and European Afairs of
the State of Brandenburg in Germany and the European
Regional Development Fund.</p>
        <p>AI models in challenging contexts such the city
scenarios. On the other hand, the crowdworkers obtained the
best results in the worst YOLO scenario, i.e., nightlight
city, contributing almost half of the car identifications
and demonstrating their ability to make decisions based
on the scene’s hints. In the case of the rainy highway,
the crowdworkers retrieved a significant amount of
remaining cars, which were normally the most distant cars.
Lastly, the daylight city scenario also represented a
challenge for the crowdworkers. This might be related to
the efort required to find partially hidden cars in dense
parking locations.</p>
        <p>The results show that a Human-AI collaboration might
be feasible and scalable to save human efort by having
pre-annotated data and reacting to untrained or
challenging scenarios by taking advantage of crowdworkers’
ability to make decisions based on context. Nevertheless,
to achieve fully annotated frames further mechanisms
should be investigated. For instance, a AI active
learning scheme using crowdworkers contribution, and the
inclusion of more crowdworkers per frame. Additionally,
automatic active learning for frequent crowdworkers can
be AI-supported, under a personalized training scheme
based on their behavior. Finally, further steps for the
detection and fixing of wrong identifications, e.g., as
proposed in [22], and for addressing multi-object scenarios
should be investigated.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>