<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Experience Report on Requirements-Driven Model-Based Synthetic Vision Testing</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Markus Murschitz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oliver Zendel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Humenberger</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christoph Sulzbachner</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gustavo Ferna´ ndez Dom´ınguez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Oliver.Zendel</institution>
          ,
          <addr-line>Martin.Humenberger, Christoph.Sulzbachner, Gustavo-</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this work we show how to specifically sample domain parameters - for a certain system under test (SUT) - to create corresponding test data in order to find the system's limits of operation and discover its flaws. The SUT is part of an aerial sense and avoid system that performs aerial object detection in a video stream. In order to generate synthetic test data, we first define the variability range of the test data based on real-world observations, as well as the problem specification as requirements. Then synthetic test data with explicit situational and domain coverage is generated to show how it can be used to identify problems within the tested system. Next, we show how to specifically sample domain parameters to create corresponding test data which allows us to find the operation limits of the system under test. Finally, we verify the gained insights and therefore the methodology in two ways: (i) By comparing the evaluation results to results obtained with real-world data, and (ii) by identifying the reasons for certain shortcomings based on tested SUT internals.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>Nowadays, computer vision (CV) systems are increasingly used in
applications, which can either be potentially harmful to humans or
are a safety measure to prevent accidents. CV systems are considered
as hard to test, due to high complexity of the algorithms, variety of
inputs, as well as large numbers of possible results and internal states.
It is of utmost importance that the CV and testing community settle
on a common standard regarding testing-procedures and concepts to
safely open new application fields.</p>
      <p>In our opinion one of the most neglected points in vision testing
is that CV applications have to be tested as a whole: the system and
the environment it is working in. Normally, there are two ways to
improve the result quality of any real-world CV application: (i) to
optimize the vision system, and (ii) to increasingly control the scene.
While the first goal is concerned with camera optics components and
algorithms, the second goal restricts the number of possible inputs
(i.e. the domain) and, thus, reduces applicability. As a result, the
requirements that are to be satisfied by a CV application have to include
both: domain aspects as well as functional aspects.</p>
      <p>
        During the course of this work we will focus on the specific
application of aerial obstacle detection as system to be tested (see
Figure 1): A single camera mounted on a small propeller-driven light
aircraft is producing a continuous image stream. The SUT’s goal is
to detect other planes in this image stream, which are on a potential
impact course. Further detections are the input for a sense and avoid
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] controller. In order to formulate the requirements, it is important
to analyze the system with its full purpose in mind even though this
work is only testing the sense part.
      </p>
      <p>
        Naturally, a failure of such a system leads to dangerous situations,
hence rigorous testing is obligatory. To follow the idea of test driven
development (see e.g. Boehm et al.[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]), test data is being designed in
an early stage, when a complete system is unavailable (neither
hardware nor software). At this early development stage, exists a general
idea and a basic set of requirements about what the system should be
capable of, and some potential internals (used methods of the
algorithm) are already determined. To test a system based on such
definitions is called Gray Box Testing, in contrast to White Box Testing,
where the entire code of the system is available, and Black Box
Testing, where only the general functionality of the algorithm is available
to the testers.
      </p>
      <p>The research objectives of this work are:
to design synthetic test data based on the SUT specification
together with domain restrictions in order to reveal possible flaws of
the SUT and to answer system design relevant questions2,
to analyze the SUT’s performance for certain partitions of this test
data, and derive a number of insights into the SUT’s weaknesses,
to show how operational limits can be determined by the approach,
and to verify if the approach is feasible, and if it leads to
reasonable results by comparing the gained insights to real-world
examples and analyzing SUT’s internals.</p>
      <p>The paper is organized as follows: Section 2 summarizes the related
work and discusses how the proposed procedure differs from other
approaches. Section 3 presents the approach for designing test data
based on requirements of a concrete system. Section 4 describes the
general procedure to create the actual test data and Section 5 presents
obtained test results. Section 6 shows how the presented approach is
validated while Section 7 summarizes the findings and finally,
Section 8 concludes the topic and gives an outlook.
2 The concrete formalization of the domain model written in a specific
domain language, and many details on the creation of test data from this model
are omitted, because it would exceed its scope.</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        Many advances in computer vision are closely related to available
labeled data. For example, increasingly difficult test data allowed
the community to analyze new approaches and rate them. Moreover,
training data is constantly growing in variability, which is driving
the currently impressive advances in deep-learning-based vision.
Increasingly complex and larger real-world test data sets require,
dependent on the ground truth type, either a considerable amount of
annotation work or complex vision-independent measurements (e.g.
Light Detection And Ranging (LiDAR)). To manage manual
annotation effort, several strategies have been developed: many works
approach the issue by crowd sourcing technologies [
        <xref ref-type="bibr" rid="ref10 ref16 ref19 ref20 ref8">8, 20, 10, 16, 19</xref>
        ]
and, or semi-supervised methods [
        <xref ref-type="bibr" rid="ref29 ref3">3, 29</xref>
        ]. Other works fully
concentrate on synthetic test data, due to the reduced effort in ground truth
generation [
        <xref ref-type="bibr" rid="ref12 ref13 ref24 ref27 ref4 ref6">27, 13, 12, 24, 4, 6</xref>
        ].
      </p>
      <p>
        There is an ongoing discussion about real-world versus synthetic
test datasets, since it is not entirely clear if synthetic test data can
replace real-world test data. Naturally, on the one hand, testing is
aiming to reflect real-world behavior as close as possible. On the
other hand, today’s demand for test data and especially for training
data often cannot (at least with manageable effort) be fulfilled by
real-world data sets [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. Biedermann et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] present evaluations
of Advanced Driver Assistance Systems (ADAS) on their fully
synthetic, physically accurate rendered COnGRATS dataset and claim
that their synthetic data is not ”simpler” than real-world data.
      </p>
      <p>
        Ros et al. [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] generate synthetic data for semantic segmentation
of urban scenes that can be used to test and train systems. Their test
data design is based on the requirements of visual ADAS working in
an urban environment. It is especially tailored for training deep
convolutional neural networks, which need data that is sufficiently
diverse to learn many parameters. The authors argue, that pixel-based
human-made annotations (e.g. ImageNet [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]) are still a driving
factor in system development, but are also simply too expensive for
more complex applications, such as those required for ADAS.
      </p>
      <p>
        Regarding the domain of aerial vehicles (AVs), Ribeiro and
Oliveira [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] show a system to test a roll attitude autopilot
system, by controlling a flight simulator, named X-Planes with
Matlab/Simulink, but they do not deal with domain or situational
coverage.
2.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Vision Testing Requirements</title>
      <p>
        Konderman [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] first analyzes how ground truth is currently
generated and further elaborates on the current shortcomings of ground
truth data design. He sees the aim of performance analysis in
understanding, under which circumstances a given algorithm is suitable for
a given task and finds requirements for engineering to be the key for
better suited ground truth and test data. Finally, he presents a
requirements analysis for stereo ground truth.
      </p>
      <p>
        Regarding computer vision robustness, an extensive list of
situational requirements has been presented by Zendel et al. [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]. They
perform a risk analysis method called Hazard and Operability Study
(HAZOP) and apply it to CV for the first time. This method is
employed to generate a checklist for test data and includes many
potentially performance hampering vision situations (criticalities).
Therefore, the checklist constitutes a robustness coverage metric, which
can be used to validate test data.
      </p>
      <p>
        Model-based testing is a well established method to generate a
suite of test cases from requirements [
        <xref ref-type="bibr" rid="ref7 ref9">7, 9</xref>
        ]. It aims at automatically
generating input and expected results (ground truth) for a given
system or software under test, such that a certain test purpose can be
achieved. This purpose defines the capabilities or properties of the
system to be tested. In order to enable a detailed specification of what
has to be tested and to measure test progress, a test metric or coverage
criterion is introduced.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Performance Metrics</title>
      <p>
        In evaluations on computer vision a performance metric has to
be chosen dependent on the application type (e.g. object
detection (OD), object tracking (OT) or event detection (ED)). Several
sets of performance measures were proposed for different
applications: VACE [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] (OD and OT), CLEAR [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] (OD and OT), CLEAR
MOT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] (OT), and the information theoretic measures [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] (OT) and
CREDS [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] (ED). These metrics consist of different scores
evaluating the performance with respect to the number of true positives, true
negatives, false positives (or false alarms), false negatives (or missed
detections), deviation errors or fragmentation depending on the target
to be evaluated. Examples for more exact pixel-based comparisons
besides the simple bounding box overlap measure include the Hoover
Method [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] as well as the Multi Object Maximum Overlap
Matching (MOMOM) of O¨ zdemir et al. [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. The Hoover Method [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
was originally proposed to calculate a performance measure for
correct detections, over-segmentation, under-segmentation, missed
detections and noise. MOMOM solves the basic problem of finding the
best way to assign target object instances in the GT to detection
instances found in the SUT’s output, by modelling this as an
optimization problem. The underlying assumption is that the best matching is
the one that globally maximizes the overlapping regions of GT and
SUT result. The results of MOMOM are a classification of
detections into correct, under, over, and missed detections (see Figure 2).
It also reports false alarms, i.e. SUT results that do not have
overlapping pixels with any GT object instance. After classification, the
actual performance measure based on the matches can be calculated.
O¨ zdemir et al. [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] use a performance measure that is sensitive to
object shape and boundary fragmentation errors.
3
      </p>
    </sec>
    <sec id="sec-5">
      <title>TEST DESIGN</title>
      <p>To test a vision system, we first specify its objective (the task to be
solved) and its domain (the world it has to operate in) as specific as
possible. Potential sources of input for such design phase are
standards, functional descriptions, requirements for the SUT, and known
issues. In this case study, a group of testers and developers agreed on
such definitions and decided on the type of ground truth, and
therefore on what constitutes a test case (see Figure 4). Then the group
defined a number of performance influencing questions which are
desired to be answered by the evaluation. The most relevant, and
therefore presented questions are:
How do different backgrounds influence the number of false
detections? (answered in Section 5.1)
How does the type of an observed approaching aerial object
influence the missed detections? (answered in Section 5.2)
How do detections depend on the approaching aerial flight path
and distance? (answered in Section 5.3)
One reoccurring key principle in deriving insights from evaluations
is to find equivalence classes (of objects and situations) that can
easily be analyzed. One possible data partition into equivalence classes
of situations are scenarios. In this case, they are based on flight
maneuvers (see Section 3.3).</p>
      <p>In the following sub sections we show the exemplary definition
of the SUT and its objective respectively (Section 3.1), the domain
it has to operate in (Section 3.2) and the maneuver-based scenarios
(Section 3.3).
3.1</p>
    </sec>
    <sec id="sec-6">
      <title>System under Test</title>
      <p>The purpose of the SUT to be tested in this work is a collision
avoidance system to assist pilots in order to increase safety in public
airspace (usually referred to as ”Sense and Avoid” see Figure 1). The
role of this collision avoidance system is to identify non-cooperative
airborne objects (gliders, ultralights etc.) which cannot be detected
by existing collision avoidance systems. Before designing the test
data, various general methodologies have been discussed on how to
tackle the problem: Statistical models of sky region intensity can
reveal anomalous objects in sky regions. Gradient based methods can
hint the detection of initial candidate flight objects. By determining
the ego-motion, objects that move in relation to the camera and
background can be revealed. Finally, also template based trained aerial
object detectors are a possible solution. The results of any of these
methods are regions within the image that can be tracked.</p>
      <p>
        This work concentrates on testing the detector step only, since
evaluations of tracking are more complicated (see e.g. Kristan et
al. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]), and would exceed the scope of this short paper. The SUT
used for our evaluation is a prototype in development and states one
possible way to solve the problem. The evaluation of this early
prototype is a good showcase for the testing approach. However, the
proposed strategies and methods for test data generation can be applied
to other detectors, as well.
3.2
      </p>
    </sec>
    <sec id="sec-7">
      <title>Domain</title>
      <p>A domain definition specifies what is allowed and what is excluded
within the closed world segment the algorithm has to operate in.
Vision systems of today often have to operate in rather uncontrolled
outdoor environments. A closed world description might not be able
to include the entire range of possible situations that can occur.
However, by explicitly defining the closed world, we at least exactly know
the limitations of the test data. Also no coverage estimation is
possible without defining the variables (their number) and their maximum
and minimum bounds.</p>
      <p>We define such world by two main ingredients: (i) Rules of
existence and co-existence: which classes (e.g. which objects) exist, and
by what parameters they are described. (ii) Relation rules (between
object classes or instances) are usually represented as constraints on
their parameters.</p>
      <p>Part of this definition is that we define limits for parameters. In
this case study we derive the following limits from the requirements
of the SUT itself:</p>
      <p>Altitude range: 200 – 2.000m, hence no takeoff and landing</p>
      <p>Schleicher
ASW 20</p>
      <p>DR400 Eurocopter</p>
      <p>Robin EC135
(a) Types of flying objects</p>
      <p>Goose</p>
      <sec id="sec-7-1">
        <title>Empty</title>
      </sec>
      <sec id="sec-7-2">
        <title>Mono</title>
      </sec>
      <sec id="sec-7-3">
        <title>Gradient</title>
      </sec>
      <sec id="sec-7-4">
        <title>Clouds1</title>
        <p>(b) Sky types</p>
      </sec>
      <sec id="sec-7-5">
        <title>Clouds2</title>
      </sec>
      <sec id="sec-7-6">
        <title>Clouds3</title>
        <p>rural
village
(c) Ground types
city</p>
        <sec id="sec-7-6-1">
          <title>Flight speed: 100 - 200km/h</title>
          <p>Good sight: daylight only, no flights during thunderstorms and
rain, no flights within clouds
For the purpose of this work, a rather limited domain was defined: It
constitutes of all possible combinations of
0 to 3 aerial objects (out of six possible types, see Figure 3(a))3,
the sky is characterized by one out of six sky types (see
Figure 3(b)),
the ground type is one out of three types (see Figure 3(c)).
Every type of aerial object is not only characterized by its
appearance, but also by the rules it is subject to (e.g. all planes can only fly
in directions similar to their forward orientation and not backwards4).
They all have speed ranges characteristic to the respective object.
Parameters such as direction, actual speed of flying objects, and speed
of the SUT-plane are free, but bounded within known limits. Finally,
in order to generate each test case, the entities are all integrated into
a scene by combining a skydome (to represent the sky) a more or less
flat ground model and the respective flying objects (see Figure 4).
3.3</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Scenarios</title>
      <p>Flying Object
Flying Object</p>
      <p>Skydome</p>
      <p>Ground</p>
      <p>Sky
Ground</p>
      <p>A scenario is a definition of how the previously defined closed
world works in a specific situation. As such, a scenario can be
employed to explicitly restrict the scenes (which are generated by it) to
3 The element depicted as single goose is only one representative of the used
swarm of geese.
4 Extreme winds are excluded.
a certain subset of possible scenes.</p>
      <p>The scenarios considered in this case study are maneuver-based:
Head on: AV is on a head-on collision course with SUT-plane
Converging: flight path of AV and SUT-plane intersect
Take Over: AV is taking over SUT-plane
Following: SUT-plane follows AV
Parallel: AV and SUT-plane have parallel flight paths</p>
      <p>Null: No AVs are visible
Within a scenario, the constraints on entities are constant for all test
cases. For example, in a takeover scenario the starting point of any
AV has to be behind the starting point of the SUT. The flight
direction is limited as well, so that the path of the AV crosses the field of
view of the SUT-plane. The coverage in respect to these situational
aspects is by definition guaranteed simply by defining numerous of
these scenarios such as takeover, crossing, heading etc. and
generating test data for them5.</p>
      <p>A scenario only limits the possibilities according to certain
constraints, but it still contains numerous variable parameters such as:
which AV is used, where it starts exactly, or the exact direction of
movement. All these free parameters span a high dimensional
parameter space. Many of those parameters are continuous; a test case
on the other hand is only a discrete sample of the parameter space.
4</p>
    </sec>
    <sec id="sec-9">
      <title>TEST DATA GENERATION</title>
      <p>
        Describing the entire procedure of test data generation from a
domain description and respective models in detail would exceed the
scope of this work. Some general methodologies are discussed in
the following (for more details see Zendel et al. [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]). The variables
of scene elements (defined in Section 3.2) span a parameter space.
Since some variables are parameters over a continuous range,
sampling is required. With the objective of optimal parameter coverage
in mind, a smart sampling is advantageous. Such sampling is
accomplished by employing low discrepancy (see e.g. Matousek et al. [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ])
as a sampling method. The problem is stated as follows: With a given
number of sampling points (=single test cases), how is it possible to
minimize the volumes of untested regions in the multi-dimensional
parameter space while observing varying conditions for each of the
parameters (e.g. minimum/maximum limits and different data types)
with respect to the domain model rules. This is accomplished by
applying the following generation strategy:
1. The domain definition is modeled in description logic as a
Satisfiability Modulo Theory (SMT) model. With an SMT model it
is possible to check for satisfiability, find solutions, test solutions,
and evaluate if they fit the model (see e.g. Moura et al. [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]).
2. Samples are taken in the parameter space according to low
discrepancy sampling.
3. The samples are evaluated for validity by testing them in
conjunction with the SMT model of the domain.
4. Steps 2 and 3 are repeated until a valid solution is found. If this
fails for a certain number of times, a solution is generated from
the SMT model. If even this fails, the domain model is invalid and
not satisfiable under the requested circumstances.
5. The test case is created with the chosen selection of parameter
values. It represents the initial setup of a scene as well as all
variables needed to simulate its progression within the time frame.
A dedicated rendering and post-processing pipeline generates
images and GT labelings for specified moments in time (e.g. a
sequence of 100 frames at 20 fps). An example is given in Figure 4.
      </p>
      <sec id="sec-9-1">
        <title>5 If there is at least one test case for each scenario.</title>
        <p>
          (a) after 5 initialization frames
(b) after 5 frames with atmospheric effects
In order to compare the SUT output and the ground truth we need an
appropriate metric. The choice of metric must be based on the task
and the corresponding requirements. We chose an adaption of Multi
Object Maximum Overlap Matching (MOMOM) of O¨ zdemir et
al. [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] tailored to the needs of aerial object detection applications.
Not all of the detections’ instance classes distinguished by MOMOM
(correct-, missed-, over-, under-detections and false alarms; see
Figure 2) are relevant for sense and avoid. Over-detections are not
considered a problem, only missed detections and false alarms are
erroneous outputs. Also, in comparison to the original MOMOM, the
actual shape of the detected objects bears no relevance, but position
and size are important.
        </p>
        <p>In the following, evaluations and corresponding datasets for this
case study are presented: the background evaluation (Section 5.1),
the foreground / target object evaluation (Section 5.2), and finally an
example for parametric analysis (Section 5.3).
5.1</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>Background Dependency Evaluation</title>
      <p>To evaluate how the background influences the number of false
alarms, we designed a test data set that does not show any aerial
objects (Null test) and it varies ground and sky types (see Figure
3(c) and Figure3(b)). For each ground-sky combination 5 different
flight paths, each consisting of 40 consecutive frames from a
moving virtual camera with 10 frames per second, are generated. For an
example see Figure 4.</p>
      <p>In a first preliminary experiment, the initialization time is
determined by analyzing the false alarm dependency on the frame number.
It reveals that the number of false alarms significantly drops after four
frames. Thus, we decided not to evaluate any frames with an index
lower than five in the following evaluations to avoid overemphasizing
the startup phase.</p>
      <p>In a second evaluation (for the same test data), we analyze the
number of false alarms per ground-sky combination, without
considering initialization artifacts (see Figure 5(a)). The following
observations are made:
(O1) In general, the background model of the SUT is capable of
modeling backgrounds with the tested variance.
(O2) False alarms occur mostly in scenarios where the ground
model contains a city scene.
(O3) The number of false alarms is reduced if simulated
atmospheric effects are present (see Figure 5(b) versus Figure 5(a)).
(O4) The remainder of the false alarms are located at hard edges of
mountains in the terrain.</p>
      <p>The question arising from Observation (O2) is: why does the city
lead to significantly higher false alarms compared to the other terrain
elements? There are two possible reasons: (i) the SUT’s background
model is sensitive to elements of unexpected high frequencies or
significant change in appearance in the background. (ii) The SUT
benefits from two effects, which in real-world data reduce the impact of
the ground on the captured image: Firstly, ground is blurred due to
limited depth of field. And secondly, light-scattering particles in the
atmosphere reduce image saturation. The first reason (i) would mean
that this is a shortcoming of the SUT, while the second reason (ii)
means the test data is in that respect ”harder” than real-world data.</p>
      <p>However, to get a less biased test result, the remaining dataset was
computed with a simulated atmospheric effect (results are shown in
Figure 5(b)).</p>
      <p>From the perspective of hazard analysis, Observation (O4) exhibits
no problematic behavior, but a positive side effect. Mountains and
their hard edges are definitely dangerous objects for an airplane.
5.2</p>
    </sec>
    <sec id="sec-11">
      <title>Flying Object Dependency Evaluation</title>
      <p>The influence of the flying object’s shape and characteristics is
analyzed by investigating their effect on the number of missed
detections. The therefore synthesized test set is in general comparable to
the previous one, but has a single object within the field of view
(FOV) of the camera in at least one frame of each test sequence. In
each sequence, the object is entering the FOV and leaving the FOV
over time. The variables are the starting and end position of the
objects and their speed. Due to the nature of this test set (objects inside
and outside of the FOV), it must be filtered before analysis: (a) the
frame number must be greater or equal to five (see previous section),
(b) the flying object must cover a minimal amount of pixels. The
corresponding requirements to pass the filter are: a minimal number
of pixels of 152 = 225 which corresponds to 0:062% of the
image (image size 752x480), and a frame number larger than 5. Figure
6(a) shows the remaining number of frames after filtering for each
object-scenario combination.</p>
      <p>The zeros in the takeover scenario in Figure 6(a) show that not
all scenarios are possible with all flying objects. They have
different speeds, which approximate their real-world cruising speeds. The
A340 plane is the only one faster than the SUT, hence the only one
that can pass the SUT in a takeover scenario. Furthermore, in order
to analyze the effect of object type on missed detections, we analyze
the relative number of frames with at least one missed detection6 (see
Figure 6(b)). The following observations can be made:
(O5) In general, slower objects are less likely to be detected than
faster ones.
(O6) Thin structures can be missed: The object ”goose”, which
actually consists of several geese flying in a swarm, is not detected
at all. The individual geese are usually very small. The ASW20
states a similar problem: it has a thin wing profile and therefore
its silhouette has thin parts.
(O7) The big and slow balloon confuses the system. It is the slowest
of all these objects.
(O8) The SUT performs much better on converging than on the other
domain models.</p>
      <sec id="sec-11-1">
        <title>6 According to the MOMOM matching optimization results.</title>
        <p>(a) Coverage: # of frames per
combination
(b) Relative # of frames with
missed detections
Our initial hypothesis defining parallel flights as problematic did
not prove to be correct. There is a minor reduction in performance,
but it is not significant in comparison to the other scenarios.
Regarding Observation (O8), it is the least restricted scenario and the most
probable7 (see coverage in Figure 6(a)). It only requires the flight
paths to intersect within the SUT’s field of view.
5.3</p>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>Parametric Analysis</title>
      <p>Along with coverage of situations and entities (as described in the
previous sections), our tool chain also supports parametric analysis.
Since the head on situation is the most critical, it was decided to do a
parametric evaluation on the flight path for such a scenario. Test data
was generated containing a single plane of the type Airbus A340. It
was set up to fly towards the SUT with constant speed from various
directions (determined by low discrepancy sampling). In Figure 7,
the dependency of missed detections in relation to the flight path is
visualized. Obviously, the missed detections increases with distance;
as can be seen in both XZ-plane and XY-plane projections. One
cannot draw reliable conclusions from the YZ-plane, but the XY-plane
clearly shows decreased detection rates close to the Y=0 line.
(O9) Planes heading directly towards the SUT-AV state an
additional difficulty for the SUT.
6</p>
    </sec>
    <sec id="sec-13">
      <title>VALIDATION OF THE APPROACH</title>
      <p>In the previous section, we established a list of 9 observations about
the system’s behavior in the defined domain. The question remains,
whether those results are only valid for the synthetic test cases or do
they generalize to real-world situations. In the following, we show
two methods of evaluating the testing approach for the
applicability to the SUT at hand: (i) Find similar situations in real-world data
recorded from an actual aerial vehicle and compare the results, (ii)
find out details in the SUT’s algorithm that undermine the validity of
the observations.</p>
      <p>Regarding the general observation judging the background as
sufficient (Observation (O1)): the available real-world data contains
various cloud formations that did not influence the performance
significantly.</p>
      <p>Regarding Observation (O2) and (O4), where mountains and cities
have been detected as flying objects: Unfortunately the available
captured test data does not contain any sequences with mountains, but</p>
      <sec id="sec-13-1">
        <title>7 At least within our definition of the domain.</title>
        <p>600
500
400
300
200
100
600
500
400
300
200
100
(a) hovering drone example
(b) house example
it contains houses (see Figure 8(b)), which state a similar problem
in low height flights. The SUT reacted with false positives for those
scenes.</p>
        <p>Observing that low speed AVs are a potential problem
(Observation (O5)), a real-world situation was sought out in captured video
material. The phenomenon could be found in a video segment
showing a hovering drone (see Figure 8(a)). This video material was not
annotated since the segment was not considered critical before the
synthetic test run. It would have been difficult to find the segment of
some hundred frames in a couple of gigabytes of data.</p>
        <p>The problem with thin and/or disconnected silhouettes
(Observation (O6)) can be explained by the internals of the SUT: in order to
make it more robust, the SUT has a filtering step for small noise,
which ignores such small elements. Therefore the goose swarm,
which is not connected, and where each individual goose is too small
to lead to detection on its own, is considered to be noise. This is
simply a camera resolution dependent effect, hence the testers
recommended to the system developers to increase the camera resolution.</p>
        <p>Observation (O7), huge slow objects like the balloon can confuse
the system, can be explained by the general approach of the SUT
treating the task as foreground background segmentation. If an object
covers the majority of the pixels of the camera image, it is assumed
to be background. This causes a missed detection as well as a number
of false alarms in the background.</p>
        <p>Observation (O9), where one plain flies toward the SUT on
headon collision course, could not be found in the real-world data, but it
can be explained similarly to the previous Observation (O6).</p>
        <p>Observation (O3), concerning the atmospheric effects was already
discussed in the previous section. Observation (O8) establishing that
the system performs best in the converging scenario, could not be
verified by any of the two means, which made even more evident
Whenever we test with synthetic test data we initially have to make
sure that the data is ”sufficiently realistic”, meaning that the behavior
of the algorithm with synthetic data is similar to real world behavior.
We propose to compare results for synthetic and real-world data in
an initial phase and adapt the synthetic data until similar behavior is
reached (See Figure 5(b)).</p>
        <p>We also have to ensure that we measure performance fairly and
according to the test objectives (e.g. do not measure initialization
phases).</p>
        <p>What became clear during the evaluation of the approach is, that
is time-consuming, often infeasible, and sometimes even impossible
(without risk to human life) to find specific situations in real-world
data. Even if the data is available, it can mean that one has to sift
through gigabytes of mostly unannotated video material.
8</p>
      </sec>
    </sec>
    <sec id="sec-14">
      <title>CONCLUSION &amp; OUTLOOK</title>
      <p>During the course of this work we depicted a case study on how to
design test data based on SUT requirements and a definition of the
environment the tested system has to operate in. One key element was
to identify several performance-influencing questions that allow for
deep insights into the shortcomings of SUT. For each of these
questions test data was created, evaluations have been performed, and
observations on the tested system’s behavior have been made. Finally,
the validation of this testing approach could verify 8 of 9
observations with either real-world examples that cause the same erroneous
behavior, or explanations based on the SUT’s internals. Therefore
this test case generation procedure, combined with an appropriate
evaluation method, leads to interpretable and valid results. During
the course of this work the completeness of the generated test data
could not be determined. The question remains: how many erroneous
behaviors have not been found by this synthetic-test-data-based
evaluation?</p>
      <p>In our opinion, carefully planned test data (that includes functional
and domain aspects) is vital to any complete assessment about a CV
application and synthetic test data was the most feasible solution for
the application at hand.
9 The original image can not be published for legal reasons, therefore the
image Figure 8(b) shows an exemplary similar scene.</p>
    </sec>
    <sec id="sec-15">
      <title>ACKNOWLEDGEMENTS</title>
      <p>Special thanks go to our recently retired project leader Wolfgang
Herzner. Financial support was provided by the Austrian Ministry
for Transport, Innovation and Technology (BMVIT) in the scope of
the TAKE OFF program within the DVKUP and RPA-AI projects.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Angelov</surname>
          </string-name>
          , Sense and Avoid in UAS: Research and Applications, John Wiley &amp; Sons,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Bernardin</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Stiefelhagen</surname>
          </string-name>
          , '
          <article-title>Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics'</article-title>
          ,
          <source>EURASIP Journal on Image and Video Processing</source>
          , (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>V.</given-names>
            <surname>Biaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Despiegel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Herold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Beiler</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Gentric</surname>
          </string-name>
          , '
          <article-title>Semisupervised Evaluation of Face Recognition in Videos'</article-title>
          ,
          <source>in Proceedings of the International Workshop on Video and Image Ground Truth in Computer Vision Applications</source>
          , VIGTA '
          <fpage>13</fpage>
          , New York, NY, USA, (
          <year>2013</year>
          ). ACM.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Biedermann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ochs</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Mester</surname>
          </string-name>
          , '
          <article-title>Evaluating visual ADAS components on the COnGRATS dataset'</article-title>
          ,
          <source>in 2016 IEEE Intelligent Vehicles Symposium (IV)</source>
          , (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Boehm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Rombach</surname>
          </string-name>
          , H., and
          <string-name>
            <given-names>V.</given-names>
            <surname>Zelkowitz</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          ,
          <source>Foundations of Empirical Software Engineering: The Legacy of Victor R. Basili</source>
          , Springer Science &amp; Business
          <string-name>
            <surname>Media</surname>
          </string-name>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Butler</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wulff</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stanley</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>J.</given-names>
            <surname>Black</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          ,
          <string-name>
            <surname>'</surname>
          </string-name>
          <article-title>A naturalistic open source movie for optical flow evaluation'</article-title>
          ,
          <source>in Computer Vision-ECCV</source>
          <year>2012</year>
          , Springer, (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Dalal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Karunanithi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leaton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lott</surname>
          </string-name>
          , G. Patton, and
          <string-name>
            <given-names>B.</given-names>
            <surname>Horowitz</surname>
          </string-name>
          , '
          <article-title>Model-based testing in practice'</article-title>
          . ACM, (
          <year>1999</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R. Di</given-names>
            <surname>Salvo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Giordano</surname>
          </string-name>
          ,
          <string-name>
            <surname>and I. Kavasidis</surname>
          </string-name>
          , '
          <article-title>A Crowdsourcing Approach to Support Video Annotation'</article-title>
          ,
          <source>in Proceedings of the International Workshop on Video and Image Ground Truth in Computer Vision Applications</source>
          , VIGTA '
          <fpage>13</fpage>
          , New York, NY, USA, (
          <year>2013</year>
          ). ACM.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Dias Neto</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Subramanyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vieira</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Travassos</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          ,
          <article-title>'A Survey on Model-based Testing Approaches: A Systematic Review'</article-title>
          ,
          <source>in Proceedings of the 1st ACM International Workshop on Empirical Assessment of Software Engineering Languages and Technologies: Held in Conjunction with the 22Nd IEEE/ACM International Conference on Automated Software Engineering (ASE)</source>
          <year>2007</year>
          , WEASELTech '
          <fpage>07</fpage>
          , New York, NY, USA, (
          <year>2007</year>
          ). ACM.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Donath</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Kondermann</surname>
          </string-name>
          , '
          <article-title>How Good is Crowdsourcing for Optical Flow Ground Truth Generation?'</article-title>
          ,
          <string-name>
            <surname>submitted to</surname>
            <given-names>CVPR</given-names>
          </string-name>
          , (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>K.</given-names>
            <surname>Edward</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Matthew</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Michael</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          , '
          <article-title>An information theoretic approach for tracker performance evaluation'</article-title>
          ,
          <source>in 2009 IEEE 12th International Conference on Computer Vision</source>
          , (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R.</given-names>
            <surname>Haeusler</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Kondermann</surname>
          </string-name>
          , 'Synthesizing Real World Stereo Challenges', in Pattern Recognition, eds.,
          <string-name>
            <surname>Joachim</surname>
            <given-names>Weickert</given-names>
          </string-name>
          ,
          <source>Matthias Hein, and Bernt Schiele, Lecture Notes in Computer Science</source>
          , Springer Berlin Heidelberg, (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>V.</given-names>
            <surname>Haltakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Unger</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Ilic</surname>
          </string-name>
          , '
          <article-title>Framework for Generation of Synthetic Ground Truth Data for Driver Assistance Applications'</article-title>
          , in Pattern Recognition, eds.,
          <string-name>
            <surname>Joachim</surname>
            <given-names>Weickert</given-names>
          </string-name>
          ,
          <source>Matthias Hein, and Bernt Schiele, Lecture Notes in Computer Science</source>
          , Springer Berlin Heidelberg, (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hoover</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Jean-Baptiste</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Flynn</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bunke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Goldgof</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bowyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Eggert</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fitzgibbon</surname>
          </string-name>
          , and B. Fisher, R., '
          <article-title>An experimental comparison of range image segmentation algorithms'</article-title>
          ,
          <source>Pattern Analysis and Machine Intelligence</source>
          , IEEE Transactions on, (
          <year>1996</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>R.</given-names>
            <surname>Kasturi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Goldgof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Soundararajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Manohar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Garofolo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bowers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Boonstra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Korzhova</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , '
          <article-title>Framework for Performance Evaluation of Face, Text, and Vehicle Detection and Tracking in Video: Data, Metrics, and Protocol'</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          , (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A</given-names>
            <surname>´. Kiss</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Szira</surname>
          </string-name>
          <article-title>´nyi, 'Evaluation of Manually Created Ground Truth for Multi-view People Localization'</article-title>
          ,
          <source>in Proceedings of the International Workshop on Video and Image Ground Truth in Computer Vision Applications</source>
          , VIGTA '
          <fpage>13</fpage>
          , New York, NY, USA, (
          <year>2013</year>
          ). ACM.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kondermann</surname>
          </string-name>
          , '
          <article-title>Ground Truth Design Principles: An Overview'</article-title>
          ,
          <source>in Proceedings of the International Workshop on Video and Image Ground Truth in Computer Vision Applications</source>
          , VIGTA '
          <fpage>13</fpage>
          , New York, NY, USA, (
          <year>2013</year>
          ). ACM.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kristan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Matas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Leonardis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Vojir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pflugfelder</surname>
          </string-name>
          , G. Fernandez,
          <string-name>
            <given-names>G.</given-names>
            <surname>Nebehay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Porikli</surname>
          </string-name>
          , and L. Cehovin, '
          <article-title>A Novel Performance Evaluation Methodology for Single-Target Trackers'</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          , (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>T.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Belongie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hays</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Perona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ramanan</surname>
          </string-name>
          , P. Dolla´r, and
          <string-name>
            <given-names>L.</given-names>
            <surname>Zitnick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            , '
            <surname>Microsoft</surname>
          </string-name>
          <string-name>
            <surname>COCO</surname>
          </string-name>
          :
          <article-title>Common Objects in Context'</article-title>
          , in Computer Vision - ECCV 2014, eds., David Fleet,
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Pajdla</surname>
          </string-name>
          ,
          <source>Bernt Schiele, and Tinne Tuytelaars, Lecture Notes in Computer Science</source>
          , Springer International Publishing, (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>L.</given-names>
            <surname>Maier-Hein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mersmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kondermann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Stock</surname>
          </string-name>
          , G. Kenngott,
          <string-name>
            <surname>H.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sanchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wagner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Preukschas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wekerle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Helfert</surname>
          </string-name>
          , and others, '
          <article-title>Crowdsourcing for reference correspondence generation in endoscopic images', in Medical Image Computing and ComputerAssisted Intervention-MICCAI</article-title>
          <year>2014</year>
          , Springer, (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Matousek</surname>
          </string-name>
          ,
          <article-title>Geometric discrepancy: An illustrated guide</article-title>
          , Springer,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>S.</given-names>
            <surname>Meister</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Kondermann</surname>
          </string-name>
          , '
          <article-title>Real versus realistically rendered scenes for optical flow evaluation'</article-title>
          ,
          <source>in 2011 14th ITG Conference on Electronic Media Technology (CEMT)</source>
          , (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>L.</given-names>
            <surname>Moura</surname>
          </string-name>
          and
          <string-name>
            <given-names>N.</given-names>
            <surname>Bjørner</surname>
          </string-name>
          , '
          <article-title>Z3: An Efficient SMT Solver', in Tools and Algorithms for the Construction and Analysis of Systems</article-title>
          , eds.,
          <string-name>
            <given-names>C. R.</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          and Jakob Rehof, Lecture Notes in Computer Science, Springer Berlin Heidelberg, (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>N.</given-names>
            <surname>Onkarappa</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Sappa</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          , '
          <article-title>Synthetic sequences and ground-truth flow field generation for algorithm validation'</article-title>
          ,
          <source>Multimedia Tools and Applications</source>
          , (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>B. O</surname>
          </string-name>
          ¨ zdemir, S. Aksoy,
          <string-name>
            <given-names>S.</given-names>
            <surname>Eckert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pesaresi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Ehrlich</surname>
          </string-name>
          , '
          <article-title>Performance measures for object detection evaluation', Pattern Recognition Letters, (</article-title>
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            and
            <surname>M. F. Oliveira</surname>
          </string-name>
          , N., '
          <article-title>UAV autopilot controllers test platform using Matlab/Simulink and X-Plane'</article-title>
          ,
          <source>in 2010 IEEE Frontiers in Education Conference (FIE)</source>
          , (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>G.</given-names>
            <surname>Ros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sellart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Materzynska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Vazquez</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Lopez</surname>
          </string-name>
          , '
          <article-title>The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes'</article-title>
          ,
          <string-name>
            <surname>in</surname>
            <given-names>CVPR</given-names>
          </string-name>
          , (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>O.</given-names>
            <surname>Russakovsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Krause</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Satheesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karpathy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Khosla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Berg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            , and L.
            <surname>Fei-Fei</surname>
          </string-name>
          ,
          <article-title>'ImageNet Large Scale Visual Recognition Challenge'</article-title>
          ,
          <source>International Journal of Computer Vision</source>
          , (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Fei-Fei</surname>
          </string-name>
          , '
          <article-title>Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora'</article-title>
          ,
          <source>in 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          , (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>O.</given-names>
            <surname>Zendel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Herzner</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Murschitz</surname>
          </string-name>
          , '
          <article-title>VITRO - Model based vision testing for robustness'</article-title>
          ,
          <source>in 2013 44th International Symposium on Robotics (ISR)</source>
          , (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>O.</given-names>
            <surname>Zendel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Murschitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Humenberger</surname>
          </string-name>
          , and W. Herzner, 'CVHAZOP:
          <article-title>Introducing Test Data Validation for Computer Vision'</article-title>
          ,
          <source>in Proceedings of the IEEE International Conference on Computer Vision</source>
          , (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>F.</given-names>
            <surname>Ziliani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Velastin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Porikli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Marcenaro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kelliher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cavallaro</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Bruneaut</surname>
          </string-name>
          , '
          <article-title>Performance evaluation of event detection solutions: the CREDS experience'</article-title>
          ,
          <source>in IEEE Conference on Advanced Video and Signal Based Surveillance</source>
          ,
          <year>2005</year>
          ., (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>