=Paper= {{Paper |id=Vol-2134/paper01 |storemode=property |title=Attribute Dissection of Urban Road Scenes for Efficient Dataset Integration |pdfUrl=https://ceur-ws.org/Vol-2134/paper01.pdf |volume=Vol-2134 |authors=Jiman Kim,Changjong Park |dblpUrl=https://dblp.org/rec/conf/ijcai/KimP18 }} ==Attribute Dissection of Urban Road Scenes for Efficient Dataset Integration== https://ceur-ws.org/Vol-2134/paper01.pdf

Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden

Attribute Dissection of Urban Road Scenes for Efficient Dataset Integration

Jiman Kim, Chanjong Park
Samsung Research,
Samsung Electronics
{jiman14.kim, cj710.park}@samsung.com

Abstract al., 2016] that compare the scene parsing accuracy of sev-
eral state-of-the-art algorithms focus on their advantages and
Semantic scene segmentation or scene parsing is disadvantages, rather than on the characteristics of the image
very useful for high-level scene recognition. In or- data. However, the datasets include different categories de-
der to improve the performance of scene segmen- fined and different image characteristics of the instances in
tation, the quantity and quality of the datasets used them, so efficient learning of the deep network requires de-
for deep network’s learning are important. In other tailed analysis of various attributes of the data. For example,
words, we need to consider various external envi- some datasets may contain many small objects, some datasets
ronments and various variations of the predefined may contain densely distributed objects, and other datasets
objects in terms of image characteristics. In recent objects may show deformed objects. By using these image
years, many datasets for semantic scene segmenta- characteristics, a network can be developed that has excellent
tion focused on autonomous driving have been re- specialization for a specific environment and object, or that
leased. However, since only quantitative analysis has excellent generality in a general environment by combin-
of each dataset is provided, it is difficult to estab- ing characteristics.
lish an efficient learning strategy considering the In this paper, we analyze the CamVid, Cityscapes, SYN-
image characteristics of objects. We present defi- THIA, GTA-V, and Mapillary datasets quantitatively, based
nitions of three frame attributes and five object at- on two criteria. First, we analyze image-centric criteria such
tributes, and analyze their statistical distributions as the average number of categories in the image, the aver-
to provide qualitative information about datasets age number of objects, and the average proportion of the im-
that are to be merged. We also propose an inte- age that is a road region. Second, we analyze object-centric
grated dataset configuration that can exploit the ad- criteria, such as the average spatial density of objects in the
vantages of each data set for deep network learn- image, their average size, average shape, average color, and
ing after class matching. As a result, we can build average position in the image. This analysis provides we
new integrated datasets that are optimized for the good insight into ways to train deep networks. We also pro-
scene complexity and object properties of the envi- pose a new set of integrated classes that can be used com-
ronment by considering the statistical characteris- monly among datasets, and a method to construct an inte-
tics of each dataset. grated dataset. The integrated dataset contributes to improve
the generality of the deep network by including various road
environments and object characteristics. This paper has the
1 Introduction following structure. Section 2 introduces papers related to
Scene understanding requires information such as the objects published datasets. Section 3 summarizes each dataset and
that are present in the scene, their characteristics, and the re- proposes image- and object-centric attributes. Section 4 pro-
lationships among them. Semantic segmentation provides in- poses a new integrated dataset by performing class alliance.
formation on the location and type of objects by dividing an Section 5 provides a detailed comparative analysis of the pro-
image into regions that include predefined objects. To apply posed attributes, and suggests insights for constructing inte-
scene segmentation functions to autonomous vehicles, many grated datasets. Section 6 summarizes all findings and contri-
road scene-centered datasets have been released. Represen- butions of this paper.
tative datasets labeled at the pixel level are CamVid [Bros-
tow et al., 2008; 2009], Cityscapes [Cordts et al., 2015; 2 Related Work
2016], SYNTHIA [Ros et al., 2016], GTA-V [Richter et al., Road scene-centric public datasets for pixel-level semantic
2016], Mapillary [Neuhold et al., 2017]. Papers that explain segmentation have been released (Table 1, Fig. 1) with pa-
each dataset provide statistical information on the image data pers that explain them. CamVid [Brostow et al., 2008;
collection environment, area, device, the amount of images, 2009] was the first dataset that had semantic labels of ob-
and the relative proportions of objects. Papers [Perazzi et ject class for each pixel. The images were images acquired

8
Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden

from the perspective of a driving automobile; they are di- image resolution, focal length, number of images taken with
vided into 32 semantic classes with manually-annotated la- the devices used for image acquisition, region where the im-
bels. To reduce the effort of the person who must label ob- age was acquired, number of instances per class, number of
jects, the authors proposed joint tracking of keypoints and objects per image, number of traffic regulation objects per
regions; this method propagates the label information to the image, and number of traffic participants per image.
100 subsequent frames. The set includes the camera’s 3D These papers mainly analyzed how often each class ap-
pose in each frame and is has a software tool that users can peared in each image. They also focused on the number and
use to label their additional images. The publicly available proportions of major classes that are closely related to traf-
datasets [Martin et al., 2001; Fei-Fei et al., 2006; Bileschi, ; fic. If the detail and organization of the information on the
Shotton et al., 2006; Smeaton et al., 2006; Griffin et al., ; frame and object side can be obtained, they would improve
Yao et al., 2007; Russell et al., ] before CamVid have the learning efficiency of deep networks. Therefore, in this
polygon-level labels not pixel-level and they were obtained work, we perform detailed characterization of each dataset to
from fixed CCTV-style cameras. The paper provides statisti- derive insight. Also, to enable simultaneous use of two or
cal information on the percentage of the objects in the image more datasets with different class numbers and types, we de-
for each sequence and the number of occurrences. fine a common usable class and propose a way to efficiently
Cityscapes [Cordts et al., 2015; 2016] is a large-scale combine datasets.
dataset that includes complex real-world urban scenes.
Cityscapes has image data that are labelled at the pixel level
and instance level. The images were acquired from 50 cities
3 Attribute Analysis
to include a variety of road environments. The authors pro- Scene segmentation or scene parsing at the pixel level to ex-
vided the results of statistical analysis between datasets by tract the boundaries of many kinds of objects solves object
grouping 30 classes into eight categories. The results de- detection and localization simultaneously. To achieve high
scribe the number and relative ratios of annotated pixels of accuracy of pixel-level segmentation, large-scale datasets are
each class, annotation density, the distribution of the number required; they must include a variety of shape and appear-
of instances related with traffic per an image, and distribution ance variations of static objects (backgrounds) and dynamic
of the number of vehicle according to the distance. object (foreground). Therefore, construction of data sets that
SYNTHIA [Ros et al., 2016] is a dataset of synthetic im- focus on road scenes has increased resolution and number of
ages obtained from a virtual world (Unity development plat- images, and to an increased variety of environments.
form) [Technologies, ]. The images were captured from Trends of Dataset. The constructed and released datasets for
multiple view-points by using two multi-camera with four the same goal are described in Table 1. Higher resolution and
monocular cameras that are mounted on a virtual car. The larger amount of images are two common trends in construct-
images include different seasons, weather and illumination ing road scene-centric dataset. The increase in the resolution
conditions. The captured images were annotated with 11 pre- of the collected images is closely related to pixel-level accu-
defined classes. In experiments, the authors showed that com- racy. In addition, virtual environment tools have been used
bining a real dataset and SYNTHIA dataset dramatically in- to collect a large number of images in a short time. In par-
creases the accuracy of semantic segmentation. ticular, the volume of real images in the Mapillary dataset
Grand Theft Auto V (GTA-V) [Richter et al., 2016] con- was increased sharply by a community-led service to share
sists of images captured from a computer game. The au- street-level photographs. Diversification of the environments
thors proposed a method to quickly generate semantic label that the images represent has yielded datasets from different
maps. Each image is automatically divided into patches, then regions and environments, and recently-constructed datasets
merged using MTS (mesh, texture, shader). For each patch, a include increasing diversity of regions and of environmen-
semantic class is manually assigned. Within a brief space of tal conditions. Real images are much more difficult to ob-
time, these methods yield far more pixel labeling than previ- tain than virtual images, and the continental, regional, and
ous datasets. When virtual images generated by the proposed environmental conditions in which the images are acquired
method were added to real-world images, segmentation ac- has become very diverse in the Cityscapes and Mapillary
curacy was greatly improved even though a large number of datasets. Each dataset has different properties (Table 1). The
real-world images are replaced by virtual images. The re- CamVid dataset was the dataset that focused on road scenes;
lated paper provided statistical information on the number of it contains many lane-clear highway images. The Cityscapes
labeled pixels, annotation density, and the time and speed of dataset includes images that are specific to European urban
labeling. scenes, The SYNTHIA dataset has many virtual images with
Mapillary [Neuhold et al., 2017] is the dataset that con- multiple seasons. GTA-V dataset’s virtual images are ex-
tains the most real-world images, and the largest number tremely realistic, and its effects are richly controllable. The
(66) of categories to consider. The images were captured Mapillary dataset contains the largest number of images col-
by differently-experienced photographers on various imaging lected in the broadest variety of regions.
devices. The considered cities are Europe, North and South Attribute Definition. We defined two types of criteria to
America, Asia, Africa and Oceania and the scenes include specifically analyze the attributes from an image frame (still-
urban, countryside, and off-road scenes. Manual annotation shot) perspective, with the exception of the collection method
was performed using polygons by specialized image anno- and environment from five representative datasets for road
tators. Statistical analyses performed by the authors include scene segmentation. One is the metrics for each image frame,

9
Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden

Name Year Class Resolution Image (Training/Validation/Test) Description
RGB GT
CamVid 2008, 32 960 × 720 701 701 Real Image,
2009 Unified Size (367/101/233) (367/101/233) Normal Light/Weather
Cityscapes 2015, 30 2,048 × 1,024 5,000 3,475 Real Image (50 Cities),
(fine) 2016 Unified Size (2,975/500/1,525) (2,975/500/-) Normal Light/Weather
SYNTHIA 2016 23 1,280 × 760 9,400 9,400 Virtual Image,
(cityscapes) Unified Size (9,400) (9,400) Dynamic Light/Weather
GTA-V 2016 34 1914 × 1052 24,966 24,966 Virtual Image,
Unified Size (24,966) (24,966) Dynamic Light/Weather
Mapillary 2017 66 3,420 × 2,480 25,000 20,000 Real Image (6 Continents),
Averaged Size (18,000/2,000/5,000) (18,000/2,000/-) Dynamic Light/Weather
Integration 2018 30 2,048 × 1,024 65,067 58,542 Real/Virtual Image,
Unified Size (41,961/6,038/17,068) (41,961/6,038/10,543) Dynamic Light/Weather

Table 1: Quantitative summary of various datasets for semantic scene segmentation. The number of classes of each dataset includes ’void’
class. GTA-V and Mapillary contain images of different sizes. The images of SYNTHIA and GTA-V are not divided for training, validation,
and test. Recently released datasets include more classes and higher resolution images. Also, many virtual tools that simulate road scene
environment were released to increase the number of virtual images in various conditions, because collecting of real images has high cost.
Integrated dataset is based on the Cityscapes dataset (Table 3) but has many more images.

Figure 1: Example images of five datasets. First row: randomly-selected original images (RGB) from each dataset; second row: ground-truth
images that correspond to each original image. CamVid and Mapillary datasets provide ground-truth color values for each class; Cityscapes,
SYNTHIA, and GTA-V datasets also provide label images with different integer index values assigned to pixels of each class. Each dataset
includes different types of urban road scenes, and various types and sizes of objects.

and the other is the object metrics (Table 2). For each metric, class that exist in a scene. Object’s size/shape/intensity(one
we computed the mean value and its distribution. Analyzed channel color) variability means distributions of object’s ex-
information of image complexity and object diversity can be ternal appearances of a specific class, and it shows how ap-
utilized to construct new datasets with different goals. Met- pearances of object varies in scenes. Geometrical position
rics to explain scene complexity from an image frame per- variability means a distribution of object positions in scenes;
spective are class diversity, object density, and road diver- it explains which positions are major regions of interest in
sity. Class diversity means a distribution of the number of scenes.
all classes appearing per frame, and the diversity of objects
in a scene can be determined. Object density means a distri-
bution of the total number of all objects appearing per frame,
4 Dataset Integration
and explains how many object concentrate in a scene. Road Class Alliance for Scene Integration. Each country has dif-
diversity means a distribution of the relative ratio of road area ferent object attributes, road surface properties, rules of the
and building area. We can estimate the scene as highway or road, traffic patterns (traffic signs and signals), and climate
city center from the road diversity. Metrics to explain object’s conditions. If the image characteristic used for learning and
extrinsic variability of each class from an object perspective testing of deep neural networks are different, this diversity
are class density, object size variability, object shape variabil- is a major cause of degradation of the accuracy of semantic
ity [Collins et al., 2001], object intensity (one channel color) scene segmentation. Quick construction of a dataset that in-
variability, and geometrical position variability. Class density cludes all varieties of road scenes is a real challenge, but it is
means a distribution of the number of objects of a specific the most reasonable way to efficiently integrate the released
class per frame, and represents the number of objects of the datasets collected in different regions. Models created with

10
Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden

Attributes Definition Explanation
P
Class Diversity 1
NP
#(Classes) how diverse objects exist in a scene
Frame Attributes Object Density 1
#(Objects) how many object concentrate in a scene
P N
Road Diversity 1
N
max( AreaArea
R -AreaB
, 0) how diverse road scenes exist in a scene
P R
Class Density N
1
#(Objects
P i) how many objects of the class exist in a scene
Object Size Variability P N
1
Sizei how the size of object varies in scenes
Object Shape Variability 1
N
Dispersednessi
P how the shape of object varies in scenes
Object Attributes Object Intensity Variability 1
Intensityi how the intensity of object varies in scenes
N P
Geometrical Position Variability N
1
X cY ci where is the object most likely to appear

Table 2: Summary of frame/object attributes. We propose three frame attributes and five object attributes. Frame attributes
are used to analyze scene complexity of each dataset. Object attributes are used to understand the extrinsic variability of ob-
jects of each class. N: total number of image frames of each dataset; : number of corresponding objects. AreaR andAreaB :
areaof roadandbuildingineachimagef rame, respectively.If AreaR < AreaB , then road diversity is set to 0.

this integrated dataset that fully represents the diversity of al., 2017; Haixiang et al., 2016; Drummond and Holte,
road scenes can be very good choice of initial model needed 2003]. It is a way to randomly select a number of images
to create a model optimized for a specific environment. from each dataset that is equal to the number of images
To build an integrated dataset, we refer consider 30 classes in the smallest datasets. The integrated dataset consists
and 8 groups of Cityscapes dataset. The author of the of min(Nm ) × M images, where Nm means the num-
Cityscapes dataset selected 30 classes in the road scene and ber of images of mth dataset and M is the number of
grouped them semantically by referring to WordNet [Miller, datasets. Undersampling is an intuitive and easy-to-use
1995]. Cityscape is a reasonable basis for performing match- sampling method, but it has the drawback of not being
ing between classes of datasets because it has about the av- able to exploit the large amount of residual images.
erage number of classes among the five datasets considered • Randomized Oversampling: Oversampling is another
here. We performed a semantic comparison between the frequently-used method [Buda et al., 2017; Haixiang
classes considered by each dataset, from 11 to 66, with the et al., 2016; Janowezyk and Madabhush, ; Jaccard et
classes defined in Cityscapes dataset (Table 3). Usually, al., 2017]. It is a way to randomly select images, and
fewer than 30 classes from each of the other datasets cor- allows duplicates in each dataset that has the largest
respond to the superclass of Cityscapes’s classes, but 1:1 number of images. The integrated dataset consists of
matching with one of the most suitable Cityscapes’s class max(Nm ) × M images, where Nm is the number of im-
was accomplished without any division. If the number of ages in the mth dataset and M is the number of datasets.
classes is ¿ 30, they are usually subclasses of the 30 classes in Overfitting may occur in some cases [Chawla et al., ;
Cityscapes, so we have matched the classes in other datasets Wang et al., ], but variations exist to reduce this prob-
to the semantically-higher class of the Cityscapes classes. In lem [Chawla et al., ; Han et al., ; Shen et al., 2016].
this way, the images in each dataset can be unified to construct Oversampling is the most common method to get the
a large-scale dataset of N images, which represent ¿ M urban largest number of images for training.
environments. For the integrated dataset based on common
classes, we performed image-based and object-based analy- • Diversity Oriented Sampling: This method means that
sis (Section 3), and observed how the characteristics changed the larger the number of average classes contained in the
(Section 4) image of the dataset, the more images are reflected in
the integrated dataset. The integrated dataset consists
Sampling Methods for Image Integration. To build an in- PM
tegrated dataset, the images from each dataset must be mixed of m=1 (wm CD
× max(Nm )) images, where wm CD
=
PM
appropriately after the classes are unified. In this paper, CDm / m=1 CDm is the weight of the mth dataset,
we propose five image-sampling methods to combine im- CDm is the average class density of the mth dataset, and
ages from different datasets. The second and third are aimed M is the number of datasets. The maximum number of
at balancing the numbers of images between datasets, and images that can be sampled is limited to max(Nm ). This
the fourth through sixth are aimed at building an integrated sampling method enables construction of an integrated
dataset that is optimized for a specific purpose. dataset that is optimized on the variety of static/dynamic
backgrounds in a target environment. This method can
• Naive Integration: The simplest method to integrate be used to construct an integrated dataset that best adapts
datasets is to merge all the images in datasets into a uni- to the static/dynamic background variety of the target
fied image size. This method retains the original image environment. As a variation of diversity-oriented sam-
data of each dataset, but naturally, the dataset character- pling, an integrated dataset may by constructed by se-
istics with large image quantities become dominant. lecting only images including classes that are more than
• Randomized Undersampling: Undersampling is one of the average number desired by the user.
the most commonly used methods to match the num- • Density Oriented Sampling: An integrated dataset can
ber of images among classes or among datasets [Buda et be built that is optimized for the object density of

11
Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden

Cityscapes: Base CamVid SYNTHIA GTA-V Mapillary
01. Road Road, Road Shoulder, Lane Markings Drivable Road, Lanemarking Road Road, Pothole, Lane, Service Lane, General Lane Marking
02. Sidewalk Sidewalk Sidewalk Sidewalk Sidewalk, Pedestrian Area, Curb, Curb Cut
03. Parking Parking Block Parking Slot - Parking
04. Rail Track - - - Rail Track
05. Person Child, Pedestrian Pedestrian Person Person
06. Rider Bicyclist Rider Rider Bicyclist, Motorcyclist, Other Rider
07. Car Car Car Car Car
08. Truck SUV/Pickup Truck Truck Truck Truck
09. Bus Truck/Bus Bus Bus Bus
10. On Rails Train Train Train On Rails
11. Motorcycle Motorcycle/Scooter Motorcycle Motorcycle Motorcycle
12. Bicycle - Bicycle Bicycle Bicycle
13. Caravan - - - Caravan
14. Trailer - - Trailer Trailer
15. Building Building Building Building Building
16. Wall Wall Wall Wall Wall
17. Fence Fence Fence Fence Fence
18. Guardrail - - Guardrail Guardrail, Barrier
19. Bridge Bridge - Bridge Bridge
20. Tunnel Tunnel - Tunnel Tunnel
21. Pole Column/Pole Pole Pole Pole, Utility Pole, Street Light, Traffic Sign Frame
22. Pole Group - - - -
23. Traffic Sign Sign/Symbol Traffic Sign Traffic Sign Traffic Sign Front
24. Traffic Light Traffic Light Traffic Light Traffic Light Traffic Light
25. Vegetation Tree, Vegetation Misc Vegetation Vegetation Vegetation
26. Terrain - Terrain Terrain Terrain, Sand
27. Sky Sky Sky Sky Sky
28. Ground Non-Drivable - - Crosswalk Plan, Crosswalk Zebra, Water
29. Dynamic Animal, Cart/Luggage/Pram, Other Moving - - Bird, Animal, Trash Can, Boat, Wheeled Slow, Other Vehicle
30. Static Archway, Misc Text, Traffic Cone, Void Road-Work, Void Ego Vehicle, Static, Void Ego Vehicle, Car Mount, Mountain, Snow, Banner, Billboard,
CCTV Camera, Traffic Sign Back, Catch Basin, Manhole,
Fire Hydrant, Bench, Bike Rack, Junction Box, Mailbox,
Phone Booth, Unlabeled

Table 3: Class matching table of Cityscapes dataset (8 categories, 30 classes) and other datasets: Object (01-24) and Nature (25-30). We
assigned the classes of four datasets (CamVid, SYNTHIA, GTA-V, Mapillary) to 30 classes by referring the class definition of the Cityscapes
dataset. All datasets share most of Cityscapes’s classes. Especially, important classes (road, human, vehicle, traffic sign/light) are essentially
included in all datasets. GTA-V does not have class information, so we manually checked class names (we could not find 8 classes). Mapillary
dataset divides the ’static’ class into many sub-classes.

the target environment. An integrated dataset that of class, and object attributes were evaluated individually for
closely represents the image of a dense dataset consists each class in each dataset.
PM
of m=1 (wm OD
× max(Nm )) images, where wm OD
= The three frame attributes indicate the number and vari-
PM ety of objects that are present in the image frames of each
ODm / m=1 ODm is the weight of mth dataset, ODm
is the average object density of mth dataset, and M is the dataset. First, the attribute values were individually calcu-
number of datasets. A modified method constructs an lated from the image frames (Table 2) and the distribution of
integrated dataset by selecting only images that corre- each attribute was expressed as a histogram (Fig. 2, Fig. 3).
spond to more than an average density that user desired. To compare the distribution’s variance, we normalized it to
[0, 1] for each histogram by dividing each bin by the maxi-
• Target Oriented Sampling: If the goal is to extract a spe- mum value of bins. We use the naı̈ve integration method to
cific target object accurately, the integrated dataset must construct an integrated dataset. Image size or resolution do
have images that contain as many of the target objects not significantly affect frame attributes, but image size can
as possible in one scene. In addition, construction of a affect object attributes. However, the variance of attribute’s
training set with uniform distribution on each object at- distribution, and the image resolution itself are characteristics
tribute enables generation of a model that is insensitive of the image-acquisition devices used in each dataset, so we
to attributes of the target object. In addition, a model displayed the original distributions without performing image
can be constructed that is insensitive to changes in ob- size normalization, then compared their variances by consid-
ject attributes, if the training dataset is built by selecting ering the absolute size range. We used the target-oriented
images so that each attribute has an even distribution, sampling method to construct another integrated dataset and
that is, as many variations as possible. computed the object attributes based on Cityscapes dataset’s
image size.
5 Experiments
5.1 Relative Analysis of Frame Attributes
Six datasets were used for analysis of frame and object
attributes. These datasets were CamVid (701 images), Class Diversity. The Mapillary dataset contains the largest
Cityscapes (3,475 images), SYNTHIA (9,400 images), GTA- number of classes per frame on average, but this result oc-
V (24,966 images), Mapillary (20,000 images), and the inte- curs because the number of classes defined in the Mapillary
grated dataset proposed in Section 4 (58,542 images). Frame dataset is much higher than for any other dataset. If we unify
attributes were evaluated for each image frame, regardless the number of classes as the minimum, and calculate the rela-

12
Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden

Figure 2: Histogram of frame attributes. (a) Distribution of class diversity for the six data sets, including the integrated dataset. Horizontal
black line: number of classes of each dataset; Red line: position of the mean value. Considering that the number of classes in a Mapillary
dataset is twice as many as other datasets, they all exhibit a similar variance. (b) Distribution of object density. Images in the virtual image
dataset generally contain more, and more-varied objects on than the other datasets. (c) Distribution of road diversity. Most datasets contain
images of a road area that is smaller than the building area on average; i.e., many urban scenes with buildings rather than highways or
countryside. Images in the Cityscapes dataset have the most diverse road area.

tive ratio, the SYNTHIA dataset includes the largest number ased to one side, but shows about the average characteristics
of classes per frame on average, and the remaining data sets of the five component datasets. Depending on the complexity
include an average of 15 classes per frame. The variance of of the field in which the dataset is to be applied, the weight of
the GTA-V dataset is the largest, which means that the classes the dataset that has the corresponding complexity can be in-
present in one frame are the most diverse from the smallest to creased to create a new integrated dataset that is optimized for
the largest. a specific research field. For example, if the scene includes
Object Density. The vertical range of object density was a complex environment where a large number of objects ap-
larger than expected; the reason is that the segmentation la- pear, the weights can be increased for virtual image datasets
bel is also assigned to all small segments which are only part such as SYNTHIA and GTA-V.
of an object. Average Object Density varies slightly among
datasets. The GTA-V dataset contains an average of 230 ob- 5.2 Attribute Analysis of Important Objects
ject segments. On average, datasets that contain virtual im- To analyze the object attributes, we selected four objects that
ages contain more objects in a scene than datasets that con- are important in the driving situation: persons and cars as ob-
tain real images. Thus, we can utilize a virtual-image dataset jects that generate the most serious damage in a collision; and
to increase the complexity of the scene. traffic lights and traffic signs that provide the most essential
Road Diversity. When the road diversity is calculated, it is information for driving.
set to 0 when no road segment is present, or the building area Class Density. The density distributions of persons and
is larger than the road area. Most of the images have road di- traffic lights were even in SYNTHIA and GTA-V, and den-
versity = 0 (Fig. 2(c)); i.e., many road scenes include numer- sity distributions of car and traffic sign were similar in most
ous buildings, or do not have an area that is labeled as road. datasets. Class Density has a higher average density value in
This result indicates that all datasets contain many images virtual image datasets than real image datasets, as is true of
that had been captured in urban environment rather than on object density of frame attributes.
the highway. Except for the zero bin, the Cityscapes dataset Object Size Variability. Cityscapes and Mapillary datasets
evenly covers the roadscapes of various areas. include variously-sized instances of people, vehicles, traffic
Integrated Dataset. In class diversity, our integrated dataset lights, signs. It is useful to use the two datasets for segmenta-
shows the most typical normal distribution, in which the mean tion that is less sensitive to the scale change of the object.
value is in the middle of the number of classes. Most exper- Object Shape Variability. Shape complexities of the im-
iments assume that the normal distribution is the most com- portant objects do not change much, regardless of dataset.
mon. In class density, the integrated dataset is close to the Cityscapes dataset and Mapillary dataset have large variances
normal distribution after GTA-V, and the value of each point in size, but small variance of shape. This result means that the
in the distribution is high because the number of images is morphological characteristics of each object do not depend on
much larger in an integrated dataset than in each of the com- the size or scale of the image. For extremely small or large
ponent datasets. This observation means that the integrated instances, the detail of appearance can vary widely, and most
dataset that we proposed is more advantageous than the com- datasets include histogram bins for such cases. Sometimes,
ponent datasets to learn models for scene segmentation. The relatively large traffic lights and traffic signs appear in virtual
road diversity of the integrated dataset represents the common image datasets.
characteristics of the other datasets. In summary, the proper- Object Intensity Variability. Instead of considering each of
ties of image complexity of the integrated dataset is not bi- the RGB values, we consider the intensity value by convert-

13
Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden

Figure 3: Histograms of object attributes for important objects. (a) Distributions of class density. Each object has a diversity of densities
in a each dataset. (b) Distributions of object size variability. The Cityscapes dataset and the Mapillary dataset contain objects of the most
diverse scales. (c) Distributions of object shape variability. The variability of shape of object in all datasets is not large; i.e., few images
contain extremely large or small objects. (d) Distributions of object intensity variability. A more recent data set shows a richer color for
each important object. (e) Distributions of geometrical position variability. Row, horizontal line: range of image height. (f) Distributions
of geometrical position variability. Col: horizontal line: range of image width. All important objects exist in various width ranges in most
datasets. Analysis of the integrated data set is described in Section 5.

ing all images to gray images. The average intensity values This characteristic is true for the four object attributes (den-
are calculated in each object region and represented as a his- sity, size, shape, intensity) in the integrated dataset, so it is
togram. For all important objects, the SYNTHIA, GTA-V, much more useful that the component datasets for training
and Mapillary datasets contain instances of much more color bigger models, because the number of objects contained is
than the CamVid and Cityscapes datasets. The difference much larger than in those. In the integrated dataset, the spa-
occurs because SYNTHIA, GTA-V, and Mapillary datasets tial position of objects within an image is more uniform, and
were constructed more recently than Camvid and Cityscape, the absolute number of objects in all horizontal and vertical
and therefore had more images and more environmental con- positions are much larger, than in the component datasets. To
ditions. SYNTHIA, GTA-V dataset’s tool can change various build a specialized integrated dataset with a specific range of
attributes of objects and backgrounds, and Mapillary dataset density, size, shape, intensity, and position values for other
was photographed on six continents, so colors vary widely. objects of interest, including important objects, the ratio of
Geometrical Position Variability: Row. The last two items from each dataset can be adjusted appropriately. For
columns of Fig. 3 show distributions that represent the row example, if the goal is to segment human regions reliably re-
and column (col) in which each object appears in the image. gardless of size and color, the ratio of the Mapillary dataset
The horizontal lines of histograms represent the image resolu- in the integrated dataset can be increased.
tion range (height, width) of each dataset. Persons and traffic
signs are mainly located at the middle height of the image, 6 Conclusion
whereas cars and traffic lights are mainly located in the up-
per part of the image. The SYNTHIA dataset contains more Published datasets for use in semantic scene segmentation
objects at various heights than do other datasets. have different characteristics, such as the number of classes
Geometrical Position Variability: Column. In all datasets, that have been defined and labeled, the image size, the range
most objects exist in various locations from left to right of of regions in which the images were obtained, the realism
the image. In particular, the Cityscapes dataset and the Map- of the graphic, and the diversity of the landscapes. There-
illary dataset include many cases in which objects are uni- fore, to learn a deeper neural network, a many images that
formly present in all column ranges, but the range of rows in include various characteristics should be acquired. In this
which important objects exist is limited, but the col range is paper, we compare the basic information of five represen-
relatively various. A dataset with an even distribution of the tative datasets, then analyzed the distribution characteristics
locations of objects implies a diversity of situations or scenar- by defining three frame attributes and five object attributes.
ios. We also performed class matching to construct new datasets
Integrated Dataset. An integrated dataset distribution that is that incorporate these five datasets. Statistical results show
within the range of characteristics of the component datasets. that the image complexity of the virtual image dataset (SYN-

14
Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden

THIA, GTA-V) is relatively higher than that of the real image [Han et al., ] H. Han, W. Y. Wang, and B. H. Mao.
dataset, and that the Cityscapes dataset includes a variety of Borderline-smote: a new over sampling method in imbal-
road scenes. In addition, for certain important objects, the anced data sets learing. Advances in Intelligent Comput-
datasets with flat distribution ranges are different for each at- ing.
tribute, so the proportional contribution of each dataset in the [Jaccard et al., 2017] N. Jaccard, T. W. Rogers, E. J. Morton,
integrated dataset should be optimized to best match the situ- and L. D. Griffin. Detection of concealed cars in complex
ation of the research field to which it is to be applied. In the cargo x-ray imagery using deep learning. Journal of X-Ray
future, we will analyze how the method of constructing inte- Science and Technology, 25(3):323–339, 2017.
grated datasets affects segmentation accuracy, and will study
how to learn the deep neural network by using the integrated [Janowezyk and Madabhush, ] A. Janowezyk and A. Madab-
datasets to improve accuracy. hush. Deep learning for digital pathology image analysis:
A comprehensive tutorial with selected use cases. Journal
of Pathology Informatics, 7.
References
[Martin et al., 2001] D. Martin, C. Fowlkes, D. Tal, and
[Bileschi, ] S. Bileschi. Cbcl streetscenes: towards scene un- J. Malik. A database of human segmented natural images
derstanding in still images. Technical report, MIT. and its application to evaluating segmentation algorithms
[Brostow et al., 2008] G J. Brostow, J. Shotton, J. Fauqueur, and measuring ecological statistics. 2001.
and R. Cipolla. Segmentation and recognition using struc- [Miller, 1995] G. A. Miller. Wordnet: a lexical database for
ture from motion point clouds. 2008. english. volume 38, pages 39–41, 1995.
[Brostow et al., 2009] G. J. Brostow, J. Fauqueur, and [Neuhold et al., 2017] G. Neuhold, T. Ollmann, S. R. Bulo,
R. Cipolla. Semantic object class in video: a high- and P. Kontschieder. The mapillary vistas dataset for se-
definition ground truth database. Pattern Recognition Let- mantic understanding of street scenes. 2017.
ters, 30(2):88–97, 2009. [Perazzi et al., 2016] F. Perazzi, J. P. Tuset, B. McWilliams,
[Buda et al., 2017] M. Buda, A. Maki, and M. A. L. V. Gool, M. Gross, and A. S. Hornung. A benchmark
Mazurowski. A systematic study of the class imbal- dataset and evaluation methodology for video object seg-
ance problem in convolutional neural networks. 2017. mentation. 2016.
[Chawla et al., ] N. V. Chawla, K. W. Bowyer, L. O. Hall, [Richter et al., 2016] S. R. Richter, V. Vineet, S. Roth, and
and W. P. Kegelmeyer. Smote: synthetic minority over- V. Koltun. Playing for data: ground truth from computer
sampling technique. Journal of Artificial Intelligence Re- games. 2016.
search, 16. [Ros et al., 2016] G. Ros, L. Sellart, J. Materzynska,
[Collins et al., 2001] R. T. Collins, A. J. Lipton, H. Fu- D. Vazques, and A. M. Lopez. The synthia dataset: a large
jiyoshi, and T. Kanade. Algorithms for cooperative collection of synthetic images for semantic segmentation
multisensor surveillance. Proceedings of the IEEE, of urban scenes. 2016.
89(10):1456–1477, 2001. [Russell et al., ] B. C. Russell, A. Torralba, K. P. Murphy,
[Cordts et al., 2015] M. Cordts, M. Omran, S. Ramos, and W. T. Freeman. Labelme: a database and web-based
T. Scharwachter, M. Enzweiler, R. Benenson, U. Franke, tool for image annotation. International Journal of Com-
S. Roth, and B. Schiele. The cityscapes dataset. 2015. puter Vision, 77(1).
[Cordts et al., 2016] M. Cordts, M. Omran, S. Ramos, [Shen et al., 2016] L. Shen, Z. Lin, and Q. Huang. Relay
T. Scharwachter, M. Enzweiler, R. Benenson, U. Franke, backpropagation for effective learning of deep convolu-
S. Roth, and B. Schiele. The cityscapes dataset for seman- tional neural networks. 2016.
tic urban scene understanding. 2016. [Shotton et al., 2006] J. Shotton, J. Winn, C. Rother, and
[Drummond and Holte, 2003] C. Drummond and R. C. A. Criminisi. Textonboost: joint appearance shape and
Holte. C4.5, class imbalance, and cost sensitivity: why context modeling for multi-class object recognition and
under-sampling beats over-sampling. 2003. segmentation. 2006.
[Smeaton et al., 2006] A. F. Smeaton, P. Over, and
[Fei-Fei et al., 2006] I. Fei-Fei, R. Fergus, and P. Perona.
W. Kraaij. Evaluation campaigns and trecvid. 2006.
One-shot learning of object categories. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 28(4):594– [Technologies, ] U. Technologies. Unity development plat-
611, 2006. form. Technical report.
[Griffin et al., ] G. Griffin, A. Holub, and P. Perona. Caltech- [Wang et al., ] K. J. Wang, B. Makond, K. H. Chen, and
256 object category dataset. Technical report, Caltech. K. M. Wang. A hybrid classifier combining smote with
pso to estimate 5-year survivability of breast cancer pa-
[Haixiang et al., 2016] G. Haixiang, L. Yijing, J. Shang, tients. Applied Soft Computing, 20.
G. Mingyun an dH. Yuanyue, and G. Bing. Learning from
class-imbalanced data: Review of methods and applica- [Yao et al., 2007] B. Yao, X. Yang, and S. C. Zhu. Introduc-
tions. Expert Systems with Applications, 73(1):220–239, tion to a large-scale general purpose ground truth database:
2016. methodology, annotation tool and benchmarks. 2007.