=Paper= {{Paper |id=Vol-2134/paper01 |storemode=property |title=Attribute Dissection of Urban Road Scenes for Efficient Dataset Integration |pdfUrl=https://ceur-ws.org/Vol-2134/paper01.pdf |volume=Vol-2134 |authors=Jiman Kim,Changjong Park |dblpUrl=https://dblp.org/rec/conf/ijcai/KimP18 }} ==Attribute Dissection of Urban Road Scenes for Efficient Dataset Integration== https://ceur-ws.org/Vol-2134/paper01.pdf
        Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden




     Attribute Dissection of Urban Road Scenes for Efficient Dataset Integration

                                             Jiman Kim, Chanjong Park
                                                  Samsung Research,
                                                 Samsung Electronics
                                       {jiman14.kim, cj710.park}@samsung.com


                          Abstract                                      al., 2016] that compare the scene parsing accuracy of sev-
                                                                        eral state-of-the-art algorithms focus on their advantages and
     Semantic scene segmentation or scene parsing is                    disadvantages, rather than on the characteristics of the image
     very useful for high-level scene recognition. In or-               data. However, the datasets include different categories de-
     der to improve the performance of scene segmen-                    fined and different image characteristics of the instances in
     tation, the quantity and quality of the datasets used              them, so efficient learning of the deep network requires de-
     for deep network’s learning are important. In other                tailed analysis of various attributes of the data. For example,
     words, we need to consider various external envi-                  some datasets may contain many small objects, some datasets
     ronments and various variations of the predefined                  may contain densely distributed objects, and other datasets
     objects in terms of image characteristics. In recent               objects may show deformed objects. By using these image
     years, many datasets for semantic scene segmenta-                  characteristics, a network can be developed that has excellent
     tion focused on autonomous driving have been re-                   specialization for a specific environment and object, or that
     leased. However, since only quantitative analysis                  has excellent generality in a general environment by combin-
     of each dataset is provided, it is difficult to estab-             ing characteristics.
     lish an efficient learning strategy considering the                   In this paper, we analyze the CamVid, Cityscapes, SYN-
     image characteristics of objects. We present defi-                 THIA, GTA-V, and Mapillary datasets quantitatively, based
     nitions of three frame attributes and five object at-              on two criteria. First, we analyze image-centric criteria such
     tributes, and analyze their statistical distributions              as the average number of categories in the image, the aver-
     to provide qualitative information about datasets                  age number of objects, and the average proportion of the im-
     that are to be merged. We also propose an inte-                    age that is a road region. Second, we analyze object-centric
     grated dataset configuration that can exploit the ad-              criteria, such as the average spatial density of objects in the
     vantages of each data set for deep network learn-                  image, their average size, average shape, average color, and
     ing after class matching. As a result, we can build                average position in the image. This analysis provides we
     new integrated datasets that are optimized for the                 good insight into ways to train deep networks. We also pro-
     scene complexity and object properties of the envi-                pose a new set of integrated classes that can be used com-
     ronment by considering the statistical characteris-                monly among datasets, and a method to construct an inte-
     tics of each dataset.                                              grated dataset. The integrated dataset contributes to improve
                                                                        the generality of the deep network by including various road
                                                                        environments and object characteristics. This paper has the
1   Introduction                                                        following structure. Section 2 introduces papers related to
Scene understanding requires information such as the objects            published datasets. Section 3 summarizes each dataset and
that are present in the scene, their characteristics, and the re-       proposes image- and object-centric attributes. Section 4 pro-
lationships among them. Semantic segmentation provides in-              poses a new integrated dataset by performing class alliance.
formation on the location and type of objects by dividing an            Section 5 provides a detailed comparative analysis of the pro-
image into regions that include predefined objects. To apply            posed attributes, and suggests insights for constructing inte-
scene segmentation functions to autonomous vehicles, many               grated datasets. Section 6 summarizes all findings and contri-
road scene-centered datasets have been released. Represen-              butions of this paper.
tative datasets labeled at the pixel level are CamVid [Bros-
tow et al., 2008; 2009], Cityscapes [Cordts et al., 2015;               2 Related Work
2016], SYNTHIA [Ros et al., 2016], GTA-V [Richter et al.,               Road scene-centric public datasets for pixel-level semantic
2016], Mapillary [Neuhold et al., 2017]. Papers that explain            segmentation have been released (Table 1, Fig. 1) with pa-
each dataset provide statistical information on the image data          pers that explain them. CamVid [Brostow et al., 2008;
collection environment, area, device, the amount of images,             2009] was the first dataset that had semantic labels of ob-
and the relative proportions of objects. Papers [Perazzi et             ject class for each pixel. The images were images acquired



                                                                    8
        Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden

from the perspective of a driving automobile; they are di-             image resolution, focal length, number of images taken with
vided into 32 semantic classes with manually-annotated la-             the devices used for image acquisition, region where the im-
bels. To reduce the effort of the person who must label ob-            age was acquired, number of instances per class, number of
jects, the authors proposed joint tracking of keypoints and            objects per image, number of traffic regulation objects per
regions; this method propagates the label information to the           image, and number of traffic participants per image.
100 subsequent frames. The set includes the camera’s 3D                   These papers mainly analyzed how often each class ap-
pose in each frame and is has a software tool that users can           peared in each image. They also focused on the number and
use to label their additional images. The publicly available           proportions of major classes that are closely related to traf-
datasets [Martin et al., 2001; Fei-Fei et al., 2006; Bileschi, ;       fic. If the detail and organization of the information on the
Shotton et al., 2006; Smeaton et al., 2006; Griffin et al., ;          frame and object side can be obtained, they would improve
Yao et al., 2007; Russell et al., ] before CamVid have                 the learning efficiency of deep networks. Therefore, in this
polygon-level labels not pixel-level and they were obtained            work, we perform detailed characterization of each dataset to
from fixed CCTV-style cameras. The paper provides statisti-            derive insight. Also, to enable simultaneous use of two or
cal information on the percentage of the objects in the image          more datasets with different class numbers and types, we de-
for each sequence and the number of occurrences.                       fine a common usable class and propose a way to efficiently
   Cityscapes [Cordts et al., 2015; 2016] is a large-scale             combine datasets.
dataset that includes complex real-world urban scenes.
Cityscapes has image data that are labelled at the pixel level
and instance level. The images were acquired from 50 cities
                                                                       3 Attribute Analysis
to include a variety of road environments. The authors pro-            Scene segmentation or scene parsing at the pixel level to ex-
vided the results of statistical analysis between datasets by          tract the boundaries of many kinds of objects solves object
grouping 30 classes into eight categories. The results de-             detection and localization simultaneously. To achieve high
scribe the number and relative ratios of annotated pixels of           accuracy of pixel-level segmentation, large-scale datasets are
each class, annotation density, the distribution of the number         required; they must include a variety of shape and appear-
of instances related with traffic per an image, and distribution       ance variations of static objects (backgrounds) and dynamic
of the number of vehicle according to the distance.                    object (foreground). Therefore, construction of data sets that
   SYNTHIA [Ros et al., 2016] is a dataset of synthetic im-            focus on road scenes has increased resolution and number of
ages obtained from a virtual world (Unity development plat-            images, and to an increased variety of environments.
form) [Technologies, ]. The images were captured from                  Trends of Dataset. The constructed and released datasets for
multiple view-points by using two multi-camera with four               the same goal are described in Table 1. Higher resolution and
monocular cameras that are mounted on a virtual car. The               larger amount of images are two common trends in construct-
images include different seasons, weather and illumination             ing road scene-centric dataset. The increase in the resolution
conditions. The captured images were annotated with 11 pre-            of the collected images is closely related to pixel-level accu-
defined classes. In experiments, the authors showed that com-          racy. In addition, virtual environment tools have been used
bining a real dataset and SYNTHIA dataset dramatically in-             to collect a large number of images in a short time. In par-
creases the accuracy of semantic segmentation.                         ticular, the volume of real images in the Mapillary dataset
   Grand Theft Auto V (GTA-V) [Richter et al., 2016] con-              was increased sharply by a community-led service to share
sists of images captured from a computer game. The au-                 street-level photographs. Diversification of the environments
thors proposed a method to quickly generate semantic label             that the images represent has yielded datasets from different
maps. Each image is automatically divided into patches, then           regions and environments, and recently-constructed datasets
merged using MTS (mesh, texture, shader). For each patch, a            include increasing diversity of regions and of environmen-
semantic class is manually assigned. Within a brief space of           tal conditions. Real images are much more difficult to ob-
time, these methods yield far more pixel labeling than previ-          tain than virtual images, and the continental, regional, and
ous datasets. When virtual images generated by the proposed            environmental conditions in which the images are acquired
method were added to real-world images, segmentation ac-               has become very diverse in the Cityscapes and Mapillary
curacy was greatly improved even though a large number of              datasets. Each dataset has different properties (Table 1). The
real-world images are replaced by virtual images. The re-              CamVid dataset was the dataset that focused on road scenes;
lated paper provided statistical information on the number of          it contains many lane-clear highway images. The Cityscapes
labeled pixels, annotation density, and the time and speed of          dataset includes images that are specific to European urban
labeling.                                                              scenes, The SYNTHIA dataset has many virtual images with
   Mapillary [Neuhold et al., 2017] is the dataset that con-           multiple seasons. GTA-V dataset’s virtual images are ex-
tains the most real-world images, and the largest number               tremely realistic, and its effects are richly controllable. The
(66) of categories to consider. The images were captured               Mapillary dataset contains the largest number of images col-
by differently-experienced photographers on various imaging            lected in the broadest variety of regions.
devices. The considered cities are Europe, North and South             Attribute Definition. We defined two types of criteria to
America, Asia, Africa and Oceania and the scenes include               specifically analyze the attributes from an image frame (still-
urban, countryside, and off-road scenes. Manual annotation             shot) perspective, with the exception of the collection method
was performed using polygons by specialized image anno-                and environment from five representative datasets for road
tators. Statistical analyses performed by the authors include          scene segmentation. One is the metrics for each image frame,



                                                                   9
        Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden

        Name        Year     Class     Resolution             Image (Training/Validation/Test)                     Description
                                                                 RGB                     GT
       CamVid       2008,     32       960 × 720                  701                    701                       Real Image,
                    2009              Unified Size          (367/101/233)          (367/101/233)             Normal Light/Weather
      Cityscapes    2015,     30     2,048 × 1,024               5,000                  3,475                Real Image (50 Cities),
         (fine)     2016              Unified Size        (2,975/500/1,525)         (2,975/500/-)            Normal Light/Weather
      SYNTHIA       2016      23      1,280 × 760                9,400                  9,400                     Virtual Image,
     (cityscapes)                     Unified Size              (9,400)                (9,400)               Dynamic Light/Weather
       GTA-V        2016      34      1914 × 1052               24,966                 24,966                     Virtual Image,
                                      Unified Size             (24,966)               (24,966)               Dynamic Light/Weather
      Mapillary     2017      66     3,420 × 2,480              25,000                 20,000               Real Image (6 Continents),
                                     Averaged Size      (18,000/2,000/5,000)      (18,000/2,000/-)           Dynamic Light/Weather
     Integration    2018      30     2,048 × 1,024              65,067                 58,542                  Real/Virtual Image,
                                      Unified Size     (41,961/6,038/17,068) (41,961/6,038/10,543)           Dynamic Light/Weather

Table 1: Quantitative summary of various datasets for semantic scene segmentation. The number of classes of each dataset includes ’void’
class. GTA-V and Mapillary contain images of different sizes. The images of SYNTHIA and GTA-V are not divided for training, validation,
and test. Recently released datasets include more classes and higher resolution images. Also, many virtual tools that simulate road scene
environment were released to increase the number of virtual images in various conditions, because collecting of real images has high cost.
Integrated dataset is based on the Cityscapes dataset (Table 3) but has many more images.




Figure 1: Example images of five datasets. First row: randomly-selected original images (RGB) from each dataset; second row: ground-truth
images that correspond to each original image. CamVid and Mapillary datasets provide ground-truth color values for each class; Cityscapes,
SYNTHIA, and GTA-V datasets also provide label images with different integer index values assigned to pixels of each class. Each dataset
includes different types of urban road scenes, and various types and sizes of objects.


and the other is the object metrics (Table 2). For each metric,           class that exist in a scene. Object’s size/shape/intensity(one
we computed the mean value and its distribution. Analyzed                 channel color) variability means distributions of object’s ex-
information of image complexity and object diversity can be               ternal appearances of a specific class, and it shows how ap-
utilized to construct new datasets with different goals. Met-             pearances of object varies in scenes. Geometrical position
rics to explain scene complexity from an image frame per-                 variability means a distribution of object positions in scenes;
spective are class diversity, object density, and road diver-             it explains which positions are major regions of interest in
sity. Class diversity means a distribution of the number of               scenes.
all classes appearing per frame, and the diversity of objects
in a scene can be determined. Object density means a distri-
bution of the total number of all objects appearing per frame,
                                                                          4 Dataset Integration
and explains how many object concentrate in a scene. Road                 Class Alliance for Scene Integration. Each country has dif-
diversity means a distribution of the relative ratio of road area         ferent object attributes, road surface properties, rules of the
and building area. We can estimate the scene as highway or                road, traffic patterns (traffic signs and signals), and climate
city center from the road diversity. Metrics to explain object’s          conditions. If the image characteristic used for learning and
extrinsic variability of each class from an object perspective            testing of deep neural networks are different, this diversity
are class density, object size variability, object shape variabil-        is a major cause of degradation of the accuracy of semantic
ity [Collins et al., 2001], object intensity (one channel color)          scene segmentation. Quick construction of a dataset that in-
variability, and geometrical position variability. Class density          cludes all varieties of road scenes is a real challenge, but it is
means a distribution of the number of objects of a specific               the most reasonable way to efficiently integrate the released
class per frame, and represents the number of objects of the              datasets collected in different regions. Models created with



                                                                     10
        Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden

                        Attributes                                Definition                 Explanation
                                                                  P
                        Class Diversity                        1
                                                               NP
                                                                       #(Classes)            how diverse objects exist in a scene
    Frame Attributes    Object Density                         1
                                                                       #(Objects)            how many object concentrate in a scene
                                                             P N
                        Road Diversity                     1
                                                           N
                                                                 max( AreaArea
                                                                            R -AreaB
                                                                                      , 0)   how diverse road scenes exist in a scene
                                                                 P              R
                        Class Density                         N
                                                               1
                                                                      #(Objects
                                                                       P           i)        how many objects of the class exist in a scene
                        Object Size Variability                P    N
                                                                     1
                                                                          Sizei              how the size of object varies in scenes
                        Object Shape Variability             1
                                                             N
                                                                    Dispersednessi
                                                                    P                        how the shape of object varies in scenes
    Object Attributes   Object Intensity Variability             1
                                                                        Intensityi           how the intensity of object varies in scenes
                                                                 N P
                        Geometrical Position Variability          N
                                                                   1
                                                                         X cY ci             where is the object most likely to appear

Table 2: Summary of frame/object attributes. We propose three frame attributes and five object attributes. Frame attributes
are used to analyze scene complexity of each dataset. Object attributes are used to understand the extrinsic variability of ob-
jects of each class. N: total number of image frames of each dataset; : number of corresponding objects. AreaR andAreaB :
areaof roadandbuildingineachimagef rame, respectively.If AreaR < AreaB , then road diversity is set to 0.


this integrated dataset that fully represents the diversity of                al., 2017; Haixiang et al., 2016; Drummond and Holte,
road scenes can be very good choice of initial model needed                   2003]. It is a way to randomly select a number of images
to create a model optimized for a specific environment.                       from each dataset that is equal to the number of images
   To build an integrated dataset, we refer consider 30 classes               in the smallest datasets. The integrated dataset consists
and 8 groups of Cityscapes dataset. The author of the                         of min(Nm ) × M images, where Nm means the num-
Cityscapes dataset selected 30 classes in the road scene and                  ber of images of mth dataset and M is the number of
grouped them semantically by referring to WordNet [Miller,                    datasets. Undersampling is an intuitive and easy-to-use
1995]. Cityscape is a reasonable basis for performing match-                  sampling method, but it has the drawback of not being
ing between classes of datasets because it has about the av-                  able to exploit the large amount of residual images.
erage number of classes among the five datasets considered                  • Randomized Oversampling: Oversampling is another
here. We performed a semantic comparison between the                          frequently-used method [Buda et al., 2017; Haixiang
classes considered by each dataset, from 11 to 66, with the                   et al., 2016; Janowezyk and Madabhush, ; Jaccard et
classes defined in Cityscapes dataset (Table 3). Usually,                     al., 2017]. It is a way to randomly select images, and
fewer than 30 classes from each of the other datasets cor-                    allows duplicates in each dataset that has the largest
respond to the superclass of Cityscapes’s classes, but 1:1                    number of images. The integrated dataset consists of
matching with one of the most suitable Cityscapes’s class                     max(Nm ) × M images, where Nm is the number of im-
was accomplished without any division. If the number of                       ages in the mth dataset and M is the number of datasets.
classes is ¿ 30, they are usually subclasses of the 30 classes in             Overfitting may occur in some cases [Chawla et al., ;
Cityscapes, so we have matched the classes in other datasets                  Wang et al., ], but variations exist to reduce this prob-
to the semantically-higher class of the Cityscapes classes. In                lem [Chawla et al., ; Han et al., ; Shen et al., 2016].
this way, the images in each dataset can be unified to construct              Oversampling is the most common method to get the
a large-scale dataset of N images, which represent ¿ M urban                  largest number of images for training.
environments. For the integrated dataset based on common
classes, we performed image-based and object-based analy-                   • Diversity Oriented Sampling: This method means that
sis (Section 3), and observed how the characteristics changed                 the larger the number of average classes contained in the
(Section 4)                                                                   image of the dataset, the more images are reflected in
                                                                              the integrated dataset. The integrated dataset consists
Sampling Methods for Image Integration. To build an in-                           PM
tegrated dataset, the images from each dataset must be mixed                  of m=1 (wm     CD
                                                                                                 × max(Nm )) images, where wm      CD
                                                                                                                                       =
                                                                                      PM
appropriately after the classes are unified. In this paper,                   CDm / m=1 CDm is the weight of the mth dataset,
we propose five image-sampling methods to combine im-                         CDm is the average class density of the mth dataset, and
ages from different datasets. The second and third are aimed                  M is the number of datasets. The maximum number of
at balancing the numbers of images between datasets, and                      images that can be sampled is limited to max(Nm ). This
the fourth through sixth are aimed at building an integrated                  sampling method enables construction of an integrated
dataset that is optimized for a specific purpose.                             dataset that is optimized on the variety of static/dynamic
                                                                              backgrounds in a target environment. This method can
  • Naive Integration: The simplest method to integrate                       be used to construct an integrated dataset that best adapts
    datasets is to merge all the images in datasets into a uni-               to the static/dynamic background variety of the target
    fied image size. This method retains the original image                   environment. As a variation of diversity-oriented sam-
    data of each dataset, but naturally, the dataset character-               pling, an integrated dataset may by constructed by se-
    istics with large image quantities become dominant.                       lecting only images including classes that are more than
  • Randomized Undersampling: Undersampling is one of                         the average number desired by the user.
    the most commonly used methods to match the num-                        • Density Oriented Sampling: An integrated dataset can
    ber of images among classes or among datasets [Buda et                    be built that is optimized for the object density of



                                                                    11
          Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden

       Cityscapes: Base    CamVid                                        SYNTHIA             GTA-V                       Mapillary
       01. Road            Road, Road Shoulder, Lane Markings Drivable   Road, Lanemarking   Road                        Road, Pothole, Lane, Service Lane, General Lane Marking
       02. Sidewalk        Sidewalk                                      Sidewalk            Sidewalk                    Sidewalk, Pedestrian Area, Curb, Curb Cut
       03. Parking         Parking Block                                 Parking Slot        -                           Parking
       04. Rail Track      -                                             -                   -                           Rail Track
       05. Person          Child, Pedestrian                             Pedestrian          Person                      Person
       06. Rider           Bicyclist                                     Rider               Rider                       Bicyclist, Motorcyclist, Other Rider
       07. Car             Car                                           Car                 Car                         Car
       08. Truck           SUV/Pickup Truck                              Truck               Truck                       Truck
       09. Bus             Truck/Bus                                     Bus                 Bus                         Bus
       10. On Rails        Train                                         Train               Train                       On Rails
       11. Motorcycle      Motorcycle/Scooter                            Motorcycle          Motorcycle                  Motorcycle
       12. Bicycle         -                                             Bicycle             Bicycle                     Bicycle
       13. Caravan         -                                             -                   -                           Caravan
       14. Trailer         -                                             -                   Trailer                     Trailer
       15. Building        Building                                      Building            Building                    Building
       16. Wall            Wall                                          Wall                Wall                        Wall
       17. Fence           Fence                                         Fence               Fence                       Fence
       18. Guardrail       -                                             -                   Guardrail                   Guardrail, Barrier
       19. Bridge          Bridge                                        -                   Bridge                      Bridge
       20. Tunnel          Tunnel                                        -                   Tunnel                      Tunnel
       21. Pole            Column/Pole                                   Pole                Pole                        Pole, Utility Pole, Street Light, Traffic Sign Frame
       22. Pole Group      -                                             -                   -                           -
       23. Traffic Sign    Sign/Symbol                                   Traffic Sign        Traffic Sign                Traffic Sign Front
       24. Traffic Light   Traffic Light                                 Traffic Light       Traffic Light               Traffic Light
       25. Vegetation      Tree, Vegetation Misc                         Vegetation          Vegetation                  Vegetation
       26. Terrain         -                                             Terrain             Terrain                     Terrain, Sand
       27. Sky             Sky                                           Sky                 Sky                         Sky
       28. Ground          Non-Drivable                                  -                   -                           Crosswalk Plan, Crosswalk Zebra, Water
       29. Dynamic         Animal, Cart/Luggage/Pram, Other Moving       -                   -                           Bird, Animal, Trash Can, Boat, Wheeled Slow, Other Vehicle
       30. Static          Archway, Misc Text, Traffic Cone, Void        Road-Work, Void     Ego Vehicle, Static, Void   Ego Vehicle, Car Mount, Mountain, Snow, Banner, Billboard,
                                                                                                                         CCTV Camera, Traffic Sign Back, Catch Basin, Manhole,
                                                                                                                         Fire Hydrant, Bench, Bike Rack, Junction Box, Mailbox,
                                                                                                                         Phone Booth, Unlabeled



Table 3: Class matching table of Cityscapes dataset (8 categories, 30 classes) and other datasets: Object (01-24) and Nature (25-30). We
assigned the classes of four datasets (CamVid, SYNTHIA, GTA-V, Mapillary) to 30 classes by referring the class definition of the Cityscapes
dataset. All datasets share most of Cityscapes’s classes. Especially, important classes (road, human, vehicle, traffic sign/light) are essentially
included in all datasets. GTA-V does not have class information, so we manually checked class names (we could not find 8 classes). Mapillary
dataset divides the ’static’ class into many sub-classes.


      the target environment. An integrated dataset that                                     of class, and object attributes were evaluated individually for
      closely represents the image of a dense dataset consists                               each class in each dataset.
          PM
      of m=1 (wm     OD
                           × max(Nm )) images, where wm      OD
                                                                  =                             The three frame attributes indicate the number and vari-
               PM                                                                            ety of objects that are present in the image frames of each
      ODm / m=1 ODm is the weight of mth dataset, ODm
      is the average object density of mth dataset, and M is the                             dataset. First, the attribute values were individually calcu-
      number of datasets. A modified method constructs an                                    lated from the image frames (Table 2) and the distribution of
      integrated dataset by selecting only images that corre-                                each attribute was expressed as a histogram (Fig. 2, Fig. 3).
      spond to more than an average density that user desired.                               To compare the distribution’s variance, we normalized it to
                                                                                             [0, 1] for each histogram by dividing each bin by the maxi-
    • Target Oriented Sampling: If the goal is to extract a spe-                             mum value of bins. We use the naı̈ve integration method to
      cific target object accurately, the integrated dataset must                            construct an integrated dataset. Image size or resolution do
      have images that contain as many of the target objects                                 not significantly affect frame attributes, but image size can
      as possible in one scene. In addition, construction of a                               affect object attributes. However, the variance of attribute’s
      training set with uniform distribution on each object at-                              distribution, and the image resolution itself are characteristics
      tribute enables generation of a model that is insensitive                              of the image-acquisition devices used in each dataset, so we
      to attributes of the target object. In addition, a model                               displayed the original distributions without performing image
      can be constructed that is insensitive to changes in ob-                               size normalization, then compared their variances by consid-
      ject attributes, if the training dataset is built by selecting                         ering the absolute size range. We used the target-oriented
      images so that each attribute has an even distribution,                                sampling method to construct another integrated dataset and
      that is, as many variations as possible.                                               computed the object attributes based on Cityscapes dataset’s
                                                                                             image size.
5     Experiments
                                                                                             5.1 Relative Analysis of Frame Attributes
Six datasets were used for analysis of frame and object
attributes. These datasets were CamVid (701 images),                                         Class Diversity. The Mapillary dataset contains the largest
Cityscapes (3,475 images), SYNTHIA (9,400 images), GTA-                                      number of classes per frame on average, but this result oc-
V (24,966 images), Mapillary (20,000 images), and the inte-                                  curs because the number of classes defined in the Mapillary
grated dataset proposed in Section 4 (58,542 images). Frame                                  dataset is much higher than for any other dataset. If we unify
attributes were evaluated for each image frame, regardless                                   the number of classes as the minimum, and calculate the rela-



                                                                                        12
        Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden




Figure 2: Histogram of frame attributes. (a) Distribution of class diversity for the six data sets, including the integrated dataset. Horizontal
black line: number of classes of each dataset; Red line: position of the mean value. Considering that the number of classes in a Mapillary
dataset is twice as many as other datasets, they all exhibit a similar variance. (b) Distribution of object density. Images in the virtual image
dataset generally contain more, and more-varied objects on than the other datasets. (c) Distribution of road diversity. Most datasets contain
images of a road area that is smaller than the building area on average; i.e., many urban scenes with buildings rather than highways or
countryside. Images in the Cityscapes dataset have the most diverse road area.


tive ratio, the SYNTHIA dataset includes the largest number                ased to one side, but shows about the average characteristics
of classes per frame on average, and the remaining data sets               of the five component datasets. Depending on the complexity
include an average of 15 classes per frame. The variance of                of the field in which the dataset is to be applied, the weight of
the GTA-V dataset is the largest, which means that the classes             the dataset that has the corresponding complexity can be in-
present in one frame are the most diverse from the smallest to             creased to create a new integrated dataset that is optimized for
the largest.                                                               a specific research field. For example, if the scene includes
Object Density. The vertical range of object density was                   a complex environment where a large number of objects ap-
larger than expected; the reason is that the segmentation la-              pear, the weights can be increased for virtual image datasets
bel is also assigned to all small segments which are only part             such as SYNTHIA and GTA-V.
of an object. Average Object Density varies slightly among
datasets. The GTA-V dataset contains an average of 230 ob-                 5.2 Attribute Analysis of Important Objects
ject segments. On average, datasets that contain virtual im-               To analyze the object attributes, we selected four objects that
ages contain more objects in a scene than datasets that con-               are important in the driving situation: persons and cars as ob-
tain real images. Thus, we can utilize a virtual-image dataset             jects that generate the most serious damage in a collision; and
to increase the complexity of the scene.                                   traffic lights and traffic signs that provide the most essential
Road Diversity. When the road diversity is calculated, it is               information for driving.
set to 0 when no road segment is present, or the building area             Class Density. The density distributions of persons and
is larger than the road area. Most of the images have road di-             traffic lights were even in SYNTHIA and GTA-V, and den-
versity = 0 (Fig. 2(c)); i.e., many road scenes include numer-             sity distributions of car and traffic sign were similar in most
ous buildings, or do not have an area that is labeled as road.             datasets. Class Density has a higher average density value in
This result indicates that all datasets contain many images                virtual image datasets than real image datasets, as is true of
that had been captured in urban environment rather than on                 object density of frame attributes.
the highway. Except for the zero bin, the Cityscapes dataset               Object Size Variability. Cityscapes and Mapillary datasets
evenly covers the roadscapes of various areas.                             include variously-sized instances of people, vehicles, traffic
Integrated Dataset. In class diversity, our integrated dataset             lights, signs. It is useful to use the two datasets for segmenta-
shows the most typical normal distribution, in which the mean              tion that is less sensitive to the scale change of the object.
value is in the middle of the number of classes. Most exper-               Object Shape Variability. Shape complexities of the im-
iments assume that the normal distribution is the most com-                portant objects do not change much, regardless of dataset.
mon. In class density, the integrated dataset is close to the              Cityscapes dataset and Mapillary dataset have large variances
normal distribution after GTA-V, and the value of each point               in size, but small variance of shape. This result means that the
in the distribution is high because the number of images is                morphological characteristics of each object do not depend on
much larger in an integrated dataset than in each of the com-              the size or scale of the image. For extremely small or large
ponent datasets. This observation means that the integrated                instances, the detail of appearance can vary widely, and most
dataset that we proposed is more advantageous than the com-                datasets include histogram bins for such cases. Sometimes,
ponent datasets to learn models for scene segmentation. The                relatively large traffic lights and traffic signs appear in virtual
road diversity of the integrated dataset represents the common             image datasets.
characteristics of the other datasets. In summary, the proper-             Object Intensity Variability. Instead of considering each of
ties of image complexity of the integrated dataset is not bi-              the RGB values, we consider the intensity value by convert-



                                                                      13
        Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden




Figure 3: Histograms of object attributes for important objects. (a) Distributions of class density. Each object has a diversity of densities
in a each dataset. (b) Distributions of object size variability. The Cityscapes dataset and the Mapillary dataset contain objects of the most
diverse scales. (c) Distributions of object shape variability. The variability of shape of object in all datasets is not large; i.e., few images
contain extremely large or small objects. (d) Distributions of object intensity variability. A more recent data set shows a richer color for
each important object. (e) Distributions of geometrical position variability. Row, horizontal line: range of image height. (f) Distributions
of geometrical position variability. Col: horizontal line: range of image width. All important objects exist in various width ranges in most
datasets. Analysis of the integrated data set is described in Section 5.


ing all images to gray images. The average intensity values                This characteristic is true for the four object attributes (den-
are calculated in each object region and represented as a his-             sity, size, shape, intensity) in the integrated dataset, so it is
togram. For all important objects, the SYNTHIA, GTA-V,                     much more useful that the component datasets for training
and Mapillary datasets contain instances of much more color                bigger models, because the number of objects contained is
than the CamVid and Cityscapes datasets. The difference                    much larger than in those. In the integrated dataset, the spa-
occurs because SYNTHIA, GTA-V, and Mapillary datasets                      tial position of objects within an image is more uniform, and
were constructed more recently than Camvid and Cityscape,                  the absolute number of objects in all horizontal and vertical
and therefore had more images and more environmental con-                  positions are much larger, than in the component datasets. To
ditions. SYNTHIA, GTA-V dataset’s tool can change various                  build a specialized integrated dataset with a specific range of
attributes of objects and backgrounds, and Mapillary dataset               density, size, shape, intensity, and position values for other
was photographed on six continents, so colors vary widely.                 objects of interest, including important objects, the ratio of
Geometrical Position Variability: Row. The last two                        items from each dataset can be adjusted appropriately. For
columns of Fig. 3 show distributions that represent the row                example, if the goal is to segment human regions reliably re-
and column (col) in which each object appears in the image.                gardless of size and color, the ratio of the Mapillary dataset
The horizontal lines of histograms represent the image resolu-             in the integrated dataset can be increased.
tion range (height, width) of each dataset. Persons and traffic
signs are mainly located at the middle height of the image,                6 Conclusion
whereas cars and traffic lights are mainly located in the up-
per part of the image. The SYNTHIA dataset contains more                   Published datasets for use in semantic scene segmentation
objects at various heights than do other datasets.                         have different characteristics, such as the number of classes
Geometrical Position Variability: Column. In all datasets,                 that have been defined and labeled, the image size, the range
most objects exist in various locations from left to right of              of regions in which the images were obtained, the realism
the image. In particular, the Cityscapes dataset and the Map-              of the graphic, and the diversity of the landscapes. There-
illary dataset include many cases in which objects are uni-                fore, to learn a deeper neural network, a many images that
formly present in all column ranges, but the range of rows in              include various characteristics should be acquired. In this
which important objects exist is limited, but the col range is             paper, we compare the basic information of five represen-
relatively various. A dataset with an even distribution of the             tative datasets, then analyzed the distribution characteristics
locations of objects implies a diversity of situations or scenar-          by defining three frame attributes and five object attributes.
ios.                                                                       We also performed class matching to construct new datasets
Integrated Dataset. An integrated dataset distribution that is             that incorporate these five datasets. Statistical results show
within the range of characteristics of the component datasets.             that the image complexity of the virtual image dataset (SYN-



                                                                      14
        Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden

THIA, GTA-V) is relatively higher than that of the real image            [Han et al., ] H. Han, W. Y. Wang, and B. H. Mao.
dataset, and that the Cityscapes dataset includes a variety of              Borderline-smote: a new over sampling method in imbal-
road scenes. In addition, for certain important objects, the                anced data sets learing. Advances in Intelligent Comput-
datasets with flat distribution ranges are different for each at-           ing.
tribute, so the proportional contribution of each dataset in the         [Jaccard et al., 2017] N. Jaccard, T. W. Rogers, E. J. Morton,
integrated dataset should be optimized to best match the situ-              and L. D. Griffin. Detection of concealed cars in complex
ation of the research field to which it is to be applied. In the            cargo x-ray imagery using deep learning. Journal of X-Ray
future, we will analyze how the method of constructing inte-                Science and Technology, 25(3):323–339, 2017.
grated datasets affects segmentation accuracy, and will study
how to learn the deep neural network by using the integrated             [Janowezyk and Madabhush, ] A. Janowezyk and A. Madab-
datasets to improve accuracy.                                               hush. Deep learning for digital pathology image analysis:
                                                                            A comprehensive tutorial with selected use cases. Journal
                                                                            of Pathology Informatics, 7.
References
                                                                         [Martin et al., 2001] D. Martin, C. Fowlkes, D. Tal, and
[Bileschi, ] S. Bileschi. Cbcl streetscenes: towards scene un-              J. Malik. A database of human segmented natural images
   derstanding in still images. Technical report, MIT.                      and its application to evaluating segmentation algorithms
[Brostow et al., 2008] G J. Brostow, J. Shotton, J. Fauqueur,               and measuring ecological statistics. 2001.
   and R. Cipolla. Segmentation and recognition using struc-             [Miller, 1995] G. A. Miller. Wordnet: a lexical database for
   ture from motion point clouds. 2008.                                     english. volume 38, pages 39–41, 1995.
[Brostow et al., 2009] G. J. Brostow, J. Fauqueur, and                   [Neuhold et al., 2017] G. Neuhold, T. Ollmann, S. R. Bulo,
   R. Cipolla. Semantic object class in video: a high-                      and P. Kontschieder. The mapillary vistas dataset for se-
   definition ground truth database. Pattern Recognition Let-               mantic understanding of street scenes. 2017.
   ters, 30(2):88–97, 2009.                                              [Perazzi et al., 2016] F. Perazzi, J. P. Tuset, B. McWilliams,
[Buda et al., 2017] M. Buda, A. Maki, and M. A.                             L. V. Gool, M. Gross, and A. S. Hornung. A benchmark
   Mazurowski. A systematic study of the class imbal-                       dataset and evaluation methodology for video object seg-
   ance problem in convolutional neural networks. 2017.                     mentation. 2016.
[Chawla et al., ] N. V. Chawla, K. W. Bowyer, L. O. Hall,                [Richter et al., 2016] S. R. Richter, V. Vineet, S. Roth, and
   and W. P. Kegelmeyer. Smote: synthetic minority over-                    V. Koltun. Playing for data: ground truth from computer
   sampling technique. Journal of Artificial Intelligence Re-               games. 2016.
   search, 16.                                                           [Ros et al., 2016] G. Ros, L. Sellart, J. Materzynska,
[Collins et al., 2001] R. T. Collins, A. J. Lipton, H. Fu-                  D. Vazques, and A. M. Lopez. The synthia dataset: a large
   jiyoshi, and T. Kanade. Algorithms for cooperative                       collection of synthetic images for semantic segmentation
   multisensor surveillance.       Proceedings of the IEEE,                 of urban scenes. 2016.
   89(10):1456–1477, 2001.                                               [Russell et al., ] B. C. Russell, A. Torralba, K. P. Murphy,
[Cordts et al., 2015] M. Cordts, M. Omran, S. Ramos,                        and W. T. Freeman. Labelme: a database and web-based
   T. Scharwachter, M. Enzweiler, R. Benenson, U. Franke,                   tool for image annotation. International Journal of Com-
   S. Roth, and B. Schiele. The cityscapes dataset. 2015.                   puter Vision, 77(1).
[Cordts et al., 2016] M. Cordts, M. Omran, S. Ramos,                     [Shen et al., 2016] L. Shen, Z. Lin, and Q. Huang. Relay
   T. Scharwachter, M. Enzweiler, R. Benenson, U. Franke,                   backpropagation for effective learning of deep convolu-
   S. Roth, and B. Schiele. The cityscapes dataset for seman-               tional neural networks. 2016.
   tic urban scene understanding. 2016.                                  [Shotton et al., 2006] J. Shotton, J. Winn, C. Rother, and
[Drummond and Holte, 2003] C. Drummond and R. C.                            A. Criminisi. Textonboost: joint appearance shape and
   Holte. C4.5, class imbalance, and cost sensitivity: why                  context modeling for multi-class object recognition and
   under-sampling beats over-sampling. 2003.                                segmentation. 2006.
                                                                         [Smeaton et al., 2006] A. F. Smeaton, P. Over, and
[Fei-Fei et al., 2006] I. Fei-Fei, R. Fergus, and P. Perona.
                                                                            W. Kraaij. Evaluation campaigns and trecvid. 2006.
   One-shot learning of object categories. IEEE Transactions
   on Pattern Analysis and Machine Intelligence, 28(4):594–              [Technologies, ] U. Technologies. Unity development plat-
   611, 2006.                                                               form. Technical report.
[Griffin et al., ] G. Griffin, A. Holub, and P. Perona. Caltech-         [Wang et al., ] K. J. Wang, B. Makond, K. H. Chen, and
   256 object category dataset. Technical report, Caltech.                  K. M. Wang. A hybrid classifier combining smote with
                                                                            pso to estimate 5-year survivability of breast cancer pa-
[Haixiang et al., 2016] G. Haixiang, L. Yijing, J. Shang,                   tients. Applied Soft Computing, 20.
   G. Mingyun an dH. Yuanyue, and G. Bing. Learning from
   class-imbalanced data: Review of methods and applica-                 [Yao et al., 2007] B. Yao, X. Yang, and S. C. Zhu. Introduc-
   tions. Expert Systems with Applications, 73(1):220–239,                  tion to a large-scale general purpose ground truth database:
   2016.                                                                    methodology, annotation tool and benchmarks. 2007.




                                                                    15