=Paper=
{{Paper
|id=Vol-2134/paper01
|storemode=property
|title=Attribute Dissection of Urban Road Scenes for Efficient Dataset Integration
|pdfUrl=https://ceur-ws.org/Vol-2134/paper01.pdf
|volume=Vol-2134
|authors=Jiman Kim,Changjong Park
|dblpUrl=https://dblp.org/rec/conf/ijcai/KimP18
}}
==Attribute Dissection of Urban Road Scenes for Efficient Dataset Integration==
Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden Attribute Dissection of Urban Road Scenes for Efficient Dataset Integration Jiman Kim, Chanjong Park Samsung Research, Samsung Electronics {jiman14.kim, cj710.park}@samsung.com Abstract al., 2016] that compare the scene parsing accuracy of sev- eral state-of-the-art algorithms focus on their advantages and Semantic scene segmentation or scene parsing is disadvantages, rather than on the characteristics of the image very useful for high-level scene recognition. In or- data. However, the datasets include different categories de- der to improve the performance of scene segmen- fined and different image characteristics of the instances in tation, the quantity and quality of the datasets used them, so efficient learning of the deep network requires de- for deep network’s learning are important. In other tailed analysis of various attributes of the data. For example, words, we need to consider various external envi- some datasets may contain many small objects, some datasets ronments and various variations of the predefined may contain densely distributed objects, and other datasets objects in terms of image characteristics. In recent objects may show deformed objects. By using these image years, many datasets for semantic scene segmenta- characteristics, a network can be developed that has excellent tion focused on autonomous driving have been re- specialization for a specific environment and object, or that leased. However, since only quantitative analysis has excellent generality in a general environment by combin- of each dataset is provided, it is difficult to estab- ing characteristics. lish an efficient learning strategy considering the In this paper, we analyze the CamVid, Cityscapes, SYN- image characteristics of objects. We present defi- THIA, GTA-V, and Mapillary datasets quantitatively, based nitions of three frame attributes and five object at- on two criteria. First, we analyze image-centric criteria such tributes, and analyze their statistical distributions as the average number of categories in the image, the aver- to provide qualitative information about datasets age number of objects, and the average proportion of the im- that are to be merged. We also propose an inte- age that is a road region. Second, we analyze object-centric grated dataset configuration that can exploit the ad- criteria, such as the average spatial density of objects in the vantages of each data set for deep network learn- image, their average size, average shape, average color, and ing after class matching. As a result, we can build average position in the image. This analysis provides we new integrated datasets that are optimized for the good insight into ways to train deep networks. We also pro- scene complexity and object properties of the envi- pose a new set of integrated classes that can be used com- ronment by considering the statistical characteris- monly among datasets, and a method to construct an inte- tics of each dataset. grated dataset. The integrated dataset contributes to improve the generality of the deep network by including various road environments and object characteristics. This paper has the 1 Introduction following structure. Section 2 introduces papers related to Scene understanding requires information such as the objects published datasets. Section 3 summarizes each dataset and that are present in the scene, their characteristics, and the re- proposes image- and object-centric attributes. Section 4 pro- lationships among them. Semantic segmentation provides in- poses a new integrated dataset by performing class alliance. formation on the location and type of objects by dividing an Section 5 provides a detailed comparative analysis of the pro- image into regions that include predefined objects. To apply posed attributes, and suggests insights for constructing inte- scene segmentation functions to autonomous vehicles, many grated datasets. Section 6 summarizes all findings and contri- road scene-centered datasets have been released. Represen- butions of this paper. tative datasets labeled at the pixel level are CamVid [Bros- tow et al., 2008; 2009], Cityscapes [Cordts et al., 2015; 2 Related Work 2016], SYNTHIA [Ros et al., 2016], GTA-V [Richter et al., Road scene-centric public datasets for pixel-level semantic 2016], Mapillary [Neuhold et al., 2017]. Papers that explain segmentation have been released (Table 1, Fig. 1) with pa- each dataset provide statistical information on the image data pers that explain them. CamVid [Brostow et al., 2008; collection environment, area, device, the amount of images, 2009] was the first dataset that had semantic labels of ob- and the relative proportions of objects. Papers [Perazzi et ject class for each pixel. The images were images acquired 8 Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden from the perspective of a driving automobile; they are di- image resolution, focal length, number of images taken with vided into 32 semantic classes with manually-annotated la- the devices used for image acquisition, region where the im- bels. To reduce the effort of the person who must label ob- age was acquired, number of instances per class, number of jects, the authors proposed joint tracking of keypoints and objects per image, number of traffic regulation objects per regions; this method propagates the label information to the image, and number of traffic participants per image. 100 subsequent frames. The set includes the camera’s 3D These papers mainly analyzed how often each class ap- pose in each frame and is has a software tool that users can peared in each image. They also focused on the number and use to label their additional images. The publicly available proportions of major classes that are closely related to traf- datasets [Martin et al., 2001; Fei-Fei et al., 2006; Bileschi, ; fic. If the detail and organization of the information on the Shotton et al., 2006; Smeaton et al., 2006; Griffin et al., ; frame and object side can be obtained, they would improve Yao et al., 2007; Russell et al., ] before CamVid have the learning efficiency of deep networks. Therefore, in this polygon-level labels not pixel-level and they were obtained work, we perform detailed characterization of each dataset to from fixed CCTV-style cameras. The paper provides statisti- derive insight. Also, to enable simultaneous use of two or cal information on the percentage of the objects in the image more datasets with different class numbers and types, we de- for each sequence and the number of occurrences. fine a common usable class and propose a way to efficiently Cityscapes [Cordts et al., 2015; 2016] is a large-scale combine datasets. dataset that includes complex real-world urban scenes. Cityscapes has image data that are labelled at the pixel level and instance level. The images were acquired from 50 cities 3 Attribute Analysis to include a variety of road environments. The authors pro- Scene segmentation or scene parsing at the pixel level to ex- vided the results of statistical analysis between datasets by tract the boundaries of many kinds of objects solves object grouping 30 classes into eight categories. The results de- detection and localization simultaneously. To achieve high scribe the number and relative ratios of annotated pixels of accuracy of pixel-level segmentation, large-scale datasets are each class, annotation density, the distribution of the number required; they must include a variety of shape and appear- of instances related with traffic per an image, and distribution ance variations of static objects (backgrounds) and dynamic of the number of vehicle according to the distance. object (foreground). Therefore, construction of data sets that SYNTHIA [Ros et al., 2016] is a dataset of synthetic im- focus on road scenes has increased resolution and number of ages obtained from a virtual world (Unity development plat- images, and to an increased variety of environments. form) [Technologies, ]. The images were captured from Trends of Dataset. The constructed and released datasets for multiple view-points by using two multi-camera with four the same goal are described in Table 1. Higher resolution and monocular cameras that are mounted on a virtual car. The larger amount of images are two common trends in construct- images include different seasons, weather and illumination ing road scene-centric dataset. The increase in the resolution conditions. The captured images were annotated with 11 pre- of the collected images is closely related to pixel-level accu- defined classes. In experiments, the authors showed that com- racy. In addition, virtual environment tools have been used bining a real dataset and SYNTHIA dataset dramatically in- to collect a large number of images in a short time. In par- creases the accuracy of semantic segmentation. ticular, the volume of real images in the Mapillary dataset Grand Theft Auto V (GTA-V) [Richter et al., 2016] con- was increased sharply by a community-led service to share sists of images captured from a computer game. The au- street-level photographs. Diversification of the environments thors proposed a method to quickly generate semantic label that the images represent has yielded datasets from different maps. Each image is automatically divided into patches, then regions and environments, and recently-constructed datasets merged using MTS (mesh, texture, shader). For each patch, a include increasing diversity of regions and of environmen- semantic class is manually assigned. Within a brief space of tal conditions. Real images are much more difficult to ob- time, these methods yield far more pixel labeling than previ- tain than virtual images, and the continental, regional, and ous datasets. When virtual images generated by the proposed environmental conditions in which the images are acquired method were added to real-world images, segmentation ac- has become very diverse in the Cityscapes and Mapillary curacy was greatly improved even though a large number of datasets. Each dataset has different properties (Table 1). The real-world images are replaced by virtual images. The re- CamVid dataset was the dataset that focused on road scenes; lated paper provided statistical information on the number of it contains many lane-clear highway images. The Cityscapes labeled pixels, annotation density, and the time and speed of dataset includes images that are specific to European urban labeling. scenes, The SYNTHIA dataset has many virtual images with Mapillary [Neuhold et al., 2017] is the dataset that con- multiple seasons. GTA-V dataset’s virtual images are ex- tains the most real-world images, and the largest number tremely realistic, and its effects are richly controllable. The (66) of categories to consider. The images were captured Mapillary dataset contains the largest number of images col- by differently-experienced photographers on various imaging lected in the broadest variety of regions. devices. The considered cities are Europe, North and South Attribute Definition. We defined two types of criteria to America, Asia, Africa and Oceania and the scenes include specifically analyze the attributes from an image frame (still- urban, countryside, and off-road scenes. Manual annotation shot) perspective, with the exception of the collection method was performed using polygons by specialized image anno- and environment from five representative datasets for road tators. Statistical analyses performed by the authors include scene segmentation. One is the metrics for each image frame, 9 Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden Name Year Class Resolution Image (Training/Validation/Test) Description RGB GT CamVid 2008, 32 960 × 720 701 701 Real Image, 2009 Unified Size (367/101/233) (367/101/233) Normal Light/Weather Cityscapes 2015, 30 2,048 × 1,024 5,000 3,475 Real Image (50 Cities), (fine) 2016 Unified Size (2,975/500/1,525) (2,975/500/-) Normal Light/Weather SYNTHIA 2016 23 1,280 × 760 9,400 9,400 Virtual Image, (cityscapes) Unified Size (9,400) (9,400) Dynamic Light/Weather GTA-V 2016 34 1914 × 1052 24,966 24,966 Virtual Image, Unified Size (24,966) (24,966) Dynamic Light/Weather Mapillary 2017 66 3,420 × 2,480 25,000 20,000 Real Image (6 Continents), Averaged Size (18,000/2,000/5,000) (18,000/2,000/-) Dynamic Light/Weather Integration 2018 30 2,048 × 1,024 65,067 58,542 Real/Virtual Image, Unified Size (41,961/6,038/17,068) (41,961/6,038/10,543) Dynamic Light/Weather Table 1: Quantitative summary of various datasets for semantic scene segmentation. The number of classes of each dataset includes ’void’ class. GTA-V and Mapillary contain images of different sizes. The images of SYNTHIA and GTA-V are not divided for training, validation, and test. Recently released datasets include more classes and higher resolution images. Also, many virtual tools that simulate road scene environment were released to increase the number of virtual images in various conditions, because collecting of real images has high cost. Integrated dataset is based on the Cityscapes dataset (Table 3) but has many more images. Figure 1: Example images of five datasets. First row: randomly-selected original images (RGB) from each dataset; second row: ground-truth images that correspond to each original image. CamVid and Mapillary datasets provide ground-truth color values for each class; Cityscapes, SYNTHIA, and GTA-V datasets also provide label images with different integer index values assigned to pixels of each class. Each dataset includes different types of urban road scenes, and various types and sizes of objects. and the other is the object metrics (Table 2). For each metric, class that exist in a scene. Object’s size/shape/intensity(one we computed the mean value and its distribution. Analyzed channel color) variability means distributions of object’s ex- information of image complexity and object diversity can be ternal appearances of a specific class, and it shows how ap- utilized to construct new datasets with different goals. Met- pearances of object varies in scenes. Geometrical position rics to explain scene complexity from an image frame per- variability means a distribution of object positions in scenes; spective are class diversity, object density, and road diver- it explains which positions are major regions of interest in sity. Class diversity means a distribution of the number of scenes. all classes appearing per frame, and the diversity of objects in a scene can be determined. Object density means a distri- bution of the total number of all objects appearing per frame, 4 Dataset Integration and explains how many object concentrate in a scene. Road Class Alliance for Scene Integration. Each country has dif- diversity means a distribution of the relative ratio of road area ferent object attributes, road surface properties, rules of the and building area. We can estimate the scene as highway or road, traffic patterns (traffic signs and signals), and climate city center from the road diversity. Metrics to explain object’s conditions. If the image characteristic used for learning and extrinsic variability of each class from an object perspective testing of deep neural networks are different, this diversity are class density, object size variability, object shape variabil- is a major cause of degradation of the accuracy of semantic ity [Collins et al., 2001], object intensity (one channel color) scene segmentation. Quick construction of a dataset that in- variability, and geometrical position variability. Class density cludes all varieties of road scenes is a real challenge, but it is means a distribution of the number of objects of a specific the most reasonable way to efficiently integrate the released class per frame, and represents the number of objects of the datasets collected in different regions. Models created with 10 Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden Attributes Definition Explanation P Class Diversity 1 NP #(Classes) how diverse objects exist in a scene Frame Attributes Object Density 1 #(Objects) how many object concentrate in a scene P N Road Diversity 1 N max( AreaArea R -AreaB , 0) how diverse road scenes exist in a scene P R Class Density N 1 #(Objects P i) how many objects of the class exist in a scene Object Size Variability P N 1 Sizei how the size of object varies in scenes Object Shape Variability 1 N Dispersednessi P how the shape of object varies in scenes Object Attributes Object Intensity Variability 1 Intensityi how the intensity of object varies in scenes N P Geometrical Position Variability N 1 X cY ci where is the object most likely to appear Table 2: Summary of frame/object attributes. We propose three frame attributes and five object attributes. Frame attributes are used to analyze scene complexity of each dataset. Object attributes are used to understand the extrinsic variability of ob- jects of each class. N: total number of image frames of each dataset; : number of corresponding objects. AreaR andAreaB : areaof roadandbuildingineachimagef rame, respectively.If AreaR < AreaB , then road diversity is set to 0. this integrated dataset that fully represents the diversity of al., 2017; Haixiang et al., 2016; Drummond and Holte, road scenes can be very good choice of initial model needed 2003]. It is a way to randomly select a number of images to create a model optimized for a specific environment. from each dataset that is equal to the number of images To build an integrated dataset, we refer consider 30 classes in the smallest datasets. The integrated dataset consists and 8 groups of Cityscapes dataset. The author of the of min(Nm ) × M images, where Nm means the num- Cityscapes dataset selected 30 classes in the road scene and ber of images of mth dataset and M is the number of grouped them semantically by referring to WordNet [Miller, datasets. Undersampling is an intuitive and easy-to-use 1995]. Cityscape is a reasonable basis for performing match- sampling method, but it has the drawback of not being ing between classes of datasets because it has about the av- able to exploit the large amount of residual images. erage number of classes among the five datasets considered • Randomized Oversampling: Oversampling is another here. We performed a semantic comparison between the frequently-used method [Buda et al., 2017; Haixiang classes considered by each dataset, from 11 to 66, with the et al., 2016; Janowezyk and Madabhush, ; Jaccard et classes defined in Cityscapes dataset (Table 3). Usually, al., 2017]. It is a way to randomly select images, and fewer than 30 classes from each of the other datasets cor- allows duplicates in each dataset that has the largest respond to the superclass of Cityscapes’s classes, but 1:1 number of images. The integrated dataset consists of matching with one of the most suitable Cityscapes’s class max(Nm ) × M images, where Nm is the number of im- was accomplished without any division. If the number of ages in the mth dataset and M is the number of datasets. classes is ¿ 30, they are usually subclasses of the 30 classes in Overfitting may occur in some cases [Chawla et al., ; Cityscapes, so we have matched the classes in other datasets Wang et al., ], but variations exist to reduce this prob- to the semantically-higher class of the Cityscapes classes. In lem [Chawla et al., ; Han et al., ; Shen et al., 2016]. this way, the images in each dataset can be unified to construct Oversampling is the most common method to get the a large-scale dataset of N images, which represent ¿ M urban largest number of images for training. environments. For the integrated dataset based on common classes, we performed image-based and object-based analy- • Diversity Oriented Sampling: This method means that sis (Section 3), and observed how the characteristics changed the larger the number of average classes contained in the (Section 4) image of the dataset, the more images are reflected in the integrated dataset. The integrated dataset consists Sampling Methods for Image Integration. To build an in- PM tegrated dataset, the images from each dataset must be mixed of m=1 (wm CD × max(Nm )) images, where wm CD = PM appropriately after the classes are unified. In this paper, CDm / m=1 CDm is the weight of the mth dataset, we propose five image-sampling methods to combine im- CDm is the average class density of the mth dataset, and ages from different datasets. The second and third are aimed M is the number of datasets. The maximum number of at balancing the numbers of images between datasets, and images that can be sampled is limited to max(Nm ). This the fourth through sixth are aimed at building an integrated sampling method enables construction of an integrated dataset that is optimized for a specific purpose. dataset that is optimized on the variety of static/dynamic backgrounds in a target environment. This method can • Naive Integration: The simplest method to integrate be used to construct an integrated dataset that best adapts datasets is to merge all the images in datasets into a uni- to the static/dynamic background variety of the target fied image size. This method retains the original image environment. As a variation of diversity-oriented sam- data of each dataset, but naturally, the dataset character- pling, an integrated dataset may by constructed by se- istics with large image quantities become dominant. lecting only images including classes that are more than • Randomized Undersampling: Undersampling is one of the average number desired by the user. the most commonly used methods to match the num- • Density Oriented Sampling: An integrated dataset can ber of images among classes or among datasets [Buda et be built that is optimized for the object density of 11 Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden Cityscapes: Base CamVid SYNTHIA GTA-V Mapillary 01. Road Road, Road Shoulder, Lane Markings Drivable Road, Lanemarking Road Road, Pothole, Lane, Service Lane, General Lane Marking 02. Sidewalk Sidewalk Sidewalk Sidewalk Sidewalk, Pedestrian Area, Curb, Curb Cut 03. Parking Parking Block Parking Slot - Parking 04. Rail Track - - - Rail Track 05. Person Child, Pedestrian Pedestrian Person Person 06. Rider Bicyclist Rider Rider Bicyclist, Motorcyclist, Other Rider 07. Car Car Car Car Car 08. Truck SUV/Pickup Truck Truck Truck Truck 09. Bus Truck/Bus Bus Bus Bus 10. On Rails Train Train Train On Rails 11. Motorcycle Motorcycle/Scooter Motorcycle Motorcycle Motorcycle 12. Bicycle - Bicycle Bicycle Bicycle 13. Caravan - - - Caravan 14. Trailer - - Trailer Trailer 15. Building Building Building Building Building 16. Wall Wall Wall Wall Wall 17. Fence Fence Fence Fence Fence 18. Guardrail - - Guardrail Guardrail, Barrier 19. Bridge Bridge - Bridge Bridge 20. Tunnel Tunnel - Tunnel Tunnel 21. Pole Column/Pole Pole Pole Pole, Utility Pole, Street Light, Traffic Sign Frame 22. Pole Group - - - - 23. Traffic Sign Sign/Symbol Traffic Sign Traffic Sign Traffic Sign Front 24. Traffic Light Traffic Light Traffic Light Traffic Light Traffic Light 25. Vegetation Tree, Vegetation Misc Vegetation Vegetation Vegetation 26. Terrain - Terrain Terrain Terrain, Sand 27. Sky Sky Sky Sky Sky 28. Ground Non-Drivable - - Crosswalk Plan, Crosswalk Zebra, Water 29. Dynamic Animal, Cart/Luggage/Pram, Other Moving - - Bird, Animal, Trash Can, Boat, Wheeled Slow, Other Vehicle 30. Static Archway, Misc Text, Traffic Cone, Void Road-Work, Void Ego Vehicle, Static, Void Ego Vehicle, Car Mount, Mountain, Snow, Banner, Billboard, CCTV Camera, Traffic Sign Back, Catch Basin, Manhole, Fire Hydrant, Bench, Bike Rack, Junction Box, Mailbox, Phone Booth, Unlabeled Table 3: Class matching table of Cityscapes dataset (8 categories, 30 classes) and other datasets: Object (01-24) and Nature (25-30). We assigned the classes of four datasets (CamVid, SYNTHIA, GTA-V, Mapillary) to 30 classes by referring the class definition of the Cityscapes dataset. All datasets share most of Cityscapes’s classes. Especially, important classes (road, human, vehicle, traffic sign/light) are essentially included in all datasets. GTA-V does not have class information, so we manually checked class names (we could not find 8 classes). Mapillary dataset divides the ’static’ class into many sub-classes. the target environment. An integrated dataset that of class, and object attributes were evaluated individually for closely represents the image of a dense dataset consists each class in each dataset. PM of m=1 (wm OD × max(Nm )) images, where wm OD = The three frame attributes indicate the number and vari- PM ety of objects that are present in the image frames of each ODm / m=1 ODm is the weight of mth dataset, ODm is the average object density of mth dataset, and M is the dataset. First, the attribute values were individually calcu- number of datasets. A modified method constructs an lated from the image frames (Table 2) and the distribution of integrated dataset by selecting only images that corre- each attribute was expressed as a histogram (Fig. 2, Fig. 3). spond to more than an average density that user desired. To compare the distribution’s variance, we normalized it to [0, 1] for each histogram by dividing each bin by the maxi- • Target Oriented Sampling: If the goal is to extract a spe- mum value of bins. We use the naı̈ve integration method to cific target object accurately, the integrated dataset must construct an integrated dataset. Image size or resolution do have images that contain as many of the target objects not significantly affect frame attributes, but image size can as possible in one scene. In addition, construction of a affect object attributes. However, the variance of attribute’s training set with uniform distribution on each object at- distribution, and the image resolution itself are characteristics tribute enables generation of a model that is insensitive of the image-acquisition devices used in each dataset, so we to attributes of the target object. In addition, a model displayed the original distributions without performing image can be constructed that is insensitive to changes in ob- size normalization, then compared their variances by consid- ject attributes, if the training dataset is built by selecting ering the absolute size range. We used the target-oriented images so that each attribute has an even distribution, sampling method to construct another integrated dataset and that is, as many variations as possible. computed the object attributes based on Cityscapes dataset’s image size. 5 Experiments 5.1 Relative Analysis of Frame Attributes Six datasets were used for analysis of frame and object attributes. These datasets were CamVid (701 images), Class Diversity. The Mapillary dataset contains the largest Cityscapes (3,475 images), SYNTHIA (9,400 images), GTA- number of classes per frame on average, but this result oc- V (24,966 images), Mapillary (20,000 images), and the inte- curs because the number of classes defined in the Mapillary grated dataset proposed in Section 4 (58,542 images). Frame dataset is much higher than for any other dataset. If we unify attributes were evaluated for each image frame, regardless the number of classes as the minimum, and calculate the rela- 12 Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden Figure 2: Histogram of frame attributes. (a) Distribution of class diversity for the six data sets, including the integrated dataset. Horizontal black line: number of classes of each dataset; Red line: position of the mean value. Considering that the number of classes in a Mapillary dataset is twice as many as other datasets, they all exhibit a similar variance. (b) Distribution of object density. Images in the virtual image dataset generally contain more, and more-varied objects on than the other datasets. (c) Distribution of road diversity. Most datasets contain images of a road area that is smaller than the building area on average; i.e., many urban scenes with buildings rather than highways or countryside. Images in the Cityscapes dataset have the most diverse road area. tive ratio, the SYNTHIA dataset includes the largest number ased to one side, but shows about the average characteristics of classes per frame on average, and the remaining data sets of the five component datasets. Depending on the complexity include an average of 15 classes per frame. The variance of of the field in which the dataset is to be applied, the weight of the GTA-V dataset is the largest, which means that the classes the dataset that has the corresponding complexity can be in- present in one frame are the most diverse from the smallest to creased to create a new integrated dataset that is optimized for the largest. a specific research field. For example, if the scene includes Object Density. The vertical range of object density was a complex environment where a large number of objects ap- larger than expected; the reason is that the segmentation la- pear, the weights can be increased for virtual image datasets bel is also assigned to all small segments which are only part such as SYNTHIA and GTA-V. of an object. Average Object Density varies slightly among datasets. The GTA-V dataset contains an average of 230 ob- 5.2 Attribute Analysis of Important Objects ject segments. On average, datasets that contain virtual im- To analyze the object attributes, we selected four objects that ages contain more objects in a scene than datasets that con- are important in the driving situation: persons and cars as ob- tain real images. Thus, we can utilize a virtual-image dataset jects that generate the most serious damage in a collision; and to increase the complexity of the scene. traffic lights and traffic signs that provide the most essential Road Diversity. When the road diversity is calculated, it is information for driving. set to 0 when no road segment is present, or the building area Class Density. The density distributions of persons and is larger than the road area. Most of the images have road di- traffic lights were even in SYNTHIA and GTA-V, and den- versity = 0 (Fig. 2(c)); i.e., many road scenes include numer- sity distributions of car and traffic sign were similar in most ous buildings, or do not have an area that is labeled as road. datasets. Class Density has a higher average density value in This result indicates that all datasets contain many images virtual image datasets than real image datasets, as is true of that had been captured in urban environment rather than on object density of frame attributes. the highway. Except for the zero bin, the Cityscapes dataset Object Size Variability. Cityscapes and Mapillary datasets evenly covers the roadscapes of various areas. include variously-sized instances of people, vehicles, traffic Integrated Dataset. In class diversity, our integrated dataset lights, signs. It is useful to use the two datasets for segmenta- shows the most typical normal distribution, in which the mean tion that is less sensitive to the scale change of the object. value is in the middle of the number of classes. Most exper- Object Shape Variability. Shape complexities of the im- iments assume that the normal distribution is the most com- portant objects do not change much, regardless of dataset. mon. In class density, the integrated dataset is close to the Cityscapes dataset and Mapillary dataset have large variances normal distribution after GTA-V, and the value of each point in size, but small variance of shape. This result means that the in the distribution is high because the number of images is morphological characteristics of each object do not depend on much larger in an integrated dataset than in each of the com- the size or scale of the image. For extremely small or large ponent datasets. This observation means that the integrated instances, the detail of appearance can vary widely, and most dataset that we proposed is more advantageous than the com- datasets include histogram bins for such cases. Sometimes, ponent datasets to learn models for scene segmentation. The relatively large traffic lights and traffic signs appear in virtual road diversity of the integrated dataset represents the common image datasets. characteristics of the other datasets. In summary, the proper- Object Intensity Variability. Instead of considering each of ties of image complexity of the integrated dataset is not bi- the RGB values, we consider the intensity value by convert- 13 Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden Figure 3: Histograms of object attributes for important objects. (a) Distributions of class density. Each object has a diversity of densities in a each dataset. (b) Distributions of object size variability. The Cityscapes dataset and the Mapillary dataset contain objects of the most diverse scales. (c) Distributions of object shape variability. The variability of shape of object in all datasets is not large; i.e., few images contain extremely large or small objects. (d) Distributions of object intensity variability. A more recent data set shows a richer color for each important object. (e) Distributions of geometrical position variability. Row, horizontal line: range of image height. (f) Distributions of geometrical position variability. Col: horizontal line: range of image width. All important objects exist in various width ranges in most datasets. Analysis of the integrated data set is described in Section 5. ing all images to gray images. The average intensity values This characteristic is true for the four object attributes (den- are calculated in each object region and represented as a his- sity, size, shape, intensity) in the integrated dataset, so it is togram. For all important objects, the SYNTHIA, GTA-V, much more useful that the component datasets for training and Mapillary datasets contain instances of much more color bigger models, because the number of objects contained is than the CamVid and Cityscapes datasets. The difference much larger than in those. In the integrated dataset, the spa- occurs because SYNTHIA, GTA-V, and Mapillary datasets tial position of objects within an image is more uniform, and were constructed more recently than Camvid and Cityscape, the absolute number of objects in all horizontal and vertical and therefore had more images and more environmental con- positions are much larger, than in the component datasets. To ditions. SYNTHIA, GTA-V dataset’s tool can change various build a specialized integrated dataset with a specific range of attributes of objects and backgrounds, and Mapillary dataset density, size, shape, intensity, and position values for other was photographed on six continents, so colors vary widely. objects of interest, including important objects, the ratio of Geometrical Position Variability: Row. The last two items from each dataset can be adjusted appropriately. For columns of Fig. 3 show distributions that represent the row example, if the goal is to segment human regions reliably re- and column (col) in which each object appears in the image. gardless of size and color, the ratio of the Mapillary dataset The horizontal lines of histograms represent the image resolu- in the integrated dataset can be increased. tion range (height, width) of each dataset. Persons and traffic signs are mainly located at the middle height of the image, 6 Conclusion whereas cars and traffic lights are mainly located in the up- per part of the image. The SYNTHIA dataset contains more Published datasets for use in semantic scene segmentation objects at various heights than do other datasets. have different characteristics, such as the number of classes Geometrical Position Variability: Column. In all datasets, that have been defined and labeled, the image size, the range most objects exist in various locations from left to right of of regions in which the images were obtained, the realism the image. In particular, the Cityscapes dataset and the Map- of the graphic, and the diversity of the landscapes. There- illary dataset include many cases in which objects are uni- fore, to learn a deeper neural network, a many images that formly present in all column ranges, but the range of rows in include various characteristics should be acquired. In this which important objects exist is limited, but the col range is paper, we compare the basic information of five represen- relatively various. A dataset with an even distribution of the tative datasets, then analyzed the distribution characteristics locations of objects implies a diversity of situations or scenar- by defining three frame attributes and five object attributes. ios. We also performed class matching to construct new datasets Integrated Dataset. An integrated dataset distribution that is that incorporate these five datasets. Statistical results show within the range of characteristics of the component datasets. that the image complexity of the virtual image dataset (SYN- 14 Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden THIA, GTA-V) is relatively higher than that of the real image [Han et al., ] H. Han, W. Y. Wang, and B. H. Mao. dataset, and that the Cityscapes dataset includes a variety of Borderline-smote: a new over sampling method in imbal- road scenes. In addition, for certain important objects, the anced data sets learing. Advances in Intelligent Comput- datasets with flat distribution ranges are different for each at- ing. tribute, so the proportional contribution of each dataset in the [Jaccard et al., 2017] N. Jaccard, T. W. Rogers, E. J. Morton, integrated dataset should be optimized to best match the situ- and L. D. Griffin. Detection of concealed cars in complex ation of the research field to which it is to be applied. In the cargo x-ray imagery using deep learning. Journal of X-Ray future, we will analyze how the method of constructing inte- Science and Technology, 25(3):323–339, 2017. grated datasets affects segmentation accuracy, and will study how to learn the deep neural network by using the integrated [Janowezyk and Madabhush, ] A. Janowezyk and A. Madab- datasets to improve accuracy. hush. Deep learning for digital pathology image analysis: A comprehensive tutorial with selected use cases. Journal of Pathology Informatics, 7. References [Martin et al., 2001] D. Martin, C. Fowlkes, D. Tal, and [Bileschi, ] S. Bileschi. Cbcl streetscenes: towards scene un- J. Malik. A database of human segmented natural images derstanding in still images. Technical report, MIT. and its application to evaluating segmentation algorithms [Brostow et al., 2008] G J. Brostow, J. Shotton, J. Fauqueur, and measuring ecological statistics. 2001. and R. Cipolla. Segmentation and recognition using struc- [Miller, 1995] G. A. Miller. Wordnet: a lexical database for ture from motion point clouds. 2008. english. volume 38, pages 39–41, 1995. [Brostow et al., 2009] G. J. Brostow, J. Fauqueur, and [Neuhold et al., 2017] G. Neuhold, T. Ollmann, S. R. Bulo, R. Cipolla. Semantic object class in video: a high- and P. Kontschieder. The mapillary vistas dataset for se- definition ground truth database. Pattern Recognition Let- mantic understanding of street scenes. 2017. ters, 30(2):88–97, 2009. [Perazzi et al., 2016] F. Perazzi, J. P. Tuset, B. McWilliams, [Buda et al., 2017] M. Buda, A. Maki, and M. A. L. V. Gool, M. Gross, and A. S. Hornung. A benchmark Mazurowski. A systematic study of the class imbal- dataset and evaluation methodology for video object seg- ance problem in convolutional neural networks. 2017. mentation. 2016. [Chawla et al., ] N. V. Chawla, K. W. Bowyer, L. O. Hall, [Richter et al., 2016] S. R. Richter, V. Vineet, S. Roth, and and W. P. Kegelmeyer. Smote: synthetic minority over- V. Koltun. Playing for data: ground truth from computer sampling technique. Journal of Artificial Intelligence Re- games. 2016. search, 16. [Ros et al., 2016] G. Ros, L. Sellart, J. Materzynska, [Collins et al., 2001] R. T. Collins, A. J. Lipton, H. Fu- D. Vazques, and A. M. Lopez. The synthia dataset: a large jiyoshi, and T. Kanade. Algorithms for cooperative collection of synthetic images for semantic segmentation multisensor surveillance. Proceedings of the IEEE, of urban scenes. 2016. 89(10):1456–1477, 2001. [Russell et al., ] B. C. Russell, A. Torralba, K. P. Murphy, [Cordts et al., 2015] M. Cordts, M. Omran, S. Ramos, and W. T. Freeman. Labelme: a database and web-based T. Scharwachter, M. Enzweiler, R. Benenson, U. Franke, tool for image annotation. International Journal of Com- S. Roth, and B. Schiele. The cityscapes dataset. 2015. puter Vision, 77(1). [Cordts et al., 2016] M. Cordts, M. Omran, S. Ramos, [Shen et al., 2016] L. Shen, Z. Lin, and Q. Huang. Relay T. Scharwachter, M. Enzweiler, R. Benenson, U. Franke, backpropagation for effective learning of deep convolu- S. Roth, and B. Schiele. The cityscapes dataset for seman- tional neural networks. 2016. tic urban scene understanding. 2016. [Shotton et al., 2006] J. Shotton, J. Winn, C. Rother, and [Drummond and Holte, 2003] C. Drummond and R. C. A. Criminisi. Textonboost: joint appearance shape and Holte. C4.5, class imbalance, and cost sensitivity: why context modeling for multi-class object recognition and under-sampling beats over-sampling. 2003. segmentation. 2006. [Smeaton et al., 2006] A. F. Smeaton, P. Over, and [Fei-Fei et al., 2006] I. Fei-Fei, R. Fergus, and P. Perona. W. Kraaij. Evaluation campaigns and trecvid. 2006. One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4):594– [Technologies, ] U. Technologies. Unity development plat- 611, 2006. form. Technical report. [Griffin et al., ] G. Griffin, A. Holub, and P. Perona. Caltech- [Wang et al., ] K. J. Wang, B. Makond, K. H. Chen, and 256 object category dataset. Technical report, Caltech. K. M. Wang. A hybrid classifier combining smote with pso to estimate 5-year survivability of breast cancer pa- [Haixiang et al., 2016] G. Haixiang, L. Yijing, J. Shang, tients. Applied Soft Computing, 20. G. Mingyun an dH. Yuanyue, and G. Bing. Learning from class-imbalanced data: Review of methods and applica- [Yao et al., 2007] B. Yao, X. Yang, and S. C. Zhu. Introduc- tions. Expert Systems with Applications, 73(1):220–239, tion to a large-scale general purpose ground truth database: 2016. methodology, annotation tool and benchmarks. 2007. 15