=Paper=
{{Paper
|id=Vol-2283/MediaEval_18_paper_62
|storemode=property
|title=Global-Local Feature Fusion for Image Classification of Flood Affected Roads from Social Multimedia
|pdfUrl=https://ceur-ws.org/Vol-2283/MediaEval_18_paper_62.pdf
|volume=Vol-2283
|authors=Benjamin Bischke,Patrick Helber,Andreas Dengel
|dblpUrl=https://dblp.org/rec/conf/mediaeval/BischkeHD18
}}
==Global-Local Feature Fusion for Image Classification of Flood Affected Roads from Social Multimedia==
<pdf width="1500px">https://ceur-ws.org/Vol-2283/MediaEval_18_paper_62.pdf</pdf>
<pre>
    Global-Local Feature Fusion for Image Classification of Flood
              Affected Roads from Social Multimedia
                                       Benjamin Bischke1, 2 , Patrick Helber1, 2 , Andreas Dengel1, 2
                                                             1 TU Kaiserslautern, Germany
                                       2 German Research Center for Artificial Intelligence (DFKI), Germany


ABSTRACT
This paper presents the solution of the DFKI-team for the Multi-
media Satellite Task 2018 at MediaEval. We address the challenge
of social multimedia classification with respect to road passability
during flooding events. Information about road passability is an
important aspect within the context of emergency response and is
not well studied in the past. In this paper, we primarily investigate
                                                                             Figure 1: Local image features of objects provide a strong evi-
into the visual classification based on global, local and global-local
                                                                             dence for the classification of road passability (left: passable,
fused image features. We show that local features of objects can
                                                                             right: non passable).
be efficiently used for road passability classification and achieve
similar good results with local features as with global features.
When we fused global and local visual features, we did not achieve           dataset on which the network was pre-trained on. We achieved
a significant outperformance against global features alone but see           a significant improvement when relying on a network that was
a lot of potential for future research into this direction.                  trained on scene-level information rather than object classes as in
                                                                             the ImageNet dataset. Building upon this approach, we evaluted
1     INTRODUCTION                                                           models pre-trained on different datasets containing scene-level and
                                                                             object-level classes for the visual classification of flood passability
The Multimedia Satellite Task 2018 [3] continues to focus on flood-          evidence. We achieved the best results on our internal valiation
ing events as in last year’s Task 2017 [2], since, among high-impact         set with features extracted from a Wide-Resnet38 pre-trained on
natural disasters, flooding events represent, according to the United        Places365 [8], and obtained an improvement of 3% against the
Nations Office for the Coordination of Humanitarian Affairs, the             features of ResNet152 pre-trained on ImageNet [6]. These findings
most common type of disaster worldwide. The task looks at road               are in line with the insights from last year’s solution [1].
passability, namely whether or not it is possible to travel through
a flooded region. This work focuses on social multimedia and is              2.2    Flood passability image classification
based on the benchmark dataset, that contains 7.387 tweets with
                                                                             In this paper we investigate three strategies for the road passability
accompanying images and labels for evidence of road passability
                                                                             classification of images. We use a SVM (RBF kernel) as classifier
as well as the actual road passability (passable vs. non passable).
                                                                             and visual features based on the following approaches:
2     APPROACH                                                                    (1) Global features of CNNs pre-trained on Places365 [8], Ima-
                                                                                      geNet [6] and the Visual Sentiment Ontology (VSO)[4]
Our solution for classifying Tweets with respect to road passability
                                                                                  (2) Local features of objects extracted with Faster R-CNN [7]
follows a two-step approach. We first categorize all images that
                                                                                      pre-trained on Pascal VOC [5]
provide evidence for road passability during a flooding event and
                                                                                  (3) Fusion of global and local features
then classify the relevant images with respect to road passability.
Our approach is only based on the visual modality, since we could            Global Features.
not obtain any meaningful results by taking the metadata of Tweets           We follow the same approach as described in section 2.1 and extract
(e.g. text, location) into consideration.                                    global image features with pre-trained CNNs. We analyzed models
                                                                             pre-trained on ImageNet [6], Places365 [8] and VSO [4] datasets
2.1     Evidence classification of flood passability                         and obtained the best results set with scene-level features (VSO and
The approach for the evidence classification of images relies on             Places365) on the internal validation (see Table 1).
last year’s solution [1] for the Multimedia Satellite Task 2017 [2].         Local Features.
The goal of the challenge was to retrieve all images from a Flicker          In our second stragey we investigated into local image features.
dataset that provide evidence of a flooding event. We applied a              Our hypothesis is that local features corresponding to objects and
pre-trained CNN to obtain the feature representation of images and           its surrounding context such as cars, persons, traffic signs shown
used a SVM, with a radial basis function (RBF) kernel, as classifier.        in Figure 1 provide a high evidence for the discrimination of road
One important insight of our approach was the importance of the              passability. We trained the object detection network Faster R-CNN
Copyright held by the owner/author(s).
                                                                             [7] on the dataset Pascal VOC [5] and applied it on the images of
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                   the provided Twitter dataset. Whenever Faster-RCNN identified an
                                                                             instance for one of the following classes C={bus, boat, person, car},
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                                                                       B. Bischke et al.


we croped based on the bounding box of the particular object a            Table 1: Road passability classification based on global, local
small patch out of the image. We combined the patches for bus, car        and fused features on the internal validation set
and boat classes into one dataset and resized all patches to the same
size of 224x244 pixels. The three classes covered 45% of images                                    Approach                          F1 score
in the development with at least one object. Based on the created                        ImageNet ResNet101 (global)                  73.29%
dataset, we trained a CNN with two road passability classes that                         ImageNet ResNet152 (global)                  75.98%
follows the same arichtecture as LeNet with a kernel size of 7x7                      VSO X-ResNet50 Adjective (global)               79.16%
                                                                                        VSO X-ResNet50 Noun (global)                  78.97%
on all convolutional layers. In the case, that Faster-RCNN detected
                                                                                       Places365 Wide-ResNet38 (global)               77.31%
multiple objects in the image, we followed a late-fusion approach,
                                                                                     local features based on Faster R-CNN             77.24%
in which we calculated the mean of the predictions and mapped                  Places365 Wide-ResNet38 (global) + local features      78.60%
values above 0.5 to the passable road class.                                   VSO X-ResNet50 Adjective (global) + local features     78.45%
   We tried to classify the 3275 patches beloning to objects of the
person class, but our classifier was not siginicantly better than
                                                                          Table 2: Results on the internal test set for the F1-score of ev-
random guessing. By visual inspection we noticed that there are
                                                                          idence vs. no evidence for passability (row 1) and average F1-
a lot of variations in the image patches of persons that made it
                                                                          score of evidence passable and evidence non passable (row 2)
also very difficult for us to classify single patches with respect to
road passability. The dataset contains, for example, images with
                                                                                                               Run 1    Run 2       Run 3
persons being fully visible (evidence of passability) and at the same
                                                                                   evidence vs. no evidence    87.70%   87.70%      87.70%
time persons walking in the water at hip height (evidence of no                    passable vs. non passable   65.21%   64.96%      66.48%
passability). Since we were not able to achieve sufficient results for
patches of persons, we suppressed this class in our current approach
and leave it open for future research.                                    fusion did neither significantly improve nor worsen the results.
                                                                          We believe that this can be achieved with a more sophisticated
Global-Local Feature Fusion.                                              fusion strategy and better local features. The resizing of extracted
In our third strategy, we combined global and local features. We          patches to 244x244 pixels for different objects could have a negative
extracted the global features as described in section 2.1 and appened     influence on the classification, since the aspect ratio of objects can
to this vector the prediction for local features from section 2.2. In     get distorted. A deeper network for classifying local patches could
case that no local feature could be extracted from the image, we          additionally improve the results.
appended a special label to the vector. The resulting feature vector
was classified with a SVM (RBF kernel) as described in 2.1.               4   CONCLUSION
                                                                          In this paper, we presented our approach for the Multimedia Satel-
3   EXPERIMENTS AND RESULTS                                               lite Task 2018 at MediaEval. In line with previous research [1], we
We first evaluated our three approaches on the internal validation        also observed the advantages of scene-level features compared to
set. Table 1 shows the results for the classification of road passabil-   object related features when classifying images with respect to
ity using the F1-score as metric. In the table, we can see that (1)       road passability. We could confirm our hypothesis in this work and
for global features the best results are achieved with scene-level        showed that local features of a few object instances corresponding
features from the VSO, followed by Places365 and then ImageNet.           to classes in Pascal VOC can be used for the visual classification
(2) The classification using local features performed with 77.24%         for road passability. We achieved similar good results with local
similar good as the global features (in the range between 73.29% –        features as with global features and see a lot of potential to improve
79.16%) and better compared to features extracted from ImageNet           our current approach.
pre-trained models. However, it is also worth mentioning that this            When we fused global and local visual features, we did not
comparison is not completly fair, since the dataset using local fea-      achieve a significant outperformance against global features alone.
tures was smaller as not very image contained the local features. (3)     We strongly believe that better and more general local features,
For the global local feature fusion, we see an small improvement          which can be extracted from more than only half of the images,
when using features of Places365 Wide-ResNet38 and an decrease of         play an important role in this context. We will continue this work
the performance using VSO X-ResNet50 Adjective features.                  with additional classes that are not covered by Pascal VOC dataset.
   The final results on the private test set are shown in Table 2. Run    One direction would be to extract semantic segmentation classes
1 are the results for the global feature VSO X-ResNet50 Adjective, run    from a model pre-trained on the Cityscape dataset. This dataset
2 for the same feature but fused with local predictions and run 3 for     contains additional classes such as traffic signs, poles, and that
features from Places365 Wide-ResNet38 fused with local predictions.       could be important for road passability classification as well.
The offcial metric is the average F1-scores of (C1) images with the
evidence and passable roads as well as (C2) images with evidence          ACKNOWLEDGMENTS
and non passable roads. In the table, we can see that the fusion of
                                                                          The authors would like to thank NVIDIA for support within the
local and global feature information slightly decreased the results
                                                                          NVAIL program. Additionally, this work was supported BMBF
for the VSO X-ResNet50 Adjective feature whereas on the Places365
                                                                          project DeFuseNN (01IW17002).
Wide-ResNet38 feature the opposite can be observed, simialar as
on the internal validation set. Results show that the global local
Multimedia Satellite Task                                                      MediaEval’18, 29-31 October 2018, Sophia Antipolis, France


REFERENCES
[1] Benjamin Bischke, Prakriti Bhardwaj, Aman Gautam, Patrick Helber,
    Damian Borth, and Andreas Dengel. 2017. Detection of flooding events
    in social multimedia and satellite imagery using deep neural networks.
    In Proceedings of the Working Notes Proceeding MediaEval Workshop,
    Dublin, Ireland. 13–15.
[2] Benjamin Bischke, Patrick Helber, Christian Schulze, Srinivasan
    Venkat, Andreas Dengel, and Damian Borth. The Multimedia Satellite
    Task at MediaEval 2017: Emergency Response for Flooding Events.
    In Proc. of the MediaEval 2017 Workshop (Sept. 13-15, 2017). Dublin,
    Ireland.
[3] Benjamin Bischke, Patrick Helber, Zhengyu Zhao, Jens de Bruijn, and
    Damian Borth. The Multimedia Satellite Task at MediaEval 2018:
    Emergency Response for Flooding Events. In Proc. of the MediaEval
    2018 Workshop (Oct. 29-31, 2018). Sophia-Antipolis, France.
[4] Damian Borth, Rongrong Ji, Tao Chen, Thomas Breuel, and Shih-Fu
    Chang. 2013. Large-scale visual sentiment ontology and detectors
    using adjective noun pairs. In Proceedings of the 21st ACM international
    conference on Multimedia. ACM, 223–232.
[5] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014.
    Rich feature hierarchies for accurate object detection and semantic
    segmentation. In Proceedings of the IEEE conference on computer vision
    and pattern recognition. 580–587.
[6] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Im-
    agenet classification with deep convolutional neural networks. In
    Advances in neural information processing systems. 1097–1105.
[7] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster
    r-cnn: Towards real-time object detection with region proposal net-
    works. In Advances in neural information processing systems. 91–99.
[8] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio
    Torralba. 2018. Places: A 10 million image database for scene recogni-
    tion. IEEE transactions on pattern analysis and machine intelligence 40,
    6 (2018), 1452–1464.

</pre>