=Paper=
{{Paper
|id=Vol-2283/MediaEval_18_paper_36
|storemode=property
|title=Exploiting Local Semantic Concepts for Flooding-related Social Image Classification
|pdfUrl=https://ceur-ws.org/Vol-2283/MediaEval_18_paper_36.pdf
|volume=Vol-2283
|authors=Zhengyu Zhao,Martha Larson,Nelleke Oostdijk
|dblpUrl=https://dblp.org/rec/conf/mediaeval/ZhaoLO18
}}
==Exploiting Local Semantic Concepts for Flooding-related Social Image Classification==
<pdf width="1500px">https://ceur-ws.org/Vol-2283/MediaEval_18_paper_36.pdf</pdf>
<pre>
                             Exploiting Local Semantic Concepts
                       for Flooding-related Social Image Classification
                                                Zhengyu Zhao, Martha Larson, Nelleke Oostdijk
                                                            Radboud University, Netherlands
                                                  z.zhao@cs.ru.nl,m.larson@cs.ru.nl,n.oostdijk@let.ru.nl
ABSTRACT
In this paper, we present an approach to identification of the images
that depict passable and non-passable roads, from a collection of
flood-related tweet images. Our key insight is that the local informa-
tion from domain-specific concepts (‘boat’, ‘person’ and ‘car’) can
be exploited to help determine whether an image depicts a location
that is passable. We use concept detection as the basis for features
that encode local information. We use conventional features, i.e.,
presence of concepts and visual features extracted from the concept
region, but also a novel light-weight feature, i.e., the aspect ratio
of the bounding box. Experimental results show that integrating
local semantic information yields slightly better performance than
only using image-level CNN representation. Text features are not
competitive.                                                                   Figure 1: Image examples showing the contrast in visual
                                                                               properties of ‘person’ and ‘car’ in non-passable (left column)
                                                                               vs. passable (right column) classes.
1    INTRODUCTION
Despite achieving impressive performance in various visual recog-
                                                                                                   80
nition tasks, convolutional neural network (CNN) representations                                                                                               boat
                                                                                                                                                               person
do not fully capture local-level discriminative information when                                   70
                                                                                                                                                               car

only trained at a single scale, i.e., input size of 224x224 for most                               60

conventional CNNs. In order to complement global CNN features,                                     50
recent work on fine-grained object classification [5, 6, 8, 12] and
                                                                                      Quantities


                                                                                                   40
scene recognition [2, 9–11] has also tried to exploit discriminative
information from local semantic regions. Building on these insights,                               30

here, we demonstrate that the task of differentiating two road con-                                20
ditions (passable vs. non-passable) [1] will also benefit from local
                                                                                                   10
semantic information. Our starting point is the observation that
images with similar global appearance have differentiable local                                    0
                                                                                                        0           500           1000           1500            2000
patterns, as shown in Figure 1. Intuitively, we consider that three                                         Images ranked in ascending order of numerical tweet ID

specific concepts (‘boat’, ‘person’ and ‘car’) will show different prop-
erties in the context of road passability. Moreover, based on our              Figure 2: Distribution of the three concept classes over the
exploratory experiments, we observed that the images containing                whole subset of passability-relevant images in the dev-set.
the three concept classes account for a large proportion (46%) of
the passability-relevant images. As shown in Figure 2, the images
                                                                               with basic part-of-speech and semantic-word class information.
with these three concepts span over the entire passability-relevant
                                                                               On the basis of these rules, we create a set of ngrams, which rep-
dev-set without any specific bias related to time order, which is
                                                                               resents strings of lexical items that we would expect to occur in
reflected by the numerical order of the tweet ID. These two obser-
                                                                               tweets related to road passability. Whenever any created ngram is
vations indicate that using local information from these concepts
                                                                               encountered in the text, the associated class label is assigned (either
is not accidental but can be generally applicable.
                                                                               passable or non-passable). As we target mostly texts indicating
                                                                               that roads are not passable, there are only few ngrams that yield
2    APPROACH
                                                                               the label passable. In the case of no matching, the image will be
We start with a light-weight approach by only using text informa-              regarded not relevant to road passability.
tion. By manual inspection of the patterns in the dev-set, we created             For the visual-based approach, the basic pipeline is hierarchical
a set of rules that apply to a vocabulary that has been annotated              classification with two SVM classifiers. The first classifier is applied
Copyright held by the owner/author(s).
                                                                               to differentiate the images that are relevant to road passability from
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                     the others. Here we only use image-level features extracted from a
                                                                               ResNet50-based CNN model, which is pre-trained on the large-scale
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                                             Zhengyu Zhao, Martha Larson, Nelleke Oostdijk


scene-centric database Places2 [13]. Exploratory experiments on                               1
the dev-set showed that this option performed better than using the                                                                                              T3
                                                                                                        T3=0.30                                                  T1
object-centric ImageNet [3] as pre-training data. Then, the second                           0.9
classifier will further predict the images that have been classified as
relevant into passable or non passable classes. Here, we use both the                        0.8                T1=1.37

Places2 and ImageNet as the pre-training data, resulting in better


                                                                                 Precision
performance than using only one of them. This result suggests that                           0.7
discriminative information from scene-level and object-level will
complement each other for differentiating passable vs. non-passable                          0.6

images.
    Alternatively, we add a pre-filtering step before the second clas-                       0.5

sifier that allows test images containing the three concepts to be
treated differently. We adopt the state-of-the-art YOLOv3 [7], which                         0.4
                                                                                                   0     0.2              0.4             0.6           0.8            1
is pre-trained on the union of VOC2007 and VOC2012 trainval                                                                      Recall
set [4], for automatic concept detection. In order to capture dif-
ferences accurately, we exclude the image candidates that have            Figure 3: Precision-Recall curves with varying values for T1
incomplete bounding boxes in the image area, or a confidence score        (person) and T3 (car), where the arrows point out the posi-
below 0.9. When multiple instances are detected in one image, we          tions, where the specific values were set.
use the average values of their features as the final feature.
    Since ‘boat’ is not a conventional means to pass a road, the          Table 1: Average testset scores for evidence vs. non-evidence
presence of any boat in the image indicates the road is very likely       (Ave-F1_1) and passable vs. non-passable (Ave-F1_2)
to be non-passable. So we use the +/- presence of ‘boat’ in an image
as a feature. The experiments on the dev-set show that boats can
be detected in 46 of 1179 non-passable images, and only in 5 of 951                                            Run 1            Run 2           Run 3         Run 4
passable images.                                                                              Ave-F1_1         0.3260           0.8758          0.8758        0.8758
    The subtle differences in local information can also be encoded
                                                                                              Ave-F1_2         0.1286           0.6313          0.6389        0.6388
by a single value derived from concept bounding boxes. Specifically,
we look at the height-width aspect ratio of the bounding box, since
we observed that the person or car will be more likely to be stuck        4 follows the same process as in run 3, but instead of aspect ratios,
in water on the non-passable road, resulting in a lower aspect ratio.     we use deep visual features extracted from the bounding box region
In this paper, we set two empirical thresholds for ‘person’. We           of ‘car’ to train a SVM classifier for pre-filtering. Note that no local
classify images with aspect ratios lower than the first threshold         features for ‘boat’ and ‘car’ are used for this run.
(T1=1.37) as non-passable, and images with aspect ratios higher
than the second threshold (T2=2.98) as passable. Since the aspect         3.2     Experimental analysis
ratio of the front/back view of a car could be plausibly with a           Table 1 shows the evaluation results of our 4 runs. Since the annota-
respectively high value, we only apply one threshold (T3=0.30). We        tion of ‘road passability’ is based on visual inspection of the images
classify images with aspect ratios lower than this threshold as non-      associated with the tweets, it is not surprising the tweet text did
passable. Figure 3 shows the precision-recall curves of passable          not make strong contribution. In particular, we noticed that people
vs. non-passable classification as T1 and T3 change. For better           often discuss in the tweet whether it is legally allowed to pass a
visualization, we balanced the number of images from the two              road rather than whether the road is physically passable. Also, the
classes by upsampling the minority class.                                 text does not necessarily pertain to the type of the image or what is
    Furthermore, we conjecture that the local information could also      depicted in the image. For the visual information, we can observe
be learned by a CNN based on the visual content enclosed by the           that slightly better performance could be achieved by exploiting
concept bounding box. We apply this for ‘car’, for which the subtle       additional local information in the two methods that we applied.
differences of appearance are not well reflected by the aspect ratios
as described above.                                                       4     CONCLUSION
                                                                          In this paper, a new approach was proposed to capture local-level
3 EXPERIMENTS                                                             information from specific semantic concepts for better identification
3.1 Run submissions                                                       of Twitter images that depict passable and non-passable roads.
Run 1 is our text-based approach. Run2, run3 and run4 only use            Specifically, we explored two different types of features based on
visual information and also use SVM classifiers for a two-stage           the light-weight summary of the output of the concept detector, i.e.,
classification. We use the same method for the first stage of each of     aspect ratio of the bounding box, or visual features derived from
these three runs. For run 2, in the second stage, only image-level        the bounding box. From the analysis of the text-based approach,
features are leveraged. For run 3, in the second stage, we add a          we concluded that the text information might be useful if we would
pre-filtering step, which use the +/- presence of ‘boat’ and aspect       in the future, be looking at other aspects of evidence about road
ratio-based method for both ‘person’ and ‘car’ as local features. Run     passability.
Multimedia Satellite Task                                                       MediaEval’18, 29-31 October 2018, Sophia Antipolis, France


REFERENCES
 [1] Benjamin Bischke, Patrick Helber, Zhengyu Zhao, Jens de Bruijn, and
     Damian Borth. 2018. The Multimedia Satellite Task at MediaEval 2018.
     In Proc. of the MediaEval 2018 Workshop, Sophia Antipolis, France, 29-31
     October 2018.
 [2] Xiaojuan Cheng, Jiwen Lu, Jianjiang Feng, Bo Yuan, and Jie Zhou.
     2018. Scene recognition with objectness. Pattern Recognition 74 (2018),
     474–487.
 [3] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.
     2009. ImageNet: A large-scale hierarchical image database. In IEEE
     Conference on Computer Vision and Pattern Recognition (CVPR). IEEE,
     248–255.
 [4] Mark Everingham, SM Ali Eslami, Luc Van Gool, Christopher KI
     Williams, John Winn, and Andrew Zisserman. 2015. The PASCAL
     visual object classes challenge: A retrospective. International journal
     of computer vision (IJCV) 111, 1 (2015), 98–136.
 [5] Xiangteng He, Yuxin Peng, and Junjie Zhao. 2017. Fine-grained dis-
     criminative localization via saliency-guided Faster R-CNN. In ACM
     International Conference on Multimedia (ACM MM). 627–635.
 [6] Shaoli Huang, Zhe Xu, Dacheng Tao, and Ya Zhang. 2016. Part-stacked
     CNN for fine-grained visual categorization. In IEEE Conference on
     Computer Vision and Pattern Recognition (CVPR). 1173–1182.
 [7] Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental im-
     provement. arXiv preprint arXiv:1804.02767 (2018).
 [8] Xiu-Shen Wei, Chen-Wei Xie, Jianxin Wu, and Chunhua Shen. 2018.
     Mask-CNN: Localizing parts and selecting descriptors for fine-grained
     bird species categorization. Pattern Recognition 76 (2018), 704–714.
 [9] Ruobing Wu, Baoyuan Wang, Wenping Wang, and Yizhou Yu. 2015.
     Harvesting discriminative meta objects with deep CNN features for
     scene classification. In International Conference of Computer Vision
     (ICCV). 1287–1295.
[10] Guo-Sen Xie, Xu-Yao Zhang, Shuicheng Yan, and Cheng-Lin Liu. 2017.
     Hybrid CNN and dictionary-based models for scene recognition and
     domain adaptation. IEEE Transactions on Circuits and Systems for Video
     Technology (TCSVT) 27 (2017), 1263–1274.
[11] Zhengyu Zhao and Martha Larson. 2018. From Volcano to Toyshop:
     Adaptive Discriminative Region Discovery for Scene Recognition. In
     ACM International Conference on Multimedia (ACM MM).
[12] Heliang Zheng, Jianlong Fu, Tao Mei, and Jiebo Luo. 2017. Learning
     multi-attention convolutional neural network for fine-grained image
     recognition. International Conference of Computer Vision (ICCV) (2017),
     5219–5227.
[13] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio
     Torralba. 2018. Places: A 10 million image database for scene recogni-
     tion. IEEE Transactions on Pattern Analysis and Machine Intelligence
     (TPAMI) 40 (2018), 1452–1464.

</pre>