=Paper=
{{Paper
|id=Vol-2283/MediaEval_18_paper_36
|storemode=property
|title=Exploiting Local Semantic Concepts for Flooding-related Social Image Classification
|pdfUrl=https://ceur-ws.org/Vol-2283/MediaEval_18_paper_36.pdf
|volume=Vol-2283
|authors=Zhengyu Zhao,Martha Larson,Nelleke Oostdijk
|dblpUrl=https://dblp.org/rec/conf/mediaeval/ZhaoLO18
}}
==Exploiting Local Semantic Concepts for Flooding-related Social Image Classification==
Exploiting Local Semantic Concepts
for Flooding-related Social Image Classification
Zhengyu Zhao, Martha Larson, Nelleke Oostdijk
Radboud University, Netherlands
z.zhao@cs.ru.nl,m.larson@cs.ru.nl,n.oostdijk@let.ru.nl
ABSTRACT
In this paper, we present an approach to identification of the images
that depict passable and non-passable roads, from a collection of
flood-related tweet images. Our key insight is that the local informa-
tion from domain-specific concepts (‘boat’, ‘person’ and ‘car’) can
be exploited to help determine whether an image depicts a location
that is passable. We use concept detection as the basis for features
that encode local information. We use conventional features, i.e.,
presence of concepts and visual features extracted from the concept
region, but also a novel light-weight feature, i.e., the aspect ratio
of the bounding box. Experimental results show that integrating
local semantic information yields slightly better performance than
only using image-level CNN representation. Text features are not
competitive. Figure 1: Image examples showing the contrast in visual
properties of ‘person’ and ‘car’ in non-passable (left column)
vs. passable (right column) classes.
1 INTRODUCTION
Despite achieving impressive performance in various visual recog-
80
nition tasks, convolutional neural network (CNN) representations boat
person
do not fully capture local-level discriminative information when 70
car
only trained at a single scale, i.e., input size of 224x224 for most 60
conventional CNNs. In order to complement global CNN features, 50
recent work on fine-grained object classification [5, 6, 8, 12] and
Quantities
40
scene recognition [2, 9–11] has also tried to exploit discriminative
information from local semantic regions. Building on these insights, 30
here, we demonstrate that the task of differentiating two road con- 20
ditions (passable vs. non-passable) [1] will also benefit from local
10
semantic information. Our starting point is the observation that
images with similar global appearance have differentiable local 0
0 500 1000 1500 2000
patterns, as shown in Figure 1. Intuitively, we consider that three Images ranked in ascending order of numerical tweet ID
specific concepts (‘boat’, ‘person’ and ‘car’) will show different prop-
erties in the context of road passability. Moreover, based on our Figure 2: Distribution of the three concept classes over the
exploratory experiments, we observed that the images containing whole subset of passability-relevant images in the dev-set.
the three concept classes account for a large proportion (46%) of
the passability-relevant images. As shown in Figure 2, the images
with basic part-of-speech and semantic-word class information.
with these three concepts span over the entire passability-relevant
On the basis of these rules, we create a set of ngrams, which rep-
dev-set without any specific bias related to time order, which is
resents strings of lexical items that we would expect to occur in
reflected by the numerical order of the tweet ID. These two obser-
tweets related to road passability. Whenever any created ngram is
vations indicate that using local information from these concepts
encountered in the text, the associated class label is assigned (either
is not accidental but can be generally applicable.
passable or non-passable). As we target mostly texts indicating
that roads are not passable, there are only few ngrams that yield
2 APPROACH
the label passable. In the case of no matching, the image will be
We start with a light-weight approach by only using text informa- regarded not relevant to road passability.
tion. By manual inspection of the patterns in the dev-set, we created For the visual-based approach, the basic pipeline is hierarchical
a set of rules that apply to a vocabulary that has been annotated classification with two SVM classifiers. The first classifier is applied
Copyright held by the owner/author(s).
to differentiate the images that are relevant to road passability from
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France the others. Here we only use image-level features extracted from a
ResNet50-based CNN model, which is pre-trained on the large-scale
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France Zhengyu Zhao, Martha Larson, Nelleke Oostdijk
scene-centric database Places2 [13]. Exploratory experiments on 1
the dev-set showed that this option performed better than using the T3
T3=0.30 T1
object-centric ImageNet [3] as pre-training data. Then, the second 0.9
classifier will further predict the images that have been classified as
relevant into passable or non passable classes. Here, we use both the 0.8 T1=1.37
Places2 and ImageNet as the pre-training data, resulting in better
Precision
performance than using only one of them. This result suggests that 0.7
discriminative information from scene-level and object-level will
complement each other for differentiating passable vs. non-passable 0.6
images.
Alternatively, we add a pre-filtering step before the second clas- 0.5
sifier that allows test images containing the three concepts to be
treated differently. We adopt the state-of-the-art YOLOv3 [7], which 0.4
0 0.2 0.4 0.6 0.8 1
is pre-trained on the union of VOC2007 and VOC2012 trainval Recall
set [4], for automatic concept detection. In order to capture dif-
ferences accurately, we exclude the image candidates that have Figure 3: Precision-Recall curves with varying values for T1
incomplete bounding boxes in the image area, or a confidence score (person) and T3 (car), where the arrows point out the posi-
below 0.9. When multiple instances are detected in one image, we tions, where the specific values were set.
use the average values of their features as the final feature.
Since ‘boat’ is not a conventional means to pass a road, the Table 1: Average testset scores for evidence vs. non-evidence
presence of any boat in the image indicates the road is very likely (Ave-F1_1) and passable vs. non-passable (Ave-F1_2)
to be non-passable. So we use the +/- presence of ‘boat’ in an image
as a feature. The experiments on the dev-set show that boats can
be detected in 46 of 1179 non-passable images, and only in 5 of 951 Run 1 Run 2 Run 3 Run 4
passable images. Ave-F1_1 0.3260 0.8758 0.8758 0.8758
The subtle differences in local information can also be encoded
Ave-F1_2 0.1286 0.6313 0.6389 0.6388
by a single value derived from concept bounding boxes. Specifically,
we look at the height-width aspect ratio of the bounding box, since
we observed that the person or car will be more likely to be stuck 4 follows the same process as in run 3, but instead of aspect ratios,
in water on the non-passable road, resulting in a lower aspect ratio. we use deep visual features extracted from the bounding box region
In this paper, we set two empirical thresholds for ‘person’. We of ‘car’ to train a SVM classifier for pre-filtering. Note that no local
classify images with aspect ratios lower than the first threshold features for ‘boat’ and ‘car’ are used for this run.
(T1=1.37) as non-passable, and images with aspect ratios higher
than the second threshold (T2=2.98) as passable. Since the aspect 3.2 Experimental analysis
ratio of the front/back view of a car could be plausibly with a Table 1 shows the evaluation results of our 4 runs. Since the annota-
respectively high value, we only apply one threshold (T3=0.30). We tion of ‘road passability’ is based on visual inspection of the images
classify images with aspect ratios lower than this threshold as non- associated with the tweets, it is not surprising the tweet text did
passable. Figure 3 shows the precision-recall curves of passable not make strong contribution. In particular, we noticed that people
vs. non-passable classification as T1 and T3 change. For better often discuss in the tweet whether it is legally allowed to pass a
visualization, we balanced the number of images from the two road rather than whether the road is physically passable. Also, the
classes by upsampling the minority class. text does not necessarily pertain to the type of the image or what is
Furthermore, we conjecture that the local information could also depicted in the image. For the visual information, we can observe
be learned by a CNN based on the visual content enclosed by the that slightly better performance could be achieved by exploiting
concept bounding box. We apply this for ‘car’, for which the subtle additional local information in the two methods that we applied.
differences of appearance are not well reflected by the aspect ratios
as described above. 4 CONCLUSION
In this paper, a new approach was proposed to capture local-level
3 EXPERIMENTS information from specific semantic concepts for better identification
3.1 Run submissions of Twitter images that depict passable and non-passable roads.
Run 1 is our text-based approach. Run2, run3 and run4 only use Specifically, we explored two different types of features based on
visual information and also use SVM classifiers for a two-stage the light-weight summary of the output of the concept detector, i.e.,
classification. We use the same method for the first stage of each of aspect ratio of the bounding box, or visual features derived from
these three runs. For run 2, in the second stage, only image-level the bounding box. From the analysis of the text-based approach,
features are leveraged. For run 3, in the second stage, we add a we concluded that the text information might be useful if we would
pre-filtering step, which use the +/- presence of ‘boat’ and aspect in the future, be looking at other aspects of evidence about road
ratio-based method for both ‘person’ and ‘car’ as local features. Run passability.
Multimedia Satellite Task MediaEval’18, 29-31 October 2018, Sophia Antipolis, France
REFERENCES
[1] Benjamin Bischke, Patrick Helber, Zhengyu Zhao, Jens de Bruijn, and
Damian Borth. 2018. The Multimedia Satellite Task at MediaEval 2018.
In Proc. of the MediaEval 2018 Workshop, Sophia Antipolis, France, 29-31
October 2018.
[2] Xiaojuan Cheng, Jiwen Lu, Jianjiang Feng, Bo Yuan, and Jie Zhou.
2018. Scene recognition with objectness. Pattern Recognition 74 (2018),
474–487.
[3] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.
2009. ImageNet: A large-scale hierarchical image database. In IEEE
Conference on Computer Vision and Pattern Recognition (CVPR). IEEE,
248–255.
[4] Mark Everingham, SM Ali Eslami, Luc Van Gool, Christopher KI
Williams, John Winn, and Andrew Zisserman. 2015. The PASCAL
visual object classes challenge: A retrospective. International journal
of computer vision (IJCV) 111, 1 (2015), 98–136.
[5] Xiangteng He, Yuxin Peng, and Junjie Zhao. 2017. Fine-grained dis-
criminative localization via saliency-guided Faster R-CNN. In ACM
International Conference on Multimedia (ACM MM). 627–635.
[6] Shaoli Huang, Zhe Xu, Dacheng Tao, and Ya Zhang. 2016. Part-stacked
CNN for fine-grained visual categorization. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR). 1173–1182.
[7] Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental im-
provement. arXiv preprint arXiv:1804.02767 (2018).
[8] Xiu-Shen Wei, Chen-Wei Xie, Jianxin Wu, and Chunhua Shen. 2018.
Mask-CNN: Localizing parts and selecting descriptors for fine-grained
bird species categorization. Pattern Recognition 76 (2018), 704–714.
[9] Ruobing Wu, Baoyuan Wang, Wenping Wang, and Yizhou Yu. 2015.
Harvesting discriminative meta objects with deep CNN features for
scene classification. In International Conference of Computer Vision
(ICCV). 1287–1295.
[10] Guo-Sen Xie, Xu-Yao Zhang, Shuicheng Yan, and Cheng-Lin Liu. 2017.
Hybrid CNN and dictionary-based models for scene recognition and
domain adaptation. IEEE Transactions on Circuits and Systems for Video
Technology (TCSVT) 27 (2017), 1263–1274.
[11] Zhengyu Zhao and Martha Larson. 2018. From Volcano to Toyshop:
Adaptive Discriminative Region Discovery for Scene Recognition. In
ACM International Conference on Multimedia (ACM MM).
[12] Heliang Zheng, Jianlong Fu, Tao Mei, and Jiebo Luo. 2017. Learning
multi-attention convolutional neural network for fine-grained image
recognition. International Conference of Computer Vision (ICCV) (2017),
5219–5227.
[13] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio
Torralba. 2018. Places: A 10 million image database for scene recogni-
tion. IEEE Transactions on Pattern Analysis and Machine Intelligence
(TPAMI) 40 (2018), 1452–1464.