<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploiting Local Semantic Concepts for Flooding-related Social Image Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zhengyu Zhao</string-name>
          <email>z.zhao@cs.ru.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martha Larson</string-name>
          <email>m.larson@cs.ru.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nelleke Oostdijk</string-name>
          <email>n.oostdijk@let.ru.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Radboud University</institution>
          ,
          <country country="NL">Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>29</fpage>
      <lpage>31</lpage>
      <abstract>
        <p>In this paper, we present an approach to identification of the images that depict passable and non-passable roads, from a collection of lfood-related tweet images. Our key insight is that the local information from domain-specific concepts ('boat', 'person' and 'car') can be exploited to help determine whether an image depicts a location that is passable. We use concept detection as the basis for features that encode local information. We use conventional features, i.e., presence of concepts and visual features extracted from the concept region, but also a novel light-weight feature, i.e., the aspect ratio of the bounding box. Experimental results show that integrating local semantic information yields slightly better performance than only using image-level CNN representation. Text features are not competitive.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Despite achieving impressive performance in various visual
recognition tasks, convolutional neural network (CNN) representations
do not fully capture local-level discriminative information when
only trained at a single scale, i.e., input size of 224x224 for most
conventional CNNs. In order to complement global CNN features,
recent work on fine-grained object classification [
        <xref ref-type="bibr" rid="ref12 ref5 ref6 ref8">5, 6, 8, 12</xref>
        ] and
scene recognition [
        <xref ref-type="bibr" rid="ref10 ref11 ref2 ref9">2, 9–11</xref>
        ] has also tried to exploit discriminative
information from local semantic regions. Building on these insights,
here, we demonstrate that the task of diferentiating two road
conditions (passable vs. non-passable) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] will also benefit from local
semantic information. Our starting point is the observation that
images with similar global appearance have diferentiable local
patterns, as shown in Figure 1. Intuitively, we consider that three
specific concepts (‘boat’, ‘person’ and ‘car’) will show diferent
properties in the context of road passability. Moreover, based on our
exploratory experiments, we observed that the images containing
the three concept classes account for a large proportion (46%) of
the passability-relevant images. As shown in Figure 2, the images
with these three concepts span over the entire passability-relevant
dev-set without any specific bias related to time order, which is
reflected by the numerical order of the tweet ID. These two
observations indicate that using local information from these concepts
is not accidental but can be generally applicable.
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
      <p>We start with a light-weight approach by only using text
information. By manual inspection of the patterns in the dev-set, we created
a set of rules that apply to a vocabulary that has been annotated
70
60
s50
e
it
itn40
a
u
Q
30
20
10
00</p>
      <p>500 1000 1500 2000</p>
      <p>Images ranked in ascending order of numerical tweet ID
with basic part-of-speech and semantic-word class information.
On the basis of these rules, we create a set of ngrams, which
represents strings of lexical items that we would expect to occur in
tweets related to road passability. Whenever any created ngram is
encountered in the text, the associated class label is assigned (either
passable or non-passable). As we target mostly texts indicating
that roads are not passable, there are only few ngrams that yield
the label passable. In the case of no matching, the image will be
regarded not relevant to road passability.</p>
      <p>
        For the visual-based approach, the basic pipeline is hierarchical
classification with two SVM classifiers. The first classifier is applied
to diferentiate the images that are relevant to road passability from
the others. Here we only use image-level features extracted from a
ResNet50-based CNN model, which is pre-trained on the large-scale
scene-centric database Places2 [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Exploratory experiments on
the dev-set showed that this option performed better than using the
object-centric ImageNet [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] as pre-training data. Then, the second
classifier will further predict the images that have been classified as
relevant into passable or non passable classes. Here, we use both the
Places2 and ImageNet as the pre-training data, resulting in better
performance than using only one of them. This result suggests that
discriminative information from scene-level and object-level will
complement each other for diferentiating passable vs. non-passable
images.
      </p>
      <p>
        Alternatively, we add a pre-filtering step before the second
classifier that allows test images containing the three concepts to be
treated diferently. We adopt the state-of-the-art YOLOv3 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which
is pre-trained on the union of VOC2007 and VOC2012 trainval
set [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], for automatic concept detection. In order to capture
differences accurately, we exclude the image candidates that have
incomplete bounding boxes in the image area, or a confidence score
below 0.9. When multiple instances are detected in one image, we
use the average values of their features as the final feature.
      </p>
      <p>Since ‘boat’ is not a conventional means to pass a road, the
presence of any boat in the image indicates the road is very likely
to be non-passable. So we use the +/- presence of ‘boat’ in an image
as a feature. The experiments on the dev-set show that boats can
be detected in 46 of 1179 non-passable images, and only in 5 of 951
passable images.</p>
      <p>The subtle diferences in local information can also be encoded
by a single value derived from concept bounding boxes. Specifically,
we look at the height-width aspect ratio of the bounding box, since
we observed that the person or car will be more likely to be stuck
in water on the non-passable road, resulting in a lower aspect ratio.
In this paper, we set two empirical thresholds for ‘person’. We
classify images with aspect ratios lower than the first threshold
(T1=1.37) as non-passable, and images with aspect ratios higher
than the second threshold (T2=2.98) as passable. Since the aspect
ratio of the front/back view of a car could be plausibly with a
respectively high value, we only apply one threshold (T3=0.30). We
classify images with aspect ratios lower than this threshold as
nonpassable. Figure 3 shows the precision-recall curves of passable
vs. non-passable classification as T1 and T3 change. For better
visualization, we balanced the number of images from the two
classes by upsampling the minority class.</p>
      <p>Furthermore, we conjecture that the local information could also
be learned by a CNN based on the visual content enclosed by the
concept bounding box. We apply this for ‘car’, for which the subtle
diferences of appearance are not well reflected by the aspect ratios
as described above.
3
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>EXPERIMENTS</title>
    </sec>
    <sec id="sec-4">
      <title>Run submissions</title>
      <p>Run 1 is our text-based approach. Run2, run3 and run4 only use
visual information and also use SVM classifiers for a two-stage
classification. We use the same method for the first stage of each of
these three runs. For run 2, in the second stage, only image-level
features are leveraged. For run 3, in the second stage, we add a
pre-filtering step, which use the +/- presence of ‘boat’ and aspect
ratio-based method for both ‘person’ and ‘car’ as local features. Run
1
0.9
0.8
4 follows the same process as in run 3, but instead of aspect ratios,
we use deep visual features extracted from the bounding box region
of ‘car’ to train a SVM classifier for pre-filtering. Note that no local
features for ‘boat’ and ‘car’ are used for this run.
3.2</p>
    </sec>
    <sec id="sec-5">
      <title>Experimental analysis</title>
      <p>Table 1 shows the evaluation results of our 4 runs. Since the
annotation of ‘road passability’ is based on visual inspection of the images
associated with the tweets, it is not surprising the tweet text did
not make strong contribution. In particular, we noticed that people
often discuss in the tweet whether it is legally allowed to pass a
road rather than whether the road is physically passable. Also, the
text does not necessarily pertain to the type of the image or what is
depicted in the image. For the visual information, we can observe
that slightly better performance could be achieved by exploiting
additional local information in the two methods that we applied.
4</p>
    </sec>
    <sec id="sec-6">
      <title>CONCLUSION</title>
      <p>In this paper, a new approach was proposed to capture local-level
information from specific semantic concepts for better identification
of Twitter images that depict passable and non-passable roads.
Specifically, we explored two diferent types of features based on
the light-weight summary of the output of the concept detector, i.e.,
aspect ratio of the bounding box, or visual features derived from
the bounding box. From the analysis of the text-based approach,
we concluded that the text information might be useful if we would
in the future, be looking at other aspects of evidence about road
passability.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Bischke</surname>
          </string-name>
          , Patrick Helber,
          <string-name>
            <given-names>Zhengyu</given-names>
            <surname>Zhao</surname>
          </string-name>
          , Jens de Bruijn, and
          <string-name>
            <given-names>Damian</given-names>
            <surname>Borth</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>The Multimedia Satellite Task at MediaEval 2018</article-title>
          .
          <source>In Proc. of the MediaEval 2018 Workshop</source>
          , Sophia Antipolis, France,
          <fpage>29</fpage>
          -
          <lpage>31</lpage>
          October
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Xiaojuan</surname>
            <given-names>Cheng</given-names>
          </string-name>
          , Jiwen Lu, Jianjiang Feng, Bo Yuan, and
          <string-name>
            <given-names>Jie</given-names>
            <surname>Zhou</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Scene recognition with objectness</article-title>
          .
          <source>Pattern Recognition</source>
          <volume>74</volume>
          (
          <year>2018</year>
          ),
          <fpage>474</fpage>
          -
          <lpage>487</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Jia</given-names>
            <surname>Deng</surname>
          </string-name>
          , Wei Dong, Richard Socher,
          <string-name>
            <surname>Li-Jia</surname>
            <given-names>Li</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kai</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <surname>Li</surname>
          </string-name>
          Fei-Fei.
          <year>2009</year>
          .
          <article-title>ImageNet: A large-scale hierarchical image database</article-title>
          .
          <source>In IEEE Conference on Computer Vision</source>
          and
          <article-title>Pattern Recognition (CVPR)</article-title>
          . IEEE,
          <fpage>248</fpage>
          -
          <lpage>255</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Mark</given-names>
            <surname>Everingham</surname>
          </string-name>
          , SM Ali Eslami, Luc Van Gool,
          <source>Christopher KI Williams</source>
          , John Winn, and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>The PASCAL visual object classes challenge: A retrospective</article-title>
          .
          <source>International journal of computer vision (IJCV) 111</source>
          ,
          <issue>1</issue>
          (
          <year>2015</year>
          ),
          <fpage>98</fpage>
          -
          <lpage>136</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Xiangteng</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <surname>Yuxin Peng</surname>
            , and
            <given-names>Junjie</given-names>
          </string-name>
          <string-name>
            <surname>Zhao</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Fine-grained discriminative localization via saliency-guided Faster R-CNN</article-title>
          .
          <source>In ACM International Conference on Multimedia (ACM MM)</source>
          .
          <volume>627</volume>
          -
          <fpage>635</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Shaoli</given-names>
            <surname>Huang</surname>
          </string-name>
          , Zhe Xu,
          <string-name>
            <given-names>Dacheng</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Ya</given-names>
            <surname>Zhang</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Part-stacked CNN for fine-grained visual categorization</article-title>
          .
          <source>In IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          .
          <volume>1173</volume>
          -
          <fpage>1182</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Joseph</given-names>
            <surname>Redmon</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ali</given-names>
            <surname>Farhadi</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Yolov3: An incremental improvement</article-title>
          . arXiv preprint arXiv:
          <year>1804</year>
          .
          <volume>02767</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Xiu-Shen</surname>
            <given-names>Wei</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen-Wei</surname>
            <given-names>Xie</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jianxin Wu</surname>
            , and
            <given-names>Chunhua</given-names>
          </string-name>
          <string-name>
            <surname>Shen</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Mask-CNN: Localizing parts and selecting descriptors for fine-grained bird species categorization</article-title>
          .
          <source>Pattern Recognition</source>
          <volume>76</volume>
          (
          <year>2018</year>
          ),
          <fpage>704</fpage>
          -
          <lpage>714</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Ruobing</given-names>
            <surname>Wu</surname>
          </string-name>
          , Baoyuan Wang,
          <string-name>
            <given-names>Wenping</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Yizhou</given-names>
            <surname>Yu</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Harvesting discriminative meta objects with deep CNN features for scene classification</article-title>
          .
          <source>In International Conference of Computer Vision</source>
          (ICCV).
          <volume>1287</volume>
          -
          <fpage>1295</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Guo-Sen</surname>
            <given-names>Xie</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu-Yao</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Shuicheng Yan, and Cheng-Lin Liu.
          <year>2017</year>
          .
          <article-title>Hybrid CNN and dictionary-based models for scene recognition and domain adaptation</article-title>
          .
          <source>IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) 27</source>
          (
          <year>2017</year>
          ),
          <fpage>1263</fpage>
          -
          <lpage>1274</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Zhengyu</given-names>
            <surname>Zhao</surname>
          </string-name>
          and
          <string-name>
            <given-names>Martha</given-names>
            <surname>Larson</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>From Volcano to Toyshop: Adaptive Discriminative Region Discovery for Scene Recognition</article-title>
          .
          <source>In ACM International Conference on Multimedia (ACM MM).</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Heliang</surname>
            <given-names>Zheng</given-names>
          </string-name>
          , Jianlong Fu, Tao Mei, and
          <string-name>
            <given-names>Jiebo</given-names>
            <surname>Luo</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Learning multi-attention convolutional neural network for fine-grained image recognition</article-title>
          .
          <source>International Conference of Computer Vision</source>
          (ICCV) (
          <year>2017</year>
          ),
          <fpage>5219</fpage>
          -
          <lpage>5227</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Bolei</surname>
            <given-names>Zhou</given-names>
          </string-name>
          , Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba.
          <year>2018</year>
          .
          <article-title>Places: A 10 million image database for scene recognition</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 40</source>
          (
          <year>2018</year>
          ),
          <fpage>1452</fpage>
          -
          <lpage>1464</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>