<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Flood Event Analysis base on Pose Estimation and Water-related Scene Recognition</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Khanh-An C.Quan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tan-Cong Nguyen</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vinh-Tiep Nguyen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Minh-Triet Tran</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Information Technology</institution>
          ,
          <addr-line>VNU-HCM</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Science</institution>
          ,
          <addr-line>VNU-HCM</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Social Sciences and Humanities</institution>
          ,
          <addr-line>VNU-HCM</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>27</fpage>
      <lpage>29</lpage>
      <abstract>
        <p>In this paper, we describe our approach for the Multimedia Satellite Task: Emergency Response for Flooding Events at the MediaEval 2019 Challenge. Specifically, for the Multimodal Flood Level Estimation subtask, we employ a combination of ResNet-50 trained on Places365 dataset for features extractor, OpenPose for pose estimation and Mask R-CNN for segmentation to predict if an image has at least one person standing in water above the knee. Our approach has achieved the highest results for Multimodal Flood Level Estimation subtask. In this Multimedia Satellite Task, we take part in two subtasks: Image-based News Topic Disambiguation (INTD) and Multimodal Flood Level Estimation (MFLE). We propose using EficientNet features [8] for training a water-related image classifier in the first subtask. For the second task, we use both EficientNet and ResNet50 features. Then, we employ Faster R-CNN[7] to detect if there are people in the image. We also combine binary mask from Mask R-CNN [3] and pose from OpenPose [2] to predict whether the image contains at least one person standing in water above the knee. We also implement a language model for article's content and title contains the image. To evaluate our method, we use F1 score. Full details of the challenge tasks can be found in [1].</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>2.1</p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
    </sec>
    <sec id="sec-3">
      <title>INTD subtask</title>
      <p>Firstly, the input image will be segmented to get the background.
After that, we use EficientNet architecture to extract image features
of both the original image and background image. We use the
extracted features on multiple convolution layer and concatenate
them together. By using the original image and the background
image we have two extracted image features of the same size. Finally,
we concatenate these two features together and feed into
fullyconnected layers to estimate the final result.</p>
    </sec>
    <sec id="sec-4">
      <title>MFLE subtask</title>
      <p>For the second task, our proposed method contains four stages:
water-related scene recognition, person detection, pose estimation
and prediction based on paired mask and pose. Our system pipeline
for this subtask shown in the Figure 1.</p>
      <p>
        For the first stages, we label all the images of the training set into
two categories: water-related and non-water-related scene. Then,
we use the result of the average pooling layer from ResNet-50
trained on Places365[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] as visual features to combine with a neural
network to classify whether the image scene related to water or
not. We also employ the visual model from the first task (described
in detail in section 2.1) on the non-water-related images to ensure
that the water-related images are not omitted. All water-related
images will be carried over to the next stage and the remaining
images will be labeled as class 0.
      </p>
      <p>For the second stage, we use Faster R-CNN to eliminate images
that do not contain a person inside on water-related images. Both
positive and negative images will be estimated pose by OpenPose
to detect the swimming person in the next stage.</p>
      <p>
        In the third stage, we use the OpenPose with COCO output
format [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to estimate the poses of all people (figure 2). We also
calculate the pose bounding box based on the output keypoints.
After that, we train a WaterClassifier network to predict the label of
water/non-water. We crop 1⁄3 area from the bottom of all images in
the training set vertically and divide manually into water/non-water
image. Then we extract visual features using ResNet-50 trained
on Places365 and train a neural network to predict the label of
water/non-water.
      </p>
      <p>For the last stage, we will make a prediction based on the paired
mask and pose. Firstly, to detect all the swimming persons that
contain in the image we extract poses with shoulders upwards
only (including arms). In most swimming case, OpenPose gives
very well result. After extract poses, we crop 50 x 100 pixels areas
below the pose bounding box that calculated from the previous
step then feed into WaterClassifier to predict whether a person is
swimming or not. According to the observation, we realized that
in some cases swimming persons that only have head and upward
cannot be detected or misclassified by Faster R-CNN. Therefore, we
also applied this to the negative result from the person detection
stage to make sure we do not miss any swimming person.</p>
      <p>We also use Mask R-CNN to get binary mask and bounding box
(bbox) of each person in the images. Then, we conducted a pairing
between pose and binary mask of each person in all the
waterrelated images that have at least one person. We calculate the IoU
score of bounding box mask and bbox pose of all pose and binary
mask pairs included in the image. After that, we match pose and
mask with IoU score from high to low with each pose having only
one mask and vice versa. We also eliminate cases where the person
is on a vehicle or boat removing the paired mask and pose from the
image. After matching pose and mask of each person in each image,
we conduct resolve special flooded cases: Knee keypoint outside of
the person (Figure 2.c), Hip keypoint outside of the person (Figure
2.b), Keypoint fit with the person but the ratio of thighs is very small
compared to the upper body (Figure 2.d), Knee keypoint close to
the submerged part (Figure 2.e) by crop a rectangular area suitable
rectangle for each case. All the rectangular areas cropped from
the above cases will be extract features using ResNet-50 trained
on Place365 as the input of the WaterClassifier to predict whether
the person’s knee above the water or not. All remaining images
not classified with water or that do not meet the above cases are
classified in class 0.</p>
      <p>
        For this subtask, we also implement language model base on
the article’s content and title. We employ both LSTM and CNN to
extract features of preprocessed text. Then, we use GloVe [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] to
represent each word by a 300-dim vector. In the first module, we
use Bidirectional LSTM [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] with 2 layers with 512 nodes of each.
In the second module, we use CNN [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] 3 layers with increasing
kernel size of 3,4,5. After the title and content of the article are put
      </p>
      <sec id="sec-4-1">
        <title>Runs Run 1 Run 2 Run 3</title>
      </sec>
      <sec id="sec-4-2">
        <title>F1-Score</title>
        <p>into these two modules, we summarize their output on the output
layer and feed in the full connected layers to classify.
3
3.1</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>RESULTS AND ANALYSIS</title>
    </sec>
    <sec id="sec-6">
      <title>Submitted runs</title>
      <p>For the INTD subtask, we have submitted 3 runs, as below:
- Run 1: Randomly split the train set with 9:1 ratio into train
and val set. After training, we also run more some epochs on the
entire training and validation set before predicting on the test set.</p>
      <p>- Run 2: Same model as Run 1 with additional photos in the
training set of MFLE subtask.</p>
      <p>- Run 3: Same model as Run 2 but adjusted some threshold.
For the MFLE subtask, we have submitted 5 runs, as below:
- Run 1: Model described at Section 2.2.
- Run 2: Text model described at the end of the Section 2.2.
- Run 3: Combine the results Run 1 and 2 with class 1 only.
- Run 4, Run 5: Same as Run 1 and Run 3 but adjusted some
threshold of visual model.
3.2</p>
    </sec>
    <sec id="sec-7">
      <title>Results and Analysis</title>
    </sec>
    <sec id="sec-8">
      <title>CONCLUSION AND OUTLOOK</title>
      <p>In this paper, we employ a combination of ResNet-50 trained on
Places365 dataset and EficientNet for features extractor, OpenPose
for pose estimation and Mask R-CNN for segmentation to predict
an image has at least one person standing in water above the knee.
Our methods show potential results and achieve the highest rank
at the MFLE subtask at the challenge. For the future works, we
think we can improve both water-related image classifier and water
classifier to increase accuracy.</p>
    </sec>
    <sec id="sec-9">
      <title>ACKNOWLEDGMENTS</title>
      <p>Research is supported by Vingroup Innovation Foundation (VINIF)
in project code VINIF.2019.DA19. We would like to thank AIOZ Pte
Ltd for supporting our team with computing infrastructure.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Bischke</surname>
          </string-name>
          , Patrick Helber, Erkan Basar, Simon Brugman,
          <string-name>
            <given-names>Zhengyu</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Konstantin</given-names>
            <surname>Pogorelov</surname>
          </string-name>
          .
          <source>The Multimedia Satellite Task at MediaEval</source>
          <year>2019</year>
          :
          <article-title>Flood Severity Estimation</article-title>
          .
          <source>In Proc. of the MediaEval 2019</source>
          Workshop (Oct.
          <fpage>27</fpage>
          -
          <lpage>29</lpage>
          ,
          <year>2019</year>
          ). Sophia Antipolis, France.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Zhe</given-names>
            <surname>Cao</surname>
          </string-name>
          , Gines Hidalgo, Tomá imon,
          <string-name>
            <surname>Shih-En Wei</surname>
            , and
            <given-names>Yaser</given-names>
          </string-name>
          <string-name>
            <surname>Sheikh</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Realtime Multi-person 2D Pose Estimation Using Part Afinity Fields</article-title>
          .
          <source>2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          (
          <year>2016</year>
          ),
          <fpage>1302</fpage>
          -
          <lpage>1310</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          , Georgia Gkioxari, Piotr Dollár, and
          <string-name>
            <surname>Ross</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Girshick</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <string-name>
            <surname>Mask R-CNN</surname>
          </string-name>
          .
          <source>2017 IEEE International Conference on Computer Vision</source>
          (ICCV) (
          <year>2017</year>
          ),
          <fpage>2980</fpage>
          -
          <lpage>2988</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Yoon</given-names>
            <surname>Kim</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Convolutional neural networks for sentence classification</article-title>
          .
          <source>arXiv preprint arXiv:1408.5882</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Tsung-Yi Lin</surname>
            ,
            <given-names>Michael</given-names>
          </string-name>
          <string-name>
            <surname>Maire</surname>
            ,
            <given-names>Serge J.</given-names>
          </string-name>
          <string-name>
            <surname>Belongie</surname>
            ,
            <given-names>Lubomir D.</given-names>
          </string-name>
          <string-name>
            <surname>Bourdev</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ross B. Girshick</surname>
            , James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and
            <given-names>C. Lawrence</given-names>
          </string-name>
          <string-name>
            <surname>Zitnick</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <string-name>
            <surname>Microsoft</surname>
            <given-names>COCO</given-names>
          </string-name>
          :
          <article-title>Common Objects in Context</article-title>
          .
          <source>In ECCV.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Jefrey</given-names>
            <surname>Pennington</surname>
          </string-name>
          , Richard Socher, and
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Glove: Global vectors for word representation</article-title>
          .
          <source>In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)</source>
          .
          <volume>1532</volume>
          -
          <fpage>1543</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Shaoqing</given-names>
            <surname>Ren</surname>
          </string-name>
          , Kaiming He,
          <string-name>
            <surname>Ross Girshick</surname>
            , and
            <given-names>Jian</given-names>
          </string-name>
          <string-name>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <string-name>
            <surname>Faster</surname>
            <given-names>R-CNN</given-names>
          </string-name>
          :
          <article-title>Towards Real-Time Object Detection with Region Proposal Networks</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          28,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cortes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. D.</given-names>
            <surname>Lawrence</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. D.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sugiyama</surname>
          </string-name>
          , and R. Garnett (Eds.). Curran Associates, Inc.,
          <fpage>91</fpage>
          -
          <lpage>99</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Mingxing</given-names>
            <surname>Tan and Quoc V Le</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>EficientNet: Rethinking Model Scaling for Convolutional Neural Networks</article-title>
          . arXiv preprint arXiv:
          <year>1905</year>
          .
          <volume>11946</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Bolei</given-names>
            <surname>Zhou</surname>
          </string-name>
          , Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba.
          <year>2017</year>
          .
          <article-title>Places: A 10 million Image Database for Scene Recognition</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Peng</surname>
            <given-names>Zhou</given-names>
          </string-name>
          , Zhenyu Qi, Suncong Zheng, Jiaming Xu,
          <string-name>
            <given-names>Hongyun</given-names>
            <surname>Bao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Bo</given-names>
            <surname>Xu</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling</article-title>
          .
          <source>arXiv preprint arXiv:1611.06639</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>