Flood Event Analysis base on Pose Estimation and Water-related
                      Scene Recognition
                    Khanh-An C.Quan1 , Tan-Cong Nguyen 2 , Vinh-Tiep Nguyen1 , Minh-Triet Tran3
                                                    1 University of Information Technology, VNU-HCM
                                               2 University of Social Sciences and Humanities, VNU-HCM
                                                             3 University of Science, VNU-HCM

                    15520006@gm.uit.edu.vn,ntcong@hcmussh.edu.vn,tiepnv@uit.edu.vn,tmtriet@fit.hcmus.edu.vn

ABSTRACT
In this paper, we describe our approach for the Multimedia Satellite
Task: Emergency Response for Flooding Events at the MediaEval
2019 Challenge. Specifically, for the Multimodal Flood Level Esti-
mation subtask, we employ a combination of ResNet-50 trained on
Places365 dataset for features extractor, OpenPose for pose esti-
mation and Mask R-CNN for segmentation to predict if an image
has at least one person standing in water above the knee. Our ap-
proach has achieved the highest results for Multimodal Flood Level
Estimation subtask.


1     INTRODUCTION
                                                                                          Figure 1: Overview of our MFLE pipeline
In this Multimedia Satellite Task, we take part in two subtasks:
Image-based News Topic Disambiguation (INTD) and Multimodal
Flood Level Estimation (MFLE). We propose using EfficientNet                      For the first stages, we label all the images of the training set into
features [8] for training a water-related image classifier in the first        two categories: water-related and non-water-related scene. Then,
subtask. For the second task, we use both EfficientNet and ResNet-             we use the result of the average pooling layer from ResNet-50
50 features. Then, we employ Faster R-CNN[7] to detect if there                trained on Places365[9] as visual features to combine with a neural
are people in the image. We also combine binary mask from Mask                 network to classify whether the image scene related to water or
R-CNN [3] and pose from OpenPose [2] to predict whether the                    not. We also employ the visual model from the first task (described
image contains at least one person standing in water above the                 in detail in section 2.1) on the non-water-related images to ensure
knee. We also implement a language model for article’s content and             that the water-related images are not omitted. All water-related
title contains the image. To evaluate our method, we use F1 score.             images will be carried over to the next stage and the remaining
Full details of the challenge tasks can be found in [1].                       images will be labeled as class 0.
                                                                                  For the second stage, we use Faster R-CNN to eliminate images
2 APPROACH                                                                     that do not contain a person inside on water-related images. Both
2.1 INTD subtask                                                               positive and negative images will be estimated pose by OpenPose
                                                                               to detect the swimming person in the next stage.
Firstly, the input image will be segmented to get the background.
                                                                                  In the third stage, we use the OpenPose with COCO output
After that, we use EfficientNet architecture to extract image features
                                                                               format [5] to estimate the poses of all people (figure 2). We also
of both the original image and background image. We use the
                                                                               calculate the pose bounding box based on the output keypoints.
extracted features on multiple convolution layer and concatenate
                                                                               After that, we train a WaterClassifier network to predict the label of
them together. By using the original image and the background
                                                                               water/non-water. We crop 1⁄3 area from the bottom of all images in
image we have two extracted image features of the same size. Finally,
                                                                               the training set vertically and divide manually into water/non-water
we concatenate these two features together and feed into fully-
                                                                               image. Then we extract visual features using ResNet-50 trained
connected layers to estimate the final result.
                                                                               on Places365 and train a neural network to predict the label of
                                                                               water/non-water.
2.2     MFLE subtask
                                                                                  For the last stage, we will make a prediction based on the paired
For the second task, our proposed method contains four stages:                 mask and pose. Firstly, to detect all the swimming persons that
water-related scene recognition, person detection, pose estimation             contain in the image we extract poses with shoulders upwards
and prediction based on paired mask and pose. Our system pipeline              only (including arms). In most swimming case, OpenPose gives
for this subtask shown in the Figure 1.                                        very well result. After extract poses, we crop 50 x 100 pixels areas
Copyright 2019 for this paper by its authors. Use
                                                                               below the pose bounding box that calculated from the previous
permitted under Creative Commons License Attribution                           step then feed into WaterClassifier to predict whether a person is
4.0 International (CC BY 4.0).                                                 swimming or not. According to the observation, we realized that
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France                                                                     Khanh-An et al.


                                                                                                                 Runs     F1-Score
                                                                                       Runs    F1-Score
                                                                                                                 Run 1     0.8831
                                                                                      Run 1     0.8850           Run 2     0.5341
                                                                                      Run 2     0.8603           Run 3     0.7484
                                                                                      Run 3     0.8757           Run 4     0.8746
                                                                                              (a)
                                                                                                                 Run 5     0.7419
                                                                                                                         (b)


                                                                           Figure 3: Results of (a) INTD subtask and (b) MFLE subtask.


Figure 2: All cases of the result predicted by OpenPose                    into these two modules, we summarize their output on the output
(Red/yellow circle illustrates the knee/hip keypoint meet                  layer and feed in the full connected layers to classify.
the conditions of each case). (a) Swimming person with pose
from the neck upward, (b) Hip keypoint outside of the per-                 3 RESULTS AND ANALYSIS
son, (c) Knee keypoint outside of the person, (d) Keypoint
fit with the person but the ratio of thighs is very small com-             3.1 Submitted runs
pared to the upper body, (e) Knee keypoint close to the sub-               For the INTD subtask, we have submitted 3 runs, as below:
merged part, (f) Normal pose of the person not submerged                      - Run 1: Randomly split the train set with 9:1 ratio into train
                                                                           and val set. After training, we also run more some epochs on the
                                                                           entire training and validation set before predicting on the test set.
in some cases swimming persons that only have head and upward                 - Run 2: Same model as Run 1 with additional photos in the
cannot be detected or misclassified by Faster R-CNN. Therefore, we         training set of MFLE subtask.
also applied this to the negative result from the person detection            - Run 3: Same model as Run 2 but adjusted some threshold.
stage to make sure we do not miss any swimming person.                     For the MFLE subtask, we have submitted 5 runs, as below:
   We also use Mask R-CNN to get binary mask and bounding box                 - Run 1: Model described at Section 2.2.
(bbox) of each person in the images. Then, we conducted a pairing             - Run 2: Text model described at the end of the Section 2.2.
between pose and binary mask of each person in all the water-                 - Run 3: Combine the results Run 1 and 2 with class 1 only.
related images that have at least one person. We calculate the IoU            - Run 4, Run 5: Same as Run 1 and Run 3 but adjusted some
score of bounding box mask and bbox pose of all pose and binary            threshold of visual model.
mask pairs included in the image. After that, we match pose and
mask with IoU score from high to low with each pose having only
one mask and vice versa. We also eliminate cases where the person          3.2    Results and Analysis
is on a vehicle or boat removing the paired mask and pose from the         Figure 3.a presents results of three runs for the INTD subtask. In
image. After matching pose and mask of each person in each image,          the first Run, the model has 0.887 score. But in run 2 and Run 3,
we conduct resolve special flooded cases: Knee keypoint outside of         by using extra dataset from Subtask 2, the performance is reduced,
the person (Figure 2.c), Hip keypoint outside of the person (Figure        may be due to distribution of two datasets are different.
2.b), Keypoint fit with the person but the ratio of thighs is very small      As shown in the Figure 3.b, Run 1 obtains the best result for the
compared to the upper body (Figure 2.d), Knee keypoint close to            MFLE subtask. The text features at Run 2 does seem to achieve
the submerged part (Figure 2.e) by crop a rectangular area suitable        average results. The results of Run 4 and Run 5 are only adjusted
rectangle for each case. All the rectangular areas cropped from            at some threshold so there is no big difference compared with Run
the above cases will be extract features using ResNet-50 trained           1 and Run 3.
on Place365 as the input of the WaterClassifier to predict whether
the person’s knee above the water or not. All remaining images
                                                                           4     CONCLUSION AND OUTLOOK
not classified with water or that do not meet the above cases are
classified in class 0.                                                     In this paper, we employ a combination of ResNet-50 trained on
   For this subtask, we also implement language model base on              Places365 dataset and EfficientNet for features extractor, OpenPose
the article’s content and title. We employ both LSTM and CNN to            for pose estimation and Mask R-CNN for segmentation to predict
extract features of preprocessed text. Then, we use GloVe [6] to           an image has at least one person standing in water above the knee.
represent each word by a 300-dim vector. In the first module, we           Our methods show potential results and achieve the highest rank
use Bidirectional LSTM [10] with 2 layers with 512 nodes of each.          at the MFLE subtask at the challenge. For the future works, we
In the second module, we use CNN [4] 3 layers with increasing              think we can improve both water-related image classifier and water
kernel size of 3,4,5. After the title and content of the article are put   classifier to increase accuracy.
The 2019 Multimedia Satellite Task                                                   MediaEval’19, 27-29 October 2019, Sophia Antipolis, France


ACKNOWLEDGMENTS                                                                     Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common
Research is supported by Vingroup Innovation Foundation (VINIF)                     Objects in Context. In ECCV.
                                                                                [6] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014.
in project code VINIF.2019.DA19. We would like to thank AIOZ Pte
                                                                                    Glove: Global vectors for word representation. In Proceedings of the
Ltd for supporting our team with computing infrastructure.                          2014 conference on empirical methods in natural language processing
                                                                                    (EMNLP). 1532–1543.
REFERENCES                                                                      [7] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster
 [1] Benjamin Bischke, Patrick Helber, Erkan Basar, Simon Brugman,                  R-CNN: Towards Real-Time Object Detection with Region Proposal
     Zhengyu Zhao, and Konstantin Pogorelov. The Multimedia Satel-                  Networks. In Advances in Neural Information Processing Systems 28,
     lite Task at MediaEval 2019: Flood Severity Estimation. In Proc. of the        C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett
     MediaEval 2019 Workshop (Oct. 27-29, 2019). Sophia Antipolis, France.          (Eds.). Curran Associates, Inc., 91–99.
 [2] Zhe Cao, Gines Hidalgo, Tomá imon, Shih-En Wei, and Yaser Sheikh.          [8] Mingxing Tan and Quoc V Le. 2019. EfficientNet: Rethinking
     2016. Realtime Multi-person 2D Pose Estimation Using Part Affinity             Model Scaling for Convolutional Neural Networks. arXiv preprint
     Fields. 2017 IEEE Conference on Computer Vision and Pattern Recogni-           arXiv:1905.11946 (2019).
     tion (CVPR) (2016), 1302–1310.                                             [9] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio
 [3] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. 2017.        Torralba. 2017. Places: A 10 million Image Database for Scene Recog-
     Mask R-CNN. 2017 IEEE International Conference on Computer Vision              nition. IEEE Transactions on Pattern Analysis and Machine Intelligence
     (ICCV) (2017), 2980–2988.                                                      (2017).
 [4] Yoon Kim. 2014. Convolutional neural networks for sentence classifi-      [10] Peng Zhou, Zhenyu Qi, Suncong Zheng, Jiaming Xu, Hongyun Bao,
     cation. arXiv preprint arXiv:1408.5882 (2014).                                 and Bo Xu. 2016. Text classification improved by integrating bidi-
 [5] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev,            rectional LSTM with two-dimensional max pooling. arXiv preprint
     Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr               arXiv:1611.06639 (2016).