=Paper=
{{Paper
|id=Vol-2283/MediaEval_18_paper_27
|storemode=property
|title=Ensembled Convolutional Neural Network Models for Retrieving Flood Relevant Tweets
|pdfUrl=https://ceur-ws.org/Vol-2283/MediaEval_18_paper_27.pdf
|volume=Vol-2283
|authors=Yu Feng,Sergiy Shebotnov,Claus Brenner,Monika Sester
|dblpUrl=https://dblp.org/rec/conf/mediaeval/FengSBS18
}}
==Ensembled Convolutional Neural Network Models for Retrieving Flood Relevant Tweets==
<pdf width="1500px">https://ceur-ws.org/Vol-2283/MediaEval_18_paper_27.pdf</pdf>
<pre>
              Ensembled Convolutional Neural Network Models for
                      Retrieving Flood Relevant Tweets
                                        Yu Feng, Sergiy Shebotnov, Claus Brenner, Monika Sester
                            Institute of Cartography and Geoinformatics, Leibniz University Hannover, Germany
                             {yu.feng,claus.brenner,monika.sester}@ikg.uni-hannover.de,shebotnov@gmail.com

ABSTRACT                                                                2     APPROACH
Social media, which provides instant textual and visual information     In this section, our approach is introduced. All the models are
exchange, plays a more important role in emergency response than        trained using the Tensorflow and Keras frameworks. We randomly
ever before. Many researchers nowadays are focusing on disas-           selected 10% from the given dataset (582 tweets) as an independent
ter monitoring using crowd sourcing. Interpretation and retrieval       internal test set. Moreover, 60 tweets from each label were randomly
of such information significantly influences the efficiency of these    selected and used as a validation set. All the remaining tweets were
applications. This paper presents a method proposed by team EVUS-       used to train the models. The network architectures and parameters
ikg for the MediaEval 2018 challenge on Multimedia Satellite Task.      were tuned and compared internally based on the performance of
We only focused on the subtask “flood classification for social mul-    the models on the internal test set. The validation set was used for
timedia”. A supervised learning method with an ensemble of 10           early stopping during training. Data augmentation, such as rotation,
Convolutional Neural Networks (CNN) was applied to classify the         shift, and zoom, was also performed during the training process.
tweets in the benchmark.                                                   Run 1 allows only visual information to be used for the classifi-
                                                                        cation task. Pre-trained models DenseNet201 [6], InceptionV3 [12]
                                                                        and InceptionResNetV2 [11] were used as basic feature extractors.
                                                                        They were all trained based on the ImageNet dataset and achieved
1    INTRODUCTION                                                       a top-5 accuracy of 0.936, 0.937 and 0.953, respectively. We froze the
Crowdsourcing is a rapidly developing method for acquiring infor-       weights of these pre-trained models and concatenated the nodes at
mation from many users in real time. Many applications nowadays         the layers before the output layers. Followed by two dense layers
are focusing on monitoring natural disaster events such as earth-       with batch normalization and dropout of 50%, we produced an out-
quakes, fires and flooding. The retrieved information can improve       put of three nodes with the softmax function. The architecture of
the situation awareness for citizens. At the same time, it helps the    our model is shown in Figure 1.
rescuers to provide a better emergency response. Flooding is one
of the topics which attracts lots of attention. With the development
of information retrieval and deep learning techniques, instead of                       DenseNet201
using pre-defined keywords for extracting flood relevant informa-                                             1920 x 1

tion, deep learning models can achieve much better performance                                            +
                                                                             Input                             Concat           FC          FC        Softmax
for visual and textual information understanding.                           Image
                                                                                         InceptionV3
                                                                                                              2048 x 1
   In our previous work [5], a method which considers both predic-                                                                                              3x1
                                                                                                          +                                      128 x 1
tions from separately trained text and image classifiers was used to
                                                                                                                                     1024 x 1
extract flood and heavy rainfall relevant information from twitter                    InceptionResNetV2                  5504 x 1
                                                                                                              1536 x 1
data. However, an end-to-end classification approach, which can
directly fuse the information, seems more attractive. Some of the
teams [1, 8, 10] from the Multimedia Satellite Task [2] at MediaEval          Figure 1: CNN model using visual information only
2017 have already achieved end-to-end solutions. Well-performing
end-to-end classifiers have been trained for flickr data with binary       Run 2 allows only metadata information to be used for the classi-
labels (with evidence and non evidence for flooding).                   fication task. In this case, only user provided tweet texts are used as
   More information regarding the floods, such as severity, is still    inputs. We applied a classic CNN model [7] for natural language pro-
desired. Multimedia Satellite Task at MediaEval 2018 [3] provided       cessing, combined with the word embeddings of fasttext [4]. This
the binary labels for the tweets (with evidence and non evidence        embedding contains one million word vectors trained on Wikipedia
for road passability). For the tweets with road passability evidence,   2017, UMBC webbase corpus, and statmt.org news dataset [9]. Each
the benchmark dataset also provided the labels for road passability.    word vector contains 300 dimensions. Since most of the texts in the
Most tweets in this dataset were labeled as non evidence (3,685         training dataset are no more than 21 words after pre-processing
tweets). The number of tweets labeled as passable and not passable      (e.g. removing stop words, url, emojis), we limited the maximum
are 946 and 1,179, respectively.                                        allowed sentence length n to 21. For sentences with less than 21
                                                                        words, we used zero padding to obtain a fixed size input (21 x 300)
Copyright held by the owner/author(s).
                                                                        for the network. Convolutional filters were then applied on this
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France              embedding matrix to extract feature maps for each sentence. After
                                                                        experiments with different filter sizes, the combination of filter
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                                                                                                            Y. Feng et al.


sizes 1, 3, 5 performed best on our internal test set. The architecture                                        passable. Additionally, we also listed the F1-scores separately for
of the model is shown in Figure 2.                                                                             both classes, and the accuracy on our internal test set.
                                                                       Global
                               MaxPool             Conv1D             MaxPool
                                                                                                                           Table 1: Evaluation on internal test set
                                                            10 x 64             1 x 64
                                         10 x 64
                      n x 64
            Conv1D
                                                                       Global     +
                               MaxPool             Conv1D             MaxPool            Concat    Softmax                         Run 1     Run 2     Run 3     Run 4     Run 5
       Embedding                         10 x 64            10 x 64             1 x 64

Text
            n x 300   n x 64
                                                                                                         3x1       Avg. F1-score   66.94%   41.05%    61.23%    54.67%    49.89%
                                                                       Global    +
                               MaxPool             Conv1D             MaxPool
                                                                                              192 x 1
                                                                                                                    F1-score (1)   61.08%   43.37%    57.14%    53.02%    33.10%
                                                            10 x 64             1 x 64
                                         10 x 64
                      n x 64                                                                                        F1-score (2)   72.80%   38.73%    65.32%    56.31%    66.67%
                                                                                                                     Accuracy      81.93%   55.33%    75.90%    75.73%    75.90%
       Figure 2: CNN model using metadata information

    Run 3 allows metadata-visual fused information to be used for                                                           Table 2: Evaluation on private test set
tweets classification. Pre-trained models on ImageNet were used as
visual feature extractors in the same way as in Run 1. We cut the                                                                  Run 1     Run 2     Run 3     Run 4     Run 5
visual model at the dense layer with 1024 nodes. For the metadata
information, we used the same CNN structure as presented in Run                                                    Avg. F1-score   64.35%   32.81%    59.49%    52.16%    51.59%
2. The output of the text classifier was the dense layer with 192
nodes. We simply concatenated the features from both modalities                                                   From the results above, the classifiers have generally similar
and used the softmax function to derive an output of 3 nodes.                                                  performance on both internal and private test set. The visual based
    Run 4 is a general run, where we only used the visual informa-                                             classifier (Run 1) can achieve the best performance, compared to all
tion. Same architecture as Run 1 was used. Since the flood related                                             the other runs. The models trained only on metadata information
tweets have only a very small proportion among the daily user-                                                 do not achieve a good performance on both test sets. The fusion of
sent tweets, we wanted to investigate whether introducing a large                                              both models did not make any significant improvements according
amount of negative labeled images can improve the robustness of                                                to the evaluation. From the results of Run 4 and 5, we noticed that
the classifiers for the current task. Therefore, the 3,349 negative la-                                        introducing more negative examples or more examples with impre-
beled flickr photos from the Multimedia Satellite Task at MediaEval                                            cise labels, lead to a significantly worse performance. Therefore,
2017 were used as an extra data source to improve the robustness                                               we conclude that the balance of the training examples plays an
of the model against the tweets with no evidence.                                                              important role for the classifier performance.
    Run 5 is also a general run, where we only used the visual                                                    Since the textual descriptions from users rarely address the sever-
information. Same architecture as Run 1 was used. We observed that                                             ity of the flood situation, it is very hard to achieve a reasonable
many of the positive labeled images from the Multimedia Satellite                                              classification performance only based on textual information. Tex-
Task at MediaEval 2017 were describing severe flood situations.                                                tual information may contain informative words or phrases regard-
Therefore, in this run we assigned the label (2) with evidence “not                                            ing flood evidence. However, distinguishing whether the road is
passable” to these 1,916 positive labeled photos. Together with the                                            passable or not from a single tweet text, which is less than 280
3,349 negative labeled photos, we trained an image classifier. In                                              characters, is a challenging task. In our case, we concluded that
this way, we introduced imprecise labeled training examples. Our                                               introducing textual information did not help for the current task.
intention is to investigate how much the performance of classifiers
is affected through introducing more data but with imprecise labels.                                           4    CONCLUSIONS AND OUTLOOK
    Since the random initialization of the weights in neural networks                                          In this paper, an ensemble of CNN models was trained for retrieving
often leads to unstable performance of the models, during Run 1,                                               flood relevant tweets. Our best model is the one trained only on
2 and 3, each model was trained 10 times on the same training                                                  visual information. Using only metadata, the classifiers were not
set. In this case, we regard these 10 models as weak classifiers. We                                           able to produce high quality predictions. Using photographs is rea-
ensemble the predictions and take the majority voting of all the                                               sonable, since nowadays people are likely to share photographs to
10 predictions as the final prediction. In this case, we hope the                                              address their current situation, rather than detailed textual descrip-
ensemble learning improves the robustness of the classifiers. For                                              tions. The analysis of video sequences would be a promising ex-
Run 4 and 5, the models were trained only once due to the much                                                 tension for extracting more information regarding flooding events,
longer training time.                                                                                          such as rainfall intensity, flow speed or even water depth.

3      RESULTS AND DISCUSSION                                                                                  ACKNOWLEDGMENTS
As for the subtask “flood classification for social multimedia”, we                                            The authors would like to acknowledge the support from the BMBF
firstly tested the performance of the classifiers based on our internal                                        funded research project "EVUS - Real-Time Prediction of Pluvial
test set. The results are shown in Table 1. The evaluation on the                                              Floods and Induced Water Contamination in Urban Areas" (BMBF,
private test set, which was provided by the organizer, is shown in                                             03G0846A). We also gratefully acknowledge the support of NVIDIA
Table 2. The metrics used for evaluation are the averaged F1-scores                                            Corporation with the donation of a GeForce Titan X GPU used for
for the classes (1) with evidence passable and (2) with evidence not                                           this research.
Multimedia Satellite Task                                                      MediaEval’18, 29-31 October 2018, Sophia Antipolis, France


REFERENCES
 [1] Benjamin Bischke, Prakriti Bhardwaj, Aman Gautam, Patrick Helber,
     Damian Borth, and Andreas Dengel. 2017. Detection of flooding events
     in social multimedia and satellite imagery using deep neural networks.
     In Proceedings of the Working Notes Proceeding MediaEval Workshop,
     Dublin, Ireland.
 [2] Benjamin Bischke, Patrick Helber, Christian Schulze, Srinivasan
     Venkat, Andreas Dengel, and Damian Borth. 2017. The multime-
     dia satellite task at mediaeval 2017: Emergence response for flooding
     events. In Proc. of the MediaEval 2017 Workshop (Sept. 13-15, 2017).
     Dublin, Ireland.
 [3] Benjamin Bischke, Patrick Helber, Zhengyu Zhao, Jens de Bruijn, and
     Damian Borth. The Multimedia Satellite Task at MediaEval 2018:
     Emergency Response for Flooding Events. In Proc. of the MediaEval
     2018 Workshop (Oct. 29-31, 2018). Sophia-Antipolis, France.
 [4] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov.
     2017. Enriching Word Vectors with Subword Information. Transactions
     of the Association for Computational Linguistics 5 (2017), 135–146.
 [5] Yu Feng and Monika Sester. 2018. Extraction of pluvial flood relevant
     volunteered geographic information (VGI) by deep learning from
     user generated texts and photos. ISPRS International Journal of Geo-
     Information 7, 2 (2018), 39.
 [6] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Wein-
     berger. 2017. Densely Connected Convolutional Networks. In The
     IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
 [7] Yoon Kim. 2014. Convolutional neural networks for sentence classifi-
     cation. arXiv preprint arXiv:1408.5882 (2014).
 [8] Laura Lopez-Fuentes, Joost van de Weijer, Marc Bolanos, and Harald
     Skinnemoen. 2017. Multi-modal deep learning approach for flood
     detection. In Proc. of the MediaEval 2017 Workshop (Sept. 13–15, 2017).
     Dublin, Ireland.
 [9] Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch,
     and Armand Joulin. 2018. Advances in Pre-Training Distributed Word
     Representations. In Proceedings of the International Conference on Lan-
     guage Resources and Evaluation (LREC 2018).
[10] Keiller Nogueira, Samuel G Fadel, Ícaro C Dourado, Javier AV Muñoz,
     Otávio AB Penatti, Rodrigo Tripodi Calumby, Lin Li, and Jefersson Alex
     dos Santos. 2017. Data-Driven Flood Detection using Neural Networks..
     In Proceedings of the Working Notes Proceeding MediaEval Workshop,
     Dublin, Ireland.
[11] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A
     Alemi. 2017. Inception-v4, inception-resnet and the impact of residual
     connections on learning.. In AAAI, Vol. 4. 12.
[12] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and
     Zbigniew Wojna. 2016. Rethinking the inception architecture for
     computer vision. In Proceedings of the IEEE conference on computer
     vision and pattern recognition. 2818–2826.

</pre>