=Paper=
{{Paper
|id=Vol-2283/MediaEval_18_paper_27
|storemode=property
|title=Ensembled Convolutional Neural Network Models for Retrieving Flood Relevant Tweets
|pdfUrl=https://ceur-ws.org/Vol-2283/MediaEval_18_paper_27.pdf
|volume=Vol-2283
|authors=Yu Feng,Sergiy Shebotnov,Claus Brenner,Monika Sester
|dblpUrl=https://dblp.org/rec/conf/mediaeval/FengSBS18
}}
==Ensembled Convolutional Neural Network Models for Retrieving Flood Relevant Tweets==
Ensembled Convolutional Neural Network Models for Retrieving Flood Relevant Tweets Yu Feng, Sergiy Shebotnov, Claus Brenner, Monika Sester Institute of Cartography and Geoinformatics, Leibniz University Hannover, Germany {yu.feng,claus.brenner,monika.sester}@ikg.uni-hannover.de,shebotnov@gmail.com ABSTRACT 2 APPROACH Social media, which provides instant textual and visual information In this section, our approach is introduced. All the models are exchange, plays a more important role in emergency response than trained using the Tensorflow and Keras frameworks. We randomly ever before. Many researchers nowadays are focusing on disas- selected 10% from the given dataset (582 tweets) as an independent ter monitoring using crowd sourcing. Interpretation and retrieval internal test set. Moreover, 60 tweets from each label were randomly of such information significantly influences the efficiency of these selected and used as a validation set. All the remaining tweets were applications. This paper presents a method proposed by team EVUS- used to train the models. The network architectures and parameters ikg for the MediaEval 2018 challenge on Multimedia Satellite Task. were tuned and compared internally based on the performance of We only focused on the subtask “flood classification for social mul- the models on the internal test set. The validation set was used for timedia”. A supervised learning method with an ensemble of 10 early stopping during training. Data augmentation, such as rotation, Convolutional Neural Networks (CNN) was applied to classify the shift, and zoom, was also performed during the training process. tweets in the benchmark. Run 1 allows only visual information to be used for the classifi- cation task. Pre-trained models DenseNet201 [6], InceptionV3 [12] and InceptionResNetV2 [11] were used as basic feature extractors. They were all trained based on the ImageNet dataset and achieved 1 INTRODUCTION a top-5 accuracy of 0.936, 0.937 and 0.953, respectively. We froze the Crowdsourcing is a rapidly developing method for acquiring infor- weights of these pre-trained models and concatenated the nodes at mation from many users in real time. Many applications nowadays the layers before the output layers. Followed by two dense layers are focusing on monitoring natural disaster events such as earth- with batch normalization and dropout of 50%, we produced an out- quakes, fires and flooding. The retrieved information can improve put of three nodes with the softmax function. The architecture of the situation awareness for citizens. At the same time, it helps the our model is shown in Figure 1. rescuers to provide a better emergency response. Flooding is one of the topics which attracts lots of attention. With the development of information retrieval and deep learning techniques, instead of DenseNet201 using pre-defined keywords for extracting flood relevant informa- 1920 x 1 tion, deep learning models can achieve much better performance + Input Concat FC FC Softmax for visual and textual information understanding. Image InceptionV3 2048 x 1 In our previous work [5], a method which considers both predic- 3x1 + 128 x 1 tions from separately trained text and image classifiers was used to 1024 x 1 extract flood and heavy rainfall relevant information from twitter InceptionResNetV2 5504 x 1 1536 x 1 data. However, an end-to-end classification approach, which can directly fuse the information, seems more attractive. Some of the teams [1, 8, 10] from the Multimedia Satellite Task [2] at MediaEval Figure 1: CNN model using visual information only 2017 have already achieved end-to-end solutions. Well-performing end-to-end classifiers have been trained for flickr data with binary Run 2 allows only metadata information to be used for the classi- labels (with evidence and non evidence for flooding). fication task. In this case, only user provided tweet texts are used as More information regarding the floods, such as severity, is still inputs. We applied a classic CNN model [7] for natural language pro- desired. Multimedia Satellite Task at MediaEval 2018 [3] provided cessing, combined with the word embeddings of fasttext [4]. This the binary labels for the tweets (with evidence and non evidence embedding contains one million word vectors trained on Wikipedia for road passability). For the tweets with road passability evidence, 2017, UMBC webbase corpus, and statmt.org news dataset [9]. Each the benchmark dataset also provided the labels for road passability. word vector contains 300 dimensions. Since most of the texts in the Most tweets in this dataset were labeled as non evidence (3,685 training dataset are no more than 21 words after pre-processing tweets). The number of tweets labeled as passable and not passable (e.g. removing stop words, url, emojis), we limited the maximum are 946 and 1,179, respectively. allowed sentence length n to 21. For sentences with less than 21 words, we used zero padding to obtain a fixed size input (21 x 300) Copyright held by the owner/author(s). for the network. Convolutional filters were then applied on this MediaEval’18, 29-31 October 2018, Sophia Antipolis, France embedding matrix to extract feature maps for each sentence. After experiments with different filter sizes, the combination of filter MediaEval’18, 29-31 October 2018, Sophia Antipolis, France Y. Feng et al. sizes 1, 3, 5 performed best on our internal test set. The architecture passable. Additionally, we also listed the F1-scores separately for of the model is shown in Figure 2. both classes, and the accuracy on our internal test set. Global MaxPool Conv1D MaxPool Table 1: Evaluation on internal test set 10 x 64 1 x 64 10 x 64 n x 64 Conv1D Global + MaxPool Conv1D MaxPool Concat Softmax Run 1 Run 2 Run 3 Run 4 Run 5 Embedding 10 x 64 10 x 64 1 x 64 Text n x 300 n x 64 3x1 Avg. F1-score 66.94% 41.05% 61.23% 54.67% 49.89% Global + MaxPool Conv1D MaxPool 192 x 1 F1-score (1) 61.08% 43.37% 57.14% 53.02% 33.10% 10 x 64 1 x 64 10 x 64 n x 64 F1-score (2) 72.80% 38.73% 65.32% 56.31% 66.67% Accuracy 81.93% 55.33% 75.90% 75.73% 75.90% Figure 2: CNN model using metadata information Run 3 allows metadata-visual fused information to be used for Table 2: Evaluation on private test set tweets classification. Pre-trained models on ImageNet were used as visual feature extractors in the same way as in Run 1. We cut the Run 1 Run 2 Run 3 Run 4 Run 5 visual model at the dense layer with 1024 nodes. For the metadata information, we used the same CNN structure as presented in Run Avg. F1-score 64.35% 32.81% 59.49% 52.16% 51.59% 2. The output of the text classifier was the dense layer with 192 nodes. We simply concatenated the features from both modalities From the results above, the classifiers have generally similar and used the softmax function to derive an output of 3 nodes. performance on both internal and private test set. The visual based Run 4 is a general run, where we only used the visual informa- classifier (Run 1) can achieve the best performance, compared to all tion. Same architecture as Run 1 was used. Since the flood related the other runs. The models trained only on metadata information tweets have only a very small proportion among the daily user- do not achieve a good performance on both test sets. The fusion of sent tweets, we wanted to investigate whether introducing a large both models did not make any significant improvements according amount of negative labeled images can improve the robustness of to the evaluation. From the results of Run 4 and 5, we noticed that the classifiers for the current task. Therefore, the 3,349 negative la- introducing more negative examples or more examples with impre- beled flickr photos from the Multimedia Satellite Task at MediaEval cise labels, lead to a significantly worse performance. Therefore, 2017 were used as an extra data source to improve the robustness we conclude that the balance of the training examples plays an of the model against the tweets with no evidence. important role for the classifier performance. Run 5 is also a general run, where we only used the visual Since the textual descriptions from users rarely address the sever- information. Same architecture as Run 1 was used. We observed that ity of the flood situation, it is very hard to achieve a reasonable many of the positive labeled images from the Multimedia Satellite classification performance only based on textual information. Tex- Task at MediaEval 2017 were describing severe flood situations. tual information may contain informative words or phrases regard- Therefore, in this run we assigned the label (2) with evidence “not ing flood evidence. However, distinguishing whether the road is passable” to these 1,916 positive labeled photos. Together with the passable or not from a single tweet text, which is less than 280 3,349 negative labeled photos, we trained an image classifier. In characters, is a challenging task. In our case, we concluded that this way, we introduced imprecise labeled training examples. Our introducing textual information did not help for the current task. intention is to investigate how much the performance of classifiers is affected through introducing more data but with imprecise labels. 4 CONCLUSIONS AND OUTLOOK Since the random initialization of the weights in neural networks In this paper, an ensemble of CNN models was trained for retrieving often leads to unstable performance of the models, during Run 1, flood relevant tweets. Our best model is the one trained only on 2 and 3, each model was trained 10 times on the same training visual information. Using only metadata, the classifiers were not set. In this case, we regard these 10 models as weak classifiers. We able to produce high quality predictions. Using photographs is rea- ensemble the predictions and take the majority voting of all the sonable, since nowadays people are likely to share photographs to 10 predictions as the final prediction. In this case, we hope the address their current situation, rather than detailed textual descrip- ensemble learning improves the robustness of the classifiers. For tions. The analysis of video sequences would be a promising ex- Run 4 and 5, the models were trained only once due to the much tension for extracting more information regarding flooding events, longer training time. such as rainfall intensity, flow speed or even water depth. 3 RESULTS AND DISCUSSION ACKNOWLEDGMENTS As for the subtask “flood classification for social multimedia”, we The authors would like to acknowledge the support from the BMBF firstly tested the performance of the classifiers based on our internal funded research project "EVUS - Real-Time Prediction of Pluvial test set. The results are shown in Table 1. The evaluation on the Floods and Induced Water Contamination in Urban Areas" (BMBF, private test set, which was provided by the organizer, is shown in 03G0846A). We also gratefully acknowledge the support of NVIDIA Table 2. The metrics used for evaluation are the averaged F1-scores Corporation with the donation of a GeForce Titan X GPU used for for the classes (1) with evidence passable and (2) with evidence not this research. Multimedia Satellite Task MediaEval’18, 29-31 October 2018, Sophia Antipolis, France REFERENCES [1] Benjamin Bischke, Prakriti Bhardwaj, Aman Gautam, Patrick Helber, Damian Borth, and Andreas Dengel. 2017. Detection of flooding events in social multimedia and satellite imagery using deep neural networks. In Proceedings of the Working Notes Proceeding MediaEval Workshop, Dublin, Ireland. [2] Benjamin Bischke, Patrick Helber, Christian Schulze, Srinivasan Venkat, Andreas Dengel, and Damian Borth. 2017. The multime- dia satellite task at mediaeval 2017: Emergence response for flooding events. In Proc. of the MediaEval 2017 Workshop (Sept. 13-15, 2017). Dublin, Ireland. [3] Benjamin Bischke, Patrick Helber, Zhengyu Zhao, Jens de Bruijn, and Damian Borth. The Multimedia Satellite Task at MediaEval 2018: Emergency Response for Flooding Events. In Proc. of the MediaEval 2018 Workshop (Oct. 29-31, 2018). Sophia-Antipolis, France. [4] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics 5 (2017), 135–146. [5] Yu Feng and Monika Sester. 2018. Extraction of pluvial flood relevant volunteered geographic information (VGI) by deep learning from user generated texts and photos. ISPRS International Journal of Geo- Information 7, 2 (2018), 39. [6] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Wein- berger. 2017. Densely Connected Convolutional Networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). [7] Yoon Kim. 2014. Convolutional neural networks for sentence classifi- cation. arXiv preprint arXiv:1408.5882 (2014). [8] Laura Lopez-Fuentes, Joost van de Weijer, Marc Bolanos, and Harald Skinnemoen. 2017. Multi-modal deep learning approach for flood detection. In Proc. of the MediaEval 2017 Workshop (Sept. 13–15, 2017). Dublin, Ireland. [9] Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. 2018. Advances in Pre-Training Distributed Word Representations. In Proceedings of the International Conference on Lan- guage Resources and Evaluation (LREC 2018). [10] Keiller Nogueira, Samuel G Fadel, Ícaro C Dourado, Javier AV Muñoz, Otávio AB Penatti, Rodrigo Tripodi Calumby, Lin Li, and Jefersson Alex dos Santos. 2017. Data-Driven Flood Detection using Neural Networks.. In Proceedings of the Working Notes Proceeding MediaEval Workshop, Dublin, Ireland. [11] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning.. In AAAI, Vol. 4. 12. [12] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2818–2826.