Flood detection from social multimedia and satellite images using ensemble and transfer learning with CNN architectures Danielle Dias, Ulisses Dias University of Campinas, Brazil danielle.dias@ic.unicamp.br,ulisses@ft.unicamp.br ABSTRACT 2.1 Transfer Learning Mechanism In this paper we explore deep convolutional neural networks pre- Many advanced CNN architectures have been trained on ImageNet trained on ImageNet along with transfer learning mechanism to and are currently available. We selected 10 of them as feature ex- detect if an area has been affected by a flood in terms of access. tractors: DenseNet121 [5], DenseNet169 [5], DenseNet201 [5], In- We worked in two tasks with different datasets. The first dataset ceptionResNetV2 [8], InceptionV3 [9], MobileNet [3], ResNet50 [4], contains images from social media and the goal is to identify di- VGG16 [7], VGG19 [7], Xception [2]. We also studied if global fea- rect evidence for passability of roads by conventional means. The ture based approaches extracted with Lire [6] could provide any second dataset contains high resolution satellite imagery of par- significant advantage to the pre-trained models, but since no im- tially flooded areas and the goal is to identify sections of roads provement was achieved, we decided to use only features extracted that are potentially blocked. For both tasks, we used visual infor- from CNNs. mation only and our best models achieved averaged F1-Score value We replaced the architecture prediction layers with new ANN of 64.81% on the first task and 73.27% on the second task. models, which are responsible for returning the classification la- bels. For the social media subtask, the output labels account for (i) no evidence, (ii) evidence/not passable, and (iii) evidence/passable. 1 INTRODUCTION For the remote sensing subtask, the output labels account for (i) Flooding events demand fast response. Rescue and medical teams passable, and (ii) non passable. That said, ANNs in the former sub- should move fast to the affected points and bring victims to safety task have three units, whereas ANNs in the latter have only two. in a timely manner. Unfortunately, roads may be affected by the The output layers use softmax activation function. flood in terms of access. Automatic road passability recognition aids the support planning that will mitigate the impact of disasters. 2.2 Prediction Layer Models The “Multimedia Satellite Task 2018” studies the problem of road passability classification, namely whether or not it is possible to Two approaches performed best on our 5-fold cross validation anal- travel through a flooded region. Two tasks were proposed depend- ysis of prediction layers. They are hereby called Model1 and Model2 . ing on the source of information. In the first task, we should take Model1 is an ANN having only one hidden layer with 512 nodes. advantage of the high popularity of social media and filter those Each node uses ReLU as activation function. We added a Dropout information which provide direct evidence for passability of roads. layer with a dropout ratio of 50% in the hidden layer, and l 2 regu- In the second task, we receive high resolution satellite imagery of larization to prevent overfitting. partially flooded areas and the goal is to identify if it is possible to Model2 is an ANN having two hidden layers. The first has 2048 go from a point A to a point B. More details can be found in the nodes and the second has 128. Nodes in hidden layers use ReLU as task overview [1]. activation function, and l 2 as regularization. We dropped out 80% of the connections between input layer and the first hidden layer. 2 APPROACH We also added a dropout ratio of 50% in each hidden layer. The dataset for the social media task consists of 7,387 images and the dataset for the remote sensing task consists of 1,664 satellite 2.3 Ensemble images. As the size of the dataset is limited, we decided to use the We have 10 CNN architectures to extract features and 2 ANN ar- transfer learning mechanism in both cases in a similar workflow: chitectures for prediction. Therefore, for each image we have 20 images are received as input, pre-trained convolutional neural net- class predictions, each prediction is a vector of three floating point works (CNNs) are used for feature extraction (Section 2.1), artificial numbers in the social media task and two floating point numbers neural networks (ANNs) predict labels (Section 2.2), and an ensem- in the remote sensing task. ble is constructed of individual classifiers (Section 2.3). To create an ensemble, we concatenate the class predictions and While in the social media subtask images are the only source of use logistic regression to map the 20×3 = 60 dimension vector to 3 information, in the remote sensing subtask we receive the images output classes in the social media task, and to map the 20 × 2 = 40 along with two points A and B. The question is whether or not we dimension vector to 2 output classes in the remote sensing task. can go from point A to point B. Thus, we preprocess the images to embed these points within the image (Section 2.4). 2.4 Preprocessing Satellite Images In Figure 1 we illustrate all the steps to preprocess the satellite im- Copyright held by the owner/author(s). MediaEval’18, 29-31 October 2018, Sophia Antipolis, France ages. In Figure 1(a) we show one of the images in the development dataset. Figures 1(b-d) have marks added for illustrative purposes MediaEval’18, 29-31 October 2018, Sophia Antipolis, France D. Dias et al. (a) Original (b) Marks Added (c) Cropped (d) Rotated Figure 1: Preprocessing steps using the points A and B. In (a) we show an original image as given by the organizers. In (b) we add blue, red, and yellow marks to represent the points A, B, and the midpoint between A and B, respectively. We also draw a circunference centered in the midpoint that has diameter equal to the distance between A and B. In (c) we crop the image to include the entire circunference. In (d) we rotate the image to have A in the left side and B in the right side. and to clarify the explanation, we did not really add these marks Table 1: Evaluation results for the flood classification task to the image provided to the CNNs. Blue and red marks represent from social multimedia images. We highlight in bold the the inputted A and B points, respectively. Our ultimate goals is to best result, which was achieved by the ensemble. place these points in fixed locations, so the model could learn how to find a path between them. We follow by describing each step. CNN Arch. ANN Arch. Averaged F1-Score (%) (1) We first compute a point C which is halfway between A DenseNet201 Model1 62.82 and B as shown by the yellow mark in Figure 1(b). VGG19 Model1 60.92 (2) We compute a circunference centered in C that have the Resnet50 Model1 62.93 distance between A and B as diameter. Observe in Fig- DenseNet169 Model1 62.91 ure 1(b) that only a small area of the image is inside the circunference and that the area outside is not helpful to Ensemble - 64.81 answer whether there is a path between A and B. This ob- servation occurs in several cases. Table 2: Evaluation results for the flood classification task (3) We crop the image to keep only the circunscript square as from satellite imagery. We highlight in bold the best result, shown in Figure 1(c). which was achieved by DenseNet121 with Model1 . (4) We rotate the image around C to place A in the left side where the circunference touches the circunscript square, and B in the right side counterpart, as shown in Figure 1(d). CNN Arch. ANN Arch. Averaged F1-Score (%) MobileNet Model1 56.82 3 RESULTS AND ANALYSIS MobileNet Model2 68.63 During the training phase, we evaluated the models using 5-fold InceptionV3 Model2 62.69 cross validation. We selected the four best models and the ensem- DenseNet121 Model1 73.27 ble to submit to the organizers, which performed their analysis on Ensemble - 71.72 unseen data and reported the results back to us. Table 1 and Ta- ble 2 present the results using the averaged F1-Score metric for the social media and remote sensing subtasks, respectively. In the social media task, the ensemble produced the best results, We combined features extracted from 10 CNNs with 2 models obtaining F1-Score of 64.81% against 62.93% yielded by ResNet50, based on ANNs for prediction, then we build an ensemble by con- the best individual model. In the remote sensing task, the ensem- catenating the predicted classes and using logistic regression to ble achieved 71.72% while the best individual model DenseNet121 map them to a new output. This ensemble achieved best results in reached 73.27%. We believe there is room for improvement if we the social media task, but not in the remote sensing. Our results tune the ensemble again, or if we replace the logistic regressor by support the idea that transfer learning mechanism and ensemble other classification methods. are promising approaches for both tasks. 4 CONCLUSION ACKNOWLEDGMENTS We used pre-trained CNNs as a starting point to create models that We thank CAPES and CNPq (grant 400487/2016-0) and FAPESP predict if it is possible to travel through a flooded area. (grant 2015/11937-9). Multimedia Satellite Task MediaEval’18, 29-31 October 2018, Sophia Antipolis, France REFERENCES [1] Benjamin Bischke, Patrick Helber, Zhengyu Zhao, Jens de Bruijn, and Damian Borth. The Multimedia Satellite Task at MediaEval 2018: Emergency Response for Flooding Events. In Proc. of the MediaEval 2018 Workshop (Oct. 29-31, 2018). Sophia-Antipolis, France. [2] François Chollet. 2016. Xception: Deep Learning with Depthwise Sep- arable Convolutions. CoRR abs/1610.02357 (2016). arXiv:1610.02357 http://arxiv.org/abs/1610.02357 [3] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient Convolutional Neural Networks for Mo- bile Vision Applications. (04 2017). [4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. CoRR abs/1512.03385 (2015). arXiv:1512.03385 http://arxiv.org/abs/1512.03385 [5] Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. 2016. Densely Connected Convolutional Networks. CoRR abs/1608.06993 (2016). arXiv:1608.06993 http://arxiv.org/abs/1608.06993 [6] Mathias Lux and Savvas A. Chatzichristofis. 2008. Lire: Lucene Image Retrieval: An Extensible Java CBIR Library. In Proceedings of the 16th ACM International Conference on Multimedia (MM ’08). ACM, New York, NY, USA, 1085–1088. [7] Karen Simonyan and Andrew Zisserman. 2014. Very Deep Con- volutional Networks for Large-Scale Image Recognition. CoRR abs/1409.1556 (2014). arXiv:1409.1556 http://arxiv.org/abs/1409.1556 [8] Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke. 2016. Inception-v4, Inception-ResNet and the Impact of Residual Connec- tions on Learning. CoRR abs/1602.07261 (2016). arXiv:1602.07261 http://arxiv.org/abs/1602.07261 [9] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2015. Rethinking the Inception Architecture for Computer Vision. CoRR abs/1512.00567 (2015). arXiv:1512.00567 http://arxiv.org/abs/1512.00567