=Paper=
{{Paper
|id=Vol-2670/MediaEval_19_paper_48
|storemode=property
|title=Transfer Learning and Mixed Input Deep Neural Networks for Estimating Flood Severity in
News Content
|pdfUrl=https://ceur-ws.org/Vol-2670/MediaEval_19_paper_48.pdf
|volume=Vol-2670
|authors=Pierrick Bruneau,Thomas
Tamisier
|dblpUrl=https://dblp.org/rec/conf/mediaeval/BruneauT19
}}
==Transfer Learning and Mixed Input Deep Neural Networks for Estimating Flood Severity in
News Content==
Transfer Learning and Mixed Input Deep Neural Networks for Estimating Flood Severity in News Content Pierrick Bruneau1 , Thomas Tamisier1 1 LIST, Luxembourg {pierrick.bruneau,thomas.tamisier}@list.lu ABSTRACT regarding the annotation of training and test sets, the reader may This paper describes deep learning approaches which use textual refer to the workshop overview paper [2]. and visual features for flood severity detection in news content. In the context of the MediaEval 2019 Multimedia Satellite task, we 3 PROPOSED APPROACH test the value of transferring models pre-trained on large related For given training and validation sets, each model in this work was corpora, as well as the improvement brought by dual branch models trained for 50 epochs. Model selection was performed by monitoring that combine embeddings output from mixed textual and visual a validation metric at epoch end, and retaining the best model w.r.t. inputs. this metric. As both subtasks are significantly imbalanced, instead of the accuracy, we used the F1 metric. We optimized the binary cross- 1 INTRODUCTION entropy loss using the Adam solver [7]. Batches of 32 elements were Identifying news items related to a catastrophic event such as a used. For each run, we used stratified 5-fold cross-validation, hence flood, and assessing the severity of the event using the collected retaining 20% of validation data in each fold. A model ensemble information can provide timely input to support the victims. The was built using the model selected for each fold. Majority voting News Image Topic Disambiguation (NITD) and Multimodal Flood or score averaging was selected depending on the F1-Score on the Level Estimation (MFLE) subtasks of the Multimedia Satellite (MM- full training data. For handling class imbalance in the context of Sat) MediaEval 2019 task foster the application of machine learning neural networks, we used instance weighting. In the Multimedia research to this context. The MMSat overview paper [2] discloses Satellite overview paper [2], runs that use only the provided textual further information on this matter. To address these subtasks, we and visual information are distinguished from those that can use do not propose any specialized model (e.g. combined use of pose any external information. In the remainder, pre-trained models (e.g. detection and occlusion detection [3] that could be used for MFLE pre-trained word embeddings or convolutional models pre-trained subtask). Rather, we reuse existing general-purpose text and image using tier image collections) are understood as external information: classification models. In particular, the value of adapting pre-trained runs using textual or visual information only were obtained with models to these subtasks is estimated in this work. models trained from random initializations. Part of the literature on multimodal neural networks aims at learning similarities between modalities such as image and text 3.1 Textual Information e.g. for automatic image captioning [8]. In the present work, multi- Classification of textual content is usually carried out by consid- modality is understood as the joint usage of several modalities (i.e. ering the text as a sequence of words, and using recurrent neu- text and image) as means to improve prediction capabilities. In other ral models such as the LSTM [6] for classifying these sequential words, embeddings derived from each modality are merged, and inputs. We compared the baseline LSTM to several variants (e.g. fed forward to a sigmoid function typical of classification models Attention-based BiLSTM [17], Multi-Head Attention model [14]), [1, 9]. Runs submitted to the MFLE subtask are meant to measure but no option yielded results significantly better than the baseline the improvement brought by such multimodality. In the remainder, LSTM with random initialization. Hence the baseline LSTM model after shortly introducing the addressed subtasks, implementation was the only one considered for textual processing in the MFLE rationale and details are presented, and the respective experimental subtask. This setting is suitable to MFLE run 2, as it does not require results are disclosed and commented. any pretraining. For the latter run, we performed a grid search for hyper parameters. In the end, we retained 50 for the fixed text size, 2 DATA 100 for the hidden vector size, and 32 for the word embedding size. The NITD subtask aims at predicting the flood-relatedness of news Taking inspiration from data augmentation techniques used with articles using their featured images as input. The training set con- images (see Section 3.2), we tried to augment the training set by tains 5180 images, with ∼ 10.1% flood-related images. The test set setting a random offset to the extracted textual sequences (instead contains 1296 images. The MFLE subtask aims at classifying news of always taking the 50 first words in the text). We did not observe articles w.r.t. flood severity using both the news text and featured improved results by doing so. We hypothesize that the starting images as input. The training set features 4932 news articles, with words in a text carry a lot of its overall meaning. ∼ 3.2% of instances from the positive class (i.e. high severity). The test set features 1234 articles. For details about the subtasks, e.g. 3.2 Visual Information Copyright 2019 for this paper by its authors. Use For image classification in both subtasks, we focused on 3 well permitted under Creative Commons License Attribution known model architectures: InceptionV3 [13], MobileNetV2 [11] 4.0 International (CC BY 4.0). and VGG16 [12]. InceptionV3 has served as a building block for MediaEval‘19, 27-29 October 2019, Sophia Antipolis, France MediaEval‘19, 27-29 October 2019, Sophia Antipolis, France P. Bruneau et T. Tamisier Table 1: F1-Scores (%) for NITD and MFLE subtasks. We refer to Sections 3.1, 3.2 and 3.3 for details about specific runs. ∅ indicates the model has been trained from a random initialization. NITD MFLE Run 1 Run 2 Run 3 Run 4 Run 5 Run 1 Run 2 Run 3 Runs 4 Run 5 (MNV2) (InceptionV3) (VGG16) (MNV2) (LSTM) (MNV2 & LSTM) (IV3 & LSTM) (VGG16 & LSTM) ∅ ImageNet fine-tuned Places365 fine-tuned ∅ ∅ ∅ ImageNet Places365 85.1 81.0 79.6 89.0 89.6 56.6 56.5 57.6 67.1 66.0 many recent contributions (e.g. [4, 9, 10]). Parameter sets pre- trained on the data from the ImageNet challenge [5] are also widely available. VGG16 yields performance close to the state of the art on the Places365 scene recognition task [16]. As NITD and MFLE can be understood as recognizing certain types of scenes, we hypoth- esize that transferring a VGG16 model pre-trained on Places365 can be valuable. MobileNetV2 has a comparatively small number Figure 1: Generic mixed input model architecture. of parameters (∼ 2M, when InceptionV3 defines ∼ 20M parameters, and VGG16 ∼ 130M), and is hence more suitable for being trained from scratch. For all models, we rescaled images to 224 × 224 pix- els. We applied image augmentation methods commonly used in above-mentioned papers, i.e. each image in the training batches is modified by a combination of random transformations. When no external data can be used (i.e. run 1 of NITD and MFLE), we used randomly initialized MobileNetV2 models. InceptionV3 mod- els pre-trained using the ImageNet dataset were used. In a first Figure 2: Class Activation Maps for true positives (TP), false stage, only the last dense layer was trained while freezing all other positives (FP) and false negatives (FN) in both subtasks. layers (NITD run 2). The last 2 convolutional layers were then fine-tuned (NITD run 3). Also, we jointly trained the two last fully connected layers of a pre-trained VGG16 model (NITD run 4). We (+0.6% for VGG16, runs 4 and 5), at worse performance degrada- tried to fine tune the large fully connected ante penultimate layer tion (-1.4% for InceptionsV3, runs 2 and 3). Perhaps surprisingly, and the last convolutional layer after this first stage (NITD run 5). MobileNetV2 trained from scratch offers solid performance, in be- tween InceptionV3 and VGG16. For the MFLE subtask, the mixed 3.3 Mixed Input input model benefits from a small positive combination effect be- tween modalities (+1.0% between runs 1 and 3). Using pre-trained Figure 1 shows the generic architecture for our mixed input models. vision models (runs 4 and 5) yields a significant performance boost The proposed multimodal approach is very similar to that proposed (≈ 10%). The best results are obtained with the combination of in [9]. The embedding of textual and visual models is taken as their LSTM and InceptionV3 (run 4). As all images in the MFLE data set penultimate layer output. For MFLE run 3, we reused previously depict flooded scenes, we can hypothesize that object level features trained LSTM (run 2) and MobileNetV2 (run 1), and trained only are more relevant than scene level features. Figure 2 displays Class the additional fully connected layers. Similarly, for run 4 we used Activation Maps [15] of examples with high positive or negative a pre-trained InceptionV3 model, adapted and fine-tuned to the activations w.r.t. InceptionV3 models trained on both subtasks. We MFLE subtask data as described in 3.2. Run 5 used a pre-trained see that the positive class is frequently associated to the detection of and adapted VGG16 model instead. The textual and visual models water patterns. This possibly explains for the limited performance performing the best w.r.t all training data were selected. Ensembles in MFLE, as these patterns are not very discriminative then. On of 5 mixed input models were trained from this common basis. To the other hand, false negatives are often associated to misleading balance the influence of text and image, we defined a bottleneck elements in the image (e.g. microphone for NITD, pile of rubbish fully connected layer, that reduces the generally high-dimensional for MFLE). convolutional embedding (e.g. 2048 for InceptionV3) to size 100, the same size as the LSTM hidden vector. 5 CONCLUSION In this paper, we tested several approaches to the detection of 4 ANALYSIS flood severity in multimodal news content. We highlighted the rel- The F1-Scores resulting from our runs are displayed in Table 1. For evance of considering closely related tasks for pre-training, rather the NITD subtask, VGG16 with fine-tuning gets the best results than general-purpose image datasets such as ImageNet. Mixed- (F1 = 89.6%). VGG16 models performed better than others with a input architectures in the MFLE task yielded an improvement w.r.t. significant margin. This means NITD is close to a scene recognition modalities taken separately, but this improvement was limited in task. However, fine-tuning brought at best minor improvement comparison to the influence of using relevant pre-trained models. The Multimedia Satellite Task: Flood Severity Estimation MediaEval‘19, 27-29 October 2019, Sophia Antipolis, France REFERENCES [1] N. Audebert, C. Herold, K. Slimani, and C. Vidal. 2019. Multimodal deep networks for text and image-based document classification. arXiv:1907.06370 [cs] (2019). [2] B. Bischke, P. Helber, S. Brugman, E. Basar, Z. Zhao, M. Larson, and K. Pogorelov. 2019. The Multimedia Satellite Task at MediaEval 2019. In Proc. of the MediaEval 2019 Workshop. [3] A. Bulat and G. Tzimiropoulos. 2016. Human pose estimation via Convolutional Part Heatmap Regression. In European Conference on Computer Vision. 717–732. [4] P. Chen, Y. Sharma, H. Zhang, J. Yi, and C. Hsieh. 2018. EAD: Elastic- Net Attacks to Deep Neural Networks via Adversarial Examples. In Thirty-Second AAAI Conference on Artificial Intelligence. [5] J. Deng, W. Dong, R. Socher, L. Li, L. Kai, and F. Li. 2009. ImageNet: A large-scale hierarchical image database. In IEEE Conference on Com- puter Vision and Pattern Recognition. 248–255. [6] A. Graves. 2012. Supervised Sequence Labelling. In Supervised Se- quence Labelling with Recurrent Neural Networks. Springer Berlin Hei- delberg, 5–13. [7] D. Kingma and J. Ba. 2015. Adam: A Method for Stochastic Optimiza- tion. In International Conference for Learning Representations. [8] R. Kiros, R. Salakhutdinov, and R. Zemel. 2014. Unifying Visual- Semantic Embeddings with Multimodal Neural Language Models. In NIPS Deep Learning Workshop. [9] L. Lopez-Fuentes, J. van de Weijer, M. Bolaños, and H. Skinnemoen. 2017. Multi-modal Deep Learning Approach for Flood Detection. In Proc. of the MediaEval 2017 Workshop. [10] R. Poplin, A. Varadarajan, K. Blumer, Y. Liu, M. McConnell, G. Corrado, L. Peng, and D. Webster. 2018. Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nature Biomedical Engineering 2, 3 (2018), 158–164. [11] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen. 2018. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In IEEE Con- ference on Computer Vision and Pattern Recognition. 4510–4520. [12] K. Simonyan and A. Zisserman. 2014. Very Deep Convolutional Net- works for Large-Scale Image Recognition. In International Conference for Learning Representations. [13] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. 2016. Re- thinking the Inception Architecture for Computer Vision. In IEEE Conference on Computer Vision and Pattern Recognition. 2818–2826. [14] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30. 5998–6008. [15] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. 2016. Learn- ing Deep Features for Discriminative Localization. In IEEE Conference on Computer Vision and Pattern Recognition. 2921–2929. [16] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba. 2018. Places: A 10 Million Image Database for Scene Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (2018), 1452–1464. [17] P. Zhou, W. Shi, J. Tian, Z. Qi, B. Li, H. Hao, and B. Xu. 2016. Attention- Based Bidirectional Long Short-Term Memory Networks for Relation Classification. In Proceedings of the 54th Annual Meeting of the Associ- ation for Computational Linguistics (Volume 2: Short Papers). 207–212.