7 Essential Principles to Make Multimodal Sentiment Analysis Work in the Wild Björn W. Schuller1,2 1 Department of Computing, Imperial College London, United Kingdom 2 Chair of Complex & Intelligent Systems, University of Passau, Germany bjoern.schuller@imperial.ac.uk Abstract semi-supervised learning combined with (gamified) crowd- sourcing can help to efficiently build a large training corpus. Sentiment Analysis is increasingly carried out on Smart pre-selection of suited material from large resources multimodal data such as videos taken in every- can further improve efficiency. day environments. This requires robust process- Make it Adapt. It has repeatedly been shown that in- ing across languages and cultures when aiming for domain training improves SA [Cambria et al., 2015]. Recent mining of opinions from the ‘many’. Here, seven transfer learning provides a range of solutions to adapt to a key principles are laid out to ensure a high perfor- new domain even if only little and/or unlabelled data should mance of an according automatic approach. exist from it. Subject adaptation is another key aspect. Make it Context-aware. Temporal context modelling can 1 Introduction be learnt, e. g., by LSTM recurrent networks. Additional ex- ploitation of knowledge sources can help resolve ambiguities. Sentiment Analysis (SA) recently found its way beyond pure Also, automatic recognition of further traits and states of the text analysis [Cambria et al., 2015], as sentiment is increas- subject expressing sentiment such as age, gender, ethnicity, ingly expressed also via video ‘micro blogs’, short clips, or personality, or emotion, and health state can add important other forms. Such multimodal data is usually recorded ‘in the information regarding the sentiment expressed. wild’ thus challenging today’s automatic analysers. For ex- Make it Multilingual. It seems obvious that multilingual- ample, one’s video-posted opinion on a movie may contain ism is an issue for text-based SA. However, it is as well for scenes of this movie, requiring subject tracking, and music in acoustic analysis, and can in principle even influence video- the background may need to be overcome for speech recogni- based SA due to, e. g., varying lip-movements. In fact, lan- tion and voice analysis. Here, I provide ‘essential principles’ guages are often even mixed in real-world statements. to make a multimodal SA work despite such challenges. Make it Multicultural. Cross-cultural SA has been re- searched comparably little, albeit it is clearly of crucial im- 2 The Principles portance. Just as for multilingualism, a key requirement will Seven selected recommendations to make a multimodal SA be sufficient learning data. Then, models can be switched, system ‘ready for the wild’ are given with a short statement : and transferred across languages and cultures. Make it Multimodal – But Truly. Multimodal SA is of- ten carried out in a late fusion manner, as e. g., (spoken) lan- 3 Conclusion guage, acoustics, and video analysis operate on different time Seven major requirements were highlighted on the way to levels and monomodal analysers prevail. However, recent ad- truly robust multimodal sentiment analysis in adverse condi- vances in synergistic fusion allow for further exploitation of tions in today’s ‘big data’ era. Beyond these, a number of fur- heterogeneous information streams such as analysis of cross- ther issues need to be addressed to best exploit automated sen- modal behaviour-synchrony to reveal, e. g., regulation. timent analysis such as provision of meaningful confidence Make it Robust. Robustness is a obviously key handling measures and optimal exploitation of ‘internal’ confidences. real-world data. Effective denoising and dereverberation can these days be reached by data-driven approaches such as (hi- Acknowledgments erarchical) deep learning. Beyond, recognition of occlusions, The author was supported by the EU’s H2020 Programme via background noises, and alike should be used in the fusion to the IAs # 644632 (MixedEmotions) and # 645094 (SEWA). dynamically adjust weights given to the modalities Train it on Big Data. A major bottleneck for SA beyond References textual analysis is the lack of suited ‘big’ (ideally multimodal) [Cambria et al., 2015] E. Cambria, B. Schuller, and Y. Xia. training data. While data is usually ‘out there’ (such as videos on the net), it is the labels that lack. Recent cooperative New Avenues in Opinion Mining and Sentiment Analysis. learning approaches such as by dynamic active learning and In Proc. IJCAI 2015, Buenos Aires, 2015. 1 Proceedings of the 4th Workshop on Sentiment Analysis where AI meets Psychology (SAAIP 2016), IJCAI 2016, page 1, New York City, USA, July 10, 2016.