-

New York City, USA, July

7 Essential Principles to Make Multimodal Sentiment Analysis Work in the Wild

Bj o¨rn W. Schuller

bjoern.schuller@imperial.ac.uk 0 1 0 Chair of Complex & Intelligent Systems, University of Passau , Germany 1 Department of Computing, Imperial College London , United Kingdom

2015

10 2016

Sentiment Analysis (SA) recently found its way beyond pure text analysis [Cambria et al., 2015], as sentiment is increasingly expressed also via video 'micro blogs', short clips, or other forms. Such multimodal data is usually recorded 'in the wild' thus challenging today's automatic analysers. For example, one's video-posted opinion on a movie may contain scenes of this movie, requiring subject tracking, and music in the background may need to be overcome for speech recognition and voice analysis. Here, I provide 'essential principles' to make a multimodal SA work despite such challenges.

Sentiment Analysis is increasingly carried out on multimodal data such as videos taken in everyday environments. This requires robust processing across languages and cultures when aiming for mining of opinions from the ‘many’. Here, seven key principles are laid out to ensure a high performance of an according automatic approach. Seven selected recommendations to make a multimodal SA system ‘ready for the wild’ are given with a short statement :

Make it Multimodal – But Truly. Multimodal SA is often carried out in a late fusion manner, as e. g., (spoken) language, acoustics, and video analysis operate on different time levels and monomodal analysers prevail. However, recent advances in synergistic fusion allow for further exploitation of heterogeneous information streams such as analysis of crossmodal behaviour-synchrony to reveal, e. g., regulation.

Make it Robust. Robustness is a obviously key handling real-world data. Effective denoising and dereverberation can these days be reached by data-driven approaches such as (hierarchical) deep learning. Beyond, recognition of occlusions, background noises, and alike should be used in the fusion to dynamically adjust weights given to the modalities

Train it on Big Data. A major bottleneck for SA beyond textual analysis is the lack of suited ‘big’ (ideally multimodal) training data. While data is usually ‘out there’ (such as videos on the net), it is the labels that lack. Recent cooperative learning approaches such as by dynamic active learning and