7 Essential Principles to Make Multimodal Sentiment Analysis Work in the Wild

                                             Björn W. Schuller1,2
                     1
                       Department of Computing, Imperial College London, United Kingdom
                    2
                      Chair of Complex & Intelligent Systems, University of Passau, Germany
                                        bjoern.schuller@imperial.ac.uk


                          Abstract                                  semi-supervised learning combined with (gamified) crowd-
                                                                    sourcing can help to efficiently build a large training corpus.
      Sentiment Analysis is increasingly carried out on             Smart pre-selection of suited material from large resources
      multimodal data such as videos taken in every-                can further improve efficiency.
      day environments. This requires robust process-                  Make it Adapt. It has repeatedly been shown that in-
      ing across languages and cultures when aiming for             domain training improves SA [Cambria et al., 2015]. Recent
      mining of opinions from the ‘many’. Here, seven               transfer learning provides a range of solutions to adapt to a
      key principles are laid out to ensure a high perfor-          new domain even if only little and/or unlabelled data should
      mance of an according automatic approach.                     exist from it. Subject adaptation is another key aspect.
                                                                       Make it Context-aware. Temporal context modelling can
1    Introduction                                                   be learnt, e. g., by LSTM recurrent networks. Additional ex-
                                                                    ploitation of knowledge sources can help resolve ambiguities.
Sentiment Analysis (SA) recently found its way beyond pure          Also, automatic recognition of further traits and states of the
text analysis [Cambria et al., 2015], as sentiment is increas-      subject expressing sentiment such as age, gender, ethnicity,
ingly expressed also via video ‘micro blogs’, short clips, or       personality, or emotion, and health state can add important
other forms. Such multimodal data is usually recorded ‘in the       information regarding the sentiment expressed.
wild’ thus challenging today’s automatic analysers. For ex-            Make it Multilingual. It seems obvious that multilingual-
ample, one’s video-posted opinion on a movie may contain            ism is an issue for text-based SA. However, it is as well for
scenes of this movie, requiring subject tracking, and music in      acoustic analysis, and can in principle even influence video-
the background may need to be overcome for speech recogni-          based SA due to, e. g., varying lip-movements. In fact, lan-
tion and voice analysis. Here, I provide ‘essential principles’     guages are often even mixed in real-world statements.
to make a multimodal SA work despite such challenges.                  Make it Multicultural. Cross-cultural SA has been re-
                                                                    searched comparably little, albeit it is clearly of crucial im-
2    The Principles                                                 portance. Just as for multilingualism, a key requirement will
Seven selected recommendations to make a multimodal SA              be sufficient learning data. Then, models can be switched,
system ‘ready for the wild’ are given with a short statement :      and transferred across languages and cultures.
   Make it Multimodal – But Truly. Multimodal SA is of-
ten carried out in a late fusion manner, as e. g., (spoken) lan-    3 Conclusion
guage, acoustics, and video analysis operate on different time      Seven major requirements were highlighted on the way to
levels and monomodal analysers prevail. However, recent ad-         truly robust multimodal sentiment analysis in adverse condi-
vances in synergistic fusion allow for further exploitation of      tions in today’s ‘big data’ era. Beyond these, a number of fur-
heterogeneous information streams such as analysis of cross-        ther issues need to be addressed to best exploit automated sen-
modal behaviour-synchrony to reveal, e. g., regulation.             timent analysis such as provision of meaningful confidence
   Make it Robust. Robustness is a obviously key handling           measures and optimal exploitation of ‘internal’ confidences.
real-world data. Effective denoising and dereverberation can
these days be reached by data-driven approaches such as (hi-        Acknowledgments
erarchical) deep learning. Beyond, recognition of occlusions,       The author was supported by the EU’s H2020 Programme via
background noises, and alike should be used in the fusion to        the IAs # 644632 (MixedEmotions) and # 645094 (SEWA).
dynamically adjust weights given to the modalities
   Train it on Big Data. A major bottleneck for SA beyond           References
textual analysis is the lack of suited ‘big’ (ideally multimodal)
                                                                    [Cambria et al., 2015] E. Cambria, B. Schuller, and Y. Xia.
training data. While data is usually ‘out there’ (such as videos
on the net), it is the labels that lack. Recent cooperative           New Avenues in Opinion Mining and Sentiment Analysis.
learning approaches such as by dynamic active learning and            In Proc. IJCAI 2015, Buenos Aires, 2015.


                                                                1


    Proceedings of the 4th Workshop on Sentiment Analysis where AI meets Psychology (SAAIP 2016), IJCAI 2016, page 1,
                                            New York City, USA, July 10, 2016.