<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>New York City, USA, July</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>7 Essential Principles to Make Multimodal Sentiment Analysis Work in the Wild</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bj o¨rn W. Schuller</string-name>
          <email>bjoern.schuller@imperial.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Chair of Complex &amp; Intelligent Systems, University of Passau</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computing, Imperial College London</institution>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <volume>10</volume>
      <issue>2016</issue>
      <abstract>
        <p>Sentiment Analysis (SA) recently found its way beyond pure text analysis [Cambria et al., 2015], as sentiment is increasingly expressed also via video 'micro blogs', short clips, or other forms. Such multimodal data is usually recorded 'in the wild' thus challenging today's automatic analysers. For example, one's video-posted opinion on a movie may contain scenes of this movie, requiring subject tracking, and music in the background may need to be overcome for speech recognition and voice analysis. Here, I provide 'essential principles' to make a multimodal SA work despite such challenges.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Sentiment Analysis is increasingly carried out on
multimodal data such as videos taken in
everyday environments. This requires robust
processing across languages and cultures when aiming for
mining of opinions from the ‘many’. Here, seven
key principles are laid out to ensure a high
performance of an according automatic approach.
Seven selected recommendations to make a multimodal SA
system ‘ready for the wild’ are given with a short statement :</p>
      <p>Make it Multimodal – But Truly. Multimodal SA is
often carried out in a late fusion manner, as e. g., (spoken)
language, acoustics, and video analysis operate on different time
levels and monomodal analysers prevail. However, recent
advances in synergistic fusion allow for further exploitation of
heterogeneous information streams such as analysis of
crossmodal behaviour-synchrony to reveal, e. g., regulation.</p>
      <p>Make it Robust. Robustness is a obviously key handling
real-world data. Effective denoising and dereverberation can
these days be reached by data-driven approaches such as
(hierarchical) deep learning. Beyond, recognition of occlusions,
background noises, and alike should be used in the fusion to
dynamically adjust weights given to the modalities</p>
      <p>Train it on Big Data. A major bottleneck for SA beyond
textual analysis is the lack of suited ‘big’ (ideally multimodal)
training data. While data is usually ‘out there’ (such as videos
on the net), it is the labels that lack. Recent cooperative
learning approaches such as by dynamic active learning and</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>