<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Examining Multimodal Characteristics of Video to Understand User Engagement</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fahim A. Salim</string-name>
          <email>salimf@tcd.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Killian Levacher</string-name>
          <email>killian.levacher@scss.tcd.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Owen Conlan</string-name>
          <email>owen.conlan@scss.tcd.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nick Campbell</string-name>
          <email>nick@tcd.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CNGL/ADAPT Centre, Trinity College Dublin</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Video content is being produced in ever increasing quantities and o ers a potentially highly diverse source for personalizable content. A key characteristic of quality video content is the engaging experience it o ers for end users. This paper explores how di erent characteristics of a video, e.g. face detection, paralinguistic features in the audio track, extracted from di erent modalities in the video can impact how users rate and thereby engage with the video. These characteristics can further be used to help segment videos in a personalized and contextually aware manner. Initial experimental results from the study presented in this paper provide encouraging results.</p>
      </abstract>
      <kwd-group>
        <kwd>Personalization</kwd>
        <kwd>Multimodality</kwd>
        <kwd>Video Analysis</kwd>
        <kwd>Paralin- guistic</kwd>
        <kwd>User Engagement</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Videos are one of the most versatile forms of content in terms of multimodality
we consume on a regular basis. They are available in ever increasing quantities.
It is therefore useful to identify engagement in videos automatically for variety
of applications. E.g. in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] the presenter did statistical analysis on TED talks to
come up with a metric for creating an optimum TED talk based on user ratings,
while [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] try to create video recommender systems based on users viewing
habits and commenting patterns.
      </p>
      <p>
        Each kind of video engages users di erently, i.e. engagement with content is
context dependent [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For our context, by engagement we mean the elaborate
feedback system used by the raters of TED videos described in detail in section
2.
      </p>
      <p>We believe that it is possible to extract quanti able multimodal features
from a video presentation automatically and correlate them with user
engagement criterion for variety of interesting applications. Additionally extracting and
indexing those features and their correlation to engagement could impact
content adaptability and contextualizing search and non-sequential video slicing.
This paper discusses an initial multimodal analysis experiment. In order to test
our hypothesis, we extracted features from TED videos and correlated them with
user feedback scores available on the TED website.</p>
    </sec>
    <sec id="sec-2">
      <title>Current Study</title>
      <p>
        TED website asks viewers to describe the video in terms of particular words
in-stead of a simple like or dislike option. A user can choose up to three words
from a choice of 14 words (listed in table 1) to rate a video. This makes our
problem more interesting for now we do not have a crisp binary feedback to learn
from, but a rather fuzzy description of what viewers thought about a particular
video. The rating system for user feedback gives us a detailed insight of user
engagement with the video presentation. Since it is voluntarily information by
users in terms of semantically positive and negative words, it provides good basis
to analyze relevant factors of engagement described in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>Among the 14 rating criterion in table 1 provided to the user, we could
identify 9 of them as being positive words, 4 of them being negative (shown in
bold) words and 1 as neutral (shown in Italic).</p>
      <p>As seen in Table 1, ratings tend to be overwhelmingly positive. In order to
normalize them we used the following de nitions for our purpose for a video to
be considered \Beautiful" or \Persuasive" etc. it must have a rating count more
than average rating count for that particular rating word. With this, TED talks
were categorized as \Beautiful and not Beautiful", \Inspiring and not Inspiring",
\Persuasive and not Persuasive" etc. giving two classes for classi cation for each
of the 14 rating words.</p>
      <p>
        To perform our experiment we extracted, for how many seconds there was a
close up shot of the speaker and when there was a distant shot and when the
speaker was not on the screen. For non-visual features we looked at number of
laughter and applauses (counted from transcribed audio track i.e. subtitle les)
and laughter applause ratio within TED talks. We use HAAR cascades [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] in
OpenCV library [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to detect when the speaker is on the screen or not and a
simple python script to get laughter and applause count since TED talks come
with subtitles les.
      </p>
      <p>
        For correlating features with user ratings to see some potential patterns we
utilized the WEKA toolkit which allows easy access to a suite of machine learning
techniques [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. We used Machine Learning algorithm Logistics Regression, and
tenfold cross-validation testing for our analysis on 1340 TED talk videos, to see
how feature values a ected user ratings. We tested on both percentage count
and actual count for each ratings.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Experiment Results and Discussion</title>
      <p>Our aim is to see the value in the multimodality of video content. We
experimented by removing our visual features to see if this will a ect the correct
classi cation of video for the ratings. Figure 1 shows that the accuracy of
correctly classi ed instances increased with the inclusion of visual features for the
majority of rating words, 7 to be precise. While for 3 it remained equal but for
4 rating words it actually decreased.</p>
      <p>Results of our study are interesting in many regards. Firstly, it is the
preliminary step towards our thesis about the value of di erent modalities within
a video stream. Another interesting aspect of this study is that all the features
were automatically extracted, i.e. no manual annotation was performed. So any
model based on our feature set could be easily used for new content and any
advancement in computer vision and paralinguistic analysis technology would
help in making our model better. This model would become a component of
personalization systems to enhance contextual quires.</p>
      <p>Our current approach however is not without any limitations. The biggest of
them is that it cannot be used for all type of videos such as movies and sports
videos etc. Another limitation of our approach is that we have analyzed the
video as a whole unit i.e. we simply do not have information on which portion
of video was more \Funny" or \Beautiful" or \Long-winded" compared to other
portions.</p>
      <p>For further investigations we would like to extend our signal extraction to
a focused set of features. We are planning to extract more visual features to
see what impact they have on user ratings. In addition to visual features we
would also like to introduce paralinguistic features from the audio stream to the
fold. Most importantly we are planning to see the correlation between
linguistic features of TED talks and their corresponding user ratings. This extracted
meta-data and engagement analysis will feed into a model to create multimodal
segments in a personalized and contextually aware manner.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgement</title>
      <p>This research is supported by Science Foundation Ireland through the CNGL
Programme (Grant 12/CE/I2267) in the ADAPT Centre (www.adaptcentre.ie)
at School of Computer Science and Statistics, Trinity College, Dublin.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. Att eld, S.,
          <string-name>
            <surname>Piwowarski</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kazai</surname>
          </string-name>
          , G.:
          <article-title>Towards a science of user engagement</article-title>
          .
          <source>WSDM Workshop on User Modelling for Web Applications</source>
          , Hong
          <string-name>
            <surname>Kong</surname>
          </string-name>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bradski</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>The OpenCV Library</article-title>
          . Dr. Dobb. J.
          <string-name>
            <surname>Soft</surname>
          </string-name>
          . Tool. (
          <year>2000</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Brezeale</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cook</surname>
            ,
            <given-names>D.J.</given-names>
          </string-name>
          :
          <article-title>Learning video preferences using visual features and closed captions</article-title>
          .
          <source>IEEE Multimed</source>
          .
          <volume>16</volume>
          ,
          <issue>3947</issue>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Lienhart</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maydt</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lienhartintelcom</surname>
            ,
            <given-names>R.:</given-names>
          </string-name>
          <article-title>An extended set of Haar-like features for rapid object detection</article-title>
          .
          <source>Proceedings. Int. Conf. Image Process</source>
          .
          <volume>1</volume>
          ,
          <issue>900903</issue>
          (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Holmes</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pfahringer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reutemann</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I.H.</given-names>
          </string-name>
          :
          <article-title>The WEKA data mining software</article-title>
          .
          <source>ACM SIGKDD Explor</source>
          .
          <volume>11</volume>
          ,
          <issue>1018</issue>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. OBrien,
          <string-name>
            <given-names>H.L.</given-names>
            ,
            <surname>Toms</surname>
          </string-name>
          ,
          <string-name>
            <surname>E.G.</surname>
          </string-name>
          :
          <article-title>Examining the generalizability of the User Engagement Scale (UES) in exploratory search</article-title>
          .
          <source>Inf. Process. Manag</source>
          .
          <volume>49</volume>
          ,
          <issue>10921107</issue>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qin</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cai</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Cross domain recommendation based on multi-type media fusion</article-title>
          .
          <source>Neurocomputing</source>
          .
          <volume>127</volume>
          ,
          <issue>124134</issue>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Wernicke</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :Lies, damned lies and
          <source>statistics (about TEDTalks)</source>
          .(
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>