<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Context of Experience - MediaEval submission of ITEC / AAU</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Polyxeni Sgouroglou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tarek Markus Abdel Aziz</string-name>
          <email>tabdelaz@edu.aau.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mathias Lux</string-name>
          <email>mlux@itec.aau.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Klagenfurt University Universitätsstrasse</institution>
          <addr-line>65-67 Klagenfurt</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>20</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>People want to be entertained. However, context influences what people actually find entertaining. The MultimediaEval 2016 Context of Experience Task [2] focuses on automated methods to find the right content for a specific viewing situation, or more specifically, which movies are good to watch in an airplane. In this paper we present our approach to automatically suggesting movie from a list of possible ones by means of visual data as well as meta data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>Movies for entertainment are big business. Worldwide TV
and video revenue in 2015 is estimated with 286.17 billion
USD1. With new shows, series and movies every year, there
is a huge library of content to choose from. Especially in the
confined space of an airplane seat and for the duration of a
long distance flight, video entertainment is well received by
passengers. So many companies offer on-board
entertainment systems, where passengers can choose from multiple
videos to entertain themselves without disturbing other
passengers. While there is of course no one-fits-all solution, the
general hypothesis of the task is, that some videos are better
suited for watching on an airplane than others. Of course
there are many different factors that can influence such a
decision, ie. if a movie is still watchable on a small, low
contrast screen, or if there are scenes which are potentially
offending to neighboring passengers.</p>
    </sec>
    <sec id="sec-2">
      <title>OUR APPROACH</title>
      <p>While distinguishing positive from negative reviews for
films is a quite easy process for humans, it appears that
determining the suitability of a film for watching on an
airplane, is even for humans a non trivial task. It becomes
even harder when humans are asked to decide about the
1https://www.statista.com/statistics/259985/
global-filmed-entertainment-revenue/, last visited
2016-0927
suitability of films, while being out of the specific context.
For a human who remains out of the airplane context it is
rather difficult to visualize the exact emotional impact of a
film while watched during a flight. Mood, different tastes,
stress levels during flight, anxiety problems are only some
factors that could influence a passenger’s decision. Hence,
lists proposed by websites about what should or should not
be watched on an airplane remain controversial. However,
the majority of passengers share the same goal, which is to
pleasantly pass time. Therefore, there are some
characteristics based on common sense that intuitively make a film
unsuitable for watching on an airplane for the majority of
the passengers including films about airplane crashes, very
violent ones, those with a high level of nudity, etc.</p>
      <p>For the MultimediaEval 2016 Context of Experience Task
we submitted four runs, whereas the first two are visual
only runs and the latter two investigate text features. For
the text part of our experiments we choose to work with
meta data and more specifically with plots, synopses and
plot keywords of the films, since we consider them as the
more descriptive and concise types of meta data for deciding
the suitability in our watching situation. The intuition we
have is that other types of meta data such as genre, language
etc. are not sufficient to characterize a film as unsuitable for
watching on an airplane.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Classification based on visual features</title>
      <p>We used two different visual components, which are posters
and trailers, for the classification of a movie. The first run
– Run 1 – uses only posters, and the second run – Run 2 –
uses only trailers. For both runs we used the same machine
learning techniques.</p>
      <p>Most people determine the genre or other special features
of a movie by looking at a poster. For instance, red fonts
are sometimes a common hint that the movie includes
horrifying or bloody scenes. These derived features play also an
important role in the movie selection in air planes, because
to every movie title there is also presented the belonging
poster. The best way to develop an algorithm that
determine on a poster, whether it is good or not, is to use machine
learning techniques.</p>
      <p>We decided to use deep learning. It is usually employed
for large datasets, but our development set was too small.
Therefore, we extracted only the vectors of the last hidden
layer from a pre-trained deep neural network model, which
can then be used as input to a separate machine learning
system. [4]</p>
      <p>
        We used the deep learning framework Caffe [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
developed by the Berkeley Vision Learning Center. It can be
used to load one of the pretrained models, which are
provided by community members and researchers. We used
the BLVC GoogLeNet model [3]. It has 22-layers and is
trained on the dataset ImageNet-2014 to detect 1 000
different classes of images. For each poster from the development
set we extracted a vector with 1024 dimensions from the
layer pool5/7x7_s1. It is the last hidden layer, and contains
all high-level processing information which were created by
the network.
      </p>
      <p>As classifier we build a Support Vector Machine (SVM)
with scikit-learn2, a free software machine learning library.
The advantages of SVMs are that they are effective in high
dimensional spaces and in cases where the number of
inputvectors is smaller than the number of dimensions, like in
our situation. We used a linear kernel function, and the soft
margin parameter C was set to 10. The result is Run 1.</p>
      <p>We performed something similar with the trailers. From
every trailer we extracted 200 frames with ffmpeg3, a
software for handling multimedia data and streams. We also
extracted with the BLVC GoogLeNet model the vectors from
the last hidden layer and we concatenated all 200 vectors of
a trailer. The result was a vector with 204,800 dimensions.
We trained with these vectors from the development set the
SVM afterwards we classified the vectors from the test set.
The result is Run 2.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Text based classification</title>
      <p>We first obtain the plots of the baseline text data given by
the organizers by the XML files for the whole dataset of 318
films. We parse the XML files and access the title and the
plot of each film. Then we perform text processing on the
plots by casefolding, tokenization, non-alphanumeric
characters removal, stopping based on Google’s stop words list to
reduce computational and space complexity and stemming
with the PorterStemmer of the Natural Language Toolkit
(nltk)4. Finally, we store the XML plot tokens for further
use. After preprocessing and feature vectorization the
training corpus contains 1 972 distinct terms.</p>
      <p>In both runs we employ a two-step classification. In the
first step the text features are used to determine with a
Naive Bayes classifier if movies are good to watch on an
airplane or not. In case of a positive match, ie. the film is
classified as being good to watch on an airplane, we
compare the terms of the text features to a list of – in our
opinion intuitive – terms, that make movies unsuitable for being
watched on an airplane: {airplane crash, airplane attack,
hijack, hijacking, air force one, bomb, terrorist, kidnap, abuse,
fascism}. If the text features of a positive example match
terms of this list twice, the classification result is changed to
the negative example class. This basically allows us to focus
more on precision than on recall and to reduce the number
of false positive hits. Experiments based on the
development data set have shown that this significantly improves
precision, while reducing recall only marginally.</p>
      <p>Our first text based run – Run 3 – contains predictions
based on the baseline text features, expanded by a set of text
features we obtained from relevant web pages. The number
of features used is 5 000 ordered by term frequency. The first
1 972 are features extracted by the baseline tokenized XML
2http://scikit-learn.org/stable/, last visited 2016-10-06
3https://ffmpeg.org/, last visited 2016-10-06
4http://www.nltk.org/, last visited 2016-10-06
plots, while the rest 3 028 are features extracted from our
downloaded, preprocessed and tokenized web results. For
creating features the CountVectorizer – scikit-learn’s bag of
words tool – is used. As analyzer parameter we use
unigrams. As classification method we use Naive Bayes for
multinomial models. In the final prediction vector if a film
that is already classified as suitable contains at least two of
the list’s terms, we change the classification to unsuitable
for watching on an airplane.</p>
      <p>Our second text based run – Run 4 – is based on the 1972
baseline text features extracted, while the rest 3028 features
are obtained from plot keywords and synopses of Internet
Movie Database (IMDb)5. All the other characteristics
remain as in Run 1.
3.</p>
    </sec>
    <sec id="sec-5">
      <title>RESULTS &amp; DISCUSSION</title>
      <p>In the ground truth 137 out of 225 movies were classified
as suitable for watching on an airplane (positive examples).
The remaining 88 were negative examples. Table 1 shows
the results of our runs. Classifying all as positive would
theoretically lead to a precision of 0.6089 and a recall of
1.0. In that sense our approach focusing on minimizing the
number of false positives was not well chosen in terms of
evaluation numbers. However, the meta data based runs (3
and 4) as well as the poster based run (1) are better than
the naive classifier. Surprisingly enough the visual only runs
perform comparably well in relation to the meta data based
ones. The poster itself – with the given method – seems to
carry enough information for classification.</p>
      <p>In this work we have applied convolutional neural
networks as well as meta data based methods for classification.
Although the ground truth leans towards positive examples
and our approach focuses on minimizing false negatives we
think the chosen methods provide interesting results,
especially when considering, that the applied methods are
well known approaches, ie. Naive Bayes Classifiers, SVMs,
CNNs, TF*IDF, etc., not tuned to the use case besides the
list of inappropriate concepts and can surely be extended to
better fit the use case.
5.
5http://www.imdb.com/, last visited 2016-09-29
[2] M. Riegler, , C. Spampinato, M. Larson, P. Halvorsen,
and C. Griwodz. The MediaEval 2016 context of
experience task: Recommending videos suiting a
watching situation. In Proceedings of the MediaEval
2016 Workshop, 2016.
[3] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and
A. Rabinovich. Going deeper with convolutions. CoRR,
abs/1409.4842, 2014.
[4] J. Wan, D. Wang, S. C. H. Hoi, P. Wu, J. Zhu,
Y. Zhang, and J. Li. Deep learning for content-based
image retrieval: A comprehensive study. In Proceedings
of the 22nd ACM International Conference on
Multimedia, MM ’14, pages 157–166, New York, NY,
USA, 2014. ACM.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Shelhamer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Donahue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Karayev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Guadarrama</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Darrell</surname>
          </string-name>
          . Caffe:
          <article-title>Convolutional architecture for fast feature embedding</article-title>
          .
          <source>arXiv preprint arXiv:1408.5093</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>