Context of Experience – MediaEval submission of ITEC /
                              AAU

                                Polyxeni Sgouroglou, Tarek Markus Abdel Aziz, Mathias Lux
                                                            Klagenfurt University
                                                          Universitätsstrasse 65-67
                                                             Klagenfurt, Austria
                                        {psgourog,tabdelaz}@edu.aau.at, mlux@itec.aau.at

ABSTRACT                                                                 suitability of films, while being out of the specific context.
People want to be entertained. However, context influences               For a human who remains out of the airplane context it is
what people actually find entertaining. The MultimediaEval               rather difficult to visualize the exact emotional impact of a
2016 Context of Experience Task [2] focuses on automated                 film while watched during a flight. Mood, different tastes,
methods to find the right content for a specific viewing situ-           stress levels during flight, anxiety problems are only some
ation, or more specifically, which movies are good to watch              factors that could influence a passenger’s decision. Hence,
in an airplane. In this paper we present our approach to                 lists proposed by websites about what should or should not
automatically suggesting movie from a list of possible ones              be watched on an airplane remain controversial. However,
by means of visual data as well as meta data.                            the majority of passengers share the same goal, which is to
                                                                         pleasantly pass time. Therefore, there are some character-
                                                                         istics based on common sense that intuitively make a film
1.     INTRODUCTION                                                      unsuitable for watching on an airplane for the majority of
   Movies for entertainment are big business. Worldwide TV               the passengers including films about airplane crashes, very
and video revenue in 2015 is estimated with 286.17 billion               violent ones, those with a high level of nudity, etc.
USD1 . With new shows, series and movies every year, there                  For the MultimediaEval 2016 Context of Experience Task
is a huge library of content to choose from. Especially in the           we submitted four runs, whereas the first two are visual
confined space of an airplane seat and for the duration of a             only runs and the latter two investigate text features. For
long distance flight, video entertainment is well received by            the text part of our experiments we choose to work with
passengers. So many companies offer on-board entertain-                  meta data and more specifically with plots, synopses and
ment systems, where passengers can choose from multiple                  plot keywords of the films, since we consider them as the
videos to entertain themselves without disturbing other pas-             more descriptive and concise types of meta data for deciding
sengers. While there is of course no one-fits-all solution, the          the suitability in our watching situation. The intuition we
general hypothesis of the task is, that some videos are better           have is that other types of meta data such as genre, language
suited for watching on an airplane than others. Of course                etc. are not sufficient to characterize a film as unsuitable for
there are many different factors that can influence such a               watching on an airplane.
decision, ie. if a movie is still watchable on a small, low
contrast screen, or if there are scenes which are potentially
offending to neighboring passengers.
                                                                         2.1    Classification based on visual features
                                                                            We used two different visual components, which are posters
                                                                         and trailers, for the classification of a movie. The first run
2.     OUR APPROACH                                                      – Run 1 – uses only posters, and the second run – Run 2 –
   While distinguishing positive from negative reviews for               uses only trailers. For both runs we used the same machine
films is a quite easy process for humans, it appears that                learning techniques.
determining the suitability of a film for watching on an air-               Most people determine the genre or other special features
plane, is even for humans a non trivial task. It becomes                 of a movie by looking at a poster. For instance, red fonts
even harder when humans are asked to decide about the                    are sometimes a common hint that the movie includes horri-
1                                                                        fying or bloody scenes. These derived features play also an
  https://www.statista.com/statistics/259985/
global-filmed-entertainment-revenue/, last visited 2016-09-              important role in the movie selection in air planes, because
27                                                                       to every movie title there is also presented the belonging
                                                                         poster. The best way to develop an algorithm that deter-
                                                                         mine on a poster, whether it is good or not, is to use machine
                                                                         learning techniques.
                                                                            We decided to use deep learning. It is usually employed
                                                                         for large datasets, but our development set was too small.
                                                                         Therefore, we extracted only the vectors of the last hidden
                                                                         layer from a pre-trained deep neural network model, which
                                                                         can then be used as input to a separate machine learning
Copyright is held by the author/owner(s).                                system. [4]
MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Netherlands           We used the deep learning framework Caffe [1] devel-
oped by the Berkeley Vision Learning Center. It can be              plots, while the rest 3 028 are features extracted from our
used to load one of the pretrained models, which are pro-           downloaded, preprocessed and tokenized web results. For
vided by community members and researchers. We used                 creating features the CountVectorizer – scikit-learn’s bag of
the BLVC GoogLeNet model [3]. It has 22-layers and is               words tool – is used. As analyzer parameter we use un-
trained on the dataset ImageNet-2014 to detect 1 000 differ-        igrams. As classification method we use Naive Bayes for
ent classes of images. For each poster from the development         multinomial models. In the final prediction vector if a film
set we extracted a vector with 1024 dimensions from the             that is already classified as suitable contains at least two of
layer pool5/7x7_s1. It is the last hidden layer, and contains       the list’s terms, we change the classification to unsuitable
all high-level processing information which were created by         for watching on an airplane.
the network.                                                          Our second text based run – Run 4 – is based on the 1972
   As classifier we build a Support Vector Machine (SVM)            baseline text features extracted, while the rest 3028 features
with scikit-learn 2 , a free software machine learning library.     are obtained from plot keywords and synopses of Internet
The advantages of SVMs are that they are effective in high          Movie Database (IMDb)5 . All the other characteristics re-
dimensional spaces and in cases where the number of input-          main as in Run 1.
vectors is smaller than the number of dimensions, like in
our situation. We used a linear kernel function, and the soft
margin parameter C was set to 10. The result is Run 1.
                                                                    3.     RESULTS & DISCUSSION
   We performed something similar with the trailers. From             In the ground truth 137 out of 225 movies were classified
every trailer we extracted 200 frames with ffmpeg 3 , a soft-       as suitable for watching on an airplane (positive examples).
ware for handling multimedia data and streams. We also              The remaining 88 were negative examples. Table 1 shows
extracted with the BLVC GoogLeNet model the vectors from            the results of our runs. Classifying all as positive would
the last hidden layer and we concatenated all 200 vectors of        theoretically lead to a precision of 0.6089 and a recall of
a trailer. The result was a vector with 204,800 dimensions.         1.0. In that sense our approach focusing on minimizing the
We trained with these vectors from the development set the          number of false positives was not well chosen in terms of
SVM afterwards we classified the vectors from the test set.         evaluation numbers. However, the meta data based runs (3
The result is Run 2.                                                and 4) as well as the poster based run (1) are better than
                                                                    the naive classifier. Surprisingly enough the visual only runs
2.2    Text based classification                                    perform comparably well in relation to the meta data based
   We first obtain the plots of the baseline text data given by     ones. The poster itself – with the given method – seems to
the organizers by the XML files for the whole dataset of 318        carry enough information for classification.
films. We parse the XML files and access the title and the
plot of each film. Then we perform text processing on the
plots by casefolding, tokenization, non-alphanumeric char-          Table 1: Results of the submitted runs giving true
acters removal, stopping based on Google’s stop words list to       and false positives and negatives, precision (P), re-
reduce computational and space complexity and stemming              call (R) and F1.
with the PorterStemmer of the Natural Language Toolkit                  Run    TP   FP    TN    FN       P        R         F1
(nltk)4 . Finally, we store the XML plot tokens for further              1     87   52    35    49    0.6259    0.6397    0.6327
use. After preprocessing and feature vectorization the train-
                                                                         2     92   60    27    44    0.6053    0.6765    0.6389
ing corpus contains 1 972 distinct terms.
                                                                         3     92   55    32    44    0.6259    0.6765    0.6502
   In both runs we employ a two-step classification. In the
                                                                         4     88   54    33    48    0.6197    0.6471    0.6331
first step the text features are used to determine with a
Naive Bayes classifier if movies are good to watch on an air-
plane or not. In case of a positive match, ie. the film is
classified as being good to watch on an airplane, we com-
pare the terms of the text features to a list of – in our opin-
                                                                    4.     CONCLUSIONS
ion intuitive – terms, that make movies unsuitable for being           In this work we have applied convolutional neural net-
watched on an airplane: {airplane crash, airplane attack, hi-       works as well as meta data based methods for classification.
jack, hijacking, air force one, bomb, terrorist, kidnap, abuse,     Although the ground truth leans towards positive examples
fascism}. If the text features of a positive example match          and our approach focuses on minimizing false negatives we
terms of this list twice, the classification result is changed to   think the chosen methods provide interesting results, es-
the negative example class. This basically allows us to focus       pecially when considering, that the applied methods are
more on precision than on recall and to reduce the number           well known approaches, ie. Naive Bayes Classifiers, SVMs,
of false positive hits. Experiments based on the develop-           CNNs, TF*IDF, etc., not tuned to the use case besides the
ment data set have shown that this significantly improves           list of inappropriate concepts and can surely be extended to
precision, while reducing recall only marginally.                   better fit the use case.
   Our first text based run – Run 3 – contains predictions
based on the baseline text features, expanded by a set of text      5.     REFERENCES
features we obtained from relevant web pages. The number
of features used is 5 000 ordered by term frequency. The first      [1] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long,
1 972 are features extracted by the baseline tokenized XML              R. Girshick, S. Guadarrama, and T. Darrell. Caffe:
2
                                                                        Convolutional architecture for fast feature embedding.
  http://scikit-learn.org/stable/, last visited 2016-10-06              arXiv preprint arXiv:1408.5093, 2014.
3
  https://ffmpeg.org/, last visited 2016-10-06
4                                                                   5
  http://www.nltk.org/, last visited 2016-10-06                         http://www.imdb.com/, last visited 2016-09-29
[2] M. Riegler, , C. Spampinato, M. Larson, P. Halvorsen,
    and C. Griwodz. The MediaEval 2016 context of
    experience task: Recommending videos suiting a
    watching situation. In Proceedings of the MediaEval
    2016 Workshop, 2016.
[3] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed,
    D. Anguelov, D. Erhan, V. Vanhoucke, and
    A. Rabinovich. Going deeper with convolutions. CoRR,
    abs/1409.4842, 2014.
[4] J. Wan, D. Wang, S. C. H. Hoi, P. Wu, J. Zhu,
    Y. Zhang, and J. Li. Deep learning for content-based
    image retrieval: A comprehensive study. In Proceedings
    of the 22nd ACM International Conference on
    Multimedia, MM ’14, pages 157–166, New York, NY,
    USA, 2014. ACM.