Simula @ MediaEval 2016 Context of Experience Task

                Konstantin Pogorelov, Michael Riegler, Pål Halvorsen, Carsten Griwodz
                                    Simula Research Laboratory and University of Oslo
                                    konstantin, michael, paalh, griff@simula.no


ABSTRACT                                                           algorithm to get the final class.
This paper presents our approach for the Context of Multi-            The classification algorithm that we used for all three
media Experience Task of the MediaEval 2016 Benchmark.             runs is the PART algorithm [2], which is based on PAR-
We present different analyses of the given data using dif-         Tial decision Trees. PART relies on decision lists and uses
ferent subsets of data sources and combinations of it. Our         a separate-and-conquer approach to create them. In each
approach gives a baseline evaluation indicating that meta-         iteration PART creates a partial decision tree. For each it-
data approaches work well but that also visual features can        eration the algorithm finds the best leaf in the tree and uses
provide useful information for the given problem to solve.         it as a rule. This is repeated until a best set of rules is
                                                                   found for the given data. The advantage of PART is that
                                                                   it is very simple. The simplicity is achieved by using rule
1.   INTRODUCTION                                                  based learning and decision finding that does not require
   In this paper we present our solutions for the Context of       global optimization. A possible disadvantage of the algo-
Experience Task: recommending videos suiting a watching            rithm is that the rule sets are rather big compared to other
situation [10], which is part of the MediaEval 2016 Bench-         decision based algorithms such as C4.5 [7] or RIPPER [1].
mark. The Context of Experience task’s main purpose is to             Nevertheless, for our use case this is not important be-
explore multimedia content that is watched under a certain         cause the dataset is rather small [11]. For all our runs we
situation. This situation can be seen as the context under         use the WEKA machine learning library implementation of
that the multimedia content is consumed. The use case for          PART with the provided (optimal) standard settings [3].
the task is watching movies during a flight.
   The hypothesis is that watching movies during a specific        2.1    Metadata
context situation will change the preferences of the view-            For the metadata only approach we used only metadata
ers. This is related to similar hypotheses in the field of         provided be the task dataset. We limited the metadata to
recommender systems as presented in for example [12, 13]           the following attributes: rating, country, language, year,
where context is also an important influencing factor. Nev-        runtime, Rotten Tomatoes score, IMDB score, Metacritic
ertheless, it is also closely related to the field of quality of   score, and genre. We pre-processed and transformed rating,
experience [9, 8, 4] because the context during a flight, such     language, countries and genre into numeric values for the
as loud noises and other distractions, can play an important       classification. The different scores for the different movie
role for which movies viewers chose to watch.                      scoring pages were normalized to a scale from 1.0 to 10. If
   Participants of the context of experience task are asked        a value was missing in the dataset we manually searched for
to classify a list of movies into the two classes, namely,         the information in the Internet and replaced it with what we
+goodonairplane or -goodonairplane. To tackle this prob-           found. If we could not find ratings for all scoring services
lem we propose three different approaches. All three meth-         we used 5.0 (average score) as value.
ods use information extracted directly from the movies or
the metadata containing information about the movies in
combination with a machine-learning-based classifier. The
                                                                   2.2    Visual Information
remainder of the paper is organized as following. At first, we        For the visual data we downloaded the trailers from the
will give a detailed explanation of our three approaches and       provided links and extracted all frames. From each frame we
the classification algorithm that we used. This is followed        extracted different visual features and combined them into
by a description of the experimental setup and the results.        one feature vector for the classification (with a dimension of
Finally, we draw a conclusion.                                     3, 866 values).
                                                                      For the visual features, we decided to use several different
                                                                   global features. The features that we used for this work are:
2.   APPROACHES                                                    joint histogram, JPEG coefficient histogram, Tamura, fuzzy
  In this section we will describe our three proposed runs         opponent histogram, simple color histogram, fuzzy color his-
in more detail. For all runs we use the same classification        togram, rotation invariant local binary pattern, fuzzy color
                                                                   and texture histogram, local binary patterns and opponent
                                                                   histogram, PHOG, rank and opponent histogram, color lay-
Copyright is held by the author/owner(s).
MediaEval 2016 Workshop, October 20-21, 2016, Hilversum,           out, CEDD, Gabor, opponent histogram, edge histogram,
Netherlands                                                        scalable color and JCD. All the features have been extracted
Table 1: The configuration of our three submitted                 Table 3: MediaEval 2016 Context of Experience Task
runs for the task. R1 combines visual and metadata,               official results.
R2 uses only the metadata and R3 uses only the
visual data for classification. The last row shows                           Run        F1-score   Precision    Recall
the baseline provided by the organizers.                                      R1         0.6688     0.6084      0.7426
                                                                              R2         0.7341     0.6047      0.9338
       Run                   Description                                      R3         0.7687     0.6333      0.9779
        R1        Metadata and visual data combined                         Baseline       0.6       0.629      0.5735
        R2                 Meta data only
        R3                Visual data only
      Baseline         All available metadata                     4.   RESULTS
                                                                     Table 3 gives a detailed overview of the results in terms of
                                                                  true positives, false positives, true negatives and false nega-
Table 2: Detailed results for each run and the base-              tives achieved by our runs and the baseline. Table 3 depicts
line regarding true positives (TP), false positives               the official results of the task metrics for our runs and the
(FP), true negatives (TN) and false negatives (FN).               baseline. All three runs outperformed the baseline signifi-
                                                                  cantly. R1 which used metadata and visual information at
                                                                  the same time had the lowest performance. This was surpris-
               Run        TP    FP    TN     FN
                                                                  ing for us since we were thinking that this approach would
                R1        101   65    22     35
                                                                  perform best. A reason for the weak performance could be
                R2        127   83     4      9                   the way of how we combine the different features. The sec-
                R3        133   77    10      3                   ond best of our runs is R2 that uses metadata only. This
              Baseline     78   46    41     58                   is not surprising since metadata is well known for perform-
                                                                  ing well and in general better than content based classifica-
                                                                  tion. R3 was the best performing approach and even out-
using the LIRE open source library [5]. A detailed descrip-       performed the metadata approach which was not expected.
tion of all features can be found in [6]                          It seems that for the use case of watching movies on a flight
                                                                  the visual features of the movie play an important role. The
2.3    Metadata and                                               reason therefore could be that movies with brighter colors
       Visual Information Combined                                are preferred. Nevertheless, we have to investigate his in
  For the final run we combined the metadata with the vi-         more detail to give a final conclusion.
sual feature information. To combine the visual information
with the metadata we first run the classifier on the visual       5.   CONCLUSION
information with a modification so that the output was not           This paper presented three approaches for the context of
binary but a probability for each class. This probability         experience task, which were able to classify movies into two
then is added to the metadata as two additional features          subsets for being suitable or not to be watched on an air-
(probability to be negative or positive). The extended fea-       plane. The results and insights gained by evaluating our
ture vector then is used for finding the final class. This can    different methods indicate that there is a difference between
be seen as a kind of late fusion approach which is in general     what people would like to watch during a flight and that
seen as better performing than early fusion in literature [14].   this difference is detectable to a certain extend by automatic
                                                                  analysis of metadata and content based information.
                                                                     Nevertheless, we would clearly see the need for extending
3.    EXPERIMENTAL SETUP                                          the work by using multiple and larger datasets. Addition-
  The by the task provided dataset contains all in all 318        ally, it might be important to collect user opinions not by
movies split into training and testset. For each run we cal-      crowdsourcing but by actually travelling people.
culated the F1-score, precision and recall. The testset con-
tains 223 movies. For the trailers, only links were provided,
and we had to download them. Furthermore, the posters
                                                                  6.   ACKNOWLEDGMENT
of the movies were also provided but we did not use them            This work has been funded by the NFR-funded FRINATEK
in our approaches. Apart form the movies we did also use          project ”Efficient Execution of Large Workloads on Elastic
the provided metadata. We did not collect any additional          Heterogeneous Resources” (EONS) (project number 231687)
data such as full length movies, etc and we did not use the       funded by the Norwegian Research Council.
pre-extracted visual, text and audio features. The goal of
the of the task was, as mentioned before, to automatically
identify if a movie is suitable to be watched during a flight
or not.
  We assessed three different methods executed in three
runs. An overview of the conducted runs can be found in
table 3 where we provide a summarized overview and short
descriptions of each method. The organizers also provided
a baseline for comparison based on a simple random tree
algorithm (last row in the tables).
7.   REFERENCES
 [1] W. W. Cohen. Fast effective rule induction. In
     Proceedings of the twelfth international conference on
     machine learning, pages 115–123, 1995.
 [2] E. Frank and I. H. Witten. Generating accurate rule
     sets without global optimization. In J. Shavlik, editor,
     Fifteenth International Conference on Machine
     Learning, pages 144–151. Morgan Kaufmann, 1998.
 [3] M. Hall, E. Frank, G. Holmes, B. Pfahringer,
     P. Reutemann, and I. H. Witten. The weka data
     mining software: an update. ACM SIGKDD
     explorations newsletter, 11(1):10–18, 2009.
 [4] P. Lebreton, A. Raake, M. Barkowsky, and
     P. Le Callet. Evaluating complex scales through
     subjective ranking. In Proc. of QoMEX). IEEE, 2014.
 [5] M. Lux. Lire: Open source image retrieval in java. In
     Proceedings of the 21st ACM international conference
     on Multimedia, pages 843–846. ACM, 2013.
 [6] M. Lux and O. Marques. Visual information retrieval
     using java and lire. Synthesis Lectures on Information
     Concepts, Retrieval, and Services, 5(1):1–112, 2013.
 [7] J. R. Quinlan. C4. 5: programs for machine learning.
     Elsevier, 2014.
 [8] J. A. Redi, Y. Zhu, H. de Ridder, and I. Heynderickx.
     How passive image viewers became active multimedia
     users. In Visual Signal Quality Assessment. Springer,
     2015.
 [9] U. Reiter, K. Brunnström, K. De Moor, M.-C. Larabi,
     M. Pereira, A. Pinheiro, J. You, and A. Zgank.
     Factors influencing quality of experience. In Quality of
     Experience. Springer, 2014.
[10] M. Riegler, , C. Spampinato, M. Larson, P. Halvorsen,
     and C. Griwodz. The mediaeval 2016 context of
     experience task: Recommending videos suiting a
     watching situation. In Proceedings of the MediaEval
     2016 Workshop, 2016.
[11] M. Riegler, M. Larson, C. Spampinato, P. Halvorsen,
     M. Lux, J. Markussen, K. Pogorelov, C. Griwodz, and
     H. Stensland. Right inflight?: A dataset for exploring
     the automatic prediction of movies suitable for a
     watching situation. In Proc. of MMSys. ACM, 2016.
[12] A. Said, S. Berkovsky, and E. W. De Luca. Putting
     things in context: Challenge on context-aware movie
     recommendation. In Proc. of CAMRa. ACM, 2010.
[13] A. Said, S. Berkovsky, and E. W. De Luca. Group
     recommendation in context. In Proc. of CAMRa.
     ACM, 2011.
[14] C. G. Snoek, M. Worring, and A. W. Smeulders. Early
     versus late fusion in semantic video analysis. In
     Proceedings of the 13th annual ACM international
     conference on Multimedia, pages 399–402. ACM, 2005.