<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The MediaEval 2018 Movie Recommendation Task: Recommending Movies Using Content</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yashar Deldjoo</string-name>
          <email>deldjooy@acm.org</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mihai Gabriel Constantin</string-name>
          <email>mgconstantin@imag.pub.ro</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Athanasios Dritsas</string-name>
          <email>a.dritsas@student.tudelft.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bogdan Ionescu</string-name>
          <email>bionescu@imag.pub.ro</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Markus Schedl</string-name>
          <email>markus.schedl@jku.at</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Delft University of Technology</institution>
          ,
          <country country="NL">Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Johannes Kepler University Linz</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Politecnico di Milano</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University Politehnica of Bucharest</institution>
          ,
          <country country="RO">Romania</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>29</fpage>
      <lpage>31</lpage>
      <abstract>
        <p>In this paper we introduce the MediaEval 2018 task Recommending Movies Using Content. It focuses on predicting overall scores that users give to movies, i.e., average rating (representing overall appreciation of the movies by the viewers) and the rating variance/standard deviation (representing agreement/disagreement between users) using audio, visual and textual features derived from selected movie scenes. We release a dataset of movie clips consisting of 7K clips for 800 unique movies. In the paper, we present the challenge, the dataset and ground truth creation, the evaluation protocol and the requested runs.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        A dramatic rise in the generation of video content has been
witnessed in recent years. Video recommender systems (RS), play an
important role in helping users of online streaming services to cope
with the information overload. Video recommendation systems are
traditionally powered by either collaborative filtering (CF) models
which leverage the correlations between users’ consumption
patterns or content-based filtering (CBF) approaches typically based
on textual metadata, either editorial, e.g., genre, cast, director, or
user generated e.g., tags, reviews [
        <xref ref-type="bibr" rid="ref1 ref15">1, 15</xref>
        ].
      </p>
      <p>
        The goal of the MediaEval Movie Recommendation Task is to use
content-based audio, visual and metadata features and their
multimodal combinations to predict how a movie will be received by its
viewers by predicting global ratings of users and the standard
deviation of ratings [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The task uses as input movie clips instead of the
full-length movies, which makes it more versatile and efective as
clips are more easily available than the full movies. There are two
main useful outcomes of this task: firstly, by predicting the average
ratings that users give to movies, such techniques can be exploited
by producers and investors to decide whether or not to adopt the
production of similar movies; secondly and more importantly
the task is laying the groundwork for CBF movie recommendation
where recommendations are tailored to match the individual
preferences of users on the audio-visual content and the descriptive
metadata. As for the latter, the current MediaEval task looks into
predicting the variance of the ratings whose correct predictions
imply the ability of the prediction system to diferentiate between
the preferences of diferent users or groups of users which can be
exploited by current CBF movie recommender systems. In contrast
to the de facto CF approach widely adopted by the community of
RS, the CBF approach can handle the cold-start problem for items
where the newly added items lack enough interactions (impeding
the usability of CF approach) and can also help systems respect
user privacy [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. This paper presents an overview of the task, the
features provided by the organizers, a description of the ground
truth and evaluation methods as well as the required runs.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>TASK DESCRIPTION</title>
      <p>Task participants must create an automatic system that can predict
the average ratings that users will assign to movies (representing
the overall appreciation of the movie by the audience) and also
the rating variance (representing the agreement/disagreements
between user ratings)1. The input to the system is a set of audio, visual,
and text features derived from selected movie scenes (movie clips).</p>
      <p>
        The novelty of this task is that it uses movie clips instead of
movie trailers as chosen by most of previous works both in the
multimedia and recommendation fields [
        <xref ref-type="bibr" rid="ref11 ref4 ref6">4, 6, 11</xref>
        ]. Movie trailers
for the most part are free samples of a film that are packaged to
communicate a feeling of the movie’s story. Their main goal is to
convince the audience to come back for more when the film opens
in theaters. For this reason, the trailers are usually made with lots of
thrills and chills. Movie clips, however, focus on a particular scene
and display the scene at the natural pace of the movie . The two
media types communicate diferent information to their viewers
and can evoke diferent emotions [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] which in turn strongly efect
the users’ perception and appreciation, i.e. ratings, of the movie. To
give an example, compare from the movie "Beautiful Girls" (1996)
the oficial trailer, 2 a movie clip (A girl named Marty),3 and another
movie clip (Ice skating with Marty)4 all taken from the same movie.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>DATA</title>
      <p>
        Participants are supplied with audio and visual features extracted
from movie clips as well as associated metadata (genre and tag
labels). These content features resemble the content features of
1Note that in fact it is required to predict standard deviation of ratings, cf. Section 5
but due to intelligibility we use the term “variance” instead of standard deviation.
2https://www.youtube.com/watch?v=yfQ5ONwWxI8
3https://www.youtube.com/watch?v=4K8M2EVnoKc
4https://www.youtube.com/watch?v=M-h1ERyxbQ0
our recently released movie trailer dataset MMTF-14K [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ].
However, unlike in MMTF-14K, in the movie clips dataset used in the
MediaEval task at hand, each movie can be associated with several
clips.
      </p>
      <p>
        The complete development set (devset) provides features
computed from 5562 clips corresponding to 632 unique movies while
the testset provides features for 1315 clips corresponding to 159
unique movies from the well-known MovieLens 20M dataset
(ml20m) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The task makes use of the user ratings from the ml-20m
dataset in order to calculate the grountruth, namely the per-movie
global average rating and rating variance. The YouTube IDs of the
clips are also available in the movie names of the clips. For example,
000000094_2V am2a4r 9vo represents a clip in the dataset with the
ml-ID 94 and the YouTube ID 2V am2a4r 9vo5. Each movie has on
average about 8.5 associated clips where this value is calculated over
both the devset and testset. The content descriptors are organized
in three categories described next.
3.1
      </p>
    </sec>
    <sec id="sec-4">
      <title>Metadata</title>
      <p>
        The metadata descriptors (found in the folder named Metadata) are
provided as two CSV files containing genre and user-generated tag
features associated with each movie. The metadata features come
in pre-computed numerical format instead of the original textual
format for ease of use. The metadata descriptors are exactly the
same as with our MMTF-14K trailer dataset [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ].
3.2
      </p>
    </sec>
    <sec id="sec-5">
      <title>Audio features</title>
      <p>
        The Audio descriptors (found in the folder named Audio) are
contained in two sub-folders: block level features (BLF) [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] and
ivector features [
        <xref ref-type="bibr" rid="ref16 ref17 ref8">8, 16, 17</xref>
        ]. The BLF data includes the raw features
of the 6 sub-components (sub-features) that describe various audio
aspects: spectral aspects (spectral pattern, delta spectral pattern,
variance delta spectral pattern), harmonic aspects (correlation
pattern), rhythmic aspects (logarithmic fluctuation pattern), and tonal
aspects (spectral contrast pattern). The i-vector features, describing
timbre, include diferent parameters for Gaussian mixture models
(GMM) equal to (16, 32, 64, 256, 512), the total variability dimension
(tvDim) equal to (10, 20, 40, 200, 400). The Block level features folder
has two subfolders: "All" and "Component6"; the former contains
the super-vector created by concatenating all 6 sub-components,
the latter contains the raw feature vectors of the sub-components
in separate CSV files. The i-vector features folder contains
individual CSV files for each of the possible combinations of the two
parameters GMM, and tvDim.
3.3
      </p>
    </sec>
    <sec id="sec-6">
      <title>Visual features</title>
      <p>
        The Visual descriptors (found it the folder named Visual) are
contained in two sub-folders: Aesthetic visual features [
        <xref ref-type="bibr" rid="ref13 ref9">9, 13</xref>
        ] and
Deep AlexNet Fc7 features [
        <xref ref-type="bibr" rid="ref12 ref2">2, 12</xref>
        ], each of them including
diferent aggregation and fusion schemes for the two types of visual
features. These two features are aggregated by using four basic
statistical methods, each corresponding to a diferent sub-folder,
that compute a video-level feature vector from frame-level vectors
by using: average value across all frames (denoted "Avg"), average
value and variance ("AvgVar"), median values ("Med") and finally
5https://www.youtube.com/watch?v=2Vam2a4r9vo
median and median and median absolute deviation ("MedMad").
Each of the four aggregation sub-folders of the Aesthetic visual
features folder contains CSV files for three types of fusion
methods: early fusion of all the components (denoted All), early fusion
of components according to their type (color based components
denoted Type3Color, object based components - Type3Object and
texture - Type3Texture) and finally each of the 26 individual
components with no early fusion scheme (example: the colorfulness
component denoted Feat26Colorfulness), therefore resulting in a
total of 30 files in each sub-folder. Regarding the AlexNet features,
in our context, we use the output values extracted from the fc7
layer. For this reason, no supplementary early fusion scheme is
required or possible, and only one CSV file is present inside each
of the four aggregation folders.
4
      </p>
    </sec>
    <sec id="sec-7">
      <title>RUN DESCRIPTION</title>
      <p>Every team can submit up to 4 runs, 2 runs for prediction score for
rating average and 2 runs for rating std. For each score type, the
ifrst run is expected to contain the prediction score for the best
unimodal approach (using visual information, audio or metadata) and
the second run, hybrid approach that consider all modalities. Note
that in all these runs, the teams should think how to temporally
aggregate clip-level information into movie-level information (each
movie on average is assigned 8 clips). This task is novel in two
regards. First, the dataset includes movie clips instead of trailers,
thereby providing a wider variety of the movie’s aspects by showing
diferent kinds of scenes. Second, including information about the
ratings’ variance allows to assess users’ agreement and to uncover
polarizing movies.
5</p>
    </sec>
    <sec id="sec-8">
      <title>GROUND TRUTH AND EVALUATION</title>
      <p>The evaluation of participants’ runs is realized by predicting users’
overall ratings for which we use the standard error metric
rootmean-square-error (RMSE) between the predicted scores and the
actual scores according to the ground truth (as given in the
MovieLens 20M dataset), RMSE = q N1 ÍiN=1(si − sˆi )2 where N is the
number of scores in the test set on which the system is validated,
si is the actual score of users given to item i and sˆi is the predicted
score. Two types of scores are considered for evaluation
(1) average ratings
(2) standard deviation of ratings
The standard deviation of ratings is chosen to measure the
agreement/disagreements between user ratings thereby building the
groundwork for personalized recommendation. It should be
reminded that during test data release, participants are provided only
with the IDs of test movie clips where they are expected to predict
both of the above scores.
6</p>
    </sec>
    <sec id="sec-9">
      <title>CONCLUSIONS</title>
      <p>The 2018 Movie Recommendation Task provides an unified
framework for evaluating participants’ approaches to the prediction of
movie ratings through the usage of movie clips and audio, visual and
metadata features and their hybrid combinations. Details regarding
the methods and results of each individual run can be found in the
working note papers of the MediaEval 2018 workshop proceedings.
Recommending Movies Using Content: Which content is key?</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Charu</surname>
            <given-names>C</given-names>
          </string-name>
          <string-name>
            <surname>Aggarwal</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Content-based recommender systems</article-title>
          .
          <source>In Recommender systems</source>
          . Springer,
          <fpage>139</fpage>
          -
          <lpage>166</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Mihai</given-names>
            <surname>Gabriel</surname>
          </string-name>
          Constantin and
          <string-name>
            <given-names>Bogdan</given-names>
            <surname>Ionescu</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Content description for Predicting image Interestingness</article-title>
          .
          <source>In Signals, Circuits and Systems (ISSCS)</source>
          ,
          <source>2017 International Symposium on. IEEE</source>
          , 1-
          <fpage>4</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Yashar</given-names>
            <surname>Deldjoo</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Video recommendation by exploiting the multimedia content</article-title>
          .
          <source>Ph.D. Dissertation</source>
          . Italy.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Yashar</given-names>
            <surname>Deldjoo</surname>
          </string-name>
          , Mihai Gabriel Constantin, Hamid Eghbal-Zadeh, Markus Schedl, Bogdan Ionescu, and
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Cremonesi</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>AudioVisual Encoding of Multimedia Content to Enhance Movie Recommendations</article-title>
          .
          <source>In Proceedings of the Twelfth ACM Conference on Recommender Systems</source>
          . ACM. https://doi.org/10.1145/3240323.3240407
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Yashar</given-names>
            <surname>Deldjoo</surname>
          </string-name>
          , Mihai Gabriel Constantin, Bogdan Ionescu, Markus Schedl, and
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Cremonesi</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>MMTF-14K: A Multifaceted Movie Trailer Dataset for Recommendation and Retrieval</article-title>
          .
          <source>In Proceedings of the 9th ACM Multimedia Systems Conference (MMSys</source>
          <year>2018</year>
          ). Amsterdam, the Netherlands.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Yashar</given-names>
            <surname>Deldjoo</surname>
          </string-name>
          , Mehdi Elahi, Massimo Quadrana, and
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Cremonesi</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Using Visual Features based on MPEG-7 and Deep Learning for Movie Recommendation</article-title>
          .
          <source>International Journal of Multimedia Information Retrieval</source>
          (
          <year>2018</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Yashar</given-names>
            <surname>Deldjoo</surname>
          </string-name>
          , Markus Schedl, Paolo Cremonesi, and
          <string-name>
            <given-names>Gabriella</given-names>
            <surname>Pasi</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Content-Based Multimedia Recommendation Systems: Definition and Application Domains</article-title>
          .
          <source>In Proceedings of the 9th Italian Information Retrieval Workshop (IIR</source>
          <year>2018</year>
          ). Rome, Italy.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Hamid</given-names>
            <surname>Eghbal-Zadeh</surname>
          </string-name>
          , Bernhard Lehner, Markus Schedl, and
          <string-name>
            <given-names>Gerhard</given-names>
            <surname>Widmer</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>I-Vectors for Timbre-Based Music Similarity and Music Artist Classification.</article-title>
          .
          <source>In ISMIR</source>
          .
          <volume>554</volume>
          -
          <fpage>560</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Andreas</surname>
            <given-names>F Haas</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marine Guibert</surname>
            , Anja Foerschner, Sandi Calhoun, Emma George, Mark Hatay, Elizabeth Dinsdale, Stuart A Sandin, Jennifer E Smith,
            <given-names>Mark JA</given-names>
          </string-name>
          <article-title>Vermeij, and</article-title>
          <string-name>
            <surname>others.</surname>
          </string-name>
          <year>2015</year>
          .
          <article-title>Can we measure beauty? Computational evaluation of coral reef aesthetics</article-title>
          .
          <source>PeerJ</source>
          <volume>3</volume>
          (
          <year>2015</year>
          ),
          <year>e1390</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F Maxwell</given-names>
            <surname>Harper and Joseph A Konstan</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>The movielens datasets: History and context</article-title>
          .
          <source>Acm transactions on interactive intelligent systems (tiis) 5</source>
          ,
          <issue>4</issue>
          (
          <year>2016</year>
          ),
          <fpage>19</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Yimin</surname>
            <given-names>Hou</given-names>
          </string-name>
          , Ting Xiao, Shu Zhang, Xi Jiang,
          <string-name>
            <given-names>Xiang</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Xintao</given-names>
            <surname>Hu</surname>
          </string-name>
          , Junwei Han,
          <string-name>
            <given-names>Lei</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L Stephen</given-names>
            <surname>Miller</surname>
          </string-name>
          , Richard Neupert, and others.
          <source>2016</source>
          .
          <article-title>Predicting movie trailer viewer's "like/dislike" via learned shot editing patterns</article-title>
          .
          <source>IEEE Transactions on Afective Computing</source>
          <volume>7</volume>
          ,
          <issue>1</issue>
          (
          <year>2016</year>
          ),
          <fpage>29</fpage>
          -
          <lpage>44</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Alex</surname>
            <given-names>Krizhevsky</given-names>
          </string-name>
          , Ilya Sutskever, and
          <string-name>
            <given-names>Geofrey E</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Imagenet classification with deep convolutional neural networks</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          .
          <volume>1097</volume>
          -
          <fpage>1105</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Congcong</given-names>
            <surname>Li</surname>
          </string-name>
          and
          <string-name>
            <given-names>Tsuhan</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Aesthetic visual quality assessment of paintings</article-title>
          .
          <source>IEEE Journal of selected topics in Signal Processing 3</source>
          ,
          <issue>2</issue>
          (
          <year>2009</year>
          ),
          <fpage>236</fpage>
          -
          <lpage>252</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Robert</given-names>
            <surname>Marich</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Marketing to moviegoers: a handbook of strategies and tactics</article-title>
          . SIU Press.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Francesco</surname>
            <given-names>Ricci</given-names>
          </string-name>
          , Lior Rokach, and
          <string-name>
            <given-names>Bracha</given-names>
            <surname>Shapira</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Recommender systems: introduction and challenges</article-title>
          .
          <source>In Recommender systems handbook. Springer</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Markus</surname>
            <given-names>Schedl</given-names>
          </string-name>
          , Hamed Zamani,
          <string-name>
            <surname>Ching-Wei</surname>
            <given-names>Chen</given-names>
          </string-name>
          , Yashar Deldjoo, and
          <string-name>
            <given-names>Mehdi</given-names>
            <surname>Elahi</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Current challenges and visions in music recommender systems research</article-title>
          .
          <source>IJMIR 7</source>
          ,
          <issue>2</issue>
          (
          <year>2018</year>
          ),
          <fpage>95</fpage>
          -
          <lpage>116</lpage>
          . https: //doi.org/10.1007/s13735-018-0154-2
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Klaus</surname>
            <given-names>Seyerlehner</given-names>
          </string-name>
          , Markus Schedl,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Knees</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Reinhard</given-names>
            <surname>Sonnleitner</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>A Refined Block-level Feature Set for Classification, Similarity and Tag Prediction</article-title>
          .
          <source>In 7th Annual Music Information Retrieval Evaluation eXchange (MIREX</source>
          <year>2011</year>
          ). Miami, FL, USA.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>