<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>May</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Content-Based Multimedia Recommendation Systems: De nition and Application Domains</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yashar Deldjoo</string-name>
          <email>deldjooy@acm.org</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Markus Schedl</string-name>
          <email>markus.schedl@jku.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Cremonesi</string-name>
          <email>paolo.cremonesi@polimi.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabriella Pasi</string-name>
          <email>pasi@disco.unimib.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Johannes Kepler University Linz</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Politecnico di Milano</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Milano-Bicocca</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <volume>2</volume>
      <fpage>8</fpage>
      <lpage>30</lpage>
      <abstract>
        <p>The goal of this work is to formally provide a general de nition of a multimedia recommendation system (MMRS), in particular a content-based MMRS (CB-MMRS), and to shed light on di erent applications of multimedia content for solving a variety of tasks related to recommendation. We would like to disambiguate the fact that multimedia recommendation is not only about recommending a particular media type (e.g., music, video), rather there exists a variety of other applications in which the analysis of multimedia input can be usefully exploited to provide recommendations of various kinds of information.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The World Wide Web is a huge resource of digital multimedia information. In
early years of the WWW, the available digital resources were mainly constituted
of texts. For this reason, the rst search engines and, later, content-based
recommender systems relied merely on text analysis. Nowadays, the information
available on the Web is provided by several di erent media types, which include
text, audio, video, and images. Moreover, di erent media types can co-exist in
documents such as for example Web pages. Vertical search engines and
recommender systems have been developed to cope with the problem of accessing or
recommending speci c media objects. While some media types are not related
to others (e.g., texts), other media types, such as videos, can be considered as
structured entities, possibly composite of other media types; for example a movie
is a video object composed of a sequence of images and of an audio stream, and
can further possibly carry a text (subtitles). The aim of this paper is twofold:
on the one hand, we propose a general de nition of content-based multimedia
recommender system (CB-MMRS), which comprises both systems working with
one media type (vertical approach) and systems working with multiple media
types (e.g., videos when exploiting the composite of image, audio, and textual
information). Moreover we propose a general recommendation model of
composite media objects, where the recommendation relies on the computation of
distinct utility values, one for each media type in the composite object, and a
nal utility is computed by aggregating such values. This can pave the way for
novel recommendation techniques. As a second contribution, we discuss a variety
of tasks where MM content can be exploited for e ective recommendation, and
we categorize them along di erent axes.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Content-Based Multimedia Recommendation Systems</title>
      <p>
        We characterize a content-based multimedia recommendation system (CB-MMRS)
by specifying its main components.
1. Multimedia Items: In the literature [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], a multimedia item (aka
multimedia object or document) refers to an item which can be a text, image, audio,
or video. Formally, a multimedia item I in its most general form is
represented by the triple: I = (IV ; IA; IT ) in which IV , IA, IT refer to the visual,
aural, and textual components (aka modalities), respectively. While text can
be considered an atomic media type (meaning it consists of a single textual
modality IT ), other media types including audio, image, and video can be
either atomic or composite, as in the latter case they may contain multiple
modalities. For example, an audio item that represents a performance of a
classical music piece can be seen as an atomic media type (using IA). On the
other hand, a pop song with lyrics can be regarded as composite (using IA
and IT ); an image of a scene is atomic (using IV ) while an image of a news
article is composite (using IV and IT ); nally a silent movie is atomic (using
IV ) while a movie with sound is composite (using IA, IV and/or IT ). We
still use the term multimedia item while referring to all these media types
regardless of the fact that they are atomic or composite. A CB-MMRS is a
system that is able to store and manage MM items.
      </p>
      <sec id="sec-2-1">
        <title>2. Multimedia Content-Based Representation: Developing a CB-MMRS</title>
        <p>relies on content-based (CB) descriptions according to distinct modalities
(IV , IA, IT ). These CB descriptors are usually extracted by applying some
form of signal processing speci c to each modality, and are described based
on speci c features. Examples of such features for images are color and
texture; for text they include words and n-grams.</p>
        <p>
          A CB-MMRS is a system that is able to process MM items and represent
each modality in terms of a feature vector fm = [f1 f2 ::: fnm ] 2 Rnm where
m 2 fV; A; T g represents the visual, audio or textual modality.4
3. Recommendation Model: A CB-MMRS adopts a (personalized)
recommendation model and provides suggestions for items by measuring the
interest of user on CB characteristics of items in hand [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>
          Given a target user u, to whom the recommendation will be provided, and a
collection of MM items I = fI1; I2; :::; IjIjg, the task of MM recommendation
is to identify the MM item i that satis es
i = argmax R(u; Ii); Ii 2 I
i
(1)
4 We step aside our attention from end-to-end learning performed by deep neural
networks where the intermediate step of feature extraction is not done explicitly.
MM-driven RS
where R(u; Ii) is the estimated utility of item i for the user u on the basis
of which the items are ranked [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The utility can be only estimated by
the RS to judge how much an item is worth being recommended, and its
prediction lies at the core of a recommendation model. The utility estimation
(or prediction) is done based on a particular recommendation model, e.g.,
collaborative- ltering (CF) or content-based ltering (CBF), which typically
involves knowledge about users, items, and the core utility function itself [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
A comparison of such systems is provided in Table 1. While the community
of RS for long has considered CF-MMRS or CB-MMRS using pure metadata
(textual) as the only form of MMRS, in this paper, we focus our attention on
CB-MMRS exploiting di erent media types and the constituting modalities.
Depending on the number of modalities leveraged, we can categorize
CBMMRS as unimodal or multimodal. For example, unimodal CB-MMRS can
produce satisfactory results for the recommendation of text (e.g., a piece of
news), but not for image, audio, and video. Users' diverse information needs
are more likely to be satis ed by multimodal recommendation mechanisms.
In a multimodal CB-MMRS, the estimated utility of item i to the user u can
be decomposed into several speci c utilities computed across each modality
in a MM item:
        </p>
        <p>R(u; Ii) = F (Rm(u; Ii));
m 2 fV; A; T g
(2)
where Rm(U; Ii) denotes the utility of item Ii for user u with regards to
modality m 2 fV; A; T g, and F is an aggregation function of the estimated
utilities for each modality. Based on the semantics of the aggregation,
different functions can be employed, each implying a particular interpretation
of the a ected process.</p>
        <p>
          Aggregation operators can be roughly classi ed as conjunctive, disjunctive,
and averaging [
          <xref ref-type="bibr" rid="ref4 ref5">4,5</xref>
          ]. Conjunctive operators include the minimum (min) and
functions that are upper-bounded by the minimum
        </p>
        <p>R(u; Ii) &lt;= min (Rm(u; Ii));
8m 2 fV; A; T g
(3)
Disjunctive operators include the maximum (max) and those functions
lowerbounded by the maximum. Based on the choice of distinct aggregation
operator, di erent aggregation values are obtained.</p>
        <p>We make an illustrative example to clarify the importance of these
aggregation operators. Suppose a MMRS should recommend a movie to a user. It
is further known by the system that the user is likely to watch a fast-paced
movie lmed with abrupt camera shot changes (visual), with rapid music
tempo (audio), and with textual keywords that describe the movie as
energetic or fast (textual). In such a case, if we set the aggregation function to
(min), the system will follow a conservative/pessimistic approach and would
require all the three audio plus visual plus textual modalities to contain the
aforementioned properties, so the corresponding item can be considered as
a good candidate for recommendation. Oppositely, the max operator adopts
an optimistic approach and would only require one of the three modalities to
include the desired property, making the utility of such item higher.
Therefore, these two aggregation operators have distinct semantics which can be
leveraged depending on the particular recommendation application at hand.
The min and the max functions set a lower bound and an upper bound,
respectively, for averaging aggregation operators, (e.g., arithmetic mean,
geometric mean, or harmonic mean). For instance, in the eld of multimedia
information retrieval (MMIR), it is common to use the weighted average
linear combination</p>
        <p>R(u; Ii) =</p>
        <p>X wmRm(u; Ii)
m
where wm is a weight factor indicating the importance of modality m. When
we focus our attention on a speci c modality, the problem is similar to a
standard CBF problem in which a linear model can be used among others,
for example</p>
        <p>Rm(u; Ii) =</p>
        <p>X wmj Rmj (u; fmj )</p>
        <p>j
where wmj is a weight factor indicating the importance of the feature fmj , the
j-th feature in modality m. Equations 4 and 5 are called the inter-modality
and intra-modality fusion functions in MMIR. Application of di erent
aggregation operators for CB-MMRS and generally MMRS remain open for
exploitation in future works.
(4)
(5)</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Multimedia Content for Tasks Related to</title>
    </sec>
    <sec id="sec-4">
      <title>Recommender Systems</title>
      <p>
        Multimedia content can be leveraged in RS that recommend a media type or a
non-media item to the user. Multimedia content can be also exploited for certain
tasks that are related to RS, but are not directly part of the core recommendation
approach or item model. Examples include the exploitation of web cam videos to
identify the target user's emotional state [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], or in general her head/posture [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],
which in turn can be used to personalize recommendations [
        <xref ref-type="bibr" rid="ref8 ref9">8,9</xref>
        ]. Another
example is the use of audio content features to model transitions or learn sequences
from music playlists, e.g., continuously increasing energy level of songs in a
playlist. Such information can then be used for automatic playlist generation
or continuation [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The former example relates to the use of multimedia
content in context-aware recommender systems, the latter to its use in sequence
recommendation. We explore these di erent dimensions in the following.
3.1
      </p>
      <sec id="sec-4-1">
        <title>Approaches that recommend multimedia items</title>
        <p>
          The primary and foremost used application of multimedia content is constituted
by MMRS i.e., systems that recommend a particular media type to the user.
In CB-MMRS, the media types constituting both the input and the output of
the system are the same (e.g., music recommendation based on music acoustic
content plus target users' preferences); However in some applications the two
media types can be also di erent, e.g., recommending music for a given image
with regards to the evoked emotions. We will explore these categories of
recommendation in the following:
{ Audio recommendation: As for audio recommendation, the most common
application is music recommendation [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Examples of common audio
features exploited in the music domain include energy, timbre, tempo, tonality,
and more abstract ones based on deep learning [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
{ Image recommendation: In the image domain, some of the interesting
examples include recommending clothes (in the fashion industry) and paintings
(e.g., in the tourism industry) among others. As for clothes recommendation,
there exists a huge potential in the fashion industry, mainly for the economic
value, to build personalized fashion RS. These systems can be built by
taking into account metadata, reviews, previous purchasing patterns and visual
appearance of products. Such recommendation can be done in two manners:
(1) nding some pairs of objects that can be seen as alternative to a given
image provided by the user (such as two pairs of jeans) and, (2) nding
the ones which may be complementary (such recommending a pair of jeans
matching a shirt). For example [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] proposed a CB-MMRS to provide
personalized recommendation for a given clothes image by considering the visual
appearances of clothes. The proposed system exploits visual features based
on convolutional neural networks (pre-trained on 1.2M ImageNet images)
and uses a metric learning approach to nd the visual similarity between a
query image and the complementary items (second scenario). Some research
works in the RS community [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] have criticized the above work in the sense
that it treats the recommendation problem as a visual retrieval problem
disregarding users' historical feedbacks on items as well as other factors beyond
the visual dimension. The main novelty in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], beside focusing on a novel
clothes recommendation scenario is to examine the visual appearance of the
items under investigation to overcome the `cold start' problem.
{ Video recommendation In the video domain, examples of target items
include recommending movies, TV-series, movie clips, trailers, or user-generated
content. In [
          <xref ref-type="bibr" rid="ref13 ref14 ref15 ref16">13,14,15,16</xref>
          ], the authors propose a video RS that exploits visual
features complying with the mise-en-scene (stylistic aspect in a movie) and
incorporate it in di erent CBF and CBF+CF systems to show that their
proposed system can be replaced with similar systems using genre metadata
and user-generated tags (in some cases). The authors show the possibility
of utilizing such stylistic-based movie recommender systems in a real
system [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] also for children [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] or combined with user's directly speci ed need
in the form of a query by visual example [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ].
        </p>
        <p>
          A newer version of the authors' work [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] proposed advanced audio and
visual descriptors originated from multimedia signal processing under a novel
rank-aware hybridization approach to signi cantly improve quality of
traditional RS over metadata.
3.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Approaches that use multimedia items as input</title>
        <p>
          Multimedia content can not only be used for a particular media-item
recommendation (as illustrated above), but also there exists other applications where
a MM item is used only as the input of such systems, while in the output another
form of information is recommended. As listed in Table 1, we would like to call
such system MM-driven RS to highlight that the output can be a non-media
item. An example of such an application is provided below:
{ POI recommendation by analyzing user-generated photos: [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]
proposed a personalized travel recommendation system by leveraging the rich
and freely available community-contributed photos and by considering
demographical information such as gender, age and race in user pro les in order
to provide e ective personalized travel recommendation. The authors show
that consideration of such attributes is e ective for travel recommendation
- especially providing a promising aspect for personalization. For this, the
authors discuss the correlation between travel patterns and people
characteristics by using information-theoretic measures.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Other approaches</title>
        <p>
          An interesting but less-investigated area of research in CB-MMRS is
recommending a piece of media (e.g., music) based on its association with other media
(e.g., image) with regards to the evoked emotions, user-generated tags or other
catalysts. For instance, [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] proposed a MMRS to automatically suggest music
based on a set of images (paintings). The motivation is that the a ective content
of painting when harmonized with music can be e ective for creating a ne art
sideshow referred to as emotion-based impressionism sideshow. Emotion is used
as the main enabler to nd the association between the painting (input to the
system) and the music (the output) which is done using Mixed Media Graph
(MMG) model [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. The proposed method uses a variety of visual features based
on color, light and textures from the painting images as well as acoustic features
such as melody, rhythm, tempo from the music where both categories of features
are known to be a ecting emotions.
        </p>
        <p>
          Visual contextual advertisement is another very related application eld of
multimedia context in which the particular multimedia item currently being
consumed by the user (e.g., image or video) becomes the target for recommending
advertisements. The goal here is to build a semantic match between two
heterogeneous multimedia sources (e.g., content of an image and the advertisement
in textual form). [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] proposed a visual contextual advertisement system that
suggests the most relevant advertisement for a given image without building a
textual relation between the two heterogeneous sources (i.e., it disregards the
tags associated with images). The authors mention that there exists two main
approaches for visual contextual advertisement: (1) based on image annotation,
(2) based on feature translation model. In the rst case, a model is trained based
on a selection of labeled images which is leveraged to predict text on the test time
given a new image. Manual labeling of the items is required which makes the
approach prone to error or labor-intensive. The second approach builds a bridge
between the two visual and textual feature spaces through a translation model
and leveraging a language model to estimate the relevance of each
advertisement w.r.t a given target image. In [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] the authors propose a knowledge-driven
cross-media semantic matching framework that leverages two large high-quality
knowledge sources ImageNet (for image) and Wikipedia (for text). The
imageadvertisement match is established in the respective knowledge sources.
3.4
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>Context-aware Recommendation</title>
        <p>
          In context-aware or situation-aware recommender systems, which often enhance
information about user{item interactions by considering time [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ], user
activity [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ], or weather [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ], among others, multimedia data can be user to create
intermediate representations of items and match them with similar
representations of context or users to e ect recommendations. We exemplify this idea
with the recent research topic of emotion-based matching, more precisely, using
emotion information to match items to users and items to context entities.
        </p>
      </sec>
      <sec id="sec-4-5">
        <title>Emotion-based matching of items and users: Here, the goal is to select</title>
        <p>
          items that match the target user's a ective state. Eliciting the user's emotional
state can be e ected by requesting explicit feedback or by analyzing multimedia
material, for instance, user-generated text [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ], speech [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ], or facial expressions
in video [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], or a combination of audio and visual cues in video [
          <xref ref-type="bibr" rid="ref30 ref31">30,31</xref>
          ]. Likewise,
describing items by a ective terms can also be approached via content analysis.
For instance, in the music domain, this task is commonly known as music emotion
recognition (MER) [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ]. Both tasks, i.e., inferring emotions from users and from
items, come with their particular challenges, for instance, high variations in the
intensity of users' facial expressions or subjectivity of perceived emotions when
creating ground truth annotations of items. An even harder task, however, is
to connect users with items in the a ectively intended way. To do so, knowing
about the target user's intent is crucial.
        </p>
        <p>
          In the music domain, three fundamental intents or purposes of music listening
have been identi ed [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ]: self-awareness (e.g., stimulating a re ection of people
on their identity), social relatedness (e.g., feeling closeness to friends and
expressing identity), and arousal and mood regulation (e.g., managing emotions).
Several studies found that a ect regulation is the most important purpose why
people listen to music [
          <xref ref-type="bibr" rid="ref33 ref34">33,34</xref>
          ]. However, in which ways music preferences vary
as a function of a listener's emotion, listening intent, and a ective impact of
listening to a certain emotionally laden music piece is still not well understood,
and is further in uenced by other psychological aspects such as the listener's
personality [
          <xref ref-type="bibr" rid="ref35">35</xref>
          ].
        </p>
      </sec>
      <sec id="sec-4-6">
        <title>Emotion-based matching of items and context entities: This task en</title>
        <p>
          tails the establishment of relationships between items and contextual aspects.
A ective information for items can again be elicited by multimedia analysis,
those of contextual entities | in this scenario most commonly location [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ] or
weather [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ] | by explicit user annotations. The recommender system then
regards the emotion assigned to the contextual entity as proxy for the user's
emotion, and matches items and users correspondingly.
        </p>
        <p>
          To give an example, the system proposed in [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ] recommends music pieces for
locations (places of interest such as monuments). It uses emotions as intermediate
representations of both. Based on online questionnaires, a limited set of places
of interest are assigned a ective labels. So are music pieces. Since the amount
of potentially suited music pieces is, however, much larger than the number of
interesting locations, an audio content-based auto-tagger for music [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ] is trained
on a small set of annotated pieces, and is subsequently used to predict the
emotion tags for the remaining, unlabeled pieces. Recommendations for a target
user at a given location are then made by ranking all music pieces according
to the overlap (Jaccard coe cient) between their location and the location's
a ective labels.
        </p>
      </sec>
      <sec id="sec-4-7">
        <title>Sequence Recommendation</title>
        <p>
          In certain domains, recommendation of coherent or meaningful item sequences
is preferred over recommendation of unordered item sets [
          <xref ref-type="bibr" rid="ref39">39</xref>
          ]. Examples include
recommending online courses or exercises for e-learning, video clips in media
streaming services, and automatic music playlist generation or continuation.
        </p>
        <p>Recommending sequences of music pieces, i.e., playlists, is special for several
reasons, most importantly the typically short duration and consumption time,
the likely preference for repeated item consumption, and the strong emotional
impact of music (cf. Section 3.4).</p>
        <p>
          Approaches to automatic playlist generation or continuation can either learn
directly from the sequences of items used for training, for instance, via sequential
pattern mining [
          <xref ref-type="bibr" rid="ref40">40</xref>
          ], Markov models [
          <xref ref-type="bibr" rid="ref41">41</xref>
          ], or recurrent neural networks [
          <xref ref-type="bibr" rid="ref42 ref43">42,43</xref>
          ].
Alternatively or additionally, such approaches can also take content features
into account. In the music domain, these descriptors may include tempo (beats
per minute), timbre, or rhythm patterns and can be extracted through audio
processing techniques. Other features relevant to describe music can be extracted
from images like album covers [
          <xref ref-type="bibr" rid="ref44">44</xref>
          ] or video clips [
          <xref ref-type="bibr" rid="ref45">45</xref>
          ], which renders the task a
multimedia content analysis problem. Using these content descriptors, playlists
can either be created by computing similarities between songs, albums, or artists,
or by de ning constraints and creating the playlist in a way that ful lls these
(as much as possible). The former approach aims at building coherent playlists
in which consecutive tracks sound are as similar as possible, e.g., [
          <xref ref-type="bibr" rid="ref46 ref47">46,47</xref>
          ]. The
latter approach allows to de ne target characteristics, such as increasing tempo,
high diversity of artists, or xed start and end song [
          <xref ref-type="bibr" rid="ref48 ref49">48,49</xref>
          ].
        </p>
        <p>
          Additionally, combining sequence recommendation with context-aware
recommendation (cf. Section 3.4), playlists can be created based on hybrid methods
that integrate the context of the listener and content-based similarity [
          <xref ref-type="bibr" rid="ref50 ref51">50,51</xref>
          ].
4
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>In this work, we proposed a general de nition of content-based multimedia
recommender system (CB-MMRS). Moreover, we proposed a general
recommendation model of composite media objects, where the recommendation relies on
the computation of distinct utility values (for each media object) and a nal
utility is computed by aggregating such values. Finally, we presented variety of
di erent applications where MM content is used not only as the target
product, rather analyzed in the input of the system or to model the user to provide
recommendation of various kinds of information.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Elleithy</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Advanced techniques in computing sciences and software engineering</article-title>
          . Springer Science &amp; Business
          <string-name>
            <surname>Media</surname>
          </string-name>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Ricci</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rokach</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shapira</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Recommender Systems: Introduction and Challenges</article-title>
          .
          <source>In: Recommender Systems Handbook</source>
          . Springer (
          <year>2015</year>
          )
          <volume>1</volume>
          {
          <fpage>34</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Aggarwal</surname>
            ,
            <given-names>C.C.</given-names>
          </string-name>
          :
          <article-title>An introduction to recommender systems</article-title>
          .
          <source>In: Recommender Systems</source>
          . Springer (
          <year>2016</year>
          )
          <volume>1</volume>
          {
          <fpage>28</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Marrara</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pasi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Viviani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Aggregation operators in information retrieval</article-title>
          .
          <source>Fuzzy Sets and Systems</source>
          <volume>324</volume>
          (
          <year>2017</year>
          )
          <volume>3</volume>
          {
          <fpage>19</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Tzeng</surname>
            ,
            <given-names>G.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>J.J.:</given-names>
          </string-name>
          <article-title>Multiple attribute decision making: methods and applications</article-title>
          . CRC press (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Ebrahimi</given-names>
            <surname>Kahou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            , Michalski, V.,
            <surname>Konda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Memisevic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Pal</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          :
          <article-title>Recurrent neural networks for emotion recognition in video</article-title>
          .
          <source>In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction. ICMI '15</source>
          , New York, NY, USA, ACM (
          <year>2015</year>
          )
          <volume>467</volume>
          {
          <fpage>474</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Deldjoo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Atani</surname>
            ,
            <given-names>R.E.:</given-names>
          </string-name>
          <article-title>A low-cost infrared-optical head tracking solution for virtual 3d audio environment using the nintendo wii-remote</article-title>
          .
          <source>Entertainment Computing</source>
          <volume>12</volume>
          (
          <year>2016</year>
          )
          <volume>9</volume>
          {
          <fpage>27</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Andjelkovic</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parra</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>O</given-names>
            <surname>'Donovan</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          : Moodplay:
          <article-title>Interactive mood-based music discovery and recommendation</article-title>
          .
          <source>In: Proceedings of the 2016 Conference on User Modeling Adaptation and Personalization. UMAP '16</source>
          , New York, NY, USA, ACM (
          <year>2016</year>
          )
          <volume>275</volume>
          {
          <fpage>279</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Tkalcic</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burnik</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Odic</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kosir</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tasic</surname>
          </string-name>
          , J.:
          <article-title>Emotion-aware recommender systems { a framework and a case study</article-title>
          . In Markovski, S.,
          <string-name>
            <surname>Gusev</surname>
          </string-name>
          , M., eds.
          <source>: ICT Innovations</source>
          <year>2012</year>
          , Berlin, Heidelberg, Springer Berlin Heidelberg (
          <year>2013</year>
          )
          <volume>141</volume>
          {
          <fpage>150</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Schedl</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zamani</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>C.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deldjoo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elahi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Current challenges and visions in music recommender systems research</article-title>
          .
          <source>International Journal of Multimedia Information Retrieval</source>
          (
          <year>2018</year>
          )
          <volume>1</volume>
          {
          <fpage>22</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>McAuley</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Targett</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shi</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          , Van Den Hengel, A.:
          <article-title>Image-based recommendations on styles and substitutes</article-title>
          .
          <source>In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , ACM (
          <year>2015</year>
          )
          <volume>43</volume>
          {
          <fpage>52</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McAuley</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          : Vbpr:
          <article-title>Visual bayesian personalized ranking from implicit feedback</article-title>
          . (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Deldjoo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elahi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cremonesi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garzotto</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piazzolla</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Quadrana</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Content-based video recommendation system based on stylistic visual features</article-title>
          .
          <source>Journal on Data Semantics</source>
          <volume>5</volume>
          (
          <issue>2</issue>
          ) (
          <year>2016</year>
          )
          <volume>99</volume>
          {
          <fpage>113</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Deldjoo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cremonesi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schedl</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Quadrana</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>The e ect of di erent video summarization models on the quality of video recommendation based on low-level visual</article-title>
          .
          <source>In: Content-Based Multimedia Indexing (CBMI)</source>
          ,
          <year>2017</year>
          15th International Workshop on. ACM. (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Deldjoo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elahi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cremonesi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garzotto</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piazzolla</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Recommending movies based on mise-en-scene design</article-title>
          .
          <source>In: Proceedings of the 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems</source>
          , ACM (
          <year>2016</year>
          )
          <volume>1540</volume>
          {
          <fpage>1547</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Deldjoo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elahi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Quadrana</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cremonesi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Using visual features based on mpeg-7 and deep learning for movie recommendation</article-title>
          .
          <source>International Journal of Multimedia Information Retrieval</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Elahi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deldjoo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bakhshandegan Moghaddam</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cella</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cereda</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cremonesi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Exploring the semantic gap for movie recommendations</article-title>
          .
          <source>In: Proceedings of the Eleventh ACM Conference on Recommender Systems</source>
          , ACM (
          <year>2017</year>
          )
          <volume>326</volume>
          {
          <fpage>330</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Deldjoo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fra</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Valla</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paladini</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anghileri</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tuncil</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garzotta</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cremonesi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , et al.:
          <article-title>Enhancing childrens experience with recommendation systems</article-title>
          .
          <source>In: Workshop on Children and Recommender Systems (KidRec'17)-11th ACM Conference of Recommender Systems</source>
          . (
          <year>2017</year>
          ) N{A
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Deldjoo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fra</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Valla</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cremonesi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Letting users assist what to watch: An interactive query-by-example movie recommendation system</article-title>
          . (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Deldjoo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Constantin</surname>
            ,
            <given-names>M.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schedl</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ionescu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cremonesi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : Mmtf-14k:
          <article-title>A multifaceted movie trailer feature dataset for recommendation and retrieval</article-title>
          .
          <source>In: Proceedings of the 9th ACM Multimedia Systems Conference</source>
          , ACM (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21. Cheng, A.J.,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>Y.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hsu</surname>
            ,
            <given-names>W.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liao</surname>
            ,
            <given-names>H.Y.M.:</given-names>
          </string-name>
          <article-title>Personalized travel recommendation by mining people attributes from community-contributed photos</article-title>
          .
          <source>In: Proceedings of the 19th ACM international conference on Multimedia, ACM</source>
          (
          <year>2011</year>
          )
          <volume>83</volume>
          {
          <fpage>92</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>C.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shan</surname>
            ,
            <given-names>M.K.</given-names>
          </string-name>
          :
          <article-title>Emotion-based impressionism slideshow with automatic music accompaniment</article-title>
          .
          <source>In: Proceedings of the 15th ACM international conference on Multimedia, ACM</source>
          (
          <year>2007</year>
          )
          <volume>839</volume>
          {
          <fpage>842</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>J.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>H.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Faloutsos</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duygulu</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Automatic multimedia crossmodal correlation discovery</article-title>
          .
          <source>In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          ,
          <source>ACM</source>
          (
          <year>2004</year>
          )
          <volume>653</volume>
          {
          <fpage>658</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tian</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>A semantic approach to recommending text advertisements for images</article-title>
          .
          <source>In: Proceedings of the sixth ACM conference on Recommender systems, ACM</source>
          (
          <year>2012</year>
          )
          <volume>179</volume>
          {
          <fpage>186</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Herrera</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Resa</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sordo</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Rocking around the clock eight days a week: an exploration of temporal patterns of music listening</article-title>
          .
          <source>In: Proceedings of the ACM Conference on Recommender Systems: Workshop on Music Recommendation and Discovery (WOMRAD</source>
          <year>2010</year>
          ).
          <article-title>(</article-title>
          <year>2010</year>
          )
          <volume>7</volume>
          {
          <fpage>10</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosenblum</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Context-aware Mobile Music Recommendation for Daily Activities</article-title>
          .
          <source>In: Proceedings of the 20th ACM International Conference on Multimedia, Nara</source>
          , Japan,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2012</year>
          )
          <volume>99</volume>
          {
          <fpage>108</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Pettijohn</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carter</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Music for the seasons: Seasonal music preferences in college students</article-title>
          .
          <source>Current Psychology</source>
          (
          <year>2010</year>
          )
          <volume>1</volume>
          {
          <fpage>18</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Dey</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Asad</surname>
            ,
            <given-names>M.U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Afroz</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nath</surname>
            ,
            <given-names>R.P.D.</given-names>
          </string-name>
          :
          <article-title>Emotion extraction from real time chat messenger</article-title>
          . In: 2014 International Conference on Informatics,
          <source>Electronics Vision (ICIEV)</source>
          .
          <source>(May</source>
          <year>2014</year>
          )
          <volume>1</volume>
          {
          <fpage>5</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Erdal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Kachele,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Schwenker</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          <article-title>In: Emotion Recognition in Speech with Deep Learning Architectures</article-title>
          . Springer International Publishing,
          <string-name>
            <surname>Cham</surname>
          </string-name>
          (
          <year>2016</year>
          )
          <volume>298</volume>
          {
          <fpage>311</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Kaya</surname>
          </string-name>
          , H., G F.
          <article-title>: Video-based emotion recognition in the wild using deep transfer learning and score fusion</article-title>
          .
          <source>Image and Vision Computing</source>
          <volume>65</volume>
          (
          <year>2017</year>
          )
          <volume>66</volume>
          {
          <article-title>75 Multimodal Sentiment Analysis and Mining in the Wild Image</article-title>
          and
          <string-name>
            <given-names>Vision</given-names>
            <surname>Computing</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Noroozi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marjanovic</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Njegus</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Escalera</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anbarjafari</surname>
          </string-name>
          , G.:
          <article-title>Audio-visual emotion recognition in video clips</article-title>
          .
          <source>IEEE Transactions on A ective Computing</source>
          (
          <year>2017</year>
          ) 1{
          <fpage>1</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>H.H.:</given-names>
          </string-name>
          <article-title>Machine recognition of music emotion: A review</article-title>
          .
          <source>Transactions on Intelligent Systems and Technology</source>
          <volume>3</volume>
          (
          <issue>3</issue>
          ) (May
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          33. Schafer, T.,
          <string-name>
            <surname>Sedlmeier</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stdtler</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huron</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>The psychological functions of music listening</article-title>
          .
          <source>Frontiers in Psychology</source>
          <volume>4</volume>
          (
          <issue>511</issue>
          ) (
          <year>2013</year>
          )
          <volume>1</volume>
          {
          <fpage>34</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          34.
          <string-name>
            <surname>Lonsdale</surname>
            ,
            <given-names>A.J.</given-names>
          </string-name>
          , North,
          <string-name>
            <surname>A.C.</surname>
          </string-name>
          :
          <article-title>Why do we listen to music? A uses and grati cations analysis</article-title>
          .
          <source>British Journal of Psychology</source>
          <volume>102</volume>
          (
          <issue>1</issue>
          ) (
          <year>February 2011</year>
          )
          <volume>108</volume>
          {
          <fpage>134</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          35.
          <string-name>
            <surname>Ferwerda</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schedl</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tkalcic</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          : Personality &amp;
          <article-title>Emotional States: Understanding Users Music Listening Needs</article-title>
          .
          <source>In: Extended Proceedings of the 23rd International Conference on User Modeling, Adaptation and Personalization (UMAP</source>
          <year>2015</year>
          ), Dublin, Ireland (June{
          <year>July 2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          36.
          <string-name>
            <surname>Kaminskas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ricci</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schedl</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Location-aware Music Recommendation Using Auto-Tagging and Hybrid Matching</article-title>
          .
          <source>In: Proceedings of the 7th ACM Conference on Recommender Systems (RecSys)</source>
          , Hong Kong,
          <source>China (October</source>
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          37.
          <string-name>
            <surname>Coviello</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sohn</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kramer</surname>
            ,
            <given-names>A.D.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marlow</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franceschetti</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Christakis</surname>
            ,
            <given-names>N.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fowler</surname>
            ,
            <given-names>J.H.</given-names>
          </string-name>
          :
          <article-title>Detecting emotional contagion in massive social networks</article-title>
          .
          <source>PLOS ONE 9</source>
          (
          <issue>3</issue>
          ) (03
          <year>2014</year>
          )
          <volume>1</volume>
          {
          <fpage>6</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          38.
          <string-name>
            <surname>Seyerlehner</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sonnleitner</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schedl</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hauger</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ionescu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>From Improved Auto-taggers to Improved Music Similarity Measures</article-title>
          .
          <source>In: Proceedings of the 10th International Workshop on Adaptive Multimedia Retrieval (AMR</source>
          <year>2012</year>
          ), Copenhagen, Denmark (
          <year>October 2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          39.
          <string-name>
            <surname>Quadrana</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cremonesi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jannach</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Sequence-aware recommender systems</article-title>
          . CoRR abs/
          <year>1802</year>
          .08452 (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          40.
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>E.H.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>Y.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ciou</surname>
            ,
            <given-names>J.B.</given-names>
          </string-name>
          :
          <article-title>Mining mobile application sequential patterns for usage prediction</article-title>
          .
          <source>In: 2014 IEEE International Conference on Granular Computing (GrC)</source>
          .
          <source>(Oct</source>
          <year>2014</year>
          )
          <volume>185</volume>
          {
          <fpage>190</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          41.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moore</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Turnbull</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joachims</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Playlist prediction via metric embedding</article-title>
          .
          <source>In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          ,
          <source>ACM</source>
          (
          <year>2012</year>
          )
          <volume>714</volume>
          {
          <fpage>722</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          42.
          <string-name>
            <surname>Hidasi</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karatzoglou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baltrunas</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tikk</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Session-based recommendations with recurrent neural networks</article-title>
          .
          <source>CoRR abs/1511</source>
          .06939 (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          43.
          <string-name>
            <surname>Vall</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Quadrana</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schedl</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Widmer</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cremonesi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>The Importance of Song Context in Music Playlists</article-title>
          .
          <source>In: Proceedings of the Poster Track of the 11th ACM Conference on Recommender Systems (RecSys)</source>
          , Como, Italy (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          44. L beks, J.,
          <string-name>
            <surname>Turnbull</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>You can judge an artist by an album cover: Using images for music annotation</article-title>
          .
          <source>IEEE MultiMedia 18(4) (April</source>
          <year>2011</year>
          )
          <volume>30</volume>
          {
          <fpage>37</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          45.
          <string-name>
            <surname>Schindler</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rauber</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Harnessing music-related visual stereotypes for music information retrieval</article-title>
          .
          <source>ACM Trans. Intell. Syst. Technol</source>
          .
          <volume>8</volume>
          (
          <issue>2</issue>
          ) (
          <year>October 2016</year>
          )
          <volume>20</volume>
          :
          <fpage>1</fpage>
          {
          <fpage>20</fpage>
          :
          <fpage>21</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          46.
          <string-name>
            <surname>Pohle</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Knees</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schedl</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pampalk</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Widmer</surname>
          </string-name>
          , G.:
          <article-title>\Reinventing the Wheel": A Novel Approach to Music Player Interfaces</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          <volume>9</volume>
          (
          <year>2007</year>
          )
          <volume>567</volume>
          {
          <fpage>575</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          47.
          <string-name>
            <surname>Knees</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pohle</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schedl</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Widmer</surname>
          </string-name>
          , G.:
          <article-title>Combining Audio-based Similarity with Web-based Data to Accelerate Automatic Music Playlist Generation</article-title>
          .
          <source>In: Proceedings of the 8th ACM SIGMM International Workshop on Multimedia Information Retrieval (MIR)</source>
          , Santa Barbara, CA, USA (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          48.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moore</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Turnbull</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joachims</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Playlist prediction via metric embedding</article-title>
          .
          <source>In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD '12</source>
          , New York, NY, USA, ACM (
          <year>2012</year>
          )
          <volume>714</volume>
          {
          <fpage>722</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          49.
          <string-name>
            <surname>Flexer</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schnitzer</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gasser</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Widmer</surname>
          </string-name>
          , G.:
          <article-title>Playlist Generation Using Start and End Songs</article-title>
          .
          <source>In: Proceedings of the 9th International Conference on Music Information Retrieval (ISMIR)</source>
          , Philadelphia, PA, USA (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          50. Cheng,
          <string-name>
            <given-names>Z.</given-names>
            ,
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          :
          <article-title>Just-for-</article-title>
          <string-name>
            <surname>Me</surname>
          </string-name>
          :
          <article-title>An Adaptive Personalization System for Location-Aware Social Music Recommendation</article-title>
          .
          <source>In: Proceedings of the 4th ACM International Conference on Multimedia Retrieval (ICMR)</source>
          , Glasgow, UK (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          51.
          <string-name>
            <surname>Reynolds</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barry</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burke</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Coyle</surname>
          </string-name>
          , E.:
          <article-title>Towards a Personal Automatic Music Playlist Generation Algorithm: The Need for Contextual Information</article-title>
          .
          <source>In: Proceedings of the 2nd International Audio Mostly Conference: Interaction with Sound</source>
          , Ilmenau, Germany (
          <year>2007</year>
          )
          <volume>84</volume>
          {
          <fpage>89</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>