<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Keyword-Based TV Program Recommendation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Experimental Setup</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Christian Wartena</institution>
          ,
          <addr-line>Wout Slakhorst, Martin Wibbels, Zeno Gantner, Christoph Freudenthaler, Chris Newell, Lars Schmidt-Thieme Novay, Enschede</addr-line>
          ,
          <country country="NL">the Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Notwithstanding the success of collaborative ltering algorithms for item recommendation there are still situations in which there is a need for content-based recommendation, especially in new-item scenarios, e.g. in streaming broadcasting. Since video content is hard to analyze we use documents describing the videos to compute item similarities. We do not use the descriptions directly, but use their keywords as an intermediate level of representation. We argue that a nearest-neighbor approach relying on unrestricted keywords deserves a special de nition of similarity that also takes word similarities into account. We de ne such a similarity measure as a divergence measure of smoothed keyword distributions. The smoothing is done on the basis of co-occurrence probabilities of the present keywords. Thus co-occurrence similarity of words is also taken into account. We have evaluated keyboard-based recommendations with a dataset collected by the BBC and on a subset of the MovieLens dataset augmented with plot descriptions from IMDB. Our main conclusions are (1) that keyword-based rating predictions can be very e ective for some types of items, and (2) that rating predictions are signi cantly better if we do not only take into account the overlap of keywords between two documents, but also the mutual similarities between keywords.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>clearly separated. Moreover, this o ers the possibility
to integrate information from di erent sources,
including human classi cation and allows correction of faulty
analyses, which might be important for many
organizations.</p>
      <p>Content-based recommendation relies on the ability to
compute similarities between items based on their
content. Classical methods use the overlap of words (either
keywords are all words in the documents/descriptions),
expressed by a correlation coe cient, like the Jaccard
coe cient, or by the cosine similarity, to de ne the
similarity between items. However, two items might have
very similar content but use a di erent vocabulary to
describe it. If we restrict the description of an item to a
few keywords, the problem will become even more severe.
Especially when keywords are not restricted to a set of
standardized terms, it might be the case that two items
have a considerable overlap in content but are described
by completely disjoint sets of keywords. Thus we expect
that recommendations could be improved if we are able
to include keyword similarities in the de nition of item
similarities.</p>
      <p>We compute similarities between keywords by
comparing their co-occurrence distributions. For words in
texts it is a well-studied phenomenon that semantic and
syntactic similarities can be computed by comparing the
contexts in which they appear. Stated in other words:
appearing in a similar context is a better indication for
similarity than direct co-occurrence. For keywords we
expect the same behavior since they are extracted from
the (rather short) texts. In each text one synonym of a
word is likely to be dominant and selected as a keyword.
In other documents di erent synonyms of the keyword
will appear in similar contexts.</p>
      <p>Since we can use the same collection of keyword
annotated items as we use for recommendation, the
keywordto-keyword similarities can be integrated easily into the
item-item similarities. We consider a Markov chain on
items and keywords, with transitions from items to
keywords, representing the probabilities of terms to be a
keyword for a given item and transitions from keywords
to items, representing the probabilities for each
document to be annotated with a given tag. Now the
cooccurrence distribution of a keyword is obtained by a
two-step Markov chain evolution starting with a
keyword. Keyword similarities are determined by
comparing their co-occurrence distributions. Item similarities
are obtained by comparing the keyword distribution that
arises from a one-step Markov chain evolution. By a
three-step evolution starting with a document we
incorporate the co-occurrence distributions of the keywords
into a kind of smoothed keyword distribution of the item.
When these smoothed distributions are compared, the
co-occurrence similarity of keywords is included in the
item-item similarity.</p>
      <p>
        We have evaluated recommendations based on the
keywords with a dataset collected by the BBC and with
viewing data from MovieLens combined with plot
descriptions from IMDB. For the BBC dataset we have
the original editorial synopsis and a collection of
related web pages. From both sets of texts we have
extracted keywords by two di erent methods. For all set
of keywords in the BBC dataset we see a clear
improvement of recommendation results when keyword
similarities are included in the computation of item-item
similarities. Moreover, we see that keyword-based
recommendation gives very good results, comparable or slightly
better than those obtained by state-of-the-art
collaborative ltering recommenders. Further observations from
the experiments with this dataset are that the keywords
extracted using a co-occurrence-based technique
introduced in [
        <xref ref-type="bibr" rid="ref19">20</xref>
        ] give better results than the keywords
extracted on the basis of their tf.idf value and that the
related websites give rise to better keywords than the
original descriptions.
      </p>
      <p>In contrast to the BBC data, for the MovieLens
dataset keyword-based recommendation is not able to
predict useful ratings at all. This might be explained
by the fact that keywords try to de ne the topic of an
item. In a homogeneous database of movies it is likely
that topic is not a key factor determining the users
appreciation of the movie.</p>
      <p>Our main conclusions are that it matters how the
keywords are extracted and which texts are used and in the
second place that the similarity measure is very
important: recommendation results are signi cantly better if
we do not only take into account the overlap of keywords
between two documents, but also the mutual similarities
between keywords.</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work 2</title>
      <p>2.1</p>
      <sec id="sec-2-1">
        <title>Co-occurrence-Based Similarity</title>
        <p>
          The idea that words can be described in terms of the
context in which they appear and hence the idea that word
similarities can be derived by comparing these contexts
has a long tradition in linguistics and is stated e.g. by
Zelig Harris [
          <xref ref-type="bibr" rid="ref4">5</xref>
          ]. The concept has become known as the
distributional hypothesis. Various formalizations of the
idea di er considerably in the way a context of a word
is de ned. Co-occurrence distributions arise from
approaches that do not use grammatical structure. Schutze
and Pederson [
          <xref ref-type="bibr" rid="ref15">16</xref>
          ] suggest that one could construct a
vector of co-occurrence probabilities from a complete word
co-occurrence matrix, where co-occurrences are counted
in a xed size window. The cosine similarity of these
vectors then provides a similarity measure. However,
they do not pursue this approach because it was
computationally too expensive. The approach that is most
similar to the approach we will use is that of Linden and
Piitulainen [
          <xref ref-type="bibr" rid="ref9">10</xref>
          ], who take all words in any dependency
relation to the word under consideration as its context.
Then the probability distribution over the words in the
context is computed. Finally, the Jensen-Shannon
divergence is used to compare these distributions.
        </p>
        <p>
          This approach is very much the same as the query
language models used in pseudo-relevance methods in
information retrieval as formulated e.g. by [
          <xref ref-type="bibr" rid="ref7">8</xref>
          ] and [
          <xref ref-type="bibr" rid="ref20">21</xref>
          ].
In these approaches rst, all documents containing the
query term are retrieved. Then the average distribution
of words in the documents is computed which in this
approach is called the query language model. Finally,
documents are ranked according to the similarity of the
document distribution to the query language model.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Keyword Extraction</title>
        <p>
          Extracting keywords from a text is closely related to
ranking words in the text by their relevance for the
text. To a rst approximation, the best keywords are
the most relevant words in the text. Determining the
right weight structure for words in a text is a central area
of research since the late 1960's ([
          <xref ref-type="bibr" rid="ref14">15</xref>
          ]). In 1972 Sparck
Jones (reprinted as [
          <xref ref-type="bibr" rid="ref16">17</xref>
          ]) proposed a weighting for
specicity of a term that has become known as tf.idf. This
measure is still dominant in determining the relevance
of potential keywords for a text. However, keywords are
not simply the most speci c words of a text and other
factors may also play a role in keyword selection. Frank
et al. [
          <xref ref-type="bibr" rid="ref3">4</xref>
          ] and Turney [
          <xref ref-type="bibr" rid="ref18">19</xref>
          ] and subsequently many
others have used machine learning approaches to keyword
extraction to integrate other features.
        </p>
        <p>
          The relevance measure used below was introduced by
Wartena et al. [
          <xref ref-type="bibr" rid="ref19">20</xref>
          ] and it was shown there that this
measure gives good results for keyword extraction.
2.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Keyword-Based Recommendation</title>
        <p>
          As noted e.g. by [
          <xref ref-type="bibr" rid="ref1">2</xref>
          ] popular collaborative ltering
algorithms are not suited for TV program recommendation,
as the new-item problem is very prevalent here. For new
items content-based recommendation has to be used. In
content-based recommendation approaches it is common
to base recommendations on the words found in textual
descriptions of the items. Here usually tf.idf weights or
information gain is used ([
          <xref ref-type="bibr" rid="ref11">12</xref>
          ]) to determine the relevance
of words. Words with low weights are usually removed,
but still a relatively large number of words (100 or more
[
          <xref ref-type="bibr" rid="ref11">12</xref>
          ]) is used for representation of the text. Furthermore,
not all highly relevant words usually can serve as
keywords that often are required to be noun phrases. Thus
this approach di ers signi cantly from a keyword-based
approach.
        </p>
        <p>
          Recently, there is a considerable interest in using
social tags for recommendation. Tags are in many respects
similar to keywords, but also have a lot of di erent
characteristics. In most tagged collections the assigners of
the tags are the same people that we want to compute
recommendations for. Thus most approaches try to
capture the tagging behavior of users to improve
recommendations. One of the rst papers that integrates
tagbased similarities in a nearest-neighbors recommender
is by Tso-Sutter et al. [
          <xref ref-type="bibr" rid="ref17">18</xref>
          ]. Liang et al. [
          <xref ref-type="bibr" rid="ref8">9</xref>
          ] also use
a nearest-neighbor approach for tag-based
recommendation. Most other approaches like the one of Firan et
al. [
          <xref ref-type="bibr" rid="ref2">3</xref>
          ] build user pro les from tags and base
recommendations on these pro les.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Markov Chains on Items and (Key)words</title>
      <p>We use the distributions of terms over items for two
different purposes: rst we consider the distribution of all
terms occurring in the texts to select a few key terms to
represent each document. In a second stage we consider
the distribution of keywords over items. We have to keep
in mind that we talk about di erent sets of terms in both
cases. The concepts and techniques used are however the
same.</p>
      <p>Consider a set of n term occurrences (e.g. words or
multi-words) each being an instance of a term t in T =
ft1; : : : ; tmg, and each occurring in a source document d
in a corpus D = fd1; : : : ; dM g. Let n(d; t) be the number
of occurrences of term t in d, n(t) = Pd n(d; t) be the
number of occurrences of term t, N (d) = Pt n(d; t) the
number of term occurrences in d and n the total number
of term occurrences in the entire collection.</p>
      <p>We de ne three (conditional) probability distributions
q(t) =
Q(djt) =
q(tjd) =
n(t)</p>
      <p>n
n(d; t)
n(t)
n(d; t)
N (d)
on T
on D
on T :
Probability distributions on D and T will be denoted by
P , p with various sub- and superscripts.</p>
      <p>Consider a Markov chain on T [ D having transitions
T ! D with transition probabilities Q(djt) and
transitions D ! T with transition probabilities q(tjd) only.
Given a term distribution p(t) we compute the one-step
Markov chain evolution. This gives us a document
distribution Pp(d):</p>
      <p>Pp(d) = X Q(djt)p(t):
Likewise given a document distribution P (d), the
onestep Markov chain evolution yields the term distribution
pP (t) =</p>
      <p>X q(tjd)P (d):
t
d
(1)
(2)
(3)
(4)
(5)</p>
      <p>Since P (d) gives the probability to nd a term
occurrence in document d, pP is the weighted average of the
term distributions in the documents. Combining these,
i.e. running the Markov chain twice, every term
distribution gives rise to a new term distribution
p(t) = pPp (t) =</p>
      <p>X q(tjd)Q(djt0)p(t0):
t0;d
(6)
For some term z, starting from the degenerate term
distribution pz(t) = tz (1 if t = z and 0 otherwise), we get
the distribution of co-occurring terms or co-occurrence
distribution pz
pz(t) =</p>
      <p>X q(tjd)Q(djt0)pz(t0) =</p>
      <p>X q(tjd)Q(djz): (7)
d;t0
d
This distribution is the weighted average of the term
distributions of documents containing z where the weight
is the probability Q(djz) that an instance of term z has
source d. If we compute term similarities by
comparing their co-occurrence distribution { rather than the
source distributions Q(djz) { we base the similarity on
the context in which a word occurs as intended in the
distributional hypothesis.</p>
      <p>Likewise we obtain a term distribution if we run a
Markov chain three times starting from the degenerated
document distribution Pd(i) id:
pd(t) = pPpPd (t) =</p>
      <p>q(tjd0)Q(d0jt0)q(t0jd00)P (d00jd)</p>
      <p>X
d0;t0;d00
= X q(tjd0)Q(d0jt0)q(t0jd) =</p>
      <p>X q(zjd)pz(t): (9)
d0;t0
z
The distribution Pd can be seen as a smoothed version
of the document distribution Pd in which co-occurrence
information of the words is integrated. Thus, if we
compare documents using these smoothed distributions we
also take into account co-occurrence-based word
similarities.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Keyword Extraction</title>
      <p>For all items in our datasets a short textual description is
available. We extract words from these texts to represent
them as a vector in a word space. We can either use
all words (after removing stop words) or only a small
selection.</p>
      <p>For keyword extraction we compare two di erent
extraction methods. Both methods are based on ranking
words and selecting the k top-ranked words. The rst
method uses standard tf.idf ranking. The tf.idf value of
a term t in a document d is de ned as
(8)
tf:idf (t; d) =
n(d; t)
log df (t)
;
(10)
where n(d; t) is the number of occurrences of w in d, and
df is the number of documents d0 for which n(d0; t) &gt; 0.</p>
      <p>
        The second method uses the hypothesis that the
cooccurrence distribution of a good keyword is a good
estimator of the term distribution of the document. Thus
the suitability of a word as a keyword can be predicted
by comparing the co-occurrence distribution of the word
and the term distribution. There are various options
to compute the similarity between two distributions. In
[
        <xref ref-type="bibr" rid="ref19">20</xref>
        ] it was shown that the following correlation coe
cient gives the best results:
r(z; d) =
qPt(Pd(t)
      </p>
      <p>Pt(Pd(t)
q(t))(pz(t)</p>
      <p>q(t))
q(t))2pPt(pz(t)
q(t))2
:
(11)
This coe cient captures the idea that two distributions
are similar if they diverge in the same way from the
background distribution q. The coe cient is in fact the
cosine of the residual co-occurrence distribution of the
term and the smoothed term distribution of the
document after subtracting the background term
distribution. Note that the "residual" probabilities can be
negative and hence r(z; d) also can become negative. For
keyword extraction we will not only use the coe cient
for ranking, but we will also require that the correlation
coe cient de ned in equation 11 is positive.</p>
      <p>
        The di erent keyword extraction strategies are
implemented in a UIMA1 text analysis pipeline. All words in
the text are stemmed using the tagger/lemmatizer from
[
        <xref ref-type="bibr" rid="ref5">6</xref>
        ] and annotated by the Stanford part of speech tagger
([1]). To compute co-occurrence distributions all open
class words are taken into account.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Keyword-Based Recommendation</title>
      <p>
        The recommendation strategy we use is a
straightforward k-nearest-neighbor approach for recommendation
([
        <xref ref-type="bibr" rid="ref12">13</xref>
        ]). Content-based k-nearest-neighbor approaches are
similar to classical collaborative ltering algorithms, but
the similarity measure between items is based on the
content of the items and not on the ratings. The rating we
predict for a user and an item is the weighted average
of all items rated by the user, where more similar items
get greater weights. To be precise, let Iu be the set of all
items rated by user u, then the predicted rating R(u; i)
of u for item i is de ned by
      </p>
      <p>R(u; i) =
j2Iu sim(i; j)R(u; j)
j2Iu sim(i; j)
:
We use two di erent keyword based similarity measures
for items. The rst measure is the Jaccard coe cient:
sim(i; j) =
+ jKi \ Kj j ;
jKi [ Kj j
where Ki is the set of keywords of item i. The additional
parameter ensures that each item is taken into account,
even if the set of keywords is disjoint from the item for
which a rating has to be predicted. Thus, items which
do not overlap with any other items rated by the user</p>
      <sec id="sec-5-1">
        <title>1http://incubator.apache.org/uima/</title>
      </sec>
      <sec id="sec-5-2">
        <title>2http://www.mymediaproject.org</title>
        <p>get the user average as the prediction. If a very large
value is taken for , the predicted rating will always be
the user average. Some initial experiments suggest that
a value of about 0.1 yields the best results.</p>
        <p>Since all keywords are drawn from an unrestricted
vocabulary it might be the case that two texts are tagged
with similar or strongly related words but not with
exactly the same words. Thus we should not only check
whether the same keywords are used, but also how
strongly the keywords are related. As argued before,
this can be done by comparing co-occurrence
distributions: the co-occurrence distribution can be seen as a
proxy for the semantics of a word. The whole text now
has to be represented by the average of all co-occurrence
distributions of all its keywords. This new distribution
is in fact a smoothed version of the original keyword
distribution of the document. The similarity between two
items i and j is now given by
Again we use
divergence.
6
6.1</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Evaluation</title>
      <sec id="sec-6-1">
        <title>Data Sets</title>
        <p>sim(i; j) =
+ 1</p>
        <p>JSD(pikpj ):
(14)
= 0:1, and JSD is the Jensen-Shannon
(12)
(13)</p>
      </sec>
      <sec id="sec-6-2">
        <title>BBC Broadcast Data</title>
        <p>As a rst dataset to test our hypothesis that
kNNbased rating prediction will bene t from including
cooccurrence into the computation of item similarity was
collected in a user study at the BBC. BBC programming
provides a very interesting use case for keyword based
recommendation. Since the BBC does not have a static
database of items, like the movie databases on which
much of the research on recommendation was done, but
a stream of items. Here in fact each item that we want
to predict ratings for is a new item. Content-based
recommendation might be very useful in this situation. For
all items an editorial description and one or more web
pages are available.</p>
        <p>The BBC data was collected during eld trials of
the MyMedia project2 concerning recommender systems.
An audience research panel was asked to rate all content
items they watched during the eld trial. In parallel,
media server logs were analyzed to determine the viewing
behavior of a larger superset of users. The characteristics
of the dataset are described in Table 1.</p>
        <p>Every content item in the BBC dataset has a related
web page or website. This meant that two descriptions
were available for each item:
1. Original editorial descriptions typically 30 to 200
words in length.</p>
        <p>2. Website text typically 200 to 4000 words in length.
The website text was obtained automatically using some
knowledge about the rough HTML structure of the web
sites. Note that some content items have very brief
descriptions and a simple, single web page associated with
them whereas other items have longer descriptions and
a substantial website. Where items were part of an
ongoing series the web site frequently includes information
about the complete series, rather than information about
an individual episode.</p>
        <p>We have extracted keywords from all texts by
stemming and the two weighting schemes discussed above.
Since we only extract nouns and verbs as keywords and
we also exclude person names, as far as properly
identi ed, less than ten keywords were found for a number
of items. For all texts that are long enough 10
keywords were extracted. When extracting keywords using
the correlation de ned in 11 we also restrict the set of
possible keywords to those term that have a positive
correlation. Thus the number of keywords extracted here
sometimes is lower than 10 even if 10 nouns are present
in the text. The average number of keywords assigned
and the total number of unique keywords used are given
in Table 2.</p>
      </sec>
      <sec id="sec-6-3">
        <title>MovieLens Dataset</title>
        <p>
          The second dataset we have used is derived from the
10 Million rating dataset from MovieLens ([
          <xref ref-type="bibr" rid="ref10">11</xref>
          ]). We
have augmented this dataset with the plot descriptions
of the movies from IMDB ([
          <xref ref-type="bibr" rid="ref6">7</xref>
          ]). For a lot of movies the
available plots are very short and uninformative. Thus
we restricted the dataset to the movies having plots of
at least 200 words. The characteristics of the dataset
are described in Table 1. The number of keywords per
item and the total number of unique keywords are given
in Table 2.
        </p>
        <p>As compared to the BBC dataset we see that the
dataset is much denser: the number of users and items
is smaller whereas there are many more ratings.
6.2
The goal of the experiment is twofold. First we want
to know whether extracted keywords provide a viable
resource on which to base recommendations. In the
second place we want to test whether the similarity measure
de ned in (14) gives better rating predictions than the
Jaccard coe cient (13). To test the latter hypothesis for
each set of keywords we compute predictions using both
measures. In order to test the rst hypothesis we
compare the keyword-based rating predictions to predictions
from other algorithms. We use the following baselines:
1. user average,
2. item average,
3. collaborative ltering, and
4. genre- and series-based prediction.</p>
        <p>Item average (i.e. for a user-item pair we predict the
average rating other users have assigned to that item)
provides a nice baseline in the experiment but is not
an alternative to content-based recommendations in real
scenarios, since it cannot be applied for new items. User
average (i.e. for a item user pair we predict the average
rating the user has given to other items) also is a good
baseline but not useful in real life since it does not help
a user to make any choices. Collaborative ltering
provides a very strong baseline and is some sense gives the
limit we want to reach. However, it is only applicable in
the static experiment and not in the streaming broadcast
scenario as discussed above. For collaborative ltering
we have used a state-of-the-art matrix factorization
implementation.3 For the genre-based recommendation we
use the same algorithm as for the keyword-based
recommendation. To do so we simply treat the genre labels as
keywords. In the experiment with the BBC dataset there
are a lot of series. We expect that series-based
recommendation might give very good results, since it is likely
that someone who likes some episodes of a series will also
like the remaining episodes. Series can easily be
identied, since in almost all cases all items of a series have the
same title. By using the title of each item as a keyword
we get a series-based recommender. Since we use = 1
for all items that do not belong to a series already rated
by the user we predict the user average. Given the good
results of genre-based recommendation in earlier
experiments we also use genres and the combination of genres
and title for content-based recommendation.</p>
        <p>
          For evaluation we have done a leave-one-out
experiment: each rating is predicted using all ratings except
the one that has to be predicted. Since the recommender
does not need any training of a model (except the
cooccurrence distributions of the keywords) this is a very
feasible approach. For the collaborative ltering we use
a di erent protocol, since for each split a new model has
to be trained. The result given here is obtained using a
10-fold cross-validation. Interpreting the results requires
3 Biased matrix factorization from the MyMediaLite
package: http://ismll.de/mymedialite [
          <xref ref-type="bibr" rid="ref13">14</xref>
          ]
Data
web { tf.idf
web { co-occ
original { tf.idf
original { co-occ
genres
title
genres + title
web { tf.idf
web { co-occ
original { tf.idf
original { co-occ
genres
genres + title
user average
item average
MF
        </p>
        <p>Distance
Jaccard
Jaccard
Jaccard
Jaccard
Jaccard
Jaccard
Jaccard
JSD
JSD
JSD
JSD
JSD</p>
        <p>JSD
Data
plot - tf.idf
plot - co-occ
original keywords
genres
plot - tf.idf
plot - co-occ
original keywords
genres
user average</p>
        <p>Distance
Jaccard
Jaccard
Jaccard
Jaccard
JSD
JSD
JSD
JSD
some caution because the matrix factorization models
were trained using roughly 10 % smaller datasets.
6.3</p>
      </sec>
      <sec id="sec-6-4">
        <title>Results</title>
        <p>As it is common for rating prediction, we use the root
mean square error (RMSE) as evaluation measure. The
results in terms of RMSE are given in Table 3 and Table
4 for the BBC and MovieLens datasets, respectively.</p>
        <p>The rst remarkable fact is that keyword-based rating
prediction gives very good results on the BBC dataset
but cannot improve on the item average baseline in the
case of the MovieLens/IMDB data. This result is not
very surprising. Keywords mainly give the topic of the
program or the movie plot. Whether someone likes a
movie might depend on the genre, the director, the
actors, etc. but probably not on the topic of the plot.
Nevertheless we see that keyword-based
recommendation indeed can be very useful since it clearly
outperforms simple baselines like user or item average. As
expected the series (title) and genre-based recommenders
perform very well. However, the best keyword-based
recommenders perform equally well. Surprisingly, the
content-based recommenders perform equal well as the
matrix factorization. The conclusion for our rst
hypothesis therefore is that keyword-based
recommendation can be very useful for a dataset in which the topic
of the item matters and for which no other suitable
metadata, such as genre or series information is available.</p>
        <p>With regard to our second question, whether the
inclusion of keyword co-occurrence information in the de
nition of item similarity is useful, we see that in almost all
cases our new distance measure gives better results than
the standard measure. Only the genre-based results are
poorer. We have however to say that the measure was
not intended for use with such clearly de ned concepts
such as genres. It should solve problems with (near)
synonyms in a set of freely selected keywords.</p>
        <p>
          Furthermore we observe that the co-occurrence-based
keywords perform better than tf.idf-based keywords.
Thus the results also provide more evidence to support
the conclusions of a comparison between the two
methods in previous work ([
          <xref ref-type="bibr" rid="ref19">20</xref>
          ]). Finally, we see that the
keywords extracted from the related material perform better
than the keywords extracted from the original
descriptions. When we look into more detail, on the contrary
one gets the impression that the keywords extracted from
the original descriptions contain less mistakes and noise.
However, the main e ect seems to be, that there are a
lot of items for which the original descriptions are too
short and give too few keywords.
7
        </p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>In this paper we have investigated keyword-based rating
prediction. Keywords constitute a useful level of
description of an item since keywords can be assigned by
humans or extracted automatically from one or more texts.
We have shown that for some datasets keyword-based
rating predictions give very good results, comparable to
state-of-the art collaborative ltering methods. We have
hypothesized that the reason lies in the nature of the
dataset and the relevance of the topic of the item for the
appreciation of the item. It remains a question for
future research to apply keyword-based rating prediction
to more datasets to verify this hypothesis.</p>
      <p>We have argued that a nearest-neighbor approach
relying on unrestricted keywords deserves a special de nition
of nearness taking word similarities also into account.
We have de ned such a similarity measure as a
divergence measure of smoothed keyword distributions where
the smoothing is done on the basis of the co-occurrence
probabilities of the keywords. In the experiments we see
that for various sets of keywords this measure always
gives better results than the Jaccard coe cient.</p>
      <p>Other ndings are that the keywords extracted from
the related web pages lead to better recommendation
results than the keywords extracted from the original
abstracts. The main reason seems to be that the
abstracts are in many cases too short to extract an
optimal number of relevant keywords. Finally we see that
the keywords obtained by comparison of co-occurrence
distributions lead to better recommendation results than
the keywords extracted using a standard tf.idf relevance
measure.
8</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work was funded by the European Commission
FP7 project MyMedia4 under the grant agreement no.
215006. We thank the anonymous reviewers for their
valuable feedback.
[1] Stanford part of speech tagger.</p>
      <p>http://nlp.stanford.edu/software/tagger.shtml.</p>
      <p>Word,</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Cotter</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Smyth</surname>
          </string-name>
          . Ptv:
          <article-title>Intelligent personalised tv guides</article-title>
          .
          <source>In AAAI/IAAI</source>
          , pages
          <volume>957</volume>
          {
          <fpage>964</fpage>
          . AAAI Press / The MIT Press,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Firan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Nejdl</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Paiu</surname>
          </string-name>
          .
          <article-title>The bene t of using tag-based pro les</article-title>
          . In V.
          <string-name>
            <surname>A. F. Almeida</surname>
            and
            <given-names>R. A</given-names>
          </string-name>
          . Baeza-Yates, editors,
          <source>LA-WEB</source>
          , pages
          <volume>32</volume>
          {
          <fpage>41</fpage>
          . IEEE Computer Society,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E.</given-names>
            <surname>Frank</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. W.</given-names>
            <surname>Paynter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. H.</given-names>
            <surname>Witten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gutwin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. G.</given-names>
            <surname>Nevill-Manning</surname>
          </string-name>
          .
          <article-title>Domain-speci c keyphrase extraction</article-title>
          . In T. Dean, editor,
          <source>IJCAI</source>
          , pages
          <volume>668</volume>
          {
          <fpage>673</fpage>
          . Morgan Kaufmann,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Harris</surname>
          </string-name>
          . Distributional structure.
          <volume>10</volume>
          (
          <issue>23</issue>
          ):
          <volume>146</volume>
          {
          <fpage>162</fpage>
          ,
          <year>1954</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hepple</surname>
          </string-name>
          .
          <article-title>Independence and commitment: Assumptions for rapid training and execution of rulebased pos taggers</article-title>
          .
          <source>In ACL</source>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>[7] http://www.imdb.com.</mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [8]
          <string-name>
            <surname>J. D.</surname>
          </string-name>
          <article-title>La erty and C. Zhai. Document language models, query models, and risk minimization for information retrieval</article-title>
          . In W. B.
          <string-name>
            <surname>Croft</surname>
            ,
            <given-names>D. J.</given-names>
          </string-name>
          <string-name>
            <surname>Harper</surname>
            ,
            <given-names>D. H.</given-names>
          </string-name>
          <string-name>
            <surname>Kraft</surname>
          </string-name>
          , and J. Zobel, editors,
          <source>SIGIR</source>
          , pages
          <volume>111</volume>
          {
          <fpage>119</fpage>
          . ACM,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nayak</surname>
          </string-name>
          , and L.-T. Weng.
          <article-title>Personalized recommender systems integrating social tags and item taxonomy</article-title>
          .
          <source>In Web Intelligence</source>
          , pages
          <fpage>540</fpage>
          {
          <fpage>547</fpage>
          . IEEE,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>Linden</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Piitulainen</surname>
          </string-name>
          .
          <article-title>Discovering synonyms and other related words</article-title>
          .
          <source>CompuTerm</source>
          <year>2004</year>
          , pages
          <fpage>63</fpage>
          {
          <fpage>70</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>[11] http://www.grouplens.org/system/ les/- README 10M100K.html.</mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Pazzani</surname>
          </string-name>
          .
          <article-title>A framework for collaborative, content-based and demographic ltering</article-title>
          .
          <source>Artif. Intell. Rev.</source>
          ,
          <volume>13</volume>
          (
          <issue>5-6</issue>
          ):
          <volume>393</volume>
          {
          <fpage>408</fpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Pazzani</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Billsus</surname>
          </string-name>
          .
          <article-title>Content-based recommendation systems</article-title>
          .
          <source>In The Adaptive Web: Methods and strategies of web personalization. Volume 4321 oF Lecture Notes in Computer Science</source>
          , pages
          <volume>325</volume>
          {
          <fpage>341</fpage>
          . Springer-Verlag,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Rendle</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Schmidt-Thieme</surname>
          </string-name>
          .
          <article-title>Online-updating regularized kernel matrix factorization models for large-scale recommender systems</article-title>
          .
          <source>In RecSys '08: Proceedings of the 2008 ACM Conference on Recommender Systems. ACM</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>G.</given-names>
            <surname>Salton</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Buckley</surname>
          </string-name>
          .
          <article-title>Term weighting approaches in automatic text retrieval</article-title>
          .
          <source>Technical report</source>
          , Cornell University,
          <year>1987</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>H.</given-names>
            <surname>Schu</surname>
          </string-name>
          <article-title>tze and</article-title>
          <string-name>
            <given-names>J.</given-names>
            <surname>Pederson</surname>
          </string-name>
          .
          <article-title>A cooccurrence-based thesaurus and two applications to information retrieval</article-title>
          .
          <source>In Proceedings of RIA Conference</source>
          , pages
          <volume>266</volume>
          {
          <fpage>274</fpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>K.</given-names>
            <surname>Sp</surname>
          </string-name>
          <article-title>arck Jones. A statistical interpretation of term speci city and its application in retrieval</article-title>
          .
          <source>Journal of documentation</source>
          ,
          <volume>60</volume>
          :
          <fpage>493</fpage>
          {
          <fpage>502</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [18]
          <string-name>
            <surname>K. H. L. Tso-Sutter</surname>
            ,
            <given-names>L. Balby</given-names>
          </string-name>
          <string-name>
            <surname>Marinho</surname>
          </string-name>
          , and L.
          <string-name>
            <surname>Schmidt-Thieme</surname>
          </string-name>
          .
          <article-title>Tag-aware recommender systems by fusion of collaborative ltering algorithms</article-title>
          . In R. L. Wainwright and H. Haddad, editors,
          <source>SAC</source>
          , pages
          <year>1995</year>
          {
          <year>1999</year>
          . ACM,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>P. D.</given-names>
            <surname>Turney</surname>
          </string-name>
          .
          <article-title>Learning algorithms for keyphrase extraction</article-title>
          .
          <source>Inf. Retr.</source>
          ,
          <volume>2</volume>
          (
          <issue>4</issue>
          ):
          <volume>303</volume>
          {
          <fpage>336</fpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>C.</given-names>
            <surname>Wartena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brussee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Slakhorst</surname>
          </string-name>
          .
          <article-title>Keyword extraction using word co-occurrence</article-title>
          .
          <source>In DEXA Workshops</source>
          , pages
          <volume>54</volume>
          {
          <fpage>58</fpage>
          . IEEE Computer Society,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhai</surname>
          </string-name>
          and
          <string-name>
            <surname>J. D.</surname>
          </string-name>
          <article-title>La erty. Model-based feedback in the language modeling approach to information retrieval</article-title>
          .
          <source>In CIKM</source>
          , pages
          <volume>403</volume>
          {
          <fpage>410</fpage>
          . ACM,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>