Improving Novelty in Streaming Recommendation Using a
                      Context Model ∗

                                Doina Alexandra Dumitrescu                     Simone Santini
                                                   Escuela Politécnica Superior
                                                 Universidad Autónoma de Madrid
                                                          Madrid, Spain
                                         {doina.dumitrescu, simone.santini}@uam.es

 ABSTRACT                                                         doesn’t know whether the user wants to know something
 In recent years there has been an increasing research interest   about the neighborood in New York, about the cocktail, or
 in novelty/diversity detection in Information Retrieval and      about the Indian tribe: the query has several possible inter-
 in Recommendation Systems. We propose a model that in-           pretation, and whoever formulated the query is interested in
 creases the novelty of recommendations using a context user      general in only one of them. Without further specification,
 profile that was created automatically using self-organizing     the safest bet for the server is to provide a diverse result
 maps. Our system was evaluated on the Reuters Corpus             set, that is, one that covers all these topics, possibly in an
 Volume 1 and our experiments show that filtering the recom-      amount proportional to the estimated a priori user interest
 mended items using a novelty score derived from the context-     (one can assume, for instance, that more people are inter-
 based user profile provides better search results in terms of    ested in New York than in the Indian tribe). Even if the
 novel information that is presented to the user.                 query is not ambiguous and the results are about what the
                                                                  user really meant (say, the neighborhood in new york), the
 Categories and Subject Descriptors                               user would want every document to be novel, that is, to
                                                                  provide information that is not present in other documents.
 H.3.3 [Information Systems]: Information Search and Re-
                                                                  A highly informative text about the history and the human
 trieval
                                                                  landscape of Manhattan is probably a very good first result,
                                                                  but another document almost identical to the first is not an
 General Terms                                                    equally desirable second result: the information that it con-
 Algorithms, Metrics, Experimentation                             tains has already been seen. That is, every interpretation
                                                                  of a query contains several aspects, and the person who for-
 Keywords                                                         mulated it will be interested, to some degree, in all of them.
 Novelty, Diversity, User Model, Context                          These considerations highlight an important difference be-
                                                                  tween diversity and novelty. Diversity is something that the
                                                                  user wouldn’t really want: the person interested in the man-
 1.    INTRODUCTION                                               hattan cocktail would be very happy to receive results only
 The plethora of information available on the internet about      about the cocktail. Diversity is needed mainly by the server,
 virtually anything, and the duplication of this information      which doesn’t know enough about the user true intentions.
 due to the significant replication of the sources has given a    Novelty, on the other hand, is something the user wants,
 great relevance to the issues of novelty and diversity in re-    since she will be interested in all the aspects of her query.
 trieval and recommendation systems [2, 5]. The definition        Novelty is a way to avoid redundancy, while diversity is a
 of novelty and diversity generally accepted in information       way to deal with the server’s ignorance.
 retrieval is that diversity helps the server deal with query
 ambiguity, while novelty helps the user deal with query un-      If we have access to a model of the user’s interests, then it
 derspecification [8]. Consider a query composed of the word      is possible to reduce ambiguity and, therefore, to reduce the
 manhattan. The query is ambiguous because the server             need for diversity, concentrating only on increasing novelty.
 ∗This work was supported by the Ministerio de Educación y       We present a method for increasing novelty that relies on
 Ciencia under the grant N. TIN2011-28538-C02, Novelty, di-       a context model built on texts previously seen by the user.
 versity, context and time: newdimensions in next-generation      The scenario that we use here to exemplify the working of
 information retrieval and recommender systems.                   our method is that of the recommendation of news, and the
                                                                  context is composed of past news that the user has con-
                                                                  sidered interesting. This is just one possible way in which
                                                                  context can be gathered. Our model is very general and
                                                                  can be used to represent text or multimedia data. In past
                                                                  work, we have used, for example, the contents of the user’s
                                                                  hard disk to represent his main interests [7]. In this paper
                                                                  we show and evaluate a modification to our model with the
CARS-2012, September 9, 2012, Dublin, Ireland.                    purpose of increasing the context coverage of the recommen-
Copyright is held by the author/owner(s).                         dations, that is, the fraction of the user context that the
        -             filter          -                marker     In addition to this short term change, the context evolves in
                                                                  a permanent and irreversible way, again under the influence
     items                                                        of the documents read (as well as, possibly, based on other
                                       short time                 elements such as the documents that the user creates on
                         6             update     irreversible    her computer, or the emails that she writes). A number
                     context                     update          of strategies can be used for this evolution: the user can
                      model                                      be asked to mark certain documents of special interest, the
                                                                  system can present only a summary, and all the documents
                                                                  whose complete text is accessed are permanently entered
Figure 1: Schema of out stream recommendation                     into the context, etc. We have considered the evolution of
system.                                                           the context based on documents of interest elsewhere [7],
                                                                  and we shall not consider it here. From the point of view of
                                                                  this paper, therefore, the context, defined as in the following
recommended items will cover, trying to avoid items that          section, is fixed, and only the short term changes, needed to
refer to parts of the context already covered by previously       increase novelty, are considered.
seen items.
                                                                  3.     THE USER MODEL
                                                                  There is convincing evidence that a user model can help
2.   STREAM RECOMMENDATION                                        attain a higher precision in information retrieval and recom-
One important distinction between this work and others            mendation systems [1, 6]. In this paper, we are interested
on novelty and diversity in recommendation is that we are         in using such a model to achieve novelty, that is, to obtain
dealing with a stream of news, that is, with a continuous–        recommendations that span the whole gamut of the user’s
potentially infinite–interaction, while typical recommenda-       interests. The context model that we use was presented
tion systems work based on sessions: the user chooses an          (without the extension to enforce novelty that we present
item or requests a recommendation, and a finite set of results    here0 in [7], to which the reader is referred for details; here
is shown. This operation may be repeated several times, as        we shall only give a brief description. The basis of the model
the user selects several items and receives correspondingly       is a set of documents, which can be composed of the docu-
changing recommendations. No work, to the best of our             ments on which the user is working at the time or, in our
knowledge, has considered novelty in a session-less situa-        case, by news items marked in the past. The model doesn’t
tion, in which a countable stream of items arrive and a sys-      change depending on the source of the documents. In the
tem has to decide whether they are relevant for the user at       following, in order to make the tests self-contained, we shall
that moment. Note that, differently from the case of a closed     always assume that the context is based on a set of news
session, we don’t have access to the whole data base of items     items that supposedly the user has seen in the past and
that can be recommended. We don’t even have access to the         that he has found relevant.
items that have already been shown, since in a stream their
number is unbound. All we can do, at any given time, is take      All the documents in the contexts are considered as a sin-
a decision only considering how the past history of items has     gle “macro-document” and processed consequently. On this
modified the user context. The system is shown schemati-          macro-document we perform stop-word removal and stem-
cally in figure 1. We have a stream of items (news, in our test   ming. Each term t is assigned a weight wt using tf-idf weight-
scenario) that arrive continuously to the user. The user is       ing [1]. We use the standard vector model representation
described by the context model C (the details of this repre-      of information retrieval in which each different term corre-
sentation will be given in the next section) and each news is     sponds to an axis in the word space. We consider n-grams,
compared with the context and assigned a score. Whenever          constituted of n consecutive terms that appear in the docu-
the score is above a certain threshold (or a predefined crite-    ment. If the n-gram pk is composed of the terms tu1 , . . . , tun ,
rion is met) the item is presented to the user. At the same       with weights wu1 , . . . , wun , then its representation in the
time, the context undergoes a first kind of change, which         word space is given by the point
we call a short term change. This change (used to improve
                                                                                 1
the novelty of the recommended items) is characterized by a              pk = pPn               (. . . , wu1 , . . . , ww2 , . . . , wun , . . .)
                                                                                           2
time constant β. After a time, depending on β, the change                             i=1 wui
provoked by any individual item disappears. This is one im-       in the T -dimensional term space. Note that all points are
portant difference between our model and the session based        normalized, so they are in reality points in the manifold
ones. Let us assume that a person is interested in politics.      S T −1 (the unit sphere in RT ). A context is represented by
News about the election of the new French president will          a set of these n-gram representations, what we call a point
undoubtedly be of interest to her. However, she will proba-       cloud representation. That is, a context is a finite set of
bly not be interested in receiving many similar news about        points C ⊆ S T −1 .
the election one after the other. The effect of the short term
change will be do “deactivate”, for a time, the part of the       In the word space, we lay a self-organizing map, using a
context that deals with this kind of news, so that further        modification of WEBSOM [3]. The map is a grid of elements
news about the French elections will not, cœteris paribus, be     called neurons, each one of which is a point in the word space
considered as interesting. However, this situation should not     and is identified by two integer indices, that is, a neuron is
be permanent: after a while (a few hours, a day,...), further     given as:
news about the French President will again be considered
interesting, and should be recommended.                                [µν] = ([µν]1 , . . . , [µν]T )0     1 ≤ µ ≤ N, 1 ≤ ν ≤ M                    (1)
The map is discrete, two-dimensional1 with the 4-neighborhood       context), and determining the maximal similarity between
topology. The neurons are immersed in the T -dimensional            u and the neurons of the map
word space ([µν] ∈ RT ), and in our version of WEBSOM
their weights are normalized.    That is, throughout the al-                              r = max s([µν], u)                   (7)
                                                                                              [µν]
gorithm we enforce Tt=1 [[]µν]2t = 1 for all map indices µν.
                     P
The neurons are at the same time elements of the discrete           the items that yield a value of r below a certain threshold are
grid and points in S T −1 . As points in the discrete grid, the     considered interesting and presented to the user. As we have
relevant distance between two neurons is their graph dis-           mentioned in the introduction, the problem with this simple
tance:                                                              solution is that very similar news items can have high score
                                                                    because they are similar to the same neuron (that is: to the
                 δ([ζξ], [µν]) = |ζ − µ| + |ξ − ν|           (2)    same part of the user context), although the presentation
As points in S T −1 , it is possible to determine the similarity    of more than one adds little to the information content of
between a neuron and any other point p ∈ S T −1 as                  the first one. To increase the diversity of the results, we
                                                                    rewrite (7) adding an interest factor wµν to each neuron,
                                      T
                                      X                             and computing the relevance of the news item u as
                     s([µν], w) =       [µν]i pi             (3)
                                      i=1                                              r = max wµν s([µν], u)                  (8)
                                                                                            [µν]
On this map we define a neighborhood function, h(t, n), which
depends on two parameters t, n ∈ N; n is the graph dis-             Let us call [∗] the “winning” neuron for a given item (the
tance between two given neurons, t is a time parameter that         neuron closest to it, viz. the neuron for which the maximum
increases as learning proceeds. The function h(t, n) repre-         of eq. (8) is attained). Whenever an item is recommended,
sents the “degree of neighborhood-ness” of two neurons at           the interest factor of each neuron is updated according to
a distance n at time t; We assume that 0 ≤ h(t, n) ≤ 1,             the equation
h(t, 0) = 1, and that h is monotonically decreasing in n and                         (
t. The degree to which neuron [ζξ] belongs to the neighbor-                            λwµν                   if [µν] = [∗]
                                                                             wµν ←                                           (9)
hood of neuron [µν] at time t is given by h(t, δ([ζξ], [µν])).                         min{1, (1 + β)wµν } otherwise
In addition to the neighborhood we define a positive learning
parameter α(t), t ∈ N, monotonically decreasing with t.             with 0 < λ < 1 and β > 0. According to this equation, the
                                                                    interest of the neuron closest to a recommended news item
In order to create the map for a context C, all the points in       is reduced by a factor λ, reducing its contribution to (8) and
it are presented to the map, and the training algorithm is          therefore making it less likely to win again. This means that
applied. We call the presentation of a point p ∈ C an event         further news items close to the one that caused this neuron
of learning, and the presentation of all the points of C an         to win will be less likely to be recommended. If the neuron
epoch. Learning consists of a number of epochs, counted by          is no longer the winner for any item, its interest will be
a counter t. The neurons of the map are at first spread ran-        increased by a factor (1 + β) with each recommended item,
domly in the word space; then, for each event consisting of         until the value 1 is restored (or until the neuron wins again,
the presentation of the point p, the neuron with the highest        in which case its interest factor will again be reduced). If
similarity to p is found:                                           the neuron wins only once, its relaxation time (the number
                                                                    of items necessary for it to go back to 1) is
                     [∗] = arg max s(p, [µν]);               (4)
                               [µν]                                                                   log λ
                                                                                          tr = −                              (10)
                                                                                                   log(1 + β)
the neuron [∗] and all its neighbors are shifted towards p.
The amount of this shift depends on the learning parameter          If we want to achieve effective diversity, this time should
α and on the distance from [∗] on the map:                          be of the order of magnitude of the number of neurons in
                                                                    the network, but not greater, as if neurons win too often
    ∀[µν]   [µν] ← [µν] + α(t)h(t, δ([∗], [µν])) · (p − [µν]) (5)   their interest factor will decrease, on average, as a function
                                                                                                                   K
Finally, all neurons are re-normalized in order to maintain         of time. A reasonable choice is to set tr ≈ 10   , where K is
them on the sphere S T −1 :                                         the number of neurons. Expressing, say β as a function of
                                                                    λ and tr , we obtain
                                   [µν]
                   ∀[µν]   [µν] ← pP                         (6)                              1
                                           2
                                     i [µν]i
                                                                                              1 tr
                                                                                          β=       −1                         (11)
                                                                                              λ
4.    NOVELTY IN NEWS FILTERING                                     so, for the desired recovery span, we have
If we don’t consider novelty, we can filter the news items as
they arrive simply by representing them as a point u ∈ S T −1
                                                                                                 10
                                                                                                1 K
(considering it as a document and using the same informa-                                 β=          −1                      (12)
                                                                                                λ
tion retrieval techniques that we have used to represent the
1
                                                                    In this schema, we only reduce the interest factor of the
 The map can be k dimensional; however, maps with k > 4             neuron [∗]. In our tests, we also tried another schema, in
are rarely used because of the combinatorial explosion in           which we reduce (by a progressively smaller amount) the
the number of neuron which would lead to over-fitting most
training sets. We have used mostly maps with k = 2, and             interest factor of neurons in the neighborhood of the winner
we shall consider only this case here, mainly because it avoid      using the same function h that we have used in the training
notational complexities.                                            algorithm. In this case, the update of the interest factor is
done with the function                                                                categories contain no document, and in our experiments we
           (                                                                          used only the 103 that do.
             λh(0, [µν], [∗])wµν                         if δ([µν], [∗]) < τ
   wµν ←                                                                       (13)
             min{1, (1 + β)wµν }                         otherwise
                                                                                                                  No IDF, (M11,M13,MCAT)
                                                                                                 1000
where h is the neighborhood function, and τ is a suitable
threshold (in our experiments, we set τ = 2).

5.           TESTS
We consider two measures. The first one is the standard                                           100
                                                                                                                                         novelty:
precision (the fraction of recommended documents that are                                                                             no update
                                                                                                                                    winner only


                                                                                      coverage
relevant). Given a context and a set of recommendations,                                                                          neighborhood
we expect that the use of novelty will somewhat decrease
precision (if there are two very relevant documents that are
                                                                                                   10
almost identical, a system that considers novelty will rec-
ommend only one of them). So, this measure is a sort of
quality control: we do accept a certain reduction of preci-
sion, but we want to keep it under control, to avoid losing
too much result quality. The second measure is a deter-
                                                                                                    1
mination of novelty. We measure, for a given number of                                                  0   100       200             300       400   500
recommended items, the number of neurons that win. If D                                                                     results
is the number of documents recommended and, during the
recommendation of these documents, nw different neurons                               Figure 3: Coverage vs. number of recommended
are winners (i.e. they are the neurons that achieve the max-                          items (simple weighting)
imum in eq. (8)), then the coverage is defined as V = nw /D
(0 < V ≤ 1). Note that nw < K so that if the documents                                We began by creating a context. To this end, we used
arrive in a stream (D → ∞) in the long run this value will                            5% of the news items contained in the sub-categories (of
tend to zero. In order to obtain a significative value, we al-                        MCAT) M11 (EQUITY MARKETS) and M13 (MONEY
ways consider sets of recommended documents smaller than                              MARKETS), for a total of about 5000 documents. Once
the number of neurons in the network.                                                 the context is in place, we create a stream of news with all
                                                                                      the elements in the Reuters data base and use our method
                             No IDF, (M11,M13,MCAT)
                                                                                      for recommending items that are close to the context. As a
              1
                                                                                      ground truth, we consider as relevant all the items that are
                                                                                      in the context categories (M11 and M13) and in their parent
                                                    novelty:                          category (MCAT), but not those that are in the sibling cate-
                                                 no update
             0.8                               winner only
                                                                                      gories of M11 and M13. We stop the stream when 500 items
                                             neighborhood                             have ben recommended (this is necessary in order to get sig-
                                                                                      nificant coverage values, as we mentioned before). Each item
             0.6                                                                      is analyzed and represented as a point u ∈ S T −1 (see sec-
 precision


                                                                                      tion 4). The weights of the terms that compose each item
                                                                                      are determined according to two different schemas: Sim-
             0.4                                                                      ple weighting (without an inverse document frequency (idf)
                                                                                      term; the weight of a word is calculated using the normalized
                                                                                      term frequency (tf ), and tf-idf (the normalized frequency of
             0.2                                                                      a word is divided by the logarithm of its frequency in the
                                                                                      British National Corpus (BNC) word frequency list2 ). Fig-
                                                                                      ures 2–5 show the results of our tests. Each figure contains
              0
                   0   100       200             300       400     500
                                                                                      three curves: the first is obtained by filtering the items with-
                                       results
                                                                                      out taking into account novelty (having the interest factor of
                                                                                      each neuron always equal to 1). For the second, we used the
Figure 2: Precision vs. number of recommended                                         adaptation of (7), with λ = 0.1 and a relaxation time long
items (simple weighting).                                                             enough so that once a neuron has won, its interest factor will
                                                                                      remain low for the duration of the test; the third is obtained
                                                                                      using the neighborhood adaptation (9), with τ = 2. The val-
To validate the proposed approach, we conducted some ex-
                                                                                      ues of λ and τ are not critical, and some preliminary tests
periments using 2.5 gigabytes of uncompressed news sto-
                                                                                      on limited data sets allowed us to select reasonable values.
ries from Reuters Corpus Volume 1 (RCV1-v1) [4]. The
                                                                                      Figures 2 and 3 show the precision and the coverage V for
Reuters’s is a collection of 806,791 news stories, each one as-
                                                                                      filtering done with simple weighting (without an idf term).
sociated with some metadata. In our experiments, we used
                                                                                      The precision drops only marginally when considering nov-
the text and the topics categories each news document be-
                                                                                      elty, but the number of neurons that “win” is increased by
longs to. The RCV1-v1 test collection contains 117 topics
                                                                                      more than an order of magnitude, confirming that the re-
in a three-level hierarchy with, at the top, four categories:
                                                                                      sults are relative to a broader portion of the user context.
CCAT (Corporate/Industrial), ECAT (Economics), GCAT
                                                                                      2
(Government/Social), MCAT (Market). Fourteen of these                                     Available at http://www.natcorp.ox.ac.uk/
Figures 4 and 5 show the same results using a weighting                     of the use of the method for increasing novelty. This might
scheme with the idf term for the news items. The results                    also explain why in some cases the novelty system attains a
are qualitatively the same, although we observe a general                   higher precision than the one based only on relevance: there
drop in precision. We have not investigated the causes of                   are probably parts of the context more ambiguous than oth-
this drop, which we suspect are related to the small number                 ers sine they are sensitive to words that appear in categories
of words in the items. Note that in the range 200-500 results,              other than MCAT. With the novelty system, once an item
the system with novelty actually achieves a better precision                has activated these part, they become “desensitized,” and
that the simple system. We haven’t thoroughly investigated                  cease to produce irrelevant results.
this phenomenon, but it is possible that the simple system
get “stuck” in groups of similar documents from other cate-                 6.   CONCLUSION
gories with a similar (and relatively high) value of measred                In this paper, we have presented a method for increasing the
relevance. Introducing novelty in this case would reduce the                novelty of a recommendation system that receives a stream,
presence of these groups.                                                   as opposed to the session-based recommender systems com-
                                                                            mon in the literature. We have shown that a model of the
                              IDF, (M11,M13,MCAT)                           user context can be useful to increase the novelty of the
               1                                                            recommended results without having to receive results irrel-
                                                                            evant for the user. The fact that we receive a potentially
                                                    novelty:                infinite stream of items poses new challenges for novelty in-
                                                 no update
             0.8                               winner only                  creasing systems, as redundancy, which is taken to be a con-
                                             neighborhood
                                                                            stant in session based system, “fades away” in the case of
                                                                            stream: a result very similar to one already received will be
             0.6                                                            considered redundant in the short term, but may become in-
 precision


                                                                            teresting again in the long term. To deal with this problem,
                                                                            we have introduced an interest factor in our model that is
             0.4
                                                                            reduced when a certain area of the context is activated but
                                                                            that, if the area is no longer activated, recovers in a time
                                                                            that can be controlled by design.
             0.2


                                                                            7.   REFERENCES
               0                                                            [1] J. e. a. Allan. Challenges in information retrieval and
                    0   100     200               300          400   500
                                                                                language modeling: report of a workshop held at the
                                       results
                                                                                center for intelligent information retrieval. SIGIR
                                                                                Forum, 37(1):31–47, Apr. 2003.
Figure 4: Precision vs.                             number of recommended
items (tf-idf weighting)                                                    [2] E. Gabrilovich, S. Dumais, and E. Horvitz. Newsjunkie:
                                                                                providing personalized newsfeeds via analysis of
                                                                                information novelty. In WWW ’04: Proceedings of the
                              IDF, (M11,M13,MCAT)                               13th international conference on World Wide Web,
             1000                                                               pages 482–490, New York, NY, USA, 2004. ACM.
                                                                            [3] S. Kaski. Computationally efficient approximation of a
                                                                                probabilistic model for document representation in the
                                                                                WEBSOM full-text analysis method. Neural Processing
                                                                                letters, 5(2), 1997.
              100
                                                                            [4] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A
                                                                                new benchmark collection for text categorization
coverage


                                                                                research. J. Mach. Learn. Res., 5:361–397, Dec. 2004.
                                                                            [5] X. Li and B. W. Croft. An information-pattern-based
               10                                                               approach to novelty detection. Information Processing
                                                        novelty:
                                                     no update                  & Management, 44(3):1159–1188, May 2008.
                                                   winner only
                                                 neighborhood
                                                                            [6] G. Pasi. Contextual search: Issues and challenges. In
                                                                                A. Holzinger and K.-M. Simonic, editors, Information
                                                                                Quality in e-Health, volume 7058 of Lecture Notes in
                1                                                               Computer Science, pages 23–30. Springer Berlin /
                    0   100      200              300          400   500
                                                                                Heidelberg, 2011. 10.1007/978 − 3 − 642 − 25364 − 53 .
                                       results
                                                                            [7] S. Santini and A. Dumitrescu. Context as a
Figure 5: Coverage vs.                              number of recommended       non-ontological determinant of semantics. In
items (tf-idf weighting)                                                        Proceedings of the 3rd International conference on
                                                                                Semantics and digital media technologies, 2008.
We must also remark that the MCAT category is quite dif-                    [8] R. Song, Z. Luo, J.-Y. Nie, Y. Yu, and H.-W. Hon.
ficult to filter, as many of the significant terms that appear                  Identification of ambiguous queries in web search.
in it are part of the economics jargon, and appear in other                     Information Processing & Management, 45(2):216 –
categories as well, such as ECAT. This explains the rela-                       229, 2009.
tively low values of precision that we obtain, independently