Improving Novelty in Streaming Recommendation Using a Context Model ∗ Doina Alexandra Dumitrescu Simone Santini Escuela Politécnica Superior Universidad Autónoma de Madrid Madrid, Spain {doina.dumitrescu, simone.santini}@uam.es ABSTRACT doesn’t know whether the user wants to know something In recent years there has been an increasing research interest about the neighborood in New York, about the cocktail, or in novelty/diversity detection in Information Retrieval and about the Indian tribe: the query has several possible inter- in Recommendation Systems. We propose a model that in- pretation, and whoever formulated the query is interested in creases the novelty of recommendations using a context user general in only one of them. Without further specification, profile that was created automatically using self-organizing the safest bet for the server is to provide a diverse result maps. Our system was evaluated on the Reuters Corpus set, that is, one that covers all these topics, possibly in an Volume 1 and our experiments show that filtering the recom- amount proportional to the estimated a priori user interest mended items using a novelty score derived from the context- (one can assume, for instance, that more people are inter- based user profile provides better search results in terms of ested in New York than in the Indian tribe). Even if the novel information that is presented to the user. query is not ambiguous and the results are about what the user really meant (say, the neighborhood in new york), the Categories and Subject Descriptors user would want every document to be novel, that is, to provide information that is not present in other documents. H.3.3 [Information Systems]: Information Search and Re- A highly informative text about the history and the human trieval landscape of Manhattan is probably a very good first result, but another document almost identical to the first is not an General Terms equally desirable second result: the information that it con- Algorithms, Metrics, Experimentation tains has already been seen. That is, every interpretation of a query contains several aspects, and the person who for- Keywords mulated it will be interested, to some degree, in all of them. Novelty, Diversity, User Model, Context These considerations highlight an important difference be- tween diversity and novelty. Diversity is something that the user wouldn’t really want: the person interested in the man- 1. INTRODUCTION hattan cocktail would be very happy to receive results only The plethora of information available on the internet about about the cocktail. Diversity is needed mainly by the server, virtually anything, and the duplication of this information which doesn’t know enough about the user true intentions. due to the significant replication of the sources has given a Novelty, on the other hand, is something the user wants, great relevance to the issues of novelty and diversity in re- since she will be interested in all the aspects of her query. trieval and recommendation systems [2, 5]. The definition Novelty is a way to avoid redundancy, while diversity is a of novelty and diversity generally accepted in information way to deal with the server’s ignorance. retrieval is that diversity helps the server deal with query ambiguity, while novelty helps the user deal with query un- If we have access to a model of the user’s interests, then it derspecification [8]. Consider a query composed of the word is possible to reduce ambiguity and, therefore, to reduce the manhattan. The query is ambiguous because the server need for diversity, concentrating only on increasing novelty. ∗This work was supported by the Ministerio de Educación y We present a method for increasing novelty that relies on Ciencia under the grant N. TIN2011-28538-C02, Novelty, di- a context model built on texts previously seen by the user. versity, context and time: newdimensions in next-generation The scenario that we use here to exemplify the working of information retrieval and recommender systems. our method is that of the recommendation of news, and the context is composed of past news that the user has con- sidered interesting. This is just one possible way in which context can be gathered. Our model is very general and can be used to represent text or multimedia data. In past work, we have used, for example, the contents of the user’s hard disk to represent his main interests [7]. In this paper we show and evaluate a modification to our model with the CARS-2012, September 9, 2012, Dublin, Ireland. purpose of increasing the context coverage of the recommen- Copyright is held by the author/owner(s). dations, that is, the fraction of the user context that the - filter - marker In addition to this short term change, the context evolves in a permanent and irreversible way, again under the influence items of the documents read (as well as, possibly, based on other short time elements such as the documents that the user creates on 6 update irreversible her computer, or the emails that she writes). A number context  update of strategies can be used for this evolution: the user can model  be asked to mark certain documents of special interest, the system can present only a summary, and all the documents whose complete text is accessed are permanently entered Figure 1: Schema of out stream recommendation into the context, etc. We have considered the evolution of system. the context based on documents of interest elsewhere [7], and we shall not consider it here. From the point of view of this paper, therefore, the context, defined as in the following recommended items will cover, trying to avoid items that section, is fixed, and only the short term changes, needed to refer to parts of the context already covered by previously increase novelty, are considered. seen items. 3. THE USER MODEL There is convincing evidence that a user model can help 2. STREAM RECOMMENDATION attain a higher precision in information retrieval and recom- One important distinction between this work and others mendation systems [1, 6]. In this paper, we are interested on novelty and diversity in recommendation is that we are in using such a model to achieve novelty, that is, to obtain dealing with a stream of news, that is, with a continuous– recommendations that span the whole gamut of the user’s potentially infinite–interaction, while typical recommenda- interests. The context model that we use was presented tion systems work based on sessions: the user chooses an (without the extension to enforce novelty that we present item or requests a recommendation, and a finite set of results here0 in [7], to which the reader is referred for details; here is shown. This operation may be repeated several times, as we shall only give a brief description. The basis of the model the user selects several items and receives correspondingly is a set of documents, which can be composed of the docu- changing recommendations. No work, to the best of our ments on which the user is working at the time or, in our knowledge, has considered novelty in a session-less situa- case, by news items marked in the past. The model doesn’t tion, in which a countable stream of items arrive and a sys- change depending on the source of the documents. In the tem has to decide whether they are relevant for the user at following, in order to make the tests self-contained, we shall that moment. Note that, differently from the case of a closed always assume that the context is based on a set of news session, we don’t have access to the whole data base of items items that supposedly the user has seen in the past and that can be recommended. We don’t even have access to the that he has found relevant. items that have already been shown, since in a stream their number is unbound. All we can do, at any given time, is take All the documents in the contexts are considered as a sin- a decision only considering how the past history of items has gle “macro-document” and processed consequently. On this modified the user context. The system is shown schemati- macro-document we perform stop-word removal and stem- cally in figure 1. We have a stream of items (news, in our test ming. Each term t is assigned a weight wt using tf-idf weight- scenario) that arrive continuously to the user. The user is ing [1]. We use the standard vector model representation described by the context model C (the details of this repre- of information retrieval in which each different term corre- sentation will be given in the next section) and each news is sponds to an axis in the word space. We consider n-grams, compared with the context and assigned a score. Whenever constituted of n consecutive terms that appear in the docu- the score is above a certain threshold (or a predefined crite- ment. If the n-gram pk is composed of the terms tu1 , . . . , tun , rion is met) the item is presented to the user. At the same with weights wu1 , . . . , wun , then its representation in the time, the context undergoes a first kind of change, which word space is given by the point we call a short term change. This change (used to improve 1 the novelty of the recommended items) is characterized by a pk = pPn (. . . , wu1 , . . . , ww2 , . . . , wun , . . .) 2 time constant β. After a time, depending on β, the change i=1 wui provoked by any individual item disappears. This is one im- in the T -dimensional term space. Note that all points are portant difference between our model and the session based normalized, so they are in reality points in the manifold ones. Let us assume that a person is interested in politics. S T −1 (the unit sphere in RT ). A context is represented by News about the election of the new French president will a set of these n-gram representations, what we call a point undoubtedly be of interest to her. However, she will proba- cloud representation. That is, a context is a finite set of bly not be interested in receiving many similar news about points C ⊆ S T −1 . the election one after the other. The effect of the short term change will be do “deactivate”, for a time, the part of the In the word space, we lay a self-organizing map, using a context that deals with this kind of news, so that further modification of WEBSOM [3]. The map is a grid of elements news about the French elections will not, cœteris paribus, be called neurons, each one of which is a point in the word space considered as interesting. However, this situation should not and is identified by two integer indices, that is, a neuron is be permanent: after a while (a few hours, a day,...), further given as: news about the French President will again be considered interesting, and should be recommended. [µν] = ([µν]1 , . . . , [µν]T )0 1 ≤ µ ≤ N, 1 ≤ ν ≤ M (1) The map is discrete, two-dimensional1 with the 4-neighborhood context), and determining the maximal similarity between topology. The neurons are immersed in the T -dimensional u and the neurons of the map word space ([µν] ∈ RT ), and in our version of WEBSOM their weights are normalized. That is, throughout the al- r = max s([µν], u) (7) [µν] gorithm we enforce Tt=1 [[]µν]2t = 1 for all map indices µν. P The neurons are at the same time elements of the discrete the items that yield a value of r below a certain threshold are grid and points in S T −1 . As points in the discrete grid, the considered interesting and presented to the user. As we have relevant distance between two neurons is their graph dis- mentioned in the introduction, the problem with this simple tance: solution is that very similar news items can have high score because they are similar to the same neuron (that is: to the δ([ζξ], [µν]) = |ζ − µ| + |ξ − ν| (2) same part of the user context), although the presentation As points in S T −1 , it is possible to determine the similarity of more than one adds little to the information content of between a neuron and any other point p ∈ S T −1 as the first one. To increase the diversity of the results, we rewrite (7) adding an interest factor wµν to each neuron, T X and computing the relevance of the news item u as s([µν], w) = [µν]i pi (3) i=1 r = max wµν s([µν], u) (8) [µν] On this map we define a neighborhood function, h(t, n), which depends on two parameters t, n ∈ N; n is the graph dis- Let us call [∗] the “winning” neuron for a given item (the tance between two given neurons, t is a time parameter that neuron closest to it, viz. the neuron for which the maximum increases as learning proceeds. The function h(t, n) repre- of eq. (8) is attained). Whenever an item is recommended, sents the “degree of neighborhood-ness” of two neurons at the interest factor of each neuron is updated according to a distance n at time t; We assume that 0 ≤ h(t, n) ≤ 1, the equation h(t, 0) = 1, and that h is monotonically decreasing in n and ( t. The degree to which neuron [ζξ] belongs to the neighbor- λwµν if [µν] = [∗] wµν ← (9) hood of neuron [µν] at time t is given by h(t, δ([ζξ], [µν])). min{1, (1 + β)wµν } otherwise In addition to the neighborhood we define a positive learning parameter α(t), t ∈ N, monotonically decreasing with t. with 0 < λ < 1 and β > 0. According to this equation, the interest of the neuron closest to a recommended news item In order to create the map for a context C, all the points in is reduced by a factor λ, reducing its contribution to (8) and it are presented to the map, and the training algorithm is therefore making it less likely to win again. This means that applied. We call the presentation of a point p ∈ C an event further news items close to the one that caused this neuron of learning, and the presentation of all the points of C an to win will be less likely to be recommended. If the neuron epoch. Learning consists of a number of epochs, counted by is no longer the winner for any item, its interest will be a counter t. The neurons of the map are at first spread ran- increased by a factor (1 + β) with each recommended item, domly in the word space; then, for each event consisting of until the value 1 is restored (or until the neuron wins again, the presentation of the point p, the neuron with the highest in which case its interest factor will again be reduced). If similarity to p is found: the neuron wins only once, its relaxation time (the number of items necessary for it to go back to 1) is [∗] = arg max s(p, [µν]); (4) [µν] log λ tr = − (10) log(1 + β) the neuron [∗] and all its neighbors are shifted towards p. The amount of this shift depends on the learning parameter If we want to achieve effective diversity, this time should α and on the distance from [∗] on the map: be of the order of magnitude of the number of neurons in the network, but not greater, as if neurons win too often ∀[µν] [µν] ← [µν] + α(t)h(t, δ([∗], [µν])) · (p − [µν]) (5) their interest factor will decrease, on average, as a function K Finally, all neurons are re-normalized in order to maintain of time. A reasonable choice is to set tr ≈ 10 , where K is them on the sphere S T −1 : the number of neurons. Expressing, say β as a function of λ and tr , we obtain [µν] ∀[µν] [µν] ← pP (6)  1 2 i [µν]i 1 tr β= −1 (11) λ 4. NOVELTY IN NEWS FILTERING so, for the desired recovery span, we have If we don’t consider novelty, we can filter the news items as they arrive simply by representing them as a point u ∈ S T −1   10 1 K (considering it as a document and using the same informa- β= −1 (12) λ tion retrieval techniques that we have used to represent the 1 In this schema, we only reduce the interest factor of the The map can be k dimensional; however, maps with k > 4 neuron [∗]. In our tests, we also tried another schema, in are rarely used because of the combinatorial explosion in which we reduce (by a progressively smaller amount) the the number of neuron which would lead to over-fitting most training sets. We have used mostly maps with k = 2, and interest factor of neurons in the neighborhood of the winner we shall consider only this case here, mainly because it avoid using the same function h that we have used in the training notational complexities. algorithm. In this case, the update of the interest factor is done with the function categories contain no document, and in our experiments we ( used only the 103 that do. λh(0, [µν], [∗])wµν if δ([µν], [∗]) < τ wµν ← (13) min{1, (1 + β)wµν } otherwise No IDF, (M11,M13,MCAT) 1000 where h is the neighborhood function, and τ is a suitable threshold (in our experiments, we set τ = 2). 5. TESTS We consider two measures. The first one is the standard 100 novelty: precision (the fraction of recommended documents that are no update winner only coverage relevant). Given a context and a set of recommendations, neighborhood we expect that the use of novelty will somewhat decrease precision (if there are two very relevant documents that are 10 almost identical, a system that considers novelty will rec- ommend only one of them). So, this measure is a sort of quality control: we do accept a certain reduction of preci- sion, but we want to keep it under control, to avoid losing too much result quality. The second measure is a deter- 1 mination of novelty. We measure, for a given number of 0 100 200 300 400 500 recommended items, the number of neurons that win. If D results is the number of documents recommended and, during the recommendation of these documents, nw different neurons Figure 3: Coverage vs. number of recommended are winners (i.e. they are the neurons that achieve the max- items (simple weighting) imum in eq. (8)), then the coverage is defined as V = nw /D (0 < V ≤ 1). Note that nw < K so that if the documents We began by creating a context. To this end, we used arrive in a stream (D → ∞) in the long run this value will 5% of the news items contained in the sub-categories (of tend to zero. In order to obtain a significative value, we al- MCAT) M11 (EQUITY MARKETS) and M13 (MONEY ways consider sets of recommended documents smaller than MARKETS), for a total of about 5000 documents. Once the number of neurons in the network. the context is in place, we create a stream of news with all the elements in the Reuters data base and use our method No IDF, (M11,M13,MCAT) for recommending items that are close to the context. As a 1 ground truth, we consider as relevant all the items that are in the context categories (M11 and M13) and in their parent novelty: category (MCAT), but not those that are in the sibling cate- no update 0.8 winner only gories of M11 and M13. We stop the stream when 500 items neighborhood have ben recommended (this is necessary in order to get sig- nificant coverage values, as we mentioned before). Each item 0.6 is analyzed and represented as a point u ∈ S T −1 (see sec- precision tion 4). The weights of the terms that compose each item are determined according to two different schemas: Sim- 0.4 ple weighting (without an inverse document frequency (idf) term; the weight of a word is calculated using the normalized term frequency (tf ), and tf-idf (the normalized frequency of 0.2 a word is divided by the logarithm of its frequency in the British National Corpus (BNC) word frequency list2 ). Fig- ures 2–5 show the results of our tests. Each figure contains 0 0 100 200 300 400 500 three curves: the first is obtained by filtering the items with- results out taking into account novelty (having the interest factor of each neuron always equal to 1). For the second, we used the Figure 2: Precision vs. number of recommended adaptation of (7), with λ = 0.1 and a relaxation time long items (simple weighting). enough so that once a neuron has won, its interest factor will remain low for the duration of the test; the third is obtained using the neighborhood adaptation (9), with τ = 2. The val- To validate the proposed approach, we conducted some ex- ues of λ and τ are not critical, and some preliminary tests periments using 2.5 gigabytes of uncompressed news sto- on limited data sets allowed us to select reasonable values. ries from Reuters Corpus Volume 1 (RCV1-v1) [4]. The Figures 2 and 3 show the precision and the coverage V for Reuters’s is a collection of 806,791 news stories, each one as- filtering done with simple weighting (without an idf term). sociated with some metadata. In our experiments, we used The precision drops only marginally when considering nov- the text and the topics categories each news document be- elty, but the number of neurons that “win” is increased by longs to. The RCV1-v1 test collection contains 117 topics more than an order of magnitude, confirming that the re- in a three-level hierarchy with, at the top, four categories: sults are relative to a broader portion of the user context. CCAT (Corporate/Industrial), ECAT (Economics), GCAT 2 (Government/Social), MCAT (Market). Fourteen of these Available at http://www.natcorp.ox.ac.uk/ Figures 4 and 5 show the same results using a weighting of the use of the method for increasing novelty. This might scheme with the idf term for the news items. The results also explain why in some cases the novelty system attains a are qualitatively the same, although we observe a general higher precision than the one based only on relevance: there drop in precision. We have not investigated the causes of are probably parts of the context more ambiguous than oth- this drop, which we suspect are related to the small number ers sine they are sensitive to words that appear in categories of words in the items. Note that in the range 200-500 results, other than MCAT. With the novelty system, once an item the system with novelty actually achieves a better precision has activated these part, they become “desensitized,” and that the simple system. We haven’t thoroughly investigated cease to produce irrelevant results. this phenomenon, but it is possible that the simple system get “stuck” in groups of similar documents from other cate- 6. CONCLUSION gories with a similar (and relatively high) value of measred In this paper, we have presented a method for increasing the relevance. Introducing novelty in this case would reduce the novelty of a recommendation system that receives a stream, presence of these groups. as opposed to the session-based recommender systems com- mon in the literature. We have shown that a model of the IDF, (M11,M13,MCAT) user context can be useful to increase the novelty of the 1 recommended results without having to receive results irrel- evant for the user. The fact that we receive a potentially novelty: infinite stream of items poses new challenges for novelty in- no update 0.8 winner only creasing systems, as redundancy, which is taken to be a con- neighborhood stant in session based system, “fades away” in the case of stream: a result very similar to one already received will be 0.6 considered redundant in the short term, but may become in- precision teresting again in the long term. To deal with this problem, we have introduced an interest factor in our model that is 0.4 reduced when a certain area of the context is activated but that, if the area is no longer activated, recovers in a time that can be controlled by design. 0.2 7. REFERENCES 0 [1] J. e. a. Allan. Challenges in information retrieval and 0 100 200 300 400 500 language modeling: report of a workshop held at the results center for intelligent information retrieval. SIGIR Forum, 37(1):31–47, Apr. 2003. Figure 4: Precision vs. number of recommended items (tf-idf weighting) [2] E. Gabrilovich, S. Dumais, and E. Horvitz. Newsjunkie: providing personalized newsfeeds via analysis of information novelty. In WWW ’04: Proceedings of the IDF, (M11,M13,MCAT) 13th international conference on World Wide Web, 1000 pages 482–490, New York, NY, USA, 2004. ACM. [3] S. Kaski. Computationally efficient approximation of a probabilistic model for document representation in the WEBSOM full-text analysis method. Neural Processing letters, 5(2), 1997. 100 [4] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization coverage research. J. Mach. Learn. Res., 5:361–397, Dec. 2004. [5] X. Li and B. W. Croft. An information-pattern-based 10 approach to novelty detection. Information Processing novelty: no update & Management, 44(3):1159–1188, May 2008. winner only neighborhood [6] G. Pasi. Contextual search: Issues and challenges. In A. Holzinger and K.-M. Simonic, editors, Information Quality in e-Health, volume 7058 of Lecture Notes in 1 Computer Science, pages 23–30. Springer Berlin / 0 100 200 300 400 500 Heidelberg, 2011. 10.1007/978 − 3 − 642 − 25364 − 53 . results [7] S. Santini and A. Dumitrescu. Context as a Figure 5: Coverage vs. number of recommended non-ontological determinant of semantics. In items (tf-idf weighting) Proceedings of the 3rd International conference on Semantics and digital media technologies, 2008. We must also remark that the MCAT category is quite dif- [8] R. Song, Z. Luo, J.-Y. Nie, Y. Yu, and H.-W. Hon. ficult to filter, as many of the significant terms that appear Identification of ambiguous queries in web search. in it are part of the economics jargon, and appear in other Information Processing & Management, 45(2):216 – categories as well, such as ECAT. This explains the rela- 229, 2009. tively low values of precision that we obtain, independently