1. INTRODUCTION

A Signal-Based Approach to News Recommendation

Sirian Caldarelli

sirian.caldarelli@gmail.com 0

Davide Feltoni Gurini

feltoni@dia.uniroma3.it 0

Alessandro Micarelli

micarel@dia.uniroma3.it 0

Giuseppe Sansonetti

0 0 Department of Engineering Roma Tre University Via della Vasca Navale 79 Rome , 00146 Italy

2016

In this paper, we describe our research activity on an approach to personalized news recommendation, which captures the temporal dynamics of the active user's interests. In such recommender, the user pro le explicitly involves the time dimension in representing her interests and preferences. Each user's interest is represented as a signal, thus characterizing its evolution over time. To this aim, a signal processing technique (i.e., the discrete wavelet transform) is adopted to represent and analyze such signals. Furthermore, we report the experimental results of a very preliminary comparative evaluation on an online available dataset. Such results seem encouraging, thus spurring us to continue developing our approach. News recommendation; user pro ling; temporal dynamics

1. INTRODUCTION

With the development in electronics and Internet technologies, online information available has been constantly increasing. In such scenario, users are confused and more and more feel the need to be guided in the selection of the information to pay attention to. News recommenders are a possible solution, since help users nd the information of possible interest to them. In order to provide personalized suggestions, such systems rely on a representation of the target user's interests and preferences. A vast amount of user pro ling techniques have been proposed and deeply evaluated [ 7 ]. However, representing how users' interests evolve over time remains a di cult challenge. In this paper, we apply an approach to user pro ling, called bag-of-signals [ 2 ], whose aim is to represent the diversity and time-dependent evolving nature of users' interests. Based on such approach, we realized a recommender system of news articles. In order to assess its performance, we performed a very preliminary o -line evaluation as follows. Starting from a public database, we built users pro les extracting their interests from news articles linked to contents generated by them on social media. More speci cally, we examined users' timelines on Twitter 1 considering all the tweets and the related news articles in the entire observation period. Then, we extracted users' interests as concepts (e.g., topics) from those news and represented their evolution over time as signals. For analyzing and comparing such signals, we made use of a signal processing tool that characterizes the frequency content of any signal, along with its accurate location in the time domain. A comparative evaluation with a classic approach that completely ignores the time-dependence of users' interests revealed the bene ts of the proposed news recommender. 2.

BAG-OF-SIGNALS MODEL

The representation of users' interests as signals requires some de nitions. We de ne pseudo-document related to a user u 2 U (with U set of all the users) and an observation period T , the set of all the news articles mentioned by u in the period T : P seudoDoc(u; T ) = fnewsj user(news) = u; date(news) 2 T g The notation user(news) = u means that the user u has mentioned that particular news, while date(news) 2 T means that u has mentioned that news in the period T . An extension of the bag-of-words representation, well-known in Information Retrieval, is the bag-of-concepts model, where concepts instead of keywords are extracted from pseudodocuments. Concepts are entities more semantically significant than simple keywords. We de ne bag-of-concepts user model the following set of weighted concepts:

PBoC (u) = fc; w(u; c)jc 2 C; u 2 U g where the function w(u; c) gives the weight of the concept c 2 C for the user u 2 U (with C and U set of concepts and users, respectively). Then, we de ne pseudo-fragment related to a user u 2 U in an interval t 2 T , the set of all the news mentioned by u in the interval t: P seudoF rag(u; t) = fnewsj user(news) = u; date(news) 2 tg By analyzing a single pseudo-fragment related to an interval t, it is possible to determine the signal components for the concepts in the text fragment. A signal component fu;c; t related to a user u 2 U , a concept c 2 C, and an interval t 2 T , is determined by the number of times the concept c occurs in the pseudo-fragment P seudoF rag(u; t), based on the weighting function !(u; c; t)

fu;c; t = !(u; c; t) ti 2 This function is used to reduce the impact of typical problems of Information Retrieval, which may a ect the proposed model too. More speci cally, !(u; c; t) takes into account (i) the discriminating power of the concept c within the time interval t, and (ii) the relevance of the same concept within the user u's pro le. We de ne signal Su;c related to a user u and a concept c the ordered set of signal components fu;c; ti with T

Su;c = [fu;c; t1 ; fu;c; t2 ; ; fu;c; tn ] where T consists of n consecutive and same length intervals ti (with i = 1; 2; :::n). As seen in the bag-ofconcepts model, a user is represented through a set of concepts weighted according to their occurrences within the pseudo-document. In the proposed model, a user is represented by a set of signals related to several concepts that appear in the pseudo-fragments concerning the user. Furthermore, each signal is made up of an ordered set of signal components weighted according to the weighting function. Now, we de ne the bag-of-signals model of user u 2 U as the set of the signals related to the user u, where the components fu;c; t are determined by the weighting function !(u; c; t): PBoS(u) = fSu;c = [fu;c; t1 ; fu;c; t2 ; ; fu;c; tn ] j c 2 Cg Each signal contains two di erent information related to the concept: temporal and quantitative. Hence, the elementary units of bag-of-signal representation are signals and therefore they are the starting point for assessing the similarity between users. These signals show strong discontinuities and sharp spikes. Signal processing provides an ideal tool for representing and analyzing such kind of signals: the wavelet transform [ 5 ]. Wavelets are mathematical functions that may be located both in time (space), as well as in scale (frequency), thus providing an accurate time-scale map of the signal. The wavelet-based analysis relies on the use of a prototype function, so-called mother wavelet, whose translated and scaled versions constitute the basis functions for the series expansion that ensures the representation of the original signal through coe cients. Operations involving signals can, therefore, be developed - in a more streamlined and e cient way - directly on corresponding wavelet coe cients. If the mother wavelet is properly selected (in our approach we choose the Haar wavelet for its compact support, as can be seen from Figure 1), the wavelet transform allows for best capturing signal dynamics. Computation of the wavelet transform can be performed in a fast way (with computational cost O(n), if n is the number of signal samples) by means of the fast discrete wavelet transform (DWT) [ 10 ]. Preliminary attempts of leveraging the wavelet theory for music and movies recommendation tasks have been proposed [ 4, 3 ]. Once de ned the bag-of-signals model for representing user pro les, we also need to de ne a method for evaluating the similarity between users. Concretely, we considered two di erent similarity functions f 1 and f 2.

Given two users u1, u2 and their corresponding pro les PBoS(u1), PBoS(u2) based on the bag-of-signals representation, the similarity function f1 between those users is de ned as follows: f 1(u1; u2) =

Pc2C1\C2 (su1;c) (su2;c) templevel(su1;c; su2;c) qP

c2C1 2(su1;c) qPc2C2 2(su2;c) where su1;c 2 PBoS(u1) and su2;c 2 PBoS(u2), C1 and C2 are the sets of the concepts related to the signals belonging to PBoS(u1) and PBoS(u2), the function (s) expresses the energy of the signal s and templevel(s1; s2) is a function that analyzes whether the signals s1 and s2 show similar time use patterns. The importance of a signal within the pro le is given by its energy. Given a discrete-time signal s, limited and with real components, its energy (s) is de ned as follows: (s) = jsj X s[i]2 i=0 The function templevel returns a value between 0 and 1, providing a measure of how much the concepts belonging to the two pro les have been used with similar time patterns. In this way, the contribution of two concepts used in the same intervals will be greater than the contribution of the concepts used in di erent intervals. The approximation Al(s) of the signal s at level l-th is de ned by the set of approximation coe cients of the DWT limited to the level l-th:

Al(s) = fal;j j = 1; :::; 2lg Given two signals s1 and s2 and their respective approximations at level Alevel(s1) = [as1 ; :::; as1 ] and Alevel(s2) = [as2 ; :::; as2 ], the function templevel(s1; s2) is de ned as follows: templevel(s1; s2) =

C(s1; s2) pC(s1; s1)C(s2; s2) where

C(s1; s2) = j2lj XAlevel(s1)[i]Alevel(s2)[i] i=0

Given two users u1, u2 and their respective user pro les PBoS(u1) and PBoS(u2) based on the bag-of-signals representation, the similarity function f2 between those users is de ned as follows: f 2(u1; u2) =

P c2C1\C2

P su1;c[i] su2;c[i] r P c2C1

P su1;c[i]2 r P c2C2

P su2;c[i]2 where su1;c 2 PBoS(u1) and su2;c 2 PBoS(u2), C1 and C2 are the sets of the concepts related to the signals belonging to PBoS(u1) and PBoS(u2).

EXPERIMENTAL EVALUATION

In order to perform our experimental tests, we resorted to the dataset presented and employed in [ 1 ]. Such dataset was obtained by monitoring a sample of 20,000 English speaking users' timelines on Twitter for a given time period T . From the original sample, the authors selected only those 1619 users that posted at least ten tweets at month and at least 20 tweets in the whole observation period, thus gathering more than two million tweets. From the news articles mentioned in such tweets, concepts (i.e., entities, types, and topics) were extracted through the web service OpenCalais 2. We associated such concepts to the creation time of the corresponding tweet, in order to temporally localize them. The whole observation period T was about three months, so we considered the tweets (and the linked news) of the rst two months as training dataset, the remaining tweets as testing dataset. After that, the evaluation procedure was as follows (see Figure 2).

Training phase the news linked to the tweets belonging to the training dataset were retrieved; the concepts extracted from such news were considered; a bag-of-signals pro le was built for each user, using the concepts obtained in the previous step; for each user a list of users more similar to her was returned.

Testing phase the news linked to the tweets belonging to the testing dataset were retrieved; a pseudo-document for each user was generated from those news; all the pseudo-documents were indexed using the open source Lucene platform 3, as proposed in [ 6 ]; for each pseudo-document a list of pseudo-documents more similar to it was returned. 2http://www.opencalais.com/ 3https://lucene.apache.org/ The performance of the recommender system was assessed in terms of the normalized version of Discounted Cumulative Gain (nDCG) [ 8, 9 ]. nDCG is usually truncated at a particular rank level to emphasize the importance of the rst retrieved documents. The measure is de ned as follows: DCG@n nDCG@n = (1)

IDCG@n and the Discounted Cumulative Gain (DCG) is de ned as follows: where reli is the graded relevance of the i th result (i.e., 0 = non-signi cant, 1 = signi cant, and 2 = very signi cant ), and the Ideal DCG (IDCG) for a query corresponds to the DCG measure where scores are resorted monotonically decreasing, that is, the maximum possible DCG value over that query. nDCG is often used to evaluate search engine algorithms and other techniques whose goal is to order a subset of items in such a way that highly relevant documents are placed on the top of the list, while less important ones are moved lower. Basically, higher values of nDCG mean that the system output gets closer to the ideal ranked output. Figure 3 shows the experimental results obtained considering the two similarity functions f 1 and f 2 introduced above, and the function S1 proposed in [ 6 ], which was obtained by indexing the contents of all the news articles using Lucene. It is possible to notice that the rst two approaches, which consider the evolution of interests over time, outperform the last one that, instead, ignores the temporal dimension. Figure 4 reports the best results (i.e., those obtained through the f 1 similarity function) when varying the nature of the concepts represented as signals in the user pro le. As we could expect, bag-of-signals user pro les representing entities as signals allow the news recommender to obtain the best performance. In fact, the maximum number for topics and types extracted by OpenCalais is 18 and 39, respectively. On the contrary, there is no limit for the number of entities extracted from news articles. In the used dataset, a bag-of-signals user pro le with entities as signals can have more than 3500 represented concepts. Hence, the smaller amount of information in case of topics and types brought about worse results than those obtaining using entities.

CONCLUSIONS

In this paper, we have presented a news recommender system based on the bag-of-signals user model, which leverages signal processing techniques to represent not only the number of occurrences of the informative entities (concepts), but also the related time use patterns. The bag-of-signals user model involves modeling the user interests through a set of signals and the adoption of similarity functions suitably de ned. More speci cally, for the signal analysis and representation we employ the wavelet mathematical tool for its main characteristic of time-frequency localization. Practically, the discrete wavelet transform allows us to e ectively analyze the sampled signals with a di erent time window.

Although the experimental results on an online available dataset are positive, this work is still in a preliminary stage and leaves much space for future developments. For instance, the similarity function is an open issue that should be further investigated. Starting from the bag-of-signals model, we could explore new functions considering the same data but in a di erent way, developing new aspects, and using other tools from the signal processing domain. Moreover, we intend to test our news recommender on real news datasets. Finally, another interesting development could involve sentiment analysis. Concretely, we propose to add a further module to the described news recommender, whereby extract the positive, negative, or neutral opinion expressed by the user about a given concept. In this way, the pro le may take into account not only the level and the temporal localization of users' interests, but also their nature.

[1]

Abel ,

Gao ,

G.-J.

Houben , and

Tao . Analyzing temporal dynamics in twitter pro les for personalized recommendations in the social web . In Proceedings of the 3rd International Web Science Conference, WebSci '11. ACM , 2011 .

[2]

Arru ,

D. Feltoni

Gurini ,

Gasparetti ,

Micarelli , and

Sansonetti . Signal-based user recommendation on twitter . In Proceedings of the 22Nd International Conference on World Wide Web, WWW '13 Companion , pages 941 { 944 , New York, NY, USA, 2013 . ACM.

[3]

Biancalana ,

Gasparetti ,

Micarelli ,

Miola , and

Sansonetti . Context-aware movie recommendation based on signal processing and machine learning . In Proceedings of the 2nd Challenge on Context-Aware Movie Recommendation , CAMRa '11 , pages 5 { 10 , New York, NY, USA, 2011 . ACM.

[4]

Gasparetti ,

Biancalana ,

Micarelli ,

Miola , and

Sansonetti . Wavelet-based music recommendation . In K. -H. Krempels and J. Cordeiro, editors, WEBIST 2012 - Proceedings of the 8th International Conference on Web Information Systems and Technologies , Porto, Portugal, 18 - 21 April, 2012 , pages 399 { 402 . SciTePress , 2012 .

[5]

Graps . An introduction to wavelets . IEEE Computational Science and Engineering , 2 ( 2 ), 1995 .

[6]

Hannon ,

Bennett , and

Smyth . Recommending twitter users to follow using content and collaborative ltering approaches . In Proceedings of the fourth ACM Conference on Recommender Systems, RecSys '10 , pages 199 { 206 . ACM, 2010 .

[7]

Harandi and

J. A.

Gulla . Survey of user pro ling in news recommender systems . In J. A. Gulla , B.

Yu , O

. Ozgobek, and N. Shabib, editors, Proceedings of the 3rd International Workshop on News Recommendation and Analytics (INRA 2015 ) co-located with 9th ACM Conference on Recommender Systems (RecSys 2015 ), Vienna, Austria, September 20 , 2015 ., volume 1542 of CEUR Workshop Proceedings , pages 20 { 26 . CEUR-WS.org, 2015 .

[8]

rvelin and

J. Keka

lainen. IR evaluation methods for retrieving highly relevant documents . In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '00 , pages 41 { 48 , New York, NY, USA, 2000 . ACM.

[9]

rvelin and

J. Keka

lainen. Cumulated gain-based evaluation of IR techniques . ACM Transactions on Information Systems , 20 ( 4 ): 422 { 446 , 2002 .

[10]

S. G.

Mallat . A theory for multiresolution signal decomposition: The wavelet representation . IEEE Trans, on Pattern Analysis and Machine Intelligence , PAMI- 11 ( 7 ): 674 { 693 , July 1989 .