-

Estimating the Value of Multi-Dimensional Data Sets in Context-based Recommender Systems

Panagiotis Adamopoulos

Alexander Tuzhilin

0 0 Department of Information, Operations and Management Sciences Leonard N. Stern School of Business, New York University , USA

2014

We propose a method for estimating the expected economic value of multi-dimensional data sets in recommender systems and illustrate the proposed approach using a unique data set combining implicit and explicit ratings with rich content as well as spatio-temporal contextual dimensions and social network data.

Business Value Context Dataset

MODEL

We build a (hybrid) model incorporating the extra information of temporal, social and location dynamics as well as the content of items, using a feature-based factorization model [ 2 ]. In particular, the prediction score y^u;i is modeled as: y^u;i = +

X gbg + X

mbum + X g2G m2M

1. INTRODUCTION

Although collaborative ltering (CF) recommender systems (RSes) have been very successful during the last decades, they have certain limitations; traditional RSes operate in the two-dimensional U ser Item space and do not take into consideration additional contextual information, such as time and location, that may be crucial in many applications. At the same time, data related to social networks and other informative dimensions is widely available nowadays but it usually comes at signi cant monetary cost and / or engineering e ort. Hence, data should be treated as an investment and the expected costs and bene ts of acquiring and using it should be carefully considered and evaluated.

In this paper, we illustrate how we can estimate the expected economic value (gain or loss) of such multi-dimensional Loss = data sets and translate the added predictive power into monetary units (such as U.S. dollars). This approach has important implications since determining the expected monetary value of data sets or speci c sets of features can lead to bet- 3. ter and more pro table managerial decisions through more informed and data-driven decision making in the future. Besides, the proposed approach can be used to derive even more useful evaluation metrics in the eld of RSes.

In the rest of the paper, we rst use the matrix factorization framework to show how various dimensions can be incorporated into a single model for recommendations and then discuss how the added predictive power of the inducted model translates into monetary value for businesses. Then, we introduce a novel multi-dimensional data set and illustrate the aforementioned approach. Due to space limitations, we focus on the task of item prediction; this method can be extended to rating prediction as well. +

X m2M mpm n2N !T

! nbin X n2N nqn ! where is the base score of the predictions, G; M; N the index sets of global features, user features, and item features, respectively, , , the corresponding feature vectors, and g; m; n the feature values. In the speci c example presented in the rest of this paper, the global features include the location and temporal information (context ) of the rating events, the item features the content information of the items, and the user features the social network information of the users (see Section 5). In addition, a vector of latent factors is included as well. The model can be further extended in order to incorporate social relationships of the users or other relevant information.

To estimate the model (i.e., the feature weights bg,bum,bin and factors pm, qn), we use the logistic function as activation function and the negative log-likelihood as loss function: 1 where f (y^) = 1 + e y^ and ru;i 2 f0; 1g the true rating.

X(ru;i ln f (y^u;i) (1 ru;i) ln(1 f (y^u;i)))+regularization; u;i

DATA

Similar to [ 3 ], we construct a new data set, titled \ConcertTweets", based on publicly available and well-structured tweets referring to music concerts [ 1 ]. This data set is collected and analyzed in real time using the Twitter streaming API. We decided to collect, use, and release this data set because it contains rich feature dimensions as well as novel and relevant activity from a domain of signi cant academic and business interest. As of June 2014, this data set contains information on 30; 178 distinct Twitter users and 100; 000 personal ratings, both implicit and explicit, referring to more than 50; 000 concerts of 13; 578 music artists and bands.

The unique characteristics of our data set allow reconciling it and linking it to popular databases leveraging rich semantic information, such as the musical genres of the artists. Besides, both the geolocation information of the concert and the user (as publicly disclosed based on the application settings, self-reported by the user, or inferred based on the detailed meta-data about the time zone of the location of the user) are included. Other characteristics of this data set that allow for more thorough and extensive (both ofine and online) experimentation are the combination of implicit (i.e., ru;i 2 f`Yes', `Maybe', `No'g) and explicit (i.e., ru;i 2 f0:5; 1:0; : : : ; 5:0g) ratings, the presence of popular and recent events, and the availability of the timestamp information for both the item (i.e., concert) and the corresponding rating event. In addition, this data set includes information about the social presence of the users (e.g., number of followers, timeline, etc.) and can be easily extended to include their social network. Finally, using the unique Twitter user identi ers, this data set can be further enriched with cross-domain (e.g., books, movies) user activity [ 4 ].

BUSINESS VALUE

Working within the CF framework of RSes, we assume that data related to implicit and explicit ratings is already available and part of the baseline recommender. Hence, we illustrate how we can estimate the added economic value of data sets related to additional contextual dimensions. We also assume that either the complete data set or an initial representative sample from the additional dimensions is available in order to conduct the initial analysis before the decision to acquire the full data set and / or incorporate it into the production RS. Then, using the cost-bene t information of the business for the speci c recommendation task (as in Table 1), we can estimate the expected value of predictions with and without using the additional dimensions. In particular, the added value per instance (i.e., rating tuple) for an additional dimension is estimated as:

Value = p(U)

Recall

b(R,U) + p(U) (

Recall)

c(NR,U) + p(NU) (

Speci city) c(R,NU); where Recall = RecallRS0 RecallRS, Speci city = Speci cityRS Speci cityRS0 , RS the baseline recommender (or \random" predictions) and RS0 the recommender with the extended set of contextual dimensions.

Equivalently, for the task of top-N recommendations: Value = p(U)

b(R,U) p(U)

c(NR,U) p(NU) c(R,NU):

Similarly, the above approach is extended to the ranking task, using the area under the ROC curve, as well as applications with non-zero bene t for true negatives (i.e., not recommended and not used items) and variable costs.

Given the expected value of the additional dimensions introduced to the RS, we can then estimate whether adding such factors justi es the engineering cost and e ort as well as the potential monetary cost of acquiring the data.

RESULTS

In the conducted experiments, we consider as positive instances (ru;i := 1) all the items with an explicit rating equal to or greater than 4:0 or an implicit rating indicating that the user attended (i.e., labeled as `Yes') or might attend (i.e.,

Accuracy uEs[er-item pair

Value] per MF MF + Item Content MF + User Social Network data MF + Location-based features MF + Temporal features MF + All features `Maybe') the event; items with ratings less than 4:0 or events that a user will not attend (i.e., `No') are consider negatives (ru;i := 0). In addition, for each user we randomly select an equal number of non-rated items as negative examples in order to increase the accuracy of our predictions. Moreover, we employ a holdout evaluation scheme with 80=20 random splits into training and test sets without ltering any ratings and we evaluate each model in term of classi cation tasks based on accuracy. Also, we set the L2 regularization parameters at 0:004 and the constant bias for prediction at 0:5. The learning rate for stochastic gradient descent is 0:015.

For the various speci cations of the factorization model of Section 2, apart from i) the basic model (MF) which includes 128 latent factors, we used ii) the content information of the 50 most frequent music genres of the artists as item features, iii) the social presence of the users (i.e., number of followers, friends, statuses posted, and tweets favorited) as user features, iv) spatial information of the 50 most popular locations and whether the user is located in the same geographical region with the event (locality) as global features, v) the temporal information (i.e., `Friday', `Saturday', `Other') of the event again as global features, and vi) an integrated model combining all the aforementioned features.

Table 2 shows the experimental results using a cost of 100 units for wrong predictions and zero cost for correct predictions. We see that the various dimensions of this data set have very di erent monetary value and that the contextual information of location is the most informative dimension in this application o ering signi cant return on investment. Even though the highest accuracy was achieved using the integrated model, the business value should be further considered and compared against the engineering e ort and the monetary cost related to additional data.

CONCLUSIONS

In this paper, we propose a method for estimating the expected economic value of multi-dimensional data sets in RSes and illustrate the proposed approach using a unique data set combining implicit and explicit ratings with rich content, spatio-temporal contextual dimensions, and social network pro les. This approach can lead to better and more profitable managerial decisions as well as more useful evaluation metrics. As part of the future work, we plan to extend the proposed approach to the task of rating prediction as well as estimate the value of di erent dimensions in various recommendation domains and settings.

[1]

Adamopoulos. ConcertTweets: A Multi-Dimensional Data Set for Recommender Systems Research . http://people.stern.nyu.edu/padamopo/data/concertTweets.html.

[2]

Chen ,

Zhang ,

Lu , et al. Svdfeature: a toolkit for feature-based collaborative ltering . JMLR , 2012 .

[3]

Dooms et al. Movietweetings: a movie rating dataset collected from twitter . In CrowdRec at RecSys , 2013 .

[4]

Dooms et al. Mining cross-domain rating datasets from structured data on twitter . In MSM at WWW , 2014 .