<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Estimating the Value of Multi-Dimensional Data Sets in Context-based Recommender Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Panagiotis Adamopoulos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexander Tuzhilin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information, Operations and Management Sciences Leonard N. Stern School of Business, New York University</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <abstract>
        <p>We propose a method for estimating the expected economic value of multi-dimensional data sets in recommender systems and illustrate the proposed approach using a unique data set combining implicit and explicit ratings with rich content as well as spatio-temporal contextual dimensions and social network data.</p>
      </abstract>
      <kwd-group>
        <kwd>Business Value</kwd>
        <kwd>Context</kwd>
        <kwd>Dataset</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>2.</p>
    </sec>
    <sec id="sec-2">
      <title>MODEL</title>
      <p>
        We build a (hybrid) model incorporating the extra
information of temporal, social and location dynamics as well as the
content of items, using a feature-based factorization model
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In particular, the prediction score y^u;i is modeled as:
y^u;i =
+
      </p>
      <p>X gbg + X</p>
      <p>mbum + X
g2G
m2M</p>
    </sec>
    <sec id="sec-3">
      <title>1. INTRODUCTION</title>
      <p>Although collaborative ltering (CF) recommender systems
(RSes) have been very successful during the last decades,
they have certain limitations; traditional RSes operate in
the two-dimensional U ser Item space and do not take
into consideration additional contextual information, such
as time and location, that may be crucial in many
applications. At the same time, data related to social networks and
other informative dimensions is widely available nowadays
but it usually comes at signi cant monetary cost and / or
engineering e ort. Hence, data should be treated as an
investment and the expected costs and bene ts of acquiring
and using it should be carefully considered and evaluated.</p>
      <p>In this paper, we illustrate how we can estimate the
expected economic value (gain or loss) of such multi-dimensional Loss =
data sets and translate the added predictive power into
monetary units (such as U.S. dollars). This approach has
important implications since determining the expected monetary
value of data sets or speci c sets of features can lead to bet- 3.
ter and more pro table managerial decisions through more
informed and data-driven decision making in the future.
Besides, the proposed approach can be used to derive even more
useful evaluation metrics in the eld of RSes.</p>
      <p>In the rest of the paper, we rst use the matrix
factorization framework to show how various dimensions can be
incorporated into a single model for recommendations and
then discuss how the added predictive power of the inducted
model translates into monetary value for businesses. Then,
we introduce a novel multi-dimensional data set and
illustrate the aforementioned approach. Due to space
limitations, we focus on the task of item prediction; this method
can be extended to rating prediction as well.
+</p>
      <p>X
m2M
mpm
n2N
!T</p>
      <p>!
nbin
X
n2N
nqn
!
where is the base score of the predictions, G; M; N the
index sets of global features, user features, and item features,
respectively, , , the corresponding feature vectors, and
g; m; n the feature values. In the speci c example
presented in the rest of this paper, the global features include
the location and temporal information (context ) of the
rating events, the item features the content information of the
items, and the user features the social network information
of the users (see Section 5). In addition, a vector of
latent factors is included as well. The model can be further
extended in order to incorporate social relationships of the
users or other relevant information.</p>
      <p>To estimate the model (i.e., the feature weights bg,bum,bin
and factors pm, qn), we use the logistic function as
activation function and the negative log-likelihood as loss function:
1
where f (y^) = 1 + e y^ and ru;i 2 f0; 1g the true rating.</p>
      <p>X(ru;i ln f (y^u;i) (1 ru;i) ln(1 f (y^u;i)))+regularization;
u;i</p>
    </sec>
    <sec id="sec-4">
      <title>DATA</title>
      <p>
        Similar to [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], we construct a new data set, titled
\ConcertTweets", based on publicly available and well-structured
tweets referring to music concerts [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This data set is
collected and analyzed in real time using the Twitter
streaming API. We decided to collect, use, and release this data
set because it contains rich feature dimensions as well as
novel and relevant activity from a domain of signi cant
academic and business interest. As of June 2014, this data set
contains information on 30; 178 distinct Twitter users and
100; 000 personal ratings, both implicit and explicit,
referring to more than 50; 000 concerts of 13; 578 music artists
and bands.
      </p>
      <p>
        The unique characteristics of our data set allow
reconciling it and linking it to popular databases leveraging rich
semantic information, such as the musical genres of the artists.
Besides, both the geolocation information of the concert and
the user (as publicly disclosed based on the application
settings, self-reported by the user, or inferred based on the
detailed meta-data about the time zone of the location of
the user) are included. Other characteristics of this data
set that allow for more thorough and extensive (both
ofine and online) experimentation are the combination of
implicit (i.e., ru;i 2 f`Yes', `Maybe', `No'g) and explicit (i.e.,
ru;i 2 f0:5; 1:0; : : : ; 5:0g) ratings, the presence of popular
and recent events, and the availability of the timestamp
information for both the item (i.e., concert) and the
corresponding rating event. In addition, this data set includes
information about the social presence of the users (e.g.,
number of followers, timeline, etc.) and can be easily extended to
include their social network. Finally, using the unique
Twitter user identi ers, this data set can be further enriched with
cross-domain (e.g., books, movies) user activity [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
    </sec>
    <sec id="sec-5">
      <title>BUSINESS VALUE</title>
      <p>Working within the CF framework of RSes, we assume that
data related to implicit and explicit ratings is already
available and part of the baseline recommender. Hence, we
illustrate how we can estimate the added economic value of data
sets related to additional contextual dimensions. We also
assume that either the complete data set or an initial
representative sample from the additional dimensions is available
in order to conduct the initial analysis before the decision to
acquire the full data set and / or incorporate it into the
production RS. Then, using the cost-bene t information of the
business for the speci c recommendation task (as in Table
1), we can estimate the expected value of predictions with
and without using the additional dimensions. In
particular, the added value per instance (i.e., rating tuple) for an
additional dimension is estimated as:</p>
      <p>Value = p(U)</p>
      <p>Recall</p>
      <p>b(R,U)
+ p(U)
(</p>
      <p>Recall)</p>
      <p>c(NR,U)
+ p(NU)
(</p>
      <p>Speci city)
c(R,NU);
where Recall = RecallRS0 RecallRS, Speci city =
Speci cityRS Speci cityRS0 , RS the baseline recommender
(or \random" predictions) and RS0 the recommender with
the extended set of contextual dimensions.</p>
      <p>Equivalently, for the task of top-N recommendations:
Value = p(U)</p>
      <p>b(R,U)
p(U)</p>
      <p>c(NR,U)
p(NU)
c(R,NU):</p>
      <p>Similarly, the above approach is extended to the ranking
task, using the area under the ROC curve, as well as
applications with non-zero bene t for true negatives (i.e., not
recommended and not used items) and variable costs.</p>
      <p>Given the expected value of the additional dimensions
introduced to the RS, we can then estimate whether adding
such factors justi es the engineering cost and e ort as well
as the potential monetary cost of acquiring the data.</p>
    </sec>
    <sec id="sec-6">
      <title>RESULTS</title>
      <p>In the conducted experiments, we consider as positive
instances (ru;i := 1) all the items with an explicit rating equal
to or greater than 4:0 or an implicit rating indicating that
the user attended (i.e., labeled as `Yes') or might attend (i.e.,</p>
      <p>Accuracy uEs[er-item pair</p>
      <p>Value] per
MF
MF + Item Content
MF + User Social Network data
MF + Location-based features
MF + Temporal features
MF + All features
`Maybe') the event; items with ratings less than 4:0 or events
that a user will not attend (i.e., `No') are consider negatives
(ru;i := 0). In addition, for each user we randomly select
an equal number of non-rated items as negative examples in
order to increase the accuracy of our predictions. Moreover,
we employ a holdout evaluation scheme with 80=20 random
splits into training and test sets without ltering any ratings
and we evaluate each model in term of classi cation tasks
based on accuracy. Also, we set the L2 regularization
parameters at 0:004 and the constant bias for prediction at 0:5.
The learning rate for stochastic gradient descent is 0:015.</p>
      <p>For the various speci cations of the factorization model
of Section 2, apart from i) the basic model (MF) which
includes 128 latent factors, we used ii) the content information
of the 50 most frequent music genres of the artists as item
features, iii) the social presence of the users (i.e., number
of followers, friends, statuses posted, and tweets favorited)
as user features, iv) spatial information of the 50 most
popular locations and whether the user is located in the same
geographical region with the event (locality) as global
features, v) the temporal information (i.e., `Friday', `Saturday',
`Other') of the event again as global features, and vi) an
integrated model combining all the aforementioned features.</p>
      <p>Table 2 shows the experimental results using a cost of 100
units for wrong predictions and zero cost for correct
predictions. We see that the various dimensions of this data set
have very di erent monetary value and that the contextual
information of location is the most informative dimension
in this application o ering signi cant return on investment.
Even though the highest accuracy was achieved using the
integrated model, the business value should be further
considered and compared against the engineering e ort and the
monetary cost related to additional data.</p>
    </sec>
    <sec id="sec-7">
      <title>CONCLUSIONS</title>
      <p>In this paper, we propose a method for estimating the
expected economic value of multi-dimensional data sets in RSes
and illustrate the proposed approach using a unique data set
combining implicit and explicit ratings with rich content,
spatio-temporal contextual dimensions, and social network
pro les. This approach can lead to better and more
profitable managerial decisions as well as more useful evaluation
metrics. As part of the future work, we plan to extend the
proposed approach to the task of rating prediction as well
as estimate the value of di erent dimensions in various
recommendation domains and settings.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Adamopoulos. ConcertTweets:</surname>
          </string-name>
          <article-title>A Multi-Dimensional Data Set for Recommender Systems Research</article-title>
          . http://people.stern.nyu.edu/padamopo/data/concertTweets.html.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Lu</surname>
          </string-name>
          , et al.
          <article-title>Svdfeature: a toolkit for feature-based collaborative ltering</article-title>
          .
          <source>JMLR</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Dooms</surname>
          </string-name>
          et al.
          <article-title>Movietweetings: a movie rating dataset collected from twitter</article-title>
          .
          <source>In CrowdRec at RecSys</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Dooms</surname>
          </string-name>
          et al.
          <article-title>Mining cross-domain rating datasets from structured data on twitter</article-title>
          .
          <source>In MSM at WWW</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>