<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>WMRB: Learning to Rank in a Scalable Batch Training Approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kuan Liu</string-name>
          <email>kuanl@usc.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Prem Natarajan</string-name>
          <email>pnataraj@isi.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information Sciences Institute, University of Southern California</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <abstract>
        <p>We propose a new learning to rank algorithm, named Weighted Margin-Rank Batch loss (WMRB), to extend the popular Weighted Approximate-Rank Pairwise loss (WARP). WMRB uses a new rank estimator and an eficient batch training algorithm. The approach allows more accurate item rank approximation and explicit utilization of parallel computation to accelerate training. In three item recommendation tasks, WMRB consistently outperforms WARP and other baselines. Moreover, WMRB shows clear time eficiency advantages as data scale increases.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>Rank based learning algorithms have been widely applied to
recommendation problems. One prominent example is Weighted
ApproximateRank Pairwise loss [2]. It achieves a high precision on the top of a
predicted ranked list instead of an averaged high precision over the
entire list. However, it is not scalable to large item set in practice
due to its intrinsic online learning fashion. In this work, we
address the limitation by proposing a novel algorithm and empirically
demonstrate its advantages in both accuracy and time eficiency.</p>
    </sec>
    <sec id="sec-2">
      <title>BACKGROUND</title>
      <p>Notation. Let x denote a user, y an item, and Y the entire item set.
yx denotes items interacted by user x. y¯x ≡ Y \ yx is the irrelevant
item set. We omit subscript x when there is no ambiguity. fy (x )
denotes the model score. The rank of item y is defined as
Õ
ry = ranky (f , x, y) =</p>
      <p>I[fy (x ) ≤ fy¯(x )],</p>
      <p>
        (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
y¯ ∈y¯
where I is the indicator function. Finally, |t |+ ≡ max (t , 0), t ∈ R.
      </p>
      <p>
        Weighted Approximate-Rank Pairwise loss (WARP)
develops an online approximation of item ranks. Its critical component
is an iterative sampling approximation procedure: For each
user-item pair (x, y) , sample y′ ∈ Y uniformly with replacement
until 1 + fy′ (x ) &lt; fy (x ) is violated. It estimates item ranks by
ry ≈ rankywar p (f , x, y) = ⌊ |Y|N− 1 ⌋ (
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
where N is the sampling times to find the violating example. It then
incurs a rank-sensitive loss as in Order Weighted Average [3], i.e.,
6
5
4
D
T
S
i3
e
v
t
a
l
e
2
R
1
100-3
      </p>
      <p>
        Relative STDs of rank estimators
warp
sampled-wmrb 0.001
sampled-wmrb 0.01
sampled-wmrb 0.1
10-2 10-1
Rank position p (p = r/N)
100
Φowa (ry ) =
ry
Õ
αj
α1 ≥ α2 ≥ .. ≥ 0,
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
j=1
where the non-increasing α series control the sensitivity to ranks.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>WEIGHTED MARGIN-RANK BATCH LOSS</title>
      <p>
        Limitations of WARP. We first point out several limitations of
WARP. 1) Rank estimator in (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) is not only biased1 but also with
large variance. As simulated in Fig.1, online estimation (blue) has
a large relative variance, especially when items have high ranks
(small p). 2) Low updating frequency results in prolonged
training time in practice. This is because it can take a large number
of sampling iterations to find a violating item, especially after the
beginning stage of training. 3) Intrinsic sequential manner of
WARP prevents full utilization of available parallel computation
(e.g. GPUs), making it hard to train large or flexible models.
3.1
      </p>
    </sec>
    <sec id="sec-4">
      <title>Proposed Approach</title>
      <p>
        We address the limitations by combining a rank-sensitive loss and
batch training. We first define (sampled) margin rank as
rankywmr b (f , x, y) = ||YZ|| yÕ′∈Z |1 − fy (x ) + fy′ (x )|+I(y′ ∈ y¯), (4)
where Z is a subset of Y randomly sampled (without replacement).
1Suppose an item has a rank r in a population N. Let p = r /N . Expectation of the
estimator in (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) is approximately p + ÍkN=2 k1 p(1 − p)k−1 &gt; p. It overestimates the
rank seriously when r is small.
      </p>
      <p>
        While margin rank defined in (4) is not the rank in (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), it
characterizes overall score violation of irrelevant items. The margin loss
is often used to approximate the indicator function. Moreover, (4)
can be readily computed in a batch manner—Model scores between
a user and a batch of items fy′ (x )∀y′ ∈ Y are first computed; The
margin losses of violating items are then summed up.
      </p>
      <p>
        We then design a rank-sensitive loss function. Note the
margin rank ry = rankywmr b (f , x, y) is a non-negative real number
rather than an integer as in (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ). We define a diferentiable loss
function to incur “rank weighted” loss as follows:
      </p>
      <p>
        Lwmr b (x, y) = Φwmr b (ry ) = log(ry + 1).
(
        <xref ref-type="bibr" rid="ref4">5</xref>
        )
1
      </p>
      <p>
        By noting Φ′′(r ) = − (1+r )2 &lt; 0, the loss is more sensitive with
small r, thus mimicking the property as in (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ).
      </p>
      <p>Compared to WARP. , WMRB replaces the sequential sampling
procedure with batch computations. It results in a diferent rank
approximation and loss function. Per user-item pair, WMRB
involves more computation and is compensated with easy utilization
of parallel computing. WMRB updates model parameters much
more frequently than WARP – which only updates the parameters
of one user-item after many sampling.</p>
      <p>WMRB has an unbiased estimator of margin rank. Its diferent
sampling scheme results in smaller variances. Simulation in Fig.1
shows sampled-wmrb has much smaller variance than warp as long
as |Z|/|Y| is not too small.</p>
    </sec>
    <sec id="sec-5">
      <title>4 RESULTS</title>
      <p>We validate WMRB on three datasets: XING2, Yelp3, and
MovieLens20m4. The tasks are to recommend to users job posts, Yelp business,
and movies, respectively. We assess the quality of recommendation
by comparing models’ recommendation to ground truth
interactions split from the datasets. We report recommendation accuracies
under metrics Precsion@5, Recall@30, and NDCG@30 as well as
training time. The datasets statistics are listed in Table 2.</p>
      <p>We compare WMRB to diferent methods. POP recommends
items purely based on popularity. WARP and A-WARP are
implemented in LightFM [1]. A-WARP difers from vanilla WARP
by incorporating available attributes. CE uses Cross-Entropy loss</p>
      <sec id="sec-5-1">
        <title>2http://2016.recsyschallenge.com/ 3https://www.yelp.com/dataset_challenge. Downloaded in Feb 17. 4https://grouplens.org/datasets/movielens/20m/</title>
      </sec>
      <sec id="sec-5-2">
        <title>WMRB</title>
        <sec id="sec-5-2-1">
          <title>Tconv</title>
          <p>1.2h
5.0 h
31.2h</p>
        </sec>
        <sec id="sec-5-2-2">
          <title>Tconv</title>
          <p>3.3 h
3.9 h
20.7h
function and is a batch training based algorithm implemented by
ourselves. CE and WMRB incorporate attributes as in A-WARP.</p>
          <p>Accuracy. Table 1 reports accuracy comparisons of diferent
models. We highlight two observations. First, WMRB consistently
outperforms WARP and A-WARP. For instance, the relative
improvements are 8.6%, 18.6%, and 9.8% on Recall@30. Second, with both
being batch based methods, WMRB wins over CE clearly, indicating
the efectiveness of the rank-sensitive loss.</p>
          <p>Time eficiency. Table 3 reports dataset complexity and training
time of diferent models. To measure complexity, we report the
total number of model parameters and the average number of
attributes per user-item pair. While we make an efort to speed up
each method,5 we are most interested in how running time changes
with diferent data scales given fixed models and configurations.</p>
          <p>From Table 3, WMRB is slower than LightFM on ML-20m. It
catches up quickly on Yelp, where data scale increases. On XING,
which has the largest model size and complexity, WMRB is much
faster. The trend clearly shows scalability advantages of batch
training based WMRB.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5 CONCLUSION</title>
      <p>In this work, we propose a simple yet efective approach to scale
up learning to rank algorithm. It implements the idea of ordered
weighted average loss in a batch training algorithm. Our
preliminary empirical study shows its efectiveness.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Maciej</given-names>
            <surname>Kula</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Metadata Embeddings for User and Item Cold-start Recommendations</article-title>
          .
          <source>arXiv preprint arXiv:1507.08439</source>
          (
          <year>2015</year>
          ). http://arxiv.org/abs/1507.08439
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Jason</given-names>
            <surname>Weston</surname>
          </string-name>
          , Samy Bengio, and
          <string-name>
            <given-names>Nicolas</given-names>
            <surname>Usunier</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Large scale image annotation: learning to rank with joint word-image embeddings</article-title>
          .
          <source>Machine learning 81, 1</source>
          (
          <year>2010</year>
          ),
          <fpage>21</fpage>
          -
          <lpage>35</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Ronald</surname>
            <given-names>R</given-names>
          </string-name>
          <string-name>
            <surname>Yager</surname>
          </string-name>
          .
          <year>1988</year>
          .
          <article-title>On ordered weighted averaging aggregation operators in multicriteria decisionmaking</article-title>
          .
          <source>IEEE Transactions on systems, Man, and Cybernetics 18</source>
          ,
          <issue>1</issue>
          (
          <year>1988</year>
          ),
          <fpage>183</fpage>
          -
          <lpage>190</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>5WMRB are implemented based on Tensorflow and run on a single GPU (GeForce GTX TITAN X)</article-title>
          .
          <article-title>LightFM runs asynchronous SGD on Cython with 5 cores.(Intel Xeon CPU 3</article-title>
          .30GHz)
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>