WMRB: Learning to Rank in a Scalable Batch Training Approach
                                      Kuan Liu                                                              Prem Natarajan
                      Information Sciences Institute                                                  Information Sciences Institute
                     University of Southern California                                               University of Southern California
                              kuanl@usc.edu                                                                  pnataraj@isi.edu

ABSTRACT                                                                       Figure 1: Standard deviations (relative values) of two types of
We propose a new learning to rank algorithm, named Weighted                    rank estimators at different item ranks. Simulation is done
Margin-Rank Batch loss (WMRB), to extend the popular Weighted                  with item set size N=100,000. ‘online’ uses estimator (2) and
Approximate-Rank Pairwise loss (WARP). WMRB uses a new rank                    ‘sampled-batch q’ uses (4) where q = |Z|/|Y|.
estimator and an efficient batch training algorithm. The approach
                                                                                                 6       Relative STDs of rank estimators
allows more accurate item rank approximation and explicit utiliza-
                                                                                                                            warp
tion of parallel computation to accelerate training. In three item
                                                                                                 5                          sampled-wmrb 0.001
recommendation tasks, WMRB consistently outperforms WARP                                                                    sampled-wmrb 0.01
and other baselines. Moreover, WMRB shows clear time efficiency
                                                                                                 4
                                                                                                                            sampled-wmrb 0.1


                                                                                      Relative STD
advantages as data scale increases.
                                                                                                 3
KEYWORDS
Learning to Rank, Batch Training, Scalability                                                    2
ACM Reference format:
Kuan Liu and Prem Natarajan. 2017. WMRB: Learning to Rank in a Scalable                          1
Batch Training Approach. In Proceedings of RecSys 2017 Posters, Como, Italy,
August 27-31, 2 pages.                                                                           0
                                                                                                 10 -3          10 -2             10 -1               10 0
                                                                                                            Rank position p (p = r/N)
1    INTRODUCTION
Rank based learning algorithms have been widely applied to recom-
                                                                                                       ry
mendation problems. One prominent example is Weighted Approximate-
                                                                                          owa
                                                                                                      Õ
Rank Pairwise loss [2]. It achieves a high precision on the top of a                   Φ      (ry ) =      α j α 1 ≥ α 2 ≥ .. ≥ 0,                  (3)
predicted ranked list instead of an averaged high precision over the                                  j=1
entire list. However, it is not scalable to large item set in practice  where the non-increasing α series control the sensitivity to ranks.
due to its intrinsic online learning fashion. In this work, we ad-
dress the limitation by proposing a novel algorithm and empirically     3 WEIGHTED MARGIN-RANK BATCH LOSS
demonstrate its advantages in both accuracy and time efficiency.        Limitations of WARP. We first point out several limitations of
                                                                        WARP. 1) Rank estimator in (2) is not only biased1 but also with
2 BACKGROUND                                                            large variance. As simulated in Fig.1, online estimation (blue) has
Notation. Let x denote a user, y an item, and Y the entire item set.    a large relative variance, especially when items have high ranks
yx denotes items interacted by user x. ȳx ≡ Y \ yx is the irrelevant   (small p). 2) Low updating frequency results in prolonged train-
item set. We omit subscript x when there is no ambiguity. fy (x)        ing time in practice. This is because it can take a large number
denotes the model score. The rank of item y is defined as               of sampling iterations to find a violating item, especially after the
                                        Õ                               beginning stage of training. 3) Intrinsic sequential manner of
             ry = ranky (f , x, y) =           I[fy (x) ≤ fȳ (x)], (1)
                                                                        WARP prevents full utilization of available parallel computation
                                        ȳ ∈ȳ
                                                                        (e.g. GPUs), making it hard to train large or flexible models.
where I is the indicator function. Finally, |t |+ ≡ max(t, 0), t ∈ R.
   Weighted Approximate-Rank Pairwise loss (WARP) devel-                3.1 Proposed Approach
ops an online approximation of item ranks. Its critical component
                                                                        We address the limitations by combining a rank-sensitive loss and
is an iterative sampling approximation procedure: For each
                                                                        batch training. We first define (sampled) margin rank as
user-item pair (x, y) , sample y ∈ Y uniformly with replacement
                                    ′
until 1 + fy ′ (x) < fy (x) is violated. It estimates item ranks by
                                                                                                   |Y| Õ
                                                                          rankywmr b (f , x, y) =             |1 − fy (x) + fy ′ (x)|+ I(y ′ ∈ ȳ), (4)
                             war p                   |Y| − 1                                       |Z| ′
                  ry ≈ ranky       (f , x, y) = ⌊            ⌋      (2)                                y  ∈Z
                                                        N
where N is the sampling times to find the violating example. It then    where Z is a subset of Y randomly sampled (without replacement).
incurs a rank-sensitive loss as in Order Weighted Average [3], i.e.,    1       Suppose an item has a rank r in a population N. Let p = r /N . Expectation of the
                                                                               estimator in (2) is approximately p + kN=2 k1 p(1 − p)k −1 > p . It overestimates the
                                                                                                                    Í
RecSys 2017 Poster Proceedings, August 27-31, Como, Italy
                                                                               rank seriously when r is small.
RecSys 2017 Poster Proceedings, August 27-31, Como, Italy                                                                        Kuan Liu and Prem Natarajan

Table 1: Recommendation accuracy comparisons (in %). Best results are in bold (e.g., 12.6). WMRB outperforms pairwise based
methods as well as batch based method CE.
                  Datasets               XING                         Yelp                       ML-20m
                  Metrics       P@5 R@30 NDCG@30 P@5 R@30 NDCG@30 P@5 R@30 NDCG@30
                -        POP     0.5   2.7      1.3        0.3     0.9        0.5           6.2 10.0      8.5
                        WARP     2.6   8.3      5.6        1.3     4.4        2.5           9.8 14.2     13.4
            Pairwise
                       A-WARP    2.6  11.6      6.7        1.3     4.3        2.5          10.1 13.3     13.5
                          CE     2.5  12.3      6.5        1.4     4.5        2.6           9.6 14.3     13.2
             Batch
                        WMRB     3.0  12.6      7.2        1.5     5.1        2.9         10.2 14.6      13.9
Table 2: Dataset statistics. U: users; I: items; S: interactions.                   Table 3: Dataset complexities and training time compar-
           Data           |U |         |I |        |S train |     |S test |         isons. ( Tepoch : average epoch time; Tconv : total training
           XING        1,500,000     327,002      2,338,766      484,237            time.) With increasing data scales, WMRB shows time effi-
           Yelp        1,029,433     144,073      1,917,218      213,460            ciency advantages over pairwise based implementation.
          ML-20m        138,493      27,278       7,899,901     2,039,972                              # of      # of        LightFM                  WMRB
                                                                                     Datasets
   While margin rank defined in (4) is not the rank in (1), it charac-                               Param.      Attr.    Tepoch Tconv            Tepoch Tconv
terizes overall score violation of irrelevant items. The margin loss                    ML-20m       4.6M         11       7m      1.2h            22m    3.3 h
is often used to approximate the indicator function. Moreover, (4)                       Yelp        9.3M         19       10m     5.0 h           9m    3.9 h
can be readily computed in a batch manner—Model scores between                           XING        12.1M        33       94m     31.2h           24m   20.7h
a user and a batch of items fy ′ (x)∀y ′ ∈ Y are first computed; The
margin losses of violating items are then summed up.                                function and is a batch training based algorithm implemented by
   We then design a rank-sensitive loss function. Note the mar-                     ourselves. CE and WMRB incorporate attributes as in A-WARP.
gin rank ry = rankywmr b (f , x, y) is a non-negative real number                      Accuracy. Table 1 reports accuracy comparisons of different mod-
rather than an integer as in (3). We define a differentiable loss                   els. We highlight two observations. First, WMRB consistently out-
function to incur “rank weighted” loss as follows:                                  performs WARP and A-WARP. For instance, the relative improve-
                Lwmr b (x, y) = Φwmr b (ry ) = log(ry + 1).                   (5)   ments are 8.6%, 18.6%, and 9.8% on Recall@30. Second, with both
                                                                                    being batch based methods, WMRB wins over CE clearly, indicating
    By noting Φ ′′ (r ) = − (1+r
                               1
                                 )2
                                    < 0, the loss is more sensitive with            the effectiveness of the rank-sensitive loss.
small r, thus mimicking the property as in (3).
                                                                                       Time efficiency. Table 3 reports dataset complexity and training
   Compared to WARP. , WMRB replaces the sequential sampling                        time of different models. To measure complexity, we report the
procedure with batch computations. It results in a different rank                   total number of model parameters and the average number of at-
approximation and loss function. Per user-item pair, WMRB in-                       tributes per user-item pair. While we make an effort to speed up
volves more computation and is compensated with easy utilization                    each method,5 we are most interested in how running time changes
of parallel computing. WMRB updates model parameters much                           with different data scales given fixed models and configurations.
more frequently than WARP – which only updates the parameters                          From Table 3, WMRB is slower than LightFM on ML-20m. It
of one user-item after many sampling.                                               catches up quickly on Yelp, where data scale increases. On XING,
   WMRB has an unbiased estimator of margin rank. Its different                     which has the largest model size and complexity, WMRB is much
sampling scheme results in smaller variances. Simulation in Fig.1                   faster. The trend clearly shows scalability advantages of batch train-
shows sampled-wmrb has much smaller variance than warp as long                      ing based WMRB.
as |Z|/|Y| is not too small.
                                                                                    5    CONCLUSION
4    RESULTS                                                                        In this work, we propose a simple yet effective approach to scale
We validate WMRB on three datasets: XING2 , Yelp3 , and MovieLens-                  up learning to rank algorithm. It implements the idea of ordered
20m4 . The tasks are to recommend to users job posts, Yelp business,                weighted average loss in a batch training algorithm. Our prelimi-
and movies, respectively. We assess the quality of recommendation                   nary empirical study shows its effectiveness.
by comparing models’ recommendation to ground truth interac-
tions split from the datasets. We report recommendation accuracies                  REFERENCES
under metrics Precsion@5, Recall@30, and NDCG@30 as well as                          [1] Maciej Kula. 2015. Metadata Embeddings for User and Item Cold-start Recommen-
                                                                                         dations. arXiv preprint arXiv:1507.08439 (2015). http://arxiv.org/abs/1507.08439
training time. The datasets statistics are listed in Table 2.                        [2] Jason Weston, Samy Bengio, and Nicolas Usunier. 2010. Large scale image
   We compare WMRB to different methods. POP recommends                                  annotation: learning to rank with joint word-image embeddings. Machine learning
items purely based on popularity. WARP and A-WARP are im-                                81, 1 (2010), 21–35.
                                                                                     [3] Ronald R Yager. 1988. On ordered weighted averaging aggregation operators in
plemented in LightFM [1]. A-WARP differs from vanilla WARP                               multicriteria decisionmaking. IEEE Transactions on systems, Man, and Cybernetics
by incorporating available attributes. CE uses Cross-Entropy loss                        18, 1 (1988), 183–190.
2 http://2016.recsyschallenge.com/                                                  5 WMRB are implemented based on Tensorflow and run on a single GPU (GeForce
3 https://www.yelp.com/dataset_challenge. Downloaded in Feb 17.
                                                                                    GTX TITAN X). LightFM runs asynchronous SGD on Cython with 5 cores.(Intel Xeon
4 https://grouplens.org/datasets/movielens/20m/                                     CPU 3.30GHz)