WMRB: Learning to Rank in a Scalable Batch Training Approach Kuan Liu Prem Natarajan Information Sciences Institute Information Sciences Institute University of Southern California University of Southern California kuanl@usc.edu pnataraj@isi.edu ABSTRACT Figure 1: Standard deviations (relative values) of two types of We propose a new learning to rank algorithm, named Weighted rank estimators at different item ranks. Simulation is done Margin-Rank Batch loss (WMRB), to extend the popular Weighted with item set size N=100,000. ‘online’ uses estimator (2) and Approximate-Rank Pairwise loss (WARP). WMRB uses a new rank ‘sampled-batch q’ uses (4) where q = |Z|/|Y|. estimator and an efficient batch training algorithm. The approach 6 Relative STDs of rank estimators allows more accurate item rank approximation and explicit utiliza- warp tion of parallel computation to accelerate training. In three item 5 sampled-wmrb 0.001 recommendation tasks, WMRB consistently outperforms WARP sampled-wmrb 0.01 and other baselines. Moreover, WMRB shows clear time efficiency 4 sampled-wmrb 0.1 Relative STD advantages as data scale increases. 3 KEYWORDS Learning to Rank, Batch Training, Scalability 2 ACM Reference format: Kuan Liu and Prem Natarajan. 2017. WMRB: Learning to Rank in a Scalable 1 Batch Training Approach. In Proceedings of RecSys 2017 Posters, Como, Italy, August 27-31, 2 pages. 0 10 -3 10 -2 10 -1 10 0 Rank position p (p = r/N) 1 INTRODUCTION Rank based learning algorithms have been widely applied to recom- ry mendation problems. One prominent example is Weighted Approximate- owa Õ Rank Pairwise loss [2]. It achieves a high precision on the top of a Φ (ry ) = α j α 1 ≥ α 2 ≥ .. ≥ 0, (3) predicted ranked list instead of an averaged high precision over the j=1 entire list. However, it is not scalable to large item set in practice where the non-increasing α series control the sensitivity to ranks. due to its intrinsic online learning fashion. In this work, we ad- dress the limitation by proposing a novel algorithm and empirically 3 WEIGHTED MARGIN-RANK BATCH LOSS demonstrate its advantages in both accuracy and time efficiency. Limitations of WARP. We first point out several limitations of WARP. 1) Rank estimator in (2) is not only biased1 but also with 2 BACKGROUND large variance. As simulated in Fig.1, online estimation (blue) has Notation. Let x denote a user, y an item, and Y the entire item set. a large relative variance, especially when items have high ranks yx denotes items interacted by user x. ȳx ≡ Y \ yx is the irrelevant (small p). 2) Low updating frequency results in prolonged train- item set. We omit subscript x when there is no ambiguity. fy (x) ing time in practice. This is because it can take a large number denotes the model score. The rank of item y is defined as of sampling iterations to find a violating item, especially after the Õ beginning stage of training. 3) Intrinsic sequential manner of ry = ranky (f , x, y) = I[fy (x) ≤ fȳ (x)], (1) WARP prevents full utilization of available parallel computation ȳ ∈ȳ (e.g. GPUs), making it hard to train large or flexible models. where I is the indicator function. Finally, |t |+ ≡ max(t, 0), t ∈ R. Weighted Approximate-Rank Pairwise loss (WARP) devel- 3.1 Proposed Approach ops an online approximation of item ranks. Its critical component We address the limitations by combining a rank-sensitive loss and is an iterative sampling approximation procedure: For each batch training. We first define (sampled) margin rank as user-item pair (x, y) , sample y ∈ Y uniformly with replacement ′ until 1 + fy ′ (x) < fy (x) is violated. It estimates item ranks by |Y| Õ rankywmr b (f , x, y) = |1 − fy (x) + fy ′ (x)|+ I(y ′ ∈ ȳ), (4) war p |Y| − 1 |Z| ′ ry ≈ ranky (f , x, y) = ⌊ ⌋ (2) y ∈Z N where N is the sampling times to find the violating example. It then where Z is a subset of Y randomly sampled (without replacement). incurs a rank-sensitive loss as in Order Weighted Average [3], i.e., 1 Suppose an item has a rank r in a population N. Let p = r /N . Expectation of the estimator in (2) is approximately p + kN=2 k1 p(1 − p)k −1 > p . It overestimates the Í RecSys 2017 Poster Proceedings, August 27-31, Como, Italy rank seriously when r is small. RecSys 2017 Poster Proceedings, August 27-31, Como, Italy Kuan Liu and Prem Natarajan Table 1: Recommendation accuracy comparisons (in %). Best results are in bold (e.g., 12.6). WMRB outperforms pairwise based methods as well as batch based method CE. Datasets XING Yelp ML-20m Metrics P@5 R@30 NDCG@30 P@5 R@30 NDCG@30 P@5 R@30 NDCG@30 - POP 0.5 2.7 1.3 0.3 0.9 0.5 6.2 10.0 8.5 WARP 2.6 8.3 5.6 1.3 4.4 2.5 9.8 14.2 13.4 Pairwise A-WARP 2.6 11.6 6.7 1.3 4.3 2.5 10.1 13.3 13.5 CE 2.5 12.3 6.5 1.4 4.5 2.6 9.6 14.3 13.2 Batch WMRB 3.0 12.6 7.2 1.5 5.1 2.9 10.2 14.6 13.9 Table 2: Dataset statistics. U: users; I: items; S: interactions. Table 3: Dataset complexities and training time compar- Data |U | |I | |S train | |S test | isons. ( Tepoch : average epoch time; Tconv : total training XING 1,500,000 327,002 2,338,766 484,237 time.) With increasing data scales, WMRB shows time effi- Yelp 1,029,433 144,073 1,917,218 213,460 ciency advantages over pairwise based implementation. ML-20m 138,493 27,278 7,899,901 2,039,972 # of # of LightFM WMRB Datasets While margin rank defined in (4) is not the rank in (1), it charac- Param. Attr. Tepoch Tconv Tepoch Tconv terizes overall score violation of irrelevant items. The margin loss ML-20m 4.6M 11 7m 1.2h 22m 3.3 h is often used to approximate the indicator function. Moreover, (4) Yelp 9.3M 19 10m 5.0 h 9m 3.9 h can be readily computed in a batch manner—Model scores between XING 12.1M 33 94m 31.2h 24m 20.7h a user and a batch of items fy ′ (x)∀y ′ ∈ Y are first computed; The margin losses of violating items are then summed up. function and is a batch training based algorithm implemented by We then design a rank-sensitive loss function. Note the mar- ourselves. CE and WMRB incorporate attributes as in A-WARP. gin rank ry = rankywmr b (f , x, y) is a non-negative real number Accuracy. Table 1 reports accuracy comparisons of different mod- rather than an integer as in (3). We define a differentiable loss els. We highlight two observations. First, WMRB consistently out- function to incur “rank weighted” loss as follows: performs WARP and A-WARP. For instance, the relative improve- Lwmr b (x, y) = Φwmr b (ry ) = log(ry + 1). (5) ments are 8.6%, 18.6%, and 9.8% on Recall@30. Second, with both being batch based methods, WMRB wins over CE clearly, indicating By noting Φ ′′ (r ) = − (1+r 1 )2 < 0, the loss is more sensitive with the effectiveness of the rank-sensitive loss. small r, thus mimicking the property as in (3). Time efficiency. Table 3 reports dataset complexity and training Compared to WARP. , WMRB replaces the sequential sampling time of different models. To measure complexity, we report the procedure with batch computations. It results in a different rank total number of model parameters and the average number of at- approximation and loss function. Per user-item pair, WMRB in- tributes per user-item pair. While we make an effort to speed up volves more computation and is compensated with easy utilization each method,5 we are most interested in how running time changes of parallel computing. WMRB updates model parameters much with different data scales given fixed models and configurations. more frequently than WARP – which only updates the parameters From Table 3, WMRB is slower than LightFM on ML-20m. It of one user-item after many sampling. catches up quickly on Yelp, where data scale increases. On XING, WMRB has an unbiased estimator of margin rank. Its different which has the largest model size and complexity, WMRB is much sampling scheme results in smaller variances. Simulation in Fig.1 faster. The trend clearly shows scalability advantages of batch train- shows sampled-wmrb has much smaller variance than warp as long ing based WMRB. as |Z|/|Y| is not too small. 5 CONCLUSION 4 RESULTS In this work, we propose a simple yet effective approach to scale We validate WMRB on three datasets: XING2 , Yelp3 , and MovieLens- up learning to rank algorithm. It implements the idea of ordered 20m4 . The tasks are to recommend to users job posts, Yelp business, weighted average loss in a batch training algorithm. Our prelimi- and movies, respectively. We assess the quality of recommendation nary empirical study shows its effectiveness. by comparing models’ recommendation to ground truth interac- tions split from the datasets. We report recommendation accuracies REFERENCES under metrics Precsion@5, Recall@30, and NDCG@30 as well as [1] Maciej Kula. 2015. Metadata Embeddings for User and Item Cold-start Recommen- dations. arXiv preprint arXiv:1507.08439 (2015). http://arxiv.org/abs/1507.08439 training time. The datasets statistics are listed in Table 2. [2] Jason Weston, Samy Bengio, and Nicolas Usunier. 2010. Large scale image We compare WMRB to different methods. POP recommends annotation: learning to rank with joint word-image embeddings. Machine learning items purely based on popularity. WARP and A-WARP are im- 81, 1 (2010), 21–35. [3] Ronald R Yager. 1988. On ordered weighted averaging aggregation operators in plemented in LightFM [1]. A-WARP differs from vanilla WARP multicriteria decisionmaking. IEEE Transactions on systems, Man, and Cybernetics by incorporating available attributes. CE uses Cross-Entropy loss 18, 1 (1988), 183–190. 2 http://2016.recsyschallenge.com/ 5 WMRB are implemented based on Tensorflow and run on a single GPU (GeForce 3 https://www.yelp.com/dataset_challenge. Downloaded in Feb 17. GTX TITAN X). LightFM runs asynchronous SGD on Cython with 5 cores.(Intel Xeon 4 https://grouplens.org/datasets/movielens/20m/ CPU 3.30GHz)