Multi-objective Relevance Ranking Michinari Momma Alireza Bagheri Garakani Yi Sun michi@amazon.com alirezg@amazon.com yisun@amazon.com Amazon.com Inc. Amazon.com Inc. Amazon.com Inc. Seattle, WA Seattle, WA Seattle, WA ABSTRACT input features to a model. For example, Amazon Search [18, 19] In this paper, we introduce an Augmented Lagrangian based method has a large number of features to capture relevance with a feature in a search relevance ranking algorithm to incorporate the multi- repository consisting of product features such as sales and customer dimensional nature of relevance and business constraints, both of reviews, queries / context features such as query specificity or cus- which are the requirements for building relevance ranking mod- tomer status, as well as textual similarity features between query els in production. The off-the-shelf solutions cannot handle such and products. complex objectives and therefore, modelers are left hand-tuning Business constraints are additional requirements in production of parameters that have only indirect impact to the objectives, at- modeling. Some are derived from existing relevance metrics proven tempting to incorporate multiple objectives (MO) in a model. This to be effective over time and are sought to retain in model refresh to process is time-consuming and tends to face sub-optimality. The ensure avoiding churn [14]. Some are derived from relevance such proposed method is designed to systematically solve the MO prob- as latency to ensure quick search responses, and some are strategic lem in a constrained optimization framework, which is integrated and examples include minimum %-gain to consider experimentation with a popular Boosting algorithm and is, by all means, a novel / launch and avoiding adult items for media products. Reduction contribution. Furthermore, we propose a procedure to specify the of search defects that are the search results that do not match the constraints to achieve business goals and the exploration scales query in various aspects, can be considered for both as it gives linearly in the number of constraints, while existing methodology better customer experience and business requirement being strict can scale exponentially. The experimental results show that the as to ensure the quality of search results. An example of search method successfully builds models that achieve MO criteria much defect is showing a cheap zirconium ring for a query “diamond more efficiently th an ex isting me thods. Th e po tential im pact in- ring”, which could give customers the impression that the search is cludes significant reduction in model development time and allows broken, or the e-commerce site is more like a flea market, damaging for automation of model refresh even with presence of several MO brand image of the service [19]. criteria, in real world production system scale with hundreds of As discussed above, relevance ranking modeling faces challenges millions of records. of dealing with multiple metrics of relevance and / or business con- straints, i.e., multiple-objectives (MO). Off-the-shelf machine learn- KEYWORDS ing solutions [5, 10, 13], cannot handle such complicated objectives Learning to rank; Multi-objective ranking optimization; Product in a systematic manner. Although the effect can be limited, one may search; Web search try to employ either over-weighting (OW) [7] by exemplars and/or by search impressions, a set of query-product pairs for a session ACM Reference Format: on a given day, to influence the objective function in a desired way. Michinari Momma, Alireza Bagheri Garakani, and Yi Sun. 2019. Multi- Tuning of weight values is required in this approach and it can objective Relevance Ranking. In Proceedings of the SIGIR 2019 Workshop suffer combinatorial exploration to search for a best combination of on eCommerce (SIGIR 2019 eCom), 8 pages. weightings, which becomes prohibitive as the number of objectives grows. 1 INTRODUCTION To address the issue, we propose a constrained optimization Relevance in information retrieval (IR) has been extensively studied method applied to the Gradient Boosting Tree (GBT). Specifically, and applied to various areas such as web search, product search and we introduce a constraint optimization in the LambdaMART algo- recommendation, etc. The concept of relevance is multi- rithm, the most prominent method for relevance modeling today. dimensional, dynamic and evolutional [1, 15]. In product search, a LambdaMART is based on GBT [8] using CART [2] and employs ranking of products is modeled by customer’s historical optimization on IR metrics such as the normalized cumulative dis- behavior. Typically, signals such as purchase, add to cart or click counted gains (NDCG) or the mean reciprocal rank (MRR). One of are used as a target variable and models are tuned to optimize our goals is to propose a practical approach to the MO problem / rankings based on such behaviors. To address the multi- task in a production setting, which is constrained by memory and dimensional nature of relevance, various features that represent model size, i.e., the number to trees, and efficiency of the learn- relevance dimensions are used as ing algorithm to avoid regression of latency in scoring and model development timeline. Copyright © 2019 by the paper’s authors. Copying permitted for private and academic purposes. In order to meet the requirements and limitations, adaptation In: J. Degenhardt, S. Kallumadi, U. Porwal, A. Trotman (eds.): of the Augmented Lagrangian (AL) method [17] to LambdaMART Proceedings of the SIGIR 2019 eCom workshop, July 2019, Paris, France, published at (AL-LM) is proposed. AL converts the original constrained prob- http://ceur-ws.org lem into an unconstrained problem by adding penalty terms that SIGIR 2019 eCom, July 2019, Paris, France Michinari Momma, et al. penalize constraint violations. The Lagrange multipliers, i.e., dual In terms of application of constrained optimization to more variables are estimated at each iteration. The advantage of AL for generic problems, Bayesian Optimization (BO) with constraints incorporating into Boosting framework is its simplicity and smooth- [9, 11] could be applicable. However, in our problem, objective func- ness. With AL, we introduce dual variables in Boosting. The dual tions and gradients are available, even if it is a surrogate function, variables are iteratively optimized and fit well within the Boosting and exploiting gradients in optimization should be more efficient. iterations. The Boosting objective function is replaced by the un- Incorporation of BO on top of our approach to fine tune metrics or constrained AL problem and the gradient is readily derived using hyper parameters in Boosting component could be an interesting the LambdaMART gradients. With the gradient and updates of dual direction in the future. variables, we solve the optimization problem by jointly iterating AL and Boosting steps. To the best of our knowledge, our work is 3 FORMULATION AND ALGORITHM the first to explicitly introduce constrained optimization problem In this section, we provide formulation and an algorithm of the in Boosting and the first to apply it search relevance problems. proposed method in details. As a method for production modeling, Although in this paper, the proposed method is implemented in there are some requirements in design: latency in scoring and the a Boosting algorithm and applied for relevance modeling (rank- computational cost in training. For the former, we should avoid ing problems), it is naturally applicable to other algorithms such constructing large number of trees and complex structure of each as logistic regressions, SVM, and neural networks with various individual tree as it translates into latency degradation. For the lat- applications in classification and regression domain. ter, we should avoid large number of iterations, or nested iterations, Our code is currently incorporated in GBM in R. We plan to as the objective function evaluation can be expensive in search implement the algorithm into XGBoost and / or LightGBM and application. make them publicly available. Section 2 introduces existing work in multi-objective optimiza- tion in relevance ranking. Section 3 gives formulations and algo- 3.1 LambdaMART objective and gradient rithm of our proposed method. Section 4 illustrates experimental First, let us review the LambdaMART formulation. Suppose we have results and Section 5 concludes this paper. a set of queries Q = {q} and an index set of documents (products in product search) associated with each query I q . A document is de- noted by di with an index i ∈ I q . A set of pairs of document indices 2 RELATED WORK by P q = {(i, j)} with the relation Ri ▷q R j : di is more relevant than The MO problem in search relevance literature has been a popular d j for a given query q. Suppose also we have a single cost function research topic and there have been two major directions: combining to optimize, referred to as primary cost. Given a relevance model f , multiple objectives in a single model and aggregate multiple models for a query and a document, a score of the document is computed tuned for each objective. As for the single model approach, Dong et as a function of its input features that could be query dependent: q q al. [7] proposes an over-weighting model, based on GBrank [22], to si = f (x i ). A probability of the relevance relationship is modeled adjust importance weighing of examples coming from different data by a sigmoid function: sources, for incorporating recency into the web search relevance Prob(Ri ▷q R j ) = Prob (i, j) ∈ P q  ranking. Svore et al. [20] optimizes both human-labeled relevance   −1 and click, with the former being prioritized. Their objectives are (1)  q q  −σ s i −s j based on NDCG that are modeled by the LambdaMART-loss. The = 1+e combined objective function is a weighted linear combination of the two, which is in fact popular in literature [16, 21]. where σ is a parameter to determine the shape of the sigmoid Another popular approach is to use aggregation of multiple function. LambdaMART [3, 16] is based on the pairwise cost that rankers. Dai et al. [6] proposes a hybrid approach of label aggrega- is the cross-entropy with a rank-dependent weight to incorporate tion and mixture of experts for achieving recency and general rank- importance of high ranks. For a pair (i, j) ∈ P q , the cross entropy ing objectives. Kang et al. [12] relies on editorial grading for each is given by metric and relative preference to optimize both label aggregation   q q  and model aggregation in a supervised manner. Each component pm q q pm,q −σ s i −s j   c si , s j = ∆Z i, j log 1 + e (2) model is trained for each objective. They apply the method for optimizing three objectives that are general ranking (i.e. matching), pm,q distance and reputation. These methods require separate models for where ∆Z i, j is the difference of a metric such as NDCG between each objective and are not directly applicable in our setting where the (i, j) pair in one order and one that is flipped. For a metric like we deploy a single model in production. NDCG, relevant documents in higher ranked position gets a high pm,q Most of the existing algorithms rely on heuristics, such as weight- value and that in lower a low value. Therefore, the weight ∆Z i, j ing or linear combination of objectives / models, and require man- tends to be large when a high ranked position is involved in the pair. ual tuning of hyper-parameters, which becomes prohibitive as the For a query, the total cost c(s q ) with s q being a vector containing number of objectives becomes large. Our approach, in contrast, au- all documents for the query, i.e., s q = [s 1 , .., s I q ], is given by tomatically yields a model that satisfies minimal requirements over q q c pm s q = c pm si , s j Õ   MO’s, alleviating the effort of hand-tuning of hyper-parameters,  (3) which makes a clear distinction from the existing methods. (i, j)∈P q Multi-objective Relevance Ranking SIGIR 2019 eCom, July 2019, Paris, France The primary objective function of LambdaMART is a sum over all 3.2 Incorporating Augmented Lagrangian queries: method in Boosting q q C pm (s) = c pm (si , s j ) = c pm (s q ) Õ Õ Õ (4) Suppose all objectives are given in terms of cost functions. Just like q ∈Q (i, j)∈P q q ∈Q the primary objective function reviewed in 3.1, we use surrogated cost functions to optimize the metrics we desire. For example, sup- where s is a concatenated vector containing scores for all queries: pose we want to set minimum criteria in NDCG. In this case, we {s q |q ∈ Q }. use LambdaMART costs for NDCG and set the cost no greater than Boosting in LambdaMART is based on the GBT, which gener- the given upper-bound (UB) b, i.e., C t (s) ≤ b t , t = 1, . . . ,T . Given ates a linear combination of base functions that are the decision the constraints represented in terms of cost functions, we have the trees, which are learned to fit the gradient. Typically, in GBT, once following constraint optimization problem: a tree is constructed, the function value of each leaf is computed by an estimate of the Newton step for the node. More formally, min C pm (s) s.t . C t (s) ≤ b t , t = 1, ...,T . (9) s at an nt h iteration, leaf nodes {Rnl }lL=1 , where L is the number The Lagrangian is written by of leaves of a tree, are generated by CART for given input fea- T tures of a query and documents, and the gradient as a target: L (s, α ) = C pm (s) + α t C t (s) − b t Õ  q q (10) {(x i , дn (f (x i )))}q ∈Q, i ∈I q , with дn being the gradient with respect t to a score. The function value for each node is given as: γnl = where α = α 1 , ..., α T is a vector of dual variables. The Lagrangian      2  −1 q Í pm /∂ s q Í − x q ∈Rnl ∂ 2C pm /∂si q x ∈R nl ∂C i . Therefore, is solved by minimizing with respect to the primal variables s i i and maximizing with respect to the dual variables α . AL itera- the score, or the prediction function is given as follows: tively solves the constraint optimization while alleviating non- N Õ L smoothness in α arising in the dual. In our problem, the Lagrangian q q q q Õ si = f (x i ) = f 0 (x i ) + γnl δ (x i ∈ Rnl ) (5) at iteration k is written as follows: n=1 l =1 T Lk (s, α ) = C pm (s) + α t C t (s) − b t Õ q  where f 0 (x i ) is an initial base function that could be provided t by prior knowledge and δ is the Kronecker delta. To derive the (11) T 2 gradient, it is handy to rewrite the query cost as follows: 1  t α − α kt −1 Õ − q 2µ k c pm (s q ) = c pm (si ) t Õ i ∈I q where α kt −1 is a solution in the previous iteration and a constant in (6) the current iteration k. µ kt is a sufficiently large constant associated q q q q ª c pm (si , s j ) + c pm (s j , si )® Õ© Õ Õ = ­ with each dual variable α t . Note that the last term is newly added i ∈I q « j:(i, j)∈P q j:(j,i)∈P q ¬ as compared with Eq. (10) and it gives proximal minimization with pm q pm q q pm q q iterates α kt −1 , to make the optimization smooth. with c (si ) ≡ j:(i, j)∈P q c (si , s j ) + j:(j,i)∈P q c (s j , si ). Í Í We maximize the Lagrangian with respect to α ≥ 0 and minimize The first term computes the pairwise cost between di and oth- with respect to s. ers that are less relevant than di . The second term computes that max min Lk (s, α ) (12) between di and others that are more relevant than di . The gra- α ≥0 s dient used in LambdaMART [3, 4] is readily computed by taking From the stationary condition ∂Lk /∂α t = 0, we obtain the update pm,q derivative with respect to the current score. By defining λi ≡ formula for α:   −1 α kt = max 0, µ kt C t (s) − b t + α kt −1 .   q q   pm pm,q pm,q σ s −s  ∂cq /∂si and λi j ≡ −σ ∆Z i, j 1+e i j , we have (13) the gradient formula in terms of λ’s. At an iteration k, if the constraint t is not satisfied, i.e., C t (s) > b t , we have α t > α t , which means the Lagrange multiplier pm,q pm,q pm,q k k−1 Õ Õ λi = λi j − λ ji (7) α t increases unless the constraint is already satisfied. Intuitively, j:(i, j)∈P q j:(j,i)∈P q we can consider it as weighting that is adjusted each iteration to   2 overweight an unsatisfied constraint that is associated the cost we pm,q q Similarly, the second order derivative, defined as ρ i ≡ ∂ 2c pm /∂ si want to improve. Note if a constraint should be strictly satisfied at is given as follows: optimality, α t should take value 0 to maximize the Lagrangian. If pm,q Õ pm,q pm,q  pm,q  a constraint should be satisfied with equality, α t can take a finite ρi = σ2 ∆Z i, j ρi j 1 − ρi j value. By restricting the solution to the former case, we can push j:(i, j)∈P q the dual variable to 0 whenever the constraint is satisfied: (8) pm,q pm,q pm,q Õ   − σ2 ∆Z i, j ρ ji 1 − ρ ji if C t (s) − b t < 0 ( t 0, j:(j,i)∈P q αk = t t t t (14) µ k C (s) − b + α k −1 , otherwise    −1 Intuitively, we can avoid unnecessary iterations to find out α = 0  q q  pm,q σ s −s where we define ρ i j ≡ 1+e i j . by simply pushing α to zero whenever the constraint is met. Our SIGIR 2019 eCom, July 2019, Paris, France Michinari Momma, et al. preliminary study shows the update Eq. (14) works better than that Input: Number of trees N , number of leaves per tree L, in Eq. (13) in both primary and sub-objectives. Throughout this learning rate η, AL parameter µ t . Initial Lagrange paper, we adopt Eq. (14) as the update scheme for dual variables. multiplier estimate α 0t = 0, t = 1, ..,T . Given initial As for the primal variables, the first order derivatives are given BaseModel as follows: foreach q ∈ Q do q q ! f 0 (x i ) = BaseModel(x i ), i ∈ I q ∂Lk Õ pm,q Õ t t,q Õ q /* If BaseModel is empty, set f 0 (x i ) = 0 q q = λi + α λt = λi (15) */ ∂si q ∈Q,i ∈I q t q ∈Q,i ∈I q end q pm,q t,q for n=1 to N do where we define λi ≡ λi + t α t λi . Similarly, by defining foreach q ∈ Q, i ∈ I q do Í q pm,q t,q q pm,q ÍT t λt,q , ρt ≡ ρi + t α t ρ i , the second order derivative with respect λi = λi + t =1 α n−1 Í i to a score is simply given as follows: q pm,q ÍT t ρ t,q ρi = ρi + t =1 α n−1 i ! end ∂ 2 Lk Õ pm,q Õ t t,q Õ q  2 = ρi + α ρi = ρ i (16) {Rnl }lL=1 q q q ∂ si q ∈Q,i ∈I q t q ∈Q,i ∈I q /* Create an L leaf tree on {(x i , λi )}q ∈Q,i ∈I q */ q One interpretation of the AL method is a penalty method on Í q λ q∈Q, x i ∈R nl i constraint violations. To see it clearly, we can plug the stationary γnl = − Í q q ρ q∈Q, x i ∈R nl i condition Eq. (13) into Eq. (11), yielding /* Assign leaf values on Newton step estimate Lk (s, α ) = C pm (s) + α t C t (s) − b t Õ */  t ∈ {α t >0} foreach q ∈ Q, i ∈ I q do q q q   µ kt (17) fn (x i ) = fn−1 (x i ) + η l γnl δ x i ∈ Rnl /* Take Í t C (s) − b t Õ 2 + + const. step with learning rate η */ 2 t ∈ {α t >0} for t=1 to T do t It is evident that it penalizes constraint violations associated with  C (s) with Compute cost α t > 0 with quadratic function. Therefore, setting a high value q q {si = fn x i }q ∈Q,i ∈I q of µ kt imposes the constraint more strictly but on the other hand, Update α nt via Eq. (13) or Eq. (14) setting a too high value may introduce a non-smoothness behavior end which AL tries to avoid. end end 3.3 An approach to Augmented Lagrangian Algorithm 1: AL-LambdaMART (AL-LM) with LambdaMART Both AL and LambdaMART algorithms run by iterations. We pro- pose a simple approach to integrate the two; Conduct an AL step calculation to update α at each Boosting iteration, which means 4.1 Dataset and objectives the AL iteration index k is set equal to the Boosting iteration index The dataset consists of search queries, input features (e.g., query, n. As for the AL parameter µ kt , we take a fixed policy on itera- product, and query-product dependent features such as product tion: µ kt = µ t . A variant of algorithm would increase the value as sales, customer review, textual matches between a query and prod- iterations proceeds. Algorithm 1 shows the steps of the algorithm. ucts), as well as customer’s purchase decision. We follow the basic Most notably, the additional component to the original Lamb- modeling practice described in [18, 19]. We collect training data for daMART at the n-th iteration step is the gradient computation: a month worth of the data followed by a week worth of the data q q λi ,ρ i and the update of α nt . The modification to existing solvers for evaluation. For this study, we focus on purchase as a primary such as GBM in R [10], XGBoost [5] , and LightGBM [13] should objective. be a minimal effort. As for sub-objectives, we identify four. The first one (t1) is to surface set of products that have relatively good quality and popu- 4 EXPERIMENTS lar to a certain customer segment (e.g., customers with subscrip- In this section, we analyze the algorithm by a numerical study. The tions or customers who are more interested in trending products.); dataset we use is a product search data collected and generated the insight being promoting popular products of high quality for in an e-commerce service. We first study how to set the optimiza- such a segment while others still have the impressions on such tion parameter µ and the cost upper-bounds to ultimately achieve products. The 2nd – 4th ones are products that contain additional modeling goals using smaller sampled dataset. Then we apply the benefit/service to such a customer segment; 2nd (t2) offers some settings to the full dataset for building a production-ready model. lowest benefit to the customer of the three and 3rd (t3) and 4th (t4) Our prototyping code for AL-LM is currently incorporated in GBM offer superior benefits in an increasing order. Generally, benefits in R and used throughout of this section. are hierarchical and coverage is inversely proportional to it. Multi-objective Relevance Ranking SIGIR 2019 eCom, July 2019, Paris, France Table 1: Baseline unconstrained results. (a) Costs w.r.t. #iterations C_t1 C_t2 b_t1 b_t2 C_t1 C_t2 C_pr C_pr purchase t1 t2 0.08 0.25 cost NDCG cost NDCG cost NDCG 0.07 0.2 train 0.106 0.826 0.038 0.932 0.047 0.842 0.06 0.15 Constraint Costs valid 0.133 0.809 0.038 0.930 0.045 0.845 Primary Cost 0.05 0.1 0.04 Table 2: Upper bounds (b t 1 , b t 2 ) set by cost reduction rate 0.03 0.05 from baseline. 0.02 0 0.01 -0.05 %-cost reduction 0 -0.1 5% 10% 20% 30% 40% 50% 1 11 21 31 41 51 61 71 81 91 t1 0.036 0.034 0.030 0.026 0.023 0.019 Iteration t2 0.045 0.043 0.038 0.033 0.028 0.024 (b) AL update: 𝛼 alpha_t1 alpha_t2 3.5 4.2 Multi-objective ranking illustration 3 2.5 log(𝛼 +1) First, we show how the algorithm progresses by iterations. In this 2 study, we use two constraints for simplicity. We randomly sample 1.5 1 20K search impressions from the dataset and split into 50% for train- 0.5 ing and validation sets. We run the algorithm up to 100 iterations for 0 1 11 21 31 41 51 61 71 81 91 illustration purpose. The UB b t are set based on the unconstrained Iteration baseline; we run the unconstraint problem first, which is exactly the same as the original LambdaMART algorithm. Table 1 shows cost and NDCG for the purchase, t1 and t2. Then the UB values are Figure 1: (a) Cost curves in AL-LM. µ is set to 10K. Solid curves set based on a cost reduction rate, such as 5, 10, . . . , 50%, from the represent training results and dotted validation. Solid lines unconstrained baselines. Table 2 shows actual values of UB’s used represent UB’s. (b) how α changes over iterations shown in in the experiments. log-scale. 4.2.1 Cost curve illustration. Figure 1 (a) illustrates curves of costs Table 3: Relative margin by UB reduction rates and µ, for t1 that are the objectives in the problem. The value of α nt is also shown in Figure 1 (b). We set UB for 20% cost reduction against the uncon- strained baseline and set µ = 10K, which is a setting that is found to UB reduction rate (%) µ data be large enough as studied in next subsection. In cost curves, solid 5% 10% 20% 30% 40% 50% tr 19.28 13.87 6.40 3.72 2.45 -0.28 curves correspond to the cost values on the training data and dotted 10000 val 21.13 15.30 7.33 4.20 3.24 0.31 curves those on the validation data. Note solid lines represent the tr 17.81 13.22 6.62 0.02 0.70 -0.09 UB’s. As seen in the Figure 1 (a), the initial few iterations try to 1000 val 19.05 13.84 6.87 1.00 1.03 0.80 satisfy constraints aggressively. Then spend a number of iterations tr 18.31 14.02 5.80 -0.06 -0.65 -1.66 in the over-satisfied region, gradually challenging the UB’s. The 100 val 18.82 13.90 5.98 0.94 -0.05 -1.17 behavior of α can be seen in (b). α is pushed to zero when the tr 13.08 12.15 6.97 -0.39 -1.56 -3.28 constraint is met during the iterations. Then when a constraint 10 ts 14.25 13.17 7.08 -0.55 -1.28 -2.98 violation occurs, a finite value of α kicks in again. At iteration 80, α t 2 actually gets 0.79, as seen in Figure 1 (b). Validation results are also shown as dotted curves in Figure 1 (a). Constraint satisfaction value of µ, i.e., µ = 1K or 10K can satisfy the constraints. For above in training tend to be generalizable to that in validation set. 40% reduction, even µ = 10K suffers from infeasibility. As µ = 10K satisfies most constraints for the cost reduction rate for the training 4.2.2 Optimization parameter µ. In AL, the optimization parameter data, a sensible choice for the value of µ would be at least 10K. µ kt can be any sufficiently large value. In our experiments, we set µ nt to be a constant across all iterations and constraint. We vary 4.2.3 Cost reduction and NDCG gain. Now, we look at the cost the value from (10, 100, 1K, 10K) and see how the constraints are reduction rates and NDCG gains over the unconstrained cases as satisfied, over different cost UB reduction rates that ranges from 5% baseline. Table 5 shows for each objective, how the attained cost through 50%, see Table 1 for the values used for each objective. As a reduction rates and NDCG gains vary with different UB settings. metric to measure constrained satisfaction, we use relative margin Note for the constraints, the cost values are consistent with Table 4 that is defined as (b t − C t )/b t . Negative value means constraint and 3, as they are computed based on values in Table 5. violation. In Table 4 and 3, obviously, the more UB values are set As we have tighter UB values, cost associated with the constraint aggressively, the less chance the constraints can be satisfied. For is reduced, which improves the NDCG gain. For the primary ob- both constraints, when UB’s are set 5%, all constraints are met even jective, purchase, the behavior is opposite, which is all expected for µ is as small as 10. However, as UB is set up to 30%, only larger as we are tightening constraints. An important observation is the SIGIR 2019 eCom, July 2019, Paris, France Michinari Momma, et al. Table 4: Relative margin by UB reduction rates and µ, for t2 Table 6: Production scale modeling results in %-gain. For over-weighting (OW), 7 models achieve +1% gain criteria out UB reduction rate (%) of 144 trials in grid search. Min, max and avg performances µ data 5% 10% 20% 30% 40% 50% of the 7 models are shown. For AL-LM, one denotes use of tr 3.09 2.01 0.65 6.19 -0.98 -0.18 estimated UB’s (one-shot model) and adjust applying adjust- 10000 val 3.34 2.11 1.50 7.22 0.39 0.29 ment over the one-shot model. The bold letter shows statisti- tr 2.16 1.31 0.10 0.07 -0.45 -1.15 cally significance in comparing OW (max) and AL-ML (one). 1000 val 1.97 1.51 0.63 0.92 0.61 -0.90 tr 2.61 -0.09 0.46 -1.14 -0.97 -2.10 pur- t1 t2 t3 t4 100 val 2.57 -0.51 0.96 -0.59 0.00 -1.54 chase @5 @22 @5 @22 @5 @22 @5 @22 tr 0.70 -0.46 -1.57 -1.64 -3.20 -5.26 OW (min) -0.10 2.59 1.98 1.15 1.00 1.66 1.19 2.01 1.54 10 OW (max) -0.03 3.52 2.67 1.66 1.41 2.33 1.67 2.74 2.08 ral 0.47 -0.04 -1.55 -1.50 -2.64 -5.24 OW (avg) -0.05 2.90 2.20 1.27 1.10 1.79 1.30 2.26 1.68 AL-LM (one) -0.07 3.21 1.83 4.28 3.08 3.38 2.29 4.19 2.77 Table 5: Cost reduction (%) and NDCG-gain for each objec- AL-LM (adj.) -0.03 2.12 1.26 3.13 2.16 2.73 1.97 3.36 2.64 tive, by different UB’s, with µ=10K training / upper-bound reduction% validation / upper-bound reduction% 5% 10% 20% 30% 40% 50% 5% 10% 20% 30% 40% 50% 4.3 Production-scale modeling Cost red. 24.28 23.87 26.40 33.72 42.45 49.72 26.13 25.30 27.33 34.20 43.24 50.31 4.3.1 Applying AL-ML in the full dataset. In this subsection, we t1 NDCG 1.62 1.57 1.74 2.32 3.06 3.74 1.76 1.66 1.81 2.34 3.16 3.74 Cost red. 8.09 12.01 20.65 36.19 39.02 49.82 8.34 12.11 21.50 37.22 40.39 50.29 show results on a larger sampled data to simulate a production t2 NDCG 0.92 1.68 3.49 6.58 7.09 9.30 1.06 1.87 3.46 6.76 7.21 9.34 modeling with more sub-objectives. To this end, we increase the pur- Cost red. 0.02 -0.22 -0.02 -1.25 -2.54 -5.18 0.21 0.18 0.29 -0.82 -1.88 -3.60 samples to ∼1MM search impressions for the training and ∼500K chase NDCG -0.23 -0.01 -0.21 -0.63 -1.03 -1.58 0.18 0.18 0.11 -0.20 -0.23 -0.96 search impressions for the hold-out evaluation data sets and use the full four sub-objectives. Each sub-objective is computed by a binary target value, just like the purchase target. The goal of the modeling Cost reduction rate vs. NDCG gain (train) Cost reduction rate vs. NDCG gain (valid.) is to achieve at least 1% gain in the number of products with the t1 t2 Linear (t1) Linear (t2) t1 t2 Linear (t1) Linear (t2) 10% 10% positive sub-objective value, i.e., presence of products eligible for y = 0.181x y = 0.180x some customer benefit in top-5 and 22 ranks, while minimizing the 8% 8% impact on the purchase NDCG. We use relationship between the metrics (top-K) and cost as illustrated in Figure 2 to estimate the 6% 6% values of UB’s by using smaller number of samples. In other words, 4% 4% as long as we know the relationship between the metric value and y = 0.071x y = 0.071x cost value, we can achieve the goal of +1% gain by just setting the 2% 2% UB’s. Namely, we choose 14% reduction for b t 1 and 1/0.181 = 5.5% 0% 0% reduction for b t 2 and adjust if the constraint is too tight to satisfy, 0% 10% 20% 30% 40% 50% 60% 0% 10% 20% 30% 40% 50% 60% or over-satisfied. Note as we find t3 and t4 follows very similar pattern as t2, we use the same setting as t2 to them. Figure 2: Relationship between cost reduction and NDCG 4.3.2 OW method as a baseline. As a baseline method to com- gain pare against, we tune models by over-weighting (OW) over the sub-objectives. Basically, in the OW method, we identify search impressions that contain products with the positive sub-objective validation results are consistent with those of training, which means value for applying over-weightings. We introduce weights on the the constraint satisfaction in training generalizes well at least for pairwise cost computation so that we can influence the cost func- the dataset examined. Note constraint t2 is over-satisfied with large tion depending on products that matches a query. For example, margin in cost. This is due to the correlation between the target if a product has a value one in the target t1, i.e., the product is values associated with t1 and t2, as that of t1 is available only if eligible for some benefit, and we want to optimize t1 over other the target value of t2 is positive, which partially explains t1 is a sub-objectives, we put high weight on the cost associated with t1 so tighter constraint to be satisfied than t2. the ranking is more likely to optimized for the search impressions The relationship between the cost reduction and the NDCG gains containing products with t1 as a feature. When there are multiple is illustrated in Figure 2. The two metrics are quite consistent and sub-objectives, we need to combinatorially tune multiple weighting fit well by linear lines. This means, we can estimate cost reduction schemes to concurrently achieve multiple objectives. value for a given NDCG gain requirement, by using the approx- The weight values are tuned first by manually finding a reason- imately liner relationship. Once we have the cost reduction rate, ably good weighting parameter ranges and run grid search to fine we can set it as the UB. This observation is quite useful in practice tune. As there are four dimensional space to explore and too time where NDCG gain values would likely be the criteria for offline consuming for the problem size, we only search weighting parame- modeling. ters for t1 and t2; we rely on correlation between t2 and t3 or t4 to Multi-objective Relevance Ranking SIGIR 2019 eCom, July 2019, Paris, France optimize overall metrics. Despite the search space reduction, we REFERENCES end up building 144+ models. [1] Pia Borlund. 2003. The Concept of Relevance in IR. J. Am. Soc. Inf. Sci. Technol. 54, 10 (Aug. 2003), 913–925. https://doi.org/10.1002/asi.10286 [2] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. 1984. Classification and Regression Trees. Wadsworth and Brooks, Monterey, CA. 4.3.3 Results. Table 6 shows the results on the evaluation set from [3] Chris J.C. Burges. 2010. From RankNet to LambdaRank to LambdaMART: An OW and AL-LM, which are measured by %-gain with respect to Overview. Technical Report. https://www.microsoft.com/en-us/research/ publication/from-ranknet-to-lambdarank-to-lambdamart-an-overview/ the unconstrained baseline. For OW, only 7 out of 144 trials over [4] Christopher J. Burges, Robert Ragno, and Quoc V. Le. 2007. Learning to Rank the grid are found as feasible solutions. We report min, max and with Nonsmooth Cost Functions. In Advances in Neural Information Processing avд among the solutions. For AL-LM, we already have estimates of Systems 19, B. Schölkopf, J. C. Platt, and T. Hoffman (Eds.). MIT Press, 193– 200. http://papers.nips.cc/paper/2971-learning-to-rank-with-nonsmooth-cost- each constraint from Figure 2. A similar preliminary experiment functions.pdf gives estimate of other cost UB’s. [5] Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting Both methods achieve the 1% gain criteria for all metrics. While System. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16). ACM, New York, NY, USA, AL-LM achieves higher gains for t2 – t4 and purchase NDCG is 785–794. https://doi.org/10.1145/2939672.2939785 flat, the model shows slight over satisfaction and we apply some [6] Na Dai, Milad Shokouhi, and Brian D. Davison. 2011. Learning to Rank for adjustment (relax UB’s by 20%: 14% reduction for b t 1 to 11% and Freshness and Relevance. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’11). 4%), yielding slight improvement on purchase (insignificant) with ACM, New York, NY, USA, 95–104. https://doi.org/10.1145/2009916.2009933 less margin on t1. In terms of efficiency, AL-LM is a clear win [7] Anlei Dong, Yi Chang, Zhaohui Zheng, Gilad Mishne, Jing Bai, Ruiqiang Zhang, Karolina Buchner, Ciya Liao, and Fernando Diaz. 2010. Towards Recency Ranking as it requires only two model builds, one for getting the baseline in Web Search. In Proceedings of the Third ACM International Conference on cost value and the other by applying cost reduction UB’s estimated Web Search and Data Mining (WSDM ’10). ACM, New York, NY, USA, 11–20. by smaller sample. We could put efforts on adjustment if the one- https://doi.org/10.1145/1718487.1718490 [8] Jerome H. Friedman. 2001. Greedy function approximation: A gradient boosting shot model has issues, which can be resolved slight tightening or machine. Ann. Statist. 29, 5 (10 2001), 1189–1232. https://doi.org/10.1214/aos/ relaxing constraints. OW, on the other hand, requires exploration 1013203451 of the weighting schemes and the success rate is 7/144 = 4.8%, [9] Jacob R. Gardner, Matt J. Kusner, Zhixiang Xu, Kilian Q. Weinberger, and John P. Cunningham. 2014. Bayesian Optimization with Inequality Constraints. which requires 21 trials on average and 61 trials at 95% confidence. In Proceedings of the 31st International Conference on International Conference Addition of further sub-objectives, such as search defect and on Machine Learning - Volume 32 (ICML’14). JMLR.org, II–937–II–945. http: //dl.acm.org/citation.cfm?id=3044805.3044997 adult products tend to be much less correlated with the constraints [10] Ridgeway Greg. 2013. gbm: Generalized Boosted Regression Models. https://cran.r- we studied in this paper, which means more combinatorial explo- project.org/web/packages/gbm/index.html ration (exponential in T ) is needed for existing methodologies. AL- [11] José Miguel Hernández-Lobato, Michael A. Gelbart, Ryan P. Adams, Matthew W. Hoffman, and Zoubin Ghahramani. 2016. A General Framework for Constrained LM is the choice for MO modeling, as it only requires setting UB’s Bayesian Optimization Using Information-based Search. J. Mach. Learn. Res. 17, and some adjustment around the one-shot model: (linear in T ). 1 (Jan. 2016), 5549–5601. http://dl.acm.org/citation.cfm?id=2946645.3053442 [12] Changsung Kang, Xuanhui Wang, Yi Chang, and Belle Tseng. 2012. Learning to Rank with Multi-aspect Relevance for Vertical Search. In Proceedings of the Fifth ACM International Conference on Web Search and Data Mining (WSDM ’12). ACM, 5 CONCLUSION AND FURTHER WORK New York, NY, USA, 453–462. https://doi.org/10.1145/2124295.2124350 [13] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, In this paper, we introduced an Augmented Lagrangian based Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A Highly Efficient Gradient Boosting method to address challenges with the multi-objective optimization Decision Tree. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett in relevance modeling. Specifically, we incorporated a constraint (Eds.). Curran Associates, Inc., 3146–3154. http://papers.nips.cc/paper/6907- optimization method into Boosting so that modelers can use it very lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf [14] Mahdi Milani Fard, Quentin Cormier, Kevin Canini, and Maya Gupta. 2016. easily as an extension to the LambdaMART, one of the most pop- Launch and Iterate: Reducing Prediction Churn. In Advances in Neural Information ular Boosting methods in this domain. The explicit introduction Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and of constrained optimization is a novel contribution of this paper R. Garnett (Eds.). Curran Associates, Inc., 3179–3187. http://papers.nips.cc/ paper/6053-launch-and-iterate-reducing-prediction-churn.pdf and thus its application to relevance ranking is novel. Experimental [15] S. Mizzaro. 1998. How many Relevances in Information Retrieval? Interacting results showed the proposed AL-LM can indeed efficiently resolve With Computers 10, 3 (1998), 305–322. the MO problem. The impact of the outcome would be not limited [16] Phong Nguyen, John Dines, and Jan Krasnodebski. 2017. A Multi-Objective Learning to re-Rank Approach to Optimize Online Marketplaces for Multiple to reducing the model development lead time; modelers can make Stakeholders. CoRR abs/1708.00651 (2017). arXiv:1708.00651 http://arxiv.org/ more effort in developing new features by using the time saved abs/1708.00651 [17] J. Nocedal and S. Wright. 2006. Numerical Optimization (2 ed.). and an automated model refresh can be realizable by incorporating Springer. http://books.google.com.tr/books?id=VbHYoSyelFcC,/bib/ various constraints automatically. In fact, we have been quite suc- nocedal/nocedal2006numerical/%28Springer%20series%20in%20operations% cessful in apply the multiple objective models built with AL-LM to 20research%29%20Jorge%20Nocedal%2C%20Stephen%20Wright-Numerical% 20Optimization-Springer%20%282006%29.pdf,http://www.bioinfo.org.cn/ modeling projects in production. ~wangchao/maa/Numerical_Optimization.pdf Our code is currently incorporated in GBM in R. We plan to [18] Daria Sorokina. 2016. Amazon Search: The Joy of Ranking Products. implement the algorithm into XGBoost and / or LightGBM and https://mlconf.com/mlconf-2016-sf/. [19] Daria Sorokina and Erick Cantu-Paz. 2016. Amazon Search: The Joy of Ranking make it publicly available. Products. In Proceedings of the 39th International ACM SIGIR Conference on Re- Although we applied the method on relevance modeling, it is search and Development in Information Retrieval (SIGIR ’16). ACM, New York, NY, USA, 459–460. https://doi.org/10.1145/2911451.2926725 applicable to wider problem domains, i.e., classification and regres- [20] Krysta M. Svore, Maksims N. Volkovs, and Chris J.C. Burges. 2011. Learn- sions, and other machine learning methods such as neural networks, ing to Rank with Multiple Objective Functions, In Proceedings of WWW etc. Extension to wider application and methodology is a future 2011. https://www.microsoft.com/en-us/research/publication/learning-to-rank- with-multiple-objective-functions/ work as well. SIGIR 2019 eCom, July 2019, Paris, France Michinari Momma, et al. [21] Lidan Wang, Paul N. Bennett, and Kevyn Collins-Thompson. 2012. Robust Rank- Application to Learning Ranking Functions for Web Search. In Ad- ing Models via Risk-sensitive Optimization. In Proceedings of the 35th International vances in Neural Information Processing Systems 20, J. C. Platt, D. Koller, ACM SIGIR Conference on Research and Development in Information Retrieval (SI- Y. Singer, and S. T. Roweis (Eds.). Curran Associates, Inc., 1697–1704. GIR ’12). ACM, New York, NY, USA, 761–770. https://doi.org/10.1145/2348283. http://papers.nips.cc/paper/3305-a-general-boosting-method-and-its- 2348385 application-to-learning-ranking-functions-for-web-search.pdf [22] Zhaohui Zheng, Hongyuan Zha, Tong Zhang, Olivier Chapelle, Keke Chen, and Gordon Sun. 2008. A General Boosting Method and its