Multi-objective Relevance Ranking
               Michinari Momma                                      Alireza Bagheri Garakani                                 Yi Sun
                 michi@amazon.com                                         alirezg@amazon.com                          yisun@amazon.com
                  Amazon.com Inc.                                            Amazon.com Inc.                           Amazon.com Inc.
                     Seattle, WA                                                Seattle, WA                               Seattle, WA
ABSTRACT                                                                              input features to a model. For example, Amazon Search [18, 19]
In this paper, we introduce an Augmented Lagrangian based method                      has a large number of features to capture relevance with a feature
in a search relevance ranking algorithm to incorporate the multi-                     repository consisting of product features such as sales and customer
dimensional nature of relevance and business constraints, both of                     reviews, queries / context features such as query specificity or cus-
which are the requirements for building relevance ranking mod-                        tomer status, as well as textual similarity features between query
els in production. The off-the-shelf solutions cannot handle such                     and products.
complex objectives and therefore, modelers are left hand-tuning                           Business constraints are additional requirements in production
of parameters that have only indirect impact to the objectives, at-                   modeling. Some are derived from existing relevance metrics proven
tempting to incorporate multiple objectives (MO) in a model. This                     to be effective over time and are sought to retain in model refresh to
process is time-consuming and tends to face sub-optimality. The                       ensure avoiding churn [14]. Some are derived from relevance such
proposed method is designed to systematically solve the MO prob-                      as latency to ensure quick search responses, and some are strategic
lem in a constrained optimization framework, which is integrated                      and examples include minimum %-gain to consider experimentation
with a popular Boosting algorithm and is, by all means, a novel                       / launch and avoiding adult items for media products. Reduction
contribution. Furthermore, we propose a procedure to specify the                      of search defects that are the search results that do not match the
constraints to achieve business goals and the exploration scales                      query in various aspects, can be considered for both as it gives
linearly in the number of constraints, while existing methodology                     better customer experience and business requirement being strict
can scale exponentially. The experimental results show that the                       as to ensure the quality of search results. An example of search
method successfully builds models that achieve MO criteria much                       defect is showing a cheap zirconium ring for a query “diamond
more efficiently th an ex isting me thods. Th e po tential im pact in-                ring”, which could give customers the impression that the search is
cludes significant reduction in model development time and allows                     broken, or the e-commerce site is more like a flea market, damaging
for automation of model refresh even with presence of several MO                      brand image of the service [19].
criteria, in real world production system scale with hundreds of                          As discussed above, relevance ranking modeling faces challenges
millions of records.                                                                  of dealing with multiple metrics of relevance and / or business con-
                                                                                      straints, i.e., multiple-objectives (MO). Off-the-shelf machine learn-
KEYWORDS                                                                              ing solutions [5, 10, 13], cannot handle such complicated objectives
Learning to rank; Multi-objective ranking optimization; Product                       in a systematic manner. Although the effect can be limited, one may
search; Web search                                                                    try to employ either over-weighting (OW) [7] by exemplars and/or
                                                                                      by search impressions, a set of query-product pairs for a session
ACM Reference Format:                                                                 on a given day, to influence the objective function in a desired way.
Michinari Momma, Alireza Bagheri Garakani, and Yi Sun. 2019. Multi-
                                                                                      Tuning of weight values is required in this approach and it can
objective Relevance Ranking. In Proceedings of the SIGIR 2019 Workshop
                                                                                      suffer combinatorial exploration to search for a best combination of
on eCommerce (SIGIR 2019 eCom), 8 pages.
                                                                                      weightings, which becomes prohibitive as the number of objectives
                                                                                      grows.
1 INTRODUCTION                                                                            To address the issue, we propose a constrained optimization
Relevance in information retrieval (IR) has been extensively studied                  method applied to the Gradient Boosting Tree (GBT). Specifically,
and applied to various areas such as web search, product search and                   we introduce a constraint optimization in the LambdaMART algo-
recommendation, etc. The concept of relevance is multi-                               rithm, the most prominent method for relevance modeling today.
dimensional, dynamic and evolutional [1, 15]. In product search, a                    LambdaMART is based on GBT [8] using CART [2] and employs
ranking of products is modeled by customer’s historical                               optimization on IR metrics such as the normalized cumulative dis-
behavior. Typically, signals such as purchase, add to cart or click                   counted gains (NDCG) or the mean reciprocal rank (MRR). One of
are used as a target variable and models are tuned to optimize                        our goals is to propose a practical approach to the MO problem /
rankings based on such behaviors. To address the multi-                               task in a production setting, which is constrained by memory and
dimensional nature of relevance, various features that represent                      model size, i.e., the number to trees, and efficiency of the learn-
relevance dimensions are used as                                                      ing algorithm to avoid regression of latency in scoring and model
                                                                                      development timeline.
Copyright © 2019 by the paper’s authors. Copying permitted for private and academic
purposes.                                                                                 In order to meet the requirements and limitations, adaptation
In: J. Degenhardt, S. Kallumadi, U. Porwal, A. Trotman (eds.):                        of the Augmented Lagrangian (AL) method [17] to LambdaMART
Proceedings of the SIGIR 2019 eCom workshop, July 2019, Paris, France, published at   (AL-LM) is proposed. AL converts the original constrained prob-
http://ceur-ws.org
                                                                                      lem into an unconstrained problem by adding penalty terms that
SIGIR 2019 eCom, July 2019, Paris, France                                                                                       Michinari Momma, et al.


penalize constraint violations. The Lagrange multipliers, i.e., dual         In terms of application of constrained optimization to more
variables are estimated at each iteration. The advantage of AL for        generic problems, Bayesian Optimization (BO) with constraints
incorporating into Boosting framework is its simplicity and smooth-       [9, 11] could be applicable. However, in our problem, objective func-
ness. With AL, we introduce dual variables in Boosting. The dual          tions and gradients are available, even if it is a surrogate function,
variables are iteratively optimized and fit well within the Boosting      and exploiting gradients in optimization should be more efficient.
iterations. The Boosting objective function is replaced by the un-        Incorporation of BO on top of our approach to fine tune metrics or
constrained AL problem and the gradient is readily derived using          hyper parameters in Boosting component could be an interesting
the LambdaMART gradients. With the gradient and updates of dual           direction in the future.
variables, we solve the optimization problem by jointly iterating
AL and Boosting steps. To the best of our knowledge, our work is          3     FORMULATION AND ALGORITHM
the first to explicitly introduce constrained optimization problem
                                                                          In this section, we provide formulation and an algorithm of the
in Boosting and the first to apply it search relevance problems.
                                                                          proposed method in details. As a method for production modeling,
Although in this paper, the proposed method is implemented in
                                                                          there are some requirements in design: latency in scoring and the
a Boosting algorithm and applied for relevance modeling (rank-
                                                                          computational cost in training. For the former, we should avoid
ing problems), it is naturally applicable to other algorithms such
                                                                          constructing large number of trees and complex structure of each
as logistic regressions, SVM, and neural networks with various
                                                                          individual tree as it translates into latency degradation. For the lat-
applications in classification and regression domain.
                                                                          ter, we should avoid large number of iterations, or nested iterations,
   Our code is currently incorporated in GBM in R. We plan to
                                                                          as the objective function evaluation can be expensive in search
implement the algorithm into XGBoost and / or LightGBM and
                                                                          application.
make them publicly available.
   Section 2 introduces existing work in multi-objective optimiza-
tion in relevance ranking. Section 3 gives formulations and algo-         3.1     LambdaMART objective and gradient
rithm of our proposed method. Section 4 illustrates experimental          First, let us review the LambdaMART formulation. Suppose we have
results and Section 5 concludes this paper.                               a set of queries Q = {q} and an index set of documents (products in
                                                                          product search) associated with each query I q . A document is de-
                                                                          noted by di with an index i ∈ I q . A set of pairs of document indices
2    RELATED WORK                                                         by P q = {(i, j)} with the relation Ri ▷q R j : di is more relevant than
The MO problem in search relevance literature has been a popular          d j for a given query q. Suppose also we have a single cost function
research topic and there have been two major directions: combining        to optimize, referred to as primary cost. Given a relevance model f ,
multiple objectives in a single model and aggregate multiple models       for a query and a document, a score of the document is computed
tuned for each objective. As for the single model approach, Dong et       as a function of its input features that could be query dependent:
                                                                           q         q
al. [7] proposes an over-weighting model, based on GBrank [22], to        si = f (x i ). A probability of the relevance relationship is modeled
adjust importance weighing of examples coming from different data         by a sigmoid function:
sources, for incorporating recency into the web search relevance
                                                                                           Prob(Ri ▷q R j ) = Prob (i, j) ∈ P q
                                                                                                                                 
ranking. Svore et al. [20] optimizes both human-labeled relevance
                                                                                                                                  −1
and click, with the former being prioritized. Their objectives are                                                                              (1)
                                                                                                                       
                                                                                                                          q   q
                                                                                                              
                                                                                                                    −σ s i −s j
based on NDCG that are modeled by the LambdaMART-loss. The                                                  = 1+e
combined objective function is a weighted linear combination of
the two, which is in fact popular in literature [16, 21].                 where σ is a parameter to determine the shape of the sigmoid
    Another popular approach is to use aggregation of multiple            function. LambdaMART [3, 16] is based on the pairwise cost that
rankers. Dai et al. [6] proposes a hybrid approach of label aggrega-      is the cross-entropy with a rank-dependent weight to incorporate
tion and mixture of experts for achieving recency and general rank-       importance of high ranks. For a pair (i, j) ∈ P q , the cross entropy
ing objectives. Kang et al. [12] relies on editorial grading for each     is given by
metric and relative preference to optimize both label aggregation                                                                      
                                                                                                                                 q    q
                                                                                                                     
and model aggregation in a supervised manner. Each component                             pm     q q          pm,q           −σ s i −s j
                                                                                                      
                                                                                     c         si , s j = ∆Z i, j log 1 + e                         (2)
model is trained for each objective. They apply the method for
optimizing three objectives that are general ranking (i.e. matching),                pm,q
distance and reputation. These methods require separate models for        where ∆Z i, j is the difference of a metric such as NDCG between
each objective and are not directly applicable in our setting where       the (i, j) pair in one order and one that is flipped. For a metric like
we deploy a single model in production.                                   NDCG, relevant documents in higher ranked position gets a high
                                                                                                                                                   pm,q
    Most of the existing algorithms rely on heuristics, such as weight-   value and that in lower a low value. Therefore, the weight ∆Z i, j
ing or linear combination of objectives / models, and require man-        tends to be large when a high ranked position is involved in the pair.
ual tuning of hyper-parameters, which becomes prohibitive as the          For a query, the total cost c(s q ) with s q being a vector containing
number of objectives becomes large. Our approach, in contrast, au-        all documents for the query, i.e., s q = [s 1 , .., s I q ], is given by
tomatically yields a model that satisfies minimal requirements over
                                                                                                                            q q
                                                                                              c pm s q =          c pm si , s j
                                                                                                            Õ                      
MO’s, alleviating the effort of hand-tuning of hyper-parameters,
                                                                                                      
                                                                                                                                                     (3)
which makes a clear distinction from the existing methods.                                                   (i, j)∈P q
Multi-objective Relevance Ranking                                                                                               SIGIR 2019 eCom, July 2019, Paris, France


The primary objective function of LambdaMART is a sum over all                            3.2    Incorporating Augmented Lagrangian
queries:                                                                                         method in Boosting
                                     q q
         C pm (s) =           c pm (si , s j ) =   c pm (s q )
                    Õ Õ                          Õ
                                                               (4)                        Suppose all objectives are given in terms of cost functions. Just like
                          q ∈Q (i, j)∈P q                      q ∈Q                       the primary objective function reviewed in 3.1, we use surrogated
                                                                                          cost functions to optimize the metrics we desire. For example, sup-
where s is a concatenated vector containing scores for all queries:                       pose we want to set minimum criteria in NDCG. In this case, we
{s q |q ∈ Q }.                                                                            use LambdaMART costs for NDCG and set the cost no greater than
    Boosting in LambdaMART is based on the GBT, which gener-                              the given upper-bound (UB) b, i.e., C t (s) ≤ b t , t = 1, . . . ,T . Given
ates a linear combination of base functions that are the decision                         the constraints represented in terms of cost functions, we have the
trees, which are learned to fit the gradient. Typically, in GBT, once                     following constraint optimization problem:
a tree is constructed, the function value of each leaf is computed
by an estimate of the Newton step for the node. More formally,                                          min C pm (s) s.t . C t (s) ≤ b t , t = 1, ...,T .            (9)
                                                                                                         s
at an nt h iteration, leaf nodes {Rnl }lL=1 , where L is the number                       The Lagrangian is written by
of leaves of a tree, are generated by CART for given input fea-                                                                  T
tures of a query and documents, and the gradient as a target:                                           L (s, α ) = C pm (s) +         α t C t (s) − b t
                                                                                                                                 Õ                         
    q            q                                                                                                                                                  (10)
{(x i , дn (f (x i )))}q ∈Q, i ∈I q , with дn being the gradient with respect                                                     t
to a score. The function value for each node is given as: γnl =
                                                                                          where α = α 1 , ..., α T is a vector of dual variables. The Lagrangian
                                                                                                                  
                                                        2  −1
                              q Í                 pm /∂ s q
   Í
− x q ∈Rnl ∂ 2C pm /∂si                 q
                                       x ∈R nl ∂C         i        . Therefore,           is solved by minimizing with respect to the primal variables s
        i                                 i
                                                                                          and maximizing with respect to the dual variables α . AL itera-
the score, or the prediction function is given as follows:
                                                                                          tively solves the constraint optimization while alleviating non-
                                               N Õ
                                                 L                                        smoothness in α arising in the dual. In our problem, the Lagrangian
                 q         q          q                          q
                                               Õ
                si = f (x i ) = f 0 (x i ) +              γnl δ (x i ∈ Rnl )       (5)    at iteration k is written as follows:
                                               n=1 l =1                                                                          T
                                                                                                        Lk (s, α ) = C pm (s) +     α t C t (s) − b t
                                                                                                                                Õ
                 q
                                                                                                                                                      
where f 0 (x i ) is an initial base function that could be provided
                                                                                                                                   t
by prior knowledge and δ is the Kronecker delta. To derive the                                                                                                      (11)
                                                                                                                    T                  2
gradient, it is handy to rewrite the query cost as follows:                                                            1  t
                                                                                                                           α − α kt −1
                                                                                                                    Õ
                                                                                                                  −
                                q                                                                                     2µ k
      c pm (s q ) =      c pm (si )                                                                                 t
                     Õ

                         i ∈I q                                                           where α kt −1 is a solution in the previous iteration and a constant in
                                                                                    (6)   the current iteration k. µ kt is a sufficiently large constant associated
                                       q q                              q q ª
                                c pm (si , s j ) +              c pm (s j , si )®
           Õ© Õ                                        Õ
            =    ­                                                                        with each dual variable α t . Note that the last term is newly added
          i ∈I q « j:(i, j)∈P q                    j:(j,i)∈P q                  ¬         as compared with Eq. (10) and it gives proximal minimization with
       pm    q                         pm      q q                          pm    q q     iterates α kt −1 , to make the optimization smooth.
with c (si ) ≡ j:(i, j)∈P q c (si , s j ) + j:(j,i)∈P q c (s j , si ).
                       Í                                  Í
                                                                                             We maximize the Lagrangian with respect to α ≥ 0 and minimize
The first term computes the pairwise cost between di and oth-
                                                                                          with respect to s.
ers that are less relevant than di . The second term computes that
                                                                                                                      max min Lk (s, α )                       (12)
between di and others that are more relevant than di . The gra-                                                      α ≥0 s
dient used in LambdaMART [3, 4] is readily computed by taking                             From the stationary condition ∂Lk /∂α t = 0, we obtain the update
                                                                                pm,q
derivative with respect to the current score. By defining λi                         ≡    formula for α:
                                                                       −1
                                                                                                       α kt = max 0, µ kt C t (s) − b t + α kt −1 .
                                                           
                                                               q   q
                                                                                                                                                
  pm                pm,q                 pm,q             σ s −s
                                                                                                                                       
∂cq /∂si and λi j           ≡ −σ ∆Z i, j            1+e i j                  , we have                                                                 (13)

the gradient formula in terms of λ’s.                                                    At an iteration k, if the constraint t is not satisfied, i.e., C t (s) >
                                                                                         b t , we have α t > α t , which means the Lagrange multiplier
             pm,q              pm,q                            pm,q                                      k      k−1
                        Õ                                  Õ
            λi    =           λi j    −                       λ ji              (7)      α t increases unless the constraint is already satisfied. Intuitively,
                        j:(i, j)∈P q           j:(j,i)∈P q                               we can consider it as weighting that is adjusted each iteration to
                                                                                     2 overweight an unsatisfied constraint that is associated the cost we
                                                              pm,q                  q
Similarly, the second order derivative, defined as ρ i               ≡ ∂ 2c pm /∂ si     want to improve. Note if a constraint should be strictly satisfied at
is given as follows:                                                                     optimality, α t should take value 0 to maximize the Lagrangian. If
          pm,q
                           Õ             pm,q pm,q
                                                          
                                                                   pm,q
                                                                                        a constraint should be satisfied with equality, α t can take a finite
         ρi      = σ2                ∆Z i, j   ρi j         1 − ρi j                     value. By restricting the solution to the former case, we can push
                      j:(i, j)∈P q                                                       the dual variable to 0 whenever the constraint is satisfied:
                                                                                (8)
                                        pm,q pm,q                  pm,q
                           Õ                                           
                 − σ2                ∆Z i, j   ρ ji         1 − ρ ji                                                                if C t (s) − b t < 0
                                                                                                        (
                                                                                                   t      0,
                      j:(j,i)∈P q                                                                 αk = t t             t     t                                (14)
                                                                                                          µ k C (s) − b + α k −1 , otherwise
                                                                                                                         
                                                   −1
                                                                                         Intuitively, we can avoid unnecessary iterations to find out α = 0
                                          
                                             q q
                                
                    pm,q               σ s −s
where we define ρ i j        ≡ 1+e i j                  .
                                                                                         by simply pushing α to zero whenever the constraint is met. Our
SIGIR 2019 eCom, July 2019, Paris, France                                                                                              Michinari Momma, et al.


preliminary study shows the update Eq. (14) works better than that                    Input: Number of trees N , number of leaves per tree L,
in Eq. (13) in both primary and sub-objectives. Throughout this                                 learning rate η, AL parameter µ t . Initial Lagrange
paper, we adopt Eq. (14) as the update scheme for dual variables.                               multiplier estimate α 0t = 0, t = 1, ..,T . Given initial
   As for the primal variables, the first order derivatives are given                           BaseModel
as follows:                                                                           foreach q ∈ Q do
                                                                                                 q                  q
                                             !                                            f 0 (x i ) = BaseModel(x i ), i ∈ I q
     ∂Lk        Õ         pm,q
                                 Õ
                                       t t,q
                                                    Õ         q                            /* If BaseModel is empty, set f 0 (x i ) = 0
                                                                                                                                            q
        q =              λi    +     α λt      =             λi   (15)                                                                                  */
      ∂si    q ∈Q,i ∈I q          t              q ∈Q,i ∈I q                          end
                     q       pm,q           t,q
                                                                                      for n=1 to N do
where we define λi ≡ λi            + t α t λi . Similarly, by defining                    foreach q ∈ Q, i ∈ I q do
                                     Í
 q      pm,q             t,q                                                                      q     pm,q ÍT         t λt,q ,
ρt ≡ ρi       + t α t ρ i , the second order derivative with respect                            λi = λi      + t =1 α n−1
                Í
                                                                                                                             i
to a score is simply given as follows:                                                              q    pm,q ÍT         t ρ t,q
                                                                                                  ρi = ρi     + t =1 α n−1      i
                                                !                                         end
    ∂ 2 Lk        Õ           pm,q
                                     Õ
                                          t t,q
                                                       Õ          q
       2 =                ρi     +    α ρi      =             ρ i (16)                  {Rnl }lL=1
        q                                                                                                                              q q
   ∂ si        q ∈Q,i ∈I q            t             q ∈Q,i ∈I q
                                                                                           /* Create an L leaf tree on {(x i , λi )}q ∈Q,i ∈I q
                                                                                           */                     q
  One interpretation of the AL method is a penalty method on                                          Í
                                                                                                            q      λ
                                                                                                     q∈Q, x i ∈R nl i
constraint violations. To see it clearly, we can plug the stationary                      γnl = − Í                  q
                                                                                                            q      ρ
                                                                                                     q∈Q, x i ∈R nl i
condition Eq. (13) into Eq. (11), yielding
                                                                                           /* Assign leaf values on Newton step estimate
         Lk (s, α ) = C pm (s) +           α t C t (s) − b t
                                     Õ
                                                                                           */
                                                             

                                           t ∈ {α t >0}                                   foreach q ∈ Q, i ∈ I q do
                                                                                                    q             q                 q
                                                                                                                                          
                                          µ kt                               (17)             fn (x i ) = fn−1 (x i ) + η l γnl δ x i ∈ Rnl /* Take
                                                                                                                         Í
                                                   t
                                               C (s) − b t
                              Õ                              2
                       +                                          + const.                     step with learning rate η */
                                          2
                           t ∈ {α t >0}
                                                                                              for t=1 to T do
                                                                                                                t
It is evident that it penalizes constraint violations associated with                                        C (s) with
                                                                                                  Compute cost
α t > 0 with quadratic function. Therefore, setting a high value                                      q            q
                                                                                                    {si = fn x i }q ∈Q,i ∈I q
of µ kt imposes the constraint more strictly but on the other hand,                            Update α nt via Eq. (13) or Eq. (14)
setting a too high value may introduce a non-smoothness behavior                            end
which AL tries to avoid.
                                                                                         end
                                                                                      end
3.3     An approach to Augmented Lagrangian                                                Algorithm 1: AL-LambdaMART (AL-LM)
        with LambdaMART
Both AL and LambdaMART algorithms run by iterations. We pro-
pose a simple approach to integrate the two; Conduct an AL step
calculation to update α at each Boosting iteration, which means                     4.1    Dataset and objectives
the AL iteration index k is set equal to the Boosting iteration index
                                                                                    The dataset consists of search queries, input features (e.g., query,
n. As for the AL parameter µ kt , we take a fixed policy on itera-
                                                                                    product, and query-product dependent features such as product
tion: µ kt = µ t . A variant of algorithm would increase the value as               sales, customer review, textual matches between a query and prod-
iterations proceeds. Algorithm 1 shows the steps of the algorithm.                  ucts), as well as customer’s purchase decision. We follow the basic
   Most notably, the additional component to the original Lamb-                     modeling practice described in [18, 19]. We collect training data for
daMART at the n-th iteration step is the gradient computation:                      a month worth of the data followed by a week worth of the data
  q q
λi ,ρ i and the update of α nt . The modification to existing solvers               for evaluation. For this study, we focus on purchase as a primary
such as GBM in R [10], XGBoost [5] , and LightGBM [13] should                       objective.
be a minimal effort.                                                                   As for sub-objectives, we identify four. The first one (t1) is to
                                                                                    surface set of products that have relatively good quality and popu-
4     EXPERIMENTS                                                                   lar to a certain customer segment (e.g., customers with subscrip-
In this section, we analyze the algorithm by a numerical study. The                 tions or customers who are more interested in trending products.);
dataset we use is a product search data collected and generated                     the insight being promoting popular products of high quality for
in an e-commerce service. We first study how to set the optimiza-                   such a segment while others still have the impressions on such
tion parameter µ and the cost upper-bounds to ultimately achieve                    products. The 2nd – 4th ones are products that contain additional
modeling goals using smaller sampled dataset. Then we apply the                     benefit/service to such a customer segment; 2nd (t2) offers some
settings to the full dataset for building a production-ready model.                 lowest benefit to the customer of the three and 3rd (t3) and 4th (t4)
Our prototyping code for AL-LM is currently incorporated in GBM                     offer superior benefits in an increasing order. Generally, benefits
in R and used throughout of this section.                                           are hierarchical and coverage is inversely proportional to it.
Multi-objective Relevance Ranking                                                                                                                     SIGIR 2019 eCom, July 2019, Paris, France


               Table 1: Baseline unconstrained results.
                                                                                                                                 (a) Costs w.r.t. #iterations
                                                                                                           C_t1         C_t2      b_t1         b_t2       C_t1        C_t2   C_pr       C_pr
                    purchase              t1               t2                                       0.08                                                                            0.25
                  cost NDCG          cost NDCG        cost NDCG                                     0.07                                                                            0.2
       train     0.106    0.826     0.038    0.932   0.047    0.842                                 0.06
                                                                                                                                                                                    0.15


                                                                                 Constraint Costs
       valid     0.133    0.809     0.038    0.930   0.045    0.845


                                                                                                                                                                                            Primary Cost
                                                                                                    0.05
                                                                                                                                                                                    0.1
                                                                                                    0.04
Table 2: Upper bounds (b t 1 , b t 2 ) set by cost reduction rate                                   0.03
                                                                                                                                                                                    0.05

from baseline.                                                                                      0.02
                                                                                                                                                                                    0

                                                                                                    0.01                                                                            -0.05
                               %-cost reduction                                                       0                                                                             -0.1
                  5%       10%   20%     30%    40%        50%                                              1      11      21     31      41       51    61      71     81   91
           t1    0.036    0.034 0.030 0.026 0.023         0.019                                                                                Iteration

           t2    0.045    0.043 0.038 0.033 0.028         0.024                                                                          (b) AL update: 𝛼
                                                                                                                                          alpha_t1       alpha_t2
                                                                                                     3.5
4.2     Multi-objective ranking illustration                                                           3
                                                                                                     2.5


                                                                                     log(𝛼 +1)
First, we show how the algorithm progresses by iterations. In this                                     2
study, we use two constraints for simplicity. We randomly sample                                     1.5
                                                                                                       1
20K search impressions from the dataset and split into 50% for train-                                0.5
ing and validation sets. We run the algorithm up to 100 iterations for                                 0
                                                                                                            1      11      21     31      41       51    61      71     81   91
illustration purpose. The UB b t are set based on the unconstrained                                                                            Iteration
baseline; we run the unconstraint problem first, which is exactly
the same as the original LambdaMART algorithm. Table 1 shows
cost and NDCG for the purchase, t1 and t2. Then the UB values are           Figure 1: (a) Cost curves in AL-LM. µ is set to 10K. Solid curves
set based on a cost reduction rate, such as 5, 10, . . . , 50%, from the    represent training results and dotted validation. Solid lines
unconstrained baselines. Table 2 shows actual values of UB’s used           represent UB’s. (b) how α changes over iterations shown in
in the experiments.                                                         log-scale.

4.2.1 Cost curve illustration. Figure 1 (a) illustrates curves of costs
                                                                            Table 3: Relative margin by UB reduction rates and µ, for t1
that are the objectives in the problem. The value of α nt is also shown
in Figure 1 (b). We set UB for 20% cost reduction against the uncon-
strained baseline and set µ = 10K, which is a setting that is found to                                                                      UB reduction rate (%)
                                                                                                       µ          data
be large enough as studied in next subsection. In cost curves, solid                                                             5%        10% 20% 30% 40%                           50%
                                                                                                                    tr         19.28      13.87 6.40 3.72 2.45                      -0.28
curves correspond to the cost values on the training data and dotted                  10000
                                                                                                                   val         21.13      15.30 7.33 4.20 3.24                       0.31
curves those on the validation data. Note solid lines represent the
                                                                                                                    tr         17.81      13.22 6.62 0.02 0.70                      -0.09
UB’s. As seen in the Figure 1 (a), the initial few iterations try to                            1000
                                                                                                                   val         19.05      13.84 6.87 1.00 1.03                       0.80
satisfy constraints aggressively. Then spend a number of iterations
                                                                                                                    tr         18.31      14.02 5.80 -0.06 -0.65                    -1.66
in the over-satisfied region, gradually challenging the UB’s. The                                   100
                                                                                                                   val         18.82      13.90 5.98 0.94 -0.05                     -1.17
behavior of α can be seen in (b). α is pushed to zero when the                                                      tr         13.08      12.15 6.97 -0.39 -1.56                    -3.28
constraint is met during the iterations. Then when a constraint                                       10
                                                                                                                    ts         14.25      13.17 7.08 -0.55 -1.28                    -2.98
violation occurs, a finite value of α kicks in again. At iteration 80,
α t 2 actually gets 0.79, as seen in Figure 1 (b). Validation results are
also shown as dotted curves in Figure 1 (a). Constraint satisfaction        value of µ, i.e., µ = 1K or 10K can satisfy the constraints. For above
in training tend to be generalizable to that in validation set.             40% reduction, even µ = 10K suffers from infeasibility. As µ = 10K
                                                                            satisfies most constraints for the cost reduction rate for the training
4.2.2 Optimization parameter µ. In AL, the optimization parameter
                                                                            data, a sensible choice for the value of µ would be at least 10K.
µ kt can be any sufficiently large value. In our experiments, we set
µ nt to be a constant across all iterations and constraint. We vary         4.2.3 Cost reduction and NDCG gain. Now, we look at the cost
the value from (10, 100, 1K, 10K) and see how the constraints are           reduction rates and NDCG gains over the unconstrained cases as
satisfied, over different cost UB reduction rates that ranges from 5%       baseline. Table 5 shows for each objective, how the attained cost
through 50%, see Table 1 for the values used for each objective. As a       reduction rates and NDCG gains vary with different UB settings.
metric to measure constrained satisfaction, we use relative margin          Note for the constraints, the cost values are consistent with Table 4
that is defined as (b t − C t )/b t . Negative value means constraint       and 3, as they are computed based on values in Table 5.
violation. In Table 4 and 3, obviously, the more UB values are set             As we have tighter UB values, cost associated with the constraint
aggressively, the less chance the constraints can be satisfied. For         is reduced, which improves the NDCG gain. For the primary ob-
both constraints, when UB’s are set 5%, all constraints are met even        jective, purchase, the behavior is opposite, which is all expected
for µ is as small as 10. However, as UB is set up to 30%, only larger       as we are tightening constraints. An important observation is the
SIGIR 2019 eCom, July 2019, Paris, France                                                                                                                                                                      Michinari Momma, et al.


Table 4: Relative margin by UB reduction rates and µ, for t2                                                                                    Table 6: Production scale modeling results in %-gain. For
                                                                                                                                                over-weighting (OW), 7 models achieve +1% gain criteria out
                                                               UB reduction rate (%)                                                            of 144 trials in grid search. Min, max and avg performances
                     µ         data
                                            5%              10% 20% 30% 40%                                              50%                    of the 7 models are shown. For AL-LM, one denotes use of
                                 tr        3.09             2.01 0.65 6.19 -0.98                                        -0.18                   estimated UB’s (one-shot model) and adjust applying adjust-
            10000
                                val        3.34             2.11 1.50 7.22 0.39                                          0.29                   ment over the one-shot model. The bold letter shows statisti-
                                 tr        2.16             1.31 0.10 0.07 -0.45                                        -1.15                   cally significance in comparing OW (max) and AL-ML (one).
              1000
                                val        1.97             1.51 0.63 0.92 0.61                                         -0.90
                                 tr        2.61            -0.09 0.46 -1.14 -0.97                                       -2.10                                    pur-          t1                 t2                 t3                 t4
                   100
                                val        2.57            -0.51 0.96 -0.59 0.00                                        -1.54                                   chase   @5          @22    @5          @22    @5          @22    @5          @22
                                 tr        0.70            -0.46 -1.57 -1.64 -3.20                                      -5.26                     OW (min)      -0.10   2.59        1.98   1.15        1.00   1.66        1.19   2.01        1.54
                    10                                                                                                                            OW (max)      -0.03   3.52        2.67   1.66        1.41   2.33        1.67   2.74        2.08
                                ral        0.47            -0.04 -1.55 -1.50 -2.64                                      -5.24
                                                                                                                                                  OW (avg)      -0.05   2.90        2.20   1.27        1.10   1.79        1.30   2.26        1.68
                                                                                                                                                 AL-LM (one)    -0.07   3.21        1.83   4.28        3.08   3.38        2.29   4.19        2.77
Table 5: Cost reduction (%) and NDCG-gain for each objec-                                                                                        AL-LM (adj.)   -0.03   2.12        1.26   3.13        2.16   2.73        1.97   3.36        2.64
tive, by different UB’s, with µ=10K

                              training / upper-bound reduction%                              validation / upper-bound reduction%
                              5% 10% 20% 30% 40% 50%                                         5% 10% 20% 30% 40% 50%                             4.3    Production-scale modeling
       Cost red. 24.28 23.87 26.40 33.72 42.45 49.72 26.13 25.30 27.33 34.20 43.24 50.31                                                        4.3.1 Applying AL-ML in the full dataset. In this subsection, we
 t1    NDCG       1.62 1.57 1.74 2.32 3.06 3.74 1.76 1.66 1.81 2.34 3.16 3.74
       Cost red. 8.09 12.01 20.65 36.19 39.02 49.82 8.34 12.11 21.50 37.22 40.39 50.29
                                                                                                                                                show results on a larger sampled data to simulate a production
 t2    NDCG       0.92 1.68 3.49 6.58 7.09 9.30 1.06 1.87 3.46 6.76 7.21 9.34                                                                   modeling with more sub-objectives. To this end, we increase the
 pur- Cost red. 0.02 -0.22 -0.02 -1.25 -2.54 -5.18 0.21 0.18 0.29 -0.82 -1.88 -3.60                                                             samples to ∼1MM search impressions for the training and ∼500K
 chase NDCG -0.23 -0.01 -0.21 -0.63 -1.03 -1.58 0.18 0.18 0.11 -0.20 -0.23 -0.96
                                                                                                                                                search impressions for the hold-out evaluation data sets and use the
                                                                                                                                                full four sub-objectives. Each sub-objective is computed by a binary
                                                                                                                                                target value, just like the purchase target. The goal of the modeling
            Cost reduction rate vs. NDCG gain (train)                                Cost reduction rate vs. NDCG gain (valid.)
                                                                                                                                                is to achieve at least 1% gain in the number of products with the
              t1         t2       Linear (t1)         Linear (t2)                       t1       t2     Linear (t1)         Linear (t2)
 10%                                                                      10%                                                                   positive sub-objective value, i.e., presence of products eligible for
                                            y = 0.181x                                                            y = 0.180x                    some customer benefit in top-5 and 22 ranks, while minimizing the
 8%                                                                       8%
                                                                                                                                                impact on the purchase NDCG. We use relationship between the
                                                                                                                                                metrics (top-K) and cost as illustrated in Figure 2 to estimate the
 6%                                                                       6%
                                                                                                                                                values of UB’s by using smaller number of samples. In other words,
 4%                                                                       4%                                                                    as long as we know the relationship between the metric value and
                                            y = 0.071x                                                            y = 0.071x
                                                                                                                                                cost value, we can achieve the goal of +1% gain by just setting the
 2%                                                                       2%
                                                                                                                                                UB’s. Namely, we choose 14% reduction for b t 1 and 1/0.181 = 5.5%
 0%                                                                       0%
                                                                                                                                                reduction for b t 2 and adjust if the constraint is too tight to satisfy,
       0%      10%       20%       30%          40%      50%        60%         0%       10%      20%    30%          40%     50%         60%
                                                                                                                                                or over-satisfied. Note as we find t3 and t4 follows very similar
                                                                                                                                                pattern as t2, we use the same setting as t2 to them.
Figure 2: Relationship between cost reduction and NDCG                                                                                          4.3.2 OW method as a baseline. As a baseline method to com-
gain                                                                                                                                            pare against, we tune models by over-weighting (OW) over the
                                                                                                                                                sub-objectives. Basically, in the OW method, we identify search
                                                                                                                                                impressions that contain products with the positive sub-objective
validation results are consistent with those of training, which means                                                                           value for applying over-weightings. We introduce weights on the
the constraint satisfaction in training generalizes well at least for                                                                           pairwise cost computation so that we can influence the cost func-
the dataset examined. Note constraint t2 is over-satisfied with large                                                                           tion depending on products that matches a query. For example,
margin in cost. This is due to the correlation between the target                                                                               if a product has a value one in the target t1, i.e., the product is
values associated with t1 and t2, as that of t1 is available only if                                                                            eligible for some benefit, and we want to optimize t1 over other
the target value of t2 is positive, which partially explains t1 is a                                                                            sub-objectives, we put high weight on the cost associated with t1 so
tighter constraint to be satisfied than t2.                                                                                                     the ranking is more likely to optimized for the search impressions
    The relationship between the cost reduction and the NDCG gains                                                                              containing products with t1 as a feature. When there are multiple
is illustrated in Figure 2. The two metrics are quite consistent and                                                                            sub-objectives, we need to combinatorially tune multiple weighting
fit well by linear lines. This means, we can estimate cost reduction                                                                            schemes to concurrently achieve multiple objectives.
value for a given NDCG gain requirement, by using the approx-                                                                                      The weight values are tuned first by manually finding a reason-
imately liner relationship. Once we have the cost reduction rate,                                                                               ably good weighting parameter ranges and run grid search to fine
we can set it as the UB. This observation is quite useful in practice                                                                           tune. As there are four dimensional space to explore and too time
where NDCG gain values would likely be the criteria for offline                                                                                 consuming for the problem size, we only search weighting parame-
modeling.                                                                                                                                       ters for t1 and t2; we rely on correlation between t2 and t3 or t4 to
Multi-objective Relevance Ranking                                                                                  SIGIR 2019 eCom, July 2019, Paris, France


optimize overall metrics. Despite the search space reduction, we        REFERENCES
end up building 144+ models.                                             [1] Pia Borlund. 2003. The Concept of Relevance in IR. J. Am. Soc. Inf. Sci. Technol.
                                                                             54, 10 (Aug. 2003), 913–925. https://doi.org/10.1002/asi.10286
                                                                         [2] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. 1984. Classification and
                                                                             Regression Trees. Wadsworth and Brooks, Monterey, CA.
4.3.3 Results. Table 6 shows the results on the evaluation set from      [3] Chris J.C. Burges. 2010. From RankNet to LambdaRank to LambdaMART: An
OW and AL-LM, which are measured by %-gain with respect to                   Overview. Technical Report. https://www.microsoft.com/en-us/research/
                                                                             publication/from-ranknet-to-lambdarank-to-lambdamart-an-overview/
the unconstrained baseline. For OW, only 7 out of 144 trials over        [4] Christopher J. Burges, Robert Ragno, and Quoc V. Le. 2007. Learning to Rank
the grid are found as feasible solutions. We report min, max and             with Nonsmooth Cost Functions. In Advances in Neural Information Processing
avд among the solutions. For AL-LM, we already have estimates of             Systems 19, B. Schölkopf, J. C. Platt, and T. Hoffman (Eds.). MIT Press, 193–
                                                                             200. http://papers.nips.cc/paper/2971-learning-to-rank-with-nonsmooth-cost-
each constraint from Figure 2. A similar preliminary experiment              functions.pdf
gives estimate of other cost UB’s.                                       [5] Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting
   Both methods achieve the 1% gain criteria for all metrics. While          System. In Proceedings of the 22Nd ACM SIGKDD International Conference on
                                                                             Knowledge Discovery and Data Mining (KDD ’16). ACM, New York, NY, USA,
AL-LM achieves higher gains for t2 – t4 and purchase NDCG is                 785–794. https://doi.org/10.1145/2939672.2939785
flat, the model shows slight over satisfaction and we apply some         [6] Na Dai, Milad Shokouhi, and Brian D. Davison. 2011. Learning to Rank for
adjustment (relax UB’s by 20%: 14% reduction for b t 1 to 11% and            Freshness and Relevance. In Proceedings of the 34th International ACM SIGIR
                                                                             Conference on Research and Development in Information Retrieval (SIGIR ’11).
4%), yielding slight improvement on purchase (insignificant) with            ACM, New York, NY, USA, 95–104. https://doi.org/10.1145/2009916.2009933
less margin on t1. In terms of efficiency, AL-LM is a clear win          [7] Anlei Dong, Yi Chang, Zhaohui Zheng, Gilad Mishne, Jing Bai, Ruiqiang Zhang,
                                                                             Karolina Buchner, Ciya Liao, and Fernando Diaz. 2010. Towards Recency Ranking
as it requires only two model builds, one for getting the baseline           in Web Search. In Proceedings of the Third ACM International Conference on
cost value and the other by applying cost reduction UB’s estimated           Web Search and Data Mining (WSDM ’10). ACM, New York, NY, USA, 11–20.
by smaller sample. We could put efforts on adjustment if the one-            https://doi.org/10.1145/1718487.1718490
                                                                         [8] Jerome H. Friedman. 2001. Greedy function approximation: A gradient boosting
shot model has issues, which can be resolved slight tightening or            machine. Ann. Statist. 29, 5 (10 2001), 1189–1232. https://doi.org/10.1214/aos/
relaxing constraints. OW, on the other hand, requires exploration            1013203451
of the weighting schemes and the success rate is 7/144 = 4.8%,           [9] Jacob R. Gardner, Matt J. Kusner, Zhixiang Xu, Kilian Q. Weinberger, and
                                                                             John P. Cunningham. 2014. Bayesian Optimization with Inequality Constraints.
which requires 21 trials on average and 61 trials at 95% confidence.         In Proceedings of the 31st International Conference on International Conference
   Addition of further sub-objectives, such as search defect and             on Machine Learning - Volume 32 (ICML’14). JMLR.org, II–937–II–945. http:
                                                                             //dl.acm.org/citation.cfm?id=3044805.3044997
adult products tend to be much less correlated with the constraints     [10] Ridgeway Greg. 2013. gbm: Generalized Boosted Regression Models. https://cran.r-
we studied in this paper, which means more combinatorial explo-              project.org/web/packages/gbm/index.html
ration (exponential in T ) is needed for existing methodologies. AL-    [11] José Miguel Hernández-Lobato, Michael A. Gelbart, Ryan P. Adams, Matthew W.
                                                                             Hoffman, and Zoubin Ghahramani. 2016. A General Framework for Constrained
LM is the choice for MO modeling, as it only requires setting UB’s           Bayesian Optimization Using Information-based Search. J. Mach. Learn. Res. 17,
and some adjustment around the one-shot model: (linear in T ).               1 (Jan. 2016), 5549–5601. http://dl.acm.org/citation.cfm?id=2946645.3053442
                                                                        [12] Changsung Kang, Xuanhui Wang, Yi Chang, and Belle Tseng. 2012. Learning to
                                                                             Rank with Multi-aspect Relevance for Vertical Search. In Proceedings of the Fifth
                                                                             ACM International Conference on Web Search and Data Mining (WSDM ’12). ACM,
5    CONCLUSION AND FURTHER WORK                                             New York, NY, USA, 453–462. https://doi.org/10.1145/2124295.2124350
                                                                        [13] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma,
In this paper, we introduced an Augmented Lagrangian based                   Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A Highly Efficient Gradient Boosting
method to address challenges with the multi-objective optimization           Decision Tree. In Advances in Neural Information Processing Systems 30, I. Guyon,
                                                                             U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett
in relevance modeling. Specifically, we incorporated a constraint            (Eds.). Curran Associates, Inc., 3146–3154. http://papers.nips.cc/paper/6907-
optimization method into Boosting so that modelers can use it very           lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf
                                                                        [14] Mahdi Milani Fard, Quentin Cormier, Kevin Canini, and Maya Gupta. 2016.
easily as an extension to the LambdaMART, one of the most pop-               Launch and Iterate: Reducing Prediction Churn. In Advances in Neural Information
ular Boosting methods in this domain. The explicit introduction              Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and
of constrained optimization is a novel contribution of this paper            R. Garnett (Eds.). Curran Associates, Inc., 3179–3187. http://papers.nips.cc/
                                                                             paper/6053-launch-and-iterate-reducing-prediction-churn.pdf
and thus its application to relevance ranking is novel. Experimental    [15] S. Mizzaro. 1998. How many Relevances in Information Retrieval? Interacting
results showed the proposed AL-LM can indeed efficiently resolve             With Computers 10, 3 (1998), 305–322.
the MO problem. The impact of the outcome would be not limited          [16] Phong Nguyen, John Dines, and Jan Krasnodebski. 2017. A Multi-Objective
                                                                             Learning to re-Rank Approach to Optimize Online Marketplaces for Multiple
to reducing the model development lead time; modelers can make               Stakeholders. CoRR abs/1708.00651 (2017). arXiv:1708.00651 http://arxiv.org/
more effort in developing new features by using the time saved               abs/1708.00651
                                                                        [17] J. Nocedal and S. Wright. 2006.              Numerical Optimization (2 ed.).
and an automated model refresh can be realizable by incorporating            Springer.             http://books.google.com.tr/books?id=VbHYoSyelFcC,/bib/
various constraints automatically. In fact, we have been quite suc-          nocedal/nocedal2006numerical/%28Springer%20series%20in%20operations%
cessful in apply the multiple objective models built with AL-LM to           20research%29%20Jorge%20Nocedal%2C%20Stephen%20Wright-Numerical%
                                                                             20Optimization-Springer%20%282006%29.pdf,http://www.bioinfo.org.cn/
modeling projects in production.                                             ~wangchao/maa/Numerical_Optimization.pdf
   Our code is currently incorporated in GBM in R. We plan to           [18] Daria Sorokina. 2016.         Amazon Search: The Joy of Ranking Products.
implement the algorithm into XGBoost and / or LightGBM and                   https://mlconf.com/mlconf-2016-sf/.
                                                                        [19] Daria Sorokina and Erick Cantu-Paz. 2016. Amazon Search: The Joy of Ranking
make it publicly available.                                                  Products. In Proceedings of the 39th International ACM SIGIR Conference on Re-
   Although we applied the method on relevance modeling, it is               search and Development in Information Retrieval (SIGIR ’16). ACM, New York, NY,
                                                                             USA, 459–460. https://doi.org/10.1145/2911451.2926725
applicable to wider problem domains, i.e., classification and regres-   [20] Krysta M. Svore, Maksims N. Volkovs, and Chris J.C. Burges. 2011. Learn-
sions, and other machine learning methods such as neural networks,           ing to Rank with Multiple Objective Functions, In Proceedings of WWW
etc. Extension to wider application and methodology is a future              2011. https://www.microsoft.com/en-us/research/publication/learning-to-rank-
                                                                             with-multiple-objective-functions/
work as well.
SIGIR 2019 eCom, July 2019, Paris, France                                                                                                      Michinari Momma, et al.


[21] Lidan Wang, Paul N. Bennett, and Kevyn Collins-Thompson. 2012. Robust Rank-            Application to Learning Ranking Functions for Web Search.           In Ad-
     ing Models via Risk-sensitive Optimization. In Proceedings of the 35th International   vances in Neural Information Processing Systems 20, J. C. Platt, D. Koller,
     ACM SIGIR Conference on Research and Development in Information Retrieval (SI-         Y. Singer, and S. T. Roweis (Eds.). Curran Associates, Inc., 1697–1704.
     GIR ’12). ACM, New York, NY, USA, 761–770. https://doi.org/10.1145/2348283.            http://papers.nips.cc/paper/3305-a-general-boosting-method-and-its-
     2348385                                                                                application-to-learning-ranking-functions-for-web-search.pdf
[22] Zhaohui Zheng, Hongyuan Zha, Tong Zhang, Olivier Chapelle, Keke
     Chen, and Gordon Sun. 2008.            A General Boosting Method and its