<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Cooperative Multi-Agent Deep Reinforcement Learning in Content Ranking Optimization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zhou Qin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kai Yuan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pratik Lahiri</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wenyang Liu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Amazon.com, Inc. 550 Terry Ave N, Seattle</institution>
          ,
          <addr-line>Washington 98109</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>eCom'24: ACM SIGIR Workshop on eCommerce</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In a typical e-commerce setting, Content Ranking Optimization (CRO) mechanisms are employed to surface content on the search page to fulfill customers' shopping missions. CRO commonly utilizes models such as contextual deep bandits model to independently rank content at diferent positions, e.g., one optimizer dedicated to organic search results and another to sponsored results. However, this regional optimization approach does not necessarily translate to whole page optimization, e.g., maximizing revenue at the top of the page may inadvertently diminish the revenue of lower positions. In this paper, we propose a reinforcement learning based method for whole page ranking to jointly optimize across all positions by: 1) shifting from position level optimization to whole page level optimization to achieve an overall optimized ranking; 2) applying reinforcement learning to optimize for the cumulative rewards instead of the instant reward. We formulate page level CRO as a cooperative Multi-agent Markov Decision Process , and address it with the novel Multi-Agent Deep Deterministic Policy Gradient (MADDPG) model. MADDPG supports a flexible and scalable joint optimization framework by adopting a “centralized training and decentralized execution” approach. Extensive experiments demonstrate that MADDPG scales to a 2.5 billion action space in the public Mujoco environment, and outperforms the deep bandits modeling by 25.7% on the ofline CRO data set from a leading e-commerce company. We foresee that this novel multi-agent optimization is applicable to similar joint optimization problems in the field of information retrieval.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Recommender Systems</kwd>
        <kwd>Deep Reinforcement Learning</kwd>
        <kwd>Actor-Critic</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Traditional e-commerce search results page can be consisted of several positions (a.k.a “slots”),
e.g., shown as the blue boxes in Figure 1. Such an explicit design helps maintain a
consistent shopping experience for the customers and is easy to maintain. Each position
accommodates several rankable pieces of contents. Diferent positions could also serve diferent
purposes: top-position content (e.g., position 1) includes highly relevant contents or a
dedicated module answering or clarifying customers’ questions; middle positions can help
customers navigate, narrow down, or discover related products and categories; bottom
positions may have lower relevance, but aims to inspire customers to either diverge from the
original intent or explore their search further. Content providers can schedule their
contents to show at particular positions, based on diferent content properties. When
multiple pieces of contents compete for the same position, a content optimizer (a.k.a ranker)
is often designed to determine which are the most helpful pieces of contents to surface.</p>
      <p>
        In common use cases, content ranking optimization
(CRO) can be formalized as a ranking problem, such as Search Bar
building contextual deep learning models to optimize for Position 1
certain objective functions, e.g., revenue, profit, or
clickthrough rate (CTR), etc. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] Or more often, a multi-objective Position 2
optimization problem, i.e., optimizing several metrics
sismeualrtcahneroeususllyts[p2]a.gIenarreeals-uwbojrelcdtutsoeccearsteasi,nproessittrioicntsioonns:th1e) Pos6ition Position 3
some positions are designated to fixed content providers;
2) diferent positions hold contents from diferent content Position 4
providers, which are independent from each other; 3)
combining multiple positions can lead to a huge action space. Position 5
These restrictions have resulted in independent ranking
decisions for diferent positions, meaning CRO does not Figure 1: Content Positions
take other positions’ information into consideration.
      </p>
      <p>In this paper, we propose to utilize a multi-agent deep
reinforcement learning (RL) solution (MADDPG) to tackle the CRO problem, which aims to 1)
achieve a joint whole-page level contextual optimization across multiple positions, by learning
the interactions among positions and generating whole page contents combinations; 2) optimize
for long term benefits instead of instant benefits; 3) improve customer experience to better
fulfill customers’ shopping missions. The optimizer of each position is referred as an “agent”,
and contents of each position are referred as “actions”. All agents act cooperatively to receive
collective whole page level rewards. We utilize full RL to optimize for the accumulative future
rewards during the entire customer journey, instead of heuristic instant rewards. We believe
this novel method is applicable and innovative to other joint optimization applications in
e-commerce.</p>
      <p>The rest of the paper is organized as follows: Section 2 describes the related work used in the
industry; Section 3 describes the CRO problem formulation; Section 4 describes the proposed
solution; Section 5 describes the evaluation and results; and Section 6 summarizes our work
and discusses future research directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>Reinforcement learning has gained significant attention in the field of e-commerce for ranking
related problems, where the goal is to present customers with ranked list of items to maximize
customer engagement and conversions. Specifically, we focus on the work addressing the
high-dimensional action space in the ranking scenario.</p>
      <p>
        Multi-Variate Testing (MVT) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] tackles multi-dimensional joint optimization problem and is
widely adopted in e-commerce companies. It builds on top the of multi-armed bandits (MAB)
model [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and adopts the hill-climbing algorithm [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to eficiently explore the high dimensional
action space. One concern with applying MVT is the latency of neural networks. Hill-climbing
in MVT requires several rounds of inferences. However, CRO typically utilizes the neural
networks (NN) based models to estimate the objective. The motivation is to reduce manual
feature engineering efort and accommodate advanced features into model, such as embedding.
As a result, inference with NN models is expensive. Thus, it is intractable to meet the latency
requirement of NN models and MVT.
      </p>
      <p>
        Hierarchical RL (HRL) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] was proposed to solve the joint optimization problem on
ecommerce search page. Optimization is decomposed into 2 sub-tasks: 1) a source selector
that decides which sources should be selected at the current page; and 2) an item presenter
that decides the presentation order of the items selected from the selected sources in a page to
each position. The model fully utilizes the sequential characteristics of user behaviors across
diferent pages to decide both the sources and items to be presented. The item selector is similar
to content ranking in CRO, but hierarchical structure is not required in CRO, thus HRL is not
practical. However, this approach inspires the joint optimization between page layout
optimization (PLO) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and content ranking, where PLO decides the page template at a higher-level and
CRO decides the contents given the selected page layout at a lower level.
      </p>
      <p>
        Reinforcement learning has also been applied in diferent ranking methodologies, e.g.,
listwise recommendation [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and page-wise recommendation [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The page-wise solution generates
items and the strategy to display them on a two-dimension page.
      </p>
      <p>In CRO, we firstly apply MADDPG to the whole page search ranking problem, providing a
scalable solution for an industrial setting with tailored modifications.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Problem</title>
      <sec id="sec-3-1">
        <title>3.1. Combinatorial Optimization</title>
        <p>We formalize the content ranking optimization (CRO) as a combinatorial optimization problem,
where the major challenge is to deal with a combinatorial action space. In CRO, an action
is content. With more new positions and contents, the action space will grow exponentially.
Formally, for a CRO environment with  positions and  discrete actions at each position
, there will be ∏︀</p>
        <p>=1  possible actions to be considered. Table 1 lists the existing contents,
containing the possible combinations and the actual max combinations from a single search
request: (Actual max is less than the possible number because contents may have diferent
trafic coverage).</p>
        <p>
          Such a large action space is intractable for conventional single-agent single-dimensional RL
algorithms, such as DQN [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], since it is dificult to explore eficiently [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. One solution is to
delegate the optimization responsibility to lower-level agents, where each agent optimizes its
own dimension, and agents have a way to communicate [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] with other agents to cooperate.
As a result, the action space grows linearly for each agent. We formulate CRO as a multi-agent
RL due to: 1) optimality: this work evaluates both single-agent RL and multi-agent RL in
evaluation Section 5, and demonstrate that multi-agent RL algorithm performs optimally, stably,
and scalably in a CRO-like scenario; 2) flexibility: multi-agent RL applies a “centralized training
and decentralized execution” approach [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], where each agent is not required to access another
agent’s observation at execution time. This supports the flexible system structure enabling
joint optimization across diferent applications. For instance, we can deploy trained agents to
diferent services for online execution.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Multi-agent MDP in CRO</title>
        <p>
          In this part, we formalize the problem in CRO using MADDPG. We consider a multi-agent
extension of a MDP called Markov games [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] with the tuple  = (, , , , , ,  ). Here, 
is the number of agents;  = {1, 2, ...,  } is the discrete set of actions for the agents;  is
the state space;  is the reward function; and  = {1, 2, ...,  } is the set of observations for
each agent. To choose actions, each agent  uses a stochastic policy   :  ×   → [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ], which
produces the next state according to the state transition function  : × 1× 2× ...×  → .
Each agent i obtains rewards as a function of the state and the agent’s action  :  ×   → R,
and receives a local observation correlated with the state  :  → . The goal of each agent 
is to learn a policy   :  →  which maximizes its own total expected return  = ∑︀ 
=0  ,
where  is a discount factor and  is the time horizon. Notably, since CRO is a fully cooperative
environment, at each time step , all agents observe the same joint reward 1 = 2 = ... =
 = . Concept mapping in CRO will be like 1 * 0.990 + 2 * 0.991 + 3 * 0.993 + ....
• : e-commerce website and customers;
• : a customer session on the same day;
• : a customer’s shopping state which represents the global context , such as region,
query, device, membership status;
•  : when the session is abandoned (no more interactions) at the end of the
day;
•  : state transition probability;
•  : position agents to choose content at their respective positions, such as
top/middle/bottom/side positions;
• : content candidates at position ;
• : the observation of position , consisting of two parts:
global observation ;
local observation ≀ regarding the content features at position , such as CTRs, ads
bids, etc.;
• : the reward which CRO receives from the environment, detailed discussion at 4.2
section;
        </p>
        <p>At each time step, a customer issues a request to CRO, CRO observes the customer’s shopping
state, chooses a content combination and display to the customer, and then receives the rewards
from the session. CRO goal is to learn the policy   for each agent  which maximizes the
expected return across all sessions.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Mix-ofline RL System in CRO</title>
        <p>It is intractable to collect business rewards instantly online, which is the key limitation
preventing online RL. As Figure 2 shows, CRO is defined as a mix-ofline RL system in the current
scenario: every day, a trained agent is deployed online to interact with the environment; at
the end of the day, the trajectories are collected to train the agent, then the trained agent is
deployed online to act. CRO is closer to an ofline RL scenario where the agent can interact
with the environment.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Modeling</title>
      <sec id="sec-4-1">
        <title>4.1. MADDPG</title>
        <p>
          We utilize the MADDPG model [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] to solve the CRO problem. MADDPG is a multi-agent
version of actor-critic method (MAAC) [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. It adopts the framework of centralized training
with decentralized execution approach [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. A key characteristic of MADDPG is utilizing a
fully observable critic which involves the observations and actions of all agents to ease joint
training. Each agent trains a Deep Deterministic Policy Gradient [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] model, which concurrently
learns a Q-function and a policy. The actor    () with policy weights   observes the local
observations , while the critic  is allowed to access the observations, actions, and the target
policies of all agents during training. The critic of each agent concatenates all state-actions
together as the input and uses the local reward to obtain the corresponding Q-value. Formally,
consider an environment with  agents and policies parameterized by  = { 1, ...,   }, and
 ( ) = E[] is:
 = { 1, ...,   } as the set of agent policies. The gradient of the expected return for agent i,
[︃
∇   ( ) = E∼  ,∼   ∇  log  (|) (, 1, ...,  )
where  (, 1, ...,  ) is a centralized action-value function that takes the actions of all
agents, 1, ...,  as input, and the concatenated state information  = (1, ...,  ), and outputs
the Q-value for agent . We then extend the work with deterministic policies   with parameters
 , so the gradient can be written as:
        </p>
        <p>[︃
∇   ( ) = E,∼  ∇   (|)∇  (, , ...,  )|= ())

updated as follows:</p>
        <p>Here the experience replay bufer  contains the tuples (, ′ , 1,
...,  , ), recording the experience of all agents. The centralized action-value function  is
[︃
( ) = E,,,′ ( (, , ...,  ) − ( +   ′ (′ , ′1, ..., ′ ))|′= ′( )
2
]︃
where  ′ =
{  1′ , ...,</p>
        <p>′ } is the set of target policies with delayed parameters  ′.
state regardless of the changes in the policy of other agents.</p>
        <p>
          This method resolves the non-stationarity issue [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], as  (′ |, 1, ...,  ,  1, ...,   ) =
 (′ |, 1, ...,  ,  1′, ...,  ′ ) for any   ̸=  ′, since the environment returns the same
next
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Reward</title>
        <p>Rewards are critical to an RL system. In our setting, we formulate the reward as follows:
]︃
]︃
(1)
(2)
(3)
 =  +    +  *   +  *  +  *  (4)</p>
        <p>
          Clicks and Abandonments are part of the reward due to two reasons: 1) we aim to
encourage more engagements in the session and reduce session abandonments, which are
an unbiased indicators and indicate long-term value; 2) to enable rewards shaping [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] to
stabilize the learning since purchases are sparse. Figure 3 shows how to collect the
rewards from the session.  represents the customer state,  represents the content
combination displayed to the customer, and reward  is the total reward received between
impressions (time steps). 0 is the terminal state. From this episode, 4 transitions are collected:
(1, 1, 1, 2, 0), (2, 2, 2, 3, 0), (3, 3, 3, 4, 0), (4, 4, 4, 0, 1).
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Exploration</title>
        <p>
          We sample actions from the Gumbel Softmax [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] action distributions with a temperature of 1.0
for each agent. This is the conventional noisy-based method in RL. With higher temperature,
the model encourages more explorations. For new and unrecognized contents, we will assign a
ifxed probability to ensure they can be explored.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluation</title>
      <p>
        We conduct extensive evaluations both in a public Mujoco environment [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and on a CRO
ofline data set.
      </p>
      <sec id="sec-5-1">
        <title>5.1. Evaluation On HalfCheetah-v2</title>
        <sec id="sec-5-1-1">
          <title>5.1.1. Environment</title>
          <p>
            We choose the Mujoco [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ] environment HalfCheetah-v2 [
            <xref ref-type="bibr" rid="ref19">19</xref>
            ] to simulate the CRO scenario.
The robot in this environment has six limbs, with each limb representing one continuous action
 ∈ [0.0, 1.0]. The goal is to train agents to control these limbs cooperatively to move faster
and receive higher rewards. Since CRO has a discrete action space - contents, we need to
discretize the continuous space into a discrete space to bridge the gap. For instance, when the
per-dimension action size is set to 5, we discretize the action space to be [1.0, 0.5, 0.0, 0.5, 1.0]
and at every time step, the agents interact with the environment using one of those discretized
values. If the max per-dimension action size is 5, the action space expands to 56 = 15625; if
action size is 50, the action space reaches 506 = 15.6, which simulates a extremely complex
cooperative scenario.
          </p>
        </sec>
        <sec id="sec-5-1-2">
          <title>5.1.2. Benchmarks</title>
          <p>
            We leverage the following models as benchmarks:
• DQN: The classical DQN model. We flatten multi-dimensional action spaces into a single
dimension
• CQL: Conservative Q-learning [
            <xref ref-type="bibr" rid="ref20">20</xref>
            ], tweaked DQN model with a conservative Q function.
          </p>
          <p>
            CQL is designed to address the over-estimation issue [
            <xref ref-type="bibr" rid="ref20">20</xref>
            ] in an ofline RL setting using
an of-policy algorithm. We also flatten action spaces into a single dimension;
• BranchingDQN: The action branching architecture [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ] was proposed to address the
large action space problem in single-agent RL. The key insight is that to solve problem in
combinatorial action space, it is possible to optimize for each action dimension with a
degree of independence. The shared network module computes a latent representation
of the input state that is then propagated to the several action branches. Each branch is
responsible for controlling an individual degree of freedom, and the concatenation of the
selected sub-actions results in a joint-action tuple;
• BranchingCQL: A conservative version of BranchingDQN to fit into an ofline RL
scenario;
• MADDPG: This paper’s proposed model.
          </p>
        </sec>
        <sec id="sec-5-1-3">
          <title>5.1.3. Online Setting</title>
          <p>
            We run models in HalfCheetah-v2 in an online fashion with the discrete per-dimension action
sizes of [
            <xref ref-type="bibr" rid="ref10 ref5">5, 10, 25, 50</xref>
            ] resulting in [15K, 1M, 244M, 15.6B] action spaces. Models are run 3 times
with 1,000 episodes and mean rewards are averaged. The CQL and DQN model cannot be tested
with action size &gt; 10, due to Out-of-Memory issues on the GPU. All policy and critic networks
have the same 2 MLP layers with 256 hidden units and ReLU activation function [
            <xref ref-type="bibr" rid="ref21">21</xref>
            ]. Target
networks are updated by the  -soft update method [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ] where  = 0.001. For training the CQL,
DQN, and Branching*, we utilize an epsilon-greedy [
            <xref ref-type="bibr" rid="ref22">22</xref>
            ] strategy as the behavioral policy [
            <xref ref-type="bibr" rid="ref23">23</xref>
            ]
with an exponential decay  = (0.01 + 0.99 *  − 15 ); for MADDPG, we sample actions from
a Gumbel Softmax [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ] action distribution. The experience replay bufer is set to 105, and the
batch size is set to 128 for all models.
          </p>
          <p>(a) Mean Reward, Action Size 5</p>
          <p>(b) Mean Reward, Action Size 10</p>
          <p>The results in Figure 4 indicates that typical DQN and CQL models are intractable for handling
the large action space due to the insuficient explorations, which is expected. Branching models
learn well at relatively low dimension action size but do not scale as well as MADDPG. MADDPG
demonstrates consistent and optimal performance in all online experiments.</p>
        </sec>
        <sec id="sec-5-1-4">
          <title>5.1.4. Ofline Setting</title>
          <p>
            The online study demonstrates that MADDPG scales well in an online setting. However, CRO
is more suitable for an ofline setting. Specifically, the CRO agent is required to learn from
the trajectories collected by online policies at the end of the day. In this section, we conduct
experiments to verify whether MADDPG can perform well in an ofline setting. We still apply
per-dimension action sizes of [
            <xref ref-type="bibr" rid="ref10 ref5">5, 10, 25, 50</xref>
            ] to observe its scalability. Ofline learning consists of
two steps: 1) collecting trajectories from some online policies (or random policy); 2) training
the ofline agent with these trajectories. To generate the trajectories, we adopt the D4RL
approach [
            <xref ref-type="bibr" rid="ref24">24</xref>
            ]: first, we train an online MADDGP agent in a HalfCheetah-v2 environment,
recording its mean rewards and saving its checkpoints until the model converges; second, we
load 3 agents: 2 agents from checkpoints with 30% and 100% performance, and 1 random agent;
then, we have the 3 agents interact with the environment (not train) to collect 3M trajectories
equally (1M - 100% - expert, 1M - 30% - medium, 1M - random); finally, we train the models
on this data set and observe the performance trending during training. All model settings are
identical to the online setting.
          </p>
          <p>The results in Figure 5 demonstrates that 1) conservative models can learn better than DQN
in an ofline setting, which aligns with their design purpose, but CQL does not scale well with
larger action sizes; 2) MADDPG scales well in the ofline setting.</p>
          <p>(a) Mean Reward, Action Size 5</p>
          <p>(b) Mean Reward, Action Size 5
(c) Mean Reward, Action Size 5
Figure 5: Model Performance in Ofline Setting</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Evaluation on CRO Data Set</title>
        <p>(d) Mean Reward, Action Size 5
In this section, we conduct experiments to evaluate MADDPG on the CRO data set. For simplicity,
we extract a business metric as the reward for benchmarking. We evaluate the models on desktop
trafic only and consider joint optimization for 3 positions - top/middle/bottom positions, as
they have the most impacts and content competition.</p>
        <sec id="sec-5-2-1">
          <title>5.2.1. Trajectory Preparation</title>
          <p>Trajectory data serves as the training source for RL. We follow a standard procedure to extract
the training trajectories from CRO logs. As a result, we extracted 40M training trajectories from
desktop trafic. The trajectories are from baseline policies: 5% random policy (generated by 5%
epsilon greedy) and 95% deep bandits policy. For evaluation trajectories, we follow the same
steps but use an exploration filter to extract 4M trajectories from the random policy.</p>
        </sec>
        <sec id="sec-5-2-2">
          <title>5.2.2. Benchmarks</title>
          <p>We leverage the following models and and rewards as benchmarks:
• Random: It randomly selects content combinations from available options.
• Baseline (Multi-component reward): It is a baseline ranker with click-based
Multiobjective Optimization objective.
• Position-level bandits: It is identical to the baseline model structure, past search queries
are tokenized by the pre-trained SentencePiece tokenizer [25] and encoded by Embedding
and LSTM [26] layers. Then we concatenate embedding with other inputs and feed it to
an MLP layer with 256 hidden units. The final output is one unit with a Mean Squared
Error (MSE) loss. This model is similar to the baseline, with the major diferences in the
features and reward.
• Page-level bandits: It utilizes an MLP structure with a multi-head output structure. Each
"head" represents one position and outputs the content probability for available contents
at this position. Features are the same as described above: past search queries, customer
context, top position content ID, inline stop content ID, and bottom position content IDs.
• MADDPG w/ gamma=0: We train it on trajectories with a discounted factor=0.0. State
is encoded by SentencePiece, Embedding and LSTM layers as well. Since the discounted
factor is 0, this actually acts as a bandits model which optimizes for the instant reward
without considering the future rewards.
• MADDPG w/ gamma=0.99: We train it on trajectories with a discount factor as 0.99;
others settings are the same as described above.</p>
          <p>
            The diference between the baseline and random models demonstrates the impact of the
baseline model. The diference between position-level and page-level bandits presents the benefit
of joint optimization. The diference between “MADDPG w/ gamma=0.99” and “MADDPG w/
gamma=0” illustrates the benefit of RL over bandits.
5.2.3. Metrics
We adopt the Inverse Propensity Scoring (IPS) method [27] to evaluate the trained policies:
Given logged exploration data with a policy  , and an evaluation policy  , the following IPS
estimator of  is unbiased:
ˆ   ( ) =
1 ∑︁  (|)
 =1  (|) 
(5)
where  is the trajectory count,  is the trajectory reward. Since it is a random policy,
 (|) = |1| is extracted by softmax [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ] from the output of models. The higher the IPS, the
better the policy. Due to the high variance of the reward and IPS calculation, we choose 3 scores
for the benchmark: 1) IPS on source rewards; 2) IPS on logged rewards; 3) IPS on indicator
rewards (1 when source reward &gt; 0 otherwise 0). The results are shown in Table 2.
          </p>
          <p>Table 2 demonstrates that MADDPG outperforms the the bandits model with logged and
unlogged reward. The diference between “MADDPG with gamma=0.99” and “MADDPG with
gamma=0” shows that RL is beneficial by considering state transition and delayed reward.
The diference between page-level and position-level models shows that ranking holistically
is beneficial. The baseline IPS is not high, which we assume is due to the composite efect of
diferent objectives and the over-estimation issue. We also conducted an experiment to verify
whether MADDPG can gradually learn during training. We periodically dump the model during
training and evaluate it on the CRO evaluation data set and observe the IPS trending. Figure 6
shows the IPS trending, demonstrating that MADDPG can learn during ofline model training.</p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Online A/B Test Performance</title>
        <p>We implemented this approach on one of the leading e-commerce platforms and ran an A/B
test to verify its performance. We tested with both discount factor as 0 (refer to as Treatment 1)
and 0.99 (refer to as Treatment 2), optimizing a multi-component reward function. By training
and deploying it on the online trafic, both treatments demonstrated incremental multi-million
revenue value. However, treatment 2 resulted in significant profit losses, leading to a flat overall
reward, i.e., a larger discount factor resulted in poorer performance in our setting.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Evaluation Summary</title>
        <p>The evaluation demonstrates that the MADDPG model scales well in both online and ofline
multi-agent environment setting, and is able to learn from the CRO ofline data set. MADDPG
also significantly outperforms the baseline like bandits model by 25.7% on IPS evaluation. We
also evaluated this through online A/B test and the proposed solution brought incremental
multi-million revenue.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, we frame page-level content ranking as a joint optimization problem, and tackle
it using the novel MADDPG model. We conduct extensive experiments on both online Mujoco
environment and an ofline CRO data set. The evaluations demonstrate that MADDPG scales
well up to a 2.5 billion action space and outperforms the baseline bandits model by 25.7%. In
the future, we will further invest in multi-agent RL in several directions: 1) explainability:
interpretation of why an agent takes certain action is important for content owners but this
is not a trivial task, especially for neural network-based RL. Fortunately we have several
directions to explore, such as reward decomposition [28] and shapley on Q-values [29]; 2)
eficient exploration: exploration is critical for RL. This paper applies a noisy-based method
which may not be the optimal for new contents. We will test entropy-based and variance-based
exploration methods; 3) enhanced lifelong RL: retraining new models with a saved experience
replay when encountering new contents is not ideal. We will research transfer RL to support
the onboarding of new contents without needing to retrain the models.
reinforcement learning, arXiv preprint arXiv:2004.07219 (2020).
[25] T. Kudo, J. Richardson, Sentencepiece: A simple and language independent subword
tokenizer and detokenizer for neural text processing, arXiv preprint arXiv:1808.06226
(2018).
[26] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (1997)
1735–1780.
[27] E. J. Williamson, A. Forbes, Introduction to propensity scores, Respirology 19 (2014)
625–635.
[28] Z. Juozapaitis, A. Koul, A. Fern, M. Erwig, F. Doshi-Velez, Explainable reinforcement
learning via reward decomposition, in: IJCAI/ECAI Workshop on explainable artificial
intelligence, 2019.
[29] J. Wang, Y. Zhang, T.-K. Kim, Y. Gu, Shapley q-value: A local reward approach to solve
global reward games, in: Proceedings of the AAAI Conference on Artificial Intelligence,
volume 34, 2020, pp. 7285–7292.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Pi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Bian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Gai</surname>
          </string-name>
          ,
          <article-title>Deep interest evolution network for click-through rate prediction</article-title>
          ,
          <source>in: Proceedings of the AAAI conference on artificial intelligence</source>
          , volume
          <volume>33</volume>
          ,
          <year>2019</year>
          , pp.
          <fpage>5941</fpage>
          -
          <lpage>5948</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Deb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sindhya</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <article-title>Hakanen, Multi-objective optimization, in: Decision sciences</article-title>
          , CRC Press,
          <year>2016</year>
          , pp.
          <fpage>161</fpage>
          -
          <lpage>200</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D. N.</given-names>
            <surname>Hill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Nassif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Iyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vishwanathan</surname>
          </string-name>
          ,
          <article-title>An eficient bandit algorithm for realtime multivariate optimization</article-title>
          ,
          <source>in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>1813</fpage>
          -
          <lpage>1821</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Silva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Werneck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Silva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Pereira</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          <article-title>Rocha, Multi-armed bandits in recommendation systems: A survey of the state-of-the-art and future directions</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>197</volume>
          (
          <year>2022</year>
          )
          <fpage>116669</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Takanobu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <article-title>Aggregating e-commerce search results from heterogeneous sources via hierarchical reinforcement learning</article-title>
          ,
          <source>in: The World Wide Web Conference</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1771</fpage>
          -
          <lpage>1781</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Qin</surname>
          </string-name>
          , W. Liu,
          <article-title>Automate page layout optimization: An ofline deep q-learning approach</article-title>
          ,
          <source>in: Proceedings of the 16th ACM Conference on Recommender Systems</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>522</fpage>
          -
          <lpage>524</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <article-title>Deep reinforcement learning for list-wise recommendations</article-title>
          , arXiv preprint arXiv:
          <year>1801</year>
          .
          <volume>00209</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <article-title>Deep reinforcement learning for pagewise recommendations</article-title>
          ,
          <source>in: Proceedings of the 12th ACM conference on recommender systems</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>95</fpage>
          -
          <lpage>103</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>V.</given-names>
            <surname>Mnih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Antonoglou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wierstra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Riedmiller</surname>
          </string-name>
          ,
          <article-title>Playing atari with deep reinforcement learning</article-title>
          ,
          <source>arXiv preprint arXiv:1312.5602</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Tavakoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pardo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kormushev</surname>
          </string-name>
          ,
          <article-title>Action branching architectures for deep reinforcement learning</article-title>
          ,
          <source>in: Proceedings of the aaai conference on artificial intelligence</source>
          , volume
          <volume>32</volume>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Oroojlooy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hajinezhad</surname>
          </string-name>
          ,
          <article-title>A review of cooperative multi-agent deep reinforcement learning</article-title>
          ,
          <source>Applied Intelligence</source>
          <volume>53</volume>
          (
          <year>2023</year>
          )
          <fpage>13677</fpage>
          -
          <lpage>13722</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R.</given-names>
            <surname>Lowe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. I.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tamar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Harb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. Pieter</given-names>
            <surname>Abbeel</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Mordatch, Multi-agent actorcritic for mixed cooperative-competitive environments</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>V.</given-names>
            <surname>Konda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tsitsiklis</surname>
          </string-name>
          ,
          <article-title>Actor-critic algorithms</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>12</volume>
          (
          <year>1999</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T. P.</given-names>
            <surname>Lillicrap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Hunt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pritzel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Heess</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Erez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tassa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wierstra</surname>
          </string-name>
          ,
          <article-title>Continuous control with deep reinforcement learning</article-title>
          ,
          <source>arXiv preprint arXiv:1509.02971</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kobayashi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. E. L.</given-names>
            <surname>Ilboudo</surname>
          </string-name>
          ,
          <article-title>T-soft update of target network for deep reinforcement learning</article-title>
          ,
          <source>Neural Networks</source>
          <volume>136</volume>
          (
          <year>2021</year>
          )
          <fpage>63</fpage>
          -
          <lpage>71</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Harada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Russell</surname>
          </string-name>
          ,
          <article-title>Policy invariance under reward transformations: Theory and application to reward shaping</article-title>
          ,
          <source>in: Icml</source>
          , volume
          <volume>99</volume>
          ,
          <string-name>
            <surname>Citeseer</surname>
          </string-name>
          ,
          <year>1999</year>
          , pp.
          <fpage>278</fpage>
          -
          <lpage>287</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>E.</given-names>
            <surname>Jang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Poole</surname>
          </string-name>
          ,
          <article-title>Categorical reparameterization with gumbel-softmax</article-title>
          ,
          <source>arXiv preprint arXiv:1611.01144</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>E.</given-names>
            <surname>Todorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Erez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tassa</surname>
          </string-name>
          ,
          <article-title>Mujoco: A physics engine for model-based control</article-title>
          ,
          <source>in: 2012 IEEE/RSJ international conference on intelligent robots and systems</source>
          , IEEE,
          <year>2012</year>
          , pp.
          <fpage>5026</fpage>
          -
          <lpage>5033</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Mujoco</surname>
          </string-name>
          ,
          <string-name>
            <surname>Mujoco</surname>
          </string-name>
          halfcheetah-v2., https://gym.openai.com/envs/HalfCheetah-v2/.,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , G. Tucker,
          <string-name>
            <given-names>S.</given-names>
            <surname>Levine</surname>
          </string-name>
          ,
          <article-title>Conservative q-learning for ofline reinforcement learning</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>1179</fpage>
          -
          <lpage>1191</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Agarap</surname>
          </string-name>
          ,
          <article-title>Deep learning using rectified linear units (relu</article-title>
          ), arXiv preprint arXiv:
          <year>1803</year>
          .
          <volume>08375</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>C.</given-names>
            <surname>Dann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mansour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mohri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sekhari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sridharan</surname>
          </string-name>
          ,
          <article-title>Guarantees for epsilon-greedy reinforcement learning with function approximation</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>4666</fpage>
          -
          <lpage>4689</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>S.</given-names>
            <surname>Fujimoto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Meger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Precup</surname>
          </string-name>
          ,
          <article-title>Of-policy deep reinforcement learning without exploration</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>2052</fpage>
          -
          <lpage>2062</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Nachum</surname>
          </string-name>
          , G. Tucker, S. Levine,
          <article-title>D4rl: Datasets for deep data-driven</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>