Speeding up the Metabolism in E-commerce by Reinforcement
                        Mechanism Design
                                          Hua-Lin He                                                                         Chun-Xiang Pan
                                      Alibaba Inc.                                                                              Alibaba Inc.
                                   Hangzhou, China                                                                            Hangzhou, China
                              hualin.hhl@alibaba-inc.com                                                                    xuanran@taobao.com

                                             Qing Da                                                                         An-Xiang Zeng
                                      Alibaba Inc.                                                                              Alibaba Inc.
                                   Hangzhou, China                                                                            Hangzhou, China
                              daqing.dq@alibaba-inc.com                                                                    renzhong@taobao.com

ABSTRACT                                                                                                customers, enterprises and start-ups, and hundreds of thousands of
In a large E-commerce platform, all the participants compete for                                        service providers, making it a new type of economic entity rather
impressions under the allocation mechanism of the platform. Exist-                                      than enterprise platform. In such a economic entity, a major re-
ing methods mainly focus on the short-term return based on the                                          sponsibility of the platform is to design economic institutions to
current observations instead of the long-term return. In this paper,                                    achieve various business goals, which is the exact field of Mecha-
we formally establish the lifecycle model for products, by defining                                     nism Design [1]. Among all the affairs of the E-commerce platform,
the introduction, growth, maturity and decline stages and their tran-                                   impression allocation is one of the key strategies to achieve its busi-
sitions throughout the whole life period. Based on such model, we                                       ness goal, while products are players competing for the resources
further propose a reinforcement learning based mechanism design                                         under the allocation mechanism of the platform, and the platform
framework for impression allocation, which incorporates the first                                       is the game designer aiming to design game whose outcome will
principal component based permutation and the novel experiences                                         be as the platform desires.
generation method, to maximize short-term as well as long-term                                              Existing work of impression allocation in literature are mainly
return of the platform. With the power of trial-and-error, it is pos-                                   motivated and modeled from a perspective view of supervised learn-
sible to optimize impression allocation strategies globally which                                       ing, roughly falling into the fields of information retrieval [2, 3] and
is contribute to the healthy development of participants and the                                        recommendation [4, 5]. For these methods, a Click-Through-Rate
platform itself. We evaluate our algorithm on a simulated environ-                                      (CTR) model is usually built based on either a ranking function
ment built based on one of the largest E-commerce platforms, and                                        or a collaborative filtering system, then impressions are allocated
a significant improvement has been achieved in comparison with                                          according to the CTR scores. However, these methods usually op-
the baseline solutions.                                                                                 timize the short-term clicks, by assuming that the properties of
                                                                                                        products is independent of the decisions of the platform, which
CCS CONCEPTS                                                                                            may hardly hold in the real E-commerce environment. There are
                                                                                                        also a few work trying to apply the mechanism design to the al-
• Computing methodologies → Reinforcement learning; Policy
                                                                                                        location problem from an economic theory point of view such
iteration; • Applied computing → Online shopping;
                                                                                                        as [6–8]. Nevertheless, these methods only work in very limited
KEYWORDS                                                                                                cases, such as the participants play only once, and their properties
                                                                                                        is statistically known or does not change over time, etc., making
Reinforcement Learning, Mechanism Design, E-commerce                                                    them far from practical use in our scenario. A recent pioneer work
ACM Reference Format:                                                                                   named Reinforcement Mechanism Design [9] attempts to get rid of
Hua-Lin He, Chun-Xiang Pan, Qing Da, and An-Xiang Zeng. 2018. Speeding                                  nonrealistic modeling assumptions of the classic economic theory
up the Metabolism in E-commerce by Reinforcement Mechanism Design .                                     and to make automated optimization possible, by incorporating the
In Proceedings of ACM SIGIR Workshop on eCommerce (SIGIR 2018 eCom).
                                                                                                        Reinforcement Learning (RL) techniques. It is a general framework
ACM, New York, NY, USA, 7 pages.
                                                                                                        which models the resource allocation problem over a sequence of
1      INTRODUCTION                                                                                     rounds as a Markov decision process (MDP) [10], and solves the
                                                                                                        MDP with the state-of-the-art RL methods. However, by defining
Nowadays, E-commerce platform like Amazon or Taobao has de-                                             the impression allocation over products as the action, it can hardly
veloped into a large business ecosystem consisting of millions of                                       scale with the number of products/sellers as shown in [11, 12].
Permission to make digital or hard copies of part or all of this work for personal or
Copyright © 2018 by the paper’s authors. Copying permitted for private and academic purposes.
                                                                                                        Besides, it depends on an accurate behavioral model for the prod-
classroom
In:           use is granted
    J. Degenhardt,              withoutS.feeKallumadi,
                      G. Di Fabbrizio,       providedM. that  copies
                                                           Kumar,      are Lin,
                                                                    Y.-C.  not A.
                                                                                made  or distributed
                                                                                  Trotman, H. Zhao      ucts/sellers, which is also unfeasible due to the uncertainty of the
(eds.): Proceedings
for profit            of the SIGIR
            or commercial          2018 eCom
                               advantage  and workshop,   12 bear
                                                that copies  July, this
                                                                   2018,notice
                                                                          Ann Arbor, Michigan,
                                                                                and the        USA,
                                                                                        full citation
published  at http://ceur-ws.org
on the first page. Copyrights for third-party components of this work must be honored.                  real world.
For all other uses, contact the owner/author(s).                                                            Although the properties of products can not be fully observed
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA                                                    or accurately predicted, they do share a similar pattern in terms
© 2018 Copyright held by the owner/author(s).
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA                                                                            Hua-Lin He et al.


of development trend, as summarized in the product lifecycle the-        of these research suffer from low accuracy of click-through rate
ory [13, 14]. The life story of most products is a history of their      estimation for the lack of exposure historical data of start-ups.
passing through certain recognizable stages including introduction,         One of the most related topics in user impressions allocation is
growth, maturity and decline stages.                                     item cold-start problem [17], which has been extensively studied
    • Introduction: Also known as market development - this is           over past decades. Researches can be classified into three cate-
      when a new product is first brought to market. Sales are low       gories: hybrid algorithms combining CF with content-based tech-
      and creep along slowly.                                            niques [18, 19], bandit algorithms [20–22] and data supplement
    • Growth: Demand begins to accelerate and the size of the total      algorithms [23]. Among these researches, the hybrid algorithms
      market expands rapidly.                                            exploit items’ properties, the bandit algorithms are designed for
    • Maturaty: Demand levels off and grows.                             no item content setting and gathering interactions from user effec-
    • Decline: The product begins to lose consumer appeal and            tively, and the data supplement algorithms view cold-start as data
      sales drift downward.                                              missing problem. Both of these research did not take the whole prod-
                                                                         uct lifecycle of items into account for the weakness of traditional
During the lifecycle, new products arrive continuously and outdated      prediction based machine learning model, resulting in long-term
products wither away every day, leading to a natural metabolism          imbalance between global efficiency and lifecycle optimization.
in the E-commerce platform. Due to the insufficient statistics, new         The application of reinforcement learning in commercial system
products usually attract few attention from conventional supervised      such as web recommendations and e-commerce search engines has
learning methods, making the metabolism a very long period.              not yet been well developed. Some attempts are made to model
    Inspired by the product lifecycle theory as well the reinforcement   the user impression allocation problem in e-commerce platform
mechanism design framework, we consider to develop reinforce-            such as Tabao.com and Amazon.com. By regarding the platforms
ment mechanism design while taking advantage of the product life-        with millions of users as environment and treating the engines
cycle theory. The key insight is, with the power of trial-and-error,     allocating user impressions as agents, an Markov Decision Process
it is possible to recognize in advance the potentially hot products      or at least Partially Observable Markov Decision Process can be
in the introduction stage as well as the potentially slow-selling        established. For example, an reinforcement learning capable model
products in the decline stage, so the metabolism can be speeded          is established on each page status by limit the page visit sequences
up and the long-term efficiency can be increased with an optimal         to a constant number in a recommendation scene [24]. And another
impression allocation strategy.                                          proposed model is established on global status by combining all
    We formally establish the lifecycle model and formulate the          the item historical representations in platform [11]. However, both
impression allocation problem by regarding the global status of          of these approaches struggled to manage an fixed dimensionality
products as the state and the parameter adjustment of a scoring          of state observation, low-dimensional action outputs and suffered
function as the action. Besides, we develop a novel framework            from partially observation issues.
which incorporates a first principal component based algorithm              Recently, mechanism design has been applied in impression
and a repeated sampling based experiences generation method,             allocation, providing a new approach for better allocating user im-
as well as a shared convolutional neural network to further en-          pressions [9, 25]. However, the former researches are not suitable
hance the expressiveness and robustness. Moreover, we compare            for real-world scenes because of the output action space is too large
the feasibility and efficiency between baselines and the improved        to be practical. In this paper, a reinforcement learning based mech-
algorithms in a simulated environment built based on one of the          anism design is established for the impression allocation problem
largest E-commerce platforms.                                            to maximize both short-term as well as long-term return of prod-
    The rest of the paper is organized as follows. The product lifecy-   ucts in the platform with a new approach to extract states from all
cle model and reinforcement learning algorithms are introduced in        products and to reduce action space into practical level.
section 3. Then a reinforcement learning mechanism design frame-
work is proposed in section 4. Further more, experimental results
are analyzed in section 5. Finally, conclusions and future work are
discussed in section 6.
                                                                         3 PRELIMINARIES
2   RELATED WORK                                                         3.1 Product Lifecycle Model
Many researches have been conducted on impression allocation             In this subsection, we establish a mathematical model of product
and dominated by supervised learning. In ranking phase, search           lifecycle with noises. At step t, each product has an observable
engine aims to find out good candidates and brought them in front        attribute vector x t ∈ Rd and an unobservable latent lifecycle state
so that products with better performance will gain more impres-          zt ∈ L, where d is the dimension of the attribute space, and L =
sions. Among which click-through rate is one of the most common          {0, 1, 2, 3} is the set of lifecycle stages indicating the the introduction,
representation of products performance. Some research presents an        growth, maturity and decline stages respectively. Let pt ∈ R be the
approach to automatically optimize the retrieval quality with well-      CTR and qt ∈ R be the accumulated user impressions of the product.
founded retrieval functions under risk minimization frame-work           Without loss of generality, we assume pt and qt are observable,
by historical click-through data [15]. Some other research proposed      pt , qt are two observable components of x t , the platform allocates
an unbiased estimation of document relevance by estimating the           the impressions ut ∈ R to the product. The dynamics of the system
presentation probability of each document [16]. Nevertheless, both
Speeding up the Metabolism in E-commerce by Reinforcement Mechanism DesignSIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA


can be written as
                                                                                       0.14                                                        Typical Lifecycle
                         
                         
                          qt +1 = qt + ut
                           pt +1 = pt + f (zt , qt )
                         
                         
                                                                   (1)
                         zt +1 = д(x t , zt , t)
                                                                                      0.11


                                                                                 CTR
                         
                         
where f can be seen as the derivative of the p, and д is the state                                                                Maturity
                                                                                       0.08
transition function over L.


                                                                                                                    th


                                                                                                                                              De
                                                                                                                    ow


                                                                                                                                               clin
   According to the product lifecycle theory and online statistics,                               Introduction


                                                                                                                  Gr


                                                                                                                                                   e
the derivative of the CTR can be formulated as                                         0.05
                                                                                              0     20       40     60       80    100       120    140     160
                          (ch − cl )e −δ (qt )                                                                              Time Step
                                                  + ξ , z ∈ {1, 3}
                      
                      
        f (zt , qt ) = (2 − z)(1 + e −δ (qt ) )2
                      
                      
                                                                   (2)
                      ξ ,                              z
                      
                      
                                                         ∈ {0, 2}              Figure 2: CTR evolution with the proposed lifecycle model.
where ξ ∼ N (0, σ 2 ) is a gaussian noise with zero mean and vari-
ance σ 2 , δ (qt ) = (qt − q˜t z − δ µ )/δ σ is the normalized impressions      space S, action space A, a conditional probability distribution
accumulated from stage z , q˜t z is the initial impressions when the            p(·) and a scalar reward function r = R(s, a), R : S × A → R.
product is firstly evolved to the life stage z, δ µ , δ σ are two unobserv-     For states st , st +1 ∈ S and action at ∈ A, distribution function
able parameters for normalization, and ch , cl ∈ R are the highest              p(st +1 |st , at ) denotes the transition probability from state st to st +1
CTR and the lowest CTR during whole product lifecycle, inferred                 when action at is adopted in time step t, and the Markov property
from two neural networks, respectively:                                         p(st +1 |st , at ) = p(st +1 |s 1 , a 1 , · · · , st , at ) holds for any historical tra-
                    cl = h(x t |θl ), ch = h(x t |θh ),                   (3)   jectories s 1 , a 1 , · · · , st to arrive at status st . A future discounted
                                                                                                                                    γ
                                                                                return at time step t is defined as R t = k∞=t γ k−t R(sk , ak ), where
                                                                                                                                           Í
where h(·|θ ) is a neural network with the fixed parameter θ , indi-
                                                                                γ is a scalar factor representing the discount. A policy is denoted
cating that cl , ch are unobservable but relevant to attribute vector
                                                                                as πθ (at |st ) which is a probability distribution mapping from S to
x t . Intuitively, when the product stays in introduction or maturity
                                                                                A , where different policies are distinguished by parameter θ .
stage, the CTR can be only influenced by the noise. When the prod-
                                                                                   The target of agent in reinforcement learning is to maximize the
uct in the growth stage, f will be a positive increment, making the
                                                                                expected discounted return, and the performance objective can be
CTR increased up to the upper bound ch . Similar analysis can be
                                                                                denoted as
obtained for the product in the decline stage.                                                                              γ
                                                                                                           max J = E R 1 π ]
                                         t > t2 , q < q2                                                     π
                                                                                                                   = Es∼d π ,a∼πθ [R(s, a)]                            (4)
                     t > t1          q > q2                t > t3
       g: z=0                 z=1             z=2                   z=3         where d π (s) is a discounted state distribution indicating the possi-
                                                                                bility to encounter a state s under the policy of π . An action-value
                                                                                function is then obtained iteratively as
      Figure 1: State transition during product lifecycle
                                                                                          Q(st , at ) = E R(st , at ) + γ Ea∼πθ [Q(st +1 , at +1 )]
                                                                                                                                                   
                                                                                                                                                      (5)

    Then we define the state transition function of product lifecycle           In order to avoid calculating the gradients of the changing state
as a finite state machine as illustrated in Fig. 1. The product starts          distribution in continuous action space, the Deterministic Policy
with the initial stage z = 0, and enters the growth stage when the              Gradient(DPG) method [26, 27] and the Deep Deterministic Policy
time exceeds t 1 . During the growth stage, a product can either step           Gradient [28] are brought forward. Gradients of the deterministic
in to the maturity stage if its accumulated impressions q reaches q 2 ,         policy π is
or the decline stage if the time exceeds t 2 while q is less than q 2 . A                   ∇θ µ J = Es∼d µ ∇θ µ Q w (s, a)
                                                                                                                           
product in the maturity stage will finally enter the last decline stage
                                                                                                   = Es∼d µ ∇θ µ µ(s)∇a Q w (s, a)|a=µ(s)
                                                                                                                                         
                                                                                                                                               (6)
if the time exceeds t 3 . Otherwise, the product will stay in current
stage. Here, t 1 , t 2 , t 3 , q 2 are the latent thresholds of products.       where µ is the deep actor network to approximate policy function.
    We simulate several product during the whole lifecycle with                 And the parameters of actor network can be updated as
different latent parameters (the details can be found in the experi-
                                                                                         θ µ ← θ µ + αE ∇θ µ µ(st )∇a Q w (st , at )|a=µ(s)
                                                                                                                                           
                                                                                                                                              (7)
mental settings), the CTR curves follow the exact trend described
in Fig. 2.                                                                      where Q w is an obtained approximation of action-value function
                                                                                called critic network. Its parameter vector w is updated according
3.2    Reinforcement Learning and DDPG                                          to objective
       methods                                                                                   min L = Es∼d µ yt − Q w (st , at ))2
                                                                                                                                     
                                                                                                                                                (8)
                                                                                                         w
Reinforcement learning maximizes accumulated rewards by trial-
                                                                                where yt = R(st , at ) + γQ w (st +1 , µ ′ (st +1 )), µ ′ is the target actor
                                                                                                                         ′
and-error approach in a sequential decision problem. The sequen-
                                                                                network to approximate policy π , Q w is the target critic network
                                                                                                                            ′
tial decision problem is formulated by MDP as a tuple of state
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA                                                                                   Hua-Lin He et al.


to approximate action-value function. The parameters w ′, θ µ are
                                                                  ′
                                                                             meaning of this formulation is the mathematical expect over all
updated softly as                                                            products in platform for the average click amount of an product
                       w ′ ← τw ′ + (1 − τ )w                                during its lifecycle, indicating the efficiency of products in the
                                                                             platform and it can be calculated accumulatively in the online
                       θ µ ← τθ µ + (1 − τ )θ µ
                         ′          ′
                                                                       (9)   environment, which can be approximately obtained by
                                                                                                                  n    i   t
4    A SCALABLE REINFORCEMENT                                                                    R(s, a) ≈
                                                                                                              1Õ 1 Õ
                                                                                                                         pi ui                      (15)
     MECHANISM DESIGN FRAMEWORK                                                                               n i ti τ =0 τ τ
In our scenario, at each step, the platform observes the global infor-           A major issue in the above model is that, in practices there will
mation of all the products, and then allocates impressions according         be millions or even billions of products, making combinations of
to the observation and some certain strategy, after which the prod-          all attribute vectors to form a complete system state with size n × d
ucts get their impressions and update itself with the attributes as          computationally unaffordable as referred in essays [11]. A straight-
well as the lifecycle stages. Then the platform is able to get a feed-       forward solution is to applying feature engineering technique to
back to judge how good its action is, and adjust its strategy based          generate a low dimension representation of the state as sl = G(s),
on all the feedbacks. The above procedures leads to a standard               where G is a pre-designed aggregator function to generate a low di-
sequential decision making problem.                                          mensional representation of the status. However, the pre-designed
   However, application of reinforcement learning to this problem            aggregator function is a completely subjective and highly depends
encounters sever computational issues, due to high dimensionality            on the the hand-craft features. Alternatively, we attempt to tackle
of both action space and state space, especially with a large n.             this problem using a simple sampling based method. Specifically,
Thus, we model the impression allocation problem as a standard               the state is approximated by ns products uniformly sampled from
reinforcement learning problem formally, by regarding the global             all products
information of the platform as the state                                                        ŝ = [x 1 , x 2 , · · · , x ns ]T ∈ Rns ×d    (16)
                    s = [x 1 , x 2 , ..., x n ]T ∈ Rn×d               (10)   where ŝ is the approximated state. Then, two issues arise with such
where n is the number of the product in the platform, d is the dimen-        sampling method:
sion of the attribute space, and regarding the parameter adjustion                 • In which order should the sampled ns products permutated
of a score function as the action,                                                   in ŝ, to implement the permutation invariance?
                          a = π (s |θ µ ) ∈ Rd                        (11)         • How to reduce the bias brought by the sampling procedure,
                                                                                     especially when ns is much smaller than n?
where π is the policy to learn parameterize by θ µ , and the action a
                                                                             To solve these two problem, we further propose the first principal
can be further used to calculate scores of all products
                                                                             component based permutation and the repeated sampling based
                           1                                                 experiences generation, which are described in the following sub-
                oi =              , ∀i ∈ {1, 2, ..., n}        (12)
                     1 + e −a x i
                             T
                                                                             sections in details.
After which the result of impression allocation over all n products
can be obtained by                                                           4.1     First Principal Component based
                         e oi                                                        Permutation
                   ui = Ín o , ∀i ∈ {1, 2, ..., n}                    (13)
                         i e
                              i
                                                                             The order of each sampled product in the state vector has to be
Without loss of generosity, we assume at each step the summation             proper arranged, since the unsorted state matrix vibrates severely
of impressions allocated is 1, i.e., ni ui = 1. As is well known,            during training process, making the parameters in network hard to
                                      Í
products number n(billions) is far bigger than products attributes           converge. To avoid it, a simple way for permutation is to make order
dimensions d(thousands) in large scale E-commerce platforms. By              according to a single dimension, such as the brought time ti , or the
such definition, the dimension of the action space is reduced to d, sig-     accumulated impressions qi . However, such ad-hoc method may
nificantly alleviating the computational issue in previous work [12],        lose information due to the lack of general principles. For example,
where the the dimension of the action space is n.                            if we sort according to a feature that is almost the same among
   The goal of policy is to speeded up the metabolism by scoring             all products, state matrix will keep vibrating severely between ob-
and ranking products under the consideration of product lifecycle,           servations. A suitable solution is to sort the products in an order
making the new products grow into maturity stage as quickly as               that keep most information of all features, where the first principal
possible and keeping the the global efficiency from dropping down            components are introduced [29]. We design a first principal compo-
during a long term period. Thus, we define the reward related to s           nent based permutation algorithm, to project each x i into a scalar
and a as                                                                     vi and sort all the products according to vi
                              n  ∫t i
                                             dq(t) 
                                                                                                                         
                 R(s, a) =
                            1 Õ  1
                                        p(t)      dt               (14)                           et = arg max e Tst Tst e                    (17)
                            n i  ti         dt                                                            ∥e ∥=1
                                  t =0                                                                β ê + (1 − β) (et − ê)
                                                     
                                                                                                 ê =
                                                     
where ti is the time step of the i-th product after being brought                                                                                   (18)
                                                                                                      ∥β ê + (1 − β) (et − ê)∥
to the platform, p(t), q(t) is the click through rate function and
accumulated impressions of a product respectively. The physical                                  vi = ê Tx i , i = 1, 2, · · · , ns                (19)
Speeding up the Metabolism in E-commerce by Reinforcement Mechanism DesignSIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA


where et is the first principal component of system states in current                                             Algorithm 1: The Scalable Reinforcement Mechanism Design
step t obtained by the classic PCA method as in Eq. 17. ê is the                                                 Framework
projection vector softly updated by et in Eq. 18, with which we                                                    Initialize the parameters of the actor-critic network
calculate the projected score of each products in Eq. 19. Here 0 <                                                  θ µ , w, θ µ , w ′
                                                                                                                                ′

β < 1 is a scalar indicating the decay rate of ê. Finally, the state                                              Initialize the replay buffer M
vector is denoted as                                                                                                                               j
                                                                                                                   Initialize m observations ŝ 0
                                        ŝ = [x k1 , x k2 , · · · , x kns ]T                              (20)     Initialize the first principal component p̂ by ŝ 0
                                                                                                                   foreach training step t do
where k 1 , k 2 , · · · , kns is the order of products, sorted by vi .                                                  Select action at = µ(ŝt1 |θ µ )
                                                                                                                        Execute action at and observe reward r t
4.2      Repeated Sampling based Experiences                                                                            foreach j ∈ 1, 2, · · · , m do
         Generation                                                                                                          Sample a random subset of ns products
We adopt the classic experience replay technique [30, 31] to enrich                                                          Combine an observation in the order of x kTê
                                                                                                                                                                                      T
                                                                                                                                                      j
                                                                                                                                                          
experiences during the training phase just as other reinforcement                                                                                   ŝt ← x k1 , x k2 , · · · , x kns
learning applications. In the traditional experience replay tech-
nique, the experience is formulated as (st , at , r t , st +1 ). However,                                                     Update first principal component 
                                                                                                                                                                        jT j
                                                                                                                                                                              
as what we describe above, there are Cnns observations each step                                                                                   et ← arg max e Tŝt ŝt e
                                                                                                                                                            ∥e ∥=1
theoretically, since we need to sample ns products from all the n                                                                               ê ← norm (β ê + (1 − β) (et − ê))
products to approximate the global statistics. If ns is much smaller
                                                                                                                        end
than n, such approximation will be inaccurate.
                                                                                                                        foreach i, j ∈ 1, 2, · · · , m do
   To reduce the above bias, we propose the repeated sampling
                                                                                                                                                                                 j
based experiences generation. For each original experience, we do                                                                                  M ← M ∪ {(ŝti , at , r t , ŝt +1 )}
repeated sampling st and st +1 for m times, to obtain m 2 experiences                                                   end
of                                                                                                                      Sample nk transitions from M: (ŝk , ak , r k , ŝk +1 )
                                       j
                  (ŝti , at , r t , ŝt +1 ), i, j ∈ 1, 2, · · · , m (21)                                              Update critic and actor networks
                                                                                                                                           αw Õ
as illustrated in Fig. 3. This approach improves the stability of ob-                                                          w ←w +             (yk − Q w (ŝk , ak ))∇w Q w (ŝk , ak )
                                                                                                                                           nk
                                                                                                                                                k
                                                                     (oi,t , at , Rt , oj,t+1 )
                                                                                                                                 µ      µ   αµ Õ
                                                                                                                                                    ∇θ µ µ(ŝk )∇ak Q w (ŝk , ak )
          (st , at , Rt , st+1 )
                                                                                                                               θ ←θ +
                                                                                                                                            nk
                                                                                                                                                        k
                                   t                                                         t
                                                                                                                           Update the target networks
                                          at+1                                                    at+1
    Sliding Pool
                                   Sampling
                                     Batch
                                                 agent          Sliding Pool
                                                                                           Sampling
                                                                                             Batch
                                                                                                         agent                                 w ′ ← τw ′ + (1 − τ )w
                                                                                                                                                 θ µ ← τθ µ + (1 − τ )θ µ
                                                                                                                                                    ′          ′


Figure 3: Classical experiences generation(left): One experi-                                                        end
ence is obtained each step by pair(st , at , r t , st +1 ); Repeated
sampling based experiences generation(right): m2 experi-
                                                         j
ences are obtained each step by pair(ŝti , at , r t , ŝt +1 )                                                  Finally, the agent observes system repeatedly and train the actor-
                                                                                                                 critic network to learn an optimized policy gradually.
servation in noise environment. It is also helpful to generate plenty
of experiences in the situation that millions of times repetition is
                                                                                                                 5     EXPERIMENTAL RESULTS
unavailable.                                                                                                     To demonstrate how the proposed approach can help improve the
     It is worth noting that, the repeated sampling is conducted in                                              long-term efficiency by speeding up the metabolism, we apply the
the training phase. When to play in the environment, the action                                                  proposed reinforcement learning based mechanism design, as well
at is obtained through a randomly selected approximated state                                                    as other comparison methods, to a simulated E-commerce platform
ŝt , i.e., at = π (ŝt1 ). Actually, since at does not necessarily equal                                        built based on the proposed product lifecycle model.
to π (ŝti ), ∀i ∈ 1, 2, · · · , m, it can further help learning a invariant
presentation of the approximated state observations.                                                             5.1     The Configuration
     The overall procedure of the algorithm is described in Algo-                                                The simulation is built up based on product lifecycle model pro-
rithm 1. Firstly, a random sampling is utilized to get a sample of                                               posed in section 3.1. Among all of the parameters, q 2 is uniformly
system states. And then the sample is permutated by the projection                                               sampled from [104 , 106 ], t 1 , t 2 , t 3 , δ µ , δ σ are uniformly sampled
of the first principal components. After that, a one step action and                                             from [5, 30], [35, 120], [60, 180], [104 , 106 ], [2.5 × 103 , 2.5 × 105 ] re-
multiple observations are introduced to enrich experiences in expe-                                              spectively, and parameter σ is set as 0.016 . The parameters cl , ch
rience pool. Moreover, a shared convolutional neural network is                                                  are generated by a fixed neural network whose parameter is uni-
applied within the actor-critic networks and target actor-critic net-                                            formly sampled from [−0.5, 0.5] to model online environments, with
works to extract features from the ordered state observation [32, 33].                                           the outputs scaled into the intervals of [0.01, 0.05] and [0.1, 0.15]
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA                                                                                     Hua-Lin He et al.

              Table 1: Parameters in learning phase
                                                                                                160


                                                                            discounted return
       Param    Value    Reference
                                                                                                140
         ns      103     Number of products in each sample
         β      0.999    First principal component decay rate                                   120                                       FPC-CNN-EXP
         γ      0.99     Rewards discount factor                                                                                          FPC-CNN
         τ      0.99     Target network decay rate
                                                                                                                                          FPC
                                                                                                100                                       T-Perm
         m        5      Repeated observation times
                                                                                                      200   400   600   800 1000 1200 1400 1600 1800
                                                                                                                            step
respectively. Apart from the normalized dynamic CTR p and the
accumulated impressions q, the attribute vector x is uniformly sam-              Figure 4: Performance Comparison between algorithms
pled from [0, 1] element-wisely with the dimension d = 15. All the
latent parameters in the lifecycle model are assumed unobservable
                                                                           improvement in speeding up the converging process. Both the three
during the learning phase.
                                                                           FCP based algorithms converge to same final accumulated rewards
   The DDPG algorithm is adopted as the learning algorithm. The
                                                                           for their state inputs have the same observation representation.
learning rates for the actor network and the critic network are 10−4
and 10−3 respectively, with the optimizer ADAM [34]. The replay
buffer is limit by 2.5 × 104 . The most relevant parameters evolved                             0.6
in the learning procedure are set as table 1.                                                   0.5
   Comparisons are made within the proposed reinforcement learn-
                                                                            percentage          0.4
ing based methods as
                                                                                                0.3
      • CTR-A: The impressions are allocated in proportion to the                                                                            Introduction
        CTR score.                                                                              0.2                                          Growth
      • T-Perm: The basic DDPG algorithm, with brought time                                     0.1                                          Maturity
        based permutation and a fully connected network to process                                                                           Decline
                                                                                                0.0
        the state                                                                                     200   400   600   800 1000 1200 1400 1600 1800
      • FPC: The basic DDPG algorithm, with first principal com-                                                            step
        ponent based permutation and a fully connected network to
        process the state.                                                 Figure 5: Percentage of impressions allocated to different
      • FPC-CNN: FPC with a shared two-layers convolutional neu-           stages.
        ral network in actor-critic networks.
      • FPC-CNN-EXP: FPC-CNN with the improved experiences
        generation method.                                                     Then we investigate the distribution shift of the impression
                                                                           allocation over the 4 lifecycle stages after the training procedure
where CTR-A is the classic supervised learning method and the              of the FPC-CNN-EXP method, as shown in Fig. 5. It can be seen
others are the proposed methods in this paper. For all the experi-         that the percentage of decline stage is decreased and percentage
ments, CTR-A is firstly applied for the first 360 steps to initialize      of introduction and maturity stages are increased. By giving up
system into a stable status, i.e., the distribution over different life-   the products in the decline stage, it helps the platform to avoid the
cycle stages are stable, then other methods are engaged to run for         waste of the impressions since these products are always with a
another 2k steps and the actor-critic networks are trained for 12.8k       low CTR. By encouraging the products in the introduction stage,
times.                                                                     it gives the changes of exploring more potential hot products. By
                                                                           supporting the products in the maturity stage, it maximizes the
5.2     The Results                                                        short-term efficiency since the they are with the almost highest
We firstly show the discounted accumulated rewards of different            CTRs during their lifecycle.
methods at every step in Fig. 4. After the initialization with the             We finally demonstrate the change of the global clicks, rewards
CTR-A, we find that the discounted accumulated reward of CTR-A             as well as the averaged time durations for a product to grow up
itself almost converges to almost 100 after 360 steps (actually that       into maturity stage from its brought time at each step, in terms
why 360 steps is selected for the initialization), while that of other     of relative change rate compared with the CTR-A method, as is
methods can further increase with more learning steps. It is showed        shown in Fig. 6. The global average click increases by 6% when the
that all FPC based algorithms beat the T-Perm algorithm, indicating        rewards is improved by 30%. The gap here is probably caused by the
that the FPC based algorithm can find a more proper permutation            inconsistency of the reward definition and the global average click
to arrange items while the brought time based permutation leads            metric. In fact, the designed reward contains some other implicit
to a loss of information, making a drop of the final accumulated           objectives related to the metabolism. To further verify the guess, we
rewards. Moreover, CNN and EXP algorithms perform better in ex-            show that the average time for items to growth into maturity stage
tracting feature from observations automatically, causing a slightly       has dropped by 26%, indicating that the metabolism is significantly
Speeding up the Metabolism in E-commerce by Reinforcement Mechanism DesignSIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA


speeded up. Thus, we empirically prove that, through the proposed                           [11] Qingpeng Cai, Aris Filos-Ratsikas, Pingzhong Tang, and Yiwei Zhang. Reinforce-
reinforcement learning based mechanism design which utilizes                                     ment mechanism design for fraudulent behaviour in e-commerce. 2018.
                                                                                            [12] Qingpeng Cai, Aris Filos-Ratsikas, Pingzhong Tang, and Yiwei Zhang. Reinforce-
the lifecycle theory, the long-term efficiency can be increased by                               ment mechanism design for e-commerce. CoRR, abs/1708.07607, 2017.
speeding up the metabolism.                                                                 [13] Theodore Levitt. Exploit the product life cycle. Harvard business review, 43:81–94,
                                                                                                 1965.
                                                                                            [14] Hui Cao and Paul Folan. Product life cycle: the evolution of a paradigm and
              1.4                                                                                literature review from 1950–2009. Production Planning & Control, 23(8):641–662,
                                                                                                 2012.
                                                                                            [15] Thorsten Joachims. Optimizing search engines using clickthrough data. In
              1.2                                                                                Proceedings of the eighth ACM SIGKDD international conference on Knowledge
 percentage


                                                                                                 discovery and data mining, pages 133–142. ACM, 2002.
              1.0                                                                           [16] Georges E Dupret and Benjamin Piwowarski. A user browsing model to predict
                                                                                                 search engine click data from past observations. In Proceedings of the 31st annual
                                                                                                 international ACM SIGIR conference on Research and development in information
              0.8                                                        Click                   retrieval, pages 331–338. ACM, 2008.
                                                                         Rewards            [17] Francesco Ricci, Lior Rokach, and Bracha Shapira. Introduction to recommender
              0.6                                                        Time cost               systems handbook. In Recommender systems handbook, pages 1–35. Springer,
                                                                                                 2011.
                    200   400   600   800 1000 1200 1400 1600 1800                          [18] Seung-Taek Park and Wei Chu. Pairwise preference regression for cold-start
                                             step                                                recommendation. In Proceedings of the third ACM conference on Recommender
                                                                                                 systems, pages 21–28. ACM, 2009.
                                                                                            [19] Martin Saveski and Amin Mantrach. Item cold-start recommendations: learn-
                                                                                                 ing local collective embeddings. In Proceedings of the 8th ACM Conference on
                    Figure 6: Metabolism relative metrics                                        Recommender systems, pages 89–96. ACM, 2014.
                                                                                            [20] Jin-Hu Liu, Tao Zhou, Zi-Ke Zhang, Zimo Yang, Chuang Liu, and Wei-Min Li.
                                                                                                 Promoting cold-start items in recommender systems. PloS one, 9(12):e113457,
                                                                                                 2014.
6             CONCLUSIONS AND FUTURE WORK                                                   [21] Oren Anava, Shahar Golan, Nadav Golbandi, Zohar Karnin, Ronny Lempel, Oleg
                                                                                                 Rokhlenko, and Oren Somekh. Budget-constrained item cold-start handling in
In this paper, we propose an end-to-end general reinforcement                                    collaborative filtering recommenders via optimal design. In Proceedings of the
learning framework to improve the long-term efficiency by speed-                                 24th International Conference on World Wide Web, pages 45–54. International
ing up the metabolism. We reduce action space into a reasonable                                  World Wide Web Conferences Steering Committee, 2015.
                                                                                            [22] Michal Aharon, Oren Anava, Noa Avigdor-Elgrabli, Dana Drachsler-Cohen, Sha-
level and then propose a first principal component based permu-                                  har Golan, and Oren Somekh. Excuseme: Asking users to help in item cold-start
tation for better observation of environment state. After that, an                               recommendations. In Proceedings of the 9th ACM Conference on Recommender
                                                                                                 Systems, pages 83–90. ACM, 2015.
improved experiences generation technique is engaged to enrich                              [23] Maksims Volkovs, Guangwei Yu, and Tomi Poutanen. Dropoutnet: Addressing
experience pool. Moreover, the actor-critic network is improved by                               cold start in recommender systems. In Advances in Neural Information Processing
a shared convolutional network for better state representation. Ex-                              Systems, pages 4964–4973, 2017.
                                                                                            [24] Nima Taghipour, Ahmad Kardan, and Saeed Shiry Ghidary. Usage-based web
periment results show that our algorithms outperform the baseline                                recommendations: a reinforcement learning approach. In Proceedings of the 2007
algorithms.                                                                                      ACM conference on Recommender systems, pages 113–120. ACM, 2007.
   For the future work, one of the promising directions is to de-                           [25] Qingpeng Cai, Aris Filos-Ratsikas, Chang Liu, and Pingzhong Tang. Mechanism
                                                                                                 design for personalized recommender systems. In Proceedings of the 10th ACM
velop a theoretical guarantee for first principal component based                                Conference on Recommender Systems, pages 159–166. ACM, 2016.
permutation. Another possible improvement is to introduce the                               [26] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour.
                                                                                                 Policy gradient methods for reinforcement learning with function approximation.
nonlinearity to the scoring function for products.                                               In Advances in neural information processing systems, pages 1057–1063, 2000.
                                                                                            [27] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and
REFERENCES                                                                                       Martin Riedmiller. Deterministic policy gradient algorithms. In Proceedings of
                                                                                                 the 31st International Conference on Machine Learning (ICML-14), pages 387–395,
 [1] William Vickrey. Counterspeculation, auctions, and competitive sealed tenders.
                                                                                                 2014.
     The Journal of finance, 16(1):8–37, 1961.
                                                                                            [28] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez,
 [2] Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton,
                                                                                                 Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep
     and Greg Hullender. Learning to rank using gradient descent. In Proceedings of
                                                                                                 reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
     the 22nd international conference on Machine learning, pages 89–96. ACM, 2005.
                                                                                            [29] Hervé Abdi and Lynne J Williams. Principal component analysis. Wiley interdis-
 [3] Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to
                                                                                                 ciplinary reviews: computational statistics, 2(4):433–459, 2010.
     rank: from pairwise approach to listwise approach. In Proceedings of the 24th
                                                                                            [30] Long Ji Lin. Self-improving reactive agents based on reinforcement learning,
     international conference on Machine learning, pages 129–136. ACM, 2007.
                                                                                                 planning and teaching. Machine Learning, 8(3-4):293–321, 1992.
 [4] Greg Linden, Brent Smith, and Jeremy York. Amazon. com recommendations:
                                                                                            [31] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis
     Item-to-item collaborative filtering. IEEE Internet computing, 7(1):76–80, 2003.
                                                                                                 Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep
 [5] Yehuda Koren and Robert Bell. Advances in collaborative filtering. In Recom-
                                                                                                 reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
     mender systems handbook, pages 77–118. Springer, 2015.
                                                                                            [32] Yu-Hu Cheng, Jian-Qiang Yi, and Dong-Bin Zhao. Application of actor-critic
 [6] Roger B Myerson. Optimal auction design. Mathematics of operations research,
                                                                                                 learning to adaptive state space construction. In Machine Learning and Cyber-
     6(1):58–73, 1981.
                                                                                                 netics, 2004. Proceedings of 2004 International Conference on, volume 5, pages
 [7] Noam Nisan and Amir Ronen. Algorithmic mechanism design. Games and
                                                                                                 2985–2990. IEEE, 2004.
     Economic Behavior, 35(1-2):166–196, 2001.
                                                                                            [33] Yuxin Wu and Yuandong Tian. Training agent for first-person shooter game
 [8] Yoav Shoham and Kevin Leyton-Brown. Multiagent systems: Algorithmic, game-
                                                                                                 with actor-critic curriculum learning. 2016.
     theoretic, and logical foundations. Cambridge University Press, 2008.
                                                                                            [34] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.
 [9] Pingzhong Tang. Reinforcement mechanism design. In Early Carrer Highlights
                                                                                                 CoRR, abs/1412.6980, 2014.
     at Proceedings of the 26th International Joint Conference on Artificial Intelligence
     (IJCAI, pages 5146–5150, 2017.
[10] Christos H Papadimitriou and John N Tsitsiklis. The complexity of markov
     decision processes. Mathematics of operations research, 12(3):441–450, 1987.