<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Speeding up the Metabolism in E-commerce by Reinforcement Mechanism Design</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Qing Da Alibaba Inc. Hangzhou</string-name>
          <email>hualin.hhl@alibaba-inc.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>China daqing.dq@alibaba-inc.com</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>An-Xiang Zeng Alibaba Inc.</institution>
          <addr-line>Hangzhou</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Chun-Xiang Pan Alibaba Inc.</institution>
          <addr-line>Hangzhou</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Hua-Lin He Alibaba Inc.</institution>
          <addr-line>Hangzhou</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Reinforcement Learning</institution>
          ,
          <addr-line>Mechanism Design, E-commerce</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <abstract>
        <p>In a large E-commerce platform, all the participants compete for impressions under the allocation mechanism of the platform. Existing methods mainly focus on the short-term return based on the current observations instead of the long-term return. In this paper, we formally establish the lifecycle model for products, by defining the introduction, growth, maturity and decline stages and their transitions throughout the whole life period. Based on such model, we further propose a reinforcement learning based mechanism design framework for impression allocation, which incorporates the first principal component based permutation and the novel experiences generation method, to maximize short-term as well as long-term return of the platform. With the power of trial-and-error, it is possible to optimize impression allocation strategies globally which is contribute to the healthy development of participants and the platform itself. We evaluate our algorithm on a simulated environment built based on one of the largest E-commerce platforms, and a significant improvement has been achieved in comparison with the baseline solutions.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>• Computing methodologies → Reinforcement learning; Policy
iteration; • Applied computing → Online shopping;</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>Nowadays, E-commerce platform like Amazon or Taobao has
developed into a large business ecosystem consisting of millions of
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).</p>
      <p>
        SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA
© 2018 Copyright held by the owner/author(s).
customers, enterprises and start-ups, and hundreds of thousands of
service providers, making it a new type of economic entity rather
than enterprise platform. In such a economic entity, a major
responsibility of the platform is to design economic institutions to
achieve various business goals, which is the exact field of
Mechanism Design [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Among all the afairs of the E-commerce platform,
impression allocation is one of the key strategies to achieve its
business goal, while products are players competing for the resources
under the allocation mechanism of the platform, and the platform
is the game designer aiming to design game whose outcome will
be as the platform desires.
      </p>
      <p>
        Existing work of impression allocation in literature are mainly
motivated and modeled from a perspective view of supervised
learning, roughly falling into the fields of information retrieval [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ] and
recommendation [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. For these methods, a Click-Through-Rate
(CTR) model is usually built based on either a ranking function
or a collaborative filtering system, then impressions are allocated
according to the CTR scores. However, these methods usually
optimize the short-term clicks, by assuming that the properties of
products is independent of the decisions of the platform, which
may hardly hold in the real E-commerce environment. There are
also a few work trying to apply the mechanism design to the
allocation problem from an economic theory point of view such
as [
        <xref ref-type="bibr" rid="ref6 ref7 ref8">6–8</xref>
        ]. Nevertheless, these methods only work in very limited
cases, such as the participants play only once, and their properties
is statistically known or does not change over time, etc., making
them far from practical use in our scenario. A recent pioneer work
named Reinforcement Mechanism Design [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] attempts to get rid of
nonrealistic modeling assumptions of the classic economic theory
and to make automated optimization possible, by incorporating the
Reinforcement Learning (RL) techniques. It is a general framework
which models the resource allocation problem over a sequence of
rounds as a Markov decision process (MDP) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], and solves the
MDP with the state-of-the-art RL methods. However, by defining
the impression allocation over products as the action, it can hardly
scale with the number of products/sellers as shown in [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ].
Besides, it depends on an accurate behavioral model for the
products/sellers, which is also unfeasible due to the uncertainty of the
real world.
      </p>
      <p>
        Although the properties of products can not be fully observed
or accurately predicted, they do share a similar pattern in terms
of development trend, as summarized in the product lifecycle
theory [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ]. The life story of most products is a history of their
passing through certain recognizable stages including introduction,
growth, maturity and decline stages.
      </p>
      <p>• Introduction: Also known as market development - this is
when a new product is first brought to market. Sales are low
and creep along slowly.
• Growth: Demand begins to accelerate and the size of the total
market expands rapidly.
• Maturaty: Demand levels of and grows.
• Decline: The product begins to lose consumer appeal and
sales drift downward.</p>
      <p>During the lifecycle, new products arrive continuously and outdated
products wither away every day, leading to a natural metabolism
in the E-commerce platform. Due to the insuficient statistics, new
products usually attract few attention from conventional supervised
learning methods, making the metabolism a very long period.</p>
      <p>Inspired by the product lifecycle theory as well the reinforcement
mechanism design framework, we consider to develop
reinforcement mechanism design while taking advantage of the product
lifecycle theory. The key insight is, with the power of trial-and-error,
it is possible to recognize in advance the potentially hot products
in the introduction stage as well as the potentially slow-selling
products in the decline stage, so the metabolism can be speeded
up and the long-term eficiency can be increased with an optimal
impression allocation strategy.</p>
      <p>We formally establish the lifecycle model and formulate the
impression allocation problem by regarding the global status of
products as the state and the parameter adjustment of a scoring
function as the action. Besides, we develop a novel framework
which incorporates a first principal component based algorithm
and a repeated sampling based experiences generation method,
as well as a shared convolutional neural network to further
enhance the expressiveness and robustness. Moreover, we compare
the feasibility and eficiency between baselines and the improved
algorithms in a simulated environment built based on one of the
largest E-commerce platforms.</p>
      <p>The rest of the paper is organized as follows. The product
lifecycle model and reinforcement learning algorithms are introduced in
section 3. Then a reinforcement learning mechanism design
framework is proposed in section 4. Further more, experimental results
are analyzed in section 5. Finally, conclusions and future work are
discussed in section 6.
2</p>
    </sec>
    <sec id="sec-3">
      <title>RELATED WORK</title>
      <p>
        Many researches have been conducted on impression allocation
and dominated by supervised learning. In ranking phase, search
engine aims to find out good candidates and brought them in front
so that products with better performance will gain more
impressions. Among which click-through rate is one of the most common
representation of products performance. Some research presents an
approach to automatically optimize the retrieval quality with
wellfounded retrieval functions under risk minimization frame-work
by historical click-through data [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Some other research proposed
an unbiased estimation of document relevance by estimating the
presentation probability of each document [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Nevertheless, both
of these research sufer from low accuracy of click-through rate
estimation for the lack of exposure historical data of start-ups.
      </p>
      <p>
        One of the most related topics in user impressions allocation is
item cold-start problem [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], which has been extensively studied
over past decades. Researches can be classified into three
categories: hybrid algorithms combining CF with content-based
techniques [
        <xref ref-type="bibr" rid="ref18 ref19">18, 19</xref>
        ], bandit algorithms [
        <xref ref-type="bibr" rid="ref20 ref21 ref22">20–22</xref>
        ] and data supplement
algorithms [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. Among these researches, the hybrid algorithms
exploit items’ properties, the bandit algorithms are designed for
no item content setting and gathering interactions from user
efectively, and the data supplement algorithms view cold-start as data
missing problem. Both of these research did not take the whole
product lifecycle of items into account for the weakness of traditional
prediction based machine learning model, resulting in long-term
imbalance between global eficiency and lifecycle optimization.
      </p>
      <p>
        The application of reinforcement learning in commercial system
such as web recommendations and e-commerce search engines has
not yet been well developed. Some attempts are made to model
the user impression allocation problem in e-commerce platform
such as Tabao.com and Amazon.com. By regarding the platforms
with millions of users as environment and treating the engines
allocating user impressions as agents, an Markov Decision Process
or at least Partially Observable Markov Decision Process can be
established. For example, an reinforcement learning capable model
is established on each page status by limit the page visit sequences
to a constant number in a recommendation scene [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. And another
proposed model is established on global status by combining all
the item historical representations in platform [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. However, both
of these approaches struggled to manage an fixed dimensionality
of state observation, low-dimensional action outputs and sufered
from partially observation issues.
      </p>
      <p>
        Recently, mechanism design has been applied in impression
allocation, providing a new approach for better allocating user
impressions [
        <xref ref-type="bibr" rid="ref25 ref9">9, 25</xref>
        ]. However, the former researches are not suitable
for real-world scenes because of the output action space is too large
to be practical. In this paper, a reinforcement learning based
mechanism design is established for the impression allocation problem
to maximize both short-term as well as long-term return of
products in the platform with a new approach to extract states from all
products and to reduce action space into practical level.
3
3.1
      </p>
    </sec>
    <sec id="sec-4">
      <title>PRELIMINARIES</title>
    </sec>
    <sec id="sec-5">
      <title>Product Lifecycle Model</title>
      <p>In this subsection, we establish a mathematical model of product
lifecycle with noises. At step t , each product has an observable
attribute vector xt ∈ Rd and an unobservable latent lifecycle state
zt ∈ L, where d is the dimension of the attribute space, and L =
{0, 1, 2, 3} is the set of lifecycle stages indicating the the introduction,
growth, maturity and decline stages respectively. Let pt ∈ R be the
CTR and qt ∈ R be the accumulated user impressions of the product.
Without loss of generality, we assume pt and qt are observable,
pt , qt are two observable components of xt , the platform allocates
the impressions ut ∈ R to the product. The dynamics of the system
where f can be seen as the derivative of the p, and д is the state
transition function over L</p>
      <p>.</p>
      <p>According to the product lifecycle theory and online statistics,
Introduction
(1)
(2)
(3)




qt +1 = qt + ut
pt +1 = pt + f (zt , qt )
zt +1 = д(xt , zt , t )
the derivative of the CTR can be formulated as
f (zt , qt ) =  (2 − z)(1 + e−δ (qt ))2</p>
      <p>(ch − cl )e−δ (qt )



ξ ,

+ ξ , z ∈ {1, 3}
z ∈ {0, 2}
where ξ ∼ N (0, σ 2) is a gaussian noise with zero mean and
variance σ 2, δ (qt ) = (qt − q˜t z − δµ )/δσ is the normalized impressions
accumulated from stage z , q˜t z</p>
      <p>is the initial impressions when the
product is firstly evolved to the life stage z, δµ , δσ are two
unobservable parameters for normalization, and ch , cl ∈ R are the highest
CTR and the lowest CTR during whole product lifecycle, inferred
from two neural networks, respectively:</p>
      <p>cl = h(xt |θl ), ch = h(xt |θh ),
obtained for the product in the decline stage.
where h(·|θ ) is a neural network with the fixed parameter θ ,
indicating that cl , ch are unobservable but relevant to attribute vector
xt . Intuitively, when the product stays in introduction or maturity
stage, the CTR can be only influenced by the noise. When the
product in the growth stage, f will be a positive increment, making the
CTR increased up to the upper bound ch . Similar analysis can be
t &gt; t2, q &lt; q2
g : z = 0 t &gt; t1 z = 1</p>
      <p>q &gt; q2 z = 2 t &gt; t3 z = 3
in to the maturity stage if its accumulated impressions q reaches q2,
or the decline stage if the time exceeds t2 while q is less than q2. A
product in the maturity stage will finally enter the last decline stage
if the time exceeds t3. Otherwise, the product will stay in current
stage. Here, t1, t2, t3, q2 are the latent thresholds of products.</p>
      <p>We simulate several product during the whole lifecycle with
diferent latent parameters (the details can be found in the
experimental settings), the CTR curves follow the exact trend described
in Fig. 2.
can be written as
Typical Lifecycle
space S, action space A, a conditional probability distribution
p(·) and a scalar reward function r = R(s, a), R : S × A
For states st , st +1 ∈ S and action at ∈ A, distribution function
p(st +1 |st , at ) denotes the transition probability from state st to st +1
when action at is adopted in time step t , and the Markov property
→
p(st +1 |st , at ) = p(st +1 |s1, a1, · · · , st , at ) holds for any historical
trajectories s1, a1, · · · , st to arrive at status st . A future discounted
γt = Í∞
k=t γ k−t R(sk , ak ), where
γ is a scalar factor representing the discount. A policy is denoted
as πθ (at |st ) which is a probability distribution mapping from S to
A , where diferent policies are distinguished by parameter θ .</p>
      <p>The target of agent in reinforcement learning is to maximize the
expected discounted return, and the performance objective can be
denoted as</p>
      <p>π
max J = E R</p>
      <p>γ1 π ]
= Es∼d π ,a∼πθ [R(s, a)]
where d π (s) is a discounted state distribution indicating the
possibility to encounter a state s under the policy of π . An action-value
function is then obtained iteratively as</p>
      <p>
        Q(st , at ) = E R(st , at ) + γ Ea∼πθ [Q(st +1, at +1)]
In order to avoid calculating the gradients of the changing state
distribution in continuous action space, the Deterministic Policy
Gradient(DPG) method [
        <xref ref-type="bibr" rid="ref26 ref27">26, 27</xref>
        ] and the Deep Deterministic Policy
Gradient [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] are brought forward. Gradients of the deterministic
policy π is
∇θ µ J = Es∼d µ
      </p>
      <p>∇θ µ Qw (s, a)
= Es∼d µ</p>
      <p>∇θ µ µ (s)∇aQw (s, a)|a=µ (s)
where µ is the deep actor network to approximate policy function.
And the parameters of actor network can be updated as
θ
µ</p>
      <p>← θ µ + α E ∇θ µ µ (st )∇aQw (st , at )|a=µ (s)
where Qw is an obtained approximation of action-value function
called critic network. Its parameter vector w is updated according
min L = Es∼d µ yt − Qw (st , at ))
w
2
where yt = R(st , at ) + γ Qw′ (st +1, µ ′(st +1)), µ ′ is the target actor
network to approximate policy π , Qw′ is the target critic network
(4)
(5)
(6)
(7)
(8)
3.2</p>
    </sec>
    <sec id="sec-6">
      <title>Reinforcement Learning and DDPG</title>
      <p>to objective
methods
Reinforcement learning maximizes accumulated rewards by
trialand-error approach in a sequential decision problem. The
sequential decision problem is formulated by MDP as a tuple of state
to approximate action-value function. The parameters w ′, θ µ ′ are
updated softly as
In our scenario, at each step, the platform observes the global
information of all the products, and then allocates impressions according
to the observation and some certain strategy, after which the
products get their impressions and update itself with the attributes as
well as the lifecycle stages. Then the platform is able to get a
feedback to judge how good its action is, and adjust its strategy based
on all the feedbacks. The above procedures leads to a standard
sequential decision making problem.</p>
      <p>However, application of reinforcement learning to this problem
encounters sever computational issues, due to high dimensionality
of both action space and state space, especially with a large n.
Thus, we model the impression allocation problem as a standard
reinforcement learning problem formally, by regarding the global
information of the platform as the state</p>
      <p>s = [x1, x2, ..., xn ]T ∈ Rn×d
where n is the number of the product in the platform, d is the
dimension of the attribute space, and regarding the parameter adjustion
of a score function as the action,</p>
      <p>a = π (s |θ µ ) ∈ Rd
where π is the policy to learn parameterize by θ µ , and the action a
can be further used to calculate scores of all products
1
oi =
1 + e−aTxi</p>
      <p>, ∀i ∈ {1, 2, ..., n}
After which the result of impression allocation over all n products
can be obtained by</p>
      <p>
        eoi
ui = Íin eoi , ∀i ∈ {1, 2, ..., n}
Without loss of generosity, we assume at each step the summation
of impressions allocated is 1, i.e., Íin ui = 1. As is well known,
products number n(billions) is far bigger than products attributes
dimensions d(thousands) in large scale E-commerce platforms. By
such definition, the dimension of the action space is reduced to d,
significantly alleviating the computational issue in previous work [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ],
where the the dimension of the action space is n.
      </p>
      <p>The goal of policy is to speeded up the metabolism by scoring
and ranking products under the consideration of product lifecycle,
making the new products grow into maturity stage as quickly as
possible and keeping the the global eficiency from dropping down
during a long term period. Thus, we define the reward related to s
and a as</p>
      <p>R(s, a) = n1 Õni  t1it∫=0 p(t )dqd(tt )dt  (14)

where ti is the time step of the i-th product after being brought
to the platform, p(t ), q(t ) is the click through rate function and
accumulated impressions of a product respectively. The physical
meaning of this formulation is the mathematical expect over all
products in platform for the average click amount of an product
during its lifecycle, indicating the eficiency of products in the
platform and it can be calculated accumulatively in the online
environment, which can be approximately obtained by
(15)
(16)
1 Õn 1 Õti
R(s, a) ≈ n
i ti τ =0
pτi uτi</p>
      <p>
        A major issue in the above model is that, in practices there will
be millions or even billions of products, making combinations of
all attribute vectors to form a complete system state with size n × d
computationally unafordable as referred in essays [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. A
straightforward solution is to applying feature engineering technique to
generate a low dimension representation of the state as sl = G(s),
where G is a pre-designed aggregator function to generate a low
dimensional representation of the status. However, the pre-designed
aggregator function is a completely subjective and highly depends
on the the hand-craft features. Alternatively, we attempt to tackle
this problem using a simple sampling based method. Specifically,
the state is approximated by ns products uniformly sampled from
all products
      </p>
      <p>sˆ = [x1, x2, · · · , xns ]T ∈ Rns ×d
where sˆ is the approximated state. Then, two issues arise with such
sampling method:
• In which order should the sampled ns products permutated
in sˆ, to implement the permutation invariance?
• How to reduce the bias brought by the sampling procedure,
especially when ns is much smaller than n?
To solve these two problem, we further propose the first principal
component based permutation and the repeated sampling based
experiences generation, which are described in the following
subsections in details.
4.1</p>
    </sec>
    <sec id="sec-7">
      <title>First Principal Component based</title>
    </sec>
    <sec id="sec-8">
      <title>Permutation</title>
      <p>
        The order of each sampled product in the state vector has to be
proper arranged, since the unsorted state matrix vibrates severely
during training process, making the parameters in network hard to
converge. To avoid it, a simple way for permutation is to make order
according to a single dimension, such as the brought time ti , or the
accumulated impressions qi . However, such ad-hoc method may
lose information due to the lack of general principles. For example,
if we sort according to a feature that is almost the same among
all products, state matrix will keep vibrating severely between
observations. A suitable solution is to sort the products in an order
that keep most information of all features, where the first principal
components are introduced [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ]. We design a first principal
component based permutation algorithm, to project each xi into a scalar
vi and sort all the products according to vi
et = arg max eTst Tst e
      </p>
      <p>∥e ∥=1
eˆ =
βeˆ + (1 − β ) (et − eˆ)
∥βeˆ + (1 − β ) (et − eˆ)∥
vi = eˆTxi , i = 1, 2, · · · , ns
(17)
(18)
where et is the first principal component of system states in current
step t obtained by the classic PCA method as in Eq. 17. eˆ is the
projection vector softly updated by et in Eq. 18, with which we
calculate the projected score of each products in Eq. 19. Here 0 &lt;
β &lt; 1 is a scalar indicating the decay rate of eˆ. Finally, the state
vector is denoted as
sˆ = [xk1 , xk2 , · · · , xkns ]</p>
      <p>T
(20)
where k1, k2, · · · , kns is the order of products, sorted by vi .
4.2</p>
    </sec>
    <sec id="sec-9">
      <title>Repeated Sampling based Experiences</title>
    </sec>
    <sec id="sec-10">
      <title>Generation</title>
      <p>
        We adopt the classic experience replay technique [
        <xref ref-type="bibr" rid="ref30 ref31">30, 31</xref>
        ] to enrich
experiences during the training phase just as other reinforcement
learning applications. In the traditional experience replay
technique, the experience is formulated as (st , at , rt , st +1). However,
as what we describe above, there are Cns observations each step
n
theoretically, since we need to sample ns products from all the n
products to approximate the global statistics. If ns is much smaller
than n, such approximation will be inaccurate.
      </p>
      <p>To reduce the above bias, we propose the repeated sampling
based experiences generation. For each original experience, we do
repeated sampling st and st +1 for m times, to obtain m2 experiences
of</p>
      <p>(sˆti , at , rt , sˆtj+1), i, j ∈ 1, 2, · · · , m (21)
as illustrated in Fig. 3. This approach improves the stability of
ob(st,at,Rt,st+1)</p>
      <p>(oi,t, at, Rt, oj,t+1)
t
Samplinagt+1</p>
      <p>Batch
Sliding Pool
agent</p>
      <p>Sliding Pool
t
SaBmatpclihngat+1 agent
servation in noise environment. It is also helpful to generate plenty
of experiences in the situation that millions of times repetition is
unavailable.</p>
      <p>It is worth noting that, the repeated sampling is conducted in
the training phase. When to play in the environment, the action
at is obtained through a randomly selected approximated state
sˆt , i.e., at = π (sˆt1). Actually, since at does not necessarily equal
to π (sˆti ), ∀i ∈ 1, 2, · · · , m, it can further help learning a invariant
presentation of the approximated state observations.</p>
      <p>
        The overall procedure of the algorithm is described in
Algorithm 1. Firstly, a random sampling is utilized to get a sample of
system states. And then the sample is permutated by the projection
of the first principal components. After that, a one step action and
multiple observations are introduced to enrich experiences in
experience pool. Moreover, a shared convolutional neural network is
applied within the actor-critic networks and target actor-critic
networks to extract features from the ordered state observation [
        <xref ref-type="bibr" rid="ref32 ref33">32, 33</xref>
        ].
The simulation is built up based on product lifecycle model
proposed in section 3.1. Among all of the parameters, q2 is uniformly
sampled from [104, 106], t1, t2, t3, δµ , δσ are uniformly sampled
from [
        <xref ref-type="bibr" rid="ref30 ref5">5, 30</xref>
        ], [35, 120], [60, 180], [104, 106], [2.5 × 103, 2.5 × 105]
respectively, and parameter σ is set as 0.016 . The parameters cl , ch
are generated by a fixed neural network whose parameter is
uniformly sampled from [−0.5, 0.5] to model online environments, with
the outputs scaled into the intervals of [0.01, 0.05] and [0.1, 0.15]
Number of products in each sample
First principal component decay rate
Rewards discount factor
Target network decay rate
      </p>
      <p>
        Repeated observation times
respectively. Apart from the normalized dynamic CTR p and the
accumulated impressions q, the attribute vector x is uniformly
sampled from [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] element-wisely with the dimension d = 15. All the
latent parameters in the lifecycle model are assumed unobservable
during the learning phase.
      </p>
      <p>
        The DDPG algorithm is adopted as the learning algorithm. The
learning rates for the actor network and the critic network are 10−4
and 10−3 respectively, with the optimizer ADAM [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ]. The replay
bufer is limit by 2.5 × 104. The most relevant parameters evolved
in the learning procedure are set as table 1.
      </p>
      <p>Comparisons are made within the proposed reinforcement
learning based methods as
• CTR-A: The impressions are allocated in proportion to the</p>
      <p>CTR score.
• T-Perm: The basic DDPG algorithm, with brought time
based permutation and a fully connected network to process
the state
• FPC: The basic DDPG algorithm, with first principal
component based permutation and a fully connected network to
process the state.
• FPC-CNN: FPC with a shared two-layers convolutional
neural network in actor-critic networks.
• FPC-CNN-EXP: FPC-CNN with the improved experiences
generation method.
where CTR-A is the classic supervised learning method and the
others are the proposed methods in this paper. For all the
experiments, CTR-A is firstly applied for the first 360 steps to initialize
system into a stable status, i.e., the distribution over diferent
lifecycle stages are stable, then other methods are engaged to run for
another 2k steps and the actor-critic networks are trained for 12.8k
times.
5.2</p>
    </sec>
    <sec id="sec-11">
      <title>The Results</title>
      <p>We firstly show the discounted accumulated rewards of diferent
methods at every step in Fig. 4. After the initialization with the
CTR-A, we find that the discounted accumulated reward of CTR-A
itself almost converges to almost 100 after 360 steps (actually that
why 360 steps is selected for the initialization), while that of other
methods can further increase with more learning steps. It is showed
that all FPC based algorithms beat the T-Perm algorithm, indicating
that the FPC based algorithm can find a more proper permutation
to arrange items while the brought time based permutation leads
to a loss of information, making a drop of the final accumulated
rewards. Moreover, CNN and EXP algorithms perform better in
extracting feature from observations automatically, causing a slightly
160
n
r
u
tre140
d
e
t
uon120
c
s
i
d100
0.6
0.5</p>
      <p>FPC-CNN-EXP
FPC-CNN
FPC
T-Perm</p>
      <p>Introduction
Growth
Maturity
Decline
200 400 600 800 1000 1200 1400 1600 1800</p>
      <p>step
improvement in speeding up the converging process. Both the three
FCP based algorithms converge to same final accumulated rewards
for their state inputs have the same observation representation.
200
400
600
800 1000 1200 1400 1600 1800
step</p>
      <p>Then we investigate the distribution shift of the impression
allocation over the 4 lifecycle stages after the training procedure
of the FPC-CNN-EXP method, as shown in Fig. 5. It can be seen
that the percentage of decline stage is decreased and percentage
of introduction and maturity stages are increased. By giving up
the products in the decline stage, it helps the platform to avoid the
waste of the impressions since these products are always with a
low CTR. By encouraging the products in the introduction stage,
it gives the changes of exploring more potential hot products. By
supporting the products in the maturity stage, it maximizes the
short-term eficiency since the they are with the almost highest
CTRs during their lifecycle.</p>
      <p>We finally demonstrate the change of the global clicks, rewards
as well as the averaged time durations for a product to grow up
into maturity stage from its brought time at each step, in terms
of relative change rate compared with the CTR-A method, as is
shown in Fig. 6. The global average click increases by 6% when the
rewards is improved by 30%. The gap here is probably caused by the
inconsistency of the reward definition and the global average click
metric. In fact, the designed reward contains some other implicit
objectives related to the metabolism. To further verify the guess, we
show that the average time for items to growth into maturity stage
has dropped by 26%, indicating that the metabolism is significantly
speeded up. Thus, we empirically prove that, through the proposed
reinforcement learning based mechanism design which utilizes
the lifecycle theory, the long-term eficiency can be increased by
speeding up the metabolism.</p>
      <p>1.4
1.2
0.6
Click
Rewards
Time cost
6</p>
    </sec>
    <sec id="sec-12">
      <title>CONCLUSIONS AND FUTURE WORK</title>
      <p>In this paper, we propose an end-to-end general reinforcement
learning framework to improve the long-term eficiency by
speeding up the metabolism. We reduce action space into a reasonable
level and then propose a first principal component based
permutation for better observation of environment state. After that, an
improved experiences generation technique is engaged to enrich
experience pool. Moreover, the actor-critic network is improved by
a shared convolutional network for better state representation.
Experiment results show that our algorithms outperform the baseline
algorithms.</p>
      <p>For the future work, one of the promising directions is to
develop a theoretical guarantee for first principal component based
permutation. Another possible improvement is to introduce the
nonlinearity to the scoring function for products.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>William</given-names>
            <surname>Vickrey</surname>
          </string-name>
          . Counterspeculation, auctions, and
          <article-title>competitive sealed tenders</article-title>
          .
          <source>The Journal of finance</source>
          ,
          <volume>16</volume>
          (
          <issue>1</issue>
          ):
          <fpage>8</fpage>
          -
          <lpage>37</lpage>
          ,
          <year>1961</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Chris</given-names>
            <surname>Burges</surname>
          </string-name>
          , Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and
          <string-name>
            <given-names>Greg</given-names>
            <surname>Hullender</surname>
          </string-name>
          .
          <article-title>Learning to rank using gradient descent</article-title>
          .
          <source>In Proceedings of the 22nd international conference on Machine learning</source>
          , pages
          <fpage>89</fpage>
          -
          <lpage>96</lpage>
          . ACM,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Zhe</given-names>
            <surname>Cao</surname>
          </string-name>
          , Tao Qin,
          <string-name>
            <surname>Tie-Yan</surname>
            <given-names>Liu</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Feng Tsai</surname>
            , and
            <given-names>Hang</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Learning to rank: from pairwise approach to listwise approach</article-title>
          .
          <source>In Proceedings of the 24th international conference on Machine learning</source>
          , pages
          <fpage>129</fpage>
          -
          <lpage>136</lpage>
          . ACM,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Greg</given-names>
            <surname>Linden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Brent</given-names>
            <surname>Smith</surname>
          </string-name>
          , and Jeremy York. Amazon.
          <article-title>com recommendations: Item-to-item collaborative filtering</article-title>
          .
          <source>IEEE Internet computing</source>
          ,
          <volume>7</volume>
          (
          <issue>1</issue>
          ):
          <fpage>76</fpage>
          -
          <lpage>80</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Yehuda</given-names>
            <surname>Koren</surname>
          </string-name>
          and
          <string-name>
            <given-names>Robert</given-names>
            <surname>Bell</surname>
          </string-name>
          .
          <article-title>Advances in collaborative filtering</article-title>
          .
          <source>In Recommender systems handbook</source>
          , pages
          <fpage>77</fpage>
          -
          <lpage>118</lpage>
          . Springer,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Roger</surname>
            <given-names>B</given-names>
          </string-name>
          <string-name>
            <surname>Myerson</surname>
          </string-name>
          .
          <article-title>Optimal auction design</article-title>
          .
          <source>Mathematics of operations research</source>
          ,
          <volume>6</volume>
          (
          <issue>1</issue>
          ):
          <fpage>58</fpage>
          -
          <lpage>73</lpage>
          ,
          <year>1981</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Noam</given-names>
            <surname>Nisan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Amir</given-names>
            <surname>Ronen</surname>
          </string-name>
          .
          <article-title>Algorithmic mechanism design</article-title>
          .
          <source>Games and Economic Behavior</source>
          ,
          <volume>35</volume>
          (
          <issue>1-2</issue>
          ):
          <fpage>166</fpage>
          -
          <lpage>196</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Yoav</given-names>
            <surname>Shoham</surname>
          </string-name>
          and
          <string-name>
            <given-names>Kevin</given-names>
            <surname>Leyton-Brown</surname>
          </string-name>
          .
          <article-title>Multiagent systems: Algorithmic, gametheoretic, and logical foundations</article-title>
          . Cambridge University Press,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Pingzhong</given-names>
            <surname>Tang</surname>
          </string-name>
          .
          <article-title>Reinforcement mechanism design</article-title>
          .
          <source>In Early Carrer Highlights at Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI</source>
          , pages
          <fpage>5146</fpage>
          -
          <lpage>5150</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Christos</surname>
            <given-names>H</given-names>
          </string-name>
          <string-name>
            <surname>Papadimitriou and John N Tsitsiklis.</surname>
          </string-name>
          <article-title>The complexity of markov decision processes</article-title>
          .
          <source>Mathematics of operations research</source>
          ,
          <volume>12</volume>
          (
          <issue>3</issue>
          ):
          <fpage>441</fpage>
          -
          <lpage>450</lpage>
          ,
          <year>1987</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Qingpeng</surname>
            <given-names>Cai</given-names>
          </string-name>
          , Aris Filos-Ratsikas,
          <string-name>
            <given-names>Pingzhong</given-names>
            <surname>Tang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Yiwei</given-names>
            <surname>Zhang</surname>
          </string-name>
          .
          <article-title>Reinforcement mechanism design for fraudulent behaviour in e-commerce</article-title>
          .
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Qingpeng</surname>
            <given-names>Cai</given-names>
          </string-name>
          , Aris Filos-Ratsikas,
          <string-name>
            <given-names>Pingzhong</given-names>
            <surname>Tang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Yiwei</given-names>
            <surname>Zhang</surname>
          </string-name>
          .
          <article-title>Reinforcement mechanism design for e-commerce</article-title>
          .
          <source>CoRR, abs/1708.07607</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Theodore</given-names>
            <surname>Levitt</surname>
          </string-name>
          .
          <article-title>Exploit the product life cycle</article-title>
          .
          <source>Harvard business review</source>
          ,
          <volume>43</volume>
          :
          <fpage>81</fpage>
          -
          <lpage>94</lpage>
          ,
          <year>1965</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Hui</given-names>
            <surname>Cao</surname>
          </string-name>
          and
          <string-name>
            <given-names>Paul</given-names>
            <surname>Folan</surname>
          </string-name>
          .
          <article-title>Product life cycle: the evolution of a paradigm and literature review from 1950-2009</article-title>
          . Production Planning &amp; Control,
          <volume>23</volume>
          (
          <issue>8</issue>
          ):
          <fpage>641</fpage>
          -
          <lpage>662</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Thorsten</given-names>
            <surname>Joachims</surname>
          </string-name>
          .
          <article-title>Optimizing search engines using clickthrough data</article-title>
          .
          <source>In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          , pages
          <fpage>133</fpage>
          -
          <lpage>142</lpage>
          . ACM,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Georges</surname>
            <given-names>E</given-names>
          </string-name>
          <string-name>
            <surname>Dupret</surname>
            and
            <given-names>Benjamin</given-names>
          </string-name>
          <string-name>
            <surname>Piwowarski</surname>
          </string-name>
          .
          <article-title>A user browsing model to predict search engine click data from past observations</article-title>
          .
          <source>In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <fpage>331</fpage>
          -
          <lpage>338</lpage>
          . ACM,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Francesco</surname>
            <given-names>Ricci</given-names>
          </string-name>
          , Lior Rokach, and
          <string-name>
            <given-names>Bracha</given-names>
            <surname>Shapira</surname>
          </string-name>
          .
          <article-title>Introduction to recommender systems handbook</article-title>
          .
          <source>In Recommender systems handbook</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>35</lpage>
          . Springer,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Seung-Taek Park</surname>
            and
            <given-names>Wei</given-names>
          </string-name>
          <string-name>
            <surname>Chu</surname>
          </string-name>
          .
          <article-title>Pairwise preference regression for cold-start recommendation</article-title>
          .
          <source>In Proceedings of the third ACM conference on Recommender systems</source>
          , pages
          <fpage>21</fpage>
          -
          <lpage>28</lpage>
          . ACM,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Martin</given-names>
            <surname>Saveski</surname>
          </string-name>
          and
          <string-name>
            <given-names>Amin</given-names>
            <surname>Mantrach</surname>
          </string-name>
          .
          <article-title>Item cold-start recommendations: learning local collective embeddings</article-title>
          .
          <source>In Proceedings of the 8th ACM Conference on Recommender systems</source>
          , pages
          <fpage>89</fpage>
          -
          <lpage>96</lpage>
          . ACM,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Jin-Hu</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Tao Zhou,
          <string-name>
            <surname>Zi-Ke</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Zimo Yang, Chuang Liu, and
          <string-name>
            <surname>Wei-Min Li</surname>
          </string-name>
          .
          <article-title>Promoting cold-start items in recommender systems</article-title>
          .
          <source>PloS one</source>
          ,
          <volume>9</volume>
          (
          <issue>12</issue>
          ):
          <fpage>e113457</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Oren</given-names>
            <surname>Anava</surname>
          </string-name>
          , Shahar Golan, Nadav Golbandi, Zohar Karnin, Ronny Lempel, Oleg Rokhlenko, and
          <string-name>
            <given-names>Oren</given-names>
            <surname>Somekh</surname>
          </string-name>
          .
          <article-title>Budget-constrained item cold-start handling in collaborative filtering recommenders via optimal design</article-title>
          .
          <source>In Proceedings of the 24th International Conference on World Wide Web</source>
          , pages
          <fpage>45</fpage>
          -
          <lpage>54</lpage>
          . International World Wide Web Conferences Steering Committee,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Michal</surname>
            <given-names>Aharon</given-names>
          </string-name>
          , Oren Anava, Noa Avigdor-Elgrabli, Dana Drachsler-Cohen,
          <string-name>
            <given-names>Shahar</given-names>
            <surname>Golan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Oren</given-names>
            <surname>Somekh</surname>
          </string-name>
          . Excuseme:
          <article-title>Asking users to help in item cold-start recommendations</article-title>
          .
          <source>In Proceedings of the 9th ACM Conference on Recommender Systems</source>
          , pages
          <fpage>83</fpage>
          -
          <lpage>90</lpage>
          . ACM,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Maksims</surname>
            <given-names>Volkovs</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Guangwei</given-names>
            <surname>Yu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Tomi</given-names>
            <surname>Poutanen</surname>
          </string-name>
          . Dropoutnet:
          <article-title>Addressing cold start in recommender systems</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , pages
          <fpage>4964</fpage>
          -
          <lpage>4973</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Nima</surname>
            <given-names>Taghipour</given-names>
          </string-name>
          , Ahmad Kardan, and Saeed Shiry Ghidary.
          <article-title>Usage-based web recommendations: a reinforcement learning approach</article-title>
          .
          <source>In Proceedings of the 2007 ACM conference on Recommender systems</source>
          , pages
          <fpage>113</fpage>
          -
          <lpage>120</lpage>
          . ACM,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Qingpeng</surname>
            <given-names>Cai</given-names>
          </string-name>
          , Aris
          <string-name>
            <surname>Filos-Ratsikas</surname>
            , Chang Liu, and
            <given-names>Pingzhong</given-names>
          </string-name>
          <string-name>
            <surname>Tang</surname>
          </string-name>
          .
          <article-title>Mechanism design for personalized recommender systems</article-title>
          .
          <source>In Proceedings of the 10th ACM Conference on Recommender Systems</source>
          , pages
          <fpage>159</fpage>
          -
          <lpage>166</lpage>
          . ACM,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Richard</surname>
            <given-names>S Sutton</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>David A McAllester</given-names>
            ,
            <surname>Satinder P Singh</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Yishay</given-names>
            <surname>Mansour</surname>
          </string-name>
          .
          <article-title>Policy gradient methods for reinforcement learning with function approximation</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>1057</fpage>
          -
          <lpage>1063</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>David</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Guy</given-names>
            <surname>Lever</surname>
          </string-name>
          , Nicolas Heess, Thomas Degris, Daan Wierstra, and
          <string-name>
            <given-names>Martin</given-names>
            <surname>Riedmiller</surname>
          </string-name>
          .
          <article-title>Deterministic policy gradient algorithms</article-title>
          .
          <source>In Proceedings of the 31st International Conference on Machine Learning (ICML-14)</source>
          , pages
          <fpage>387</fpage>
          -
          <lpage>395</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Timothy</surname>
            <given-names>P Lillicrap</given-names>
          </string-name>
          , Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez,
          <string-name>
            <given-names>Yuval</given-names>
            <surname>Tassa</surname>
          </string-name>
          , David Silver,
          <string-name>
            <given-names>and Daan</given-names>
            <surname>Wierstra</surname>
          </string-name>
          .
          <article-title>Continuous control with deep reinforcement learning</article-title>
          .
          <source>arXiv preprint arXiv:1509.02971</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>Hervé</given-names>
            <surname>Abdi and Lynne J Williams</surname>
          </string-name>
          .
          <article-title>Principal component analysis</article-title>
          .
          <source>Wiley interdisciplinary reviews: computational statistics</source>
          ,
          <volume>2</volume>
          (
          <issue>4</issue>
          ):
          <fpage>433</fpage>
          -
          <lpage>459</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <article-title>[30] Long Ji Lin</article-title>
          .
          <article-title>Self-improving reactive agents based on reinforcement learning, planning and teaching</article-title>
          .
          <source>Machine Learning</source>
          ,
          <volume>8</volume>
          (
          <issue>3</issue>
          -4):
          <fpage>293</fpage>
          -
          <lpage>321</lpage>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Volodymyr</surname>
            <given-names>Mnih</given-names>
          </string-name>
          , Koray Kavukcuoglu, David Silver,
          <string-name>
            <given-names>Alex</given-names>
            <surname>Graves</surname>
          </string-name>
          , Ioannis Antonoglou, Daan Wierstra, and
          <string-name>
            <given-names>Martin</given-names>
            <surname>Riedmiller</surname>
          </string-name>
          .
          <article-title>Playing atari with deep reinforcement learning</article-title>
          .
          <source>arXiv preprint arXiv:1312.5602</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <surname>Yu-Hu</surname>
            <given-names>Cheng</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jian-Qiang Yi</surname>
          </string-name>
          , and
          <string-name>
            <surname>Dong-Bin Zhao</surname>
          </string-name>
          .
          <article-title>Application of actor-critic learning to adaptive state space construction</article-title>
          .
          <source>In Machine Learning and Cybernetics</source>
          ,
          <year>2004</year>
          . Proceedings of 2004 International Conference on, volume
          <volume>5</volume>
          , pages
          <fpage>2985</fpage>
          -
          <lpage>2990</lpage>
          . IEEE,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>Yuxin</given-names>
            <surname>Wu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Yuandong</given-names>
            <surname>Tian</surname>
          </string-name>
          .
          <article-title>Training agent for first-person shooter game with actor-critic curriculum learning</article-title>
          .
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <surname>Diederik</surname>
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Kingma</surname>
            and
            <given-names>Jimmy</given-names>
          </string-name>
          <string-name>
            <surname>Ba</surname>
          </string-name>
          .
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>CoRR, abs/1412.6980</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>