<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Gaussian Process Based Optimistic Knapsack Sampling with Applications to Stochastic Resource Allocation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sondre Glimsdal</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ole-Christoffer Granmo</string-name>
          <email>ole.granmog@uia.no</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of ICT University of Agder Norway</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The stochastic non-linear fractional knapsack problem is a challenging optimization problem with numerous applications, including resource allocation. The goal is to find the most valuable mix of materials that fits within a knapsack of fixed capacity. When the value functions of the involved materials are fully known and differentiable, the most valuable mixture can be found by direct application of Lagrange multipliers. However, in many real-world applications, such as web polling, information about material value is uncertain, and in many cases missing altogether. Surprisingly, without prior information about material value, the recently proposed Learning Automata Knapsack Game (LAKG) offers arbitrarily accurate convergence towards the optimal solution, simply by interacting with the knapsack on-line. This paper introduces Gaussian Process based Optimistic Knapsack Sampling (GPOKS) - a novel model-based reinforcement learning scheme for solving stochastic fractional knapsack problems, founded on Gaussian Process (GP) enabled Optimistic Thompson Sampling (OTS). Not only does this scheme converge significantly faster than LAKG, GPOKS also incorporates GP based learning of the material values themselves, forming the basis for OTS supported balancing between exploration and exploitation. Using resource allocation in web polling as a proof-of-concept application, our empirical results show that GPOKS consistently outperforms LAKG, the current top-performer, under a wide variety of parameter settings.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        The Internet can be seen as a massive collection of
everchanging information, continuously evolving as web
resources are created, edited, deleted, and replaced
        <xref ref-type="bibr" rid="ref16">(Pandey,
Ramamritham, &amp; Chakrabarti 2003)</xref>
        . Obtaining adequate
information from the Internet is crucial for many tasks,
including social media analytics, counter terrorism, and
business intelligence. It is thus important that the applied search
engines and web-monitoring frameworks are able to keep
their indexes and caches complete and up-to-date.
Achieving this, of course, relies on detecting the changes that the
web resources undergo, typically by means of polling.
      </p>
      <p>
        The problem of balancing polling capacity optimally
among web resources, with limited prior information, was
essentially unsolved until the Learning Automata
Knapsack Game (LAKG) was introduced in 2006 as a generic
and adaptive solution to the so-called Stochastic Non-linear
Equality Fractional Knapsack (NEFK) Problem
        <xref ref-type="bibr" rid="ref4 ref6">(Granmo et
al. 2006)</xref>
        . Before that, the simplest and perhaps most
common polling approach was to allocate the available polling
capacity uniformly among the web resources being
monitored, polling them all with the same fixed frequency,
constrained by the available polling capacity. This uniform
polling strategy is clearly sub-optimal since web resources
evolve at different speed. For slowly changing web
resources, a high polling frequency translates into a
correspondingly large number of unfruitful polls. Conversely, for
quickly evolving web resources, a too low polling frequency
leads to potential loss of information or acting on out-dated
information. In brief, without balancing the allocation of
the available polling capacity, wasting resources polling one
resource may in turn prevent us from polling another more
attractive resource, thus degrading overall performance.
      </p>
      <p>
        A two phase strategy has been proposed to address the
latter inefficiency: In the first phase, the uniform strategy is
applied, which allows the update probability of monitored
web resources to be estimated. By treating these
probability estimates as the true ones, Lagrange multipliers can be
applied to find an allocation of capacity that is optimal for
the estimated values
        <xref ref-type="bibr" rid="ref16">(Pandey, Ramamritham, &amp; Chakrabarti
2003)</xref>
        . However, this method needs an arbitrary long
estimation phase to approach the optimal solution in the second
phase. That is, one either has to accept a sub-optimal final
solution because the update probability estimates are
inaccurate, or one must wait an extensive amount of time till
the estimates have become sufficiently accurate, allowing a
better solution in the second phase. Also note that evolving
update probabilities render the solution found with the latter
approach progressively more inaccurate.
      </p>
      <p>
        This paper introduces Gaussian Process based Optimistic
Knapsack Sampling (GPOKS) — a novel scheme for
solving stochastic knapsack problems founded on Gaussian
Process (GP)
        <xref ref-type="bibr" rid="ref18">(Rasmussen &amp; Williams 2006)</xref>
        based Thompson
Sampling (TS)
        <xref ref-type="bibr" rid="ref20 ref3 ref5 ref8">(Thompson 1933; Granmo 2010)</xref>
        , enhanced
by the principles of Optimistic TS
        <xref ref-type="bibr" rid="ref13">(May et al. 2012)</xref>
        . As
we shall see, not only does this scheme converge
significantly faster than LAKG, GPOKS also incorporates GP
based learning of the material unit values themselves,
forming the basis for TS based exploration and exploitation. This
allows GPOKS to gradually shift from estimation to
optimization, starting with pure estimation and converging
towards pure optimization.
      </p>
      <p>
        In
        <xref ref-type="bibr" rid="ref3 ref5 ref8">(Granmo 2010)</xref>
        we reported a Bayesian technique for
solving bandit like problems, revisiting the Thompson
Sampling
        <xref ref-type="bibr" rid="ref20">(Thompson 1933)</xref>
        principle pioneered in 1933. This
revisit lead to novel schemes for handling multi-armed and
dynamic (restless) bandit problems
        <xref ref-type="bibr" rid="ref10 ref3 ref5 ref8 ref9">(Granmo &amp; Berg 2010;
Gupta, Granmo, &amp; Agrawala 2011a; 2011b)</xref>
        , and
empirical results demonstrated the advantages of these techniques
over established top performers. Furthermore, we provided
theoretical results stating that the original technique is
instantaneously self-correcting and that it converges to only
pulling the optimal arm with probability as close to unity as
desired. We now expand this principle to support Thompson
Sampling for Stochastic NEFK Problems.
1.1
      </p>
      <sec id="sec-1-1">
        <title>Formal Problem Formulation</title>
        <p>In order to appreciate the qualities of the Stochastic NEFK
Problem, it is beneficial to view the problem in light of the
classical linear Fractional Knapsack (FK) Problem. Indeed,
the Stochastic NEFK Problem generalizes the latter problem
in two significant ways. Both of the two problems are briefly
defined below.</p>
        <p>The Linear Fractional Knapsack (FK) Problem: The
linear FK problem is a classical continuous optimization
problem which also has applications within the field of
resource allocation. The problem involves n materials of
different value vi per unit volume, 1 i n, where each
material is available in a certain amount xi bi. Let
fi(xi) denote the value of the amount xi of material i, i.e.,
fi(xi) = vixi. The problem is to fill a knapsack of fixed
volume c with the material mix ~x = [x1; : : : ; xn] of maximal
value Pn</p>
        <p>
          1 fi(xi)
          <xref ref-type="bibr" rid="ref1">(Black 2004)</xref>
          .
        </p>
        <p>
          The Nonlinear Equality FK (NEFK) Problem: One
important extension of the above classical problem is the
Nonlinear Equality FK problem with a separable and concave
objective function. The problem can be stated as follows
          <xref ref-type="bibr" rid="ref12">(Kellerer, Pferschy, &amp; Pisinger 2004)</xref>
          :
maximize
subject to
f (~x) = Pn
        </p>
        <p>1 fi(xi)</p>
        <p>P1n xi = c and 8i 2 f1; : : : ; ng; xi</p>
        <p>Since the objective function is considered to be concave,
the value function fi(xi) of each material is also concave.
This means that the derivatives of the material value
functions fi(xi) with respect to xi, (hereafter denoted fi0), are
non-increasing. In other words, the material value per unit
volume is no longer constant as in the linear case, but
decreases with the material amount, and so the optimization
problem becomes:
maximize
subject to
f (~x) = Pn</p>
        <p>
          1 fi(xi);
where fi(xi) = R0xi fi0(xi)dxi
P1n xi = c and 8i 2 f1; : : : ; ng; xi
0:
0:
value functions are equal, subject to the knapsack constraints
          <xref ref-type="bibr" rid="ref2">(Bretthauer &amp; Shetty 2002)</xref>
          :
f10 (x1) = = f n0(xn)
P1n xi = c and 8i 2 f1; : : : ; ng; xi
0:
        </p>
        <p>The Stochastic NEFK Problem: In this paper we
generalize the above nonlinear equality knapsack problem. First
of all, we let the material value per unit volume for any xi
be a probability function pi(xi). Furthermore, we consider
the distribution of pi(xi) to be unknown. That is, each time
an amount xi of material i is placed in the knapsack, we are
only allowed to observe an instantiation of pi(xi) at xi, and
not pi(xi) itself.1 Given this stochastic environment, we
intend to devise an on-line incremental scheme that learns the
mix of materials of maximal expected value, through a series
of informed guesses. Thus, to clarify issues, we are provided
with a knapsack of fixed volume c, which is to be filled with
a mix of n different materials. However, unlike the NEFK,
in the Stochastic NEFK Problem the unit volume value of a
material i, 1 i n, is a random quantity — it takes the
value 1 with probability pi(xi) and the value 0 with
probability 1 pi(xi), respectively. As an additional complication,
pi(xi) is nonlinear in the sense that it decreases
monotonically with xi, i.e., xi1 xi2 , pi(xi1 ) pi(xi2 ).</p>
        <p>Since unit volume values are random, we operate with
expected unit volume values rather than the actual unit volume
values. With this understanding, and the above perspective
in mind, the expected value of the amount xi of material i,
1 i n, becomes fi(xi) = R0xi pi(u)du. Accordingly,
the expected value per unit volume2 of material i becomes
fi0(xi) = pi(xi). In this stochastic and non-linear version
of the FK problem, the goal is to fill the knapsack so that
the expected value f (~x) = P1n fi(xi) of the material mix
contained in the knapsack is maximized. Thus, we aim to:
maximize
subject to
f (~x) = Pn</p>
        <p>1 fi(xi);
where fi(xi) = R0xi pi(u)du; pi(xi) = fi0(xi)</p>
        <p>P1n xi = c and 8i 2 f1; : : : ; ng; xi 0:</p>
        <p>A fascinating property of the above problem is that the
amount of information available to the decision maker is
limited — the decision maker is only allowed to observe the
current unit value of each material (either 0 or 1). That is,
each time a material mix is placed in the knapsack, the unit
value of each material is provided to the decision maker. The
actual outcome probabilities pi(xi); 1 i n, however,
remain unknown. As a result of the latter, the expected value of
the material mix must be maximized by means of
trial-anderror, i.e., by experimenting with different material mixes
and by observing the resulting random unit value outcomes.</p>
        <p>1For the sake of consistency with previous work on the
Stochastic NEFK Problem, we here model stochastic material unit values
using Bernoulli trials. However, since GPOKS is based on
Gaussian Processes, the central limit theorem opens up for addressing
a number of other distributions too. Furthermore, there exist
dedicated kernel functions for a variety of distributions.</p>
        <p>2We hereafter use fi0(xi) to denote the derivative of the
expected value function fi(xi) with respect to xi.</p>
        <p>
          Efficient solutions to the latter problem, based on the
principle of Lagrange multipliers, have been devised. In short, the
optimal value occurs when the derivatives fi0 of the material
The contributions of this paper can be summarized as
follows:
1. We combine Bayesian modeling with reinforcement
learning to provide a novel solution to the Stochastic
NEFK Problem.
3. We introduce GP based sampling mechanisms in the spirit
of Optimistic Thompson Sampling
          <xref ref-type="bibr" rid="ref13">(May et al. 2012)</xref>
          for
increased performance.
4. The resulting scheme persistently outperforms
state-ofthe-art approaches when applied to resource allocation in
web polling.
        </p>
        <p>These contributions form the first steps towards establishing
a new family of reinforcement learning schemes that
provide on-line solutions to stochastic versions of classical
optimization problems. This is achieved by carefully
designing Bayesian models that capture the nature of the
optimization problems, applying TS principles to address the
exploration/exploitation dilemma in on-line learning and control.
1.3</p>
      </sec>
      <sec id="sec-1-2">
        <title>Paper Outline</title>
        <p>
          In Section 2, we present our scheme for Gaussian
Process Based Optimistic Knapsack Sampling (GPOKS). We
start with a brief introduction to Gaussian Processes before
we propose how Gaussian Processes can enable
Thompson Sampling — the current leader when it comes to
solving Bernoulli Bandit Problems
          <xref ref-type="bibr" rid="ref3 ref5 ref8">(Granmo 2010)</xref>
          — for
exploration and exploitation when solving on-line Stochastic
NEFK problems. Then, in Section 3, we define the web
resource allocation polling problem in more detail, following
up with an evaluation of GPOKS compared with
state-ofthe-art. We conclude in Section 4 and present pointers for
further work.
        </p>
        <p>2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Gaussian Process Based Optimistic</title>
    </sec>
    <sec id="sec-3">
      <title>Knapsack Sampling (GPOKS)</title>
      <p>
        The conflict between exploration and exploitation is a
wellknown problem in reinforcement learning, and other areas
of artificial intelligence. The multi-armed bandit problem
captures the essence of this conflict, and has thus occupied
researchers for over fifty years
        <xref ref-type="bibr" rid="ref22">(Wyatt 1997)</xref>
        . In brief, an
agent sequentially pulls one of multiple arms attached to a
gambling machine, with each pull resulting in a random
reward. The reward distributions are unknown, and thus, one
must balance between exploiting existing knowledge about
the arms, and obtaining new information.
      </p>
      <p>
        We are here facing a similar problem, however, instead of
seeking the singly best material (bandit arm), we need to find
a mixture of materials, also referred to as a mixed strategy
in Game Theory. Recently, GP optimization has been
addressed from a bandit problem perspective
        <xref ref-type="bibr" rid="ref19">(Srinivas N. &amp; M.
2010)</xref>
        , allowing the GP to be explored globally with as few
evaluations as possible based on so-called upper confidence
bounds. Inspired by the success of GP based
optimization, we here propose a novel GP based model for stochastic
NEFK problems, where a collection of GPs captures the
individual material unit values. Based on the GP colletion,
Thompson Sampling is applied to sample likely
deterministic NEFK problem instances from the GPs. These, in turn,
are solved based on Lagrange Multipliers, producing a
potential solution to the problem at hand.
2.1
      </p>
      <sec id="sec-3-1">
        <title>Gaussian Processes based Representation of</title>
      </sec>
      <sec id="sec-3-2">
        <title>Material Unit Value</title>
        <p>
          A Gaussian Process (GP) is a stochastic process that
represents a function as a multivariate Gaussian distribution
          <xref ref-type="bibr" rid="ref18">(Rasmussen &amp; Williams 2006)</xref>
          . It is specified as a tuple
GP = ( (~x); K( ; )) where ( ) is the mean function,
typically assigned (~x) = ~0, and K( ; ) is a kernel that specifies
the covariance matrix for the random vector ~x. In this paper,
we use the one dimensional Squared Exponential kernel (eq.
1), configured by the hyper parameters ~ = fl; f2; n2g.
1
2l2 (xp
K(xp; xq) =
f2exp(
xq)2)) +
2
n pq
(1)
Here l is the characteristic length-scale parameter that
determines how rapidly the correlation should decay as the
distance between xp and xq increases, f2 is the signal variance
and n2 is white noise (note that pq here denotes the
Kronecker delta between xp and xq). For further information on
GPs we refer to
          <xref ref-type="bibr" rid="ref18">(Rasmussen &amp; Williams 2006)</xref>
          .
        </p>
        <p>
          By way of example, Figure 1 illustrates how the posterior
distribution over possible material unit value functions for a
given material i can be represented by means of a GP. The
x-axis measures the amount of material, xi, while the y-axis
provides the material unit value fi0(xi). The mean and 95%
confidence interval is included, as well as four samples
indicating possible candidates for fi0(xi). Note that since the
Stochastic NEFK problem deals with non-increasing unit
value functions, fi0(xi), we apply Rejection Sampling to
sample from the distribution of non-increasing functions.
Similarly, ”optimistic” sampling, as pioneered by May et al.
          <xref ref-type="bibr" rid="ref13">(May et al. 2012)</xref>
          , is realized by rejecting sampled functions
that drop below the estimated mean.
2.2
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Architectural Overview of GPOKS</title>
        <p>Figure 2 provides an architectural overview of our scheme.
As illustrated in the figure, GPOKS operates as follows:
1. A collection of GPs, one Gaussian Process, GPi, for each
material i, attempts to estimate the material unit value
functions, fi0(xi); 1 i n.
2. One candidate material unit value function, f^i0(xi); 1
i n, is then sampled from each GPi , thus applying the
TS principle of sampling functions proportionally to their
likelihoods.
3. The DET-KS component in the architectue finds the
optimal material mixture M^ = [x1; : : : ; xn] for the sampled
material unit value functions, f^i0(xi); 1 i n, using
Lagrange multipliers.
4. One of the materials is then selected by the Scheduler
component for evaluation, ensuring that each material i
is selected with a frequency that is proportional to the
amount of material, xi, assigned by M^ .
5. Finally, the Stochastic Environment, i.e., the Stochastic
NEFK, samples the true outcome probability function,
pi(xi), at xi, providing feedback vi to the corresponding
GPi, which updates its Bayesian estimate of fi0(xi).
By following the above steps our goal is to gradually
improve our ”best guesses” so that each iteration successively
brings us closer to the optimal solution of the targeted
Stochastic NEFK problem.
2.3</p>
      </sec>
      <sec id="sec-3-4">
        <title>Example Steps</title>
        <p>Figure 3 and 4 show the GP based estimates for the unit
value of two materials, f10 (x1) and f20 (x2), after only 5
material value observations. As can be seen, uncertainty about
the material unit value functions is significant, and the
estimated optimal material amounts M^ = [x^1; x^2] are far from
Web Page 1
Web Page 2
x
x
x
x</p>
        <p>x
x x x x x
x x x x x x x x
x
x
t
t
the optimal amounts M = [x1; x2].</p>
        <p>However, after 193 iterations of the GPOKS algorithm,
we observe a number of fascinating properties in Figure 5.
First of all, the Bayesian estimates of the material unit
values, f10 (x1) and f20 (x2), have become more accurate.
Furthermore, we observe that the estimated optimal material
mixture is now much closer to the optimal mixture. Finally,
observe that the uncertainty concerning f10 (x1) and f20 (x2)
varies with x1 and x2. The beauty of Thompson Sampling is
that the observations are collected with gradually increasing
exploitation, zooming in on the areas that are most likely to
contain the optimal material mixture.
Having obtained a solution to the model in which we set the
NEFK, we shall now demonstrate how we can utilize this
solution for the current problem being studied, namely, the
optimal web-polling problem.</p>
        <p>
          Web resource monitoring consists of repeatedly polling
a selection of web resources so that the user can detect
changes that occur over time. Clearly, as this task can be
prohibitively expensive, in practical applications, the
system imposes a constraint on the maximum number of web
resources that can be polled per time unit. This bound is
dictated by the governing communication bandwidth, and by
the speed limitations associated with the processing. Since
only a fraction of the web resources can be polled within
a given unit of time, the problem which the system’s
analyst encounters is one of determining which web resources
are to be polled. In such cases, a reasonable choice of
action is to choose web resources in a manner that maximizes
the number of changes detected, and the optimal allocation
of the resources involves trial-and-error. As illustrated in
Figure 6, web resources may change with varying
frequencies (that are unknown to the decision maker), and changes
appear more or less randomly. Furthermore, as argued
elsewhere,
          <xref ref-type="bibr" rid="ref4 ref4 ref6 ref6">(Granmo &amp; Oommen 2006; Granmo et al. 2006;
2007)</xref>
          , the probability that an individual web resource poll
uncovers a change on its own decreases monotonically with
the polling frequency used for that web resource.
        </p>
        <p>
          Although several nonlinear criterion functions for
measuring web monitoring performance have been proposed
in the literature (e.g., see
          <xref ref-type="bibr" rid="ref16 ref21">(Pandey, Ramamritham, &amp;
Chakrabarti 2003; Wolf et al. 2002)</xref>
          ), from a broader
viewpoint they are mainly built around the basic concept of
update detection probability, i.e., the probability that polling a
web resource results in new information being discovered.
Therefore, for the purpose of conceptual clarity, we will use
the update detection probability as the token of interest in
this paper. To further define our notion of web monitoring
performance, we consider that time is discrete with the time
interval length T to be the atomic unit of decision making. In
each time interval every single web resource i has a constant
probability qi of remaining unchanged. Furthermore, when
a web resource is updated/changed, the update is available
for detection only until the web resource is updated again.
After that, the original update is considered lost. For
instance, each time a newspaper web resource is updated,
previous news items are replaced by the most recent ones.
        </p>
        <p>In the following, we will denote the update detection
probability of a web resource i as di. Under the above
conditions, di depends on the frequency, xi, that the resource is
polled with, and is modeled using the following expression:
di(xi) = 1</p>
        <p>1
qi xi :
By way of example, consider the scenario that a web
resource remains unchanged in any single time step with
probability 0:5. Then polling the web resource uncovers new
information with probability 1 0:53 = 0:875 if the web
resource is polled every 3rd time step (i.e., with frequency
31 ) and 1 0:52 = 0:75 if the web resource is polled
every 2nd time step. As seen, increasing the polling frequency
reduces the probability of discovering new information on
each polling.</p>
        <p>Given the above considerations, our aim is to find the
resource polling frequencies ~x that maximize the expected
number of pollings uncovering new information per time
step:
maximize
subject to</p>
        <p>P1n xi di(xi)
P1n xi = c and 8i = 1; : : : ; n; xi
In order to find a solution to the above problem we must
define the Stochastic Environment that GPOKS is to
interact with. As seen in Section 2, the Stochastic Environment
consists of the unit volume value functions ff10 (x1); f20 (x2);
: : : ; f n0(xn)g, which are unknown to GPOKS. We identify
the nature of these functions by applying the principle of
Lagrange multipliers to the above maximization problem.
In short, after some simplification, it can be seen that the
following conditions characterize the optimal solution:
d1(x1) = d2(x2) = = dn(xn)
P1n xi = c and 8i = 1; : : : ; n; xi
0:
Since we are not able to observe di(xi) or qi directly, we
base our definition of ff10 (x1); f20 (x2); : : : ; f n0(xn)g on the
result of polling web resources. Briefly stated, we want
fi0(xi) to instantiate to the value 0 with probability 1 di(xi)
and to the value 1 with probability di(xi). Accordingly, if
the web resource i is polled and i has been updated since
our last polling, then we consider fi0(xi) to have been
instantiated to 1. And, if the web resource i is unchanged, we
consider fi0(xi) to have been instantiated to 0.
3.2</p>
      </sec>
      <sec id="sec-3-5">
        <title>Empirical Results</title>
        <p>
          In this section we evaluate GPOKS and compare its
performance with the currently best performing algorithm, LAKG.
While H-TRAA possesses better scalability than LAKG
          <xref ref-type="bibr" rid="ref3 ref5 ref8">(Granmo &amp; Oommen 2010)</xref>
          , for two material problems,
their performance is identical because the hierarchical setup
of H-TRAA does not come into play. For clarification we
will also include some promising variants of GPOKS. Here
follows an overview of a selection of the policies that we
have investigated:
        </p>
        <p>Uniform: The uniform policy allocates monitoring
resources uniformly across all web resources. This classical
policy can, of course, be applied directly in an unknown
environment.</p>
        <p>
          LAKG: The LAKG scheme is basically a game between
so-called Learning Automata
          <xref ref-type="bibr" rid="ref15">(Narendra &amp; Thathachar
1989)</xref>
          . They start off from a uniform policy and
gradually improves toward the optimal configuration through a
sequence of small jumps across a discretized search space.
In all our experiments the resolution of LAKG is set to 100
states.
        </p>
        <p>
          Optimal: This policy requires that update frequencies are
known, and finds the optimal solution based on the
principle of Lagrange multipliers
          <xref ref-type="bibr" rid="ref16 ref21">(Pandey, Ramamritham, &amp;
Chakrabarti 2003; Wolf et al. 2002)</xref>
          .
        </p>
        <p>GPOKS - Mean: To highlight the advantage of our
Optimistic Thompson Sampling approach, we also test a simpler
scheme where we use the mean of the GPs when estimating
the optimal solution rather than sampling functions from the
GPs.</p>
        <p>We have conducted numerous experiments using various
configurations, such as different noise parameters and
update probabilities. Here, we present a representative subset
of these, as they all show the same trend. Performance is
measured as the average accumulated number of web
resource updates found.</p>
        <p>For these experiments, we used an ensemble of 1000
independent replications, each random generator seeded with
a unique number, to maximize the precision of the reported
results. In order to provide a robust overview of the
performance of GPOKS, we investigated three radically different
update probability configurations for web resource pairs. In
the first one, q1 = 0:9=q2 = 0:1, one web resource is
updated significantly more often than the other. A more
moderate version of the latter configuration, q1 = 0:75=q2 = 0:25,
was also investigated. Furthermore, we measured
performance when the two web resources have almost equal
update probability, q1 = 0:55=q2 = 0:45. Finally, we also
investigated the robustness of GPOKS by adding
increasing amount of white-noise, (w ), to the feedback given to
GPOKS. Note that, for the sake of fairness, we applied the
same kernel hyper-parameters, = f1:0; 1:0; 0:1g, for all
the GP based strategies, without further optimization.</p>
        <p>Table 1 reports the performance of the different
policies3. As can be seen, GPOKS clearly outperforms LAKG
when facing the q1 = 0:9=q2 = 0:1 configuration, with
GPOKS detecting on average approximately 8 more updates
than LAKG over 1000 time steps. Also note how
remarkably close GPOKS gets to the optimal performance, missing
on average merely 7 web resource updates over 1000 time
steps. We observe similar results for the q1 = 0:75=q2 =
0:25 configuration. Finally, for the q1 = 0:55=q2 = 0:45
configuration, we observe that the performance of LAKG
and GPOKS becomes more similar. This can be explained
by the prior bias of LAKG, starting from a uniform
allocation of resources. This gives LAKG an advantage over
GPOKS, which are largely unbiased when it comes to prior
belief about update probabilities. Finally, notice the
performance loss caused by using the mean of the GPs
(GPOKSMean) instead of TS. This trend is further explored in
Ta3Note that all of the setups apply a small degree of white noise
(w = 0:1):
ble 2, where we increase the amount of white noise
affecting feedback. We then observe that GPOKS is surprisingly
robust towards noisy feedback compared to GPOKS-Mean.
This can be explained by the greedy nature of
GPOKSMean, which is less inclined to explore the space of
functions encompassed by the GPs, thus being more easily
mislead by noise.</p>
        <p>4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and Further Work</title>
      <p>The stochastic non-linear fractional knapsack problem is a
challenging optimization problem with numerous
applications, including resource allocation. The goal is to find the
most valuable mix of materials that fits within a knapsack
of fixed capacity. When the value functions of the involved
materials are fully known and differentiable, the most
valuable mixture can be found by direct application of Lagrange
multipliers.</p>
      <p>In this paper we introduced Gaussian Process based
Optimistic Knapsack Sampling (GPOKS) — a novel
modelbased reinforcement learning scheme for solving stochastic
fractional knapsack problems. The scheme is founded on
Gaussian Process (GP) enabled Optimistic Thompson
Sampling (OTS). Our empirical results demonstrates that this
scheme converge significantly faster than LAKG.
Furthermore, GPOKS incorporates GP based learning of the
material unit values themselves, forming the basis for OTS
supported balancing between exploration and exploitation.
Using resource allocation in web polling as a proof-of-concept
application, our empirical results show that GPOKS
consistently outperforms LAKG, the current top-performer, under
a wide variety of parameter settings.</p>
      <p>In our further work, we will address games of interacting
GPOKS for solving networked and hierarchical resource
allocation problems. Furthermore, we are investigating
techniques for decomposing the GP calculations for increased
computational performance.</p>
      <sec id="sec-4-1">
        <title>Scheme</title>
        <p>Optimal
Uniform
LAKG
GPOKS
GPOKS-Mean
Optimal
Uniform
LAKG
GPOKS
GPOKS-Mean
Optimal
Uniform
LAKG
GPOKS
GPOKS-Mean
= 0:1</p>
      </sec>
      <sec id="sec-4-2">
        <title>Scheme GPOKS GPOKS-Mean</title>
        <p>p1/p2
0.75/0.25
0.75/0.25</p>
        <p>= 0:0
808.2
793.9</p>
        <p>= 0:2
804.5
787.2</p>
        <p>= 0:4
804.1
769.1</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Black</surname>
            ,
            <given-names>P. E.</given-names>
          </string-name>
          <year>2004</year>
          .
          <article-title>Fractional knapsack problem</article-title>
          .
          <source>Dictionary of Algorithms and Data Structures.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Bretthauer</surname>
            ,
            <given-names>K. M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Shetty</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2002</year>
          .
          <article-title>The Nonlinear Knapsack Problem - Algorithms and Applications</article-title>
          .
          <source>European Journal of Operational Research</source>
          <volume>138</volume>
          :
          <fpage>459</fpage>
          -
          <lpage>472</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Granmo</surname>
            ,
            <given-names>O.-C.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Berg</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Solving NonStationary Bandit Problems by Random Sampling from Sibling Kalman Filters</article-title>
          .
          <source>In Proceedings of the Twenty Third International Conference on Industrial, Engineering, and Other Applications of Applied Intelligent Systems (IEAAIE</source>
          <year>2010</year>
          ),
          <fpage>199</fpage>
          -
          <lpage>208</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Granmo</surname>
            ,
            <given-names>O.-C.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Oommen</surname>
            ,
            <given-names>B. J.</given-names>
          </string-name>
          <year>2006</year>
          .
          <article-title>On Allocating Limited Sampling Resources Using a Learning Automata-based Solution to the Fractional Knapsack Problem</article-title>
          .
          <source>In Proceedings of the 2006 International Intelligent Information Processing and Web Mining Conference (IIS:IIPW'06)</source>
          ,
          <source>Advances in Soft Computing</source>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Granmo</surname>
            ,
            <given-names>O.-C.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Oommen</surname>
            ,
            <given-names>B. J.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Solving Stochastic Nonlinear Resource Allocation Problems Using a Hierarchy of Twofold Resource Allocation Automata</article-title>
          .
          <source>IEEE Transactions on Computers</source>
          <volume>59</volume>
          (
          <issue>4</issue>
          ):
          <fpage>545</fpage>
          -
          <lpage>560</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Granmo</surname>
            ,
            <given-names>O.-C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Oommen</surname>
            ,
            <given-names>B. J.</given-names>
          </string-name>
          ; Myrer,
          <string-name>
            <surname>S. A.</surname>
          </string-name>
          ; and Olsen,
          <string-name>
            <surname>M. G.</surname>
          </string-name>
          <year>2006</year>
          .
          <article-title>Determining Optimal Polling Frequency Using a Learning Automata-based Solution to the Fractional Knapsack Problem</article-title>
          .
          <source>In Proceedings of the 2006 IEEE International Conferences on Cybernetics &amp; Intelligent Systems (CIS) and Robotics</source>
          , Automation &amp;
          <string-name>
            <surname>Mechatronics (RAM).</surname>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Granmo</surname>
            ,
            <given-names>O.-C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Oommen</surname>
            ,
            <given-names>B. J.</given-names>
          </string-name>
          ; Myrer,
          <string-name>
            <surname>S. A.</surname>
          </string-name>
          ; and Olsen,
          <string-name>
            <surname>M. G.</surname>
          </string-name>
          <year>2007</year>
          .
          <article-title>Learning Automata-based Solutions to the Nonlinear Fractional Knapsack Problem with Applications to Optimal Resource Allocation</article-title>
          .
          <source>IEEE Transactions on Systems, Man, and Cybernetics</source>
          , Part B
          <volume>37</volume>
          (
          <issue>1</issue>
          ):
          <fpage>166</fpage>
          -
          <lpage>175</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Granmo</surname>
            ,
            <given-names>O.-C.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Solving Two-Armed Bernoulli Bandit Problems Using a Bayesian Learning Automaton</article-title>
          .
          <source>International Journal of Intelligent Computing and Cybernetics</source>
          <volume>3</volume>
          (
          <issue>2</issue>
          ):
          <fpage>207</fpage>
          -
          <lpage>234</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Granmo</surname>
            ,
            <given-names>O.-C.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Agrawala</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2011a</year>
          .
          <article-title>Successive Reduction of Arms in Multi-Armed Bandits</article-title>
          .
          <source>In Proceedings of the Thirty-first SGAI International Conference on Artificial Intelligence (SGAI</source>
          <year>2011</year>
          ). Springer.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Granmo</surname>
            ,
            <given-names>O.-C.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Agrawala</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2011b</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>In Proceedings of the Tenth International Conference on Machine Learning and Applications (ICMLA'11)</source>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Kellerer</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ; Pferschy,
          <string-name>
            <given-names>U.</given-names>
            ; and
            <surname>Pisinger</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <year>2004</year>
          . Knapsack Problems. Springer.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>May</surname>
            ,
            <given-names>B. C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Korda</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Leslie</surname>
            ,
            <given-names>D. S.</given-names>
          </string-name>
          <year>2012</year>
          .
          <article-title>Optimistic bayesian sampling in contextual-bandit problems</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>J.</given-names>
            <surname>Mach</surname>
          </string-name>
          .
          <source>Learn. Res</source>
          .
          <volume>8</volume>
          :
          <fpage>2069</fpage>
          -
          <lpage>2106</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Narendra</surname>
            ,
            <given-names>K. S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Thathachar</surname>
            ,
            <given-names>M. A. L.</given-names>
          </string-name>
          <year>1989</year>
          .
          <article-title>Learning Automata: An Introduction</article-title>
          . Prentice Hall.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Pandey</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ramamritham</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Chakrabarti</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <article-title>Monitoring the Dynamic Web to Respond to Continuous Queries</article-title>
          .
          <source>In 12th International World Wide Web Conference</source>
          ,
          <volume>659</volume>
          -
          <fpage>668</fpage>
          . ACM Press.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Rasmussen</surname>
            ,
            <given-names>C. E.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>C. K. I.</given-names>
          </string-name>
          <year>2006</year>
          .
          <article-title>Gaussian Processes for Machine Learning</article-title>
          . The MIT Press.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Srinivas N.</given-names>
            ,
            <surname>Krause</surname>
          </string-name>
          <string-name>
            <surname>A.</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. S.</surname>
          </string-name>
          , and M.,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <year>2010</year>
          .
          <article-title>Gaussian process optimization in the bandit setting: No regret and experimental design</article-title>
          . In Fu¨rnkranz, J., and
          <string-name>
            <surname>Joachims</surname>
          </string-name>
          , T., eds.,
          <source>Proceedings of the 27th International Conference on Machine Learning (ICML-10)</source>
          ,
          <fpage>1015</fpage>
          -
          <lpage>1022</lpage>
          . Haifa, Israel: Omnipress.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Thompson</surname>
            ,
            <given-names>W. R.</given-names>
          </string-name>
          <year>1933</year>
          .
          <article-title>On the likelihood that one unknown probability exceeds another in view of the evidence of two samples</article-title>
          .
          <source>Biometrika</source>
          <volume>25</volume>
          :
          <fpage>285</fpage>
          -
          <lpage>294</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Wolf</surname>
            ,
            <given-names>J. L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Squillante</surname>
            ,
            <given-names>M. S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sethuraman</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Ozsen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2002</year>
          .
          <article-title>Optimal Crawling Strategies for Web Search Engines</article-title>
          .
          <source>In 11th International World Wide Web Conference</source>
          ,
          <volume>136</volume>
          -
          <fpage>147</fpage>
          . ACM Press.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Wyatt</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>1997</year>
          .
          <article-title>Exploration and Inference in Learning from Reinforcement</article-title>
          .
          <source>Ph.D. Dissertation</source>
          , University of Edinburgh.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>