=Paper=
{{Paper
|id=Vol-1348/maics2013_paper_15
|storemode=property
|title=Gaussian Process Based Optimistic Knapsack Sampling with Applications to Stochastic Resource Allocation
|pdfUrl=https://ceur-ws.org/Vol-1348/maics2013_paper_15.pdf
|volume=Vol-1348
|dblpUrl=https://dblp.org/rec/conf/maics/GlimsdalG13
}}
==Gaussian Process Based Optimistic Knapsack Sampling with Applications to Stochastic Resource Allocation==
<pdf width="1500px">https://ceur-ws.org/Vol-1348/maics2013_paper_15.pdf</pdf>
<pre>
                       Gaussian Process Based Optimistic Knapsack Sampling
                        with Applications to Stochastic Resource Allocation
                                      Sondre Glimsdal and Ole-Christoffer Granmo
                                                            Department of ICT
                                                           University of Agder
                                                                Norway
                                                  {sondre.glimsdal, ole.granmo}@uia.no


                            Abstract                                   essentially unsolved until the Learning Automata Knap-
                                                                       sack Game (LAKG) was introduced in 2006 as a generic
     The stochastic non-linear fractional knapsack problem             and adaptive solution to the so-called Stochastic Non-linear
     is a challenging optimization problem with numerous
                                                                       Equality Fractional Knapsack (NEFK) Problem (Granmo et
     applications, including resource allocation. The goal is
     to find the most valuable mix of materials that fits within       al. 2006). Before that, the simplest and perhaps most com-
     a knapsack of fixed capacity. When the value functions            mon polling approach was to allocate the available polling
     of the involved materials are fully known and differen-           capacity uniformly among the web resources being moni-
     tiable, the most valuable mixture can be found by direct          tored, polling them all with the same fixed frequency, con-
     application of Lagrange multipliers. However, in many             strained by the available polling capacity. This uniform
     real-world applications, such as web polling, informa-            polling strategy is clearly sub-optimal since web resources
     tion about material value is uncertain, and in many cases         evolve at different speed. For slowly changing web re-
     missing altogether. Surprisingly, without prior informa-          sources, a high polling frequency translates into a corre-
     tion about material value, the recently proposed Learn-           spondingly large number of unfruitful polls. Conversely, for
     ing Automata Knapsack Game (LAKG) offers arbitrar-
                                                                       quickly evolving web resources, a too low polling frequency
     ily accurate convergence towards the optimal solution,
     simply by interacting with the knapsack on-line.                  leads to potential loss of information or acting on out-dated
     This paper introduces Gaussian Process based Op-
                                                                       information. In brief, without balancing the allocation of
     timistic Knapsack Sampling (GPOKS) — a novel                      the available polling capacity, wasting resources polling one
     model-based reinforcement learning scheme for solving             resource may in turn prevent us from polling another more
     stochastic fractional knapsack problems, founded on               attractive resource, thus degrading overall performance.
     Gaussian Process (GP) enabled Optimistic Thompson                    A two phase strategy has been proposed to address the
     Sampling (OTS). Not only does this scheme converge                latter inefficiency: In the first phase, the uniform strategy is
     significantly faster than LAKG, GPOKS also incorpo-               applied, which allows the update probability of monitored
     rates GP based learning of the material values them-
                                                                       web resources to be estimated. By treating these probabil-
     selves, forming the basis for OTS supported balancing
     between exploration and exploitation. Using resource              ity estimates as the true ones, Lagrange multipliers can be
     allocation in web polling as a proof-of-concept appli-            applied to find an allocation of capacity that is optimal for
     cation, our empirical results show that GPOKS consis-             the estimated values (Pandey, Ramamritham, & Chakrabarti
     tently outperforms LAKG, the current top-performer,               2003). However, this method needs an arbitrary long esti-
     under a wide variety of parameter settings.                       mation phase to approach the optimal solution in the second
                                                                       phase. That is, one either has to accept a sub-optimal final
                                                                       solution because the update probability estimates are inac-
                      1    Introduction                                curate, or one must wait an extensive amount of time till
The Internet can be seen as a massive collection of ever-              the estimates have become sufficiently accurate, allowing a
changing information, continuously evolving as web re-                 better solution in the second phase. Also note that evolving
sources are created, edited, deleted, and replaced (Pandey,            update probabilities render the solution found with the latter
Ramamritham, & Chakrabarti 2003). Obtaining adequate                   approach progressively more inaccurate.
information from the Internet is crucial for many tasks, in-              This paper introduces Gaussian Process based Optimistic
cluding social media analytics, counter terrorism, and busi-           Knapsack Sampling (GPOKS) — a novel scheme for solv-
ness intelligence. It is thus important that the applied search        ing stochastic knapsack problems founded on Gaussian Pro-
engines and web-monitoring frameworks are able to keep                 cess (GP) (Rasmussen & Williams 2006) based Thompson
their indexes and caches complete and up-to-date. Achiev-              Sampling (TS) (Thompson 1933; Granmo 2010), enhanced
ing this, of course, relies on detecting the changes that the          by the principles of Optimistic TS (May et al. 2012). As
web resources undergo, typically by means of polling.                  we shall see, not only does this scheme converge signif-
   The problem of balancing polling capacity optimally                 icantly faster than LAKG, GPOKS also incorporates GP
among web resources, with limited prior information, was               based learning of the material unit values themselves, form-
ing the basis for TS based exploration and exploitation. This        value functions are equal, subject to the knapsack constraints
allows GPOKS to gradually shift from estimation to opti-             (Bretthauer & Shetty 2002):
mization, starting with pure estimation and converging to-                       0                 0
wards pure optimization.                                                        1 (x1 ) = · · · = fn (xn )
                                                                               fP
                                                                                   n
   In (Granmo 2010) we reported a Bayesian technique for                           1 xi = c and ∀i ∈ {1, . . . , n}, xi ≥ 0.
solving bandit like problems, revisiting the Thompson Sam-
pling (Thompson 1933) principle pioneered in 1933. This                  The Stochastic NEFK Problem: In this paper we gener-
revisit lead to novel schemes for handling multi-armed and           alize the above nonlinear equality knapsack problem. First
dynamic (restless) bandit problems (Granmo & Berg 2010;              of all, we let the material value per unit volume for any xi
Gupta, Granmo, & Agrawala 2011a; 2011b), and empiri-                 be a probability function pi (xi ). Furthermore, we consider
cal results demonstrated the advantages of these techniques          the distribution of pi (xi ) to be unknown. That is, each time
over established top performers. Furthermore, we provided            an amount xi of material i is placed in the knapsack, we are
theoretical results stating that the original technique is in-       only allowed to observe an instantiation of pi (xi ) at xi , and
stantaneously self-correcting and that it converges to only          not pi (xi ) itself.1 Given this stochastic environment, we in-
pulling the optimal arm with probability as close to unity as        tend to devise an on-line incremental scheme that learns the
desired. We now expand this principle to support Thompson            mix of materials of maximal expected value, through a series
Sampling for Stochastic NEFK Problems.                               of informed guesses. Thus, to clarify issues, we are provided
                                                                     with a knapsack of fixed volume c, which is to be filled with
1.1   Formal Problem Formulation                                     a mix of n different materials. However, unlike the NEFK,
                                                                     in the Stochastic NEFK Problem the unit volume value of a
In order to appreciate the qualities of the Stochastic NEFK          material i, 1 ≤ i ≤ n, is a random quantity — it takes the
Problem, it is beneficial to view the problem in light of the        value 1 with probability pi (xi ) and the value 0 with proba-
classical linear Fractional Knapsack (FK) Problem. Indeed,           bility 1−pi (xi ), respectively. As an additional complication,
the Stochastic NEFK Problem generalizes the latter problem           pi (xi ) is nonlinear in the sense that it decreases monotoni-
in two significant ways. Both of the two problems are briefly        cally with xi , i.e., xi1 ≤ xi2 ⇔ pi (xi1 ) ≥ pi (xi2 ).
defined below.
                                                                         Since unit volume values are random, we operate with ex-
   The Linear Fractional Knapsack (FK) Problem: The                  pected unit volume values rather than the actual unit volume
linear FK problem is a classical continuous optimization             values. With this understanding, and the above perspective
problem which also has applications within the field of re-
source allocation. The problem involves n materials of dif-
                                                                     in mind, the expected value of the    R xamount xi of material i,
                                                                     1 ≤ i ≤ n, becomes fi (xi ) = 0 i pi (u)du. Accordingly,
ferent value vi per unit volume, 1 ≤ i ≤ n, where each
material is available in a certain amount xi ≤ bi . Let              the expected value per unit volume2 of material i becomes
fi (xi ) denote the value of the amount xi of material i, i.e.,      fi0 (xi ) = pi (xi ). In this stochastic and non-linear version
fi (xi ) = vi xi . The problem is to fill a knapsack of fixed vol-   of the FK problem, the goalP      is to fill the knapsack so that
                                                                                                         n
ume cP   with the material mix ~x = [x1 , . . . , xn ] of maximal    the expected value f (~x) = 1 fi (xi ) of the material mix
          n
value 1 fi (xi ) (Black 2004).                                       contained in the knapsack is maximized. Thus, we aim to:
   The Nonlinear Equality FK (NEFK) Problem: One im-                                           Pn
                                                                      maximize f (~x) = 1 fi (xRi ),
portant extension of the above classical problem is the Non-                                              xi                        0
linear Equality FK problem with a separable and concave                               Pn fi (xi ) = 0 pi (u)du, pi (xi ) = fi (xi )
                                                                                      where
objective function. The problem can be stated as follows              subject to          1 xi = c and ∀i ∈ {1, . . . , n}, xi ≥ 0.
(Kellerer, Pferschy, & Pisinger 2004):                                  A fascinating property of the above problem is that the
                           Pn                                        amount of information available to the decision maker is
    maximize fP(~x) = 1 fi (xi )
                      n                                              limited — the decision maker is only allowed to observe the
    subject to        1 xi = c and ∀i ∈ {1, . . . , n}, xi ≥ 0.
                                                                     current unit value of each material (either 0 or 1). That is,
   Since the objective function is considered to be concave,         each time a material mix is placed in the knapsack, the unit
the value function fi (xi ) of each material is also concave.        value of each material is provided to the decision maker. The
This means that the derivatives of the material value func-          actual outcome probabilities pi (xi ), 1 ≤ i ≤ n, however, re-
tions fi (xi ) with respect to xi , (hereafter denoted fi0 ), are    main unknown. As a result of the latter, the expected value of
non-increasing. In other words, the material value per unit          the material mix must be maximized by means of trial-and-
volume is no longer constant as in the linear case, but de-          error, i.e., by experimenting with different material mixes
creases with the material amount, and so the optimization            and by observing the resulting random unit value outcomes.
problem becomes:
                                                                        1
                          Pn                                               For the sake of consistency with previous work on the Stochas-
   maximize f (~x) = 1 fi (xRi ),                                    tic NEFK Problem, we here model stochastic material unit values
                                      x
                   where fi (xi ) = 0 i fi0 (xi )dxi                 using Bernoulli trials. However, since GPOKS is based on Gaus-
                  P  n
   subject to        1 xi = c and ∀i ∈ {1, . . . , n}, xi ≥ 0.       sian Processes, the central limit theorem opens up for addressing
                                                                     a number of other distributions too. Furthermore, there exist dedi-
Efficient solutions to the latter problem, based on the princi-      cated kernel functions for a variety of distributions.
ple of Lagrange multipliers, have been devised. In short, the            2
                                                                           We hereafter use fi0 (xi ) to denote the derivative of the ex-
optimal value occurs when the derivatives fi0 of the material        pected value function fi (xi ) with respect to xi .
1.2   Paper Contributions                                        evaluations as possible based on so-called upper confidence
The contributions of this paper can be summarized as fol-        bounds. Inspired by the success of GP based optimiza-
lows:                                                            tion, we here propose a novel GP based model for stochastic
                                                                 NEFK problems, where a collection of GPs captures the in-
1. We combine Bayesian modeling with reinforcement               dividual material unit values. Based on the GP colletion,
   learning to provide a novel solution to the Stochastic        Thompson Sampling is applied to sample likely determinis-
   NEFK Problem.                                                 tic NEFK problem instances from the GPs. These, in turn,
2. We propose the first reinforcement learning scheme that       are solved based on Lagrange Multipliers, producing a po-
   combines Gaussian Processes (Rasmussen & Williams             tential solution to the problem at hand.
   2006) with Thompson Sampling (Thompson 1933;
   Granmo 2010).                                                 2.1     Gaussian Processes based Representation of
                                                                         Material Unit Value
3. We introduce GP based sampling mechanisms in the spirit
   of Optimistic Thompson Sampling (May et al. 2012) for         A Gaussian Process (GP) is a stochastic process that rep-
   increased performance.                                        resents a function as a multivariate Gaussian distribution
                                                                 (Rasmussen & Williams 2006). It is specified as a tuple
4. The resulting scheme persistently outperforms state-of-       GP = (µ(~x), K(·, ·)) where µ(·) is the mean function, typi-
   the-art approaches when applied to resource allocation in
                                                                 cally assigned µ(~x) = ~0, and K(·, ·) is a kernel that specifies
   web polling.
                                                                 the covariance matrix for the random vector ~x. In this paper,
These contributions form the first steps towards establishing    we use the one dimensional Squared Exponential kernel (eq.
a new family of reinforcement learning schemes that pro-         1), configured by the hyper parameters θ~ = {l, σf2 , σn2 }.
vide on-line solutions to stochastic versions of classical op-
timization problems. This is achieved by carefully design-                                      1
                                                                       K(xp , xq ) = σf2 exp(−     (xp − xq )2 )) + σn2 δpq  (1)
ing Bayesian models that capture the nature of the optimiza-                                   2l2
tion problems, applying TS principles to address the explo-      Here l is the characteristic length-scale parameter that deter-
ration/exploitation dilemma in on-line learning and control.     mines how rapidly the correlation should decay as the dis-
                                                                 tance between xp and xq increases, σf2 is the signal variance
1.3   Paper Outline                                              and σn2 is white noise (note that δpq here denotes the Kro-
In Section 2, we present our scheme for Gaussian Pro-            necker delta between xp and xq ). For further information on
cess Based Optimistic Knapsack Sampling (GPOKS). We              GPs we refer to (Rasmussen & Williams 2006).
start with a brief introduction to Gaussian Processes before        By way of example, Figure 1 illustrates how the posterior
we propose how Gaussian Processes can enable Thomp-              distribution over possible material unit value functions for a
son Sampling — the current leader when it comes to solv-         given material i can be represented by means of a GP. The
ing Bernoulli Bandit Problems (Granmo 2010) — for ex-            x-axis measures the amount of material, xi , while the y-axis
ploration and exploitation when solving on-line Stochastic       provides the material unit value fi0 (xi ). The mean and 95%
NEFK problems. Then, in Section 3, we define the web re-         confidence interval is included, as well as four samples indi-
source allocation polling problem in more detail, following      cating possible candidates for fi0 (xi ). Note that since the
up with an evaluation of GPOKS compared with state-of-           Stochastic NEFK problem deals with non-increasing unit
the-art. We conclude in Section 4 and present pointers for       value functions, fi0 (xi ), we apply Rejection Sampling to
further work.                                                    sample from the distribution of non-increasing functions.
                                                                 Similarly, ”optimistic” sampling, as pioneered by May et al.
      2   Gaussian Process Based Optimistic                      (May et al. 2012), is realized by rejecting sampled functions
           Knapsack Sampling (GPOKS)                             that drop below the estimated mean.
The conflict between exploration and exploitation is a well-     2.2     Architectural Overview of GPOKS
known problem in reinforcement learning, and other areas
of artificial intelligence. The multi-armed bandit problem        Figure 2 provides an architectural overview of our scheme.
captures the essence of this conflict, and has thus occupied      As illustrated in the figure, GPOKS operates as follows:
researchers for over fifty years (Wyatt 1997). In brief, an      1. A collection of GPs, one Gaussian Process, GPi , for each
agent sequentially pulls one of multiple arms attached to a         material i, attempts to estimate the material unit value
gambling machine, with each pull resulting in a random re-          functions, fi0 (xi ), 1 ≤ i ≤ n.
ward. The reward distributions are unknown, and thus, one        2. One candidate material unit value function, fˆi0 (xi ), 1 ≤
must balance between exploiting existing knowledge about            i ≤ n, is then sampled from each GPi , thus applying the
the arms, and obtaining new information.                            TS principle of sampling functions proportionally to their
   We are here facing a similar problem, however, instead of        likelihoods.
seeking the singly best material (bandit arm), we need to find
a mixture of materials, also referred to as a mixed strategy     3. The DET-KS component in the architectue finds the opti-
in Game Theory. Recently, GP optimization has been ad-              mal material mixture M̂ = [x1 , . . . , xn ] for the sampled
dressed from a bandit problem perspective (Srinivas N. & M.         material unit value functions, fˆi0 (xi ), 1 ≤ i ≤ n, using
2010), allowing the GP to be explored globally with as few          Lagrange multipliers.
                                                                     Web Page 1      x      x     x            x        x        x
                                                                                                                                     t

                                                                     Web Page 2     x x   xxx         x   xx       x   xx   xx   x
                                                                                                                                     t

                                                                    Figure 6: Web resource changes occurring over time. An ’x’
                                                                    on the time-lines denotes that the respective web resource
                                                                    has changed.


                                                                    the optimal amounts M = [x1 , x2 ].
                                                                       However, after 193 iterations of the GPOKS algorithm,
                                                                    we observe a number of fascinating properties in Figure 5.
                                                                    First of all, the Bayesian estimates of the material unit val-
                                                                    ues, f10 (x1 ) and f20 (x2 ), have become more accurate. Fur-
                                                                    thermore, we observe that the estimated optimal material
                                                                    mixture is now much closer to the optimal mixture. Finally,
Figure 1: Gaussian Process based representation of material         observe that the uncertainty concerning f10 (x1 ) and f20 (x2 )
unit value                                                          varies with x1 and x2 . The beauty of Thompson Sampling is
                                                                    that the observations are collected with gradually increasing
            GP
                 1     ...    GP
                                   i   ...         GP
                                                         n          exploitation, zooming in on the areas that are most likely to
                                                                    contain the optimal material mixture.
                                        f i'
                     f1'                                 f 'n
                                                                                3    Application: Web Polling
                                               DET-KS               Having obtained a solution to the model in which we set the
                                                                    NEFK, we shall now demonstrate how we can utilize this
                                vi                  ^
                                                    M               solution for the current problem being studied, namely, the
                                                                    optimal web-polling problem.
                                               Scheduler               Web resource monitoring consists of repeatedly polling
                                                                    a selection of web resources so that the user can detect
                                                    xi              changes that occur over time. Clearly, as this task can be
                                                                    prohibitively expensive, in practical applications, the sys-
                     Stochastic Environment                         tem imposes a constraint on the maximum number of web
                                                                    resources that can be polled per time unit. This bound is
                                                                    dictated by the governing communication bandwidth, and by
         Figure 2: GPOKS Architectural Overview                     the speed limitations associated with the processing. Since
                                                                    only a fraction of the web resources can be polled within
                                                                    a given unit of time, the problem which the system’s ana-
4. One of the materials is then selected by the Scheduler           lyst encounters is one of determining which web resources
    component for evaluation, ensuring that each material i         are to be polled. In such cases, a reasonable choice of ac-
    is selected with a frequency that is proportional to the        tion is to choose web resources in a manner that maximizes
    amount of material, xi , assigned by M̂.                        the number of changes detected, and the optimal allocation
5. Finally, the Stochastic Environment, i.e., the Stochastic        of the resources involves trial-and-error. As illustrated in
    NEFK, samples the true outcome probability function,            Figure 6, web resources may change with varying frequen-
    pi (xi ), at xi , providing feedback vi to the corresponding    cies (that are unknown to the decision maker), and changes
    GPi , which updates its Bayesian estimate of fi0 (xi ).         appear more or less randomly. Furthermore, as argued else-
                                                                    where, (Granmo & Oommen 2006; Granmo et al. 2006;
 By following the above steps our goal is to gradually im-          2007), the probability that an individual web resource poll
 prove our ”best guesses” so that each iteration successively       uncovers a change on its own decreases monotonically with
 brings us closer to the optimal solution of the targeted           the polling frequency used for that web resource.
 Stochastic NEFK problem.                                              Although several nonlinear criterion functions for mea-
                                                                    suring web monitoring performance have been proposed
2.3    Example Steps                                                in the literature (e.g., see (Pandey, Ramamritham, &
Figure 3 and 4 show the GP based estimates for the unit             Chakrabarti 2003; Wolf et al. 2002)), from a broader view-
value of two materials, f10 (x1 ) and f20 (x2 ), after only 5 ma-   point they are mainly built around the basic concept of up-
terial value observations. As can be seen, uncertainty about        date detection probability, i.e., the probability that polling a
the material unit value functions is significant, and the esti-     web resource results in new information being discovered.
mated optimal material amounts M̂ = [x̂1 , x̂2 ] are far from       Therefore, for the purpose of conceptual clarity, we will use
Figure 3: Estimate of material unit value f10 (x1 ) af-           Figure 4: Estimate of material unit value f20 (x2 ) af-
ter 7 observations, with optimal and estimated material           ter 7 observations, with optimal and estimated material
amounts x1 .                                                      amounts x2 .


the update detection probability as the token of interest in          3.1    GPOKS Solution
this paper. To further define our notion of web monitoring            In order to find a solution to the above problem we must
performance, we consider that time is discrete with the time          define the Stochastic Environment that GPOKS is to inter-
interval length T to be the atomic unit of decision making. In        act with. As seen in Section 2, the Stochastic Environment
each time interval every single web resource i has a constant         consists of the unit volume value functions {f10 (x1 ), f20 (x2 ),
probability qi of remaining unchanged. Furthermore, when              . . . , fn0 (xn )}, which are unknown to GPOKS. We identify
a web resource is updated/changed, the update is available            the nature of these functions by applying the principle of
for detection only until the web resource is updated again.           Lagrange multipliers to the above maximization problem.
After that, the original update is considered lost. For in-           In short, after some simplification, it can be seen that the
stance, each time a newspaper web resource is updated, pre-           following conditions characterize the optimal solution:
vious news items are replaced by the most recent ones.
   In the following, we will denote the update detection                          d1 (x1 ) = d2 (x2 ) = · · · = dn (xn )
                                                                                  P  n
probability of a web resource i as di . Under the above con-                         1 xi = c and ∀i = 1, . . . , n, xi ≥ 0.
ditions, di depends on the frequency, xi , that the resource is       Since we are not able to observe di (xi ) or qi directly, we
polled with, and is modeled using the following expression:           base our definition of {f10 (x1 ), f20 (x2 ), . . . , fn0 (xn )} on the
                                        1
                                                                      result of polling web resources. Briefly stated, we want
                     di (xi ) = 1 − qi xi .                           fi0 (xi ) to instantiate to the value 0 with probability 1−di (xi )
                                                                      and to the value 1 with probability di (xi ). Accordingly, if
By way of example, consider the scenario that a web re-               the web resource i is polled and i has been updated since
source remains unchanged in any single time step with prob-           our last polling, then we consider fi0 (xi ) to have been in-
ability 0.5. Then polling the web resource uncovers new               stantiated to 1. And, if the web resource i is unchanged, we
information with probability 1 − 0.53 = 0.875 if the web              consider fi0 (xi ) to have been instantiated to 0.
resource is polled every 3rd time step (i.e., with frequency
1               2                                                     3.2    Empirical Results
3 ) and 1 − 0.5 = 0.75 if the web resource is polled ev-
      nd
ery 2 time step. As seen, increasing the polling frequency            In this section we evaluate GPOKS and compare its perfor-
reduces the probability of discovering new information on             mance with the currently best performing algorithm, LAKG.
each polling.                                                         While H-TRAA possesses better scalability than LAKG
                                                                      (Granmo & Oommen 2010), for two material problems,
   Given the above considerations, our aim is to find the
                                                                      their performance is identical because the hierarchical setup
resource polling frequencies ~x that maximize the expected
                                                                      of H-TRAA does not come into play. For clarification we
number of pollings uncovering new information per time
                                                                      will also include some promising variants of GPOKS. Here
step:
                                                                      follows an overview of a selection of the policies that we
               Pn                                                     have investigated:
    maximize P1 xi × di (xi )                                            Uniform: The uniform policy allocates monitoring re-
                n
    subject to  1 xi = c and ∀i = 1, . . . , n, xi ≥ 0.               sources uniformly across all web resources. This classical
Figure 5: Estimate of material unit values f10 (x1 ) and f20 (x2 ) after 193 observations, with optimal and estimated material
amounts x1 and x2 .


policy can, of course, be applied directly in an unknown en-      the first one, q1 = 0.9/q2 = 0.1, one web resource is up-
vironment.                                                        dated significantly more often than the other. A more moder-
   LAKG: The LAKG scheme is basically a game between              ate version of the latter configuration, q1 = 0.75/q2 = 0.25,
so-called Learning Automata (Narendra & Thathachar                was also investigated. Furthermore, we measured perfor-
1989). They start off from a uniform policy and gradu-            mance when the two web resources have almost equal up-
ally improves toward the optimal configuration through a          date probability, q1 = 0.55/q2 = 0.45. Finally, we also
sequence of small jumps across a discretized search space.        investigated the robustness of GPOKS by adding increas-
In all our experiments the resolution of LAKG is set to 100       ing amount of white-noise, (wσ ), to the feedback given to
states.                                                           GPOKS. Note that, for the sake of fairness, we applied the
   Optimal: This policy requires that update frequencies are      same kernel hyper-parameters, θ = {1.0, 1.0, 0.1}, for all
known, and finds the optimal solution based on the prin-          the GP based strategies, without further optimization.
ciple of Lagrange multipliers (Pandey, Ramamritham, &                Table 1 reports the performance of the different poli-
Chakrabarti 2003; Wolf et al. 2002).                              cies3 . As can be seen, GPOKS clearly outperforms LAKG
   GPOKS - Mean: To highlight the advantage of our Opti-          when facing the q1 = 0.9/q2 = 0.1 configuration, with
mistic Thompson Sampling approach, we also test a simpler         GPOKS detecting on average approximately 8 more updates
scheme where we use the mean of the GPs when estimating           than LAKG over 1000 time steps. Also note how remark-
the optimal solution rather than sampling functions from the      ably close GPOKS gets to the optimal performance, missing
GPs.                                                              on average merely 7 web resource updates over 1000 time
                                                                  steps. We observe similar results for the q1 = 0.75/q2 =
   We have conducted numerous experiments using various
                                                                  0.25 configuration. Finally, for the q1 = 0.55/q2 = 0.45
configurations, such as different noise parameters and up-
                                                                  configuration, we observe that the performance of LAKG
date probabilities. Here, we present a representative subset
                                                                  and GPOKS becomes more similar. This can be explained
of these, as they all show the same trend. Performance is
                                                                  by the prior bias of LAKG, starting from a uniform allo-
measured as the average accumulated number of web re-
                                                                  cation of resources. This gives LAKG an advantage over
source updates found.
                                                                  GPOKS, which are largely unbiased when it comes to prior
   For these experiments, we used an ensemble of 1000 in-         belief about update probabilities. Finally, notice the perfor-
dependent replications, each random generator seeded with         mance loss caused by using the mean of the GPs (GPOKS-
a unique number, to maximize the precision of the reported        Mean) instead of TS. This trend is further explored in Ta-
results. In order to provide a robust overview of the perfor-
mance of GPOKS, we investigated three radically different            3
                                                                       Note that all of the setups apply a small degree of white noise
update probability configurations for web resource pairs. In      (wσ = 0.1).
ble 2, where we increase the amount of white noise affect-      archy of Twofold Resource Allocation Automata. IEEE
ing feedback. We then observe that GPOKS is surprisingly        Transactions on Computers 59(4):545–560.
robust towards noisy feedback compared to GPOKS-Mean.           Granmo, O.-C.; Oommen, B. J.; Myrer, S. A.; and Olsen,
This can be explained by the greedy nature of GPOKS-            M. G. 2006. Determining Optimal Polling Frequency Us-
Mean, which is less inclined to explore the space of func-      ing a Learning Automata-based Solution to the Fractional
tions encompassed by the GPs, thus being more easily mis-       Knapsack Problem. In Proceedings of the 2006 IEEE Inter-
lead by noise.                                                  national Conferences on Cybernetics & Intelligent Systems
                                                                (CIS) and Robotics, Automation & Mechatronics (RAM).
       4    Conclusions and Further Work                        IEEE.
The stochastic non-linear fractional knapsack problem is a      Granmo, O.-C.; Oommen, B. J.; Myrer, S. A.; and Olsen,
challenging optimization problem with numerous applica-         M. G. 2007. Learning Automata-based Solutions to the
tions, including resource allocation. The goal is to find the   Nonlinear Fractional Knapsack Problem with Applications
most valuable mix of materials that fits within a knapsack      to Optimal Resource Allocation. IEEE Transactions on
of fixed capacity. When the value functions of the involved     Systems, Man, and Cybernetics, Part B 37(1):166–175.
materials are fully known and differentiable, the most valu-    Granmo, O.-C. 2010. Solving Two-Armed Bernoulli Ban-
able mixture can be found by direct application of Lagrange     dit Problems Using a Bayesian Learning Automaton. Inter-
multipliers.                                                    national Journal of Intelligent Computing and Cybernetics
   In this paper we introduced Gaussian Process based Op-       3(2):207–234.
timistic Knapsack Sampling (GPOKS) — a novel model-             Gupta, N.; Granmo, O.-C.; and Agrawala, A. 2011a. Suc-
based reinforcement learning scheme for solving stochastic      cessive Reduction of Arms in Multi-Armed Bandits. In
fractional knapsack problems. The scheme is founded on          Proceedings of the Thirty-first SGAI International Confer-
Gaussian Process (GP) enabled Optimistic Thompson Sam-          ence on Artificial Intelligence (SGAI 2011). Springer.
pling (OTS). Our empirical results demonstrates that this
scheme converge significantly faster than LAKG. Further-        Gupta, N.; Granmo, O.-C.; and Agrawala, A. 2011b.
more, GPOKS incorporates GP based learning of the mate-         Thompson Sampling for Dynamic Multi-Armed Bandits.
rial unit values themselves, forming the basis for OTS sup-     In Proceedings of the Tenth International Conference on
ported balancing between exploration and exploitation. Us-      Machine Learning and Applications (ICMLA’11). IEEE.
ing resource allocation in web polling as a proof-of-concept    Kellerer, H.; Pferschy, U.; and Pisinger, D. 2004. Knapsack
application, our empirical results show that GPOKS consis-      Problems. Springer.
tently outperforms LAKG, the current top-performer, under       May, B. C.; Korda, N.; Lee, A.; and Leslie, D. S. 2012. Op-
a wide variety of parameter settings.                           timistic bayesian sampling in contextual-bandit problems.
   In our further work, we will address games of interacting    J. Mach. Learn. Res. 8:2069–2106.
GPOKS for solving networked and hierarchical resource al-       Narendra, K. S., and Thathachar, M. A. L. 1989. Learning
location problems. Furthermore, we are investigating tech-      Automata: An Introduction. Prentice Hall.
niques for decomposing the GP calculations for increased
computational performance.                                      Pandey, S.; Ramamritham, K.; and Chakrabarti, S. 2003.
                                                                Monitoring the Dynamic Web to Respond to Continuous
                                                                Queries. In 12th International World Wide Web Confer-
                       References                               ence, 659–668. ACM Press.
 Black, P. E. 2004. Fractional knapsack problem. Dictio-        Rasmussen, C. E., and Williams, C. K. I. 2006. Gaussian
 nary of Algorithms and Data Structures.                        Processes for Machine Learning. The MIT Press.
 Bretthauer, K. M., and Shetty, B. 2002. The Nonlinear          Srinivas N., Krause A., K. S., and M., S. 2010. Gaussian
 Knapsack Problem — Algorithms and Applications. Euro-          process optimization in the bandit setting: No regret and
 pean Journal of Operational Research 138:459–472.              experimental design. In Fürnkranz, J., and Joachims, T.,
 Granmo, O.-C., and Berg, S. 2010. Solving Non-                 eds., Proceedings of the 27th International Conference on
 Stationary Bandit Problems by Random Sampling from             Machine Learning (ICML-10), 1015–1022. Haifa, Israel:
 Sibling Kalman Filters. In Proceedings of the Twenty Third     Omnipress.
 International Conference on Industrial, Engineering, and       Thompson, W. R. 1933. On the likelihood that one un-
 Other Applications of Applied Intelligent Systems (IEA-        known probability exceeds another in view of the evidence
 AIE 2010), 199–208. Springer.                                  of two samples. Biometrika 25:285–294.
 Granmo, O.-C., and Oommen, B. J. 2006. On Al-                  Wolf, J. L.; Squillante, M. S.; Sethuraman, J.; and Ozsen,
 locating Limited Sampling Resources Using a Learning           K. 2002. Optimal Crawling Strategies for Web Search En-
 Automata-based Solution to the Fractional Knapsack Prob-       gines. In 11th International World Wide Web Conference,
 lem. In Proceedings of the 2006 International Intelli-         136–147. ACM Press.
 gent Information Processing and Web Mining Conference
                                                                Wyatt, J. 1997. Exploration and Inference in Learning
 (IIS:IIPW’06), Advances in Soft Computing. Springer.
                                                                from Reinforcement. Ph.D. Dissertation, University of Ed-
 Granmo, O.-C., and Oommen, B. J. 2010. Solving Stochas-        inburgh.
 tic Nonlinear Resource Allocation Problems Using a Hier-
Scheme         p1 /p2     Avg[#Updates] t=10     Avg[#Updates] t=100     Avg[#Updates] t=1000
Optimal      0.90/0.10           9.1                    91.0                    909.9
Uniform      0.90/0.10           5.9                    59.0                    590.0
LAKG         0.90/0.10           6.0                    71.6                    874.9
GPOKS        0.90/0.10           8.0                    88.9                    903.0
GPOKS-Mean   0.90/0.10           8.5                    89.7                    902.9
Optimal      0.75/0.25           8.1                    81.2                    812.5
Uniform      0.75/0.25           6.9                    68.8                    687.5
LAKG         0.75/0.25           6.9                    74.1                    793.1
GPOKS        0.75/0.25           7.4                    78.8                    807.9
GPOKS-Mean   0.75/0.25           6.6                    69.6                    792.2
Optimal      0.55/0.45           7.5                    75.2                    752.5
Uniform      0.55/0.45           7.5                    74.8                    747.5
LAKG         0.55/0.45           7.5                    74.8                    749.8
GPOKS        0.55/0.45           7.0                    73.5                    749.4
GPOKS-Mean   0.55/0.45           5.4                    52.8                    725.3

              Table 1: Average number of updates at different times, wσ = 0.1


              Scheme             p1 /p2    wσ = 0.0     wσ = 0.2    wσ = 0.4
              GPOKS            0.75/0.25    808.2        804.5       804.1
              GPOKS-Mean       0.75/0.25    793.9        787.2       769.1

      Table 2: The performance of GPOKS variants under different levels of white noise

</pre>