=Paper=
{{Paper
|id=Vol-1348/maics2013_paper_15
|storemode=property
|title=Gaussian Process Based Optimistic Knapsack Sampling with Applications to Stochastic Resource Allocation
|pdfUrl=https://ceur-ws.org/Vol-1348/maics2013_paper_15.pdf
|volume=Vol-1348
|dblpUrl=https://dblp.org/rec/conf/maics/GlimsdalG13
}}
==Gaussian Process Based Optimistic Knapsack Sampling with Applications to Stochastic Resource Allocation==
Gaussian Process Based Optimistic Knapsack Sampling with Applications to Stochastic Resource Allocation Sondre Glimsdal and Ole-Christoffer Granmo Department of ICT University of Agder Norway {sondre.glimsdal, ole.granmo}@uia.no Abstract essentially unsolved until the Learning Automata Knap- sack Game (LAKG) was introduced in 2006 as a generic The stochastic non-linear fractional knapsack problem and adaptive solution to the so-called Stochastic Non-linear is a challenging optimization problem with numerous Equality Fractional Knapsack (NEFK) Problem (Granmo et applications, including resource allocation. The goal is to find the most valuable mix of materials that fits within al. 2006). Before that, the simplest and perhaps most com- a knapsack of fixed capacity. When the value functions mon polling approach was to allocate the available polling of the involved materials are fully known and differen- capacity uniformly among the web resources being moni- tiable, the most valuable mixture can be found by direct tored, polling them all with the same fixed frequency, con- application of Lagrange multipliers. However, in many strained by the available polling capacity. This uniform real-world applications, such as web polling, informa- polling strategy is clearly sub-optimal since web resources tion about material value is uncertain, and in many cases evolve at different speed. For slowly changing web re- missing altogether. Surprisingly, without prior informa- sources, a high polling frequency translates into a corre- tion about material value, the recently proposed Learn- spondingly large number of unfruitful polls. Conversely, for ing Automata Knapsack Game (LAKG) offers arbitrar- quickly evolving web resources, a too low polling frequency ily accurate convergence towards the optimal solution, simply by interacting with the knapsack on-line. leads to potential loss of information or acting on out-dated This paper introduces Gaussian Process based Op- information. In brief, without balancing the allocation of timistic Knapsack Sampling (GPOKS) — a novel the available polling capacity, wasting resources polling one model-based reinforcement learning scheme for solving resource may in turn prevent us from polling another more stochastic fractional knapsack problems, founded on attractive resource, thus degrading overall performance. Gaussian Process (GP) enabled Optimistic Thompson A two phase strategy has been proposed to address the Sampling (OTS). Not only does this scheme converge latter inefficiency: In the first phase, the uniform strategy is significantly faster than LAKG, GPOKS also incorpo- applied, which allows the update probability of monitored rates GP based learning of the material values them- web resources to be estimated. By treating these probabil- selves, forming the basis for OTS supported balancing between exploration and exploitation. Using resource ity estimates as the true ones, Lagrange multipliers can be allocation in web polling as a proof-of-concept appli- applied to find an allocation of capacity that is optimal for cation, our empirical results show that GPOKS consis- the estimated values (Pandey, Ramamritham, & Chakrabarti tently outperforms LAKG, the current top-performer, 2003). However, this method needs an arbitrary long esti- under a wide variety of parameter settings. mation phase to approach the optimal solution in the second phase. That is, one either has to accept a sub-optimal final solution because the update probability estimates are inac- 1 Introduction curate, or one must wait an extensive amount of time till The Internet can be seen as a massive collection of ever- the estimates have become sufficiently accurate, allowing a changing information, continuously evolving as web re- better solution in the second phase. Also note that evolving sources are created, edited, deleted, and replaced (Pandey, update probabilities render the solution found with the latter Ramamritham, & Chakrabarti 2003). Obtaining adequate approach progressively more inaccurate. information from the Internet is crucial for many tasks, in- This paper introduces Gaussian Process based Optimistic cluding social media analytics, counter terrorism, and busi- Knapsack Sampling (GPOKS) — a novel scheme for solv- ness intelligence. It is thus important that the applied search ing stochastic knapsack problems founded on Gaussian Pro- engines and web-monitoring frameworks are able to keep cess (GP) (Rasmussen & Williams 2006) based Thompson their indexes and caches complete and up-to-date. Achiev- Sampling (TS) (Thompson 1933; Granmo 2010), enhanced ing this, of course, relies on detecting the changes that the by the principles of Optimistic TS (May et al. 2012). As web resources undergo, typically by means of polling. we shall see, not only does this scheme converge signif- The problem of balancing polling capacity optimally icantly faster than LAKG, GPOKS also incorporates GP among web resources, with limited prior information, was based learning of the material unit values themselves, form- ing the basis for TS based exploration and exploitation. This value functions are equal, subject to the knapsack constraints allows GPOKS to gradually shift from estimation to opti- (Bretthauer & Shetty 2002): mization, starting with pure estimation and converging to- 0 0 wards pure optimization. 1 (x1 ) = · · · = fn (xn ) fP n In (Granmo 2010) we reported a Bayesian technique for 1 xi = c and ∀i ∈ {1, . . . , n}, xi ≥ 0. solving bandit like problems, revisiting the Thompson Sam- pling (Thompson 1933) principle pioneered in 1933. This The Stochastic NEFK Problem: In this paper we gener- revisit lead to novel schemes for handling multi-armed and alize the above nonlinear equality knapsack problem. First dynamic (restless) bandit problems (Granmo & Berg 2010; of all, we let the material value per unit volume for any xi Gupta, Granmo, & Agrawala 2011a; 2011b), and empiri- be a probability function pi (xi ). Furthermore, we consider cal results demonstrated the advantages of these techniques the distribution of pi (xi ) to be unknown. That is, each time over established top performers. Furthermore, we provided an amount xi of material i is placed in the knapsack, we are theoretical results stating that the original technique is in- only allowed to observe an instantiation of pi (xi ) at xi , and stantaneously self-correcting and that it converges to only not pi (xi ) itself.1 Given this stochastic environment, we in- pulling the optimal arm with probability as close to unity as tend to devise an on-line incremental scheme that learns the desired. We now expand this principle to support Thompson mix of materials of maximal expected value, through a series Sampling for Stochastic NEFK Problems. of informed guesses. Thus, to clarify issues, we are provided with a knapsack of fixed volume c, which is to be filled with 1.1 Formal Problem Formulation a mix of n different materials. However, unlike the NEFK, in the Stochastic NEFK Problem the unit volume value of a In order to appreciate the qualities of the Stochastic NEFK material i, 1 ≤ i ≤ n, is a random quantity — it takes the Problem, it is beneficial to view the problem in light of the value 1 with probability pi (xi ) and the value 0 with proba- classical linear Fractional Knapsack (FK) Problem. Indeed, bility 1−pi (xi ), respectively. As an additional complication, the Stochastic NEFK Problem generalizes the latter problem pi (xi ) is nonlinear in the sense that it decreases monotoni- in two significant ways. Both of the two problems are briefly cally with xi , i.e., xi1 ≤ xi2 ⇔ pi (xi1 ) ≥ pi (xi2 ). defined below. Since unit volume values are random, we operate with ex- The Linear Fractional Knapsack (FK) Problem: The pected unit volume values rather than the actual unit volume linear FK problem is a classical continuous optimization values. With this understanding, and the above perspective problem which also has applications within the field of re- source allocation. The problem involves n materials of dif- in mind, the expected value of the R xamount xi of material i, 1 ≤ i ≤ n, becomes fi (xi ) = 0 i pi (u)du. Accordingly, ferent value vi per unit volume, 1 ≤ i ≤ n, where each material is available in a certain amount xi ≤ bi . Let the expected value per unit volume2 of material i becomes fi (xi ) denote the value of the amount xi of material i, i.e., fi0 (xi ) = pi (xi ). In this stochastic and non-linear version fi (xi ) = vi xi . The problem is to fill a knapsack of fixed vol- of the FK problem, the goalP is to fill the knapsack so that n ume cP with the material mix ~x = [x1 , . . . , xn ] of maximal the expected value f (~x) = 1 fi (xi ) of the material mix n value 1 fi (xi ) (Black 2004). contained in the knapsack is maximized. Thus, we aim to: The Nonlinear Equality FK (NEFK) Problem: One im- Pn maximize f (~x) = 1 fi (xRi ), portant extension of the above classical problem is the Non- xi 0 linear Equality FK problem with a separable and concave Pn fi (xi ) = 0 pi (u)du, pi (xi ) = fi (xi ) where objective function. The problem can be stated as follows subject to 1 xi = c and ∀i ∈ {1, . . . , n}, xi ≥ 0. (Kellerer, Pferschy, & Pisinger 2004): A fascinating property of the above problem is that the Pn amount of information available to the decision maker is maximize fP(~x) = 1 fi (xi ) n limited — the decision maker is only allowed to observe the subject to 1 xi = c and ∀i ∈ {1, . . . , n}, xi ≥ 0. current unit value of each material (either 0 or 1). That is, Since the objective function is considered to be concave, each time a material mix is placed in the knapsack, the unit the value function fi (xi ) of each material is also concave. value of each material is provided to the decision maker. The This means that the derivatives of the material value func- actual outcome probabilities pi (xi ), 1 ≤ i ≤ n, however, re- tions fi (xi ) with respect to xi , (hereafter denoted fi0 ), are main unknown. As a result of the latter, the expected value of non-increasing. In other words, the material value per unit the material mix must be maximized by means of trial-and- volume is no longer constant as in the linear case, but de- error, i.e., by experimenting with different material mixes creases with the material amount, and so the optimization and by observing the resulting random unit value outcomes. problem becomes: 1 Pn For the sake of consistency with previous work on the Stochas- maximize f (~x) = 1 fi (xRi ), tic NEFK Problem, we here model stochastic material unit values x where fi (xi ) = 0 i fi0 (xi )dxi using Bernoulli trials. However, since GPOKS is based on Gaus- P n subject to 1 xi = c and ∀i ∈ {1, . . . , n}, xi ≥ 0. sian Processes, the central limit theorem opens up for addressing a number of other distributions too. Furthermore, there exist dedi- Efficient solutions to the latter problem, based on the princi- cated kernel functions for a variety of distributions. ple of Lagrange multipliers, have been devised. In short, the 2 We hereafter use fi0 (xi ) to denote the derivative of the ex- optimal value occurs when the derivatives fi0 of the material pected value function fi (xi ) with respect to xi . 1.2 Paper Contributions evaluations as possible based on so-called upper confidence The contributions of this paper can be summarized as fol- bounds. Inspired by the success of GP based optimiza- lows: tion, we here propose a novel GP based model for stochastic NEFK problems, where a collection of GPs captures the in- 1. We combine Bayesian modeling with reinforcement dividual material unit values. Based on the GP colletion, learning to provide a novel solution to the Stochastic Thompson Sampling is applied to sample likely determinis- NEFK Problem. tic NEFK problem instances from the GPs. These, in turn, 2. We propose the first reinforcement learning scheme that are solved based on Lagrange Multipliers, producing a po- combines Gaussian Processes (Rasmussen & Williams tential solution to the problem at hand. 2006) with Thompson Sampling (Thompson 1933; Granmo 2010). 2.1 Gaussian Processes based Representation of Material Unit Value 3. We introduce GP based sampling mechanisms in the spirit of Optimistic Thompson Sampling (May et al. 2012) for A Gaussian Process (GP) is a stochastic process that rep- increased performance. resents a function as a multivariate Gaussian distribution (Rasmussen & Williams 2006). It is specified as a tuple 4. The resulting scheme persistently outperforms state-of- GP = (µ(~x), K(·, ·)) where µ(·) is the mean function, typi- the-art approaches when applied to resource allocation in cally assigned µ(~x) = ~0, and K(·, ·) is a kernel that specifies web polling. the covariance matrix for the random vector ~x. In this paper, These contributions form the first steps towards establishing we use the one dimensional Squared Exponential kernel (eq. a new family of reinforcement learning schemes that pro- 1), configured by the hyper parameters θ~ = {l, σf2 , σn2 }. vide on-line solutions to stochastic versions of classical op- timization problems. This is achieved by carefully design- 1 K(xp , xq ) = σf2 exp(− (xp − xq )2 )) + σn2 δpq (1) ing Bayesian models that capture the nature of the optimiza- 2l2 tion problems, applying TS principles to address the explo- Here l is the characteristic length-scale parameter that deter- ration/exploitation dilemma in on-line learning and control. mines how rapidly the correlation should decay as the dis- tance between xp and xq increases, σf2 is the signal variance 1.3 Paper Outline and σn2 is white noise (note that δpq here denotes the Kro- In Section 2, we present our scheme for Gaussian Pro- necker delta between xp and xq ). For further information on cess Based Optimistic Knapsack Sampling (GPOKS). We GPs we refer to (Rasmussen & Williams 2006). start with a brief introduction to Gaussian Processes before By way of example, Figure 1 illustrates how the posterior we propose how Gaussian Processes can enable Thomp- distribution over possible material unit value functions for a son Sampling — the current leader when it comes to solv- given material i can be represented by means of a GP. The ing Bernoulli Bandit Problems (Granmo 2010) — for ex- x-axis measures the amount of material, xi , while the y-axis ploration and exploitation when solving on-line Stochastic provides the material unit value fi0 (xi ). The mean and 95% NEFK problems. Then, in Section 3, we define the web re- confidence interval is included, as well as four samples indi- source allocation polling problem in more detail, following cating possible candidates for fi0 (xi ). Note that since the up with an evaluation of GPOKS compared with state-of- Stochastic NEFK problem deals with non-increasing unit the-art. We conclude in Section 4 and present pointers for value functions, fi0 (xi ), we apply Rejection Sampling to further work. sample from the distribution of non-increasing functions. Similarly, ”optimistic” sampling, as pioneered by May et al. 2 Gaussian Process Based Optimistic (May et al. 2012), is realized by rejecting sampled functions Knapsack Sampling (GPOKS) that drop below the estimated mean. The conflict between exploration and exploitation is a well- 2.2 Architectural Overview of GPOKS known problem in reinforcement learning, and other areas of artificial intelligence. The multi-armed bandit problem Figure 2 provides an architectural overview of our scheme. captures the essence of this conflict, and has thus occupied As illustrated in the figure, GPOKS operates as follows: researchers for over fifty years (Wyatt 1997). In brief, an 1. A collection of GPs, one Gaussian Process, GPi , for each agent sequentially pulls one of multiple arms attached to a material i, attempts to estimate the material unit value gambling machine, with each pull resulting in a random re- functions, fi0 (xi ), 1 ≤ i ≤ n. ward. The reward distributions are unknown, and thus, one 2. One candidate material unit value function, fˆi0 (xi ), 1 ≤ must balance between exploiting existing knowledge about i ≤ n, is then sampled from each GPi , thus applying the the arms, and obtaining new information. TS principle of sampling functions proportionally to their We are here facing a similar problem, however, instead of likelihoods. seeking the singly best material (bandit arm), we need to find a mixture of materials, also referred to as a mixed strategy 3. The DET-KS component in the architectue finds the opti- in Game Theory. Recently, GP optimization has been ad- mal material mixture M̂ = [x1 , . . . , xn ] for the sampled dressed from a bandit problem perspective (Srinivas N. & M. material unit value functions, fˆi0 (xi ), 1 ≤ i ≤ n, using 2010), allowing the GP to be explored globally with as few Lagrange multipliers. Web Page 1 x x x x x x t Web Page 2 x x xxx x xx x xx xx x t Figure 6: Web resource changes occurring over time. An ’x’ on the time-lines denotes that the respective web resource has changed. the optimal amounts M = [x1 , x2 ]. However, after 193 iterations of the GPOKS algorithm, we observe a number of fascinating properties in Figure 5. First of all, the Bayesian estimates of the material unit val- ues, f10 (x1 ) and f20 (x2 ), have become more accurate. Fur- thermore, we observe that the estimated optimal material mixture is now much closer to the optimal mixture. Finally, Figure 1: Gaussian Process based representation of material observe that the uncertainty concerning f10 (x1 ) and f20 (x2 ) unit value varies with x1 and x2 . The beauty of Thompson Sampling is that the observations are collected with gradually increasing GP 1 ... GP i ... GP n exploitation, zooming in on the areas that are most likely to contain the optimal material mixture. f i' f1' f 'n 3 Application: Web Polling DET-KS Having obtained a solution to the model in which we set the NEFK, we shall now demonstrate how we can utilize this vi ^ M solution for the current problem being studied, namely, the optimal web-polling problem. Scheduler Web resource monitoring consists of repeatedly polling a selection of web resources so that the user can detect xi changes that occur over time. Clearly, as this task can be prohibitively expensive, in practical applications, the sys- Stochastic Environment tem imposes a constraint on the maximum number of web resources that can be polled per time unit. This bound is dictated by the governing communication bandwidth, and by Figure 2: GPOKS Architectural Overview the speed limitations associated with the processing. Since only a fraction of the web resources can be polled within a given unit of time, the problem which the system’s ana- 4. One of the materials is then selected by the Scheduler lyst encounters is one of determining which web resources component for evaluation, ensuring that each material i are to be polled. In such cases, a reasonable choice of ac- is selected with a frequency that is proportional to the tion is to choose web resources in a manner that maximizes amount of material, xi , assigned by M̂. the number of changes detected, and the optimal allocation 5. Finally, the Stochastic Environment, i.e., the Stochastic of the resources involves trial-and-error. As illustrated in NEFK, samples the true outcome probability function, Figure 6, web resources may change with varying frequen- pi (xi ), at xi , providing feedback vi to the corresponding cies (that are unknown to the decision maker), and changes GPi , which updates its Bayesian estimate of fi0 (xi ). appear more or less randomly. Furthermore, as argued else- where, (Granmo & Oommen 2006; Granmo et al. 2006; By following the above steps our goal is to gradually im- 2007), the probability that an individual web resource poll prove our ”best guesses” so that each iteration successively uncovers a change on its own decreases monotonically with brings us closer to the optimal solution of the targeted the polling frequency used for that web resource. Stochastic NEFK problem. Although several nonlinear criterion functions for mea- suring web monitoring performance have been proposed 2.3 Example Steps in the literature (e.g., see (Pandey, Ramamritham, & Figure 3 and 4 show the GP based estimates for the unit Chakrabarti 2003; Wolf et al. 2002)), from a broader view- value of two materials, f10 (x1 ) and f20 (x2 ), after only 5 ma- point they are mainly built around the basic concept of up- terial value observations. As can be seen, uncertainty about date detection probability, i.e., the probability that polling a the material unit value functions is significant, and the esti- web resource results in new information being discovered. mated optimal material amounts M̂ = [x̂1 , x̂2 ] are far from Therefore, for the purpose of conceptual clarity, we will use Figure 3: Estimate of material unit value f10 (x1 ) af- Figure 4: Estimate of material unit value f20 (x2 ) af- ter 7 observations, with optimal and estimated material ter 7 observations, with optimal and estimated material amounts x1 . amounts x2 . the update detection probability as the token of interest in 3.1 GPOKS Solution this paper. To further define our notion of web monitoring In order to find a solution to the above problem we must performance, we consider that time is discrete with the time define the Stochastic Environment that GPOKS is to inter- interval length T to be the atomic unit of decision making. In act with. As seen in Section 2, the Stochastic Environment each time interval every single web resource i has a constant consists of the unit volume value functions {f10 (x1 ), f20 (x2 ), probability qi of remaining unchanged. Furthermore, when . . . , fn0 (xn )}, which are unknown to GPOKS. We identify a web resource is updated/changed, the update is available the nature of these functions by applying the principle of for detection only until the web resource is updated again. Lagrange multipliers to the above maximization problem. After that, the original update is considered lost. For in- In short, after some simplification, it can be seen that the stance, each time a newspaper web resource is updated, pre- following conditions characterize the optimal solution: vious news items are replaced by the most recent ones. In the following, we will denote the update detection d1 (x1 ) = d2 (x2 ) = · · · = dn (xn ) P n probability of a web resource i as di . Under the above con- 1 xi = c and ∀i = 1, . . . , n, xi ≥ 0. ditions, di depends on the frequency, xi , that the resource is Since we are not able to observe di (xi ) or qi directly, we polled with, and is modeled using the following expression: base our definition of {f10 (x1 ), f20 (x2 ), . . . , fn0 (xn )} on the 1 result of polling web resources. Briefly stated, we want di (xi ) = 1 − qi xi . fi0 (xi ) to instantiate to the value 0 with probability 1−di (xi ) and to the value 1 with probability di (xi ). Accordingly, if By way of example, consider the scenario that a web re- the web resource i is polled and i has been updated since source remains unchanged in any single time step with prob- our last polling, then we consider fi0 (xi ) to have been in- ability 0.5. Then polling the web resource uncovers new stantiated to 1. And, if the web resource i is unchanged, we information with probability 1 − 0.53 = 0.875 if the web consider fi0 (xi ) to have been instantiated to 0. resource is polled every 3rd time step (i.e., with frequency 1 2 3.2 Empirical Results 3 ) and 1 − 0.5 = 0.75 if the web resource is polled ev- nd ery 2 time step. As seen, increasing the polling frequency In this section we evaluate GPOKS and compare its perfor- reduces the probability of discovering new information on mance with the currently best performing algorithm, LAKG. each polling. While H-TRAA possesses better scalability than LAKG (Granmo & Oommen 2010), for two material problems, Given the above considerations, our aim is to find the their performance is identical because the hierarchical setup resource polling frequencies ~x that maximize the expected of H-TRAA does not come into play. For clarification we number of pollings uncovering new information per time will also include some promising variants of GPOKS. Here step: follows an overview of a selection of the policies that we Pn have investigated: maximize P1 xi × di (xi ) Uniform: The uniform policy allocates monitoring re- n subject to 1 xi = c and ∀i = 1, . . . , n, xi ≥ 0. sources uniformly across all web resources. This classical Figure 5: Estimate of material unit values f10 (x1 ) and f20 (x2 ) after 193 observations, with optimal and estimated material amounts x1 and x2 . policy can, of course, be applied directly in an unknown en- the first one, q1 = 0.9/q2 = 0.1, one web resource is up- vironment. dated significantly more often than the other. A more moder- LAKG: The LAKG scheme is basically a game between ate version of the latter configuration, q1 = 0.75/q2 = 0.25, so-called Learning Automata (Narendra & Thathachar was also investigated. Furthermore, we measured perfor- 1989). They start off from a uniform policy and gradu- mance when the two web resources have almost equal up- ally improves toward the optimal configuration through a date probability, q1 = 0.55/q2 = 0.45. Finally, we also sequence of small jumps across a discretized search space. investigated the robustness of GPOKS by adding increas- In all our experiments the resolution of LAKG is set to 100 ing amount of white-noise, (wσ ), to the feedback given to states. GPOKS. Note that, for the sake of fairness, we applied the Optimal: This policy requires that update frequencies are same kernel hyper-parameters, θ = {1.0, 1.0, 0.1}, for all known, and finds the optimal solution based on the prin- the GP based strategies, without further optimization. ciple of Lagrange multipliers (Pandey, Ramamritham, & Table 1 reports the performance of the different poli- Chakrabarti 2003; Wolf et al. 2002). cies3 . As can be seen, GPOKS clearly outperforms LAKG GPOKS - Mean: To highlight the advantage of our Opti- when facing the q1 = 0.9/q2 = 0.1 configuration, with mistic Thompson Sampling approach, we also test a simpler GPOKS detecting on average approximately 8 more updates scheme where we use the mean of the GPs when estimating than LAKG over 1000 time steps. Also note how remark- the optimal solution rather than sampling functions from the ably close GPOKS gets to the optimal performance, missing GPs. on average merely 7 web resource updates over 1000 time steps. We observe similar results for the q1 = 0.75/q2 = We have conducted numerous experiments using various 0.25 configuration. Finally, for the q1 = 0.55/q2 = 0.45 configurations, such as different noise parameters and up- configuration, we observe that the performance of LAKG date probabilities. Here, we present a representative subset and GPOKS becomes more similar. This can be explained of these, as they all show the same trend. Performance is by the prior bias of LAKG, starting from a uniform allo- measured as the average accumulated number of web re- cation of resources. This gives LAKG an advantage over source updates found. GPOKS, which are largely unbiased when it comes to prior For these experiments, we used an ensemble of 1000 in- belief about update probabilities. Finally, notice the perfor- dependent replications, each random generator seeded with mance loss caused by using the mean of the GPs (GPOKS- a unique number, to maximize the precision of the reported Mean) instead of TS. This trend is further explored in Ta- results. In order to provide a robust overview of the perfor- mance of GPOKS, we investigated three radically different 3 Note that all of the setups apply a small degree of white noise update probability configurations for web resource pairs. In (wσ = 0.1). ble 2, where we increase the amount of white noise affect- archy of Twofold Resource Allocation Automata. IEEE ing feedback. We then observe that GPOKS is surprisingly Transactions on Computers 59(4):545–560. robust towards noisy feedback compared to GPOKS-Mean. Granmo, O.-C.; Oommen, B. J.; Myrer, S. A.; and Olsen, This can be explained by the greedy nature of GPOKS- M. G. 2006. Determining Optimal Polling Frequency Us- Mean, which is less inclined to explore the space of func- ing a Learning Automata-based Solution to the Fractional tions encompassed by the GPs, thus being more easily mis- Knapsack Problem. In Proceedings of the 2006 IEEE Inter- lead by noise. national Conferences on Cybernetics & Intelligent Systems (CIS) and Robotics, Automation & Mechatronics (RAM). 4 Conclusions and Further Work IEEE. The stochastic non-linear fractional knapsack problem is a Granmo, O.-C.; Oommen, B. J.; Myrer, S. A.; and Olsen, challenging optimization problem with numerous applica- M. G. 2007. Learning Automata-based Solutions to the tions, including resource allocation. The goal is to find the Nonlinear Fractional Knapsack Problem with Applications most valuable mix of materials that fits within a knapsack to Optimal Resource Allocation. IEEE Transactions on of fixed capacity. When the value functions of the involved Systems, Man, and Cybernetics, Part B 37(1):166–175. materials are fully known and differentiable, the most valu- Granmo, O.-C. 2010. Solving Two-Armed Bernoulli Ban- able mixture can be found by direct application of Lagrange dit Problems Using a Bayesian Learning Automaton. Inter- multipliers. national Journal of Intelligent Computing and Cybernetics In this paper we introduced Gaussian Process based Op- 3(2):207–234. timistic Knapsack Sampling (GPOKS) — a novel model- Gupta, N.; Granmo, O.-C.; and Agrawala, A. 2011a. Suc- based reinforcement learning scheme for solving stochastic cessive Reduction of Arms in Multi-Armed Bandits. In fractional knapsack problems. The scheme is founded on Proceedings of the Thirty-first SGAI International Confer- Gaussian Process (GP) enabled Optimistic Thompson Sam- ence on Artificial Intelligence (SGAI 2011). Springer. pling (OTS). Our empirical results demonstrates that this scheme converge significantly faster than LAKG. Further- Gupta, N.; Granmo, O.-C.; and Agrawala, A. 2011b. more, GPOKS incorporates GP based learning of the mate- Thompson Sampling for Dynamic Multi-Armed Bandits. rial unit values themselves, forming the basis for OTS sup- In Proceedings of the Tenth International Conference on ported balancing between exploration and exploitation. Us- Machine Learning and Applications (ICMLA’11). IEEE. ing resource allocation in web polling as a proof-of-concept Kellerer, H.; Pferschy, U.; and Pisinger, D. 2004. Knapsack application, our empirical results show that GPOKS consis- Problems. Springer. tently outperforms LAKG, the current top-performer, under May, B. C.; Korda, N.; Lee, A.; and Leslie, D. S. 2012. Op- a wide variety of parameter settings. timistic bayesian sampling in contextual-bandit problems. In our further work, we will address games of interacting J. Mach. Learn. Res. 8:2069–2106. GPOKS for solving networked and hierarchical resource al- Narendra, K. S., and Thathachar, M. A. L. 1989. Learning location problems. Furthermore, we are investigating tech- Automata: An Introduction. Prentice Hall. niques for decomposing the GP calculations for increased computational performance. Pandey, S.; Ramamritham, K.; and Chakrabarti, S. 2003. Monitoring the Dynamic Web to Respond to Continuous Queries. In 12th International World Wide Web Confer- References ence, 659–668. ACM Press. Black, P. E. 2004. Fractional knapsack problem. Dictio- Rasmussen, C. E., and Williams, C. K. I. 2006. Gaussian nary of Algorithms and Data Structures. Processes for Machine Learning. The MIT Press. Bretthauer, K. M., and Shetty, B. 2002. The Nonlinear Srinivas N., Krause A., K. S., and M., S. 2010. Gaussian Knapsack Problem — Algorithms and Applications. Euro- process optimization in the bandit setting: No regret and pean Journal of Operational Research 138:459–472. experimental design. In Fürnkranz, J., and Joachims, T., Granmo, O.-C., and Berg, S. 2010. Solving Non- eds., Proceedings of the 27th International Conference on Stationary Bandit Problems by Random Sampling from Machine Learning (ICML-10), 1015–1022. Haifa, Israel: Sibling Kalman Filters. In Proceedings of the Twenty Third Omnipress. International Conference on Industrial, Engineering, and Thompson, W. R. 1933. On the likelihood that one un- Other Applications of Applied Intelligent Systems (IEA- known probability exceeds another in view of the evidence AIE 2010), 199–208. Springer. of two samples. Biometrika 25:285–294. Granmo, O.-C., and Oommen, B. J. 2006. On Al- Wolf, J. L.; Squillante, M. S.; Sethuraman, J.; and Ozsen, locating Limited Sampling Resources Using a Learning K. 2002. Optimal Crawling Strategies for Web Search En- Automata-based Solution to the Fractional Knapsack Prob- gines. In 11th International World Wide Web Conference, lem. In Proceedings of the 2006 International Intelli- 136–147. ACM Press. gent Information Processing and Web Mining Conference Wyatt, J. 1997. Exploration and Inference in Learning (IIS:IIPW’06), Advances in Soft Computing. Springer. from Reinforcement. Ph.D. Dissertation, University of Ed- Granmo, O.-C., and Oommen, B. J. 2010. Solving Stochas- inburgh. tic Nonlinear Resource Allocation Problems Using a Hier- Scheme p1 /p2 Avg[#Updates] t=10 Avg[#Updates] t=100 Avg[#Updates] t=1000 Optimal 0.90/0.10 9.1 91.0 909.9 Uniform 0.90/0.10 5.9 59.0 590.0 LAKG 0.90/0.10 6.0 71.6 874.9 GPOKS 0.90/0.10 8.0 88.9 903.0 GPOKS-Mean 0.90/0.10 8.5 89.7 902.9 Optimal 0.75/0.25 8.1 81.2 812.5 Uniform 0.75/0.25 6.9 68.8 687.5 LAKG 0.75/0.25 6.9 74.1 793.1 GPOKS 0.75/0.25 7.4 78.8 807.9 GPOKS-Mean 0.75/0.25 6.6 69.6 792.2 Optimal 0.55/0.45 7.5 75.2 752.5 Uniform 0.55/0.45 7.5 74.8 747.5 LAKG 0.55/0.45 7.5 74.8 749.8 GPOKS 0.55/0.45 7.0 73.5 749.4 GPOKS-Mean 0.55/0.45 5.4 52.8 725.3 Table 1: Average number of updates at different times, wσ = 0.1 Scheme p1 /p2 wσ = 0.0 wσ = 0.2 wσ = 0.4 GPOKS 0.75/0.25 808.2 804.5 804.1 GPOKS-Mean 0.75/0.25 793.9 787.2 769.1 Table 2: The performance of GPOKS variants under different levels of white noise