ITAT 2016 Proceedings, CEUR Workshop Proceedings Vol. 1649, pp. 93–101 http://ceur-ws.org/Vol-1649, Series ISSN 1613-0073, c 2016 H. Degroote, B. Bischl, L. Kotthoff, P. De Causmaecker Reinforcement Learning for Automatic Online Algorithm Selection - an Empirical Study Hans Degroote1 , Bernd Bischl2 , Lars Kotthoff3 , and Patrick De Causmaecker4 1 KU Leuven, Department of Computer Science, CODeS & iMinds-ITEC, Belgium ❤❛♥s✳❞❡❣r♦♦t❡❅❦✉❧❡✉✈❡♥✳❜❡ 2 Bernd Bischl, Department of Statistics, LMU Munich 3 University of British Columbia, Department of Computer Science 4 KU Leuven, Department of Computer Science, CODeS & iMinds-ITEC, Belgium Abstract: In this paper a reinforcement learning methodol- The predictive model predicts for each instance which al- ogy for automatic online algorithm selection is introduced gorithm is likely to be best. The model is created based on and empirically tested. It is applicable to automatic algo- characteristics of the problem under consideration. These rithm selection methods that predict the performance of characteristics are called problem features. The idea is that each available algorithm and then pick the best one. The their value should be correlated with how hard a problem experiments confirm the usefulness of the methodology: is for a certain algorithm. using online data results in better performance. Automatic algorithm selection methods are metaheuris- As in many online learning settings an exploration vs. tics in the sense that they are general problem-independent exploitation trade-off, synonymously learning vs. earning strategies with a different implementation depending on trade-off, is incurred. Empirically investigating the qual- the specific problem being considered. The difference in ity of classic solution strategies for handling this trade-off implementation manifests itself in the choice of features, in the automatic online algorithm selection setting is the which are distinct for every problem. For example, the ra- secondary goal of this paper. tio of the amount of clauses over the amount of variables The automatic online algorithm selection problem can is relevant for satisfiability problems but makes no sense be modelled as a contextual multi-armed bandit problem. for graph colouring or scheduling problems. Two classic strategies for solving this problem are tested After the training phase the decision model remains in the context of automatic online algorithm selection: fixed in automatic offline algorithm selection. To decide ε-greedy and lower confidence bound. The experiments which algorithm to use to solve a new instance with, its show that a simple purely exploitative greedy strategy out- features are calculated and input into the decision model performs strategies explicitly performing exploration. which in turn returns an algorithm. This algorithm is then used to solve the instance with. 1 Introduction The observation motivating this research is that new performance data keeps being generated after the training The problem considered in this paper is automatic algo- phase every time the predictive model is used to predict rithm selection. The field of algorithm selection is mo- the best algorithm for a new instance. This data is freely tivated by the observation that a unique best algorithm available yet not used by automatic offline algorithm selec- rarely exists for many problems. Which algorithm is best tion methods. The main research question of this paper is: depends on the specific problem instance being solved. “Can online performance data be used to improve the pre- This can be illustrated by looking at the results of past dictive model underlying automatic algorithm selection?". SAT-competitions1 . There are always problem instances An interesting challenge faced in automatic online al- that are not solved by the winning algorithm (the overall gorithm selection is finding a balance between learning a best) but that other algorithms manage to solve [25]. good predictive model and making good predictions. Se- The complementarities between different algorithms lecting a predicted non-best algorithm might be better in can be leveraged through algorithm portfolios [13]. In- the long run because the information thus obtained results stead of using a single algorithm to solve a set of prob- in a better model and more accurate predictions for fu- lems, several algorithms are combined in a portfolio and ture instances, but it negatively affects the expected per- a method of selecting the most appropriate algorithm for formance on the current instance. This challenge is an each problem instance at hand is used. This selection pro- example of the exploration vs. exploitation trade-off of- cess is called automatic algorithm selection. ten faced in reinforcement learning. It is also called the State of the art approaches for automatic offline algo- learning vs. earning trade-off. rithm selection use machine learning techniques to create The automatic online algorithm selection problem can a predictive model based on large amounts of training data. be modelled as a multi-armed bandit problem, more specifically as a multi-armed bandit problem with covari- 1 http://satcompetition.org/ ates, also known as the contextual multi-armed bandit 94 H. Degroote, B. Bischl, L. Kotthoff, P. De Causmaecker problem, as for each problem instance the values of a num- ber of problem characteristics are known. Two basic clas- sic strategies for solving the contextual multi-armed ban- dit problem that incorporate explicit exploration are tested and compared to the purely exploitative approach. The remainder of this paper is structured as follows. In section 2 the automatic online algorithm selection problem is defined. First the classic automatic offline algorithm se- lection problem is discussed, then the methodology for au- Figure 1: Rice’s model for algorithm selection’ tomatic online algorithm selection is presented after which the contextual multi-armed bandit problem is introduced and is shown how automatic online algorithm selection The aim is to find the feature mapping and selection can be modelled as a contextual multi-armed bandit prob- mapping that optimise the average performance. The fea- lem. In section 3 related work is discussed. The exper- tures of an instance are given, so the only leeway there is imental setting and results are presented in section 4. In which features to consider. Limitations on the possible se- section 5 some remarks about the introduced methodol- lection mappings can be imposed by the method used to ogy are made and the experimental results are discussed. create it. Future work is also discussed in section 5. The paper con- Identifying descriptive features is a time consuming cludes in section 6 process. Luckily large amounts of features have already been proposed in literature for many interesting problems. For example in [22] an overview is given of features for 2 Automatic Online Algorithm Selection the satisfiability problem and in [21] for the multi-mode 2.1 Automatic Algorithm Selection resource-constrained project scheduling problem. The automatic in automatic algorithm selection refers to Rice’s paper "The algorithm selection problem" [23] for- the way decision models are made: automatically. Super- mally introduced the algorithm selection problem. The vised learning techniques are typically used. fundamental characteristics of the problem remain un- Two broad classes of techniques can be identified. In changed up to now. In the most basic scenario identified the first fall classification-based techniques: the decision by Rice the problem is characterised by a set of instances, model directly predicts which algorithm will be best for a set of algorithms and a (set of) performance measure(s) an instance based on its features. No information about and by two mappings between these sets: a selection map- the actual quality of the algorithm is communicated. Note ping and a performance mapping. The selection mapping that in general this is not a binary but a multi-class clas- maps instances to algorithms and the performance map- sification problem, as each instance is classified as being ping maps algorithm-instance pairs to their performance- best-solved by one of an arbitrary amount of algorithms. measure(s). A typical formulation of the objective of auto- In [20] for example, the k-nearest-neighbours method is matic algorithm selection is to find the selection mapping used. Another example of the use of k-nearest-neighbours that results in the best average performance. can be found in [6], where the more complicated problem It is up to the user to identify a sensible performance of ranking algorithms (as opposed to only predicting the measure. In this paper only single-objective problems are best) is considered. Another option is to use decision trees considered. See [9] for a more formal description of what or their more powerful relative random forest, as in the characterises an acceptable performance measure for the latest version of Satzilla [26], an algorithm selector for the research in this paper. Each performance measure with satisfiability problem. totally ordered values is definitely acceptable. Misclassification is cost-sensitive in automatic algo- Rice acknowledges the need for a set of features in prac- rithm selection: classifying an instance incorrectly as be- tical applications and extends his model with this set. The ing best solved by a horrendous algorithms is worse than full model is visualised in figure 1. Note that the selection classifying it as being best solved by an algorithm only mapping now maps values from the feature space instead marginally worse than the best one. A classification-based of directly from the instance space. automatic algorithm selection technique should take this To formalise the problem statement, let Q be a prob- cost-sensitivity into account, as argued in [5]. ability distribution on the instance set I . Let f be the The second class of automatic algorithm selection tech- feature mapping (mapping an instance to a feature vec- niques consists of regression-based techniques. A regres- tor), s the selection mapping (mapping a feature vector to sion model is created for each algorithm, predicting its an algorithm) and p the performance mapping (mapping performance in function of the problem features. The al- an instance-algorithm combination to a performance mea- gorithm with the best predicted performance is selected sure). The average performance of a selection mapping to solve a new instance with. An overview of such tech- can now be defined as: niques can be found in [14]. A recent approach is de- EQ [(p(s( f (i)), i)] (1) scribed in [10]. Reinforcement Learning for Automatic Online Algorithm Selection – an Empirical Study 95 Note that algorithm selection itself is a cost-sensitive mapping that optimise the average performance as defined classification problem: the goal is to classify instances in equation 1. as belonging to the algorithm that best solves them. The In the setting of this paper the instance set is defined distinction between classification and regression methods by a fixed set of benchmark instances and the distribution refers to how this classification problem is solved behind is uniform. The feature mapping is defined by consid- the scenes. ering all features available for the benchmark instances. A thorough overview of algorithm selection methodol- The problem of selecting the most informative features is ogy can be found in [24] and more recently in [17]. not considered: the feature mapping is fixed. A selection Both classes of automatic algorithm selection methods mapping is defined by considering a regression model for use the same kind of input to initialise their decision mod- each algorithm and selecting an algorithm in function of els: performance data of all algorithms on a set of train- these predicted values. The most straightforward selection ing instances. For the classification-based techniques it is mapping is to select the algorithm with the predicted best strictly necessary for the performance of all algorithms to performance. This and other options are discussed in sec- be available for each instance. Otherwise it is not possi- tion 2.4. The performance mapping used is discussed in ble to say which algorithm is best for the instance. This is section 4.1. not the case for regression-based techniques. As long as In the offline setting the selection mapping remains each algorithm’s model has access to datapoints it can be fixed. However, in the online setting it changes over time initialised, it is not necessary to know the performance of as more instances are solved. The selection at each point in each algorithm on each training instance. time depends explicitly on earlier selections. For this rea- son equation 1 cannot be used directly to formally define a general problem statement for automatic online algorithm 2.2 Solution Strategy for Automatic Online selection. Algorithm Selection In the empirical setting of this paper the quality of a so- lution to the automatic online algorithm selection problem During the online phase performance data is generated is measured as its average performance on a time-ordered every time a new instance is solved. This performance set of instances, as presented during the online phase. The data consists of the performance of the selected algorithm empirical performance measurement process is explained on the new instance. The performance data of the other in more detail in section 4 where the experimental setting algorithms on the new instance is not available. Since and results are described. this type of data can only be processed by the regression- based methods, the proposed methodology will be limited to regression-based techniques. 2.4 Contextual Multi-armed Bandits The methodology for automatic online algorithm selec- In the standard multi-armed bandit a gambler has access tion is the following. During the offline training phase an to a set of slot machines (bandits) and must decide on a initial regression model is trained for each algorithm , us- strategy in which order to pull their arms. His goal is ing training data consisting of algorithm performance on to realise as much profit as possible. Each time an arm instances described by feature values. During the online is pulled the gambler receives a random reward sampled phase the algorithm to solve the first online instance with from a distribution belonging to the selected arm. Initially is selected based on the models created during the training all distributions are unknown, but as the gambler gambles phase. The model of the selected algorithm is retrained on he obtains more information about the distributions of with the new datapoint. The performance of all other al- the available arms and can make more informed choices. gorithms on the instance remains unknown. The algorithm The central dilemma faced by the gambler is whether to solve the second online instance with is selected based to keep pulling the arm proven to be best so far or to try on the models created after having solved the first instance, another arm about which little is known and that might thus one of the models has been updated to incorporate the be better. If the other arm turns out to be more profitable performance information about the first online instance. never having explored its potential further would have lost Selection for the third online instance is influenced by the the gambler a lot of money. two previous etc. As more instances are solved, more dat- See [1] for a formal definition of the multi-armed ban- apoints are gathered and the models are expected to im- dit problem. In this paper a number of policies for pulling prove, which in turn is expected to result in better algo- arms are analysed in terms of how fast the total profit di- rithm selection. verges from the maximal profit in function of the total amount of pulls. 2.3 Automatic Online Algorithm Selection Problem The contextual multi-armed bandit problem generalises Statement the multi-armed bandit problem. To stay within the metaphor: before pulling an arm the gambler sees a con- As discussed in section 2.1, the goal of automatic algo- text vector. This context vector contains values for pre- rithm selection is to find the feature mapping and selection defined properties that describe the current situation. In 96 H. Degroote, B. Bischl, L. Kotthoff, P. De Causmaecker the contextual multi-armed bandit problem the reward of selection problem where one has to select one algorithm each arm depends on the context. As in the classic multi- from a limited pre-defined set of algorithms to solve an in- armed bandit problem the gambler’s goal is to maximize stance with, has not yet been investigated. In the remain- his profit, but in order to do so he has to learn how the der of this section some related research is discussed and context vector relates to the rewards. is mentioned how it differs from this work. The automatic online algorithm selection problem is a In [11] multi-armed bandit methodology is applied to contextual multi-armed bandit problem. Each algorithm is the online learning of dynamic algorithm portfolios. Their an arm and pulling an arm is the equivalent of selecting goal differs from this paper’s. They want to learn for each an algorithm. When selecting an algorithm for an instance instance a separate algorithm portfolio while the goal here its feature values are known, which is the equivalent of is to predict one algorithm to solve the instance with. In having shown a context vector. Maximizing profit in this an algorithm portfolio a bunch of algorithms are run si- context boils down to minimizing the performance differ- multaneously. The dynamic goal is to learn the optimal ence between the selected algorithm and the actual best assignment of time slices to algorithms while the portfolio algorithm. is in use. This paper’s setting is not dynamic in this sense. A number of solution strategies for the contextual ban- Once an algorithm has been selected to solve an instance dit have been introduced and analysed in literature, such with this decision will not be come back on, even if the as LinUCB [7] where the reward is assumed to linearly algorithm appears to perform poorly on the instance. depend on the feature vector. However, for the prelimi- In [8] a new multi-armed bandit model is proposed and nary research presented in this paper three straightforward applied to search heuristic selection, a kind of algorithm and simple strategies have been implemented. selection. However, their objective differs from this pa- The first strategy that is considered is the greedy strat- per’s. In terms of algorithm selection they have access to a egy. The greedy strategy does not perform any explicit ex- number of stochastic algorithms and a budget of N trials. ploration: it always selects the algorithm that is predicted The goal is to find an as good as possible solution for one to be best. instance within the budget of N trials, whereas this paper’s The second strategy that is considered is the ε-greedy goal is to find an as good as possible solution on average strategy, which is parametrised by a value ε between 0 over many instances, each with a budget of one trial. Their and 1. The strategy is equivalent with the simple greedy stochasticity is caused by the algorithms but the instance strategy with probability (1 − ε) and selects a random al- remains fixed. In this paper stochasticity is also caused gorithm with probability ε. by the instances as at each point in time a new random The third strategy is the is the UCB strategy, short for instance is solved. upper confidence bound. It is parametrised by parameter λ In [12] a notion of online algorithm selection is intro- (with λ ≥ 0). The UCB strategy consists of calculating for duced for decision problems. They focus on the problem each algorithm its predicted performance p and the stan- of deciding how to distribute time shares over the set of dard error on this prediction e. The algorithm with highest available algorithms and make this decision on an instance value for p + e ∗ λ is selected. Since algorithms for which per instance basis. They model the problem on two levels. few datapoints exist typically have high variance on their On the upper level they use bandit methodology to decide predictions, an algorithm with predicted poor performance which time allocator to use (choosing from a uniform al- might be preferred over an algorithm with decent perfor- locator and various dynamic allocators) and on the lower mance, depending on the performance difference, variance level the algorithms are run in parallel (or simulated to run sizes and the value of λ . A higher λ -value results in more in parallel) according to the time shares predicted by the exploration. allocator selected on the higher level. Thus the arms of The equivalent of the UCB strategy for minimisation their bandit problem are ’time allocators’ and not algo- problems is the LCB strategy, short for lower confidence rithms. bound. Its selection rule is: p − e ∗ λ . Unlike the two previous strategies, which only rely on a 4 Experiments predicted value, the LCB strategy also relies on a notion of variance. Hence it can be applied only to regression meth- 4.1 Experimental Setting ods for which the variance on a prediction is calculable. A standard database with automatic algorithm selection data, called ASLIB [3], is used. This database consists of 3 Related work 17 problems, each with a number of algorithms (2-30) and instances (500-2500). The value of one or more perfor- Reinforcement learning and multi-armed bandit method- mance measures is available for each algorithm-instance ology have been applied to some related topics in auto- pair. Using this database it is possible to simulate differ- matic algorithm selection literature. However, the authors ent algorithm selection strategies without having to waste believe the setting considered in this paper, applying re- time on calculating the performance of algorithms and in- inforcement learning to the standard automatic algorithm stances. Reinforcement Learning for Automatic Online Algorithm Selection – an Empirical Study 97 Information about the feature values for each instance is is achieved by the so-called virtual best solver. The vir- stored as well. The amount of features ranges from 22 to tual best solver selects for each instance the best possible 155. algorithm. It is defined only for instances for which per- In all experiments performed performance is measured formance data is available for all algorithms. Note that as how fast an algorithm solves an instance. To differenti- the PAR-score of the virtual best solver itself is not 0, it is ate between solving an instance just within the time-limit simply normalised to 0. and failing to solve an instance, time-outs are penalised The virtual best solver is artificial because it requires by multiplying them with a fixed factor. A penalty factor calculating the performance of each algorithm before se- commonly used in literature is 10. resulting in the PAR10 lecting one, hence it cannot be used in practice. However, criterion (with PAR an abbreviation for penalised average it is easy to define for an ASLIB scenario and is commonly runtime). Suppose the time-out limit is 1 hour. An un- used to evaluate the quality of an algorithm selection ap- solved instance will have a PAR score of 10 hours. In proach, for example in [25] and [15]. A score of 1 equals terms of the problem definition of automatic algorithm se- the score of the single best solver. The single best solver lection (equation 1): all results in this paper are presented corresponds to the classical notion of ’best algorithm’: it with as performance mapping applying the PAR10 crite- is the best solver on average over the entire dataset. Any rion to the stored runtime. algorithm selection strategy should improve on the single Since for all problems being considered performance is best solver to be considered useful, but the score of 1 does measured as time taken until a solution is found, they are not provide a strict upper bound and it is possible to obtain all minimisation problems. This implies specifically that scores higher than 1. An algorithm selection method with the lower confidence bound method (LCB) will be used a score higher than 1 performs worse than the single best instead of the upper confidence bound method (UCB). solver. As content management system for the experiments and To enable comparison with the current state of the art as interface to the remote cluster the R-package BatchEx- in automatic offline algorithm selection, the performance periments was used [4]. of regression random forest as reported on the ASLIB As described in section 2.2 a regression model is trained website2 is shown as a horizontal red line on each plot. for each algorithm during a training phase and these mod- LLAMA is an R-package for algorithm selection interfac- els are subsequently updated during an online phase. To ing a number of machine learning algorithms [16]. On the evaluate how well each strategy has managed to learn website the performance of some popular machine learn- models, the final model quality at the end of the online ing algorithms applied to ASLIB algorithm selection sce- phase is evaluated during a verification phase. During the narios is reported. Since regression random forest is also verification phase each strategy’s resulting model quality used in this paper’s experiments this allows comparison is evaluated by using the models to make predictions. Note with a current state of the art automatic offline algorithm that during this verification phase models are no longer up- selection method. dated and no explicit exploration is performed. For each The results are presented using box plots. The hinges strategy the basic greedy selection criterion is used. correspond to the first and third quartiles. The whiskers The set of available instances is split into three subsets extend to the highest value within a 1.5 inter-quartile range to represent three experimental phases: a set of training from the hinges. The remaining points are outliers. instances, a set of online instances and a set of verification Several parameters must be defined to run the experi- instances. ments. They are kept at a fixed value for all experiments As regression model ’regression forest’ is used. The reported in this paper. implementation from R-package randomForest [19] with • LCB λ : 1 the standard parameter values is used. The randomFor- • ε-greedy ε: 0.05 est method is interfaced through the R-package MLR [2]. • Proportion of training instances: 0.1 Note that the prediction variance reported by random for- • Proportion of online instances: 0.8 est is calculated using a bootstrap methodology. See [19] • Proportion of verification instances: 0.1 for a description of this method. • Minimal amount of instances before retraining: 16 Even though a database of performance data is used, • Amount of repitions per experiment: 10 running the experiments still proved too time-consuming for most ASLIB-scenarios. Most time is spent on retrain- For the exploration methods standard parameters were ing models. Therefore an optimisation was introduced: chosen. The proportions of training and online instances retraining the model of an algorithm is postponed until a were chosen ad hoc. The proportion of 0.1 for verification minimal amount of new datapoints is available. instances was chosen more consciously because it is stan- All results have been normalised on an per-instance ba- dard practice to evaluate models on 10% of the data. The sis before the average PAR10 performances (averaged over other parameters were also chosen ad hoc. For follow-up all repeats of the experiment) are calculated. A value of 0 studies a parameter study can be useful. is the best possible (recall that minimisation problems are 2 http://coseal.github.io/aslib-r/scenario-pages/QBF- considered, so a lower value is a better value). This score 2011/llama.html 98 H. Degroote, B. Bischl, L. Kotthoff, P. De Causmaecker Only results for the QBF-2011 scenario are reported in this paper. Results for other scenarios are qualitatively similar with regards to the two research questions consid- ered3 . The QBF-2011 scenario contains performance data obtained from the quantified Boolean formula competition of 2011. The QBF-2011 scenario contains 5 algorithms, 46 features and 1368 instances, of which 1054 were solved by at least one algorithm. There are 136 training instances, 1094 online instances and 136 verification instances. The PAR10 score of the virtual best solver fluctuates around 8400 and that of the single best solver around 15300, depending on the specific split in training, online Figure 2: Boxplot summarising the answer to the question and verification instance set. Recall that the virtual best ’Is active learning useful?’. The presented data is collected solver’s score is normalised to 0 and the single best’s to 1. during the online phase 4.2 Is Automatic Online Algorithm Selection Useful for the Greedy Approach? Note that the Greedy-full-information strategy does not Adding additional data to the regression models is ex- provide a real upper bound: it is possible to perform better pected to result in better performance. To validate this hy- than this strategy as more information is not guaranteed to pothesis the performance of the most basic learning strat- always result in better predictions. egy (greedy) is compared with that of a strategy that does The plot with the results of the online phase is presented not learn. in figure 2. Online learning appears to be useful as the The greedy strategy picks the algorithm predicted to be greedy strategy outperforms the greedy-no-learning strat- best. egy. The good performance of the greedy-full-information The strategy that does not learn is called the greedy-no- strategy shows the value of having access to more infor- learning strategy and is abbreviated as greedyNL in the mation. plots. It is equivalent to the simply greedy strategy but The performance reported in figure 2 is the average per- it does not do any learning: it keeps using the models it formance over all online instances. For the first online in- learned during the training phase, never adding new data- stance the performance of the greedy-no-learning strategy points. This strategy is the strategy used by offline algo- is equal to that of the greedy strategy that does learn, but rithm selection approaches. for the last online instance the performance of the greedy The greedy-no-learning strategy uses its models to pre- strategy that does learn is expected to be better because dict the best algorithm for all online instances and its it has access to more data. The performance reported in PAR10-score is calculated on these online instances. The figure 2 is the average of these (most likely) increasing learning strategy does the same, but updates its models performances. with the data it gathers during the online phase. A third strategy is considered as well: the greedy-full- To quantify how much the greedy strategy has learned information strategy, abbreviated as greedyFI. Greedy- during the online phase, the quality of its predictions is full-information is an artificial strategy that has access to tested on a set of verification instances. During the verifi- the online information of each algorithm on all handled in- cation phase the models are no longer updated. The differ- stances. Thus not only the result of the selected algorithm ence in PAR10-score between the greedy strategy and the is used to update the models, but also the results of all other greedy-no-learning strategy is a measure for how much us- algorithms, hence the full-information. It does not have to ing the online data improves the quality of the selection. explore as it has access to all information regardless, hence The plot with the results of the verification phase is its greedy selection criterion. presented in figure 3. Note that the performance of the The Greedy-full-information strategy is introduced to greedy-full-information strategy is similar to the perfor- serve as a sort of upper bound on the performance of any mance of llama. This is expected because the bench- selection strategy. It always makes the best decision given mark performance was calculated using a 10-fold cross- the current information (pure exploitation) and it has ac- validation where performance of models trained on 90% cess to the maximal amount of information (performance of the data is measured on the remaining 10%. The mod- of all algorithms on all handled instances). Each actual els of the greedy-full-information strategy have also been selection strategy will have access to only a part of the in- trained on 90% of the data: 10% training data and 80% formation and might at times make suboptimal decisions online data. if it explores. To answer the question titling this section: automatic 3 Plots for all performed experiments are available on online algorithm selection appears to be useful for the http://www.kuleuven-kulak.be/~u0075355/Plots_ITAT_2016 greedy approach. Reinforcement Learning for Automatic Online Algorithm Selection – an Empirical Study 99 Figure 3: Boxplot summarising the answer to the question Figure 4: Boxplot summarising the answer to the ques- ’Is active learning useful?’. The presented data is collected tion ’is explicit exploration useful?’. The presented data is during the verification phase collected during the online phase 4.3 Handling the Exploration vs. Exploitation Trade-off When performing reinforcement learning one is typically faced with an exploration vs. exploitation trade-off. When no online learning is performed the predicted best algo- rithm is always selected because the only reason for se- lecting an algorithm is solving the next instance as well as possible. In an online learning setting a second reason for selecting an algorithm surfaces: additional informa- tion will be obtained and this information will increase the quality of future decisions. Figure 5: Boxplot summarising the answer to the ques- Two exploration-incorporating strategies are compared tion ’is explicit exploration useful?’. The presented data is to the simple greedy approach: ε-greedy (epsGreedy on collected during the verification phase the plots) and lower confidence bound (LCB on the plots). See section 2.4 for a description of these two strategies. A first test is to compare each strategy’s performance no-learning strategy for all learning strategies. during the online phase. This measures their ability to solve the exploration vs. exploitation trade-off: do they manage to benefit from exploring more by obtaining a bet- 5 Discussion and future work ter average performance? The plot with the results of the online phase is presented The automatic online algorithm selection method pre- in figure 4. The answer appears to be negative: explicit sented in section 2.2 is inefficient. Every time a new exploration does not result in a better average performance datapoint is collected for an algorithm, the corresponding than greedy and the ε-greedy strategy even drops down to regression model is retrained from scratch using all pre- the level of the greedy-no-learning strategy. vious data and the newly obtained datapoint. If the fit- A second test is to check whether the exploration strate- ting of a model takes a long time this approach can be- gies managed to learn better models than the greedy strat- come prohibitively expensive, especially if its complexity egy by comparing their performance on the verification is influenced heavily by the amount of instances, as for data. If the exploration strategies managed to learn better each online instance a new model is trained and the mod- models they have merit as they traded off some exploita- els are trained based on an ever increasing amount of in- tion in favour of useful exploration. If this is not the case stances. Identifying and implementing more efficient up- the exploration was not useful and simply resulted in pick- dating strategies is future work. Mondrian forests [18] for ing inferior algorithms without any noticeable gain. example are an online version of random forests that could The plot with the results of the verification phase is be useful in this context. presented in figure 5. Exploration does not appear to There might be a theoretical problem with the proposed have been useful as the models learned by the ε-greedy automatic online algorithm selection method. During the and lower confidence bound strategy do not outperform online phase an algorithm’s regression model is extended the model learned by the greedy strategy. Note however only with datapoints for which the algorithm was predicted that the additional information obtained during the on- to be best. Hence the new datapoints are all clustered in line phase does result in better models than the greedy- the same region(s) of the problem domain. Note also that 100 H. Degroote, B. Bischl, L. Kotthoff, P. De Causmaecker the region(s) where an algorithm is best is likely to change be to perform random or round-robin selection until suf- slightly every time a new instance is handled, as with the ficient samples have been collected for each algorithm to changing of an algorithm’s regression model all points in construct a regression model. Interesting challenges would the domain where the algorithm’s predicted performance be to include the option to add new algorithms at runtime was better than that of another algorithm’s are likely to and even identifying which kind of instances are hard for move slightly. Then again, in a sense the property that all algorithms, thereby inspiring the development of a new datapoints are mostly collected in the area where an algo- algorithm that performs well on these instances which can rithm is expected to be best is desirable. Knowing with then be added to the system. high accuracy how poorly an algorithm performs on in- Other future work consists of implementing solu- stances where it is bad is useless in this context whereas tion strategies specifically designed for the contextual accurate predictions on instances for which the algorithm multi-armed bandit problem which are more theoretically is likely to be one of the best are very relevant. However, founded, for example LinUCB [7]. note that predicting performance accurately is not the goal itself. What is important is that the actual best algorithm 6 Conclusions is the algorithm with predicted best score. The selection mapping does not change if a fixed value is added to each A reinforcement learning methodology for automatic on- performance prediction. line algorithm selection has been introduced. It is limited At the start of this project it was thought that the explicit to automatic algorithm selection methods based on perfor- exploration would be useful. Current and future work is in- mance predictions for each individual algorithm. It has vestigating why this does not appear to be the case. There been shown experimentally that the method is capable of are two main hypotheses. learning from online data and thereby improves on auto- The first hypothesis is that the amount of exploration matic offline algorithm selection methods. data collected during the online phase is negligible com- It has been shown that automatic online algorithm selec- pared to the data gathered during the training phase, thus tion can be modelled as a contextual multi-armed bandit the influence of the exploration cannot be observed. A problem. training set of 100 instances for 5 algorithms can be seen A total of three solution strategies have been imple- as a combination of 100 greedy choices and 400 explo- mented and empirically tested: an approach that always rative choices. The epsilon greedy strategy will explore greedily selects the best algorithm and two approaches 5% of the time, resulting in on average 50 new explorative that perform exploration: ε-greedy and lower confidence datapoints during an online phase of 1000 instances. This bound. The experiments suggest that the greedy strategy hypothesis is currently being investigated outperforms the explorative strategies. The second hypothesis is that exploration is already implicitly performed by the greedy strategy, rendering additional explicit exploration unnecessary. The greedy Acknowledgements method is greedy in the sense that it always selects the Work supported by the Belgian Science Policy Office best algorithm, but which the best algorithm is depends (BELSPO) in the Interuniversity Attraction Pole COMEX. from instance to instance, thus over time performance dat- (http://comex.ulb.ac.be). apoints for all algorithms are collected. In this way the The computational resources and services used in this greedy strategy implicitly explores. Investigating this hy- work were provided by the VSC (Flemish Supercomputer pothesis is future work. Center), funded by the Research Foundation - Flanders In order to better quantify the improvements realised (FWO) and the Flemish Government – department EWI during the online phase, future work is to investigate the way in which the selection model improves in detail, by not only evaluating the overall models before and after the References online phase, but also at several points during the online [1] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time anal- phase and by also dropping down a level and investigating ysis of the multiarmed bandit problem. Machine learning, how the individual regression models (one for each algo- 47(2-3):235–256, 2002. rithm) evolve over time. [2] B. Bernd. mlr: A new package to conduct machine learning In future work the overhead of retraining the models experiments in r. should be explicitly considered and quantified in order to [3] B. Bischl, P. Kerschke, L. Kotthoff, M. Lindauer, Y. Mal- be able to quantify the net improvement of using the on- itsky, A. Fréchette, H. Hoos, F. Hutter, K. Leyton-Brown, line data. In the experiments here reported this overhead K. Tierney, et al. Aslib: A benchmark library for algorithm is ignored. selection. arXiv preprint arXiv:1506.02465, 2015. An interesting path for future work is te develop an al- [4] B. Bischl, M. Lang, O. Mersmann, J. Rahnenführer, and gorithm that learns how to perform automatic online al- C. Weihs. BatchJobs and BatchExperiments: Abstraction gorithm selection form scratch, without any training data mechanisms for using R in batch environments. Journal of whatsoever. A straightforward initial methodology would Statistical Software, 64(11):1–25, 2015. Reinforcement Learning for Automatic Online Algorithm Selection – an Empirical Study 101 [5] B. Bischl, O. Mersmann, H. Trautmann, and M. Preuss. Al- vkar, and Y. Shoham. Understanding random sat: Be- gorithm selection based on exploratory landscape analysis yond the clauses-to-variables ratio. In Principles and Prac- and cost-sensitive learning. In Genetic and Evolutionary tice of Constraint Programming–CP 2004, pages 438–452. Computation Conference (GECCO), 2012. Springer, 2004. [6] P. B. Brazdil, C. Soares, and J. P. Da Costa. Ranking learn- [23] J. R. Rice. The algorithm selection problem. Advances in ing algorithms: Using ibl and meta-learning on accuracy Computers, 15:65–118, 1976. and time results. Machine Learning, 50(3):251–277, 2003. [24] K. A. Smith-Miles. Cross-disciplinary perspectives on [7] W. Chu, L. Li, L. Reyzin, and R. E. Schapire. Contex- meta-learning for algorithm selection. ACM Computing tual bandits with linear payoff functions. In International Surveys (CSUR), 41(1):6, 2009. Conference on Artificial Intelligence and Statistics, pages [25] L. Xu, F. Hutter, H. Hoos, and K. Leyton-Brown. Evaluat- 208–214, 2011. ing component solver contributions to portfolio-based algo- [8] V. A. Cicirello and S. F. Smith. The max k-armed bandit: rithm selectors. In Theory and Applications of Satisfiability A new model of exploration applied to search heuristic se- Testing–SAT 2012, pages 228–241. Springer, 2012. lection. In AAAI, pages 1355–1361, 2005. [26] L. Xu, F. Hutter, J. Shen, H. H. Hoos, and K. Leyton- [9] H. Degroote and P. De Causmaecker. Towards a knowl- Brown. Satzilla2012: Improved algorithm selection based edge base for performance data: A formal model for per- on cost-sensitive classification models. Proceedings of SAT formance comparison. In Proceedings of the Companion Challenge, pages 57–58, 2012. Publication of the 2015 on Genetic and Evolutionary Com- putation Conference, pages 1189–1192. ACM, 2015. [10] T. Doan and J. Kalita. Selecting machine learning algo- rithms using regression models. In 2015 IEEE Interna- tional Conference on Data Mining Workshop (ICDMW), pages 1498–1505. IEEE, 2015. [11] M. Gagliolo and J. Schmidhuber. Learning dynamic al- gorithm portfolios. Annals of Mathematics and Artificial Intelligence, 47(3-4):295–328, 2006. [12] M. Gagliolo and J. Schmidhuber. Algorithm selection as a bandit problem with unbounded losses. In Interna- tional Conference on Learning and Intelligent Optimiza- tion, pages 82–96. Springer, 2010. [13] B. A. Huberman, R. M. Lukose, and T. Hogg. An Eco- nomics Approach to Hard Computational Problems. Sci- ence, 275(5296):51–54, 1997. [14] F. Hutter, L. Xu, H. H. Hoos, and K. Leyton-Brown. Algo- rithm runtime prediction: Methods & evaluation. Artificial Intelligence, 206:79–111, 2014. [15] S. Kadioglu, Y. Malitsky, A. Sabharwal, H. Samulowitz, and M. Sellmann. Algorithm selection and scheduling. In Principles and Practice of Constraint Programming–CP 2011, pages 454–469. Springer, 2011. [16] L. Kotthoff. Llama: leveraging learning to automatically manage algorithms. arXiv preprint arXiv:1306.1031, 2013. [17] L. Kotthoff. Algorithm Selection for Combinatorial Search Problems: A Survey. AI Magazine, 35(3):48–60, 2014. [18] B. Lakshminarayanan, D. M. Roy, and Y. W. Teh. Mon- drian forests: Efficient online random forests. In Advances in Neural Information Processing Systems, pages 3140– 3148, 2014. [19] A. Liaw and M. Wiener. Classification and regression by randomforest. R news, 2(3):18–22, 2002. [20] Y. Malitsky, A. Sabharwal, H. Samulowitz, and M. Sell- mann. Non-model-based algorithm portfolios for sat. In Theory and Applications of Satisfiability Testing-SAT 2011, pages 369–370. Springer, 2011. [21] T. Messelis and P. De Causmaecker. An automatic al- gorithm selection approach for the multi-mode resource- constrained project scheduling problem. European Journal of Operational Research, 233(3):511–528, 2014. [22] E. Nudelman, K. Leyton-Brown, H. H. Hoos, A. De-