ITAT 2016 Proceedings, CEUR Workshop Proceedings Vol. 1649, pp. 93–101
http://ceur-ws.org/Vol-1649, Series ISSN 1613-0073, c 2016 H. Degroote, B. Bischl, L. Kotthoff, P. De Causmaecker


            Reinforcement Learning for Automatic Online Algorithm Selection - an
                                     Empirical Study

                                 Hans Degroote1 , Bernd Bischl2 , Lars Kotthoff3 , and Patrick De Causmaecker4
                                   1 KU Leuven, Department of Computer Science, CODeS & iMinds-ITEC, Belgium
                                                          ❤❛♥s✳❞❡❣r♦♦t❡❅❦✉❧❡✉✈❡♥✳❜❡
                                              2 Bernd Bischl, Department of Statistics, LMU Munich
                                         3 University of British Columbia, Department of Computer Science
                                   4 KU Leuven, Department of Computer Science, CODeS & iMinds-ITEC, Belgium


      Abstract: In this paper a reinforcement learning methodol-          The predictive model predicts for each instance which al-
      ogy for automatic online algorithm selection is introduced          gorithm is likely to be best. The model is created based on
      and empirically tested. It is applicable to automatic algo-         characteristics of the problem under consideration. These
      rithm selection methods that predict the performance of             characteristics are called problem features. The idea is that
      each available algorithm and then pick the best one. The            their value should be correlated with how hard a problem
      experiments confirm the usefulness of the methodology:              is for a certain algorithm.
      using online data results in better performance.                       Automatic algorithm selection methods are metaheuris-
         As in many online learning settings an exploration vs.           tics in the sense that they are general problem-independent
      exploitation trade-off, synonymously learning vs. earning           strategies with a different implementation depending on
      trade-off, is incurred. Empirically investigating the qual-         the specific problem being considered. The difference in
      ity of classic solution strategies for handling this trade-off      implementation manifests itself in the choice of features,
      in the automatic online algorithm selection setting is the          which are distinct for every problem. For example, the ra-
      secondary goal of this paper.                                       tio of the amount of clauses over the amount of variables
         The automatic online algorithm selection problem can             is relevant for satisfiability problems but makes no sense
      be modelled as a contextual multi-armed bandit problem.             for graph colouring or scheduling problems.
      Two classic strategies for solving this problem are tested             After the training phase the decision model remains
      in the context of automatic online algorithm selection:             fixed in automatic offline algorithm selection. To decide
      ε-greedy and lower confidence bound. The experiments                which algorithm to use to solve a new instance with, its
      show that a simple purely exploitative greedy strategy out-         features are calculated and input into the decision model
      performs strategies explicitly performing exploration.              which in turn returns an algorithm. This algorithm is then
                                                                          used to solve the instance with.
      1    Introduction                                                      The observation motivating this research is that new
                                                                          performance data keeps being generated after the training
      The problem considered in this paper is automatic algo-             phase every time the predictive model is used to predict
      rithm selection. The field of algorithm selection is mo-            the best algorithm for a new instance. This data is freely
      tivated by the observation that a unique best algorithm             available yet not used by automatic offline algorithm selec-
      rarely exists for many problems. Which algorithm is best            tion methods. The main research question of this paper is:
      depends on the specific problem instance being solved.              “Can online performance data be used to improve the pre-
      This can be illustrated by looking at the results of past           dictive model underlying automatic algorithm selection?".
      SAT-competitions1 . There are always problem instances                 An interesting challenge faced in automatic online al-
      that are not solved by the winning algorithm (the overall           gorithm selection is finding a balance between learning a
      best) but that other algorithms manage to solve [25].               good predictive model and making good predictions. Se-
         The complementarities between different algorithms               lecting a predicted non-best algorithm might be better in
      can be leveraged through algorithm portfolios [13]. In-             the long run because the information thus obtained results
      stead of using a single algorithm to solve a set of prob-           in a better model and more accurate predictions for fu-
      lems, several algorithms are combined in a portfolio and            ture instances, but it negatively affects the expected per-
      a method of selecting the most appropriate algorithm for            formance on the current instance. This challenge is an
      each problem instance at hand is used. This selection pro-          example of the exploration vs. exploitation trade-off of-
      cess is called automatic algorithm selection.                       ten faced in reinforcement learning. It is also called the
         State of the art approaches for automatic offline algo-          learning vs. earning trade-off.
      rithm selection use machine learning techniques to create              The automatic online algorithm selection problem can
      a predictive model based on large amounts of training data.         be modelled as a multi-armed bandit problem, more
                                                                          specifically as a multi-armed bandit problem with covari-
          1 http://satcompetition.org/                                    ates, also known as the contextual multi-armed bandit
94                                                                                   H. Degroote, B. Bischl, L. Kotthoff, P. De Causmaecker

     problem, as for each problem instance the values of a num-
     ber of problem characteristics are known. Two basic clas-
     sic strategies for solving the contextual multi-armed ban-
     dit problem that incorporate explicit exploration are tested
     and compared to the purely exploitative approach.
        The remainder of this paper is structured as follows. In
     section 2 the automatic online algorithm selection problem
     is defined. First the classic automatic offline algorithm se-
     lection problem is discussed, then the methodology for au-           Figure 1: Rice’s model for algorithm selection’
     tomatic online algorithm selection is presented after which
     the contextual multi-armed bandit problem is introduced
     and is shown how automatic online algorithm selection              The aim is to find the feature mapping and selection
     can be modelled as a contextual multi-armed bandit prob-        mapping that optimise the average performance. The fea-
     lem. In section 3 related work is discussed. The exper-         tures of an instance are given, so the only leeway there is
     imental setting and results are presented in section 4. In      which features to consider. Limitations on the possible se-
     section 5 some remarks about the introduced methodol-           lection mappings can be imposed by the method used to
     ogy are made and the experimental results are discussed.        create it.
     Future work is also discussed in section 5. The paper con-         Identifying descriptive features is a time consuming
     cludes in section 6                                             process. Luckily large amounts of features have already
                                                                     been proposed in literature for many interesting problems.
                                                                     For example in [22] an overview is given of features for
     2     Automatic Online Algorithm Selection                      the satisfiability problem and in [21] for the multi-mode
     2.1   Automatic Algorithm Selection                             resource-constrained project scheduling problem.
                                                                        The automatic in automatic algorithm selection refers to
     Rice’s paper "The algorithm selection problem" [23] for-        the way decision models are made: automatically. Super-
     mally introduced the algorithm selection problem. The           vised learning techniques are typically used.
     fundamental characteristics of the problem remain un-              Two broad classes of techniques can be identified. In
     changed up to now. In the most basic scenario identified        the first fall classification-based techniques: the decision
     by Rice the problem is characterised by a set of instances,     model directly predicts which algorithm will be best for
     a set of algorithms and a (set of) performance measure(s)       an instance based on its features. No information about
     and by two mappings between these sets: a selection map-        the actual quality of the algorithm is communicated. Note
     ping and a performance mapping. The selection mapping           that in general this is not a binary but a multi-class clas-
     maps instances to algorithms and the performance map-           sification problem, as each instance is classified as being
     ping maps algorithm-instance pairs to their performance-        best-solved by one of an arbitrary amount of algorithms.
     measure(s). A typical formulation of the objective of auto-     In [20] for example, the k-nearest-neighbours method is
     matic algorithm selection is to find the selection mapping      used. Another example of the use of k-nearest-neighbours
     that results in the best average performance.                   can be found in [6], where the more complicated problem
        It is up to the user to identify a sensible performance      of ranking algorithms (as opposed to only predicting the
     measure. In this paper only single-objective problems are       best) is considered. Another option is to use decision trees
     considered. See [9] for a more formal description of what       or their more powerful relative random forest, as in the
     characterises an acceptable performance measure for the         latest version of Satzilla [26], an algorithm selector for the
     research in this paper. Each performance measure with           satisfiability problem.
     totally ordered values is definitely acceptable.
                                                                        Misclassification is cost-sensitive in automatic algo-
        Rice acknowledges the need for a set of features in prac-
                                                                     rithm selection: classifying an instance incorrectly as be-
     tical applications and extends his model with this set. The
                                                                     ing best solved by a horrendous algorithms is worse than
     full model is visualised in figure 1. Note that the selection
                                                                     classifying it as being best solved by an algorithm only
     mapping now maps values from the feature space instead
                                                                     marginally worse than the best one. A classification-based
     of directly from the instance space.
                                                                     automatic algorithm selection technique should take this
        To formalise the problem statement, let Q be a prob-
                                                                     cost-sensitivity into account, as argued in [5].
     ability distribution on the instance set I . Let f be the
                                                                        The second class of automatic algorithm selection tech-
     feature mapping (mapping an instance to a feature vec-
                                                                     niques consists of regression-based techniques. A regres-
     tor), s the selection mapping (mapping a feature vector to
                                                                     sion model is created for each algorithm, predicting its
     an algorithm) and p the performance mapping (mapping
                                                                     performance in function of the problem features. The al-
     an instance-algorithm combination to a performance mea-
                                                                     gorithm with the best predicted performance is selected
     sure). The average performance of a selection mapping
                                                                     to solve a new instance with. An overview of such tech-
     can now be defined as:
                                                                     niques can be found in [14]. A recent approach is de-
                          EQ [(p(s( f (i)), i)]               (1)    scribed in [10].
Reinforcement Learning for Automatic Online Algorithm Selection – an Empirical Study                                                       95

         Note that algorithm selection itself is a cost-sensitive         mapping that optimise the average performance as defined
      classification problem: the goal is to classify instances           in equation 1.
      as belonging to the algorithm that best solves them. The               In the setting of this paper the instance set is defined
      distinction between classification and regression methods           by a fixed set of benchmark instances and the distribution
      refers to how this classification problem is solved behind          is uniform. The feature mapping is defined by consid-
      the scenes.                                                         ering all features available for the benchmark instances.
         A thorough overview of algorithm selection methodol-             The problem of selecting the most informative features is
      ogy can be found in [24] and more recently in [17].                 not considered: the feature mapping is fixed. A selection
         Both classes of automatic algorithm selection methods            mapping is defined by considering a regression model for
      use the same kind of input to initialise their decision mod-        each algorithm and selecting an algorithm in function of
      els: performance data of all algorithms on a set of train-          these predicted values. The most straightforward selection
      ing instances. For the classification-based techniques it is        mapping is to select the algorithm with the predicted best
      strictly necessary for the performance of all algorithms to         performance. This and other options are discussed in sec-
      be available for each instance. Otherwise it is not possi-          tion 2.4. The performance mapping used is discussed in
      ble to say which algorithm is best for the instance. This is        section 4.1.
      not the case for regression-based techniques. As long as               In the offline setting the selection mapping remains
      each algorithm’s model has access to datapoints it can be           fixed. However, in the online setting it changes over time
      initialised, it is not necessary to know the performance of         as more instances are solved. The selection at each point in
      each algorithm on each training instance.                           time depends explicitly on earlier selections. For this rea-
                                                                          son equation 1 cannot be used directly to formally define a
                                                                          general problem statement for automatic online algorithm
      2.2   Solution Strategy for Automatic Online                        selection.
            Algorithm Selection                                              In the empirical setting of this paper the quality of a so-
                                                                          lution to the automatic online algorithm selection problem
      During the online phase performance data is generated
                                                                          is measured as its average performance on a time-ordered
      every time a new instance is solved. This performance
                                                                          set of instances, as presented during the online phase. The
      data consists of the performance of the selected algorithm
                                                                          empirical performance measurement process is explained
      on the new instance. The performance data of the other
                                                                          in more detail in section 4 where the experimental setting
      algorithms on the new instance is not available. Since
                                                                          and results are described.
      this type of data can only be processed by the regression-
      based methods, the proposed methodology will be limited
      to regression-based techniques.                                     2.4   Contextual Multi-armed Bandits
         The methodology for automatic online algorithm selec-
                                                                          In the standard multi-armed bandit a gambler has access
      tion is the following. During the offline training phase an
                                                                          to a set of slot machines (bandits) and must decide on a
      initial regression model is trained for each algorithm , us-
                                                                          strategy in which order to pull their arms. His goal is
      ing training data consisting of algorithm performance on
                                                                          to realise as much profit as possible. Each time an arm
      instances described by feature values. During the online
                                                                          is pulled the gambler receives a random reward sampled
      phase the algorithm to solve the first online instance with
                                                                          from a distribution belonging to the selected arm. Initially
      is selected based on the models created during the training
                                                                          all distributions are unknown, but as the gambler gambles
      phase. The model of the selected algorithm is retrained
                                                                          on he obtains more information about the distributions of
      with the new datapoint. The performance of all other al-
                                                                          the available arms and can make more informed choices.
      gorithms on the instance remains unknown. The algorithm
                                                                             The central dilemma faced by the gambler is whether
      to solve the second online instance with is selected based
                                                                          to keep pulling the arm proven to be best so far or to try
      on the models created after having solved the first instance,
                                                                          another arm about which little is known and that might
      thus one of the models has been updated to incorporate the
                                                                          be better. If the other arm turns out to be more profitable
      performance information about the first online instance.
                                                                          never having explored its potential further would have lost
      Selection for the third online instance is influenced by the
                                                                          the gambler a lot of money.
      two previous etc. As more instances are solved, more dat-
                                                                             See [1] for a formal definition of the multi-armed ban-
      apoints are gathered and the models are expected to im-
                                                                          dit problem. In this paper a number of policies for pulling
      prove, which in turn is expected to result in better algo-
                                                                          arms are analysed in terms of how fast the total profit di-
      rithm selection.
                                                                          verges from the maximal profit in function of the total
                                                                          amount of pulls.
      2.3   Automatic Online Algorithm Selection Problem                     The contextual multi-armed bandit problem generalises
            Statement                                                     the multi-armed bandit problem. To stay within the
                                                                          metaphor: before pulling an arm the gambler sees a con-
      As discussed in section 2.1, the goal of automatic algo-            text vector. This context vector contains values for pre-
      rithm selection is to find the feature mapping and selection        defined properties that describe the current situation. In
96                                                                                   H. Degroote, B. Bischl, L. Kotthoff, P. De Causmaecker

     the contextual multi-armed bandit problem the reward of         selection problem where one has to select one algorithm
     each arm depends on the context. As in the classic multi-       from a limited pre-defined set of algorithms to solve an in-
     armed bandit problem the gambler’s goal is to maximize          stance with, has not yet been investigated. In the remain-
     his profit, but in order to do so he has to learn how the       der of this section some related research is discussed and
     context vector relates to the rewards.                          is mentioned how it differs from this work.
        The automatic online algorithm selection problem is a           In [11] multi-armed bandit methodology is applied to
     contextual multi-armed bandit problem. Each algorithm is        the online learning of dynamic algorithm portfolios. Their
     an arm and pulling an arm is the equivalent of selecting        goal differs from this paper’s. They want to learn for each
     an algorithm. When selecting an algorithm for an instance       instance a separate algorithm portfolio while the goal here
     its feature values are known, which is the equivalent of        is to predict one algorithm to solve the instance with. In
     having shown a context vector. Maximizing profit in this        an algorithm portfolio a bunch of algorithms are run si-
     context boils down to minimizing the performance differ-        multaneously. The dynamic goal is to learn the optimal
     ence between the selected algorithm and the actual best         assignment of time slices to algorithms while the portfolio
     algorithm.                                                      is in use. This paper’s setting is not dynamic in this sense.
        A number of solution strategies for the contextual ban-      Once an algorithm has been selected to solve an instance
     dit have been introduced and analysed in literature, such       with this decision will not be come back on, even if the
     as LinUCB [7] where the reward is assumed to linearly           algorithm appears to perform poorly on the instance.
     depend on the feature vector. However, for the prelimi-            In [8] a new multi-armed bandit model is proposed and
     nary research presented in this paper three straightforward     applied to search heuristic selection, a kind of algorithm
     and simple strategies have been implemented.                    selection. However, their objective differs from this pa-
        The first strategy that is considered is the greedy strat-   per’s. In terms of algorithm selection they have access to a
     egy. The greedy strategy does not perform any explicit ex-      number of stochastic algorithms and a budget of N trials.
     ploration: it always selects the algorithm that is predicted    The goal is to find an as good as possible solution for one
     to be best.                                                     instance within the budget of N trials, whereas this paper’s
        The second strategy that is considered is the ε-greedy       goal is to find an as good as possible solution on average
     strategy, which is parametrised by a value ε between 0          over many instances, each with a budget of one trial. Their
     and 1. The strategy is equivalent with the simple greedy        stochasticity is caused by the algorithms but the instance
     strategy with probability (1 − ε) and selects a random al-      remains fixed. In this paper stochasticity is also caused
     gorithm with probability ε.                                     by the instances as at each point in time a new random
        The third strategy is the is the UCB strategy, short for     instance is solved.
     upper confidence bound. It is parametrised by parameter λ          In [12] a notion of online algorithm selection is intro-
     (with λ ≥ 0). The UCB strategy consists of calculating for      duced for decision problems. They focus on the problem
     each algorithm its predicted performance p and the stan-        of deciding how to distribute time shares over the set of
     dard error on this prediction e. The algorithm with highest     available algorithms and make this decision on an instance
     value for p + e ∗ λ is selected. Since algorithms for which     per instance basis. They model the problem on two levels.
     few datapoints exist typically have high variance on their      On the upper level they use bandit methodology to decide
     predictions, an algorithm with predicted poor performance       which time allocator to use (choosing from a uniform al-
     might be preferred over an algorithm with decent perfor-        locator and various dynamic allocators) and on the lower
     mance, depending on the performance difference, variance        level the algorithms are run in parallel (or simulated to run
     sizes and the value of λ . A higher λ -value results in more    in parallel) according to the time shares predicted by the
     exploration.                                                    allocator selected on the higher level. Thus the arms of
        The equivalent of the UCB strategy for minimisation          their bandit problem are ’time allocators’ and not algo-
     problems is the LCB strategy, short for lower confidence        rithms.
     bound. Its selection rule is: p − e ∗ λ .
        Unlike the two previous strategies, which only rely on a     4     Experiments
     predicted value, the LCB strategy also relies on a notion of
     variance. Hence it can be applied only to regression meth-      4.1   Experimental Setting
     ods for which the variance on a prediction is calculable.
                                                                     A standard database with automatic algorithm selection
                                                                     data, called ASLIB [3], is used. This database consists of
     3   Related work                                                17 problems, each with a number of algorithms (2-30) and
                                                                     instances (500-2500). The value of one or more perfor-
     Reinforcement learning and multi-armed bandit method-           mance measures is available for each algorithm-instance
     ology have been applied to some related topics in auto-         pair. Using this database it is possible to simulate differ-
     matic algorithm selection literature. However, the authors      ent algorithm selection strategies without having to waste
     believe the setting considered in this paper, applying re-      time on calculating the performance of algorithms and in-
     inforcement learning to the standard automatic algorithm        stances.
Reinforcement Learning for Automatic Online Algorithm Selection – an Empirical Study                                                        97

         Information about the feature values for each instance is        is achieved by the so-called virtual best solver. The vir-
      stored as well. The amount of features ranges from 22 to            tual best solver selects for each instance the best possible
      155.                                                                algorithm. It is defined only for instances for which per-
         In all experiments performed performance is measured             formance data is available for all algorithms. Note that
      as how fast an algorithm solves an instance. To differenti-         the PAR-score of the virtual best solver itself is not 0, it is
      ate between solving an instance just within the time-limit          simply normalised to 0.
      and failing to solve an instance, time-outs are penalised               The virtual best solver is artificial because it requires
      by multiplying them with a fixed factor. A penalty factor           calculating the performance of each algorithm before se-
      commonly used in literature is 10. resulting in the PAR10           lecting one, hence it cannot be used in practice. However,
      criterion (with PAR an abbreviation for penalised average           it is easy to define for an ASLIB scenario and is commonly
      runtime). Suppose the time-out limit is 1 hour. An un-              used to evaluate the quality of an algorithm selection ap-
      solved instance will have a PAR score of 10 hours. In               proach, for example in [25] and [15]. A score of 1 equals
      terms of the problem definition of automatic algorithm se-          the score of the single best solver. The single best solver
      lection (equation 1): all results in this paper are presented       corresponds to the classical notion of ’best algorithm’: it
      with as performance mapping applying the PAR10 crite-               is the best solver on average over the entire dataset. Any
      rion to the stored runtime.                                         algorithm selection strategy should improve on the single
         Since for all problems being considered performance is           best solver to be considered useful, but the score of 1 does
      measured as time taken until a solution is found, they are          not provide a strict upper bound and it is possible to obtain
      all minimisation problems. This implies specifically that           scores higher than 1. An algorithm selection method with
      the lower confidence bound method (LCB) will be used                a score higher than 1 performs worse than the single best
      instead of the upper confidence bound method (UCB).                 solver.
         As content management system for the experiments and                 To enable comparison with the current state of the art
      as interface to the remote cluster the R-package BatchEx-           in automatic offline algorithm selection, the performance
      periments was used [4].                                             of regression random forest as reported on the ASLIB
         As described in section 2.2 a regression model is trained        website2 is shown as a horizontal red line on each plot.
      for each algorithm during a training phase and these mod-           LLAMA is an R-package for algorithm selection interfac-
      els are subsequently updated during an online phase. To             ing a number of machine learning algorithms [16]. On the
      evaluate how well each strategy has managed to learn                website the performance of some popular machine learn-
      models, the final model quality at the end of the online            ing algorithms applied to ASLIB algorithm selection sce-
      phase is evaluated during a verification phase. During the          narios is reported. Since regression random forest is also
      verification phase each strategy’s resulting model quality          used in this paper’s experiments this allows comparison
      is evaluated by using the models to make predictions. Note          with a current state of the art automatic offline algorithm
      that during this verification phase models are no longer up-        selection method.
      dated and no explicit exploration is performed. For each                The results are presented using box plots. The hinges
      strategy the basic greedy selection criterion is used.              correspond to the first and third quartiles. The whiskers
         The set of available instances is split into three subsets       extend to the highest value within a 1.5 inter-quartile range
      to represent three experimental phases: a set of training           from the hinges. The remaining points are outliers.
      instances, a set of online instances and a set of verification          Several parameters must be defined to run the experi-
      instances.                                                          ments. They are kept at a fixed value for all experiments
         As regression model ’regression forest’ is used. The             reported in this paper.
      implementation from R-package randomForest [19] with                   • LCB λ : 1
      the standard parameter values is used. The randomFor-                  • ε-greedy ε: 0.05
      est method is interfaced through the R-package MLR [2].                • Proportion of training instances: 0.1
      Note that the prediction variance reported by random for-              • Proportion of online instances: 0.8
      est is calculated using a bootstrap methodology. See [19]              • Proportion of verification instances: 0.1
      for a description of this method.                                      • Minimal amount of instances before retraining: 16
         Even though a database of performance data is used,                 • Amount of repitions per experiment: 10
      running the experiments still proved too time-consuming
      for most ASLIB-scenarios. Most time is spent on retrain-               For the exploration methods standard parameters were
      ing models. Therefore an optimisation was introduced:               chosen. The proportions of training and online instances
      retraining the model of an algorithm is postponed until a           were chosen ad hoc. The proportion of 0.1 for verification
      minimal amount of new datapoints is available.                      instances was chosen more consciously because it is stan-
         All results have been normalised on an per-instance ba-          dard practice to evaluate models on 10% of the data. The
      sis before the average PAR10 performances (averaged over            other parameters were also chosen ad hoc. For follow-up
      all repeats of the experiment) are calculated. A value of 0         studies a parameter study can be useful.
      is the best possible (recall that minimisation problems are             2 http://coseal.github.io/aslib-r/scenario-pages/QBF-

      considered, so a lower value is a better value). This score         2011/llama.html
98                                                                                     H. Degroote, B. Bischl, L. Kotthoff, P. De Causmaecker

        Only results for the QBF-2011 scenario are reported in
     this paper. Results for other scenarios are qualitatively
     similar with regards to the two research questions consid-
     ered3 . The QBF-2011 scenario contains performance data
     obtained from the quantified Boolean formula competition
     of 2011. The QBF-2011 scenario contains 5 algorithms,
     46 features and 1368 instances, of which 1054 were solved
     by at least one algorithm. There are 136 training instances,
     1094 online instances and 136 verification instances.
        The PAR10 score of the virtual best solver fluctuates
     around 8400 and that of the single best solver around
     15300, depending on the specific split in training, online
                                                                       Figure 2: Boxplot summarising the answer to the question
     and verification instance set. Recall that the virtual best
                                                                       ’Is active learning useful?’. The presented data is collected
     solver’s score is normalised to 0 and the single best’s to 1.
                                                                       during the online phase

     4.2     Is Automatic Online Algorithm Selection Useful
             for the Greedy Approach?
                                                                         Note that the Greedy-full-information strategy does not
     Adding additional data to the regression models is ex-            provide a real upper bound: it is possible to perform better
     pected to result in better performance. To validate this hy-      than this strategy as more information is not guaranteed to
     pothesis the performance of the most basic learning strat-        always result in better predictions.
     egy (greedy) is compared with that of a strategy that does           The plot with the results of the online phase is presented
     not learn.                                                        in figure 2. Online learning appears to be useful as the
        The greedy strategy picks the algorithm predicted to be        greedy strategy outperforms the greedy-no-learning strat-
     best.                                                             egy. The good performance of the greedy-full-information
        The strategy that does not learn is called the greedy-no-      strategy shows the value of having access to more infor-
     learning strategy and is abbreviated as greedyNL in the           mation.
     plots. It is equivalent to the simply greedy strategy but
                                                                          The performance reported in figure 2 is the average per-
     it does not do any learning: it keeps using the models it
                                                                       formance over all online instances. For the first online in-
     learned during the training phase, never adding new data-
                                                                       stance the performance of the greedy-no-learning strategy
     points. This strategy is the strategy used by offline algo-
                                                                       is equal to that of the greedy strategy that does learn, but
     rithm selection approaches.
                                                                       for the last online instance the performance of the greedy
        The greedy-no-learning strategy uses its models to pre-
                                                                       strategy that does learn is expected to be better because
     dict the best algorithm for all online instances and its
                                                                       it has access to more data. The performance reported in
     PAR10-score is calculated on these online instances. The
                                                                       figure 2 is the average of these (most likely) increasing
     learning strategy does the same, but updates its models
                                                                       performances.
     with the data it gathers during the online phase.
        A third strategy is considered as well: the greedy-full-          To quantify how much the greedy strategy has learned
     information strategy, abbreviated as greedyFI. Greedy-            during the online phase, the quality of its predictions is
     full-information is an artificial strategy that has access to     tested on a set of verification instances. During the verifi-
     the online information of each algorithm on all handled in-       cation phase the models are no longer updated. The differ-
     stances. Thus not only the result of the selected algorithm       ence in PAR10-score between the greedy strategy and the
     is used to update the models, but also the results of all other   greedy-no-learning strategy is a measure for how much us-
     algorithms, hence the full-information. It does not have to       ing the online data improves the quality of the selection.
     explore as it has access to all information regardless, hence        The plot with the results of the verification phase is
     its greedy selection criterion.                                   presented in figure 3. Note that the performance of the
        The Greedy-full-information strategy is introduced to          greedy-full-information strategy is similar to the perfor-
     serve as a sort of upper bound on the performance of any          mance of llama. This is expected because the bench-
     selection strategy. It always makes the best decision given       mark performance was calculated using a 10-fold cross-
     the current information (pure exploitation) and it has ac-        validation where performance of models trained on 90%
     cess to the maximal amount of information (performance            of the data is measured on the remaining 10%. The mod-
     of all algorithms on all handled instances). Each actual          els of the greedy-full-information strategy have also been
     selection strategy will have access to only a part of the in-     trained on 90% of the data: 10% training data and 80%
     formation and might at times make suboptimal decisions            online data.
     if it explores.                                                     To answer the question titling this section: automatic
           3 Plots
                 for all performed experiments are available     on    online algorithm selection appears to be useful for the
     http://www.kuleuven-kulak.be/~u0075355/Plots_ITAT_2016            greedy approach.
Reinforcement Learning for Automatic Online Algorithm Selection – an Empirical Study                                                      99


      Figure 3: Boxplot summarising the answer to the question            Figure 4: Boxplot summarising the answer to the ques-
      ’Is active learning useful?’. The presented data is collected       tion ’is explicit exploration useful?’. The presented data is
      during the verification phase                                       collected during the online phase


      4.3   Handling the Exploration vs. Exploitation
            Trade-off

      When performing reinforcement learning one is typically
      faced with an exploration vs. exploitation trade-off. When
      no online learning is performed the predicted best algo-
      rithm is always selected because the only reason for se-
      lecting an algorithm is solving the next instance as well
      as possible. In an online learning setting a second reason
      for selecting an algorithm surfaces: additional informa-
      tion will be obtained and this information will increase the
      quality of future decisions.                                        Figure 5: Boxplot summarising the answer to the ques-
         Two exploration-incorporating strategies are compared            tion ’is explicit exploration useful?’. The presented data is
      to the simple greedy approach: ε-greedy (epsGreedy on               collected during the verification phase
      the plots) and lower confidence bound (LCB on the plots).
      See section 2.4 for a description of these two strategies.
         A first test is to compare each strategy’s performance           no-learning strategy for all learning strategies.
      during the online phase. This measures their ability to
      solve the exploration vs. exploitation trade-off: do they
      manage to benefit from exploring more by obtaining a bet-           5    Discussion and future work
      ter average performance?
         The plot with the results of the online phase is presented       The automatic online algorithm selection method pre-
      in figure 4. The answer appears to be negative: explicit            sented in section 2.2 is inefficient. Every time a new
      exploration does not result in a better average performance         datapoint is collected for an algorithm, the corresponding
      than greedy and the ε-greedy strategy even drops down to            regression model is retrained from scratch using all pre-
      the level of the greedy-no-learning strategy.                       vious data and the newly obtained datapoint. If the fit-
         A second test is to check whether the exploration strate-        ting of a model takes a long time this approach can be-
      gies managed to learn better models than the greedy strat-          come prohibitively expensive, especially if its complexity
      egy by comparing their performance on the verification              is influenced heavily by the amount of instances, as for
      data. If the exploration strategies managed to learn better         each online instance a new model is trained and the mod-
      models they have merit as they traded off some exploita-            els are trained based on an ever increasing amount of in-
      tion in favour of useful exploration. If this is not the case       stances. Identifying and implementing more efficient up-
      the exploration was not useful and simply resulted in pick-         dating strategies is future work. Mondrian forests [18] for
      ing inferior algorithms without any noticeable gain.                example are an online version of random forests that could
         The plot with the results of the verification phase is           be useful in this context.
      presented in figure 5. Exploration does not appear to                  There might be a theoretical problem with the proposed
      have been useful as the models learned by the ε-greedy              automatic online algorithm selection method. During the
      and lower confidence bound strategy do not outperform               online phase an algorithm’s regression model is extended
      the model learned by the greedy strategy. Note however              only with datapoints for which the algorithm was predicted
      that the additional information obtained during the on-             to be best. Hence the new datapoints are all clustered in
      line phase does result in better models than the greedy-            the same region(s) of the problem domain. Note also that
100                                                                                      H. Degroote, B. Bischl, L. Kotthoff, P. De Causmaecker

      the region(s) where an algorithm is best is likely to change       be to perform random or round-robin selection until suf-
      slightly every time a new instance is handled, as with the         ficient samples have been collected for each algorithm to
      changing of an algorithm’s regression model all points in          construct a regression model. Interesting challenges would
      the domain where the algorithm’s predicted performance             be to include the option to add new algorithms at runtime
      was better than that of another algorithm’s are likely to          and even identifying which kind of instances are hard for
      move slightly. Then again, in a sense the property that            all algorithms, thereby inspiring the development of a new
      datapoints are mostly collected in the area where an algo-         algorithm that performs well on these instances which can
      rithm is expected to be best is desirable. Knowing with            then be added to the system.
      high accuracy how poorly an algorithm performs on in-                 Other future work consists of implementing solu-
      stances where it is bad is useless in this context whereas         tion strategies specifically designed for the contextual
      accurate predictions on instances for which the algorithm          multi-armed bandit problem which are more theoretically
      is likely to be one of the best are very relevant. However,        founded, for example LinUCB [7].
      note that predicting performance accurately is not the goal
      itself. What is important is that the actual best algorithm
                                                                         6   Conclusions
      is the algorithm with predicted best score. The selection
      mapping does not change if a fixed value is added to each          A reinforcement learning methodology for automatic on-
      performance prediction.                                            line algorithm selection has been introduced. It is limited
         At the start of this project it was thought that the explicit   to automatic algorithm selection methods based on perfor-
      exploration would be useful. Current and future work is in-        mance predictions for each individual algorithm. It has
      vestigating why this does not appear to be the case. There         been shown experimentally that the method is capable of
      are two main hypotheses.                                           learning from online data and thereby improves on auto-
         The first hypothesis is that the amount of exploration          matic offline algorithm selection methods.
      data collected during the online phase is negligible com-             It has been shown that automatic online algorithm selec-
      pared to the data gathered during the training phase, thus         tion can be modelled as a contextual multi-armed bandit
      the influence of the exploration cannot be observed. A             problem.
      training set of 100 instances for 5 algorithms can be seen            A total of three solution strategies have been imple-
      as a combination of 100 greedy choices and 400 explo-              mented and empirically tested: an approach that always
      rative choices. The epsilon greedy strategy will explore           greedily selects the best algorithm and two approaches
      5% of the time, resulting in on average 50 new explorative         that perform exploration: ε-greedy and lower confidence
      datapoints during an online phase of 1000 instances. This          bound. The experiments suggest that the greedy strategy
      hypothesis is currently being investigated                         outperforms the explorative strategies.
         The second hypothesis is that exploration is already
      implicitly performed by the greedy strategy, rendering
      additional explicit exploration unnecessary. The greedy            Acknowledgements
      method is greedy in the sense that it always selects the           Work supported by the Belgian Science Policy Office
      best algorithm, but which the best algorithm is depends            (BELSPO) in the Interuniversity Attraction Pole COMEX.
      from instance to instance, thus over time performance dat-         (http://comex.ulb.ac.be).
      apoints for all algorithms are collected. In this way the             The computational resources and services used in this
      greedy strategy implicitly explores. Investigating this hy-        work were provided by the VSC (Flemish Supercomputer
      pothesis is future work.                                           Center), funded by the Research Foundation - Flanders
         In order to better quantify the improvements realised           (FWO) and the Flemish Government – department EWI
      during the online phase, future work is to investigate the
      way in which the selection model improves in detail, by
      not only evaluating the overall models before and after the        References
      online phase, but also at several points during the online
                                                                          [1] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time anal-
      phase and by also dropping down a level and investigating
                                                                              ysis of the multiarmed bandit problem. Machine learning,
      how the individual regression models (one for each algo-                47(2-3):235–256, 2002.
      rithm) evolve over time.
                                                                          [2] B. Bernd. mlr: A new package to conduct machine learning
         In future work the overhead of retraining the models                 experiments in r.
      should be explicitly considered and quantified in order to
                                                                          [3] B. Bischl, P. Kerschke, L. Kotthoff, M. Lindauer, Y. Mal-
      be able to quantify the net improvement of using the on-                itsky, A. Fréchette, H. Hoos, F. Hutter, K. Leyton-Brown,
      line data. In the experiments here reported this overhead               K. Tierney, et al. Aslib: A benchmark library for algorithm
      is ignored.                                                             selection. arXiv preprint arXiv:1506.02465, 2015.
         An interesting path for future work is te develop an al-         [4] B. Bischl, M. Lang, O. Mersmann, J. Rahnenführer, and
      gorithm that learns how to perform automatic online al-                 C. Weihs. BatchJobs and BatchExperiments: Abstraction
      gorithm selection form scratch, without any training data               mechanisms for using R in batch environments. Journal of
      whatsoever. A straightforward initial methodology would                 Statistical Software, 64(11):1–25, 2015.
Reinforcement Learning for Automatic Online Algorithm Selection – an Empirical Study                                                           101

       [5] B. Bischl, O. Mersmann, H. Trautmann, and M. Preuss. Al-            vkar, and Y. Shoham. Understanding random sat: Be-
           gorithm selection based on exploratory landscape analysis           yond the clauses-to-variables ratio. In Principles and Prac-
           and cost-sensitive learning. In Genetic and Evolutionary            tice of Constraint Programming–CP 2004, pages 438–452.
           Computation Conference (GECCO), 2012.                               Springer, 2004.
       [6] P. B. Brazdil, C. Soares, and J. P. Da Costa. Ranking learn-   [23] J. R. Rice. The algorithm selection problem. Advances in
           ing algorithms: Using ibl and meta-learning on accuracy             Computers, 15:65–118, 1976.
           and time results. Machine Learning, 50(3):251–277, 2003.       [24] K. A. Smith-Miles. Cross-disciplinary perspectives on
       [7] W. Chu, L. Li, L. Reyzin, and R. E. Schapire. Contex-               meta-learning for algorithm selection. ACM Computing
           tual bandits with linear payoff functions. In International         Surveys (CSUR), 41(1):6, 2009.
           Conference on Artificial Intelligence and Statistics, pages    [25] L. Xu, F. Hutter, H. Hoos, and K. Leyton-Brown. Evaluat-
           208–214, 2011.                                                      ing component solver contributions to portfolio-based algo-
       [8] V. A. Cicirello and S. F. Smith. The max k-armed bandit:            rithm selectors. In Theory and Applications of Satisfiability
           A new model of exploration applied to search heuristic se-          Testing–SAT 2012, pages 228–241. Springer, 2012.
           lection. In AAAI, pages 1355–1361, 2005.                       [26] L. Xu, F. Hutter, J. Shen, H. H. Hoos, and K. Leyton-
       [9] H. Degroote and P. De Causmaecker. Towards a knowl-                 Brown. Satzilla2012: Improved algorithm selection based
           edge base for performance data: A formal model for per-             on cost-sensitive classification models. Proceedings of SAT
           formance comparison. In Proceedings of the Companion                Challenge, pages 57–58, 2012.
           Publication of the 2015 on Genetic and Evolutionary Com-
           putation Conference, pages 1189–1192. ACM, 2015.
      [10] T. Doan and J. Kalita. Selecting machine learning algo-
           rithms using regression models. In 2015 IEEE Interna-
           tional Conference on Data Mining Workshop (ICDMW),
           pages 1498–1505. IEEE, 2015.
      [11] M. Gagliolo and J. Schmidhuber. Learning dynamic al-
           gorithm portfolios. Annals of Mathematics and Artificial
           Intelligence, 47(3-4):295–328, 2006.
      [12] M. Gagliolo and J. Schmidhuber. Algorithm selection
           as a bandit problem with unbounded losses. In Interna-
           tional Conference on Learning and Intelligent Optimiza-
           tion, pages 82–96. Springer, 2010.
      [13] B. A. Huberman, R. M. Lukose, and T. Hogg. An Eco-
           nomics Approach to Hard Computational Problems. Sci-
           ence, 275(5296):51–54, 1997.
      [14] F. Hutter, L. Xu, H. H. Hoos, and K. Leyton-Brown. Algo-
           rithm runtime prediction: Methods & evaluation. Artificial
           Intelligence, 206:79–111, 2014.
      [15] S. Kadioglu, Y. Malitsky, A. Sabharwal, H. Samulowitz,
           and M. Sellmann. Algorithm selection and scheduling.
           In Principles and Practice of Constraint Programming–CP
           2011, pages 454–469. Springer, 2011.
      [16] L. Kotthoff. Llama: leveraging learning to automatically
           manage algorithms. arXiv preprint arXiv:1306.1031, 2013.
      [17] L. Kotthoff. Algorithm Selection for Combinatorial Search
           Problems: A Survey. AI Magazine, 35(3):48–60, 2014.
      [18] B. Lakshminarayanan, D. M. Roy, and Y. W. Teh. Mon-
           drian forests: Efficient online random forests. In Advances
           in Neural Information Processing Systems, pages 3140–
           3148, 2014.
      [19] A. Liaw and M. Wiener. Classification and regression by
           randomforest. R news, 2(3):18–22, 2002.
      [20] Y. Malitsky, A. Sabharwal, H. Samulowitz, and M. Sell-
           mann. Non-model-based algorithm portfolios for sat. In
           Theory and Applications of Satisfiability Testing-SAT 2011,
           pages 369–370. Springer, 2011.
      [21] T. Messelis and P. De Causmaecker. An automatic al-
           gorithm selection approach for the multi-mode resource-
           constrained project scheduling problem. European Journal
           of Operational Research, 233(3):511–528, 2014.
      [22] E. Nudelman, K. Leyton-Brown, H. H. Hoos, A. De-