Eindhoven University of Technology MASTER Evaluating a multi-armed bandit-based approach to hyperparameter optimization van Knippenberg, M.S.

(1)

Eindhoven University of Technology

MASTER

Evaluating a multi-armed bandit-based approach to hyperparameter optimization

van Knippenberg, M.S.

Award date:

2018

Link to publication

Disclaimer

This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration.

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

(2)

Evaluating a Multi-Armed Bandit-Based Approach to Hyperparameter Optimization

January 2018

Marijn van Knippenberg

Fujitsu Laboratories Ltd. supervisor: Dr. Akira Ura

Eindhoven University of Technology supervisor: Dr. Joaquin Vanschoren

(3)

Abstract

A hyperparameter optimization method that was recently published, named Hyperband, is implemented and evaluated. Hyperband casts optimization as an adaptive resource allocation problem, attempting to efficiently allocate resources among randomly sampled hyperparameter configurations. Shortcomings are identified in the setup proposed by the original authors in the form of inefficient configuration sampling in later stages of the algorithm. An extension is proposed that is supposed to increase the effectively of these later samples by tracking knowledge of the solution space while the algorithm is running. As the algorithm progresses, knowledge increases, which offsets the increasing inefficiency of the sampling. The extension is implemented and compared to the original method, and though it shows no definite improvement, some of the results are encouraging. Finally, some ideas are presented for directing future work with relation to Hyperband.

The work reported in this document was part of an international internship at Fujitsu Laboratories Ltd. In Kawasaki, Japan during the months of February and March of 2017.

(4)

1 Introduction

Hyperparameter optimization is an active field of research in the Machine Learn- ing community that aims to make model construction cheaper and easier. The rapidly-growing popularity of Machine Learning has resulted in a lack of experts, those who can effectively match models to situations. An expert will create and configure an initial model based on previous experiences and intuition, and then tune it until desirable performance is reached. This skill takes time and effort to build, so it is attractive to look for an automated solution.

In the larger scheme of data analysis, hyperparameter optimization is only concerned with obtaining the model that produces the most accurate predictions. Data preparation and deployment are not included. In parts of this research, selecting the best algorithm is considered part of the optimization process; the choice of algorithm is one of the hyperparameters.

Figure 1: Data analysis flowchart. Orange: pre-processing, red: model construction and tuning, green: production.

(6)

2 Background

The classical approaches to hyperparameter optimization are random search and grid search. Grid search evaluates hyperparameter configurations at set intervals for each hyperparameter, while random search samples each hyperparameter at random. The main disadvantage of grid search is that it reduces the number of unique values for each individual hyperparameter, whereas random search suffers from a lack of guarantee of a nice distribution of samples. Previous work shows that random search is often better than grid search, especially in terms of time taken until the first acceptable solution is found [2].

Figure 2: Conventional hyperparameter selection patterns.

Over the years, various methods have been developed to improve Hyperpa- rameter Optimization, including those based on Random Forests (SMAC) [3], Gaussian Processes (SpearMint) [7], or Kernel Density Estimation (TPE) [1].

All of these methods have reported a significant performance increase over tra- ditional random search, but recent recent work shows that these gains are often negligible [5]. An accusing finger is pointed at ranking graphs, which only re- late the performance of methods in relative terms, instead of showing absolute gains. The actual gains for such a ranking graph are shown on the right in Figure 3. Note that SMAC and TPE seem to perform noticeably better than random search in the ranking graph, but that the actual difference in test error is very minimal. Another interesting point that is raised is the fact that running random search with twice as many computational resources (“random 2x”), results in performance that at least matches SMAC and TPE. Although this is not a fair comparison to make, it is telling that sophisticated approaches can so easily be rivaled. This hints at us that random search in itself is very much a viable search method, even when compared to more those sophisticated op- tions. Since it shows itself to have plenty of potential, there may be merit in working on methods that apply random search more efficiently, instead of working towards methods that replace it altogether. Finally, some of the previously proposed methods are intrinsically sequential and thus difficult to implement in a concurrent manner.

(7)

Figure 3: Performance of random sampling compared to SMAC and TPE. ”random 2x” denotes random search with twice as much computational resources [5].

3 Hyperband

An algorithm that tries to improve upon the basic random search strategy is Hyperband [5]. By incorporating ideas from the multi-armed bandit problem, the authors claim to have devised a method that beats other hyperparameter optimization methods in many cases. This section will give an overview of the algorithm, including its background and experimental results.

3.1 Successive Halving

Hyperband is based on a more fundamental algorithm called ”Successive Halv- ing” [4]. It approaches hyperparameter optimization as a multi-armed bandit problem. In such a problem, the setting is one where there are multiple available solutions, and it is not immediately clear which of the solutions is optimal. The quality of a solution can only be obtained by applying it. Moreover, the quality of a solution may change as it is applied more often, either because applying it changes it, or because applying it involves some random element. The goal is to apply solutions in such a way that the the sum of the results of applying each solution is as high as possible, given only a limited number of tries (also called the “budget”). To achieve this, a balance has to be found between “exploration”

and “exploitation”. Exploration is the sense that trying new solutions may re- veal one of better quality, and exploitation in the sense that continuing to use a solution that may not have the best quality outweighs the cost of searching for a better one. In the case of Hyperband, each solution is a hyperparameter configuration, and continuing to use a solution means continuing to train a model with that set of hyperparameters.

Given the very limited time frame of the project, only the “fixed” version of Hyperband is considered. An “infinite horizon” version is provided by Hyper- band’s authors for cases when it is not possible or not desirable to set some of the limits required for the fixed version. Given a set of n initial hyperparam-

(8)

eter configurations to consider and B available budget units to spend (hours, number of samples, etc.), the algorithm performs dlog₂ne iterations. The set of remaining configurations for an iteration is Sk for iteration k, with S0= [n]. In each iteration, each remaining configuration is trained with j

B

|Sk|dlog₂ne

k budget units. Next, the trained configurations are sorted by performance, and the worst half are discarded. An example state is shown in Figure 4, and a full visualization of all iterations for a single execution is shown in Appendix A. In other words, a group of hyperparameter configurations is trained on a fixed budget, with intermediate evaluations that discard the worst-performing half of the configurations, until only one configuration remains. Note that this approach works for both online and offline training methods, as the budget increases in each iteration, and makes no assumptions regarding the training algorithm or hyperparameters.

Although this provides a straightforward framework to pick a configuration given a set budget, it is not free of issues. Mainly, given a setting and budget, it is not clear how large n should be. Each configuration has a “loss curve”:

the validation loss as a function of budget expended on the configuration (usually number of training samples), see Figure 5. In the discarding step of the algorithm, the progress of the loss curves of the different configurations is compared. Since the budget is fixed, a larger number of configurations means less budget for each individual configuration, which in turn means that evaluations occur earlier in the training process. If the loss curve of each configuration was similar enough in characteristics, this would be fine, however, this is usually not the case. As a result, a situation such as the one in Figure 5 may occur, where configurations are improperly discarded because the evaluation occurs too early.

On the other hand, if evaluation occurs very late, budget will have been wasted on configurations that would have been discarded anyway. The challenge then becomes finding a value for n such that evaluations occur neither too early nor too late. This is what Hyperband attempts to achieve.

Figure 4: Successive Halving example. Each square represents a hyperparameter configuration, and each row represents an iteration of the algorithm. Red, crossed out squares indicate configurations that were eliminated, while green ones indicate those configurations that were passed on to the next iteration of the algorithm. In this example, the algorithm has just entered the final iteration, and is about to train a model with the single remaining configuration to determine its final performance.

(9)

Figure 5: Validation loss as a function of expended budget. The dashed lines indicate possible evaluation moments. If the number of configurations is large, the evaluation moment will be earlier, which may result in better configurations being eliminated because their loss stabilizes slower.

3.2 Approach

Hyperband attempts to fix Successive Halving’s shortcoming simply by running it multiple times. It argues that since it is impossible to make any useful assumptions about the (relative) shapes and characteristics of the loss curves in general, the best option is to try different numbers of initial hyperparameter configurations. Each of these runs is called a “bracket” in Hyperband. Each of these brackets is run with a different value for n, see Figure 6. Taking different values for n results in different evaluation moments, more or less performing a

”grid search” over the best value for n. Each bracket uses approximately the same, fixed budget of B units, and as such corresponds to a different trade-off between the number of initial configurations and the number of budget units allocated to each configuration.

(10)

Figure 6: Example of a Hyperband run. Each table represents a bracket i.e.

a single Successive Halving run. Note that through the distribution of initial configurations, the total used budget by each bracket is approximately the same.

Trained configurations from the final iteration of each bracket (shown here in white) are compared to find the overall best configuration.

3.3 Results

The results posted by Hyperband are certainly interesting. Please refer to Fig- ure 7 for some of the results. Most striking is that Hyperband is an order of magnitude faster in finding a (near-)final solution. Also note that in some of these cases, random search is very competitive. Taking all experiments from the original paper, Hyperband finds a better solution than all other methods in all cases with the exception of the SVHN (Street View House Numbers) dataset, where it finds a better solution in one out of 5 cases. In all experiments, only SMAC with early stopping [3] manages to beat Hyperband in some cases. These are very promising results, especially considering the simplicity of Hyperband.

(11)

Figure 7: Comparison of Hyperband and other hyperparameter optimization methods. ”bracket s=4” indicates a run of Hyperband in which all brackets are of the same setup as the first bracket as shown in Figure 6 (highest number of initial configurations) The hyperparameters influenced the behavior of the stochastic gradient descent algorithm (6 hyperparameters), and the regulariza- tion layers (2 hyperparameters). [5].

3.4 Evaluation

Based on the approach and the results, two issues are identified. First, as shown in the comparison graphs above, later brackets have a very minimal effect on the process of identifying the best solution, even though each bracket has the same budget cost. In most cases, early brackets find a near-best solution. In other words, Exploration is rewarded much more greatly than Exploitation. Consid- ering the way Hyperband works, this is most likely due to the fact that each consecutive bracket samples less and less hyperparameter configurations. In the last bracket, a handful of configurations is fully trained; the chance that this improves the current best solution is very small. This also brings up the second issue: absolutely no information is shared between brackets. All effort, except the current best solution, is simply discarded once a bracket is finished. If the information that all intermediate training results provide can be consolidated and then leveraged, it may be possible to make sampling configurations more informed in later brackets, effectively counteracting the decrease in meaningful progress that is caused by the smaller number of configuration samples. If these two aspects eliminate each other, performance gains should increase in later brackets, which would justify including them.

(12)

4 Proposed Extension

To improve Hyperband, a method needs to be found that can gather information from training results and somehow use this information to make informed deci- sions regarding future sampling. The method proposed in this section attempts to do this by using the information to limit the search space of hyperparameter configurations. Shrinking the search space can counteract the decrease of number of samples in later brackets; it effectively attempts to maintain sampling density. Note that this is reminiscent of, but different from, classical Thompson sampling. Where the goal of Thompson sampling is to strike an optimal balance between exploration and exploitation by directly determining in which manner configurations are sampled, the only job of the extension is to limit Hyperband’s sampling whenever it is sampling. The actual trade-off between exploration and exploitation is performed by Hyperband itself in the form of its brackets.

Figure 8: Limiting the search space for some hyperparameters HP1 and HP2.

Red: configurations with bad performance, green: configurations with good performance, gray: new configurations.

4.1 Approach

One of the issues with gathering intermediate training results is that not all results are equal, in that the budget expended for these results differs depending on which iteration of the Successive Halving algorithm the result comes from. In able to consolidate all these training results, a model is needed that can account for this difference.

For this, we take inspiration from SpearMint [7], an optimization method based on Gaussian Processes. In short, a Gaussian Process attempts to find a distribution over all functions that are consistent with observed data. It starts with an initial distribution that is continuously refined as more data is provided.

This gives us a non-parametric approach i.e. the considered functions can have

(13)

any number of different parameters. However, it is not feasible to consider all possible functions that can exists in a domain. To guide the Gaussian Process, a covariance function is provided that restrains the type of functions that can be considered. It generally determines how smooth functions must be, as well as their general shape. For this study, the Mat´ern 5/2 kernel was used. This kernel has been successfully used in the past in similar settings [7]. It is flexible enough to allow many functions to be modeled, but can be restricted sufficiently to produce useful models. Gaussian Processes can generally be created and updated quickly if their kernel function is not overly complex, and perhaps more importantly, can be sampled from very fast. This is expected to place only a very low overhead on total computation cost, especially compared to the computational resources invested in training and evaluating models.

The different numbers of budget expended on training for each configuration is interpreted as uncertainty regarding the loss value: as more budget units are expended to train with a certain configuration, the uncertainty about the loss of the resulting model decreases. Training with the maximum number of allowed budget units is considered to eliminate any uncertainty, even though a model could theoretically be trained ad infinitum. Additional benefits of using a Gaussian Process are the ability to iteratively update the Process (it does not have to be rebuilt each bracket) and its quick performance when updating and sampling. It also makes for intuitive visualizations.

Figure 9: Example of a Gaussian Process prior and posterior estimating a function. The red points are observed data with no uncertainty, the black line indicates the mean, and the gray area denotes two standard deviations¹.

Given a Gaussian Process part-way through the process of running Hyper- band, the Gaussian Process can be sampled to determine which area(s) of the search space are of most interest. More concretely, the Expected Improvement (EI) criterion [6], which has been shown to work well in situations like these, can be calculated to represent this interest. Besides its simplicity, it also handles noise and uncertainty well, making it a popular choice. It is based on the utility function u(x) = max(f⁰− f (x), 0), where f⁰ is the point with the best lowest loss observed so far, and f (x) is the observation of the mean of the Gaussian Process at x. This function awards values of x based on how much smaller their projected loss is according to the Gaussian Process, or awards 0 is the projected

1http://scikit-learn.org/stable/modules/gaussian_process.html

(14)

loss is higher. Since the Gaussian Process is a distribution of functions, we can take the expected value E[u(x)] to arrive at the Expected Improvement.

Figure 10: Given some Gaussian Process with past observations (top), areas in the search space that are of most interest for future sampling can be identified (bottom). The coloring and curve in the bottom image are illustrative, examples of actual interest can be found in Appendix B.

What remains is then to use the Expected Improvement along the entire search space to determine which configurations to sample next. Sampling simply from the location where the Expected Improvement is maximized is not sufficient, since we may need to sample many points. Additionally, based on the search space size, finding this point may be difficult. Sampling from only a single point (or area) also introduces the significant risk of stepping into a local optimum very early. It is therefore better to consider the EI as a probability distribution and draw samples from it that way. Since the EI can take any shape, a sampling method is needed that can sample from arbitrary distributions. Slice sampling is a Markov Chain Monte Carlo algorithm that does just that:

1. Take some starting value x₀ for which f (x₀) > 0 2. Sample an y₁ uniformly between 0 and f (x₀) 3. draw a horizontal line at the sampled y₁ position 4. Sample a point (x1, y1) from the line

5. Yield (x1, y1) and repeat using x1

The intuition of the method is that of a random walk of the area under the curve. This allows for efficient sampling of arbitrary functions, which is especially nice for sampling the EI, as it can be expected to be very small in large areas as the Hyperband run progresses (large areas in the configuration

(15)

search space will map to large losses).

This entire approach can be integrated into Hyperband quite simply. Initial- ize an empty Gaussian Process at the start of the Hyperband run and at the end of each bracket, update the GP with the observations that were made in that bracket. Since results have to be collected at the end of each bracket anyway, this imposes no restrictions on the potential parallelization of the algorithm. Then use the Gaussian Process to sample hyperparameter configurations for the next bracket. Interestingly, the GP can be stored at any point for later re-use very easily. The only unaddressed point now is the issue of sampling for the first bracket. After all, at that point, the GP has not yet received any observations and is effectively useless. As before, we can simply sample uniformly at random from the entire configuration search space, but we can do a bit better than that. It would be nice if the first batch of samples was nicely spread out along the search space, regardless of the number of dimensions of the search space, because this would also spread out the uncertainty in the Gaussian Process after the first Hyperband bracket. This is especially important since the first bracket samples the most configurations, and because most of those configurations will be trained only very shortly, producing a large amount of uncertainty.

Although random uniform sampling can be expected to give somewhat nice coverage, pseudo-random sampling can actually guarantee nice coverage, up to a certain number of dimensions. Pseudo-random number generation in usually implemented in pseudo-random sequences, which generate deterministic values which evenly cover the search space, regardless of the length of the sequence.

Using such a sequence to replace random uniform sampling has been shown to be able to improve over random search in classic machine learning settings, pro- viding an additional benefit. In particular, Sobol sequence sampling was used in this study, based on comparisons between it, random sampling, and Latin Hypercube sampling. [2].

Figure 11: Examples of random sampling and pseudo-random sampling.

(16)

4.2 Experiments

For the experiments, two important metrics are considered: how does the extension impact accuracy, and how does it impact the running time. The impact of the extension in terms of scalability was considered, but the duration of the project and the limited computational resources limited the experiments to the two metrics above. Three datasets were used for the majority of the experiments:

Dataset #samples #features #classes

Cod RNA 488565 9 2

Covertype 581012 54 2

MNIST 70000 784 10

For debugging and testing purposes, numerous smaller datasets from OpenML [9] were used. All datasets were split into appropriate training, validation, and test sets. Next, five runs of Hyperband were performed for each combination of dataset, machine learning algorithm, and either classic Hyperband or the extended version. For the machine learning algorithm, the choices were logis- tic regression, random forest, SVM, or a choice of the three where the choice of algorithm is an additional hyperparameter. The choice of machine learning algorithms is intended to provide a coverage of differing complexity and ability.

Figure 12 shows a generated Gaussian Process when training an SVM model on the Covertype dataset.

(17)

Figure 12: Example Gaussian Process and Expected Improvement. The image on the right shows the result of an intensive grid search for hyperparameters C and gamma. Hyperband was run while only optimizing for C, covering the area in green. The top-left image shows the Gaussian Process (mean and two standard deviations in black and gray), and observations (red bars for non-fully trained configurations with larger bars indicating greater uncertainty (eliminated earlier in the process), and green points for fully trained ones), and the current best configurations as a dashed line. The bottom-left graph show the Expected Improvement as calculated from the Gaussian Process, and the values for C that were sampled for it for the next Successive Halving run.

Sadly, none of the combinations lined out above showed any significant accuracy improvements when using extended Hyperband when compared to the classic Hyperband. In all cases, the difference in accuracy was less than 0.005, especially on smaller datasets. The only positive mention to make here is that while no improvements are visible here, neither are there any detriments to using the extended version. Running time-wise, the extended version takes a more time, it needs to maintain the Gaussian Process and perform more in- volved sampling, after all. The timing results (averaged over five trials) are shown in Figure 13. Note that the extra time remains quite constant, regardless of learning algorithm or dataset. This is due to the fact that Hyperband always produces approximately the same number of observations (which are used to update the Gaussian Process). Given that total training time for these datasets was usually took over thirty minutes, the running time impact of the extension is minimal.

4.3 Evaluation

Experiments show that the extension has the ability to model hyperparameter configuration search spaces, and that it is able to do so with only a small time penalty. The lack of accuracy improvement may be caused by one or more of

(18)

Figure 13: Extra time taken by the extended Hyperband when compared to the classic Hyperband in seconds.

the following factors:

• During code review, a significant error was found in the function that samples new hyperparameter configurations based on Expected Improvement.

There was no opportunity for significant additional testing after the code review.

• The current sampling method may not suffice. Examples in Appendix B show that sampling does not always produce the best results, which becomes increasingly problematic as the number of samples decreases.

This problem was present even after the code review. This problem may be fixed by choosing a different sampling technique, or a different approach to slice sampling (instead of the window approach).

• It is possible that for the evaluated combinations of algorithms and datasets, Hyperband itself already finds (near-)optimal solutions. In this case, the effect of the extension would not be visible unless more challenging sce- narios are considered. Fine-grained grid search results that were available at Fujitsu for some of the datasets indicate that this is likely the case for at least some of the combinations of datasets and learning algorithms.

• Perhaps the solution space is too small to observe a noticeable difference.

The solution spaces were kept small to help keep experiments manageable in the short time frame of the internship. Additionally, pre-computed solutions with grid search were were not available for larger solution spaces.

(19)

5 Conclusion

Hyperband is a recently published hyperparameter optimization algorithm that leverages the apparent power of random sampling. Its focus lies on trying many different hyperparameter configurations, and elimination configurations as soon as possible. It is a target for easy parallelization and seems to perform well when compared to other hyperparameter optimization techniques. The proposed extension attempts to improve the power of the Hyperband structure by leveraging intermediate results which are ignored in the classical version. This extension has shown to be able to model the search space, and imposes only a small time penalty on the overall Hyperband process. Sadly, the extension is not able to produce any accuracy improvements, although the reasons for this are unclear, and would require more extensive experiments to be performed.

(20)

6 Future Work

An observation which can be made from Hyperband, but which was not used in this study is that many of the hyperparameter configurations that are evaluated in a Hyperband run are evaluated multiple times with different budget values (training time, training data size, etc.). Such a series of values can be used to estimate the loss curve of a particular configuration. If multiple series are available for different configurations, it may be possible to effectively combine these into a loss surface. In practice, this would mean adding an extra dimen- sion to the Gaussian Process, and extending the current choice of kernel so it accurately models loss curves. Modelling loss curves with Gaussian Processes has been studied before [8], and the kernel proposed there can be combined with the Mat´ern 5/2 kernel to produce a Gaussian Process that models the hyperparameter search space in one direction, and loss curves in another. Sampling the resulting surface at the end of the loss curves would provide predictions of the ultimate performance of any hyperparameter configuration. This approach was briefly attempted, and showed interesting results, but lack of time prevented any further study.

Figure 14: Left: Gaussian Processes estimating loss curves for different hyperparameter values. Right: Loss curve estimation and hyperparameter estimation combined into a single Gaussian Process model.

(21)

References

[1] Bergstra, J., Bardenet, R., Bengio, Y., and K´egl, B. Algorithms for hyper-parameter optimization. In Advances in Neural Information Pro- cessing Systems 24: 25th Annual Conference on Neural Information Pro- cessing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain. (2011), pp. 2546–2554.

[2] Bergstra, J., and Bengio, Y. Random search for hyper-parameter optimization. Journal of Machine Learning Research 13 (2012), 281–305.

[3] Hutter, F., Hoos, H. H., and Leyton-Brown, K. Sequential model- based optimization for general algorithm configuration. In Learning and In- telligent Optimization - 5th International Conference, LION 5, Rome, Italy, January 17-21, 2011. Selected Papers (2011), pp. 507–523.

[4] Jamieson, K. G., and Talwalkar, A. Non-stochastic best arm identifi- cation and hyperparameter optimization. In Proceedings of the 19th Interna- tional Conference on Artificial Intelligence and Statistics, AISTATS 2016, Cadiz, Spain, May 9-11, 2016 (2016), pp. 240–248.

[5] Li, L., Jamieson, K. G., DeSalvo, G., Rostamizadeh, A., and Tal- walkar, A. Efficient hyperparameter optimization and infinitely many armed bandits. CoRR abs/1603.06560 (2016).

[6] Mockus, J. On bayesian methods for seeking the extremum and their application. In IFIP Congress (1977), pp. 195–200.

[7] Snoek, J., Larochelle, H., and Adams, R. P. Practical bayesian optimization of machine learning algorithms. In Advances in Neural Informa- tion Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States. (2012), pp. 2960–2968.

[8] Swersky, K., Snoek, J., and Adams, R. P. Freeze-thaw bayesian optimization. CoRR abs/1406.3896 (2014).

[9] Vanschoren, J., van Rijn, J. N., Bischl, B., and Torgo, L. Openml:

Networked science in machine learning. SIGKDD Explorations 15, 2 (2013), 49–60.

(22)

A Successive Halving Visualized

The series of images below shows the repeated steps of the Successive Halving algorithm. Please refer to the legend below for the meaning of the various shapes and colors in the images. Note that ”HP1” and ”HP2” stand for generic hyperparameters.

(23)

(24)

(25)

B Gaussian Process Search Space Model Exam- ples

The Gaussian Process is visualized in the upper plot in each figure. Red bars indicate non-fully trained configurations, green points indicate fully trained ones.

The blue dashed line is the current minimum of all fully trained configurations.

The black line shows the estimation of the process, with the gray area showing the deviation. The lower plot in each figure shows the expected improvement along the Gaussian Process. The red dots in this plot indicate points sampled for the next bracket of Hyperband.

(26)

Eindhoven University of Technology MASTER Evaluating a multi-armed bandit-based approach to hyperparameter optimization van Knippenberg, M.S.