Evaluation - Eindhoven University of Technology MASTER Evaluating a multi-armed bandit-based ap

Based on the approach and the results, two issues are identified. First, as shown in the comparison graphs above, later brackets have a very minimal effect on the process of identifying the best solution, even though each bracket has the same budget cost. In most cases, early brackets find a near-best solution. In other words, Exploration is rewarded much more greatly than Exploitation. Consid-ering the way Hyperband works, this is most likely due to the fact that each consecutive bracket samples less and less hyperparameter configurations. In the last bracket, a handful of configurations is fully trained; the chance that this improves the current best solution is very small. This also brings up the second issue: absolutely no information is shared between brackets. All effort, except the current best solution, is simply discarded once a bracket is finished. If the information that all intermediate training results provide can be consolidated and then leveraged, it may be possible to make sampling configurations more informed in later brackets, effectively counteracting the decrease in meaningful progress that is caused by the smaller number of configuration samples. If these two aspects eliminate each other, performance gains should increase in later brackets, which would justify including them.

4 Proposed Extension

To improve Hyperband, a method needs to be found that can gather information from training results and somehow use this information to make informed deci-sions regarding future sampling. The method proposed in this section attempts to do this by using the information to limit the search space of hyperparameter configurations. Shrinking the search space can counteract the decrease of num-ber of samples in later brackets; it effectively attempts to maintain sampling density. Note that this is reminiscent of, but different from, classical Thompson sampling. Where the goal of Thompson sampling is to strike an optimal balance between exploration and exploitation by directly determining in which manner configurations are sampled, the only job of the extension is to limit Hyperband’s sampling whenever it is sampling. The actual trade-off between exploration and exploitation is performed by Hyperband itself in the form of its brackets.

Figure 8: Limiting the search space for some hyperparameters HP1 and HP2.

Red: configurations with bad performance, green: configurations with good performance, gray: new configurations.

4.1 Approach

One of the issues with gathering intermediate training results is that not all results are equal, in that the budget expended for these results differs depending on which iteration of the Successive Halving algorithm the result comes from. In able to consolidate all these training results, a model is needed that can account for this difference.

For this, we take inspiration from SpearMint [7], an optimization method based on Gaussian Processes. In short, a Gaussian Process attempts to find a distribution over all functions that are consistent with observed data. It starts with an initial distribution that is continuously refined as more data is provided.

This gives us a non-parametric approach i.e. the considered functions can have

any number of different parameters. However, it is not feasible to consider all possible functions that can exists in a domain. To guide the Gaussian Process, a covariance function is provided that restrains the type of functions that can be considered. It generally determines how smooth functions must be, as well as their general shape. For this study, the Mat´ern 5/2 kernel was used. This kernel has been successfully used in the past in similar settings [7]. It is flexible enough to allow many functions to be modeled, but can be restricted sufficiently to produce useful models. Gaussian Processes can generally be created and updated quickly if their kernel function is not overly complex, and perhaps more importantly, can be sampled from very fast. This is expected to place only a very low overhead on total computation cost, especially compared to the computational resources invested in training and evaluating models.

The different numbers of budget expended on training for each configuration is interpreted as uncertainty regarding the loss value: as more budget units are expended to train with a certain configuration, the uncertainty about the loss of the resulting model decreases. Training with the maximum number of allowed budget units is considered to eliminate any uncertainty, even though a model could theoretically be trained ad infinitum. Additional benefits of using a Gaussian Process are the ability to iteratively update the Process (it does not have to be rebuilt each bracket) and its quick performance when updating and sampling. It also makes for intuitive visualizations.

Figure 9: Example of a Gaussian Process prior and posterior estimating a func-tion. The red points are observed data with no uncertainty, the black line indicates the mean, and the gray area denotes two standard deviations¹.

Given a Gaussian Process part-way through the process of running Hyper-band, the Gaussian Process can be sampled to determine which area(s) of the search space are of most interest. More concretely, the Expected Improvement (EI) criterion [6], which has been shown to work well in situations like these, can be calculated to represent this interest. Besides its simplicity, it also handles noise and uncertainty well, making it a popular choice. It is based on the utility function u(x) = max(f⁰− f (x), 0), where f⁰ is the point with the best lowest loss observed so far, and f (x) is the observation of the mean of the Gaussian Process at x. This function awards values of x based on how much smaller their projected loss is according to the Gaussian Process, or awards 0 is the projected

1http://scikit-learn.org/stable/modules/gaussian_process.html

loss is higher. Since the Gaussian Process is a distribution of functions, we can take the expected value E[u(x)] to arrive at the Expected Improvement.

Figure 10: Given some Gaussian Process with past observations (top), areas in the search space that are of most interest for future sampling can be identified (bottom). The coloring and curve in the bottom image are illustrative, examples of actual interest can be found in Appendix B.

What remains is then to use the Expected Improvement along the entire search space to determine which configurations to sample next. Sampling sim-ply from the location where the Expected Improvement is maximized is not sufficient, since we may need to sample many points. Additionally, based on the search space size, finding this point may be difficult. Sampling from only a single point (or area) also introduces the significant risk of stepping into a local optimum very early. It is therefore better to consider the EI as a probability dis-tribution and draw samples from it that way. Since the EI can take any shape, a sampling method is needed that can sample from arbitrary distributions. Slice sampling is a Markov Chain Monte Carlo algorithm that does just that:

1. Take some starting value x₀ for which f (x₀) > 0 2. Sample an y₁ uniformly between 0 and f (x₀) 3. draw a horizontal line at the sampled y₁ position 4. Sample a point (x1, y1) from the line

5. Yield (x1, y1) and repeat using x1

The intuition of the method is that of a random walk of the area under the curve. This allows for efficient sampling of arbitrary functions, which is especially nice for sampling the EI, as it can be expected to be very small in large areas as the Hyperband run progresses (large areas in the configuration

search space will map to large losses).

This entire approach can be integrated into Hyperband quite simply. Initial-ize an empty Gaussian Process at the start of the Hyperband run and at the end of each bracket, update the GP with the observations that were made in that bracket. Since results have to be collected at the end of each bracket anyway, this imposes no restrictions on the potential parallelization of the algorithm. Then use the Gaussian Process to sample hyperparameter configurations for the next bracket. Interestingly, the GP can be stored at any point for later re-use very easily. The only unaddressed point now is the issue of sampling for the first bracket. After all, at that point, the GP has not yet received any observations and is effectively useless. As before, we can simply sample uniformly at random from the entire configuration search space, but we can do a bit better than that. It would be nice if the first batch of samples was nicely spread out along the search space, regardless of the number of dimensions of the search space, because this would also spread out the uncertainty in the Gaussian Process after the first Hyperband bracket. This is especially important since the first bracket samples the most configurations, and because most of those configura-tions will be trained only very shortly, producing a large amount of uncertainty.

Although random uniform sampling can be expected to give somewhat nice cov-erage, pseudo-random sampling can actually guarantee nice covcov-erage, up to a certain number of dimensions. Pseudo-random number generation in usually implemented in pseudo-random sequences, which generate deterministic values which evenly cover the search space, regardless of the length of the sequence.

Using such a sequence to replace random uniform sampling has been shown to be able to improve over random search in classic machine learning settings, pro-viding an additional benefit. In particular, Sobol sequence sampling was used in this study, based on comparisons between it, random sampling, and Latin Hypercube sampling. [2].

Figure 11: Examples of random sampling and pseudo-random sampling.

In document Eindhoven University of Technology MASTER Evaluating a multi-armed bandit-based approach to hyperparameter optimization van Knippenberg, M.S. (pagina 11-16)