Experiments - Eindhoven University of Technology MASTER Evaluating a multi-armed bandit-based a

For the experiments, two important metrics are considered: how does the ex-tension impact accuracy, and how does it impact the running time. The impact of the extension in terms of scalability was considered, but the duration of the project and the limited computational resources limited the experiments to the two metrics above. Three datasets were used for the majority of the experi-ments:

Dataset #samples #features #classes

Cod RNA 488565 9 2

Covertype 581012 54 2

MNIST 70000 784 10

For debugging and testing purposes, numerous smaller datasets from OpenML [9] were used. All datasets were split into appropriate training, validation, and test sets. Next, five runs of Hyperband were performed for each combination of dataset, machine learning algorithm, and either classic Hyperband or the extended version. For the machine learning algorithm, the choices were logis-tic regression, random forest, SVM, or a choice of the three where the choice of algorithm is an additional hyperparameter. The choice of machine learning algorithms is intended to provide a coverage of differing complexity and ability.

Figure 12 shows a generated Gaussian Process when training an SVM model on the Covertype dataset.

Figure 12: Example Gaussian Process and Expected Improvement. The image on the right shows the result of an intensive grid search for hyperparameters C and gamma. Hyperband was run while only optimizing for C, covering the area in green. The top-left image shows the Gaussian Process (mean and two standard deviations in black and gray), and observations (red bars for non-fully trained configurations with larger bars indicating greater uncertainty (elimi-nated earlier in the process), and green points for fully trained ones), and the current best configurations as a dashed line. The bottom-left graph show the Expected Improvement as calculated from the Gaussian Process, and the values for C that were sampled for it for the next Successive Halving run.

Sadly, none of the combinations lined out above showed any significant ac-curacy improvements when using extended Hyperband when compared to the classic Hyperband. In all cases, the difference in accuracy was less than 0.005, especially on smaller datasets. The only positive mention to make here is that while no improvements are visible here, neither are there any detriments to using the extended version. Running time-wise, the extended version takes a more time, it needs to maintain the Gaussian Process and perform more in-volved sampling, after all. The timing results (averaged over five trials) are shown in Figure 13. Note that the extra time remains quite constant, regardless of learning algorithm or dataset. This is due to the fact that Hyperband always produces approximately the same number of observations (which are used to update the Gaussian Process). Given that total training time for these datasets was usually took over thirty minutes, the running time impact of the extension is minimal.

4.3 Evaluation

Experiments show that the extension has the ability to model hyperparameter configuration search spaces, and that it is able to do so with only a small time penalty. The lack of accuracy improvement may be caused by one or more of

Figure 13: Extra time taken by the extended Hyperband when compared to the classic Hyperband in seconds.

the following factors:

• During code review, a significant error was found in the function that sam-ples new hyperparameter configurations based on Expected Improvement.

There was no opportunity for significant additional testing after the code review.

• The current sampling method may not suffice. Examples in Appendix B show that sampling does not always produce the best results, which becomes increasingly problematic as the number of samples decreases.

This problem was present even after the code review. This problem may be fixed by choosing a different sampling technique, or a different approach to slice sampling (instead of the window approach).

• It is possible that for the evaluated combinations of algorithms and datasets, Hyperband itself already finds (near-)optimal solutions. In this case, the effect of the extension would not be visible unless more challenging sce-narios are considered. Fine-grained grid search results that were available at Fujitsu for some of the datasets indicate that this is likely the case for at least some of the combinations of datasets and learning algorithms.

• Perhaps the solution space is too small to observe a noticeable difference.

The solution spaces were kept small to help keep experiments manageable in the short time frame of the internship. Additionally, pre-computed solutions with grid search were were not available for larger solution spaces.

5 Conclusion

Hyperband is a recently published hyperparameter optimization algorithm that leverages the apparent power of random sampling. Its focus lies on trying many different hyperparameter configurations, and elimination configurations as soon as possible. It is a target for easy parallelization and seems to perform well when compared to other hyperparameter optimization techniques. The pro-posed extension attempts to improve the power of the Hyperband structure by leveraging intermediate results which are ignored in the classical version. This extension has shown to be able to model the search space, and imposes only a small time penalty on the overall Hyperband process. Sadly, the extension is not able to produce any accuracy improvements, although the reasons for this are unclear, and would require more extensive experiments to be performed.

6 Future Work

An observation which can be made from Hyperband, but which was not used in this study is that many of the hyperparameter configurations that are evaluated in a Hyperband run are evaluated multiple times with different budget values (training time, training data size, etc.). Such a series of values can be used to estimate the loss curve of a particular configuration. If multiple series are available for different configurations, it may be possible to effectively combine these into a loss surface. In practice, this would mean adding an extra dimen-sion to the Gaussian Process, and extending the current choice of kernel so it accurately models loss curves. Modelling loss curves with Gaussian Processes has been studied before [8], and the kernel proposed there can be combined with the Mat´ern 5/2 kernel to produce a Gaussian Process that models the hyperpa-rameter search space in one direction, and loss curves in another. Sampling the resulting surface at the end of the loss curves would provide predictions of the ultimate performance of any hyperparameter configuration. This approach was briefly attempted, and showed interesting results, but lack of time prevented any further study.

Figure 14: Left: Gaussian Processes estimating loss curves for different hyper-parameter values. Right: Loss curve estimation and hyperhyper-parameter estimation combined into a single Gaussian Process model.

References

[1] Bergstra, J., Bardenet, R., Bengio, Y., and K´egl, B. Algorithms for hyper-parameter optimization. In Advances in Neural Information cessing Systems 24: 25th Annual Conference on Neural Information Pro-cessing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain. (2011), pp. 2546–2554.

[2] Bergstra, J., and Bengio, Y. Random search for hyper-parameter op-timization. Journal of Machine Learning Research 13 (2012), 281–305.

[3] Hutter, F., Hoos, H. H., and Leyton-Brown, K. Sequential model-based optimization for general algorithm configuration. In Learning and In-telligent Optimization - 5th International Conference, LION 5, Rome, Italy, January 17-21, 2011. Selected Papers (2011), pp. 507–523.

[4] Jamieson, K. G., and Talwalkar, A. Non-stochastic best arm identifi-cation and hyperparameter optimization. In Proceedings of the 19th Interna-tional Conference on Artificial Intelligence and Statistics, AISTATS 2016, Cadiz, Spain, May 9-11, 2016 (2016), pp. 240–248.

[5] Li, L., Jamieson, K. G., DeSalvo, G., Rostamizadeh, A., and Tal-walkar, A. Efficient hyperparameter optimization and infinitely many armed bandits. CoRR abs/1603.06560 (2016).

[6] Mockus, J. On bayesian methods for seeking the extremum and their application. In IFIP Congress (1977), pp. 195–200.

[7] Snoek, J., Larochelle, H., and Adams, R. P. Practical bayesian op-timization of machine learning algorithms. In Advances in Neural Informa-tion Processing Systems 25: 26th Annual Conference on Neural InformaInforma-tion Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States. (2012), pp. 2960–2968.

[8] Swersky, K., Snoek, J., and Adams, R. P. Freeze-thaw bayesian opti-mization. CoRR abs/1406.3896 (2014).

[9] Vanschoren, J., van Rijn, J. N., Bischl, B., and Torgo, L. Openml:

Networked science in machine learning. SIGKDD Explorations 15, 2 (2013), 49–60.

A Successive Halving Visualized

The series of images below shows the repeated steps of the Successive Halving algorithm. Please refer to the legend below for the meaning of the various shapes and colors in the images. Note that ”HP1” and ”HP2” stand for generic hyperparameters.

B Gaussian Process Search Space Model Exam-ples

The Gaussian Process is visualized in the upper plot in each figure. Red bars in-dicate non-fully trained configurations, green points inin-dicate fully trained ones.

The blue dashed line is the current minimum of all fully trained configurations.

The black line shows the estimation of the process, with the gray area showing the deviation. The lower plot in each figure shows the expected improvement along the Gaussian Process. The red dots in this plot indicate points sampled for the next bracket of Hyperband.

In document Eindhoven University of Technology MASTER Evaluating a multi-armed bandit-based approach to hyperparameter optimization van Knippenberg, M.S. (pagina 16-26)