Thesis Outline and Contributions - Systems for AutoML Research

that for each dataset we have to tune the hyperparameters of learners because there is a relationship between the dataset and the hyperparameter configu-ration that produces the optimal model for that learner. The meta-model is in effect a mapping that aims to transform the dataset characteristics to the ideal hyperparameter configuration for a learner. We postulate that we can also express this relationship explicitly for each hyperparameter by using symbolic hyperparameter defaults, defaults that map dataset characteristics to a valid hyperparameter value, and find them in a data-driven way.

Symbolic hyperparameters defaults should then only have to be found once for a specific algorithm, and could ideally come packaged with that algorithm.

The symbolic hyperparameters defaults may also provide insight into the rela-tionship between the hyperparameter and the dataset. While implementation differences might influence the ideal symbolic hyperparameter default, it is still likely that the default transfers reasonably well across implementations, e.g., from mlr3’s [145] to scikit-learn’s [184] decision tree. The model-free ap-proach allows it to be used in all AutoML frameworks for e.g., warm-starting search or transforming the search space, and with additional experiments, sym-bolic hyperparameter defaults might even be learned for AutoML systems them-selves.

1.4 Thesis Outline and Contributions

In this section, we will detail our contributions chapter-by-chapter, illustrated by the high-level overview of our contributing chapters in Figure 1.2. After providing related background information in Chapter 2, we present our con-tributions to answering research questions 1 through 4 in Chapters 3 through 6, respectively. The first three of those chapters directly contribute to correct and reproducible AutoML research and come with software artifacts that may be used for independent research: a modular AutoML tool, machine readable benchmarking suites, and an AutoML benchmark, respectively. The work in Chapter 6 details a meta-learning approach to finding symbolic hyperparameter defaults, which may be used to speed up AutoML in future work.

First, we will provide a more in-depth overview of the AutoML literature in Chapter 2. We first give a formal definition of the AutoML problem, which is followed by a discussion of the different design axes of AutoML systems, such as search space design, optimization algorithms, and post-processing used in AutoML. Then, we briefly discuss some of the work outside of the typical regression and single-label classification setting. The chapter’s aim is not only

10 Introduction

Make implementing novel AutoML ideas easier (Q1)

Chapter 3: GAMA

Enabling the use of common benchmark suites (Q2) Chapter 4: OpenML Suites

Correct and reproducible evaluation of AutoML (Q3) Chapter 5: AutoML Benchmark Algorithm Development Support

How to speed up AutoML by learning from prior experiments?(Q4)

Chapter 6:

Meta-learning for Symbolic Defaults Building Excellent Benchmarks to Measure Progress

Learning Better Algorithms

Figure 1.2: An overview of the thesis structure. Chapters 3 through 5 detail our contributions to correct and reproducible AutoML research (Q1-Q3). Chapter 6 presents an approach to learn symbolic hyperparameter defaults, which may be used to speed up AutoML in the future (Q4).

to provide a better understanding of the techniques currently employed but also to provide a stronger context for the difficulty of development and research of AutoML systems.

In Chapter 3, we introduce our answer to Q1 in the form of the General Automated Machine learning Assistant (GAMA [103, 104]), a tool to address the difficulty of exploring novel ideas in AutoML. As discussed in the last section, developing a completely new AutoML tool just to evaluate a novel idea adds a lot of overhead and additionally can lead to less informative experimental results.

Using an existing tool allows for better comparisons, but comes with a steep learning curve and risks the new idea not being integrated by the original authors for public releases. GAMA features a modular and flexible design, which allows researchers to write or modify individual components of the AutoML pipeline easily. This does not only allow much faster iterations over new ideas but also allows for better comparison. We review other work built with GAMA and see

1.4. THESIS OUTLINE AND CONTRIBUTIONS 11

early signs that the modular AutoML tool is valuable for research.

In Chapter 4, we examine different platforms for sharing data and ma-chine learning experiments, and motivate the choice to build on OpenML [258].

We build a programmatic interface to the platform called openml-python [87], which enables further automation of downstream tasks which greatly increases the ease with which reproducible experiments can be conducted. For exam-ple, it is possible to automatically download datasets alongside meta-data to conduct reproducible 10-fold cross-validation. By enabling the development of comprehensive benchmarking suites on the platform [28], we allow researchers to identify collections of interesting tasks and to share them. We believe that the ease with which benchmarking suites can now be shared and reproduced greatly contributes to the use of high-quality tasks in evaluations, and show early signs that might confirm this (Q2).

Next, we build on that to address Q3 and create the AutoML benchmark [100]

which we present in Chapter 5. The AutoML benchmark introduces a bench-marking tool for completely automated AutoML evaluations. To achieve this, we work together with the authors of AutoML frameworks and integrate with OpenML through openml-python. We present two benchmarking suites for benchmarking AutoML frameworks, one classification and one regression suite, and survey the current AutoML landscape through large-scale evaluation of Au-toML frameworks. Since its initial presentation in 2019 [100], the AuAu-toML com-munity has used the benchmark extensively, both integrating AutoML frame-works and using the suites for large-scale evaluations.

In Chapter 6, we develop a method for finding symbolic hyperparameter de-faults using meta-learning [102]. We use symbolic regression to optimize sym-bolic hyperparameter values for multiple hyperparameters of a learner jointly and do so for 6 different learners. Because symbolic regression relies on many evaluations, we use surrogate models to make optimization tractable. We com-pare the performance of the found default values to implementation defaults both on the surrogate models and through experiments on real data. The au-tomatically designed symbolic hyperparameter defaults can match hand-crafted symbolic hyperparameter defaults and outperform the current constant defaults.

We summarize the work and discuss open challenges and future work in Chapter 7.

12 Introduction

Chapter 2

Automated Machine Learning

In this chapter we give a more thorough introduction to AutoML for tabular datasets. We will first give a definition of the AutoML problem in Section 2.1.

The most common approach to tackle the problem is to iteratively explore the search space and optionally perform a post-processing step, as is visualized in Figure 2.1. For that reason, we structure the three sections following the problem statement in that order. First, we review work on search space design, then we cover search and evaluations strategies together, and finally we discuss ways to use post-processing to create a final model.

In the remainder of the chapter we discuss the various settings in which AutoML has been researched. Subsequent chapters will detail our contributions.

Each of those chapters will discuss additional literature that is relevant to that chapter.

14 Automated Machine Learning

Search Space (Section 2.2)

Search (Section 2.3)

Evaluation (Section 2.3)

Post-processing (Section 2.4) data

AutoML

Problem Definition (Section 2.1)

model

Figure 2.1: Typical building blocks of AutoML approaches.

2.1 Problem Definition

The AutoML problem has been (re)formulated many times. There are many mathematical formulations which broadly have the same meaning as the follow-ing definition of full model selection [75]:

Given a pool of preprocessing methods, feature selection and learn-ing algorithms, select the combination of these that obtains the low-est error for a given data set.

Mathematical definitions with a similar intent often define the problem as a direct extension of the hyperparameter optimization problem by encoding the choice of algorithms used as additional hyperparameters [27, 238], which is also known as the Combined Algorithm Selection and Hyperparameter optimization (CASH) problem [238]. In some cases authors explicitly make a distinction be-tween preprocessing algorithms, which transform a dataset into another dataset, and learners, which learn to predict labels for a dataset [3, 169]. However, these definitions require a liberal interpretation to generalize across implemen-tations. For example, the paper which introduced auto-sklearn [85] adopts the CASH [238] formulation and uses sequential model-based algorithm con-figuration (SMAC) [119] to tune pipelines. However, after optimizing over the pipeline space the resulting models get combined into an ensemble as described in [48, 49], which only fits the CASH definition by a very liberal interpretation of the notion of an algorithm and indicator hyperparameters. For this reason, as far as mathematical formulations go, we prefer the interpretation of AutoML as optimizing a directed acyclic graph (DAG) of operations as given through

2.1. PROBLEM DEFINITION 15

a series of definitions by Z¨oller and Huber [285], which we will use in adapted form:

Pipeline Creation Problem: Let a set of algorithms A with an according domain of hyperparameters Λ_(·), a set of valid pipeline structures G and a dataset D be given. The pipeline creation problem consists of finding a pipeline structure in combination with a joint algorithm and hyperparameter selection that minimizes the loss

(g, A, λ)^∗∈ arg min

g∈G,A∈A^|g|,λ∈Λ

R(P_g,A,λ, D). (2.1)

within a given resource budget B.

where:

• g ∈ G is a graph from the set of all valid graphs,

• A ∈ A^|g|is a vector which for each node in graph g specifies the algorithm from the set of algorithms A,

• λ ∈ Λ is a vector specifying the hyperparameter configuration of each algorithm from the set of all possible configurations,

• and B is a resource budget, which may be given as e.g., time or iterations.

R is the empirical risk of the pipeline Pg,A,λ according to some evaluation procedure. For example, the root mean square error of the predictions of pipeline Pg,A,λ for a validation set Dv⊂ D after being trained on Dtrain = D \ Dv. R may also be defined over multiple objectives in which case a Pareto optimal set of pipelines is to be found.

We purposely do not define specific characteristics for D, so that the defini-tion generalizes beyond single-label classificadefini-tion and regression to e.g., multi-label classification and clustering. When we refer to the AutoML problem in this work, we refer to the above definition.

Note that this definition is still quite narrow, specifically only formalizing the automated optimization of machine learning pipelines, and geared towards a quantitative assessment of final model performance. In a broader sense, Au-toML systems may also be understood to automate other tasks in the ‘machine learning engineering pipeline’ [206, 213, 276], including exploratory data analy-sis, reports on model quality and interpretability, and model deployment. Santu et al. [213] define multiple levels of AutoML based on which steps are automated

16 Automated Machine Learning

and consequently how much help a domain expert would need from an ML expert in order to produce ML models. They reference current work that automates some of these steps independently, and also provide additional directions for research, such as computer-assisted task formulation (specifying exactly what the ML model has to learn). In a qualitative comparison, Xanthopoulos et al.

[276] find that multiple AutoML frameworks automate more than just pipeline design, for example, providing automated interpretability reports or data visual-ization, but none of the systems cover full end-to-end automation. Additionally, they define several qualities beyond automation, such as the quality of docu-mentation and support, or the ability to integrate with other systems. While we acknowledge that the automation of other parts of the ‘machine learning engi-neering pipeline’ is interesting and important work, this work focuses primarily on automated pipeline design.

In document Systems for AutoML Research (pagina 26-33)