Search - The Modular AutoML Pipeline - Systems for AutoML Research

3.2 The Modular AutoML Pipeline

3.2.1 Search

There are three types of optimization algorithms currently implemented in GAMA to search for optimal machine learning pipelines: random search, the bandit-based asynchronous successive halving algorithm, and an asynchronous evo-lutionary algorithm. We first give a brief motivation for using asynchronous algorithms and then discuss the different implemented methods in more detail below.

Asynchronous Optimization

In GAMA, we chose to incorporate asynchronous algorithms because they paral-lelize more efficiently than their synchronous counterparts. This is illustrated in Figure 3.1, where the two methods are compared and jobs, visualized as bars,

38 GAMA - Modular AutoML

Asynchronous

Time

Synchronous

sync sync

0 s s' t

Figure 3.1: A visual example of sync points (e.g., generations in evolution) causing idle workers in synchronous methods. Bars represent jobs distributed over 4 workers for each method. For comparison purposes, their color represents the batch and the same total compute time is used in both methods.

are distributed over 4 workers for each method. The figure shows that syn-chronous algorithms need to wait until all jobs in a batch are finished, e.g., all individuals of a generation or in a rung are evaluated, which leaves time gaps where workers are idle. By contrast, asynchronous methods define new jobs whenever resources are available, allowing them to parallelize more effectively.

ML pipelines can vary dramatically in running time [170, 279], which means synchronous approaches may spend a lot of time waiting for stragglers to finish.

In the example, each job is given a color to represent the batch and the same total compute time is used in both methods. In reality, the asynchronous method will need to generate jobs with different information than the syn-chronous method, so the results would differ. Evaluating the effect of these differences on convergence time and final model quality would be interesting future work.

Interestingly, there hasn’t been much work evaluating asynchronous opti-mization for AutoML. While the resource utilization is higher for asynchronous

3.2. THE MODULAR AUTOML PIPELINE 39

algorithms, new candidate solutions are generated with different information which means it might alter the end result. This has been studied outside of the AutoML context, but very little within [151, 190]. Using asynchronous evolu-tion has been proposed before independently by Pil´at, Kˇren, and Neruda [190]

though their evaluation is small-scale and also introduces caching of machine learning pipelines. To the best of our knowledge, none of the systems that use bandit-based optimization include their asynchronous version.

Random Search

Random search is more effective than grid search for hyperparameter optimiza-tion [17] and may prove to be a strong baseline given a well-designed search space (as it is for certain types of Neural Architecture Search [152]). GAMA’s random search creates pipelines in three steps. First, the pipeline length is cho-sen uniformly at random (containing a maximum of 3 steps, by default). Then, for each step, an algorithm is chosen uniformly at random. Finally, for each algorithm, the hyperparameter configuration is chosen uniformly at random.³

Asynchronous Successive Halving Algorithm

ASHA [151] uses multi-fidelity estimates to filter out bad pipelines early as shown in Algorithm 2. In short, given a reduction factor η and budget pa-rameters (b, B, s), configurations are first evaluated on bη^s budget. The top ¹_η configurations in rung k, corresponding to resource budget b · η^s+k, get pro-moted to the next rung with a larger resource budget per pipeline b · η^s+k+1. In ASHA, new configurations are added to the lowest rung anytime no evalu-ations are scheduled for higher rungs, and all pipelines in the top ¹_η of their rungs have already been promoted. The minimum early stopping rate s can be used to increase the budget of the bottom rung. Because GAMA includes non-iterative algorithms in the search space, these multi-fidelity estimates are obtained by subsampling the dataset. For example, on a dataset with 1 million rows, pipelines would first be evaluated with cross-validation on 10,000 rows, the top configurations are subsequently evaluated on 100,000 rows, and the best of those pipelines are evaluated on the full dataset. Pipeline candidates are generated at random, similar to random search.

3Continuous hyperparameters are currently discretized in GAMA’s search space.

40 GAMA - Modular AutoML

Algorithm 2 Asynchronous Successive Halving Algorithm [151]

Require: minimum resource b, maximum resource B, reduction factor η, minimum early stopping rate s

1: while not stop do ▷ e.g., time, iterations

2: for each free worker do

3: (θ, k) ← get job() ▷ In AutoML, θ is a ML pipeline

4: queue evaluation(θ, bη^s+k)

5: end for

6: for each completed job (θ, k) with loss l do

7: Update configuration θ in rung k with loss l.

8: end for

9: end while

10:

11: function get job()

12: for k = ⌊log_η(B/b)⌋ − s, . . . , 1, 0 do ▷ Promote in high rungs first

13: candidates ← top k(rung k, ^{|rung k|}_η )

14: promotable ← {t for t ∈ candidates if t not already promoted}

15: if | promotable | > 0 then

16: return promotable[0], k + 1 ▷ Always promote if possible

17: end if

18: end for

19: Draw random configuration θ ▷ But grow bottom rung otherwise

20: return θ, 0

21: end function

3.2. THE MODULAR AUTOML PIPELINE 41

Asynchronous Multi-Objective Evolutionary Algorithm

The evolutionary algorithm in GAMA is identical to the one described in [218]

for which pseudo-code is presented in Algorithm 3. The queue evaluation(p) function submits pipeline p to a queue to be evaluated on one of the worker nodes, and the get next evaluation() function returns whichever evaluation is done first. The algorithm maintains a single population and generates offspring from the population whenever a worker is available.

Algorithm 3 Asynchronous Evolution

Require: Pstartinitial pipeline designs, Nmax> 0

1: for all p ∈ Pstart do

2: queue evaluation(p) ▷ To be evaluated on a worker

3: end for

5: P ← ∅

6: while not stop do ▷ E.g., time, iterations

7: P ← P ∪ { get next evaluation() } ▷ Whichever is done first

8: if |P | > N_max then

9: P ← P \ {eliminate(P )} ▷ Remove the worst fitness

10: end if

11: if worker is available then

12: queue evaluation(create one(P )) ▷ Create new pipeline

13: end if

14: end while

While the pseudo-code presented here only differs in form from [218], there are differences in the selection, mutation, cross-over and representation of indi-viduals. GAMA uses genetic programming trees to represent linear ML pipelines (see Section 2.3.2), and uses the following operators to optimize them:

Elimination (line 9): Remove an individual from the worst rank pareto front.

Pipeline Creation (line 12):

– Parent Selection Binary tournament selection based on pareto rank and crowding distance as in NSGA-II [64].

– Cross-over Exchange subtrees (e.g., a preprocessing pipeline).

– Mutation One of the following mutations with equal probability⁴:

4Considering only valid mutations, e.g., you can’t shrink a tree with only a root node.

42 GAMA - Modular AutoML

0.5 0.6 0.7 0.8 0.9 1.0

AUC Higgs

porto-seguro airlines APSFailure kick numerai28_6

TPOT GAMA Best

Figure 3.2: A comparison of TPOT and GAMA without ensembling on six binary classification tasks from the benchmark. The best observed score per fold across all frameworks in the benchmark (see Chapter 5) is also shown for reference.

Point Replace a terminal

Point Replace a primitive and its connected terminals Insert Extend a ‘data‘ terminal with a preprocessing subtree.

Shrink Remove (part of) a subprocessing subtree.

Figure 3.2 shows a small scale comparison on six tasks from the AutoML benchmark (see Chapter 5) between TPOT, which uses a synchronous (µ + λ) algorithm, and GAMA, using asynchronous evolution. The top and bottom three tasks are the biggest three binary classification tasks under one million and under 100 thousand rows, respectively. The best observed score for each fold across all frameworks is also shown for reference. GAMA’s search space is very similar to that of TPOT, but TPOT allows stacking in the pipeline design by using any learner as a preprocessing step and appending its predictions to the data. While we can’t draw any conclusions because of these multiple design differences, we think it is a promising indication that asynchronous methods also lead to better pipelines being discovered in the same time budget, as GAMA improves over TPOT on tasks where a substantial improvement was shown to be possible. In the future we hope to do a principled comparison by adding synchronous evolution to GAMA.

3.2. THE MODULAR AUTOML PIPELINE 43

Algorithm 4 Ensemble selection from libraries of models [49]

Require: P a set of pipelines with given loss lpand out-of-fold predicted prob-abilities ˆy_p, initial ensemble size k, final ensemble size K

1: w ← [0 | p ∈ P] ▷ Initialize weight of each pipeline

2: for all p ∈ top k(P, k) do ▷ Add k pipelines with the least loss

3: w_p← 1

4: end for

6: for 1, . . . , K − k do

7: L ← [Evaluate(P, w^′), w^′ is w but with wp increased by 1 | p ∈ P ]

8: Increase wpby 1 where p := arg minp∈PLp 9: end for

10:

11: function Evaluate(P, w^′)

12: return Loss incurred by prediction _|w¹′|1

p∈Pw^′p· ˆyp 13: end function

In document Systems for AutoML Research (pagina 54-60)