• No results found

Methods of Multi-Model Consolidation, with Emphasis on the Recommended Cross Validation Approach

N/A
N/A
Protected

Academic year: 2022

Share "Methods of Multi-Model Consolidation, with Emphasis on the Recommended Cross Validation Approach"

Copied!
40
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

1

Methods of Multi-Model Consolidation, with Emphasis on the Recommended

Cross Validation Approach

Huug van den Dool

CTB seminar, May, 11, 2009

Acknowledgement: Malaquias Pena, Ake Johansson, Wanqiu Wang,Tony Barnston, Suranjana Saha

(2)
(3)

3

Traditional Anomaly Correlation

F’ = (F - C

obs

) A’ = (A - C

obs

)

Forecast, verifying Analysis, Climatology

AC = Σ F’ A’ / (Σ F’F’ Σ A’ A’)

1/2

Summation is in space, or in space and time.

Weighting may be involved.

Cobs is known at the time the forecast is made, i.e. determined from previous data.

A (and F obviously) are not part of the sample from which C is calculated Relationship of AC (skill) to MSE .

AC is calculated from ‘raw’ data.

(4)

New trend due to availability of hindcast data sets:

F“ = (F - C

mdl

) A’ = (A - C

obs

)

and, C

obs

tends to be calculated from data that

matches the model data.

(5)

5

Short-Cut Anomaly Correlation

F“ = (F - C

mdl

) A’ = (A - C

obs

)

AC

sc

= Σ F” A’ / (Σ F”F” Σ A’ A’)

1/2

F” = (F - C

mdl

) = (F – C

obs

) - (C

mdl

- C

obs

)) F” = F’ - (C

mdl

– C

obs

) (1)

Using F” amounts to a systematic error correction (SEC) , which requires a cross-validation (CV) to be honest.

{{ Eq (1) becomes more involved if the periods for C

mdl

and C

obs

are not the

same.}}

(6)

Why do we need CV?

• To obtain an estimate of skill on future (independent) data.

While there is no substitute for real time forecasts on future data, a CV procedure attempts to help us out (without having to wait too long)

• Leaving N years out of a sample of M creates N independent data points. Or does it??

• Details of CV procedures used by authors are exceedingly ad-hoc and often wrong

• We recommend 3CVRE

(7)

7

Meaning of 3CVRE

• Leave 3 years out (3 as a minimum)

• R: Leave 3 years out, namely the test year plus two others chosen at Random, see

example

• E: Use ‘External’ observed climatology,

not an observed climatology that changes

in response to leaving out a particular set

of 3 years.

(8)

Example 1981-2001.

Three years left out. First

year is test year. The

other two are picked at

years left out 1981 1985 1989 years left out 1982 2000 1989 years left out 1983 1990 1998 years left out 1984 1993 1981 years left out 1985 1992 1995 years left out 1986 1999 1987 years left out 1987 1996 1989 years left out 1988 1988 1989 years left out 1989 1983 1992 years left out 1990 1985 2000 years left out 1991 1990 2001 years left out 1992 1996 2001 years left out 1993 1985 1995 years left out 1994 1989 1991 years left out 1995 1986 1996 years left out 1996 1991 1990 years left out 1997 1991 1990 years left out 1998 1991 1988 years left out 1999 2001 1995

(9)

9

Why leave three out?, as

opposed to just one. Two very different reasons

• Anomaly Correlation does not change between ‘raw’ and CV-1-out. (This can be shown analytically)

• CV-1-out leads to serious ‘degeneracy’ problems when

the forecast involves a regression (as it does for MME

with unequal weights) and skill is not that high to begin

with (applies unfortunately)

(10)

M. Peña Mendez and H. van den Dool, 2008:

Consolidation of Multi-Method Forecasts at CPC.

J. Climate, 21, 6521–6538.

Unger, D., H. van den Dool, E. O’Lenic and D.

Collins, 2009: Ensemble Regression. Manuscript Accepted

Monthly Weather Review

2009 early online release, posted January 2009 DOI: 10.1175/2008MWR2605.1

(1) CTB, (2) why do we need ‘consolidation’?

(11)

11

Context: Consolidation of

Several Models

(12)

OFFicial Forecast(element, lead, location, initial month) =

a * F 1 + b * F 2 + c * F 3 +…

Honest hindcast required 1950-present.

Covariance (F

1

, F

2

), (F

1

, F

3

), (F

2

, F

3

), and

(F

1

, A), (F

2

, A), (F

3

, A) allows solution for a, b, c

(element, lead, location, initial month)

(13)

13

CON is color blind

(14)

Apply to:

• Monthly SST, 1981-2001, 4 starts, leads 1-5

• 9 models

• Domain is 20S-20N Pacific Ocean

• (gridpoints, not Nino34 index)

M. Peña Mendez and H. van den Dool, 2008:

Consolidation of Multi-Method Forecasts at CPC.

J. Climate, 21, 6521–6538.

(15)

15

Table 1. Some information on the DEMETER-PLUS models

Acronym Full Name layout Period

D1, D2,…,D7 DEMETER Models * Ensemble members: 9 Leads: 0 to 5 months

Initial months: Feb, May, Aug, Nov. 1980-2001 CFS NCEP Climate Forecast System Ensemble members: 15

Leads: 0 to 8 months

Initial months: Jan to Dec 1981-2006

CA CPC Constructed Analog Ensemble members: 12

Leads: -3 to 12

Initial months: Jan to Dec 1956-2006

* Institutions developing these models: European Center for Medium Range Forecasts, Max Plank-Institute, Meteo-France, United Kingdom Met Office, Instituto Nazionale de Geofisica e Vulcanology, Laboratoire d’Oceanographie Dynamique et de Climatologie, European Centre for Research and Advanced Training in Scientific Computation.

(16)

K

CON = Σ α k SST k

k = 1

i.e. a weighted mean over K model estimates

One finds the K alphas typically by minimizing the distance between CON and observed SST.

(17)

17

Classic or Unconstrained Regression (UR)

The general problem of consolidation consists of finding a vector of weights, α, that minimizes the Sum of Square Errors, SSE, given by the following expression:

SSE = (Zα - o)T(Zα - o) (5)

Then leads to ZTZα = ZTo

So the weights are formally given by

α = A-1 b (6)

where A = ZTZ is the covariance matrix, and b=Zto .

Equation (6) is the solution for the ordinary (Unconstrained) linear Regression (UR).

(18)

Why ridge regression?

One of the preferred methods that:

• Tries minimize damage due to overfit (too many coefficients from too little data)

• Tries to handle co-linearity as much as possible

• Has a smaller difference in correlation

(MSE) for dependent and independent

data

(19)

19

Essentially, ridging is a multiple linear regression with an additional penalty term to constrain the size of the squared weights in the minimization of SSE (5):

J = (Zα - o)T(Zα - o) + λ αTα (7)

where I is the identity matrix, and , the regularization (or ridging) parameter, indicates the relative weight of the penalty term.

Similarities between the ridging and Bayesian approaches for determining the

weights have been discussed by Hsiang (1976) and Delsole (2007). In the Bayesian view, (8) represents the posterior mean probability of α, based on a normal a priori parameter distribution with mean zero and variance matrix (σ2/λ)I, where σ2I is the matrix variance of the regression residual, assumed to be normal with a mean zero.

Minimization of J leads to

α = ( A + λ I ) -1 b (8)

(20)
(21)

21

(Delsole 2007)

(22)

UR MMA COR

RI RIM RIW

Climo

(23)

23

(24)

3CVRE

SEC

SEC and CV

(25)

25

25.5 .7 26.8 -.4 1981 2.45 25.9 1.1 28.1 .9 1982 2.45 23.8 -.9 27.1 -.1 1983 2.45 23.5 -1.3 26.7 -.5 1984 2.45 24.1 -.7 26.7 -.5 1985 2.45 26.0 1.3 27.4 .2 1986 2.45 26.6 1.9 28.8 1.6 1987 2.45 23.6 -1.1 25.6 -1.6 1988 2.45 26.2 1.5 26.7 -.5 1989 2.45 25.8 1.1 27.3 .1 1990 2.45 23.5 -1.2 27.9 .7 1991 2.45 24.4 -.3 27.5 .4 1992 2.45 24.4 -.3 27.6 .4 1993 2.45 23.5 -1.2 27.3 .1 1994 2.45 22.9 -1.8 27.0 -.2 1995 2.45 25.6 .9 27.1 -.1 1996 2.45 25.8 1.1 28.9 1.7 1997 2.45 23.4 -1.3 25.9 -1.2 1998 2.45 24.5 -.2 26.3 -.8 1999 2.45 25.0 .3 26.7 -.5 2000 2.45 25.2 .4 27.3 .1 2001 2.45 24.7 .0 27.2 .0 all

No CV

Mdl 4 anomaly Obs anomaly year SEC

(26)

25.5 .9 26.8 -.4 1981 2.62 25.9 1.3 28.1 .9 1982 2.62 23.8 -.9 27.1 -.1 1983 2.46 23.5 -1.3 26.7 -.5 1984 2.44 24.1 -.8 26.7 -.5 1985 2.32 26.0 1.4 27.4 .2 1986 2.56 26.6 2.0 28.8 1.6 1987 2.63 23.6 -.8 25.6 -1.6 1988 2.73 26.2 1.5 26.7 -.5 1989 2.48 25.8 1.1 27.3 .1 1990 2.54 23.5 -1.2 27.9 .7 1991 2.42 24.4 -.3 27.5 .4 1992 2.49 24.4 -.5 27.6 .4 1993 2.32 23.5 -1.3 27.3 .1 1994 2.38 22.9 -1.8 27.0 -.2 1995 2.48 25.6 .9 27.1 -.1 1996 2.45 25.8 1.0 28.9 1.7 1997 2.36 23.4 -1.4 25.9 -1.2 1998 2.37 24.5 -.3 26.3 -.8 1999 2.42 25.0 .2 26.7 -.5 2000 2.41 25.2 .5 27.3 .1 2001 2.50

3CVRE

Mdl 4 anomaly Obs anomaly year SEC

(27)

27

(28)
(29)

29

Conclusions MME

• MMA is an improvement over individual models

• It is hard to improve upon an equal weight ensemble average (MMA). Only WestPac SST show some

improvement as per ridge regression

• This is caused by (very) deficient data set length. We need 5000 years, not 25.

• Pooling gridpoints, pooling various start times and

leads, throwing out ‘bad’ models upfront and using all ensemble members helps

• Equal treatment for very unequal methods is ….

• RIW and COR make sense, because this is what CPC does subjectively.

• As should have been expected: UR is really bad

(30)
(31)

31

(32)

ACsc

ACsc plus CV

AC (raw)

(33)

33

Why leave three out?, as

opposed to just one. Two very different reasons

• Anomaly Correlation does not change between ‘raw’ and CV-1-out. (This can be shown analytically)

• CV-1-out leads to serious ‘degeneracy’ problems when

the forecast involves a regression (as it does for MME

with unequal weights) and skill is not that high to begin

with (applies unfortunately)

(34)
(35)

35

(36)
(37)

37

Bayesian Multimodel Strategies

Linear regression leads to unstable weights for small sample sizes.

Methods for producing more stable estimates have been proposed by van den Dool and Rukhovets (1994), Kharin and Zwiers (2002), Yun et al. (2003), and Robertson et al. (2004).

These methods are special cases of a Bayesian method, each distinguished by a different set of prior assumptions (DelSole 2007).

Some reasonable prior assumptions:

R:0 Weights centered about 0 and bounded in magnitude (ridge regression)

R:MM Weights centered about 1/K (K = # models) and bounded in magnitude R:MM+R Weights centered about an optimal value and bounded in magnitude R:S2N Models with small S2N (signal-to-noise) ratio tend to have small weights LS Weights are unconstrained (ordinary least squares)

From Jim Kinter (Feb 2009)

(38)

If the multimodel strategy is carefully cross validated, then the simple mean beats all other investigated multimodel

strategies.

Since Bayesian methods involve additional empirical

parameters, proper assessment requires a two-deep cross validation procedure. This can change the conclusion

about the efficacy of various Bayesian priors.

Traditional cross validation procedures are biased and

incorrectly indicate that Bayesian schemes beat a simple

mean.

(39)

39

Concluding comments CV

• CV is done because …….

• Does CV lower skill???

• CV procedures are quite complicated, full of traps. (The price we pay for impatience)

• Is there an all-purpose CV approach?

• 1-out procedures may be problematic for several reasons

• 3CVRE appears appropriate for (our) MME

study.

(40)

--- OUT TO 1.5 YEARS ---

Referenties

GERELATEERDE DOCUMENTEN

The problem is that the Pareto ranking algorithm compares the numerical ob- jective function values of scenarios, based on a small, equal number of simulation replications.. It

als Argentinië en Uruguay – wordt een meer dan gemiddelde groei verwacht, zodat hun aandeel in de wereldmelkpro- ductie iets toeneemt.. Ook voor Nieuw- Zeeland

roos moeten tegen deze plaag middelen worden ingezet die schadelijk zijn voor natuurlijke

Sous ce cordon court une bande à décor flora] très schématisé: des fleurs, indiquées comme un bulbe incisé assez profondé- ment, sant séparées par des tiges ou

Aan de hand van een twee-dimensionaal model van een femur zal de werk- wijze bij het toepassen van de methode der eindige elementen voor de bepaling van het mechanisch gedrag van

This paper advances results in model selection by relaxing the task of optimally tun- ing the regularization parameter in a number of algorithms with respect to the

The first layer weights are computed (training phase) by a multiresolution quantization phase, the second layer weights are computed (linking phase) by a PCA technique, unlike

assessment where rankings and thresholds are required. Rather than suggesting a fixed model for research impact assessment, this paper aims at evidencing the existence