Risk Scores, Empirical Z-estimators and its application to Censored Regression

(1)

to Censored Regression

K. Pelckmans, J. De Brabanter_{†, J.A.K. Suykens, B. De Moor} K.U.Leuven, ESAT, SCD/sista, Kasteelpark Arenberg 10, B-3001 Leuven, Belgium

† Hogeschool KaHo Sint-Lieven, (Associatie K.U.Leuven), B-9000 Gent, Belgium

Kristiaan.Pelckmans@esat.kuleuven.be.

Abstract. This work explores the use of risk scores and Zero-estimators (Z-estimators) into learning theory. Those notions are introduced in analogy with the likelihood scores in parametric probabilistic methods, but instead studies its rela-tion with empirical risk minimizarela-tion and its finite sample properties. The case of the L1-norm location estimator is discussed in some detail, and it is shown how this mechanism can be used for the design of a learning machine for regression in the context of censored observations where the risk is not completely accessible. This work aims at the further integration of optimization theory in the analysis of learning algorithms.

1 MOTIVATION

A profound disadvantage of parametric statistical techniques is that valid inferences as-sume a given parametric model class containing the true parameter [6]. Robust statistics [5] relaxed this assumption somewhat by allowing the true parameter to lie in a close neighborhood of the parametric model class. Distribution-free learning theory [13, 4] takes a different viewpoint by adopting a more flexible non-parametric class of distri-butions, not necessarily indexed by a parameter. The objective here is the design and analysis of algorithms which result in estimates which are guaranteed to perform well on future samples which are generated based on the same unknown stochastical model as before. Although the predictive model class is to be of finite capacity [13] before ef-fectively implementing risk minimization, minimal risk inferences are valid for almost any underlying stochastic model. We further investigate this line of thought for the task of estimating unknowns in a predictive model (θ are called the parameters of the pre-dictive model, or in short parameters). We discuss finite sample risk guarantees of such estimates using a pre-specified loss functionℓ and for an arbitrary distribution in a non-parametric class_{F. Herefor, the notion of risk scores is introduced and investigated,} resulting in a class of learning algorithms which are named empirical Z-estimators, This idea is developed in analogy to the classical Z-estimators for maximum likelihood and M-estimators, see e.g. [12], Chapter 5. This work is restricted to the case where the parameters belong to a finite dimensional vector space.

This paper presents four main results. At first, we introduce the formal notion of risk scores and risk based Z-estimators. Secondly, we show that empirical Z-estimators can be used to give results which will converge to the minimal risk estimate under

(2)

certain conditions. Hereto, we require that the loss function under consideration is dif-ferentiable and Lipschitz bounded enabling a technique of quantization. This technique is advantageous over classical approaches based on covering numbers (see e.g. [13, 4, 2]) as no assumption is to be made concerning finiteness nor the existence of second moments (the support of the distribution need not be bounded). This property paves the way towards the design of robust methods. Thirdly, the case of the one norm (de-noted asℓ1) for location estimation is considered in some detail. Finally, the usefulness

of empirical Z-estimators is indicated by constructing a convex learning machine for regression in the context of censored observations.

This paper is organized as follows. Section 2 discusses the notion of risk scores in the case of differentiable, Lipschitz bounded loss functions and also considers in some detail the case of theL1-norm for estimating the minimal risk location. Section 3

ex-plores its use in the construction and analysis of a convex approach towards inference in the presence of randomly censored observations. Capital letters are used for random variables, scalars for deterministic values or vectors. An exception is made for the pa-rameter vector ˆθnminimizing the risk, which is a function of random variables and thus

a random variable itself. The symbol ’_{→’ denotes convergence in probability when the}P limit is taken for_{n → ∞.}

2 RISK SCORES

2.1 Empirical Z-estimators

LetZ be a (multivariate) random variable generated from a fixed but unknown stochas-tic model as represented by the distribution function FZ ∈ F. Here, F is a

non-parametric class of distributions with existing first moment. Assume further that any joint distribution in the class_{F is linearly independent, e.g. two random variables Z}i

andZjare collinear with zero probability for anyi < j. Let the data-samples {Zi}ni=1

be sampled iid from the same distribution, abbreviated as Z1:nfor notational

conve-nience.

Definition 1 (Risk Scores) Letℓ : RD _{→ R}+_in_C1_(RD_{, R) be a loss function}

mea-suring the discrepancy between parameters_{θ ∈ Θ = R}Dand the fixed distributionFZ

(in some sense). (We do not consider the generalization to the constrained case where Θ ⊂ RD_{.) The risk score for a given parameter vector}_{θ is then defined as r : R}D_{→ R}

in_C0(RD_{, R) as follows} rd(θ∗; Z) = ∂ℓ(θ; FZ) ∂θd θ=θ∗, ∀d = 1, . . . , D, (1) where with a slight abuse of notation, the dependence of the lhs on the chosen loss functionℓ is omitted for notational convenience. The expected risk score can then be defined as rd(θ∗; FZ) = E ∂ℓ(θ; FZ) ∂θd _θ=θ∗ , ∀d = 1, . . . , D, (2)

(3)

The expected risk score_{r : Θ → R}Dis defined as the vector valued functionr(θ; FZ) =

(r1(θ; Z), . . . , rd(θ; Z))T ∈ RD. Let the empirical counterpart rd1:n : RD → R in

ˆ C0_(RD_{, R) be defined as follows} rd,1:n(θ∗; Z1:n) = 1 n n X i=1 ∂ℓ(θ; Zi) ∂θd θ=θ∗, ∀d = 1, . . . , D, (3)

for the realizations_{Zi}ni=1sampled iid fromFZ. The empirical vector valued function

r1:n : Θ → RD is defined asr1:n(θ; FZ) = (r1,1:n(θ; FZ), . . . , rD,1:n(θ; FZ))T ∈

RD.

Note the subtle notational difference between the expectationr(θ; FZ), the random

vec-torr(θ; Z), its empirical counterparts r1:n(θ; Z1:n), and the point evaluation r(θ; Zi) of

sampleZi. As a prototype, consider the caseZ = (X, Y ) ⊂ RD× R and ℓ(θ, FXY) = 1

2(θTX − Y )2forθ ∈ RD. The corresponding expected risk score becomesE[(θTX −

Y )]. The following proposition summarizes some properties of this quantity.

Proposition 1 (Properties of Risk Scores) Under sufficient regularity conditions on

the distributionFZ, the following equalities follow straightforwardly from the linearity

of the derivative and the expectation:

1. The expectation and the derivative may be switched under the usual regularity con-ditions on the distributionFZ and the loss function (the interchange of integration

and differentiation is based on Lebesgue’s Dominated Convergence theorem, see e.g. [10]) : rd(θ; FZ) = E ∂ℓ(θ; Z) ∂θd = ∂ ∂θdR(θ; F Z), ∀d = 1, . . . , D, (4)

where_{R(θ; F}Z) = E[ℓ(θ; FZ)] denotes the actual risk.

2. The empirical risk score provides an unbiased estimate of the actual risk scores E[rd,1:n(θ; Z1:n)] = rd(θ; FZ), ∀d = 1, . . . , D. (5)

3. If_{ℓ ∈ C}1(RD_{, R) is Lipschitz smooth with constant L > 0, the corresponding risk}

score is bounded in the interval_{] − L, L[.} 4. The risk score of a sampleZi, defined as

rd(θ0; Zi) = ∂ℓ(θ; Zi) ∂θd θ=θ0 , ∀d = 1, . . . , D, (6)

for all i = 1, . . . , n. The terms rd(θ; Zi) for fixed θ are iid random variables

bounded in the interval _{] − L, L[ when ℓ is Lipschitz smooth. Consider a fixed} parameter_{θ ∈ Θ. In this case, the following probabilistic bound holds}

P (|rd(θ; FZ) − rd,1:n(θ; Z1:n)| > ǫ) ≤ 2 exp −ǫ 2_n 2L2 , ∀d = 1, . . . , D. (7)

(4)

Proof: Statements 1 to 3 follow from the linearity of the expectation and the derivative operator. Statement 4 follows readily from Hoeffding’s inequality (see e.g. [4]).

The primal objective of these notions is to be able to formalize the notion of an actual Z-estimator and its empirical counterpart.

Definition 2 (Z-estimator for Empirical Risk Minimization) An actual Z-estimator

for lossℓ and domain Θ outputs a parameter vector ˆθ ∈ Θ for a given distribution FZ

such that the following set of equalities hold

rd(ˆθ; FZ) = 0, ∀d = 1, . . . , D, (8)

forr the risk score r associated with the fixed loss ℓ. An empirical Z-estimator outputs a parameter vector ˆθn∈ Θ for given sample Z1:nsuch that

rd,1:n(ˆθn; Z1:n) = 0, ∀d = 1, . . . , D. (9)

Those estimators are inspired by optimization routines which try to set the gradients of the objective to zero. The following section gives such a Z-estimator for the task of regression in the case of censored observations: here the empirical risk cannot be com-puted completely due to censoring. An important advantage of this kind of estimators is that in the case where the expectation of the risk is not finite (e.g. outliers ofY ocuur or whenY is Cauchy distributed), the risk scores are still bounded and can as such be used for robust inference. Note that in classical theorems on learning of real-valued functions as e.g. in [2], one requires typically bounded response variables excluding those cases. Let the risk scorerdbe associated with the lossℓ, and let the quantized risk score

qd: RD→ [−L, L] for a given quantization level Q ∈ N0be defined as follows

qd(θ; Z, Q) = L Q Q Lrd(θ; Z) , (10)

which has2Q+1 different possible output levels GQ,k=

kL Q

−L for k = 0, . . . , 2Q. Figure 1 exemplifies such quantification for a risk score function in the case of regres-sionℓ(θ; (X, Y )) = ℓ(θT_{X − Y ). Let the empirical counterpart q}

d,1:nbe defined

cor-respondingly. Define the quantification error _{0 ≤ ǫ}Q as supe|rd(e) − qd(e)| which

satisfies by constructionǫQ≤_QL .

Lemma 1 (Uniform Convergence of Risk Scores) Assume the loss _{ℓ ∈ C}1(R) is a Lipschitz bounded function with constant_{0 < L < ∞. The random vector Z is assumed} to occur in the loss through a linear combination, where _{θ ∈ Θ parameterizes this} combination in a linear independent way. Let 0 < α < 1 be a constant confidence level, then with probability higher than_{1 − α, the following inequality holds}

sup θ∈Θ|r(θ; F Z) − r1:n(θ; Z1:n)|_∞ ≤ 4L_Q + v u u t32L 2 n ln 16D(2Q + 1) D−1 X i=0 n − 1 i ! − ln α ! . (11)

(5)

where we denote the rhs in short asν(n, α; D, Q, L). Specifically, the rhs goes to zero when_{Q → ∞ and n → ∞ at sufficient rates. This proofs uniform convergence of} r1:n

P

→ r.

Both the first and last term of the rhs can be bounded byǫQby linearity of the

expecta-tion operator:

|E[rd(θ; FZ)] − E[qd(θ; FZ)]| ≤ E[|rd(θ; FZ) − qd(θ; FZ)|] ≤ E[ǫQ] = ǫQ. (13)

The midterm of the rhs of equation (12) is bounded by reducing the terms to a classi-fication problem. Herefor, let the indicator functionI(z < c) be 1 if z < c and zero otherwise, and let the variablesδQ_d be defined as

δ_dQ(θ; Zi, k) = I(qd(θ; Zi, Q) ≤ GQ,k) ∈ {0, 1}, (14)

fork = 0, . . . , 2Q, indicating whether a risk score rd

i(θ; Zi) is smaller or equal than

the_{k-th quantized level. Notice that the set {δ}_dQ(θ; Zi, k)}ni=1 for givenQ, d and k is

iid (by construction). Application of the union bound (as thoroughly used in e.g. [13, 4]) yields for anyǫ > 0:

P sup θ∈Θ E[δQ_d(θ; Z, k)] −1 n n X i=1 δQ_d(θ; Zi, k) ≥ ǫ ! ≤ 16 D−1 X i=0 n − 1 i exp −nǫ 2 8 , (15) where the bound on the shattering number of the vector spaceΘ with dimension D is as described in [3, 4] under the assumption that theD variables are linearly independent. Rewriting the termqdasqd(θ; Z, Q) =P2Qk=0

L Q

δQ_d(θ; Z, k) − L (weighted sum of the indicators) and application of the above bound gives the bound

P sup θ∈Θ qd(θ; FZ, Q) − 1 n n X i=1 qd(θ; Zi, Q) ≥ ǫ ! ≤ P sup θ∈Θ L Q Q X k=0 δ_dQ(θ; FZ, k) − 1 n n X i=1 δ_dQ(θ; Zi, k) ≥ ǫ ! ≤ P sup θ∈Θ sup k δdQ(θ; FZ, k) − 1 n n X i=1 δdQ(θ; Zi, k) ≥Q_L.ǫ Q ! ≤ 16(2Q + 1) D−1 X i=0 n − 1 i exp − nǫ 2 32L2 , (16)

(6)

by linearity of the termsqdandδdQ; application of the inequalityP (

Pn

i=1|Yi| > ǫ) ≤

P (supi|Yi| > ǫ/n); and application of the union bound over the 2Q + 1 different

quantized intervals respectively. Application of the union bound over the dimensions d = 0, . . . , D and substitution of the rhs by 0 < α gives (11).

−3 −2 −1 0 1 2 3 −0.75 −0.5 −0.25 0 0.25 0.5 0.75 (θT_{x − y)} risk score( θ T x − Y) R 1 R 2 R 3 R 4

Fig. 1. An example of a risk score (solid smooth line) as a function of the parameter θ in a regression setting where ℓ(θ; FZ) = ℓ(θTX − Y). The horizontal dotted lines indicate the Q +

1 = 4 different quantized intervals R1, R2, R3and R4. The staircase (solid line) indicates the quantized risk score as in (10). The vertical dashed lines indicate the corresponding quantization of the residuals(θTx − y).

At this point, we can state our main result. Let a loss functionℓ with associated risk scorer for distribution FZ have a separable minimum in case for allǫ > 0, the

following inequality holds inf

|θ−θ∗|∞>ǫ

|r(θ; FZ)|∞> 0 = |r(θ∗; FZ)|_∞. (17)

Theorem 1 (Convergence of Empirical Z-estimators) Let the set _{Zi}ni=1 be

sam-pled iid from a fixed but unknown underlying joint distribution FZ ∈ F possessing

regularity conditions as stated previously. Assume the continuous separable function ℓ : RD _{→ R ∈ C}

1(RD, R) have a separable root as in (17). Then r(θ∗; FZ) =

0 ⇔ θ∗ = arg minθ∈ΘR(θ; FZ). Moreover, any sequence of estimators {ˆθn} with

bounded deviation from the optimumr(ˆθn; FZ) converges to the theoretical risk

mini-mizer ˆθn P

→ θ∗with|r(θ∗; FZ)|_∞= 0.

Proof: The proof follows from Lemma 1 and taking the limit_{Q → ∞ of inequality} (11) for_{n → ∞. A formal justification can be found in Theorem 5.7 of [12].}

Particularly, an algorithm which has its risk scores converging to zero - orr1:n(θ; Z1:n) P

→ 0 - effectively implements risk minimization. This theorem motivates the construction

(7)

of algorithms which try to find zero risk scores based on a finite sample as exemplified in Section 3.

Corollary 1 (Testing a Point Prediction with a Finite Sample) Consider a fixed

es-timate ˆ_{θ ∈ Θ, and a Lipschitz bounded loss function ℓ ∈ C}1(RD_{, R) with constant}

L < ∞. The probability of occurrence of the empirical risk r1:n(ˆθ) under the null

hypothesis thatFZis such that|r(ˆθ; FZ)|∞= 0 obeys the tail inequality

P r1:n(ˆθ; Z1:n) ∞≥ ǫ |r(θ; Fˆ Z)|∞= 0 ≤ D exp −ǫ 2_n 2L2 . (18)

Proof: This statement follows readily by application of Hoeffding’s inequality as in Proposition 1.4, together with a union bound over theD dimensions.

This can be used for testing a single estimate ˆθ on an independent test-set Z1:n. To see

this, substitute the rhs of (18) by the confidence level0 < α < 1 as with confidence level smaller thanα, the following inequality holds

r1:n(ˆθ; Z1:n) ∞≥ r 2L2_{(ln D − ln α)} n (19)

under the condition thatr(ˆθ; FZ) = 0. Now, the null-distribution becomes FZ is such

thatr(ˆθ; FZ) = 0, the test-statistic is r1:n(ˆθ), and the finite sample (!) test distribution

has a tail bound as in the rhs of (18), with derived critical value which is an upper-bound to the p-value as in (19) (see e.g. [6]). Testing for a finite number (say_{1 < m < ∞)} of hypothesis _{θ1, . . . , θm} can be done similarly by adding a Bonferroni correction

term (or equivalently, application of the union bound) such that the rhs of (19) becomes q

2L2

(ln(Dm)−ln α)

n . The following result studies the extension towards infinite sets,

which occurs e.g. if one wants to have a confidence set of the results after optimization over an infinite set, sayΘ = RD.

Definition 3 (Confidence Set for a Finite Sample) Let θ_∗ ∈ RD _{be the Bayes}

esti-mate such thatr(θ_∗; FZ) = 0. Let 0 < α < 1 be a pre-specified confidence of covering.

Given a finite sample_{Zi}ni=1sampled iid fromFZ, a nontrivial setSα,n⊂ Θ can be

constructed which includes the Bayes estimate with high probability, formally

P (θ_∗∈ Sα,n) ≥ α. (20)

To construct this set, all vectors_{θ ∈ Θ should be included whose empirical risk would} not be too far of the minimum actual risk. In terms of scores, this can be formalized as follows by application of Theorem 1:

Sα,n= n θ ∈ Θ |r1:n(θ; Z1:n)|∞≤ ν(n, α; D, Q, L) o . (21)

(8)

2.2 ℓ₁LOSS FUNCTION FOR LOCATION ESTIMATION

We consider the prototypical case of estimating the minimum risk location parameter θ ∈ R of univariate random variable Y with iid samples {Yi}ni=1, and with theℓ1loss

in some detail. TheL1-norm loss is defined asℓ1(θ; Z) = ℓ1(θ; Y ) = |θ − Y |. Since

theL1-normℓ1 ∈ C0(RD, R), is not differentiable, we resort to a closely related loss

function which belongs to_C_∞(RD_{, R). Let the loss ℓ}

asbe defined for anye ∈ R as

ℓas(e) =

p e2_{+ a}

s, (22)

withas> 0 an appropriate constant (see Figure (2)).

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 e loss −a1/4 s a 1/4 s

Fig. 2.The ℓ1loss function and its discontinuous derivative (dashed lines), and the differentiable approximation ℓas and its respective derivative (smooth solid lines). The horizontal thin lines

indicate the critical region (] − a1s/4, a 1_/4

s [) where the approximation rasis has a distance from

qas

smaller than one.

Lemma 2 (Uniform Convergence) The approximationℓas converges uniformly toℓ1

whenas→ 0, formally lim as→0 sup e∼F|ℓa s(e) − ℓ1(e)| = 0. (23)

Moreover, the optimal estimate using the approximate norm ℓas, defined as ˆθas =

arg min_θ∈Θℓas(θ − Y ) corresponding with ℓas foras > 0 will converge to ˆθ1which

is the parameter which has optimal risk with loss _{ℓ(·) = | · |, or formally, ˆ}θ1 =

arg min_θ∈Θℓ1(θ − Y ). Then both will converge uniformly when asshrinks sufficiently

fast, formally

lim

as→0

ˆ

(9)

where ˆθ_∗= arg min_θ∈ΘR1:n(θ; Z1:n).

This result is proven in [8, 14]. Let now rd,as _{be defined as the}_{dth risk score of the}

differentiable lossℓassuch that in the case of the univariate location estimateZ = Y ∈

R one can write

ras_{(θ; Y ) =} ℓas(θ − Y ) ∂θ = θ − Y p(θ − Y )2_{+ a} s . (25)

The following quantified versionqd,as_{is defined for fixed}_{0 < a}

sas follows qas (θ; FY) =      −1 iff (θ − Y ) < −a1/4s 1 iff _{(θ − Y ) > a}1/4s 0 elsewhere. (26)

A first result bounds the deviation of this quantificationqas_{to the continuous scores}_ras

for any fixedθ.

Proposition 2 (Bounded Deviation of Quantization) Let any set _{Y |θ − Y | ≤ a1/4s } contain a probability mass of at most 0 ≤ p(as) < 1 for any parameter θ ∈ R

and underlying distributionFY. Then the expected deviation betweenrasandqascan

be bounded in terms of this probability mass for any fixed_{θ ∈ R as follows} Eh|ras (θ; Y ) − qas (θ; Y )|i≤ (1 − p(as)) 1 − 1 p1 + √as ! + p(as). (27)

Proof: This argument can be proven purely geometrically (see Figure 2). Lete be defined as_{e = (θ − Y ). First, assume that e ≥ a}s(or equivalentlye ≤ as), then

1 − |ras (θ; Y )| = 1 − e √ e2_+a s

. This quantity can be bounded as1 −

1

√

1+√as

by substitution ofe by a1/4s and by using the fact thatras is monotonically increasing.

When _{|e| < a}1/4s , the deviation is bound as|ras(θ; Y )| < 1 by construction. This

together with the definition ofp(as) proves the inequality (27).

We now specialize this result for the parameter ˆθnwhich solves the convex problem

ˆ θn= arg min θ∈Θ 1 n n X i=1 ℓ1(θ − Yi), (28)

for given iid sampleY1:nof sizen. This is done by relating the Lagrange multipliers to

the quantization as proposed in (26). Without loss of generality, the following Proposi-tion is stated in a deterministic setting where the samples are fixed such thatYi = yi

for alli = 1, . . . , n. At first, the dual derivation of problem (28) is given. The ℓ1based

location estimation problem can be formulated as a convex constrained programming problem by introducting slack variables_{ei}ni=1as follows

min θ∈R,e∈RnJ (e) = n X i=1 ei s.t. − ei≤ θ − yi≤ ei, ∀i = 1, . . . , n. (29)

(10)

The Lagrangian becomes_{L(e, θ; α}+, α−_{) =}Pn

i=1ei−Pni=1α +

i (ei+θ−yi) −Pni=1α−i (ei−

θ + yi). The first order conditions for optimality of maximizing the Lagrangian w.r.t.

the primal variableseiandθ become

     ∂L ∂ei = 0 → 1 = α + i + α−i ∀i = 1, . . . , n ∂L ∂θ = 0 → Pn i=1(α + i − α−i ) = 0. (30)

Those conditions are satisfied by strong duality property, see e.g. [9]. Substitution of (α+i − α−i ) by a new variable αifor alli = 1, . . . , n gives

               −1 ≤ αi≤ 1 ∀i = 1, . . . , n (a) Pn i=1αi= 0 (b) αi= −1 ⇔ θ − yi> 0 ∀i = 1, . . . , n (d) αi= 1 ⇔ θ − yi< 0 ∀i = 1, . . . , n (e) −1 < α < 1 ⇔ θ − yi= 0 ∀i = 1, . . . , n (f ) (31)

where the last three statements follow from the complementary slackness conditions.

Proposition 3 (Lagrange Multipliers and Risk Scores) Let ˆθn ∈ Θ be the empirical

optimum obtained by minimizing problem (28), and let the vectorα be defined as α = (αi, . . . , αs)T ∈ Rn. For anyas> 0, the following inequality holds

αi− r as (ˆθn; Yi) ≤    1 − √ 1 1+√as iff _|ˆθn− Yi| > a1/4s 2 iff _|ˆθn− Yi| ≤ a1/4s . (32)

Proof: This statement is prooven by relating the quantization as defined in (26) with conditions (31.cde). The factor 2 appears from the case _{|θ − Y | = 0, and its} counterpart condition (31.e).

Lemma 3 (Convergence of Empirical Scores) Under the regularity conditions ofFY ∈

FY and the Lebesgue smoothness assumption, the following inequality holds for an iid

sampleY1:nofFY: lim as→0 Eh r as 1:n(ˆθn, Y1:n) i ≤ 2D n . (33)

Herefor, the convergence ˆθn P

→ arg minθ∈ΘR1(θ; FY) takes place.

Proof: Remark that from the conditions for optimality as given in the previous proof the equality PN

i=1αi = 0 holds exactly. We take the expectation of

expres-sion (32) as in Proposition 2. Substitution of the term p(as) by D/n follows from

the fact that limas→0, at most D different points can be interpolated by any θ and

limas→0p(as) =

D

n. This can be made formal through use of the assumption of

Lebesgue smooth distributions as e.g. in [12]. The first term in the rhs of expression (27) vanishes aslimas→0

1 −_√ 1 1+√as

(11)

lim n→∞alims→0 Eh r as 1:n(ˆθn, Y1:n) i

= 0. Interchanging the limits and application of Theo-rem 1 and Lemma 1 proofs convergence.

3 CENSORED REGRESSION

LetV be a univariate random variable representing the response corresponding with covariate_{X, such that the set {Z}i = (Xi, Vi)}ni=1is iid sampled from a joint

distribu-tionFXV ⊂ RD× R and FXV ∈ F. Now, Viis subject to right-censoring at pointCi,

iid fromFC ⊂ R. The observed response is as such Yi= min(Ci, Vi). Let the random

variable∆ denote wether the observation (X, V ) is subject to censoring. Formally, for anyi = 1, . . . , n      (Xi, Vi) ∼ FXV ∆i= I(Vi> Ci) Yi= min(Vi, Ci). (34)

In our setup, bothX, Y and ∆ are fully observed. The response Viis only observed in

case∆i= 0. Risk Minimization in case of censored responses targets the minimization

of the actual risk_{R(θ; F}XV) through the observed random variables X and Y . Note

at first that this setup is unlike the likelihood approach of the Tobit model [11, 1] de-scribing the data generation model including the censoring process (corresponding to FXY). Our approach differs as well from the censored least absolute deviation approach

(CLAD) of Powell [7] where the observed values are used to fit a modelmin(Ci, θTXi)

with aL1-norm. In our case, a standard linear programming problem solves the

estima-tion problem, unlike the iterative optimizers needed in the referred techniques. At first, we extend the risk score to the case of censored observations.

Definition 4 (Risk Score of Censored Observations) Let(Xi, Yi) be a right-censored

observation such that∆i = 1. Let θ ∈ Θ be a fixed parameter. The set of possible risk

scores for this censored sample, (or the possible risk scores) are then

r∆(θ; Xi, Yi) = {r(θ; Xi, Vi) s.t. Vi≥ Yi} . (35)

If this set is a singleton, then the point is informative for the loss and the fixed parameter θ.

Let for such informative censored points rδ_{(θ; X}

i, Yi) be equal to r∆(θ; Xi, Yi) as

E[rδ_{(θ; X}

i, Yi)] = E[rδ(θ; Xi, Vi)]. In case the non-singleton set r∆(θ; Xi, Yi)

con-tains zero, we definerδ_{(θ; X}

i, Yi) = 0, as E[r∆(θ; Xi, Vi)] = 0 without other

knowl-edge on the problem. The censored empirical risk score of a fixed parameter_{θ ∈ Θ can} then be defined as rδ 1:n(θ; Z1:n) = X i:∆i=0 r(θ; Xi, Yi) + X i:∆i=1, rδ_{(θ; X} i, Yi), (36)

in the case of loss where the empirical risk score takes almost surely a value in the set {−L, L} as for ℓ1. As such, the above expansion consists ofn iid terms.

(12)

Proposition 4 (Unbiased Risk Score Estimate) The expectation ofrδ

1:nreduces to the

actual risk score as

Erδ 1:n(θ; Z1:n), E " X i:∆i=0 r(θ; Xi, Yi) # + E " X i:∆i=1 rδ(θ; Xi, Yi) # = r(θ; FXV), (37) Proof: This follows from the definitions and by exploiting the iid property of the censoring mechanism. Now the previous methodology of risk scores can be applied to give finite sample properties of an estimator setting this censored empirical risk score to zero as in the following learning machine.

This censored empirical risk score then gives an improved estimate of the actual risk score with respect to the one that omits all censored observations. See Fig. 3 for a schematical representation for the case of theℓ1loss.

−3 −2 −1 0 1 2 3 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 e risk score V i C i = Yi θT X i −a1/4 s a1/4 s

Fig. 3.Schematical representation of a sample(Xi, Vi) which is right-censored with active cen-soring at Ci(and thus ∆i = 1). The observed value Yiis the minimummin(Ci, Vi) and thus equals Ci(vertical solid line) in this case. It is clearly seen that the risk score for ℓas(horizontally

smooth solid line) for any Vi> Yiis approximatively one for any Vi> Yi.

Definition 5 (Right-Censored Regression) The parameter_{θ ∈ R}Dcorresponding to the loss functionℓ describing optimally the latent variable V given its covariate X based on the censored sample of sizen can be found by solving

(ˆθn, ˆe) = arg min θ∈Θ,e∈Rn 1 n n X i=1 ℓ1(ei) s.t. ( θT_x i+ ei= yi ∀i : ∆i= 0 θT_x i+ ei≤ yi ∀i : ∆i= 1. (38)

(13)

Theorem 2 (Consistency ofℓ1Risk Score Estimator) This Z-estimate ˆθn converges

with high probability to the optimal risk estimate lim n→+∞ ˆ θn = arg min θ∈Θ R(θ; FXV ). (39)

The key element of the proof is to proof that the empirical risk scores ras _{and the}

Lagrange multipliers as used in the programming problem are with high probability equal for anyas> 0 and θ ∈ Θ. The result then follows from Lemma 1. At first, let the

risk score corresponding toℓas(θ; FXY) for regression be defined for all d = 1, . . . , D

as ∂ℓas(θ T_{X − Y )} ∂θd = ∂ℓas((θ T_{X) − Y )} ∂(θdXd) ∂(θdXd) ∂θd = ras,d (θ; X, Y ), (40) by application of the chain rule. Note at this point the relation with Proposition 2: this quantity can again be related to the Lagrange multipliers of the constrained problem.

Proposition 5 (Dual and Scores) Let ˆθn ∈ Θ be the optimum obtained by

minimiz-ing problem (38), and letα = (αi, . . . , αs)T ∈ Rn be the corresponding Lagrange

multipliers. For anyas > 0 and for non-censored samples (Xi, Yi) (where ∆i = 0),

the following inequality holds as previously

(αiXi) − r as_(ˆ_θ n; Xi, Yi) ∞ ≤ B(1 − |√ 1 1+√as|) whenever |ˆθ T

nXi− Yi| > a1/4s and≤ 2B otherwise. For all censored

sam-ples(Xi, Yi) with ∆i= 1, the following inequality holds

(αiXi) − r δ,as_(ˆ_θ n; Xi, Yi) ∞≤          B 1 − 1 √ 1+√as iff ˆθT nXi< Yi− a1/4s 2B iff _|ˆθT nXi− Yi| ≤ a1/4s 0 iff ˆθT nXi> Yi+ a1/4s (41) where_{B < ∞ is fixed such that B ≥ |X|}_∞.

Proof: The censored ℓ1 regression problem (38) can be formulated as a linear

programming problem as follows

min θ∈RD ,e∈RnJ (e) = n X i=1 ei s.t. ( −ei≤ θTXi− Yi≤ ei ∀i : ∆i6= 1 ei≥ 0, θTXi− Yi≤ ei ∀i : ∆i= 1. (42)

The Lagrangian becomes_{L(e, θ; α}+, α−_{) =}Pn

i=1ei− P i:∆i=0α + i (ei+ θTXi− Yi) −P i:∆i=0α − i (ei−θTXi+Yi) −Pi:∆i=1α + i ei−Pi:∆i=1α − i (ei−θTXi+Yi). Due

to the theorem of strong duality in linear programming problems [9], in the optimum the first order conditions for optimality w.r.t. the primal variableseiandθ are satisfied

     ∂L ∂ei = 0 → 1 = α + i + α−i ∀i = 1, . . . , n ∂L ∂θd = 0 → P i:∆i=0(α + i − α−i )Xid− P i:∆i=1α − i Xid= 0 ∀d = 1, . . . , D (43)

(14)

Replacing(α−_i − α+i ) by a new variable αifor alli such that ∆i = 0 and αi = −α−i when∆i= 1 gives:                              Pn i=1αiXid= 0 ∀d = 1, . . . , D (a) −1 ≤ αi ≤ 1 i : ∆i= 0 (b) 0 ≤ αi≤ 1 i : ∆i= 1 (c) αi = 1 ⇔ (θTXi− Yi) > 0 ∀i = 1, . . . , n (d) αi = −1 ⇔ (θTXi− Yi) < 0 i : ∆i= 0 (e) −1 < αi < 1 ⇔ (θTXi− Yi) = 0 i : ∆i= 0 (f ) αi = 0 ⇔ θTXi< Yi i : ∆i= 1 (g) 0 < αi< 1 ⇔ (θTXi− Yi) = 0 i : ∆i= 1 (h) (44)

We relate the quantity to the risk scoresrδ,as(θ; Z

i) as follows. Relating this with the

definition of the risk scoresrδ,as _{as in (25) and conditions (44.bde) shows that one can}

useαiXid. as quantized versions of the risk score forℓaswhereas→ 0.

Corollary 2 (Bounded Expected Deviation) Let ˆθn ∈ Θ be the minimizer of problem

(38), then the expected deviation of the censored empirical risk score to zero is bounded as follows Eh|ras,δ 1:n(ˆθn; FXV)| i ≤ (1 − p(as))B 1 − 1 p1 + √as ! + 2Bp(as). (45)

wherep(as) denotes the probability mass

0 ≤ p(as), sup θ∈Θ PXV n (X, V ) |θ T X − V | ≤ a1/4s o ≤ 1. (46)

Given this result, Theorem 2 is proved as before in Lemma 3.

4 CONCLUSION

This paper advances results relating results of learning theory with notions which are indispensable in optimization theory. Hereto, we outlined an approach for incorporating the notion of risk scores and Lagrange multipliers in the theory of empirical risk mini-mization. Convergence of an empirical Z-estimators and the actual risk minimizing so-lution is shown. This framework relies on differentiability and Lipschitz boundedness of the loss function, but it is shown that ideas can be extended to the non-differentiableL1

-norm risk by using an appropriate approximation. Finally, it is shown how this method-ology can be used to design a learning machine for regression in the context of censored observations.

(15)

Acknowledgments. This research work was carried out at the ESAT laboratory of the KULeu-ven. Research Council KU Leuven: Concerted Research Action GOA-Mefisto 666, GOA-Ambiorics IDO, several PhD/postdoc & fellow grants; Flemish Government: Fund for Scientific Research Flanders (several PhD/postdoc grants, projects G.0407.02, G.0256.97, G.0115.01, G.0240.99, G.0197.02, G.0499.04, G.0211.05, G.0080.01, research communities ICCoS, ANMMM), AWI (Bil. Int. Collaboration Hungary/ Poland), IWT (Soft4s, STWW-Genprom, GBOU-McKnow, Eureka-Impact, Eureka-FLiTE, several PhD grants); Belgian Federal Government: DWTC IUAP IV-02 (1996-2001) and IUAP V-10-29 (2002-2006) (2002-2006), Program Sustainable Develop-ment PODO-II (CP/40); Direct contract research: Verhaert, Electrabel, Elia, Data4s, IPCOS. JS is an associate professor and BDM is a full professor at K.U.Leuven Belgium, respectively.

References

1. T. Amemiya. Regression analysis when the dependent variable is truncated normal. Econo-metrica, 41(6):997–1016, 1973.

2. M. Anthony and P.L. Bartlett. Neural Network Learning: Theoretical Foundations. Cam-bridge University Press, 1999.

3. T. Cover. Geometrical and statistical properties of systems of linear inequalities with ap-plications in pattern recognition. IEEE Transactions on Electronic Computers, 14:326–224, 1965.

4. L. Devroye, L. Gy¨orfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer-Verlag, 1996.

5. P. J. Huber. Robust estimation of a location parameter. Annals of Mathematical Statistics, 35:73–101, 1964.

6. L.P. Lehmann and J.P. Romano. Testing Statistical Hypotheses. Springer Series in Statistics. Springer, 2nd. edition edition, 1986.

7. J.L. Powell. Least absolute deviations estimation for the censored regression model. Journal of Econometrics, 25:303–325, 1984.

8. C.C. Pugh. Real Mathematical Analysis. Springer-Verlag, 2002. 9. R.T. Rockafellar. Convex Analysis. Princeton University Press, 1970. 10. W. Rudin. Principles of Real Analysis. McGraw-Hill, New-York, 1976.

11. J. Tobin. Estimatin of relationships for limited dependent variables. Econometrica, 26:22– 36, 1958.

12. A. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 1998.

13. V.N. Vapnik. Statistical Learning Theory. Wiley and Sons, 1998. 14. C.R. Vogel. Computational Methods for Inverse Problems. Siam, 2002.