to Censored Regression
K. Pelckmans, J. De Brabanter†, J.A.K. Suykens, B. De Moor K.U.Leuven, ESAT, SCD/sista, Kasteelpark Arenberg 10, B-3001 Leuven, Belgium
† Hogeschool KaHo Sint-Lieven, (Associatie K.U.Leuven), B-9000 Gent, Belgium
Kristiaan.Pelckmans@esat.kuleuven.be.
Abstract. This work explores the use of risk scores and Zero-estimators (Z-estimators) into learning theory. Those notions are introduced in analogy with the likelihood scores in parametric probabilistic methods, but instead studies its rela-tion with empirical risk minimizarela-tion and its finite sample properties. The case of the L1-norm location estimator is discussed in some detail, and it is shown how this mechanism can be used for the design of a learning machine for regression in the context of censored observations where the risk is not completely accessible. This work aims at the further integration of optimization theory in the analysis of learning algorithms.
1
MOTIVATION
A profound disadvantage of parametric statistical techniques is that valid inferences as-sume a given parametric model class containing the true parameter [6]. Robust statistics [5] relaxed this assumption somewhat by allowing the true parameter to lie in a close neighborhood of the parametric model class. Distribution-free learning theory [13, 4] takes a different viewpoint by adopting a more flexible non-parametric class of distri-butions, not necessarily indexed by a parameter. The objective here is the design and analysis of algorithms which result in estimates which are guaranteed to perform well on future samples which are generated based on the same unknown stochastical model as before. Although the predictive model class is to be of finite capacity [13] before ef-fectively implementing risk minimization, minimal risk inferences are valid for almost any underlying stochastic model. We further investigate this line of thought for the task of estimating unknowns in a predictive model (θ are called the parameters of the pre-dictive model, or in short parameters). We discuss finite sample risk guarantees of such estimates using a pre-specified loss functionℓ and for an arbitrary distribution in a non-parametric classF. Herefor, the notion of risk scores is introduced and investigated, resulting in a class of learning algorithms which are named empirical Z-estimators, This idea is developed in analogy to the classical Z-estimators for maximum likelihood and M-estimators, see e.g. [12], Chapter 5. This work is restricted to the case where the parameters belong to a finite dimensional vector space.
This paper presents four main results. At first, we introduce the formal notion of risk scores and risk based Z-estimators. Secondly, we show that empirical Z-estimators can be used to give results which will converge to the minimal risk estimate under
certain conditions. Hereto, we require that the loss function under consideration is dif-ferentiable and Lipschitz bounded enabling a technique of quantization. This technique is advantageous over classical approaches based on covering numbers (see e.g. [13, 4, 2]) as no assumption is to be made concerning finiteness nor the existence of second moments (the support of the distribution need not be bounded). This property paves the way towards the design of robust methods. Thirdly, the case of the one norm (de-noted asℓ1) for location estimation is considered in some detail. Finally, the usefulness
of empirical Z-estimators is indicated by constructing a convex learning machine for regression in the context of censored observations.
This paper is organized as follows. Section 2 discusses the notion of risk scores in the case of differentiable, Lipschitz bounded loss functions and also considers in some detail the case of theL1-norm for estimating the minimal risk location. Section 3
ex-plores its use in the construction and analysis of a convex approach towards inference in the presence of randomly censored observations. Capital letters are used for random variables, scalars for deterministic values or vectors. An exception is made for the pa-rameter vector ˆθnminimizing the risk, which is a function of random variables and thus
a random variable itself. The symbol ’→’ denotes convergence in probability when theP limit is taken forn → ∞.
2
RISK SCORES
2.1 Empirical Z-estimators
LetZ be a (multivariate) random variable generated from a fixed but unknown stochas-tic model as represented by the distribution function FZ ∈ F. Here, F is a
non-parametric class of distributions with existing first moment. Assume further that any joint distribution in the classF is linearly independent, e.g. two random variables Zi
andZjare collinear with zero probability for anyi < j. Let the data-samples {Zi}ni=1
be sampled iid from the same distribution, abbreviated as Z1:nfor notational
conve-nience.
Definition 1 (Risk Scores) Letℓ : RD → R+inC1(RD, R) be a loss function
mea-suring the discrepancy between parametersθ ∈ Θ = RDand the fixed distributionFZ
(in some sense). (We do not consider the generalization to the constrained case where Θ ⊂ RD.) The risk score for a given parameter vectorθ is then defined as r : RD→ R
inC0(RD, R) as follows rd(θ∗; Z) = ∂ℓ(θ; FZ) ∂θd θ=θ∗, ∀d = 1, . . . , D, (1) where with a slight abuse of notation, the dependence of the lhs on the chosen loss functionℓ is omitted for notational convenience. The expected risk score can then be defined as rd(θ∗; FZ) = E ∂ℓ(θ; FZ) ∂θd θ=θ∗ , ∀d = 1, . . . , D, (2)
The expected risk scorer : Θ → RDis defined as the vector valued functionr(θ; FZ) =
(r1(θ; Z), . . . , rd(θ; Z))T ∈ RD. Let the empirical counterpart rd1:n : RD → R in
ˆ C0(RD, R) be defined as follows rd,1:n(θ∗; Z1:n) = 1 n n X i=1 ∂ℓ(θ; Zi) ∂θd θ=θ∗, ∀d = 1, . . . , D, (3)
for the realizations{Zi}ni=1sampled iid fromFZ. The empirical vector valued function
r1:n : Θ → RD is defined asr1:n(θ; FZ) = (r1,1:n(θ; FZ), . . . , rD,1:n(θ; FZ))T ∈
RD.
Note the subtle notational difference between the expectationr(θ; FZ), the random
vec-torr(θ; Z), its empirical counterparts r1:n(θ; Z1:n), and the point evaluation r(θ; Zi) of
sampleZi. As a prototype, consider the caseZ = (X, Y ) ⊂ RD× R and ℓ(θ, FXY) = 1
2(θTX − Y )2forθ ∈ RD. The corresponding expected risk score becomesE[(θTX −
Y )]. The following proposition summarizes some properties of this quantity.
Proposition 1 (Properties of Risk Scores) Under sufficient regularity conditions on
the distributionFZ, the following equalities follow straightforwardly from the linearity
of the derivative and the expectation:
1. The expectation and the derivative may be switched under the usual regularity con-ditions on the distributionFZ and the loss function (the interchange of integration
and differentiation is based on Lebesgue’s Dominated Convergence theorem, see e.g. [10]) : rd(θ; FZ) = E ∂ℓ(θ; Z) ∂θd = ∂ ∂θdR(θ; F Z), ∀d = 1, . . . , D, (4)
whereR(θ; FZ) = E[ℓ(θ; FZ)] denotes the actual risk.
2. The empirical risk score provides an unbiased estimate of the actual risk scores E[rd,1:n(θ; Z1:n)] = rd(θ; FZ), ∀d = 1, . . . , D. (5)
3. Ifℓ ∈ C1(RD, R) is Lipschitz smooth with constant L > 0, the corresponding risk
score is bounded in the interval] − L, L[. 4. The risk score of a sampleZi, defined as
rd(θ0; Zi) = ∂ℓ(θ; Zi) ∂θd θ=θ0 , ∀d = 1, . . . , D, (6)
for all i = 1, . . . , n. The terms rd(θ; Zi) for fixed θ are iid random variables
bounded in the interval ] − L, L[ when ℓ is Lipschitz smooth. Consider a fixed parameterθ ∈ Θ. In this case, the following probabilistic bound holds
P (|rd(θ; FZ) − rd,1:n(θ; Z1:n)| > ǫ) ≤ 2 exp −ǫ 2n 2L2 , ∀d = 1, . . . , D. (7)
Proof: Statements 1 to 3 follow from the linearity of the expectation and the derivative operator. Statement 4 follows readily from Hoeffding’s inequality (see e.g. [4]).
The primal objective of these notions is to be able to formalize the notion of an actual Z-estimator and its empirical counterpart.
Definition 2 (Z-estimator for Empirical Risk Minimization) An actual Z-estimator
for lossℓ and domain Θ outputs a parameter vector ˆθ ∈ Θ for a given distribution FZ
such that the following set of equalities hold
rd(ˆθ; FZ) = 0, ∀d = 1, . . . , D, (8)
forr the risk score r associated with the fixed loss ℓ. An empirical Z-estimator outputs a parameter vector ˆθn∈ Θ for given sample Z1:nsuch that
rd,1:n(ˆθn; Z1:n) = 0, ∀d = 1, . . . , D. (9)
Those estimators are inspired by optimization routines which try to set the gradients of the objective to zero. The following section gives such a Z-estimator for the task of regression in the case of censored observations: here the empirical risk cannot be com-puted completely due to censoring. An important advantage of this kind of estimators is that in the case where the expectation of the risk is not finite (e.g. outliers ofY ocuur or whenY is Cauchy distributed), the risk scores are still bounded and can as such be used for robust inference. Note that in classical theorems on learning of real-valued functions as e.g. in [2], one requires typically bounded response variables excluding those cases. Let the risk scorerdbe associated with the lossℓ, and let the quantized risk score
qd: RD→ [−L, L] for a given quantization level Q ∈ N0be defined as follows
qd(θ; Z, Q) = L Q Q Lrd(θ; Z) , (10)
which has2Q+1 different possible output levels GQ,k=
kL Q
−L for k = 0, . . . , 2Q. Figure 1 exemplifies such quantification for a risk score function in the case of regres-sionℓ(θ; (X, Y )) = ℓ(θTX − Y ). Let the empirical counterpart q
d,1:nbe defined
cor-respondingly. Define the quantification error 0 ≤ ǫQ as supe|rd(e) − qd(e)| which
satisfies by constructionǫQ≤QL .
Lemma 1 (Uniform Convergence of Risk Scores) Assume the loss ℓ ∈ C1(R) is a Lipschitz bounded function with constant0 < L < ∞. The random vector Z is assumed to occur in the loss through a linear combination, where θ ∈ Θ parameterizes this combination in a linear independent way. Let 0 < α < 1 be a constant confidence level, then with probability higher than1 − α, the following inequality holds
sup θ∈Θ|r(θ; F Z) − r1:n(θ; Z1:n)|∞ ≤ 4LQ + v u u t32L 2 n ln 16D(2Q + 1) D−1 X i=0 n − 1 i ! − ln α ! . (11)
where we denote the rhs in short asν(n, α; D, Q, L). Specifically, the rhs goes to zero whenQ → ∞ and n → ∞ at sufficient rates. This proofs uniform convergence of r1:n
P
→ r.
Proof: Then the large deviation can be bounded as follows sup θ∈Θ|rd,1:n (θ; Z1:n) − rd(θ; FZ)| ≤ sup θ∈Θ|rd,1:n (θ; Z1:n) − qd,1:n(θ; Z1:n)| + sup θ∈Θ|qd,1:n (θ; Z1:n) − qd(θ; FZ)| + sup θ∈Θ|qd (θ; FZ) − rd(θ; FZ)| . (12)
Both the first and last term of the rhs can be bounded byǫQby linearity of the
expecta-tion operator:
|E[rd(θ; FZ)] − E[qd(θ; FZ)]| ≤ E[|rd(θ; FZ) − qd(θ; FZ)|] ≤ E[ǫQ] = ǫQ. (13)
The midterm of the rhs of equation (12) is bounded by reducing the terms to a classi-fication problem. Herefor, let the indicator functionI(z < c) be 1 if z < c and zero otherwise, and let the variablesδQd be defined as
δdQ(θ; Zi, k) = I(qd(θ; Zi, Q) ≤ GQ,k) ∈ {0, 1}, (14)
fork = 0, . . . , 2Q, indicating whether a risk score rd
i(θ; Zi) is smaller or equal than
thek-th quantized level. Notice that the set {δdQ(θ; Zi, k)}ni=1 for givenQ, d and k is
iid (by construction). Application of the union bound (as thoroughly used in e.g. [13, 4]) yields for anyǫ > 0:
P sup θ∈Θ E[δQd(θ; Z, k)] −1 n n X i=1 δQd(θ; Zi, k) ≥ ǫ ! ≤ 16 D−1 X i=0 n − 1 i exp −nǫ 2 8 , (15) where the bound on the shattering number of the vector spaceΘ with dimension D is as described in [3, 4] under the assumption that theD variables are linearly independent. Rewriting the termqdasqd(θ; Z, Q) =P2Qk=0
L Q
δQd(θ; Z, k) − L (weighted sum of the indicators) and application of the above bound gives the bound
P sup θ∈Θ qd(θ; FZ, Q) − 1 n n X i=1 qd(θ; Zi, Q) ≥ ǫ ! ≤ P sup θ∈Θ L Q Q X k=0 δdQ(θ; FZ, k) − 1 n n X i=1 δdQ(θ; Zi, k) ≥ ǫ ! ≤ P sup θ∈Θ sup k δdQ(θ; FZ, k) − 1 n n X i=1 δdQ(θ; Zi, k) ≥QL.ǫ Q ! ≤ 16(2Q + 1) D−1 X i=0 n − 1 i exp − nǫ 2 32L2 , (16)
by linearity of the termsqdandδdQ; application of the inequalityP (
Pn
i=1|Yi| > ǫ) ≤
P (supi|Yi| > ǫ/n); and application of the union bound over the 2Q + 1 different
quantized intervals respectively. Application of the union bound over the dimensions d = 0, . . . , D and substitution of the rhs by 0 < α gives (11).
−3 −2 −1 0 1 2 3 −0.75 −0.5 −0.25 0 0.25 0.5 0.75 (θT x − y) risk score( θ T x − Y) R 1 R 2 R 3 R 4
Fig. 1. An example of a risk score (solid smooth line) as a function of the parameter θ in a regression setting where ℓ(θ; FZ) = ℓ(θTX − Y). The horizontal dotted lines indicate the Q +
1 = 4 different quantized intervals R1, R2, R3and R4. The staircase (solid line) indicates the quantized risk score as in (10). The vertical dashed lines indicate the corresponding quantization of the residuals(θTx − y).
At this point, we can state our main result. Let a loss functionℓ with associated risk scorer for distribution FZ have a separable minimum in case for allǫ > 0, the
following inequality holds inf
|θ−θ∗|∞>ǫ
|r(θ; FZ)|∞> 0 = |r(θ∗; FZ)|∞. (17)
Theorem 1 (Convergence of Empirical Z-estimators) Let the set {Zi}ni=1 be
sam-pled iid from a fixed but unknown underlying joint distribution FZ ∈ F possessing
regularity conditions as stated previously. Assume the continuous separable function ℓ : RD → R ∈ C
1(RD, R) have a separable root as in (17). Then r(θ∗; FZ) =
0 ⇔ θ∗ = arg minθ∈ΘR(θ; FZ). Moreover, any sequence of estimators {ˆθn} with
bounded deviation from the optimumr(ˆθn; FZ) converges to the theoretical risk
mini-mizer ˆθn P
→ θ∗with|r(θ∗; FZ)|∞= 0.
Proof: The proof follows from Lemma 1 and taking the limitQ → ∞ of inequality (11) forn → ∞. A formal justification can be found in Theorem 5.7 of [12].
Particularly, an algorithm which has its risk scores converging to zero - orr1:n(θ; Z1:n) P
→ 0 - effectively implements risk minimization. This theorem motivates the construction
of algorithms which try to find zero risk scores based on a finite sample as exemplified in Section 3.
Corollary 1 (Testing a Point Prediction with a Finite Sample) Consider a fixed
es-timate ˆθ ∈ Θ, and a Lipschitz bounded loss function ℓ ∈ C1(RD, R) with constant
L < ∞. The probability of occurrence of the empirical risk r1:n(ˆθ) under the null
hypothesis thatFZis such that|r(ˆθ; FZ)|∞= 0 obeys the tail inequality
P r1:n(ˆθ; Z1:n) ∞≥ ǫ |r(θ; Fˆ Z)|∞= 0 ≤ D exp −ǫ 2n 2L2 . (18)
Proof: This statement follows readily by application of Hoeffding’s inequality as in Proposition 1.4, together with a union bound over theD dimensions.
This can be used for testing a single estimate ˆθ on an independent test-set Z1:n. To see
this, substitute the rhs of (18) by the confidence level0 < α < 1 as with confidence level smaller thanα, the following inequality holds
r1:n(ˆθ; Z1:n) ∞≥ r 2L2(ln D − ln α) n (19)
under the condition thatr(ˆθ; FZ) = 0. Now, the null-distribution becomes FZ is such
thatr(ˆθ; FZ) = 0, the test-statistic is r1:n(ˆθ), and the finite sample (!) test distribution
has a tail bound as in the rhs of (18), with derived critical value which is an upper-bound to the p-value as in (19) (see e.g. [6]). Testing for a finite number (say1 < m < ∞) of hypothesis {θ1, . . . , θm} can be done similarly by adding a Bonferroni correction
term (or equivalently, application of the union bound) such that the rhs of (19) becomes q
2L2
(ln(Dm)−ln α)
n . The following result studies the extension towards infinite sets,
which occurs e.g. if one wants to have a confidence set of the results after optimization over an infinite set, sayΘ = RD.
Definition 3 (Confidence Set for a Finite Sample) Let θ∗ ∈ RD be the Bayes
esti-mate such thatr(θ∗; FZ) = 0. Let 0 < α < 1 be a pre-specified confidence of covering.
Given a finite sample{Zi}ni=1sampled iid fromFZ, a nontrivial setSα,n⊂ Θ can be
constructed which includes the Bayes estimate with high probability, formally
P (θ∗∈ Sα,n) ≥ α. (20)
To construct this set, all vectorsθ ∈ Θ should be included whose empirical risk would not be too far of the minimum actual risk. In terms of scores, this can be formalized as follows by application of Theorem 1:
Sα,n= n θ ∈ Θ |r1:n(θ; Z1:n)|∞≤ ν(n, α; D, Q, L) o . (21)
2.2 ℓ1LOSS FUNCTION FOR LOCATION ESTIMATION
We consider the prototypical case of estimating the minimum risk location parameter θ ∈ R of univariate random variable Y with iid samples {Yi}ni=1, and with theℓ1loss
in some detail. TheL1-norm loss is defined asℓ1(θ; Z) = ℓ1(θ; Y ) = |θ − Y |. Since
theL1-normℓ1 ∈ C0(RD, R), is not differentiable, we resort to a closely related loss
function which belongs toC∞(RD, R). Let the loss ℓ
asbe defined for anye ∈ R as
ℓas(e) =
p e2+ a
s, (22)
withas> 0 an appropriate constant (see Figure (2)).
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 e loss −a1/4 s a 1/4 s
Fig. 2.The ℓ1loss function and its discontinuous derivative (dashed lines), and the differentiable approximation ℓas and its respective derivative (smooth solid lines). The horizontal thin lines
indicate the critical region (] − a1s/4, a 1/4
s [) where the approximation rasis has a distance from
qas
smaller than one.
Lemma 2 (Uniform Convergence) The approximationℓas converges uniformly toℓ1
whenas→ 0, formally lim as→0 sup e∼F|ℓa s(e) − ℓ1(e)| = 0. (23)
Moreover, the optimal estimate using the approximate norm ℓas, defined as ˆθas =
arg minθ∈Θℓas(θ − Y ) corresponding with ℓas foras > 0 will converge to ˆθ1which
is the parameter which has optimal risk with loss ℓ(·) = | · |, or formally, ˆθ1 =
arg minθ∈Θℓ1(θ − Y ). Then both will converge uniformly when asshrinks sufficiently
fast, formally
lim
as→0
ˆ
where ˆθ∗= arg minθ∈ΘR1:n(θ; Z1:n).
This result is proven in [8, 14]. Let now rd,as be defined as thedth risk score of the
differentiable lossℓassuch that in the case of the univariate location estimateZ = Y ∈
R one can write
ras(θ; Y ) = ℓas(θ − Y ) ∂θ = θ − Y p(θ − Y )2+ a s . (25)
The following quantified versionqd,asis defined for fixed0 < a
sas follows qas (θ; FY) = −1 iff (θ − Y ) < −a1/4s 1 iff (θ − Y ) > a1/4s 0 elsewhere. (26)
A first result bounds the deviation of this quantificationqasto the continuous scoresras
for any fixedθ.
Proposition 2 (Bounded Deviation of Quantization) Let any set {Y |θ − Y | ≤ a1/4s } contain a probability mass of at most 0 ≤ p(as) < 1 for any parameter θ ∈ R
and underlying distributionFY. Then the expected deviation betweenrasandqascan
be bounded in terms of this probability mass for any fixedθ ∈ R as follows Eh|ras (θ; Y ) − qas (θ; Y )|i≤ (1 − p(as)) 1 − 1 p1 + √as ! + p(as). (27)
Proof: This argument can be proven purely geometrically (see Figure 2). Lete be defined ase = (θ − Y ). First, assume that e ≥ as(or equivalentlye ≤ as), then
1 − |ras (θ; Y )| = 1 − e √ e2+a s
. This quantity can be bounded as1 −
1
√
1+√as
by substitution ofe by a1/4s and by using the fact thatras is monotonically increasing.
When |e| < a1/4s , the deviation is bound as|ras(θ; Y )| < 1 by construction. This
together with the definition ofp(as) proves the inequality (27).
We now specialize this result for the parameter ˆθnwhich solves the convex problem
ˆ θn= arg min θ∈Θ 1 n n X i=1 ℓ1(θ − Yi), (28)
for given iid sampleY1:nof sizen. This is done by relating the Lagrange multipliers to
the quantization as proposed in (26). Without loss of generality, the following Proposi-tion is stated in a deterministic setting where the samples are fixed such thatYi = yi
for alli = 1, . . . , n. At first, the dual derivation of problem (28) is given. The ℓ1based
location estimation problem can be formulated as a convex constrained programming problem by introducting slack variables{ei}ni=1as follows
min θ∈R,e∈RnJ (e) = n X i=1 ei s.t. − ei≤ θ − yi≤ ei, ∀i = 1, . . . , n. (29)
The Lagrangian becomesL(e, θ; α+, α−) =Pn
i=1ei−Pni=1α +
i (ei+θ−yi) −Pni=1α−i (ei−
θ + yi). The first order conditions for optimality of maximizing the Lagrangian w.r.t.
the primal variableseiandθ become
∂L ∂ei = 0 → 1 = α + i + α−i ∀i = 1, . . . , n ∂L ∂θ = 0 → Pn i=1(α + i − α−i ) = 0. (30)
Those conditions are satisfied by strong duality property, see e.g. [9]. Substitution of (α+i − α−i ) by a new variable αifor alli = 1, . . . , n gives
−1 ≤ αi≤ 1 ∀i = 1, . . . , n (a) Pn i=1αi= 0 (b) αi= −1 ⇔ θ − yi> 0 ∀i = 1, . . . , n (d) αi= 1 ⇔ θ − yi< 0 ∀i = 1, . . . , n (e) −1 < α < 1 ⇔ θ − yi= 0 ∀i = 1, . . . , n (f ) (31)
where the last three statements follow from the complementary slackness conditions.
Proposition 3 (Lagrange Multipliers and Risk Scores) Let ˆθn ∈ Θ be the empirical
optimum obtained by minimizing problem (28), and let the vectorα be defined as α = (αi, . . . , αs)T ∈ Rn. For anyas> 0, the following inequality holds
αi− r as (ˆθn; Yi) ≤ 1 − √ 1 1+√as iff |ˆθn− Yi| > a1/4s 2 iff |ˆθn− Yi| ≤ a1/4s . (32)
Proof: This statement is prooven by relating the quantization as defined in (26) with conditions (31.cde). The factor 2 appears from the case |θ − Y | = 0, and its counterpart condition (31.e).
Lemma 3 (Convergence of Empirical Scores) Under the regularity conditions ofFY ∈
FY and the Lebesgue smoothness assumption, the following inequality holds for an iid
sampleY1:nofFY: lim as→0 Eh r as 1:n(ˆθn, Y1:n) i ≤ 2D n . (33)
Herefor, the convergence ˆθn P
→ arg minθ∈ΘR1(θ; FY) takes place.
Proof: Remark that from the conditions for optimality as given in the previous proof the equality PN
i=1αi = 0 holds exactly. We take the expectation of
expres-sion (32) as in Proposition 2. Substitution of the term p(as) by D/n follows from
the fact that limas→0, at most D different points can be interpolated by any θ and
limas→0p(as) =
D
n. This can be made formal through use of the assumption of
Lebesgue smooth distributions as e.g. in [12]. The first term in the rhs of expression (27) vanishes aslimas→0
1 −√ 1 1+√as
lim n→∞alims→0 Eh r as 1:n(ˆθn, Y1:n) i
= 0. Interchanging the limits and application of Theo-rem 1 and Lemma 1 proofs convergence.
3
CENSORED REGRESSION
LetV be a univariate random variable representing the response corresponding with covariateX, such that the set {Zi = (Xi, Vi)}ni=1is iid sampled from a joint
distribu-tionFXV ⊂ RD× R and FXV ∈ F. Now, Viis subject to right-censoring at pointCi,
iid fromFC ⊂ R. The observed response is as such Yi= min(Ci, Vi). Let the random
variable∆ denote wether the observation (X, V ) is subject to censoring. Formally, for anyi = 1, . . . , n (Xi, Vi) ∼ FXV ∆i= I(Vi> Ci) Yi= min(Vi, Ci). (34)
In our setup, bothX, Y and ∆ are fully observed. The response Viis only observed in
case∆i= 0. Risk Minimization in case of censored responses targets the minimization
of the actual riskR(θ; FXV) through the observed random variables X and Y . Note
at first that this setup is unlike the likelihood approach of the Tobit model [11, 1] de-scribing the data generation model including the censoring process (corresponding to FXY). Our approach differs as well from the censored least absolute deviation approach
(CLAD) of Powell [7] where the observed values are used to fit a modelmin(Ci, θTXi)
with aL1-norm. In our case, a standard linear programming problem solves the
estima-tion problem, unlike the iterative optimizers needed in the referred techniques. At first, we extend the risk score to the case of censored observations.
Definition 4 (Risk Score of Censored Observations) Let(Xi, Yi) be a right-censored
observation such that∆i = 1. Let θ ∈ Θ be a fixed parameter. The set of possible risk
scores for this censored sample, (or the possible risk scores) are then
r∆(θ; Xi, Yi) = {r(θ; Xi, Vi) s.t. Vi≥ Yi} . (35)
If this set is a singleton, then the point is informative for the loss and the fixed parameter θ.
Let for such informative censored points rδ(θ; X
i, Yi) be equal to r∆(θ; Xi, Yi) as
E[rδ(θ; X
i, Yi)] = E[rδ(θ; Xi, Vi)]. In case the non-singleton set r∆(θ; Xi, Yi)
con-tains zero, we definerδ(θ; X
i, Yi) = 0, as E[r∆(θ; Xi, Vi)] = 0 without other
knowl-edge on the problem. The censored empirical risk score of a fixed parameterθ ∈ Θ can then be defined as rδ 1:n(θ; Z1:n) = X i:∆i=0 r(θ; Xi, Yi) + X i:∆i=1, rδ(θ; X i, Yi), (36)
in the case of loss where the empirical risk score takes almost surely a value in the set {−L, L} as for ℓ1. As such, the above expansion consists ofn iid terms.
Proposition 4 (Unbiased Risk Score Estimate) The expectation ofrδ
1:nreduces to the
actual risk score as
Erδ 1:n(θ; Z1:n), E " X i:∆i=0 r(θ; Xi, Yi) # + E " X i:∆i=1 rδ(θ; Xi, Yi) # = r(θ; FXV), (37) Proof: This follows from the definitions and by exploiting the iid property of the censoring mechanism. Now the previous methodology of risk scores can be applied to give finite sample properties of an estimator setting this censored empirical risk score to zero as in the following learning machine.
This censored empirical risk score then gives an improved estimate of the actual risk score with respect to the one that omits all censored observations. See Fig. 3 for a schematical representation for the case of theℓ1loss.
−3 −2 −1 0 1 2 3 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 e risk score V i C i = Yi θT X i −a1/4 s a1/4 s
Fig. 3.Schematical representation of a sample(Xi, Vi) which is right-censored with active cen-soring at Ci(and thus ∆i = 1). The observed value Yiis the minimummin(Ci, Vi) and thus equals Ci(vertical solid line) in this case. It is clearly seen that the risk score for ℓas(horizontally
smooth solid line) for any Vi> Yiis approximatively one for any Vi> Yi.
Definition 5 (Right-Censored Regression) The parameterθ ∈ RDcorresponding to the loss functionℓ describing optimally the latent variable V given its covariate X based on the censored sample of sizen can be found by solving
(ˆθn, ˆe) = arg min θ∈Θ,e∈Rn 1 n n X i=1 ℓ1(ei) s.t. ( θTx i+ ei= yi ∀i : ∆i= 0 θTx i+ ei≤ yi ∀i : ∆i= 1. (38)
Theorem 2 (Consistency ofℓ1Risk Score Estimator) This Z-estimate ˆθn converges
with high probability to the optimal risk estimate lim n→+∞ ˆ θn = arg min θ∈Θ R(θ; FXV ). (39)
The key element of the proof is to proof that the empirical risk scores ras and the
Lagrange multipliers as used in the programming problem are with high probability equal for anyas> 0 and θ ∈ Θ. The result then follows from Lemma 1. At first, let the
risk score corresponding toℓas(θ; FXY) for regression be defined for all d = 1, . . . , D
as ∂ℓas(θ TX − Y ) ∂θd = ∂ℓas((θ TX) − Y ) ∂(θdXd) ∂(θdXd) ∂θd = ras,d (θ; X, Y ), (40) by application of the chain rule. Note at this point the relation with Proposition 2: this quantity can again be related to the Lagrange multipliers of the constrained problem.
Proposition 5 (Dual and Scores) Let ˆθn ∈ Θ be the optimum obtained by
minimiz-ing problem (38), and letα = (αi, . . . , αs)T ∈ Rn be the corresponding Lagrange
multipliers. For anyas > 0 and for non-censored samples (Xi, Yi) (where ∆i = 0),
the following inequality holds as previously
(αiXi) − r as(ˆθ n; Xi, Yi) ∞ ≤ B(1 − |√ 1 1+√as|) whenever |ˆθ T
nXi− Yi| > a1/4s and≤ 2B otherwise. For all censored
sam-ples(Xi, Yi) with ∆i= 1, the following inequality holds
(αiXi) − r δ,as(ˆθ n; Xi, Yi) ∞≤ B 1 − 1 √ 1+√as iff ˆθT nXi< Yi− a1/4s 2B iff |ˆθT nXi− Yi| ≤ a1/4s 0 iff ˆθT nXi> Yi+ a1/4s (41) whereB < ∞ is fixed such that B ≥ |X|∞.
Proof: The censored ℓ1 regression problem (38) can be formulated as a linear
programming problem as follows
min θ∈RD ,e∈RnJ (e) = n X i=1 ei s.t. ( −ei≤ θTXi− Yi≤ ei ∀i : ∆i6= 1 ei≥ 0, θTXi− Yi≤ ei ∀i : ∆i= 1. (42)
The Lagrangian becomesL(e, θ; α+, α−) =Pn
i=1ei− P i:∆i=0α + i (ei+ θTXi− Yi) −P i:∆i=0α − i (ei−θTXi+Yi) −Pi:∆i=1α + i ei−Pi:∆i=1α − i (ei−θTXi+Yi). Due
to the theorem of strong duality in linear programming problems [9], in the optimum the first order conditions for optimality w.r.t. the primal variableseiandθ are satisfied
∂L ∂ei = 0 → 1 = α + i + α−i ∀i = 1, . . . , n ∂L ∂θd = 0 → P i:∆i=0(α + i − α−i )Xid− P i:∆i=1α − i Xid= 0 ∀d = 1, . . . , D (43)
Replacing(α−i − α+i ) by a new variable αifor alli such that ∆i = 0 and αi = −α−i when∆i= 1 gives: Pn i=1αiXid= 0 ∀d = 1, . . . , D (a) −1 ≤ αi ≤ 1 i : ∆i= 0 (b) 0 ≤ αi≤ 1 i : ∆i= 1 (c) αi = 1 ⇔ (θTXi− Yi) > 0 ∀i = 1, . . . , n (d) αi = −1 ⇔ (θTXi− Yi) < 0 i : ∆i= 0 (e) −1 < αi < 1 ⇔ (θTXi− Yi) = 0 i : ∆i= 0 (f ) αi = 0 ⇔ θTXi< Yi i : ∆i= 1 (g) 0 < αi< 1 ⇔ (θTXi− Yi) = 0 i : ∆i= 1 (h) (44)
We relate the quantity to the risk scoresrδ,as(θ; Z
i) as follows. Relating this with the
definition of the risk scoresrδ,as as in (25) and conditions (44.bde) shows that one can
useαiXid. as quantized versions of the risk score forℓaswhereas→ 0.
Corollary 2 (Bounded Expected Deviation) Let ˆθn ∈ Θ be the minimizer of problem
(38), then the expected deviation of the censored empirical risk score to zero is bounded as follows Eh|ras,δ 1:n(ˆθn; FXV)| i ≤ (1 − p(as))B 1 − 1 p1 + √as ! + 2Bp(as). (45)
wherep(as) denotes the probability mass
0 ≤ p(as), sup θ∈Θ PXV n (X, V ) |θ T X − V | ≤ a1/4s o ≤ 1. (46)
Given this result, Theorem 2 is proved as before in Lemma 3.
4
CONCLUSION
This paper advances results relating results of learning theory with notions which are indispensable in optimization theory. Hereto, we outlined an approach for incorporating the notion of risk scores and Lagrange multipliers in the theory of empirical risk mini-mization. Convergence of an empirical Z-estimators and the actual risk minimizing so-lution is shown. This framework relies on differentiability and Lipschitz boundedness of the loss function, but it is shown that ideas can be extended to the non-differentiableL1
-norm risk by using an appropriate approximation. Finally, it is shown how this method-ology can be used to design a learning machine for regression in the context of censored observations.
Acknowledgments. This research work was carried out at the ESAT laboratory of the KULeu-ven. Research Council KU Leuven: Concerted Research Action GOA-Mefisto 666, GOA-Ambiorics IDO, several PhD/postdoc & fellow grants; Flemish Government: Fund for Scientific Research Flanders (several PhD/postdoc grants, projects G.0407.02, G.0256.97, G.0115.01, G.0240.99, G.0197.02, G.0499.04, G.0211.05, G.0080.01, research communities ICCoS, ANMMM), AWI (Bil. Int. Collaboration Hungary/ Poland), IWT (Soft4s, STWW-Genprom, GBOU-McKnow, Eureka-Impact, Eureka-FLiTE, several PhD grants); Belgian Federal Government: DWTC IUAP IV-02 (1996-2001) and IUAP V-10-29 (2002-2006) (2002-2006), Program Sustainable Develop-ment PODO-II (CP/40); Direct contract research: Verhaert, Electrabel, Elia, Data4s, IPCOS. JS is an associate professor and BDM is a full professor at K.U.Leuven Belgium, respectively.
References
1. T. Amemiya. Regression analysis when the dependent variable is truncated normal. Econo-metrica, 41(6):997–1016, 1973.
2. M. Anthony and P.L. Bartlett. Neural Network Learning: Theoretical Foundations. Cam-bridge University Press, 1999.
3. T. Cover. Geometrical and statistical properties of systems of linear inequalities with ap-plications in pattern recognition. IEEE Transactions on Electronic Computers, 14:326–224, 1965.
4. L. Devroye, L. Gy¨orfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer-Verlag, 1996.
5. P. J. Huber. Robust estimation of a location parameter. Annals of Mathematical Statistics, 35:73–101, 1964.
6. L.P. Lehmann and J.P. Romano. Testing Statistical Hypotheses. Springer Series in Statistics. Springer, 2nd. edition edition, 1986.
7. J.L. Powell. Least absolute deviations estimation for the censored regression model. Journal of Econometrics, 25:303–325, 1984.
8. C.C. Pugh. Real Mathematical Analysis. Springer-Verlag, 2002. 9. R.T. Rockafellar. Convex Analysis. Princeton University Press, 1970. 10. W. Rudin. Principles of Real Analysis. McGraw-Hill, New-York, 1976.
11. J. Tobin. Estimatin of relationships for limited dependent variables. Econometrica, 26:22– 36, 1958.
12. A. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 1998.
13. V.N. Vapnik. Statistical Learning Theory. Wiley and Sons, 1998. 14. C.R. Vogel. Computational Methods for Inverse Problems. Siam, 2002.