• No results found

Sparse Inference of High Dimensional Survival Models

N/A
N/A
Protected

Academic year: 2021

Share "Sparse Inference of High Dimensional Survival Models"

Copied!
58
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Sparse Inference of High Dimensional Survival Models

An Application of Differential Geometry to Excess Relative Risk Models

M

β

1

β

2

Vera Velt

Master’s Thesis Mathematics December 2015

First Supervisor: Prof. E.C. Wit Second Supervisor: Dr. Fentaw Abegaz

(2)
(3)

Abstract

This thesis concerns sparse inference of high dimensional survival models. By describ- ing the statistical model space as a Riemannian manifold, we discuss the geometrical structure of generalized linear models. Then we use this geometrical interpretation to find a sparse solution path that leads to the maximum likelihood estimator. The method we suggest to find such a sparse solution path is a direct and scale-invariant approach based on differential geometric least angle regression. For the purpose of this thesis the method is extended to excess relative risk survival models. The derived method is implemented in R and used for a data analysis on data of a survival cancer study. In the analysis, a genomic signature is found that might be predictive for the prospects of an individual suffering from a specific type of ovarian cancer.

(4)
(5)

Contents

Page

1 Introduction 1

2 Basic Concepts and Notation of Differential Geometry 3

3 Generalized Linear Models and Geometry 7

3.1 The Standard Regression Model . . . . 7

3.2 Parametric Families . . . . 8

3.3 The Exponential Family . . . . 8

3.4 Generalized Linear Models . . . . 9

3.5 A Geometrical Description of a GLM . . . 10

3.6 Angles on the Tangent Space . . . 13

4 The Least Angle Regression Method 15 4.1 The Generalized Equiangularity Condition . . . 15

4.2 Least Angle Regression . . . 17

4.3 LARS Algorithm for GLMs . . . 18

4.4 Predictor-Corrector Algorithm . . . 20

4.5 Some Remarks . . . 22

5 Excess Relative Risk Models 25 5.1 Censored Survival Times . . . 25

5.2 The Hazard Function . . . 26

5.3 The Partial Likelihood Function . . . 28

5.4 The Likelihood Function . . . 30

5.5 A Survival Risk Model . . . 30

5.6 Excess Relative Risk Models . . . 31

6 Data Analysis 33 6.1 Background Information . . . 33

6.2 Explorative Data Analysis . . . 34

6.3 Data Analysis with dgErr . . . 35

6.4 Results . . . 37

6.5 Discussion . . . 38

7 Conclusion 41

A Derivations 45

B dgErr 49

(6)
(7)

Chapter 1

Introduction

In this thesis we are concerned with the subject sparse inference of high dimensional survival models. These days, it has become more important to deal with high dimen- sional data sets, where the number of predictors outnumber the number of observations.

There exists a relationship between gene-expression and certain diseases. Important examples are diabetes or different types of cancer. Since a few years it is possible to collect RNA structures and nowadays it has become relatively inexpensive and reliable to do this. The number of studies about the expression of the genes is increasing and we gain more knowledge about this subject. Also in survival analysis we deal with genomic data. Survival analysis is a subject in statistics that deals with data on finding predictors of risk of some events, such as cancer remission.

There are a number of recent survival studies where RNA of patients is collected and the genomic data is being studied. The genomic data in these studies is typically high dimensional, where the number of the genes (the predictors) exceeds the num- ber of participants (the observations). Generally, the aim of these studies is to find genomic signatures for certain diseases. These signatures may provide information on the expected survival of individual patients and physicians can base treatments on this information or possibly develop new treatments. Examples of recent studies involving cancer survival all describe a survival data analysis to discover potentially predictive relationships between genetic data and survival times of participants suffering from dif- ferent types of cancer (Gillet et al. (2012), J¨onsson et al. (2010), Loboda et al. (2011), Ross et al. (2012)).

We now know that there is a link between the survival of patients and the genomic structure of the tumors. To learn more about these diseases, it is necessary to further investigate the high dimensional survival data. But dealing with survival data comes with some challenges. First of all, when dealing with high dimensional data we are bounded by the limitations of a computer. It is required to develop efficient algorithms in order to deal with these computational limitations. Secondly, because p is larger than n it is complicated to make statistical inferences. For sparse inference of this data, methods that are mainly used to deal with this problem involve penalized like- lihood functions. The penalties in these methods depend on some tuning parameter that induces sparsity. Although penalized inference can give useful information, the justification of the penalty can be problematic. Also, these methods are not invariant under scale transformations.

(8)

In this thesis, we will discuss a more direct approach to induce sparsity; directly from the likelihood function. The proposed method in this thesis is differential geometric least angle regression. For this purpose, we will study the geometrical structure of generalized linear models. The advantages of the method discussed in this thesis are that it is a more direct approach and that it is scale-invariant.

Generalized linear models can be described by differential, statistical manifolds. The global idea of the method is that we want to define a solution path in our model space that leads to the maximum likelihood estimator. It is based on the basic least angle regression algorithm which is an algorithm that sequentially adds variables to the solution curve. (Augugliaro et al. (2013)) We are seeking a path that increases the likelihood as fast as possible and uses a minimal number of covariates. To this end the tangent residual vector on the model space is introduced. This vector is chosen such that it is perpendicular to the model at the maximum likelihood estimator. The angle between the covariates and the tangent residual vector gives a direct way to determine the importance of these individual covariates. We make use of the generalized equiangularity condition and the algorithm that is used to obtain a solution path is the predictor-corrector algorithm (Augugliaro et al. (2014)).

The method can be applied to survival relative risk models. It is extended to Cox proportional hazards models by Wit et al. (2015). For the purpose of this thesis, we will extend it to a different risk model, namely Excess relative risk models. We used this method for an analysis on real data of a survival study concerning a specific type of ovarian cancer (Gillet et al. (2012)). In this study the genomic data of patients with ovarian cancer was collected and the gene-expression patterns were studied. The researchers found an 11-gene signature with a predictive power about the prospects of a patient with this type of cancer. In comparison, with the proposed method we have found a set of 8 genes of which 3 genes are in line with the article.

This thesis is organized as follows. In chapter 2, we begin with some notation and basic concepts of differential geometry. In chapter 3, we discuss the differential geometry of a generalized linear model. We will derive a geometrical description of the maximum likelihood estimator and determine angles on the tangent space. These angles are needed for our method. Chapter 4 includes the method for deriving a sparse solution path that leads to the maximum likelihood estimator, based on differential geometric least angle regression. In Chapter 5 we introduce survival analysis. We learn how to work with censored data. In this chapter we derive the likelihood function for a survival risk model. The geometrical structure of a survival risk model is given and extended to excess relative risk models. We finish this thesis with a data analysis based on this method in chapter 6, for survival data of a study concerning ovarian cancer.

(9)

Chapter 2

Basic Concepts and Notation of Differential Geometry

In this chapter, we begin with a brief introduction to differential geometry. This is a subject that requires some complicated notation. It is our aim to cover sufficient differential geometric theory that is needed for this thesis.

Differentiation

Let Θ be a parameter space of dimension p. Here, θ will be a parametrization of a statistical model of interest. Let η ∈ N be a different parameter space of a larger dimension n > p. The points in Θ are denoted by (θ1, ..., θp)T and similarly the points in N are denoted by (η1, ..., ηn)T. Note that superscripts are used for the components, while we will make use of subscripts to indicate distinct points.

When x is a scalar, differentiation will be denoted by dxdf(x) for a first-order derivative,

d2f

dx2(x) for a second order derivative, and similarly for higher order derivatives, . The partial derivative of f (θ) with respect to θm, is denoted by ∂mf . For second order partial derivatives, that will be ∂m2f . Furthermore, when η is a function of θ for example, then the derivative of the i-th component of η with respect to the m-th component θ, is

mηi.

In order to simplify the notation for expressions with summations, one can make use of the Einstein summation convention. This notational convention implies summation in an expression. Sums are taken when an index is used twice; both as superscript and as subscript. That is, the summation in the expression P

mmθm is implied and we can denote it by ∂mθm. We will make use of the Einstein summation convention.

We can denote the chain rule as follows. If we take the partial derivative of f (θ), where f (θ) = F (η(θ)), the chain rule applies. The chain rule is equal to

mf (θ) =

k

X

i=1

iF (η(θ))∂mηi.

This can be simplified to ∂mf = (∂mηi)∂iF , by applying the Einstein summation convention. It now becomes clear that ∂m = (∂mηi)∂i is an expression for the chain rule itself.

(10)

Differential Manifolds

Differential manifolds are multidimensional generalizations of surfaces and the theory of manifolds is fundamental in differential geometry. The notion of a differential manifold is necessary for extending methods of differential calculus to a space that is more general than the space Rp.

For a parameter space, we want to represent the model as faithfully as possible. For this we need to keep the same topology and smoothness structure, which can be ac- complished when working with manifolds. The basic concepts of differential manifolds will now be discussed.

We begin with some terminology. We say a surface is a smooth p-dimensional subspace of Rp. This means that it is infinitely differentiable. Homeomorphisms and diffeo- morphisms are two important maps that we need to introduce. A homeomorphism is a continuous, one-to-one mapping with a continuous inverse. A diffeomorphism is a bijection and a smooth invertible map, whose inverse is smooth. These maps are important, since they preserve a smooth structure of a surface.

We say that a topological space M is locally m-dimensional Euclidean if for each point p ∈ M there exists a neighborhood of point p that is homeomorphic to Rp. By coordi- nate mapping, we mean a homeomorphism of the surface onto an open subset of Rp. To secure irrelevance of the choice of coordinate systems, say ψ and φ, we say the systems are compatible if both φ · ψ−1 and ψ · φ−1 are smooth. That is, the transformation from ψ-coordinates to φ-coordinates is a diffeomorphism.

A maximal structure is the collection of every coordinate system that is compatible with one of its elements. We can now define a smooth manifold and an embedded submanifold. A smooth manifold M is a topological manifold together with a maximal smooth structure. If we discuss a smooth manifold, the maximal smooth structure is implicitly present, but we just mention manifold M.

Let us now define an embedded submanifold. A mapping f : M → S of an p- dimensional manifold into a n-dimensional manifold is an embedding if it is a smooth map of rank p that is homeomorphic to its image f (M) ⊂ S. Also, f (M ) is then an embedded submanifold of S.

The key idea of a differential geometric approach is the notion of a tangent space. A tangent space consists of tangent vectors, which should be interpreted as the definition of a direction on a manifold M. The tangent vector corresponds to the directional derivative ∂θ

m in a given direction, which from now on will be denoted by ∂m for simplicity.

The tangent vector is defined as follows. Let M be a smooth manifold. A tangent vector at p ∈ M is a mapping

Xp: C(M ) → R, such that for all f, g ∈ C(M ) and a, b ∈ R, it holds that

(i) Xp(a · f + b · g) = aXp(f ) + bXp(g), (ii) Xp(f · g) = gXp(f ) + f Xp(g).

(11)

The set of tangent vectors forms a p-dimensional manifold, spanned by the set

{∂m|m = 1, ..., p}. It is the tangent space of M at p, denoted by TpM. The tangent bundle, denoted by T M, is the collection of all tangent spaces of M. That is,

T M = {TpM for all p ∈ M} ,

and it has the structure of a manifold. In fact, when M is of dimension m, then T M is an 2m-dimensional manifold. A tangent field X is a smooth map between manifolds,

X : M → T M

p 7→ X(p) ∈ TpM.

Geometric Surfaces

An inner product on a real vector space is a function that assigns to each pair of vectors v, w, in the vector space, a number hv, wi with the following three properties,

• (bilinearity)

ha1v1+ a2v2, wi = a1hv1, wi + a2hv2, wi, hv, b1w1+ b2w2i = b1hv, w2i + b2hv, w2i

• (symmetry)

hv, wi = hw, vi,

• (positive definiteness)

hv, vi ≥ 0, and

hv, vi = 0 if and only if v = 0.

The length of the vector v is defined by ||v|| = phv, vi. Also, another important property is the following: two vectors v and w are orthogonal if hv, wi = 0.

A geometric surface M can now be defined as an abstract surface furnished with an inner product h., .i on each of its tangent planes. These inner products are required to vary smoothly in the sense that if V and W are two differentiable vector fields on the surface, then hV, W i is a differentiable real-valued function on the surface. Each tangent plane TpM of M has its own inner product. The geometric structure provided by this collection of inner products can be described as a metric tensor on M. In short we can say that a geometric surface is a surface together with a metric tensor.

The Geometry of Parametric Families

Parametric families can be described geometrically by a smooth statistical manifold.

That is, a smooth differential manifold whose points are probability measures defined on a common probability space.

How we describe a parametric family as a statistical manifold is covered in detail in the next chapter.

(12)
(13)

Chapter 3

Generalized Linear Models and Geometry

There is a relationship between parametric families of distributions and geometry. Since the geometry most naturally applies to the exponential families, we will begin this chapter by introducing this family of densities. We can describe our model space as a curved exponential family embedded in the space of all densities. In this chapter the generalized linear model is described. A geometrical description of a generalized linear model is given, by describing our model as a differential statistical manifold. We will derive an expression of tangent spaces on our model and of the tangent residual vector.

This vector will be used to give a geometrical description of the maximum likelihood estimator. In the last section of this chapter, we discuss angles on the tangent space.

3.1 The Standard Regression Model

Suppose that we observe a response variable Y and p predictors or explanatory vari- ables X1, X2, . . . , Xp. We assume that there exists a relationship between Y and X = (X1, X2, . . . , Xp). This relationship is of the form Y = g(X) + . Here Y are the dependent parameters, X the independent parameters and g is a fixed but un- known function. Furthermore,  is the error such that i ∼ N (0, σ2). We are interested in making predictions for Y . The approximation can be written as

E(Y |X) = g(X, β),

where the β’s are unknown parameters. Parametric methods involve making an as- sumption on the functional form of g.

When we deal with standard linear regression models we have that the equation linking each response variable Y and the set of explanatory variables is of the form

g(X, β) = β0+ β1X1+ β2X2+ . . . + βpXp,

and we have responses Y1, . . . , Yn and parameters β1, . . . , βp (Dobson and Barnett (2011)).

A question that we may ask is: do all predictors help with explaining the response Y , or is there only a subset of predictors that is useful?

(14)

3.2 Parametric Families

Let us begin with defining a parametric family in general, and later on we will define the exponential family. Let y be the observed value of the random variable Y = (Y1, ..., Yn)T and let θ = (θ1, ..., θn) be a parameter vector varying over a set Θ ∈ Rn. Then a parametric family of densities is the set of functions

{pY(y|θ)|θ ∈ Θ} .

The likelihood function and the log-likelihood function are denoted by L(θ) and l(θ), respectively. The method of maximum likelihood estimation is finding a value ˆθ as function of Y , the maximum likelihood estimator (MLE), that maximizes the likelihood.

This value maximizes the log-likelihood function as well, since the logarithm is a strictly monotonically increasing function.

Let Ui(θ) = ∂il(θ, Y ) be a score statistic. We can construct the Fisher information matrix; this is an n × n-matrix In(θ) which consists of the elements

In(θ)i,j= E[Ui(θ)Uj(θ)].

It is the expected value of the observed information. The Fisher information gives infor- mation about the efficiency of the maximum likelihood. It determines the conditional correlation between θi and θj and we say that two parameters θi and θj are orthogonal if the element of the ith row and jth column of the Fisher information matrix is zero.

We can specify a family of densities in terms of an alternative parametrization {Pσ|σ ∈ Σ}.

This new parametrization must satisfy the identifiability and parameter space. We need the mapping from Θ to Σ to be one-to-one, i.e. is a homeomorphism. Also, the deriva- tives must be available.

3.3 The Exponential Family

Let Yi be random variable from the exponential family. Then we can write the proba- bility density in the form

pY(y|θ) = exp{(yTθ − ψ(θ))}m(y), (3.1) where y ∈ Rn. The densities are defined with respect to a fixed dominating measure ν.

We call Θ the natural parameter space and it is defined as Θ =



θ ∈ Rk: Z

exp{yTθ}m(y)dν < ∞

 . We define the exponential family as

S = {Pθ: θ ∈ Θ} .

If there exists a Pθ in S for every θ ∈ Θ, then S is said to be of full rank. It is called a full exponential family and it plays the role of a manifold in the space of all density functions. Furthermore, if in addition Θ is open in Rn, then S is a regular exponential family.

We define the function ψ(θ) such that the integral of the density is equal to one. It is easy to verify that ψ(θ) is then equal to

ψ(θ) = log(expyTθ m(x)).

From this point we make the assumption that S is a regular exponential family.

(15)

Example

It can be shown that the standard linear regression model is an example of the regular exponential family. The density of Y , conditionally on the explanatory variables X, can be written in the form

exp 1

σ2βTXTY + 1

−2σ2YTY − ( 1

2βTXTXβ +n

2 log(2πσ2))

 ,

which is in the same form as (3.1). Hence, it belongs to the exponential family. 4 We will now consider some essential properties of a regular exponential family. In a regular exponential family, the function ψ has derivatives of all orders on Θ. The mean function µ is denoted by µ = (µ(θ1), ..., µ(θn))T, and it is called the mean value mapping. The variance function V (.) of Y is an n × n−matrix. The mean and variance functions of y, µ and V respectively, are given by

µ = ∂iψ, V = ∂i2ψ.

The mapping µ = µ(θ) from θ ∈ Θ to µ ∈ Rn is a diffeomorphism. Hence, the family S can be indexed by µ instead of θ. We say that µ plays the role of a coordinate system.

The image space we obtain is called the mean-value parameter space.

The set S can be treated as a differential manifold of order n. We define this space by S = {pY(y, µ)} ,

and we can either use θ or µ as coordinate system for S and obtain the same results.

An exponential family that is not of full rank is generally called a curved exponential family. That is, the parameter space is of dimension p, where p < n. It turns out that curved exponential families have the geometric structure of a manifold M embedded in a supermanifold S.

If µ(p) : S → Rk is the mean for any p ∈ S, then

M =p ∈ S|µi(p) = f (Xi, β) for some β ∈ B, for all i ,

where β is an unknown parameter. By adding some structures to M, it becomes a p-dimensional subspace of S and such that M is a curved exponential family. Note that M ⊂ S. We say that M is a curved space that is an embedded subfamily of S.

3.4 Generalized Linear Models

A generalized linear model, or GLM, is a flexible generalization of ordinary least squares regression. It generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value. Maximum likelihood is the principal method for estimation used for all generalized linear models.

(16)

In short we can describe a GLM by three components:

• y is a random observation for a random variable Y , where Y has a distribution from a natural exponential family

• a set of parameters β and for each random variable Yithere is a vector of covariates Xi= (Xi1, ..., Xip)T ∈ Rp

• a function g such that µi(β) = G(XiTβ), were g = G−1, which is called the link function

Consider a curved exponential family which is M a p-dimensional subspace of S. In a GLM, the space M is a set of generalized linear functions, embedded in S. The mean function f is of the special form

fi(β) = G(XimT βm).

For the mean value map, we can use the notation µ(β) = (G(X1Tβ, . . . , G(XpTβ))), and define M ⊂ S, the subspace of GLM induced densities, as

M =p ∈ S|µ(p) = G(XTβ) .

In words, we now have defined an ambient space S; that is the space of all densities.

Furthermore we have defined a model space M embedded in S, which are all densities that our generalized linear model can reach.

3.5 A Geometrical Description of a GLM

We want to define the geometric structure of a GLM, using the parametric family of density functions as manifolds. A crucial element of our differential geometric approach is the concept of tangent spaces; M and S can be locally approximated by their tangent spaces.

The geometry of exponential families can be described as follows. Each point p ∈ M has two vector spaces associated with it, namely TpS and TpM. The data is represented in TpS by a vector. An inner product is defined such that if p maximizes the likelihood function, then the vector is orthogonal to TpM.

TpS and TpS

The tangent space to S is a vector space that consists of elements that are score statistics. Recall that the score functions are

Ui(p) = ∂il(µ(β)),

for i = 1, ..., n. They span an n-dimensional vector space. This is the tangent space of S, formally given by

TpS = span {Ui(p)|i = 1, ..., n} .

(17)

We define the tangent space of M in a similar fashion. Let p be a point in M. The log- likelihood function for p can be parameterized with β by l(f (β)|Y ), where f (β) = µ(p).

We denote the score functions by

Um(p) = ∂ml(µ(β)).

These are m random variables that span a linear space of dimension m. The tangent space of M is again defined as the linear space spanned by the m score functions.

TpM = span {Um(p)|m = 1, ..., p} .

We represent Um(p) by its components (∂mf1, ..., ∂mfn)T. In figure 3.1 we give a representation of a tangent space in point p on the curved space M is given. Possible directions at which one can pass through p, are shown by the vectors originated from p.

M

TpM p

Figure 3.1: Tangent space TpM

It can be shown that the tangent space TpM is a linear subspace of the tangent space TpS. By applying the chain rule, we obtain

Um(p) = ∂ml(f (β)|Y )

= ∂mfiil(µ|Y )

= ∂mfiUi(p).

We conclude that TpM ⊂ TpS.

By convention we use the subscripts m, n, ... to name vectors in the tangent space TpM and subscripts i, j, ... for vectors in TpS. Collecting all tangent spaces TpS for p ∈ S, we obtain T S; the tangent bundle for S. Similarly, the tangent spaces TpM for all p ∈ M form the tangent bundle T M for M.

An overview

An overview of the different spaces that we have discussed so far are depicted in figure 3.2. In the middle we have S, our ambient space and embedded we have the model space M. These spaces can be mapped with the mean value map to obtain the mean value spaces µM and µS, shown on the left. For a point p on M we can determine the associated tangent spaces TpM and TpS, shown on the right.

(18)

p

µ

M

M

TpM

µ

S µ(p) : S → Rk

S T

p

S

Figure 3.2: An overview of the spaces The inner product on a tangent space

We will now consider an inner product in the tangent space. The information is easily defined in terms of any parametrization µ according to

hUi, Ujip(µ)= Eµ[Ui(p)Uj(p)] = In(µ)i,j, which is well-defined and finite for all points p(µ) ∈ S.

In order to define the notion of the angle between two tangent vectors in the tangent space of S, we use the information matrix. Let us take two vectors in Tp(µ)S, defined by

~a = aiUi(p), and

~b = biUi(p)

where a = (a1, ..., an)T and b = (b1, ..., bn)T. We define h~a,~bip(µ) = Eµ[aibjUiUj]

= aibjEµ[UiUj]

= aTIn(µ)b,

where In(µ) is the Fisher information matrix evaluated for the mean parameter µ = µ(p). Note that In(µ)i,j is the conditional correlation between µi and µj. We observe that the Fisher information uniquely defines an inner product by associating with each point p of S an inner product on the tangent spaces TpS. The Fisher information metric, for any ~a,~b ∈ TpS, is defined as

d(~a,~b) = h ~a − b, ~a − bip.

Since TpM is a linear subspace of TpS, an inner product on TpS also defines an inner product on TpM. Consequently, we may define an inner product between a tangent vector on Tp(µ(β))S and Tp(µ(β))M. Consider the following two vectors

~v = vjUj(p), and

~

w = wmUm(p)

(19)

Note that ~v ∈ Tp(µ(β))S and ~w ∈ Tp(µ(β))M. We define, h ~w, ~vip(µ(β)) = Eµ(β)[wmvjUmUj]

= wmvjmµ(β)Eµ(β)[UiUj]

= wTmµ(β)In(µ(β))v.

We have now obtained a collection of inner product spaces.

3.6 Angles on the Tangent Space

We need to find a map that relates the data to these spaces. We consider the space TµS and use the vector with components y − µ(β), in the space, with respect to the basis Ui. We obtain what is called the tangent residual vector r(µ(β), y).

The tangent residual vector is expressed by

r(µ(β), y) = (y − µ(β))iil(µ(β)).

In this section we introduce the notion of angles on the tangent space. We denote by ρm(β) the angle between the tangent residual vector r(µ(β), y) and the basis vector

ml(β(γ)).

It turns that for the inner product of the tangent residual vector and a basis vector the following identity holds;

hr(µ(β), y), ∂ml(µ(β))i = ∂ml(µ(β)). (3.2) Proof

We will show that the identity holds by calculating the inner product of r(µ(β), y) and

ml(µ(β). We observe that

hr(µ(β), y), ∂ml(µ(β)i = Eµ

n

X

i=1

(yi− µi)∂il(µ(β))

n

X

j=1

jl(µ(β))∂mµj)

=

n

X

i=1

mµiEµ∂il(µ(β))2(yi− µi)

=

n

X

i=1

∂µi

∂βm

∂θ

∂µi

(yi− µi)

= ∂ml(µ(β)).

Hence, the identity holds.

The derivative of the likelihood ∂ml(µ(β) is equal to zero at the maximum likelihood estimator. Consequently, the inner product in (3.2) is equal to zero. This implies that at the MLE, the tangent residual vector is perpendicular to the model space M, i.e.

the ρm( ˆβ) is a right angle. This is depicted in figure 3.3.

We have obtained a geometrical description of angles on the tangent space. We can use this to make estimations of the maximum likehood estimator. Methods for finding a sparse solution path that leads to the MLE is covered in the next chapter.

(20)

M

TpM r(µ(β))

p

Figure 3.3: The tangent residual vector is perpendicular to the model space M

(21)

Chapter 4

The Least Angle Regression Method

In the previous chapter we have learned about the geometry of generalized linear mod- els. We obtained a geometrical description for the maximum likelihood estimator. In this chapter we will introduce methods for finding a sparse solution path β(γ) that leads to the maximum likelihood. The method is based on the original least angle re- gression (LARS); it is used to obtain a solution path of a linear regression model. It is an algorithm that sequentially adds variables to the solution. In this chapter, we make a direct connection to the geometry of our model with least angle regression. We begin in the first section by discussing the generalized equiangularity condition. Also, it will be shown that by the Rao score test statistic we can obtain the same information as the angles. We will base an algorithm on the characterization of the Rao score statistic. In R, the differential geometric extension of least angle regression is implemented in the code dglars. The last section of the chapter concerns the predictor-corrector algorithm;

it is a possible algorithm for dglars.

4.1 The Generalized Equiangularity Condition

Our aim is to find out which individual covariates describe the response variable the best. The angle between covariates and the tangent residual vector within the likelihood manifold provides a way to determine information about the importance of the indi- vidual covariates. In general we have that smaller angles between the tangent residual vector and a basis vector on the tangent space are the better for convergence.

If the basis vectors of two predictors make the same angle with the tangent residual vector, we say that the generalized equiangularity condition holds. This condition is important in the least angle regression and a representation of this situation is shown in figure 4.1. The generalized equiangularity condition is defined as

ρa1{µ(β(γ))} = ρa2{µ(β(γ))} . (4.1) We want to calculate the angle ρm(β) on the tangent space. Since the tangent residual vector is orthogonal to the vector ∂ml(β(γ)), we can determine this angle by applying the law of cosines:

cos[ρmβ(γ)] = ∂ml(β(γ))

r(µ(β), y). (4.2)

(22)

M

TpM p

r(β)

a2l(β)

a1l(β)

Figure 4.1: Equiangularity on the tangent space TpM

Recall that in the previous section we derived the inner product of the tangent residual vector and the basis vector ∂ml(µ(β));

hr(µ(β), y), ∂ml(µ(β))i = ∂ml(µ(β)). (4.3) When combining equation (4.2) and (4.3) we obtain

ml(β(γ), y) = cos[ρm{β(γ)}] · ||r(µ(β), y)||p{µ(β(γ))}· ||∂ml(β(γ))||p{µ(β(γ))}

= cos[ρm{β(γ)}] · ||r(µ(β), y)||p{µ(β(γ))}· i

1

m2 {β(γ)} .

The second equality holds, since ||∂ml(β(γ))||p{µ(β(γ))} is the square of the information.

Hence the angle ρm(β(γ)) can be written as follows,

ρm{β(γ)} = cos−1

ml(β)

||r(µ(β), y)||p{µ(β(γ))}· i

1

m2 {β(γ)}

. (4.4)

We will now introduce the Rao score test statistic rum, which is defined as rum(γ) = i

1

m2 {β(γ)} ∂ml(β(γ)). (4.5)

The term within the inverse cosine in equation (4.5) is proportional to the Rao score test statistic.

It can be shown that for generalized linear models that, instead of using angles, one can equivalently use the geometric characterization of the Rao score test statistic. In order to see this relationship, we rewrite the equation (4.4) such that the Rao score test statistic can be written in the form

rum(γ) = i

1

m2{β(γ)} ∂ml(β(γ))

= cos[ρm{β(γ)}] · ||r(µ(β), y)||p{µ(β(γ))}.

(23)

Since on the restricted domain, the inverse cosine is strictly increasing and also, since

||r(µ(β), y)|| does not depend on m, we conclude that both the Rao score test statistic and the angles contain the same information. Note that we prefer a large Rao score statistic, since smaller the angles ρm{β(γ)} correspond to larger rum(γ).

4.2 Least Angle Regression

Least angle regression, or LARS, is a regression algorithm that is used for high dimen- sional data. Suppose we have a subset of variables that potentially describes a response variable. Our goal is to select a set of these variables that describes the response vari- ables best. These covariates are the explanatory variables or predictors. Typically, we have a large number of covariates to choose from and they far outnumber the observa- tions. The LARS algorithm outputs an estimation of a collection of the covariates to be included. A brief description of the LARS algorithm is given now.

We begin with only the intercept as our starting point and then we find the covariate that is most correlated to the response variable. Let this covariate be Xj1. Now we want to proceed in the direction of this covariate. The algorithm makes the largest step possible in this direction up to the point that there is another covariate, say Xj2, that is just as correlated to current residual as Xj1. A graphical representation for the two covariates is given in figure 4.2. Here we can see clearly that at first proceeding along Xj1, instead of Xj2, would give a better estimation of the outcome. But at some point, this is no longer the case. We proceed in the direction that is equiangular to both Xj1

and Xj2. Again, we make the largest step in this direction, until there is yet another covariate, Xj3, that is as much correlated as the other two covariates. We now proceed in the direction such that the covariates Xj1, Xj2 and Xj3 all make the same angle with the residual. This procedure is iterative. During the iterations of the algorithm, we update the active set A, the set of indices corresponding to the covariates that have the greatest correlations.

xj1

xj2 xj2

y

Figure 4.2: Least Angle Regression

(24)

4.3 LARS Algorithm for GLMs

By making use of the geometrical description derived in the previous chapter, the algorithm can be applied to the generalized linear model, . The LARS algorithm specified for GLMs will be as follows.

We first define our starting point. The intercept a0 is contained in the active set.

Furthermore, we introduce the concept of protected variables. These are variables that we want to hold. The purpose of the algorithm is to collect a selection of a set of variables that predict the outcome the best. However, it is possible that a selection of covariates is already assumed to be predictive for the outcome. It might be useful to hold these covariates in advance, these variables are called protected variables. Let that be the following set:

A = {a1, . . . , ab} . This is the active set at our starting point.

Let ˆβ(γ(i)) be the solution curve in the i-th step of the algorithm. The starting point of the algorithm at γ(0) is the parameter vector

β(γˆ (0)) = ( ˆβprot(0)), 0, . . . , 0)T,

with a zero for every k non-protected variables a1, . . . , ak, and where ˆβprot(0)) is the maximum likelihood for the model for only the protected variables. Now we can proceed with the algorithm.

We have that the tangent residual vector r(µ( ˆβprot(0))), y) is orthogonal to the tangent space Tp{µ( ˆβprot(0))}M.

We must find the most correlated predictor. This entails that we seek the predictor whose basis vector has the smallest angle with the tangent residual vector. Suppose that covariate is Xa1. We have that the basis vector for Xa1, which is ∂a1l( ˆβprot(0))), has the smallest angle with the tangent residual vector. We include the predictor in the active set, and we obtain A = A∪ {a1}. The solution curve we obtain is defined by

β(γˆ (1)) = ( ˆβprot(1)), ˆβa1(1)))T,

and it is chosen in such a way that the tangent residual vector is always orthogonal to our tangent space. The direction of this curve is defined by the projection of the tangent residual vector onto the basis vector ∂a1l( ˆβ(γ(1))).

The curve ˆβ(γ) continues in this direction until a new covariate is just as correlated as Xa1. Let that covariate be Xa2. That is, they satisfy the equiangularity condition (4.1) on the tangent space Tp{µ( ˆβ(γ(2))}M. Recall that this condition implies that

ρa1

n

µ(β(γ(2)))o

= ρa2

n

µ(β(γ(2)))o .

We include Xa2 to our active set and we update our active set to A = A∪ {a1, a2}.

The solution curve proceeds such that the direction is equiangular to both Xa1 and Xa2. The new curve that we obtain is

β(γˆ (2)) = ( ˆβprot(2)), ˆβa1(2)), ˆβa2(2))).

(25)

It is defined such that the tangent residual vector is again orthogonal to the tangent space. Also, it is defined by the tangent vector that bisects the angle between the basis vectors ∂a1l(β(γ(2))) and ∂a2l(β(γ(2))). These steps can be repeated for the remaining covariates and a solution curve is obtained.

M

TpM p

y r(µ(β(γ(1))))

a2l(β(γ(1)))

a1l(β(γ(1)))

(a) Solution path direction based on the least angle

M

TpM p

y r(µ(β(γ(2))))

a2l(β(γ(2)))

a1l(β(γ(2)))

(b) Solution path direction based on equiangularity

Figure 4.3: Least Angle Regression for a Generalized Linear Model

Consider the solution paths in figure (4.3). In subfigure (4.3a), a graphical representa- tion is given for the solution path based on the least angle. As we can see covariate Xa1

has the smallest angle to the tangent residual vector. If we proceed in this direction until we satisfy the equiangularity condition we shift to the next situation represented in subfigure (4.3b). Here, we choose the direction that is equiangular to both Xa1 and Xa2.

Let us now study the obtained solution curve ˆβ(γ) based on the Rao score test statistic which was defined by equation (4.5). We have γ ∈ [0, γ(1)] such that 0 ≤ γ(k) ≤ ... ≤ γ(2) ≤ γ(1), and we have the corresponding active set A(γ) = A∪ {a1, a2, ..., ak}. The solution curve is defined such that the following two conditions hold for the Rao score test statistics.

|ruai(γ)| = |rauj(γ)| for all ai, aj ∈ A(γ), and (4.6)

|ruac

h(γ)| < |raui(γ)| for all acn∈ Ac(γ), ai ∈ A, (4.7) where Ac is the complement of the set A.

Condition (4.6) holds to satisfy the generalized equiangularity condition. Condition (4.7) must hold, since otherwise the covariate would have been added to the active set before. Recall that a higher Rao score test statistic corresponds to a smaller angle.

Also, when there is a covariate ach ∈ Ac with a Rao score test statistic as large as that of the covariates in the active set, i.e.

|ruac

h(γ)| = |raui(γ)| for all ai∈ A(γ), (4.8) we add the covariate to the active set.

(26)

4.4 Predictor-Corrector Algorithm

The basic idea of the predictor-corrector algorithm is to trace a curve implicitly defined by a system of non-linear equations.

Let there be k covariates in the active set. From (4.6) it implies that

|rau1(γ)| = |rau2(γ)| = . . . = |rauk(γ)|.

We introduce the vector vk= {va1, . . . , vak} of signs of the Rao score test statistics for all aj’s, that is

vaj = sign(rauj(k))).

We define ˆβA(γ) by the following k + b non-linear equations





















a1l(β(γ)) = 0 ...

a

bl(β(γ)) = 0 rua1(γ) = va1· γ ...

ruak(γ) = vak· γ,

where the first b equations correspond to the protected variables. We define our system by

ϕA(γ) = (∂a

1l(β(γ)), . . . , ∂a

bl(β(γ)), rua1, . . . , rua

k)T.

We now rewrite our system in the following way. Let vA = (0, . . . , 0, vk) be a vector starting with b zero’s. If we consider the following equation, ˜ϕA(γ) = ϕA(γ) − vAγ, we obtain a system of equations equaling zero, as follows:





















a

1l(β(γ)) = 0 ...

abl(β(γ)) = 0 rau1(γ) − va1 · γ = 0 ...

rauk(γ) = vak · γ = 0

(4.9)

and this can be solved by the Newton-Raphson method. Newton-Raphson is a method for approximating the root of a real valued function. To this end, we need the derivative of ˜ϕA(γ), which is given by

d ˜ϕA(γ)

dγ = dϕA(γ) dγ − vA

= ∂ϕA(γ)

∂ ˆβA

·d ˆβA(γ)

dγ − vA. (4.10)

(27)

The Algorithm

The predictor-corrector algorithm is as follows. The algorithm consists of two important steps. As was to be expected, these are a predictor-step and a corrector-step. In the predictor step we approximate the solution curve. In the corrector step we can correct the mistake of including a wrong predictor in the set. When an incorrect variable is included, we have that there exists a non-active variable such that the absolute value of the corresponding Rao score test statistic is greater than γ. The problem can be resolved by reducing the previous step size.

We start with computing ˆβprot and we determine A = argmaxac

j∈Ac|ruac

j( ˆβprot)| and γ = |rau1( ˆβprot)|. We repeat the following steps until some convergence criterion is met.

For an optimal step size ∆γopt, we take γ = γ − ∆γopt. The local approximation of the solution curve is

β(γ − ∆γˆ opt) ≈ β(γ − ∆γ˜ opt)

= βˆA(γ) − (dϕA(γ)

dγ )−1· vA∆γopt, (4.11) by making use of equation (4.10). We solve the equation to obtain ˜βA(γ). This is the predictor step. For the corrector step, we solve the system of equations (4.9) for ˜βA(γ).

For all covariates ach in the non-active set, we compute the Rao score test statistic rauc

h(γ). If there is a covariate ach ∈ Ac such that

|ruac

h(γ)| > γ,

then the step size was too large. We reduce the step size and we go back to the predictor step. If condition (4.8) holds, we proceed the algorithm and we update the active set.

A graphical representation of the predictor-corrector algorithm is depicted in figure 4.4.

The sub-figures 4.4(a)-(d) show the Rao score paths of some covariates and the line

|ru(γ)| = γ is given.

In 4.4a we see the first iteration γ(1). The first covariate is added to the active set, since it has the largest value for the Rao score test statistic. This point is indicated by the dot and the value for γ at that point is indicated by the vertical dotted line. We reduce γ by ∆γ, as is shown in figure 4.4b. At this point, there are no covariates that satisfy the inclusion condition (4.8). We proceed with the next step by again reducing γ by

∆γ. There are two covariates where the absolute value of the Rao score test statistic is larger than γ. This is shown in 4.4c. That means that the stepsize was too large.

We reduce the step size until only one covariate satisfies the equiangularity condition.

In 4.4d that is the case and the next covariate is added to the active set. This is γ(2). We can now proceed with step γ(3).

(28)

ru)

γ

|rum(γ)|

|run(γ)|

|rul(γ)|

...

γ(1)

(a) The first iteration of the algorithm

ru)

γ

|rum(γ)|

|run(γ)|

|rul(γ)|

...

γ(1) γ(2)

← ∆γ →

(b) The step is not large enough

ru)

γ

|rum(γ)|

|run(γ)|

|rul(γ)|

...

γ(1) γ(2)

← ∆γ →

← ∆γ →

(c) The step needs to be corrected

ru)

γ

|rum(γ)|

|run(γ)|

|rul(γ)|

...

γ(1) γ(2)

← ∆γ → γ(3)

(d) Reduced step size Figure 4.4: The Predictor-Corrector algorithm: Rao score paths

4.5 Some Remarks

We conclude this chapter with three remarks. The first remark concerns the optimal step size ∆γopt. There are several methods to compute the optimal step size. We can consider a fixed value, or we can relate ∆γopt with a fixed variation in the arc length parameterization of the solution curve. Another method is to consider the step size that changes the active set. In this thesis, we will not further discuss this matter, but for a more detailed information on the third possibility, the reader is referred to Augugliaro et al. (2013).

Secondly, in the corrector step we usually do not find the exact point where a covariate should be included to the active set. However we do not need to find this exact point to know which covariate should be included: if there is only one covariate ach for which it holds that |rauc

h(γ)| > |rauj(γ)| for all aj ∈ A, it implies that we want to add the covariate to the active set. We reduce the step size and include the covariate. If there is more than one covariate for which it holds, we can not conclude that they all should be included in the set. We reduce the step size and proceed with the predictor step to ensure which one of the covariates has to be included first.

(29)

Finally, there are some computational issues. Solving the system of equations (4.8) and computing the inverse in (4.11) in the predictor step are computationally the most expensive steps. For a large number of variables in the active set (as γ reduces), computing the inverse results in very large matrices. There are solutions to overcome these problems, but they will also not be further discussed here.

(30)
(31)

Chapter 5

Excess Relative Risk Models

In this chapter we will extend the dervied method of the previous chapter to survival risk models. The focus will primarily be on the excess relative risk model. This model is the main subject of this chapter. But before we describe this model we will start with an introduction to survival analysis and learn how to work with censored survival times. We will derive an expression for the partial likelihood function and make the connection to generalized linear models. In the last section of this chapter, we will derive expressions for the excess relative risk model which can be implemented in R.

5.1 Censored Survival Times

Survival analysis is a subject in statistics that is concerned with the time until an event called failure occurs. In engineering, this event could be the failure of a component to working properly. In medicine, it could mean the death of a patient, a or a relapse from remission. In general we assume there is only one event of interest that may happen.

The time from a well-defined starting point until failure is called the survival time.

The main goal of survival analysis is to asses the relationship of explanatory variables to survival data. In this thesis, we focus on survival analysis in cancer studies and in particular genomic research. We seek a relationship between survival and genomic markers to obtain more information about the prospects of an individual cancer patient, based on the genomic structure of the tumor.

The analysis of survival data can be complicated, since the survival times are unknown in some cases. Sometimes some information about the survival time of an individual is present, but we do not know the exact time. We say, the data is censored. Censoring is a key problem in survival analysis, which occurs mainly in medical trials and it can have different causes.

Mainly, there are three reasons to the occurrence of censoring occurs:

• An individual survives the end of the study

• An individual is lost to follow-up

• An individual withdraws from the study

(32)

Example

In figure 5.1, an example of possible survival times for four individuals for a medical trial are represented in a graph to illustrate censored and non-censored survival times.

The study period is between time TO and time TC.

4 3 2 1

D

L

D D

T0 TL TC tim e →

Figure 5.1: Types of censoring

Patient 1 has a survival time that we know exact, since the time of failure is known.

Patient 2 has survived the end of the study period TC and the time of the event D, or death, is therefore not known. We know that the survival time is at least as long as the period from the time the patient has been followed until the end of the study period. We say, the data of patient 2 is right censored.

Patient 3 is no longer available for contact at time TL. The patient is lost to follow- up, or maybe he or she is withdrawn from the study. This happens for example when a patient dies of a different cause than of the disease of the medical trial.

Patient 4 is also censored. The beginning of the survival period has started before the study period, and therefore the exact survival time is unknown. This happens when the disease is diagnosed before the study has started. We call the data left censored.

4 In this thesis we will focus us solely on right censored survival times. Although the exact survival time of censored observations is unknown, the information that we have up until the time of censorship can be used in analyzing the survival data.

5.2 The Hazard Function

In survival analysis, the hazard function is one of the most important tools and it is used to define the risk of death at a specific time. The hazard function and the survival function will be defined in this section. We want to estimate and interpret survivor functions and hazard functions from survival data. It gives us information

(33)

about the rate at which the survival drops during the follow-up. Also, survival- and hazard functions can be compared. We can, for example, make a comparison between a treatment- and a placebo group, based on these functions.

Let T be the random variable that denotes the survival time and t be a specific value for T . Note that T ≥ 0. Also, let δ be the variable that indicates whether we have an event occurring during the study period or censoring by the end of the study period.

δ =1 if failure 0 if censored

where δ is called the status. Consider the example in the previous section. Individuals 1 and 5 have an exact failure time, hence the status is ’failure’ and δ = 1, while for the other individuals the status is ’censored’ and δ = 0.

The hazard function gives the probability of death in an infinitely small time period between a specified time t and t + dt. Formally,

h(t) = lim

dt→0P r(T ≤ t + dt|T > t)

= lim

dt→0

P r(t < T ≤ t + dt) P r(T > t) .

Let f (t) be the probability density function of T . The probability of failure before a specific time t is then given by the density function

F (t) = P r(T < t) =

t

Z

0

f (s)ds.

The survival function gives the probability of survival beyond time t. It is the comple- ment of F (t). That is,

S(t) = P r(T ≥ t) = 1 − F (t).

The hazard function can now be written in the form h(t) = f (t)

S(t). (5.1)

The hazard function can depend on the vector of covariates X(t) = (X1(t), ..., Xp(t))T. The hazard function that belongs to relative risk regression models can be expressed in the form

h(t, x) = h0(t)ψ(X(t), β). (5.2)

In (5.2), the part h0(t) is called the baseline hazard function. The function ψ : R → R is called the relative risk function an it is defined such that it is positive for all β.

An example of a relative risk model is the model that we are focusing on, namely the excess relative risk model. For this model we have that

ψ(x(t), β) =

p

Y

k=1

(1 + Xk(t)βk). (5.3)

Referenties

GERELATEERDE DOCUMENTEN

Development of an energy management solution for mine compressor systems manual compressor energy management done without pressure feedback from the compressed consumption

In Chapter 1 (Sections 1.1.2 to 1.1.7) the literature for the connexin structure in gap junction architecture and formation, their role in cell-cell communication,

Conclusion: Ledebouria caesiomontana is a new species restricted to the Blouberg mountain massif in Limpopo Province, South Africa.. Initial estimates deem the species

o Er werd geen plaggenbodem noch diepe antropogene humus A horizont aangetroffen. - Zijn er

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End

Er zijn meer factoren die ervoor zorgen dat de (meeste huisgenoten van) mensen met dementie zich geen voorstelling kunnen maken van het nut van informele hulp.. De angst voor

De zorgorganisatie is niet verantwoordelijk voor wat de mantelzorger doet en evenmin aansprakelijk voor de schade die een cliënt lijdt door diens fouten als gevolg van het niet goed

CS formulation based on LASSO has shown to provide an efficient approximation of the 0-norm for the selection of the residual allowing a trade-off between the sparsity imposed on