Sparse Inference of High Dimensional Survival Models

(1)

Sparse Inference of High Dimensional Survival Models

An Application of Differential Geometry to Excess Relative Risk Models

M

β

₁

β

₂

Vera Velt

Master’s Thesis Mathematics December 2015

First Supervisor: Prof. E.C. Wit Second Supervisor: Dr. Fentaw Abegaz

(2)

(3)

Abstract

This thesis concerns sparse inference of high dimensional survival models. By describing the statistical model space as a Riemannian manifold, we discuss the geometrical structure of generalized linear models. Then we use this geometrical interpretation to find a sparse solution path that leads to the maximum likelihood estimator. The method we suggest to find such a sparse solution path is a direct and scale-invariant approach based on differential geometric least angle regression. For the purpose of this thesis the method is extended to excess relative risk survival models. The derived method is implemented in R and used for a data analysis on data of a survival cancer study. In the analysis, a genomic signature is found that might be predictive for the prospects of an individual suffering from a specific type of ovarian cancer.

(4)

(5)

Introduction

In this thesis we are concerned with the subject sparse inference of high dimensional survival models. These days, it has become more important to deal with high dimensional data sets, where the number of predictors outnumber the number of observations.

There exists a relationship between gene-expression and certain diseases. Important examples are diabetes or different types of cancer. Since a few years it is possible to collect RNA structures and nowadays it has become relatively inexpensive and reliable to do this. The number of studies about the expression of the genes is increasing and we gain more knowledge about this subject. Also in survival analysis we deal with genomic data. Survival analysis is a subject in statistics that deals with data on finding predictors of risk of some events, such as cancer remission.

There are a number of recent survival studies where RNA of patients is collected and the genomic data is being studied. The genomic data in these studies is typically high dimensional, where the number of the genes (the predictors) exceeds the number of participants (the observations). Generally, the aim of these studies is to find genomic signatures for certain diseases. These signatures may provide information on the expected survival of individual patients and physicians can base treatments on this information or possibly develop new treatments. Examples of recent studies involving cancer survival all describe a survival data analysis to discover potentially predictive relationships between genetic data and survival times of participants suffering from different types of cancer (Gillet et al. (2012), J¨onsson et al. (2010), Loboda et al. (2011), Ross et al. (2012)).

We now know that there is a link between the survival of patients and the genomic structure of the tumors. To learn more about these diseases, it is necessary to further investigate the high dimensional survival data. But dealing with survival data comes with some challenges. First of all, when dealing with high dimensional data we are bounded by the limitations of a computer. It is required to develop efficient algorithms in order to deal with these computational limitations. Secondly, because p is larger than n it is complicated to make statistical inferences. For sparse inference of this data, methods that are mainly used to deal with this problem involve penalized likelihood functions. The penalties in these methods depend on some tuning parameter that induces sparsity. Although penalized inference can give useful information, the justification of the penalty can be problematic. Also, these methods are not invariant under scale transformations.

(8)

In this thesis, we will discuss a more direct approach to induce sparsity; directly from the likelihood function. The proposed method in this thesis is differential geometric least angle regression. For this purpose, we will study the geometrical structure of generalized linear models. The advantages of the method discussed in this thesis are that it is a more direct approach and that it is scale-invariant.

Generalized linear models can be described by differential, statistical manifolds. The global idea of the method is that we want to define a solution path in our model space that leads to the maximum likelihood estimator. It is based on the basic least angle regression algorithm which is an algorithm that sequentially adds variables to the solution curve. (Augugliaro et al. (2013)) We are seeking a path that increases the likelihood as fast as possible and uses a minimal number of covariates. To this end the tangent residual vector on the model space is introduced. This vector is chosen such that it is perpendicular to the model at the maximum likelihood estimator. The angle between the covariates and the tangent residual vector gives a direct way to determine the importance of these individual covariates. We make use of the generalized equiangularity condition and the algorithm that is used to obtain a solution path is the predictor-corrector algorithm (Augugliaro et al. (2014)).

The method can be applied to survival relative risk models. It is extended to Cox proportional hazards models by Wit et al. (2015). For the purpose of this thesis, we will extend it to a different risk model, namely Excess relative risk models. We used this method for an analysis on real data of a survival study concerning a specific type of ovarian cancer (Gillet et al. (2012)). In this study the genomic data of patients with ovarian cancer was collected and the gene-expression patterns were studied. The researchers found an 11-gene signature with a predictive power about the prospects of a patient with this type of cancer. In comparison, with the proposed method we have found a set of 8 genes of which 3 genes are in line with the article.

This thesis is organized as follows. In chapter 2, we begin with some notation and basic concepts of differential geometry. In chapter 3, we discuss the differential geometry of a generalized linear model. We will derive a geometrical description of the maximum likelihood estimator and determine angles on the tangent space. These angles are needed for our method. Chapter 4 includes the method for deriving a sparse solution path that leads to the maximum likelihood estimator, based on differential geometric least angle regression. In Chapter 5 we introduce survival analysis. We learn how to work with censored data. In this chapter we derive the likelihood function for a survival risk model. The geometrical structure of a survival risk model is given and extended to excess relative risk models. We finish this thesis with a data analysis based on this method in chapter 6, for survival data of a study concerning ovarian cancer.

(9)

Chapter 2

Basic Concepts and Notation of Differential Geometry

In this chapter, we begin with a brief introduction to differential geometry. This is a subject that requires some complicated notation. It is our aim to cover sufficient differential geometric theory that is needed for this thesis.

Differentiation

Let Θ be a parameter space of dimension p. Here, θ will be a parametrization of a statistical model of interest. Let η ∈ N be a different parameter space of a larger dimension n > p. The points in Θ are denoted by (θ¹, ..., θ^p)^T and similarly the points in N are denoted by (η¹, ..., ηⁿ)^T. Note that superscripts are used for the components, while we will make use of subscripts to indicate distinct points.

When x is a scalar, differentiation will be denoted by _dx^df(x) for a first-order derivative,

d²f

dx²(x) for a second order derivative, and similarly for higher order derivatives, . The partial derivative of f (θ) with respect to θ^m, is denoted by ∂_mf . For second order partial derivatives, that will be ∂_m²f . Furthermore, when η is a function of θ for example, then the derivative of the i-th component of η with respect to the m-th component θ, is

∂_mηⁱ.

In order to simplify the notation for expressions with summations, one can make use of the Einstein summation convention. This notational convention implies summation in an expression. Sums are taken when an index is used twice; both as superscript and as subscript. That is, the summation in the expression P

m∂mθ^m is implied and we can denote it by ∂_mθ^m. We will make use of the Einstein summation convention.

We can denote the chain rule as follows. If we take the partial derivative of f (θ), where f (θ) = F (η(θ)), the chain rule applies. The chain rule is equal to

∂mf (θ) =

k

X

i=1

∂iF (η(θ))∂mηⁱ.

This can be simplified to ∂mf = (∂mηⁱ)∂iF , by applying the Einstein summation convention. It now becomes clear that ∂_m = (∂_mηⁱ)∂_i is an expression for the chain rule itself.

(10)

Differential Manifolds

Differential manifolds are multidimensional generalizations of surfaces and the theory of manifolds is fundamental in differential geometry. The notion of a differential manifold is necessary for extending methods of differential calculus to a space that is more general than the space R^p.

For a parameter space, we want to represent the model as faithfully as possible. For this we need to keep the same topology and smoothness structure, which can be ac- complished when working with manifolds. The basic concepts of differential manifolds will now be discussed.

We begin with some terminology. We say a surface is a smooth p-dimensional subspace of R^p. This means that it is infinitely differentiable. Homeomorphisms and diffeo- morphisms are two important maps that we need to introduce. A homeomorphism is a continuous, one-to-one mapping with a continuous inverse. A diffeomorphism is a bijection and a smooth invertible map, whose inverse is smooth. These maps are important, since they preserve a smooth structure of a surface.

We say that a topological space M is locally m-dimensional Euclidean if for each point p ∈ M there exists a neighborhood of point p that is homeomorphic to R^p. By coordinate mapping, we mean a homeomorphism of the surface onto an open subset of R^p. To secure irrelevance of the choice of coordinate systems, say ψ and φ, we say the systems are compatible if both φ · ψ⁻¹ and ψ · φ⁻¹ are smooth. That is, the transformation from ψ-coordinates to φ-coordinates is a diffeomorphism.

A maximal structure is the collection of every coordinate system that is compatible with one of its elements. We can now define a smooth manifold and an embedded submanifold. A smooth manifold M is a topological manifold together with a maximal smooth structure. If we discuss a smooth manifold, the maximal smooth structure is implicitly present, but we just mention manifold M.

Let us now define an embedded submanifold. A mapping f : M → S of an p- dimensional manifold into a n-dimensional manifold is an embedding if it is a smooth map of rank p that is homeomorphic to its image f (M) ⊂ S. Also, f (M ) is then an embedded submanifold of S.

The key idea of a differential geometric approach is the notion of a tangent space. A tangent space consists of tangent vectors, which should be interpreted as the definition of a direction on a manifold M. The tangent vector corresponds to the directional derivative _∂θ^∂

m in a given direction, which from now on will be denoted by ∂_m for simplicity.

The tangent vector is defined as follows. Let M be a smooth manifold. A tangent vector at p ∈ M is a mapping

Xp: C^∞(M ) → R, such that for all f, g ∈ C^∞(M ) and a, b ∈ R, it holds that

(i) X_p(a · f + b · g) = aX_p(f ) + bX_p(g), (ii) X_p(f · g) = gX_p(f ) + f X_p(g).

(11)

The set of tangent vectors forms a p-dimensional manifold, spanned by the set

{∂m|m = 1, ..., p}. It is the tangent space of M at p, denoted by TpM. The tangent bundle, denoted by T M, is the collection of all tangent spaces of M. That is,

T M = {T_pM for all p ∈ M} ,

and it has the structure of a manifold. In fact, when M is of dimension m, then T M is an 2m-dimensional manifold. A tangent field X is a smooth map between manifolds,

X : M → T M

p 7→ X(p) ∈ TpM.

Geometric Surfaces

An inner product on a real vector space is a function that assigns to each pair of vectors v, w, in the vector space, a number hv, wi with the following three properties,

• (bilinearity)

ha1v1+ a2v2, wi = a1hv1, wi + a2hv2, wi, hv, b1w1+ b2w2i = b1hv, w2i + b2hv, w2i

• (symmetry)

hv, wi = hw, vi,

• (positive definiteness)

hv, vi ≥ 0, and

hv, vi = 0 if and only if v = 0.

The length of the vector v is defined by ||v|| = phv, vi. Also, another important property is the following: two vectors v and w are orthogonal if hv, wi = 0.

A geometric surface M can now be defined as an abstract surface furnished with an inner product h., .i on each of its tangent planes. These inner products are required to vary smoothly in the sense that if V and W are two differentiable vector fields on the surface, then hV, W i is a differentiable real-valued function on the surface. Each tangent plane TpM of M has its own inner product. The geometric structure provided by this collection of inner products can be described as a metric tensor on M. In short we can say that a geometric surface is a surface together with a metric tensor.

The Geometry of Parametric Families

Parametric families can be described geometrically by a smooth statistical manifold.

That is, a smooth differential manifold whose points are probability measures defined on a common probability space.

How we describe a parametric family as a statistical manifold is covered in detail in the next chapter.

(12)

(13)

Chapter 3

Generalized Linear Models and Geometry

There is a relationship between parametric families of distributions and geometry. Since the geometry most naturally applies to the exponential families, we will begin this chapter by introducing this family of densities. We can describe our model space as a curved exponential family embedded in the space of all densities. In this chapter the generalized linear model is described. A geometrical description of a generalized linear model is given, by describing our model as a differential statistical manifold. We will derive an expression of tangent spaces on our model and of the tangent residual vector.

This vector will be used to give a geometrical description of the maximum likelihood estimator. In the last section of this chapter, we discuss angles on the tangent space.

3.1 The Standard Regression Model

Suppose that we observe a response variable Y and p predictors or explanatory variables X₁, X₂, . . . , X_p. We assume that there exists a relationship between Y and X = (X1, X2, . . . , Xp). This relationship is of the form Y = g(X) + . Here Y are the dependent parameters, X the independent parameters and g is a fixed but unknown function. Furthermore, is the error such that _i ∼ N (0, σ²). We are interested in making predictions for Y . The approximation can be written as

E(Y |X) = g(X, β),

where the β’s are unknown parameters. Parametric methods involve making an assumption on the functional form of g.

When we deal with standard linear regression models we have that the equation linking each response variable Y and the set of explanatory variables is of the form

g(X, β) = β₀+ β₁X₁+ β₂X₂+ . . . + β_pX_p,

and we have responses Y₁, . . . , Y_n and parameters β₁, . . . , β_p (Dobson and Barnett (2011)).

A question that we may ask is: do all predictors help with explaining the response Y , or is there only a subset of predictors that is useful?

(14)

3.2 Parametric Families

Let us begin with defining a parametric family in general, and later on we will define the exponential family. Let y be the observed value of the random variable Y = (Y1, ..., Yn)^T and let θ = (θ¹, ..., θⁿ) be a parameter vector varying over a set Θ ∈ Rⁿ. Then a parametric family of densities is the set of functions

{p_Y(y|θ)|θ ∈ Θ} .

The likelihood function and the log-likelihood function are denoted by L(θ) and l(θ), respectively. The method of maximum likelihood estimation is finding a value ˆθ as function of Y , the maximum likelihood estimator (MLE), that maximizes the likelihood.

This value maximizes the log-likelihood function as well, since the logarithm is a strictly monotonically increasing function.

Let U_i(θ) = ∂_il(θ, Y ) be a score statistic. We can construct the Fisher information matrix; this is an n × n-matrix I_n(θ) which consists of the elements

I_n(θ)_i,j= E[U_i(θ)U_j(θ)].

It is the expected value of the observed information. The Fisher information gives information about the efficiency of the maximum likelihood. It determines the conditional correlation between θi and θj and we say that two parameters θi and θj are orthogonal if the element of the ith row and jth column of the Fisher information matrix is zero.

We can specify a family of densities in terms of an alternative parametrization {Pσ|σ ∈ Σ}.

This new parametrization must satisfy the identifiability and parameter space. We need the mapping from Θ to Σ to be one-to-one, i.e. is a homeomorphism. Also, the derivatives must be available.

3.3 The Exponential Family

Let Yi be random variable from the exponential family. Then we can write the probability density in the form

pY(y|θ) = exp{(y^Tθ − ψ(θ))}m(y), (3.1) where y ∈ Rⁿ. The densities are defined with respect to a fixed dominating measure ν.

We call Θ the natural parameter space and it is defined as Θ =

θ ∈ R^k: Z

exp{y^Tθ}m(y)dν < ∞

. We define the exponential family as

S = {P_θ: θ ∈ Θ} .

If there exists a P_θ in S for every θ ∈ Θ, then S is said to be of full rank. It is called a full exponential family and it plays the role of a manifold in the space of all density functions. Furthermore, if in addition Θ is open in Rⁿ, then S is a regular exponential family.

We define the function ψ(θ) such that the integral of the density is equal to one. It is easy to verify that ψ(θ) is then equal to

ψ(θ) = log(expy^Tθ m(x)).

From this point we make the assumption that S is a regular exponential family.

(15)

Example

It can be shown that the standard linear regression model is an example of the regular exponential family. The density of Y , conditionally on the explanatory variables X, can be written in the form

exp 1

σ²β^TX^TY + 1

−2σ²Y^TY − ( 1

2σ²β^TX^TXβ +n

2 log(2πσ²))

,

which is in the same form as (3.1). Hence, it belongs to the exponential family. 4 We will now consider some essential properties of a regular exponential family. In a regular exponential family, the function ψ has derivatives of all orders on Θ. The mean function µ is denoted by µ = (µ(θ1), ..., µ(θn))^T, and it is called the mean value mapping. The variance function V (.) of Y is an n × n−matrix. The mean and variance functions of y, µ and V respectively, are given by

µ = ∂_iψ, V = ∂_i²ψ.

The mapping µ = µ(θ) from θ ∈ Θ to µ ∈ Rⁿ is a diffeomorphism. Hence, the family S can be indexed by µ instead of θ. We say that µ plays the role of a coordinate system.

The image space we obtain is called the mean-value parameter space.

The set S can be treated as a differential manifold of order n. We define this space by S = {p_Y(y, µ)} ,

and we can either use θ or µ as coordinate system for S and obtain the same results.

An exponential family that is not of full rank is generally called a curved exponential family. That is, the parameter space is of dimension p, where p < n. It turns out that curved exponential families have the geometric structure of a manifold M embedded in a supermanifold S.

If µ(p) : S → R^k is the mean for any p ∈ S, then

M =p ∈ S|µⁱ(p) = f (Xi, β) for some β ∈ B, for all i ,

where β is an unknown parameter. By adding some structures to M, it becomes a p-dimensional subspace of S and such that M is a curved exponential family. Note that M ⊂ S. We say that M is a curved space that is an embedded subfamily of S.

3.4 Generalized Linear Models

A generalized linear model, or GLM, is a flexible generalization of ordinary least squares regression. It generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value. Maximum likelihood is the principal method for estimation used for all generalized linear models.

(16)

In short we can describe a GLM by three components:

• y is a random observation for a random variable Y , where Y has a distribution from a natural exponential family

• a set of parameters β and for each random variable Yithere is a vector of covariates Xi= (Xi1, ..., Xip)^T ∈ R^p

• a function g such that µi(β) = G(X_i^Tβ), were g = G⁻¹, which is called the link function

Consider a curved exponential family which is M a p-dimensional subspace of S. In a GLM, the space M is a set of generalized linear functions, embedded in S. The mean function f is of the special form

fⁱ(β) = G(X_im^T β^m).

For the mean value map, we can use the notation µ(β) = (G(X₁^Tβ, . . . , G(X_p^Tβ))), and define M ⊂ S, the subspace of GLM induced densities, as

M =p ∈ S|µ(p) = G(X^Tβ) .

In words, we now have defined an ambient space S; that is the space of all densities.

Furthermore we have defined a model space M embedded in S, which are all densities that our generalized linear model can reach.

3.5 A Geometrical Description of a GLM

We want to define the geometric structure of a GLM, using the parametric family of density functions as manifolds. A crucial element of our differential geometric approach is the concept of tangent spaces; M and S can be locally approximated by their tangent spaces.

The geometry of exponential families can be described as follows. Each point p ∈ M has two vector spaces associated with it, namely TpS and TpM. The data is represented in TpS by a vector. An inner product is defined such that if p maximizes the likelihood function, then the vector is orthogonal to T_pM.

TpS and TpS

The tangent space to S is a vector space that consists of elements that are score statistics. Recall that the score functions are

Ui(p) = ∂il(µ(β)),

for i = 1, ..., n. They span an n-dimensional vector space. This is the tangent space of S, formally given by

TpS = span {Ui(p)|i = 1, ..., n} .

(17)

We define the tangent space of M in a similar fashion. Let p be a point in M. The log- likelihood function for p can be parameterized with β by l(f (β)|Y ), where f (β) = µ(p).

We denote the score functions by

Um(p) = ∂ml(µ(β)).

These are m random variables that span a linear space of dimension m. The tangent space of M is again defined as the linear space spanned by the m score functions.

T_pM = span {U_m(p)|m = 1, ..., p} .

We represent Um(p) by its components (∂mf¹, ..., ∂mfⁿ)^T. In figure 3.1 we give a representation of a tangent space in point p on the curved space M is given. Possible directions at which one can pass through p, are shown by the vectors originated from p.

M

TpM p

Figure 3.1: Tangent space T_pM

It can be shown that the tangent space T_pM is a linear subspace of the tangent space TpS. By applying the chain rule, we obtain

Um(p) = ∂ml(f (β)|Y )

= ∂mfⁱ∂il(µ|Y )

= ∂mfⁱUi(p).

We conclude that T_pM ⊂ T_pS.

By convention we use the subscripts m, n, ... to name vectors in the tangent space TpM and subscripts i, j, ... for vectors in T_pS. Collecting all tangent spaces T_pS for p ∈ S, we obtain T S; the tangent bundle for S. Similarly, the tangent spaces T_pM for all p ∈ M form the tangent bundle T M for M.

An overview

An overview of the different spaces that we have discussed so far are depicted in figure 3.2. In the middle we have S, our ambient space and embedded we have the model space M. These spaces can be mapped with the mean value map to obtain the mean value spaces µM and µS, shown on the left. For a point p on M we can determine the associated tangent spaces TpM and TpS, shown on the right.

(18)

p

µ

M

T_pM

µ

S µ(p) : S → R^k

S T

_p

S

Figure 3.2: An overview of the spaces The inner product on a tangent space

We will now consider an inner product in the tangent space. The information is easily defined in terms of any parametrization µ according to

hUi, Uji_p(µ)= Eµ[Ui(p)Uj(p)] = In(µ)i,j, which is well-defined and finite for all points p(µ) ∈ S.

In order to define the notion of the angle between two tangent vectors in the tangent space of S, we use the information matrix. Let us take two vectors in T_p(µ)S, defined by

~a = aⁱUi(p), and

~b = bⁱU_i(p)

where a = (a¹, ..., aⁿ)^T and b = (b¹, ..., bⁿ)^T. We define h~a,~bi_p(µ) = Eµ[aⁱb^jUiUj]

= aⁱb^jEµ[UiUj]

= a^TI_n(µ)b,

where In(µ) is the Fisher information matrix evaluated for the mean parameter µ = µ(p). Note that In(µ)i,j is the conditional correlation between µi and µj. We observe that the Fisher information uniquely defines an inner product by associating with each point p of S an inner product on the tangent spaces TpS. The Fisher information metric, for any ~a,~b ∈ TpS, is defined as

d(~a,~b) = h ~a − b, ~a − bip.

Since T_pM is a linear subspace of T_pS, an inner product on T_pS also defines an inner product on TpM. Consequently, we may define an inner product between a tangent vector on T_p(µ(β))S and T_p(µ(β))M. Consider the following two vectors

~v = v^jU_j(p), and

~

w = w^mUm(p)

(19)

Note that ~v ∈ T_p(µ(β))S and ~w ∈ T_p(µ(β))M. We define, h ~w, ~vi_p(µ(β)) = E_µ(β)[w^mv^jUmUj]

= w^mv^j∂mµ(β)E_µ(β)[UiUj]

= w^T∂_mµ(β)I_n(µ(β))v.

We have now obtained a collection of inner product spaces.

3.6 Angles on the Tangent Space

We need to find a map that relates the data to these spaces. We consider the space T_µS and use the vector with components y − µ(β), in the space, with respect to the basis Ui. We obtain what is called the tangent residual vector r(µ(β), y).

The tangent residual vector is expressed by

r(µ(β), y) = (y − µ(β))ⁱ∂_il(µ(β)).

In this section we introduce the notion of angles on the tangent space. We denote by ρm(β) the angle between the tangent residual vector r(µ(β), y) and the basis vector

∂_ml(β(γ)).

It turns that for the inner product of the tangent residual vector and a basis vector the following identity holds;

hr(µ(β), y), ∂_ml(µ(β))i = ∂_ml(µ(β)). (3.2) Proof

We will show that the identity holds by calculating the inner product of r(µ(β), y) and

∂_ml(µ(β). We observe that

hr(µ(β), y), ∂_ml(µ(β)i = E_µ





n

X

i=1

(y_i− µ_i)∂_il(µ(β))

n

X

j=1

∂_jl(µ(β))∂_mµ_j)





=

n

X

i=1

∂mµiEµ∂il(µ(β))²(yi− µi)

=

n

X

i=1

∂µ_i

∂βm

∂θ

∂µi

(y_i− µ_i)

= ∂ml(µ(β)).

Hence, the identity holds.

The derivative of the likelihood ∂ml(µ(β) is equal to zero at the maximum likelihood estimator. Consequently, the inner product in (3.2) is equal to zero. This implies that at the MLE, the tangent residual vector is perpendicular to the model space M, i.e.

the ρm( ˆβ) is a right angle. This is depicted in figure 3.3.

We have obtained a geometrical description of angles on the tangent space. We can use this to make estimations of the maximum likehood estimator. Methods for finding a sparse solution path that leads to the MLE is covered in the next chapter.

(20)

M

TpM r(µ(β))

p

Figure 3.3: The tangent residual vector is perpendicular to the model space M

(21)

Chapter 4

The Least Angle Regression Method

In the previous chapter we have learned about the geometry of generalized linear models. We obtained a geometrical description for the maximum likelihood estimator. In this chapter we will introduce methods for finding a sparse solution path β(γ) that leads to the maximum likelihood. The method is based on the original least angle regression (LARS); it is used to obtain a solution path of a linear regression model. It is an algorithm that sequentially adds variables to the solution. In this chapter, we make a direct connection to the geometry of our model with least angle regression. We begin in the first section by discussing the generalized equiangularity condition. Also, it will be shown that by the Rao score test statistic we can obtain the same information as the angles. We will base an algorithm on the characterization of the Rao score statistic. In R, the differential geometric extension of least angle regression is implemented in the code dglars. The last section of the chapter concerns the predictor-corrector algorithm;

it is a possible algorithm for dglars.

4.1 The Generalized Equiangularity Condition

Our aim is to find out which individual covariates describe the response variable the best. The angle between covariates and the tangent residual vector within the likelihood manifold provides a way to determine information about the importance of the individual covariates. In general we have that smaller angles between the tangent residual vector and a basis vector on the tangent space are the better for convergence.

If the basis vectors of two predictors make the same angle with the tangent residual vector, we say that the generalized equiangularity condition holds. This condition is important in the least angle regression and a representation of this situation is shown in figure 4.1. The generalized equiangularity condition is defined as

ρ_a₁{µ(β(γ))} = ρ_a₂{µ(β(γ))} . (4.1) We want to calculate the angle ρm(β) on the tangent space. Since the tangent residual vector is orthogonal to the vector ∂_ml(β(γ)), we can determine this angle by applying the law of cosines:

cos[ρ_mβ(γ)] = ∂_ml(β(γ))

r(µ(β), y). (4.2)

(22)

M

TpM p

r(β)

∂_a₂l(β)

∂_a₁l(β)

Figure 4.1: Equiangularity on the tangent space TpM

Recall that in the previous section we derived the inner product of the tangent residual vector and the basis vector ∂_ml(µ(β));

hr(µ(β), y), ∂ml(µ(β))i = ∂ml(µ(β)). (4.3) When combining equation (4.2) and (4.3) we obtain

∂_ml(β(γ), y) = cos[ρ_m{β(γ)}] · ||r(µ(β), y)||_{p{µ(β(γ))}}· ||∂_ml(β(γ))||_{p{µ(β(γ))}}

= cos[ρm{β(γ)}] · ||r(µ(β), y)||_{p{µ(β(γ))}}· i

1

m2 {β(γ)} .

The second equality holds, since ||∂ml(β(γ))||_{p{µ(β(γ))}} is the square of the information.

Hence the angle ρ_m(β(γ)) can be written as follows,

ρ_m{β(γ)} = cos⁻¹





∂_ml(β)

||r(µ(β), y)||_{p{µ(β(γ))}}· i

1

m2 {β(γ)}



. (4.4)

We will now introduce the Rao score test statistic r^u_m, which is defined as r^u_m(γ) = i⁻

1

m2 {β(γ)} ∂_ml(β(γ)). (4.5)

The term within the inverse cosine in equation (4.5) is proportional to the Rao score test statistic.

It can be shown that for generalized linear models that, instead of using angles, one can equivalently use the geometric characterization of the Rao score test statistic. In order to see this relationship, we rewrite the equation (4.4) such that the Rao score test statistic can be written in the form

r^u_m(γ) = i⁻

1

m2{β(γ)} ∂_ml(β(γ))

= cos[ρm{β(γ)}] · ||r(µ(β), y)||_{p{µ(β(γ))}}.

(23)

Since on the restricted domain, the inverse cosine is strictly increasing and also, since

||r(µ(β), y)|| does not depend on m, we conclude that both the Rao score test statistic and the angles contain the same information. Note that we prefer a large Rao score statistic, since smaller the angles ρm{β(γ)} correspond to larger r^u_m(γ).

4.2 Least Angle Regression

Least angle regression, or LARS, is a regression algorithm that is used for high dimensional data. Suppose we have a subset of variables that potentially describes a response variable. Our goal is to select a set of these variables that describes the response variables best. These covariates are the explanatory variables or predictors. Typically, we have a large number of covariates to choose from and they far outnumber the observations. The LARS algorithm outputs an estimation of a collection of the covariates to be included. A brief description of the LARS algorithm is given now.

We begin with only the intercept as our starting point and then we find the covariate that is most correlated to the response variable. Let this covariate be Xj1. Now we want to proceed in the direction of this covariate. The algorithm makes the largest step possible in this direction up to the point that there is another covariate, say X_j2, that is just as correlated to current residual as Xj1. A graphical representation for the two covariates is given in figure 4.2. Here we can see clearly that at first proceeding along X_j1, instead of X_j2, would give a better estimation of the outcome. But at some point, this is no longer the case. We proceed in the direction that is equiangular to both Xj1

and Xj2. Again, we make the largest step in this direction, until there is yet another covariate, X_j3, that is as much correlated as the other two covariates. We now proceed in the direction such that the covariates Xj1, Xj2 and Xj3 all make the same angle with the residual. This procedure is iterative. During the iterations of the algorithm, we update the active set A, the set of indices corresponding to the covariates that have the greatest correlations.

xj₁

x_j₂ x_j₂

y

Figure 4.2: Least Angle Regression

(24)

4.3 LARS Algorithm for GLMs

By making use of the geometrical description derived in the previous chapter, the algorithm can be applied to the generalized linear model, . The LARS algorithm specified for GLMs will be as follows.

We first define our starting point. The intercept a0 is contained in the active set.

Furthermore, we introduce the concept of protected variables. These are variables that we want to hold. The purpose of the algorithm is to collect a selection of a set of variables that predict the outcome the best. However, it is possible that a selection of covariates is already assumed to be predictive for the outcome. It might be useful to hold these covariates in advance, these variables are called protected variables. Let that be the following set:

A^∗ = {a^∗₁, . . . , a^∗_b} . This is the active set at our starting point.

Let ˆβ(γ⁽ⁱ⁾) be the solution curve in the i-th step of the algorithm. The starting point of the algorithm at γ⁽⁰⁾ is the parameter vector

β(γˆ ⁽⁰⁾) = ( ˆβprot(γ⁽⁰⁾), 0, . . . , 0)^T,

with a zero for every k non-protected variables a1, . . . , ak, and where ˆβprot(γ⁽⁰⁾) is the maximum likelihood for the model for only the protected variables. Now we can proceed with the algorithm.

We have that the tangent residual vector r(µ( ˆβprot(γ⁽⁰⁾)), y) is orthogonal to the tangent space T_p{µ( ˆβprot(γ⁽⁰⁾)}M.

We must find the most correlated predictor. This entails that we seek the predictor whose basis vector has the smallest angle with the tangent residual vector. Suppose that covariate is X_a₁. We have that the basis vector for X_a₁, which is ∂_a₁l( ˆβ_prot(γ⁽⁰⁾)), has the smallest angle with the tangent residual vector. We include the predictor in the active set, and we obtain A = A^∗∪ {a1}. The solution curve we obtain is defined by

β(γˆ ⁽¹⁾) = ( ˆβ_prot(γ⁽¹⁾), ˆβ_a₁(γ⁽¹⁾))^T,

and it is chosen in such a way that the tangent residual vector is always orthogonal to our tangent space. The direction of this curve is defined by the projection of the tangent residual vector onto the basis vector ∂_a₁l( ˆβ(γ⁽¹⁾)).

The curve ˆβ(γ) continues in this direction until a new covariate is just as correlated as X_a₁. Let that covariate be X_a₂. That is, they satisfy the equiangularity condition (4.1) on the tangent space T_p{^{µ( ˆ}^β(γ⁽²⁾⁾}M. Recall that this condition implies that

ρa1

n

µ(β(γ⁽²⁾))o

= ρa2

n

µ(β(γ⁽²⁾))o .

We include X_a₂ to our active set and we update our active set to A = A^∗∪ {a₁, a₂}.

The solution curve proceeds such that the direction is equiangular to both X_a₁ and Xa2. The new curve that we obtain is

β(γˆ ⁽²⁾) = ( ˆβprot(γ⁽²⁾), ˆβa1(γ⁽²⁾), ˆβa2(γ⁽²⁾)).

(25)

It is defined such that the tangent residual vector is again orthogonal to the tangent space. Also, it is defined by the tangent vector that bisects the angle between the basis vectors ∂_a₁l(β(γ⁽²⁾)) and ∂_a₂l(β(γ⁽²⁾)). These steps can be repeated for the remaining covariates and a solution curve is obtained.

M

TpM p

y r(µ(β(γ⁽¹⁾)))

∂a2l(β(γ⁽¹⁾))

∂a1l(β(γ⁽¹⁾))

(a) Solution path direction based on the least angle

M

TpM p

y r(µ(β(γ⁽²⁾)))

∂a2l(β(γ⁽²⁾))

∂a1l(β(γ⁽²⁾))

(b) Solution path direction based on equiangularity

Figure 4.3: Least Angle Regression for a Generalized Linear Model

Consider the solution paths in figure (4.3). In subfigure (4.3a), a graphical representation is given for the solution path based on the least angle. As we can see covariate Xa1

has the smallest angle to the tangent residual vector. If we proceed in this direction until we satisfy the equiangularity condition we shift to the next situation represented in subfigure (4.3b). Here, we choose the direction that is equiangular to both Xa1 and Xa2.

Let us now study the obtained solution curve ˆβ(γ) based on the Rao score test statistic which was defined by equation (4.5). We have γ ∈ [0, γ⁽¹⁾] such that 0 ≤ γ^(k) ≤ ... ≤ γ⁽²⁾ ≤ γ⁽¹⁾, and we have the corresponding active set A(γ) = A^∗∪ {a₁, a₂, ..., a_k}. The solution curve is defined such that the following two conditions hold for the Rao score test statistics.

|r^u_a_i(γ)| = |r_a^u_j(γ)| for all ai, aj ∈ A(γ), and (4.6)

|r^u_a^c

h(γ)| < |r_a^u_i(γ)| for all a^c_n∈ A^c(γ), ai ∈ A, (4.7) where A^c is the complement of the set A.

Condition (4.6) holds to satisfy the generalized equiangularity condition. Condition (4.7) must hold, since otherwise the covariate would have been added to the active set before. Recall that a higher Rao score test statistic corresponds to a smaller angle.

Also, when there is a covariate a^c_h ∈ A^c with a Rao score test statistic as large as that of the covariates in the active set, i.e.

|r^u_a^c

h(γ)| = |r_a^u_i(γ)| for all ai∈ A(γ), (4.8) we add the covariate to the active set.

(26)

4.4 Predictor-Corrector Algorithm

The basic idea of the predictor-corrector algorithm is to trace a curve implicitly defined by a system of non-linear equations.

Let there be k covariates in the active set. From (4.6) it implies that

|r_aû₁(γ)| = |r_aû₂(γ)| = . . . = |r_aû_k(γ)|.

We introduce the vector vk= {va1, . . . , vak} of signs of the Rao score test statistics for all a_j’s, that is

v_a_j = sign(r_a^u_j(γ^(k))).

We define ˆβA(γ) by the following k + b non-linear equations











∂_a^∗₁l(β(γ)) = 0 ...

∂_a^∗

bl(β(γ)) = 0 r^u_a₁(γ) = va1· γ ...

r^u_a_k(γ) = va_k· γ,

where the first b equations correspond to the protected variables. We define our system by

ϕA(γ) = (∂_a^∗

1l(β(γ)), . . . , ∂_a^∗

bl(β(γ)), r^u_a₁, . . . , r^u_a

k)^T.

We now rewrite our system in the following way. Let vA = (0, . . . , 0, v_k) be a vector starting with b zero’s. If we consider the following equation, ˜ϕA(γ) = ϕA(γ) − vAγ, we obtain a system of equations equaling zero, as follows:











∂_a^∗

1l(β(γ)) = 0 ...

∂a^∗_bl(β(γ)) = 0 r_a^u₁(γ) − v_a₁ · γ = 0 ...

r_a^u_k(γ) = vak · γ = 0

(4.9)

and this can be solved by the Newton-Raphson method. Newton-Raphson is a method for approximating the root of a real valued function. To this end, we need the derivative of ˜ϕA(γ), which is given by

d ˜ϕA(γ)

dγ = dϕA(γ) dγ − vA

= ∂ϕA(γ)

∂ ˆβA

·d ˆβA(γ)

dγ − vA. (4.10)

(27)

The Algorithm

The predictor-corrector algorithm is as follows. The algorithm consists of two important steps. As was to be expected, these are a predictor-step and a corrector-step. In the predictor step we approximate the solution curve. In the corrector step we can correct the mistake of including a wrong predictor in the set. When an incorrect variable is included, we have that there exists a non-active variable such that the absolute value of the corresponding Rao score test statistic is greater than γ. The problem can be resolved by reducing the previous step size.

We start with computing ˆβ_prot and we determine A = argmax_a^c

j∈A^c|r^u_ac

j( ˆβ_prot)| and γ = |r_a^u₁( ˆβprot)|. We repeat the following steps until some convergence criterion is met.

For an optimal step size ∆γ^opt, we take γ = γ − ∆γ^opt. The local approximation of the solution curve is

β(γ − ∆γˆ ^opt) ≈ β(γ − ∆γ˜ ^opt)

= βˆA(γ) − (dϕA(γ)

dγ )⁻¹· vA∆γ^opt, (4.11) by making use of equation (4.10). We solve the equation to obtain ˜βA(γ). This is the predictor step. For the corrector step, we solve the system of equations (4.9) for ˜βA(γ).

For all covariates a^c_h in the non-active set, we compute the Rao score test statistic r_a^u^c

h(γ). If there is a covariate a^c_h ∈ A^c such that

|r^u_a^c

h(γ)| > γ,

then the step size was too large. We reduce the step size and we go back to the predictor step. If condition (4.8) holds, we proceed the algorithm and we update the active set.

A graphical representation of the predictor-corrector algorithm is depicted in figure 4.4.

The sub-figures 4.4(a)-(d) show the Rao score paths of some covariates and the line

|r^u(γ)| = γ is given.

In 4.4a we see the first iteration γ⁽¹⁾. The first covariate is added to the active set, since it has the largest value for the Rao score test statistic. This point is indicated by the dot and the value for γ at that point is indicated by the vertical dotted line. We reduce γ by ∆γ, as is shown in figure 4.4b. At this point, there are no covariates that satisfy the inclusion condition (4.8). We proceed with the next step by again reducing γ by

∆γ. There are two covariates where the absolute value of the Rao score test statistic is larger than γ. This is shown in 4.4c. That means that the stepsize was too large.

We reduce the step size until only one covariate satisfies the equiangularity condition.

In 4.4d that is the case and the next covariate is added to the active set. This is γ⁽²⁾. We can now proceed with step γ⁽³⁾.

(28)

ru(γ)→

γ→

|r^um(γ)|

|r^un(γ)|

|r^ul(γ)|

...

γ⁽¹⁾

(a) The first iteration of the algorithm

ru(γ)→

γ→

|r^um(γ)|

|r^un(γ)|

|r^ul(γ)|

...

γ⁽¹⁾ γ⁽²⁾

← ∆γ →

(b) The step is not large enough

ru(γ)→

γ→

|r^um(γ)|

|r^un(γ)|

|r^ul(γ)|

...

γ⁽¹⁾ γ⁽²⁾

← ∆γ →

(c) The step needs to be corrected

ru(γ)→

γ→

|r^um(γ)|

|r^un(γ)|

|r^ul(γ)|

...

γ⁽¹⁾ γ⁽²⁾

← ∆γ → γ⁽³⁾

(d) Reduced step size Figure 4.4: The Predictor-Corrector algorithm: Rao score paths

4.5 Some Remarks

We conclude this chapter with three remarks. The first remark concerns the optimal step size ∆γ^opt. There are several methods to compute the optimal step size. We can consider a fixed value, or we can relate ∆γ^opt with a fixed variation in the arc length parameterization of the solution curve. Another method is to consider the step size that changes the active set. In this thesis, we will not further discuss this matter, but for a more detailed information on the third possibility, the reader is referred to Augugliaro et al. (2013).

Secondly, in the corrector step we usually do not find the exact point where a covariate should be included to the active set. However we do not need to find this exact point to know which covariate should be included: if there is only one covariate a^c_h for which it holds that |r_a^u^c

h(γ)| > |r_a^u_j(γ)| for all aj ∈ A, it implies that we want to add the covariate to the active set. We reduce the step size and include the covariate. If there is more than one covariate for which it holds, we can not conclude that they all should be included in the set. We reduce the step size and proceed with the predictor step to ensure which one of the covariates has to be included first.

(29)

Finally, there are some computational issues. Solving the system of equations (4.8) and computing the inverse in (4.11) in the predictor step are computationally the most expensive steps. For a large number of variables in the active set (as γ reduces), computing the inverse results in very large matrices. There are solutions to overcome these problems, but they will also not be further discussed here.

(30)

(31)

Chapter 5

Excess Relative Risk Models

In this chapter we will extend the dervied method of the previous chapter to survival risk models. The focus will primarily be on the excess relative risk model. This model is the main subject of this chapter. But before we describe this model we will start with an introduction to survival analysis and learn how to work with censored survival times. We will derive an expression for the partial likelihood function and make the connection to generalized linear models. In the last section of this chapter, we will derive expressions for the excess relative risk model which can be implemented in R.

5.1 Censored Survival Times

Survival analysis is a subject in statistics that is concerned with the time until an event called failure occurs. In engineering, this event could be the failure of a component to working properly. In medicine, it could mean the death of a patient, a or a relapse from remission. In general we assume there is only one event of interest that may happen.

The time from a well-defined starting point until failure is called the survival time.

The main goal of survival analysis is to asses the relationship of explanatory variables to survival data. In this thesis, we focus on survival analysis in cancer studies and in particular genomic research. We seek a relationship between survival and genomic markers to obtain more information about the prospects of an individual cancer patient, based on the genomic structure of the tumor.

The analysis of survival data can be complicated, since the survival times are unknown in some cases. Sometimes some information about the survival time of an individual is present, but we do not know the exact time. We say, the data is censored. Censoring is a key problem in survival analysis, which occurs mainly in medical trials and it can have different causes.

Mainly, there are three reasons to the occurrence of censoring occurs:

• An individual survives the end of the study

• An individual is lost to follow-up

• An individual withdraws from the study

(32)

Example

In figure 5.1, an example of possible survival times for four individuals for a medical trial are represented in a graph to illustrate censored and non-censored survival times.

The study period is between time T_O and time T_C.

4 3 2 1

D

L

D D

T0 TL TC tim e →

Figure 5.1: Types of censoring

Patient 1 has a survival time that we know exact, since the time of failure is known.

Patient 2 has survived the end of the study period TC and the time of the event D, or death, is therefore not known. We know that the survival time is at least as long as the period from the time the patient has been followed until the end of the study period. We say, the data of patient 2 is right censored.

Patient 3 is no longer available for contact at time TL. The patient is lost to follow- up, or maybe he or she is withdrawn from the study. This happens for example when a patient dies of a different cause than of the disease of the medical trial.

Patient 4 is also censored. The beginning of the survival period has started before the study period, and therefore the exact survival time is unknown. This happens when the disease is diagnosed before the study has started. We call the data left censored.

4 In this thesis we will focus us solely on right censored survival times. Although the exact survival time of censored observations is unknown, the information that we have up until the time of censorship can be used in analyzing the survival data.

5.2 The Hazard Function

In survival analysis, the hazard function is one of the most important tools and it is used to define the risk of death at a specific time. The hazard function and the survival function will be defined in this section. We want to estimate and interpret survivor functions and hazard functions from survival data. It gives us information

(33)

about the rate at which the survival drops during the follow-up. Also, survival- and hazard functions can be compared. We can, for example, make a comparison between a treatment- and a placebo group, based on these functions.

Let T be the random variable that denotes the survival time and t be a specific value for T . Note that T ≥ 0. Also, let δ be the variable that indicates whether we have an event occurring during the study period or censoring by the end of the study period.

δ =1 if failure 0 if censored

where δ is called the status. Consider the example in the previous section. Individuals 1 and 5 have an exact failure time, hence the status is ’failure’ and δ = 1, while for the other individuals the status is ’censored’ and δ = 0.

The hazard function gives the probability of death in an infinitely small time period between a specified time t and t + dt. Formally,

h(t) = lim

dt→0P r(T ≤ t + dt|T > t)

= lim

dt→0

P r(t < T ≤ t + dt) P r(T > t) .

Let f (t) be the probability density function of T . The probability of failure before a specific time t is then given by the density function

F (t) = P r(T < t) =

t

Z

0

f (s)ds.

The survival function gives the probability of survival beyond time t. It is the complement of F (t). That is,

S(t) = P r(T ≥ t) = 1 − F (t).

The hazard function can now be written in the form h(t) = f (t)

S(t). (5.1)

The hazard function can depend on the vector of covariates X(t) = (X₁(t), ..., X_p(t))^T. The hazard function that belongs to relative risk regression models can be expressed in the form

h(t, x) = h0(t)ψ(X(t), β). (5.2)

In (5.2), the part h0(t) is called the baseline hazard function. The function ψ : R → R is called the relative risk function an it is defined such that it is positive for all β.

An example of a relative risk model is the model that we are focusing on, namely the excess relative risk model. For this model we have that

ψ(x(t), β) =

p

Y

k=1

(1 + X_k(t)β_k). (5.3)

Sparse Inference of High Dimensional Survival Models