Modelling long term survival with non-proportional hazards
Perperoglou, A.
Citation
Perperoglou, A. (2006, October 18). Modelling long term survival with non-proportional hazards. Retrieved from
https://hdl.handle.net/1887/4918
Version:
Corrected Publisher’s Version
License:
Licence agreement concerning inclusion of doctoral thesis in the Institutional Repository of the University of
Leiden
Downloaded from:
https://hdl.handle.net/1887/4918
The coxvc_1-1-1 package
A.1
Introduction
The coxvc_1-1-1 package is a set of functions for survival analysis that run under R2.1.1 [81]. This package contains a set of routines to fit Cox models [24] with time varying effects of the covariates and reduced-rank models [77]. What makes those two modelling approaches so special is that an expanded data set has to be created before fitting, making the task computationally demanding, since even small data sets explode when stacking together all the possible risk sets. Using coxvc the models can be fitted on the original data, in a very fast and efficient algorithm, as described in [76]. The set of routines included in the package also contains some small useful functions that the authors often use when fitting survival models.
The coxvc 1-1-1 requires packages MASS, splines and survival [64], which are automatically loaded when you use the package. Please refer to the manual of those packages for more information. The MASS [102] package is loaded for using the command ginverse which is essential when estimating the generalized inverse matrix of the information matrix from a reduced-rank model. Splines are loaded in order to transform some of the covariates when running the models. Note that this package is not essential (although the build in examples of the coxvc package use splines) but it is definitely useful in many applications. Last, the survival package is the base core of the package, since it is needed for creating the survival objects used in our examples.
A.2
Statistical background
functions. A non-proportional Cox model may be written as:
h(t|X) = h0(t) exp(XΘF0) (A.1)
where h0(t) is the unspecified baseline hazard, X is an 1 × p matrix of p
co-variates,F is a n × q matrix of q time functions, and Θ is a p × q matrix of
estimable coefficients.
Perperoglou, le Cessie and van Houwelingen [77] introduced the idea of reduced-rank regression to survival analysis with time varying coefficients. A reduced-rank model requires the matrix of regression coefficientsΘ to be writ-ten as a product of two submatrices,B of size p × r andΓ of size q × r, thus resulting inΘ = BΓ0, a matrix of reduced-rankr, smaller than the number of covariatesp or the number of time functions q. For fitting the full model, r has
to be chosen to be equal to the minimum (p, q), in which case the structure
matrixΘ is of full rank.
This package was created to fulfil the demand of fitting reduced- rank haz-ards models in a fast and efficient way. For motivation of the package use refer to [76]. The new version of the package contains an additional set of small functions that were found useful to the author in several cases when analyzing survival data.
A.3
Examples
First load the coxvc library: > library(coxvc)
The sample data within this library come from a study of ovarian cancer pa-tients [104]. There are in total 358 cases of papa-tients with information of the following variables:
time The number of days from enrollment until death or censoring. death An indicator of death (1) or censoring (0).
karn The karnofsky index measuring the ability of the patients to perform several tasks.
diam The diameter of the residual tumor.
Table A.1: Definitions of variables and patients frequencies Xk 0 1 2 3 4 Karnofsky 100 90 80 70 <70 n 137 108 47 46 20 Xf 0 1 Figo III IV n 262 96 Xd 0 1 2 3 4 Diameter Micro <1 1-2 2-5 >5 n 29 67 49 68 145
For more information refer to table A.1. First attach the data:
> data(ova) > attach(ova)
A short summary of the data follows: > summary(ova)
time death karn figo
Min. : 7.0 Min. :0.000 Min. :0.000 Min. :0.0000
1st Qu.: 369.3 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.0000
Median : 734.5 Median :1.000 Median :1.000 Median :0.0000
Mean : 980.8 Mean :0.743 Mean :1.173 Mean :0.2682
3rd Qu.:1619.3 3rd Qu.:1.000 3rd Qu.:2.000 3rd Qu.:1.0000
Max. :2729.0 Max. :1.000 Max. :4.000 Max. :1.0000
diam x Min. :0.000 Min. : 1.00 1st Qu.:1.000 1st Qu.: 90.25 Median :3.000 Median :179.50 Mean :2.651 Mean :179.50 3rd Qu.:4.000 3rd Qu.:268.75 Max. :4.000 Max. :358.00
> fit.ph <- coxph(Surv(time, death) ~ karn + diam + figo) > fit.ph
Call: coxph(formula = Surv(time, death) ~ karn + diam + figo)
coef exp(coef) se(coef) z p
karn 0.172 1.19 0.0532 3.23 1.2e-03
diam 0.200 1.22 0.0495 4.05 5.2e-05
figo 0.542 1.72 0.1353 4.00 6.2e-05
Likelihood ratio test=64.1 on 3 df, p=7.68e-14 n= 358
A test of proportionality based on Schoenfeld residuals [92] reveals that in fact there are deviations from proportional hazards in the data.
> cox.zph(fit.ph) rho chisq p karn -0.15773 7.23734 0.00714 diam -0.03464 0.31140 0.57682 figo -0.00405 0.00432 0.94760 GLOBAL NA 9.64451 0.02184
as it is indicated by the small global p-value given above. A graphical inspection given by:
> par(mfrow = c(3, 1)) > plot(cox.zph(fit.ph))
The results are shown in figure A.1 and suggest that there may be an interaction of time with the covariates.
A first approach will be to fit a full rank model, which includes the fullΘ
matrix. We choose to transform time using B-splines, thus create theF matrix
to contain F1(t) = 1 a constant and cubic B-spline functions on 3 degrees of freedom:
> Ft <- cbind(rep(1, nrow(ova)), bs(time, df = 3)) Then the full rank model is given by:
> fit.r3 <- coxvc(Surv(time, death) ~ karn + diam + figo, Ft, rank = 3,
+ data = ova)
Time
Beta(t) for karn
82 200 350 470 630 800 1100 1900 −1 0 1 2 ● ● ●● ● ● ● ●● ● ● ● ●●● ●● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●● ●● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● Time
Beta(t) for diam
82 200 350 470 630 800 1100 1900 −2.0 −0.5 1.0 ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ●●● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ●● ● ● ●●●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●●●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● Time
Beta(t) for figo
82 200 350 470 630 800 1100 1900 −2 0 2 4 ●● ●●● ● ●●●● ●●● ●● ●● ● ● ●● ●●● ● ●● ●● ● ●●●● ● ●● ● ● ●● ●●● ● ●● ● ● ● ● ●● ●● ●●●● ● ● ●●● ● ●● ● ●● ● ● ●● ● ● ●●● ● ● ● ● ● ●● ● ● ●●● ●● ●● ● ● ● ● ●●● ●● ●●● ● ● ●●●● ● ● ● ●● ●● ● ●● ● ● ● ● ●●●●●●●●●●● ●● ● ● ● ●● ● ●●● ● ● ● ● ●● ●●● ● ●● ● ● ● ●● ● ● ●●● ● ● ●● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ●●●● ● ● ● ● ● ● ●●● ● ● ● ●●● ●● ●● ● ● ● ● ●●●● ●● ●● ● ●●●● ● ●●● ● ●●●●●●● ● ● ● ●●● ●●●● ● ●●●● ●●●● ● ●● ●
Figure A.1: Test of proportionality based on scaled Schoenfeld residuals along with a spline smooth with 90% confidence intervals.
call: coxvc(formula = Surv(time, death) ~ karn + diam + figo, Ft =
Ft, rank = 3, data = ova)
coef exp(coef) se(coef) z p
karn:f3(t) -1.676 0.187 2.150 -0.7795 0.44000
diam:f3(t) -1.550 0.212 1.546 -1.0026 0.32000
figo:f3(t) 3.481 32.492 4.358 0.7988 0.42000
log-likelihood= -1373.840
algorithm converged in 5 iterations
The class of object fit.r3 is coxvc. The generic function printcoxvc is in-cluded in the package for printing results from the full model. The model has 21 parameters, and in practice the results are identical with fitting a coxph model on the expanded data set. However, the fit here was done in 5 iterations, on the original data set, which makes the routine much faster and more efficient. There are in total 266 events present in the ovarian data set. The object fit.r3 also contains the baseline hazard evaluated at this event time points. The function expand.haz can be used for expanding either the baseline or the cumulative baseline hazard.
> haz <- fit.r3$hazard > length(haz)
[1] 266
> haz.exp <- expand.haz(haz, death, fun = "baseline") > length(haz.exp)
[1] 358
When expanding the baseline hazard, the function assigns a zero value in the time points of censoring, while when expanding a cumulative baseline hazard, the function assigns the value of the cumulative baseline at the time where the previous event took place whenever there is a censored case.
> cum.haz <- cumsum(haz)
> cum.haz.exp <- expand.haz(cum.haz, death, fun = "cumulative") The function plotcoxvc is included in the package to draw figures of the time varying behavior of the covariates:
0 500 1000 1500 2000 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 time in days karn diam figo
Figure A.2: Estimated effects of the covariates over time, for the full rank model.
> plotcoxvc(fit.r3, fun = "survival", xlab = "time in days") In figure A.2 we have seen that the time varying behavior of the covariates is too flexible, especially in the last days of the follow up. We fitted a rank=2 model at the data, to see whether the fit improves:
> fit.r2 <- coxvc(Surv(time, death) ~ karn + diam + figo, Ft, rank = 2,
+ data = ova)
> fit.r2
call: coxvc(formula = Surv(time, death) ~ karn + diam + figo, Ft =
Ft, rank = 2, data = ova)
coef exp(coef) se(coef)
0 500 1000 1500 2000 0.0 0.2 0.4 0.6 0.8 1.0 time in days
Figure A.3: Survival function for the full rank model.
diam 0.176 1.1924381 0.163 figo 0.616 1.8515072 0.358 karn:f1(t) -1.277 0.2788727 0.711 diam:f1(t) 0.209 1.2324450 0.646 figo:f1(t) -0.415 0.6603403 1.259 karn:f2(t) 0.239 1.2699785 1.322 diam:f2(t) 0.094 1.0985597 0.942 figo:f2(t) 0.298 1.3471618 1.455 karn:f3(t) -1.146 0.3179059 1.964 diam:f3(t) -0.830 0.4360493 1.352 figo:f3(t) -2.054 0.1282210 2.202
log-likelihood= -1375.299 , Rank= 2 algorithm converged in 12
Beta : Gamma: [,1] [,2] [,1] [,2] [1,] 5.4171 -72.6259 [1,] 1.0610 0.0716 [2,] 0.3858 -3.2510 [2,] 1.8601 0.1563 [3,] 3.8667 -48.6655 [3,] 0.5818 0.0401 [4,] -5.4324 -0.3894 > summary(fit.r2)
call: coxvc(formula = Surv(time, death) ~ karn + diam + figo, Ft =
Ft, rank = 2, data = ova)
Beta : Gamma: [,1] [,2] [,1] [,2] [1,] 5.4171 -72.6259 [1,] 1.0610 0.0716 [2,] 0.3858 -3.2510 [2,] 1.8601 0.1563 [3,] 3.8667 -48.6655 [3,] 0.5818 0.0401 [4,] -5.4324 -0.3894
The class of fit.r2 is coxrr. For reduced-rank models the generic function print.coxrr will print the estimated coefficients of the model along with their standard errors and so forth, as well as the factors of theΘ matrix, B and Γ.
Moreover, the function summary.coxrr will provide also summary of theB and
Γ matrices.
We see that the rank=2 model, with 16 parameters in total, has a more reasonable fitting of the covariate effects
> plotcoxvc(fit.r2, fun = "effects", xlab = "time in days") while the rank=1 model with 9 free parameters, is more much more rigid: > fit.r1 <- coxvc(Surv(time, death) ~ karn + diam + figo, Ft, rank = 1,
+ data = ova)
> fit.r1
call: coxvc(formula = Surv(time, death) ~ karn + diam + figo, Ft =
0 500 1000 1500 2000 −1.0 −0.5 0.0 0.5 1.0 time in days karn diam figo
Figure A.4: Estimated effects of the covariates over time, for the rank=2 model.
coef exp(coef) se(coef)
log-likelihood= -1378.049 , Rank= 1 algorithm converged in 5 iterations Beta : Gamma: [,1] [,1] [1,] 0.2146 [1,] 1.5553 [2,] 0.2450 [2,] -2.2257 [3,] 0.5008 [3,] 1.3876 [4,] -5.3198
> plotcoxvc(fit.r1, fun = "effects", xlab = " time in days")
0 500 1000 1500 2000 −1.0 −0.5 0.0 0.5 1.0 time in days karn diam figo
to zero. For example consider the simple proportional hazards model fit.ph. To get an estimate of the baseline hazard the function coxph.details can be used:
> haz.ph <- coxph.detail(fit.ph)$haz > haz.ph0 <- calc.h0(fit.ph)
The object haz.ph is the baseline hazard evaluated at the mean value of the covariates, while the object haz.ph0 is the baseline hazard evaluated for all covariate values equal to zero. This can be seen in graph A.6:
> plot(time[death == 1], exp(-cumsum(haz.ph)), ylim = c(0, 1), + ylab = "", "l")
> lines(time[death == 1], exp(-cumsum(haz.ph0)), col = 2)
0 500 1000 1500 2000 0.0 0.2 0.4 0.6 0.8 1.0 time[death == 1]
Figure A.6: Figure of survival for an average person (black line) and a person