By R Views
With roots dating back to at least 1662 when John Graunt, a London merchant, published an extensive set of inferences based on mortality records, survival analysis is one of the oldest subfields of Statistics . Basic life-table methods, including techniques for dealing with censored data, were discovered before 1700 , and in the early eighteenth century, the old masters – de Moivre working on annuities, and Daniel Bernoulli studying competing risks for the analysis of smallpox inoculation – developed the modern foundations of the field . Today, survival analysis models are important in Engineering, Insurance, Marketing, Medicine, and many more application areas. So, it is not surprising that R should be rich in survival analysis functions. CRAN’s Survival Analysis Task View, a curated list of the best relevant R survival analysis packages and functions, is indeed formidable. We all owe a great deal of gratitude to Arthur Allignol and Aurielien Latouche, the task view maintainers.
Looking at the Task View on a small screen, however, is a bit like standing too close to a brick wall – left-right, up-down, bricks all around. It is a fantastic edifice that gives some idea of the significant contributions R developers have made both to the theory and practice of Survival Analysis. As well-organized as it is, however, I imagine that even survival analysis experts need some time to find their way around this task view. Newcomers – people either new to R or new to survival analysis or both – must find it overwhelming. So, it is with newcomers in mind that I offer the following narrow trajectory through the task view that relies on just a few packages: survival, ggplot2, ggfortify, and ranger
survival package is the cornerstone of the entire R survival analysis edifice. Not only is the package itself rich in features, but the object created by the
Surv() function, which contains failure time and censoring information, is the basic survival analysis data structure in R. Dr. Terry Therneau, the package author, began working on the survival package in 1986. The first public release, in late 1989, used the Statlib service hosted by Carnegie Mellon University. Thereafter, the package was incorporated directly into Splus, and subsequently into R.
ggfortify enables producing handsome, one-line survival plots with
ranger might be the surprise in my very short list of survival packages. The
ranger() function is well-known for being a fast implementation of the Random Forests algorithm for building ensembles of classification and regression trees. But
ranger() also works with survival data. Benchmarks indicate that
ranger() is suitable for building time-to-event models with the large, high-dimensional data sets important to internet marketing applications. Since
ranger() uses standard
Surv() survival objects, it’s an ideal tool for getting acquainted with survival analysis in this machine-learning age.
Load the data
This first block of code loads the required packages, along with the
veteran dataset from the
survival package that contains data from a two-treatment, randomized trial for lung cancer.
library(survival) library(ranger) library(ggplot2) library(dplyr) library(ggfortify) #------------ data(veteran) head(veteran)
## trt celltype time status karno diagtime age prior ## 1 1 squamous 72 1 60 7 69 0 ## 2 1 squamous 411 1 70 5 64 10 ## 3 1 squamous 228 1 60 3 38 0 ## 4 1 squamous 126 1 60 9 63 10 ## 5 1 squamous 118 1 70 11 65 10 ## 6 1 squamous 10 1 20 5 49 0
The variables in
veteran are: *
trt: 1=standard 2=test *
celltype: 1=squamous, 2=small cell, 3=adeno, 4=large *
time: survival time in days *
status: censoring status *
karno: Karnofsky performance score (100=good) *
diagtime: months from diagnosis to randomization *
age: in years *
prior: prior therapy 0=no, 10=yes
Kaplan Meier Analysis
The first thing to do is to use
Surv() to build the standard survival object. The variable
time records survival time;
status indicates whether the patient’s death was observed (
status = 1) or that survival time was censored (
status = 0). Note that a “+” after the time in the print out of
km indicates censoring.
# Kaplan Meier Survival Curve km
##  72 411 228 126 118 10 82 110 314 100+ 42 8 144 25+ ##  11 30 384 4 54 13 123+ 97+ 153 59 117 16 151 22 ##  56 21 18 139 20 31 52 287 18 51 122 27 54 7 ##  63 392 10 8 92 35 117 132 12 162 3 95 177 162 ##  216 553 278 12 260 200 156 182+ 143 105 103 250 100 999 ##  112 87+ 231+ 242 991 111 1 587 389 33
To begin our analysis, we use the formula
Surv(futime, status) ~ 1 and the
survfit() function to produce the Kaplan-Meier estimates of the probability of survival over time. The
times parameter of the
summary() function gives some control over which times to print. Here, it is set to print the estimates for 1, 30, 60 and 90 days, and then every 90 days thereafter. This is the simplest possible model. It only takes three lines of R code to fit it, and produce numerical and graphical summaries.
## Call: survfit(formula = Surv(time, status) ~ 1, data = veteran) ## ## time n.risk n.event survival std.err lower 95% CI upper 95% CI ## 1 137 2 0.985 0.0102 0.96552 1.0000 ## 30 97 39 0.700 0.0392 0.62774 0.7816 ## 60 73 22 0.538 0.0427 0.46070 0.6288 ## 90 62 10 0.464 0.0428 0.38731 0.5560 ## 180 27 30 0.222 0.0369 0.16066 0.3079 ## 270 16 9 0.144 0.0319 0.09338 0.2223 ## 360 10 6 0.090 0.0265 0.05061 0.1602 ## 450 5 5 0.045 0.0194 0.01931 0.1049 ## 540 4 1 0.036 0.0175 0.01389 0.0934 ## 630 2 2 0.018 0.0126 0.00459 0.0707 ## 720 2 0 0.018 0.0126 0.00459 0.0707 ## 810 2 0 0.018 0.0126 0.00459 0.0707 ## 900 2 0 0.018 0.0126 0.00459 0.0707
#plot(km_fit, xlab="Days", main = 'Kaplan Meyer Plot') #base graphics is always ready autoplot(km_fit)
Next, we look at survival curves by treatment.
And, to show one more small exploratory plot, I’ll do just a little data munging to look at survival by age. First, I create a new data frame with a categorical variable
AG that has values
GT60, which respectively describe veterans younger and older than sixty. While I am at it, I make
prior into factor variables. But note,
npsurv() worked just fine without this refinement.
Although the two curves appear to overlap in the first fifty days, younger patients clearly have a better chance of surviving more than a year.
Cox Proportional Hazards Model
Next, I’ll fit a Cox Proportional Hazards Model that makes use of all of the covariates in the data set.
# Fit Cox Model cox
## Call: ## coxph(formula = Surv(time, status) ~ trt + celltype + karno + ## diagtime + age + prior, data = vet) ## ## n= 137, number of events= 128 ## ## coef exp(coef) se(coef) z Pr(>|z|) ## trttest 2.946e-01 1.343e+00 2.075e-01 1.419 0.15577 ## celltypesmallcell 8.616e-01 2.367e+00 2.753e-01 3.130 0.00175 ** ## celltypeadeno 1.196e+00 3.307e+00 3.009e-01 3.975 7.05e-05 *** ## celltypelarge 4.013e-01 1.494e+00 2.827e-01 1.420 0.15574 ## karno -3.282e-02 9.677e-01 5.508e-03 -5.958 2.55e-09 *** ## diagtime 8.132e-05 1.000e+00 9.136e-03 0.009 0.99290 ## age -8.706e-03 9.913e-01 9.300e-03 -0.936 0.34920 ## priorYes 7.159e-02 1.074e+00 2.323e-01 0.308 0.75794 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## exp(coef) exp(-coef) lower .95 upper .95 ## trttest 1.3426 0.7448 0.8939 2.0166 ## celltypesmallcell 2.3669 0.4225 1.3799 4.0597 ## celltypeadeno 3.3071 0.3024 1.8336 5.9647 ## celltypelarge 1.4938 0.6695 0.8583 2.5996 ## karno 0.9677 1.0334 0.9573 0.9782 ## diagtime 1.0001 0.9999 0.9823 1.0182 ## age 0.9913 1.0087 0.9734 1.0096 ## priorYes 1.0742 0.9309 0.6813 1.6937 ## ## Concordance= 0.736 (se = 0.03 ) ## Rsquare= 0.364 (max possible= 0.999 ) ## Likelihood ratio test= 62.1 on 8 df, p=1.799e-10 ## Wald test = 62.37 on 8 df, p=1.596e-10 ## Score (logrank) test = 66.74 on 8 df, p=2.186e-11
Note that the model flags small cell type, adeno cell type and karno as significant. However, some caution needs to be exercised in interpreting these results. While the Cox Proportional Hazard’s model is thought to be “robust”, a careful analysis would check the assumptions underlying the model. For example, the Cox model assumes that the covariates do not vary with time. In a vignette  that accompanies the
survival package Therneau, Crowson and Atkinson demonstrate that the Karnofsky score (karno) is, in fact, time-dependent so the assumptions for the Cox model are not met. The vignette authors go on to present a strategy for dealing with time dependent covariates.
Data scientists who are accustomed to computing ROC curves to assess model performance should be interested in the Concordance statistic. The documentation for the
survConcordance() function in the
survival package defines concordance as “the probability of agreement for any two randomly chosen observations, where in this case agreement means that the observation with the shorter survival time of the two also has the larger risk score. The predictor (or risk score) will often be the result of a Cox model or other regression” and notes that: “For continuous covariates concordance is equivalent to Kendall’s tau, and for logistic regression is is equivalent to the area under the ROC curve.”
To demonstrate using the
survival package, along with
ggfortify, I’ll fit Aalen’s additive regression model for censored data to the veteran data. The documentation states: “The Aalen model assumes that the cumulative hazard H(t) for a subject can be expressed as a(t) + X B(t), where a(t) is a time-dependent intercept term, X is the vector of covariates for the subject (possibly time-dependent), and B(t) is a time-dependent matrix of coefficients.”
The plots show how the effects of the covariates change over time. Notice the steep slope and then abrupt change in slope of karno.
## Call: ## aareg(formula = Surv(time, status) ~ trt + celltype + karno + ## diagtime + age + prior, data = vet) ## ## n= 137 ## 75 out of 97 unique event times used ## ## slope coef se(coef) z p ## Intercept 0.083400 3.81e-02 1.09e-02 3.490 4.79e-04 ## trttest 0.006730 2.49e-03 2.58e-03 0.967 3.34e-01 ## celltypesmallcell 0.015000 7.30e-03 3.38e-03 2.160 3.09e-02 ## celltypeadeno 0.018400 1.03e-02 4.20e-03 2.450 1.42e-02 ## celltypelarge -0.001090 -6.21e-04 2.71e-03 -0.229 8.19e-01 ## karno -0.001180 -4.37e-04 8.77e-05 -4.980 6.28e-07 ## diagtime -0.000243 -4.92e-05 1.64e-04 -0.300 7.65e-01 ## age -0.000246 -6.27e-05 1.28e-04 -0.491 6.23e-01 ## priorYes 0.003300 1.54e-03 2.86e-03 0.539 5.90e-01 ## ## Chisq=41.62 on 8 df, p=1.6e-06; test weights=aalen
#summary(aa_fit) # provides a more complete summary of results autoplot(aa_fit)
Random Forests Model
As a final example of what some might perceive as a data-science-like way to do time-to-event modeling, I’ll use the
ranger() function to fit a Random Forests Ensemble model to the data. Note however, that there is nothing new about building tree models of survival data. Terry Therneau also wrote the
rpart package, R’s basic tree-modeling package, along with Brian Ripley. See section 8.4 for the rpart vignette  that contains a survival analysis example.
ranger() builds a model for each observation in the data set. The next block of code builds the model using the same variables used in the Cox model above, and plots twenty random curves, along with a curve that represents the global average for all of the patients. Note that I am using plain old base R graphics here.
# ranger model r_fit
The next block of code illustrates how
ranger() ranks variable importance.
## importance ## karno 0.0734 ## celltype 0.0338 ## diagtime 0.0003 ## trt -0.0007 ## prior -0.0011 ## age -0.0027
celltype as the two most important; the same variables with the smallest p-values in the Cox model. Also note that the importance results just give variable names and not level names. This is because
ranger and other tree models do not usually create dummy variables.
ranger() does compute Harrell’s c-index (See  p. 370 for the definition), which is similar to the Concordance statistic described above. This is a generalization of the ROC curve, which reduces to the Wilcoxon-Mann-Whitney statistic for binary variables, which in turn, is equivalent to computing the area under the ROC curve.
cat("Prediction Error = 1 - Harrell's c-index = ", r_fit$prediction.error)
## Prediction Error = 1 - Harrell's c-index = 0.308269
An ROC value of .68 would normally be pretty good for a first try. But note that the
ranger model doesn’t do anything to address the time varying coefficients. This apparently is a challenge. In a 2011 paper , Hamad observes:
However, in the context of survival trees, a further difficulty arises when time–varying effects are included. Hence, we feel that the interpretation of covariate effects with tree ensembles in general is still mainly unsolved and should attract future research.
I believe that the major use for tree-based models for survival data will be to deal with very large data sets.
Finally, to provide an “eyeball comparison” of the three survival curves, I’ll plot them on the same graph.The following code pulls out the survival data from the three model objects and puts them into a data frame for
# Set up for ggplot kmi
For this data set, I would put my money on a carefully constructed Cox model that takes into account the time varying coefficients. I suspect that there are neither enough observations nor enough explanatory variables for the
ranger() model to do better.
This four-package excursion only hints at the Survival Analysis tools that are available in R, but it does illustrate some of the richness of the R platform, which has been under continuous development and improvement for nearly twenty years. The
ranger package, which suggests the
survival package, and
ggfortify, which depends on
ggplot2 and also suggests the
survival package, illustrate how open-source code allows developers to build on the work of their predecessors. The documentation that accompanies the
survival package, the numerous online resources, and the statistics such as concordance and Harrell’s c-index packed into the objects produced by fitting the models gives some idea of the statistical depth that underlies almost everything R.
Some Tutorials and Papers
For an elementary treatment of evaluating the proportional hazards assumption that uses the veterans data set, see the text by Kleinbaum and Klein .
See the 1995 paper  by Intrator and Kooperberg for an early review of using classification and regression trees to study survival data.
For convenience, I have collected the references used throughout the post here.
 Hacking, Ian. (2006) The Emergence of Probability: A Philosophical Study of Early Ideas about Probability Induction and Statistical Inference. Cambridge University Press, 2nd ed., p. 11
 Andersen, P.K., Keiding, N. (1998) Survival analysis Encyclopedia of Biostatistics 6. Wiley, pp. 4452-4461  Kaplan, E.L. & Meier, P. (1958). Non-parametric estimation from incomplete observations, J American Stats Assn. 53, pp. 457–481, 562–563.  Cox, D.R. (1972). Regression models and life-tables (with discussion), Journal of the Royal Statistical Society (B) 34, pp. 187–220.
 Diez, David. Survival Analysis in R, OpenIntro
 Klein, John P and Moeschberger, Melvin L. Survival Analysis Techniques for Censored and Truncated Data, Springer. (1997)
 Wright, Marvin & Ziegler, Andreas. (2017) ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R, JSS Vol 77, Issue 1.
 Harrell, Frank, Lee, Kerry & Mark, Daniel. Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors. Statistics in Medicine, Vol 15 (1996), pp. 361-387  Amunategui, Manuel. Survival Ensembles: Survival Plus Classification for Improved Time-Based Predictions in R
 NUS Course Notes. Chapter 3 The Cox Proportional Hazards Model
 Encyclopedia of Biostatistics, 2nd Edition (2005). Aalen’s Additive Regression Model  Therneau et al. Using Time Dependent Covariates and Time Dependent Coefficients in the Cox Model
 Kleinbaum, D.G. and Klein, M. Survival Analysis, A Self Learning Text Springer (2005)  Therneau, T and Atkinson, E. An Introduction to Recursive Partitioning Using RPART Routines
 Intrator, O. and Kooperberg, C. Trees and splines in survival analysis Statistical Methods in Medical Research (1995)
 Bou-Hamad, I. A review of survival trees Statistics Surveys Vol.5 (2011)
Authors’s note: this post was originally published on April 26, 2017 but was subsequently withdrawn because of an error spotted by Dr. Terry Therneau. He observed that the Cox Portional Hazards Model fitted in that post did not properly account for the time varying covariates. This revised post makes use of a different data set, and points to resources for addressing time varying covariates. Many thanks to Dr. Therneau. Any errors that remain are mine.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…
Source:: R News