Home  /  Products  /  Stata 18  /  Lasso for Cox model

<- See Stata 18's new features

Highlights

  • Estimation: lasso and elastic net

  • Left-truncation and right-censoring

  • Selection methods

    • Cross-validation

    • Adaptive lasso

    • BIC

    • User specified

  • Penalized and postselection predictions

  • Plots of survivor and other functions

  • See more lasso features

If you have time-to-event data, also known as survival-time or failure-time data, and many predictors, check out the lasso cox and elasticnet cox commands. (And when we say many predictors, we mean hundreds, thousands, or more!) New in Stata 18, these commands expand the existing lasso suite for prediction and model selection to include a high-dimensional semiparametric Cox proportional hazards model.

After lasso cox and elasticnet cox, you can use stcurve to plot the survivor, failure, hazard, or cumulative hazard function or use any of the other postestimation tools available after lasso and elasticnet.

Let's see it work

We illustrate lasso cox with an example that predicts the risk of death for stage I lung adenocarcinoma patients. Lung adenocarcinoma is one of the most common non-small cell lung cancers.

Stage I adenocarcinoma indicates that the tumor size is relatively small and the cancer has not spread to other distant organs. Stage I adenocarcinoma patients have varied survival outcomes even though they are in the early cancer-development stage. For example, Yu et al. (2016) show that, in one cohort, more than 50% of stage I adenocarcinoma patients died within 5 years after the initial diagnosis, while about 15% of the patients survived for more than 10 years.

Histopathology image features are used for prognostic analysis. We can use lasso cox to extract the top histopathology image features that distinguish short-term and long-term survivors.

We have a fictional survival dataset (lungcancer.dta) inspired by Yu et al. (2016). The variable t records either the time of death or censoring in months for stage I adenocarcinoma lung cancer patients. The indicator variable died is 1 or 0 if the patient died or is censored, respectively. There are 500 histopathology image features, histfeature1 to hisfeature500, and only 250 patients. The analysis aims to classify a new patient into a low-risk or high-risk group, given the histopathology image features.

We first load the dataset and then type stset to show that it has already been stset.

. webuse lungcancer
(Fictitious data on stage I adenocarcinoma lung cancer)

. stset
-> stset t, failure(died)

Survival-time data settings

         Failure event: died!=0 & died<.
Observed time interval: (0, t]
     Exit on or before: failure

250 total observations
0 exclusions
250 observations remaining, representing
211 failures in single-record/single-failure data
18,465.093 total analysis time at risk and under observation
At risk from t = 0
Earliest observed entry t = 0
Last observed exit t = 260

Next, we split the full sample into training and testing data. The training data will be used for estimation, and the testing data will be used to measure the prediction performance.

We use splitsample to split the data into two parts. We use the generate(group) option to create a new variable, group, that equals 1 if it belongs to the training data or 0 to the testing data. The split(0.6 0.4) option specifies that 60% of the entire data are used as training data and 40% as testing data. To make the results reproducible, we specify the rseed() option.

. splitsample, generate(group) split(0.6 0.4) rseed(12345)

For later use, we save the training data as lungcancer_training.dta and the testing data as lungcancer_testing.dta.

. preserve

. keep if group == 1
(100 observations deleted)

. save lungcancer_training, replace
file lungcancer_training.dta not found
file lungcancer_training.dta saved

. restore

. preserve

. keep if group == 2
(150 observations deleted)

. save lungcancer_testing, replace
file lungcancer_testing.dta not found
file lungcancer_testing.dta saved

. restore

We now fit a lasso cox model using only the training data. By default, we use cross-validation. We specify rseed() for reproducibility.

. use lungcancer_training, clear
(Fictitious data on stage I adenocarcinoma lung cancer)

. lasso cox histfeature*, rseed(12345671)

        Failure _d: died
  Analysis time _t: t

10-fold cross-validation with 100 lambdas ...
(output omitted)
... cross-validation complete ... minimum found

Lasso Cox model                             No. of obs        =        150
                                            No. of covariates =        500
Selection: Cross-validation                 No. of CV folds   =         10

No. of
nonzero In-sample CV mean
ID Description lambda coef. dev. ratio deviance
1 first lambda .3539123 0 0.0000 8.922501
30 lambda before .0918411 45 0.2199 8.042941
* 31 selected lambda .0876668 48 0.2306 8.039609
32 lambda after .0836822 52 0.2419 8.05246
34 last lambda .0762481 63 0.2662 8.105045
* lambda selected by cross-validation.

lasso cox selects 48 of the 500 features. We can now predict the penalized relative-hazard ratio (variable riskscore_training) and evaluate risk scores. We will use the median of riskscore_training as a threshold to classify a patient as low risk or high risk. We store the median value in a global macro (median) for later use.

. predict riskscore_training
(options hr penalized assumed; predicted hazard ratio with penalized
coefficients)

. summarize riskscore_training, detail

Predicted hazard ratio, penalized
Percentiles Smallest
1% .054982 .0414753
5% .0838301 .054982
10% .1308778 .0702972 Obs 150
25% .3676802 .0727958 Sum of wgt. 150
50% .9458244 Mean 1.998198
Largest Std. dev. 3.75226
75% 2.368032 9.962103
90% 4.912702 11.13334 Variance 14.07945
95% 6.651043 12.4411 Skewness 7.054249
99% 12.4411 39.40631 Kurtosis 67.68195
. global median = r(p50)

We now use the testing data to validate the model. First, we predict the penalized hazard ratio (variable riskscore_testing) in the testing sample. Then, we compare riskscore_testing with the median of the hazard ratio obtained in the training data ($median). The patient is labeled high risk if the predicted risk score is greater than or equal to the median. The patient is classified as low risk if the predicted risk score is less than the median.

. use lungcancer_testing, clear
(Fictitious data on stage I adenocarcinoma lung cancer)

. predict riskscore_testing
(options hr penalized assumed; predicted hazard ratio with penalized
coefficients)

. generate byte risk = (riskscore_testing >= $median)

. label define risk_lb 1 "High risk" 0 "Low risk"

. label values risk risk_lb

To evaluate the effectiveness of risk classification, we first look at the Kaplan–Meier plot, which plots the survival curve for both low-risk and high-risk groups.

. sts graph, by(risk)

The graph shows that the predicted high-risk patients have a more steeply falling survival curve than the predicted low-risk patients. To confirm this conjecture, we do a log-rank test.

. sts test risk

        Failure _d: died
  Analysis time _t: t

Equality of survivor functions
Log-rank test

Observed Expected
risk events events
Low risk 39 68.17
High risk 51 21.83
Total 90 90.00
chi2(1) = 61.50 Pr>chi2 = 0.0000

The log-rank test rejects the hypothesis that the predicted low-risk and high-risk patients have the same survival functions. Both the Kaplan–Meier plot and the log-rank test show that using the predicted hazard ratios' median can effectively distinguish a low-risk patient from a high-risk patient. We can now make prognostic predictions given new data.

The dataset (newlungcancer.dta) contains histopathology image features for some new stage I adenocarcinoma patients, and we do not observe their survival time yet because they are still alive. Based on the prediction model from lasso cox, we want to classify these new patients as low risk or high risk. To achieve this objective, we need to predict only the new patients' hazard ratios and compare them with the median level of risk score obtained in the training data.

. webuse newlungcancer, clear
(Fictitious new data on stage I adenocarcinoma lung cancer)

. predict riskscore_new
(options hr penalized assumed; predicted hazard ratio with penalized
coefficients)

. generate risk = (riskscore_new >= $median)

. label define risk_lb 1 "High risk" 0 "Low risk"

. label values risk risk_lb

. tabulate risk

risk Freq. Percent Cum.
Low risk 27 54.00 54.00
High risk 23 46.00 100.00
Total 50 100.00

The table of the predicted risk level shows that 27 patients are classified as low risk, while 23 patients are classified as high risk.

Reference

Yu, K., C. Zhang, G. J. Berry, R. B. Altman, C. Ré, D. L. Rubin, and M. Snyder. 2016. Predicting non-small cell lung cancer prognosis by fully automated microscopic pathology image features. Nature Communications 7(12474).

Tell me more

Learn more about other new features in survival analysis.

Learn more about other lasso features in Lasso for prediction and model selection.

Read more in the Stata Lasso Reference Manual; see [LASSO] lasso and [LASSO] elasticnet.

View all the new features in Stata 18.

Made for data science.

Get started today.