Community corner: Cross-validation in Stata
Evaluating the out-of-sample properties of statistical models is important, especially for predictive modeling and analytics. Steven Brownell and Billy Buchanan’s crossvalidate package makes it easy. It contains xv, an extensible prefix command implementing cross-validation for Stata estimation commands.
The xv and xvloo prefixes split your sample, fit your model to the training sample, predict outcomes on the validation or test sample, and compute metrics related to fit, all in one command.
For example, use an 80/20 split to evaluate the mean squared error for a linear regression model:
. xv .8, metric(mse): reg price mpg i.foreign
Or use a 60/20/20 split with four folds to evaluate accuracy for a logistic regression model:
. xv .6 .2, metric(acc) kfold(4): logit low age lwt i.race smoke pt1 ht ui
Use one of more than 40 built-in metrics or create your own. You can install these prefix commands and learn more about them and the built-in metrics by typing
. ssc install crossvalidate2 . help crossvalidate2 . help libxv##classification
To learn more about the crossvalidate package and all of its options, take a look at Billy's GitHub page and Steven’s talk from the 2024 Stata Conference.