Title | Saving one or more parts of a dataset | |
Author |
Paul Seed, Wolfson Institute of Preventive Medicine, London Nicholas J. Cox, Durham University, UK Jean Marie Linhart, StataCorp |
The save command does not allow specification either of a varlist, which would be used to specify a subset of variables, or of if or in conditions, which would be used to specify a subset of observations.
We assume the main dataset has previously been saved to a Stata data file in binary format (a .dta file). If not, you should save the data first:
. save main
The first way to save part of a large dataset is to use keep or drop first.
. save part
. use mainand repeat for a different part of the data.
. ssc describe savesome . ssc install savesomeFor alternatives to ssc, see help search.
use main preserve foreach i of num 1/7 { keep if group == `i' save group`i' restore, preserve }
use main preserve forval i = 1/7 { keep if group == `i' save group`i' restore, preserve }
While save has these limitations, use does not. You can, in fact, split a large set without ever loading it in its entirety.
Suppose again that you wanted to divide a dataset into 7 part datasets depending on the values 1 to 7 of a classifying variable group. Here are two other ways of doing that:
foreach i of num 1/7 { use main if group == `i', clear save group`i' }
forval i = 1/7 { use main if group == `i', clear save group`i' }
This approach can be adopted to other similar problems. In particular, you can also specify a varlist with use.
It is natural to wonder which method is faster. This question is, however, difficult to answer because it depends on the size of a dataset, how much memory you have available, whether you are working over a network, the platform you are on, and so forth.
It is possible with method 1 that the main dataset is held in memory without putting it out to disk each time, if the operating system is smart enough to do that and enough memory is available. But as far as Stata is concerned, it is put out to disk. Method 2 has the data on disk and requires disk access.
That said, various experiments with Stata for Linux, for Macintosh, and for Windows indicate that method 2 is generally faster.