Sometimes, we want to use part of a dataset. We might wish to use a subset of variables, a subset of observations, or both. First, we type
. copy https://www.stata-press.com/data/r18/nhanes2l.dta nhanes2l.dta
Next, we can type describe using to view information about the contents of the dataset without opening the datafile.
. describe using nhanes2l.dta Contains data Second National Health and Nutrition Examination Survey Observations: 10,351 23 Mar 2023 10:43 Variables: 42
Variable Storage Display Value |
name type format label Variable label |
sampl long %9.0g Unique case identifier strata byte %9.0g Stratum identifier psu byte %9.0g psulbl Primary sampling unit region byte %9.0g region Region smsa byte %22.0g smsalbl SMSA type location byte %9.0g Location (stand office ID) houssiz byte %9.0g Number of people in household sex byte %9.0g sex Sex race byte %9.0g race Race age byte %9.0g Age (years) height float %9.0g Height (cm) weight float %9.0g Weight (kg) bpsystol int %9.0g Systolic blood pressure bpdiast int %9.0g Diastolic blood pressure tcresult int %9.0g Serum cholesterol (mg/dL) tgresult int %9.0g Serum triglycerides (mg/dL) hdresult int %9.0g High-density lipids (mg/dL) hgb float %9.0g Hemoglobin (g/dL) hct float %9.0g Hematocrit (%) tibc int %9.0g Total iron bind. cap. (mcg/dL) iron int %9.0g Serum iron (mcg/dL) hlthstat byte %20.0g hlth Health status heartatk byte %16.0g heartlbl Prior heart attack diabetes byte %12.0g diabetes Diabetes status sizplace byte %39.0g size Size of place finalwgt long %9.0g Sampling weight (except lead) leadwt long %9.0g Sampling weight for lead corpuscl float %9.0g Mean corpuscular volume (fL) trnsfern float %9.0g Transferrin saturation (%) albumin float %9.0g Serum albumin (g/dL) vitaminc float %9.0g Serum vitamin C (mg/dL) zinc int %9.0g Serum zinc (mcg/dL) copper int %9.0g Serum copper (mcg/dL) porphyrn int %9.0g Erythrocyte porphyrin (mcg/dl) lead byte %9.0g Lead (mcg/dL) hsizgp byte %8.0g # in household or 5 if #>=5 rural byte %8.0g rurallbl Rural loglead float %9.0g log(lead) agegrp byte %8.0g agegrp Age group highlead byte %10.0g highlead High lead level bmi float %9.0g Body mass index (BMI) highbp byte %8.0g High blood pressure |
The dataset contains 10,351 observations and 42 variables. Let's say we are interested only in the variables diabetes, agegrp, and bmi. We can include those variable names in our use command, and Stata will load only those variables into memory.
. use diabetes agegrp bmi using nhanes2l (Second National Health and Nutrition Examination Survey)
We can type describe to view the contents of the data in memory.
. describe Contains data from nhanes2l.dta Observations: 10,351 Second National Health and Nutrition Examination Survey Variables: 3 23 Mar 2023 10:43
Variable Storage Display Value |
name type format label Variable label |
diabetes byte %12.0g diabetes Diabetes status agegrp byte %8.0g agegrp Age group bmi float %9.0g Body mass index (BMI) |
There are 10,351 observations for the variables we requested: diabetes, agegrp, and bmi. Note that the other variables are still present in the dataset in the file, but they are not loaded into Stata's memory.
We can also use a subset of observations from the dataset. Perhaps we want to use only the first 1,000 observations in the dataset. We could do this with the in option.
. use diabetes agegrp bmi using nhanes2l in 1/1000 (Second National Health and Nutrition Examination Survey)
We can type describe and see that the dataset in memory includes 1,000 observations for the variables diabetes, agegrp, and bmi.
. describe Contains data from nhanes2l.dta Observations: 1,000 Second National Health and Nutrition Examination Survey Variables: 3 23 Mar 2023 10:43
Variable Storage Display Value |
name type format label Variable label |
diabetes byte %12.0g diabetes Diabetes status agegrp byte %8.0g agegrp Age group bmi float %9.0g Body mass index (BMI) |
Sometimes, we may wish to restrict the observations based on a variable in the dataset. For example, we may be interested in observations from the Northeastern region of the United States. We can begin by using the variable region.
. use region using nhanes2l.dta (Second National Health and Nutrition Examination Survey)
Next we can tabulate the variable region with and without the value labels.
. tabulate region
Region | Freq. Percent Cum. | |
NE | 2,096 20.25 20.25 | |
MW | 2,774 26.80 47.05 | |
S | 2,853 27.56 74.61 | |
W | 2,628 25.39 100.00 | |
Total | 10,351 100.00 |
Region | Freq. Percent Cum. | |
1 | 2,096 20.25 20.25 | |
2 | 2,774 26.80 47.05 | |
3 | 2,853 27.56 74.61 | |
4 | 2,628 25.39 100.00 | |
Total | 10,351 100.00 |
The Northeastern region of the United States corresponds to "region==1". So, we can open the dataset using only the observations for region 1 by adding the option if region==1.
. use region diabetes agegrp bmi using nhanes2l if region==1 (Second National Health and Nutrition Examination Survey)
We can type tabulate region to verify that the dataset in memory includes only observations from region 1.
Don't forget that the dataset in the file still contains all the original data. But the dataset in Stata's memory includes only the variables and observations we specified with our use command. If you save the dataset in memory, you will save only the variables and observations in memory, and you will lose all other data in the original datafile. Be sure to save your partial dataset with a new name to avoid losing data.
. save nhanes2l_partial.dta file nhanes2l_partial.dta saved
You can watch a demonstration of these commands by clicking on the link to the YouTube video below. You can read more about these commands by clicking on the links to the Stata manual entries below.
Read more in the Stata Data Management Reference Manual; see [D] copy, [D] use, [D] describe, and [D] save. In the Stata Base Reference Manual, see [R] tabulate oneway.