Example 2: Load data from Python¶
There are many ways to load data from Python into Stata’s current dataset in memory. For example:
Pandas DataFrames and NumPy arrays can be loaded directly into Stata.
The Data and Frame classes within the Stata Function Interface (sfi) module provide multiple methods for loading data from Python.
Stata can read in data from a variety of sources, many of which can be created in Python: Excel files, CSV files, SPSS and SAS datasets, and various databases.
In this example, we will show you how to load a pandas DataFrame into Stata and then perform analyses in Stata, using the %%stata magic command.
We have data from the Second National Health and Nutrition Examination Survey (NHANES II; McDowell et al. 1981). NHANES II is part of a study to assess the health and nutritional status of adults and children in the United States. It is designed to be a nationally representative sample of the U.S. population. The data for this particular sample were collected between 1976 and 1980.
This dataset contains 10,351 observations and 58 variables, and is stored in a CSV file called nhanes2.csv. Among these variables, we have each individual’s age and systolic blood pressure (bpsystol). Age has been categorized into six groups: 20-29, 30-39, 40-49, 50-59, 60-69, and 70+. The goal is to read the data into a pandas DataFrame and load the DataFrame into Stata. We want to use Stata’s features to fit a regression model of bpsystol as a function of age group (agegrp) and gender. Then we want to see how the average predicted systolic blood pressure varies across individuals in each age group, and across males and females in each age group.
First, we use the pandas function read_csv() to read the data from the .csv file into a DataFrame named nhanes2.
[5]:
import pandas as pd
import io
import requests
data = requests.get("https://www.stata.com/python/pystata18/misc/nhanes2.csv").content
nhanes2 = pd.read_csv(io.StringIO(data.decode('utf-8')))
nhanes2
[5]:
sampl | strata | psu | region | smsa | location | houssiz | sex | race | age | ... | region4 | smsa1 | smsa2 | smsa3 | rural | loglead | agegrp | highlead | bmi | highbp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1400 | 1 | 1 | S | 2 | 1 | 4 | Male | White | 54 | ... | 0 | 0 | 1 | 0 | 0 | NaN | 50-59 | NaN | 20.495686 | 0 |
1 | 1401 | 1 | 1 | S | 2 | 1 | 6 | Female | White | 41 | ... | 0 | 0 | 1 | 0 | 0 | 2.564949 | 40-49 | lead<25 | 21.022337 | 0 |
2 | 1402 | 1 | 1 | S | 1 | 1 | 6 | Female | Other | 21 | ... | 0 | 1 | 0 | 0 | 0 | NaN | 20-29 | NaN | 24.973860 | 0 |
3 | 1404 | 1 | 1 | S | 2 | 1 | 9 | Female | White | 63 | ... | 0 | 0 | 1 | 0 | 0 | NaN | 60-69 | NaN | 35.728722 | 1 |
4 | 1405 | 1 | 1 | S | 1 | 1 | 3 | Female | White | 64 | ... | 0 | 1 | 0 | 0 | 0 | 2.995732 | 60-69 | lead<25 | 27.923803 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
10346 | 48760 | 32 | 2 | MW | 4 | 48 | 5 | Female | White | 35 | ... | 0 | 0 | 0 | 1 | 1 | NaN | 30-39 | NaN | 20.355173 | 0 |
10347 | 48763 | 32 | 2 | MW | 4 | 48 | 2 | Female | White | 33 | ... | 0 | 0 | 0 | 1 | 1 | 1.945910 | 30-39 | lead<25 | 41.645557 | 1 |
10348 | 48764 | 32 | 2 | MW | 4 | 48 | 1 | Female | White | 60 | ... | 0 | 0 | 0 | 1 | 0 | NaN | 60-69 | NaN | 35.626114 | 0 |
10349 | 48768 | 32 | 2 | MW | 4 | 48 | 1 | Female | White | 29 | ... | 0 | 0 | 0 | 1 | 0 | NaN | 20-29 | NaN | 19.204464 | 0 |
10350 | 48770 | 32 | 2 | MW | 4 | 48 | 1 | Male | White | 31 | ... | 0 | 0 | 0 | 1 | 1 | NaN | 30-39 | NaN | 19.635565 | 0 |
10351 rows × 58 columns
Then, we load the DataFrame into Stata by specifying the -d argument of the %%stata magic. Here you could also use the API function pdataframe_to_data() of the stata module to load the pandas DataFrame into Stata. See Call Stata using API functions and Example 5 for more information.
Within Stata, we encode agegrp and sex and label the resulting variables, agegrp2 and sex2, along with bpsystol.
[6]:
%%stata -d nhanes2 -force
label define sex2 1 "Male" 2 "Female"
encode sex, generate(sex2) label(sex2)
label define agegrp 1 "20-29" 2 "30-39" 3 "40-49" 4 "50-59" 5 "60-69" 6 "70+"
encode agegrp, generate(agegrp2) label(agegrp)
label variable bpsystol "systolic blood pressure"
label variable agegrp2 "Age Group"
label variable sex2 "1=Male, 2=Female"
describe bpsystol agegrp2 sex2
. label define sex2 1 "Male" 2 "Female"
. encode sex, generate(sex2) label(sex2)
.
. label define agegrp 1 "20-29" 2 "30-39" 3 "40-49" 4 "50-59" 5 "60-69" 6 "70+"
. encode agegrp, generate(agegrp2) label(agegrp)
.
. label variable bpsystol "systolic blood pressure"
. label variable agegrp2 "Age Group"
. label variable sex2 "1=Male, 2=Female"
.
. describe bpsystol agegrp2 sex2
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
bpsystol long %12.0g systolic blood pressure
agegrp2 long %8.0g agegrp Age Group
sex2 long %8.0g sex2 1=Male, 2=Female
.
We then fit a regression model of systolic blood pressure on age group and sex. We also push Stata’s estimation results, including the coefficient vector e(b) and variance–covariance matrix e(V), into a Python dictionary called steret.
[7]:
%%stata -eret steret
// fit a regression model
regress bpsystol agegrp2##sex2
// e() results
ereturn list
. // fit a regression model
. regress bpsystol agegrp2##sex2
Source | SS df MS Number of obs = 10,351
-------------+---------------------------------- F(11, 10339) = 312.88
Model | 1407229.28 11 127929.935 Prob > F = 0.0000
Residual | 4227440.75 10,339 408.882943 R-squared = 0.2497
-------------+---------------------------------- Adj R-squared = 0.2489
Total | 5634670.03 10,350 544.412563 Root MSE = 20.221
------------------------------------------------------------------------------
bpsystol | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
agegrp2 |
30-39 | .7956175 .9473117 0.84 0.401 -1.061297 2.652532
40-49 | 5.117078 1.018176 5.03 0.000 3.121256 7.1129
50-59 | 12.20018 1.022541 11.93 0.000 10.1958 14.20456
60-69 | 16.85887 .8155092 20.67 0.000 15.26031 18.45742
70+ | 22.50889 1.130959 19.90 0.000 20.29199 24.72579
|
sex2 |
Female | -12.60132 .8402299 -15.00 0.000 -14.24833 -10.9543
|
agegrp2#sex2 |
30-39 #|
Female | 4.140156 1.31031 3.16 0.002 1.571695 6.708617
40-49 #|
Female | 8.644866 1.412067 6.12 0.000 5.876941 11.41279
50-59 #|
Female | 11.83134 1.406641 8.41 0.000 9.074051 14.58863
60-69 #|
Female | 14.093 1.130882 12.46 0.000 11.87625 16.30975
70+#Female | 15.86608 1.542296 10.29 0.000 12.84288 18.88928
|
_cons | 123.8862 .6052954 204.67 0.000 122.6997 125.0727
------------------------------------------------------------------------------
.
. // e() results
. ereturn list
scalars:
e(N) = 10351
e(df_m) = 11
e(df_r) = 10339
e(F) = 312.8766723590026
e(r2) = .2497447540773559
e(rmse) = 20.22085415724865
e(mss) = 1407229.279971501
e(rss) = 4227440.746112915
e(r2_a) = .2489465330013186
e(ll) = -45803.93060947768
e(ll_0) = -47291.06810807488
e(rank) = 12
macros:
e(cmdline) : "regress bpsystol agegrp2##sex2"
e(title) : "Linear regression"
e(marginsok) : "XB default"
e(vce) : "ols"
e(depvar) : "bpsystol"
e(cmd) : "regress"
e(properties) : "b V"
e(predict) : "regres_p"
e(model) : "ols"
e(estat_cmd) : "regress_estat"
matrices:
e(b) : 1 x 21
e(V) : 21 x 21
e(beta) : 1 x 20
functions:
e(sample)
.
Below, we list the entire contents of the Python dictionary steret.
[8]:
steret
[8]:
{'e(N)': 10351.0,
'e(df_m)': 11.0,
'e(df_r)': 10339.0,
'e(F)': 312.8766723590026,
'e(r2)': 0.24974475407735586,
'e(rmse)': 20.220854157248645,
'e(mss)': 1407229.2799715009,
'e(rss)': 4227440.746112915,
'e(r2_a)': 0.24894653300131864,
'e(ll)': -45803.930609477684,
'e(ll_0)': -47291.06810807488,
'e(rank)': 12.0,
'e(cmdline)': 'regress bpsystol agegrp2##sex2',
'e(title)': 'Linear regression',
'e(marginsprop)': 'minus',
'e(marginsok)': 'XB default',
'e(vce)': 'ols',
'e(_r_z_abs__CL)': '|t|',
'e(_r_z__CL)': 't',
'e(depvar)': 'bpsystol',
'e(cmd)': 'regress',
'e(properties)': 'b V',
'e(predict)': 'regres_p',
'e(model)': 'ols',
'e(estat_cmd)': 'regress_estat',
'e(b)': array([[ 0. , 0.79561746, 5.11707797, 12.20017802,
16.85886868, 22.50888857, 0. , -12.601317 ,
0. , 0. , 0. , 4.14015609,
0. , 8.6448661 , 0. , 11.83133884,
0. , 14.09300146, 0. , 15.86607901,
123.88620072]]),
'e(V)': array([[ 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
-0. ],
[ 0. , 0.89739937, 0.36638257, 0.36638257, 0.36638257,
0.36638257, 0. , 0.36638257, 0. , 0. ,
0. , -0.89739937, 0. , -0.36638257, 0. ,
-0.36638257, 0. , -0.36638257, 0. , -0.36638257,
-0.36638257],
[ 0. , 0.36638257, 1.03668247, 0.36638257, 0.36638257,
0.36638257, 0. , 0.36638257, 0. , 0. ,
0. , -0.36638257, 0. , -1.03668247, 0. ,
-0.36638257, 0. , -0.36638257, 0. , -0.36638257,
-0.36638257],
[ 0. , 0.36638257, 0.36638257, 1.04559011, 0.36638257,
0.36638257, 0. , 0.36638257, 0. , 0. ,
0. , -0.36638257, 0. , -0.36638257, 0. ,
-1.04559011, 0. , -0.36638257, 0. , -0.36638257,
-0.36638257],
[ 0. , 0.36638257, 0.36638257, 0.36638257, 0.66505528,
0.36638257, 0. , 0.36638257, 0. , 0. ,
0. , -0.36638257, 0. , -0.36638257, 0. ,
-0.36638257, 0. , -0.66505528, 0. , -0.36638257,
-0.36638257],
[ 0. , 0.36638257, 0.36638257, 0.36638257, 0.36638257,
1.27906771, 0. , 0.36638257, 0. , 0. ,
0. , -0.36638257, 0. , -0.36638257, 0. ,
-0.36638257, 0. , -0.36638257, 0. , -1.27906771,
-0.36638257],
[ 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
-0. ],
[ 0. , 0.36638257, 0.36638257, 0.36638257, 0.36638257,
0.36638257, 0. , 0.70598634, 0. , 0. ,
0. , -0.70598634, 0. , -0.70598634, 0. ,
-0.70598634, 0. , -0.70598634, 0. , -0.70598634,
-0.36638257],
[ 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
-0. ],
[ 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
-0. ],
[ 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
-0. ],
[ 0. , -0.89739937, -0.36638257, -0.36638257, -0.36638257,
-0.36638257, 0. , -0.70598634, 0. , 0. ,
0. , 1.7169127 , 0. , 0.70598634, 0. ,
0.70598634, 0. , 0.70598634, 0. , 0.70598634,
0.36638257],
[ 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
-0. ],
[ 0. , -0.36638257, -1.03668247, -0.36638257, -0.36638257,
-0.36638257, 0. , -0.70598634, 0. , 0. ,
0. , 0.70598634, 0. , 1.99393419, 0. ,
0.70598634, 0. , 0.70598634, 0. , 0.70598634,
0.36638257],
[ 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
-0. ],
[ 0. , -0.36638257, -0.36638257, -1.04559011, -0.36638257,
-0.36638257, 0. , -0.70598634, 0. , 0. ,
0. , 0.70598634, 0. , 0.70598634, 0. ,
1.97863792, 0. , 0.70598634, 0. , 0.70598634,
0.36638257],
[ 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
-0. ],
[ 0. , -0.36638257, -0.36638257, -0.36638257, -0.66505528,
-0.36638257, 0. , -0.70598634, 0. , 0. ,
0. , 0.70598634, 0. , 0.70598634, 0. ,
0.70598634, 0. , 1.27889308, 0. , 0.70598634,
0.36638257],
[ 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
-0. ],
[ 0. , -0.36638257, -0.36638257, -0.36638257, -0.36638257,
-1.27906771, 0. , -0.70598634, 0. , 0. ,
0. , 0.70598634, 0. , 0.70598634, 0. ,
0.70598634, 0. , 0.70598634, 0. , 2.37867695,
0.36638257],
[-0. , -0.36638257, -0.36638257, -0.36638257, -0.36638257,
-0.36638257, -0. , -0.36638257, -0. , -0. ,
-0. , 0.36638257, -0. , 0.36638257, -0. ,
0.36638257, -0. , 0.36638257, -0. , 0.36638257,
0.36638257]]),
'e(beta)': array([[ 0. , 0.01239614, 0.07200443, 0.17276986, 0.32311413,
0.2832181 , 0. , -0.26970688, 0. , 0. ,
0. , 0.0487697 , 0. , 0.090657 , 0. ,
0.12640132, 0. , 0.21209664, 0. , 0.15095109]])}
You can also access specific elements of the dictionary. For example, you can access e(b) and e(V) by typing steret[‘e(b)’] and steret[‘e(V)’], respectively, in Python.
[9]:
steret['e(b)']
[9]:
array([[ 0. , 0.79561746, 5.11707797, 12.20017802,
16.85886868, 22.50888857, 0. , -12.601317 ,
0. , 0. , 0. , 4.14015609,
0. , 8.6448661 , 0. , 11.83133884,
0. , 14.09300146, 0. , 15.86607901,
123.88620072]])
Next, we use the margins command to estimate the average predicted systolic blood pressure for each age group, and we graph the results using marginsplot.
[10]:
%%stata
margins agegrp2
Predictive margins Number of obs = 10,351
Model VCE: OLS
Expression: Linear prediction, predict()
------------------------------------------------------------------------------
| Delta-method
| Margin std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
agegrp2 |
20-29 | 117.2684 .419845 279.31 0.000 116.4454 118.0914
30-39 | 120.2383 .5020813 239.48 0.000 119.2541 121.2225
40-49 | 126.9255 .56699 223.86 0.000 125.8141 128.0369
50-59 | 135.682 .5628593 241.06 0.000 134.5787 136.7853
60-69 | 141.5285 .3781197 374.30 0.000 140.7873 142.2696
70+ | 148.1096 .6445073 229.80 0.000 146.8463 149.373
------------------------------------------------------------------------------
[11]:
%%stata
marginsplot
Variables that uniquely identify margins: agegrp2
Now, we estimate those expected values separately for males and females in each age group, and we graph the results.
[12]:
%%stata
margins agegrp2#sex2
marginsplot
. margins agegrp2#sex2
Adjusted predictions Number of obs = 10,351
Model VCE: OLS
Expression: Linear prediction, predict()
------------------------------------------------------------------------------
| Delta-method
| Margin std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
agegrp2#sex2 |
20-29#Male | 123.8862 .6052954 204.67 0.000 122.6997 125.0727
20-29 #|
Female | 111.2849 .5827553 190.96 0.000 110.1426 112.4272
30-39#Male | 124.6818 .728709 171.10 0.000 123.2534 126.1102
30-39 #|
Female | 116.2207 .692755 167.77 0.000 114.8627 117.5786
40-49#Male | 129.0033 .8187185 157.57 0.000 127.3984 130.6081
40-49 #|
Female | 125.0468 .7859058 159.11 0.000 123.5063 126.5874
50-59#Male | 136.0864 .8241405 165.13 0.000 134.4709 137.7019
50-59 #|
Female | 135.3164 .7703532 175.66 0.000 133.8064 136.8264
60-69#Male | 140.7451 .5465096 257.53 0.000 139.6738 141.8163
60-69 #|
Female | 142.2368 .5236736 271.61 0.000 141.2103 143.2633
70+#Male | 146.3951 .9553456 153.24 0.000 144.5224 148.2678
70+#Female | 149.6599 .8717829 171.67 0.000 147.951 151.3687
------------------------------------------------------------------------------
. marginsplot
Variables that uniquely identify margins: agegrp2 sex2
.