Example 5: Call Stata using API functions¶
In the previous examples, we saw how to run Stata commands by using the %%stata magic command. We also saw how it can be used to pass datasets and results between Stata and Python. The magic command is easy and convenient to use, but it will only work in an IPython environment.
In this example, we will demonstrate how to call Stata using the API functions within the config and stata modules. The API functions can be used in both IPython and non-IPython environments. Within an IPython environment, you can use the magic commands and API functions together, making the interaction between Python and Stata more flexible.
The first part of this example will show you how to interact with Stata by using just the API functions. The second part will demonstrate how to use the API functions in combination with the magic commands.
For illustrative purposes, we will use the Boston housing price dataset preloaded in the scikit-learn package. This dataset was collected in 1978 and contains 506 observations on 14 features for homes from various suburbs in Boston, Massachusetts.
Access Stata using API functions¶
First, we load the Boston housing price dataset from the scikit-learn package and describe the dataset.
[33]:
from sklearn import datasets
bos = datasets.load_boston()
print(bos.DESCR)
.. _boston_dataset:
Boston house prices dataset
---------------------------
**Data Set Characteristics:**
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.
:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
:Missing Attribute Values: None
:Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.
The Boston house-price data has been used in many machine learning papers that address regression
problems.
.. topic:: References
- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
We store it into a pandas DataFrame named boston.
[34]:
import pandas as pd
boston = pd.DataFrame(bos.data)
boston.columns = bos.feature_names
boston['MEDV'] = bos.target
boston.head()
[34]:
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
Then, we load the pandas DataFrame into Stata by using the pdataframe_to_data() function of the stata module.
[35]:
from pystata import stata
stata.pdataframe_to_data(boston, force=True)
We use the run() function to quickly summarize the data in Stata.
[36]:
stata.run('summarize')
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
CRIM | 506 3.613524 8.601545 .00632 88.9762
ZN | 506 11.36364 23.32245 0 100
INDUS | 506 11.13678 6.860353 .46 27.74
CHAS | 506 .06917 .253994 0 1
NOX | 506 .5546951 .1158777 .385 .871
-------------+---------------------------------------------------------
RM | 506 6.284634 .7026171 3.561 8.78
AGE | 506 68.5749 28.14886 2.9 100
DIS | 506 3.795043 2.10571 1.1296 12.1265
RAD | 506 9.549407 8.707259 1 24
TAX | 506 408.2372 168.5371 187 711
-------------+---------------------------------------------------------
PTRATIO | 506 18.45553 2.164946 12.6 22
B | 506 356.674 91.29486 .32 396.9
LSTAT | 506 12.65306 7.141062 1.73 37.97
MEDV | 506 22.53281 9.197104 5 50
Then, we specify the variable labels and attach a value label to CHAS, which indicates whether the tract bounds the Charles River.
[37]:
stata.run('''
label variable MEDV "Median value of owner-occupied homes in $1000s"
label variable CRIM "Town crime rate, per capita"
label variable RM "Average number of rooms per dwelling"
label variable CHAS "Tract bounds the Charles River (= 1 if tract bounds river; 0 otherwise)"
label define bound 1 "Yes" 0 "No", replace
label values CHAS bound
''')
.
. label variable MEDV "Median value of owner-occupied homes in $1000s"
. label variable CRIM "Town crime rate, per capita"
. label variable RM "Average number of rooms per dwelling"
. label variable CHAS "Tract bounds the Charles River (= 1 if tract bounds rive
> r; 0 otherwise)"
.
. label define bound 1 "Yes" 0 "No", replace
. label values CHAS bound
.
Note that you can also use the Data class in the sfi module to set the variable labels, and the ValueLabel class to set the value labels.
[38]:
from sfi import ValueLabel, Data
Data.setVarLabel('MEDV', "Median value of owner-occupied homes in $1000")
Data.setVarLabel('CRIM', 'Town crime rate, per capita')
Data.setVarLabel('RM', 'Average number of rooms per dwelling')
Data.setVarLabel('CHAS', 'Tract bounds the Charles River (= 1 if tract bounds river; 0 otherwise)')
# remove the label created above
ValueLabel.removeLabel('bound')
ValueLabel.createLabel('bound')
ValueLabel.setLabelValue('bound', 1, 'Yes')
ValueLabel.setLabelValue('bound', 0, 'No')
ValueLabel.setVarValueLabel('CHAS', 'bound')
[39]:
stata.run('describe')
Contains data
Observations: 506
Variables: 14
-------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
CRIM double %10.0g Town crime rate, per capita
ZN double %10.0g
INDUS double %10.0g
CHAS double %10.0g bound Tract bounds the Charles River (=
1 if tract bounds river; 0
otherwise)
NOX double %10.0g
RM double %10.0g Average number of rooms per
dwelling
AGE double %10.0g
DIS double %10.0g
RAD double %10.0g
TAX double %10.0g
PTRATIO double %10.0g
B double %10.0g
LSTAT double %10.0g
MEDV double %10.0g Median value of owner-occupied
homes in $1000
-------------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
Afterward, we fit a linear regression model of the median home value (MEDV) on the town crime rate (CRIM), the average number of rooms per dwelling (RM), and whether the house is close to the Charles River (CHAS). Then, we predict the median home values, storing these values in the variable midval.
[40]:
stata.run('''
regress MEDV CRIM RM i.CHAS
predict midval
''')
.
. regress MEDV CRIM RM i.CHAS
Source | SS df MS Number of obs = 506
-------------+---------------------------------- F(3, 502) = 206.73
Model | 23607.3566 3 7869.11888 Prob > F = 0.0000
Residual | 19108.9388 502 38.0656151 R-squared = 0.5527
-------------+---------------------------------- Adj R-squared = 0.5500
Total | 42716.2954 505 84.5867236 Root MSE = 6.1697
------------------------------------------------------------------------------
MEDV | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
CRIM | -.2607244 .0327369 -7.96 0.000 -.3250427 -.1964061
RM | 8.27818 .4018204 20.60 0.000 7.488723 9.067637
|
CHAS |
Yes | 3.763037 1.086199 3.46 0.001 1.628981 5.897093
_cons | -28.81068 2.563308 -11.24 0.000 -33.84682 -23.77455
------------------------------------------------------------------------------
. predict midval
(option xb assumed; fitted values)
.
Then, we use the get_ereturn() function to push all the e() results to Python as a dictionary named steret. And we use the nparray_from_data() function to store the predicted values (midval) in a NumPy array named stpred.
[41]:
steret = stata.get_ereturn()
stpred = stata.nparray_from_data('midval')
Here are the contents of steret and the first five predicted values.
[42]:
steret
[42]:
{'e(N)': 506.0,
'e(df_m)': 3.0,
'e(df_r)': 502.0,
'e(F)': 206.7251205970851,
'e(r2)': 0.5526545876050155,
'e(rmse)': 6.169733796232259,
'e(mss)': 23607.356626601762,
'e(rss)': 19108.938788418,
'e(r2_a)': 0.5499812086464797,
'e(ll)': -1636.7207309713813,
'e(ll_0)': -1840.240065743183,
'e(rank)': 4.0,
'e(cmdline)': 'regress MEDV CRIM RM i.CHAS',
'e(title)': 'Linear regression',
'e(marginsprop)': 'minus',
'e(marginsok)': 'XB default',
'e(vce)': 'ols',
'e(depvar)': 'MEDV',
'e(cmd)': 'regress',
'e(properties)': 'b V',
'e(predict)': 'regres_p',
'e(model)': 'ols',
'e(estat_cmd)': 'regress_estat',
'e(b)': array([[ -0.26072441, 8.27817981, 0. , 3.76303705,
-28.81068251]]),
'e(V)': array([[ 1.07170671e-03, 2.83319282e-03, 0.00000000e+00,
1.31333008e-03, -2.17690615e-02],
[ 2.83319282e-03, 1.61459656e-01, 0.00000000e+00,
-3.53939984e-02, -1.02250451e+00],
[ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, -0.00000000e+00],
[ 1.31333008e-03, -3.53939984e-02, 0.00000000e+00,
1.17982792e+00, 1.36083940e-01],
[-2.17690615e-02, -1.02250451e+00, -0.00000000e+00,
1.36083940e-01, 6.57054561e+00]])}
[43]:
stpred[0:5]
[43]:
array([25.61670113, 24.33638954, 30.66092491, 29.1115799 , 30.33546638])
Next, we use margins and marginsplot to get the predicted median home values based on values of the CHAS variable.
[44]:
stata.run('''
margins CHAS
marginsplot
''')
.
. margins CHAS
Predictive margins Number of obs = 506
Model VCE: OLS
Expression: Linear prediction, predict()
------------------------------------------------------------------------------
| Delta-method
| Margin std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
CHAS |
No | 22.27252 .2843824 78.32 0.000 21.71379 22.83124
Yes | 26.03555 1.047609 24.85 0.000 23.97732 28.09379
------------------------------------------------------------------------------
. marginsplot
Variables that uniquely identify margins: CHAS
.
Last, we create a partial-regression leverage plot for all the regressors by using the avplots command.
[45]:
stata.run('avplots')
Access Stata using API functions and the stata magic¶
The API functions can replicate what the stata magic can do. So for any given task, you might choose to use the magic command or an API function. But in an IPython environment, you can use them both. To demonstrate, we will perform the same analysis as above, using both the API functions and the stata magic command.
First, we use %%stata to load the Boston housing data into Stata and summarize it.
[46]:
%%stata -d boston -force
summarize
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
CRIM | 506 3.613524 8.601545 .00632 88.9762
ZN | 506 11.36364 23.32245 0 100
INDUS | 506 11.13678 6.860353 .46 27.74
CHAS | 506 .06917 .253994 0 1
NOX | 506 .5546951 .1158777 .385 .871
-------------+---------------------------------------------------------
RM | 506 6.284634 .7026171 3.561 8.78
AGE | 506 68.5749 28.14886 2.9 100
DIS | 506 3.795043 2.10571 1.1296 12.1265
RAD | 506 9.549407 8.707259 1 24
TAX | 506 408.2372 168.5371 187 711
-------------+---------------------------------------------------------
PTRATIO | 506 18.45553 2.164946 12.6 22
B | 506 356.674 91.29486 .32 396.9
LSTAT | 506 12.65306 7.141062 1.73 37.97
MEDV | 506 22.53281 9.197104 5 50
Then, we label the variables.
[47]:
%%stata
label variable MEDV "Median value of owner-occupied homes in $1000s"
label variable CRIM "Town crime rate, per capita"
label variable RM "Average number of rooms per dwelling"
label variable CHAS "Tract bounds the Charles River (= 1 if tract bounds river; 0 otherwise)"
label define bound 1 "Yes" 0 "No", replace
label values CHAS bound
describe
. label variable MEDV "Median value of owner-occupied homes in $1000s"
. label variable CRIM "Town crime rate, per capita"
. label variable RM "Average number of rooms per dwelling"
. label variable CHAS "Tract bounds the Charles River (= 1 if tract bounds rive
> r; 0 otherwise)"
.
. label define bound 1 "Yes" 0 "No", replace
. label values CHAS bound
.
. describe
Contains data
Observations: 506
Variables: 14
-------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
CRIM double %10.0g Town crime rate, per capita
ZN double %10.0g
INDUS double %10.0g
CHAS double %10.0g bound Tract bounds the Charles River (=
1 if tract bounds river; 0
otherwise)
NOX double %10.0g
RM double %10.0g Average number of rooms per
dwelling
AGE double %10.0g
DIS double %10.0g
RAD double %10.0g
TAX double %10.0g
PTRATIO double %10.0g
B double %10.0g
LSTAT double %10.0g
MEDV double %10.0g Median value of owner-occupied
homes in $1000s
-------------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
.
Then, we run a regression model, obtain the predicted values, and store the e() results back into Python.
[48]:
%%stata -eret steret
regress MEDV CRIM RM i.CHAS
predict midval
. regress MEDV CRIM RM i.CHAS
Source | SS df MS Number of obs = 506
-------------+---------------------------------- F(3, 502) = 206.73
Model | 23607.3566 3 7869.11888 Prob > F = 0.0000
Residual | 19108.9388 502 38.0656151 R-squared = 0.5527
-------------+---------------------------------- Adj R-squared = 0.5500
Total | 42716.2954 505 84.5867236 Root MSE = 6.1697
------------------------------------------------------------------------------
MEDV | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
CRIM | -.2607244 .0327369 -7.96 0.000 -.3250427 -.1964061
RM | 8.27818 .4018204 20.60 0.000 7.488723 9.067637
|
CHAS |
Yes | 3.763037 1.086199 3.46 0.001 1.628981 5.897093
_cons | -28.81068 2.563308 -11.24 0.000 -33.84682 -23.77455
------------------------------------------------------------------------------
. predict midval
(option xb assumed; fitted values)
.
We easily stored all the e() results in the dictionary steret. But suppose we only want to store the root mean squared error, e(rmse), and the coefficient vector, e(b). In these cases, it is more convenient to use the API functions defined in the sfi to access each element.
[49]:
from sfi import Scalar, Matrix
strmse = Scalar.getValue('e(rmse)')
steb = Matrix.get('e(b)')
[50]:
strmse
[50]:
6.169733796232259
[51]:
steb
[51]:
[[-0.260724411230532,
8.278179811535974,
0.0,
3.763037052105144,
-28.810682506359157]]
In the last %%stata block, we created the variable midval to store the predicted values. Rather than exporting the whole dataset to Python, by specifying the -douta or -doutd arguments with %%stata, we only want to export this one variable to Python. So we use the nparray_from_data() function to store the midval variable into a NumPy array named stpred.
[52]:
stpred = stata.nparray_from_data('midval')
Throughout the examples, we have demonstrated three ways to interact with Stata from within Python:
using the magic commands
using the suite of API functions from the pystata Python package
using the API functions from the Stata Function Interface (sfi) module
With these tools, you can use Stata’s capabilities within Python and pass data and results between Stata and Python seamlessly.