Example 5: Call Stata using API functions

In the previous examples, we saw how to run Stata commands by using the %%stata magic command. We also saw how it can be used to pass datasets and results between Stata and Python. The magic command is easy and convenient to use, but it will only work in an IPython environment.

In this example, we will demonstrate how to call Stata using the API functions within the config and stata modules. The API functions can be used in both IPython and non-IPython environments. Within an IPython environment, you can use the magic commands and API functions together, making the interaction between Python and Stata more flexible.

The first part of this example will show you how to interact with Stata by using just the API functions. The second part will demonstrate how to use the API functions in combination with the magic commands.

For illustrative purposes, we will use the Boston housing price dataset preloaded in the scikit-learn package. This dataset was collected in 1978 and contains 506 observations on 14 features for homes from various suburbs in Boston, Massachusetts.

Access Stata using API functions

First, we load the Boston housing price dataset from the scikit-learn package and describe the dataset.

[33]:
from sklearn import datasets

bos = datasets.load_boston()
print(bos.DESCR)
.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**

    :Number of Instances: 506

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.

.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

We store it into a pandas DataFrame named boston.

[34]:
import pandas as pd
boston = pd.DataFrame(bos.data)
boston.columns = bos.feature_names
boston['MEDV'] = bos.target
boston.head()
[34]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2

Then, we load the pandas DataFrame into Stata by using the pdataframe_to_data() function of the stata module.

[35]:
from pystata import stata
stata.pdataframe_to_data(boston, force=True)

We use the run() function to quickly summarize the data in Stata.

[36]:
stata.run('summarize')

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
        CRIM |        506    3.613524    8.601545     .00632    88.9762
          ZN |        506    11.36364    23.32245          0        100
       INDUS |        506    11.13678    6.860353        .46      27.74
        CHAS |        506      .06917     .253994          0          1
         NOX |        506    .5546951    .1158777       .385       .871
-------------+---------------------------------------------------------
          RM |        506    6.284634    .7026171      3.561       8.78
         AGE |        506     68.5749    28.14886        2.9        100
         DIS |        506    3.795043     2.10571     1.1296    12.1265
         RAD |        506    9.549407    8.707259          1         24
         TAX |        506    408.2372    168.5371        187        711
-------------+---------------------------------------------------------
     PTRATIO |        506    18.45553    2.164946       12.6         22
           B |        506     356.674    91.29486        .32      396.9
       LSTAT |        506    12.65306    7.141062       1.73      37.97
        MEDV |        506    22.53281    9.197104          5         50

Then, we specify the variable labels and attach a value label to CHAS, which indicates whether the tract bounds the Charles River.

[37]:
stata.run('''
label variable MEDV "Median value of owner-occupied homes in $1000s"
label variable CRIM "Town crime rate, per capita"
label variable RM "Average number of rooms per dwelling"
label variable CHAS "Tract bounds the Charles River (= 1 if tract bounds river; 0 otherwise)"

label define bound 1 "Yes" 0 "No", replace
label values CHAS bound
''')

.
. label variable MEDV "Median value of owner-occupied homes in $1000s"

. label variable CRIM "Town crime rate, per capita"

. label variable RM "Average number of rooms per dwelling"

. label variable CHAS "Tract bounds the Charles River (= 1 if tract bounds rive
> r; 0 otherwise)"

.
. label define bound 1 "Yes" 0 "No", replace

. label values CHAS bound

.

Note that you can also use the Data class in the sfi module to set the variable labels, and the ValueLabel class to set the value labels.

[38]:
from sfi import ValueLabel, Data

Data.setVarLabel('MEDV', "Median value of owner-occupied homes in $1000")
Data.setVarLabel('CRIM', 'Town crime rate, per capita')
Data.setVarLabel('RM', 'Average number of rooms per dwelling')
Data.setVarLabel('CHAS', 'Tract bounds the Charles River (= 1 if tract bounds river; 0 otherwise)')

# remove the label created above
ValueLabel.removeLabel('bound')

ValueLabel.createLabel('bound')
ValueLabel.setLabelValue('bound', 1, 'Yes')
ValueLabel.setLabelValue('bound', 0, 'No')
ValueLabel.setVarValueLabel('CHAS', 'bound')
[39]:
stata.run('describe')

Contains data
 Observations:           506
    Variables:            14
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
CRIM            double  %10.0g                Town crime rate, per capita
ZN              double  %10.0g
INDUS           double  %10.0g
CHAS            double  %10.0g     bound      Tract bounds the Charles River (=
                                                1 if tract bounds river; 0
                                                otherwise)
NOX             double  %10.0g
RM              double  %10.0g                Average number of rooms per
                                                dwelling
AGE             double  %10.0g
DIS             double  %10.0g
RAD             double  %10.0g
TAX             double  %10.0g
PTRATIO         double  %10.0g
B               double  %10.0g
LSTAT           double  %10.0g
MEDV            double  %10.0g                Median value of owner-occupied
                                                homes in $1000
-------------------------------------------------------------------------------
Sorted by:
     Note: Dataset has changed since last saved.

Afterward, we fit a linear regression model of the median home value (MEDV) on the town crime rate (CRIM), the average number of rooms per dwelling (RM), and whether the house is close to the Charles River (CHAS). Then, we predict the median home values, storing these values in the variable midval.

[40]:
stata.run('''
regress MEDV CRIM RM i.CHAS
predict midval
''')

.
. regress MEDV CRIM RM i.CHAS

      Source |       SS           df       MS      Number of obs   =       506
-------------+----------------------------------   F(3, 502)       =    206.73
       Model |  23607.3566         3  7869.11888   Prob > F        =    0.0000
    Residual |  19108.9388       502  38.0656151   R-squared       =    0.5527
-------------+----------------------------------   Adj R-squared   =    0.5500
       Total |  42716.2954       505  84.5867236   Root MSE        =    6.1697

------------------------------------------------------------------------------
        MEDV | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        CRIM |  -.2607244   .0327369    -7.96   0.000    -.3250427   -.1964061
          RM |    8.27818   .4018204    20.60   0.000     7.488723    9.067637
             |
        CHAS |
        Yes  |   3.763037   1.086199     3.46   0.001     1.628981    5.897093
       _cons |  -28.81068   2.563308   -11.24   0.000    -33.84682   -23.77455
------------------------------------------------------------------------------

. predict midval
(option xb assumed; fitted values)

.

Then, we use the get_ereturn() function to push all the e() results to Python as a dictionary named steret. And we use the nparray_from_data() function to store the predicted values (midval) in a NumPy array named stpred.

[41]:
steret = stata.get_ereturn()
stpred = stata.nparray_from_data('midval')

Here are the contents of steret and the first five predicted values.

[42]:
steret
[42]:
{'e(N)': 506.0,
 'e(df_m)': 3.0,
 'e(df_r)': 502.0,
 'e(F)': 206.7251205970851,
 'e(r2)': 0.5526545876050155,
 'e(rmse)': 6.169733796232259,
 'e(mss)': 23607.356626601762,
 'e(rss)': 19108.938788418,
 'e(r2_a)': 0.5499812086464797,
 'e(ll)': -1636.7207309713813,
 'e(ll_0)': -1840.240065743183,
 'e(rank)': 4.0,
 'e(cmdline)': 'regress MEDV CRIM RM i.CHAS',
 'e(title)': 'Linear regression',
 'e(marginsprop)': 'minus',
 'e(marginsok)': 'XB default',
 'e(vce)': 'ols',
 'e(depvar)': 'MEDV',
 'e(cmd)': 'regress',
 'e(properties)': 'b V',
 'e(predict)': 'regres_p',
 'e(model)': 'ols',
 'e(estat_cmd)': 'regress_estat',
 'e(b)': array([[ -0.26072441,   8.27817981,   0.        ,   3.76303705,
         -28.81068251]]),
 'e(V)': array([[ 1.07170671e-03,  2.83319282e-03,  0.00000000e+00,
          1.31333008e-03, -2.17690615e-02],
        [ 2.83319282e-03,  1.61459656e-01,  0.00000000e+00,
         -3.53939984e-02, -1.02250451e+00],
        [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
          0.00000000e+00, -0.00000000e+00],
        [ 1.31333008e-03, -3.53939984e-02,  0.00000000e+00,
          1.17982792e+00,  1.36083940e-01],
        [-2.17690615e-02, -1.02250451e+00, -0.00000000e+00,
          1.36083940e-01,  6.57054561e+00]])}
[43]:
stpred[0:5]
[43]:
array([25.61670113, 24.33638954, 30.66092491, 29.1115799 , 30.33546638])

Next, we use margins and marginsplot to get the predicted median home values based on values of the CHAS variable.

[44]:
stata.run('''
margins CHAS
marginsplot
''')

.
. margins CHAS

Predictive margins                                         Number of obs = 506
Model VCE: OLS

Expression: Linear prediction, predict()

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        CHAS |
         No  |   22.27252   .2843824    78.32   0.000     21.71379    22.83124
        Yes  |   26.03555   1.047609    24.85   0.000     23.97732    28.09379
------------------------------------------------------------------------------

. marginsplot

Variables that uniquely identify margins: CHAS

.
../_images/notebook_Example5_22_1.svg

Last, we create a partial-regression leverage plot for all the regressors by using the avplots command.

[45]:
stata.run('avplots')
../_images/notebook_Example5_24_0.svg

Access Stata using API functions and the stata magic

The API functions can replicate what the stata magic can do. So for any given task, you might choose to use the magic command or an API function. But in an IPython environment, you can use them both. To demonstrate, we will perform the same analysis as above, using both the API functions and the stata magic command.

First, we use %%stata to load the Boston housing data into Stata and summarize it.

[46]:
%%stata -d boston -force
summarize

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
        CRIM |        506    3.613524    8.601545     .00632    88.9762
          ZN |        506    11.36364    23.32245          0        100
       INDUS |        506    11.13678    6.860353        .46      27.74
        CHAS |        506      .06917     .253994          0          1
         NOX |        506    .5546951    .1158777       .385       .871
-------------+---------------------------------------------------------
          RM |        506    6.284634    .7026171      3.561       8.78
         AGE |        506     68.5749    28.14886        2.9        100
         DIS |        506    3.795043     2.10571     1.1296    12.1265
         RAD |        506    9.549407    8.707259          1         24
         TAX |        506    408.2372    168.5371        187        711
-------------+---------------------------------------------------------
     PTRATIO |        506    18.45553    2.164946       12.6         22
           B |        506     356.674    91.29486        .32      396.9
       LSTAT |        506    12.65306    7.141062       1.73      37.97
        MEDV |        506    22.53281    9.197104          5         50

Then, we label the variables.

[47]:
%%stata
label variable MEDV "Median value of owner-occupied homes in $1000s"
label variable CRIM "Town crime rate, per capita"
label variable RM "Average number of rooms per dwelling"
label variable CHAS "Tract bounds the Charles River (= 1 if tract bounds river; 0 otherwise)"

label define bound 1 "Yes" 0 "No", replace
label values CHAS bound

describe

. label variable MEDV "Median value of owner-occupied homes in $1000s"

. label variable CRIM "Town crime rate, per capita"

. label variable RM "Average number of rooms per dwelling"

. label variable CHAS "Tract bounds the Charles River (= 1 if tract bounds rive
> r; 0 otherwise)"

.
. label define bound 1 "Yes" 0 "No", replace

. label values CHAS bound

.
. describe

Contains data
 Observations:           506
    Variables:            14
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
CRIM            double  %10.0g                Town crime rate, per capita
ZN              double  %10.0g
INDUS           double  %10.0g
CHAS            double  %10.0g     bound      Tract bounds the Charles River (=
                                                1 if tract bounds river; 0
                                                otherwise)
NOX             double  %10.0g
RM              double  %10.0g                Average number of rooms per
                                                dwelling
AGE             double  %10.0g
DIS             double  %10.0g
RAD             double  %10.0g
TAX             double  %10.0g
PTRATIO         double  %10.0g
B               double  %10.0g
LSTAT           double  %10.0g
MEDV            double  %10.0g                Median value of owner-occupied
                                                homes in $1000s
-------------------------------------------------------------------------------
Sorted by:
     Note: Dataset has changed since last saved.

.

Then, we run a regression model, obtain the predicted values, and store the e() results back into Python.

[48]:
%%stata -eret steret
regress MEDV CRIM RM i.CHAS
predict midval

. regress MEDV CRIM RM i.CHAS

      Source |       SS           df       MS      Number of obs   =       506
-------------+----------------------------------   F(3, 502)       =    206.73
       Model |  23607.3566         3  7869.11888   Prob > F        =    0.0000
    Residual |  19108.9388       502  38.0656151   R-squared       =    0.5527
-------------+----------------------------------   Adj R-squared   =    0.5500
       Total |  42716.2954       505  84.5867236   Root MSE        =    6.1697

------------------------------------------------------------------------------
        MEDV | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        CRIM |  -.2607244   .0327369    -7.96   0.000    -.3250427   -.1964061
          RM |    8.27818   .4018204    20.60   0.000     7.488723    9.067637
             |
        CHAS |
        Yes  |   3.763037   1.086199     3.46   0.001     1.628981    5.897093
       _cons |  -28.81068   2.563308   -11.24   0.000    -33.84682   -23.77455
------------------------------------------------------------------------------

. predict midval
(option xb assumed; fitted values)

.

We easily stored all the e() results in the dictionary steret. But suppose we only want to store the root mean squared error, e(rmse), and the coefficient vector, e(b). In these cases, it is more convenient to use the API functions defined in the sfi to access each element.

[49]:
from sfi import Scalar, Matrix

strmse = Scalar.getValue('e(rmse)')
steb = Matrix.get('e(b)')
[50]:
strmse
[50]:
6.169733796232259
[51]:
steb
[51]:
[[-0.260724411230532,
  8.278179811535974,
  0.0,
  3.763037052105144,
  -28.810682506359157]]

In the last %%stata block, we created the variable midval to store the predicted values. Rather than exporting the whole dataset to Python, by specifying the -douta or -doutd arguments with %%stata, we only want to export this one variable to Python. So we use the nparray_from_data() function to store the midval variable into a NumPy array named stpred.

[52]:
stpred = stata.nparray_from_data('midval')

Throughout the examples, we have demonstrated three ways to interact with Stata from within Python:

With these tools, you can use Stata’s capabilities within Python and pass data and results between Stata and Python seamlessly.