Stata provides a tight integration with Python named PyStata. It contains two parts:
The pystata Python package includes two sets of tools for interacting with Stata from Python:
The magic commands can be used to access Stata and Mata interactively in an IPython kernel-based environment. The API functions can be used to interact with Stata and Mata from both IPython and command-line environments.
The pystata package is shipped with Stata and is located in STATA_SYSDIR/utilities/pystata directory. To get started, we need to configure the pystata package within Python to initialize Stata.
There are four methods to initialize Stata from within Python. In the first method, the configuration module stata_setup, which is available in the Python Package Index (PyPI), is provided to locate the pystata package to initialize Stata.
Suppose we have Stata installed in C:/Program Files/Stata17/ and we use the Stata/MP edition. Stata can be initialized as follows:
import stata_setup
stata_setup.config("C:/Program Files/Stata17/", "mp")
___ ____ ____ ____ ____ © /__ / ____/ / ____/ 17.0 ___/ / /___/ / /___/ MP—Parallel Edition Statistics and Data Science Copyright 1985-2021 StataCorp LLC StataCorp 4905 Lakeway Drive College Station, Texas 77845 USA 800-STATA-PC https://www.stata.com 979-696-4600 [email protected] Stata license: Single-user 8-core perpetual Serial number: 100 Licensed to: zxx StataCorp LLC Notes: 1. Unicode is supported; see help unicode_advice. 2. More than 2 billion observations are allowed; see help obs_advice. 3. Maximum number of variables is set to 5,000; see help set_maxvar.
The pystata package provides three magic commands to interact with Stata from within the IPython environment:
The stata magic can be used as both a cell magic and a line magic to execute Stata commands.
%%stata?
Execute one line or a block of Stata commands.
When the line magic command %stata is used, a one-line Stata command can be specified and executed, as it would be in Stata's Command window. When the cell magic command %%stata is used, a block of Stata commands can be specified and executed all at once. This is similar to executing a series of commands from a do-file.
Cell magic syntax:
%%stata [-d DATA] [-f DFLIST|ARRLIST] [-force]
[-doutd DATAFRAME] [-douta ARRAY] [-foutd FRAMELIST] [-fouta FRAMELIST]
[-ret DICTIONARY] [-eret DICTIONARY] [-sret DICTIONARY] [-qui] [-nogr]
[-gw WIDTH] [-gh HEIGHT]
Optional arguments:
-d DATA Load a NumPy array or pandas DataFrame
into Stata as the current working dataset.
-f DFLIST|ARRLIST Load one or multiple NumPy arrays or
pandas DataFrames into Stata as frames.
The arrays and dataframes should be
separated by commas. Each array or
DataFrame is stored in Stata as a separate
frame with the same name.
-force Force loading of the NumPy array or pandas
DataFrame into Stata as the current working
dataset, even if the dataset in memory has
changed since it was last saved; or force
loading of the NumPy arrays or pandas DataFrames
into Stata as separate frames even if one or
more of the frames already exist in Stata.
-doutd DATAFRAME Save the dataset in memory as a pandas
DataFrame when the cell completes.
-douta ARRAY Save the dataset in memory as a NumPy
array when the cell completes.
-foutd FRAMELIST Save one or multiple Stata frames as pandas
DataFrames when the cell completes. The Stata
frames should be separated by commas. Each
frame is stored in Python as a pandas
DataFrame. The variable names in each frame
will be used as the column names in the
corresponding dataframe.
-fouta FRAMELIST Save one or multiple Stata frames as NumPy
arrays when the cell completes. The Stata frames
should be separated by commas. Each frame is
stored in Python as a NumPy array.
-ret DICTIONARY Store current r() results into a dictionary.
-eret DICTIONARY Store current e() results into a dictionary.
-sret DICTIONARY Store current s() results into a dictionary.
-qui Run Stata commands but suppress output.
-nogr Do not display Stata graphics.
-gw WIDTH Set graph width in inches, pixels, or centimeters;
default is inches.
-gh HEIGHT Set graph height in inches, pixels, or centimeters;
default is inches.
Line magic syntax:
%stata stata_cmd
The %%stata magic is used to execute Stata code within a cell.
%%stata
sysuse auto, clear
describe
. sysuse auto, clear (1978 automobile data) . describe Contains data from C:\Program Files\Stata17/ado\base/a/auto.dta Observations: 74 1978 automobile data Variables: 12 13 Apr 2020 17:45 (_dta has notes) ------------------------------------------------------------------------------- Variable Storage Display Value name type format label Variable label ------------------------------------------------------------------------------- make str18 %-18s Make and model price int %8.0gc Price mpg int %8.0g Mileage (mpg) rep78 int %8.0g Repair record 1978 headroom float %6.1f Headroom (in.) trunk int %8.0g Trunk space (cu. ft.) weight int %8.0gc Weight (lbs.) length int %8.0g Length (in.) turn int %8.0g Turn circle (ft.) displacement int %8.0g Displacement (cu. in.) gear_ratio float %6.2f Gear ratio foreign byte %8.0g origin Car origin ------------------------------------------------------------------------------- Sorted by: foreign .
%%stata
scatter mpg weight, by(foreign, total)
The %stata magic provides users a quick way to execute a single-line Stata command.
%stata summarize mpg
Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- mpg | 74 21.2973 5.785503 12 41
The cell magic %%stata provides arguments to control the execution of Stata’s commands within the cell.
%%stata -qui
summarize mpg
%%stata -gw 12cm
scatter mpg weight, by(foreign, total)
There are many ways to load data from Python into Stata's current dataset in memory. For example
We have data from the Second National Health and Nutrition Examination Survey (NHANES II; McDowell et al. 1981) studying the health and nutritional status of adults and children between 1976 and 1980. We want to use Stata’s features to fit a regression model of bpsystol as a function of age group (agegrp) and gender. Then we want to see how the average predicted systolic blood pressure varies across individuals in each age group, and across males and females in each age group.
import pandas as pd
import io
import requests
data = requests.get("https://www.stata.com/python/pystata/misc/nhanes2.csv").content
nhanes2 = pd.read_csv(io.StringIO(data.decode('utf-8')))
nhanes2
sampl | strata | psu | region | smsa | location | houssiz | sex | race | age | ... | region4 | smsa1 | smsa2 | smsa3 | rural | loglead | agegrp | highlead | bmi | highbp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1400 | 1 | 1 | S | 2 | 1 | 4 | Male | White | 54 | ... | 0 | 0 | 1 | 0 | 0 | NaN | 50-59 | NaN | 20.495686 | 0 |
1 | 1401 | 1 | 1 | S | 2 | 1 | 6 | Female | White | 41 | ... | 0 | 0 | 1 | 0 | 0 | 2.564949 | 40-49 | lead<25 | 21.022337 | 0 |
2 | 1402 | 1 | 1 | S | 1 | 1 | 6 | Female | Other | 21 | ... | 0 | 1 | 0 | 0 | 0 | NaN | 20-29 | NaN | 24.973860 | 0 |
3 | 1404 | 1 | 1 | S | 2 | 1 | 9 | Female | White | 63 | ... | 0 | 0 | 1 | 0 | 0 | NaN | 60-69 | NaN | 35.728722 | 1 |
4 | 1405 | 1 | 1 | S | 1 | 1 | 3 | Female | White | 64 | ... | 0 | 1 | 0 | 0 | 0 | 2.995732 | 60-69 | lead<25 | 27.923803 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
10346 | 48760 | 32 | 2 | MW | 4 | 48 | 5 | Female | White | 35 | ... | 0 | 0 | 0 | 1 | 1 | NaN | 30-39 | NaN | 20.355173 | 0 |
10347 | 48763 | 32 | 2 | MW | 4 | 48 | 2 | Female | White | 33 | ... | 0 | 0 | 0 | 1 | 1 | 1.945910 | 30-39 | lead<25 | 41.645557 | 1 |
10348 | 48764 | 32 | 2 | MW | 4 | 48 | 1 | Female | White | 60 | ... | 0 | 0 | 0 | 1 | 0 | NaN | 60-69 | NaN | 35.626114 | 0 |
10349 | 48768 | 32 | 2 | MW | 4 | 48 | 1 | Female | White | 29 | ... | 0 | 0 | 0 | 1 | 0 | NaN | 20-29 | NaN | 19.204464 | 0 |
10350 | 48770 | 32 | 2 | MW | 4 | 48 | 1 | Male | White | 31 | ... | 0 | 0 | 0 | 1 | 1 | NaN | 30-39 | NaN | 19.635565 | 0 |
10351 rows × 58 columns
We use the -d argument to load the DataFrame into Stata as current dataset. Within Stata, we encode agegrp and sex and label the resulting variables, agegrp2 and sex2, along with bpsystol.
%%stata -d nhanes2 -force
label define sex2 1 "Male" 2 "Female"
encode sex, generate(sex2) label(sex2)
label define agegrp 1 "20-29" 2 "30-39" 3 "40-49" 4 "50-59" 5 "60-69" 6 "70+"
encode agegrp, generate(agegrp2) label(agegrp)
label variable bpsystol "systolic blood pressure"
label variable agegrp2 "Age Group"
label variable sex2 "1=Male, 2=Female"
describe bpsystol agegrp2 sex2
. label define sex2 1 "Male" 2 "Female" . encode sex, generate(sex2) label(sex2) . . label define agegrp 1 "20-29" 2 "30-39" 3 "40-49" 4 "50-59" 5 "60-69" 6 "70+" . encode agegrp, generate(agegrp2) label(agegrp) . . label variable bpsystol "systolic blood pressure" . label variable agegrp2 "Age Group" . label variable sex2 "1=Male, 2=Female" . . describe bpsystol agegrp2 sex2 Variable Storage Display Value name type format label Variable label ------------------------------------------------------------------------------- bpsystol long %12.0g systolic blood pressure agegrp2 long %8.0g agegrp Age Group sex2 long %8.0g sex2 1=Male, 2=Female .
Next, we fit the model and push Stata's estimation results into Python. The estimation results are stored in steret, which is a Python dictionary.
%%stata -eret steret
// fit a regression model
regress bpsystol agegrp2##sex2
// e() results
ereturn list
. // fit a regression model . regress bpsystol agegrp2##sex2 Source | SS df MS Number of obs = 10,351 -------------+---------------------------------- F(11, 10339) = 312.88 Model | 1407229.28 11 127929.935 Prob > F = 0.0000 Residual | 4227440.75 10,339 408.882943 R-squared = 0.2497 -------------+---------------------------------- Adj R-squared = 0.2489 Total | 5634670.03 10,350 544.412563 Root MSE = 20.221 ------------------------------------------------------------------------------ bpsystol | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- agegrp2 | 30-39 | .7956175 .9473117 0.84 0.401 -1.061297 2.652532 40-49 | 5.117078 1.018176 5.03 0.000 3.121256 7.1129 50-59 | 12.20018 1.022541 11.93 0.000 10.1958 14.20456 60-69 | 16.85887 .8155092 20.67 0.000 15.26031 18.45742 70+ | 22.50889 1.130959 19.90 0.000 20.29199 24.72579 | sex2 | Female | -12.60132 .8402299 -15.00 0.000 -14.24833 -10.9543 | agegrp2#sex2 | 30-39 #| Female | 4.140156 1.31031 3.16 0.002 1.571695 6.708617 40-49 #| Female | 8.644866 1.412067 6.12 0.000 5.876941 11.41279 50-59 #| Female | 11.83134 1.406641 8.41 0.000 9.074051 14.58863 60-69 #| Female | 14.093 1.130882 12.46 0.000 11.87625 16.30975 70+#Female | 15.86608 1.542296 10.29 0.000 12.84288 18.88928 | _cons | 123.8862 .6052954 204.67 0.000 122.6997 125.0727 ------------------------------------------------------------------------------ . . // e() results . ereturn list scalars: e(N) = 10351 e(df_m) = 11 e(df_r) = 10339 e(F) = 312.8766723590079 e(r2) = .2497447540773591 e(rmse) = 20.22085415724865 e(mss) = 1407229.279971525 e(rss) = 4227440.746112916 e(r2_a) = .2489465330013219 e(ll) = -45803.93060947768 e(ll_0) = -47291.06810807489 e(rank) = 12 macros: e(cmdline) : "regress bpsystol agegrp2##sex2" e(title) : "Linear regression" e(marginsok) : "XB default" e(vce) : "ols" e(depvar) : "bpsystol" e(cmd) : "regress" e(properties) : "b V" e(predict) : "regres_p" e(model) : "ols" e(estat_cmd) : "regress_estat" matrices: e(b) : 1 x 21 e(V) : 21 x 21 functions: e(sample) .
steret.keys()
dict_keys(['e(N)', 'e(df_m)', 'e(df_r)', 'e(F)', 'e(r2)', 'e(rmse)', 'e(mss)', 'e(rss)', 'e(r2_a)', 'e(ll)', 'e(ll_0)', 'e(rank)', 'e(cmdline)', 'e(title)', 'e(marginsprop)', 'e(marginsok)', 'e(vce)', 'e(depvar)', 'e(cmd)', 'e(properties)', 'e(predict)', 'e(model)', 'e(estat_cmd)', 'e(b)', 'e(V)'])
You can access specific elements of the dictionary. For example, you can access e(b) and e(V) by typing steret['e(b)'] and steret['e(V)'] in Python.
steret['e(b)']
array([[ 0. , 0.79561746, 5.11707797, 12.20017802, 16.85886868, 22.50888857, 0. , -12.601317 , 0. , 0. , 0. , 4.14015609, 0. , 8.6448661 , 0. , 11.83133884, 0. , 14.09300146, 0. , 15.86607901, 123.88620072]])
The iris dataset consists of four features measured on 50 samples from each of three Iris species. This data is used in Fisher's (1936) article.
%%stata
use http://www.stata-press.com/data/r17/iris, clear
describe
label list species
. use http://www.stata-press.com/data/r17/iris, clear (Iris data) . describe Contains data from http://www.stata-press.com/data/r17/iris.dta Observations: 150 Iris data Variables: 5 18 Jan 2020 13:23 (_dta has notes) ------------------------------------------------------------------------------- Variable Storage Display Value name type format label Variable label ------------------------------------------------------------------------------- iris byte %10.0g species Iris species seplen double %4.1f Sepal length in cm sepwid double %4.1f Sepal width in cm petlen double %4.1f Petal length in cm petwid double %4.1f Petal width in cm ------------------------------------------------------------------------------- Sorted by: . label list species species: 1 setosa 2 versicolor 3 virginica .
Our goal is to build a classifier using those four features to detect the Iris type. We will use the Random Forest classification model within the scikit-learn Python package to achieve this goal.
%%stata -fouta training,test
// Split the original dataset into training and test
// dataset which contains 80% and 20% of observations respectively
splitsample, generate(svar, replace) split(0.8 0.2) show rseed(16)
// create two frames holding the the two datasets
frame put iris seplen sepwid petlen petwid if svar==1, into(training)
frame put iris seplen sepwid petlen petwid if svar==2, into(test)
. // Split the original dataset into training and test . // dataset which contains 80% and 20% of observations respectively . splitsample, generate(svar, replace) split(0.8 0.2) show rseed(16) svar | Freq. Percent Cum. ------------+----------------------------------- 1 | 120 80.00 80.00 2 | 30 20.00 100.00 ------------+----------------------------------- Total | 150 100.00 . . // create two frames holding the the two datasets . frame put iris seplen sepwid petlen petwid if svar==1, into(training) . frame put iris seplen sepwid petlen petwid if svar==2, into(test) .
Now we have 2 NumPy arrays in Python, training and test. Below we split each array into two sub-arrays to store the features and labels separately.
X_train = training[:, 1:]
y_train = training[:, 0]
X_test = test[:, 1:]
y_test = test[:, 0]
Then we use X_train and y_train to train the classification model.
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=3, random_state=17)
clf.fit(X_train, y_train)
RandomForestClassifier(max_depth=3, random_state=17)
Next we use X_test and y_test to evaluate the performace of the training model. We also predict the species type of each flower and the probabilities that it belongs to the three species in the test dataset.
from sklearn import metrics
y_pred = clf.predict(X_test)
y_pred_prob = clf.predict_proba(X_test)
Next, in the test frame, we create a Byte variable irispr to store the predicted species types and three float variables to store the probabilities that each flower belongs to the three species types from the array y_pred_prob.
from sfi import Frame
fr = Frame.connect('test')
fr.addVarByte('irispr')
fr.addVarFloat('pr1')
fr.addVarFloat('pr2')
fr.addVarFloat('pr3')
fr.store('irispr', None, y_pred)
fr.store('pr1 pr2 pr3', None, y_pred_prob)
In Stata, we change the current working frame to test. We attach the value label species to irispr, and use the tabulate command to display a classification table. We also list the flowers that have been misclassified.
%%stata
frame change test
label values irispr species
label variable irispr predicted
tabulate iris irispr, row
list iris irispr pr1 pr2 pr3 if iris!=irispr
. frame change test . label values irispr species . label variable irispr predicted . tabulate iris irispr, row +----------------+ | Key | |----------------| | frequency | | row percentage | +----------------+ Iris | predicted species | setosa versicolo virginica | Total -----------+---------------------------------+---------- setosa | 11 0 0 | 11 | 100.00 0.00 0.00 | 100.00 -----------+---------------------------------+---------- versicolor | 0 9 3 | 12 | 0.00 75.00 25.00 | 100.00 -----------+---------------------------------+---------- virginica | 0 1 6 | 7 | 0.00 14.29 85.71 | 100.00 -----------+---------------------------------+---------- Total | 11 10 9 | 30 | 36.67 33.33 30.00 | 100.00 . list iris irispr pr1 pr2 pr3 if iris!=irispr +----------------------------------------------------------+ | iris irispr pr1 pr2 pr3 | |----------------------------------------------------------| 16. | versicolor virginica 0 .1658101 .8341899 | 18. | versicolor virginica 0 .0193115 .9806885 | 19. | versicolor virginica .0006667 .2863244 .7130089 | 26. | virginica versicolor .0015674 .593027 .4054056 | +----------------------------------------------------------+ .
The mata magic is used to execute Mata code. It can be used as both a line magic and a cell magic command.
%%mata?
Execute one line or a block of Mata code.
When the %mata line magic command is used, one line of Mata code can be specified and executed. This is similar to specifying mata: istmt within Stata. When the %%mata cell magic command is used, a block of Mata code can be specified. The code is executed just as it would be in a do-file.
Cell magic syntax:
%%mata [-m ARRAYLIST] [-outm MATLIST] [-qui] [-c]
Execute a block of Mata code. This is equivalent to running a
block of Mata code within a do-file. You do not need to
explicitly place the code within the mata[:] and end block.
Optional arguments:
-m ARRAYLIST Load multiple NumPy arrays into Mata's
interactive environment. The array names
should be separated by commas. Each array is
stored in Mata as a matrix with the same name.
-outm MATLIST Save Mata matrices as NumPy arrays when the
cell completes. The matrix names should be
separated by commas. Each matrix is stored
in Python as a NumPy array.
-qui Run Mata code but suppress output.
-c This argument specifies that Mata code be
executed in mata: mode. This means that
if Mata encounters an error, it will stop
execution and return control to Python. The
error will then be thrown as a Python SystemError
exception.
Line magic syntax:
%mata [-c]
Enter Mata's interactive environment. This is equivalent to
typing mata or mata: in the Stata Command window.
Optional argument:
-c Enter interactive Mata environment in mata:
mode. The default is to enter in mata mode.
%mata istmt
Run a single-line Mata statement. This is equivalent to executing
mata: istmt within Stata.
%%mata
/* create an NxN identity matrix */
real matrix id(real scalar n)
{
real scalar i
real matrix res
res = J(n, n, 0)
for (i=1; i<=n; i++) {
res[i,i] = 1
}
return(res)
}
B = id(3)
B
. mata ------------------------------------------------- mata (type end to exit) ----- : /* create an NxN identity matrix */ : real matrix id(real scalar n) > { > real scalar i > real matrix res > > res = J(n, n, 0) > for (i=1; i<=n; i++) { > res[i,i] = 1 > } > return(res) > } : : B = id(3) : B [symmetric] 1 2 3 +-------------+ 1 | 1 | 2 | 0 1 | 3 | 0 0 1 | +-------------+ : end ------------------------------------------------------------------------------- .
The %pystata line magic is used to configure the system and display current system information and settings.
%pystata?
Stata utility commands.
Line magic syntax:
%pystata status
Display current system information and settings.
%pystata set graph_show True|False [, perm]
Set whether Stata graphics are displayed. The default is to
display the graphics. Note that if multiple graphs are
generated, only the last one is displayed. To display multiple
graphs, use the name() option with Stata's graph commands.
%pystata set graph_size w #[in|px|cm] [, perm]
%pystata set graph_size h #[in|px|cm] [, perm]
%pystata set graph_size w #[in|px|cm] h #[in|px|cm] [, perm]
Set dimensions for Stata graphs. The default is a 5.5-inch width
and 4-inch height. Values may be specified in inches, pixels, or
centimeters; the default unit is inches. Either the width or height
must be specified, or both. If only one is specified, the other one
is determined by the aspect ratio.
%pystata set graph_format svg|png|pdf [, perm]
Set the graphic format used to display Stata graphs. The default
is svg. If svg or png is specified, the graphs will be embedded.
If pdf is specified, the graphs will be displayed, and exported
to PDF files and stored in the current working directory with
numeric names, such as 0.pdf, 1.pdf, 2.pdf, etc. Storing the PDF
graph files in the current directory allows you to embed them in
the notebook when exporting the notebook to a PDF via Latex.
%pystata set graph_format png
%%stata
sysuse auto, clear
histogram rep78
. sysuse auto, clear (1978 automobile data) . histogram rep78 (bin=8, start=1, width=.5) .
In addition to the magic commands, you can also call Stata using the API functions defined in the stata module.
We use a dataset containing quarterly turkey sales throughout the 1990s to illustrate how to call Stata using API functions.
turksales = pd.read_csv('turksales.csv')
turksales.head()
t | sales | |
---|---|---|
0 | 1990q1 | 100.000000 |
1 | 1990q2 | 97.846031 |
2 | 1990q3 | 98.840286 |
3 | 1990q4 | 100.827500 |
4 | 1991q1 | 98.909805 |
We load the pandas DataFrame into Stata using the pdataframe_to_data() function of the stata module.
from pystata import stata
stata.pdataframe_to_data(turksales, force=True)
Next, we declare the data to be time-series data using the run() function.
stata.run('''
generate qdate = quarterly(t, "YQ")
format qdate %tq
tsset qdate, quarterly
''')
. . generate qdate = quarterly(t, "YQ") . format qdate %tq . tsset qdate, quarterly Time variable: qdate, 1990q1 to 1999q4 Delta: 1 quarter .
Afterwards, we fit a autoregressive integrated moving average (ARIMA) model on sales using the arima command. Then we predict the sale values, storing these values in the variable sales_pred.
stata.run('''
arima sales, arima(3, 1, 0)
predict sales_pred, y
''')
. . arima sales, arima(3, 1, 0) (setting optimization to BHHH) Iteration 0: log likelihood = -95.623818 Iteration 1: log likelihood = -83.15803 Iteration 2: log likelihood = -78.411534 Iteration 3: log likelihood = -77.140891 Iteration 4: log likelihood = -75.668157 (switching optimization to BFGS) Iteration 5: log likelihood = -75.617809 Iteration 6: log likelihood = -75.542463 Iteration 7: log likelihood = -75.536259 Iteration 8: log likelihood = -75.535664 Iteration 9: log likelihood = -75.53558 Iteration 10: log likelihood = -75.535576 ARIMA regression Sample: 1990q2 thru 1999q4 Number of obs = 39 Wald chi2(3) = 125.92 Log likelihood = -75.53558 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ | OPG D.sales | Coefficient std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- sales | _cons | .3138001 .0780853 4.02 0.000 .1607558 .4668444 -------------+---------------------------------------------------------------- ARMA | ar | L1. | -.8190625 .087633 -9.35 0.000 -.99082 -.647305 L2. | -.8174304 .1006356 -8.12 0.000 -1.014673 -.6201882 L3. | -.7738968 .0902345 -8.58 0.000 -.9507531 -.5970404 -------------+---------------------------------------------------------------- /sigma | 1.608579 .264229 6.09 0.000 1.0907 2.126458 ------------------------------------------------------------------------------ Note: The test of the variance against zero is one sided, and the two-sided confidence interval is truncated at zero. . predict sales_pred, y (1 missing value generated) .
Then we use the pdataframe_from_data() function to store the time variable t, original sale values sales, and the predicted values sales_pred in a pandas DataFrame named stpred.
import numpy as np
stpred = stata.pdataframe_from_data('t sales sales_pred', missingval=np.nan)
stpred = stpred.set_index("t")
stpred.head()
sales | sales_pred | |
---|---|---|
t | ||
1990q1 | 100.000000 | NaN |
1990q2 | 97.846031 | 100.313797 |
1990q3 | 98.840286 | 98.946854 |
1990q4 | 100.827500 | 99.967163 |
1991q1 | 98.909805 | 101.124245 |
Next we plot original turkey sale values and the predictions values in Python.
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
sns.set_theme(style="darkgrid")
fig, ax = plt.subplots(figsize=(10, 8))
lp = sns.lineplot(ax=ax, data=stpred)
lp.xaxis.set_major_locator(ticker.MultipleLocator(6))
plt.show()
Python Integration
https://www.stata.com/manuals/ppystataintegration.pdf
The pystata Python package
https://www.stata.com/python/pystata/index.html
Stata Function Interface (sfi) module
https://www.stata.com/python/api17/