Neighboring towns have more influence on each other than on towns far away. The same is true of countries that are close to each other and of closely connected friends on social media.
Spatial autoregressive models are fit using datasets that contain observations on geographical areas. Observations are called spatial units and might be countries, states, counties, postal codes, or city blocks. Alternatively, they might not be geographically based at all; they could be nodes of a social network.
Datasets contain a continuous outcome variable—such as incidence of disease, output of farms, or crime rate—along with other variables to predict the outcome. For cross-sectional data, each variable has one value per spatial unit. For panel data, there are typically multiple values for different time points.
There is a new manual entirely devoted to fitting SAR models, working with spatial data, and creating and managing spatial weighting matrices. The new commands are called the Sp commands. See the Spatial Autoregressive Models Reference Manual.
There are three steps to fitting SAR models:
Stata's Sp commands will work with or without shapefiles, files commonly used to define maps. They will work with other location data or even work with data without locations at all, such as social network data.
Here's an analysis from start to finish.
You do not have to use shapefiles, but we will. We downloaded a standard-format shapefile, tl_2016_us_county.zip, that we found at https://www.census.gov/geo/maps-data/data/tiger.html.
We copied the file to our current directory, then typed in Stata:
. unzipfile tl_2016_us_county.zip (output omitted) . spshape2dta tl_2016_us_county (importing .shp file) (importing .dbf file) (creating _ID spatial-unit id) (creating _CX coordinate) (creating _CY coordinate) file tl_2016_us_county_shp.dta created file tl_2016_us_county.dta created
spshape2dta did its magic and created two Stata datasets for us. One is a Stata-format shapefile:
tl_2016_us_county_shp.dta
The other is a Stata dataset containing the other data that were in the shapefile bundle:
tl_2016_us_county.dta
spshape2dta also linked the two files.
We have our own analysis data for these counties in texas_ue.dta. We could just use them and skip the shapefile, but our data do not have the coordinates of the counties. We could not calculate distances or find neighbors. We could not do an SAR analysis. That's why we got the shapefile from the U.S. Census; it will provide all of this information.
We will merge our data with tl_2016_us_county.dta. We will first create an ID variable for merging the files. We also tell Sp that the Census provided the coordinates in latitude and longitude and that we want distances reported in miles.
. use tl_2016_us_county . generate long fips = real(STATEFP + COUNTYFP) . spset fips, modify replace (output omitted) . spset, modify coordsys(latlong, miles) Sp dataset tl_2016_us_county.dta data: cross sectional spatial-unit id: _ID (equal to fips) coordinates: _CY, _CX (latitude-and-longitude, miles) linked shapefile: tl_2016_us_county_shp.dta . save, replace file tl_2016_us_county.dta saved . use texas_ue, clear . merge 1:1 fips using tl_2016_us_county, keep(match) (output omitted)
Our data are ready for analysis.
We plan on fitting a model with spatial lags of the dependent variable, spatial lags of a covariate, and spatial autoregressive errors. Spatial lags are defined by spatial weighting matrices. We will use one matrix for the variables and another for the errors.
Sp provides many ways to create spatial weighting matrices. We will use just two of its predefined formulations:
. spmatrix create contiguity W . spmatrix create idistance M . spmatrix dir
Weighting matrix name N x N Type Normalization | ||
M 254 x 254 idistance spectral | ||
W 254 x 254 contiguity spectral | ||
We created W to be a contiguity matrix based on nearest neighbors.
We created M to be the inverse of the distance between counties.
We let Sp perform its default normalization, which is spectral (largest eigenvalue). We could have chosen row or min–max normalization.
Sp also provides commands that let you create custom weighting matrices. You can create them from Stata data by writing Mata code or by importing them from a text file. The Mata capability is of special interest because it is so easy to use.
We have prepared our data.
We have defined the spatial weighting matrices we need.
We can now fit our model.
. spregress unemployment college, gs2sls dvarlag(W) ivarlag(W:college) errorlag(M) Spatial autoregressive model Number of obs = 254 GS2SLS estimates Wald chi2(3) = 43.52 Prob > chi2 = 0.0000 Pseudo R2 = 0.2081
unemployment | Coef. Std. Err. z P>|z| [95% Conf. Interval] | |
unemployment | ||
college | -.067419 .0122785 -5.49 0.000 -.0914843 -.0433536 | |
_cons | 5.715733 .4246597 13.46 0.000 4.883415 6.54805 | |
W | ||
college | -.0424388 .0213227 -1.99 0.047 -.0842306 -.000647 | |
unemployment | .2058481 .0961201 2.14 0.032 .0174562 .39424 | |
W | ||
e.unemploy~t | 3.247298 1.369204 2.37 0.018 .5637078 5.930888 | |
Wald test of spatial terms: chi2(3) = 9.77 Prob > chi2 = 0.0206 |
We used the generalized spatial two-stage least-squares (GS2SLS) estimator. The GS2SLS estimator lets us fit multiple spatial lags, potentially allowing us to better approximate the true spatial dependence. Alternatively, we could have used a maximum likelihood estimator to fit the model.
Our dependent variable is the unemployment rate (unemployment) and we think that unemployment is affected by the proportion of the adult population who hold college degrees (college).
If you are familiar with SAR models, you know they are difficult to interpret because the coefficients are a combination of direct and indirect effects. Direct effects are the effects of the spatial unit on itself. Indirect effects are the effects spatial units have on other spatial units, also known as spillover effects.
New command estat impact splits out those effects.
. estat impact Average impacts Number of obs = 254
Delta-Method | ||
dy/dx Std. Err. z P>|z| [95% Conf. Interval] | ||
direct | ||
college | -.0690788 .0121434 -5.69 0.000 -.0928795 -.0452781 | |
indirect | ||
college | -.0586373 .0299383 -1.96 0.050 -.1173152 .0000407 | |
total | ||
college | -.1277161 .0320147 -3.99 0.000 -.1904638 -.0649684 | |
estat impact is essential for interpreting the results of SAR models and works after all Sp estimators. We see that a 1-percentage point increase in those holding college degrees in a county reduces unemployment by 0.07 percentage points in that same county, the direct effect. The spillover effect to neighboring counties is almost as large—a 0.06 percentage point expected reduction in unemployment.
Learn more about Stata's spatial autogressive models features.
There is an entire manual dedicated to SAR, and it has friendly introductions to the subject. See the new Spatial Autoregressive Models Reference Manual.