Last updated: 4 October 2012
2012 Stata Conference San Diego
26–27 July 2012
Manchester Grand Hyatt
One Market Place
San Diego, CA 92101
Proceedings
Custom Stata commands for semi-automatic confidentiality screening of Statistics Canada data
Jesse McCrosky
University of Saskatchewan
The use of Statistics Canada census and survey data in research data centers
includes very specific and sometimes complex confidentiality
requirements. Ensuring that statistical output meets these requirements adds
an additional step to analysis that can be difficult and time consuming.
Thanks to the flexibility of Stata, this additional step can sometimes be
avoided. I present newly developed Stata commands that partially automate
this process. Features include reporting of minimum unweighted frequencies
for weighted output, automatic rounding of results as required by a given
survey, and warnings when potentially unreleasable results are generated.
These commands have the potential to save time and reduce error rates for
researchers using Statistics Canada data as well as for Research Data
Analysts, the Statistics Canada employees responsible for confidentiality
screening.
Additional information
sd12_mccrosky.pdf
scdensity: A program for self-consistent density estimation
Joerg Luedicke
University of Florida and Yale University
Estimating the density of a distribution from a finite number of data points
is an important tool in the statistician's and data analyst’s toolbox.
In their recent paper, Bernacchia and Pigolotti (2011) introduce a new
nonparametric method for the density estimation of univariate distributions.
Whereas conventional methods, like plotting histograms or kernel density
estimates, rely on the need to make arbitrary choices beforehand (for
example, choosing a smoothing parameter), Bernacchia and Pigolotti’s
approach does not rely on any a priori assumptions but instead estimates the
density in a “self-consistent” way by iteratively finding an
optimal shape of the kernel. The method of self-consistent density estimation
is implemented in Stata as an ado-file (
scdensity), with its main
engine written in Mata. In this presentation, I will discuss the underlying
theory and main features of this program. In addition, I will present results
of Monte Carlo simulations that compare the performance of the
self-consistent density estimate with various kernel estimates and maximum
likelihood fits. Finally, I will evaluate the potential usefulness of the
self-consistent estimator in other contexts, such as nonparametric regression
modeling.
Reference:
Bernacchia, A., and Pigolotti, S. 2011. Self-consistent method for density
estimation.
Journal of the Royal Statistical Society, Series B
73: 407–422.
Additional information
sd12_luedicke.pdf
TMPM: The trauma mortality prediction model is robust to ICD-9, ICD-10, and AIS Coding lexicons
Alan Cook
Baylor University Medical Center
Many methods have been developed to predict mortality following trauma. Two
classification systems are used to provide a taxonomy for diseases, including
injuries. The ICD-9 is the classification system for administrative data in
the United States. AIS was developed for characterization of injuries alone.
The Trauma Mortality Prediction Model (TMPM) is based on empiric estimates of
severity for each injury in the ICD-9 and AIS lexicons. Each probability of
mortality is estimated from the five worst injuries per patient. TMPM has
been rigorously tested against other mortality prediction models using ICD-9
and AIS data and has been found superior. The
tmpm command allows
Stata users to efficiently apply TMPM to datasets using ICD-9 or AIS. The
command uses model-averaged regression coefficients that assign empirically
derived severity measures for each of the 1,322 AIS codes and 1,579 ICD-9
injury codes. The injury codes are sorted into body regions and then merged
with the table of model-averaged regression coefficients to assemble a set of
regression coefficients. A logit model is generated to calculate the
probability of death.
tmpm accommodates either AIS or ICD-9 lexicons
from a single command and adds the probability of mortality for each patient
to the original dataset as a new variable.
Additional information
sd12_cook.pdf
Adoption: A new Stata routine for consistently estimating population technological adoption parameters
Aliou Diagne
Africa Rice Center
Diagne and Demont (2007) used a counterfactual outcomes framework to show
that the observed sample technological adoption rate does not consistently
estimate the population adoption rate even if the sample is random. Likewise,
it is shown that a model of adoption with observed adoption outcome as a
dependent variable and where exposure to the technology is not observed and
controlled for cannot yield consistent estimates of the determinants of
adoption. In this talk, I present a new user-written Stata command called
adoption. My command is implemented by using Stata estimation commands
internally to carry out the various estimations and by computing the correct
standard errors for the average treatment effect (ATE) parameter estimates:
population mean potential adoption in the exposed subpopulation (ATE1),
population mean potential adoption in the non-exposed subpopulation (ATE0),
population mean joint exposure and adoption (JEA), population adoption gap
(GAP), and population selection bias (PSB). The ATE adoption parameters are
estimated using the semiparametric method (that is, inverse probability
weighting) or a parametric method that fits adoption outcome on independent
variables using one of Stata’s parametric models, such as probit, logit,
generalized linear models, ordinary least squares, Poisson, or tobit.
Reference:
Diagne, A., and M. Demont. 2007. Taking a new look at empirical models of
adoption: Average treatment effect estimation of adoption rates and their
determinants.
Agricultural Economics 37: 201–210.
Additional information
sd12_diagne.pdf
Graphics (and numerics) for univariate distributions
Nicholas J. Cox
Durham University, UK
How to plot (and summarize) univariate distributions is a staple of
introductory data analysis. Graphical (and numerical) assessment of marginal
and conditional distributions remains important for much statistical
modeling. Research problems can easily evoke needs for many comparisons,
across groups, across variables, across models, and so forth. Over several
centuries, many methods have been suggested, and their relative merits are a
source of lively ongoing debate. I offer a selective but also detailed
review of Stata functionality for univariate distributions. The presentation
ranges from official Stata commands through various user-written commands,
including some new programs, to suggestions on how to code your own graphics
commands when other sources fail. I also discuss both continuous and discrete
distributions. The tradeoff between showing detail and allowing broad
comparisons is an underlying theme.
Additional information
sd12_cox.ppt
Binary choice models with endogenous regressors
Christopher Baum
Boston College and DIW Berlin
Yingying Dong
University of California–Irvine
Arthur Lewbel
Boston College
Dong and Lewbel have developed the theory of simple estimators for binary
choice models with endogenous or mismeasured regressors, depending on a
“special regressor” as defined by Lewbel (2000). “Control
function” methods such as Stata’s
ivprobit are generally
only valid when endogenous regressors are consistent. The estimators proposed
here can be used with limited, censored, continuous, or discrete endogenous
regressors, and they have significant advantages over alternatives such as
maximum likelihood and the linear probability model. These estimators are
numerically straightforward to implement. We present and demonstrate an
improved version of a Stata routine that provides both estimation and
postestimation features.
Reference:
Lewbel, A. 2000. Semiparametric qualitative response model estimation with
unknown heteroskedasticity and instrumental variables.
Journal of
Econometrics 97: 145–177.
Additional information
sd12_baum.pdf
An application of multiple imputation and sampling-based estimation
Haluk Gedikoglu
Lincoln University of Missouri
Missing data occurs frequently in agricultural household surveys, possibly
leading to biased and inefficient regression estimates. Multiple imputation
can be used to overcome the missing-data problem. Previous studies applied
multiple imputation to datasets where only some of the variables have missing
observations while the rest have no missing observations; in reality,
however, all the variables in a survey might have missing observations.
Currently, there is no theoretical or practical guidance for practitioners on
how to apply multiple imputation when all the variables in a dataset have
missing observations. The objective of this study is to evaluate the impact
of alternative multiple-imputation application methods when all the variables
have missing observations. The data for this study were collected through a
mail survey of 2,995 farmers in Missouri and Iowa in spring 2011. Two
multiple-imputation methods are applied in the imputation step: one using
only the complete observations and the other using all the observations. The
results of the current study show that using all the observations in the
imputation step, even if they have missingness, produces estimates with lower
standard errors. Hence, practitioners should use all the observations in the
imputation step.
Additional information
sd12_gedikoglu.pdf
The application of Stata’s multiple-imputation techniques to analyze a design of experiments with multiple responses
Clara Novoa
Texas State University
In this talk, I exemplify the application of the multiple-imputation
techniques available in Stata to analyze a design of experiments with
multiple responses and missing data. No imputation and multiple-imputation
methodologies are compared.
Additional information
sd12_novoa.pdf
EFA within a CFA context
Phil Ender
UCLA Statistical Consulting
EFA within a CFA framework combines aspects of both EFA and CFA. It uses CFA
to produce a factor solution that is close to an EFA solution while providing
features typically found in CFA, such as standard errors, statistical tests,
and modification indices. In this presentation, I include an example using
the
sem command introduced in Stata 12.
Additional information
sd12_ender.pdf
Structural equation modeling using the SEM Builder and the sem command
Kristin MacDonald
StataCorp LP
In this talk, I will give a brief introduction to structural equation
modeling (SEM) and Stata’s
sem command. I will also introduce
the SEM Builder—the graphical user interface for drawing path diagrams,
fitting structural equation models, and analyzing the results. Using the
SEM Builder, we will take a more detailed look at some of the models
commonly fit within the SEM framework including confirmatory factor models,
path models with observed variables, structural models with latent
variables, and multiple group models.
Additional information
sd12_macdonald.pdf
Imagining a Stata/Python combination
James Fiedler
Universities Space Research Association
There are occasions when a task is difficult in Stata but fairly easy in a
more general programming language. Python is a popular language for a range
of uses. It is easy to use, has many high-quality packages, and programs can
be written relatively quickly. Is there any advantage to combining Stata
and Python within a single interface? Stata already offers support for
user-written programs, which allow extensive control over calculations but
somewhat less control over graphics. Also, except for specifying output,
the user has minimal programmatic control over the user interface. Python
can be used in ways that allow more control over the interface and graphics,
and in so doing provide roundabout methods for satisfying some user requests
(for example, transparency levels in graphics and the ability to clear the
results window). My talk will explore these ideas, present a possible method
for combining Stata and Python, and give examples to demonstrate how this
combination might be useful.
Additional information
sd12_fiedler.pdf
Issues for analyzing competing-risks data with missing or misclassification in causes
Ronny Westerman
Philipps-University of Marburg
Competing-risks models have a various field of application in medical and
public health studies. A challenging clue for applying cause-specific
survival models yields on the problem of missing and misclassification in
cause of death. The masked cause of death is related to incomplete or only
partially identifiable information of death certificates. In this
presentation, I will introduce some alternative issues for competing-risks
models with some implemented Stata commands and also will discuss some
limitations on some hands-on examples. Another purpose should be the
introduction of some more sophisticated tools for modeling the long-term
survival function in terms of competing risks. Data analysis will be provided
with free-accessible SEER-DATA from the National Institute of Cancer.
Additional information
sd12_westerman.pptx
Generating survival data for fitting marginal structural Cox models using Stata
Ehsan Karim
University of British Columbia
Marginal structural models (MSMs) can be used to estimate the effect of a
time-dependent exposure in the presence of time-dependent confounding.
Previously, Fewell et al. (2004) described how to estimate this model in
Stata based on a weighted pooled logistic model approximation. However, based
on the current literature and some recent simulation study results, this
model can be suitably fit in other ways too, and various new weighting
schemes are proposed accordingly. In this presentation, I will first explain
the idea behind MSMs and justify the use of various weighting schemes through
simple examples and tabulations using Stata. Then I will illustrate the
procedure of generating survival data from a Cox MSM by using existing Stata
commands. I will compare the performance of simulated data generation and
the procedure of fitting MSMs via Stata with other standard statistical
packages such as SAS and R.
Reference:
Fewell, Z., M. A. Hernan, F. Wolfe, K. Tilling, H. Choi, and J. A. C. Sterne.
Controlling for time-dependent confounding using marginal structural models.
Stata Journal 4: 402–420.
Additional information
sd12_karim.pdf
Computing optimal strata bounds using dynamic programming
Eric Miller
Summit Consulting
Stratification is a sampling design that can improve efficiency. It works by
first partitioning the population into homogeneous subgroups and then
performing simple random sampling within each group. For a continuous
variable, stratification involves determining strata boundaries. Holding the
number of strata fixed, a reduction in the width of a given stratum reduces
its associated variance at the expense of the variances from the other
strata. Dynamic programming provides a method for simultaneously minimizing
all the strata variances by determining optimal strata boundaries. In this
presentation, I describe a new user-written command,
optbounds, that
uses dynamic programming to find optimal boundary points for a continuous
stratification variable. The command uses the variance minimization technique
developed by Khan, Nand, and Ahmad (2008). The user first chooses a known
probability distribution that approximates the stratification variable.
Parameter estimates are then generated from the data, and goodness-of-fit
statistics are used to assess the quality of the approximation. A brief
overview of the theory, a description of the command, and several
illustrative examples will be provided.
Reference:
Khan, M. G. M., N. Nand, and N. Ahmad. 2008. Determining the optimum strata
boundary points using dynamic programming.
Survey Methodology 34:
205–214.
Additional information
sd12_miller.pdf
Correct standard errors for multistage regression-based estimators: A guide for practitioners with illustrations
Joseph Terza
University of North Carolina–Greensboro
With a view toward lessening the analytic and computational burden faced by
practitioners seeking to correct the standard errors of two-stage estimators,
I offer a heretofore unnoticed simplification of the conventional formulation
for the most commonly encountered cases in empirical
application—two-stage estimators involving maximum likelihood
estimation or nonlinear least squares in either stage. Also with the applied
researcher in mind, I cast the discussion in the context of nonlinear
regression models involving endogeneity—a sampling problem whose
solution often requires two-stage estimation. I detail simplified standard
error formulations for three very useful estimators in applied contexts
involving endogeneity in a nonlinear setting (endogenous regressors,
endogenous sample selection, and causal effects). The analytics and
Stata/Mata code for implementing the simplified formulae are demonstrated
with illustrative real-world examples and simulated data.
Additional information
sd12_terza.pdf
Shrinkage estimators for structural parameters
Tirthankar Chakravarty
University of California–San Diego
Instrumental-variables estimators of parameters in single-equation structural
models, like 2SLS and LIML, are the most commonly used econometric
estimators. Hausman-type tests are commonly used to choose between OLS and
IV estimators. However, recent research has revealed troublesome size
properties of Wald tests based on these pre-test estimators. These problems
can be circumvented by using shrinkage estimators, particularly
James–Stein estimators. I introduce the
ivshrink command, which
encompasses nearly 20 distinct variants of the shrinkage-type estimators
proposed in the econometrics literature, based on optimal risk properties,
including fixed (k-class estimators are a special case) and data-dependent
shrinkage estimators (random convex combinations of OLS and IV estimators,
for example). Analytical standard errors to be used in Wald-type tests are
provided where appropriate, and bootstrap standard errors are reported
otherwise. Where the variance–covariance matrices of the resulting
estimators are expected to be degenerate, options for matrix norm
regularization are also provided. We illustrate the techniques using a widely
used dataset in the econometric literature.
Additional information
sd12_chakravarty.pdf
Stata implementation of the nonparametric spatial heteroskedasticity- and autocorrelation-consistent covariance matrix estimator
P. Wilner Jeanty
Hobby Center for the Study of Texas/Kinder Institute for Urban Research, Rice University
In this talk, I introduce two Stata routines to implement the nonparametric
spatial heteroskedasticity- and autocorrelation-consistent (SHAC) estimator of the
variance–covariance matrix in a spatial context, as proposed by Conley
(1999) and Kelejian and Prucha (2007). The SHAC estimator is robust
against potential misspecification of the disturbance terms and allows for
unknown forms of heteroskedasticity and correlation across spatial units.
Heteroskedasticity is likely to arise when spatial units differ in size or
structural features.
References:
Conley, T. 1999. GMM estimation with cross sectional dependence.
Journal
of Econometrics 92: 1–45.
Kelejian, H. H., and I. R. Prucha. 2010. Specification and estimation of
spatial autoregressive models with autoregressive and heteroskedastic
disturbances.
Journal of Econometrics 157: 53–67.
Additional information
sd12_jeanty.pdf
Big data, little spaces, high speed: Using Stata to analyze the determinants of broadband access in the United States
David Beede
U.S. Department of Commerce
Brittany Bond
U.S. Department of Commerce
This study brings together Census block-level data on broadband service
availability, economics, demographics, regulations, and terrain to model the
supply and demand of high-speed broadband service in the United States.
While Stata is the primary tool for data management and multilevel modeling,
other software tools, such as GIS, are used in conjunction with Stata to
generate visually arresting pictures that help communicate the study’s
findings.
Additional information
sd12_beede.pdf
A comparative analysis of lottery-, charter-, and traditional-based elementary schools within the Anchorage school district
Matthew McCauley
University of Alaska–Anchorage
The growing popularity of alternative choices to traditional-based public
schools—such as public charter/lottery-based schools—has prompted
nationwide research. In Anchorage, however, there are few quantitative
studies that compare student performance across traditional public schools,
charter schools, and lottery-based schools. The purpose of this project is to
create and analyze panel data for all public elementary schools within the
Anchorage School District and compare the achievement of charter/lottery with
traditional-based schools. This will be done using public Terra Nova and SBA
data from ASD for the years 2007–2010 in addition to U.S. Census data.
Data will be imported into Stata, a robust statistical software application,
and regression techniques will be used to compare student Terra Nova and SBA
scores while controlling for other factors that also influence test scores.
Additional information
sd12_mccauley.pptx
Matching individuals in the Current Population Survey: A distance-based approach
Stuart Craig
Yale University
In this presentation, I introduce a set of Stata programs designed to match
individuals from year to year in the Current Population Survey (CPS) using a
distance-based measure of similarity. Unlike panel data, the CPS is a
repeated cross section of geographic residences, which are continually
surveyed regardless of whether the occupants are the same. Previous work has
taken the person and household identifiers supplied in the datasets as given
and validated or invalidated identifier-derived matches based on demographic
variables. This work has focused on selecting the best set of demographic
verifiers. Recognizing that there is substantial error in the supplied
identifiers, the distance-based approach extends these methods by treating
demographic variables as pseudo-identifiers and selecting matches based on
a criterion of distance minimization. This approach possesses several
advantages over prior methods. First, by reducing the weight placed on the
survey-provided identifiers, the distance approach provides a matching
technique that can be uniformly applied across the entire CPS series to
create a consistent historical series of CPS matches, even in those years
where the survey-provided identifiers are particularly error-prone. Second,
this approach provides a flexible framework for matching individuals in the
CPS, which allows for the selection of pseudo-identifiers to vary based on
the measurement of interest. Third, it generates a matched series with low
and consistent mismatch rates, which is ideal for measuring secular trends in
dynamics, such as income volatility. Several measures of distance and the
analytical decisions regarding acceptable year-to-year variation are
discussed.
Additional information
sd12_craig.pdf
Allocative efficiency analysis using DEA in Stata
Choonjoo Lee
Korea National Defense University
In this presentation, I present a procedure and an illustrative application
of a user-written Allocation Model (AE) in Stata. AE measures allocative
efficiency and economic efficiency as well as technical efficiency when price
and cost information of production are available. This model is an extension
of basic DEA models that I also wrote.
Additional information
sd12_lee.pdf
Psychometric analysis using Stata
Chuck Huber
StataCorp LP
In this talk, I will provide an overview of Stata features that are typically
used for the analysis of psychometric and educational testing data.
Traditional multivariate tools such as canonical correlation, MANOVA,
multivariate regression, Cronbach’s alpha, exploratory and confirmatory
factor analysis, cluster analysis, and discriminant analysis will be
discussed as well as more modern techniques based on latent trait models such
as the Rasch model, multidimensional scaling, and correspondence analysis.
Multilevel mixed-effects models for continuous, binary, and count outcomes
will be described in the context of both ecological systems theory and
longitudinal data analysis. Structural equation modeling will also be
mentioned but not discussed in detail.
Additional information
sd12_huber.pdf
Huber_2012SanDiego.do
Huber_2012SanDiego.dta
Huber_2012SanDiego_Pilot.dta
Huber_2012SanDiego_SEM.stsem
Scientific organizers
Phil Ender, (chair) UCLA
A. Colin Cameron, UC Davis
Xiao Chen, UCLA
Estie Hudes, UC San Francisco
Michael Mitchell, U.S. Department of Veterans Affairs
Logistics organizers
Chris Farrar, StataCorp
Gretchen Farrar, StataCorp