Last updated: 16 November 2007
2007 West Coast Stata Users Group meeting
25–26 October 2007
Marina del Rey Hotel
13534 Bali Way
Marina del Rey, CA 90292
Proceedings
Prediction of random effects and effects of misspecification of their distribution
Charles McCulloch
University of California, San Francisco
Statistical models that include random effects are commonly used to analyze
longitudinal and clustered data. These models are often used to derive
predicted values of the random effects, for example in predicting which
physicians or hospitals are performing exceptionally well or exceptionally
poorly. I start this talk with a brief introduction and several examples of
the use of prediction of random effects in practice. In typical
applications, the data analyst specifies a parametric distribution for the
random effects (often Gaussian) although there is little information
available to guide this choice. Are predictions sensitive to this
specification? Through theory, simulations, and an example illustrating the
prediction of who is likely to go on to develop high blood pressure, I show
that misspecification can have a moderate impact on predictions of random
effects and describe simple ways to diagnose such sensitivity.
Additional information
West_Coast_Stata_2007_talk_predict_random_effects.pdf (slides)
Panel data methods for microeconometrics using Stata
Colin Cameron
University of California, Davis
This presentation provides an overview of the subset of methods for panel data and
the associated Stata
xt commands most commonly used by
microeconometricians. First, attention is focused on a short panel, meaning
data on many individual units and few time periods. Examples include
longitudinal surveys of many individuals and panel datasets on many firms.
Then the data can be viewed as being clustered on the individual unit and
panel methods used are also applicable to other forms of clustered data such
as cross-section data from individual-level surveys conducted at many
villages with clustering at the village level. Second, emphasis is placed on
using the repeated measures aspect of panel data to estimate key marginal
effects that can be interpreted as measuring causation rather than mere
correlation. The leading methods assume time-invariant individual-specific
effects (or “fixed effects”). Instrumental variables (IV) methods
can also be used, with data from periods other than the current year
potentially serving as instruments. Third, some analyses use dynamic models
rather than static models. Particular interest lies in fitting models
with both lagged dependent variables and fixed effects. The paper
additionally surveys other panel methods used in econometrics, such as those
for nonlinear models and those for dynamic panels with many periods of data.
Additional information
cameronwcsug.pdf (slides)
Repeated measures anova: The wide, the long, and the long
Phil Ender
Unversity of California, Los Angeles
This presentation will give an overview of the three main approaches to
analyzing repeated measures analysis of variance: 1) multivariate models, 2)
traditional anova models, and 3) linear mixed models along with discussion
of the advantages and disadvantages of each. The presentation includes Stata
code using
manova,
anova,
regress, and
xtmixed. The
three approaches are illustrated through the use of a split-plot factorial
design with one between-subjects factor and one repeated factor.
Additional information
repeated_anova.pdf (slides)
Survey data analysis with Stata 10: Accessible and comprehensive
Christine Wells
University of California, Los Angeles
The presentation will discuss Stata’s evolution into a comprehensive
survey data analysis package by looking at its past, present, and possible
future. Comparisons will be made with other survey data analysis software
packages, such as SUDAAN, WesVar, SAS, and SPSS, with respect to both survey
designs that can be analyzed as well as the types of analyses that can be
conducted.
Additional information
Wells_Stata10talk.pdf (slides)
Multilevel modeling of complex survey data
Sophia Rabe-Hesketh
University of California, Berkeley
Survey data are often analyzed using multilevel or hierarchical models. For
example, in education surveys, schools may be sampled at the first stage and
students at the second stage and multilevel models used to model
within-school and between-school variability. An important aspect of most
surveys that is often ignored in multilevel modeling is that units at
each stage are sampled with unequal probabilities. Standard maximum
likelihood estimation can be modified to take the sampling probabilities
into account, yielding pseudomaximum likelihood estimation, which is typically
combined with robust standard errors based on the sandwich estimator. This
approach is implemented in
gllamm. I will introduce the ideas, discuss
issues that arise such as the scaling of the weights, and illustrate the
approach by applying it to data from the Program for International Student
Assessment (PISA).
Additional information
stata_sophia.pdf (slides)
Calculating measures of comorbidity using administrative data
Vicki Stagg
University of Calgary
The development of a Stata program to calculate published measures of
comorbidity will be of value to researchers working with inpatient discharge
data coded in ICD-9-CM or ICD-10. The
comorbid command
calculates the weighted sum of comorbidities, as well as comorbidity scores
based on the Charlson Index, which reflects the cumulative increase in
likelihood of 1-year mortality from comorbidities. This allows for
the calculation of three different comorbidity measures: ICD-9-CM, Enhanced
ICD-9-CM, or ICD-10 (Quan et al 2005). Exclusion of less severe
comorbidities can occur using an optional hierarchical method that excludes
from the calculations a mild comorbidity when a patient has also exhibited a
more severe form of the same diagnosis. The comparable
elixhauser command calculates the sum of this alternate set of
comorbidity measures, which may be associated with negative hospital outcomes
(Elixhauser et al 1998). Both Stata algorithms can handle patients or
visits as the observational unit. Options allow for a choice of summary
output.
Additional information
Stagg_Stata_Presentation_final.ppt (slides)
stagg_notes_final.pdf (presentation notes)
Managing meta-data in Stata
Elliott Lowy
VA Health Services Research and Development
A collection of user-written commands will be presented, which in one way or
another facilitate dealing with meta-data—from manipulation and
presentation of variable names and types, through labels, notes, and other
meta-data fields included with data files, and on to a command for accessing
small text databases for interrelated datasets.
Additional information
The repository for the ado-files and packages used in this talk can be
found at
http://datadata.info/ado.
It is
easier and more Stata-like to access the repository by typing
net from http://datadata.info/ado
in the command window in Stata. This method also allows web access to
individual help files.
An algorithm for creating models for imputation using the
MICE approach: An application in Stata
Rose Medeiros
University of California, Los Angeles
It is generally advised that imputation models contain as many
“predictor” variables as possible, since the greater the number
of variables the greater the amount of information from which to make
estimations (van Buuren, Boshuizen, and Knook 1999). Ideally, an imputation
model might contain all variables in the dataset. Hence, the default in
software packages that perform multivariate imputation by chained equations
(e.g.,
ice in Stata) is often to use all other variables in the imputation
model to predict missing values. However, in datasets with moderate to
large numbers of variables, attempting to use all other variables in the
dataset results in imputation models that are too large to actually run.
One solution to this problem is to select a relatively large, but
reasonable, number of predictors based on bivariate correlations and then
drop predictors as necessary to create a regression model that is tractable
using the complete data. This set of regression models form the imputation
model for the entire dataset. This presentation outlines this approach in
more detail and presents an overview of the Stata package that implements
it.
Additional information
medeiros_mice.pdf (slides)
Modeling multiple source risk factor data and health outcomes in twins
Andy Bogart
Jack Goldberg
University of Washington, Seattle
One challenging feature of some medical research is the existence of
multiple sources of exposure information about individual subjects. When an
exposure of interest has been measured in a variety of ways or has been
reported on by multiple informants, analysts must decide how best to
estimate its association with some interesting outcome. Simply performing a
multiple regression analysis of the outcome on all the sources together can
be problematic, since those reports are likely to be highly correlated.
Alternatively, collapsing the reports into one measure invariably
implies an unfortunate loss of information and a nagging question as to
whether one has done the right thing. Instead, we used Stata 9 to implement
a novel application of complex sample survey methods (Pepe, Whitaker, and
Seidel 1999; Horton and Fitzmaurice 2004), which allows simultaneous use of
multiple reports in a single regression model. We further extended the
method to accommodate estimation of within- and between-pair effects in twin
research. My presentation will use Vietnam-era veteran twin data to explore
the association between military service in Vietnam with post traumatic
stress disorder and address within- and between-pair effects. We will
gently explore how to properly reshape data, derive necessary variables,
specify models, and implement Stata’s
svy commands to apply the
method.
References:
Pepe, M. S., R. C. Whitaker, and K. Seidel. 1999. Estimating and
comparing univariate associations with application to the prediction of
adult obesity.
Statistics in Medicine 18: 163–173.
Horton, N. J., and G. M. Fitzmaurice. Regression analysis of multiple source
and multiple informant data from complex survey samples. 2004.
Statistics
in Medicine 23: 2911–2933.
Additional information
Bogart_WCSUG_2007_FINAL.ppt (slides)
Rapid formation of regression tables for research purposes
Roy Wada
University of California, Los Angeles
The ostensible reason for a preparation of regression tables is to have them
submitted to journals for publication purposes. Contrary to this professed
view, regression tables are mostly used during research and not after.
Journals require regression tables because they allow visual comparisons
across regressions. It is difficult to compare specifications without
placing them in close proximities, even if it means printing hardcopies.
Past users of statistical packages have often resorted to printing hundreds
of pages and flipping them back and forth. The technology for postestimation
display has historically lagged behind the production of estimation itself.
A bottleneck existed in the research process when regressions were produced
much faster than they could be interpreted. The next logical step in the
development of statistical packages is to be able to produce regression
tables as fast and as naturally as performing regressions themselves.
Regression tables ought to be produced easily, rapidly, and sequentially;
they need to be displayed immediately on the computer screen. The
usefulness of regression tables is much reduced if postponed until the end
of your research.
outreg, a program by John Gallup, has been modified
and augmented extensively for this purpose.
outreg2 will immediately
produce and open formatted regression tables in programs associated with
LaTeX, Word, or Excel files.
seeout will immediately display a
regression table in the Stata Data Browser.
Additional information
Rapid_Formation_presentation.pdf (slides)
Rapid_Formation_article.pdf (article)
Syntax coloring, etc.
Elliott Lowy
VA Health Services Research and Development
I will present a sweet syntax coloring using jEdit, a free, open-source,
Java-based, cross-platform text editor. The syntax coloring
distinguishes commands, variables, macros,
simple and compound quoted strings (and unquoted string literals), and
different kinds of comments. This includes macros inside of strings, strings
in expressions in macro functions, etc. Mata syntax coloring included. On
the integration side, added bits allow a line, selection, or separately
defined section of code (as well as the whole file) to be run in Stata with
a keystroke. Semicolon delimited, and Mata, lines are recognized from
context and run correctly. The code can also be run in
do,
run, or
trace modes, as determined by a mode button in jEdit.
Multiline commands (i.e., split with triple slashes) are also recognized and
run as a whole without the need to select all lines.
Additional information
Find all the plug-ins and information about using
jEdit.
Meta-analytical integration of diagnostic accuracy studies in Stata
Ben Dwamena
University of Michigan
This presentation will demonstrate how to perform diagnostic meta-analysis
using
midas, a user-written command.
midas is comprehensive
program of statistical and graphical routines for undertaking meta-analysis
of diagnostic test performance in Stata. Primary data synthesis is
performed within the bivariate mixed-effects binary regression modeling
framework. Model specification, estimation (by adaptive Gaussian
quadrature), and prediction are carried out with
xtmelogit in Stata
release 10 or
gllamm (Rabe-Hesketh et. al) in Stata release 9. Using
the model estimated coefficients and variance–covariance matrices,
midas calculates summary operating sensitivity and specificity (with
confidence and prediction contours in SROC space), summary likelihood and
odds ratios. Global and relevant test performance metric-specific
heterogeneity statistics are also provided.
midas facilitates
extensive statistical and graphical data synthesis and exploratory analyses
of unobserved heterogeneity, covariate effects, publication bias, and
subgroup analyses. Bayes’ nomograms, likelihood-ratio matrices, and
conditional probability plots may be obtained and used to guide clinical
decision making.
Additional information
Dwamena_WCSUG2007.pdf (slides)
Estimating heterogeneous choice models with Stata
Richard Williams
University of Notre Dame
When a binary or ordinal regression model incorrectly assumes that error
variances are the same for all cases, the standard errors are wrong and
(unlike OLS regression) the parameter estimates are biased. Heterogeneous
choice/location-scale models explicitly specify the determinants of
heteroskedasticity in an attempt to correct for it. These models are also
useful when the variability of underlying attitudes is itself of substantive
interest. This paper illustrates how Williams’ user-written command
oglm (ordinal generalized linear models) can be used to fit
heterogeneous choice and related models. It further shows how two other
models that have appeared in the literature—Allison’s (1999)
model for comparing logit and probit coefficients across groups, and Hauser
and Andrew’s (2006) logistic response model with partial
proportionality constraints (LRPPC)—are special cases of the
heterogeneous choice model and/or algebraically equivalent to it and can
also be fitted with
oglm. Other key features of
oglm that are
illustrated include support for linear constraints, the use of prefix
commands such as
svy and
stepwise, and the computation of
predicted probabilities and marginal effects.
Additional information
rw_WCSUG2007.pdf (slides)
rw_WCSUG2007.ppt (slides)
rw_WCSUG2007_Handout.pdf (handout)
Using regular expressions for data management in Stata
Rose Medeiros
University of California, Los Angeles
Regular expressions make a number of data management operations involving
string variables much easier. They do this by allowing the user to search
for (and copy or replace) complex patterns of characters within a string.
Examples of when regular expression are useful include extracting zip codes
from addresses, reformatting dates if they were entered in an inconsistent
manner, and removing excess spaces from string expressions. This
presentation will give the user a basic introduction to the use of regular
expressions, and the Stata functions related to regular expressions, as well
as examples of applications where regular expressions can be used to
streamline data management.
Additional information
medeiros_reg_ex.pdf (slides)
Teaching with Stata
Alan Acock
Tony Lachenbruch
Oregon State University
Stata is a useful tool to demonstrate statistical concepts to elementary
(and advanced) statistics classes. For elementary classes, one of the
issues is to avoid making the class one in how to use Stata but keep the
focus on learning statistics. We have found a lab to be helpful to teach
students how to use Stata. The basic commands need to be demonstrated, and
since most students don’t have full Stata documentation, some simple
command descriptions are useful. It is also a good idea to use datasets
from real life to illustrate the ideas. Some pitfalls can be
shown—our greatest goof (that we continue to do) is when using logical
commands to create new variables—missing values are always an issue.
Some moderately advanced ideas can be introduced into the elementary class.
Tony Lachenbruch is experimenting with the permutation and bootstrap
commands this year. Alan Acock is trying to find a way to move a college of
SPSS and SAS users to Stata by getting students on the Stata bandwagon. Alan
Acock is also trying to find which user-written commands should be
incorporated in the first-year labs.
Additional information
Teaching_with_Stata_alan.ppt (slides by Alan Acock)
Teaching_with_Stata_Tony.ppt (slides by Tony Lachenbruch)
Graph Editing
Vince Wiggins
StataCorp
We will take a quick tour of the Graph Editor, covering the basic concepts:
adding text, lines, and markers; changing the defaults for added objects;
changing properties; working quickly by combining the contextual toolbars
with the more complete object dialogs; and using the object browser
effectively. Leveraging these concepts, we'll discuss how and when to use
the grid editor and techniques for combined and by-graphs. Finally, we will
look at some tricks and features that aren't apparent at first blush.
Creating self-validating datasets
Bill Rising
StataCorp
One of Stata’s great strengths is its data management abilities. When either
assembling, sharing, or using shared datasets, some of the most
time-consuming activities are validating the data and writing documentation
for the data. Much of this futility could be avoided if datasets were
selfcontained, i.e., if they could validate themselves. I will show how to
achieve this goal within Stata by attaching validation rules to the
variables themselves via Stata’s characteristics. I will show a
dialog box that makes attaching simple validation rules to variables simple
enough that for most rules no Stata expertise is needed, but which also
allows arbitrarily complicated validation rules. Along with this I'll
demonstrate commands for running error checks, or marking suspicious
observations, as well as documenting the validation rules. The validation
system is flexible enough that simple checks continue to work even if
variable names change or if the data are reshaped, and it is rich enough
that validation may depend on other variables in the dataset. Since the
validation is at the variable level, the self validation continues to work
if variables are recombined with data from other datasets. With these tools,
Stata’s datasets can become truly self contained.
Additional information
ckvarTalk.beamer.pdf (slides)
Estimating average treatment effects in Stata
Guido Imbens
Harvard University
In this talk, I look at several methods for estimating average effects of a
program, treatment, or regime, under unconfoundedness. The setting is one
with a binary program. The traditional example in economics is that of a
labor market program where some individuals receive training and others do
not, and interest is in some measure of the effectiveness of the training.
Unconfoundedness, a term coined by Rubin (1990), refers to the case where
(nonparametrically) adjusting for differences in a fixed set of covariates
removes biases in comparisons between treated and control units, thus
allowing for a causal interpretation of those adjusted differences. This is
perhaps the most important special case for estimating average treatment
effects in practice.
Under the specific assumptions we make in this setting, the
population-average treatment effect can be estimated at the standard
parametric root-N rate without functional form assumptions. A variety of
estimators, at first sight quite different, have been proposed for
implementing this. The estimators include regression estimators, propensity
score based estimators, and matching estimators. Many of these are used in
practice, although rarely is this choice motivated by principled arguments.
In practice, the differences between the estimators are relatively minor when
applied appropriately, although matching in combination with regression is
generally more robust and is probably the recommended choice. More important
than the choice of estimator are two other issues. Both involve analyses of
the data without the outcome variable. First, one should carefully check the
extent of the overlap in covariate distributions between the treatment and
control groups. Often there is a need for some trimming based on the
covariate values if the original sample is not well balanced. Without this,
estimates of average treatment effects can be sensitive to the choice
of, and small changes in the implementation of, the estimators. In this part
of the analysis, the propensity score plays an important role. Second, it is
useful to do some assessment of the appropriateness of the unconfoundedness
assumption. Although this assumption is not directly testable, its
plausibility can often be assessed using lagged values of the outcome as
pseudooutcomes. Another issue is variance estimation. For matching
estimators bootstrapping, although widely used, has been shown to be
invalid. I discuss general methods for estimating the conditional variance
that do not involve resampling.
Additional information
stata_07oct_final.pdf (slides)
Scientific organizers
Colin Cameron, UC Davis
Xiao Chen, UCLA
Phil Ender, UCLA
Estie Hudes, UCSF
Tony Lachenbruch, Oregon State
Bill Mason, (cochair) UCLA
Sophia Rabe-Hesketh (cochair), UC Berkeley
Logistics organizers
Chris Farrar, StataCorp
Gretchen Farrar, StataCorp