Last updated: 24 November 2008
2008 Fall North American Stata Users Group meeting
13–14 November 2008
The Handlery Union Square Hotel
351 Geary Street
San Francisco, CA 94102
Proceedings
Tricks of the trade: Getting the most out of xtmixed
Roberto G. Gutierrez
StataCorp
Stata’s
xtmixed command can be used to fit mixed models, models
that contain both fixed and random effects. The fixed effects are merely the
coefficients from a standard linear regression. The random effects are not
directly estimated but summarized by their variance components, which are
estimated from the data. As such,
xtmixed is typically used to
incorporate complex and multilevel random-effects structures into standard
linear regression.
xtmixed’s syntax is complex but versatile,
allowing it to be widely used, even for situations that do not fit the
classical “mixed” framework. In this talk, I will give a
tutorial on the uses of
xtmixed not commonly considered, including
examples of heteroskedastic errors, group structures on random effects, and
smoothing via penalized splines.
Additional information
gutierrez.pdf
Multilevel modeling of educational longitudinal data
with crossed random effects
Minjeong Jeon
University of California–Berkeley
Sophia Rabe-Hesketh
University of California–Berkeley
We consider multilevel models for longitudinal data where membership in the
highest level units changes over time. The application is a four-year study
of Korean students who are in middle school during the first two waves and
in high school during the second two waves, where middle schools and high
schools are not nested. The model includes crossed random effects for middle
schools and high schools and can be estimated by using Stata’s
xtmixed command. An important consideration is how the impact of the
middle school and high school random effects on the response variable should
change over time.
The consequences of misspecifying the random-effects
distribution when fitting generalized linear mixed models
John M. Neuhaus
University of California–San Francisco
Charles E. McCulloch
University of California–San Francisco
Generalized linear mixed models provide effective analyses of clustered and
longitudinal data and typically require the specification of the
distribution of the random effects. The consequences of misspecifying this
distribution are subject to debate; some authors suggest that large biases
can arise, while others show that there will typically be little bias for the
parameters of interest. Using analytic results, simulation studies, and
example data, I summarize the results of extensive assessments
of the bias in parameter estimates due to random-effects distribution
misspecification. I also present assessments of the accuracy of
random-effects predictions under misspecification. These assessments
indicate that random-effects distribution misspecification often produces
little bias when estimating slope coefficients but may yield biased
intercepts and variance-components estimators as well as mildly inaccurate
predicted random effects.
Additional information
neuhaus_stata2008.talk.pdf
Prediction in multilevel logistic regression
Sophia Rabe-Hesketh
University of California–Berkeley
Anders Skrondal
Norwegian Institute of Public Health
This presentation focuses on predicted probabilities for multilevel models
for dichotomous or ordinal responses. For instance, in a three-level model
with patients nested in doctors nested in hospitals, predictions for
patients could be for new or existing doctors and, in the latter case, for
new or existing hospitals. In a new version of
gllamm, these
different types of predicted probabilities can be obtained very easily. We
will give examples of graphs that can be used to help interpret an estimated
model. We will also introduce a program we have written to
construct 95% confidence intervals for predicted probabilities.
Additional information
rabe_hesketh_predict5.pdf
Heteroskedasticity, extremity, and moderation in
heterogeneous choice models
Garrett Glasgow
University of California–Santa Barbara
Heterogeneous choice models are extensions of binary and ordinal regression
models that explicitly model the determinants of heteroskedasticity. I show
that, often, moderation (proximity to a choice threshold) will produce
empirical results identical to heteroskedasticity in binary heterogeneous
choice models, while extremity (a preference for endpoint categories) will
produce empirical results identical to heteroskedasticity in ordinal
heterogeneous choice models. I show how a simple extension of
Williams’ user-written
oglm command can create
ordered heterogeneous choice models that can distinguish between
heteroskedasticity, extremity, and moderation.
Additional information
glasgow_stata.ppt
A generalized meta-analysis model for binary diagnostic test performance
Ben Dwamena
University of Michigan Radiology and VA Nuclear Medicine
Methods for meta-analysis of diagnostic test accuracy studies must, in
addition to unobserved heterogeneity, account for covariate heterogeneity,
threshold effects, methodological quality, and small-study bias, which
constitute the major threats to the validity of meta-analytic results. These
have traditionally been addressed independent of each other. Two recent
methodological advances include 1) the bivariate random-effects model for
joint synthesis of sensitivity and specificity, which accounts for unobserved
heterogeneity and threshold variation using random effects, and covariate and
quality effects as independent variables in a metaregression; and 2) a
linear regression test for funnel plot asymmetry in which the diagnostic
odds ratio as an effect-size measure is regressed on effective sample size as a
precision measure. I propose a generalized framework for diagnostic
meta-analysis that integrates both developments based on a modification of
the bivariate Dale’s model in which two univariate random-effects logistic
models for sensitivity and specificity are associated through a log-linear
model of odds ratios with the effective sample size as an independent
variable. This framework unifies the estimation of the summary test
performance and the assessment of the presence, extent, and sources of
variability. Taking advantage of the ability of gllamm to model a
mixture of discrete and continous outcomes, I will discuss specification,
estimation, diagnostics, and prediction of the model, using a motivating
dataset of 43 studies investigating FDG-PET for staging the axilla in
patients with newly diagnosed breast cancer.
It’s a little different with survey data
Christine Wells
UCLA
Analyzing survey data is different from analyzing data generated by
experiments in several important ways. I will discuss these differences
using the NHANES III adult dataset as an example. Topics will
include specifying the survey elements, analysis of subpopulations, model
diagnostics, and model comparison.
Additional information
wells_stata2008.pdf
Using Stata’s capabilities to assess the performance of
Latin American students in mathematics, reading, and science
Roy Costilla
LLECE/UNESCO Santiago
Stata is a very good tool for analyzing survey data. It considers
many important aspects of complex survey design and the availability of
alternative variance-estimation methods. Through the use of matrix and macro
language, it also allows the user to store and manage output results
conveniently to automate the entire estimation and testing process.
I will discuss the estimation of the main results of the
Second Regional Comparative and Explanatory Study, an
assessment of the performance in the domains of mathematics, reading, and
science of third- and sixth- grade students in 16 countries of Latin
America in 2005–2006. In particular, I will consider the estimation of the
mean scores and their variability by country, area, grade, and
subpopulation. I will also present the comparisons made to check
for the differences in performance among countries and subpopulations.
Additional information
costilla_serce_stata_sfo.pdf
3-way ANOVA interactions: Deconstructed
Phil Ender
UCLA
I will present three approaches to understanding 3-way ANOVA interactions:
1) a conceptual approach, 2) an ANOVA approach, and 3) a regression approach
using dummy coding. I will illustrate the three approaches through the use of
a synthetic dataset with a significant 3-way interaction.
Additional information
ender_3way_anova.pdf
Some challenges in survival analysis with large datasets
Noori Akhtar-Danesh
McMaster University
In this presentation, I demonstrate some common challenges with large
datasets in survival analysis. I investigate the relationship between the age
of smoking initiation and some demographic factors in the Canadian Community
Health Survey, Cycle 3.1 (CCHS-3.1) dataset. CCHS-3.1 is a large dataset
that includes information for over 130,000 individuals. I used different
techniques for model fitting and model checking. Test-based techniques for
the assessment of PH assumption are not very useful because a small
deviation from the theoretical model leads to the rejection of PH
assumption. In contrast, graphical approaches seem to be more helpful.
However, not every diagnostic graph can be drawn because of the large
dataset. Preliminary results show that 63% of Canadians have ever smoked a
whole cigarette. Therefore, it seems more appropriate to use a cure fraction
model (Lambert, 2007,
Stata
Journal, 7: 351–375) to handle the large proportion of
censored data. However, sampling weights cannot be used in this model. In
conclusion, survival analysis for large datasets cannot be done easily. Some
challenges include assessment of PH assumption and drawing diagnostic
graphs. Besides, use of the cure fraction model may not be appropriate if
sampling weights cannot be incorporated in the model estimation.
Additional information
akhtar_danesh_stata2008_meeting.ppt
Stata and the one-armed bandit
Martin Weiss
University of Tuebingen, Germany
Using Stata, I have researched the market efficiency of the German 6/49
parimutuel lottery game. I investigate the existence of profit opportunities
for particularly unpopular combinations of numbers (Papachristou and
Karamanis [1998]), employing the covariates proposed by Henze and Riedwyl
(1998). Furthermore, I examine the time-series behavior of stakes bet in
relation to the size of the jackpot in the respective draw. In particular, I
attempt to verify the conjecture that the skewness of the payoff
distribution drives bettors' appetite for participation (Golec and Tamarkin
[1998]). Along the way, I show how one can set up Stata to retrieve data
from the Internet, unpack them automatically, and shape them for the
analysis. I also show how one can schedule tasks to automate the process
further.
References
Henze, N. and H. Riedwyl. (1998).
How to Win More:
Strategies for Increasing a Lottery Win. Natnick, MA: A K Peters.
Papachristou, G.
and D. Karamanis. (1998). Investigating efficiency in betting markets:
Evidence from the Greek 6/49 lotto.
Journal of Banking & Finance
22: 1597–1615.
Golec, J. and M. Tamarkin. (1998). Bettors love
skewness, not risk, at the horse track.
Journal of Political Economy
106: 205–225.
Additional information
Presentation_Nov_13_Martin_Weiss.pdf
Semiparametric analysis of case–control genetic data in
the presence of environmental factors
Yulia Marchenko
StataCorp
In the past decade, many statistical methods have been proposed for the
analysis of case–control genetic data with an emphasis on
haplotype-based disease association studies. Most of the methodology has
concentrated on the estimation of genetic (haplotype) main effects. Most
methods accounted for environmental and gene–environment interaction effects
by utilizing prospective-type analyses that may lead to biased estimates
when used with case–control data. Several recent publications
have addressed the issue of retrospective sampling in the analysis of
case–control genetic data in the presence of environmental factors by
developing new efficient semiparametric statistical methods. I present a
new Stata command,
haplologit, that implements efficient
profile-likelihood semiparametric methods for fitting gene–environment
models in the very important special cases of 1) a rare disease, 2) a
single candidate gene in Hardy–Weinberg equilibrium, and 3)
the independence of genetic and environmental factors.
Additional information
marchenko_SF08.pdf
Using Mata to work more effectively with Stata: A tutorial
Christopher F. Baum
Boston College and DIW Berlin
Stata’s matrix language, Mata, highlighted in Bill Gould’s Mata
Matters columns in the
Stata Journal, is very useful and powerful in
its interactive mode. Stata users who write do-files or ado-files should
gain an understanding of the Stata–Mata interface: how Mata can be
called upon to do one or more tasks and return its results to Stata. Mata's
broad and extensible menu of functions offers assistance with many
programming tasks, including many that are not matrix-oriented. In this
tutorial, I will present examples of how do-file and ado-file writers might
effectively use Mata in their work.
Additional information
baum_StataMata.beamer.FNASUG08.pdf
Mata utilities
Elliott Lowy
VAPSHCS HSR&D
I will present a selection of user-written Mata functions that serve to
streamline the process of writing other Mata functions, and I will
demonstrate what makes them handy. I will present debugging/programming
functions for the following: dropping and re-creating one or a few functions
without clearing Mata of all other useful info; displaying the contents of
a matrix in a compact and informative way; and copying private function
information into the global space. I will present text-handling functions
for the following: concatenating and dividing blocks of text; processing
lists of file/directory paths; and converting between matrices of text and
ASCII values. I will present more general-purpose functions for the
following: combining matrices of different sizes; reading and writing Mata
matrices to spreadsheet files; generating a map of matching values in two
matrices; and returning an entire (small) matrix of values to Stata locals.
I will finish with a combined Stata/Mata command for storing Stata command
preferences.
Additional information
Package available by typing net from http://datadata.info/ado within
Stata.
Estimating user-defined nonlinear regression models in
Stata and in Mata
Colin Cameron
University of California–Davis
This talk will be an overview of how to estimate nonlinear regression models
that are not covered by Stata’s many built-in estimation commands.
The Mata
optimize() function will be emphasized, and the Stata
ml
command will also be covered. The material is drawn from chapter 11 of
Cameron and Trivedi’s (2009)
Microeconometrics Using Stata,
Stata Press.
Additional information
cameronwcsug2008.pdf
Automated stress tests for econometric models
Roy Epstein
Boston College
I will present a Stata program for improved quality control of
econometric models. Reported econometric results often
have unknown reliability because of selective reporting by the researcher.
In particular,
t-statistics are often uninformative or misleading when
multiple models are estimated from the same dataset. Econometric best
practices should include routine stress tests to assess the robustness of
estimation results to reasonable perturbations of the model specification
and underlying data. It is feasible to implement these tests as standard
outputs from the statistical software. This information should lead to
greater transparency and greater ability of others to interpret a given
regression. The Stata program I will discuss can be used after commands
that perform cross-section, time-series, and panel regression. It is easily
extensible to include additional tests as desired.
Additional information
epstein_stata_november_2008.ppt
Data I/O commands
Elliott Lowy
VAPSHCS HSR&D
While Stata, of course, comes with a serviceable set of I/O commands,
I have found room for improvement. I will present a set of user-written
commands for using, saving, appending, and merging. Highlights include
wildcards in file paths, drastically reducing the amount that needs to be
typed; options to change the working directory to match the file specified;
quick reloading of the current analysis file; saving partial datasets;
using/appending sets of multiple data files; transparent use of Stat/Transfer
within all commands to use, save, append, and merge from and/or to other
formats such as SAS and Excel; maintaining a “recent file” list
through the command interface; and eliminating the irritating irregular need
for quotes.
The merge command, in particular, has an even larger set of
advantages, which together with the above advantages means never having to
open and fiddle with a file before merging it. These advantages include
merging on disparately named variables; automatic conversion of
string/numeric variables; case-insensitive merging; renaming variables added
to the current data; automatic tabulation of the (labeled) _merge variable
or summarized merge information with automatic deletion of the _merge
variable; automatic deletion of matched or unmatched records; merging with a
single record from a multiply-matching merge file; and true many-to-many
merging.
Additional information
Package available by typing net from http://datadata.info/ado within
Stata.
Causal regression with imputed estimating equations
Joseph Schafer
Penn State
Joseph Kang
Penn State
Literature on causal inference has emphasized the average causal effect,
defined as the mean difference in potential outcomes under different
treatment conditions. We consider marginal regression models that describe
how causal effects vary in relation to covariates. To estimate parameters,
we replace missing potential outcomes in estimating functions with fitted
values from imputation models that include confounders and prognostic
variables as predictors. When the imputation and analytic models are
linear, our procedure is equivalent to maximum likelihood for normally
distributed outcomes and covariates. Robustness to misspecification of the
imputation models is enhanced by including functions of propensity scores as
regressors. In simulations where the analytic, imputation, and propensity
models are misspecified, the method performs better than inverse-propensity
weighting. Using data from the National Longitudinal Study of Adolescent
Health, we analyze the effects of dieting on emotional distress in the
population of girls who diet, taking into account the study's complex sample
design.
Likelihood-ratio tests for multiply imputed datasets:
Introducing milrtest
Rose Medeiros
UCLA
Through the use of user-written programs, primarily
mim (Carlin,
Galati, and Royston, 2008,
Stata Journal 8: 49–67), Stata users can analyze multiply imputed (MI)
datasets. Among other capabilities,
mim allows the user to estimate
a range of regression models and to perform a multiparameter hypothesis test
after model estimation using a Wald test. The program presented here allows
the user to perform likelihood-ratio tests on after
mim models using
MI datasets. This provides an additional means of testing nested models
after estimation using MI data. The process used to perform the
likelihood-ratio tests is described in Meng and Rubin (1992,
Biometrika 79: 103–111). The test
statistic is calculated based on two sets of likelihood-ratio tests. The
first involves calculating the likelihood ratio for the null versus
the alternative hypothesis in each of the MI datasets. The second involves
calculating the likelihood for the null and the alternative hypotheses in each
of the MI datasets, constraining the parameters to be the estimates based on
combining coefficient estimates from the MI datasets (i.e., the average of
the parameter estimates across the MI datasets). The current version allows
testing for a limited number of regression commands (i.e.,
regress,
logit, and
ologit), but subsequent versions may include
compatibility with additional commands.
Additional information
medeiros_2008.pdf
UCLA ATS/Stat consulting service model for Stata users
Xiao Chen
UCLA
The Statistical Consulting Group provides a variety of resources to Stata
users on campus, from walk-in and email consulting to an extensive website
on materials related to Stata. In this presentation I will explain how the group
offers such services and will discuss the three major components of the
consulting process: consulting, learning, and documenting. I will also
discuss the benefits and challenges involved in sharing contributed Stata
packages with clients, and the role the Internet has played in shaping the
collaboration aspect of our consulting model.
Scientific organizers
Xiao Chen, (cochair) UCLA
Sophia Rabe-Hesketh (cochair), UC Berkeley
Phil Ender, UCLA
Estie Hudes, UCSF
Tony Lachenbruch, Oregon State
Bill Mason, UCLA
Doug Steigerwald, UC Santa Barbara
Logistics organizers
Chris Farrar, StataCorp
Gretchen Farrar, StataCorp