Last updated: 21 July 2010
2010 Stata Conference Boston
15–16 July 2010
Omni Parker House
60 School Street
Boston, MA 02108
Proceedings
Regression for nonnegative skewed dependent variables
Austin Nichols
Urban Institute
In this presentation, I compare several options for estimation and prediction in regressions using
nonnegative skewed dependent variables. Often, Poisson
regression outperforms competitors, even when its assumptions are violated
and the correct model is one that justifies a competitor.
Additional information
boston10_nichols.pdf
Margins and the Tao of interaction
Phil Ender
UCLA Statistical Consulting Group
In this presentation, I show how to use the new
margins command,
introduced in Stata 11, to explore interactions in regression and analysis
of variance. I cover three types of interactions: 1) categorical
by categorical, 2) categorical by continuous, and 3) continuous by
continuous. I also cover issues concerning graphing of interactions, along
with hypothesis testing that is appropriate for interactions.
Additional information
boston10_ender.pdf
To the vector belong the spoils: Circular statistics in Stata
Nicholas J. Cox
Durham University
Circular statistics are needed when one or more variables have outcome space
in a circle, which is, for example, true for data measured with reference to
a compass, a clock, or a calendar. Applications abound in the earth and
environmental sciences and other disciplines, such as music, not to mention
the economic and medical fields that are well represented among Stata users.
A talk on circular statistics was given in Boston in 2001. In this update, I
survey the field with special reference to recently revised or newly
written programs for graphics, modeling, testing, and summary.
Additional information
boston10_cox.zip
System for formatting tables
John Gallup
Portland State University
The addition to Stata of a system for formatting tables enables extensive
formatting of statistical tables created within Stata, ultimately allowing
users to create native Word or TeX tables. In this presentation, I talk
about this system, which is intended for use by programmers. Users may
specify font sizes, font types, text justification, cell height and width,
cell boundary lines of different styles, titles, labels, and footnotes,
among other attributes. New data may be merged or appended to existing
tables to create more-complex tables. This system can provide full
formatting for statistical tables, similar to the way that Stata provided
granular formatting for graphics starting in Stata 8. This system is
implemented in Mata for speed and compact memory use (Mata string matrices
made for efficient coding). I have reimplemented the
outreg ado
program using this system, and I have written a program to create formatted
cross-tabulation tables like those created by
tabulate. I also plan
to write a program to create formatted summary statistics tables.
Additional information
boston10_gallup.pdf
Hunting for genes with longitudinal phenotype data using Stata
Chuck Huber
Texas A&M Health Science Center School of Rural Public Health
Project Heartbeat! was a longitudinal study of metabolic and morphological
changes in adolescents aged 8–18 years. It was conducted in the 1990s.
A study is currently being conducted to consider the relationship between a
collection of phenotypes (including BMI, blood pressure, and blood lipids)
and a panel of 1,500 candidate single nucleotide polymorphisms (SNPs).
Traditional genetics software, such as PLINK and HelixTree, lacks the
ability to model longitudinal phenotype data. In this talk, I describe
how to use Stata for a longitudinal genetic association study that includes
these tasks: early-stage data checking (allele frequencies and
Hardy–Weinberg equilibrium), modeling of individual SNPs,
use of false discovery rates to control for the large number of comparisons,
exporting and importing the data through PHASE for haplotype reconstruction,
selection of tag SNPs in Stata, and analysis of haplotypes. I also
discuss strategies for scaling up to an Illumina 100k SNP chip using Stata.
All SNP names and gene names will be de-identified because this is a work in
progress.
Additional information
boston10_huber.ppt
Bayesian bivariate diagnostic meta-analysis via R-INLA
Ben Adarkwa Dwamena
University of Michigan and VA Ann Arbor Healthcare Systems
Bivariate generalized mixed modeling is currently recommended for joint
meta-analysis of diagnostic test sensitivity and specificity. Estimation is
commonly performed using frequentist likelihood-based techniques assuming
bivariate, normally distribute d, correlated logit transformations of
sensitivity and specificity. These estimation techniques are fraught with
nonconvergence and invalid confidence intervals and correlation parameters,
especially with sparse data. Bayesian approaches, though likely to surmount
these and other problems, have not previously been applied. Recently,
integrated nested Laplacian approximation (INLA) has been developed as a
computationally fast, deterministic alternative to Markov chain Monte Carlo
(MCMC)-based Bayesian m odeling, and an R interface to the C-based INLA
program has been applied to diagnostic meta-analysis. In this presentation,
I show how to easily interface R-INLA estimation with data preprocessing and
postprocessing within Stata. A user-written ado-file allows user-friendly
application of INLA by Stata users.
Storing, analyzing, and presenting Stata output
Julian Reif
University of Chicago
In this presentation, I discuss how to store, analyze, and present Stata
output. I explain how to use my commands
regsave and
svret to
save Stata output to a Stata-formatted dataset. Results can then easily be
manipulated using standard Sta ta commands. I next demonstrate how to export
large sets of results to Microsoft Excel, where they can easily be viewed in
a pivot table. Finally, I show how to use my command
texsave to
export results to a LaTeX table that can be incorporated int o a
professional paper or presentation. I provide examples that show how to
automate these procedures so that researchers can easily rerun analyses
without having to manually reassemble their output each time.
Additional information
boston10_reif.pdf
boston10_reif.zip
An efficient data envelopment analysis with a large dataset in Stata
Choonjoo Lee
Korea National Defense University
In this presentation, I present an approach to improving the computational
efficiency of data envelopment analysis (DEA) with a large dataset in Stata.
I presented my
dea program at the Stata Conference DC 09. I have
reviewed various comments and requests by Stata users and have updated the
code significantly in terms of computation time and model variants. In this
presentation, I illustrate an approach to reducing the computation time and
to improving the accuracy of DEA results using a five-inputs one-output
dataset with 365 decision-making units (DMUs).
Additional information
boston10_lee.ppt
boston10_lee.zip
Competing-risks regression in Stata 11
Roberto G. Gutierrez
StataCorp
Competing-risks survival regression provides a useful alternative to Cox
regression in the presence of one or more competing risks. For example, say
that you are studying the time from initial treatment for cancer to
recurrence of cancer in relation to the type of treatment administered and
demographic factors. Death is a competing event: The person under treatment
may die, impeding the occurence of the event of interest, recurrence of
cancer. Unlike censoring, which merely obstructs you from viewing the event,
a competing event prevents the event of interest from occurring altogether.
Depending on the scope of your statistical inference, your analysis may need
to be adjusted for competing risks.
Stata’s new
stcrreg command implements competing-risks
regression based on Fine and Gray’s proportional subhazards model. In
this talk, I focus on that new command and compare the method of Fine
and Gray to a method based on directly modeling cause-specific hazards.
Regardless of method, the focus is on estimating the cumulative incidence
function (CIF) for the event of interest in the presence of competing
events.
Additional information
boston10_gutierrez.pdf
Structural equation models with latent variables
Stas Kolenikov
University of Missouri
In this talk, I introduce the main ideas of structural equation models
(SEMs) with latent variables and Stata tools that can be used for such
models. The two approaches most often used in applied work are numeric
integration of the latent variable s and covariance structure modeling. The
first approach is implemented in Stata via
gllamm, which was developed by
Sophia Rabe-Hesketh). The second approach is currently implemented in
confa for confirmatory factor analysis models. Also, introduction of
the generalized method of moments (GMM) estimation and testing framework in
Stata 11 made it possible to estimate SEMs by using moderately
complex parameter and matrix manipulations. I provide working examples
with some popular datasets (Holzinger–Swineford factor analysis model
and Bollen’s industrialization and political democracy model).
Additional information
boston10_kolenikov.pdf
boston10_kolenikov.zip
Multiple imputation using Stata’s mi command
Yulia Marchenko
StataCorp
Stata’s
mi command can be used to perform multiple-imputation
analysis, including imputation, data management, and estimation.
mi
impute provides a number of univariate and multivariate imputation
methods, including multivariate normal (MVN) data augmentation.
mi estimate combines the
estimation and pooling steps of the multiple-imputation procedure into one
easy step.
mi also provides an extensive ability to manage multiply
imputed data. I give a brief overview of all of
mi’s
capabilities, with emphasis on
mi impute and
mi estimate, and I
also demonstrate examples of some of
mi’s unique
data-management features.
Additional information
boston10_marchenko.pdf
CEM: Coarsened exact matching in Stata
Matthew Blackwell
Harvard University
I introduce a Stata implementation of coarsened exact matching, a new method
for improving the estimation of causal effects by reducing imbalance in
covariates between treated and control groups. Coarsened exact matching is
faster, is easier to use and understand, requires fewer assumptions, is more
easily automated, and possesses more attractive statistical properties for
many applications than do existing matching methods. In coarsened exact
matching, users temporarily coarsen their data, exact match on these
coarsened data, and then run their analysis on the uncoarsened, matched
data. Coarsened exact matching bounds the degree of model dependence and
causal effect estimation error by ex ante user choice, is monotonic
imbalance bounding (so that reducing the maximum imbalance on one variable
has no effect on others), does not require a separate procedure to restrict
data to common support, meets the congruence principle, is approximately
invariant to measurement error, balances all nonlinearities and interactions
in sample (that is, not merely in expectation), and works with multiply
imputed datasets. Other matching methods inherit many of the coarsened exact
matching method’s properties when applied to further match data
that are preprocessed by coarsened exact matching.
Additional information
boston10_blackwell.pdf
Evaluating one-way and two-way cluster–robust covariance matrix estimates
Christopher F. Baum
Boston College
In this presentation, I update Nichols and Schaffer’s 2007 UK Stata Users Group talk
on clustered standard errors. Although cluster–robust standard errors
are now recognized as essential in a panel-data context, official Stata only
supports clusters that are nested within panels. This requirement rules out
the possibility of defining clusters in the time dimension and modeling
contemporaneous dependence of panel units’ error processes. I build
upon recent analytical developments that define two-way (and conceptually,
n-way) clustering and upon the 2010 implementation of two-way clustering in
the widely used
ivreg2 and
xtivreg2 packages. I present
examples of the utility of one-way and two-way clustering using Monte Carlo
techniques, I present a comparison with alternative approaches to modeling
error dependence, and I consider tests for clustering of errors.
Additional information
boston10_baum.pdf
Bootstrap LM test for the Box–Cox tobit model
David Vincent
Hewlett-Packard
Consistency of the maximum likelihood estimators for the parameters in the
standard tobit model relies heavily on the assumption of a normally
distributed error term. The Box–Cox transformation presents an obvious
attempt to preserve normality when the data make it questionable. In this
presentation, I set out an outer-product-of-gradients (OPG) version of a
Lagrange multiplier (LM) test for the null hypotheses of the standard tobit
model against the alternative of a more general nonlinear specification, as
determined by the parameter of the Box–Cox transformation. Monte Carlo
estimates of the rejection probabilities using first-order asymptotic and
parametric bootstrap critical values are obtained for sample sizes that are
comparable to those used in practice. The results show that the LM test
using bootstrap critical values has practically no size distortion, whereas
when using asymptotic critical values, the empirical rejection probabilities
are significantly larger than the nominal levels. A simple program that
carries out this test using bootstrap critical values has also been written
and can be run after the official Stata
tobit estimation command.
Additional information
boston10_vincent.pdf
Teaching a statistical program in emergency medicine research rotations: Command-driven or click-driven?
Muhammad Waseem
Lincoln Medical and Mental Health Center
Stata is a command-driven program. It is a general-purpose statistical
software package that is used by people of different backgrounds and
professional disciplines. Most Stata users, however, are nonphysicians.
Because Stata is used by people in all f ields, most training programs
offered are geared toward programmers and nonphysicians. Although Stata has
simple commands, they may be difficult for nonprogrammers to use.
Generally, physicians are familiar with clicking on rather than writing
commands. To teach emergency medicine (EM) residents, I developed a
teaching approach using pull-down menus. I observed that for EM residents,
it was easy to learn and use pull-down menus. While teaching, I emphasized
how to enter and import data. During the EM research rotation, residents
were introduced to the Stata software in addition to research methods. I
also developed a manual explaining the basic operations of Stata. Providing
an introduction of Stata prior to data entry improved the accuracy of data
recording and facilitated data analysis. It also provided EM residents with
the experience to navigate Stata following the completion of the research
rotation. Although the basic functions of Stata can be learned via this method,
I feel that it is necessary to develop a training program that addresses
the needs of physicians.
Additional information
boston10_waseem.ppt
Scientific organizers
Christopher F. Baum, (chair) Boston College
Elizabeth Allred, Harvard School of Public Health
Amresh Hanchate, Boston University
Marcello Pagano, Harvard School of Public Health
Logistics organizers
Chris Farrar, StataCorp
Gretchen Farrar, StataCorp
Sarah Marrs, StataCorp