Last updated: 20 August 2014
2014 Stata Conference Boston
31 July–1 August 2014
Omni Parker House
60 School Street
Boston, MA 02108
Proceedings
Do-it-yourself multiple imputation: Mode-effect correction in a public opinion survey
Stas Kolenikov
Abt SRBI
In this talk, I demonstrate how to build a multiple-imputation
procedure from scratch. The motivating example
comes from a public opinion survey in which the sampled respondents
provided their responses on the web or by phone. As is known in survey
methodology literature, presence of an interviewer on the phone produces
higher reports of socially desirable behaviors, such as number of
friends or political engagement, or lower reports of undesirable
behaviors, such as illicit drug use. Treating these less accurate
responses as partially missing data, I develop a non-standard multiple-imputation
model that is driven by a concept of utility from choice and
decision literature in economics. My implementation is aligned to supply
the data to Stata's
mi suite, in the sense that I create the
imputations, and
mi combines them using Rubin's rules. Additionally,
the workflow of the mode-effect detection features multiple
testing corrections. It requires extensive
post operations and
that the lists of variables be exchanged between the do-files of the project
which I also demonstrate in this presentation.
Additional information
boston14_kolenikov.pdf
ctgov: A suite of Stata commands for reporting trial
results to ClinicalTrials.gov
Phil Schumm
Department of Health Studies, University of Chicago
In response to the 1997 Food and Drug Administration
Modernization Act, the National Institutes of Health
established ClinicalTrials.gov, an online, publicly-accessible registry
for clinical trials. The 2007 Food and Drug Administration Amendments
Act broadened the scope of eligible trials, added outcomes
reporting as a requirement, and established penalties for
non-compliance. Although ClinicalTrials.gov increased the transparency
with which clinical trials are conducted in the U.S. and opened up new
possibilities for research using the information collected, additional
resources, time, and effort are required to comply with this mandate.
This presentation will introduce
ctgov, a suite of Stata commands to
facilitate the reporting of trial results. By using this tool,
researchers will be able to generate results for automatic upload to
ClinicalTrials.gov as they are doing their primary
analyses, thereby eliminating much of the additional effort and ensuring
that the results in ClinicalTrials.gov match those in the official
publication or report. Although primarily of interest to clinical
researchers, biostatisticians, and pharmaceutical companies, the approach
taken by
ctgov also has connections to work being done in the area of
reproducible research.
Additional information
boston14_schumm.pdf
Using Stata for educational accountability and compliance reporting
Billy Buchanan
Mississippi Department of Education
In 2013, the Mississippi State Legislature passed a law
requiring the state to adopt a single combined statewide accountability
system for schools and districts; the law also restricted
the state from using some of the methods used in the
accountability system of the time. Once the Mississippi Board of Education
voted to adopt the proposed model, the next major task was to program
all the business rules and requirements and calculations. This
presentation will focus on how that led to the
current accountability system. Using Stata, I could
reduce much of the complexity of the previous accountability model
when compared with other software. The current model uses 15 programs
written in Stata to import data from an internal server, implement the
rules specified in the business rules document, estimate the ratios
required by the system, create graphs to illustrate school versus district
versus state comparisons, and build school and district reports for public
consumption. Using Stata's capabilities, we can generate reports
by writing LaTeX source code and a Bash script used to
compile and clean up the output from the LaTeX files. This saves
considerable time.
Additional information
boston14_buchanan.pdf
Profile analysis
Phil Ender
UCLA Statistical Consulting Group
This presentation will discuss profile analysis, a multivariate method for examining differences in the
shapes of profiles across groups. Profile analysis uses of Stata’s
manova
command along with
manovatest for estimation. This presentation will also
demonstrate the user-written command
profileplot to graphically
display group profiles.
Additional information
boston14_ender.pdf
Computer simulation of patient flow through an operating suite
David Clark
Maine Medical Center, Portland, Maine
Operating room (OR) inefficiency is costly and stressful for
patients and staff. To evaluate possible improvements, we simulated our
OR and recovery room (RR) processes with Stata. We used hospital data
(in long format) and parametric time-to-event regression (streg) to
derive loglogistic distributions for the duration of procedures, RR
stays, and room turnaround. Variables were then reshaped into a single
row (wide format) for the simulation program. Patient and room status
for a 24-hour day were changed sequentially using a
forvalues loop with
5-minute steps. Scheduled and historical times were first used
deterministically to recreate anticipated and actual events and
durations. Patient observations were then replicated (using
expand) with
different pseudorandom parameters in each row. Distributions of patient
length of stay in OR and RR (and room turnaround times) thus
approximated theoretical input distributions. Refinements included
reassigning cases if the scheduled room was running late, changing staff
availability, and incorporating unscheduled emergencies. Summary
statistics were compiled (using
egen) for each case and the system as a
whole and were consistent with historical data. Stata has some
advantages over specialized simulation programs, especially for current
Stata users. We plan to build a user interface, make other improvements,
and share our program through RePEc.
Additional information
boston14_clark.ppt
Stata hybrids: Updates and ideas
James Fiedler
Universities Space Research Association
At last year’s Stata conference, I presented projects that
facilitate the combined use of Stata and Python. One project provides
the ability to use Python within Stata via a C plugin. The other project
provides a custom Python class that can be used to open, modify, and
save Stata datasets. In this talk, I will begin by describing some
modifications and extensions to these projects. I will then present a
few new ideas for useful combinations of Stata with other tools. Some of
these ideas can be realized using the Python projects above, some using
JavaScript and a web browser.
Additional information
boston14-fiedler.pdf
Mata routines for solution of nonlinear systems using interval methods
Matthew Baker
Hunter College and the Graduate Center, CUNY
Solution of nonlinear systems has become increasingly
important as a step in many estimation problems and is a problem of
interest in its own right. I introduce a collection of Mata routines
that can be used to find all solutions to nonlinear equation systems
and demonstrate their usage on a sequence of test problems. While
specifically tailored to solving polynomial systems, the method can be
applied to any continuous system with continuous Jacobian. The methods
rely on interval Newton methods, a technique that combines Taylor
expansion, bisection, and interval programming. The routines come
equipped with a heuristic solver that allows for approximate solution
of problems that are especially time consuming or problems that do not
require that all solutions be found. Support tools for the solver
include functions for interval arithmetic and the manipulation of a series of
matrices in parallel. I discuss an extended application of the solution
tools to the problem of finding all equilibria of discrete action games,
which in general requires solving polynomial systems.
Additional information
boston14_baker.pdf
Making interactive online visualizations with stata2leaflet and stata2d3
Robert Grant
St. George’s Medical School, University of London
The last three years have seen explosive growth in the
variety and sophistication of interactive online graphics. These are
mostly implemented in the web language JavaScript, with the D3 (Data-Driven
Documents) library being the most popular and flexible at
present. Leaflet is a mapping library also being widely used. R users
have some packages that translate their data and specifications into
interactive graphics and maps; these packages write a text file containing the
HTML and JavaScript instructions that make up a webpage containing the
desired visualization. This translation into a webpage is easily
achieved in Stata, and I will present the
stata2leaflet command which
produces zoomable, clickable online maps. Contemporary interactive
graphs benefit from allowing the viewer to filter and select data of
interest, which is a second layer of specification implemented in the
stata2d3 commands.
stata2d3 capitalizes on the consistency of Stata
graph syntax by parsing and translating a standard Stata graph command
into a webpage. Users can choose to include explanatory comments
against each line in the source code; these are invisible to viewers but help
them learn HTML and JavaScript and make further refinements.
Additional information
boston14_grant.pdf
Classification using random forests in Stata and R
Linden McBride
Cornell University
Many estimation problems focus on classification of cases
(into bins) with tools that aim to identify cases using only a small
subset of all possible questions. These tools can be used in diagnoses
of disease, identification of advanced or failing students using tests,
or classification into poor and nonpoor for the targeting of a
means-tested social program. Most popular estimation procedures for
generating these tools prioritize minimization of in-sample prediction
errors, but the objective in generating such tools is the minimization
of out-of-sample prediction errors. We provide a comparison of linear
discriminant, discrete choice, and random forest methods, with
applications to means-tested social programs. Out-of-sample prediction
error is typically minimized by random forest algorithms.
Additional information
boston14_mcbride.pdf
A midas retouch regarding diagnostic meta-analysis
Ben Dwamena
University of Michigan
The talk describes recent updates for
midas, a comprehensive
and medically popular program for diagnostic test accuracy
meta-analysis. A major change is that
midas is now an estimation command
and a wrapper for
meglm in Stata 13 . The update allows more flexibility
for specifying covariance structures, link functions other than logit,
more extensive postestimation options and specification of starting
values (especially with sparse data), and the possibility of estimating
univariate (independent) versus bivariate (correlated) modeling of
sensitivity and specificity.
Additional information
boston14_dwamena.pdf
Nonstandard deviation: Making the global local
Marcello Pagano
Harvard School of Public Health
In October 2012, HarvardX, through edX, offered its first two
online courses. One of these was
PH207X: Health in Numbers. The
course covered biostatistics and epidemiology at an introductory level
and lasted 12 weeks. 60,000 students later, we have exposed more students
to those disciplines than we could have over the next 250 years with
typical brick and mortar teaching. To do this, we had to have a
statistical package, and we chose Stata. This talk will cover some of
what we learned from the experience.
Additional information
boston14_pagano.pdf
Transformation survival models
Yulia Marchenko
StataCorp LP
The Cox proportional hazards model is one of the most popular
methods for analyzing survival or failure-time data. The key assumption
underlying the Cox model is that of proportional hazards. This
assumption may often be violated in practice. Transformation survival
models extend the Cox regression methodology to allow for
nonproportional hazards. They represent the class of semiparametric
linear transformation models, which relates an unknown transformation of
the survival time linearly to covariates. In my presentation, I will
describe these models and demonstrate how to fit them in Stata.
Additional information
boston14_marchenko.pdf
Generalized quantile regression in Stata
David Powell
RAND
Quantile regression techniques are useful in understanding the
relationship between explanatory variables and the conditional
distribution of the outcome variable, which allows the parameters of
interest to vary based on a nonseparable disturbance term. Additional
covariates may be necessary or simply desirable for identification, but
including additional variables into a conditional quantile model
separates the disturbance term, which alters the underlying structural
model. To address this problem, Powell (2013) introduces the Generalized
Quantile Regression (GQR) estimator, which provides the impact of the
treatment variables on the outcome distribution and allows for
conditioning on control variables without altering the interpretation of
the estimates. Quantile regression and instrumental-variable quantile
regression are special cases of GQR, but GQR allows for more flexible
estimation of quantile treatment effects. We can easily extend the estimator
to include instrumental variables and panel data. We introduce
a Stata command—
gqr—that implements a GMM-based GQR estimator.
User specified options for the command include the usual panel data
options and allow the user to control for endogeneity in
explanatory variables by using instruments. The command allows
users different means for characterizing standard errors of
estimated parameters, including both direct methods and
Markov chain Monte Carlo simulation.
Additional information
boston14_powell.pdf
Small multiples, or the science and art of combining graphs
Nicholas J. Cox
Durham University
Good graphics often exploit one simple graphical design that is repeated
for different parts of the data, which Edward R. Tufte dubbed as the use of
small multiples. In Stata, small multiples are supported for different
subsets of the data with
by() or
over() options of many graph
commands; users can easily emulate this in their own programs by writing
wrapper programs that call
twoway or
graph bar and its siblings.
Otherwise, specific machinery offers repetition of a design for different
variables, such as the (arguably much under-used)
graph matrix command.
Users can always put together their own composite
graphs by saving individual graphs and then combining them.
This presentation offers further modest automation of
the same design repeated for different data. Three general
programs allow small multiples in different ways.
sparkline, also inspired
by Tufte but using a centuries-old design popular in many
sciences, is most suitable for multiple time series, yet it also has
other applications.
crossplot offers a simple student-friendly graph
matrix for each
y and each
x variable specified, which is more general
than a scatterplot matrix.
combineplot is a command for
combining univariate or bivariate plots for different variables.
Additional information
boston14_cox.ppt
boston14_cox.do
Measuring mobility
Austin Nichols
Urban Institute
I review various measures of mobility using panel data.
with applications to measuring economic or social mobility in survey
data. I demonstrate a variety of approaches.
Additional information
boston14_nichols.pdf
bipolate: A Stata command for bivariate interpolation with
particular application to 3D graphing
Joseph Canner
Johns Hopkins University School of Medicine
Stata has a variety of flexible commands for graphing in two
dimensions; however, it has few options for graphing in three
dimensions. The user-written
surface command by Adrian Mander, available
from SSC, attempts to fill this gap, providing both 3D wire-frame plots
and dropline plots. However, when some (x,y) combinations do not have a
corresponding
z-value, the graphs produced by surface are often
unintelligible. SAS addresses this problem with PROC G3GRID, which
creates a dataset of interpolated values, providing a smooth surface
plot when used as input for PROC G3D. The default method of
interpolation used by PROC G3GRID was proposed by Hiroshi Akima in 1978.
To reproduce this functionality in Stata, we used a publicly available
Fortran implementation of Akima's method. We converted these Fortran
subroutines into Mata and created the Stata command
bipolate to
interface with these subroutines. The
bipolate command contains options
for interpolating
z-values at all possible combinations of the specified
x- and y-values and for specifying specific (x,y) combinations at which
to interpolate. There is also an option for handling multiple
z-values
for a given (x,y). Examples will be provided to illustrate the use of
surface, with and without
bipolate, and to illustrate various
bipolate
options.
Additional information
boston14_canner.pptx
Dialog-driven event study using Stata (Cancelled)
Chuntao Li
Zhongnan University of Economics and Law
We present our user-written ado program, eventstudy.
This package allows users to perform large scale event study with market
models such as CAPM. The program is written with Stata's dialog command and
is menu driven. Users simply feed the black box with key flavors
for the event study, and the program can automatically perform the complex procedure.
Estimating average treatment effects from observational data using teffects
David Drukker
StataCorp LP
After reviewing the potential-outcome framework for
estimating treatment effects from observational data, I will
discuss how to estimate the average treatment effect and the average
treatment effect on the treated by the regression-adjustment estimator,
the inverse-probability-weighted estimator, two doubly robust
estimators, and two matching estimators implemented in
teffects.
Additional information
boston14_drukker.pdf
Optimal interval design for phase I oncology clinical trials
Bryan Fellman
MD Anderson Cancer Center
The optimal interval design is a novel phase I trial design for finding the
maximum tolerated dose (MTD). The optimal interval design casts dose finding
as a sequential decision-making problem for assigning an appropriate dose
for each enrolled patient. The design optimizes the assignment of doses to
patients by minimizing incorrect decisions of dose escalation or
deescalation, that is, erroneously escalating (or deescalating) the dose
when the current dose is actually higher (or lower) than the MTD. This
feature of the optimal interval design strongly ensures adherence to ethical
standards. In addition, because the optimal dose assignment tends to treat
patients at (or close to) the MTD, at the end of the trial, this design will
be able to select the MTD with a high probability since most data and
statistical power are concentrated around the MTD. This presentation will
briefly cover the methods of the design and demonstrate a command that
implements them in a clinical setting.
Additional information
boston14_fellman.pdf
Distributed computations in Stata
Michael Lokshin
Sergiy Radyakin
Development Economics Research Group, The World Bank
Many complex tasks frequently challenge the
computational resources in simulation modeling and
estimation. Often these tasks have a distinct number of separable
iterations that can be performed in parallel, simultaneously, and
independently from each other. Earlier approaches were limited to an
execution on a single machine (e.g., PARALLEL, 2013) in parallel
sessions. We are developing a system, which can be run in an MS Windows
network, with automatic registration and deregistration of computing
nodes (each running Stata), a task scheduler, and a results aggregator.
A multiple-machine networked approach allows greater scale and ultimately
higher performance.
Additional information
boston14_radyakin.pdf
Binned scatterplots: Introducing binscatter and exploring its applications
Michael Stepner
MIT
binscatter is a new program that produces binned
scatterplots, which provide a nonparametric estimate of a conditional
expectation function. This presentation will describe the features of
binscatter and explore its versatile applications. Those applications
include: observing the relationship between two variables in large
datasets, visualizing OLS regressions, visualizing
regression-discontinuity designs, plotting event studies, and
visualizing IV regressions. The presentation will
demonstrate how
binscatter can be used to complement the empirical
techniques most commonly used in applied economic research.
Additional information
boston14_stepner.pdf
Floating-point numbers: A visit through the looking glass
William Gould
StataCorp LP
In lieu of his usual
Report to users, Bill Gould will talk on
floating-point numbers.
Researchers do not adequately appreciate that floating-point numbers are a
simulation of real numbers and, as with all simulations, some features are
preserved while others are not. When writing code, or even do-files,
treating the computer's floating numbers as if they were real numbers can
lead to substantive problems and to numerical inaccuracy. In this, the
relationship between computers and real numbers is not entirely unlike the
relationship between tea and Douglas Adams's Nutri-Matic drink dispenser.
The Nutri-Matic produces a concoction that is "almost, but not quite,
entirely unlike tea."
Gould shows what the universe would be like if it were implemented in
floating-point rather than in real numbers. The floating-point universe
turns out to be nothing like the real universe and probably could not be
made to function.
Without jargon and without resort to binary, Gould shows how floating-point
numbers are implemented on an imaginary base-10 computer and quantifies the
kinds of errors that can arise. In this, float-point subtraction stands
out as really being almost, but not quite, entirely unlike subtraction.
Gould shows how to work around such problems.
The point of the talk is to build your intuition about the floating-point
world so that you as a researcher can predict when calculations might go
awry, know how to think about the problem, and determine how to fix it.
Additional information
boston14_gould.pdf
Scientific organizers
Stephen Soldz, (chair) Boston Graduate School of Psychoanalysis
Christopher F. Baum, Boston College
Marcello Pagano, Harvard School of Public Health
Logistics organizers
Nathan Bishop, StataCorp
Chris Farrar, StataCorp
Gretchen Farrar, StataCorp