The 2016 Stata Conference was held July 28-29, but you can still interact with the user community even after the meeting and learn more about the presentations shared. View the presentation slides (below) and the conference photos.
Regression discontinuity designs and models |
Abstract:
This paper proposes some extensions to Calonico, Cattaneo, and
Titiunik's (2014, Stata Journal 14: 909–946) commands for
regression discontinuity (RD) and regression kink (RK) designs. First,
the commands allow users to take a second difference in RD and RK (and
higher-order derivatives). That is, they combine RD and RK estimators
with difference-in-differences (DID). The DID approach may be
appropriate, for instance, when the density function of the running
variable is discontinuous around the cutoff but steady over time.
Second, they let users specify multidimensional running variables, which
is applicable, for example, to the estimation of boundary discontinuity.
Finally, users can work with weighted data by specifying analytic
weights and include control variables. For any of those extensions, the
command ddbwsel calculates the MSE-optimal bandwidth, as proposed
by Imbens and Kalyanaraman (2012, Review of Economic Studies 79:
933–959) and Calonico, Cattaneo, and Titiunik (2014,
Econometrica 82: 2295–2326). The command ddrd
implements the robust bias-corrected confidence intervals for any of the
above extensions. The command dddens implements McCrary's (2008,
Journal of Econometrics 142: 698–714) density test for
cross-sectional RD and RK, and RD and RK with DID. The paper presents
applications of two-dimensional RD with DID and one-dimensional RK with DID.
Additional information
Chicago16_Ribas.pdf Rafael Ribas
University of Amsterdam
|
Abstract:
Regression discontinuity (RD) models are commonly used to
nonparametrically identify and estimate a local average treatment effect
(LATE). Dong and Lewbel (2015) show how a derivative of this LATE can be
estimated. They use their treatment-effect derivative (TED) to estimate
how the LATE would change if the RD threshold changed. We argue that
their estimator should be employed in most RD applications as a way to
assess the stability and hence external validity of RD estimates.
Closely related to TED is the complier probability derivative (CPD) of
Cerulli et al. (2016). Just as TED measures stability of the treatment
effect, the CPD measures stability of the complier population in fuzzy
designs. In this paper, we provide the Stata module ted, which can be
used to easily implement TED and CPD estimation, and we apply it to some
real datasets.
References Cerulli G., Y. Dong, A. Lewbel, and A. Poulsen. 2016. Testing stability of regression discontinuity models. Advances in Econometrics, forthcoming.
Additional information
Chicago16_Cerulli.pdf Giovanni Cerulli
CNR - IRCrES
|
Abstract:
We introduce the Stata module rdlocrand, which contains four commands to
conduct finite-sample inference in regression discontinuity (RD) designs
under a local randomization assumption, following the framework and
methods proposed in Cattaneo, Frandsen, and Titiunik (2015) and
Cattaneo, Titiunik, and Vazquez-Bare (2016). Assuming a known assignment
mechanism for units close to the RD cutoff, these functions implement a
variety of procedures based on randomization inference techniques.
First, the command rdrandinf employs randomization methods to
conduct point estimation, hypothesis testing and confidence interval
estimation under different assumptions. Second, the command
rdwinselect employs finite-sample methods to select a window near
the cutoff where the assumption of randomized treatment assignment is
most plausible. Third, the command rdsensitivity employs
randomization techniques to conduct a sequence of hypothesis tests for
different windows around the RD cutoff, which can be used to assess the
sensitivity of the methods as well as to construct confidence intervals
by inversion. Finally, the command rdrbounds implements
sensitivity bounds (Rosenbaum 2002) for the context of RD designs under
local randomization. Companion R functions with the same syntax and
capabilities are also provided.
Additional information
Chicago16_Vazquez-Bare.pdf Gonzalo Vazquez-Bare
Department of Economics, University of Michigan
Matias Cattaneo
Department of Economics and Department of Statistics, University of Michigan
Rocio Titiunik
Department of Political Science, University of Michigan
|
Integrating Stata with web technologies |
Abstract:
Twitter feed data have increasingly become a rich source of information,
both for commercial marketing purposes and for social science research.
In the early days of Twitter, researchers could access the Twitter API
with a simple URL. In order to control the size of data requests and to
track who is accessing what data, Twitter has instituted some security
policies that make this much more difficult for the average researcher.
Users must obtain unique keys from Twitter, create a timestamp, generate
a random string, combine this with the actual data request, hash this
string using HMAC-SHA1, and submit all of this to the Twitter API in a
timely fashion. Large requests must be divided into multiple smaller
requests and spread out over a period of time. Our particular need was
to obtain Twitter user profile information for over 800 users who had
previously tweeted about surgery. We developed a Stata command to
automate this process. It includes a number of bitwise operator
functions and other tools that could be useful in other applications. We
will discuss the unique features of this command, the tools required to
implement it, and the feasibility of extending this example to other
data request types.
Additional information
Chicago16_Canner.pptx Joseph Canner
Johns Hopkins University School of Medicine
Neeraja Nagarajan
Johns Hopkins University School of Medicine
|
Abstract:
With the integration of cURL, Stata can do amazing things—from
retrieving Instagram image information, to geocoding batches of
coordinates using the US Census Geocoder, to sending direct messages via
Twitter. This presentation focuses on the steps necessary to accomplish
these tasks without use of any additional software. We introduce a new
Mata library that handles bitwise operations, Base64 encoding, and the
HMAC-SHA1 algorithm—all packages necessary to submit post requests
using Twitter's API. As a result, Stata users can favorite Tweets,
retrieve timeline data, and send secure direct messages entirely from
the Stata console.
Additional information
Chicago16_Matsuoka.pptx William Matsuoka
San Diego Gas & Electric
|
Abstract:
Stata graphics are professional, visually pleasing, and easy to format,
but lack the interactive experience or transparency options requested by
many users. In this presentation, we introduce a new command,
gchart: it is a fully functional wrapper for the Google Chart
API, which is written almost entirely like the twoway command,
and allows users to write quality JavaScript data visualizations using
familiar Stata graphing syntax. While other Google Chart-based programs
already exist, gchart aims to be the most comprehensive library
to date. With gchart, Stata users can present interactive
web-based graphics without exporting their data to secondary software
packages or learning JavaScript and HTML. The gchart library
contains most Stata graph types such as bar, pie, and line as well as
new graphs offered by Google Charts such as treemaps, timelines, Sankey
diagrams, and more. The command contains an option to create interactive
tables directly from datasets and even has preset settings to make
resulting Google Charts feel more Statalike. Once the visualizations are
rendered, web visitors and blog readers will be happy to play with the
resulting graphics.
Additional information
Chicago16_Chavez.pdf Belen Chavez
San Diego Gas & Electric
William Matsuoka
San Diego Gas & Electric
|
Abstract:
Exploratory data analysis and visualization is the bedrock upon which we
construct our understanding of the world around us. It is an immensely
powerful tool for developing our understanding of the data and to
communicate our results in ways that are intuitive to many audiences. As
our data increase in their complexity, static visualizations may not be
the most efficient or sufficient method to extract the same meaning from
the data. The jsonio, libd3, and libhtml packages were developed
specifically to address this limitation within Stata and to provide a
toolkit upon which other users could easily and quickly contribute. The
jsonio package is a Java plugin that helps users to generate JSON
objects from the data they have in Stata and, unlike .csv, also retains
crucial metadata that can help interpret the meaning of the data
visualizations (value labels, variable labels, etc.). The libd3 Mata
library mimics the D3js API as closely as possible to make it easier for
users to take existing code and implement it in Mata without significant
effort. The libhtml library provides HTML5 DOM element classes for
constructing HTML documents. Together, these form a powerful toolkit for
interactive exploratory data analysis and visualization.
Additional information
Chicago16_Buchanan (online) Billy Buchanan
Office of Research, Evaluation, and Assessment, Minneapolis Public Schools
|
Topics in Stata programming |
Abstract:
Value-added models (VAM) are required by many states to hold teachers
accountable for students' growth within a certain period of time. For
instance, the Florida Department of Education developed VAM for
state-mandated assessments. Meanwhile, to evaluate teachers of subjects
or grades not assessed by the state models, local districts in Florida
started to build their own. However, with hundreds of locally developed
and statewide-administered standardized assessments, many districts
found it difficult to develop systematic and efficient macros to run
hundreds of statistical models and combine results. This presentation
demonstrates a template to set up loops and macros in Stata that run a
variety of linear regression models (single-level models with fixed
effects, two-level, and three-level models with random effects) for the
purpose of generating VAM-related statistics (for example, point
estimates, appropriate standard errors, performance levels, and model
specification statistics) and to save results in designated files.
Specifically, in the case of single-level models, I developed commands
to make Stata compare the output results and automatically select the
teacher with a median-sized effect as the reference for each model. By
doing so, no human judgment is involved during the looping process, thus
making it time efficient and void of human errors.
Additional information
Chicago16_An_C.pptx Chen An
Department of Accountability, Research and Evaluation, Orange County Public Schools
|
Data analysis: Tools and techniques |
Abstract:
Doctors and consultants want to know the effect of a covariate for a
given covariate pattern. Policy analysts want to know a population-level
effect of a covariate. I discuss how to estimate and interpret these
effects using factor variables and margins.
Additional information
Chicago16_Drukker.pdf David Drukker
StataCorp
|
Abstract:
Polychoric correlation is the correlation between two ordinal variables
obtained as the maximum likelihood estimate under the assumption that
the ordinal variables are obtained by coarsening a bivariate normal
distribution. I developed a suite of polychoric correlation matrix
analysis and a follow-up principal component analysis in the early 2000s
for a common application of scoring households on their socio-economic
status based on categorical proxies of wealth, such as materials used in
the house (dirt floor versus wooden floor versus tile or cement floor) and the
presence or absence of certain assets (TV, radio, bike, car, etc.). Even
though my polychoric program from circa 2004 appears to be finding some
good use in the Stata world, it lacks a number of important features. I
will describe how the Stata tools complement and enhance what polychoric
was purported to achieve. While polychoric only deals with pairwise
correlations, David Roodman's cmp provides better justifiable joint
correlations based on numerical multivariate integration. Also, while
plugging the principal components obtained from polychoric into
regression models leads to underaccounting for sampling errors in
regression coefficient estimates because of the generated regressors
problem, generalized structural equation modeling with gsem
provides the capability of simultaneous estimation of models that use
the SES index as a predictor of a substantive outcome. I will review a
stylized example from Bollen, Glanville, and Stecklov's seminal papers
on the use of latent variable models in analyzing socio-economic status
and demonstrate how these different programs can be used to that effect.
Stas Kolenikov
Abt SRBI
|
Data analysis: Tools and techniques (continued) |
Abstract:
Risk analysis of a commercial bank's wholesale loan portfolios involves
modeling of the asset quality ratings of each borrower's obligations.
This customarily involves transition matrices that capture the
probability that a loan's AQ rating will migrate to a higher or lower
rating or transition to the default state. We compare and contrast three
approaches for transition matrix modeling: the single-factor approach
commonly used in the financial industry, an approach based on
time-series forecasts of default rates, and an approach based on
modeling selected elements of the transition matrix that comprise the
most likely outcomes. We find that these two unorthodox approaches both
have excellent performance over a sample period encompassing the
financial crisis.
Additional information
Chicago16_Baum.pdf Kit Baum
Boston College and DIW Berlin
Soner Tunay
Citizens Financial Group
Alper Corlu
Citizens Financial Group
|
Abstract:
Currently, panel-data analysis largely relies on parametric models (such
as random-effects and fixed-effects models). These models make strong
assumptions in order to draw causal inference, while in reality, any of
these assumptions may not hold. Compared with parametric models,
matching does not make strong parametric assumptions and also helps
provide focused inference on the effect of a particular cause. However,
matching has been used typically in cross-sectional data analysis. In
this paper, we extend matching to panel-data analysis. In the spirit of
the difference-in-difference method, we first difference the outcomes to
remove the fixed effects. Then, we apply matching on the differenced
outcomes at each wave (except the first one). The results can be used to
examine whether treatment effects vary across time. The estimates from
the separate waves can also be combined to provide an overall estimate
of the treatment effects. In doing so, we present a variance estimator
for the overall treatment effects that can account for complicated
sequential dependence in the data. We demonstrate the method through
empirical examples and show its efficacy in comparison with previous
methods. We also outline a Stata add-on "DIDMatch" that we are creating
to implement the method.
Additional information
Chicago16_An_W.pdf Weihua An
Departments of Sociology and Statistics, Indiana University
|
Abstract:
Empirical analyses often require implementation of nonlinear models
whose regressors include one or more endogenous
variables—regressors that are correlated with the unobserved
random component of the model. Failure to account for such correlation
in estimation leads to bias and produces results that are not causally
interpretable. Terza et al. (2008) discuss a relatively simple
estimation method that avoids endogeneity bias and is applicable in a
wide variety of nonlinear regression contexts—two-stage residual
inclusion (2SRI). We offer a 2SRI how-to guide for practitioners and
demonstrate how the method can be easily implemented in Stata, complete
with correct asymptotic standard errors for the parameter estimates [see
Terza (2016)]. We illustrate our suggested step-by-step protocol in the
context of a real data example with Stata code. Other examples are
discussed, also coded in Stata.
References Terza, J. V. 2016. Simpler standard errors for two-stage optimization estimators. Stata Journal, forthcoming.
Additional information
Chicago16_Terza.pdf Joseph Terza
Department of Economics, Indiana University Purdue University Indianapolis
|
Fitting new statistical models with Stata |
Abstract:
We present a new Stata package for the estimation of autoregressive
distributed lag (ARDL) models in a time-series context. The ardl
command can be used to estimate an ARDL model with the optimal number of
autoregressive and distributed lags based on the Akaike or
Schwarz/Bayesian information criterion. The regression results can be
displayed in the ARDL levels form or in the error-correction
representation of the model. The latter separates long-run and short-run
effects and is available in two different parameterizations of the
long-run (cointegrating) relationship. The bounds testing procedure for
the existence of a long-run levels relationship suggested by Pesaran,
Shin, and Smith (2001, Journal of Applied Econometrics) is
implemented as a postestimation feature. As an alternative to their
asymptotic critical values, the small-sample critical values provided by
Narayan (2005, Applied Economics) are available as well.
Additional information
Chicago16_Kripfganz.pdf Sebastian Kripfganz
Department of Economics, University of Exeter Business School
Daniel C. Schneider
Max Planck Institute for Demographic Research
|
Abstract:
In this presentation, I describe a novel estimator for linear models
with multiple levels of fixed effects. First, I show that solving the
two-way fixed-effects model is equivalent to solving a linear system on
a graph and exploit recent advances in graph theory (Kelner et al.
2013) to propose a nearly linear time estimator. Second, I embed the
estimator into an improved version of the one by Guimaraes and Portugal
(2010) and Gaure (2013). This new estimator performs particularly well
with large datasets and high-dimensional fixed effects and can be also
used as a building block of multiple nonlinear models. Finally, I
introduce the reghdfe package, which applies this estimator and extends
it to instrumental-variable and linear GMM regressions.
Additional information
Chicago16_Correia.pdf Sergio Correia
Fuqua School of Business, Duke University
|
Abstract:
The mixed-effects location scale model extends the standard two-level
random-intercept mixed-effects model for continuous responses
(implemented in Stata as xtreg, mle) in three ways: (1)
The (log of the) within- and between-subject variances are further
modeled in terms of covariates. (2) A new random effect, referred to as
the random-scale effect, is included in the within-subject variance
function to account for any unexplained subject differences in the
residual variance. The usual random subject intercept is now referred to
as a random-location effect. (3) A subject-level association between the
location and scale is allowed by entering the random-location effect
into the within-subject variance function using either a linear or a
quadratic functional form. The distributions of the random-location and
random-scale effects are assumed to be Gaussian. runmixregls returns all
model results to Stata, at which point one can make use of all of
Stata's standard postestimation and graphing commands.
Additional information
Chicago16_Hedeker.pptx Donald Hedeker
University of Chicago
George Leckie
University of Bristol
|
Reproducible research |
Abstract:
In my presentation, I will demonstrate how we can enhance our workflow
by using the Jupyter Notebook and my
IPyStata package.
IPyStata is a package written
in Python that allows users to write and execute Stata and Python code
side by side in a single Jupyter Notebook. Users can almost seamlessly
modify and analyze data using both Stata and Python because IPyStata
allows data structures (for example, datasets and macros) to be used
interchangeably. The Jupyter Notebook is a
phenomenal tool for researchers and data scientists because it allows
live code to be combined with explanatory text, equations,
visualizations, widgets, and much more. It was originally developed as
an open-source tool for interactive Python use (called the IPython
Notebook) but is now aimed at being language agnostic under the banner
of Project Jupyter. My package, IPyStata, adds Stata to the array of
software and programming languages that can be used in the Jupyter
Notebook. In my talk, I will share how I use Stata, Python, the Jupyter
Notebook, and IPyStata to transparently document and share the code and
results that underlie my work as an aspiring researcher. For a
demonstration notebook, see:
http://nbviewer.ipython.org/github/TiesdeKok/ipystata/blob/master/ipystata/Example.ipynb.
Additional information
chicago16_de_Kok.pdf Ties de Kok
Tilburg University
|
Abstract:
This talk will introduce analysis manager (AM)—an open-source and
free plug-in for conducting reproducible research and creating dynamic
documents using Microsoft Word and Stata. AM was recently developed to
address a critical need in the research community: there were no broadly
accessible tools to integrate document preparation in Word with
statistical code, results, and data. Popular tools such as Sweave,
knitR, and weaver all use LaTeX, Markdown, and plain text editors for
document preparation. Despite the merits of these programs, Microsoft
Word is ubiquitous for manuscript preparation in many fields—such
as medicine—in which conducting reproducible research is
increasingly important. We developed AM to fill this void. AM provides
an interface to edit Stata code directly from Word and allows users to
embed statistical output from that code (estimates, tables, figures)
within Word. This output can then be individually or collectively
updated in one click with a behind-the-scenes call to Stata. With AM,
modification of a dataset or analysis will no longer entail transcribing
or recopying results into a manuscript or table. This talk will provide
an introduction to using AM, including worked examples, and will be
accessible to a wide range of users.
Additional information
Chicago16_Welty.pptx Leah Welty
Division of Biostatistics, Department of Preventive Medicine, Northwestern University
Luke V. Rasmussen
Division of Health and Biomedical Informatics, Department of Preventive Medicine, Northwestern University
Abigail S. Baldridge
Department of Preventive Medicine, Northwestern University
|
Visualizing data |
Abstract:
This talk introduces a new graphic for inference and a Stata program to
construct it. When policymakers and other lay audiences internalize
inferential results, they often ignore the measure of precision or
uncertainty that accompanies a point estimate. Doing so can lead to
inefficient action (or inaction). Different organizations take different
approaches to this problem in their reports. Some list point estimates
alone. Others list a confidence interval (CI) or make it available in an
annex. Some plot a CI using a 1-D representation. The 2015 WHO
Vaccination Coverage Cluster Survey Reference Manual recommends a new
graphic called an inchworm plot that represents results with a 2-D
distribution, highest near the point estimate and lowest at the ends of
the central 95% CI. Each distribution includes ticks to represent bounds
for strong one-sided inference. When results for several strata are
shown together, the area of each distribution is scaled to be constant;
precise estimates look like tall inchworms, poised to take a step, and
those less precise are wide and short, like worms outstretched. The talk
highlights figure features and potential to represent uncertainty in a
manner that can be absorbed by laypersons without insulting (or taxing)
their intelligence.
Additional information
Chicago16_Rhoda.pptx Dale Rhoda
Biostat Global Consulting
|
Abstract:
Quantile plots show ordered values (raw data, estimates, residuals,
whatever) against rank or cumulative probability or a one-to-one
function of the same. Even in a strict sense, they are almost 200 years
old. In Stata, quantile, qqplot, and qnorm go back to 1985
and 1986. So why any fuss? The presentation is built on a
long-considered view that quantile plots are the best single plot for
univariate distributions. No other kind of plot shows so many features
so well across a range of sample sizes with so few arbitrary decisions.
Both official and user-written programs appear in a review that includes
side-by-side and superimposed comparisons of quantiles for different
groups and comparable variables. Emphasis is on newer, previously
unpublished work, with focus on the compatibility of quantiles with
transformations; fitting and testing of brand-name distributions;
quantile-box plots as proposed by Emanuel Parzen (1929–2016);
equivalents for ordinal categorical data; and the question of which
graphics best support paired and two-sample t and other tests.
Commands mentioned include distplot, multqplot, and
qplot (Stata Journal) and mylabels,
stripplot, and hdquantile (SSC).
Additional information
Chicago16_Cox.pptx Nicholas J. Cox
Durham University
|
Hypothesis testing: Multiple comparisons and power |
Abstract:
Multiple comparisons are used to adjust the probability of making type I
errors when performing multiple tests of group differences. Stata
supports eight different method adjustments for multiple comparisons.
This presentation will discuss the various adjustments available in
Stata along with suggestions of when to use them.
Additional information
Chicago16_Ender.pdf Philip Ender
UCLA (Ret)
|
Abstract:
The log-rank test is perhaps the most commonly used nonparametric method
for comparing two survival curves and yields maximum power under
proportional hazards (PH) alternatives. While PH is a reasonable
assumption, it need not, of course, hold. Several authors have therefore
developed versatile tests using combinations of weighted log-rank
statistics that are more sensitive to non-PH hazards. For example,
Fleming and Harrington (1991) considered the family of G(rho)
statistics, while JW Lee (1996) and S-H Lee (2007) proposed tests based
on the more extended G(rho, gamma) family. In this talk, we consider
Zm=max(|Z1|,|Z2|,|Z3|), where Z1, Z2, and Z3 are z statistics
obtained from G(0,0), G(1,0), and G(0,1) tests, respectively. G(0,0)
corresponds to the log-rank test, while G(1,0) and G(0,1) are more
sensitive to early and late difference alternatives. Simulation results
indicate that the method based on Zm maintains the type I error rate,
provides increased power relative to the log-rank test under early and
late difference alternatives, and entails only a small to moderate power
loss compared with the more optimally chosen test. The syntax for a
Stata command to implement the method, verswlr, is described, and
the user can specify other choices for rho and gamma.
Additional information
Chicago16_Karrison.pptx Theodore Karrison
University of Chicago
|
Bayesian analysis in Stata |
Abstract:
Bayesian analysis is a flexible statistical methodology for inferring
properties of unknown parameters by combining observational evidence
with prior knowledge. Research questions are answered using explicit
probability statements. The Bayesian approach is especially well suited
for analyzing data models in which the data structure imposes a model
parameter hierarchy. Stata 14 introduces a suite of commands for
specification and simulation of Bayesian models, computing various
posterior summaries, testing hypotheses, and comparing models. I will
describe the main features of these commands and present examples
illustrating various models, from a simple logistic regression to
hierarchical Rasch models.
Additional information
Chicago16_Balov.pdf Nikolay Balov
StataCorp
|
Phil Schumm (Chair)
The University of Chicago
Department of Public Health Sciences
Richard Williams
University of Notre Dame
Department of Sociology
Scott Long
Indiana University
Department of Sociology
Matias Cattaneo
University of Michigan
Department of Economics