Home  /  Stata Conferences and Users Group meetings  /  Stata Conference Chicago 2016
Stata Conference
Stata Conference
Chicago 2016
July 28–29, 2016

The 2016 Stata Conference was held July 28-29, but you can still interact with the user community even after the meeting and learn more about the presentations shared. View the presentation slides (below) and the conference photos.

Regression discontinuity designs and models

Multidimensional regression discontinuity and regression kink designs with difference-in-differences
Abstract: This paper proposes some extensions to Calonico, Cattaneo, and Titiunik's (2014, Stata Journal 14: 909–946) commands for regression discontinuity (RD) and regression kink (RK) designs. First, the commands allow users to take a second difference in RD and RK (and higher-order derivatives). That is, they combine RD and RK estimators with difference-in-differences (DID). The DID approach may be appropriate, for instance, when the density function of the running variable is discontinuous around the cutoff but steady over time. Second, they let users specify multidimensional running variables, which is applicable, for example, to the estimation of boundary discontinuity. Finally, users can work with weighted data by specifying analytic weights and include control variables. For any of those extensions, the command ddbwsel calculates the MSE-optimal bandwidth, as proposed by Imbens and Kalyanaraman (2012, Review of Economic Studies 79: 933–959) and Calonico, Cattaneo, and Titiunik (2014, Econometrica 82: 2295–2326). The command ddrd implements the robust bias-corrected confidence intervals for any of the above extensions. The command dddens implements McCrary's (2008, Journal of Econometrics 142: 698–714) density test for cross-sectional RD and RK, and RD and RK with DID. The paper presents applications of two-dimensional RD with DID and one-dimensional RK with DID.
Additional information
Chicago16_Ribas.pdf
Rafael Ribas
University of Amsterdam
TED: Stata module for testing stability of regression discontinuity models
Abstract: Regression discontinuity (RD) models are commonly used to nonparametrically identify and estimate a local average treatment effect (LATE). Dong and Lewbel (2015) show how a derivative of this LATE can be estimated. They use their treatment-effect derivative (TED) to estimate how the LATE would change if the RD threshold changed. We argue that their estimator should be employed in most RD applications as a way to assess the stability and hence external validity of RD estimates. Closely related to TED is the complier probability derivative (CPD) of Cerulli et al. (2016). Just as TED measures stability of the treatment effect, the CPD measures stability of the complier population in fuzzy designs. In this paper, we provide the Stata module ted, which can be used to easily implement TED and CPD estimation, and we apply it to some real datasets.

References
Dong, Y., and A. Lewbel. 2015. Identifying the effect of changing the policy threshold in regression discontinuity models. Review of Economics and Statistics 97: 1081–1092.

Cerulli G., Y. Dong, A. Lewbel, and A. Poulsen. 2016. Testing stability of regression discontinuity models. Advances in Econometrics, forthcoming.

Additional information
Chicago16_Cerulli.pdf
Giovanni Cerulli
CNR - IRCrES
rdlocrand: A Stata package for inference in regression discontinuity designs under local randomization
Abstract: We introduce the Stata module rdlocrand, which contains four commands to conduct finite-sample inference in regression discontinuity (RD) designs under a local randomization assumption, following the framework and methods proposed in Cattaneo, Frandsen, and Titiunik (2015) and Cattaneo, Titiunik, and Vazquez-Bare (2016). Assuming a known assignment mechanism for units close to the RD cutoff, these functions implement a variety of procedures based on randomization inference techniques. First, the command rdrandinf employs randomization methods to conduct point estimation, hypothesis testing and confidence interval estimation under different assumptions. Second, the command rdwinselect employs finite-sample methods to select a window near the cutoff where the assumption of randomized treatment assignment is most plausible. Third, the command rdsensitivity employs randomization techniques to conduct a sequence of hypothesis tests for different windows around the RD cutoff, which can be used to assess the sensitivity of the methods as well as to construct confidence intervals by inversion. Finally, the command rdrbounds implements sensitivity bounds (Rosenbaum 2002) for the context of RD designs under local randomization. Companion R functions with the same syntax and capabilities are also provided.
Additional information
Chicago16_Vazquez-Bare.pdf
Gonzalo Vazquez-Bare
Department of Economics, University of Michigan
Matias Cattaneo
Department of Economics and Department of Statistics, University of Michigan
Rocio Titiunik
Department of Political Science, University of Michigan

Integrating Stata with web technologies

Mining Twitter data for fun and profit
Abstract: Twitter feed data have increasingly become a rich source of information, both for commercial marketing purposes and for social science research. In the early days of Twitter, researchers could access the Twitter API with a simple URL. In order to control the size of data requests and to track who is accessing what data, Twitter has instituted some security policies that make this much more difficult for the average researcher. Users must obtain unique keys from Twitter, create a timestamp, generate a random string, combine this with the actual data request, hash this string using HMAC-SHA1, and submit all of this to the Twitter API in a timely fashion. Large requests must be divided into multiple smaller requests and spread out over a period of time. Our particular need was to obtain Twitter user profile information for over 800 users who had previously tweeted about surgery. We developed a Stata command to automate this process. It includes a number of bitwise operator functions and other tools that could be useful in other applications. We will discuss the unique features of this command, the tools required to implement it, and the feasibility of extending this example to other data request types.
Additional information
Chicago16_Canner.pptx
Joseph Canner
Johns Hopkins University School of Medicine
Neeraja Nagarajan
Johns Hopkins University School of Medicine
Stata tweets and other API libraries
Abstract: With the integration of cURL, Stata can do amazing things—from retrieving Instagram image information, to geocoding batches of coordinates using the US Census Geocoder, to sending direct messages via Twitter. This presentation focuses on the steps necessary to accomplish these tasks without use of any additional software. We introduce a new Mata library that handles bitwise operations, Base64 encoding, and the HMAC-SHA1 algorithm—all packages necessary to submit post requests using Twitter's API. As a result, Stata users can favorite Tweets, retrieve timeline data, and send secure direct messages entirely from the Stata console.
Additional information
Chicago16_Matsuoka.pptx
William Matsuoka
San Diego Gas & Electric
Static to live: Combining Stata with Google Charts API
Abstract: Stata graphics are professional, visually pleasing, and easy to format, but lack the interactive experience or transparency options requested by many users. In this presentation, we introduce a new command, gchart: it is a fully functional wrapper for the Google Chart API, which is written almost entirely like the twoway command, and allows users to write quality JavaScript data visualizations using familiar Stata graphing syntax. While other Google Chart-based programs already exist, gchart aims to be the most comprehensive library to date. With gchart, Stata users can present interactive web-based graphics without exporting their data to secondary software packages or learning JavaScript and HTML. The gchart library contains most Stata graph types such as bar, pie, and line as well as new graphs offered by Google Charts such as treemaps, timelines, Sankey diagrams, and more. The command contains an option to create interactive tables directly from datasets and even has preset settings to make resulting Google Charts feel more Statalike. Once the visualizations are rendered, web visitors and blog readers will be happy to play with the resulting graphics.
Additional information
Chicago16_Chavez.pdf
Belen Chavez
San Diego Gas & Electric
William Matsuoka
San Diego Gas & Electric
Interactive data visualization for the web: Using jsonio, libd3, and libhtml to create reusable D3js
Abstract: Exploratory data analysis and visualization is the bedrock upon which we construct our understanding of the world around us. It is an immensely powerful tool for developing our understanding of the data and to communicate our results in ways that are intuitive to many audiences. As our data increase in their complexity, static visualizations may not be the most efficient or sufficient method to extract the same meaning from the data. The jsonio, libd3, and libhtml packages were developed specifically to address this limitation within Stata and to provide a toolkit upon which other users could easily and quickly contribute. The jsonio package is a Java plugin that helps users to generate JSON objects from the data they have in Stata and, unlike .csv, also retains crucial metadata that can help interpret the meaning of the data visualizations (value labels, variable labels, etc.). The libd3 Mata library mimics the D3js API as closely as possible to make it easier for users to take existing code and implement it in Mata without significant effort. The libhtml library provides HTML5 DOM element classes for constructing HTML documents. Together, these form a powerful toolkit for interactive exploratory data analysis and visualization.
Additional information
Chicago16_Buchanan (online)
Billy Buchanan
Office of Research, Evaluation, and Assessment, Minneapolis Public Schools

Topics in Stata programming

Setting up loops and macros in Stata to estimate value-added scores for teacher evaluations
Abstract: Value-added models (VAM) are required by many states to hold teachers accountable for students' growth within a certain period of time. For instance, the Florida Department of Education developed VAM for state-mandated assessments. Meanwhile, to evaluate teachers of subjects or grades not assessed by the state models, local districts in Florida started to build their own. However, with hundreds of locally developed and statewide-administered standardized assessments, many districts found it difficult to develop systematic and efficient macros to run hundreds of statistical models and combine results. This presentation demonstrates a template to set up loops and macros in Stata that run a variety of linear regression models (single-level models with fixed effects, two-level, and three-level models with random effects) for the purpose of generating VAM-related statistics (for example, point estimates, appropriate standard errors, performance levels, and model specification statistics) and to save results in designated files. Specifically, in the case of single-level models, I developed commands to make Stata compare the output results and automatically select the teacher with a median-sized effect as the reference for each model. By doing so, no human judgment is involved during the looping process, thus making it time efficient and void of human errors.
Additional information
Chicago16_An_C.pptx
Chen An
Department of Accountability, Research and Evaluation, Orange County Public Schools

Data analysis: Tools and techniques

What does your model say? It may depend on who is asking
Abstract: Doctors and consultants want to know the effect of a covariate for a given covariate pattern. Policy analysts want to know a population-level effect of a covariate. I discuss how to estimate and interpret these effects using factor variables and margins.
Additional information
Chicago16_Drukker.pdf
David Drukker
StataCorp
Polychoric, by any other "namelist"
Abstract: Polychoric correlation is the correlation between two ordinal variables obtained as the maximum likelihood estimate under the assumption that the ordinal variables are obtained by coarsening a bivariate normal distribution. I developed a suite of polychoric correlation matrix analysis and a follow-up principal component analysis in the early 2000s for a common application of scoring households on their socio-economic status based on categorical proxies of wealth, such as materials used in the house (dirt floor versus wooden floor versus tile or cement floor) and the presence or absence of certain assets (TV, radio, bike, car, etc.). Even though my polychoric program from circa 2004 appears to be finding some good use in the Stata world, it lacks a number of important features. I will describe how the Stata tools complement and enhance what polychoric was purported to achieve. While polychoric only deals with pairwise correlations, David Roodman's cmp provides better justifiable joint correlations based on numerical multivariate integration. Also, while plugging the principal components obtained from polychoric into regression models leads to underaccounting for sampling errors in regression coefficient estimates because of the generated regressors problem, generalized structural equation modeling with gsem provides the capability of simultaneous estimation of models that use the SES index as a predictor of a substantive outcome. I will review a stylized example from Bollen, Glanville, and Stecklov's seminal papers on the use of latent variable models in analyzing socio-economic status and demonstrate how these different programs can be used to that effect.
Additional information
Chicago16_Kolenikov.pdf
Chicago16_Kolenikov.rar (additional exercises)
Stas Kolenikov
Abt SRBI

Data analysis: Tools and techniques (continued)

Modeling rating transition matrices for wholesale loan portfolios
Abstract: Risk analysis of a commercial bank's wholesale loan portfolios involves modeling of the asset quality ratings of each borrower's obligations. This customarily involves transition matrices that capture the probability that a loan's AQ rating will migrate to a higher or lower rating or transition to the default state. We compare and contrast three approaches for transition matrix modeling: the single-factor approach commonly used in the financial industry, an approach based on time-series forecasts of default rates, and an approach based on modeling selected elements of the transition matrix that comprise the most likely outcomes. We find that these two unorthodox approaches both have excellent performance over a sample period encompassing the financial crisis.
Additional information
Chicago16_Baum.pdf
Kit Baum
Boston College and DIW Berlin
Soner Tunay
Citizens Financial Group
Alper Corlu
Citizens Financial Group
Combining difference-in-difference and matching for panel-data analysis
Abstract: Currently, panel-data analysis largely relies on parametric models (such as random-effects and fixed-effects models). These models make strong assumptions in order to draw causal inference, while in reality, any of these assumptions may not hold. Compared with parametric models, matching does not make strong parametric assumptions and also helps provide focused inference on the effect of a particular cause. However, matching has been used typically in cross-sectional data analysis. In this paper, we extend matching to panel-data analysis. In the spirit of the difference-in-difference method, we first difference the outcomes to remove the fixed effects. Then, we apply matching on the differenced outcomes at each wave (except the first one). The results can be used to examine whether treatment effects vary across time. The estimates from the separate waves can also be combined to provide an overall estimate of the treatment effects. In doing so, we present a variance estimator for the overall treatment effects that can account for complicated sequential dependence in the data. We demonstrate the method through empirical examples and show its efficacy in comparison with previous methods. We also outline a Stata add-on "DIDMatch" that we are creating to implement the method.
Additional information
Chicago16_An_W.pdf
Weihua An
Departments of Sociology and Statistics, Indiana University
A practitioner's guide to implementing the two-stage residual inclusion method in Stata
Abstract: Empirical analyses often require implementation of nonlinear models whose regressors include one or more endogenous variables—regressors that are correlated with the unobserved random component of the model. Failure to account for such correlation in estimation leads to bias and produces results that are not causally interpretable. Terza et al. (2008) discuss a relatively simple estimation method that avoids endogeneity bias and is applicable in a wide variety of nonlinear regression contexts—two-stage residual inclusion (2SRI). We offer a 2SRI how-to guide for practitioners and demonstrate how the method can be easily implemented in Stata, complete with correct asymptotic standard errors for the parameter estimates [see Terza (2016)]. We illustrate our suggested step-by-step protocol in the context of a real data example with Stata code. Other examples are discussed, also coded in Stata.

References
Terza, J., A. Basu, and P. Rathouz. 2008. Two-stage residual inclusion estimation: Addressing endogeneity in health econometric modeling. Journal of Health Economics 27: 531–543.

Terza, J. V. 2016. Simpler standard errors for two-stage optimization estimators. Stata Journal, forthcoming.

Additional information
Chicago16_Terza.pdf
Joseph Terza
Department of Economics, Indiana University Purdue University Indianapolis

Fitting new statistical models with Stata

ardl: Stata module to estimate autoregressive distributed lag models
Abstract: We present a new Stata package for the estimation of autoregressive distributed lag (ARDL) models in a time-series context. The ardl command can be used to estimate an ARDL model with the optimal number of autoregressive and distributed lags based on the Akaike or Schwarz/Bayesian information criterion. The regression results can be displayed in the ARDL levels form or in the error-correction representation of the model. The latter separates long-run and short-run effects and is available in two different parameterizations of the long-run (cointegrating) relationship. The bounds testing procedure for the existence of a long-run levels relationship suggested by Pesaran, Shin, and Smith (2001, Journal of Applied Econometrics) is implemented as a postestimation feature. As an alternative to their asymptotic critical values, the small-sample critical values provided by Narayan (2005, Applied Economics) are available as well.
Additional information
Chicago16_Kripfganz.pdf
Sebastian Kripfganz
Department of Economics, University of Exeter Business School
Daniel C. Schneider
Max Planck Institute for Demographic Research
reghdfe: Estimating linear models with multiway fixed effects
Abstract: In this presentation, I describe a novel estimator for linear models with multiple levels of fixed effects. First, I show that solving the two-way fixed-effects model is equivalent to solving a linear system on a graph and exploit recent advances in graph theory (Kelner et al. 2013) to propose a nearly linear time estimator. Second, I embed the estimator into an improved version of the one by Guimaraes and Portugal (2010) and Gaure (2013). This new estimator performs particularly well with large datasets and high-dimensional fixed effects and can be also used as a building block of multiple nonlinear models. Finally, I introduce the reghdfe package, which applies this estimator and extends it to instrumental-variable and linear GMM regressions.
Additional information
Chicago16_Correia.pdf
Sergio Correia
Fuqua School of Business, Duke University
runmixregls: A mixed-effects location scale model run within Stata
Abstract: The mixed-effects location scale model extends the standard two-level random-intercept mixed-effects model for continuous responses (implemented in Stata as xtreg, mle) in three ways: (1) The (log of the) within- and between-subject variances are further modeled in terms of covariates. (2) A new random effect, referred to as the random-scale effect, is included in the within-subject variance function to account for any unexplained subject differences in the residual variance. The usual random subject intercept is now referred to as a random-location effect. (3) A subject-level association between the location and scale is allowed by entering the random-location effect into the within-subject variance function using either a linear or a quadratic functional form. The distributions of the random-location and random-scale effects are assumed to be Gaussian. runmixregls returns all model results to Stata, at which point one can make use of all of Stata's standard postestimation and graphing commands.
Additional information
Chicago16_Hedeker.pptx
Donald Hedeker
University of Chicago
George Leckie
University of Bristol

Reproducible research

Combine Stata with Python using the Jupyter Notebook
Abstract: In my presentation, I will demonstrate how we can enhance our workflow by using the Jupyter Notebook and my IPyStata package. IPyStata is a package written in Python that allows users to write and execute Stata and Python code side by side in a single Jupyter Notebook. Users can almost seamlessly modify and analyze data using both Stata and Python because IPyStata allows data structures (for example, datasets and macros) to be used interchangeably. The Jupyter Notebook is a phenomenal tool for researchers and data scientists because it allows live code to be combined with explanatory text, equations, visualizations, widgets, and much more. It was originally developed as an open-source tool for interactive Python use (called the IPython Notebook) but is now aimed at being language agnostic under the banner of Project Jupyter. My package, IPyStata, adds Stata to the array of software and programming languages that can be used in the Jupyter Notebook. In my talk, I will share how I use Stata, Python, the Jupyter Notebook, and IPyStata to transparently document and share the code and results that underlie my work as an aspiring researcher. For a demonstration notebook, see: http://nbviewer.ipython.org/github/TiesdeKok/ipystata/blob/master/ipystata/Example.ipynb.
Additional information
chicago16_de_Kok.pdf
Ties de Kok
Tilburg University
Analysis Manager: A reproducible research tool for generating dynamic documents using Microsoft Word
Abstract: This talk will introduce analysis manager (AM)—an open-source and free plug-in for conducting reproducible research and creating dynamic documents using Microsoft Word and Stata. AM was recently developed to address a critical need in the research community: there were no broadly accessible tools to integrate document preparation in Word with statistical code, results, and data. Popular tools such as Sweave, knitR, and weaver all use LaTeX, Markdown, and plain text editors for document preparation. Despite the merits of these programs, Microsoft Word is ubiquitous for manuscript preparation in many fields—such as medicine—in which conducting reproducible research is increasingly important. We developed AM to fill this void. AM provides an interface to edit Stata code directly from Word and allows users to embed statistical output from that code (estimates, tables, figures) within Word. This output can then be individually or collectively updated in one click with a behind-the-scenes call to Stata. With AM, modification of a dataset or analysis will no longer entail transcribing or recopying results into a manuscript or table. This talk will provide an introduction to using AM, including worked examples, and will be accessible to a wide range of users.
Additional information
Chicago16_Welty.pptx
Leah Welty
Division of Biostatistics, Department of Preventive Medicine, Northwestern University
Luke V. Rasmussen
Division of Health and Biomedical Informatics, Department of Preventive Medicine, Northwestern University
Abigail S. Baldridge
Department of Preventive Medicine, Northwestern University

Visualizing data

Inchworm plots: Visual representation of inferential uncertainty
Abstract: This talk introduces a new graphic for inference and a Stata program to construct it. When policymakers and other lay audiences internalize inferential results, they often ignore the measure of precision or uncertainty that accompanies a point estimate. Doing so can lead to inefficient action (or inaction). Different organizations take different approaches to this problem in their reports. Some list point estimates alone. Others list a confidence interval (CI) or make it available in an annex. Some plot a CI using a 1-D representation. The 2015 WHO Vaccination Coverage Cluster Survey Reference Manual recommends a new graphic called an inchworm plot that represents results with a 2-D distribution, highest near the point estimate and lowest at the ends of the central 95% CI. Each distribution includes ticks to represent bounds for strong one-sided inference. When results for several strata are shown together, the area of each distribution is scaled to be constant; precise estimates look like tall inchworms, poised to take a step, and those less precise are wide and short, like worms outstretched. The talk highlights figure features and potential to represent uncertainty in a manner that can be absorbed by laypersons without insulting (or taxing) their intelligence.
Additional information
Chicago16_Rhoda.pptx
Dale Rhoda
Biostat Global Consulting
Vote for quantile plots! New planks in an old campaign
Abstract: Quantile plots show ordered values (raw data, estimates, residuals, whatever) against rank or cumulative probability or a one-to-one function of the same. Even in a strict sense, they are almost 200 years old. In Stata, quantile, qqplot, and qnorm go back to 1985 and 1986. So why any fuss? The presentation is built on a long-considered view that quantile plots are the best single plot for univariate distributions. No other kind of plot shows so many features so well across a range of sample sizes with so few arbitrary decisions. Both official and user-written programs appear in a review that includes side-by-side and superimposed comparisons of quantiles for different groups and comparable variables. Emphasis is on newer, previously unpublished work, with focus on the compatibility of quantiles with transformations; fitting and testing of brand-name distributions; quantile-box plots as proposed by Emanuel Parzen (1929–2016); equivalents for ordinal categorical data; and the question of which graphics best support paired and two-sample t and other tests. Commands mentioned include distplot, multqplot, and qplot (Stata Journal) and mylabels, stripplot, and hdquantile (SSC).
Additional information
Chicago16_Cox.pptx
Nicholas J. Cox
Durham University

Hypothesis testing: Multiple comparisons and power

Comparing multiple comparisons
Abstract: Multiple comparisons are used to adjust the probability of making type I errors when performing multiple tests of group differences. Stata supports eight different method adjustments for multiple comparisons. This presentation will discuss the various adjustments available in Stata along with suggestions of when to use them.
Additional information
Chicago16_Ender.pdf
Philip Ender
UCLA (Ret)
Versatile tests for comparing survival curves based on weighted log-rank statistics
Abstract: The log-rank test is perhaps the most commonly used nonparametric method for comparing two survival curves and yields maximum power under proportional hazards (PH) alternatives. While PH is a reasonable assumption, it need not, of course, hold. Several authors have therefore developed versatile tests using combinations of weighted log-rank statistics that are more sensitive to non-PH hazards. For example, Fleming and Harrington (1991) considered the family of G(rho) statistics, while JW Lee (1996) and S-H Lee (2007) proposed tests based on the more extended G(rho, gamma) family. In this talk, we consider Zm=max(|Z1|,|Z2|,|Z3|), where Z1, Z2, and Z3 are z statistics obtained from G(0,0), G(1,0), and G(0,1) tests, respectively. G(0,0) corresponds to the log-rank test, while G(1,0) and G(0,1) are more sensitive to early and late difference alternatives. Simulation results indicate that the method based on Zm maintains the type I error rate, provides increased power relative to the log-rank test under early and late difference alternatives, and entails only a small to moderate power loss compared with the more optimally chosen test. The syntax for a Stata command to implement the method, verswlr, is described, and the user can specify other choices for rho and gamma.
Additional information
Chicago16_Karrison.pptx
Theodore Karrison
University of Chicago

Bayesian analysis in Stata

Bayesian hierarchical models in Stata
Abstract: Bayesian analysis is a flexible statistical methodology for inferring properties of unknown parameters by combining observational evidence with prior knowledge. Research questions are answered using explicit probability statements. The Bayesian approach is especially well suited for analyzing data models in which the data structure imposes a model parameter hierarchy. Stata 14 introduces a suite of commands for specification and simulation of Bayesian models, computing various posterior summaries, testing hypotheses, and comparing models. I will describe the main features of these commands and present examples illustrating various models, from a simple logistic regression to hierarchical Rasch models.
Additional information
Chicago16_Balov.pdf
Nikolay Balov
StataCorp

Scientific committee

Phil Schumm (Chair)
The University of Chicago
Department of Public Health Sciences

Richard Williams
University of Notre Dame
Department of Sociology

Scott Long
Indiana University
Department of Sociology

Matias Cattaneo
University of Michigan
Department of Economics