2013 Stata Conference New Orleans

Home / Stata Conferences and Users Group meetings / 2013 Stata Conference New Orleans

Last updated: 31 July 2013

2013 Stata Conference New Orleans

18–19 July 2013

Hyatt French Quarter New Orleans
800 Iberville Street
New Orleans, Louisiana

Proceedings

Fitting complex mixed logit models with particular focus on labor supply estimation

Max Löffler

IZA and University of Cologne

When one estimates discrete choice models, the mixed logit approach is commonly superior to simple conditional logit setups. Mixed logit models not only allow the researcher to implement difficult random components but also overcome the restrictive IIA assumption. Despite these theoretical advantages, the estimation of mixed logit models becomes cumbersome when the model's complexity increases. Applied works therefore often rely on rather simple empirical specifications because this reduces the computational burden. I introduce the user-written command lslogit, which fits complex mixed logit models using maximum simulated likelihood methods. Because lslogit is a d2-ML-evaluator written in Mata, the estimation is rather efficient compared with other routines. It allows the researcher to specify complicated structures of unobserved heterogeneity and to choose from a set of frequently used functional forms for the direct utility function--for example, Box–Cox transformations, which are difficult to estimate in the context of logit models. The particular focus of lslogit is on the estimation of labor supply models in the discrete choice context; therefore, it facilitates several computationally exhausting but standard tasks in this research area. However, the command can be used in many other applications of mixed logit models as well.

Additional information
nola13-loffler.pdf

New Stata code to measure poverty accounting for time

Carlos Gradín

Universidade de Vigo and EQUALITAS

The purpose of this presentation is to introduce a new user-written code that allows for measuring poverty in a panel of individuals. It complements existing poverty codes for a cross-section of individuals (for example, povdeco, poverty) by producing a new family of indices proposed by Gradín, Cantó and Del Río (2012). This family of indices is a natural extension of the popular Foster–Greer–Thorbecke (FGT) poverty indices to the longitudinal case in which individuals are observed for more than one period. It takes into account that longer spells of poverty and more unequal profiles of poverty aggravate poverty. These measures have attractive decomposability properties. One particular advantage of this family of indices is that it embraces other indices recently proposed in the literature as particular cases.

Reference
Gradín, C., O. Cantó, and C. Del Río. 2012. Measuring poverty accounting for time. Review of Income and Wealth 58: 330–354.

Additional information
nola13-gradin.pptx

Demand system estimation with Stata: Multivariate censoring and other econometric issues

Soufiane Khoudmi

Benoit Mulkay

University of Montpellier

This presentation provides a Stata application for the estimation of Banks, Blundell, and Lewbel's (1997) demand system dealing with the zero problem, which is central to many expenditure survey analyses. We start from Poi's (2008) routine, and our main contribution is the multivariate censoring correction; we implement Tauchman's (2010) theoretical framework, which relies on including correction terms in the system. These are computed from a multivariate probit estimated with simulated maximum likelihood using Cappellari and Jenkin's (2007) mvnp routine. We also discuss how to deal with several econometric issues related to the demand system estimation literature: total budget endogeneity, conditional linearity, and symmetry restriction (using minimum distance estimator).

References
Banks, J., R. Blundell, and A. Lewbel. 1997. Quadratic Engel curves and consumer demand. Review of Economics and Statistics 79: 527–539.
Cappellari, L. and S. P. Jenkins. 2006. Calculation of multivariate normal probabilities by simulation, with applications to maximum simulated likelihood estimation. Stata Journal 6: 156–189.
Poi, B. 2008. Demand-system estimation: Update. Stata Journal 8: 554–556.
Tauchmann. H. 2010. Consistency of Heckman-type two-step estimators for the multivariate sample-selection model. Applied Economics 42: 3895–3902.

Additional information
nola13-khoudmi.pdf

A general approach to testing for autocorrelation

Christopher F. Baum

Boston College and DIW Berlin

Mark E. Schaffer

Heriot–Watt University

Testing for the presence of autocorrelation in a time series is a common task for researchers working with time-series data. The standard Q test statistic, introduced by Box and Pierce (1970) and refined by Ljung and Box (1978), is applicable to univariate time series and to testing for residual autocorrelation under the assumption of strict exogeneity. Breusch (1978) and Godfrey (1978) in effect extended the L-B-P approach to testing for autocorrelations in residuals in models with weakly exogenous regressors. However, each of these readily available tests has important limitations. We use the results of Cumby and Huizinga (1992) to extend the implementation of the Q test statistic of L-B-P-B-G to cover a much wider range of hypotheses and settings: (a) tests for the presence of autocorrelation of order p through q, where under the null hypothesis, there may be autocorrelation of order p-1 or less; (b) tests after estimation in which regressors are endogenous and estimation is by IV or GMM methods; and (c) tests after estimation using panel data. We show that the Cumby–Huizinga test, although developed for the large-T setting, is formally identical to the test developed by Arellano and Bond (1991) for AR(2) in a large-N panel setting.

Additional information
nola13-baum.pdf

Impulse–response functions analysis: An application to the exchange rate pass-through in Mexico

Sylvia Beatriz Guillermo Peón

Benemérita Universidad Autónoma de Puebla

Martin Rodriguez Brindis

Universidad La Salle

This paper aims at analyzing the exchange rate pass-through mechanism for the Mexican economy and is carried out using Stata under two time-series frameworks. The first framework is a recursive structural VAR (SVAR) model, which, unlike the traditional VAR model, allows us to impose additional restrictions on the contemporaneous and lagged matrices of coefficients. The second is a VEC approach, which considers the possibility of valid cointegrating relationships and allows us to incorporate the deviations from the long-run equilibrium (cointegrating equations) as explanatory variables when modeling the short-run behavior of the variables. Both frameworks aim at the estimation of impulse–response functions (IRFs) as a tool to analyze the degree and timing of the effect of exchange rate changes on domestic prices. The recursive SVAR approach allows us to estimate the structural IRFs, while the VEC approach uses the Cholesky decomposition of the white noise variance–covariance matrix by imposing some necessary restrictions so that causal interpretation of the simple IRFs is possible. If cointegration exists, estimation of the IRFs provides a tool to identify when the effect of a shock to the exchange rate is transitory and when it is permanent.

Additional information
nola13-guillermo.ppsx

Including auxiliary variables in models with missing data using full-information maximum likelihood

Rose Anne Medeiros

Rice University

Stata's sem command includes the ability to estimate models with missing data using full-information maximum likelihood estimation (FIML). One of the assumptions of FIML is that the data are at least missing at random (MAR); that is, conditional on other variables in the model, missingness is not dependent on the value that would have been observed. The MAR assumption can be made more plausible and estimation improved by the inclusion of auxiliary variables, that is, variables that predict missingness or are related to the variables with missing values but are not part of the substantive model. The inclusion of auxiliary variables is common in multiple imputation models but less common in models estimated using FIML. This presentation will introduce users to the saturated correlates model (Graham 2003), a method of including auxiliary variables in FIML models. Examples demonstrating how to include auxiliary variables using the saturated correlates model with Stata's sem command will be shown.

Additional information
nola13-medeiros.pdf

Conditional stereotype logistic regression: A new estimation command

Rob Woodruff

Battelle Memorial Institute

The stereotype logistic regression model for a categorical dependent variable is often described as a compromise between the multinomial and proportional-odds logistic models and has many attractive features. Among these are the ability to test the adequacy of the model fit compared with the unconstrained multinomial model, to test the distinguishability of the outcome categories, and even to test the "ordinality" assumption itself. What brought me to write the new command, however, was the desire to take advantage of these capabilities while working on a matched, case–control study. Like the multinomial logistic model (and unlike the proportional-odds model), the stereotype model yields valid inference under outcome-dependent sampling designs and can be much more parsimonious. The working title of my command is cstereo, and it is implemented using the d2-method of Stata's ml command. In terms of existing Stata capabilities, clogit is to logit as cstereo is to slogit. In this presentation, I will demonstrate the command's features using a simulated matched, case–control dataset.

Additional information
nola13-woodruff.pptx

powersim: Simulation-based power analysis for linear and generalized linear models

Joerg Luedicke

Yale University and University of Florida

A widespread tool in the context of a point null hypothesis significance testing framework is the computation of statistical power, especially in the planning stage of quantitative studies. However, asymptotic power formulas are often not readily available for certain tests or are too restrictive in their underlying assumptions to be of much use in practice. The Stata package powersim exploits the flexibility of a simulation-based approach by providing a facility for automated power simulations in the context of linear and generalized linear regression models. The package supports a wide variety of uni- and multivariate covariate distributions and all family and link choices that are implemented in Stata's glm command. The package mainly serves two purposes: First, it provides access to simulation-based power analyses for researchers without much experience in simulation studies. Second, it provides a convenient simulation facility for more advanced users who can easily complement the automated data generation with their own code for creating more complex synthetic datasets. The presentation will discuss some advantages of the simulation-based power analysis approach and will go through a number of worked examples to demonstrate key features of the package.

Additional information
nola13-luedicke.pdf

Inequality restricted maximum entropy estimation using Stata

Randall Campbell

Mississippi State University

R. Carter Hill

Louisiana State University

We use Stata to obtain the linear maximum entropy estimator developed by Golan, Judge, and Miller (1996). We use the Stata optimize function to illustrate maximum entropy estimation in an unrestricted linear regression model. Next we estimate the model with parameter inequality restrictions to replicate the Monte Carlo experiments in Campbell and Hill (2006). We generate data under varying design characteristics and estimate the parameters using maximum entropy and least squares estimation, both with and without parameter inequality restrictions.

References
Campbell, R. C. and R. C. Hill. 2005. A Monte Carlo study of the effect of design characteristics on the inequality restricted maximum entropy estimator. Review of Applied Economics 1: 53–84.
Golan, A., G. Judge, and D. Miller. 1996. Maximum Entropy Econometrics: Robust Estimation with Limited Data. Chichester, UK: John Wiley & Sons.

Additional information
nola13-campbell.pdf

Power and sample-size analysis in Stata

Yulia Marchenko

Director of Biostatistics, StataCorp

Stata 13's new power command performs power and sample-size analysis. The power command expands the statistical methods that were previously available in Stata's sampsi command. I will demonstrate the power command and its additional features, including the support of multiple study scenarios and automatic and customizable tables and graphs.

Additional information
nola13-marchenko.pdf

Automatic generation of personalized answers to a problem set

Rodrigo Taborda

Universidad de los Andes, Colombia

Teaching and learning statistics and econometrics requires assessment through a problem set (PS). Often the PS requires some statistical analysis of a single database; therefore, there is a unique answer. Although a unique answer guarantees the exercise was done correctly, it also facilitates cheating; the lazy student may borrow the answer from his hardworking classmate. This scenario does not guarantee an honest effort and learning. Taking advantage of the automatic generation of documents (Gini and Pasquini 2006) for a unique PS, I generate a personalized subdatabase and answer in a PDF file. Here are the steps. 1) There is a single PS for all students (implying the use of Stata). 2) There is a single / mother database. 3) A personalized (per student) database is drawn from the mother database. 4) Following Gini and Pasquini (2006), a personalized (per student) answer is generated into a PDF file. Pros: 1) No opportunity for cheating and copying and pasting the same answer without actually running or undertaking the statistical procedure. 2) Lecturer knows the answer beforehand. 3) Ease of grading. 4) Because each student has a different statistical result, forces to undertake individual inference upon the results.

Reference
Gini, R. and J. Pasquini. 2006. Automatic generation of documents. Stata Journal 6:22–39.

Additional information
nola13-taborda.pdf

Teaching students to make their empirical research replicable: A protocol for documenting data management and analysis

Richard Ball

Norm Medeiros

Haverford College

This presentation will describe a protocol we have developed for teaching students conducting empirical research to document their work in such a way that their results are completely reproducible and verifiable. The protocol is composed primarily of creating and assembling a collection of electronic documents--including raw data files, do-files, and metadata files. The guiding principle is that an independent researcher, using only the data and information contained in these files, should be able to replicate every step of the data management and analysis that generated their empirical results. Students in our introductory statistics classes, as well as our senior advisees, have had a great deal of success using the protocol to document the data processing and analysis involved in their research papers and theses. There is a great deal of evidence (see Ball and Medeiros [2012] and McCollough and McKitrick [2009]) that, across the social sciences, professional norms and common practices with respect to documenting empirical research are deficient. We hope that teaching good practices to our students will help strengthen the professional norm that researchers have an ethical responsibility to ensure that their statistical results can be independently replicated.

References
Ball, R. and N. Medeiros. 2012. Teaching integrity in empirical research: A protocol for documenting data management and analysis. Journal of Economic Education 43: 182–189.
McCoullough, B. D. and R. McKitrick. 2009. Check the numbers: The case for due diligence in policy formation. Fraser Institute for Economic Studies in Risk and Regulation.
http://www.pages.drexel.edu/~bdm25/DueDiligence.pdf

Additional information
nola13-ball.pdf

Mathematical optimization in Stata: LP and MILP

Choonjoo Lee

Korea National Defense University

In this presentation, we present a procedure and an illustrative application of user-written mathematical optimization programs: linear programming(LP) and mixed integer linear programming(MILP). The LP and MILP programs in Stata will allow researchers to easily access the Stata system and to conduct not only the statistical optimization procedure but also mathematical optimization. Unfortunately, to date, no mathematical programming options for optimization are available in Stata, but a statistical model is available. The user-written mathematical optimization approach in Stata will provide some possible future extensions of nonparametric optimization programming.

Additional information
nola13-lee.pptx

Introducing PARALLEL: Stata module for parallel computing

George Vega

Chilean Pension Supervisor

Inspired in the R library "snow" and to be used in multicore CPUs, PARALLEL implements parallel computing methods through an OS's shell scripting (using Stata in batch mode) to accelerate computations. By splitting the dataset into a determined number of clusters, this module repeats a task simultaneously over the data clusters, allowing an increase in efficiency between two and five times, or more depending on the number of cores of the CPU. Without the need of StataMP, PARALLEL is, to my knowledge, the first user-contributed Stata module to implement parallel computing.

Additional information
nola13-vega.pdf

Optimizing Stata for analysis of large datasets

Joseph Canner

Eric Schneider

Johns Hopkins University School of Medicine

As with most programming languages, there are multiple ways to do a task in Stata. Using modern CPUs with adequate memory, most Stata data processing commands run so quickly on small- or moderate-sized datasets that it is impossible to tell whether one command performs more efficiently than another. However, when one analyzes large datasets such as the Nationwide Inpatient Sample (NIS), with about 8 million records per year (~3.5GB), this choice can make a substantial difference in performance. Using the Stata timer command, we performed standardized benchmarks of common programming tasks, such as searching the NIS for a list of ICD-9 codes, converting string data to numeric and putting numeric variables in categories. For example, the inlist function can achieve significant performance gains compared with using the equivalent "var==exp1 | var==exp2" notation (38% improvement) or using foreach loops (300% improvement). Using the real and subinstr functions to remove characters from strings and convert them to numbers is about 20 times faster than the destring command. The inlist, inrange, and recode functions also perform considerably better than the equivalent recode commands (13 to 70 times faster), especially for string variables, and are often easier to write and to read.

Additional information
nola13-canner.pptx
nola13-canner.do

Reimagining a Stata/Python combination

James Fiedler

Universities Space Research Association

At last year's Stata Conference, I presented some ideas for combining Stata and Python within a single interface. Two methods were presented: in one, Python was used to automate Stata; in the other, Python was used to send simulated keystrokes to the Stata GUI. The first method has the drawback of only working in Windows, and the second can be slow and subject to character input limits. In this presentation, I will demonstrate a method for achieving interaction between Stata and Python that does not suffer these drawbacks, and I will present some examples to show how this interaction can be useful.

Additional information
nola13-fiedler.pdf

The hierarchy of factor invariance

Phil Ender

UCLA Statistical Consulting Group

Measurement invariance is a very important requisite in multiple group structural equation modeling. It attempts to verify that the factors are measuring the same underlying latent constructs within each group. This presentation will show the use of the sem command in assessing six types of factor invariance: configurational, metric, strong, strict, strict plus factor means, and strict plus factor means and variances. These six types of factor invariance constitute a hierarchy with each level representing a stricter definition of factor invariance.

Additional information
nola13-ender.pdf

Correctly modeling CD4 cell count in Cox regression analysis of HIV-positive patients

Allison Dunning

Sean Collins

Dan Fitzgerald

Sandra H. Rua

Weill Cornell Medical College

Background: Previous trial has shown that starting ART therapy earlier ("Early") rather than waiting for onset of symptoms ("Standard") in HIV patients significantly decreases mortality. As a follow-up, researchers are interested in determining if "Early" therapy significantly decreases time to first tuberculosis (TFTB) diagnosis when adjusting for CD4 cell count, a known strong predictor. Methods: Stata 12.0 was used to perform two Cox regression models to analyze the effect of ART start time on TFTB. The first model included baseline CD4 cell count only as a predictor, while the second model treated CD4 cell count as a time-varying predictor. Results: Regular Cox regression analysis showed that "Early" therapy results in a significant decrease in TFTB after adjustment for previous TB diagnosis, baseline BMI, and baseline CD4 cell count. Treating CD4 cell count as a time-varying predictor in Cox regression, we determine that ART start time was not a significant predictor of TFTB. Conclusions: Failing to adjust for the change in CD4 cell counts over time led to reporting that "Early" therapy significantly reduces the risk of TB diagnosis. Modeled correctly, the effect becomes nonsignificant. This result has substantial consequence on treatment decisions.

Additional information
nola13-dunning.pptx

Structured chaos: Using Mata and Stata to draw fractals

Seth Lirette

University of Mississippi Medical Center

Fractals are some of the most beloved and recognizable mathematical objects studied. They have been traced as far back as Leibniz, but failed to receive rigorous examination until the mid-twentieth century with the many publications of Benoit Mandelbrot and the advent of the modern computer. The powerful programming environment of Mata, in tandem with Stata’s excellent graphics capabilities, provides a very well-suited setting for generating fractals. My talk will focus on using Mata, combined with Stata, to generate some visually recognizable fractals, possibly including, but not limited to, iterated function systems (Barnsley Fern, Koch Snowflake, Gosper Island); escape-time fractals (Mandelbrot Set, Julia Sets, Burning Ship); finite subdivisions (Cantor Set, Sierpinski Triangle); Lindenmayer systems (Dragon Curve, Levy Curve); and strange attractors (Double-scroll, Rossler, Lorenz).

Additional information
nola13-lirette.pptx

gpsmap: Routine for verifying and returning the attributable table of given decimal GPS coordinates

Timothy Brophy

University of Cape Town

GPS coordinates are collected by many organizations; however, in order to derive any meaningful statistical analysis from these coordinates, they need to be joined with geographical data. Previously, users were required to export the GPS data out of Stata and into a GIS mapping program to map the coordinates, validate them and join them to an attribute table. The results would then need to be imported back into Stata for statistical analysis. gpsmap is a routine that imports a user-provided shapefile and its attribute table. Using a ray-casting algorithm, it maps the GPS coordinates to one of the polygons of the given shapefile and returns a dummy variable indicating whether the GPS coordinates were mapped successfully. Where the GPS coordinates were successfully mapped, the attribute table applicable to that particular polygon is also returned to Stata. One of the contributions of gpsmap is to allow users to circumvent GIS software and to incorporate GIS information directly within Stata. The other is to give users who are not familiar with GIS software the opportunity to use GIS information without having to familiarize themselves with GIS software.

Additional information
nola13-brophy.pptx

Two-stage regression without exclusion restrictions

Michael Barker

Georgetown University

Klein and Vella (2010) propose an estimator to fit a triangular system of two simultaneous linear equations with a single endogenous regressor. Models of this form are generally analyzed with two-stage least squares or IV methods, which require one or more exclusion restrictions. In practice, the assumptions required to construct valid instruments are frequently difficult to justify. The KV estimator does not require an exclusion restriction; the same set of independent variables may appear in both equations. To account for endogeneity, the estimator constructs a control function using information from the conditional distribution of the error terms. Conditional variance functions are estimated semiparametrically, so distributional assumptions are minimized. I will present my Stata implementation of the semiparametric control function estimator, kvreg, and discuss the assumptions that must hold for consistent estimation. The kvreg estimator contains an undocumented implementation of Ichimura’s (1993) semiparametric least squares estimator, which I plan to fillout into a stand-alone command.

References
Klein, R. and F. Vella. 2010. Estimating a class of triangular simultaneous equations models without exclusion restrictions. Journal of Econometrics 154: 154–164.
Ichimura, H. 1993. Semiparametric least squares (SLS) and weighted SLS estimation of single-index models. Journal of Econometrics 58: 71–120.

Additional information
nola13-barker.pdf

Generalizing sem in Stata

Jeff Pitblado

Director of Statistical Software, StataCorp

Introducing generalized SEM: (1) SEM with generalized linear response variables, and (2) SEM with multilevel mixed effects, whether linear or generalized linear. Generalized linear response variables mean you can now fit probit, logit, Poisson, multinomial logistic, ordered logit, ordered probit, and other models. They also mean measurements can be continuous, binary, count, categorical, and ordered. Multilevel mixed effects mean you can place latent variables at different levels of the data. You can fit models with fixed or random intercepts and fixed or random slopes. I will present examples using both command syntax and the SEM Builder.

Additional information
nola13-pitblado.pdf

Scientific organizers

R. Carter Hill, (chair) Louisiana State University
Mario Cleves , University of Arkansas for Medical Sciences
Edward Peters, LSUHSC School of Public Health

Logistics organizers

Nathan Bishop, StataCorp
Chris Farrar, StataCorp
Gretchen Farrar, StataCorp