Last updated: 31 July 2013
2013 Stata Conference New Orleans
18–19 July 2013
Hyatt French Quarter New Orleans
800 Iberville Street
New Orleans, Louisiana
Proceedings
Fitting complex mixed logit models with particular focus on labor supply estimation
Max Löffler
IZA and University of Cologne
When one estimates discrete choice models, the mixed logit approach is
commonly superior to simple conditional logit setups. Mixed logit models not
only allow the researcher to implement difficult random components but also
overcome the restrictive IIA assumption. Despite these theoretical
advantages, the estimation of mixed logit models becomes cumbersome when the
model's complexity increases. Applied works therefore often rely on rather
simple empirical specifications because this reduces the computational
burden. I introduce the user-written command
lslogit, which fits
complex mixed logit models using maximum simulated likelihood methods.
Because
lslogit is a d2-ML-evaluator written in Mata, the estimation is
rather efficient compared with other routines. It allows the researcher to
specify complicated structures of unobserved heterogeneity and to choose
from a set of frequently used functional forms for the direct utility
function--for example, Box–Cox transformations, which are difficult to
estimate in the context of logit models. The particular focus of
lslogit is on the estimation of labor supply models in the discrete
choice context; therefore, it facilitates several computationally exhausting
but standard tasks in this research area. However, the command can be used
in many other applications of mixed logit models as well.
Additional information
nola13-loffler.pdf
New Stata code to measure poverty accounting for time
Carlos Gradín
Universidade de Vigo and EQUALITAS
The purpose of this presentation is to introduce a new user-written code
that allows for measuring poverty in a panel of individuals. It complements
existing poverty codes for a cross-section of individuals (for example,
povdeco, poverty) by producing a new family of indices proposed by
Gradín, Cantó and Del Río (2012). This family of
indices is a natural extension of the popular
Foster–Greer–Thorbecke (FGT) poverty indices to the longitudinal
case in which individuals are observed for more than one period. It takes
into account that longer spells of poverty and more unequal profiles of
poverty aggravate poverty. These measures have attractive decomposability
properties. One particular advantage of this family of indices is that it
embraces other indices recently proposed in the literature as particular
cases.
Reference
Gradín, C., O. Cantó, and C. Del Río. 2012. Measuring poverty accounting for time.
Review of Income and Wealth 58: 330–354.
Additional information
nola13-gradin.pptx
Demand system estimation with Stata: Multivariate censoring and other econometric issues
Soufiane Khoudmi
Benoit Mulkay
University of Montpellier
This presentation provides a Stata application for the estimation of Banks,
Blundell, and Lewbel's (1997) demand system dealing with the zero problem,
which is central to many expenditure survey analyses. We start from Poi's
(2008) routine, and our main contribution is the multivariate censoring
correction; we implement Tauchman's (2010) theoretical framework, which
relies on including correction terms in the system. These are computed from
a multivariate probit estimated with simulated maximum likelihood using
Cappellari and Jenkin's (2007) mvnp routine. We also discuss how to deal
with several econometric issues related to the demand system estimation
literature: total budget endogeneity, conditional linearity, and
symmetry restriction (using minimum distance estimator).
References
Banks, J., R. Blundell, and A. Lewbel. 1997. Quadratic
Engel curves and consumer demand.
Review of Economics and Statistics
79: 527–539.
Cappellari, L. and S. P. Jenkins. 2006. Calculation
of multivariate normal probabilities by simulation, with applications to
maximum simulated likelihood estimation.
Stata Journal 6:
156–189.
Poi, B. 2008. Demand-system estimation: Update.
Stata Journal 8: 554–556.
Tauchmann. H. 2010. Consistency
of Heckman-type two-step estimators for the multivariate sample-selection
model.
Applied Economics 42: 3895–3902.
Additional information
nola13-khoudmi.pdf
A general approach to testing for autocorrelation
Christopher F. Baum
Boston College and DIW Berlin
Mark E. Schaffer
Heriot–Watt University
Testing for the presence of autocorrelation in a time series is a common
task for researchers working with time-series data. The standard Q test
statistic, introduced by Box and Pierce (1970) and refined by Ljung and Box
(1978), is applicable to univariate time series and to testing for residual
autocorrelation under the assumption of strict exogeneity. Breusch (1978)
and Godfrey (1978) in effect extended the L-B-P approach to testing for
autocorrelations in residuals in models with weakly exogenous regressors.
However, each of these readily available tests has important limitations.
We use the results of Cumby and Huizinga (1992) to extend the implementation
of the Q test statistic of L-B-P-B-G to cover a much wider range of
hypotheses and settings: (a) tests for the presence of autocorrelation of
order p through q, where under the null hypothesis, there may be
autocorrelation of order p-1 or less; (b) tests after estimation in
which regressors are endogenous and estimation is by IV or GMM methods; and
(c) tests after estimation using panel data. We show that the
Cumby–Huizinga test, although developed for the large-T setting, is
formally identical to the test developed by Arellano and Bond (1991) for
AR(2) in a large-N panel setting.
Additional information
nola13-baum.pdf
Impulse–response functions analysis: An application to the exchange rate pass-through in Mexico
Sylvia Beatriz Guillermo Peón
Benemérita Universidad Autónoma de Puebla
Martin Rodriguez Brindis
Universidad La Salle
This paper aims at analyzing the exchange rate pass-through mechanism for the
Mexican economy and is carried out using Stata under two time-series
frameworks. The first framework is a recursive structural VAR (SVAR) model,
which, unlike the traditional VAR model, allows us to impose additional
restrictions on the contemporaneous and lagged matrices of coefficients. The
second is a VEC approach, which considers the possibility of valid
cointegrating relationships and allows us to incorporate the deviations from
the long-run equilibrium (cointegrating equations) as explanatory variables
when modeling the short-run behavior of the variables. Both frameworks aim
at the estimation of impulse–response functions (IRFs) as a tool to
analyze the degree and timing of the effect of exchange rate changes on
domestic prices. The recursive SVAR approach allows us to estimate the
structural IRFs, while the VEC approach uses the Cholesky decomposition of
the white noise variance–covariance matrix by imposing some necessary
restrictions so that causal interpretation of the simple IRFs is possible.
If cointegration exists, estimation of the IRFs provides a tool to identify
when the effect of a shock to the exchange rate is transitory and when it is
permanent.
Additional information
nola13-guillermo.ppsx
Including auxiliary variables in models with missing data using full-information maximum likelihood
Rose Anne Medeiros
Rice University
Stata's
sem command includes the ability to estimate models with
missing data using full-information maximum likelihood estimation (FIML).
One of the assumptions of FIML is that the data are at least missing at
random (MAR); that is, conditional on other variables in the model,
missingness is not dependent on the value that would have been observed. The
MAR assumption can be made more plausible and estimation improved by the
inclusion of auxiliary variables, that is, variables that predict
missingness or are related to the variables with missing values but are not
part of the substantive model. The inclusion of auxiliary variables is
common in multiple imputation models but less common in models estimated
using FIML. This presentation will introduce users to the saturated
correlates model (Graham 2003), a method of including auxiliary variables in
FIML models. Examples demonstrating how to include auxiliary variables
using the saturated correlates model with Stata's
sem command will be
shown.
Additional information
nola13-medeiros.pdf
Conditional stereotype logistic regression: A new estimation command
Rob Woodruff
Battelle Memorial Institute
The stereotype logistic regression model for a categorical dependent
variable is often described as a compromise between the multinomial and
proportional-odds logistic models and has many attractive features. Among
these are the ability to test the adequacy of the model fit compared with
the unconstrained multinomial model, to test the distinguishability of the
outcome categories, and even to test the "ordinality" assumption itself.
What brought me to write the new command, however, was the desire to take
advantage of these capabilities while working on a matched,
case–control study. Like the multinomial logistic model (and unlike
the proportional-odds model), the stereotype model yields valid inference
under outcome-dependent sampling designs and can be much more parsimonious.
The working title of my command is
cstereo, and it is implemented
using the d2-method of Stata's
ml command. In terms of existing Stata
capabilities,
clogit is to
logit as
cstereo is to
slogit. In this presentation, I will demonstrate the command's
features using a simulated matched, case–control dataset.
Additional information
nola13-woodruff.pptx
powersim: Simulation-based power analysis for linear and generalized linear models
Joerg Luedicke
Yale University and University of Florida
A widespread tool in the context of a point null hypothesis significance
testing framework is the computation of statistical power, especially in the
planning stage of quantitative studies. However, asymptotic power formulas
are often not readily available for certain tests or are too restrictive
in their underlying assumptions to be of much use in practice. The Stata
package
powersim exploits the flexibility of a simulation-based
approach by providing a facility for automated power simulations in the
context of linear and generalized linear regression models. The package
supports a wide variety of uni- and multivariate covariate distributions
and all family and link choices that are implemented in Stata's
glm
command. The package mainly serves two purposes: First, it provides access
to simulation-based power analyses for researchers without much experience
in simulation studies. Second, it provides a convenient simulation facility
for more advanced users who can easily complement the automated data
generation with their own code for creating more complex synthetic datasets.
The presentation will discuss some advantages of the simulation-based power
analysis approach and will go through a number of worked examples to
demonstrate key features of the package.
Additional information
nola13-luedicke.pdf
Inequality restricted maximum entropy estimation using Stata
Randall Campbell
Mississippi State University
R. Carter Hill
Louisiana State University
We use Stata to obtain the linear maximum entropy estimator developed by
Golan, Judge, and Miller (1996). We use the Stata
optimize function to
illustrate maximum entropy estimation in an unrestricted linear regression
model. Next we estimate the model with parameter inequality restrictions to
replicate the Monte Carlo experiments in Campbell and Hill (2006). We
generate data under varying design characteristics and estimate the
parameters using maximum entropy and least squares estimation, both with and
without parameter inequality restrictions.
References
Campbell, R. C. and R. C. Hill. 2005. A Monte Carlo study of the effect of design characteristics on the inequality restricted maximum entropy estimator.
Review of Applied Economics 1: 53–84.
Golan, A., G. Judge, and D. Miller. 1996.
Maximum Entropy Econometrics: Robust Estimation with Limited Data. Chichester, UK: John Wiley & Sons.
Additional information
nola13-campbell.pdf
Power and sample-size analysis in Stata
Yulia Marchenko
Director of Biostatistics, StataCorp
Stata 13's new
power command performs power and sample-size analysis.
The
power command expands the statistical methods that were
previously available in Stata's
sampsi command. I will demonstrate
the
power command and its additional features, including the support
of multiple study scenarios and automatic and customizable tables and
graphs.
Additional information
nola13-marchenko.pdf
Automatic generation of personalized answers to a problem set
Rodrigo Taborda
Universidad de los Andes, Colombia
Teaching and learning statistics and econometrics requires assessment
through a problem set (PS). Often the PS requires some statistical analysis
of a single database; therefore, there is a unique answer. Although a unique
answer guarantees the exercise was done correctly, it also facilitates
cheating; the lazy student may borrow the answer from his hardworking
classmate. This scenario does not guarantee an honest effort and learning.
Taking advantage of the automatic generation of documents (Gini and Pasquini
2006) for a unique PS, I generate a personalized subdatabase and answer in
a PDF file. Here are the steps. 1) There is a single PS for all
students (implying the use of Stata). 2) There is a single / mother
database. 3) A personalized (per student) database is drawn from the mother
database. 4) Following Gini and Pasquini (2006), a personalized (per student)
answer is generated into a PDF file. Pros: 1) No opportunity for cheating
and copying and pasting the same answer without actually running or
undertaking the statistical procedure. 2) Lecturer knows the answer
beforehand. 3) Ease of grading. 4) Because each student has a different
statistical result, forces to undertake individual inference upon the
results.
Reference
Gini, R. and J. Pasquini. 2006. Automatic generation of documents.
Stata Journal 6:22–39.
Additional information
nola13-taborda.pdf
Teaching students to make their empirical research replicable: A protocol for documenting data management and analysis
Richard Ball
Norm Medeiros
Haverford College
This presentation will describe a protocol we have developed for teaching
students conducting empirical research to document their work in such a way
that their results are completely reproducible and verifiable. The protocol
is composed primarily of creating and assembling a collection of electronic
documents--including raw data files, do-files, and metadata files. The
guiding principle is that an independent researcher, using only the data and
information contained in these files, should be able to replicate every step
of the data management and analysis that generated their empirical results.
Students in our introductory statistics classes, as well as our senior
advisees, have had a great deal of success using the protocol to document
the data processing and analysis involved in their research papers and
theses. There is a great deal of evidence (see Ball and Medeiros [2012] and
McCollough and McKitrick [2009]) that, across the social sciences,
professional norms and common practices with respect to documenting
empirical research are deficient. We hope that teaching good practices to
our students will help strengthen the professional norm that researchers
have an ethical responsibility to ensure that their statistical results can
be independently replicated.
References
Ball, R. and N. Medeiros. 2012. Teaching integrity in empirical research: A protocol for documenting data management and analysis.
Journal of Economic Education 43: 182–189.
McCoullough, B. D. and R. McKitrick. 2009. Check the numbers: The case for due diligence in policy formation. Fraser Institute for Economic Studies in Risk and Regulation.
http://www.pages.drexel.edu/~bdm25/DueDiligence.pdf
Additional information
nola13-ball.pdf
Mathematical optimization in Stata: LP and MILP
Choonjoo Lee
Korea National Defense University
In this presentation, we present a procedure and an illustrative application
of user-written mathematical optimization programs: linear programming(LP)
and mixed integer linear programming(MILP). The LP and MILP programs in
Stata will allow researchers to easily access the Stata system and to
conduct not only the statistical optimization procedure but also
mathematical optimization. Unfortunately, to date, no mathematical
programming options for optimization are available in Stata, but a
statistical model is available. The user-written mathematical optimization
approach in Stata will provide some possible future extensions of
nonparametric optimization programming.
Additional information
nola13-lee.pptx
Introducing PARALLEL: Stata module for parallel computing
George Vega
Chilean Pension Supervisor
Inspired in the R library "snow" and to be used in multicore CPUs, PARALLEL
implements parallel computing methods through an OS's shell scripting (using
Stata in batch mode) to accelerate computations. By splitting the dataset
into a determined number of clusters, this module repeats a task
simultaneously over the data clusters, allowing an increase in efficiency
between two and five times, or more depending on the number of cores of the
CPU. Without the need of StataMP, PARALLEL is, to my knowledge, the first
user-contributed Stata module to implement parallel computing.
Additional information
nola13-vega.pdf
Optimizing Stata for analysis of large datasets
Joseph Canner
Eric Schneider
Johns Hopkins University School of Medicine
As with most programming languages, there are multiple ways to do a task
in Stata. Using modern CPUs with adequate memory, most Stata data processing
commands run so quickly on small- or moderate-sized datasets that it is
impossible to tell whether one command performs more efficiently than
another. However, when one analyzes large datasets such as the Nationwide
Inpatient Sample (NIS), with about 8 million records per year (~3.5GB), this
choice can make a substantial difference in performance. Using the Stata
timer command, we performed standardized benchmarks of common programming
tasks, such as searching the NIS for a list of ICD-9 codes, converting
string data to numeric and putting numeric variables in categories. For
example, the
inlist function can achieve significant performance gains
compared with using the equivalent
"var==exp1 | var==exp2" notation
(38% improvement) or using
foreach loops (300% improvement). Using
the
real and
subinstr functions to remove characters from
strings and convert them to numbers is about 20 times faster than the
destring command. The
inlist,
inrange, and
recode functions also perform considerably better than the equivalent
recode commands (13 to 70 times faster), especially for string
variables, and are often easier to write and to read.
Additional information
nola13-canner.pptx
nola13-canner.do
Reimagining a Stata/Python combination
James Fiedler
Universities Space Research Association
At last year's Stata Conference, I presented some ideas for combining
Stata and Python within a single interface. Two methods were presented: in
one, Python was used to automate Stata; in the other, Python was used to
send simulated keystrokes to the Stata GUI. The first method has the
drawback of only working in Windows, and the second can be slow and subject
to character input limits. In this presentation, I will demonstrate a method
for achieving interaction between Stata and Python that does not suffer
these drawbacks, and I will present some examples to show how this
interaction can be useful.
Additional information
nola13-fiedler.pdf
The hierarchy of factor invariance
Phil Ender
UCLA Statistical Consulting Group
Measurement invariance is a very important requisite in multiple group
structural equation modeling. It attempts to verify that the factors are
measuring the same underlying latent constructs within each group. This
presentation will show the use of the
sem command in assessing six
types of factor invariance: configurational, metric, strong, strict, strict
plus factor means, and strict plus factor means and variances. These six
types of factor invariance constitute a hierarchy with each level
representing a stricter definition of factor invariance.
Additional information
nola13-ender.pdf
Correctly modeling CD4 cell count in Cox regression analysis of HIV-positive patients
Allison Dunning
Sean Collins
Dan Fitzgerald
Sandra H. Rua
Weill Cornell Medical College
Background: Previous trial has shown that starting ART therapy earlier
("Early") rather than waiting for onset of symptoms ("Standard") in
HIV patients significantly decreases mortality. As a follow-up, researchers
are interested in determining if "Early" therapy significantly decreases
time to first tuberculosis (TFTB) diagnosis when adjusting for CD4 cell
count, a known strong predictor. Methods: Stata 12.0 was used to perform two
Cox regression models to analyze the effect of ART start time on TFTB. The
first model included baseline CD4 cell count only as a predictor, while the
second model treated CD4 cell count as a time-varying predictor. Results:
Regular Cox regression analysis showed that "Early" therapy results in a
significant decrease in TFTB after adjustment for previous TB diagnosis,
baseline BMI, and baseline CD4 cell count. Treating CD4 cell count as a
time-varying predictor in Cox regression, we determine that ART start time
was not a significant predictor of TFTB. Conclusions: Failing to adjust for
the change in CD4 cell counts over time led to reporting that "Early"
therapy significantly reduces the risk of TB diagnosis. Modeled correctly, the
effect becomes nonsignificant. This result has substantial consequence on
treatment decisions.
Additional information
nola13-dunning.pptx
Structured chaos: Using Mata and Stata to draw fractals
Seth Lirette
University of Mississippi Medical Center
Fractals are some of the most beloved and recognizable mathematical objects
studied. They have been traced as far back as Leibniz, but failed to receive
rigorous examination until the mid-twentieth century with the many
publications of Benoit Mandelbrot and the advent of the modern computer.
The powerful programming environment of Mata, in tandem with Stata’s
excellent graphics capabilities, provides a very well-suited setting for
generating fractals. My talk will focus on using Mata, combined with Stata,
to generate some visually recognizable fractals, possibly including, but not
limited to, iterated function systems (Barnsley Fern, Koch Snowflake, Gosper
Island); escape-time fractals (Mandelbrot Set, Julia Sets, Burning Ship);
finite subdivisions (Cantor Set, Sierpinski Triangle); Lindenmayer systems
(Dragon Curve, Levy Curve); and strange attractors (Double-scroll, Rossler,
Lorenz).
Additional information
nola13-lirette.pptx
gpsmap: Routine for verifying and returning the attributable table of given decimal GPS coordinates
Timothy Brophy
University of Cape Town
GPS coordinates are collected by many organizations; however, in order to
derive any meaningful statistical analysis from these coordinates, they need
to be joined with geographical data. Previously, users were required to
export the GPS data out of Stata and into a GIS mapping program to map the
coordinates, validate them and join them to an attribute table. The results
would then need to be imported back into Stata for statistical
analysis. gpsmap is a routine that imports a user-provided shapefile and
its attribute table. Using a ray-casting algorithm, it maps the GPS
coordinates to one of the polygons of the given shapefile and returns a
dummy variable indicating whether the GPS coordinates were mapped
successfully. Where the GPS coordinates were successfully mapped, the
attribute table applicable to that particular polygon is also returned to
Stata. One of the contributions of gpsmap is to allow users to circumvent
GIS software and to incorporate GIS information directly within Stata. The
other is to give users who are not familiar with GIS software the
opportunity to use GIS information without having to familiarize themselves
with GIS software.
Additional information
nola13-brophy.pptx
Two-stage regression without exclusion restrictions
Michael Barker
Georgetown University
Klein and Vella (2010) propose an estimator to fit a triangular system of
two simultaneous linear equations with a single endogenous regressor. Models
of this form are generally analyzed with two-stage least squares or IV
methods, which require one or more exclusion restrictions. In practice, the
assumptions required to construct valid instruments are frequently difficult
to justify. The KV estimator does not require an exclusion restriction; the
same set of independent variables may appear in both equations. To account
for endogeneity, the estimator constructs a control function using
information from the conditional distribution of the error terms.
Conditional variance functions are estimated semiparametrically, so
distributional assumptions are minimized. I will present my Stata
implementation of the semiparametric control function estimator,
kvreg, and discuss the assumptions that must hold for consistent
estimation. The
kvreg estimator contains an undocumented
implementation of Ichimura’s (1993) semiparametric least squares
estimator, which I plan to fillout into a stand-alone command.
References
Klein, R. and F. Vella. 2010. Estimating a class of triangular simultaneous equations models without exclusion restrictions.
Journal of Econometrics 154: 154–164.
Ichimura, H. 1993. Semiparametric least squares (SLS) and weighted SLS estimation of single-index models.
Journal of Econometrics 58: 71–120.
Additional information
nola13-barker.pdf
Generalizing sem in Stata
Jeff Pitblado
Director of Statistical Software, StataCorp
Introducing generalized SEM: (1) SEM with generalized linear response
variables, and (2) SEM with multilevel mixed effects, whether linear or
generalized linear. Generalized linear response variables mean you can now
fit probit, logit, Poisson, multinomial logistic, ordered logit, ordered
probit, and other models. They also mean measurements can be continuous,
binary, count, categorical, and ordered. Multilevel mixed effects mean you
can place latent variables at different levels of the data. You can fit
models with fixed or random intercepts and fixed or random slopes. I will
present examples using both command syntax and the SEM Builder.
Additional information
nola13-pitblado.pdf
Scientific organizers
R. Carter Hill, (chair) Louisiana State University
Mario Cleves , University of Arkansas for Medical Sciences
Edward Peters, LSUHSC School of Public Health
Logistics organizers
Nathan Bishop, StataCorp
Chris Farrar, StataCorp
Gretchen Farrar, StataCorp