Last updated: 27 September 2007
2007 UK Stata Users Group meeting
10–11 September
Centre for Econometric Analysis
Cass Business School
106 Bunhill Row
London EC1 8TZ
United Kingdom
Proceedings
Robust confidence intervals for Hodges–Lehmann
median difference
Roger Newson
Imperial College
The cendif module is part of the somersd package, and calculates
confidence intervals for the Hodges–Lehmann median difference
between values of a variable in two subpopulations. The traditional
Lehmann formula, unlike the formula used by cendif, assumes that the two
subpopulation distributions are different only in location, and that the
subpopulations are therefore equally variable. The cendif formula
therefore contrasts with the Lehmann formula as the unequal-variance
t-test contrasts with the equal-variance t-test. In a simulation study,
designed to test cendif to destruction, the performance of cendif was
compared to that of the Lehmann formula, using coverage probabilities and
median confidence interval width ratios. The simulations involved sampling
from pairs of Normal or Cauchy distributions, with subsample sizes ranging
from 5 to 40, and between-subpopulation variability scale ratios ranging
from 1 to 4. If the sample numbers were equal, then both methods gave
coverage probabilities close to the advertized confidence level. However,
if the sample numbers were unequal, then the Lehmann coverage
probabilities were over-conservative if the smaller sample was from the
less variable population, and over-liberal if the smaller sample was from
the more variable population. The cendif coverage probability was usually
closer to the advertized level, if the smaller sample was not very small.
However, if the sample sizes were 5 and 40, and the two populations were
equally variable, then the Lehmann coverage probability was close to its
advertised level, while the cendif coverage probability was over-liberal.
The cendif confidence interval, in its present form, is therefore robust
both to non-Normality and to unequal variablity, but may be less robust to
the possibility that the smaller sample size is very small. Possibilities
for improvement are discussed.
Additional information
newson-ohp1.pdf (presentation slides)
Translating the S-Plus Least Angle Regression package to Mata
Adrian Mander
MRC Human Nutrition Research
In an attempt to learn Mata I have translated the LARS package, written for
R by Trevor Hastie and Brad Efron, into Mata. The LARS package is an
efficient implementation of an entire lasso sequence with the cost of a
single least-squares estimation. Mata and R/S+ are incredibly similar in
terms of syntax and on the whole can be translated by altering the syntax
“wording”, however, there was the occasional need for additional
functions. It is certainly not the best approach to learning a new
language. I shall describe the new Stata command and apply this approach to
model selection to some nutrition data.
Performing Bayesian analysis in Stata using WinBUGS
Tom Palmer
Department of Health Sciences, Leicester University
WinBUGS is a program for Bayesian model fitting by Gibbs sampling. WinBUGS
has limited facilities for data handling, whereas Stata has excellent data
handling but no routines for Bayesian analysis; therefore, much can be
gained by running Stata and WinBUGS together. This talk explains the use of
the winbugsfromstata package, described in Thompson et al. (2006), a set of
programs that enable data to be processed in Stata and then passed to
WinBUGS for model fitting. Finally, the results can be read back into Stata
for further processing. Examples will be chosen to illustrate the range of
models that can be fitted within WinBUGS and where possible the results will
be compared with frequentist analyses in Stata. Issues to consider when
fitting models under Markov Chain Monte Carlo methods will be discussed
including assessment of convergence, length of burn-in and the form and
impact of prior distributions. J. Thompson, T. Palmer, and S. Moreno, 2006,
Bayesian analysis in Stata with WinBUGS, The Stata Journal, 6(4), p530–549.
Additional information
palmer_winbugsfromstata.presentation.pdf (presentation)
palmer_winbugsfromstata.slides.pdf (presentation slides)
A brief introduction to genetic epidemiology using Stata
Neil Shephard
University of Sheffield
An overview of using Stata to perform candidate gene association analysis
will be presented. Areas covered will include data manipulation,
Hardy–Weinberg equilibrium, calculating and plotting linkage
disequilibrium, estimating haplotypes, and interfacing with external
programs.
Usefulness and estimation of proportionality constraints
Maarten Buis
Department of Social Research Methodology,
Vrije Universiteit Amsterdam
Stata has for a long time the capability of imposing the constraint that
parameters are a linear function of one another. It does not have the
capability to impose the constraint that if a set of parameters change (due
to interaction terms) they will maintain the relative differences among
them. Such a proportionality constraint has a nice interpretation: the
constrained variables together measure some latent concept. For instance if
a proportionality constraint is imposed on the variables father’s
education, mother’s education, father’s occupational status, and
mother’s occupational status, than together they might be thought to
measure the latent variable family socioeconomic status. With the
proportionality constraint one can estimate the effect of the latent
variable and how strong each observed variable loads on the latent variable
(i.e. does the mother, the father, or the highest status parent matter
most). Such a model is a special case of a so called MIMIC model. In
principle these models can be estimated using standard
ml algorithms, however as the parameters are rather
strongly correlated
ml has a hard time finding the
maximum. An EM algorithm is proposed that will find the maximum. This
maximum is than fed into
ml to get the right
standard errors.
Additional information
buis_propcnsreg.pdf (presentation slides)
Dynamic probit models for panel data: A comparison of three methods of estimation
Alfonso Miranda
Department of Economics, Keele University
Three different methods have been suggested in the econometrics literature
to deal with the initial conditions problem in dynamic Probit models for
panel data. Heckman (1981) suggest to approximate the reduced form marginal
probability of the initial state with a Probit model and allow free
correlation between unobserved individual heterogeneity entering the initial
conditions and the main dynamic equations. Alternatively, Wooldridge (2002)
suggest to write a dynamic model conditional on the first observation and to
specify a distribution for the unobserved individual heterogeneity term
conditional on the initial state and any other exogenous explanatory
variables. Finally, Orme (1996) introduces a two-step bias corrected
procedure that is locally valid when the correlation between unobserved
individual heterogeneity determining the initial state and the dynamic
Probit equations approximates to zero. Orme suggest that this two-step
procedure can perform well even when such correlation is strong. I present
some results from a Monte Carlo simulation study comparing the performance
of all these three methods using small and medium sample sizes and low and
high correlation among unobservables.
Additional information
miranda_Dprob_pe.pdf (presentation slides)
metamiss: Meta-analysis with missing data
Ian White
MRC Biostatistics Unit, Cambridge
A new command
metamiss performs meta-analysis when
some or all studies have missing data. A variety of assumptions are
available, including missing-at-random, missing=failure, worst and best
cases, and incorporating a user-specified prior distribution for the degree
of informative missingness. This is joint work with Julian Higgins.
Additional information
Ian_White.ppt (presentation slides)
A simulation-based sensitivity for matching estimators
Tommaso Nannicini
Universidad Carlos III de Madrid
This article presents a Stata program (
sensatt)
that implements the sensitivity analysis for matching estimators proposed by
Ichino, Mealli and Nannicini (2007). The analysis simulates a potential
confounder in order to assess the robustness of the estimated treatment
effects with respect to deviations from the Conditional Independence
Assumption (CIA). The program makes use of the commands for propensity-score
matching (
att*) developed by Becker and Ichino
(2002). An example is given by using the National Supported Work (NSW)
demonstration, widely known in the program evaluation literature.
Additional information
pres_stata_2.pdf (presentation slides)
The advantages of using macros with loops
Shuk-Li Man
Center for Sexual Health and HIV Research,
University College London
Using loops and macros in Stata can hold many advantages, mainly reducing
the length of your do files, allowing errors to be tracked and fixed quickly
and efficiently, faster running do files and providing us with re-usable
programs which can be used in subsequent data analyses with similar
scenarios. In this presentation we shall cover the following areas:
- Storing global and local macros within Stata, with applied examples
including storing categories of a variable, storing data summaries
and names of files within a directory.
- The commands foreach,
forval and while,
with applied examples.
- Applied examples of how to combine macros with loops and show why
this can be useful.
Additional information
man_Stata_user_groupoct2007v5.ppt (presentation slides)
Regression-based inequality decomposition
Carlo Fiorio
Department of Economic Scieces,
Universita degli Studi di Milano
Stephen P. Jenkins
Institute for Social and Economic Research,
University of Essex
This talk discusses
ineqrbd, a program for OLS
regression-based decomposition suggested by G.S. Fields (“Accounting
for Income Inequality and Its Change: A New Method, with Application to the
Distribution of Earnings in the United States”, Research in Labor
Economics, 2003). It provides an exact decomposition of the inequality of
total income into inequality contributions from each of the factor
components (or determinants) of total income.
Additional information
fiorio_ineqrbd_UKSUG07.pdf (presentation slides)
Adolists: A New Concept for Stata
Ben Jann
ETH Zürich
A new package called
adolist is presented.
adolist is a tool to create, install, and uninstall
lists of user ado-packages (“adolists”). For example,
adolist can create a list of all user packages
installed on a system and then install the same packages on another system.
Moreover,
ado-list can be used to put together
thematic lists of packages such as, say, a list on income inequality
analysis or time-series add-ons, or the list of “41 user ados
everyone should know”. Such lists can then be shared with others,
who can easily install and uninstall the listed packages using the
adolist command.
Additional information
jann_London07_adolist.pdf
Creating self-validating datasets
Bill Rising
StataCorp
One of Stata’s great strengths is its data management abilities. When
either building or sharing datasets, some of the most time-consuming
activities are validating the data and writing documentation for the data.
Much of this futility could be avoided if datasets were self-contained,
i.e., if they could validate themselves. I will show how to achieve this
goal within Stata. I will demonstrate a package of commands for attaching
validation rules to the variables themselves, via characteristics, along
with commands for running error checks and marking suspicious observations
in the dataset. The validation system is flexible enough that simple checks
continue to work even if variable names change or if the data are reshaped,
and it is rich enough that validation may depend on other variables in the
dataset. Since the validation is at the variable level, the self-validation
also works if variables are recombined with data from other datasets. With
these tools, Stata’s datasets can become truly self-contained.
Additional information
rising_ckvarTalk.beamer.pdf (presentation slides)
Clustered standard errors in Stata
Austin Nichols
Urban Institute
A brief survey of clustered errors, focusing on estimating cluster–robust
standard errors: when and why to use the
cluster
option (nearly always in panel regressions), and implications. Additional
topics may include using
svyset to specify
clustering, multidimensional clustering, clustering in meta-analysis, how
many clusters are required for asymptotic approximations, testing
coefficients when the Var–Cov matrix has less than full rank, and
testing for clustering of errors.
Additional information
nichols_crse.pdf (presentation slides)
Quantiles, L-moments, and modes: Bringing order to descriptive statistics
Nick Cox
Department of Geography, Durham University
Describing batches of data in terms of their order statistics or quantiles
has long roots but remains underrated in graphically based exploration,
data reduction, and data reporting. Hosking in 1990 proposed L-moments based
on quantiles as a unifying framework for summarizing distribution
properties, but despite several advantages they still appear to be very
little known outside their main application areas of hydrology and
climatology. Similarly, the mode can be traced to the prehistory of
statistics, but it is often neglected or disparaged despite its value as a
simple descriptor and even as a robust estimator of location. This paper
reviews and exemplifies these approaches with detailed reference to Stata
implementations. Several graphical displays are discussed, some novel.
Specific attention is given to the use of Mata for programming core
calculations directly and rapidly.
Additional information
njctalkNASUG2007.zip (presentation in smcl, plus ado- and do-files and datasets)
Extreme values and “robust” analysis of distributions
Philippe Van Kerm
CEPS/INSTEAD, G.-D. Luxembourg
Distributive analysis typically consists in estimating summary measures
capturing aspects of the distribution of sample points beyond central
tendency. Stochastic dominance analysis is also central for comparisons of
distributions. Unfortunately, data contamination, and extreme data more
generally, are known to be highly influential in both types of
analyses—much more so, than for central tendency analysis—and
potentially jeopardize the validity of one’s conclusions even with
relatively large sample sizes. This presentation illustrates the problems
raised by extreme data in distributive analysis and describes robust
parametric and semi-parametric approaches for addressing it. The methods are
based on the use of “optimal B-robust” (OBRE) estimators, as an
alternative to maximum likelihood. A prototype of Stata implementation of
these estimators is described and empirical examples in income distribution
analysis show how robust inequality estimates and dominance checks can be
derived from these parametric or semiparametric models.
Additional information
vankerm-uksug_slides.pdf (presentation slides)
Advanced graph editing
Vince Wiggins
StataCorp
We will take a quick tour of the graph editor, covering the basic concepts:
adding text, lines, and markers; changing the defaults for added objects;
changing properties; working quickly by combining the contextual toolbars
with the more object dialogs; and using the object browser effectively.
Leveraging these concepts, we’ll discuss how and when to use the grid
editor and techniques for combined and by-graphs. Finally,we will look at
some tricks and features that aren’t apparent at first blush.
Instrumental variables: Overview and advances
Kit Baum
Boston College
The talk will present the instrumental variables (IV) regression estimator,
a key tool for the estimation of relationships incorporating
endogeneity/two-way causality or measurement error, focusing on the
Baum/Schaffer/Stillman
ivreg2 package and Stata
10’s new
ivregress command. The IV or
two-stage least squares estimator is a special case of a Generalized Method
of Moments (GMM) estimator. GMM techniques are appropriate when non-i.i.d.
disturbances are encountered. We will discuss tests of overidentification,
weak instruments, endogeneity/exogeneity and recently developed tools for
testing functional form specification (
ivreset) and
autocorrelation in the IV context (
ivactest).
Additional information
baumUKSUG2007.pdf (presentation slides)
baumUKSUG2007smcltalk.zip (presentation in smcl)
A new architecture for handling multiply imputed data in Stata
Patrick Royston
MRC Clinical Trials Unit, London
There has been a considerable growth of interest among Stata users and more
widely in the practical use of multiple imputation as a principled route to
the analysis of datasets with missing covariate values. Sophisticated Stata
software (
ice) is available for creating multiply
imputed datasets. However, equally sophisticated and flexible tools are
required to carry out the analyses. Carlin et al (2003)’s MI Tools package
and Royston’s
micombine command (packaged
with
ice) made a start. We present a new set of
tools, called
mim, which carries the postimputation
process a step further.
mim defines a standardized
architecture for MI datasets and has features for manipulating MI data. More
importantly, it supports a wide range of regression models, including those
for panel and survey data. Limited facilities for postestimation analysis
are provided, and these are expected to be further developed. The package is
in beta-testing form and has been submitted for publication in the
Stata
Journal.
Additional information
Royston_SUG_2007.ppt (presentation slides)
Scientific organizers
Tim Collier, London School of Hygiene & Tropical Medicine
Stephen Jenkins, University of Essex
Logistics organizers
Timberlake Consultants, the official distributor
of Stata in the United Kingdom, Brazil, Ireland, Poland, Portugal, and Spain.