2007 UK Stata Users Group meeting

Home / Resources & support / Users Group meetings / 2007 UK Stata Users Group meeting

Last updated: 27 September 2007

2007 UK Stata Users Group meeting

10–11 September

Centre for Econometric Analysis
Cass Business School
106 Bunhill Row
London EC1 8TZ
United Kingdom

Proceedings

Robust confidence intervals for Hodges–Lehmann median difference

Roger Newson

Imperial College

The cendif module is part of the somersd package, and calculates confidence intervals for the Hodges–Lehmann median difference between values of a variable in two subpopulations. The traditional Lehmann formula, unlike the formula used by cendif, assumes that the two subpopulation distributions are different only in location, and that the subpopulations are therefore equally variable. The cendif formula therefore contrasts with the Lehmann formula as the unequal-variance t-test contrasts with the equal-variance t-test. In a simulation study, designed to test cendif to destruction, the performance of cendif was compared to that of the Lehmann formula, using coverage probabilities and median confidence interval width ratios. The simulations involved sampling from pairs of Normal or Cauchy distributions, with subsample sizes ranging from 5 to 40, and between-subpopulation variability scale ratios ranging from 1 to 4. If the sample numbers were equal, then both methods gave coverage probabilities close to the advertized confidence level. However, if the sample numbers were unequal, then the Lehmann coverage probabilities were over-conservative if the smaller sample was from the less variable population, and over-liberal if the smaller sample was from the more variable population. The cendif coverage probability was usually closer to the advertized level, if the smaller sample was not very small. However, if the sample sizes were 5 and 40, and the two populations were equally variable, then the Lehmann coverage probability was close to its advertised level, while the cendif coverage probability was over-liberal. The cendif confidence interval, in its present form, is therefore robust both to non-Normality and to unequal variablity, but may be less robust to the possibility that the smaller sample size is very small. Possibilities for improvement are discussed.

Additional information
newson-ohp1.pdf (presentation slides)

Translating the S-Plus Least Angle Regression package to Mata

Adrian Mander

MRC Human Nutrition Research

In an attempt to learn Mata I have translated the LARS package, written for R by Trevor Hastie and Brad Efron, into Mata. The LARS package is an efficient implementation of an entire lasso sequence with the cost of a single least-squares estimation. Mata and R/S+ are incredibly similar in terms of syntax and on the whole can be translated by altering the syntax “wording”, however, there was the occasional need for additional functions. It is certainly not the best approach to learning a new language. I shall describe the new Stata command and apply this approach to model selection to some nutrition data.

Performing Bayesian analysis in Stata using WinBUGS

Tom Palmer

Department of Health Sciences, Leicester University

WinBUGS is a program for Bayesian model fitting by Gibbs sampling. WinBUGS has limited facilities for data handling, whereas Stata has excellent data handling but no routines for Bayesian analysis; therefore, much can be gained by running Stata and WinBUGS together. This talk explains the use of the winbugsfromstata package, described in Thompson et al. (2006), a set of programs that enable data to be processed in Stata and then passed to WinBUGS for model fitting. Finally, the results can be read back into Stata for further processing. Examples will be chosen to illustrate the range of models that can be fitted within WinBUGS and where possible the results will be compared with frequentist analyses in Stata. Issues to consider when fitting models under Markov Chain Monte Carlo methods will be discussed including assessment of convergence, length of burn-in and the form and impact of prior distributions. J. Thompson, T. Palmer, and S. Moreno, 2006, Bayesian analysis in Stata with WinBUGS, The Stata Journal, 6(4), p530–549.

Additional information
palmer_winbugsfromstata.presentation.pdf (presentation)
palmer_winbugsfromstata.slides.pdf (presentation slides)

A brief introduction to genetic epidemiology using Stata

Neil Shephard

University of Sheffield

An overview of using Stata to perform candidate gene association analysis will be presented. Areas covered will include data manipulation, Hardy–Weinberg equilibrium, calculating and plotting linkage disequilibrium, estimating haplotypes, and interfacing with external programs.

Usefulness and estimation of proportionality constraints

Maarten Buis

Department of Social Research Methodology, Vrije Universiteit Amsterdam

Stata has for a long time the capability of imposing the constraint that parameters are a linear function of one another. It does not have the capability to impose the constraint that if a set of parameters change (due to interaction terms) they will maintain the relative differences among them. Such a proportionality constraint has a nice interpretation: the constrained variables together measure some latent concept. For instance if a proportionality constraint is imposed on the variables father’s education, mother’s education, father’s occupational status, and mother’s occupational status, than together they might be thought to measure the latent variable family socioeconomic status. With the proportionality constraint one can estimate the effect of the latent variable and how strong each observed variable loads on the latent variable (i.e. does the mother, the father, or the highest status parent matter most). Such a model is a special case of a so called MIMIC model. In principle these models can be estimated using standard ml algorithms, however as the parameters are rather strongly correlated ml has a hard time finding the maximum. An EM algorithm is proposed that will find the maximum. This maximum is than fed into ml to get the right standard errors.

Additional information
buis_propcnsreg.pdf (presentation slides)

Dynamic probit models for panel data: A comparison of three methods of estimation

Alfonso Miranda

Department of Economics, Keele University

Three different methods have been suggested in the econometrics literature to deal with the initial conditions problem in dynamic Probit models for panel data. Heckman (1981) suggest to approximate the reduced form marginal probability of the initial state with a Probit model and allow free correlation between unobserved individual heterogeneity entering the initial conditions and the main dynamic equations. Alternatively, Wooldridge (2002) suggest to write a dynamic model conditional on the first observation and to specify a distribution for the unobserved individual heterogeneity term conditional on the initial state and any other exogenous explanatory variables. Finally, Orme (1996) introduces a two-step bias corrected procedure that is locally valid when the correlation between unobserved individual heterogeneity determining the initial state and the dynamic Probit equations approximates to zero. Orme suggest that this two-step procedure can perform well even when such correlation is strong. I present some results from a Monte Carlo simulation study comparing the performance of all these three methods using small and medium sample sizes and low and high correlation among unobservables.

Additional information
miranda_Dprob_pe.pdf (presentation slides)

metamiss: Meta-analysis with missing data

Ian White

MRC Biostatistics Unit, Cambridge

A new command metamiss performs meta-analysis when some or all studies have missing data. A variety of assumptions are available, including missing-at-random, missing=failure, worst and best cases, and incorporating a user-specified prior distribution for the degree of informative missingness. This is joint work with Julian Higgins.

Additional information
Ian_White.ppt (presentation slides)

A simulation-based sensitivity for matching estimators

Tommaso Nannicini

Universidad Carlos III de Madrid

This article presents a Stata program (sensatt) that implements the sensitivity analysis for matching estimators proposed by Ichino, Mealli and Nannicini (2007). The analysis simulates a potential confounder in order to assess the robustness of the estimated treatment effects with respect to deviations from the Conditional Independence Assumption (CIA). The program makes use of the commands for propensity-score matching (att*) developed by Becker and Ichino (2002). An example is given by using the National Supported Work (NSW) demonstration, widely known in the program evaluation literature.

Additional information
pres_stata_2.pdf (presentation slides)

The advantages of using macros with loops

Shuk-Li Man

Center for Sexual Health and HIV Research, University College London

Using loops and macros in Stata can hold many advantages, mainly reducing the length of your do files, allowing errors to be tracked and fixed quickly and efficiently, faster running do files and providing us with re-usable programs which can be used in subsequent data analyses with similar scenarios. In this presentation we shall cover the following areas:

Storing global and local macros within Stata, with applied examples including storing categories of a variable, storing data summaries and names of files within a directory.
The commands foreach, forval and while, with applied examples.
Applied examples of how to combine macros with loops and show why this can be useful.

Additional information
man_Stata_user_groupoct2007v5.ppt (presentation slides)

Regression-based inequality decomposition

Carlo Fiorio

Department of Economic Scieces, Universita degli Studi di Milano

Stephen P. Jenkins

Institute for Social and Economic Research, University of Essex

This talk discusses ineqrbd, a program for OLS regression-based decomposition suggested by G.S. Fields (“Accounting for Income Inequality and Its Change: A New Method, with Application to the Distribution of Earnings in the United States”, Research in Labor Economics, 2003). It provides an exact decomposition of the inequality of total income into inequality contributions from each of the factor components (or determinants) of total income.

Additional information
fiorio_ineqrbd_UKSUG07.pdf (presentation slides)

Adolists: A New Concept for Stata

Ben Jann

ETH Zürich

A new package called adolist is presented. adolist is a tool to create, install, and uninstall lists of user ado-packages (“adolists”). For example, adolist can create a list of all user packages installed on a system and then install the same packages on another system. Moreover, ado-list can be used to put together thematic lists of packages such as, say, a list on income inequality analysis or time-series add-ons, or the list of “41 user ados everyone should know”. Such lists can then be shared with others, who can easily install and uninstall the listed packages using the adolist command.

Additional information
jann_London07_adolist.pdf

Creating self-validating datasets

Bill Rising

StataCorp

One of Stata’s great strengths is its data management abilities. When either building or sharing datasets, some of the most time-consuming activities are validating the data and writing documentation for the data. Much of this futility could be avoided if datasets were self-contained, i.e., if they could validate themselves. I will show how to achieve this goal within Stata. I will demonstrate a package of commands for attaching validation rules to the variables themselves, via characteristics, along with commands for running error checks and marking suspicious observations in the dataset. The validation system is flexible enough that simple checks continue to work even if variable names change or if the data are reshaped, and it is rich enough that validation may depend on other variables in the dataset. Since the validation is at the variable level, the self-validation also works if variables are recombined with data from other datasets. With these tools, Stata’s datasets can become truly self-contained.

Additional information
rising_ckvarTalk.beamer.pdf (presentation slides)

Clustered standard errors in Stata

Austin Nichols

Urban Institute

A brief survey of clustered errors, focusing on estimating cluster–robust standard errors: when and why to use the cluster option (nearly always in panel regressions), and implications. Additional topics may include using svyset to specify clustering, multidimensional clustering, clustering in meta-analysis, how many clusters are required for asymptotic approximations, testing coefficients when the Var–Cov matrix has less than full rank, and testing for clustering of errors.

Additional information
nichols_crse.pdf (presentation slides)

Quantiles, L-moments, and modes: Bringing order to descriptive statistics

Nick Cox

Department of Geography, Durham University

Describing batches of data in terms of their order statistics or quantiles has long roots but remains underrated in graphically based exploration, data reduction, and data reporting. Hosking in 1990 proposed L-moments based on quantiles as a unifying framework for summarizing distribution properties, but despite several advantages they still appear to be very little known outside their main application areas of hydrology and climatology. Similarly, the mode can be traced to the prehistory of statistics, but it is often neglected or disparaged despite its value as a simple descriptor and even as a robust estimator of location. This paper reviews and exemplifies these approaches with detailed reference to Stata implementations. Several graphical displays are discussed, some novel. Specific attention is given to the use of Mata for programming core calculations directly and rapidly.

Additional information
njctalkNASUG2007.zip (presentation in smcl, plus ado- and do-files and datasets)

Extreme values and “robust” analysis of distributions

Philippe Van Kerm

CEPS/INSTEAD, G.-D. Luxembourg

Distributive analysis typically consists in estimating summary measures capturing aspects of the distribution of sample points beyond central tendency. Stochastic dominance analysis is also central for comparisons of distributions. Unfortunately, data contamination, and extreme data more generally, are known to be highly influential in both types of analyses—much more so, than for central tendency analysis—and potentially jeopardize the validity of one’s conclusions even with relatively large sample sizes. This presentation illustrates the problems raised by extreme data in distributive analysis and describes robust parametric and semi-parametric approaches for addressing it. The methods are based on the use of “optimal B-robust” (OBRE) estimators, as an alternative to maximum likelihood. A prototype of Stata implementation of these estimators is described and empirical examples in income distribution analysis show how robust inequality estimates and dominance checks can be derived from these parametric or semiparametric models.

Additional information
vankerm-uksug_slides.pdf (presentation slides)

Advanced graph editing

Vince Wiggins

StataCorp

We will take a quick tour of the graph editor, covering the basic concepts: adding text, lines, and markers; changing the defaults for added objects; changing properties; working quickly by combining the contextual toolbars with the more object dialogs; and using the object browser effectively. Leveraging these concepts, we’ll discuss how and when to use the grid editor and techniques for combined and by-graphs. Finally,we will look at some tricks and features that aren’t apparent at first blush.

Instrumental variables: Overview and advances

Kit Baum

Boston College

The talk will present the instrumental variables (IV) regression estimator, a key tool for the estimation of relationships incorporating endogeneity/two-way causality or measurement error, focusing on the Baum/Schaffer/Stillman ivreg2 package and Stata 10’s new ivregress command. The IV or two-stage least squares estimator is a special case of a Generalized Method of Moments (GMM) estimator. GMM techniques are appropriate when non-i.i.d. disturbances are encountered. We will discuss tests of overidentification, weak instruments, endogeneity/exogeneity and recently developed tools for testing functional form specification (ivreset) and autocorrelation in the IV context (ivactest).

Additional information
baumUKSUG2007.pdf (presentation slides)
baumUKSUG2007smcltalk.zip (presentation in smcl)

A new architecture for handling multiply imputed data in Stata

Patrick Royston

MRC Clinical Trials Unit, London

There has been a considerable growth of interest among Stata users and more widely in the practical use of multiple imputation as a principled route to the analysis of datasets with missing covariate values. Sophisticated Stata software (ice) is available for creating multiply imputed datasets. However, equally sophisticated and flexible tools are required to carry out the analyses. Carlin et al (2003)’s MI Tools package and Royston’s micombine command (packaged with ice) made a start. We present a new set of tools, called mim, which carries the postimputation process a step further. mim defines a standardized architecture for MI datasets and has features for manipulating MI data. More importantly, it supports a wide range of regression models, including those for panel and survey data. Limited facilities for postestimation analysis are provided, and these are expected to be further developed. The package is in beta-testing form and has been submitted for publication in the Stata Journal.

Additional information
Royston_SUG_2007.ppt (presentation slides)

Scientific organizers

Tim Collier, London School of Hygiene & Tropical Medicine
Stephen Jenkins, University of Essex

Logistics organizers

Timberlake Consultants, the official distributor of Stata in the United Kingdom, Brazil, Ireland, Poland, Portugal, and Spain.