2010 UK Stata Users Group meeting

Home / Resources & support / Users Group meetings / 2010 UK Stata Users Group meeting

Last updated: 29 November 2010

2010 UK Stata Users Group meeting

9–10 September 2010

London School of Hygiene and Tropical Medicine
Keppel Street
London WC1E 7HT
United Kingdom

Materials documenting the meeting

Proceedings

Post-parmest peripherals: fvregen, invcise, and qqvalue

Roger B. Newson

National Heart and Lung Institute, Imperial College London

The parmest package is used with Stata estimation commands to produce output datasets (or results-sets) with one observation per estimated parameter, and data on parameter names, estimates, confidence limits, p-values, and other parameter attributes. These results-sets can then be input to other Stata programs to produce tables, listings, plots, and secondary results-sets containing derived parameters. Three recently added packages for post-parmest processing are fvregen, invcise, and qqvalue.

fvregen is used when the parameters belong to models containing factor variables, introduced in Stata version 11. It regenerates these factor variables in the results-set, enabling the user to plot, list, or tabulate factor levels with estimates and confidence limits of parameters specific to these factor levels.

invcise calculates standard errors inversely from confidence limits produced without standard errors, such as those for medians and for Hodges–Lehmann median differences. These standard errors can then be input, with the estimates, into the metaparm module of parmest to produce confidence intervals for linear combinations of medians or of median differences, such as those used in meta-analysis or interaction estimation.

qqvalue inputs the p-values in a results-set and creates a new variable containing the quasi-q-values, which are calculated by inverting a multiple-test procedure designed to control the familywise error rate (FWER) or the false discovery rate (FDR). The quasi-q-value for each p-value is the minimum FWER or FDR for which that p-value would be in the discovery set if the specified multiple-test procedure was used on the full set of p-values. fvregen, invcise, qqvalue, and parmest can be downloaded from SSC.

Additional information
UKSUG10_newson1.zip

A Stata program for calibration weighting

John D'Souza

National Centre for Social Research, London

Although survey data are sometimes weighted by their selection weights, it is often preferable to use auxiliary information available on the whole population to improve estimation. Calibration weighting (Deville and Sarndal, 1992, Journal of the American Statistical Association 87: 376–382) is one of the most common methods of doing this. This method adjusts the selection weights so that known population totals for the auxiliary variables are reproduced exactly, while ensuring that the calibrated weights are as close as possible to the original sampling weight. The simplest example of calibration is poststratification. This is the special case where the auxiliary variable is a single categorical variable. General calibration extends this to deal with more than one auxiliary variable and allows the user to include both categorical and numerical variables.

A typical example might occur in a population survey, where the selection weights could be calibrated to ensure that the sample weighted by the calibration weights has exactly the same distribution as the population on variables such as age, sex, and region.

Many packages have routines for calibration. SAS has the macro CALMAR; GenStat has the procedure SVCALIBRATE; and R has the function calibrate. However, no such routine is publicly available in Stata. I will introduce a user-written Stata program for calibration and will also discuss a simple extension to show how it can incorporate a nonresponse correction. I will also briefly discuss the program’s strengths and limitations when compared to rival packages.

Additional information
UKSUG10.DSouza.ppt

Estimating and modeling cure within the framework of flexible parametric survival models

Therese M.-L. Andersson

Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm

Cure models can be used to simultaneously estimate the proportion of cancer patients who are eventually cured of their disease and the survival of those who remain “uncured”. One limitation of parametric cure models is that the functional form of the survival of the “uncured” has to be specified. It can sometimes be hard to fit survival functions flexible enough to capture high mortality rates within a few months from a diagnosis or a high cure proportion (e.g., over 90).

If instead the flexible parametric survival models implemented in stpm2 could be used, then these problems could potentially be avoided. Flexible parametric survival models are fit on the log cumulative hazard scale using restricted cubic splines for the baseline. When cure is reached, the excess hazard rate (the difference in the observed all-cause mortality rate among the patients compared with that expected in the general population) is zero, and the cumulative excess hazard is constant. By incorporating an extra constraint on the log cumulative excess hazard after the last knot so that we force it not only to be linear but also to have zero slope, we are able to estimate the cure proportion. The flexible parametric survival model can be written as a special case of a nonmixture cure model, but with a more flexible distribution, which also enables estimation of the survival of “uncured” patients.

We have updated the user-written stpm2 command for flexible parametric models and added a cure option as well as postestimation predictions of the cure proportion and survival of the “uncured”. We will compare the use of flexible parametric cure models implemented in stpm2 with standard parametric cure models implemented in strsmix and strsnmix.

This is joint work with Sandra Eloranta and Paul W. Dickman (same institution) and Paul C. Lambert (same institution and Department of Health Sciences, University of Leicester).

Additional information
UKSUG10.Andersson.pptx

Simulation of “forward-backward” multiple-imputation technique in longitudinal clinical dataset

Catherine Welch

Department of Primary Care & Population Health, University College London

Most standard missing-data techniques have been designed for cross-sectional data. A “forward-backward” multiple-imputation algorithm has been developed to impute missing values in longitudinal data (Nevalainen, Kenward, and Virtanen, 2009, Statistics in Medicine 28: 36577–3669) This technique will be applied to The Health Improvement Network (THIN), a longitudinal primary-care database to impute variables associated with incidence of cardiovascular disease (CVD).

A sample of 483 patients was extracted from THIN to test the performance of the algorithm before it was applied to the whole dataset. This dataset included individuals with information available on age, sex, deprivation quintile, height, weight, systolic blood pressure, and total serum cholesterol for each age from 65 to 69 years. CVD was identified if the patient was diagnosed with one of a predefined list of conditions at any of these ages. They were then considered to have CVD at each subsequent age.

In this sample, measurements of weight, systolic blood pressure, and cholesterol were replaced with missing values such that the probability that data are missing decreases as age increases; i.e., the data are missing at random and the overall percentage of missing data is equivalent to that in THIN. We then applied the forward-backward algorithm, which imputes values at each time point by using measurements before and after the one of interest and updates values sequentially. Ten complete datasets were created. A Poisson regression was performed using data in each dataset, and estimates were combined using Rubin’s rules. These steps were repeated 200 times and the coefficients were averaged.

I will explain in more detail how the forward-backward algorithm works and also will demonstrate the results following multiple imputation using this algorithm. I will compare these results with the analysis before data were replaced with missing values and a complete case analysis to assess the performance of the algorithm.

This is joint work with Irene Petersen (same institution) and James Carpenter (Medical Statistics Unit, London School of Hygiene and Tropical Medicine).

Additional information
UKSUG10.Welch.ppt

Thirty graphical tips Stata users should know

Nicholas J. Cox

Department of Geography, Durham University

Stata’s graphics were completely rewritten for Stata 8, with further key additions in later versions. Its official commands have, as usual, been supplemented by a variety of user-written programs. The resulting variety presents even experienced users with a system that undeniably is large, often appears complicated, and sometimes seems confusing. In this talk, I provide a personal digest of graphics strategy and tactics for Stata users emphasizing details large and small that, in my view, deserve to be known by all.

Additional information
UKSUG10.Cox.zip

Mata, the missing manual

William Gould

StataCorp, College Station, Texas

Mata is Stata’s matrix programming language. StataCorp provides detailed documentation on it, but so far has failed to give users—and especially users who add new features to Stata—any guidance in when and how to use the language. This talk provides what has been missing. In practical ways, this talk shows how to include Mata code in Stata ado-files, it reveals when to include Mata code and when not to, and it provides an introduction to the broad concepts of Mata, the concepts that will make the Mata Reference Manual approachable.

Additional information
UKSUG10.Gould.pdf

Hunting for genes with longitudinal phenotype data using Stata

J. Charles Huber Jr.

Texas A&M Health Science Center School of Rural Public Health, College Station, Texas

Project Heartbeat! was a longitudinal study of metabolic and morphological changes in adolescents aged 8–18 years and was conducted in the 1990s. A study is currently being conducted to consider the relationship between a collection of phenotypes (including BMI, blood pressure, and blood lipids) and a panel of 1,500 candidate SNPs (single nucleotide polymorphisms). Traditional genetics software such as PLINK and HelixTree lacks the ability to model longitudinal phenotype data.

This talk will describe the use of Stata for a longitudinal genetic association study from the early stages of data checking (allele frequencies and Hardy–Weinberg equilibrium), modeling of individual SNPs, the use of false discovery rates to control for the large number of comparisons, exporting and importing data through PHASE for haplotype reconstruction, selection of tagSNPs in Stata, and the analysis of haplotypes. We will also discuss strategies for scaling up to an Illumina 100k SNP chip using Stata. All SNP and gene names will be de-identified, because this is a work in progress.

This is joint work with Michael Hallman, Ron Harrist, Victoria Friedel, Melissa Richard, and Huandong Sun (same institution).

Additional information
UKSUG10.Huber.ppt

Haplotype analysis of case–control data using haplologit: New features

Yulia Marchenko

StataCorp, College Station, Texas

In haplotype-association studies, the risk of a disease is often determined not only by the presence of certain haplotypes but also by their interactions with various environmental factors. The detection of such interactions with case–control data is a challenging task and often requires very large samples. This prompted the development of more efficient estimation methods for analyzing case–control genetic data. The haplologit command implements efficient semiparametric methods, recently proposed in the literature, for fitting haplotype-environment models in the very important special cases of 1) a rare disease, 2) a single candidate gene in Hardy–Weinberg equilibrium, and 3) independence of genetic and environmental factors. In this presentation, I will describe new features of the haplologit command.

Additional information
UKSUG10.Marchenko.pdf

DIY fractional polynomials

Patrick Royston

MRC Clinical Trials Unit, London

Fractional polynomial models are a simple yet very useful extension of ordinary polynomials. They greatly increase the available range of nonlinear functions and are often used in regression modeling, both in univariate format (using Stata’s fracpoly command) and in multivariable modeling (using mfp). The standard implementation in fracpoly supports a wide range of single-equation regression models but can not cope with the more complex and varied syntaxes of other types of multi-equation models. In this talk, I show that if you are willing to do some straightforward do-file programming, you can apply fractional polynomials in a bespoke manner to more complex Stata regression commands and get useful results. I illustrate the approach in multilevel modeling of longitudinal fetal-size data using xtmixed and in a seemingly unrelated regression analysis of a dataset of academic achievement using sureg.

Additional information
UKSUG10.Royston.ppt

Forecast evaluation with Stata

Robert A. Yaffee

Silver School of Social Work, New York University

Forecasters are expected to provide evaluations of their forecasts along with their forecasts. The forecast assessments demonstrate comparative, adequate, or optimal accuracy by common forecasting criteria to provide acceptable credence in the forecasts. To assist the Stata user in this process, Robert Yaffee has written Stata programs to evaluate ARIMA and GARCH models. He explains how these assessment programs are applied to one-step-ahead and dynamic forecasts, ex post and ex ante forecasts, conditional and unconditional forecasts, as well as combinations of forecasts. In his presentation, he will also demonstrate how assessment can be applied to rolling origin forecasts of time-series models.

Additional information
UKSUG10.Yaffee.pdf

An overview of meta-analysis in Stata

Jonathan Sterne

Department of Social Medicine, University of Bristol

Roger Harbord

Department of Social Medicine, University of Bristol

Ian White

MRC Biostatistics Unit, Cambridge

A comprehensive range of user-written commands for meta-analysis is available in Stata and documented in detail in the recent book Meta-Analysis in Stata (Sterne, ed., 2009, [Stata Press]).The purpose of this session is to describe these commands, with a focus on recent developments and areas in which further work is needed. We will define systematic reviews and meta-analyses and will introduce the metan command, which is the main Stata meta-analysis command. We will distinguish between meta-analyses of randomized controlled trials and observational studies, and we will discuss the additional complexities inherent in systematic reviews of the latter.

Meta-analyses are often complicated by heterogeneity, variation between the results of different studies beyond that expected due to sampling variation alone. Meta-regression, implemented in the metareg command, can be used to explore reasons for heterogeneity, although its utility in medical research is limited by the modest numbers of studies typically included in meta-analyses and the many possible reasons for heterogeneity. Heterogeneity is a striking feature of meta-analyses of diagnostic-test accuracy studies. We will describe how to use the midas and metandi commands to display and meta-analyse the results of such studies.

Many meta-analysis problems involve combining estimates of more than one quantity: for example, treatment effects on different outcomes or contrasts among more than two groups. Such problems can be tackled using multivariate meta-analysis, implemented in the mvmeta command. We will describe how the model is fit, and when it may be superior to a set of univariate meta-analyses. Will will also illustrate its application in a variety of settings.

Additional information
UKSUG10.Sterne.pdf
UKSUG10.White.ppt
UKSUG10.Harbord.pdf

Evaluating one-way and two-way cluster–robust covariance matrix estimates

Christopher F. Baum

Department of Economics, Boston College, Chestnut Hill, Massachusetts

In this presentation, I update Nichols and Schaffer’s 2007 UK Stata Users Group talk on clustered standard errors. Although cluster–robust standard errors are now recognized as essential in a panel-data context, official Stata only supports clusters that are nested within panels. This requirement rules out the possibility of defining clusters in the time dimension and modeling contemporaneous dependence of panel units’ error processes. I build upon recent analytical developments that define two-way (and conceptually, n-way) clustering and upon the 2010 implementation of two-way clustering in the widely used ivreg2 and xtivreg2 packages. I present examples of the utility of one-way and two-way clustering using Monte Carlo techniques, I present a comparison with alternative approaches to modeling error dependence, and I consider tests for clustering of errors.

This is joint work with Mark E. Schaffer (Heriot-Watt University) and Austin Nichols (Urban Institute).

Additional information
UKSUG10.Baum.pdf

An introduction to matching methods for causal inference and their implementation in Stata

Barbara Sianesi

Institute for Fiscal Studies, London

Matching, especially in its propensity-score flavors, has become an extremely popular evaluation method. Matching is, in fact, the best-available method for selecting a matched (or reweighted) comparison group that looks like the treatment group of interest.

In this talk, I will introduce matching methods within the general problem of causal inference, highlight their strengths and weaknesses, and offer a brief overview of different matching estimators. Using psmatch2, I will then step through a practical example in Stata that is based on real data. I will then show how to implement some of these estimators, as well as highlight a number of implementational issues.

Additional information
UKSUG10.Sianesi.pdf
UKSUG10.Sianesi.zip

Report to users, followed by Wishes and grumbles

William Gould

StataCorp, College Station, Texas

William Gould, as President of StataCorp and Chief of Development, will report on StataCorp activity over the last year. This will morph into the traditional voicing from the audience of users’ wishes and grumbles regarding Stata.

Scientific organizers

Nicholas J. Cox, Durham University
Patrick Royston, MRC Clinical Trials Unit

Logistics organizers

Timberlake Consultants, the official distributor of Stata in the United Kingdom, Brazil, Ireland, Poland, Portugal, and Spain.