2008 UK Stata Users Group meeting

Home / Resources & support / User Group meetings / 2008 UK Stata Users Group meeting

Last updated: 8 December 2008

2008 UK Stata Users Group meeting

8–9 September 2008

Centre for Econometric Analysis
Cass Business School
106 Bunhill Row
London EC1 8TZ
United Kingdom

Proceedings

Stata’s mishandling of missing data: A problem and two solutions

Kenneth I. MacDonald

Nuffield College, University of Oxford

The design decisions made by Stata in handling missing data in relational and logical expressions have, for the user, complex, pernicious, and poorly understood consequences. This presentation intends to substantiate that claim and to present two possible resolutions to the problem.

As is well documented and reasonably well known, Stata considers p & q (and p | q) to be true when both p and q are indeterminate. This interpretation is counterintuitive and at odds with the formal-logic definition of these operators. To assert two unknowns is not to assert truth. Nevertheless, introductions to Stata characteristically present this as merely a “feature” and suggest that the obligation imposed on users (us) to explicitly test for missing data is straightforwardly implementable. Simple cases are indeed simple but, it will be argued, do not readily scale up to complex, real-life instances. For example, the one-line Stata command to implement the intention,

       . generate v = p|q

becomes

       . generate v = p|q if !mi(p,q)|(p&!mi(p))|(q&!mi(q))

And so forth. Such coding is a problem, not a feature—so solutions should be sought.

One solution (really a work-around) introduces my command, validly, which allows expressions such as

       . validly generate v = p|q

and correctly, without fuss, interprets the logical or relational operators (here returning true if p is true but q indeterminate and indeterminate if p is false but q indeterminate). More generally, validly serves as a “wrapper” for any standard conditional command. So, for example,

       . validly reg a b c if p|q

is handled correctly. But validly (its code deploys nested calls to cond()) is computationally expensive.

The better resolution would be for Stata, in its next release, to redesign its core code so that logical and relational operators would (as algebraic operators currently do) handle missing data appropriately. (Objections to this strategy are examined and deemed to lack force.) I would like to enlist the informed and active judgment of the participants of the 14th Users Group meeting to help bring this about.

Additional information
KIMacD.presentation.ppt

Robust statistics using Stata

Vincenzo Verardi

University of Brussels and University of Namur

In regression and multivariate analysis, the presence of outliers in the dataset can strongly distort classical estimations and lead to unreliable results. To deal with this, several robust-to-outliers methods have been proposed in the statistical literature. In Stata, some of these methods are available through the commands rreg and qreg for robust regression and hadimvo for multivariate outliers identification. Unfortunately, these methods only resist some specific types of outliers and turn out to be ineffective under alternative scenarios. In this presentation, after illustrating the drawbacks of the available methods, we present more effective robust estimators that we implemented in Stata. We also present a graphical tool that allows users to recognize the type of existing outliers in regression analysis.

Additional information
VerardiRobustStatisticsinStata.pdf

Creating fancy maps and pie charts using Google API charts

Lionel Page and Franz Buscha

University of Westminster

Google has recently developed a new tool to allow users to create Google-like charts and maps: the Google Application Programming Interface (API) chart. This tool allows the user to create custom PNG pictures by sending an appropriate syntax code over the web. Here we present two Stata programs that make it possible for users to create .png figures using the Google API chart directly from Stata and to download the figures in the current directory:

gmap allows the user to create thematic mapping of the world or one of its continents.
gpie allows the user to create pie charts in color and in 2D or 3D.

Additional information
buscha.ppt

Between tables and graphs: A tutorial

Nicholas J. Cox

Durham University

The display of data or of results often entails the preparation of a variety of table-like graphs showing both text labels and numeric values. I will present basic techniques, tips, and tricks using both official Stata and various user-written commands. The main message is that whenever graph bar, graph dot, or graph box commands fail to give what you want, then you can knit your own customized displays using twoway as a general framework.

Additional information
Cox.london08.zip

Prediction for multilevel models

Sophia Rabe-Hesketh

University of California–Berkeley

This presentation focuses on predicted probabilities for multilevel models for dichotomous or ordinal responses. In a three-level model, for instance, with patients nested in doctors nested in hospitals, predictions for patients could be for new or existing doctors and, in the latter case, for new or existing hospitals. In a new version of gllamm, these different types of predicted probabilities can be obtained very easily. I will give examples of graphs that can be used to help interpret an estimated model. I will also introduce a little program I’ve written to construct 95% confidence intervals for predicted probabilities.

Additional information
Sophia.predict5.pdf
Sophia.predict5.do
antibiotics.dta
margprob.dta
postprob.dta

Tricks of the trade: Getting the most out of xtmixed

Roberto G. Gutierrez

StataCorp

Stata’s xtmixed command can be used to fit mixed models, models that contain both fixed and random effects. The fixed effects are merely the coefficients from a standard linear regression. The random effects are not directly estimated but summarized by their variance components, which are estimated from the data. As such, xtmixed is typically used to incorporate complex and multilevel random-effects structures into standard linear regression. xtmixed’s syntax is complex but versatile, allowing it to be used widely, even for situations that do not fit the classical “mixed” framework. In this talk, I will give a tutorial on uses of xtmixed not commonly considered, including examples of heteroskedastic errors, group structures on random effects, and smoothing via penalized splines.

Additional information
gutierrez_mixed.pdf

parmest and extensions

Roger Newson

Imperial College–London

The parmest package creates output datasets (or results sets) with one observation for each of a set of estimated parameters, and data on the parameter estimates, standard errors, degrees of freedom, t or z statistics, p-values, confidence limits, and other parameter attributes specified by the user. It is especially useful when parameter estimates are “mass-produced”, as in a genome scan. Versions of the package have existed on SSC since 1998, when it contained the single command parmest. However, the package has since been extended with additional commands. The metaparm command allows the user to mass-produce confidence intervals for linear combinations of uncorrelated parameters. Examples include confidence intervals for a weighted arithmetic or geometric mean parameter in a meta-analysis, or for differences or ratios between parameters, or for interactions, defined as differences (or ratios) between differences. The parmcip command is a lower-level utility, inputting variables containing estimates, standard errors, and degrees of freedom, and outputting variables containing confidence limits and p-values. As an example, we can input genotype frequencies and calculate confidence intervals for geometric mean homozygote/heterozygote ratios for genetic polymorphisms, measuring the size and direction of departures from Hardy–Weinberg equilibrium.

Additional information
newson_ohp1.pdf

How do I do that in Stata?

Brains trust

This will be an opportunity to ask a panel of experts how to do something in Stata. If they are stumped, odds are that the audience will be able to suggest something. Bring your problems, generic or specific.

Semiparametric analysis of case–control genetic data in the presence of environmental factors

Yulia Marchenko

StataCorp

In the past decade, many statistical methods have been proposed for the analysis of case–control genetic data with an emphasis on haplotype-based disease association studies. Most of the methodology has concentrated on the estimation of genetic (haplotype) main effects. Most methods accounted for environmental and gene-environment interaction effects by utilizing prospective-type analyses that may lead to biased estimates when used with case–control data. Several recent publications addressed the issue of retrospective sampling in the analysis of case–control genetic data in the presence of environmental factors by developing new efficient semiparametric statistical methods. I present the new Stata command, haplologit, that implements efficient profile-likelihood semiparametric methods for fitting gene–environment models in the very important special cases of a) a rare disease, b) a single candidate gene in Hardy–Weinberg equilibrium, and c) independence of genetic and environmental factors.

Additional information
london08_yulia.pdf

The exploration of metabolic systems using Stata

Ray Boston

University of Pennsylvania

Considerable headway has been made over the last 20 or 30 years into the isolation of points of failure in human energy metabolism using metabolic models of challenge data. These models are almost always differential in form, second-order (or higher), nonlinear, and involve both estimated and observed metabolite concentrations. As such, they are usually relatively foreign to the scope of statistical modeling software packages. In this presentation, we demonstrate novel methods for solving and fitting these models to challenge data using Stata, and we illustrate techniques for deriving useful clinical indices such as insulin resistance.

Multiple imputation for household surveys: A comparison of methods

Rodrigo A. Alfaro and Marcelo E. Fuenzalida

Central Bank of Chile

We discuss empirical applications of imputation methods for missing data. Our results are based on Chilean household surveys using three methods of proper imputation.

Additional information
Alfaro_London_v3.ppt

Using Mata to work more effectively with Stata: A tutorial

Christopher F. Baum

Boston College

Stata’s matrix language, Mata, highlighted in Bill Gould’s Mata Matters columns in the Stata Journal, is very useful and powerful in its interactive mode. Stata users who write do-files or ado-files should gain an understanding of the Stata–Mata interface: how Mata may be called upon to do one or more tasks and return its results to Stata. Mata’s broad and extensible menu of functions offers assistance with many programming tasks, including many that are not matrix-oriented. In this tutorial, I will present examples of how do-file and ado-file writers might effectively use Mata in their work.

Additional information
baum_StataMata.beamer.UKSUG14.pdf

PanelWhiz

John P. Haisken-De New

RWI Essen

PanelWhiz is a collection of Stata Add-On scripts to make using panel datasets easier. It is designed for empirically minded economists, sociologists, political scientists, and demographers and allows the user to select vectors of variables at once. Matching and merging is done automatically. It allows items to be stored as project classes (modules). Modules can be edited and appended. PanelWhiz allows self-documenting panel retrievals to be made at the click of a button and easy data cleaning of the selected items for time consistency with PanelWhiz “plugins”. It also easily exports any Stata data to SAS, SPSS, LIMDEP, GAUSS, or MS Excel.

Trilinear plots and some alternatives

Nicholas J. Cox

Durham University

Data on three proportions, probabilities, or fractions that add to 1 can be projected from a simplex to the plane and represented in a two-dimensional plot, commonly known as trilinear, triaxial, triangular, etc. The Stata program triplot has been available for some years as one way to produce such plots. In this talk, I preview a new version of this program together with a variety of other plots based on transformations of the underlying variables.

Additional information
Cox.london082.zip

Bivariate kernel regression

Lionel Page

University of Westminster

In recent years, more Stata programs have become available for nonparametric regression. The commands mrunning and mlowess make it possible to perform nonparametric regression over several dimensions. These techniques, however, impose the separable additivity of the effect of different regressors. In some situations, this condition may be undesirable. To allow for a fully flexible nonparametric regression, I wrote a program to perform a kernel regression over two regressors. It is the natural extension of kernreg to the bivariate case.

Tools for spatial density estimation

Maurizio Pisati

University of Milano–Bicocca

The purpose of this talk is to illustrate the main features and applications of two new Stata programs for spatial density estimation: spgrid and spkde. The spgrid program generates two-dimensional arrays of evenly spaced points spanning across any regular or irregular study region specified by the user. In turn, the spkde program carries out spatial kernel density estimation based on reference points generated by spgrid.

Additional information
PisatiDensityestimationforspatialdata.pdf

Scientific organizers

Nicholas J. Cox, Durham University
Patrick Royston, MRC Clinical Trials Unit

Logistics organizers

Timberlake Consultants, the official distributor of Stata in the United Kingdom, Brazil, Ireland, Poland, Portugal, and Spain.