Last updated: 8 December 2008
2008 UK Stata Users Group meeting
8–9 September 2008
Centre for Econometric Analysis
Cass Business School
106 Bunhill Row
London EC1 8TZ
United Kingdom
Proceedings
Stata’s mishandling of missing data: A problem and two solutions
Kenneth I. MacDonald
Nuffield College, University of Oxford
The design decisions made by Stata in handling missing data in relational
and logical expressions have, for the user, complex, pernicious, and poorly
understood consequences. This presentation intends to substantiate that
claim and to present two possible resolutions to the problem.
As is well documented and reasonably well known, Stata considers
p &
q (and
p | q) to be true when both
p and
q are
indeterminate. This interpretation is counterintuitive and at odds with the
formal-logic definition of these operators. To assert two unknowns is not to
assert truth. Nevertheless, introductions to Stata characteristically
present this as merely a “feature” and suggest that the
obligation imposed on users (us) to explicitly test for missing data is
straightforwardly implementable. Simple cases are indeed simple but, it
will be argued, do not readily scale up to complex, real-life instances.
For example, the one-line Stata command to implement the intention,
. generate v = p|q
becomes
. generate v = p|q if !mi(p,q)|(p&!mi(p))|(q&!mi(q))
And so forth. Such coding is a problem, not a feature—so solutions should
be sought.
One solution (really a work-around) introduces my command,
validly, which
allows expressions such as
. validly generate v = p|q
and correctly, without fuss, interprets the logical or relational operators
(here returning true if
p is true but
q indeterminate and
indeterminate if
p is false but
q indeterminate). More generally,
validly serves as a “wrapper” for any standard conditional
command. So, for example,
. validly reg a b c if p|q
is handled correctly. But
validly (its code deploys nested calls to
cond()) is computationally expensive.
The better resolution would be for Stata, in its next release, to redesign
its core code so that logical and relational operators would (as algebraic
operators currently do) handle missing data appropriately. (Objections to
this strategy are examined and deemed to lack force.) I would like to enlist
the informed and active judgment of the participants of the 14th Users Group
meeting to help bring this about.
Additional information
KIMacD.presentation.ppt
Robust statistics using Stata
Vincenzo Verardi
University of Brussels and University of Namur
In regression and multivariate analysis, the presence of outliers in the
dataset can strongly distort classical estimations and lead to unreliable
results. To deal with this, several robust-to-outliers methods have been
proposed in the statistical literature. In Stata, some of these methods are
available through the commands
rreg and
qreg for robust
regression and
hadimvo for multivariate outliers identification.
Unfortunately, these methods only resist some specific types of outliers and
turn out to be ineffective under alternative scenarios. In this
presentation, after illustrating the drawbacks of the available methods, we
present more effective robust estimators that we implemented in Stata. We
also present a graphical tool that allows users to recognize the type of
existing outliers in regression analysis.
Additional information
VerardiRobustStatisticsinStata.pdf
Creating fancy maps and pie charts using Google API charts
Lionel Page and Franz Buscha
University of Westminster
Google has recently developed a new tool to allow users to create Google-like
charts and maps: the Google Application Programming Interface (API) chart. This
tool allows the user to create custom PNG pictures by sending an appropriate
syntax code over the web. Here we present two Stata programs that make it
possible for users to create .png figures using the Google API chart
directly from Stata and to download the figures in the current directory:
- gmap allows the user to create thematic mapping of the world or
one of its continents.
- gpie allows the user to create pie charts in color and in 2D or 3D.
Additional information
buscha.ppt
Between tables and graphs: A tutorial
Nicholas J. Cox
Durham University
The display of data or of results often entails the preparation of a variety
of table-like graphs showing both text labels and numeric values. I will
present basic techniques, tips, and tricks using both official Stata and
various user-written commands. The main message is that whenever
graph
bar,
graph dot, or
graph box commands fail to give what
you want, then you can knit your own customized displays using
twoway
as a general framework.
Additional information
Cox.london08.zip
Prediction for multilevel models
Sophia Rabe-Hesketh
University of California–Berkeley
This presentation focuses on predicted probabilities for multilevel models
for dichotomous or ordinal responses. In a three-level model, for instance,
with patients nested in doctors nested in hospitals, predictions for
patients could be for new or existing doctors and, in the latter case, for
new or existing hospitals. In a new version of
gllamm, these
different types of predicted probabilities can be obtained very easily. I
will give examples of graphs that can be used to help interpret an estimated
model. I will also introduce a little program I’ve written to
construct 95% confidence intervals for predicted probabilities.
Additional information
Sophia.predict5.pdf
Sophia.predict5.do
antibiotics.dta
margprob.dta
postprob.dta
Tricks of the trade: Getting the most out of xtmixed
Roberto G. Gutierrez
StataCorp
Stata’s
xtmixed command can be used to fit mixed models, models
that contain both fixed and random effects. The fixed effects are merely the
coefficients from a standard linear regression. The random effects are not
directly estimated but summarized by their variance components, which are
estimated from the data. As such,
xtmixed is typically used to
incorporate complex and multilevel random-effects structures into standard
linear regression.
xtmixed’s syntax is complex but versatile,
allowing it to be used widely, even for situations that do not fit the
classical “mixed” framework. In this talk, I will give a
tutorial on uses of
xtmixed not commonly considered, including
examples of heteroskedastic errors, group structures on random effects, and
smoothing via penalized splines.
Additional information
gutierrez_mixed.pdf
parmest and extensions
Roger Newson
Imperial College–London
The
parmest package creates output datasets (or results sets) with one
observation for each of a set of estimated parameters, and data on the
parameter estimates, standard errors, degrees of freedom,
t or
z statistics,
p-values, confidence limits, and other parameter
attributes specified by the user. It is especially useful when parameter
estimates are “mass-produced”, as in a genome scan. Versions of
the package have existed on SSC since 1998, when it contained the single
command
parmest. However, the package has since been extended with
additional commands. The
metaparm command allows the user to mass-produce
confidence intervals for linear combinations of uncorrelated parameters.
Examples include confidence intervals for a weighted arithmetic or geometric
mean parameter in a meta-analysis, or for differences or ratios between
parameters, or for interactions, defined as differences (or ratios) between
differences. The
parmcip command is a lower-level
utility, inputting variables containing estimates, standard errors, and
degrees of freedom, and outputting variables containing confidence limits
and
p-values. As an example, we can input genotype frequencies and
calculate confidence intervals for geometric mean homozygote/heterozygote
ratios for genetic polymorphisms, measuring the size and direction of
departures from Hardy–Weinberg equilibrium.
Additional information
newson_ohp1.pdf
How do I do that in Stata?
Brains trust
This will be an opportunity to ask a panel of experts how to
do something in Stata. If they are stumped, odds are that the audience will
be able to suggest something. Bring your problems, generic or specific.
Semiparametric analysis of case–control genetic data in
the presence of environmental factors
Yulia Marchenko
StataCorp
In the past decade, many statistical methods have been proposed for the
analysis of case–control genetic data with an emphasis on
haplotype-based disease association studies. Most of the methodology has
concentrated on the estimation of genetic (haplotype) main effects. Most
methods accounted for environmental and gene-environment interaction effects
by utilizing prospective-type analyses that may lead to biased estimates
when used with case–control data. Several recent publications
addressed the issue of retrospective sampling in the analysis of
case–control genetic data in the presence of environmental factors by
developing new efficient semiparametric statistical methods. I present the
new Stata command,
haplologit, that implements efficient
profile-likelihood semiparametric methods for fitting gene–environment
models in the very important special cases of a) a rare disease, b) a
single candidate gene in Hardy–Weinberg equilibrium, and c)
independence of genetic and environmental factors.
Additional information
london08_yulia.pdf
The exploration of metabolic systems using Stata
Ray Boston
University of Pennsylvania
Considerable headway has been made over the last 20 or 30 years into the
isolation of points of failure in human energy metabolism using metabolic
models of challenge data. These models are almost always differential in
form, second-order (or higher), nonlinear, and involve both estimated and
observed metabolite concentrations. As such, they are usually relatively
foreign to the scope of statistical modeling software packages. In this
presentation, we demonstrate novel methods for solving and fitting these
models to challenge data using Stata, and we illustrate techniques for
deriving useful clinical indices such as insulin resistance.
Multiple imputation for household surveys: A comparison of methods
Rodrigo A. Alfaro and Marcelo E. Fuenzalida
Central Bank of Chile
We discuss empirical applications of imputation methods for missing data.
Our results are based on Chilean household surveys using three methods of
proper imputation.
Additional information
Alfaro_London_v3.ppt
Using Mata to work more effectively with Stata: A tutorial
Christopher F. Baum
Boston College
Stata’s matrix language, Mata, highlighted in Bill Gould’s
Mata Matters columns in the
Stata Journal, is very
useful and powerful in its interactive mode. Stata users who write do-files
or ado-files should gain an understanding of the Stata–Mata interface: how
Mata may be called upon to do one or more tasks and return its results to
Stata. Mata’s broad and extensible menu of functions offers assistance
with many programming tasks, including many that are not matrix-oriented.
In this tutorial, I will present examples of how do-file and ado-file writers
might effectively use Mata in their work.
Additional information
baum_StataMata.beamer.UKSUG14.pdf
PanelWhiz
John P. Haisken-De New
RWI Essen
PanelWhiz is a collection of Stata Add-On scripts to make using panel
datasets easier. It is designed for empirically minded economists,
sociologists, political scientists, and demographers and allows the user to
select vectors of variables at once. Matching and merging is done
automatically. It allows items to be stored as project classes (modules).
Modules can be edited and appended. PanelWhiz allows self-documenting panel
retrievals to be made at the click of a button and easy data cleaning of the
selected items for time consistency with PanelWhiz “plugins”. It
also easily exports any Stata data to SAS, SPSS, LIMDEP, GAUSS, or MS Excel.
Trilinear plots and some alternatives
Nicholas J. Cox
Durham University
Data on three proportions, probabilities, or fractions that add to 1 can be
projected from a simplex to the plane and represented in a two-dimensional
plot, commonly known as trilinear, triaxial, triangular, etc. The Stata
program
triplot has been available for some years as one way to
produce such plots. In this talk, I preview a new version of this program
together with a variety of other plots based on transformations of the
underlying variables.
Additional information
Cox.london082.zip
Bivariate kernel regression
Lionel Page
University of Westminster
In recent years, more Stata programs have become available for
nonparametric regression. The commands mrunning and mlowess
make it possible to perform nonparametric regression over several
dimensions. These techniques, however, impose the separable additivity of
the effect of different regressors. In some situations, this condition may be
undesirable. To allow for a fully flexible nonparametric regression, I
wrote a program to perform a kernel regression over two regressors. It is
the natural extension of kernreg to the bivariate case.
Tools for spatial density estimation
Maurizio Pisati
University of Milano–Bicocca
The purpose of this talk is to illustrate the main features and applications
of two new Stata programs for spatial density estimation:
spgrid and
spkde. The
spgrid program generates two-dimensional arrays of
evenly spaced points spanning across any regular or irregular study region
specified by the user. In turn, the
spkde program carries out
spatial kernel density estimation based on reference points generated by
spgrid.
Additional information
PisatiDensityestimationforspatialdata.pdf
Scientific organizers
Nicholas J. Cox, Durham University
Patrick Royston, MRC Clinical Trials Unit
Logistics organizers
Timberlake Consultants, the official distributor
of Stata in the United Kingdom, Brazil, Ireland, Poland, Portugal, and Spain.