Last updated: 9 October 2009
2009 UK Stata Users Group meeting
10–11 September
Centre for Econometric Analysis
Cass Business School
106 Bunhill Row
London EC1 8TZ
United Kingdom
Proceedings
Selection endogenous dummy ordered probit, and selection endogenous dummy dynamic ordered probit models
Massimiliano Bratti
University of Milan
Alfonso Miranda
Institute of Education, University of London
In this presentation we define two qualitative response models: 1) Selection
Endogenous Dummy Ordered Probit model (SED-OP); 2) a Selection Endogenous
Dummy Dynamic Selection Ordered Probit model (SED-DOP). The SED-OP model is
a three-equation model constituted of an endogenous dummy equation, a
selection equation, and a main equation which has an ordinal response form.
The main feature of the model is that the endogenous dummy enters both the
selection equation and the main equation. The dynamic SED-DOP model allows
both the selection equation and the ordered equation to be dynamic by
including lagged individual behaviour. Initial conditions are properly
accounted for and free correlation among unobservables entering each of the
three equations is allowed. We show how these models can be estimated in
Stata using Maximum Simulated Likelihood.
Additional information
uk09_bratti_miranda.pdf
Robust principal component analysis in Stata
Vincenzo Verardi
University of Brussels and University of Namur
In data analysis, when some observations are outlying in one or several
dimensions, principal component analysis (PCA) is distorted and may lead to
questionable results. I therefore propose a simple solution to tackle this
problem by providing a short ado-file that is based on a robust estimation
of the covariance matrix. To illustrate the importance of this type of
approach, I present a PCA analysis based on the variables used to rank
universities according to academic excellence (as measured by the scores in
Shangai ARWU Ranking).
Additional information
uk09_verardi.ppt
Three models for combining information from causal indicators
Maarten Buis
University of Tuebingen
Sometimes we have multiple measures of the same concept. Combining the
information of these multiple measures would allow us to improve the
measurement. When combining the information from different indicators, one
needs to distinguish between two types of relationships between the observed
indicators and the underlying latent variable: either the latent variable
influences the indicators or the indicators influence the latent variable.
To distinguish between these two situations, some authors, following Bollen
(
Quality and Quantity, 1984) and Bollen and Lennox (
Psychological Bulletin,
1991), call the observed variables “effect indicators” when they are
influenced by the latent variable, while they call the observed variables
“causal indicators” when they influence the latent variable.
Distinguishing between these two is important as they require very different
strategies for recovering the latent variable. In a basic (exploratory)
factor analysis, which is a model for effect indicators, one assumes that
the only thing that the observed variables have in common is the latent
variable, so any correlation between the observed variables must be due to
the latent variable, and it is this correlation that is used to recover the
latent variable. In the models for causal indicators that I will discuss in
this talk, I assume that the latent variable is a weighted sum of the
observed variables (and optionally an error term), and the weights are
estimated such that they are optimal for predicting the dependent variable.
The three models for dealing with causal indicators that will be discussed
are a model with “sheaf coefficients” (Heise,
Sociological Methods &
Research, 1972), a model with “parametrically weighted covariates”
(Yamaguchi,
Sociological Methodology, 2002), and a multiple indicators and
multiple causes (MIMIC) model (Hauser and Goldberger,
Sociological Methodoloy,
1971). The latter two can be estimated using
propcnsreg, while the former
can be estimated using
sheafcoef. Both are available from the SSC archive.
Additional information
uk09_buis.pdf
To the vector belong the spoils: Circular statistics in Stata
Nicholas J. Cox
Durham University
Circular statistics are needed when one or more variables have outcome space
in the circle, which is for example true for data measured with reference to
compass, clock, or calendar. Applications abound in the earth and
environmental sciences, not to mention economic and medical fields well
represented among Stata users and other disciplines such as music. Previous
talks on circular statistics were given to the UK Stata Users Group meetings in 1997
and 2004. This update will survey the field with special reference to
recently revised or newly written programs for graphics, summary, testing,
and modeling.
Exporting and importing Stata genotype data to and from PHASE and HaploView
Chuck Huber
Texas A&M University
Genetic association studies often explore the relationship between diseases
and collections of contiguous genetic markers located on the same chromosome
known as haplotypes. Haplotypes are usually not observed directly but are
inferred statistically using a variety of algorithms. One of the most
popular haplotype-inference programs is PHASE, and one of the most popular
programs for examining characteristics of the resulting haplotypes is
HaploView. I have developed a set of Stata commands for
exporting genotype data from Stata into PHASE, importing the resulting
haplotypes back into Stata for association analysis, and exporting the
haplotype data from Stata into HaploView.
Additional information
uk09_huber.ppt
Improving the output capabilities of Stata with Open Document Format xml
Adam Jacobs
Dianthus Medical Limited, London
Stata’s capabilities for statistical analysis, graphics, and data management
are world class, but its ability to produce well-presented textual output is
considerably more limited. Some problems that are particularly annoying are
a lack of appropriate page breaks or repetition of column headers in large
tables, Unicode support, and many of the other features taken for granted in
word processors, such as automatically generated tables of contents. But all
is not lost. Open Document Format (ODF) is an open ISO standard for
office-type documents, including word processing documents, and is the
default file format of the popular open source office software suite
OpenOffice.org. It is an xml-based format, which means that ODF files can be
written in a text editor, or with software that can produce output in
plain-text format. Happily, Stata is more than equal to the task of
producing plain-text output. In this talk, I shall explain how I have used
Stata to produce output in ODF xml files, thus making the appearance of
output considerably more user-friendly than native Stata output.
Additional information
uk09_jacobs.ppt
The economics of Statalist exchanges
Martin Weiss
University of Tuebingen
I have researched the economics of interactions on Statalist, based on the
full population of exchanges from January 1 to June 30, 2009. Both
the “demand side”—the questions asked on the
list—and the “supply side”—the answers
provided—are examined. Along the way, I have paid particular attention
to the role of unsatisfied demand (“orphans”), i.e. questions
that never attract a reply.
Additional information
uk09_weiss.pdf
Summarizing the results of simulation studies
Ian White
MRC Biostatistics Unit, Cambridge University
Simulation studies are a powerful tool, but their analyses are not always
done well; in particular, Monte Carlo standard errors are often not
reported. I present a Stata program,
simsum, which can output a range
of summaries, including bias, precision of one method relative to another,
percentage difference between model-based and empirical standard error,
power, and coverage. Monte Carlo standard errors are computed for all these
quantities, using exact or approximate formulae.
Additional information
uk09_white.pdf
Rating scale analysis
Michael Glencross
Community Agency for Social Enquiry, Johannesburg
In many research studies, respondents’ beliefs and opinions about
various concepts are often measured by means of five-, six-, and seven-point
scales. The widely used five point scale is commonly known as a Likert scale
(Likert, (1932) “A technique for the measurement of attitudes”,
Archives of Psychology, 22, 1–55). In such situations, it is desirable to
have a test statistic that provides a measure of the amount of agreement or
disagreement in the sample, that is, whether a particular item ‘pole’
is characteristic of the respondents. This is preferable to making arbitrary
decisions about the extremeness or otherwise of the sample responses. A
suitable test for this purpose was designed by Cooper (1976, “An exact
probability test for use with Likert-type scales”,
Educational and
Psychological Measurement, 36, 647-655.) Cooper
z, with
modifications suggested by Whitney (1978, “An alternative test for use with
Likert-type scales”,
Educational and Psychological Measurement, 38, pp.
15–19) (Whitney
t). Cooper showed that for large samples, the Cooper
z
statistic has a sample distribution that is approximately normal. The
alternative Whitney
t statistic has a sample distribution that is
approximately
t with (n−1) degrees of freedom and is suitable for small
samples. Between them, these two statistics, although rarely used, provide a
quick and straightforward way of analyzing rating scales in an objective
way. In this presentation, I will describe the Stata syntax used to calculate the
Cooper
z and Whitney
t statistics and create the related bar
graphs. An illustrative example will be used to demonstrate their use in a
survey.
Additional information
uk09_glencross.ppt
Funnel plots for institutional comparisons
Rosa Gini
Regional Agency for Public Health of Tuscany
Sylvia Forni
Regional Agency for Public Health of Tuscany
We introduce
funnelcompar, a Stata routine that performs the analysis
suggested by David J. Spiegelhalter (“Funnel plots for comparing
institutional performance”,
Statistics in Medicine, 24,
1185–1202). The basic idea in funnel plots is to plot performance
indicators against a measure of their precision in order to detect outliers.
A scatter plot of an indicator level is plotted together with a baseline and
control limits, which shrink as the sample size gets bigger. Our command
performs funnel plots for binomial (proportion), Poisson (crude and
standardized rates), and normal (means) distributed variables. The baseline
(and standard errors in case of normal variables) can either be specified by
the user (for instance, as literature reference) or be estimated from the
data as a weighted or nonweighted mean of the data. By default, confidence
limits are plotted at two and three standard errors, to detect alarm and
alert signals, as recommended by statistical process control theory. Options
have been implemented to mark single institutions, groups of institutions or
those institutions lying outside control limits. These plots are
increasingly used to report performance indicators at the institutional level.
Classical league tables imply the existence of ranking between institutions
and implicitly support the idea that some of them are worse/better than
others. A different approach is possible using statistical process control
theory: all institutions are part of a single system and perform at the same
level. Observed differences can never be completely eliminated and are
explained by chance (common cause variation). If observed variations exceed
that expected, special-cause variation exists and requires further
explanation to identify its cause.
Additional information
uk09_gini_forni.pdf
Decomposition of inequality change into pro-poor growth and mobility components: dsginideco
Stephen P. Jenkins
University of Essex
Philippe Van Kerm
CEPS/INSTEAD, Luxembourg
In this short talk, we describe the module
dsginideco, which decomposes the
change in income inequality between two time periods into two components:
one representing the progressivity (pro-poorness) of income growth, and the
other representing reranking. Inequality is measured using the generalized
Gini coefficient, also known as the S-Gini, G(v). This is a
distributionally-sensitive inequality index, with larger values of v placing
greater weight on inequality differences among poorer (lower ranked)
observations. The conventional Gini coefficient corresponds to the case v =
2. The decomposition is of the form: final-period inequality −
initial-period inequality = R − P, where R is a measure of reranking,
and P is a measure of the progressivity of income growth. For full details
of the decomposition and an application, see S.P. Jenkins and P. Van Kerm
(2006), “Trends in income inequality, pro-poor income growth and income
mobility”,
Oxford Economic Papers, 58: 531–548.
Additional information
uk09_jenkins_vankerm.pdf
Education inequality in Latin America and the Caribbean: A socioeconomic gradients analysis using Stata
Roy Costilla
LLECE/UNESCO, Santiago
A socioeconomic gradient describes the relationship between a social outcome
and socioeconomic status for individuals in a specific jurisdiction, such as
a school, a province or state, or a country (Willms [2003]). Ten hypotheses
about socioeconomic gradients and community differences in children’s
developmental outcomes. Within this framework, I will
analyze the relationship between students’ achievement in mathematics and
reading and their socioeconomic and cultural status in the case of Latin
American and Caribbean primary school students that were assessed by the
SERCE study (OREALC/UNESCO) (Santiago [2008]). . It is shown that there is a considerable variation of
the strength of this relationship among countries, suggesting different
degrees of success in reducing the disparities associated with socioeconomic
and cultural status.
Additional information
uk09_costilla.pdf
Multiple-imputation analysis using Stata’s new mi command
Yulia Marchenko
StataCorp
Stata 11’s
mi command can be used to perform
multiple-imputation analysis, including imputation, data management, and
estimation.
mi impute provides 5 univariate and 2 multivariate imputation
methods.
mi estimate combines the estimation and pooling steps of the
multiple-imputation procedure into one easy step.
mi also provides an
extensive ability to manage multiply-imputed data. I will give a brief
overview of all of
mi’s capabilities with emphasis on
mi
impute and
mi estimate, and will also demonstrate examples of some of
mi’s unique data management features.
Additional information
uk09_marchenko.pdf
Contour enhanced funnel plots for meta-analysis
Tom Palmer
University of Bristol
Funnel plots are commonly used to investigate publication and related biases
in meta-analysis. Although asymmetry in the appearance of a funnel plot is
often interpreted as being caused by publication bias, in reality the
asymmetry could be due to other factors that cause systematic differences in
the results of large and small studies, for example, confounding factors
such as differential study quality. Funnel plots can be enhanced by adding
contours of statistical significance to aid in interpreting the funnel plot.
If studies appear to be missing in areas of low statistical significance,
then it is possible that the asymmetry is due to publication bias. If
studies appear to be missing in areas of high statistical significance, then
publication bias is a less likely cause of the funnel asymmetry. Examples
will be given using the user-written
confunnel command in conjunction with
some of the other user written commands for meta-analysis.
Additional information
uk09_palmer_presentation.pdf
uk09_palmer_handouts.pdf
Homoskedastic adjustment inflation factors in model selection
Roger B. Newson
Imperial College, London
Insufficient confounder adjustment is viewed as a common source of “false
discoveries”, especially in the epidemiology sector. However,
adjustment for “confounders” that are correlated with the exposure, but
which do not independently predict the outcome, may cause loss of power to
detect the exposure effect. On the other hand, choosing confounders based on
"stepwise" methods is subject to many hazards, which imply that the
confidence interval eventually published is likely not to have the
advertised coverage probability for the effect that we wanted to know. We
would like to be able to find a model in the data on exposures and
confounders, and then to estimate the parameters of that model from the
conditional distribution of the outcome, given the exposures and
confounders. The
haif package, downloadable from the SSC archive, calculates the
homoskedastic adjustment inflation factors (HAIFs), by which the variances
and standard errors of coefficients for a matrix of X-variables are scaled
(or inflated), if a matrix of unnecessary confounders A is also included in
a regression model, assuming equal variances (homoskedasticity). These can
be calculated from the A- and X-variables alone, and can be used to inform
the choice of a set of models eventually fitted to the outcome data,
together with the usual criteria involving causality and prior opinion.
Examples are given of the use of HAIFs and their ratios.
Additional information
uk09_newson.pdf
Implementing econometric estimators with Mata
Christopher F. Baum
Boston College
Mark E. Schaffer
Heriot-Watt University
We discuss how econometric estimators may be efficiently programmed in Mata.
The prevalence of matrix-based analytical derivations of estimation
techniques and the computational improvements available from just-in-time
compilation combine to make Mata the tool of choice for econometric
implementation. Two examples are given: computing the seemingly unrelated
regression (SUR) estimator for an unbalanced panel, a multivariate linear
approach, and computing the continuously updated GMM estimator (GMM-CUE) for
a linear instrumental variables model. The GMM–CUE estimator makes use
of Mata’s
optimize suite of functions. Both illustrate the
power and effectiveness of a Mata-based approach.
Additional information
uk09_baum.pdf
Flexible parametric alternatives to the Cox model
Paul Lambert
University of Leicester
Patrick Royston
MRC Clinical Trials Unit, London
The Cox model is the most popular method for the modeling of time-to-event
data. The fact that it does not directly estimate the baseline hazard
function is both an advantage and a disadvantage. This tutorial will
describe various aspects of flexible parametric alternatives to the Cox
model by describing a new command,
stpm2. We will cover the following
areas:
- the general idea of the flexible parametric approach
- proportional hazards and proportional odds models
- model selection for the baseline hazard
- modeling time-dependent effects
- using age as the time-scale
- modeling with multiple time-scales
- using absolute or relative differences (hazard ratios or differences in hazard rates)
- multiple events
- time-varying covariates
- adjusted survival curves
- relative survival (incorporating expected mortality)
- estimating crude and net mortality (based on competing risks)
We aim to show that statisticians who are required to analyze time-to-event
data should not always opt for the Cox model and that use of the flexible
parametric approach brings a number of advantages. The topics covered in
this tutorial are among those described in more detail in a book to be
released by Stata Press later this year.
Additional information
uk09_lambert_royston.pdf
Recent developments in output processing
Ben Jann
ETH, Zurich
This tutorial will show how results from various Stata commands can be
processed efficiently for inclusion in customized reports. A two-step
procedure is proposed in which results are gathered and archived in the
first step and then tabulated in the second step. Such an approach
disentangles the tasks of computing results (which may take long) and
preparing results for inclusion in presentations, papers, and reports (which
you may have to do over and over). Examples using results from model
estimation commands and also various other Stata commands such as
tabulate,
summarize, or
correlate are presented.
Furthermore, this tutorial shows how to dynamically link results into word
processors or into LaTeX documents.
Additional information
uk09_jann.pdf
Scientific organizers
Roger Newson, Imperial College London
Stephen Jenkins, University of Essex
Logistics organizers
Timberlake Consultants, the official distributor
of Stata in the United Kingdom, Brazil, Ireland, Poland, Portugal, and Spain.