2009 UK Stata Users Group meeting

Home / Resources & support / User Group meetings / 2009 UK Stata Users Group meeting

Last updated: 9 October 2009

2009 UK Stata Users Group meeting

10–11 September

Centre for Econometric Analysis
Cass Business School
106 Bunhill Row
London EC1 8TZ
United Kingdom

Proceedings

Selection endogenous dummy ordered probit, and selection endogenous dummy dynamic ordered probit models

Massimiliano Bratti

University of Milan

Alfonso Miranda

Institute of Education, University of London

In this presentation we define two qualitative response models: 1) Selection Endogenous Dummy Ordered Probit model (SED-OP); 2) a Selection Endogenous Dummy Dynamic Selection Ordered Probit model (SED-DOP). The SED-OP model is a three-equation model constituted of an endogenous dummy equation, a selection equation, and a main equation which has an ordinal response form. The main feature of the model is that the endogenous dummy enters both the selection equation and the main equation. The dynamic SED-DOP model allows both the selection equation and the ordered equation to be dynamic by including lagged individual behaviour. Initial conditions are properly accounted for and free correlation among unobservables entering each of the three equations is allowed. We show how these models can be estimated in Stata using Maximum Simulated Likelihood.

Additional information
uk09_bratti_miranda.pdf

Robust principal component analysis in Stata

Vincenzo Verardi

University of Brussels and University of Namur

In data analysis, when some observations are outlying in one or several dimensions, principal component analysis (PCA) is distorted and may lead to questionable results. I therefore propose a simple solution to tackle this problem by providing a short ado-file that is based on a robust estimation of the covariance matrix. To illustrate the importance of this type of approach, I present a PCA analysis based on the variables used to rank universities according to academic excellence (as measured by the scores in Shangai ARWU Ranking).

Additional information
uk09_verardi.ppt

Three models for combining information from causal indicators

Maarten Buis

University of Tuebingen

Sometimes we have multiple measures of the same concept. Combining the information of these multiple measures would allow us to improve the measurement. When combining the information from different indicators, one needs to distinguish between two types of relationships between the observed indicators and the underlying latent variable: either the latent variable influences the indicators or the indicators influence the latent variable. To distinguish between these two situations, some authors, following Bollen (Quality and Quantity, 1984) and Bollen and Lennox (Psychological Bulletin, 1991), call the observed variables “effect indicators” when they are influenced by the latent variable, while they call the observed variables “causal indicators” when they influence the latent variable. Distinguishing between these two is important as they require very different strategies for recovering the latent variable. In a basic (exploratory) factor analysis, which is a model for effect indicators, one assumes that the only thing that the observed variables have in common is the latent variable, so any correlation between the observed variables must be due to the latent variable, and it is this correlation that is used to recover the latent variable. In the models for causal indicators that I will discuss in this talk, I assume that the latent variable is a weighted sum of the observed variables (and optionally an error term), and the weights are estimated such that they are optimal for predicting the dependent variable. The three models for dealing with causal indicators that will be discussed are a model with “sheaf coefficients” (Heise, Sociological Methods & Research, 1972), a model with “parametrically weighted covariates” (Yamaguchi, Sociological Methodology, 2002), and a multiple indicators and multiple causes (MIMIC) model (Hauser and Goldberger, Sociological Methodoloy, 1971). The latter two can be estimated using propcnsreg, while the former can be estimated using sheafcoef. Both are available from the SSC archive.

Additional information
uk09_buis.pdf

To the vector belong the spoils: Circular statistics in Stata

Nicholas J. Cox

Durham University

Circular statistics are needed when one or more variables have outcome space in the circle, which is for example true for data measured with reference to compass, clock, or calendar. Applications abound in the earth and environmental sciences, not to mention economic and medical fields well represented among Stata users and other disciplines such as music. Previous talks on circular statistics were given to the UK Stata Users Group meetings in 1997 and 2004. This update will survey the field with special reference to recently revised or newly written programs for graphics, summary, testing, and modeling.

Exporting and importing Stata genotype data to and from PHASE and HaploView

Chuck Huber

Texas A&M University

Genetic association studies often explore the relationship between diseases and collections of contiguous genetic markers located on the same chromosome known as haplotypes. Haplotypes are usually not observed directly but are inferred statistically using a variety of algorithms. One of the most popular haplotype-inference programs is PHASE, and one of the most popular programs for examining characteristics of the resulting haplotypes is HaploView. I have developed a set of Stata commands for exporting genotype data from Stata into PHASE, importing the resulting haplotypes back into Stata for association analysis, and exporting the haplotype data from Stata into HaploView.

Additional information
uk09_huber.ppt

Improving the output capabilities of Stata with Open Document Format xml

Adam Jacobs

Dianthus Medical Limited, London

Stata’s capabilities for statistical analysis, graphics, and data management are world class, but its ability to produce well-presented textual output is considerably more limited. Some problems that are particularly annoying are a lack of appropriate page breaks or repetition of column headers in large tables, Unicode support, and many of the other features taken for granted in word processors, such as automatically generated tables of contents. But all is not lost. Open Document Format (ODF) is an open ISO standard for office-type documents, including word processing documents, and is the default file format of the popular open source office software suite OpenOffice.org. It is an xml-based format, which means that ODF files can be written in a text editor, or with software that can produce output in plain-text format. Happily, Stata is more than equal to the task of producing plain-text output. In this talk, I shall explain how I have used Stata to produce output in ODF xml files, thus making the appearance of output considerably more user-friendly than native Stata output.

Additional information
uk09_jacobs.ppt

The economics of Statalist exchanges

Martin Weiss

University of Tuebingen

I have researched the economics of interactions on Statalist, based on the full population of exchanges from January 1 to June 30, 2009. Both the “demand side”—the questions asked on the list—and the “supply side”—the answers provided—are examined. Along the way, I have paid particular attention to the role of unsatisfied demand (“orphans”), i.e. questions that never attract a reply.

Additional information
uk09_weiss.pdf

Summarizing the results of simulation studies

Ian White

MRC Biostatistics Unit, Cambridge University

Simulation studies are a powerful tool, but their analyses are not always done well; in particular, Monte Carlo standard errors are often not reported. I present a Stata program, simsum, which can output a range of summaries, including bias, precision of one method relative to another, percentage difference between model-based and empirical standard error, power, and coverage. Monte Carlo standard errors are computed for all these quantities, using exact or approximate formulae.

Additional information
uk09_white.pdf

Rating scale analysis

Michael Glencross

Community Agency for Social Enquiry, Johannesburg

In many research studies, respondents’ beliefs and opinions about various concepts are often measured by means of five-, six-, and seven-point scales. The widely used five point scale is commonly known as a Likert scale (Likert, (1932) “A technique for the measurement of attitudes”, Archives of Psychology, 22, 1–55). In such situations, it is desirable to have a test statistic that provides a measure of the amount of agreement or disagreement in the sample, that is, whether a particular item ‘pole’ is characteristic of the respondents. This is preferable to making arbitrary decisions about the extremeness or otherwise of the sample responses. A suitable test for this purpose was designed by Cooper (1976, “An exact probability test for use with Likert-type scales”, Educational and Psychological Measurement, 36, 647-655.) Cooper z, with modifications suggested by Whitney (1978, “An alternative test for use with Likert-type scales”, Educational and Psychological Measurement, 38, pp. 15–19) (Whitney t). Cooper showed that for large samples, the Cooper z statistic has a sample distribution that is approximately normal. The alternative Whitney t statistic has a sample distribution that is approximately t with (n−1) degrees of freedom and is suitable for small samples. Between them, these two statistics, although rarely used, provide a quick and straightforward way of analyzing rating scales in an objective way. In this presentation, I will describe the Stata syntax used to calculate the Cooper z and Whitney t statistics and create the related bar graphs. An illustrative example will be used to demonstrate their use in a survey.

Additional information
uk09_glencross.ppt

Funnel plots for institutional comparisons

Rosa Gini

Regional Agency for Public Health of Tuscany

Sylvia Forni

Regional Agency for Public Health of Tuscany

We introduce funnelcompar, a Stata routine that performs the analysis suggested by David J. Spiegelhalter (“Funnel plots for comparing institutional performance”, Statistics in Medicine, 24, 1185–1202). The basic idea in funnel plots is to plot performance indicators against a measure of their precision in order to detect outliers. A scatter plot of an indicator level is plotted together with a baseline and control limits, which shrink as the sample size gets bigger. Our command performs funnel plots for binomial (proportion), Poisson (crude and standardized rates), and normal (means) distributed variables. The baseline (and standard errors in case of normal variables) can either be specified by the user (for instance, as literature reference) or be estimated from the data as a weighted or nonweighted mean of the data. By default, confidence limits are plotted at two and three standard errors, to detect alarm and alert signals, as recommended by statistical process control theory. Options have been implemented to mark single institutions, groups of institutions or those institutions lying outside control limits. These plots are increasingly used to report performance indicators at the institutional level. Classical league tables imply the existence of ranking between institutions and implicitly support the idea that some of them are worse/better than others. A different approach is possible using statistical process control theory: all institutions are part of a single system and perform at the same level. Observed differences can never be completely eliminated and are explained by chance (common cause variation). If observed variations exceed that expected, special-cause variation exists and requires further explanation to identify its cause.

Additional information
uk09_gini_forni.pdf

Decomposition of inequality change into pro-poor growth and mobility components: dsginideco

Stephen P. Jenkins

University of Essex

Philippe Van Kerm

CEPS/INSTEAD, Luxembourg

In this short talk, we describe the module dsginideco, which decomposes the change in income inequality between two time periods into two components: one representing the progressivity (pro-poorness) of income growth, and the other representing reranking. Inequality is measured using the generalized Gini coefficient, also known as the S-Gini, G(v). This is a distributionally-sensitive inequality index, with larger values of v placing greater weight on inequality differences among poorer (lower ranked) observations. The conventional Gini coefficient corresponds to the case v = 2. The decomposition is of the form: final-period inequality − initial-period inequality = R − P, where R is a measure of reranking, and P is a measure of the progressivity of income growth. For full details of the decomposition and an application, see S.P. Jenkins and P. Van Kerm (2006), “Trends in income inequality, pro-poor income growth and income mobility”, Oxford Economic Papers, 58: 531–548.

Additional information
uk09_jenkins_vankerm.pdf

Education inequality in Latin America and the Caribbean: A socioeconomic gradients analysis using Stata

Roy Costilla

LLECE/UNESCO, Santiago

A socioeconomic gradient describes the relationship between a social outcome and socioeconomic status for individuals in a specific jurisdiction, such as a school, a province or state, or a country (Willms [2003]). Ten hypotheses about socioeconomic gradients and community differences in children’s developmental outcomes. Within this framework, I will analyze the relationship between students’ achievement in mathematics and reading and their socioeconomic and cultural status in the case of Latin American and Caribbean primary school students that were assessed by the SERCE study (OREALC/UNESCO) (Santiago [2008]). . It is shown that there is a considerable variation of the strength of this relationship among countries, suggesting different degrees of success in reducing the disparities associated with socioeconomic and cultural status.

Additional information
uk09_costilla.pdf

Multiple-imputation analysis using Stata’s new mi command

Yulia Marchenko

StataCorp

Stata 11’s mi command can be used to perform multiple-imputation analysis, including imputation, data management, and estimation. mi impute provides 5 univariate and 2 multivariate imputation methods. mi estimate combines the estimation and pooling steps of the multiple-imputation procedure into one easy step. mi also provides an extensive ability to manage multiply-imputed data. I will give a brief overview of all of mi’s capabilities with emphasis on mi impute and mi estimate, and will also demonstrate examples of some of mi’s unique data management features.

Additional information
uk09_marchenko.pdf

Contour enhanced funnel plots for meta-analysis

Tom Palmer

University of Bristol

Funnel plots are commonly used to investigate publication and related biases in meta-analysis. Although asymmetry in the appearance of a funnel plot is often interpreted as being caused by publication bias, in reality the asymmetry could be due to other factors that cause systematic differences in the results of large and small studies, for example, confounding factors such as differential study quality. Funnel plots can be enhanced by adding contours of statistical significance to aid in interpreting the funnel plot. If studies appear to be missing in areas of low statistical significance, then it is possible that the asymmetry is due to publication bias. If studies appear to be missing in areas of high statistical significance, then publication bias is a less likely cause of the funnel asymmetry. Examples will be given using the user-written confunnel command in conjunction with some of the other user written commands for meta-analysis.

Additional information
uk09_palmer_presentation.pdf
uk09_palmer_handouts.pdf

Homoskedastic adjustment inflation factors in model selection

Roger B. Newson

Imperial College, London

Insufficient confounder adjustment is viewed as a common source of “false discoveries”, especially in the epidemiology sector. However, adjustment for “confounders” that are correlated with the exposure, but which do not independently predict the outcome, may cause loss of power to detect the exposure effect. On the other hand, choosing confounders based on "stepwise" methods is subject to many hazards, which imply that the confidence interval eventually published is likely not to have the advertised coverage probability for the effect that we wanted to know. We would like to be able to find a model in the data on exposures and confounders, and then to estimate the parameters of that model from the conditional distribution of the outcome, given the exposures and confounders. The haif package, downloadable from the SSC archive, calculates the homoskedastic adjustment inflation factors (HAIFs), by which the variances and standard errors of coefficients for a matrix of X-variables are scaled (or inflated), if a matrix of unnecessary confounders A is also included in a regression model, assuming equal variances (homoskedasticity). These can be calculated from the A- and X-variables alone, and can be used to inform the choice of a set of models eventually fitted to the outcome data, together with the usual criteria involving causality and prior opinion. Examples are given of the use of HAIFs and their ratios.

Additional information
uk09_newson.pdf

Implementing econometric estimators with Mata

Christopher F. Baum

Boston College

Mark E. Schaffer

Heriot-Watt University

We discuss how econometric estimators may be efficiently programmed in Mata. The prevalence of matrix-based analytical derivations of estimation techniques and the computational improvements available from just-in-time compilation combine to make Mata the tool of choice for econometric implementation. Two examples are given: computing the seemingly unrelated regression (SUR) estimator for an unbalanced panel, a multivariate linear approach, and computing the continuously updated GMM estimator (GMM-CUE) for a linear instrumental variables model. The GMM–CUE estimator makes use of Mata’s optimize suite of functions. Both illustrate the power and effectiveness of a Mata-based approach.

Additional information
uk09_baum.pdf

Flexible parametric alternatives to the Cox model

Paul Lambert

University of Leicester

Patrick Royston

MRC Clinical Trials Unit, London

The Cox model is the most popular method for the modeling of time-to-event data. The fact that it does not directly estimate the baseline hazard function is both an advantage and a disadvantage. This tutorial will describe various aspects of flexible parametric alternatives to the Cox model by describing a new command, stpm2. We will cover the following areas:

the general idea of the flexible parametric approach
proportional hazards and proportional odds models
model selection for the baseline hazard
modeling time-dependent effects
using age as the time-scale
modeling with multiple time-scales
using absolute or relative differences (hazard ratios or differences in hazard rates)
multiple events
time-varying covariates
adjusted survival curves
relative survival (incorporating expected mortality)
estimating crude and net mortality (based on competing risks)

We aim to show that statisticians who are required to analyze time-to-event data should not always opt for the Cox model and that use of the flexible parametric approach brings a number of advantages. The topics covered in this tutorial are among those described in more detail in a book to be released by Stata Press later this year.

Additional information
uk09_lambert_royston.pdf

Recent developments in output processing

Ben Jann

ETH, Zurich

This tutorial will show how results from various Stata commands can be processed efficiently for inclusion in customized reports. A two-step procedure is proposed in which results are gathered and archived in the first step and then tabulated in the second step. Such an approach disentangles the tasks of computing results (which may take long) and preparing results for inclusion in presentations, papers, and reports (which you may have to do over and over). Examples using results from model estimation commands and also various other Stata commands such as tabulate, summarize, or correlate are presented. Furthermore, this tutorial shows how to dynamically link results into word processors or into LaTeX documents.

Additional information
uk09_jann.pdf

Scientific organizers

Roger Newson, Imperial College London
Stephen Jenkins, University of Essex

Logistics organizers

Timberlake Consultants, the official distributor of Stata in the United Kingdom, Brazil, Ireland, Poland, Portugal, and Spain.