Last updated: 16 August 2002
Dutch & German Stata Users Group meeting
23 May 2002
Maastricht University
Auditorium (Aula)
Tongersestraat 53
Maastricht, The Netherlands
Proceedings
Patrick Royston
MRC Clinical Trials Unit, London
W. Sauerbrei
Universitaet Freiburg, Germany
We consider modelling and testing for `interaction' between a continuous
covariate
X and a categorical covariate
C in a regression
model. Here
C represents two treatment arms in a parallel-group
clinical trial and
X is a prognostic factor which may influence
response to treatment. Usually
X is categorised into groups
according to cut-point(s) and the interaction is analysed in a model with
main effects and multiplicative terms. A trend test of the effect of
C over the ordered categories from
X may be performed and is
likely to have better power. The cut-point approach raises several
well-known and difficult issues for the analyst, including dependency of the
results on the choice of cut-point, loss of power due to categorisation, and
the danger of `over-fitting' if several cut-points are considered in a
search for `optimality' (Altman et al., 1994).
We will describe an approach to avoid such problems based on fractional
polynomial (FP) modelling of
X, without categorisation, overall and
at each level of
C (Royston and Sauerbrei, 2002). The first step is
to construct a multivariable adjustment model which may contain binary
covariates and FP transformations of continuous covariates other than
X. The second step involves FP modelling of
X within the
adjustment model.
Stata software to fit the models will be demonstrated using example
datasets, mainly from cancer studies. The examples show the power of the
approach in detecting and displaying interactions in real data from
randomised controlled trials with a survival-time outcome.
References
- Altman, D. G., B. Lausen, W. Sauerbrei, M. Schumacher. 1994.
- The dangers of
using `optimal' cutpoints in the evaluation of prognostic factors. Journal
of the National Cancer Institute 86: 829–835.
- Royston, P. and W. Sauerbrei. 2002.
- A new approach to modelling interactions
between treatment and continuous covariates
- in clinical trials by using
fractional polynomials. Statistics in Medicine, to be submitted.
Hand-outs/slides
royston.pdf
Nicholas J. Cox
Durham University
It is commonplace to compute various flavours of residual and predicted
values after fitting many different kinds of model. This allows production
of a great variety of diagnostic graphics, used to examine the general and
specific fit between data and model and to seek possible means of improving
the model. Several different graphs may be inspected in many modelling
exercises, partly because each kind may be best for particular purposes, and
partly because in many analyses a variety of models — in terms of
functional form, choice of predictors, and so forth — may be
entertained, at least briefly. It is therefore helpful to be able to
produce such graphs very rapidly.
Official Stata supplies as built-ins a bundle of commands originally written
for use after
regress:
avplot,
avplots,
cprplot,
acprplot,
lvr2plot,
rvfplot and
rvpplot. These were introduced in Stata 3.0 in 1992 and are
documented at [R]
regdiag. More recently, in an update to Stata 7.0 on
6 September 2001, all but the first two have been modified so that they may be
used after
anova. Despite their many uses, this suite omits some
very useful kinds of plot, while none of the commands may be used after other
modelling commands.
The presentation focuses on a new set of commands, which are biased to
graphics useful for models predicting continuous response variables. The
ideal, approachable asymptotically, is to make minimal assumptions about which
modelling command has been issued previously. The down-side for users is that
if the data and the previous model results do not match the assumptions, it is
possible to get either bizarre results or an error message.
The commands which have been written include
anovaplot shows
fitted or predicted values from an immediately previous one-, two-, or
three-way anova. By default the data for the response are also
plotted. In particular, anovaplot can show interaction plots.
indexplot plots estimation results (by default whatever
predict produces by default) from an immediately previous
regress or similar command versus a numeric index or identifier
variable, if that is supplied, or observation number, if that is not supplied.
Values are shown, by default, as vertical spikes starting at 0.
ovfplot plots observed vs fitted or predicted values for the response
from an immediately previous regress or similar command, with by
default a line of equality superimposed.
qfrplot plots quantile plots of fitted values, minus their mean, and
residuals from the previous estimation command. Fitted values are whatever
predict produces by default and residuals are whatever predict,
res produces. Comparing the distributions gives an overview of their
variability and some idea of their fine structure. By default plots are
side-by-side. Quantile plots may be observed vs normal (Gaussian).
rdplot graphs residual distributions. The residuals are, by default,
those calculated by predict, residuals or (if the previous estimation
command was glm) by predict, response. The graph by default
is a single or multiple dotplot, as produced by dotplot: histograms
or box plots may be selected by specifying either the histogram or
the box option.
regplot plots fitted or predicted values from an immediately previous
regress or similar command. By default the data for the response are
also plotted. With one syntax, no varname is specified.
regplot shows the response and predicted values on the y axis
and the covariate named first in the regress or similar command on
the x axis. Thus with this syntax the plot shown is sensitive to the
order in which covariates are specified in the estimation command. With
another syntax, a varname is supplied, which may name any numeric
variable. This is used as the variable on the x axis. Thus in practice
regplot is most useful when the fitted values are a smooth function
of the variable shown on the x axis, or a set of such functions given
also one or more dummy variables as covariates. However, other applications
also arise, such as plotting observed and predicted values from a time series
model versus time.
rvfplot2 graphs a residual-versus-fitted plot, a graph of the
residuals versus the fitted values. The residuals are, by default, those
calculated by predict, residuals or (if the previous estimation
command was glm) by predict, response. The fitted values are
those produced by predict by default after each estimation command.
rvfplot2 is offered as a generalisation of rvfplot in
official Stata.
Hand-outs/slides
diag.pdf
diag.html — graphs covered in the meeting
Ulrich Kohler
Dept. of Social Sciences, Mannheim University
outdat is a Stata program to transfer data from Stata to other
statistical packages.
outdat writes data to a disk file in ASCII
format and makes a dictionary to read the data into SPSS, Stata, or Limdep.
The presentation shows how
outdat works and how to expand
outdat to other data formats.
Hand-outs/slides
kohler.pdf
outdat.zip
Sophia Rabe–Hesketh
Inst. of Psychiatry, Kings College, London
Models for handling sample selection or informative missingness have been
developed for both cross-sectional and longitudinal or panel data. For
cross-sectional data, Heckman (1979) suggested a joint model for the response
and sample selection processes where the disturbances of the processes are
correlated. For longitudinal data, Hausman and Wise (1979) and Diggle and
Kenward (1994) developed a model in which the continuous response (observed or
unobserved), and possibly the lagged response, is a predictor of attrition or
dropout. The Heckman model can be estimated using the heckman command in Stata
and the Diggle–Kenward model is available in the Oswald package running in
S-PLUS. Both models can also be estimated using
gllamm with the
advantage that the following three generalisations are possible. First, the
models can be extended to multilevel settings where there may be unobserved
heterogeneity between the clusters at the different levels in both the
substantive and selection processes and where selection may operate at several
levels. Second, the Heckman model can be modified for nonnormal response
processes. Third, both the Heckman and Diggle–Kenward models can be
extended to situations where the substantive response is a latent variable
measured by a number of indicators. I will show how the standard Heckman and
Diggle–Kenward models are estimated in
gllamm and give a examples
of all three types of generalisation of these standard models. The research
was carried out jointly with Anders Skrondal and Andrew Pickles.
Hand-outs/slides
select.pdf
Ron van der Holt
Dept. of Statistics, AZR, Rotterdam
Wim van Putten
Dept. of Statistics, AZR, Rotterdam
Correct data are crucial for any analysis. For example, the date of
randomization in a clinical trial should never preceed the date of diagnosis,
and treatment should always start at the date of randomization or thereafter.
You could easily list patients with f laws in the data using:
. l patnr ddiag drand if ddiag>drand, nod noo
. l patnr drand dsttreat if drand>dsttreat, nod noo
However, when the number of variables and the number of data checks are
large, the errors in the data of a certain patient may be found anywhere in
your output, which will hamper easy admission of the data. To overcome these
problems, the program
qd.ado was developed. All errors are now neatly
grouped by patient; moreover, additional comments to facilitate the admission
of the data can easily be defined. We will demonstrate the program and show
some examples. To get more information about this program, you should type
findit qd in Stata. You will find the relevant ado-files and also a
PowerPoint presentation.
Hand-outs/slides
qd.ppt
Wilgried Graveland
Dept. of Statistics, AZR, Rotterdam
The program
adoinf.ado is useful for an overview of your ado-files in
your site or personal directory. If the number of ado-files grows, you easily
forget the meaning of certain ado-files and whether they are still necessary
or not. You also easily forget which subroutines are used in a file and/or
where the file is called by other ado-files. This can especially be important
when you publish ado-files on the web. Furthermore, for each file is tried to
search for the author, oneliner describing the file, date last saved, date
last help version, and so on. The get more information about this program, you
type
findit adoinf in Stata, where the files and also the PowerPoint
presentation is available.
Hand-outs/slides
adoinf.ppt
Jeroen Weesie
Dept. of Sociology, Utrecht University
In this talk, I discuss Pearson's X
2 as a measure of goodness of
fit for quantal response models, including binary outcome models (logit,
probit, gompit), multinomial logistic regression (mlogit) and conditional
logistic regression (clogit). Large sample results for X
2 have been
derived by a.o. McCullagh and Windmeijer. A Stata implementation of the test
will be illustrated.
Hand-outs/slides
pearsonx2.pdf
Wim van Putten
Dept. of Statistics, AZR, Rotterdam
Classification and Regression Tree analysis can be applied for the
identification and assessment of prognostic factors in clinical research. It
involves repeated subdivisions of a group of subjects on the basis of the
choice of optimal cut-points of binary, ordinal, or continuous covariates,
which maximizes a certain split criterion. I will describe a specific
implementation of CART as Stata ado-file
cart.ado for failure time data
with as split criterion an adjusted
p-value. The
p-value is
associated with the chisquare logrank statistic based on residuals. The
adjustment is for the multiple testing associated with the search for the
optimal cutpoint with the maximum chisquare value (Lausen, 1997). Examples of
applications are given. CART has a serious risk of overfitting. However, it
can be a useful exploratory tool in addition to more standard regression type
techniques.
Reference
-
Lausen, B. et. al. 1997.
- The regression tree method and its application in
nutritional epidemiology. Informatik, Biometrie und Epidemiologie in
Medizin und Biologie 28(1): 1–13.
Hand-outs/slides
cart.pdf
cart.ppt
Roberto G. Gutierrez
StataCorp
With the release of Stata 7, the capabilities of
glm were greatly
enhanced. Among the improvements was the ability for users to program their
own custom link and variance functions. Whereas previously
glm was
used primarily as a platform on which to compare the results of standard
regression models (such as the logistic, probit, and Poisson), it may now be
utilised to perform generalized maximum pseudo-likelihood estimation in any
framework. Thus far, this has been an ability that for the most part has not
been exploited.
The method by which user-defined links and variance functions may be
incorporated is quite straightforward, as demonstrated in the companion text
to
glm by Hardin and Hilbe (2001). In this talk, I present a few
examples of case studies from the literature where the science dictated the
fitting of a generalized linear model with special (non-standard) link and/or
variance function. I demonstrate how these models (which were typically fit
using SAS's GENMOD procedure) may be fit using Stata.
Reference
-
Hardin, J. and J. Hilbe. 2001.
- Generalized linear models and
extensions. Stata Press, College Station, TX.
Philippe van Kerme
CEPS/INSTEAD, Luxemburg
A set of Stata routines to help analysis of `income mobility' are presented
and illustrated. Income mobility is taken here as the pattern of income change
from one time period to another within an income distribution. Multiple
approaches have been advocated to assess the magnitude of income mobility.
The macros presented provide tools for estimating several measures of income
mobility; e.g., the Shorrocks (JET 1978) or King (Econometrica 1983) indices
or summary statistics for transition matrices.
Hand-outs/slides
vankerm.pdf
Niko Speybroeck
Dept. of Animal Health, ITG, Antwerpen
Frank Boelaert
European Commission, Belgium
Geert Molenberghs
Hasselt University, Belgium
Tomasz Burzykowski
Hasselt University, Belgium
Didier Renard
Hasselt University, Belgium
K. Mintiens
Maxime Madder
Dept. of Animal Health, ITG, Antwerpen
D. Berkvens
Dept. of Animal Health, ITG, Antwerpen
-
Abstract
Infectious bovine rhinotracheitis is caused by the bovine herpesvirus type 1.
It is an enzootic disease on the B List of the Office International des
Epizooties (O.I.E.). Programs to eradicate bovine herpesvirus type I have been
implemented in several European countries to facilitate the free trade of
cattle, semen, and embryos within the European Community. Therefore, Belgium
has an incentive to control and eradicate this viral infection. In the initial
stage of the eradication campaign, it is essential to survey the infection
prevalence. Also, it is important to investigate the survey results for
possible risk factors that might be associated with bovine herpesvirus-1
positivity among cattle. The national bovine herpesvirus 1 seroprevalence
(apparent p revalence) in the Belgian cattle population was determined by a
serological survey that was conducted from December 1997 to March 1998. In a
random sample of unvaccinated herds (N=309), all cattle (N=11,248) were tested
for the presence of antibodies to glycoprotein B of bovine herpesvirus 1. The
age and sex of the animals and the type (dairy, mixed, or beef) and size of
the herds were registered. The survey is an example of a stratified one-stage
cluster sampling design. Stata has some very useful commands for analysing
surveys. The dataset was analysed using the svylogit and gllamm
Stata commands, which provided similar results. The strengths of
svylogit and gllamm will be highlighted. We will also compare
the command gllamm with the SAS procedure NL MIXED, which produced
similar results on the analysed dataset. The binary response is the apparent
prevalence, which is the serological discrete test result (positive/negative).
The true infection status was mimicked via an ado-file, using expert opinion
on the uncertainty regarding the test misclassification probabilities. The
results based on the analysis using this new response were compared with those
based on the analysis of the original response.
William W. Gould
StataCorp
Bill Gould, who is President of StataCorp, and more importantly for this
meeting, the head of development, will ruminate about work at Stata over the
last year and about ongoing activity.
Scientific organizers
Alexander Volovics
Wim van Putten
Logistics organizers
Smit Consult
and DPC, the official distributors
of Stata in the Netherlands and Germany, respectively.