Last updated: 1 December 2011
2011 Nordic and Baltic Stata Users Group meeting
11 November 2011
Karolinska Institutet
CMB, Berzelius väg 21
Solna Campus
Stockholm, Sweden
Proceedings
Quantile imputation of missing data
Matteo Bottai
Unit of Biostatistics, Institute of Environmental Medicine,
Karolinska Institutet, Sweden
Multiple imputation is an increasingly popular approach for the analysis of
data with missing observations. It is implemented in Stata's
mi suite of commands. I present a new Stata command for
imputation of missing values based on prediction of conditional quantiles of
missing observations given the observed data. The command does not require
making distributional assumptions and can be applied to impute dependent,
bounded, censored, and count data.
Additional information
bottai_nordic11.pdf
Comparing observed and theoretical distributions
Maarten L. Buis
Institut fuer Soziologie, Universitaet Tuebingen, Germany
In this presentation, I aim to introduce graphical tools for comparing the
distribution of a variable in your dataset with a theoretical probability
distribution, like the normal distribution or the Poisson distribution. The
presentation will consist of two parts. In the first part, I will consider
univariate distributions, with a particular emphasis on hanging and suspended
rootograms (
hangroot). Looking at univariate distributions
is not very common in a lot of (sub-(sub-))disciplines, but there are
situations where this can be very useful: For example, if we have a count of
accidents and we want to know whether these are occurring randomly, then we
can compare this variable with a Poisson distribution. Another example would
be simulations, where it is often the case that parameters or test statistics
should follow a certain distribution when the model that is being checked is
working as expected.
In the second part of the talk, I will focus on the more common situation
where models assume a certain distribution for the
explained/dependent/
y variable, and I will estimate how one or more
parameters, often the mean, change when one or more
explanatory/independent/
x variables change. The challenge now is
that the dependent variable no longer follows the theoretical distribution,
but rather a mixture of these theoretical distributions. In the case of a
linear regression, we can circumvent this difficulty by looking at the
residuals, which should follow a normal distribution. However, this
circumvention does not generalize to other models. I will show how to
graphically compare the distribution of the dependent variable with the
theoretical mixture distribution. The focus will be on a trick to sample new
dependent variables under the assumption that the model is true. Graphing the
distribution of the actual dependent variable together with these sampled
variables will give an idea of whether deviations from the theoretical
distribution could have occurred by chance. This idea will be applied to
checking the distributional assumption in beta regression
(
betafit) and to choosing between different parametric
survival models (
streg).
Additional information
buis_nordic11.pdf
Simulating complex survival data
Michael J. Crowther
Department of Health Sciences,
University of Leicester, Leicester, United Kingdom
Paul C. Lambert
Department of Health Sciences, University of
Leicester, Leicester, United Kingdom and Department of Medical Epidemiology
and Biostatistics, Karolinska Institutet, Stockholm, Sweden
Simulation studies are essential for understanding and evaluating both
current and new statistical models. When simulating survival times, often an
exponential or Weibull distribution is assumed for the baseline hazard
function, but these distributions can be considered too simplistic and lack
biological plausibility in many situations. We will describe a new
user-written command,
survsim, that allows the user to
simulate survival times from two-component mixture models, allowing much more
flexibility in the underlying hazard. Standard parametric models can also be
used, including the exponential, Weibull, and Gompertz models. Furthermore,
survival times can be simulated from the all-cause distribution of
cause-specific hazards for competing risks. A multinomial distribution is
used to create the event indicator, whereby the probability of experiencing
each event at a simulated time,
t, is the cause-specific hazard
divided by the all-cause hazard evaluated at time
t. Baseline
covariates and non-proportional hazards can be included in all scenarios.
Finally, we will discuss the complex extension of simulating joint
longitudinal and survival data.
Additional information
crowther_nordic11.pdf
Quantiles of the survival time from inverse probability
weighted Kaplan–Meier estimates
Andrea Discacciati
Unit of Biostatistics and Nutritional Epidemiology,
Institute of Environmental Medicine, Karolinska Institutet, Sweden
The
stci official Stata command indirectly estimates
quantiles of the survival time for different exposure levels from the
Kaplan–Meier estimates. However,
stci does not take
into account possible confounding effects. Therefore, we introduce a new
Stata command,
stqkm, that indirectly estimates quantiles of
the survival time from inverse probability weighted Kaplan–Meier
estimates. Confidence intervals for the quantile estimates are obtained
using the bootstrap method. We present a simulation study to assess the
performances of the
stqkm command in the presence of
confounding and we present a case study.
Additional information
discacciati_nordic11.pdf
An example of competing-risks analysis using Stata
Christel Häggström
Umeâ University, Sweden
Competing-risks analysis in epidemiology is of special importance in survival
analysis when studying the elderly and also when the exposure is related to
early death. In a cohort study, I investigated the association between
metabolic factors (obesity, hypertension, high glucose levels, etc.) and
prostate cancer (with mean age of diagnosis 70 years). Using this data, I
will present the analysis where I plotted cumulative incidence curves to
visualize the risk of prostate cancer in comparison with the competing-risks,
all-cause mortality for different levels of metabolic factors, using the
Stata commands
stcompet and
stpepemori. I
also used Fine and Gray regression (the
stcrreg command) to
calculate hazard ratios of subdistribution for both prostate cancer incidence
and all-cause mortality.
Additional information
haggstrom_nordic11.pdf
Using Stata for agent-based simulations
Peter Hedström
Institute for Futures Studies, Stockholm, Sweden
Thomas Grund
ETH, Zürich, Switzerland
Agent-based modeling (ABM) is an analytical tool that is becoming
increasingly important in the social sciences. The core idea behind ABM is
to use computational models to analyze the macro- or aggregate-level outcomes
that groups of agents, in interaction with one another, bring about. In this
presentation, we briefly discuss why ABM is important and show how Stata can
be used for such analyses. We also present a suite of programs. Some of these
commands are used for generating, visualizing, or measuring various
properties of the networks within which the agents are embedded, and others
are used for analyzing the collective outcomes that agents are likely to
bring about when embedded in such networks.
A command for Laplace regression
Nicola Orsini
Unit of Biostatistics and Nutritional Epidemiology,
Institute of Environmental Medicine, Karolinska Institutet, Sweden
I present an estimation command for Laplace regression to model conditional
quantiles of a response variable given a set of covariates. The
laplace command is similar to the official
qreg command except that it can account for censored data. I
illustrate its applicability and use through examples from health-related
fields.
Additional information
orsini_nordic11.pdf
Using meta-analysis to inform the design of subsequent studies
Sally R. Hinchliffe, Michael J. Crowther, Alison
Donald, and Alex J. Sutton
Department of Health Sciences, University of
Leicester, Leicester, United Kingdom
In this presentation, we describe a suite of programs (
metasim,
metapow,
metapowplot) that enable the
user to estimate the probability that the conclusions of a meta-analysis will
change with the inclusion of a new study(ies), as described previously by
Sutton et al. (2007). Using the
metasim program, we take a simulation approach to estimating the effects in
future studies. The method assumes that the effect sizes of future
studies are consistent with those observed previously, as represented by
the current meta-analysis. The contexts of both two-arm randomized
controlled trials and studies of diagnostic test accuracy are considered for
a variety of outcome measures. Calculations are possible under both fixed-
and random-effect assumptions, and several approaches to inference, including
statistical significance and limits of clinical significance, are possible.
Calculations for specific sample sizes can be conducted (using
metapow), and plots, akin to traditional power curves,
indicating the probability a new study(ies) will change inferences for a
range of sample sizes can be produced (using
metapowplot).
Finally, plots of the simulation results are overlaid on a previously
described macro,
extfunnel, which can help to intuitively
explain the results of such calculations of sample size. We hope the macro
will be useful to trialists who want to assess the impact potential new
trials will have on the overall evidence base and meta-analysts who want to
assess the robustness of the current meta-analysis to the inclusion of
future data.
Reference:
Sutton, A. J., N. J. Cooper, D. R. Jones, P. C. Lambert, J. R. Thompson, and K.
R. Abrams. 2007. Evidence-based sample size calculations based upon updated
meta-analysis.
Statistics in Medicine 27: 471–490.
Additional information
hinchcliffe_nordic11.pdf
Taking the pain out of looping and storing
Patrick Royston
MRC Clinical Trials Unit, United Kingdom
Quite a common task in Stata is to run some sequence of commands under the
control of a looping parameter and store the corresponding results in one
or more new variables. Over the years, I have written many such loops, some
of greater complexity than others. I finally became fed up with it and
decided to write a simple command to automate the repetitive parts. The
result is
looprun, which I shall describe in this
presentation.
Additional information
royston_nordic11.ppt
Projecting cancer incidence using restricted cubic splines
Mark J. Rutherford, Paul C. Lambert, and John R. Thompson
Department of Health Sciences, University of
Leicester, Leicester, United Kingdom
Age–period–cohort models provide a useful method for modeling
cancer incidence and mortality rates. There is great interest in estimating
the rates of disease at given future time points so that plans can be made
for the provision of the required future services. In the setting of using
age–period–cohort models incorporating restricted cubic splines,
we propose a new technique for projecting incidence. The method is validated
via a comparison with existing methods in the setting of Finnish Cancer
Registry data. The reasons for the improvements seen in the newly proposed
method are twofold. First, improvements are seen because of the finer
splitting of the timescale to give a more continuous estimate of the
incidence rate. Second, the new method uses more-recent trends to dictate
the future projections than previously proposed methods. The output will be
produced via the user-written command
apcfit. The
functionality of the command will be illustrated throughout the talk.
The talk will comprise an introduction of the use of restricted cubic splines
for model fitting before describing their use for
age–period–cohort models. A description of the new method for
projecting cancer incidence will be given prior to showing the results of the
application of the method to Finnish Cancer Registry data. The talk will
conclude with a description of the potential problems and issues when making
projections.
Additional information
rutherford_nordic11.pdf
Time to dementia onset: Competing-risks analysis with Laplace regression
Giola Santoni, Debora Rizzuto, and Laura Fratiglioni
Aging Research Center, Karolinska Institutet, Sweden
We want to quantify the protective effect of education on time to dementia
onset using a longitudinal data from a population study. We consider dropout
due to death of the subject as a competing event of the outcome of interest.
We show an adaptation of the Laplace regression method to the case of
competing-risks analysis. The first 20% percent of highly educated people will
develop dementia 2.5 years (p<.01) later than those with a lower education
level. The effect on all cause of mortality is negligible. We show that the
results derived through Laplace regression are comparable with those derived
with the Stata command
stcrreg.
Additional information
santoni_nordic11.pdf
Doubly robust estimation in generalized linear models with Stata
Arvid Sjölander
Department of Medical Epidemiology and
Biostatistics, Karolinska Institutet, Sweden
Nicola Orsini
Units of Biostatistics and Nutritional Epidemiology,
Institute of Environmental Medicine, Karolinska Institutet, Sweden
The aim of epidemiological research is typically to estimate the association
between a particular exposure on a particular outcome, adjusted for a set of
additional covariates. This is commonly done by fitting a regression model
for the outcome, given exposure and covariates. If the regression model is
misspecified, then the resulting estimator may be inconsistent. Recently, a
new class of estimators has been developed, so called “doubly
robust” (DR) estimators. These estimators use two regression models:
one for the outcome and one for the exposure. A DR estimator is consistent if
either model is correct, not necessarily both. Thus DR estimators give the
analyst two chances instead of only one to make valid inference. In this
presentation, we describe a new package for Stata that implements the most
common DR estimators.
Additional information
sjolander_nordic11.pdf
Chained equations and more in multiple imputation in Stata 12
Yulia Marchenko
StataCorp LP
I present the new Stata 12 command,
mi impute chained, to
perform multivariate imputation using chained equations (ICE), also known as
sequential regression imputation. ICE is a flexible imputation technique
for imputing various types of data. The variable-by-variable specification
of ICE allows you to impute variables of different types by choosing the
appropriate method for each variable from several univariate imputation
methods. Variables can have an arbitrary missing-data pattern. By
specifying a separate model for each variable, you can incorporate certain
important characteristics, such as ranges and restrictions within a subset,
specific to each variable. I also describe other new features in multiple
imputation in Stata 12.
Additional information
marchenko_nordic11.pdf
SEM for those who think they don’t care
Vince Wiggins
StataCorp LP
We will discuss SEM (structural equation modeling), not from the perspective
of the models for which it is most often used—measurement models,
confirmatory factor analysis, and the like—but from the perspective of
how it can extend other estimators. From a wide range of choices, we will
focus on extensions of mixed models (random and fixed-effects regression).
Extensions include conditional effects (not completely random), endogenous
covariates, and others.
Additional information
wiggins_nordic11.pdf
Scientific organizers
Peter Hedström, Metrika Consulting, Nuffield College and Oxford University
Nicola Orsini, Karolinska Institutet
Matteo Bottai, Karolinska Institutet
Logistics organizers
Metrika Consulting,
the official distributor of Stata in the Nordic and Baltic regions, and the
Karolinska Institutet.