Proceedings
9:15–9:45 |
Abstract:
The Clinical Practice Research Datalink (CPRD) is a
centrally-managed data warehouse, storing data provided by the
primary care sector of the United Kingdom (UK) National Health
Service (NHS). Medical researchers request retrievals from
this database, which take the form of a collection of text
datasets whose format can be complicated. I have written a
flagship package cprdutil with multiple modules to
input into Stata the many text dataset types provided in a
CPRD retrieval. These text datasets may be converted either to
Stata value labels or to Stata datasets, which can be created
complete with value labels, variable labels, and numeric Stata
dates. I have also written a fleet of satellite packages to
input into Stata the text datasets for retrievals of linked
data, in which data are provided from non-CPRD sources, with
CPRD identifier variables as a foreign key to allow data
linkage. I introduce the modules of cprdutil and give a
demonstration example in which I produce a minimal CPRD
database in Stata, using cprdutil, and in which I illustrate some
principles of sensible programming practice for creating large
databases.
Additional information: uk18_Newson.pdf
Roger B. Newson
Imperial College London
|
9:45–10:15 |
Abstract:
Multiarm multistage (MAMS) adaptive clinical trials offer
several practical advantages over traditional two-arm designs.
The framework proposed by Royston et al. (2011) uses
intermediate outcomes at interim analyses to drop research
arms demonstrating insufficient benefit prior to the final
analysis on the primary outcome. To our knowledge, the
nstage command developed for Stata (Barthel, Royston, and
Parmar, 2009) is the only sample size software for MAMS trials
with time-to-event outcomes, a common outcome measure in
modern trials in cancer, cardiovascular disease, and other
disease areas. We present an update to nstage to
increase the efficiency and uptake of MAMS designs.
nstage can accommodate efficacy-stopping boundaries at interim analyses with a new option. Users choose a stopping rule, and the program estimates the operating characteristics for a design that can assess for early evidence of overwhelming efficacy on the primary outcome when interim analyses for lack of benefit occur on an intermediate outcome. The user specifies whether the trial is expected to terminate or continue with the remaining arms should an efficacious research arm be identified before the final analysis of the trial. Because the probability of a type I error is increased through such a design, the updated command offers an option to search for a design that strongly controls the maximum familywise error rate at the desired level, if it is required. The command estimates the operating characteristics of the chosen design within a reasonable timeframe, allowing users to compare trial designs for different input parameters easily. We illustrate how the updates can be used to design a trial with the dropdown menu, using the MAMS trial STAMPEDE as an example. We hope the new functionality of the command will serve a broader range of trial objectives and thus increase adoption of the design in practice. Additional information: uk18_Choodari-Oskooei.pptx
Babak Choodari-Oskooei
MRC Clinical Trials Unit at UCL, London
Alexandra Blenkinsop
MRC Clinical Trials Unit at UCL, London
|
10:15–10:45 |
Abstract:
Spaghetti plots show many tangled lines (say, for multiple time
series or other functional traces) that are hard to
distinguish and interpret. Paella plots show multiple point
patterns for many groups, sufficiently mixed up so that
comparisons are made difficult. The talk surveys several
tactics and strategies for better, friendlier comparisons.
Devices range from showing data several times over to
selection, smoothing, and transformation.
Additional information: uk18_Cox.pptx
Nicholas J. Cox
Durham University
|
11:15–11:45 |
Abstract:
The overall look of Stata's graphs is determined by so-called
scheme files. Scheme files are system components, that is,
they are part of the local Stata installation. In this talk, I
will argue that style settings deviating from default schemes
should be part of the script producing the graphs rather than
being kept in separate scheme files, and I will present
software that supports such practice. In particular, I will
present a command called grstyle that allows users to
quickly change the overall look of graphs without having to
fiddle around with external scheme files. I will also present
a command called colorpalette that provides a wide
variety of colour schemes for use in Stata graphics.
Additional information: uk18_Jann.pdf
Ben Jann
University of Bern
|
11:45–12:45 |
Abstract:
In this presentation, I will discuss some popular supervised and
unsupervised machine learning algorithms, and their
recommended uses, and then I will present implementations in Stata.
The emphasis is on prediction and causal inference, and how to
tailor a method to a specific application.
Additional information: uk18_Nichols.pdf
Austin Nichols
Abt Associates
|
1:45–2:45 |
Abstract:
Stata 15 introduced the new estimation command menl for
fitting nonlinear mixed-effects models, also known as
nonlinear multilevel models and nonlinear hierarchical models.
These models can be thought of in two ways: as nonlinear
models containing random effects or as linear mixed-effects
models in which some or all fixed and random effects enter
nonlinearly. The overall error distribution is assumed to be
Gaussian. Nonlinear mixed-effects models have been used to
model drug absorption in the body, intensity of earthquakes,
and growth of plants, to name a few.
In my presentation, I will demonstrate how to use the menl command to fit nonlinear mixed-effects models in a variety of applications, including population pharmacokinetics and macroeconomics. Additional information: uk18_Marchenko.pdf
Yulia Marchenko
StataCorp
|
2:45–3:30 |
Abstract:
In a typical survival analysis, researchers study the time to an event of
interest. For example, in cancer studies,
researchers often wish to analyse a patient's time to death
since diagnosis. Similar applications also exist in economics
and engineering. In any case, the event of interest is often
not distinguished between different causes. Although this may
sometimes be useful, in many situations this will not paint
the entire picture and restricts analysis. More commonly, the
event may occur because of different causes, which better reflects
real-world scenarios. For instance, if the event of interest
is death due to cancer, it is also possible for the patient to
die because of other causes. This means that the time at which
the patient would have died because of cancer is never observed.
These are known as competing causes of death or competing
risks. In a competing-risks analysis, interest lies in the
cause-specific cumulative incidence function (CIF). This can
be calculated by either (1) transforming on (all)
cause-specific hazards, or (2) using a direct relationship
with the subdistribution hazards.
Obtaining cause-specific CIFs within the flexible parametric modeling framework by adopting approach (1) is possible by using the stpm2 postestimation command, stpm2cif. Alternatively, because competing risks is a special case of a multistate model, an equivalent model can be fit using the multistate package. To estimate cause-specific CIFs using approach (2), one can use stpm2 by applying time-dependent censoring weights that are calculated on restructured data using stcrprep. The above methods involve some form of data augmentation. Instead, estimation on individual-level data may be preferred because of computational advantages. This is possible using either approach, (1) or (2), with stpm2cr. In this talk, I provide an overview of these various tools and discuss which of these to use and when. Additional information: uk18_Mozumder.pdf
Sarwar Islam Mozumder
Biostatistics Research Group, University of Leicester
|
4:00–4:15 |
Abstract:
The command makehlp was released in July 2012, and it simplifies
the construction of a help file by a
SMCL help template. The command opens up the ado-file and produces a
template help file from the syntax line. In the past, the
user would need to edit this template and fill in the details
such as the description, title, examples, etc. The new version
of makehlp keeps the old functionality but also checks
for the return codes to automatically produce a list of stored
outputs. In addition, I introduce a new syntax so that all
the necessary text can be included in the ado -ile, for the
various sections such as: description, title, examples,
author, references, see also, and all the options and returns
descriptions. An example of the new syntax is desc[],
which will place all the text between the brackets into the
help file description and will be formatted as it is written,
so SMCL commands are allowed. This means that the ado-file can
store the majority of the help file, and the help file can
subsequently be created using this ado-file.
Additional information: uk18_Mander.pdf
Adrian Mander
MRC Biostatistics Unit, University of Cambridge
|
4:15–4:30 |
Abstract:
Meta-analysis (MA) is a statistical technique for combining
results from multiple independent studies, with the aim of
estimating a single overall effect with a size, direction, and
precision consistent with the data. Traditionally, MA is
performed on aggregated data (AD), where each observation
represents the effect observed in a study, often derived from
study publications. The community-contributed command metan
(Harris et al., 2008) is by far the most popular Stata
command for performing AD MA, but it was last updated in 2010
and has various flaws and limitations.
The alternative to AD MA is to obtain and analyse individual participant data (IPD), where the totality of data from all studies is stacked to form a single large dataset. I have previously described (Fisher 2015) a community-contributed command, ipdmetan, that facilitates so-called "two-stage" IPD MA. The two stages are fitting a given model to the data from each study in turn and combining the results using AD techniques. The second stage, performed using the AD command admetan, has now been expanded into a fully comprehensive AD MA command, with all the functionality of metan and much more besides. The co-author and maintainer of metan, Ross Harris, has confirmed to me that he is no longer in a position to maintain it and is happy for admetan to take its place. Another important aspect of ipdmetan (and hence also admetan) is its forest plot capabilities. Not only is the forest plot engine much more efficient and capable of better plots "out of the box" when compared with metan; it also allows the user to save and edit "forestplot results sets", which are interpreted directly by the stand alone command forestplot to produce fully flexible plots. I will take you on a quick tour of admetan and forestplot and hope to encourage you (and your colleagues and collaborators!) to use them in preference to metan. References: Fisher, D.J. 2015. Two-stage individual participant data meta-analysis and generalized forest plots. Stata Journal 15: 369–396. Harris, R.J., J.D. Deeks, D.G. Altman, M.J. Bradburn, R.M. Harbord, J.A.C. Sterne, 2008. metan: fixed- and random-effects meta-analysis. Stata Journal 8: 3–28. Additional information: uk18_Fisher.pptx
David Fisher
MRC Clinical Trials Unit at UCL
|
4:30–5:00 |
Abstract:
In observational studies with time-to-event outcomes, we
expect that there will be confounding and would usually adjust
for these confounders in a survival model. From such models,
an adjusted hazard ratio comparing exposed and unexposed
subjects is often reported. This is fine, but hazard ratios
can be difficult to interpret and are not collapsible. There
are further problems when trying to interpret hazard ratios as
causal effects. Risks are much easier to interpret than rates,
so quantifying the difference on the survival scale can be
desirable.
In Stata, stcurve gives survival curves after fitting a model where certain covariates can be given specific values, but those not specified are given mean values. Thus, it gives a prediction for an individual who happens to have the mean values of each covariate and may not reflect the average in the population. An alternative is to use standardization to estimate marginal effects, where the regression model is used to predict the survival curve for unexposed and exposed subjects at all combinations of other covariates included in the model. These predictions are then averaged to give marginal effects. I will describe a command, stpm2_standsurv, that obtains various standardized measures after fitting a flexible parametric survival model. The command can estimate standardized survival curves, the marginal hazard function, the standardized restricted mean survival time, and centiles of the standardized survival curve. Contrasts can be made between any of these measures (differences, ratios). A user-defined function can be given for more complex contrasts. Additional information: uk18_Lambert.pdf
Paul C. Lambert
Biostatistics Research Group, University of Leicester
Karolinska Institutet |
5:00–5:30 |
Abstract:
Stepping back from Mata, and even stepping a little back from
the book, I use its publication as an excuse to describe Mata,
its features, and what programming in Mata can achieve.
Additional information: uk18_Gould.pdf
William Gould
StataCorp
|
5:30–close |
Abstract:
The 2017 Stata Journal Editors' Prize will be presented
symbolically to Ben Jann.
Newton, H.J., and N.J. Cox. 2017. The Stata Journal Editors' Prize 2017: Ben Jann. Stata Journal 17: 781–785. |
9:00–9:30 |
Abstract:
We estimate response surface coefficients for a large range of
quantiles of the Leybourne and Taylor (2003, Journal of
Time Series Analysis 24: 441–460) test for the presence of
seasonal unit roots. This test statistic offers greater power
gains compared with the familiar regression-based
approach advocated by Hylleberg et al. (1990,
Journal of Econometrics 44: 215–238). This approach is
currently implemented in Stata via the command sroot,
developed by Depalo (2009, Stata Journal 9: 422–438),
and the further extensions introduced by the command
hegy by del Barrio Castro, Bodnar and Sansó (2016,
Stata Journal 16: 740–760). The main feature of the
Leybourne and Taylor test is that it achieves power gains
through the use of forward and reverse HEGY regressions. The
estimated response surfaces allow for different combinations
of number of observations T and lag order in the test
regressions p, where the latter can be either specified by
the user or endogenously determined by the underlying data.
The critical values depend on the method used to select the
number of lags. We introduce the new Stata command ltur
and illustrate its use with an empirical example. The new
command permits the computation of the Leybourne and Taylor
test statistics along with their associated critical values
and approximate probability values.
Additional information: uk18_Otero.pdf
Jesús Otero
Universidad del Rosario
Kit Baum
Boston College
|
9:30–10:00 |
Abstract:
Autoregressive distributed lag (ARDL) models are often used to
analyse dynamic relationships with time-series data in a
single-equation framework. The current value of the dependent
variable is allowed to depend on its own past realisations—the autoregressive
part—as well as current and past values
of additional explanatory variables—the distributed lag
part. The variables can be stationary, nonstationary, or a
mixture of both. In its equilibrium correction (EC)
representation, the ARDL model can be used to separate the
long-run and short-run effects, and to test for cointegration
or, more generally, for the existence of a long-run
relationship among the variables of interest.
This talk serves as a tutorial for the ardl Stata command that can be used to fit an ARDL or EC model with the optimal number of lags based on the Akaike or Schwarz/Bayesian information criteria. I will address frequently asked questions and provide a step-by-step instruction for the Pesaran, Shin, and Smith (2001, Journal of Applied Econometrics) bounds test for the existence of a long-run relationship. This test is implemented as the postestimation command estat ectest, which features newly computed finite-sample critical values and approximate p-values. These critical values cover many model configurations and supersede previous tabulations available in the literature. They account for the sample size, the chosen lag order, the number of explanatory variables, and the choice of unrestricted or restricted deterministic model components. The ardl command uses Stata's regress command to fit the model. As a consequence, specification tests can be carried out with the standard postestimation commands for linear (time-series) regressions and the forecast command suite can be used to obtain dynamic forecasts. Additional information: uk18_Kripfganz.pdf
Sebastian Kripfganz
University of Exeter Business School
Daniel C. Schneider
Max Planck Institute for Demographic Research
|
10:00–10:30 |
Abstract:
The package multishell is intended to speed up
simulations by using multicore processors and Stata's
shell command. In a first step, one or multiple do-files
are converted into batch files and added to a queue. After
starting the main command, the current instance of Stata acts
as an organiser and works through the queue. It allocates the
batch files to a preset number of parallel running Stata
instances.
multishell has several distinct features. If do-files include forvalues and foreach loops, multishell dissects the loops and creates for each combination a new do-file, which is added to the queue. This allows for an efficient allocation and use of processor power. multishell can be used to connect two or more computers to a cluster. multishell then allocates to each computer parts of the queue and a simulation is run parallel on multiple computers. Computational power is used efficiently and time saved. Additional information: uk18_Ditzen.pdf
Jan Ditzen
Centre for Energy Economics Research and Policy, Heriot-Watt University
|
11:00–11:30 |
Abstract:
merlin can do a lot of things. From linear regression
to a Weibull survival model, from a three-level logistic
model to a multivariate joint model of multiple longitudinal
outcomes, a recurrent event, and survival. merlin can do
things I haven't even thought of yet. I'll take a single
dataset, attempt to show you the full range of
capabilities of merlin, and talk about some of the new
features following its rise from the ashes of megenreg.
There'll even be some surprises.
Additional information: uk18_Crowther.pdf
Michael J. Crowther
Biostatistics Research Group, University of Leicester\n
|
11:30–12:30 |
Abstract:
Latent class analysis (LCA) allows us to identify and understand
unobserved groups in our data. These groups may be consumers with
different buying preferences, adolescents with different patterns of
behaviour, or different health status classifications.
Stata 15 introduced new features for performing LCA. In this presentation, I will demonstrate how to use gsem with categorical latent variables to fit standard latent class models—models that identify unobserved groups based on a set of categorical outcomes. I will also show how we can extend the standard model to include additional equations and to identify groups using continuous, count, ordinal, and even survival-times outcomes. We will use the results of these models to determine who is likely to be in a group and how that group's characteristics differ from other groups. Additional information: uk18_MacDonald.pdf
Kristin MacDonald
StataCorp
|
1:45–2:45 |
Abstract:
The field of machine learning is attracting increasing
attention among social scientists and economists. At the same
time, Stata offers only a limited set of machine
learning tools to date. This one-hour session introduces two Stata
packages, lassopack and pdslasso, which
implement regularized regression methods, including but not
limited to the lasso (Tibshirani 1996 Journal of the
Royal Statistical Society Series B), for Stata. The
packages include features intended for prediction, model
selection, and causal inference, and are thus applicable in
many settings. The commands allow for
high-dimensional models, where the number of regressors may be
large or even exceed the number of observations under the
assumption of sparsity.
The package lassopack implements lasso, square-root lasso (Belloni et al. 2011 Biometrika; 2014 Annals of Statistics), elastic net (Zou and Hastie, 2005, Journal of the Royal Statistical Society Series B), ridge regression (Hoerl and Kennard, 1970, Technometrics), adaptive lasso (Zou, 2006, Journal of the American Statistical Association), and postestimation OLS. These methods rely on tuning parameters, which determine the degree and type of penalization. lassopack supports three approaches for selecting these tuning parameters: information criteria (implemented in lasso2), K-fold and h-step ahead rolling cross-validation (cvlasso), and theory-driven penalization (rlasso) due to Belloni et al. (2012, Econometrica). In addition, rlasso implements the Chernozhukov et al. (2013, Annals of Statistics) sup-score test of joint significance of the regressors. The package pdslasso offers methods to facilitate causal inference in structural models. The package implements methods for selecting control variables (pdslasso), instruments (ivlasso), or both from a large set of variables in a setting where the researcher is interested in estimating the causal impact of one or more (possibly endogenous) causal variables of interest. pdslasso and ivlasso rely on the lasso and square-root-lasso estimator implemented in lassopack. ivlasso also supports weak-identification-robust hypothesis tests and confidence sets. Additional information: uk18_Ahrens.pdf
Achim Ahrens
Economic and Social Research Institute, Dublin
Christian B. Hansen
Booth School of Business, University of Chicago
Mark E. Schaffer
Heriot-Watt University, Edinburgh
|
2:45–3:15 |
Abstract:
Matching is a popular estimator of the Average Treatment
Effects (ATEs) within counterfactual observational studies. In
recent years, however, many scholars have questioned the
validity of this approach for causal inference because its
reliability draws heavily upon the so-called
selection-on-observables assumption.
When unobservable confounders are possibly at work, they say, it becomes hard to trust matching results, and the analyst should consider alternative methods suitable for tackling unobservable selection. Unfortunately, these alternatives require extra information that may be costly to obtain, or even not accessible. For this reason, some scholars have proposed matching sensitivity tests for the possible presence of unobservable selection. The literature sets out two methods: the Rosenbaum (1987) and the Ichino, Mealli, and Nannicini (2008) tests. Both are implemented in Stata. In this work, I propose a third and different sensitivity test for unobservable selection in matching estimation based on a "leave-covariates-out" (LCO) approach. Rooted in the machine learning literature, this sensitivity test recalls a bootstrap over different subsets of covariates and simulates various estimation scenarios to be compared with the baseline matching estimated by the analyst. Finally, I will present sensimatch, the Stata routine I developed to run this method, and provide some instructional applications on real datasets. Additional information: uk18_Cerulli.pdf
Giovanni Cerulli
IRCrES-CNR, Italy
|
3:45–4:15 |
Abstract:
In regression analysis, it is well known that skewness and
excessive tail heaviness affect the efficiency of classical
estimators. In this work, we propose an estimator that is
highly efficient for many distributions. More
specifically, in accordance with standard Le Cam theory, we
define a sign-and-rank–based estimator of the regression
coefficients as a one-step update, based on a fully
semiparametrically efficient central sequence, of an initial
root n consistent estimator.
In the central sequence, the score function, initially defined on the basis of the exact underlying innovation density f, is estimated using the fact that f can be well adjusted by a Tukey g-and-h distribution. We present the results of some Monte Carlo simulations conducted to assess the finite sample performance of our estimator, compared with the ordinary least squares estimator and the approximated maximum-likelihood estimator. We propose a Stata command flexrank to implement it in practice. The procedure is very fast and has a low computational complexity. Additional information: uk18_Verardi.pdf
Vincenzo Verardi
Université Libre de Bruxelles
|
4:15–close |
Abstract:
Stata developers present will carefully and cautiously
consider wishes and grumbles from Stata users in the audience.
Questions, and possibly answers, may concern reports of
present bugs and limitations or requests for new features in
future releases of the software.
StataCorp personnel
StataCorp
|