Last updated: 17 October 2011
2011 UK Stata Users Group meeting
15–16 September 2011
Centre for Econometric Analysis
Cass Business School
106 Bunhill Row
London EC1 8TZ
United Kingdom
Proceedings
Sensible parameters for polynomials and other splines
Roger B. Newson
National Heart and Lung Institute, Imperial College London
Splines, including polynomials, are traditionally used to model nonlinear
relationships involving continuous predictors. However, when they are
included in linear models (or generalized linear models), the estimated
parameters for polynomials are not easy for nonmathematicians to understand,
and the estimated parameters for other splines are often not easy even for
mathematicians to understand. It would be easier if the parameters were
differences or ratios between the values of the spline at the reference
points and the value of the spline at a base reference point
or if the parameters were
values of the polynomial or spline at reference points on the
x-axis,
or
The
bspline package can be downloaded from Statistical Software Components, and generates spline
bases for inclusion in the design matrices of linear models, based on
Schoenberg
B-splines. The package now has a recently added module
flexcurv, which inputs a sequence of reference points on the
x-axis and outputs a spline basis, based on equally spaced knots
generated automatically, whose parameters are the values of the spline at
the reference points. This spline basis can be modified by excluding the
spline vector at a base reference point and including the unit vector. If
this is done, then the parameter corresponding to the unit vector will be
the value of the spline at the base reference point, and the parameters
corresponding to the remaining reference spline vectors will be differences
between the values of the spline at the corresponding reference points and
the value of the spline at the base reference point. The spline bases are
therefore extensions, to continuous factors, of the bases of unit vectors
and/or indicator functions used to model discrete factors. It is possible to
combine these bases for different continuous and/or discrete factors in the
same way, using product bases in a design matrix to estimate factor-value
combination means and/or factor-value effects and/or factor interactions.
Additional information
UK11_newson.pdf
UK11_newson_dofiles1.zip
Experiences and lessons learned from bootstrapping random-effects predictions
Robert Grant
Kingston University and St. George’s University of London
Background: Random effects are commonly modeled in multilevel, longitudinal,
and latent-variable settings. Rather than estimating fixed effects for
specific clusters of data, “predictions” can be made as the mode
or mean of posterior distributions that arise as the product of the random
effect (an empirical Bayes prior) and the likelihood function conditional on
cluster membership.
Analyses and data: This presentation will explore the
experiences and lessons learned in using the bootstrap for inference on
random-effects predictors following logistic regression models conducted
through both
xtmelogit and
gllamm.
In the United Kingdom, 203 hospitals were compared on the
quality of care received by 10,617 stroke patients through multilevel
logistic regression models.
Results and considerations: Multilevel
modeling and prediction are both computer-intensive, and so bootstrapping
them is especially time-consuming. Examples from do-files with some helpful
approaches will be shown. A small proportion of modal best linear
unbiased predictors contained
errors, possibly arising from the prediction algorithm. Various bootstrap
confidence intervals exhibited problems such as excluding the point
prediction and degeneracy. Methods for tracing the source will be presented.
Conclusion: Bootstrapping provides flexible but time-consuming inference for
individual clusters’ predictions. However, there are potential
problems that analysts should be aware of.
Additional information
UK11_Grant.ppt
Sensitivity analysis for randomized trials with missing outcome data
Ian White
MRC Biostatistics Unit, Cambridge
Any analysis with incomplete data makes untestable assumptions about the
missing data, and analysts are therefore urged to conduct sensitivity
analyses. Ideally, a model is constructed containing a nonidentifiable
parameter
d, where
d = 0 corresponds to the assumption made in
the standard analysis, and the value of
d is then varied in a range
considered plausible in the substantive context. I have produced Stata
software for performing such sensitivity analyses in randomized trials with
a single outcome, when the user specifies a value or range of values of
d. The analysis model is assumed to be a generalized linear model
with adjustment for baseline covariates. I will describe the statistical
model used to allow for the missing data, sketch the programming required
to obtain a sandwich variance estimator, and describe modifications needed
to make the results given when
d = 0 correspond exactly to those
results available by standard methods. I will illustrate the use of the software for
binary and continuous outcomes, when the standard analysis assumes either
missing at random or (for a binary outcome)
“missing =
failure”.
Additional information
UK11_White.pdf
Implementing the continual reassessment method (CRM)
Adrian Mander
MRC Biostatistics Unit Hub for Trials Methodology, Cambridge
One of the aims of a phase I trial in oncology is to find the maximum
tolerated dose. A set of doses is administered to participants starting from
the lowest dose in increasing steps. To do this safely, the toxicity of each dose
is assessed, and a decision is made about whether to proceed with the next
highest dose until the desired target toxicity level is found. A suitable
dose is then chosen to take forward into phase II studies to discover
whether this drug is efficacious. The majority of oncology phase I trials
use algorithm-based rules such as the 3 + 3 design to escalate doses; the 3 + 3
design is easy to implement by nonstatisticians but is statistically
inefficient. Other designs, such as the continual reassessment method
(O’Quigley, Pepe, and Fisher 1990), use a model to help guide the decision of
which dose to give. The complexity of the CRM and that it requires software
may be reasons why it is not more widely used. This talk will describe a new
command
crm that is a Mata implementation of the CRM and includes
some discussion about the programming difficulties.
Additional information
UK11_Mander.pdf
A review of estimators for the fixed-effects ordered logit model
Arne Risa Hole
University of Sheffield
Joint with Andy Dickerson and Luke Munford
It is well-known that the dummy variable estimator for the fixed-effects
ordered logit model is inconsistent when
T, the dimension of
the panel, is fixed. This talk will review a range of alternative
fixed-effects ordered logit
estimators that are based on Chamberlain’s fixed-effects estimator
for the binary logit model. The talk will present Stata code for the
estimators and discuss the available evidence on their finite-sample
performance. We will conclude by presenting an empirical example in which
the estimators are used to model the relationship between commuting and life
satisfaction.
Additional information
UK11_Hole.pdf
Generalized method of moments fitting of structural mean models
Tom Palmer
MRC CAiTE Centre, School of Social and Community Medicine, University of Bristol
Joint with Roger Harbord, Paul Clarke, and Frank Windmeijer
In this talk we describe how to fit structural mean models (SMMs), as
proposed by Robins, using instrumental variables in the generalized
method of moments (GMM) framework using Stata’s
gmm command.
The GMM approach is flexible because it can fit overidentified models in
which there are more instruments than endogenous variables. It also allows
assessment of the joint validity of the instruments using Hansen’s
J
test through Stata’s
estat overid gmm postestimation command.
In the case of the logistic SMM, the approach also allows different first-stage
association models. We show the relationship between the
multiplicative SMM and the multiplicative GMM estimator implemented in the
ivpois command of Nichols (2007). For the multiplicative SMM, we
show—analogously to Imbens and Angrist (1994) for the linear case—that the
estimate is a weighted average of local estimates using the instruments
separately. To demonstrate the models, we use a Mendelian randomization
example, in which genotypes found to be robustly associated with risk
factors from genome-wide association studies are used as instrumental
variables, thereby investigating the effect of being overweight on the risk of
hypertension in the Copenhagen General Population Study.
Additional information
UK11_palmer_handouts.pdf
UK11_palmer_presentation.pdf
Flexible joint modeling of longitudinal and time-to-event data
Michael J. Crowther
Department of Health Sciences, University of Leicester
Joint with Keith R. Abrams and Paul C. Lambert
The joint modeling of longitudinal and time-to-event data has exploded in
the methodological literature in the past decade; however, the availability
of software to implement the methods lags behind. The most common form of
joint model assumes that the association between the survival and
longitudinal processes are underlined by shared random effects. As a result,
computationally intensive numerical integration techniques such as
Gauss–Hermite quadrature are required to evaluate the likelihood. We
describe a new user-written command
jm, which allows the user to
jointly model a continuous longitudinal response and an event of interest.
We assume a linear mixed-effects model for the longitudinal submodel, thereby
allowing flexibility through the use of fixed and/or random fractional
polynomials of time. We also assume a flexible parametric model (
stpm2) for the
survival submodel. Flexible parametric models are fitted on the log
cumulative hazard scale, which has direct computational benefits because it
avoids the use of numerical integration to evaluate the cumulative hazard. We
describe the features of
jm through application to a dataset
investigating the effect of serum albumin level on time to death from any
cause in 252 patients suffering end-stage renal disease.
Additional information
UK11_crowther.pdf
Sample size and power estimation when covariates are measured with error
Michael Wallace
London School of Hygiene and Tropical Medicine
Measurement error in exposure variables can lead to bias in effect
estimates, and methods that aim to correct this bias often come at the
price of greater standard errors (and so, lower statistical power). This
means that standard sample size calculations are inadequate and that, in
general, simulation studies are required. Our routine
autopower aims
to take the legwork out of this simulation process, restricting attention to
univariate logistic regression where exposures are subject to classical
measurement error. It can be used to estimate the power of a particular model
setup or to search for a suitable sample size for a desired power. The
measurement error correction methods that are employed are regression
calibration (
rcal) and a conditional score method—a Stata
routine that we also introduce.
Additional information
UK11_wallace.ppt
Splines models for prediction of house prices
David Boniface
Epidemiology and Public Health, University College London
Aim: To create a web-based facility for customers to enter an address of a house
and obtain a graph showing the trend of price of house since last sold,
extrapolated to current date, within milliseconds.
Method: The UK Land Registry of house sale prices was used to estimate mean
price trends from 2000 to 2010 for each category of house. The Stata
ado-file
uvrs (with user-specified knots) was used to model the
curve. The parameter estimates were saved. Later, to respond in real time
to a query about a particular house,
splinegen was used to generate
the spline curve for the appropriate time period, which was adjusted to apply
to the particular house and plotted on the webpage.
Challenges: use of coded date, choice of user knots for splines,
saving and retrieving the knots and parameter estimates, use of log
scale for prices to deal with skewed price distribution, estimation of
prediction intervals, and the 2009 slump in house prices
Additional information
UK11_boniface.ppt
Endogenous treatment effects for count data models with endogenous participation or sample selection
Alfonso Miranda
Institute of Education, University of London
Joint with Massimiliano Bratti
We propose an estimator for models in which an endogenous dichotomous
treatment affects a count outcome in the presence of either sample selection
or endogenous participation using maximum simulated likelihood. We allow for
the treatment to have an effect on both the participation or the sample
selection rule and on the main outcome. Applications of this model are
frequent in—but not limited to—health economics. We show an
application of the model using data from Kenkel (Kenkel and Terza, 2001,
Journal of Applied Econometrics 16: 165–184), who investigated
the effect of physician advice on the amount of alcohol consumption. Our
estimates suggest that in these data, a) neglecting treatment endogeneity
leads to a wrongly signed effect of physician advice on drinking intensity,
b) accounting for treatment endogeneity but neglecting endogenous
participation leads to an upwardly biased estimate of the treatment effect,
and c) advice affects only the drinking-intensive margin but not drinking
prevalence.
Additional information
UK11_Miranda.pdf
Multiple imputation with large proportions of missing data: How much is too much?
Jin Hyuk Lee
Texas A&M Health Science Center
Joint with John Huber Jr.
Multiple imputation (MI) is known as an effective method for handling
missing data. However, it is not clear that the method will be effective
when the data contain a high percentage of missing observations on a
variable. This study examines the effectiveness of MI in
data with 10% to 80% missing observations using absolute bias and
root mean squared error of MI measured under missing completely at
random, missing at random, and not missing at random
assumptions. Using both simulated data drawn from multivariate normal
distribution and example data from the Predictive Study of Coronary
Heart Disease, the bias and root mean squared error using MI are much smaller than
of the results when complete case analysis is used. In addition, the bias
of MI is consistent regardless of increasing imputation numbers (M) from
M = 10 to M = 50. Moreover, compared to the regression method and predictive
mean matching method, the Markov chain Monte Carlo method can also be used for continuous and
univariate missing variables as an imputation mechanism. In conclusion, MI
produces less-biased estimates, but when large proportions of data are
missing, other things need to be considered such as the number of
imputations, imputation mechanisms, and missing data mechanisms for proper
imputation.
Additional information
UK11_lee.pptx
Testing the performance of the two fold FCS algorithm for multiple imputation of longitudinal clinical records
Irene Petersen
University College London
Joint with Catherine Welch, Jonathan Bartlett, Ian White, Richard Morris, Louise Marston, Kate Walters, Irwin Nazareth, and James Carpenter
Multiple imputation is increasingly regarded as the standard method to
account for partially observed data, but most methods have been based on
cross-sectional imputation algorithms. Recently, a new multiple-imputation
method, the two fold fully conditional specification (FCS) method, was
developed to impute missing data in longitudinal datasets with nonmonotone
missing data. (See Nevalainen J., Kenward M.G., and Virtanen S.M. 2009. Missing values
in longitudinal dietary data: A multiple imputation approach based on a
fully conditional specification.
Statistics in Medicine 28:
3657–3669.) This method imputes missing data at a given time point based on
measurements recorded at the previous and next time points. Up to now, the
method has only been tested on a relatively small dataset and under very
specific conditions. We have implemented the two fold FCS algorithm in Stata,
and in this study we further challenge and evaluate the performance of the
algorithm under different scenarios. In simulation studies, we generated
1,000 datasets, which were similar in structure to the longitudinal clinical
records (The Health Improvement Network primary care database) to
which we will apply the two fold FCS algorithm. Initially, these generated
datasets included complete records. We then introduced different levels and
patterns of partially observed data patterns and applied the algorithm to
generate multiply imputed datasets. The results of our initial multiple
imputations demonstrated that the algorithm provided acceptable results when
using a linear substantive model and data were imputed over a limited time
period for continuous variables such as weight and blood pressure.
Introducing an exponential substantive model introduced some bias, but
estimates were still within acceptable ranges. We will present results for
simulation studies that include situations where categorical and
continuous variables change over a 10-year period (for example, smokers become
ex-smokers, weight increases or decreases) and large proportions of data are
unobserved. We also explore how the algorithm deals with interactions and
whether it has any impact on the final data distribution—whether the algorithm is
initiated to run forward or backward in time.
Additional information
UK11_welch.pptx
Implementing procedures for spatial panel econometrics in Stata
Gordon Hughes
School of Economics, University of Edinburgh
Econometricians have begun to devote more attention to spatial interactions
when carrying out applied econometric studies. In part, this is motivated by
an explicit focus on spatial interactions in policy formulation or market
behavior, but it may also reflect concern about the role of omitted
variables that are or may be spatially correlated. The classic models of
spatial autocorrelation or spatial error rely upon a predefined matrix of
spatial weights
W, which may be derived from an explicit model of
spatial interactions but which, alternatively, could be viewed as a flexible
approximation to an unknown set of spatial links similar to the use of a
translog cost function. With spatial panel data, it is possible, in
principle, to regard
W as potentially estimable, though the number of
time periods would have to be large relative to the number of spatial panel
units unless severe restrictions are placed upon the structure of the
spatial interactions. While the estimation of
W may be infeasible for
most real data, there is a strong, formal similarity between spatial panel
models and nonspatial panel models in which the variance–covariance
matrix of panel errors is not diagonal. One important variant of this type
of model is the random-coefficient model in which slope coefficients differ
across panel units so that interest focuses on the mean slope coefficient
across panel units. In certain applications—for example, cross-country
(macro-)economic data—the assumption that reaction coefficients are
identical across panel units is not intuitively plausible. Instead of just
sweeping differences in coefficients into a general error term, the
random-coefficient model allows the analyst to focus on the common component of
responses to changes in the independent variables while retaining the
information about the error structure associated with coefficients that are
random across panel units but constant over time for each panel unit.
At present, Stata’s spatial procedures include a range of user-written
routines that are designed to deal with cross-sectional spatial data. The
recent release of a set of programs (including
spmat,
spivreg,
and
spreg) written by Drukker, Prucha, and Raciborski
provides Stata’s users with the opportunity to fit a wide range
of standard spatial econometric models for cross-sectional data. Extending
such procedures to deal with panel data is nontrivial, in part because
there are important issues about how panels with incomplete data should be
treated. The casewise exclusion of missing data is automatic for
cross-sectional data, but omitting a whole panel unit because some of the data
in the panel are missing will typically lead to a very large reduction in the size
of the working dataset. For example, it is very rare for international
datasets on macroeconomic or other data to be complete, so that casewise
exclusion of missing data will generate datasets that contain many fewer
countries or time periods than might otherwise be usable.
The theoretical literature on econometric models for the analysis of spatial
panels has flourished in the last decade with notable contributions from
LeSage and Pace, Elhorst, and Pfaffermayr, among others. In some cases,
authors have made available specific code for the implementation of the
techniques that they have developed. However, the programming language of
choice for such methods has been MATLAB, which is expensive and has a fairly
steep learning curve for nonusers. Many of the procedures assume that
there are no missing data and the procedures may not be able to handle large datasets
because the model specifications can easily become unmanageable if either
N (the number of spatial units) or
T (the number of time
periods) becomes large.
The presentation will cover a set of user-written maximum likelihood
procedures for fitting models with a variety of spatial structures
including the spatial error model, the spatial Durbin model, the
spatial autocorrelation model, and certain combinations of these
models—the terminology is attributable to LeSage and Pace (2009).
A suite of
MATLAB programs to fit these models for both random and fixed effects
has been compiled by Elhorst (2010) and provides the basis for the
implementation in Stata/Mata. Methods of dealing with missing data,
including the implementation of an approach proposed by Pfaffermayr (2009),
will be discussed.
The problem of missing data is most severe when data on
the dependent variable are missing in the spatial autocorrelation model
because it means that information on spatial interactions may be greatly
reduced by the exclusion of countries or other panel units. In such cases,
some form of imputation may be essential, so the presentation will
consider alternative methods of imputation. It should be noted that
mi does not support panel data procedures in general, and the
relatively high cost of fitting spatial panel models means that it may be
difficult to combine
mi with spatial procedures for practical
applications.
A second aspect of spatial panel models that will be covered in the
presentation concerns the links between such models and random-coefficient
models that can be fit using procedures such as
xtrc or the
user-written procedure
xtmg. The classic formulation of
random-coefficient models assumes that the variance–covariance model of panel
errors is diagonal but heteroskedastic. This is an implausible assumption
for most cross-country datasets, so it is important to consider how it may
be relaxed, either by allowing for explicit spatial interactions or by
using a consistent estimator of the cross-country variance–covariance
model.
The user-written procedures introduced in the presentation will be
illustrated by applications drawn from analyses of demand for
infrastructure, health outcomes, and climate for cross-country data covering
the developing and developed world plus regions in China.
Additional information
UK11_hughes.pdf
Structural equation modeling for those who think they don’t care
Vince Wiggins
StataCorp LP
We will discuss SEM (structural equation modeling), not from the perspective
of the models for which it is most often used—measurement models,
confirmatory factor analysis, and the like—but from the perspective of
how it can extend other estimators. From a wide range of choices, we will
focus on extensions of mixed models (random and fixed-effects regression).
Extensions include conditional effects (not completely random), endogenous
covariates, and others.
Additional information
UK11_Wiggins.pdf
Chained equations and more in multiple imputation in Stata 12
Yulia Marchenko
StataCorp LP
I present the new Stata 12 command,
mi impute chained, to
perform multivariate imputation using chained equations (ICE), also known as
sequential regression imputation. ICE is a flexible imputation technique
for imputing various types of data. The variable-by-variable specification
of ICE allows you to impute variables of different types by choosing the
appropriate method for each variable from several univariate imputation
methods. Variables can have an arbitrary missing-data pattern. By specifying
a separate model for each variable, you can incorporate certain important
characteristics, such as ranges and restrictions within a subset, specific to
each variable. I also describe other new features in multiple imputation in
Stata 12.
Additional information
UK11_marchenko.pdf
Exporting CAPI data to Stata: Experience from surveybe
Joachim De Weerdt
Economic Development Initiatives, Tanzania
Researchers typically spend significant amounts of time cleaning and
labeling data files in preparation of analyses of survey data.
Computer-assisted personal interviewing (CAPI) gives the ability to automate this
process. First, consistency checks can be run during the interview so that
only data that passes autogenerated and user-written validation tests comes
back from the field. Second, CAPI allows for the autogeneration of a Stata
do-file that labels data files. This presentation discusses the Stata
export procedure used by
surveybe, a CAPI application designed to
handle complex surveys. The questions, as displayed on the screen to the
interviewer, are automatically turned into variable labels. Likewise, the
drop-down menus are autocoded as value labels. Furthermore, the export
procedure ensures that data from rosters get exported into different Stata
data files and that complete referential integrity is ensured across all the
files originating from the same survey, with unique primary keys linking
files together. Any changes made to the electronic questionnaire (for example,
adding a response code to the drop-down menu) or changes to the phrasing of a
question will be automatically incorporated into the exported data files,
thus ensuring that the data files match the questionnaires completely.
Additional information
UK11_deweerdt.pdf
Using Mata to import Illumina SNP chip data for genome-wide association studies
J. Charles Huber Jr.
Texas A&M University
Joint with Michael Hallman, Victoria Friedel, Melissa Richard and Huandong Sun
Modern genetic genome-wide association studies typically rely on
single nucleotide polymorphism (SNP) chip technology to determine hundreds
of thousands of genotypes for an individual sample. Once these genotypes
are ascertained, each SNP alone or in combination is tested for association
outcomes of interest such as disease status or severity. Project Heartbeat!
was a longitudinal study conducted in the 1990s that explored changes in
lipids and hormones and morphological changes in children from 8 to 18
years of age. A genome-wide association study is currently being conducted to look for SNPs
that are associated with these developmental changes. While there are
specialty programs available for the analysis of hundreds of thousands of
SNPs, they are not capable of modeling longitudinal data. Stata is well
equipped for modeling longitudinal data but cannot load hundreds of
thousands of variables into memory simultaneously. This talk will briefly
describe the use of Mata to import hundreds of thousands of SNPs from the
Illumina SNP chip platform and how to load those data into Stata for
longitudinal modeling.
Additional information
UK11_Huber.pptx
Using Stata for handling CDISC datasets
Adam Jacobs
Dianthus Medical Limited
The Clinical Data Interchange Standards Consortium (CDISC) is a globally
relevant nonprofit organization that defines standards for handling data
in clinical research. It produces a range of standards for clinical data at
various stages of maturity. One of the most mature standards is the Study
Data Tabulation Model, which provides a standardized yet flexible
data structure for storing entire databases from clinical trials. A related
standard is the Analysis Dataset Model, which defines datasets that
can be used for analyzing data from clinical trials. I shall explain how the
CDISC standards work, how Stata can simplify many of the routine tasks
encountered in handling CDISC datasets, and the great efficiencies that can
result from using datasets in a standardized structure.
Additional information
UK11_jacobs.ppt
Picturing mobility: Transition probability color plots
Philippe Van Kerm
CEPS/INSTEAD, Luxembourg
This talk presents a simple but effective graphical device for visualization
of patterns of income mobility. The device in effect uses color palettes to
picture information contained in transition matrices created from a fine
partition of the marginal distributions. The talk explains how these graphs
can be constructed using the user-written package
spmap from Maurizio
Pisati, briefly presents the wrapper command
tpcplot (for
transition probability color plots) and demonstrates how such
graphs are effective for contrasting patterns of mobility in different
countries or contrasting observed patterns against benchmarks of maximal or
minimal mobility.
Additional information
UK11_vankerm.pdf
Running multilevel models in MLwiN from within Stata: runmlwin
George Leckie
Centre for Multilevel Modelling, University of Bristol
Joint with Chris Charlton
Multilevel analysis is the statistical modeling of hierarchical and
nonhierarchical clustered data. These data structures are common in social
and medical sciences. Stata provides the
xtmixed,
xtmelogit,
and
xtmepoisson commands for fitting multilevel models, but these are
only relevant for univariate continuous, binary, and count response variables,
respectively. A much wider range of multilevel models can be fit using
the user-written
gllamm command, but
gllamm can be
computationally slow for large datasets or when there are many random
effects. Many Stata users therefore turn to specialist multilevel modeling
packages such as MLwiN for fast fitting of a wide range of complex
multilevel models. MLwiN includes the following features: fitting of
multilevel models for
n-level hierarchical and nonhierarchical data
structures; fast fitting via classical and Bayesian methods; fitting
of multilevel models for continuous, binary, ordered categorical, unordered
categorical, and count data; fitting of multilevel multivariate response
models, spatial models, measurement error models, multiple-imputation models,
and multilevel factor models; interactive model equation windows and graph
windows for model exploration; and availability that is free to academics
in the United Kingdom. In this
presentation, we will introduce the
runmlwin command to fit
multilevel models in MLwiN from within Stata and to return estimation
results to the Stata environment. We shall demonstrate
runmlwin in
action with several example multilevel analyses in which we fit models and use
Stata’s standard postestimation commands such as
predict and
test to calculate predictions, perform hypothesis tests, and produce
publication-quality graphics.
Additional information
UK11_leckie.do
UK11_leckie.pdf
Plagiarism in student papers and cheating in student exams: Results from surveys using special techniques for sensitive questions
Ben Jann
University of Bern
Eliciting truthful answers to sensitive questions is an age-old problem in
survey research. Respondents tend to underreport socially undesired or
illegal behaviors while overreporting socially desirable ones. To combat
such response bias, various techniques have been developed that are geared
toward providing the respondent greater anonymity and minimizing the
respondent’s feelings of jeopardy. Examples of such techniques are the
randomized response technique, the item-count technique, and the crosswise
model. I will present results from several surveys, conducted among
university students, that employ such techniques to measure the prevalence
of plagiarism and cheating in exams. User-written Stata programs for
analyzing data from such techniques are also presented.
Additional information
UK11_jann.pdf
Lowering your handicap with Stata
Tim Collier
London School of Hygiene and Tropical Medicine
When I first met Stata in October 2000, my golf handicap was 27 and my game
was going nowhere slowly. Ten years of intensive Stata therapy later, my
handicap is 17.3 and falling. It would, of course, be nonsense to infer from
this data that lowering your handicap increases Stata use, but could the
reverse be true? Could there be a causal relationship between increasing
Stata use and a decreasing handicap? In this presentation, I argue that, yes,
there is. Granted, Stata might not work along the lines of traditional golf
training aids, but rather its effect is mediated through a third factor,
namely time. Golf consumes time. Stata produces time. In this presentation, I
will demonstrate how minutes in Stata’s programming world are
equivalent to hours in the real world, and by the use of programs within
programs, minutes can translate to days. Although extrapolation from an
N of 1 is nearly always dangerous, I believe that Stata could be
similarly used to reduce your weight, improve foreign language skills, or
even increase research output.
Additional information
UK11_Collier.ppt
Fun and fluency with functions
Nicholas J. Cox
Durham University
Functions in Stata range between those you know you want and those you
don’t know you need. The word “functions” is heavily
overloaded in Stata; here the focus is on functions’ strict sense,
_variables, extended macro functions, and
egen functions. Often Stata
users in difficulty are seeking commands or imagining that they need to
write programs, when a few lines of code using functions would crack their
problem. In this talk, I will briefly give some general advice on using
functions and in more detail discuss a variety of examples, with the aim of
introducing something unappreciated but useful to almost everyone. Somehow
or other, graphs and my own work will also be mentioned.
Additional information
UK11_Cox_functions.html
UK11_Cox_functions.smcl
Panel time-series modeling: New tools for analyzing xt data
Markus Eberhardt
University of Oxford
Stata already has an extensive range of built-in and user-written commands
for analyzing
xt (cross-sectional time-series) data.
However, most of these commands do not take into account important features
of the data relating to their time-series properties or cross-sectional
dependence. This talk reviews the recent literature concerned with these
features with reference to the types of data in which they arise. Most of
the talk will be spent discussing and illustrating various Stata commands
for analyzing these types of data, including several new user-written
commands. The talk should be of general interest to users of
xt data
and of particular interest to researchers with panel datasets in which
countries or regions are the unit of analysis and there is also a
substantial time-series element. Over the past two decades, a literature
dedicated to the analysis of macro panel data has concerned itself with some
of the idiosyncrasies of this type of data, including variable
nonstationarity and cointegration, as well as with the investigation of
possible parameter heterogeneity across panel members and its implications
for estimation and inference. Most recently, this literature has turned its
attention to concerns over cross-sectional dependence, which can arise either
in the form of unobservable global shocks that differ in their impact
across countries (for example, the recent financial crisis) or as spillover effects
(again, unobservable) between a subset of countries or regions.
Additional information
UK11_eberhardt.pdf
Scientific organizers
Stephen Jenkins, London School of Economics
Roger Newson, Imperial College London
Logistics organizers
Timberlake Consultants, the official distributor
of Stata in the United Kingdom, Brazil, Ireland, Poland, Portugal, and Spain.