Last updated: 4 April 2003
2003 North American Stata Users Group meeting
18–19 March 2003
Longwood Galleria Conference Center
342 Longwood Avenue
Boston, Massachusetts
Proceedings
Session 1, 0830–0945
Generalized latent class modeling using gllamm
Sophia Rabe–Hesketh,
Institute of Psychiatry, King's College
Andrew Pickles,
University of Manchester
Anders Skrondal,
Norwegian Institute of Public Health
-
Abstract
gllamm can estimate both conventional and unconventional
latent class models. Models are specified using discrete latent variables
whose values determine the conditional response distributions for the
classes. A new feature of gllamm is that latent class
probabilities can depend on covariates. We will first discuss the
conventional exploratory latent class model. When a number of fallible
diagnoses of some disease are available, this model can be used to estimate
the prevalence of the disease as well as the sensitivities and
specificities of the tests in the absence of a gold standard. After
estimating the model in gllamm, gllapred
can be used to diagnose individual subjects based on their posterior class
probabilities. An advantage of using gllamm is that a wide
range of response types can be accommodated. To illustrate this, we
consider the analysis of rankings of political goals in the study of value
orientations. We will also discuss confirmatory models such as latent
class factor models and apply them to attitudes to abortion data, taking
the survey design into account by using probability weighting and robust
standard errors. Finally, we consider latent trajectory models for
investigating distinct patterns of change in longitudinal data.
Additional information
lclass.pdf
Case–control study power and sample size calculations using Stata
Katie Saunders,
Cancer Research UK, Genetic Epidemiology Division, University of Leeds
Tim Bishop,
Cancer Research UK, Genetic Epidemiology Division, University of Leeds
Jenny Barrett,
Cancer Research UK, Genetic Epidemiology Division, University of Leeds
-
Abstract
We use Stata's npnchi2 and nchi2 functions
to calculated power and required sample size for case–control studies.
Following the method described by Self et al. (1992), a large exemplary
dataset with expected risk factor frequencies among cases and controls
under any alternative hypothesis is created. The likelihood-ratio test
statistic for the hypothesis of interest is distributed as a non-central
chi-squared statistic under the alternative hypothesis, and the likelihood
ratio test statistic from the analysis of the exemplary dataset is an
approximation to the non-centrality parameter for this distribution. We
apply these methods to power and sample-size calculations for
case–control studies of gene-gene and gene-environment interactions.
Because of the low power of case-control studies to detect interactions, a
wide range of different strategies have been proposed. Required sample
size depends on several design parameters and so the simplicity of these
methods means that the efficiency of many designs can be compared over
different ranges, a valuable tool at the planning stage of a study.
Results are presented for population based, family and matching schemes
that have been proposed to improve power, and comparisons of the power of
different designs are made. Stata programs are available for these
comparisons.
Additional information
talk4.ppt.zip
saunders.zip
The effects of self-perception on students' mathematics and
science achievement in 36 countries
Ce Shen,
Academic Technology Services, Boston College
Oleksandr Talavera,
Academic Technology Services, Boston College
-
Abstract
Earlier studies based on the analyses on the data from the Third
International Mathematics and Science Study (TIMSS) identified an
interesting but conflicting
finding of the effects of three self-perception measures on students'
achievement in the two subjects at two different levels: within-country
data generally show a positive correlation between the three measures and
students' actual achievement, while at the country level, the direction
is just opposite. The three measures of self-perception include how
much students like the two subjects, how difficult they perceive the
two subjects, and how well they think they are doing with the two
subjects. Because TIMSS' sample design was a two-stage stratified
design, this study uses Stata's svyreg
procedure (for complex survey analysis) to replicate earlier
analyses. We find that on an individual level, when the number of books
at home, school resources and indicators of school management are
controlled for, the three self-perceptions
demonstrate positive effects on students' achievement
for most countries; while at the school level, the picture becomes mixed.
For most countries, the effect of perceived easiness of the two subjects
became negative. We suggest this inconsistency reflects
differences in culture and in academic standards from country to
country.
Additional information
shen.pdf
Session 2, 1015-1200
Generalized linear models for prediction: some principles,
some programs and some practice
Nicholas J. Cox,
Durham University
-
Abstract
Despite a history now over 30 years long, the adoption of generalized
linear models (GLMs) remains patchy: they are well-known in several
fields, but used little if at all in many others. One major advantage
of GLMs is that they return predictions on the scale of the response.
The use of link functions avoids the need for prior transformation of
the response for back-transformation of predictions, and above all for
bias corrections to back-transformations, whether systematic or ad hoc.
Case studies from environmental applications (suspended sediment
concentrations of rivers, heights of forest trees) are introduced in
which predictions on the response scale are of paramount scientific and
practical interest. Heavy use is made of a suite of Stata programs
written by the author producing graphic and numeric diagnostics after
regression-type models, which extend and complement commands in
official Stata. Most of these programs have uses beyond GLMs and they
will also be discussed directly.
Additional information
cox.pdf
Using Stata to manage and create a research data bank
Frederick Wolfe,
National Data Bank for Rheumatic Diseases
Kaleb Michaud,
National Data Bank for Rheumatic Diseases
-
Abstract
We manage a longitudinal research data bank containing 3,000 variables
that adds 25,000 observations per year. Data are batch converted from
SQL to Stata on a daily basis, resulting in the creation of 20
preliminary datasets. We then use Stata to quality control the data
and to prepare a single research dataset that can be augmented as
required by the data analyst by calls to specialized programs that
access the additional datasets. Our philosophy is to that most of the
quality control and programming and dataset preparation should be
built into the dataset creation process rather than requiring the data
user to do this. For example, data quality checks and complex data
preparation of items such as costs and hospital and mortality codes
are programmed into the dataset creation process, and relevant
additional datasets are automatically created to reflect such new
data. The basic dataset consists of research and control variables
that are needed for most analyses. With simple programming statements
such as getwork and getcosts,
preprocessed work and cost data, for example, are merged with the
basic set. Global macros identify file locations, database versions,
and variable sets, making updating and sharing simple.
Asymptotic confidence intervals (CIs) for a difference
between two independent proportions
Joseph Coveney,
Cobridge Co., Ltd.
-
Abstract
Binary response variables arise in a variety of studies. It is often of
interest to summarize a treatment effect in terms of the difference in
proportions of successes between groups. Stata produces the Wald-type
asymptotic CI for differences in proportions in cs and
glm , family(binomial) link(identity). The Wald-type CI
is easy to compute, but it is sometimes desirable to have a large-sample
CI with better coverage properties. Alternative asymptotic CI methods
have appeared in the literature with claims of better performance.
Miettinen and Nurminen (1985) describe an iterative method for such an
improved CI requiring repeatedly solving a cubic equation. A
noniterative approximation by Wallenstein (1997) requires only solution
to a quadratic equation. Newcombe (1998) favored a method involving
Wilson intervals of each of the two proportions (see
ciw). Agresti and Caffo (2000) describe a simple method
inspired by an analogous Wilson-type interval (see
propci) for single proportions. Each of these CIs'
implementations in Stata will be illustrated in the context of a
therapeutic equivalence clinical trial.
Extending xi
Phil Ender,
UCLA Department of Education
Michael Mitchell,
UCLA Academic Technology Services
-
Abstract
Stata's xi command performs dummy (indicator) coding on
the fly and with the "*" operator allows for the interaction of two
categorical variables or a categorical with a continuous variable.
xi3 extends the capabilities of xi to
include a number of additional coding systems and can create codings
that allow for testing simple contrasts and simple effects. In addition
to indicator coding, xi3 supports the following coding
schemes:
-
simple coding - compares each level to a reference level
deviation coding - deviations from the grand mean
Helmert coding - compares levels of a variable with the mean of subsequent levels
reverse Helmert coding - compares levels of a variable with the mean of previous levels
forward differences - adjacent levels, each versus next
backward differences - adjacent levels, each versus previous
orthogonal polynomial coding
Additionally, xi3 supports user defined coding schemes
which allow virtually any type of contrast to be used. Like
xi, xi3 can be used in conjunction with any
of the estimation commands. xi3 will do three-way
interactions with categorical variables, a mixture of categorical and
continuous variables, or with continous variables alone.
xi3 can be issued as a
stand alone command. In addition to the "*" operator for interactions,
xi3 adds the "@"
operator which performs the
coding separately for each level of the second variable to allow for
simple contrasts and simple effects.
Additional information
extendxi.html
Session 3, 1330-1500
Teaching Stata for data management
Phil Bardsley,
Carolina Population Center, University of North Carolina
Dan Blanchette,
Caroline Population Center, University of North Carolina
-
Abstract
The Carolina Population Center is a SAS shop, and its 25 programmers
have long favored SAS for data management. Its research faculty,
however, encourage use of Stata to adjust for survey sampling effects.
In an effort to introduce both SAS programmers and new trainees to
Stata, we wrote a web-based Stata tutorial. It focuses on the subset of
Stata commands necessary to manage survey research files. This includes
commands to clean data that are out of range, find duplicate
identifiers that should not exist, recode variables and create new
ones, and document the data. Because the surveys often involve
hierarchical file structures, the tutorial covers merging and
reshaping. It also introduces the very powerful for
command and its variants as labor-saving devices. We have used this
tutorial to teach a short course on Stata, many trainees have used it
to teach themselves Stata, and it has been used in training programs
overseas. The talk will show the format of the tutorial on the web and
quickly review the range of commands that the tutorial covers. We will
also talk about adding a "Rosetta Stone" to help SAS programmers
convert their code to Stata.
Teaching Stata through guided practice
Estelle Young,
Bowie State University
Stacy Gibbs,
Bowie State University
Michael Wynn,
Bowie State University
-
Abstract
Bowie State University requires a software course in their 3-course
research sequence. The course covers descriptive and inferential
statistics and some data management using Stata and SPSS syntax. The
professor provides a diskette with the datasets and all class syntax
files and a course notebook containing: the syllabus; syntax/output files
for each weeks' material; background notes on hypothesis testing; and
in-class practice exercises, homework and a final presentation. Copies
are posted on the course website.
To login, use the userid eyoung and
the password marvin, and click on Data Analysis Main. Go to course
documents to find the folders and files associated with the course. The
structure of each lesson is as follows: I present a brief summary of the
statistical material and corresponding syntax/output files. The students
follow on their computers, using the syntax files on their diskettes and
the hard copies in their notebooks. Students then practice the material
and a mini-homework assignment via the in-class exercise. The following
week, each student presents a part of the homework assignment to the rest
of the class. For the Stata Users Group meeting, the student and I would
present one mock lesson as well as distribute sample course notebooks.
Additional information
eyoung.zip
Instrumental variables and GMM: Estimation and testing
Christopher F. Baum,
Boston College
Mark E. Schaffer,
Heriot-Watt University
Steven Stillman,
New Zealand Department of Labour
-
Abstract
We discuss instrumental variables (IV) estimation in the broader context
of the generalized method of moments (GMM), and describe an extended IV
estimation routine that provides GMM estimates as well as additional
diagnostic tests. Stand-alone test procedures for heteroskedasticity,
overidentification, and endogeneity in the IV context are also described.
Additional information
ivgmm3316.pdf
wp545.pdf
ivgmm.do
Building a collection of programs for Stata
Henrik Schmiediche,
Texas A&M University
-
Abstract
Introduction and overview of the technical aspects of building a
collection of programs for Stata. In particular, we will focus on the
collection of programs developed for nonlinear measurement error models
to be presented at the workshop following the users group meeting.
Multivariate data exploration with Stata: Evaluation and wish list
Stephen Soldz,
Boston Graduate School of Psychoanalysis
-
Abstract
Stata is a general purpose statistical package with especially strong
data manipulation and regression modeling capabilities. It appears to be
especially strong in statistical techniques used by econometricians and
biostatisticians. As psychologists, among others, adopt it, certain
relative weaknesses in the existing set of implemented procedures become
apparent. In particular, multidimensional exploratory data analyses are a
set of data analytic procedures — including principal components and
factor analysis, correspondence analysis, optimal scaling, and
multidimensional scaling, — commonly used to explore the structure of
data sets and derive variables (e.g., principal components or factors)
that summarize the data in a small number of variables. While Stata, as
delivered or through user add-ons, has many of the basic capabilities in
these areas, many are implemented in a fairly rudimentary fashion and
others are implemented in the Stata executable, without sufficient hooks
for users to be able to expand them. This talk will discuss some of these
procedures and will evaluate Stata capabilities in these areas. It is
hoped that it will help stimulate StataCorp or the user community to
expand Stata capabilities in these areas.
Additional information
soldz.ppt
Session 4, 1530–1730
Stata Journal Editors' report
H. Joseph Newton,
Texas A&M University
Nicholas J. Cox,
University of Durham
Report to Stata users: Stata 8
William W. Gould,
StataCorp
William W. Gould,
StataCorp
Chinh Nguyen,
StataCorp
Scientific organizers
Elizabeth Allred, Harvard School of Public Health
[email protected]
Kit Baum, Boston College
[email protected]
Nicholas J. Cox, Durham University
[email protected]
Marcello Pagano, Harvard School of Public Health
[email protected]