2015 Oceania Stata Users Group meeting
24–25 September 2015
The Australian National University
Canberra ACT 0200
Australia
Proceedings
Text analysis using WordStat 7 within Stata
Normand Péladeau
Provalis Research
WordStat for Stata offers advanced text analytics features, allowing
Stata 13 and 14 users to analyze text stored in both short- and
long-string variables using numerous text-mining features, such as topic
modeling, document clustering, automatic classification, and
state-of-the-art dictionary-based content analysis. Extracted themes may
then be related to structured data using various statistics and graphic
displays. WordStat also offers a tool to create a Stata project from
lists of documents (including .DOC, HTML, and PDF files) and to
automatically extract from those, numerical, categorical data, and dates.
Treatment effects for survival-time outcomes: Theory and applications using Stata 14
Rebecca Pope
StataCorp
The potential-outcomes framework for estimating treatment effects from
observational data treats the unobserved outcome as a missing data
problem. When we extend this framework to the analysis of survival-time
outcomes, we also allow for data that are missing because of censoring. This
requires us to make additional assumptions and changes the properties of
some of the estimators.
Beginning with a brief review of key concepts of survival-time data, I
discuss potential outcomes in the context of survival analysis. I also
explain some of the advantages of using treatment-effects analysis
relative to traditional survival analysis. Alongside a brief overview
of some of the estimators that are implemented in Stata 14, I
demonstrate the application of survival treatment-effects analysis.
Examples include analysis of single- and multivalued-treatments and
postestimation checking of model assumptions.
Additional information
oceania15_pope.pdf
Causal inference and treatment effect: An integrative framework for evaluation research
Bill Tyler
Charles Darwin University
The increased popularity of quasi-experimental designs with
observational data in policy-oriented evaluation studies, while
enriching the environment of Stata applications, has complicated the
options available to health and other social science researchers. In
cross-cultural policy-related research, the tensions between multilevel
and counterfactual modeling present particular problems for satisfying
evidential criteria for both efficacy and effectiveness within what is
often viewed as a homogeneous field for educational and child
development policy. This presentation offers a comparative framework for
interrogating the options for extending propensity-score analysis and
other counterfactual approaches to multilevel modeling. The utility of
this framework is illustrated from issues arising from ongoing
evaluation projects in the areas of indigenous school-based
interventions in remote community settings in Northern Australia.
Additional information
oceania15_tyler.pdf
Applications of -margins- in social science
Philip Morrison
Victoria University of Wellington
Although introduced in Stata 11 and 12,
margins and
marginsplot are not
as widely used in social science as they could be. This presentation advocates
wider use of these tools. I introduce the basic ideas and
illustrate their application to several different types of research
questions from the my own research.
margins and associated
commands greatly expand our ability to assess the effects (associations)
of (the usually categorical) attributes of respondents on outcomes of
policy interest. I focus on the additional insights gained
especially when
margins is combined with
marginsplot and user-written
graphical displays such as
coefplot.
Additional information
oceania15_morrison.pdf
Pneumonia prevention using topical antibiotics in the intensive care unit (ICU): Another variation on control group variability
James Hurley
Ballarat Health Services
There are over 200 published studies of methods to prevent infections
acquired in the intensive care unit (ICU) such as pneumonia and
bacteremia. The application of combinations of various antibiotic
topically to the upper airway appears to be the most effective method
(over 40 studies). Surprisingly, within these studies of topical
antibiotics such as the prevention method, the incidence of pneumonia and
bacteremia among the control groups is as much as double that versus
control groups within studies of methods other than topical antibiotics.
Why?
Graphics as obtained with metandi obtained with meta-analysis of
diagnostic tests offer a "novel" approach to modeling the
relationship between control group rate and intervention effect size
within controlled trials. Stata offers a broad range of commands to
study statistical relationships, but an outstanding feature is the range
of graphical commands available that enable the data to be
"eyeballed". In this presentation, I will demonstrate—using graphs produced by
metan, metandi, funnelcompar, ellip, and good old
twoway (scatter)—that the relationship between control group incidence and
effect size in this context is not simple. Is it cause–and–effect or
the other way around?
Reference:
Hurkey, J.C. 2014 Topical antibiotics as a major contextual hazard
toward bacteremia within selective digestive decontamination studies: A
meta-analysis. BMC Infect Dis 14:714.
http://www.biomedcentral.com/1471-2334/14/714/.
Additional information
oceania15_hurley.pdf
Structural equation models with a binary outcome using Stata and Mplus
Richard J. Woodman
Flinders University
Xiuqin Hong
Central South University
Shuiyuan Xiao
Central South University
Arduino A. Mangoni
Flinders University
Structural Equation Modeling (SEM) is a powerful technique for
examining complex relational structures and potential causal pathways.
Although many software packages, including AMOS, STATA, Mplus, LISREL, and
R, provide routines for SEM with continuous outcomes, not all are capable
of handling categorical data. In addition, there are differences between
software in regard to the availability of desirable SEM features,
including model fit indices, tests of group invariance, direct- and
indirect-effect estimates, modification indices, and estimation
approaches. Mplus software is widely used in the social sciences and is
considered by many as the gold-standard software for SEM. Stata
introduced SEM in version 12 and implemented SEM for categorical
outcomes in version 13.
This presentation will describe and compare the
available estimation options of Stata and Mplus for SEM using a clinical
dataset that includes the binary outcome of coronary artery disease
(CAD). We used cross-sectional data on 242 individuals with CAD and 218
individuals without CAD to examine the potential causal pathways and
direct and indirect effects of homocysteine on CAD. Data were available
for systolic blood pressure, triglycerides, and cholesterol
subfractions. Body mass index, blood urea nitrogen, C-reactive protein,
and uric acid were used as markers of insulin sensitivity, renal
function, inflammation, and oxidative stress, respectively. In addition to
discussing the available estimation features of the two software, this
presentation compares the respective syntaxes and path diagramming
features.
Additional information
oceania15_woodman.pdf
An assessment of current software: Parameter estimate accuracy for Generalized Linear Mixed Models with binary outcome data
Tyman Stanford
The University of Adelaide
Generalized linear mixed models (GLMMs) are a widely used class of
models that assume the expected value of an outcome variable is
determined by a linear combination of predictor variables, via an
invertible link function, with both fixed and random model coefficients.
Estimation of the model coefficients has improved with increased
computational power; the current gold standard to estimate GLMM
coefficients requires adaptive Gauss-Hermite quadrature approximation of
the profiled likelihood function, usually a multidimensional integral,
to obtain (approximate) maximum likelihood solutions. The performance of
widely used software packages in estimating fixed and random
coefficients with a Bernoulli outcome variable is the focus of this
work. The packages surveyed, many with multiple routines available to
perform GLMM parameter estimation, are Stata, R, SAS, ADMB, SPSS, and
Matlab. The GLMM routines in these packages are applied to multiple
simulated datasets with known parameters to determine the accuracy of
parameter estimates of both fixed effects and the variance components.
The effect of increasing the number of adaptive Gauss-Hermite quadrature
integral approximation points on the bias and precision of the
estimates, as well as the effect on model selection using AIC, will be
presented. The computational time taken to generate model parameter
estimates using simulated data is also presented, an additional
consideration in practice.
Additional information
oceania15_stanford.pdf
Model comparison for analysis of population surveillance data
Rosie Meng
Flinders University
Richard Woodman
Flinders University
Stephen R. Cole
Flinders University
Erin Symonds
Repatriation General Hospital
In this presentation, I evaluate the relative merits of different
approaches to the analysis of
population-level bowel cancer surveillance data using available Stata
routines. The focus is on selecting models to suit the research
questions and the ease of interpretation.
Outcomes of colonoscopies for colorectal cancer surveillance was
obtained from the South Australian Southern Cooperative Program for the
Prevention of Colorectal Cancer (SCOOP). Research questions
identified whether patient and adenoma characteristics were associated
with the degree of neoplasia advancement at the next surveillance
colonoscopy. Among 379 patients with a diagnosis of low- or high-risk
adenoma at index colonoscopy between, their first surveillance
colonoscopy was performed between 06-Dec-2001 and 21-Dec-2010. Five
regression models were constructed: 1) Cox cause-specific model (stcox);
2) Cox model with stratification; 3) parametric survival model (streg);
4) competing-risks survival model (stcrreg); and 5) multinominal logistic
regression (mlogit).
The four survival models generally had good agreement and also are
consistent with Kaplan-Meier curves, but results from mlogit differ
significantly from the rest.
Survival analysis is preferred for surveillance data especially when
follow-up time varies considerably between individuals. A cause-specific
Cox model may be preferred over a competing-risks model to ease result
interpretation.
Additional information
oceania15_meng.pdf
rdecompose: Outcome decomposition for aggregate data
JinJing Li
University of Canberra
Yohannes Kinfu
University of Canberra
Social, behavioral, and health scientists frequently apply methods for
decomposing changes or differences in outcome variables into components
of change. A number of Stata commands, such as those based on the
Blinder-Oaxaca approach, have been developed over the years to
facilitate this exercise using unit-level data. However, despite the
abundance of aggregate data and wide use of corresponding aggregate data
decomposition techniques, there are no comparable user-developed Stata
commands for decomposing changes or differences using aggregate-level
data. In this presentation, we introduce a new Stata command for aggregate data
decomposition, based on Gupta's reformulation, and demonstrate
applications from a wide range of settings that include demography,
epidemiology, and health economics. Our command in Stata also extends
existing approaches to allow any number of factors and various
functional relationships that are not available in any platform.
Additional information
oceania15_li.pdf
Bayesian analysis using Stata
Bill Rising
StataCorp
Bayesian analysis made its official Stata debut with the release of
Stata 14. In this presentation, we will explore some simple applications to
demonstrate the basics of Stata's user interface and suite of commands
for Bayesian analysis.
Additional information
oceania15_rising_beamer.pdf
oceania15_rising_handouts.pdf
oceania15_rising_handouts_a4.pdf
bayesfiles.zip
A practical introduction to Stata 14 item response theory (IRT)
Malcolm Rosier
SDAS
Stata 14 includes a module on item response theory (IRT). I discuss
basic characteristics of measurement in the social sciences, show how
traditional measurement techniques and IRT are related, and discuss
merits, constraints, and uses of IRT.
The IRT procedure produces a calibrated scale of the underlying (latent)
dimension at the interval level of measurement. The same scale is used
to obtain a measure of the difficulty of each item and of the ability of
each person. I illustrate the one-parameter and two-parameter logistic
models by analyzing a mathematics achievement test with dichotomous
responses, scored correct or incorrect. We then introduce the IRT
procedures applied to ordered categorical data. We apply the rating
scale model (RSM) and the graded response model (GRM) to attitude scale
data.
Additional information
oceania15_rosier.pdf
Identifying biomarkers in epidemiological studies using a fusion of data mining and traditional statistical techniques in Stata
Jo Dipnall
Deakin University
Julie A. Pasco
The University of Melbourne
Michael Berk
The University of Melbourne
Lana J. Williams
Deakin University
Seetal Dodd
Orygen Youth Health Research Centre
Felice N. Jacka
Murdoch Children's Research Institute
Denny Meyer
Swinburne University of Technology
Epidemiological studies generally incorporate vast numbers of variables.
There are a multitude of techniques for variable selection in data
mining, machine learning, and traditional statistics with varying
accuracy. The aim of this study was to incorporate these techniques in
Stata to identify key biomarkers, from a large number measured, and
explore their associations with depression.
Data from the National Health and Nutrition Examination Study
(2009-2010) were utilized (n=5,227, mean age=43 yr). Depressive symptoms
were measured using the Patient Health Questionnaire-9. Blood and urine
samples were taken, and large numbers of biomarkers measured (n=67).
Anthropometric measurements, demographics, and medications were
determined. Lifestyle and health conditions were obtained via a
questionnaire.
A four-step analysis process was performed incorporating multiple
imputation, a Stata boosted regression plugin, and traditional
statistical techniques. Covariates included sex, age, race, smoking,
food security, PIR, BMI, diabetes, inactivity, and medications. The final
model controlled for confounders and effect moderators. All analysis was
managed within Stata's project and macro do environment.
Out of a possible 67 biomarkers, 4 were identified as being associated
with depressive symptoms. Implementing this research's complex
analysis strategy entirely from within Stata eliminated cross platform
errors and ensured easy replication of the results.
The Hjort-Hosmer goodness-of-fit statistic for binary regression
Steve Quinn
Flinders University
D.W. Hosmer
University of Massachusetts, Amherst
The statistic most commonly used to evaluate the adequacy of a logistic
regression model is the Hosmer-Lemeshow statistic. The authors
proposed a goodness-of-fit test based on partitioning the fitted
probabilities into a number of groups and compared observed events to
expected events within each group. They showed via simulations that the
resulting statistic follows a chi-squared distribution with degrees of
freedom approximately equal to the number of groups minus two. The
Hjort-Hosmer statistic also assesses model adequacy and is based on
partial sums of residuals that are sorted by their corresponding fitted
values. The basic idea is that if a model is correctly fitted, then the
partial sums should vary randomly about zero, and better model fit
should correspond to smaller maximal partial sums. In this presentation, the
Hosmer-Lemeshow and Hjort-Hosmer statistic are compared in binary
regression models with different links, and we describe
hjorthos,
which calculates the Hosmer-Hjort statistic.
Additional information
oceania15_quinn.pdf
xtcluster: A partially heterogeneous framework for short panel-data models
Demetris Christodoulou
MEAFA
Vasilis Sarafidis
Monash University
xtcluster implements the partially heterogeneous framework proposed by
Sarafidis and Weber (2015). The algorithm classifies individuals into
panel-data regression clusters, such that within each cluster, the slope
coefficients are homogeneous, and intracluster heterogeneity is
attributed to the presence of individual- and time-specific effects. The
slope coefficients differ across clusters. The optimal number of
clusters and the associated optimal partition are determined using a
model information criterion that is consistent for T fixed as N grows
large. The proposed method relies on the data to suggest any clustering
structure that might exist. Hence, it can be particularly useful when
there is no a priori information about a potential clustering structure,
or when one is interested in examining how far a structure that might be
meaningful according to some economic measure lies from the structure
that is optimal from a statistical point of view.
Additional information
oceania15_christodolou.pdf
Statdoc: Document and explore
Markus Schaffner
Queensland University of Technology
Statdoc is a small utility program written in Java that automatically
documents data analysis projects. It is modeled after similar tools
used in software development and as such supports good coding standards.
The program can run stand-alone or from within Stata and produces a set
of static HTML files that reveal information about the files in a given
folder structure.
Statdoc automatically discovers as much information as possible about the
data, variables, script files, and output files that it can identify
and highlights the links between them. It features an enhanced
documenting comment type, which allows it to record supporting
meta-information. This way, it allows the user to organize projects with
ease and assist to uncover information about other people's projects.
The utility is aimed at real-world research projects where a multitude
of data sources, script files, and outputs are not uncommon. Because the
documentation is produced as static HTML files, it also facilitates
sharing the complete information about a project on the web, helping
efforts to make the data analysis process more transparent. Statdoc is
available as an open source project on Github (for more information and
examples, see https://github.com/mas802/statdoc).
Additional information
oceania15_schaffner.pdf
Using interrupted time-series analysis to examine the effectiveness of the comprehensive stroke unit model
Susan Kim
Flinders University
Daniel Verma
Flinders Medical Centre
Chris Horwood
South Australia Department of Health
Paul Hakendorf
Flinders Medical Centre
Andrew Lee
Flinders University
Stroke care on the comprehensive stroke unit (CSU) is the gold standard.
Care for stroke patients often involves neurologists as well as other
physicians with stroke care expertise and training, that is, stroke
physicians. The aim of this study is to examine whether the CSU results
in better outcomes irrespective of the physician.
Patients' data from a single center with ischemic stroke admitted
between 2000 and 2014 were analyzed. Three system
changes were made during this time: (1) patients were initially seen by
a neurologist and transferred to a stroke physician from 2004 onward; (2)
advent of a stroke-trained neurologist in 2007; and (3) a CSU model with
care by a single stroke physician led by a stroke director from 2010 onward.
Interrupted time-series analysis was used to model the changes in
patients' outcomes and complication rates over time using monthly
aggregated data.
The percentage of patients discharged to rehabilitation facilities
significantly changed after each implementation (p<0.01), and a
significantly less number of patients developed aspiration pneumonia
post 2010 (p=0.045). More patients were sent to rehabilitation
facilities and less with complications with the CSU model, so better
outcomes can be achieved via the CSU model of care even when staffed by
nonneurologist stroke physicians.
Additional information
oceania15_kim.pdf
Count model selection and postestimation to evaluate composite flour technology adoption in Senegal (West Africa)
Kodjo Kondo
University of New England
This presentation examines Stata estimation and postestimation analyses in
identifying determinants of the probability and extent of adoption of
composite flour technology in bread baking in the Dakar region of
Senegal (West Africa). The technology is promoted to limit dependency on
imported wheat. A hurdle regression model is estimated using
socioeconomic and production data collected from 150 bakers in 2014.
The hurdle model, which was preferred over the negative binomial and the
zero-inflated negative binomial models, allows us to disentangle factors
affecting the adoption decisions from those influencing the quantities
used. Findings indicate that the ownership of a 50 kg mixer, training
programs on composite flour production, and the number of bakeries owned
positively affect adoption decisions, while the quantity decisions are
influenced by membership in the baker federation and the expected
output. The wheat and millet flour price ratio positively affects both
decisions. These results imply that efforts to increase the adoption
rate and its extent should promote the 50 kg mixers, intensify the
professional training on composite flour production, institutionalize
the use of composite flour, and contribute to making local flour cheaper
than wheat flour by intensifying local cereal production.
Additional information
oceania15_kondo.pdf
Wishes and grumbles
Bill Rising & Rebecca Pope
StataCorp
StataCorp will be happy to
receive wishes for developments in Stata and almost as happy to
receive grumbles about the software.
Scientific organizers
Demetris Christodoulou, (chair) University of Sydney
Yohannes Kinfu, University of Canberra, CeRAPH
Ghada Gleeson, The Australian National University, ACERH
JinJing Li, University of Canberra
Con Menictas, University of Newcastle
Logistics organizers
Survey Design and Analysis Services Pty Ltd,
the official distributor of Stata in Australia and New Zealand.