The Stata Conference was held 19-20 July 2018, but you can view the proceedings and presentation slides (below) and the conference photos.
9:00–9:20 |
Analysis of surgical outcomes in clustered data: Approaches
and interpretation
Abstract:
Observational clinical studies increasingly use large and complex
datasets representing patients who are clustered by provider,
institution, or geographic location.
Previous research on surgical outcomes (including morbidity, mortality, and
subsequent healthcare utilization) has highlighted provider technique and
experience, center volume-outcomes relationships, and geographical disparities in the
quality of surgical care as important applications of clustered data
analysis. In regression models, the nonindependence of outcomes within
each cluster may be handled through cluster–robust standard errors or
introduction of cluster-level fixed or random effects. However, clinical
studies rarely articulate and occasionally misinterpret the rationale
for applying these methods. I review recent literature on surgical
outcomes to describe how the choice of approach may be influenced by the
intended comparison among clusters, theoretical expectation of specific
cluster-level factors influencing patient outcomes, and clinical
importance of residual variation among clusters. I then present an
example from transplant surgery where the primary contribution of a
mixed-effects model is made by interpreting residual county-level
variation in posttransplant survival.
Additional information:
Dmitry Tumin
The Ohio State University, Nationwide Children's Hospital
|
9:20–9:40 |
Organ pipe plots for clustered datasets–Visualize disparities in
cluster-level coverage
Abstract:
Leo Tolstoy is famous for his novels and less well known for his ideas
on survey data analysis. Concerning estimated proportions, he is said to
have written: Covered strata are all alike; every poorly covered
stratum is poorly covered in its own way. I describe a new
command to make what we call organ pipe plots to visualize heterogeneity
in binary outcomes in clustered data. The plots were conceived for
vaccination coverage surveys, but they are helpful in a wide variety of
contexts. Imagine a survey where only 50% of sampled children are found
to be vaccinated. Different programmatic responses would be appropriate
if the vaccinated include all the children in half the clusters
versus half the children in all the clusters. These plots have been used
to identify neighborhoods that were surreptitiously and intentionally
skipped over during vaccination campaigns. The talk will demonstrate the
command and discuss similarities with Pareto plots from quality control
and a visual connection to the intracluster correlation coefficient
(ICC). Note that the ICC shares a connection to anarcho-pacifistic
ideas in Tolstoy’s later novels: many students mention them … but few
can describe them clearly.
Additional information:
Mary Prier
Biostat Global Consulting
|
9:40–10:00 |
Hepatobiliary-related outcomes in US adults exposed to lead
Abstract:
The purpose of this cross-sectional study was to investigate
hepatobiliary-related clinical markers in Unites States adults (aged ≥
20) exposed to lead using the National Health and Nutrition Examination
Survey (NHANES) 2007–2008 and 2009–2010 datasets.
Clinical markers and occupation were evaluated in 4 quartiles of exposure—0–2 μg/dL,
2–5 μg/dL, 5–10 μg/dL, and 10 μg/dL and over—to examine how the
markers and various occupations manifested in the quartiles. Linear
regression determined associations, and binary logistic regression
predicted the likelihood of elevated clinical markers using binary
degrees of exposure set at 2 μg/dL, 5 μg/dL, and 10 μg/dL. Clinical
markers, and how they manifested between exposed and less exposed
occupations, were explored in addition to how duration of exposure
altered these clinical markers. In regression analysis, gamma-glutamyl
transferase (GGT), total bilirubin, and alkaline phosphatase (ALP) were
positively and significantly associated with blood lead level (BLL).
Using binary logistic regression models, at the binary 2 μg/dL level,
ALP and GGT were more likely to be elevated in those exposed. At 5
μg/dL level, it was ALP and GGT that were more likely to be elevated in
those exposed, whereas at the 10 μg/dL level, it was GGT that was more
likely to be elevated in those exposed. In the occupational analysis,
aspartate aminotransferase (AST), alanine aminotransferase (ALT), GGT,
and ALP showed differences between populations in the exposed and
less exposed occupations. Regarding agriculture, forestry and fishing,
duration of exposure altered AST, ALP, and total bilirubin significantly
(p < 0.05), while ALT and GGT were altered moderately significantly (p <
0.10). With mining, duration of exposure altered AST and GGT moderately
significantly, whereas in construction, duration in occupation altered
AST and GGT significantly and total bilirubin moderately significantly.
The study findings are evidence of occupational exposure to lead playing
a significant role in initiating and promoting adverse hepatobiliary
clinical outcomes in United States adults.
Additional information:
Emmanuel Obeng-Gyasi
North Carolina A&T State University
|
10:00–10:40 |
Disappearing Medicaid enrollment disparities for United States citizen children in immigrant families: An example of average marginal analyses for applied research
Additional information:
Eric Seiber
Ohio State University
|
11:10–11:40 |
Bayes for undergrads
Abstract:
Teaching Markov Chain Monte Carlo Bayesian methods to undergraduates can
be challenging because they, for the most part, are not familiar with
advanced methodologies such as multilevel models, IRT, or other
analytical methods that are commonly found in Bayesian analyses.
However, almost every undergraduate is familiar with the t test. This
presentation will use Stata's bayesmh command to perform a two-sample
independent t test. We will discuss the advantages of using a Bayesian
approach to perform t test-type analyses and compare the output or
results with the traditional frequentist t test.
Additional information:
Phil Ender
UCLA (Ret)
|
11:40–12:10 |
Output and automatic reporting using putdocx/putpdf
Abstract:
Are you tired of copying and pasting tables, titles, figures,
paragraphs, and footnotes in Excel into Word or pdf files? Here is good
news: Stata 15 has released a new feature that creates analysis tables,
figures, footnotes, and paragraphs directly in Word or pdf files. The
new command, putdocx/putpdf, serves as a one-stop-shop
tool for transforming your Stata codes into Word or pdf files. This
presentation will show you how to generate analysis tables, figures, and
discussion or summary paragraphs directly in Word or pdf format. Plus,
instead of manually updating the new numbers in your tables, figures,
summary paragraphs, or footnotes when periodic updates are required, all
you must do is refresh the dataset and run your existing do-file of
putdocx/putpdf and call it to see the instantly updated
results directly in the Word/pdf file. This can be done in one click. More specifically,
below is a list of formatting and analysis results to be shown out of
putdocx/putpdf and output directly in a Word or pdf file: 1.
Paragraphs with statistics in them 2. Figures 3. Tables •
descriptive summary table • regression table • logistic regression
table • survival analysis table, etc. 4. Automation of exporting 5.
Combination of several .docx files into one summary report.
Additional information:
Winnie Hua
Corrona, LLC
|
12:10–12:30 |
Assessing the calibration of dichotomous outcome models with the
calibration belt
Abstract:
The calibration belt is a graphical approach designed to evaluate the
goodness of fit of binary outcome models such as logistic regression
models. The calibration belt examines the relationship between estimated
probabilities and observed outcome rates. Significant deviations from
the perfect calibration can be spotted on the graph. The graphical
approach is paired to a statistical test, synthesizing the calibration
assessment in a standard hypothesis testing framework. We present the
calibrationbelt command, which implements the calibration belt and its
associated test in Stata.
Additional information:
Giovanni Nattino
The Ohio State University, The Ohio Colleges of Medicine Government Resource Center
|
1:30–2:15 |
Nonlinear mixed-effects regression
Abstract:
In many applications, such as biological and agricultural growth
processes and pharmacokinetics,
the time course of a continuous response
for a subject over time may be characterized by a nonlinear function.
Parameters in these subject-specific nonlinear functions often have
natural physical interpretations, and observations within the same
subject are correlated. Subjects may be nested within higher-level
groups, giving rise to nonlinear multilevel models, also known as
nonlinear mixed-effects or hierarchical models. The new Stata 15 command
menl allows you to fit nonlinear mixed-effects models, in which fixed
and random effects may enter the model nonlinearly at different levels
of hierarchy. In this talk, I will show you how to fit nonlinear
mixed-effects models that contain random intercepts and slopes at
different grouping levels with different covariance structures for both
the random effects and the within-subject errors. I will also discuss
parameter interpretation and highlight postestimation capabilities.
Additional information:
Houssein Assaad, Senior Statistician and Software Developer
StataCorp
|
2:15–3:00 |
ERMs, simple tools for complicated data
Abstract:
While the term "extended regression model" (ERM) may be new, the method
is not. ERMs are regression models with continuous outcomes (including
censored and tobit outcomes), binary outcomes, and ordered outcomes that
are fit via maximum likelihood and that also account for endogenous
covariates, sample selection, and nonrandom treatment assignment. These
models can be used when you are worried about bias due to unmeasured
confounding, trials with informative dropout, outcomes that are missing
not at random, selection on unobservables, and more. ERMs provide a
unifying framework for handling these complications individually or in
combination. I will briefly review the types of
complications that ERMs can address. I will work through examples that
demonstrate several of these complications and show some inferences we
can make despite those complications.
Additional information:
Charles Lindsey, Senior Statistician and Software Developer
StataCorp
|
3:00–3:30 |
Even simpler standard errors for two-stage optimization estimators:
Mata implementation via the deriv command
Abstract:
Terza (2016a) offers a heretofore unexploited simplification (henceforth
referred to as SIMPLE) of the conventional formulation for the standard
errors of two-stage optimization estimators (2SOE). In that paper,
SIMPLE was illustrated in the context of two-stage residual inclusion
(2SRI) estimation (Terza et al., 2008). Stata/Mata implementations of
SIMPLE for 2SRI estimators are detailed in Terza (2017a and b). Terza
(2016b) develops a variant of SIMPLE for calculating the standard errors
of two-stage marginal effects estimators (2SME). Generally applicable
Stata/Mata implementation of SIMPLE for 2SME is detailed in Terza
(2017c) and compared with results from the Stata margins command (for
the subset of cases in which the margins command is available). Although
SIMPLE substantially reduces the analytic and coding burden imposed by
the conventional formulation, it still requires the derivation and
coding of key partial derivatives that may prove daunting for some model
specifications. In this presentation, I detail how such analytic demands
and coding requirements are virtually eliminated via the use of the Mata
deriv command. I will discuss illustrations in the 2SRI and 2SME
contexts.
References: Terza, J., A. Basu, and P. Rathouz (2008). Two-stage residual inclusion estimation. Addressing endogeneity in health econometric modeling. Journal of Health Economics 27: 531–543. Terza, J.V. (2016a). Simpler standard errors for two-stage optimization estimators. Stata Journal 16: 368–385. Terza, J.V. (2016b). Inference using sample means of parametric nonlinear data transformations. Health Services Research 51: 1109–1113. Terza, J.V. (2017a). Two-stage residual inclusion estimation: A practitioners guide to Stata implementation. Stata Journal 17: 916–938. Terza, J.V. (2017b). Two-stage residual inclusion estimation in health services research and health economics. Health Services Research, forthcoming, DOI: 10.1111/1475-6773.12714. Terza, J.V. (2017c). Causal effect estimation and inference using Stata. Stata Journal 17: 939–961.
Additional information:
Joseph Terza
Indiana University Purdue University Indianapolis
|
4:00–4:20 |
Estimating the average lifetime of nonmaturity deposits
Abstract:
Nonmaturity deposits represent funds placed with banks that have no
contractually set time for maturing, or leaving the bank.
However, in the aggregate, we know there is a tendency to see such deposits become
increasingly withdrawn as these accounts get older. Having an estimate
of the lifetime of such deposit accounts is an important ingredient for
calculating their present value. We show how to model the average
lifetime based, first, on estimating the decay rate of deposit balances
using Stata's nl command and, second, on calculating average lifetime
based on the decay rate.
Additional information:
Calvin Price
MUFG
|
4:20–4:40 |
Automating exploratory data analysis tasks with eda
Abstract:
Several tools currently exist in Stata for document
preparation, authoring, and creation, each with its own unique
strengths. Similarly, there are many tools available to map data to
visual dimensions for exploratory and expositive purposes. While these
tools are powerful on their own, they do not attempt to solve the most
significant resource constraint we all face. The eda command is
designed to address this time constraint by automating the creation of
all the univariate and bivariate data visualizations and summary
statistics tables in a dataset. Users can specify categorical and
continuous variables manually, provide their own rules based on the
number of unique values, or allow eda to use its own defaults, and
eda will apply the necessary logic to graph and describe the data
available. The command is designed to produce the maximum amount of
output by default, so a single line of code can easily produce a
document providing substantial insight into your data.
Additional information:
Billy Buchanan
Fayette County Public Schools
|
9:00–9:20 |
Ordinary least-products regression is a simple and powerful
statistical tool to identify systematic disagreement between
two measures: Fixed and proportional bias assessment
Abstract:
Background: We aimed to provide a statistical procedure to assess
systematic disagreement between two measures, assuming that measurements
made by either method are attended by random error.
Methods: We applied
Bland–Altman analysis (baplot) and ordinary least-products (OLP)
regression (manually) in three simulated pairs of samples (N=100). In
OLP, values of y and x are used in the major axis regression analysis,
but then intercept and slope are back-transformed by dividing them by
(). Fixed bias was defined if 95% confidence interval (CI) of the
intercept does not include 0. Proportional bias was defined if 95%CI of
the slope does not include 1. Results: Using baplot, we found no fixed
(bias=3.4 minutes/day; 95%CI=-10.4-17.2) and no proportional (r=-0.2;
p=0.09) bias for physical activity (PA); and fixed (bias=-5.3 hour/day,
95%CI=-5.4--5.2; bias=4.5 hour/day; 95%CI=4.3- 4.7) and proportional
(r=-0.9; p<0.01; r=0.8; p<0.01) bias for sedentary behaviour (SB) and
sleep time, respectively. Using OLP, we found similar findings from
baplot for PA (intercept=23.1; 95%CI=-3.04-49.3; slope=0.92;
95%CI=0.83-1.01) and sleep time (intercept=3.14; 95%CI=2.82-3.45;
slope=1.20; 95%CI=1.16-1.24). However, we found no fixed and
proportional bias (intercept=-0.04; 95%CI=-0.45-0.38; slope=0.20;
95%CI=-0.07-0.10) for SB. Conclusions: OLP could be included in Stata as
a valid and comparable alternative to the Bland–Altman method.
Additional information:
Marcus Vinicius Nascimento-Ferreira
1YCARE (Youth/Child cArdiovascular Risk and Environmental)
Research Group, Universidade de São Paulo
|
9:20–9:50 |
Simple tools for saving time
Abstract:
This brief talk will show some simple tools for saving time when working with Stata.
This will be a hodgepodge of items whose goal is to reduce the amount of thought, coordination,
and human memory required of common tasks in a complex work environment while speeding up such tasks greatly.
Additional information:
Bill Rising, Director of Educational Services
StataCorp
|
9:50–10:20 |
Vector-based kernel weighting: A simple estimator for improving precision
and bias of average treatment effects in multiple treatment settings
Abstract:
Treatment-effect estimation must account for endogeneity, in which
factors affect treatment assignment and outcomes simultaneously. By
ignoring endogeneity, we risk concluding that a helpful treatment is not
beneficial or that a treatment is safe when it is actually harmful.
Propensity-score (PS) matching or weighting adjusts for observed
endogeneity, but matching becomes impracticable with multiple
treatments, and weighting methods are sensitive to PS model
misspecification in applied analyses. We used Monte Carlo simulations
(1,000 replications) to examine sensitivity of multivalued treatment
inferences to PS weighting or matching strategies. We consider four
variants of PS adjustment: inverse probability of treatment weights
(IPTW), kernel weights, vector matching, and a new hybrid—vector-based
kernel weighting (VBKW). VBKW matches observations with
similar PS vectors, assigning greater kernel weights to observations
with similar probabilities within a given bandwidth. We varied the degree of
PS model misspecification, sample size, number of treatment groups, and
sample distribution across treatment groups. Across simulations, VBKW
performed equally or better than the other methods in terms of bias and
efficiency. VBKW may be less sensitive to PS model misspecification than
other methods used to account for endogeneity in multivalued treatment
analyses.
Additional information:
Jessica Lum
Department of Veterans Affairs
|
10:50–11:20 |
dtalink: Faster probabilistic record linking and deduplication
methods in Stata for large data files
Abstract:
Stata users often need to link records from two or more data files or
find duplicates within data files. Probabilistic linking methods are
often used when the file or files do not have reliable or unique
identifiers, causing deterministic linking methods (such as Stata's
merge or duplicates commands) to fail. For example, one might need to
link files that only include inconsistently spelled names, dates of
birth with typos or missing data, and addresses that change over time.
Probabilistic linkage methods score each potential pair of records on
the probability the two records match so that pairs with higher overall
scores indicate a better match than pairs with lower scores. Two
community-contributed Stata commands for probabilistic linking exist
(reclink and reclink2), but they do not scale efficiently. dtalink is a
new command that offers streamlined probabilistic linking methods
implemented in parallelized Mata code. Significant speed improvements
make it practical to implement probabilistic linking methods on large,
administrative data files (files with many rows or matching variables),
and new features offer more flexible scoring and many-to-many matching
techniques. The presentation introduces dtalink, discusses useful tips
and tricks, and provides an example of linking Medicaid and birth
certificates data.
Additional information:
Keith Kranker
Mathematica Policy Research
|
11:20–11:40 |
Doing less with Stata Markdown
Abstract:
Stata’s new dyndoc and its sister commands provide a rich set of
tools for reimagining document writing. An example of this is a document
translator, stmd, that converts dynamic documents written with plain
Markdown tags to Stata’s dyndoc format. This allows the user to write
documents in the simple, uncluttered Markdown style used with other
programming languages and on websites and still use many of dyndoc’s
features such as executing code and embedding graphics links.
Additional information:
Doug Hemken
Social Science Computing Cooperative,
University of Wisconsin–Madison
|
11:40–12:00 |
New data-cleaning command: assertlist improves speed and accuracy of
collaborative correction
Abstract:
Stata’s handy assert command can certify that a dataset meets a set of
user expectations, but when one assertion is violated, it throws an
error and does not proceed to check the rest. Identifying problems with
every variable in a large dataset can involve a messy set of ad hoc
error traps and list commands to learn what unexpected values occur in
what dataset rows. Furthermore, code to replace errant values sometimes
involves if syntax with a list of terms connected by Boolean ANDs that
identify the row targeted for the fix; when typed by hand, these rows
are quite susceptible to typographical errors. This talk describes a new
command, assertlist, that can test an entire set of assertions in
one run without ad hoc code to drill down or move on. Exceptions are
listed either to the screen or a spreadsheet. In situations where
problematic values will later be corrected or replaced, assertlist
generates spreadsheet columns that wait to receive manually
corrected values and other columns that immediately put corrected values
into Stata replace commands for easy pasting into downstream do-files.
In our experience, assertlist streamlines well-documented data cleaning
and guards against errors in correction code.
Additional information:
Dale Rhoda
Biostat Global Consulting
|
1:00–1:30 |
Regulation and US state-level corruption
Abstract:
I exploit a panel dataset on the US for 1990–2013 to evaluate the
causal impact of government regulation on bureaucratic corruption.
Despite the stylized fact that corruption and regulation are positively
correlated,
there is a lack of empirical evidence to substantiate a
causal relationship. Using novel data on federal regulation of
industries (Al-Ubaydli and McLaughlin 2015) and convictions of public
officials from the Public Integrity Section, I apply a stochastic
frontier approach to account for one-sided measurement error in
bureaucratic corruption and the Lewbel (2012) identification strategy to
control for potential endogeneity of regulation. Results are striking.
Based on the preferred model, there is evidence of endogeneity of
regulation and absence of a causal link between regulation and
corruption. However, if any of the above two econometric issues are
ignored, evidence of a spurious relationship between corruption and
regulation is found.
Additional information:
Sanchari Choudhury
Southern Methodist University
|
1:30–1:50 |
The satisfaction with healthcare services in the Emirate of
Dubai using Dubai Household Survey 2014: Inpatient admission
Abstract:
The population of the Emirate of Dubai is 2.8 Million. The Dubai Health
Authority (DHA) is the government entity that oversees healthcare in the
Emirate.
Therefore, it is important to measure patients' satisfaction
level with healthcare services in the Emirate to improve the services
provided. This study (secondary data analysis) was collected through
complex stratified (geographic area), multistage probability sampling.
The study examines the satisfaction level with healthcare services in
the Emirate of Dubai compared with those who were admitted as inpatients
during the last 12 months by using ordered logistics regression.
Satisfaction was used as a dependent variable, and many independent
variables were used in the model, including suffering from a chronic
disease and admission as an inpatient during the last 12 months. Other
covariates included were age, gender, insurance type, and nationality.
With respect to satisfaction level with healthcare services, we found
there is no difference with having or not having a chronic disease, no
difference between being male or female, and no difference with the age.
All other insurance types are less likely to be satisfied compared with
private insurance as a reference group. All other nationalities in Dubai
are more likely to be satisfied compared with UAE nationals as a
reference group. Not being admitted as an inpatient during the last 12
months in the Emirate of Dubai was more likely to be satisfied with the
healthcare services compared with being admitted in the government
sector as a reference group. There is need to improve the healthcare
services in the Emirate of Dubai in the government sector through public
private partnership and competing with the private sector to improve the
services among all government health providers, including quality of
care and waiting time.
Additional information:
Wafa Alnakhi
Johns Hopkins University
|
1:50–2:10 |
Welfare gain of rice-grading information
Abstract:
This study examines consumers' value of rice-grade labeling information
to identify the effectiveness of the new mandatory rice-grading policy
in October 2018.
This study measures consumers' premiums for super,
good, and normal grades before and after providing grade-labeling
information using a nonhypothetical random nth experimental auction. We
then estimate consumers' value of grade-labeling information by
comparing with market premiums. The results suggest that consumers value
the provision of grade-labeling information, with the highest value for
the super grade. Given the grade-labeling information, the additional
detailed information about grade labeling does not affect consumers'
rice-purchasing behaviors. The findings suggest that the rice-grading
information is the important factor differentiating domestic rice from
imported rice, and it also provides consumers credible information on
rice quality to make better purchasing decisions.
Additional information:
Doo Bong Han
Korea University
|
Scientific committee
Bowling Green State University
National Science Foundation