July 27–28
The Stata Conference was July 27-28, 2017, but you can view the program and presentation slides (below) and the conference photos.
9:00–9:20 |
Abstract:
This presentation introduces two user-written Stata
commands related to the data and calculations of
demographic life tables, whose most prominent feature is
the calculation of life expectancy at birth. The first
command, hmddata, provides a convenient interface
to the Human Mortality Database (HMD,
www.mortality.org), a database widely used for mortality
data by demographers, health researchers, and social
scientists. Different subcommands of hmddata
allow data from this database to be easily loaded,
transformed, reshaped, tabulated, and graphed. The
second command, lifetable, produces demographic
period life tables. The main features are that life
table columns can be flexibly calculated using any valid
minimum starting information; abridged tables can be
generated from complete ones; finally, a Stata dataset
can hold any number of life tables, and the various
lifetable subcommands can operate on any subset
of them.
Additional information: Baltimore17_Schneider.pdf
Daniel C. Schneider
Max Planck Institute for Demographic Research
|
9:40–10:00 |
Abstract:
There has been extensive research indicating
gender-based differences among STEM subjects,
particularly mathematics (Albano and Rodriguez, 2013;
Lane, Wang, and Magone 1996). Similarly, gender-based
differential item functioning (DIF) has been researched
because of the disadvantages females face in STEM
subjects when compared with their male counterparts.
Given that, this study will apply the multiple
indicators multiple causes (MIMIC) model, a type of
structural equation model, to detect the presence of
gender-based DIF using the Program for International
Student Assessment (PISA) mathematics data from students
in the United States of America and then predict the DIF
using math-related covariates. This study will build
upon a previous study that explored the same data using
the hierarchical generalized linear model and will be
confirmatory in nature. Based on the results of the
previous study, it is expected that several items will
exhibit DIF that disadvantages females and that
mathematics-based self-efficacy will predict the DIF.
However, additional covariates will also be explored,
and the two models will be compared in terms of their
DIF detection and the subsequent modeling of DIF.
Implications of these results include females
underachieving when compared with their male
counterparts, thus continuing the current trend. These
gender differences can further manifest at the national
level, causing U.S. students as a whole to underperform
at the international level. Last, the efficacy of the
MIMIC model to detect and predict DIF will be
illustrated and become increasingly used to model and
better understand differences and DIF.
Additional information: Baltimore17_Krost.pptx
Kevin Krost
Virginia Tech
Joshua Cohen
Virginia Tech
|
10:00–10:20 |
Abstract:
In 2001, I gave a presentation on three-valued logic.
Since then, I have developed some ideas that grew out of
that investigation, leading to new insights about
missing values and to the development of five-valued
logic. I will also show how these notions extend to
numeric computation and to an abstract generalization of
the principles involved. This is not about analysis;
this is about data construction and preparation, and it
is a possibly interesting conceptual tool.
Additional information: Baltimore17_Kantor.pptx multi_valued_logic.docx
David Kantor
Data for Decisions
|
10:40–11:10 |
Abstract:
Parallel lets you run Stata faster, sometimes faster
than MP itself. By organizing your job in several Stata
instances, parallel allows you to work with
out-of-the-box parallel computing. Using the 'parallel'
prefix, you can get faster simulations, bootstrapping,
reshaping big data, etc., without having to know a thing
about parallel computing. With no need of having
Stata/MP installed on your computer, parallel has showed
to dramatically speed up computations up to two, four,
or more times depending on how many processors your
computer has.
Additional information: Baltimore17_Quistorff.pdf
Brian Quistorff
Microsoft
George G. Vega Yon
University of Southern California
|
11:10–11:40 |
Abstract:
The inclusion of the Java API for Stata provides users,
and user programmers, with exciting opportunities to
leverage a wide array of existing work in the context of
their Stata workflow. This talk will introduce a few
tools designed to help others wanting to integrate Java
libraries into their workflow, the Stata Maven
Archetype, and the StataJavaUtilities library. In
addition to a higher-level overview, the presentation
will also show examples of using existing Java libraries
to expand statistical models in psychometrics and send
yourself emails when your job is complete, of phonetic
string encodings and string distances, of accessing
file/operating system properties, and examples to use as
starting points for developing Java plugins in Stata.
Additional information: Baltimore17_Buchanan (http:)
Billy Buchanan
Fayette County Public Schools
|
11:40–12:00 |
Abstract:
In recent years, very large datasets have become
increasingly prevalent in most social sciences. However,
some of the most important Stata commands
(collapse, egen, merge,
sort, etc.) rely on algorithms that are not well
suited for big data. In my talk, I will present the
ftools package, which contains plugin alternatives to
these commands and performs up to 20 times faster on
large datasets [1]. Further, I will explain the
underlying algorithm and Mata function and show how to
use this function to create new Stata commands and to
speed up existing packages. [1]: See benchmarks here:
https://github.com/sergiocorreia/ftools/#benchmarks
Additional information: Baltimore17_Correia.pdf
Sergio Correia
Board of Governors of the Federal Reserve System
|
1:00–1:30 |
Abstract:
Part of the art of coding is writing as little as
possible to do as much as possible. The presentation
expands on this truism. Examples are given of Stata
code to yield graphs and tables in which most of the
real work is happily delegated to workhorse commands. In
graphics, a key principle is that graph twoway is
the most general command, even when you do not want
rectangular axes. Variations on scatter- and line plots
are precisely that, variations on scatter- and line
plots. More challenging illustrations include commands
for circular and triangular graphics, in which x and y
axes are omitted, with an inevitable but manageable cost
in re-creating scaffolding, titles, labels, and other
elements. In tabulations and listings, the better-known
commands sometimes seem to fall short of what you want.
However, a few preparation commands (such as
generate, egen, collapse, or
contract) followed by list,
tabdisp, or _tab can get you a long way.
The examples range in scope from a few lines of
interactive code to fully developed programs. The
presentation is thus pitched at all levels of Stata
users.
Additional information: Baltimore17_Cox.pptx
Nicholas Cox
Durham University, United Kingdom
|
1:30–2:20 |
Abstract:
Part of reproducible research is eliminating manual steps such
as hand-editing documents. Stata 15 introduces several commands
which facilitate automated document production, including
dyndoc for converting dynamic Markdown documents to
web pages, putdocx for creating Word documents, and
putpdf for creating PDF files.
These commands allow you to mix formatted text and Stata
output, and allow you to embed Stata graphs, in-line Stata
results, and tables containing the output from selected
Stata commands.
We will show these commands in action, demonstrating automating
the production of documents in various formats, and including Stata
results in those documents.
Additional information: Baltimore17_Peng (http:)
Hua Peng
StataCorp
|
2:40–3:10 |
Abstract:
We compare a variety of methods for predicting the
probability of a binary treatment (the propensity
score), with the goal of comparing otherwise like cases
in treatment and control conditions for causal inference
about treatment effects. Better prediction methods can
under some circumstances improve causal inference by
reducing both the finite sample bias and variability of
estimators, but sometimes, better predictions of the
probability of treatment can increase bias and variance,
and we clarify the conditions under which different
methods produce better or worse inference (in terms of
mean squared error of causal impact estimates).
Additional information: Baltimore17_Nichols.pdf
Austin Nichols
Abt Associates
Linden McBride
Cornell University
|
3:10–3:40 |
Abstract:
In this paper, we create an algorithm to predict which
students are eventually going to drop out of U.S. high
school using information available in ninth grade. We
show that using a naive model—as implemented in
many schools—leads to poor predictions. In
addition to this, we explain how schools can obtain more
precise predictions by exploiting the big data available
to them, as well as more sophisticated quantitative
techniques. We also compare the performances of
econometric techniques like logistic regression with
machine learning tools such as support vector machine,
boosting and LASSO. We offer practical advice on how to
apply machine learning methods using Stata to the
high-dimensional datasets available in education.
Model parameters are calibrated by taking into account
policy goals and budget constraints.
Additional information: Baltimore17_Sansone.pdf
Dario Sansone
Georgetown University
|
4:00–4:20 |
Abstract:
We present a new Stata package for small-area
estimations of poverty and inequality implementing
methodologies from Elbers, Lanjouw, and Lanjouw (2003).
Small-area methods attempt to solve low
representativeness of surveys within areas or the lack
of data for specific areas and subpopulations. This is
accomplished by incorporating information from outside
sources. A common outside source is census data, which
often lack detailed information on welfare. Thus far, a
major limitation toward such analysis in Stata has been
the memory required to work with census data . The
povmap package introduces new Mata functions and a
plugin used to circumvent memory limitations that will
arise when working with big data.
Additional information: Baltimore17_Nguyen.pdf
Minh Nguyen
World Bank
Paul Andres Corral Rodas; Joao Pedro Wagner De Azevedo; Qinghua Zhao
World Bank
|
4:20–4:40 |
Abstract:
We present examples of how to construct interactive maps
in Stata, using only built-in commands available even in
secure environments. One can also use built-in commands
to smooth geographic data as a pre-processing step.
Smoothing can be done using methods from twoway contour,
or predictions from a GMM model as described in Drukker
, Prucha, and Raciborski (2013). The basic approach to
creating a map in Stata is twoway area, with the options
nodropbase cmiss(no) yscale(off) xscale(off),
with a polygon “shape file” dataset (often created
by the user-written shp2dta by Kevin Crow,
possible with a change of projection using programs by
Robert Picard) and multiple calls to area with if
qualifiers to build a choropleth or scatter to
superimpose point data. This approach is automated by
several user-written commands and works well for static
images but is less effective for web content where a
Javascript entity is desirable. However, it is
straightforward to write out the requisite information
using the file command and to use open-source map
tools to create interactive maps for the web. We present
two useful examples.
Additional information: Baltimore17_Lauer.pdf
Ali Lauer
Abt Associates
|
4:40–5:00 |
Abstract:
We provide examples of how one can use satellite or
other remote sensing data in Stata, with a variety of
analysis methods, including examples of measuring
economic disadvantage using satellite imagery.
Additional information: Baltimore17_Nisar.pdf
Hiren Nisar
Abt Associates
|
9:00–9:20 |
Abstract:
We developed an ado-file to easily estimate three
selected occupational segregation indicators with
standard errors using a bootstrap procedure. The
indicators are the Duncan and Duncan (1955)
dissimilarity index, the Gini coefficient based on the
distribution of jobs by gender (see Deutsch et al.
[1994]) and the Karmel and MacLachlan (1988) index of
labor market segregation. This routine can be easily
applied to conventional labor market microdata in which
information regarding the occupation classification,
industry, and occupational category variables is usually
available. As an illustration of the application of this
ado-file, we present estimates of both occupational and
industry segregation by gender drawn from household
surveys' Colombian microdata. The estimation of
occupational segregation measures with standard errors
proves to be useful in assessing statistical differences
in segregation measures within labor market groups and
over time.
Additional information: Baltimore17_Isaza-Castro.pdf
Jairo G Isaza-Castro
Universidad de la Salle
Karen Hernandez; Karen Guerrero; Jessy Hemer
Universidad de la Salle
|
9:20–9:40 |
Abstract:
Cluster randomized trials (CRTs), where clusters (for
example, schools or clinics) are randomized but
measurements are taken on individuals, are commonly used
to evaluate interventions in public health and social
science. Because CRTs typically involve only a few
clusters, simple randomization frequently leads to
baseline imbalance of cluster characteristics across
treatment arms, threatening the internal validity of the
trial. In CRTs with a small number of clusters, classic
approaches to balancing baseline characteristics—such
as matching and stratification—have several drawbacks,
especially when the number of baseline characteristics
the researcher desires to balance is large (Ivers et al.
2012). An alternative approach is constrained
randomization, whereby an allocation scheme is randomly
selected from a subset of all possible allocation
schemes based on the value of a balancing criterion
(Raab and Butcher 2001). Subsequently, an adjusted
permutation test can be used in the analysis, which
provides increased efficiency under constrained
randomization compared with simple randomization (Li et
al. 2015). We describe constrained randomization and
permutation tests for the design and analysis of CRTs
and provide examples to demonstrate the use of our newly
created Stata package (cvcrand), which uses Mata to
efficiently process large allocation matrices—to
implement constrained randomization and permutation
tests.
Additional information: Baltimore17_Gallis.pdf
John Gallis
Duke University
Fan Li; Hengshi Yu; Elizabeth L. Turner
Duke University
|
9:40–10:00 |
Abstract:
Researchers constructing measurement models must decide
how to proceed when an initial specification fits
poorly. Common approaches include search algorithms that
optimize fit and piecemeal changes to the item list or
the error specification. The former approach may yield a
good-fitting model that is inconsistent with theory or
may fail to identify the best-fitting model because of
local optimization issues. The latter suffers from poor
reproducibility and may also fail to identify the
optimal model. We outline a new approach that defines a
computationally tractable specification space based on
theory. We use the example of a hypothesized latent
variable with 25 candidate indicators divided across 5
content areas. Using Stata’s tuples command, we
identify all combinations of indicators containing >=1
indicator per content area. In our example, this yields
7,294 models. We estimate each model on a derivation
dataset and select candidate models with fit statistics
that are acceptable or could be rendered acceptable by
permitting correlated errors. Eight models fit these
criteria. We evaluate modification indices, respecify if
there is theoretical justification for correlated
errors, and select a final model based on fit
statistics. In contrast to other methods, this approach
is easily replicable and may result in a model that is
consistent with theory and has acceptable fit.
Additional information: Baltimore17_Dougherty.pptx
Geoff Dougherty
Johns Hopkins Bloomberg School of Public Health
Dr. Lorraine Dean
Johns Hopkins Bloomberg School of Public Health
|
10:40–11:10 |
Abstract:
We present response surface coefficients for a large
range of quantiles of the Elliott, Rothenberg, and Stock
(Econometrica 1996) DF-GLS unit-root tests for
different combinations of the number of observations and
the lag order in the test regressions, where the latter
can be either specified by the user or endogenously
determined. The critical values depend on the method
used to select the number of lags. The Stata command
ersur is presented, and its use illustrated with
an empirical example that tests the validity of the
expectations hypothesis of the term structure of
interest rates.
Additional information: Baltimore17_Baum.pdf
Christopher Baum
Boston College and DIW Berlin
Jesús Otero
Universidad del Rosario, Colombia
|
3:10–3:40 |
Abstract:
Estimating the causal effect of a treatment is
challenging when selection into the treatment is based
on contemporaneous unobservable characteristics, and the
outcome of interest is represented by a series of
correlated binary outcomes. Under these assumptions,
traditional nonlinear panel-data models, such as the
random-effects logistic model, will produce biased
estimates of the treatment effect because of correlation
between the treatment variable and model unobservables.
In this presentation, I will introduce a new Stata
estimation command, etxtlogit, that can estimate
a model where the outcome is a series of J-correlated
logistic binary outcomes and selection into the
treatment is based on contemporaneous unobservable
characteristics. The presentation will introduce the new
estimation command, present Monte Carlo evidence, and
offer empirical examples. Special cases of the model
will be discussed, including applications based on the
explanatory (behavioral) Rasch model, a model from item
response theory (IRT).
Additional information: Baltimore17_Rabbitt.pdf
Matthew P. Rabbitt
Economic Research Service, U.S. Department of Agriculture
|
11:40–12:00 |
Abstract:
A continuation ratio model represents a variant of an
ordered regression model that is suited to modeling
processes that unfold in stages, such as educational
attainment. The parameters for covariates in
continuation ratio models may be constrained to be
equal, subject to a proportionality constraint across
stages, or freely vary across stages. Currently, there
are three user-written Stata commands that fit
continuation ratio models. Each of these commands fits
some subset of continuation ratio models involving
parameter constraints, but none of them offer complete
coverage of the range of possibilities. In addition, all
the commands rely on reshaping the data into a
stage-case format to facilitate estimation. The new
crreg command expands the options for
continuation ratio models to include the possibility for
some or all of the covariates to be constrained to be
equal, to freely vary, or to have a proportionality
constraint across stages. The crreg command
relies on Stata’s ML routines for estimation and
avoids reshaping the data. The crreg command
includes options for three different link functions (the
logit, probit, and cloglog) and supports Stata’s
survey and multiple imputation suites of commands.
Additional information: Baltimore17_Bauldry.pdf
Shawn Bauldry
Purdue University
Jun Xu
Ball State University
Andrew Fullerton
Oklahoma State University
|
1:00–1:30 |
Abstract:
When I was in graduate school, I was taught that
multivariate methods were the future of data analysis.
In that dark computer stone age, multivariate meant
multivariate analysis of variance (MANOVA), linear
discriminant function analysis (LDA), canonical
correlation analysis (CA), and factor analysis (which
will not be discussed in this presentation). Statistical
software has evolved considerably since those ancient
days. MANOVA, LDA, and CA are still around but have been
eclipsed and pushed aside by newer, sexier
methodologies. These three methods have been consigned
to the multivariate dustbin, so to speak. This
presentation will review MANOVA, LDA, and CA, discuss
the connections among the three approaches, and
highlight the positives and negatives of each approach.
Additional information: Baltimore17_Ender.pdf
Phil Ender
UCLA (Ret.)
|
1:30–2:20 |
Abstract:
In survival analysis, right-censored data have been studied extensively
and can be analyzed using Stata's extensive suite of survival commands,
including streg for fitting parametric survival models. Right-censored
data are a special case of interval-censored data. Interval-censoring
occurs when the failure time of interest is not exactly observed but is
only known to lie within some interval. Left-censoring, which occurs
when the failure is known to happen some time before the observed time,
is also a special case of interval-censoring. Survival data may contain
a mixture of uncensored, right-censored, left-censored, and
interval-censored observations. In this talk, I will describe basic
types of interval-censored data and demonstrate how to fit parametric
survival models to these data using Stata's new stintreg command. I
will also discuss postestimation features available after this command.
Additional information: Baltimore17_Yang.pdf
Xiao Yang
StataCorp
|
2:40–3:10 |
I use the new extended regression command eoprobit to esitmate
the effect of an endogenous treatment on an ordinal profit outcome.
Additional information: Baltimore17_Drukker.pdf
David M. Drukker
StataCorp
|
3:40–4:30 |
Wishes and grumbles
StataCorp
|
Registration is now closed.
Renaissance Baltimore Harborplace Hotel
202 East Pratt Street
Baltimore, MD 21202
The conference venue is near several tourist attractions, including the USS Constellation and other vessels in the harbor, the American Visionary Arts Museum, and the National Aquarium.
Joe Canner (Chair)
Department of Surgery
Johns Hopkins University
John McGready
Department of Biostatistics
Johns Hopkins University
Austin Nichols
Abt Associates
Sharon Weinberg
Applied Statistics and Psychology
New York University