The Spanish Stata Conference was held on 24 October 2018 at Universitat Pompeu Fabra, Campus Ciutadella, Edifici Mercé Rodoreda, but you can view the program below.
Proceedings
9:30–10:45 |
Abstract:
Researchers' interest about the use of Bayesian regression analysis has been
significantly increasing in recent years. One of the fundamental reasons for
this growing interest is that a wide variety of models can be
accommodated within this alternative regression approach. This flexibility is
due in part to the possibility of using a common theoretical framework to
estimate the parameters for posterior distributions associated with different
kinds of model specifications. I will outline the main aspects associated
with Bayesian regression in Stata, and I will show the facilities
incorporated in Stata 15 to make this kind of analysis more accessible to
those who are not very familiar with this approach.
Additional information: spain18_Sánchez.pdf
Gustavo Sánchez
StataCorp
|
10:45–11:15 |
Abstract:
eltmle is a Stata program implementing the targeted maximum-likelihood
estimation (TMLE) for the ATE for a binary or continuous outcome and binary
treatment. eltmle includes the use of a super learner called from the SuperLearner
package v.2.0-21 (Polley et al. 2011). Modern epidemiology has been able to
identify significant limitations of classic epidemiological methods, like outcome
regression analysis, when estimating causal quantities such as the average
treatment effect (ATE) for observational data. For example, using classical
regression models to estimate the ATE requires the assumption that the
effect measure is constant across levels of confounders included in the model, i.e.,
that there is no effect modification. Other methods do not require this assumption,
including g-methods (for example, the g-formula) and targeted maximum-likelihood
estimation (TMLE). The average treatment effect (ATE) or risk difference is the
most commonly used causal parameter. Many estimators of the ATE, but not all, rely
on parametric modeling assumptions. Therefore, the correct model specification is
crucial to obtain unbiased estimates of the true ATE. TMLE is a semi-parametric,
efficient substitution estimator allowing for data-adaptive estimation while
obtaining valid statistical inference based on the targeted minimum loss-based
estimation. TMLE has the advantage of being doubly robust. Moreover, TMLE allows
inclusion of machine learning algorithms to minimize the risk of model
misspecification, a problem that persists for competing estimators. Evidence shows
that TMLE typically provides the least unbiased estimates of the ATE compared with
other double robust estimators. The following links provide access to a TMLE
tutorial:
https://migariane.github.io/TMLE.nb.html and the GitHub repository for the
eltmle Stata package,
https://github.com/migariane/meltmle.
Additional information: spain18_Luque-Fernández(1).pdf
Miguel Ángel Luque-Fernández
Universidad de Granada, London School of Hygiene and Tropical Medicine, CIBERESP ISCII
|
11:45–12:15 |
Abstract:
Text data, such as answers to open-ended questions, are sometimes
ignored because they are hard to analyze. My community-contributed Stata
command ngram turns text into hundreds of variables using the
"bag of words" approach. Broadly speaking, each variable records
how often the corresponding word or word sequence occurs in a given
text. This is more useful than it sounds. The program supports text
in 12 European languages.
Additional information: spain18_Schonlau.pdf
Matthias Schonlau
University of Waterloo
|
12:15–12:45 |
Abstract:
Receiver operating characteristic (ROC) analysis is used for comparing predictive models, both in
model selection and model evaluation. This method is often applied in clinical medicine and
social science to assess the tradeoff between model sensitivity and specificity. After one fits a
binary logistic regression model with a set of independent variables, the predictive performance
of this set of variables—as assessed by the area under the curve (AUC) from a ROC curve—must
be estimated for a sample (the "test" sample) that is independent of the sample used to predict
the dependent variable (the "training" sample). An important aspect of predictive modeling
(regardless of model type) is the ability of a model to generalize to new cases. Evaluating the
predictive performance (AUC) of a set of independent variables using all cases from the original
analysis sample tends to result in an overly optimistic estimate of predictive performance. K-fold
cross-validation can be used to generate a more realistic estimate of predictive performance. To
assess this ability in situations in which the number of observations is not very large, cross-validation
and bootstrap strategies are useful. cvAUROC is a community-contributed Stata command that
implements k-fold cross-validation for the AUC for a binary outcome after fitting a logistic
regression model and provides the cross-validated fitted probabilities for the dependent variable
or outcome, contained in a new variable named _fit. Different options and examples for the use
of cvAUROC can be downloaded at
https://github.com/migariane/cvAUROC and can be directly
installed in Stata using ssc install cvAUROC.
Additional information: spain18_Miguel Ángel Luque-Fernández(2).pdf
Miguel Ángel Luque-Fernández
Universidad de Granada, London School of Hygiene and Tropical Medicine, CIBERESP ISCIII
Camille Maringe
London School of Hygiene and Tropical Medicine
|
12:45–1:15 |
Abstract:
The priority review voucher (PRV) was implemented in the United
States in 2007 with the aim to stimulate research and
development (R&D) for neglected diseases. The idea is the following:
pharmaceutical companies are granted a priority review voucher by
the food and drug administration (FDA) (for example, review within 6 months
compared with the standard 10 months) upon successful development of
a product (for example, drug or vaccine) for diseases of the PRV list.
The voucher either can be used for a blockbuster drug or sold to a
third party. The PRV is believed to be a strong consideration among
pharmaceutical companies to initiate or continue a project for a
neglected disease, with the last one having been granted in June 2018.
R&D investment is measured by the number of clinical trials initiated
yearly and per disease, which is downloadable from the WHO platform
registry. Because the policy targets a specific group of diseases in a
specific country (for example, the U.S.), we isolate the impact of
the policy through the differences-in-differences (DD) approach and
differences-in-differences-in-differences (DDD) approach.
Céline Aerts
Barcelona Institute for Global Health (ISGlobal)
Marisa Miraldo
Eliana Barenho
Imperial College London
Elisa Sicuri
Barcelona Institute for Global Health (ISGlobal), Imperial College London
|
1:15–1:45 |
Abstract:
We estimated the demand for house improvement in rural Gambia,
West Africa, by exploring three definitions of demand:
utility-derived demand, stated demand, and revealed preferences-based demand.
Data were collected in the context of a cluster-randomized controlled
trial aiming at identifying and measuring the impact of improved
houses on selected health outcomes. We collected panel data
(4 rounds over approximately 1 year to control for seasonality)
from nearly 200 households representing intervention, control
and nonstudy groups, from a random subsample of 15 study
villages. We collected information on satisfaction with owned
houses (utility), willingness to pay for house improvement
(stated preferences), and routine housing behavior (revealed
preferences). We estimated the determinants of demand through
ordered logit or linear (depending on the outcome variable
distribution) fixed-effects models. Under the hypothesis that
housing investment choices in such a rural context (and
considering the short term) aim at maintaining utility constant
across seasons, we plotted predicted demand from the estimated
models against time (rounds) and analyzed and interpreted differences
across the three definitions of demand.
Additional information: spain18_Sicuri.pdf
Elisa Sicuri
Barcelona Institute for Global Health (ISGlobal), Imperial College London
Lesong Conteh
Barcelona Institute for Global Health (ISGlobal)
|
2:45–4:00 |
Abstract:
After we fit a model, our analysis does not stop. We want to use our
results to construct counterfactual scenarios. We want to study the effects of
changes in variables over the population or for a specific subpopulation.
Answering such questions is more challenging for nonlinear models and, in
particular, for models in which we make no assumptions about functional
forms—nonparametric models. In this presentation, we will illustrate how to answer
these and other relevant empirical questions for nonlinear cross-sectional and
panel-data models and for nonparametric models. We do this within a unified
framework using Stata.
Additional information: spain18_Pinzón.pdf
Enrique Pinzón
StataCorp
|
4:00–4:30 |
Abstract:
In observational studies, estimation of causal effects often
relies on the assumption that all relevant confounders are
observed. Under this assumption, propensity-score matching
(PSM) can be used to adjust for observed confounders. PSM
is a semiparametric alternative to regression models that
consists of two steps: 1) estimation of the probability of
receiving the treatment (propensity score); 2) matching on
the estimated propensity score.
PSM has been originally proposed for unstructured data, and available Stata routines are designed for these types of data. However, clustered or hierarchical data are common in many fields of study (for example, students nested into school, voters into parties, patients into hospitals). Building on recent methodological developments, the goal of this presentation is to show how PSM can be implemented with clustered data in Stata. Using examples on real data, I will present methods that exploit the information on the clustered structure of the data in two ways: in the estimation of the propensity-score model (through the inclusion of fixed or random effects) or in the implementation of the matching algorithm. Additional information: spain18_Arpino.pdf
Bruno Arpino
Universitat Pompeu Fabra
|
4:45–5:15 |
Abstract:
Since the release of Stata 15, it has been possible to convert
the results of analyses into .doc (putdocx), .pdf (putpdf), and
.html (dyndoc) files. This presentation demonstrates the process by which
this is achieved to create a set of basic exercises online
(http://bit.ly/Analisis2018), so researchers and students can
learn how to manage Stata. First, I discuss the varied file types and how to
work with them. Then, I present the steps necessary for obtaining
basic analysis with the program, including percentage
tables, means, and regressions. In addition to this option, Stata's
dyndoc command can generate other web pages unrelated to the program,
with minimal knowledge of the HTML language.
Additional information: spain18_Escobar.pdf
Modesto Escobar
Universidad de Salamanca
|
5:15–5:45 |
Abstract:
The goal of this presentation is to identify some common
analytical problems that are often encountered in quantitative research
in a wide array of social science applications (and possibly in other
research fields as well), such as the analysis of multicolinearity of
independent variables when qualitative variables are involved; the
elaboration of three-way contingency tables with percentages; the
presentation of predictive margins and frequency distributions of both
qualitative and quantitative variables; the presentation of information
both on predictive margins and on contrasts of the statistical
significance of the differences of the effects of adjacent and
non-adjanent categories of qualitative independent variables; and the
construction of time-series graphs based on the frequency distribution
of categorical variables. I will put forward some solutions
with Stata for discussion among the audience and identify some
unresolved challenges.
Additional information: spain18_Rama.ppsx
José Rama
Andrés Santana
Universidad Autónoma de Madrid
|
5:45–6:15 |
Abstract:
Panel attrition is a threat for data quality in longitudinal
studies, especially if those who drop from the study are different
from the panel respondents. This presentation investigates the effect of
survey length on wave nonresponse using data from Understanding
Society, the United Kingdom Household Longitudinal Study (UKHLS).
The concept of survey length is addressed from a theoretical point
of view, and two measures, length and interview pace, are computed
to test their effect on survey cooperation.
Pablo Cabrera-Álvarez
David Dóncel Abad
Universidad de Salamanca
|
6:15–6:45 |
Abstract:
The goal of this presentation is to put forward a new set of
indexes and data analytic strategies for the comparative study of
attitudes toward linguistic educational policies in multinational
settings. These indexes deal with the attitudes toward the linguistic
mix in primary and secondary education, most notably regarding the
local-international dimension (regional and state-wide ones vis-à-vis
English) and the subnational-national one. Empirical analysis will be
performed with Stata using data of a specialized survey for
the Catalan case (N > 2,200) and the Eusko-barometer of May 2018 (N >
600). Several analytical options will be presented for discussion.
Additional information: spain18_Santana.ppsx
Andrés Santana
Universidad Autónoma de Madrid
|
6:45–7:15 |
Abstract:
Stata developers present will carefully and cautiously
consider wishes and grumbles from Stata users in the audience.
Questions, and possibly answers, may concern reports of
present bugs and limitations or requests for new features in
future releases of the software.
StataCorp personnel
StataCorp
|