The 2019 Spanish Stata Conference was held on 17 October in Madrid at Lexington Madrid.
Proceedings
9:30–10:45 |
Abstract:
The increasing availability of high-dimensional data and increasing interest in
more realistic functional forms have sparked a renewed interest in automated
methods for selecting the covariates to include in a model. I discuss the promises
and perils of model selection and pay special attention to estimators that provide
reliable inference after model selection. I will demonstrate how to use Stata 16's
new features for double selection, partialing out, and cross-fit partialing out to
estimate the effects of variables of interest while using lasso methods to select
control variables.
Additional information: Spain19_Drukker.pdf
David M. Drukker
StataCorp
|
10:45–11:15 |
Abstract:
The hierarchical summary ROC (HSROC) model (Rutter and Gatsonis 2001)
is one of two statistically rigorous multilevel or mixed-effects models recommended for diagnostic test
accuracy meta-analysis by the Cochrane Collaboration. The original parameterization of the HSROC model
does not incorporate the difference in log-likelihood expressions between cohort (prevalence-dependent)
and case-control (prevalence-independent) diagnostic test data (Ma X 2016). Using publicly available
data regarding meta-analysis of gadolinium-enhanced MRI
for detecting lymph node metastases, I intend to show in this presentation how bayesmh and its
myriad postestimation commands, the cond(), function and substitutable expressions in Stata facilitate
estimation, graphical depiction, and interrogation of simultaneous HSROC modeling of cohort and
case-control diagnostic test accuracy studies.
Ben Adarkwa Dwamena
University of Michigan
|
11:45–12:15 |
Abstract:
We have developed four new commands that allow one to evaluate the out-of-sample prediction performance
of panel-data models in their time-series and cross-individual dimensions separately, also with
separate procedures for different types of dependent variables—either continuous or dichotomous
variables (xtreg_oust, xtreg_ousi, xtlogit_oust, and xtlogit_ousi).
The time-series procedures exclude a number of time periods defined by the user from the
estimation sample for each individual in the panel. Similarly, the cross-individual procedures
exclude a group of individuals (for example, countries) defined by the user from the estimation
sample (including all their observations throughout time). Then, for the remaining subsamples,
they fit the specified models and use the resulting parameters to forecast the dependent variable
(or the probability of a positive outcome) in the unused periods or individuals. The unused
time-period or individual sets are then recursively reduced by one period in every subsequent
step or in a random or ordered fashion, and the estimation and forecasting evaluation are repeated
until there are no more periods ahead or more individuals that could be evaluated. In the
continuous cases, the model's forecasting performance is reported both in absolute terms (RMSE)
and relative to an AR1 model by a U-theil ratio. In the dichotomous case, the prediction
performance is evaluated based on the area under the receiver operator characteristic (ROC)
statistic evaluated in both the training sample and the out of sample. Despite their given names,
the procedures allow one to choose different estimation methods, including some dynamic methodologies,
and could also be used in a time-series or a cross-section dataset only. They also allow evaluating
the model's forecasting performance for one particular individual or for a defined group of
individuals instead of the whole panel.
Additional information: Spain19_Ruiz.pdf
Alfonso Ugarte Ruiz
BBVA Research
|
12:15–12:45 |
Abstract:
The TRIPOD (transparent reporting of a multivariable prediction model for individual prognosis
or diagnosis) standards for predictive model reporting in research include internal model
validation. Bootstrap techniques are the most appropriate procedure for internal validation,
because they use all the data used in the development of the model and allow optimism to be quantified.
Objective: Provide researchers with a tool implemented in Stata as a postestimation command for internal bootstrap validation of logistic regression models. Methods: The validation method follows the following algorithm:
Conclusions: This tool makes the internal validation methods more accessible to researchers and allows better reporting of predictive models according to the TRIPOD standards. Additional information: Spain19_Fernández-Félix.pdf
Borja M. Fernández-Félix
Instituto Ramón y Cajal de Investigación Sanitaria
|
12:45–1:15 |
Abstract:
The instrumental variables (IV) method is a standard econometric approach to address endogeneity
issues. Many instruments rely on cross-sectional variation produced by a dummy variable that
is discretized from a continuous variable. Converting a continuous variable into a binary
instrument provides a simple tool to evaluate the IV strategy and the identification assumptions.
Unfortunately, the construction of the binary instrument often appears to be arbitrary, which may
raise concerns about the robustness of the second-stage results. We propose a data-driven procedure
to build this discrete instrument, implemented in a command called discretize. The boundaries of
the discrete variable are chosen to maximize the F statistic in the first stage. This procedure has
two main advantages. First, it minimizes the weak-instrument problem, which can arise in the case of
incorrect functional specification in the first stage. Second, it offers a transparent, data-driven
procedure to select an instrument that does not depend on arbitrary decisions. Several options are
available with the command to graphically check the robustness of the estimates. The presentation
also includes an illustration of its usefulness with an example that relates the raise of violent
crime in city centers and the process of suburbanization. The endogenous relation is solved using
lead poisoning as an instrument.
Federico Curci
Colegio Universitario de Estudios Financieros
Sébastien Fontenay
Université catholique de Louvain
Federico Masera
University of New South Wales
|
1:15–1:45 |
Abstract:
crtrees performs classification trees, regression trees, (Breiman et. al. 1984) and
Random Forests (Breiman 2001 and Scornet et al. 2015). Classification and regression
trees consist of three algorithms: tree growing, tree pruning, and finding the honest tree.
The Random Forests algorithm is an ensemble method that implements tree growing for many
random subsets of the data and the splitting variables set. Random Forests can be implemented
both for classification and for regression applications.
Additional information: Spain19_Mora.pdf
Ricardo Mora
Universidad Carlos III de Madrid
|
2:45–4:00 |
Abstract:
Meta-analysis provides a theoretical framework to
integrate and analyze empirical evidence from multiple studies. It
has been applied to many areas of research, such as econometrics,
education, psychology, and medicine. The new suite of commands meta
provides is an integrated framework to address the different aspects of
our meta-analysis simply. I'll discuss how to prepare and
summarize our data, address heterogeneity using
random-effects models, extend these models to the use of
meta-regression, and use postestimation commands to perform
statistical tests and assess possible issues on our data.
Additional information: Spain19_Canette.pdf
Isabel Canette
StataCorp
|
4:00–4:30 |
Abstract:
Technical efficiency analysis has limited utility for policymakers and managers unless sources
of inefficiency are identified. Apart from decisions between inputs and outputs, managers'
skills and competencies gain across the years as well as the environment are drivers of airports'
efficiency. Previous studies in agricultural economics have considered managerial practices (for
example, Rougoor, Trip et al. 1998; Hansson 2008; and Manevska-Tasevska and Hansson 2011), agricultural
education based on knowledge (for example, Galanopoulos et al. 2006 and Manevska-Tasevska 2013),
experience (for example, Puig-Junoy and Argiles 2004), and economy-driven goals (for example,
Willock et al. 1999 and Wilson et al. 2001) having a positive effect in efficiency. Nevertheless,
these factors have not been accounted for for air transport studies. In this study, a stochastic
frontier production function is used to measure airports' efficiency. Because of investments made in
some airports in detriment of others, capital investments are accounted as nonneutral technical
changes allowing time-varying efficiencies. The overall efficiency is expected to be differentiated
between airport-specific factors such as airports' managers' practices and a component related
to time-varying residual factors. Both managerial skills and the economic size of the airports
understood as technological endowments should provide insights into airports' performance.
Additional information: Spain19_Ripoll-Zarraga.pptx
Ane Elixabete Ripoll-Zarraga
Universidad Pontificia de Comillas
|
5:00–5:30 |
Abstract:
Product-mix problems, where a range of products that generate
different income competes for a limited set of resources, are key to the
success of organizations in many industries. These are simple optimization
problems in their most basic forms; however, the consideration of
uncertainties may turn them into intractable problems. In this presentation,
I investigate the economic impact of demand uncertainties on organizations
facing such decision-making problems. In this fashion, I extend
Goldratt's PQ problem to uncertain settings by considering variability in
the volume and mix of customer needs. To this end, we develop a hybrid
model-driven decision support system, characterized by a colored Petri net
that shapes the dynamics of the agent-based production system along with a
discrete-event engine of the simulation clock that makes it more agile. I
design a Drum-Buffer-Rope mechanism aimed at protecting the throughput of
the production system when it is exposed to uncertainty. Through a
statistical study, I obtain regression models that link the net profit to
the degree of volume and mix variabilities. I observe that, as demand
uncertainty grows, the net profit becomes more volatile and tends to
decrease. However, I find that until a certain threshold, the production
system is barely affected by variabilities; thus, the Drum Buffer Rope
makes it robust to uncertainties. This raises important implications for
professionals dealing with product-mix problems because understanding the link
between uncertainties and performance may prompt them to consider several
actionable changes in their organizations.
Additional information: Spain19_Costas.pptx
José Costas
Plastic7A, S.L.
|
5:30–6:00 |
Abstract:
The objective of our presentation is to exchange information
and experiences with other professors as well as university researchers and
professionals interested in promoting and improving teaching methods mainly in
sociology, political science, and economics. In this
presentation, we will share an experience on teaching innovation of a
project won in a competitive process for obtaining funds at the
Universidad Autónoma de Madrid (D_020.18_INN "Quantitative techniques
in short teaching videos"). The innovation project consists of the
elaboration of a series of educational videos on the analysis of social data—aggregate
or individual—using Stata. In this presentation, we will
discuss at least three aspects:
Additional information: Spain19_Rama.ppsx
José Rama
Andrés Santana
Universidad Autónoma de Madrid
|
6:00–6:30 |
Abstract:
Given the emergence of big data generated by massive digitization, as well as the growing
access to information from the so-called second digital revolution, social scientists
face a number of methodological challenges to better understand social life: data
collection, new ways of sampling, automatic coding, and statistical analysis of information.
This presentation proposes the graphical analysis of information based on data binarization. The idea is to build three-dimensional binary matrices formed by 1) temporal or spatial sets, 2) scenarios, and 3) events or characteristics, supported by matrices with their attributes. The treatment of this structure is based on the methodology of two-mode networks, combined with statistical tools for selection and location of nodes and representation of edges. Graphs have been used not only to solve topographic problems and to represent social structures, but also to study relationships between variables. To improve their analytical potential, these graphs are endowed with an interactive potential that includes the selection of various attributes for the recognition of the elements analyzed and the modification of parameters to focus on stronger relationships. In this presentation, we advance a Stata program that uses its recent link with Python to elaborate these interactive graphs. We give a variety of examples that range from the analysis of photo collections, content analysis of text, representation of concerts and exhibitions, surveys of personal correspondence, etc., to the analysis of multiple-response questions in questionnaires. Additional information: Spain19_Escobar.pdf
Modesto Escobar et al.
Universidad de Salamanca
|
6:30–7:00 |
Abstract:
It has befome more difficult to publish an academic article in which you show only the table with
the results of the different regressions used to test your hypotheses. At least, the one who adds a few graphs of predicted
probabilities (margins plots) or the occasional graph of coefficients knows
that the chances of success are higher. Despite this, not all graphics
produced by Stata are the same and, without a doubt, some look prettier than
others. Our presentation is dedicated to the graphic presentation of the
results for multiple regression models. In this way, we divide the
presentation into two parts.
The first is dedicated to the presentation of the estimates of the effects of the variables (the betas) with the help of the community-contributed coefplot command. We will start from a basic graph, and we will enrich it with headings, groups, notes, statistical significance, and other less-known options that help to better distinguish the contributions of each model. In addition, we will discuss the relevance or irrelevance of standardizing the variables (standard deviation = 1,) showing how, surprisingly, some results change if we standardize or if we leave the variables as they were. However, "betas" are of little use when the corresponding variables are not quantitative. Because many of the variables of interest are qualitative in the social scientists, the second part of the presentation is dedicated to the presentation of the average marginal effects (AMEs) of all the independent variables, using the option post to save results that then allow graphs using official and community-contributed Stata commands. Additional information: Spain19_Santana.ppt
Andrés Santana
José Rama
Universidad Autónoma de Madrid
|
7:00–7:30 |
Open panel discussion with Stata developers
|
Scientific committee
Economía:
Dr. Ricardo Mora
Dpto. Economía, Universidad Carlos III de Madrid
Sociología y CC Políticas:
Dr. Modesto Escobar
Dpto. Sociología y Comunicación, Universidad de Salamanca
Ciencias de la Salud:
Dr. Alexander Zlotnik
Cuerpo Superior de Sistemas y Tecnologías de la Información de la Administración del Estado
Jefe de Servicio en el Ministerio de Sanidad, Consumo y Bienestar Social de España
Logistics organizer
The logistics organizer for the 2019 Spanish Stata Conference is Timberlake Consulting S.L., the distributor of Stata in Spain.
View the proceedings of previous Stata Users Group meetings.