The Spanish Stata Users Group Meeting was Thursday,
19 October 2017 at
Instituto de Salud
9:00–10:00 |
Abstract:
Nonparametric analysis has been traditionally descriptive. We fit the
regression function that relates the outcome of interest and the covariates,
and then we graph. But we can go beyond the descriptive. We may use this
function to compute marginal effects, counterfactuals, and other statistics
of interest. In other words, we may use margins after npregress
to conduct semiparametric analysis. I will show you how.
Additional information: spain17_Pinzón.pdf
Enrique Pinzón
StataCorp
|
10:00–10:15 |
Abstract:
Objectives: To implement a health expenditure prediction system based on
morbidity and to analyze its goodness of fit.
Methods: Observational, descriptive, retrospective and cross-sectional study on total health expenditure using explanatory-predictive-stratified models. There was a database of 156,811 inhabitants of the Denia health department that included age, Clinical Risk Group (CRG), total health expenditure, among other variables. The GLM with logarithmic-gamma distributions has different iterations depending on the dependent variable, the total health expenditure and as independent variables; age, gender and membership of the CRG in order to select the model that best explains the behavior of health expenditure. The model with the highest statistical significance used the combination of the variables age, sex, CRG health status, and severity level, whose Akaike information criterion was 14.2. By correlating the values estimated by the model and the real value, we obtain a correlation of 25%. Differing by type of expenditure, CRG showed a greater explanatory capacity in outpatient pharmaceutical spending and a lower explanatory capacity in hospital expenditure. Conclusion: Multimorbidity factors have a greater impact on the explanation of health expenditure than demographic variables. Additional information: spain17_Caballer.pdf
Vicente Caballer
Universidad Politécnica de Valencia, Universidad Politécnica de Madrid
David Vivas, N. Guadalajara, Alexander Zlotnik, Isabel Barrachina
Universidad Politécnica de Valencia, Universidad Politécnica de Madrid
|
10:15–10:30 |
Abstract:
The objective of this study was the construction and validation of a predictive
model for the identification of complex chronic patients.
A cross-sectional study was performed on the population of the Comunidad Valenciana region in 2015 (4,708,754 persons). Dependent variable: resource use variables greater than or equal to the 95th percentile (P95), including the number of primary contacts, number of hospital admissions, number of visits to emergency departments, pharmaceutical problems, and the pharmaceutical costs. Predictive variables: age, morbidity (according to clinical risk groups (CRG)), and variables corresponding to the resource use mentioned above. Persons exceeding P95 were 0.2% of the population; thus, the study was carried out on a sample of 10% stratified by CRGs, and all persons without chronic or moderate conditions were eliminated, in other words, those belonging to health states 1, 2, 3, and 4, totaling 150,252 persons. A logistic regression model was then built. Its validity was analyzed with sensitivity, specificity, a goodness of fit test, and area under the ROC curve (AUC) metrics. Additional information: spain17_Badal.pdf
Silvia Badal
Universidad Politécnica de Valencia, Universidad Politécnica de Madrid
Alexander Zlotnik, Ruth Usó, David Vivas
Universidad Politécnica de Valencia, Universidad Politécnica de Madrid
|
10:30–10:45 |
Abstract:
Missing data are common in HIV cohort studies, affecting both the covariates and
the outcome. In this case study, we compare different methods to deal with missing
data applied to estimate mortality by Hepatitis C virus coinfection in the cohort of
the Spanish Network of HIV Research (CoRIS) using Stata.
I used Poisson regression to estimate mortality rate ratios, using five methods to handle missing data in both the covariates and the cause of death: complete-case, indicator method (IM), multiple imputation by chained equations (MICE), multiple imputation then deletion (MID), and inverse probability weighting (IPW). Strong predictors were found for incomplete variables' values and for their probability of being missing. No significant differences were found in excess hazard ratios between the different methods. However, the complete-case approach led to less precise estimations; and incorrect classification of cause of death or deletion of cases with a missing cause of death when using complete-case, MID, or IPW led to underestimation of the excess mortality rates. In this case-study, MICE seemed to work best, because it both corrected bias and produced the most accurate estimates. Although MICE rests on the untestable assumption that data are missing at random, it seemed plausible in this context. Additional information: spain17_Ferreras.pdf
Belén Alejos Ferreras
Instituto Carlos III
|
10:45–11:00 |
Abstract:
Introduction: Autonomic symptoms (AS) of Parkinson's disease (PD) may be present
from the time of diagnosis and even precede it. So far, there are many unknowns in
the relationship between the evolution of AS and other variables of the disease.
Objective: To describe the evolution of AS and its relationship with motor and non-motor symptoms in PD. Methods: Observational, multicentric study (Spain and Holland) with longitudinal follow-up and baseline evaluation and in the fourth year. The SCOPA-Motor, SCOPA-Cognition, and HY stage scales were used, along with SCOPA-AUT (SCOPA-AUT) and SCOPA-Sleep self-administered questionnaires. Statistical magnitudes (the size effect and relative change) and sensitivity to change (standard error of the measure, 10% of the total maximum score and ½ standard deviation) were calculated for each subscale of the SCOPA-AUT and their mean value (Estimated value of change, EVC). Patients were classified according to their worsening in the autonomic subscales, depending on whether the difference between baseline and follow-up score exceeded or not. Additional information: spain17_Chávez.pptx
Abelardo Fernández Chávez
Hospital Ramón y Cajal
|
11:45–12:00 |
Abstract:
Stata does not include most classical machine learning algorithms in its core
libraries. A few algorithms are available through plug-ins, such as wrappers for
the LIBSVM library; however, these sometimes exhibit performance problems, do
not expose the full functionality of the algorithm, and are often challenging to modify.
Weka is an open source software suite written in Java that implements most well-known machine learning algorithms. Because its source code is available and documented, it is relatively easy to introduce custom modifications that should fulfill most practical business and research needs. In this talk, we present a simple method for integrating Stata ado-file programs with standard Weka algorithms (CART and C4.5 decision trees, support vector machines, neural networks, Bayesian networks, KNNs, LogitBoost classifiers, stacking classifiers, generic ensemble classifiers, etc.) as well as custom Weka algorithms (such as CART trees with LogitBoost on its branches). Additional information: spain17_Zlotnik_1.pdf
Alexander Zlotnik
Universidad Politécnica de Madrid
|
12:00–12:15 |
Abstract:
Stata is a well-known statistical software package used for a wide variety of
statistical analyses. As Stata users in several fields know, current and future
data processing often require increasingly larger computing resources. Stata/MP
is a multiprocessor version suitable for some of these tasks but, even on powerful
hardware, its capacities are sometimes surpassed for computationally demanding tasks.
If these tasks can be parallelized, distributed computing approaches can also be used.
Some software packages that require powerful computing resources, such as ChessBase engines used for deep chess variant analysis, have introduced the possibility of offloading some calculations to its private clouds. Alternatively, large computational problems such as the SETI@home project have chosen a grid computing model. The latter approach could be further enhanced with a blockchain-based distributed ledger that registered the computational power contributed to the community by each of its members and rewarded them for their contribution. All of these approaches or combinations thereof, could be used for new Stata licensing schemes. Additional information: spain17_Zlotnik_2.pdf
Alexander Zlotnik
Universidad Politécnica de Madrid
David Arroyo Manzano
Bioestadístico
|
12:15–12:30 |
Abstract:
Statistical Data and Metadata eXchange (SDMX), is an ISO standard
developed by seven international organizations (BIS, ECB, Eurostat, IMF, OECD, the
United Nations, and the World Bank) to facilitate the exchange of statistical data.
The package sdmxuse (available from the SSC archive) allows Stata users to download
and import SDMX data directly within their favorite software. The program builds and
sends a query to the statistical agency (using RESTful web services), then imports
and formats the downloaded dataset (in XML format). The complex structure of the
datasets (so-called "cube") is reviewed to show how users can send specific
queries and import only the required time series. sdmxuse might prove useful for
researchers who need frequently updated time series and wish to automate the
downloading and formatting process.
Additional information: spain17_Fontenay.pdf
Sébastien Fontenay (was unable to attend and present)
Université Catholique de Lovaine
|
12:30–12:45 |
Abstract:
The purpose of this paper is to implement
estimators proposed by Albarráet al. (2017) in Stata for dynamic binary choice
correlated random effects (CRE) models with unbalanced panel data. The
procedure allows for unrestricted correlation between the sample selection
process that determines the unbalancedness and the time invariant unobserved
heterogeneity. We create a specific command for this procedure, named xtunbalmd.
It fits the model for each subpanel separately and obtains
estimates of the common parameters across subpanels by minimum distance (MD).
This estimation method is faster than estimation by maximum likelihood (ML),
because it allows the same estimation routines that we would use if we had
a balanced panel, while keeping the good asymptotic properties of the ML
estimator for the whole sample.
Additional information: spain17_Albarrán.pdf
Pedro Albarrán
Universidad de Alicante
Raquel Carrasco, Jesús M. Carro
Universidad Carlos III de Madrid
|
12:45–1:00 |
Abstract:
In the presence of a choice of two binary variables, the usual econometric
procedure within the framework of the random utility model is the estimation of
a bivariate probit that accounts for the potential correlation between
the error terms for the utilities of all different alternatives. This approach is
not useful if the objective of the analysis is the study of the complementarity or
substitutability of the two alternatives, because the bivariate model assumes by
construction that the two alternatives are independent from the economic point of
view. In other words, in the bivariate probit, a factor that points to
one alternative in the first choice, but does not affect the utilities of the other
choice directly, does not induce a change in the second choice. To study the
complementarity or substitutability of alternatives, it is necessary to fit
a more flexible model such as the multinomial model and to compute expected
complementarity patterns from the standard results. In this presentation, we show
the Stata command gentzkow, which performs the complete analysis, and we show its
usefulness with an example with data from China on the double choice of grandparents
to first live in the same house as their children and grandchildren, and secondly to
help with the care of the grandchild.
Additional information: spain17_Mora.pdf
Ricardo Mora
Universidad Carlos III
Yunrong Li
Southwestern University of Finance and Economics
|
1:00–1:30 |
Abstract:
Decision models are based on Markov processes that describe the statistical
laws for possible states or sequential events to which an individual or patient
is subject within a system. Every decision model can be represented as a probability
tree containing nodes and branches. Each node represents a possible state of the
patient in its clinical evolution and socioeconomic status, while each branch joins
two states that are sequentially possible. Thus, several branches arise from the initial
node representing the patient's input to the system representing the following
different possible states. Each terminal node of the tree represents the last possible
state after a particular patient evolution. Therefore, probability trees are diagrams
that represent all possible evolutions of a patient within a system. By assigning net
costs to each node and conditional probabilities to each branch, it is possible to
calculate the expected net cost per patient. Using Monte Carlo techniques, the
distribution of estimated net costs per patient in the population of interest can be
estimated to incorporate the uncertainty inherent in using estimated values for
conditional probabilities and net costs. I introduce the manantial command, which takes as
inputs the decision tree, probability distributions, and payoffs. The command provides
significance tests and confidence intervals, and perfoms sensitivity analysis. We
illustrate the use of the command with an evaluation of early intervention in
psychosis. Early intervention in psychosis is a clinical approach for those who
experience symptoms of psychosis for the first time. It is part of a new paradigm of
prevention of psychiatry that is conditioning the reform of mental health services.
The focus is on the early detection and treatment of early symptoms of psychosis during
the formative years critical to the psychotic condition.
Additional information: spain17_García_Goñi.pdf
Manuel García-Goñi
Universidad Complutense de Madrid
Ricardo Mora
Universidad Carlos III de Madrid
|
3:00–4:00 |
Abstract:
Sometimes we are interested in identifying and understanding different
groups in a population, even though we cannot directly observe which group
each individual belongs to. Latent class analysis deals with these problems.
Often, those classes are determined by heterogeneity on regression models, where the relationship of a dependent variable (or variables) with a group of covariates varies from group to group. The new features added in Stata 15 allow us to fit many latent class models, including to the gsem command finite mixture models, which can also be fit using the new prefix fmm. We will introduce these topics and discuss examples using Stata. Additional information: spain17_Cañette.pdf
Isabel Cañette
StataCorp
|
4:30–4:45 |
Abstract:
Simulations nowadays are a very important way of analyzing new improvements
in different areas before the physical implementation, which may require
hard resources that could only be affronted in case of a high probability
of success. The use of random samples from different distributions are a must
in simulations.
In this talk, we introduce new Stata functions for generating random samples from continuous and discrete distributions that are not considered in the defined Stata random-number generation functions. In addition, we will also introduce new Stata functions for generating random samples as an alternative of the build-in Stata functions. The goodness of the generated samples will be checked using the mean squared error (MSE) of the differences between the frequencies of the sample and the theoretical expected ones. We will also provide bar charts that will allow the user to graphically compare the sample with the exact distribution function of the random distribution that is being sampled. Additional information: spain17_Aguilera-Venegas.pdf
Gabriel Aguilera-Venegas
Universidad de Málaga, Universidad Politécnica de Madrid
José Luis Galán-García, M. Ángeles Galán-García, Pedro Rodríguez-Cielos, Ricardo Rodríguez-Cielos
Universidad de Málaga, Universidad Politécnica de Madrid
|
4:45–5:00 |
Abstract:
Typical multilevel analysis in comparative research implies the use of
cross-sectional data for multiple countries. Multilevel models in such
settings are likely to be affected by problems of endogeneity and omitted
variables biases because of unobserved heterogeneity. However, there is a
growing volume of longitudinal data in comparative data projects, because they
typically span multiple waves (e.g., the European Social Survey or the World
Values Survey). This allows us to exploit the longitudinal dimension of the data
by splitting the effect of aggregate variables into two different sources of
variation (between and within countries), which makes multilevel models robust
against the problem of unobserved heterogeneity. Drawing upon a few recent works
in the literature that propose to include both cross-sectional and longitudinal
effects in multilevel models, I focus on the theoretical
and practical implications of this modeling strategy. Furthermore, I provide
some examples and practical recommendations using this approach with Stata.
Antonio M. Jaime-Castillo
Universidad de Málaga
|
5:00–5:15 |
Abstract:
Demography has traditionally been interested in birth weight and the impact
that certain descriptive characteristics have on birth weight. Most of the
interest in this variable lies in the fact that weight at birth is a significant
predictor of infant health outcomes (as well as health at adult ages), but also
of cognitive performance and educational results. While evidence explaining
the prevalence of low birth weight is common in different
disciplines, it is much less frequent to see high-quality evidence built from large
sample sizes quantifying the impact of weight at birth on schooling outcomes.
In this paper, we use data from the Chinese Family Panel Study (2010 wave), a large-scale representative sample of Chinese households, to model the effect of low birth weight on standardized test scores among Chinese children aged 10-15 years. Our evidence confirms a highly significant negative effect of LBW on the results obtained by children in both mathematics and Chinese language. The paper also shows a clear gradient in the prevalence of low birth weight by family background. Our evidence also implies that highly educated parents (mothers) can actually compensate the disadvantage that low birth weight represents in terms of cognitive performance. Héctor Cebolla-Boado
UNED
Leire Salazar
UNED
|
5:30–5:45 |
Abstract:
According to Eurostat between 2011 and 2016, early school leaving has fallen in
Spain from 26.3% to 19.0%. Following this fast progression, the target for 2020
(15%) will certainly be reached soon. Using the same database, it is assured that
the percentage of population aged 30-34 years with tertiary studies has remained
above 40%. With these indicators, we can only congratulate a society that is
winning the battle against premature school leaving and has such an abundant
volume of highly qualified young people.
The above information is based on the Spanish Labour Force Survey (aka, EPA-Encuesta de Población Activa), a panel survey where an individual can be observed up to six times in a row. By applying the Stata-specific module for panel data analysis xt, the real view on the level of education in Spain dramatically changes, because early school leaving is much higher and the proportion of people who completed a university degree much lower. We just need to take into account that the EPA is surveying the same individuals in different occasions. Pau Miret Gamundi
Universidad Autónoma de Barcelona, Centro de Estudios Demográficos
|
5:45–6:00 |
Abstract:
Normally, after survey data collection is completed, the final sample differs
from population figures on key variables. If the population figures
are known, the final sample can be adjusted using techniques such as post-stratification
or calibration. These techniques are used to compute weights, which ensure that the
weighted distribution of the final sample matches the population on key variables.
This presentation compares two commands available in Stata that, under certain circumstances,
lead to different results: svyset and calibrate. svyset, which is the Stata command to
deal with survey data, includes poststrata and postweight options to post-adjust survey
data; calibrate (D'Souza, 2011) is used to compute different types of calibration.
The main difference between how these two packages compute the weights is the treatment of
the missing values. Here, we present how to use these packages alongside an explanation on
the differences followed by research examples.
Additional information: spain17_Cabrera.pdf
Pablo Cabrera
Universidad de Salamanca
Modesto Escobar
Universidad de Salamanca
|
6:00–6:15 |
Abstract:
The presentation we propose is mainly conceived as a contribution to Stata
teaching strategies and tricks in university courses, but many of the tricks
we show are very useful for research purposes as well. Some of the questions
we will address in our presentation are: How can we open
datasets that are originally in older versions of Stata? How can we open datasets available
in other programs, like SPSS? We bring attention to several short commands
that enable us to accomplish these goals. How do we show both codes and label values in
our tables? How do we perform comparisons of means that show both means and p-values?
We point to the existence of a command that allows us to do so while showing a very
easy-to-interpret output. How do we show correlations with their corresponding
significance levels? How do we combine several graphs with a unique legend with
the same scale in both axes? How do we compare several models with tables and with graphs?
Additional information: spain17_Santana.pdf
Andrés Santana
Universidad Autónoma de Madrid
José Rama
Universidad Autónoma de Madrid
|
6:30–7:00 |
Wishes and grumbles
StataCorp
|
Organizers
Scientific committee
Ricardo Mora
Universidad Carlos III de Madrid
Modesto Escobar
Universidad de Salamanca
Alexander Zlotnik
Universidad Politécnica de Madrid
Logistics organizer
The logistics organizer for the 2017 Spanish Stata Users Group meeting is
Timberlake Consulting S.L.,
the distributor
of Stata in Spain.
View the proceedings of previous Stata Users Group meetings.