Last updated: 16 June 2014
2015 German Stata Users Group meeting
Friday, 26 June 2015
Institute for Employment Research (IAB)
Room 168
Regenburger Str. 104
Nuremberg, Germany
Proceedings
Statistical learning with boosting
Matthias Schonlau
University of Waterloo, Canada
Additional materials:
de15_schonlau.pdf
Estimating survival-time treatment effects and endogenous treatment effects using Stata
David Drukker
StataCorp LP
After reviewing the potential-outcome approach to estimating treatment effects from
observational data, this talk discusses new estimators in Stata 14 for estimating
average treatment effects from survival-time data and estimators for average
treatments from endogenous-treatment designs. The talk also covers new research on
estimating quantile treatment effects.
Additional materials:
de15_drukker.pdf
Multiprocess modeling with Stata
Tamás Bartus
Corvinus University of Budapest
Multiprocess hazard models consist of multilevel hazard and discrete choice
equations with correlated random effects and are routinely used by
demographers to correct estimates for endogeneity and sample selection. Although
no official Stata command is devoted to estimating systems of hazard equations, the
official
gsem command and the user-written
cmp command offer the opportunity to
estimate models of this sort (Roodman 2011; Bartus and Roodman 2014). The
presentation addresses (1) the joint estimation of multilevel discrete-time survival and
discrete-choice equations with the
gsem and the
cmp commands; (2) the estimation
of (either multilevel or single-level) systems of lognormal survival and discrete choice
equations with the
cmp command; and (3) the preparation of multi-spell survival
datasets for the purpose of estimation. Multiprocess survival modeling is illustrated
using standard examples from demographic research.
Reference
- Roodman, D. 2011. Fitting
fully observed recursive mixed-process models with cmp. Stata Journal 11: 159–206.
- Bartus, T., D. Roodman 2014.
Estimation of multiprocess survival models with cmp. Stata Journal 14: 756–777.
Additional materials:
de15_bartus.pdf
A Stata ado for categorical data analysis with latent variables
Hans-Jürgen Andreß
University of Cologne
Maximilian Hörl
University of Cologne
Alexander Schmidt-Catran
University of Cologne
Path models are used widely in the social sciences to illustrate statistical
models used in applied research. They describe the assumed relationships and
dependencies between the variables of interest and are easy to comprehend even
for statistical laypersons. Up to now, they have mostly been applied to quantitative
data. But the main ideas are easily transferred to the analysis of categorical data. In
doing so, they present a unified approach on different statistical methods for
categorical data analysis. The
catsem ado attempts to access all of these different
methods, which are scattered over a whole range of Stata commands, with an
easy-to-understand and intuitive command language that basically describes path
diagrams. Moreover, it adds functionality that at present is not yet included in Stata:
the possibility to include categorical latent variables (Andreß 1997) and the
possibility to analyze fairly general functions of the responses as
described by Grizzle et al. (1969).
Reference
- Andreß, H.-J., J. A. Hagenaars, S.Kühnel. 1997. Analyse von Tabellen und kategorialen Daten: Log-lineare Modelle, latente Klassenanalyse, logistische Regression und GSK-Ansatz. Berlin: Springer-Lehrbuch.
- Grizzle, J., C. Starmer, and G. Koch. 1969. Analysis of categorical data by linear models. Biometrics, 25:489—504.
Additional materials:
de15_andress.pdf
simarwilson: DEA-based two-step efficiency analysis
Harald Tauchmann
Friedrich-Alexander-Universität Erlangen-Nürnberg
Measuring efficiency of production units (DMU) has developed into an
industry in applied econometrics. Unlike parametric approaches, nonparametric
techniques—namely DEA—yield individual efficiency scores for DMUs but do not
directly answer the question of what determines efficiency differentials between
them. One obvious way to circumvent this limitation is to conduct a two-stage
analysis where DEA scores obtained on the first stage, serve as lefthand-side
variables in regression on the second stage that links efficiency to exogenous factors.
Such a two-step approach, however, encounters severe problems: (i) DEA efficiency
scores are bounded—depending on how efficiency is defined—from above or from
below at the value of one; and (ii) DEA generates a complex and generally unknown
correlation pattern among estimated efficiency scores, resulting in invalid inference in
the subsequent regression analysis. To address these problems, Simar and Wilson
(2007) suggest a simulation-based, multistep iterative procedure that follows DEA
and is based on (i) truncated regressions, (ii) simulating the unknown error
correlation, and (iii) calculating bootstrapped standard errors. We introduce the new
Stata command
simarwilson which implements this procedure. It
complements the user-written command
dea (Ji and Lee 2010), which has to precede
simarwilson in applied work.
Reference
- Simar, L., P. W. Wilson. 2007. Estimation and inference in two-stage, semiparametric models of production processes. Journal of Econometrics 136: 31–64.
- Ji, Y.-B., C. Lee. 2010. Data envelopment analysis. Stata Journal 10: 267–280.
Additional materials:
de15_tauchmann.pdf
A simple procedure to correct for measurement errors in survey research
Anna de Castellarnau
Research and Expertise Centre for Survey Methodology, University Pompeu Fabra
Although there is wide literature on the existence of measurement errors,
few researchers are correcting them in their analyses. In this presentation, we will
show that correction for measurement errors in survey research is not only necessary
but also possible and actually rather simple. Using the quality estimates obtained
from the free online software Survey Quality Predictor (SQP), correlation and
covariance matrices can easily be corrected and used as input for analyses.
This procedure was described for Stata, LISREL, and R in the ESS EduNet module
"A simple procedure to correct for measurement errors in survey research". This
presentation will focus on the correction of measurement errors in regression
analysis and causal models using Stata.
Additional materials:
de15_deCastellarnau.pdf
Time-series analysis using ARFIMA
Frank Ebert
Ebert Beratung und Innovationen GmbH
Since version 12, Stata has offered the analysis of ARFIMA models. How can it
be applied, and what should be considered when using it? Weather data are reported
to show a "long" memory. This can be checked by estimating the fractional
integration parameter d of an autoregressive fractionally (or fractal) integrated
moving average (ARFIMA) process. Further relevant data are high-frequency stock market
quotations and energy prices. Weather data (in particular wind time series)
seem to show a complementary behavior to energy prices. A further aspect is the
characterization of time series by its fractional integration parameter d. Can it be
used to compress large amounts of time-series data? More technical questions are the following:
What should be considered working with data that are influenced by fractal (nonwhite)
noise and what could be done to overcome performance problems?
Additional materials:
de15_ebert.pdf
PSIDTOOLS: An interface to the the Panel Study of Income Dynamics
Ulrich Kohler
University of Potsdam
The presentation discusses a collection of user-written programs designed to
make analyses of the Panel Study of Income Dynamics (PSID) easier. The PSID is
the longest-running longitudinal household survey in the world. Beginning in 1968, the
PSID collected yearly information from over 18,000 individuals living in 5,000
households. The PSID offers data to study a broad range of topics, including
employment, income, wealth, expenditures, health, and numerous others. However, as in
many other Panel studies, the hurdles for using the data are relatively high.
One reason is that the main corpus of the PSID data is being delivered to the end
user in sets of yearly ASCII text files, forcing the user to first retrieve a dataset
streamlined to the research topic. The PSID tools make these initial steps of PSID
data analysis very easy. Particularly, the programs automatically create Stata datasets
from ASCII text files, load and merge items from several PSID waves, ease wide-long
conversions (while keeping labeling information), and automatically add value-label
information from the PSID homepage to the dataset in memory.
Additional materials:
de15_kohler.pdf
Extensions to the label commands
Daniel Klein
University of Kassel
Stata has commands to change variable names, as well as their contents,
using expressions, a variety of functions, or simple transformation rules. Name
abbreviations, wildcard characters, time-series operators, and factor-variable notation
further facilitate working with variables. Managing value and variable labels, on the
other hand, is not as convenient. Despite a large number of existing user-written
commands for this purpose, there is still room for improvement. In this presentation, I
introduce a new package,
elab, that aims at transferring concepts for manipulating
variables to value and variable labels. The package enhances the capabilities of
official Stata's label suit and introduces additional tools similar to existing Stata
commands for managing variables. Features of
elab include support for value-label
name abbreviations and wildcard characters and for restricting requests to
subsets of integer-to-text mappings. The package offers commands to systematically
change integer values and text in value labels using arithmetic expressions or string
functions. It further provides programming utilities, making it easy to implement these
features in do- and ado-files.
Additional materials:
de15_klein.pdf
A new Stata command for computing and graphing percentile shares
Ben Jann
University of Bern
Percentile shares provide an intuitive and easy-to-understand way for
analyzing income or wealth distributions. A celebrated example are the top income
shares sported by the works of Thomas Piketty and colleagues. Moreover, series of
percentile shares, defined as differences between Lorenz ordinates, can be used to
visualize whole distributions or changes in distributions. In this talk, I present a new
command called
pshare that computes and graphs percentile shares (or changes
in percentile shares) from individual level data. The command also provides
confidence intervals and supports survey estimation.
Additional materials:
de15_jann.pdf
Report to users / Wishes and grumbles
Bill Rising
StataCorp
Bill Rising, director of Educational Services at StataCorp LP, will be happy to receive wishes for developments in Stata and almost as happy to receive grumbles about the software.
Scientific organizers
Johannes Giesecke, Humboldt University of Berlin
[email protected]
Stephanie Eckman, Institute for Employment Research (IAB)
[email protected]
Logistics organizers
Dittrich & Partner Consulting GmbH, the distributor
of Stata in several countries, including Germany, the Netherlands, Austria, Czech Republic,
and Hungary.