Don't forget to save the date and join us next year at the Stata Conference in Nashville, Tennessee on 31 July–1 August 2025!
View the conference photos here, and view the proceedings and presentation slides below.
All times Pacific Daylight Time
Additional information:
US24_Barrette.pptx
Matching-adjusted indirect comparison is a comparative effectiveness research methodology that leverages individual level data and aggregate results when head-to-head randomized trials are not available or feasible. MAIC is growing in popularity partly because of the high costs of randomized trials and because of interest on the part of regulators for more safety and effectiveness evidence. Since the seminal papers describing the theory and application of MAIC were published just over a decade ago, the literature on how to apply this method as well as demonstrations of its applications has grown quickly. The National Institute for Health Care Excellence (NICE) in the UK released a technical document in 2016 that described MAIC best practices and provided sample R code for an example analysis. As the method has become more popular, references to the use of Stata for statistical analysis are appearing in publications yet very little documentation or code is available. We present the NICE technical documentation worked example using Stata in parallel to the original example in R and highlight the efficiencies and potential challenges of both programs.
Additional information:
US24_Brownell.html
Evaluating the out-of-sample properties of statistical models is important, especially for predictive modeling/analytics. Although Stata currently implements cross-validation methods natively for some model-fitting commands—dslogit, dspoisson, dsregress, elasticnet, lasso, poivregress, pologit, popoisson, poregress, sqrtlasso, xpoivregress, xpologit, xpopoisson, and xporegress—broader use of cross-validation is not natively supported. At last year’s conference, a user explained the challenges that students and new users face when trying to use cross-validation procedures in Stata. While it is possible to implement the four-step process of splitting the sample, fitting the model to the training sample, predicting outcomes on the validation/test sample, and computing metrics related to the fit, doing so is tedious and time consuming. Developing a program that implements the four-step process above is not a trivial task, despite what one of the authors initially thought. In this talk, we present xv, an extensible prefix command implementing cross-validation for Stata estimation commands.
Additional information:
US24_Ender.pdf
Ordinary least-squares regression (OLS) estimates coefficients such that the residual sum of squares (RSS) is a minimum. Further, the R-squared between the response variable and the predictors is a maximum. The solution for these OLS coefficients is unique; that is, there is only one set of coefficients that minimizes the residuals. But what if we estimated coefficients that come within one percent (0.01) or less of the maximum value of R-squared. There can be multiple sets of coefficients that yield the same R-squared. These are the fungible regression coefficients (FRCs). How many different fungible regression coefficients are possible? What do these FRCs look like? Are these FRCs of any use whatsoever? This presentation will address these questions.
Additional information:
US24_Jin.pptx
Background: Traditional Stata programming requires the mastery of syntax for tailoring program behaviors. However, the traditional programming approach is challenging in scenarios where 1) users lack familiarity with Stata syntax, 2) instructions for third-party programs are nonexistent, and 3) the program calls for complex syntax. To address these issues, we've explored a prompt-based programming approach.
Method: We used Stata's request() option under the display command to offer the user stepwise value intake prompts, establish checkpoints prior to executions, and allow modifications without termination.
Results: The request() option introduces significant advantages. It guides the user through parameter intake in a straightforward manner, reducing the efforts to understand complex syntax. Moreover, by incorporating checkpoints and allowing for modifications during processing, it substantially diminishes the likelihood and impact of errors. However, incorporating such prompt-based elements into broader programming frameworks presents challenges because of the requirement for user input.
Conclusion: Adopting a prompt-based programming approach significantly eases the learning curve and offers a practical solution for both preventing and correcting errors efficiently. Nonetheless, the potential difficulties of integrating these prompt-based elements into larger programming projects warrant careful consideration. Programmers need to evaluate application context, user's expertise, and the practicality of integration when choosing between programming methodologies.
Additional information:
US24_Dallakyan.pptx
Causal mediation analysis determines the mechanism through which a treatment influences an outcome through a mediator. The objective of this presentation is to provide a practical guide, facilitating an understanding and implementation of causal mediation analysis using Stata 18's new mediate command. We will begin by introducing the fundamental steps of causal analysis and then apply these steps to causal mediation analysis. Additionally, we will highlight the differences between causal and traditional mediation analysis. The presentation will also delve into various types of direct and indirect effects, illustrating their practical applications. Examples demonstrating how to perform causal mediation estimation within Stata, using different types of outcomes and mediators (continuous, binary, and count), will be provided. No prior knowledge of Stata is required, although a basic understanding of causal inference will be beneficial.
Additional information:
US24_Gallup.pdf
Added-variable plots show the contribution of each data point of one explanatory variable to an outcome variable, while controlling for the influence of multiple other explanatory variables. This is a multivariate generalization of a scatterplot with a trend line. It provides an intuitive visual presentation of complex estimation results to specialists and nonspecialists alike. The plots show the marginal effect of an explanatory variable on the outcome as well as how closely the data adhere to the estimate. Observers can see outliers and the statistical significance of the estimated coefficient. The more complex the estimation method, the more helpful it is to have an accessible visual representation of the results.
Currently, added-variable plots are available only for OLS regression in Stata. I recently extended the theory of added-variable plots to all commonly used linear and nonlinear estimators, including generalized least squares, instrumental variables, maximum likelihood, nonlinear least squares, and generalized method of moments estimators. I am in the process of programming added-variable plot commands for all Stata estimators. I have started with added-variable plots for panel data (xt) estimators (SJ 2020) and will shortly add them for instrumental variables and time-series estimators.
Additional information:
US24_Bjärkefur1.pdf
In this presentation, we introduce two novel commands, repado and reproot, as part of the Stata package repkit, designed to streamline projects package version control and management across teams. The repado command facilitates precise version control of Stata packages within projects by establishing project-specific ado-path folders. This ensures consistent usage of package dependencies among team members and enhances reproducibility by preserving access to specific command versions, vital for revisiting older projects. Moreover, repado proves instrumental in package development, enabling seamless testing of unpublished commands alongside stable versions in diverse project environments. Complementing repado, reproot offers efficient management of root paths across projects with minimal manual intervention. Unlike existing packages addressing the same inefficiency, like setroot, reproot excels in handling multirooted projects, such as those involving Git collaboration and data sharing on diverse platforms like Dropbox or network drives. Its streamlined setup ensures rapid root path identification, even when a project's roots are in different location in a project. The setup of reproot only needs to be done once per computer for all projects. This helps optimize project navigation and facilitate seamless integration across team workflows, especially in teams and organizations collaborating on many projects.
Additional information:
US24_Daniels.pdf
The reprun command in Stata is designed to automate reproducibility verifications for sets of Stata do-files. This session presents detailed updates to the command in the context of DIME Analytics’s repkit package, which spans a complete workflow for the reproducibility verifications. The repkit package aims to ensure that the outputs of reproducibility packages are stable and reproducible, addressing the common sources of reproducibility failures. By identifying and correcting issues, users can improve the reliability of their statistical analyses, making them suitable for sharing and publication. The reprun command performs two runs of a specified do-file, recording the state of Stata after each line’s execution during the first run and then comparing it with the state after the same line’s execution in the second run. Key states monitored include the random-number generator (RNG) state, data sort order, and data contents. If discrepancies occur between the two runs, reprun flags potential reproducibility errors, reporting mismatches in a table format, which helps in identifying and resolving issues. This tool emphasizes the importance of managing randomness and maintaining consistent data states to avoid reproducibility errors, especially when inconsistent outputs are far downstream in code from their sources.
Additional information:
US24_Bjärkefur2.pdf
adodown aims to make Stata packages easier for both developers to create and users to understand. For developers, adodown offers workflow commands that automate manual tasks at each stage of development. At project's start, adodown creates the necessary scaffolding for the package (folders, pkg-file, etc). For each package command, it uses templates to create necessary files (i.e., ado, documentation, unit test) and adds appropriate entries in the pkg-file. For documentation, it allows developers to draft in plain Markdown while creating standard help files in SMCL. And for publication, adodown collects the required files, puts them in proper format, and prepares a zip file for SSC submission. Also, adodown automatically deploys a package documentation website. For users, this provides an easy way to discover packages, to understand what they do, and to explore how commands work—all without installing the package. For developers, this provides packages with a welcome web presence and offers a home for additional documentation (e.g., how-to guides, technical notes, FAQs) and keeps HTML documentation up to date with SMCL documentation through continuous deployment via GitHub Actions. This talk will demonstrate how adodown works, showcase a few live examples, and seek feedback from the Stata community.
Additional information:
US24_Baum.pdf
In this case study, we examine the wage earnings of fully employed previous refugee immigrants in Sweden. Using administrative employer–employee data from 1990 onward, about 100,000 refugee immigrants who arrived between 1980 and 1996 and were granted asylum are compared with a matched sample of native-born workers using coarsened exact matching. Employing recentered influence function (RIF) quantile regressions to wage earnings for the period 2011–2015, the occupational-task-based Oaxaca–Blinder decomposition approach shows that refugees perform better than natives at the median wage, controlling for individual and firm characteristics. The RIF-quantile approach provides better insights for the analysis of these wage differentials than the standard regression model employed in earlier versions of the study.
Additional information:
US24_Cerulli.pdf
This presentation presents a new Stata command for carrying out optimal policy learning (OPL) with observational data, i.e., data-driven optimal decision-making, in multiaction (or multiarm) settings, where a finite set of decision options is available. The presentation and related command focus on three components: estimation, risk preference, and regret estimation via three estimation methods (i.e., regression adjustment, inverse probability weighting, and doubly robust estimators). After briefly presenting the statistical background of this OPL model and the related syntax of the Stata command, the presentation will focus on an application related to climate-related agricultural policies.
Additional information:
US24_Peón.pdf
We use Stata to estimate the probability of having a high socioeconomic destination as a function of education, parental economic level, and other explanatory variables. Given the potential endogeneity of the education variable, we estimate a probit model with an instrumental variable under the context of a complex survey dataset. The maximum-likelihood estimation procedure of the structural parameters is carried out using two equivalent strategies that consider Stata functions and reporting options. Following Long and Freese (2014), we first estimate the model using the ivprobit command with sampling weights and clustered robust standard errors, allowing us to obtain the report of the Wald exogeneity test. As a second strategy, we use the ivprobit command with survey data analysis estimation. We performed additional steps needed to compute the overall rate of correctly classified results after estimation under survey data analysis or using sampling weights. The validity of an instruments test remains challenging for ivprobit models with survey data. Our analysis of the estimation results is extended and enriched with the calculation of odds ratios (testing whether they are statistically different from one) and average probabilities by region in Mexico and educational level.
Additional information:
US24_Peng.html
This talk will demonstrate how to produce informative, robust, and complex graphs using reproducible official and community-contributed routines in Stata. We will also discuss commonly used programming tools and tips for creating more engaging graphs.
Additional information:
US24_Kolenikov.pdf
Latent class analysis (LCA) is a statistical model with categorical latent variables in which the measured categorical outcomes have proportions of the outcome categories that differ between classes. In official Stata, the model is fit using the gsem, lclass() command. Applied researchers often need to follow up the LCA modeling with other statistical analyses that involve the classes from the model, from simple descriptive statistics of variables not in the model, to multivariate models. A simplified shortcut procedure is to assign the class with the highest predicted probability, but doing so results in treating the classes as fixed and perfectly observed, rather than latent and estimated, leading to underaccounting of uncertainty and downward bias in standard errors. We demonstrate how to utilize the existing official Stata multiple imputation (MI) capacity to impute classes based on the LCA postestimation results and present the resulting dataset to Stata mi procedures as valid MI data. The standard MI diagnostics that can be applied to the mi estimate results show that variances are noticeably underestimated when only the modal class is imputed. In the application that motivated this development, the variances were biased down by 25% to 40%.
Additional information:
US24_San_Martin.pdf
The challenge of reproducing economics research has gained increased attention with the growing advocacy for open science in the field. Economics journals and research institutions are quickly adopting reproducibility guidelines, requiring authors to provide code and data for reproducing results and ensuring the trustworthiness of their findings. Presented by the Development Impact Analytics team of the World Bank, this session delves into the intricacies of achieving reproducibility in Stata. Since the launch of the World Bank's Reproducible Research Repository, the team has conducted reproducibility verifications and curated reproducibility packages for almost a hundred working papers from diverse research teams in the organization, building up a valuable and novel experience into addressing common issues that break reproducibility in Stata analyses. The session will present an overview of the workflows and tools the team has developed in response to identified reproducibility challenges in typical Stata works, covering key topics such as controlling the versions of external dependencies and appropriately handling randomness in Stata code. The presentation will include practical strategies for enhancing the transparency and reliability of Stata-based research.
Additional information:
US24_Sirchenko.pdf
We describe a new Stata command, classify, that computes various measures of association and correlation, between two categorical variables (dichotomous and polytomous, nominal and ordinal), diagnostic scores of probabilistic forecasts of such variables, and various measures of the accuracy of deterministic forecasts of them. We compiled a comprehensive catalogue of over 210 measures of association, correlation and forecast verification and 9 diagnostic scores for probabilistic forecasts from different fields, along with the terminological synonymy and bibliography associated with them. In addition to the overall measures, the command computes the class-specific metrics as well as their macro and weighted averages.
Contribute to the Stata community by sharing your feedback with StataCorp's developers. From feature improvements to bug fixes and new ways to analyze data, we want to hear how Stata can be made better for our users.
Additional information:
US24_Grant.pdf
Few methods have been proposed for flexible, nonparametric density estimation, and they do not scale well to high-dimensional problems. We describe a new approach based on smoothed trees called the kudzu density (Grant 2022). This fits the little-known density estimation tree (Ram & Gray 2011) to a dataset and convolves the edges with inverse logistic functions, which are in the class of computationally minimal smooth ramps. New Stata commands provide tree fitting, kudzu tuning, estimates of joint, marginal and cumulative densities, and pseudo-random numbers. Results will be shown for fidelity and computational cost. Preliminary results will also be shown for ensembles of kudzu under bagging and boosting. Kudzu densities are useful for Bayesian model updating where models have many unknowns, require rapid update, datasets are large, and posteriors have no guarantee of convexity and unimodality. The input “dataset” is the posterior sample from a previous analysis. This is demonstrated with a real-life large dataset. A new command outputs code to use the kudzu prior in bayesmh evaluators, BUGS/JAGS, and Stan.
Additional information:
US24_Benstead.pptx
Portland State University
Department of Economics
John Luke Gallup is an associate professor at Portland State University in Oregon focusing on development and applied econometrics. John has a PhD in economics from Berkeley and has worked in a number of countries on five continents, especially Vietnam. John started using Stata in 1990, making a hobby of creating Stata commands, the best known of which is outreg. He is currently working on deriving added-variable plots for a wide range of estimators.
Oregon Health & Science University
Department of Obstetrics and Gynecology
Bharti Garg is a senior biostatistician in the Obstetrics and Gynecology Department at Oregon Health & Science University. She has 10-plus years of statistical experience, supports multiple researchers with statistical analysis, and provides mentorship to medical students and residents. Bharti is author and coauthor of 40-plus peer-reviewed publications and 100-plus abstracts presented at scientific meetings and conferences.
Kaiser Permanente
Center for Health Research
Nadia Redmond, MSPH, is a Research Associate II at Kaiser Permanente Evidence-based Practice Center at the Center for Health Research in Portland, Oregon. She provides statistical support and conducts systematic reviews for developing clinical guidelines, including those for the U.S. Preventive Services Task Force. She is interested in biostatistics, epidemiology, prediction modeling, quantitative synthesis, and meta-analysis.
SAG Corporation
Billy Buchanan, PhD, is a senior research scientist at SAG Corporation, where he works on a study estimating the effects of disability on the earnings capacity of U.S. Veterans and leads a team providing analytical support for efforts to automate and modernize the disability claims process for Veterans. He has developed and shared several Stata programs for data visualization, automation, psychometrics, Java integration, data management and development, and he has presented at several Stata conferences.