Home  /  Stata Conferences  /  2024 UK

Proceedings

10:10–10:30 Balance and variance inflation checks for completeness-propensity weights Abstract:
(Read more)
Inverse treatment-propensity weights are standard methods for adjusting for predictors of exposure to a treatment. Because a treatment-propensity score is a balancing score, it makes sense to do balance checks on the corresponding treatment-propensity weights. It is also a good idea to do variance-inflation checks to estimate how much the propensity weights might inflate the variance of an estimated treatment effect, in the pessimistic scenario in which the weights are not really necessary. In Stata, the SSC package somersd can be used for balance checks, and the SSC package haif can be used for variance-inflation checks. It is argued that balance and variance-inflation checks are also necessary in the case of completeness-propensity weights, which are intended to remove inbalance in predictors of completeness between the subsample with complete data and the full sample of subjects with complete or incomplete data. However, the usage of somersd, scsomersd, and haif must be modified because we are removing imbalance between the complete sample and the full sample, instead of between the treated subsample and the untreated subsample. An example will be presented from a clinical trial in which the author was involved and in which nearly a quarter of randomized subjects had no final outcome data. A post hoc sensitivity analysis is presented using inverse completeness-propensity weights.

(Read less)

Additional information:
UK24_Newson.pdf

Roger B. Newson
King's College London
10:30–10:50 Using GitHub for collaborative analysis Abstract:
(Read more)
Recent trends have led to an increased importance being placed upon formal quality control processes for analysis conducted within the pharmaceutical industry and beyond. While a key feature of Stata is reproducibility through do-files and automated reporting, there are limited built-in tools for version control, code review, and collaborative analysis.

Git is a distributed version control system widely used by software development teams for collaborative programming, change tracking, and enforcement of best practices. Git keeps a record of all changes to a codebase over time, providing the ability to easily revert to a previous state, manage temporary branches, and combine code written by multiple people. Services such as GitHub build on the Git framework, providing tools to conduct code review, host source files, and manage projects.

We present an overview of Git and GitHub and explain how we use it for Stata projects at Adelphi Real World, an organization specializing in the collection and analysis of real-world healthcare data from physicians, patients, and caregivers. We share an example project to outline the benefits of code review both for data integrity and as a training tool. We also discuss how, through implementing a software-development-like approach to the creation of ado-files, we can enhance the process of creating new programs in Stata and gain confidence in the robustness and quality of our commands.

Contributor:
Liane Gillespie-Akar
Adelphi Real World
(Read less)

Additional information:
UK24_Middleton-Dalby.pptx

Chloe Middleton-Dalby
Adelphi Real World
10:50–11:10 My favorite overlooked lifesavers in Stata Abstract:
(Read more)
Everyone loves a good testing, estimation, or graphical community-contributed package. However, a successful empirical project relies on many small and overlooked but priceless programs. I will present three of my personal lifesavers.

1. adotools: adotools has four main uses. It allows the user to create and maintain a library of ado-paths. Paths can be dynamically added to and removed from a running Stata session. When removing an ado-path, all ado-programs located in the folder are cleared from memory. adotools can also reset all user specified ado-paths.

2. psimulate2: Ever wanted to run Monte Carlo simulations in parallel? You can with psimulate2 and there are (almost) no setup costs at all. psimulate2 splits the number of repetitions into equal chunks, spreads them over multiple instances of Stata, and reduces the time to run Monte Carlo simulations. It also allows macros to be returned and can save and append simulation results directly into a .dta file or frame. It can be run on Windows, Unix, and Mac.

3. xtgetpca: Extracting principal components in panel data is common. However, no Stata solution exists. xtgetpca fills this gap. It allows for different types of standardization, removal of fixed effects, and removal of unbalanced panels.

(Read less)

Additional information:
UK24_Ditzen.pdf

Jan Ditzen
Free University of Bozen
11:10–12:00 Professional statistical development: What, why, and how Abstract:
(Read more)
In this presentation, I will talk about professional statistical software development in Stata and the challenges of producing and supporting a statistical software package. I will share some of my experience on how to produce high-quality software, including verification, certification, and reproducibility of the results, and on how to write efficient and stable Stata code. I will also discuss some of the aspects of commercial software development such as clear and comprehensive documentation, consistent specifications, concise and transparent output, extensive error checks, and more.

(Read less)

Additional information:
UK24_Marchenko.pdf

Yulia Marchenko
StataCorp LLC
1:00–1:30 Stata: A short history viewed through epidemiology Abstract:
(Read more)
In this talk, I will use personal recollections to revisit the challenges many public health researchers have faced since the birth of Stata in 1985. I will discuss how, from the 1990s onward, the increasing demands for data management and analysis were met by Stata developers and the broader Stata community, particularly Michael Hills. Additionally, I will review how Stata's expansion in scope and capacity with each new version has enhanced our ability to train new generations of medical statisticians and epidemiologists. Finally, I will reflect on current and future challenges.

(Read less)

Additional information:
UK24_de_Stavola.pdf

Bianca de Stavola
University College London
1:30–1:50 compmed: A new command for estimating causal mediation effects with nonadherence to treatment allocation Abstract:
(Read more)
In clinical trials, a standard intention-to-treat analysis will unbiasedly estimate the causal effect of treatment offer, though this ignores the impact of participant nonadherence. To account for this, one can estimate a complier-average causal effect (CACE), the average causal effect of treatment receipt in the principal strata of participants who would comply with their randomization allocation. Evaluating how interventions lead to changes in the outcome (the mechanism) is also key for the development of more effective interventions. A mediation analysis aims to decompose a total treatment effect into an indirect effect, which operates via changing the mediator, and a direct effect. To identify mediation effects with nonadherence, it has been shown that the CACE can be decomposed into a direct effect, the complier-average natural direct effect (CANDE), and a mediated effect, the complier-average causal mediated effect (CACME). These can be estimated with linear structural equation models (SEMs) with instrumental variables.

However, obtaining estimates of the CACME and CANDE in Stata requires (1) correct fitting of the SEM in Stata and (2) correct identification of the pathways that correspond to the CACME and CANDE. To address these challenges, we introduce a new command, compmed, that allows users to perform the relevant SEM fitting for estimating the CACME and CANDE using a single more intuitive and user-friendly interface. compmed requires the user to specify only the continuous outcome, continuous mediator, treatment receipt, and randomization variables. Estimates, standard errors, and 95% confidence intervals are reported for all effects.

Contributors:
Sabine Landau, Richard Emsley
Kings College London
(Read less)

Additional information:
UK24_Ster.pptx

Anca Chis Ster
Kings College London
1:50–2:30 Causal mediation Abstract:
(Read more)
Causal inference aims to identify and quantify a causal effect. With traditional causal inference methods, we can estimate the overall effect of a treatment on an outcome. When we want to better understand a causal effect, we can use causal mediation analysis to decompose the effect into a direct effect of the treatment on the outcome and an indirect effect through another variable, the mediator. Causal mediation analysis can be performed in many situations—the outcome and mediator variables may be continuous, binary, or count, and the treatment variable may be binary, multivalued, or continuous.

In this talk, I will introduce the framework for causal mediation analysis and demonstrate how to perform this analysis with the mediate command, which was introduced in Stata 18. Examples will include various combinations outcome, mediator, and treatment types.

(Read less)

Additional information:
UK24_MacDonald.pdf

Kristin MacDonald
StataCorp LLC
2:30–2:50 Imputation when data cannot be pooled Abstract:
(Read more)
Distributed data networks are increasingly used to study human health across different populations and countries. Analyses are commonly performed at each study site to avoid the transfer of individual data between study sites due to legal and logistical barriers. Despite many benefits, however, a frequent challenge in such networks is the absence of key variables of interest at one or more study sites. Current imputation methods require the availability of individual data from the involved studies to impute missing data. This creates a need for methods that can impute data in one study using only information that can be easily and freely shared within a data network. To address this need, we introduce a new Stata command, mi impute from, designed to impute missing variables in a single study using a linear predictor and the related variance/covariance matrix from an imputation model fit from one or multiple external studies. In this presentation, the syntax of mi impute from will be presented along with motivating examples from health-related research.

Contributors:
Robert Thiesmeier, Matteo Bottai
Karolinska Institutet
(Read less)

Additional information:
UK24_Orsini.pdf

Nicola Orsini
Karolinska Institutet
3:00–3:50 Thirty graphical tips Stata users should know, revisited Abstract:
(Read more)
In 2010, I gave a talk at the London conference presenting 30 graphical tips. The display materials remain accessible on Stata's website but are awkward to view, because they are based on a series of .smcl files. I will recycle the title, some of the tips, and add new ones because some of what you or your students or your research team should know about when coding graphics for mainstream tasks. The theme of "thirty" matches this 30th London conference, and to a good enough approximation my 33 years as a Stata user. The talk mixes examples from official and community-contributed commands and details both large and small.

(Read less)

Additional information:
UK24_Cox.zip

Nicholas J. Cox
Durham University
3:50–4:10 Fancy graphics: Small multiples carpentry Abstract:
(Read more)
Using “small multiples” in data visualization and statistical graphics consists in combining repeated small diagrams to display variations in data patterns or associations across a series of units. Sometimes, the small multiples are mere replications of identical plots but with different plot elements highlighted. Small displays are typically arranged on a grid, and the overall appearance is, as Tufte puts it, akin to the sequence of frames of a movie when ordering follows a time dimension. Creating diagrams for use in gridded “small multiples” is easy with Stata's graphics combination commands. However, the grid pattern can be limiting. This talk will present tips and tricks for building small multiple diagrams and illustrate some coding strategies for arranging individual frames in the most flexible way, opening up some creative possibilities of data visualization.

(Read less)

Additional information:
UK24_Van_Kerm.pdf

Philippe Van Kerm
University of Luxembourg
4:10–4:30 Scalable high-dimensional nonparametric density estimation with Bayesian applications Abstract:
(Read more)
Few methods have been proposed for flexible, nonparametric density estimation, and they do not scale well to high-dimensional problems. I describe a new approach based on smoothed trees called the kudzu density (Grant 2022). This fits the little-known density estimation tree (Ram and Gray 2011) to a dataset and convolves the edges with inverse logistic functions, which are in the class of computationally minimal smooth ramps. New Stata commands provide tree fitting, kudzu tuning, estimates of joint, marginal and cumulative densities, and pseudorandom numbers.

Results will be shown for fidelity and computational cost. Preliminary results will also be shown for ensembles of kudzu under bagging and boosting. Kudzu densities are useful for Bayesian model updating where models have many unknowns, require rapid update, datasets are large, and posteriors have no guarantee of convexity and unimodality. The input “dataset” is the posterior sample from a previous analysis. This is demonstrated with a real-life large dataset. A new command outputs code to use the kudzu prior in bayesmh evaluators, BUGS and Stan.

(Read less)

Additional information:
UK24_Grant.pdf

Robert Grant
BayesCamp Ltd
9:00–9:20 Robust testing for serial correlation in linear panel-data models Abstract:
(Read more)
Serial correlation tests are essential parts of standard model specification toolkits. For static panel models with strictly exogenous regressors, a variety of tests are readily available. However, their underlying assumptions can be very restrictive. For models with predetermined or endogenous regressors, including dynamic panel models, the Arellano–Bond (1991, Review of Economic Studies) test is predominantly used, but it has low power against certain alternatives. While more powerful alternatives exist, they are underused in empirical practice. The recently developed Jochmans (2020, Journal of Applied Econometrics) portmanteau test yields substantial power gains when the time horizon is very short, but it can quickly lose its advantage even for time dimensions that are still widely considered as small.

I propose a new test based on a combination of short and longer differences, which overcomes this shortcoming and can be shown to have superior power against a wide range of stationary and nonstationary alternatives. It does not lose power as the process under the alternative approaches a random walk—unlike the Arellano–Bond test—and it is robust to large variances of the unit-specific error component—unlike the Jochmans portmanteau test. I present a new Stata command that flexibly implements these (and more) tests for serial correlation in linear error component panel-data models. The command can be run as a postestimation command after a variety of estimators, including generalized method of moments, maximum likelihood, and bias-corrected estimation.

(Read less)

Additional information:
UK24_Kripfganz.pdf

Sebastian Kripfganz
University of Exeter Business School
9:20–9:40 Estimating the wage premia of refugee immigrants Abstract:
(Read more)
In this case study, I examine the wage earnings of fully employed previous refugee immigrants in Sweden. Using administrative employer–employee data from 1990 onward, about 100,000 refugee immigrants who arrived between 1980 and 1996 and were granted asylum are compared with a matched sample of native-born workers using coarsened exact matching. Employing recentered influence function (RIF) quantile regressions to wage earnings for the period 2011–2015, the occupational-task-based Oaxaca–Blinder decomposition approach shows that refugees perform better than natives at the median wage, controlling for individual and firm characteristics. The RIF quantile approach provides better insights for the analysis of these wage differentials than the standard regression model employed in earlier versions of the study.

(Read less)

Additional information:
UK24_Baum.pdf

Kit Baum
Boston College
9:40–10:00 The Oaxaca–Blinder decomposition in Stata: An update Abstract:
(Read more)
In 2008, I published the Stata command oaxaca, which implements the popular Oaxaca–Blinder (OB) decomposition technique. This technique is used to analyze differences in outcomes between groups, such as the wage gap by gender or race. Over the years, both the functionality of Stata and the literature on decomposition methods have evolved, so an update of the oaxaca command is now long overdue. I will present a revised version of oaxaca that uses modern Stata features, such as factor-variable notation, and supports additional decomposition variants that have been proposed in the literature (for example, reweighted decompositions or decompositions based on recentered influence functions).

(Read less)

Additional information:
UK24_Jann.pdf
UK24_Jann-geoplot.pdf

Ben Jann
University of Bern
11:00–11:20 Visualizations to evaluate and communicate adverse event data in randomized controlled trials Abstract:
(Read more)
Introduction: Well-designed visualizations are powerful ways to communicate information to a range of audiences. In randomized controlled trials (RCT) where there is an abundance of complex data on harms (known as adverse events) visualizations can be a highly effective means to summarize harm profiles and identify potential adverse reactions. Trial reporting guidelines such as the CONSORT extension for harms encourage the use of visualizations for exploring harm outcomes, but research has demonstrated that their uptake is extremely low.

Methods: To improve the communication of adverse event data collected in RCTs, we developed recommendations to help trialists decide which visualizations to use to present this data. We developed Stata commands (aedot and aevolcano) to produce two of the visualizations, the volcano and dot plots, to present adverse event data with the aim of easing implementation and promoting increased uptake.

Results: In this talk, using clinical examples, I will introduce and demonstrate application of these commands. I will contrast the produced visual summaries from the volcano and dot plots with traditional nongraphical presentations of adverse data with examples in the published literature, with the aim of demonstrating the benefits of graphical displays.

Discussion: Visualizations offer an efficient means to summarize large amounts of adverse event data from RCTs, and statistical software eases the implementation of such displays. We hope that development of bespoke Stata commands to create visual summaries of adverse events will increase uptake of visualizations in this area by the applied clinical trial statistician.

(Read less)

Additional information:
UK24_Phillips.pptx

Rachel Phillips
Imperial College London
11:20–11:40 Optimizing adverse event analysis in clinical trials when dichotomising continuous harm outcomes Abstract:
(Read more)
Introduction: The assessment of harm in randomized controlled trials is vital to enable a risk-benefit assessment on the intervention under evaluation. Many trials undertake regular monitoring of continuous outcomes such as laboratory measurements, for example, blood tests. Typical practice in a trial analysis is to dichotomize this type of data into abnormal/normal categories based on reference values. Frequently, the proportion of participants with abnormal results between treatment arms are then compared using a chi-squared or Fisher’s exact test reporting a p-value. Because dicotomization results in substantial loss of information contained in the outcome distribution, this increases the chance of missing a opportunity to detect signals of harm.

Methods: A solution to this problem is to use the outcome distribution in each arm to estimate the between-arm difference in proportions of participants with an abnormal result. This approach has been developed by Sauzet et. al (2016), and it protects against a loss of information and retains statistical power.

Results: In this talk, I will introduce the distributional approach and associated Stata community-contributed command distdicho. I will compare the original analysis of blood test results from a small population drug trial in pediatric eczema with the results using the distributional approach and discuss inference from the trial based on these.

Contributor:
Odile Sauzet
Imperial College London
(Read less)

Additional information:
UK24_Cornelius.pptx

Victoria Cornelius
Imperial College London
11:40–12:00 Implementing treatment-selection rules for multiarm multistage trials using nstage Abstract:
(Read more)
Multiarm multistage (MAMS) randomized trial designs offer an efficient and practical framework for addressing multiple research questions. Typically, standard MAMS designs employ prespecified interim stopping boundaries based on lack of benefit and overwhelming efficacy. To facilitate implementation, we have developed the nstage suite of commands, which calculates the required sample sizes and trial timelines for a MAMS design.

In this talk, we introduce the MAMS selection design, integrating an additional treatment selection rule to restrict the number of research arms progressing to subsequent stages in the event all demonstrate a promising treatment effect at interim analyses. The MAMS selection design streamlines the trial process by merging traditionally early-phase treatment selection with the late-phase confirmatory trial. As a result, it gains efficiency over the standard MAMS design by reducing overall trial timelines and required sample sizes. We present an update to the nstagebin Stata command that incorporates this additional layer of adaptivity and calculates required sample sizes, trial timelines, and overall familywise type I error rate and power for MAMS selection designs.

Finally, we illustrate how a MAMS selection design can be implemented using the nstage suite of commands and outline its advantages using the ongoing trials in surgery (ROSSINI-2) and maternal health (WHO RED).

Contributors:
Alexandra Blenkinsop, Mahesh KB Parmar
MRC Clinical Trials Unit at UCL
(Read less)

Additional information:
UK24_Choodari-Oskooei.pptx

Babak Choodari-Oskooei
MRC Clinical Trials Unit at UCL
12:00–12:20 Poster lightning session

nmf: Implementation of nonnegative matrix factorization (NMF) in Stata

Additional information:
UK24_Batty1.pptx

Jonathan Batty
University of Leeds

Difference in differences using constraints in Stata

Additional information:
UK24_Birch.pdf

Colin Birch
Animal and Plant Health Agency (APHA)
1:30–1:50 Advanced Bayesian survival analysis with merlin and morgana Abstract:
(Read more)
In this talk, I will describe our latest work to bring advanced Bayesian survival analysis tools to Stata. Previously, we introduced the morgana prefix command (bayesmh in disguise), which provides a Bayesian wrapper for survival models fit with stmerlin (which is merlin’s more user-friendly wrapper designed for working with st data). We have now begun the work to sync morgana with the much more general merlin command to allow for Bayesian multiple-outcome models. Within survival analysis, multiple outcomes arise when we consider competing-risks or the more general setting of multistate processes. Using an example in breast cancer, I will show how to estimate competing-risks and illness-death multistate models within a Bayesian framework, incorporating prior information for covariate effects and baseline hazard parameters. Importantly, we have also developed the predict functionality to obtain a wide range of easily interpretable predictions, such as cumulative incidence functions and (restricted) life expectancy, along with their credible intervals.

(Read less)

Additional information:
UK24_Crowther.pdf

Michael Crowther
Red Door Analytics
1:50–2:10 codefinder: Optimizing Stata for the analysis of large, routinely collected healthcare data Abstract:
(Read more)
Routinely collected healthcare data (including electronic healthcare records and administrative data) are increasingly available at the whole-population scale and may span decades of data collection. These data may be analyzed as part of clinical, pharmacoepidemiologic and health services research, producing insights that improve future clinical care. However, the analysis of healthcare data on this scale presents a number of unique challenges. These include the storage of diagnosis, medication and procedure codes using a number of discordant systems (including ICD-9 and 10, SNOMED-CT, Read codes, etc.) and the inherently relational nature of the data (each patient has multiple clinical contacts, during which multiple codes may be recorded). Preprocessing and analyzing these data using optimized methods has a number of benefits, including minimization of computational requirements, analytic time, carbon footprint, and cost.

We will focus on one of the main issues faced by the healthcare data analyst: how to most efficiently collapse multiple disparate diagnosis codes (stored as strings across a number of variables) into a discrete disease entity using a predefined code list. A number of approaches (including the use of Boolean logic, the inlist function, string functions, and regular expressions) will be sequentially benchmarked in a large, real-world healthcare dataset (n = 192 million hospitalization episodes during a 12-year period; approximately 1 terabyte of data). The time and space complexity of each approach (in addition to its carbon footprint), will be reported. The most efficient strategy has been implemented into our newly developed Stata command codefinder, which will be discussed.

Contributor:
Marlous Hall
University of Leeds
(Read less)

Additional information:
UK24_Batty2.pptx

Jonathan Batty
University of Leeds
2:10–2:30 Data-driven decision making using Stata Abstract:
(Read more)
This presentation focuses on implementing a model in Stata for making optimal decisions in settings with multiple actions or options, commonly known as multiaction (or multiarm) settings. In these scenarios, a finite set of decision options is available. In the initial part of the presentation, I provide a concise overview of the primary approaches for estimating the reward or value function as well as the optimal policy within the multiarm framework. I outline the identification assumptions and statistical properties associated with optimal policy learning estimators.

Moving on to the second part, I explore the analysis of decision risk. This examination reveals that the optimal choice can be influenced by the decision maker's risk attitude, specifically regarding the tradeoff between the reward conditional mean and conditional variance.

The third part of the paper presents a Stata implementation of the model, accompanied by an application to real data.

(Read less)

Additional information:
UK24_Cerulli.pdf

Giovanni Cerulli
CNR-IRCRES
2:30–2:50 Pattern matching in Stata: Chasing the devil in the details Abstract:
(Read more)
The vast majority of quantitative statistics now have to be estimated through computer calculations. A computation script strengthens the reproducibility of these studies but requires carefulness from the researchers when writing their code to avoid various mistakes. This presentation introduces a command implementing some checks foreign to a dynamically typed language such as Stata in the context of data analysis. This command uses a new syntax, similar to switch or match expressions, to create a variable based on other variables in place of chains of “replace” statements with “if” conditions. More than the syntax, the real interest of this command lies in the two properties it checks for. The first one is exhaustiveness: do the stated conditions cover all the possible cases? The second one is usefulness: are all the conditions useful, or is there redundancy between branches? I borrow the present idea of pattern matching from the Rust programming language and the earlier implementation in the OCaml programming language of the algorithm detailed in Maranget (2017). The command and source code are available on GitHub.

(Read less)

Additional information:
UK24_Astruc-Le_Souder.pdf

Mael Astruc-Le Souder
University of Bordeaux
3:10–4:10 Relationships among recent difference-in-differences estimators and how to compute them in Stata Abstract:
(Read more)
I will provide an overview of the similarities and differences among popular estimators in the context of staggered interventions with panel data, illustrating how to compute and interpret the estimates using built-in and community-contributed Stata commands.

(Read less)

Additional information:
UK24_Wooldridge.pdf

Jeffrey Wooldridge
Michigan State University
4:10–5:00 Open panel discussion with Stata developers
Contribute to the Stata community by sharing your feedback with StataCorp's developers. From feature improvements to bug fixes and new ways to analyze data, we want to hear how Stata can be made better for our users.

Scientific committee

Tim Morris
MRC Clinical Trials Unit at UCL
David Vincent
David Vincent Econometrics
Rachael Hughes
University of Bristol

Logistics organizer

The logistics organizer for the 2024 UK Stata Conference is Timberlake Consultants, the Stata distributor to the United Kingdom and Ireland, France, Spain, Portugal, the Middle East and North Africa, Brazil, and Poland.

View the proceedings of previous Stata Conferences and international meetings.