2025 Stata Biostatistics and
Epidemiology Virtual Symposium

20 February 2025

What is the Virtual Symposium?

The 2025 Stata Biostatistics and Epidemiology Virtual Symposium is a meeting of researchers in biostatistics and epidemiology from around the world discussing current theory and applied methods using Stata. The program consists of invited talks by top Stata users, and the virtual platform allows you to experience this one-day event from wherever you are.

Sigrid Leithe

Cancer Registry of Norway

Robert Thiesmeier

Karolinska Institutet

Bianca De Stavola

UCL Great Ormond Street Institute of Child Health

Joie Ensor

University of Birmingham

Giselle Kolenic

University of Michigan

Alyssa Bilinski

Brown University

Agenda

All times Central Standard Time

9:00 a.m.

Balancing the privacy-utility tradeoff for synthetic time-to-event data

Sigrid Leithe, Cancer Registry of Norway

Additional information:

Bio25_Leithe.pdf

View abstract

Generation of synthetic patient records can preserve the structure and statistical properties of the original data without violating privacy, providing access to high-quality data for research and innovation. Few synthetization methods account for the censoring mechanism in time-to-event data, and formal privacy risk evaluations are often lacking. Improvements in synthetic data utility come with increased risks of privacy disclosure, necessitating a careful evaluation to obtain the proper balance. In this talk, I will demonstrate a method for generating synthetic time-to-event data based on regression models and a flexible parametric survival model in Stata. I show how to evaluate the synthetic data utility and present a method for estimating the privacy loss from publishing a synthetic dataset.

9:45 a.m.

Multiple imputation for recovering missing values when data cannot be shared

Robert Thiesmeier, Karolinska Institutet

Additional information:

Bio25_Thiesmeier.pdf

View abstract

Multisite studies are increasingly used to study human health across different populations and countries. However, a common challenge in using data from multiple studies is the presence of systematically missing values – when some studies have not recorded information on certain variables. Although it is possible to use data from sites with recorded observations to impute the missing values, this process becomes challenging when data pooling is not feasible because of logistic or legal constraints. We address this by introducing a framework for multiple imputation across study sites without the need of sharing individual data. In this talk, we present some motivating examples alongside a new command mi impute from that can handle the imputation of binary, discrete, and continuous variables. Given the increasing importance of multisite studies in medical and epidemiological research, mi impute from can offer a practical approach for imputing variables that have not been recorded in some study sites.

10:15 a.m.

Break

10:30 a.m.

Stata: A short history viewed through epidemiology

Bianca De Stavola, UCL Great Ormond Street Institute of Child Health

Additional information:

Bio25_De_Stavola.pdf

View abstract

In this talk, I will use personal recollections to revisit the challenges many public health researchers have faced since the birth of Stata in 1985. I will discuss how, from the 1990s onward, the increasing demands for data management and analysis were met by Stata developers and the broader Stata community, particularly by Michael Hills. Additionally, I will review how Stata's expansion in scope and capacity with each new version has enhanced our ability to train new generations of medical statisticians and epidemiologists. Finally, I will reflect on current and future challenges.

11:00 a.m.

Harnessing uncertainty in clinical prediction models using Stata

Joie Ensor, University of Birmingham

Additional information:

Bio25_Ensor.pdf

View abstract

Development of new clinical prediction models is in vogue, with many showing off their ill-fitting wares on journal runways. The vast majority of these models ultimately aim to inform care for the individual, based on the probability of their outcome as calculated by the prediction model. Therefore, we should all be concerned about the reliability of such models.

Unfortunately, most models are ill-fitting and developed using small samples, exacerbating overfitting and leading to large uncertainty in model predictions for an individual. This issue makes internal validation nonnegotiable in the development of any new model, and its reporting is mandated by the recent TRIPOD+AI guidelines. At the development stage, we know that our model and any estimates of performance are optimistic – our model is fit to our data and so should perform well. Therefore, we commonly assess the internal validity of our model using bootstrapping, allowing us to quantify the optimism in our development process and uncertainty in the model predictions, giving a better feel for how accurate and reliable our model is.

In this talk, I will discuss the concept of model uncertainty and demonstrate how our new Stata packages allow developers to estimate uncertainty in their model and harness this information to inform the next steps in the pipeline of their model.

11:45 a.m.

Lunch

1:00 p.m.

Increase efficiency and reproducibility in clinical trial reporting with Stata tools

Giselle Kolenic, University of Michigan

Additional information:

Bio25_Kolenic.zip

View abstract

The Statistical Analysis of Biomedical and Education Research Group (SABER) unit of the Department of Biostatistics is an academic data coordinating center (DCC) that provides expertise in the design, conduct, and analysis of multicenter clinical trials. These trials often require reporting to Data and Safety Monitoring Boards (DSMBs), usually every six months over the course of multiple years. DSMB members are provided reports that contain tables, listings, and figures (TLFs) that summarize cumulative data for evidence of study-related adverse events, adherence to the protocol, site performance, compliance with recruitment and retention goals, and data quality, timeliness, and completeness. Stata tools can be used for consistent generation of TLFs and DSMB reports over the life of trials, increasing efficiency and reproducibility. This presentation provides an overview and illustration of some of these tools, including the putdocx Stata command.

2:00 p.m.

Difference in differences with infectious disease outcomes

Alyssa Bilinski, Brown University

Additional information:

Bio25_Bilinski.pdf

View abstract

Researchers frequently employ difference in differences (DiD) to study the impact of public health interventions on infectious disease outcomes. DiD assumes that treatment and nonexperimental comparison groups would have moved in parallel in expectation, absent the intervention (“parallel-trends assumption”). However, the plausibility of the parallel-trends assumption in the context of infectious disease transmission is not well understood. Our work bridges this gap by formalizing epidemiological assumptions required for common DiD specifications, positing an underlying susceptible-infectious-recovered (SIR) data-generating process. We demonstrate that popular specifications can encode strict epidemiological assumptions. For example, DiD modeling incident case numbers or rates as outcomes will produce biased treatment-effect estimates unless untreated potential outcomes for treatment and comparison groups come from a data-generating process with the same initial infection and equal transmission rates at each time step. Applying a log transformation or modeling log growth allows for different initial infection rates under an “infinite susceptible population” assumption but invokes conditions on transmission parameters. We then propose alternative DiD specifications based on epidemiological parameters, the effective reproduction number and the effective contact rate, that are both more robust to differences between treatment and comparison groups and can be extended to complex transmission dynamics. With minimal power difference incidence and log-incidence models, we recommend a default of the more robust log specification. Our alternative specifications have lower power than incidence or log-incidence models but have higher power than log-growth models. We illustrate implications of our work by reanalyzing published studies of COVID-19 mask policies.

2:45 p.m.

Adjourn

2025 Stata Biostatistics and
Epidemiology Virtual Symposium

20 February 2025

What is the Virtual Symposium?

Presenters + abstracts

Agenda

Registration is now closed.

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies

Stata/MP4 Annual License (download)

2025 Stata Biostatistics and Epidemiology Virtual Symposium

20 February 2025

What is the Virtual Symposium?

Presenters + abstracts

Agenda

Registration is now closed.

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies

2025 Stata Biostatistics and
Epidemiology Virtual Symposium