In the spotlight: Group sequential designs for clinical trials

Clinical trials are experiments on human volunteers that test the effects of medical interventions on health-related outcomes. Designing an experiment requires careful consideration of how to use the time and resources available to collect the best statistical evidence possible, but clinical trials require an extra degree of care because investigators have an ethical responsibility to safeguard participants' health. And unlike other types of experiments that collect data all at once, it is customary for data in a clinical trial to trickle in as the outcome is recorded individually for each participant or group of participants.

Group sequential designs are among the most popular methods to help address the specific requirements of a clinical trial. Rather than waiting for the end of the experiment to perform data analysis, a group sequential trial plans for interim analyses of the incomplete trial data while the study is still underway. If an interim analysis provides compelling evidence that the treatment is effective or ineffective, the trial is stopped early. Group sequential designs capitalize on the piecemeal data collection that is typical of clinical trials by performing multiple analyses of trial data while controlling the overall false-positive (type I) error rate. And by terminating the experiment early if there is a clear winner or loser, group sequential trials help investigators fulfill their moral obligation not to expose participants to inferior treatments.

Let's explore how we could use the gs tools introduced in Stata 18 to design a group sequential clinical trial for a phase III study to test the efficacy of a new drug.

Designing a group sequential clinical trial for a new weight-loss drug

In the last decade, drugs classed as glucagon-like peptide-1 receptor agonists (GLP-1 RAs) have become increasingly popular for the treatment of type 2 diabetes and weight loss. Suppose that we are designing a clinical trial to compare the efficacy of our new (fictitious) GLP-1 RA, piruglutide, against comparator treatment semaglutide (brand name Ozempic™ or Wegovy™). Our target population is American adults with obesity, defined as having a body mass index of 30 or higher. A previous clinical trial of semaglutide in obese adults demonstrated an average weight loss of 5.8 kg over the 30-week study, with a standard deviation of 5.5 kg. Say that based on our initial studies of piruglutide, we anticipate an average weight loss of 6.6 kg over 30 weeks, with a standard deviation of 7 kg.

Our trial will randomly assign half the participants to the experimental arm, where they will receive the test drug piruglutide, and the other half to the control arm, where they will receive the reference drug semaglutide. Doctors will follow participants for 30 weeks and measure weight loss (in kg) from baseline. Stata's gsdesign command is used to plan a clinical trial testing whether piruglutide is superior to semaglutide for weight loss in obese adults. We first specify the weight loss means and standard deviations for each drug. We also specify that we will conduct a one-sided test at the 0.025 level (alpha(0.025)), and that we require 90% power (power(0.9)) to detect the specified difference in means. Analyses are scheduled to occur once we have 40%, 60%, 80%, and 100% of the data (information(40 60 80 100)). We specify the efficacy() and futility() options to indicate that we will calculate efficacy and nonbinding futility boundaries using the error-spending Kim–DeMets design, with parameter \(\rho_e = 3\) for the efficacy bound and \(\rho_f = 2\) for the futility bound. The graphbounds option tells Stata to plot the design for us to see.

. gsdesign twomeans 5.8 6.6, sd1(5.5) sd2(7) knownsds onesided alpha(0.025) 
     power(0.9) information(40 60 80 100) efficacy(kdemets(3)) 
     futility(kdemets(2)) graphbounds

Group sequential design for a two-sample means test
z test
H0: m2 = m1 versus Ha: m2 > m1

Efficacy: Error-spending Kim–DeMets, rho = 3.0000  
Futility: Error-spending Kim–DeMets, nonbinding, rho = 2.0000  

Study parameters:
      alpha =   0.0250  (upper one-sided)
      power =   0.9000
      delta =   0.8000
         m1 =   5.8000
         m2 =   6.6000
        sd1 =   5.5000
        sd2 =   7.0000

Expected sample size:
         H0 = 1,596.73
         Ha = 1,991.88

Info. ratio =   1.1053
    N fixed =    2,604
      N max =    2,878
     N1 max =    1,439
     N2 max =    1,439

Fixed-study crit. value = 1.9600  

Critical values, p-values, and sample sizes for a group sequential design


        Info.        Efficacy             Futility     
 Look   frac.      Upper   p-value      Lower   p-value
   
    1    0.40     2.9478    0.0016     0.0109    0.4956
    2    0.60     2.6021    0.0046     0.7506    0.2264
    3    0.80     2.3056    0.0106     1.4101    0.0792
    4    1.00     2.0452    0.0204     2.0452    0.0204


Note: Critical values are for z statistics; otherwise,
      use p-value boundaries.



                Sample size         
 Look         N1        N2         N
   
    1        576       576     1,152
    2        863       863     1,726
    3      1,151     1,151     2,302
    4      1,439     1,439     2,878

The top of the output tells us about the hypothesis test we plan to conduct, the stopping boundaries we selected, and the study parameters we specified. The expected sample size under H0 (and Ha) is the average number of participants that would be required if the null (or alternative) hypothesis were true and if the group sequential trial were repeated many times. The expected sample sizes of 1,596.73 and 1,991.88 participants under the null and alternative hypotheses, respectively, are considerably smaller than the sample of 2,604 that would be required by a traditional fixed-sample study design with equivalent power and type I error.

The first interim analysis will be conducted once 1,152 participants have completed the study. We will use a z test to determine if the one-sided difference in means is significant at the 0.025 level, but instead of comparing the z statistic z1 with a critical value of 1.96 as we would in a fixed-sample study, we compare it with the efficacy and futility critical values of 2.95 and 0.01, respectively. If z1 ≥ 2.95, we stop the trial early for efficacy because we have demonstrated that piruglutide is a more effective weight-loss drug than semaglutide. If z1 < 0.01, we can terminate the trial for futility to “abandon a lost cause”, saving money and preventing participants from receiving an ineffective treatment. If 0.01 ≤ z1 < 2.95, we continue the trial until the second analysis, which is scheduled to occur once we have collected data from 1,726 participants.

On the graph, the first interim analysis is marked with a dashed vertical line at a sample size of 1,152, which is when we calculate z statistic z1. If z1 is above the efficacy boundary, it lies in the blue rejection region, and we can reject H0. If it is below the futility boundary, it lies in the red acceptance region, and we can accept H0. If z1 falls in the green continuation region, the trial continues to the next analysis. As the trial progresses, the continuation region shrinks, making it increasingly likely that the trial is terminated. The shapes of the efficacy and futility boundaries are controlled by parameters \(\rho_e\)and \(\rho_f\), but at the final analysis, the efficacy boundary always meets the futility boundary, and there is no continuation region; the trial must stop, and H0 is rejected or accepted. While the concept of accepting H0 is taboo in many circles, it is a long-established practice in group sequential designs.

To learn more about group sequential designs, see the Stata Adaptive Designs: Group Sequential Trials Reference Manual.

— Alex Asher
Senior Biostatistician and Software Developer

«Back to main page