2 Basic calibration

We consider a phase II setting with null response probability p0 = 0.2 and an efficacy threshold k = 1/3 on the scale. We assume slightly optimistic design priors via da1 = 2.5 and db1 = 2 under \(H_1\), a flat design prior under \(H_0\), specified via da0 = 1 and db0 = 1, and flat analysis priors under both \(H_0\) and \(H_1\) (specified via a0 = 1, b0 = 1, a1 = 1 and b1 = 1). We test the hypotheses \[H_0:p\leq 0.2 \text{ versus } H_1:p>0.2\] against each other, so we use the argument type = "direction" and set p0 = 0.2. We limit the sample size range to the minimum and maximum values n_min = 10 and n_max = 200, and require 80% Bayesian power, specified via target_power = 0.80. Also, we require a Bayesian type-I-error rate of 5% or less, specified via the argument target_type1 = 0.05:

des_bayes <- design_singlearm_onestage_bf(
  n_min = 10,
  n_max = 200,
  k = 1/3,
  p0 = 0.2,
  a0 = 1, b0 = 1,
  a1 = 1, b1 = 1,
  da0 = 1, db0 = 1,
  da1 = 2.5, db1 = 2,
  type = "direction",
  calibration = "Bayesian",
  target_power = 0.8,
  target_type1 = 0.05
)

We can inspect the results as follows:

summary(des_bayes)
#> Summary: One-stage single-arm Bayes factor design
#> ------------------------------------------------
#> Calibration: Bayesian 
#> Sustain:  10  future n
#> Feasible   : TRUE 
#> Status     : Smallest feasible one-stage design found. 
#> 
#> Selected design
#>   n          : 13 
#>   k          : 0.333 
#> 
#> Operating characteristics
#>   Bayes power      : 0.821 
#>   Bayes type-I     : 0.021 
#>   CE(H0)           :   NA 
#>   Freq power       :   NA 
#>   Freq type-I      : 0.099

2.1 Operating-characteristic curves

The search results can be visualized directly.

plot(des_bayes, what = "all")

The left panel shows Bayesian power and type-I error, together with optional frequentist overlays if a point alternative is supplied.

The right panel showsthe relevant operating characteristics of the trial design isolated by the calibration algorithm. In this case, no frequentist power calculations were required, so the frequentist power is shown as NA (not available), and the lower part states that frequentist power calculations under a point alternative were not requested. We can quickly see that Bayesian power and type-I-error rate are calibrated for \(n=13\) already, so when choosing the evidence threshold \(k=1/3\), \(n=13\) patients suffice to yield a calibrated Bayesian design. Also, the output shows that this design is not calibrated in terms of the frequentist type-I-error rate, which is computed via a grid-search and in this case is \(9.9\%\).

Note that the beta-binomial model induces the zig-zag shape of the resulting curves for the operating characteristics such as power and type-I-error rate. As a consequence, the calibration algorithm by default requires the power not to drop below the specified target threshold for the next ten sample sizes. This is also customisably via the parameter sustain_n, which is also shown in the summary output above in form of the text “Sustain: 10 future n”.

2.2 Adding a constraint on the probability of compelling evidence

Next, we could add a constraint on the probability of compelling evidence for \(H_0\). An optional CE(H0) constraint can be imposed through target_ce_h0. This is useful when one also wants the design to have a sufficient probability of yielding compelling evidence in favour of the null hypothesis \[P(BF_01>k_f|H_0)>f\] for some \(f\in (0,1)\) and futility evidence threshold \(k_f>1\). Now, we use \(f=0.60\) and \(k_f=3\), specified via the arguments target_ce_h0 = 0.60 and k_ce = 3. Everything else remains the same, but we also add the argument dp = 0.4 to compute frequentist power under the point alternative \(p=0.4\):

des_bayes_ce <- design_singlearm_onestage_bf(
  n_min = 10,
  n_max = 200,
  k = 1/3,
  p0 = 0.2,
  a0 = 1, b0 = 1,
  a1 = 1, b1 = 1,
  da0 = 1, db0 = 1,
  da1 = 2.5, db1 = 2,
  type = "direction",
  calibration = "Bayesian",
  target_power = 0.8,
  target_type1 = 0.05,
  target_ce_h0 = 0.6,
  k_ce = 3,
  dp = 0.4
)

We inspect the results:

summary(des_bayes_ce)
#> Summary: One-stage single-arm Bayes factor design
#> ------------------------------------------------
#> Calibration: Bayesian 
#> Sustain:  10  future n
#> Feasible   : FALSE 
#> Status     : No feasible one-stage design found.

Now, no design could be found which fulfills our target constraints. Before we see why, note that since we added the parameter dp = 0.4 to the above function call frequentist power is additionally calculated under the point alternative \(p=0.4\) in the single-arm case. We can always add this to a design calibration, but the calibration mode (specified via the argument calibration) decides which metric is the benchmark for finding a feasible design (that is, a sufficiently large sample size to meet our specified target constraints). Thus, frequentist power is calculated here solely post-hoc for the calibrated design. The calibrated design itself is calibrated according to the Bayesian target power and type-I-error rate.

We can plot the resulting design:

plot(des_bayes_ce, what = "all")

The plot shows that the factor which causes problems is the probability of compelling evidence. In the sample size range up to \(n=200\), it stays below the required 60%, so no calibrated design can be found. If we lower the requirement to, say, 50%, we get:

des_bayes_ce50 <- design_singlearm_onestage_bf(
  n_min = 10,
  n_max = 200,
  k = 1/3,
  p0 = 0.2,
  a0 = 1, b0 = 1,
  a1 = 1, b1 = 1,
  da0 = 1, db0 = 1,
  da1 = 2.5, db1 = 2,
  type = "direction",
  calibration = "Bayesian",
  target_power = 0.8,
  target_type1 = 0.05,
  target_ce_h0 = 0.5,
  k_ce = 3,
  dp = 0.4
)

We inspect the fit:

summary(des_bayes_ce50)
#> Summary: One-stage single-arm Bayes factor design
#> ------------------------------------------------
#> Calibration: Bayesian 
#> Sustain:  10  future n
#> Feasible   : TRUE 
#> Status     : Smallest feasible one-stage design found. 
#> 
#> Selected design
#>   n          : 65 
#>   k          : 0.333 
#>   k_ce       :    3 
#> 
#> Operating characteristics
#>   Bayes power      : 0.934 
#>   Bayes type-I     : 0.009 
#>   CE(H0)           : 0.574 
#>   Freq power       : 0.986 
#>   Freq type-I      : 0.085

We can plot the resulting design:

plot(des_bayes_ce50, what = "all")

Now a calibrated design can be found, as expected.

3 Full Bayes-frequentist calibration

The same interface also supports frequentist and fully joint calibration modes. We change calibration = "Bayesian" to calibration = "full", and specify our target constraints for frequentist power and type-I-error rate via the arguments target_freq_power = 0.8 and target_freq_type1 = 0.05.

des_full <- design_singlearm_onestage_bf(
  n_min = 10,
  n_max = 100,
  k = 1/3,
  p0 = 0.2,
  a0 = 1, b0 = 1,
  a1 = 1, b1 = 1,
  dp = 0.4,
  da0 = 1, db0 = 1,
  da1 = 2.5, db1 = 2,
  type = "direction",
  calibration = "full",
  target_power = 0.8,
  target_type1 = 0.05,
  target_freq_power = 0.8,
  target_freq_type1 = 0.05
)

summary(des_full)
#> Summary: One-stage single-arm Bayes factor design
#> ------------------------------------------------
#> Calibration: full 
#> Sustain:  10  future n
#> Feasible   : FALSE 
#> Status     : No feasible one-stage design found.

Now, given our priors and evidence thresholds no feasible one-stage design was found in the specified sample size range. We can see the reason for this by inspecting

head(des_full$search_results)
#>    n     power      type1 ce_h0 power_naive type1_naive erased_power
#> 1 10 0.8065237 0.02925913    NA   0.8065237  0.02925913            0
#> 2 11 0.8438969 0.04026381    NA   0.8438969  0.04026381            0
#> 3 12 0.7849683 0.01482212    NA   0.7849683  0.01482212            0
#> 4 13 0.8206335 0.02085515    NA   0.8206335  0.02085515            0
#> 5 14 0.8498421 0.02813328    NA   0.8498421  0.02813328            0
#> 6 15 0.8016997 0.01109326    NA   0.8016997  0.01109326            0
#>   erased_type1 freq_power freq_type1 freq_power_naive freq_type1_naive
#> 1            0  0.6177194 0.12087388        0.6177194       0.12087388
#> 2            0  0.7037157 0.16113920        0.7037157       0.16113920
#> 3            0  0.5618218 0.07255550        0.5618218       0.07255550
#> 4            0  0.6469582 0.09913061        0.6469582       0.09913061
#> 5            0  0.7207430 0.12983963        0.7207430       0.12983963
#> 6            0  0.5967844 0.06105143        0.5967844       0.06105143
#>   erased_freq_power erased_freq_type1 bayes_ok freq_ok hybrid_ok feasible_ce
#> 1                 0                 0     TRUE   FALSE     FALSE        TRUE
#> 2                 0                 0     TRUE   FALSE     FALSE        TRUE
#> 3                 0                 0    FALSE   FALSE     FALSE        TRUE
#> 4                 0                 0     TRUE   FALSE     FALSE        TRUE
#> 5                 0                 0     TRUE   FALSE     FALSE        TRUE
#> 6                 0                 0     TRUE   FALSE     FALSE        TRUE
#>   feasible_pointwise feasible calibration
#> 1              FALSE    FALSE        full
#> 2              FALSE    FALSE        full
#> 3              FALSE    FALSE        full
#> 4              FALSE    FALSE        full
#> 5              FALSE    FALSE        full
#> 6              FALSE    FALSE        full

We can quickly see that the frequentist type-I-error of all designs is simply too large:

range(des_full$search_results$freq_type1)
#> [1] 0.06105143 0.16113920

Increasing the evidence threshold \(k\) leads to the situation that it becomes harder for the Bayes factor to accumulate evidence in favour of \(H_1\), thus also decreasing the type-I-error rates (both the Bayesian and frequentist ones). Thus, we refit our design as follows by changing k = 1/3 to k = 1/10:

des_full_strong_ev <- design_singlearm_onestage_bf(
  n_min = 10,
  n_max = 100,
  k = 1/10,
  p0 = 0.2,
  a0 = 1, b0 = 1,
  a1 = 1, b1 = 1,
  dp = 0.4,
  da0 = 1, db0 = 1,
  da1 = 2.5, db1 = 2,
  type = "direction",
  calibration = "full",
  target_power = 0.8,
  target_type1 = 0.05,
  target_freq_power = 0.8,
  target_freq_type1 = 0.05
)

We investigate the fit:

summary(des_full_strong_ev)
#> Summary: One-stage single-arm Bayes factor design
#> ------------------------------------------------
#> Calibration: full 
#> Sustain:  10  future n
#> Feasible   : TRUE 
#> Status     : Smallest feasible one-stage design found. 
#> 
#> Selected design
#>   n          : 38 
#>   k          :  0.1 
#> 
#> Operating characteristics
#>   Bayes power      : 0.866 
#>   Bayes type-I     : 0.003 
#>   CE(H0)           :   NA 
#>   Freq power       : 0.814 
#>   Freq type-I      : 0.029

We plot the fit:

plot(des_full_strong_ev)

3.1 Frequentist calibration

We turn to frequentist calibration of our design, and build upon the fully calibrated design above. We change the calibration mode accordingly to frequentist:

des_freq_strong_ev <- design_singlearm_onestage_bf(
  n_min = 10,
  n_max = 100,
  k = 1/10,
  p0 = 0.2,
  a0 = 1, b0 = 1,
  a1 = 1, b1 = 1,
  dp = 0.4,
  da0 = 1, db0 = 1,
  da1 = 2.5, db1 = 2,
  type = "direction",
  calibration = "frequentist",
  target_power = 0.8,
  target_type1 = 0.05,
  target_freq_power = 0.8,
  target_freq_type1 = 0.05
)

summary(des_freq_strong_ev)
#> Summary: One-stage single-arm Bayes factor design
#> ------------------------------------------------
#> Calibration: frequentist 
#> Sustain:  10  future n
#> Feasible   : TRUE 
#> Status     : Smallest feasible one-stage design found. 
#> 
#> Selected design
#>   n          : 38 
#>   k          :  0.1 
#> 
#> Operating characteristics
#>   Bayes power      : 0.866 
#>   Bayes type-I     : 0.003 
#>   CE(H0)           :   NA 
#>   Freq power       : 0.814 
#>   Freq type-I      : 0.029

Note in the above call, that the arguments target_power = 0.8 and target_type1 = 0.05 could also be removed, as they play no role when the calibration parameter is set to frequentist. target_power and target_type1 only specify the target constraints for calibration modes which involve a Bayesian component like calibration = "Bayesian", calibration = "hybrid" or calibration = "full".

Based on the summary output we see that the design which is calibrated in terms of frequentist type-I-error and power is also fully calibrated. This can also be seen from the last plot, because the Bayesian type-I-error and power meet their respective target constraints for already smaller sample size. The limiting factor in the fully calibrated design thus is the frequentist power, which is only achieved for \(n=38\) (see the left panel in the last plot above).

3.2 Hybrid calibration

An interesting alternative is hybrid calibration. We do not provide detailed information here, but the core idea is to report Bayesian power, which often is more realistic than frequentist power under a specific point alternative, and frequentist type-I-error. Thus, from a regulatory perspective, the frequentist type-I-error is often the limiting or hard factor that must be met, as it constitutes a worst-case scenario, but Bayesian power might be more realistic under a slightly informative design prior. We set the calibration mode to hybrid:

des_hybrid_strong_ev <- design_singlearm_onestage_bf(
  n_min = 10,
  n_max = 100,
  k = 1/10,
  p0 = 0.2,
  a0 = 1, b0 = 1,
  a1 = 1, b1 = 1,
  dp = 0.4,
  da0 = 1, db0 = 1,
  da1 = 2.5, db1 = 2,
  type = "direction",
  calibration = "hybrid",
  target_power = 0.8,
  target_type1 = 0.05,
  target_freq_power = 0.8,
  target_freq_type1 = 0.05
)

summary(des_hybrid_strong_ev)
#> Summary: One-stage single-arm Bayes factor design
#> ------------------------------------------------
#> Calibration: hybrid 
#> Sustain:  10  future n
#> Feasible   : TRUE 
#> Status     : Smallest feasible one-stage design found. 
#> 
#> Selected design
#>   n          : 23 
#>   k          :  0.1 
#> 
#> Operating characteristics
#>   Bayes power      : 0.809 
#>   Bayes type-I     : 0.004 
#>   CE(H0)           :   NA 
#>   Freq power       : 0.612 
#>   Freq type-I      : 0.027

We see that now the feasible sample size to calibrate our design has changed from \(n=38\) to only \(n=23\). Thus, the hybrid design required fewer patients.

plot(des_hybrid_strong_ev)

The plot clearly shows that Bayesian power is the limiting factor here: From \(n=23\) patients on, it passes the required 80% threshold. Bayesian and frequentist type-I-error are controlled for even smaller sample sizes, and from our previous fully and frequentist calibration results we know that frequentist power achieves the required 80% target power only for \(n=38\). Thus, shifting to a Bayesian notion of power under the slightly informative design priors under \(H_1\) yields a reduction in sample size of \(15\) patients, which is non-negligible.

We could also add a probability of compelling evidence constraint to that design by adding the arguments k_ce = 3 and target_ce_h0 = 0.60, indicating we require at least 60% probability of compelling evidence for \(H_0\) based on a evidence threshold \(k_f = 3\):

des_hybrid_strong_ev_with_ce <- design_singlearm_onestage_bf(
  n_min = 10,
  n_max = 100,
  k = 1/10,
  k_ce = 3,
  p0 = 0.2,
  a0 = 1, b0 = 1,
  a1 = 1, b1 = 1,
  dp = 0.4,
  da0 = 1, db0 = 1,
  da1 = 2.5, db1 = 2,
  type = "direction",
  calibration = "hybrid",
  target_power = 0.8,
  target_type1 = 0.05,
  target_ce_h0 = 0.6,
  target_freq_power = 0.8,
  target_freq_type1 = 0.05
)

summary(des_hybrid_strong_ev_with_ce)
#> Summary: One-stage single-arm Bayes factor design
#> ------------------------------------------------
#> Calibration: hybrid 
#> Sustain:  10  future n
#> Feasible   : FALSE 
#> Status     : No feasible one-stage design found.

We see that no calibrated design could be found in the sample size range specified. The following plot shows why:

plot(des_hybrid_strong_ev_with_ce)

The probability of compelling evidence simply does not reach the target 60% in the specified sample size range. Thus, we could either increase that range by increasing n_max or adopting different design priors. Alternatively, we could lower our requirement on the probability of compelling evidence to, say, 50% as follows:

des_hybrid_strong_ev_with_ce50 <- design_singlearm_onestage_bf(
  n_min = 10,
  n_max = 100,
  k = 1/10,
  k_ce = 3,
  p0 = 0.2,
  a0 = 1, b0 = 1,
  a1 = 1, b1 = 1,
  dp = 0.4,
  da0 = 1, db0 = 1,
  da1 = 2.5, db1 = 2,
  type = "direction",
  calibration = "hybrid",
  target_power = 0.8,
  target_type1 = 0.05,
  target_ce_h0 = 0.5,
  target_freq_power = 0.8,
  target_freq_type1 = 0.05
)

summary(des_hybrid_strong_ev_with_ce50)
#> Summary: One-stage single-arm Bayes factor design
#> ------------------------------------------------
#> Calibration: hybrid 
#> Sustain:  10  future n
#> Feasible   : TRUE 
#> Status     : Smallest feasible one-stage design found. 
#> 
#> Selected design
#>   n          : 65 
#>   k          :  0.1 
#>   k_ce       :    3 
#> 
#> Operating characteristics
#>   Bayes power      : 0.904 
#>   Bayes type-I     : 0.002 
#>   CE(H0)           : 0.574 
#>   Freq power       : 0.952 
#>   Freq type-I      : 0.026

plot(des_hybrid_strong_ev_with_ce50)

3.3 Relationship to the two-stage design

The one-stage design can be viewed as the fixed-sample baseline. In the package, the function design_singlearm_bf() extends this framework to a two-stage design with one interim analysis and early stopping for futility.

3.4 Summary

In this vignette we demonstrated how to calibrate one-stage single-arm Bayes factor designs for phase II trials with binary endpoints using the functions provided in the bfbin2arm package. The calibration framework combines design and analysis priors, Bayes-factor evidence thresholds, and target operating characteristics into a fully numerical search over the total sample size. Within this framework, users can work in purely Bayesian, purely frequentist, or hybrid modes, and optionally impose an additional constraint on the probability of compelling evidence in favour of the null hypothesis.

We illustrated how to obtain the smallest feasible design under Bayesian calibration, and how to visualize the resulting operating-characteristic curves via the dedicated plotting method. We also showed how to extend the calibration by adding a probability of compelling evidence (CE) constraint, and how the chosen futility threshold \(k_f\) (the parameter k_ce) and the target level for the probability of compelling evidence interact with the achievable power and type-I error in a given sample size range. Finally, we discussed the sustain parameter sustain_n, which requires that the chosen design remains feasible for a number of larger sample sizes and thus guards against local beta–binomial oscillations in the operating characteristics.

References

Kelter, Riko. 2026. “Power and Sample Size Calculations for Bayes Factors in Two-Arm Clinical Phase II Trials with Binary Endpoints.” https://arxiv.org/abs/2603.01715.

Kelter, Riko, and Samuel Pawel. 2025a. “Bayesian Power and Sample Size Calculations for Bayes Factors in the Binomial Setting.” https://arxiv.org/abs/2502.02914.

———. 2025b. “The Bayesian Optimal Two-Stage Design for Clinical Phase II Trials Based on Bayes Factors.” https://arxiv.org/abs/2511.23144.

Calibration of Bayesian one-stage designs for single-arm phase II trials with binary endpoints

Riko Kelter
Institute of Medical Statistics and Computational Biology
Faculty of Medicine, University of Cologne
Cologne, Germany

01 June 2026

1 Introduction

1.1 How to design a Bayesian trial: Overview of the calibration algorithm

2 Basic calibration

2.1 Operating-characteristic curves

2.2 Adding a constraint on the probability of compelling evidence

3 Full Bayes-frequentist calibration

3.1 Frequentist calibration

3.2 Hybrid calibration

3.3 Relationship to the two-stage design

3.4 Summary

References

Calibration of Bayesian one-stage designs for single-arm phase II trials with binary endpoints

Riko Kelter Institute of Medical Statistics and Computational Biology Faculty of Medicine, University of Cologne Cologne, Germany

01 June 2026

1 Introduction

1.1 How to design a Bayesian trial: Overview of the calibration algorithm

2 Basic calibration

2.1 Operating-characteristic curves

2.2 Adding a constraint on the probability of compelling evidence

3 Full Bayes-frequentist calibration

3.1 Frequentist calibration

3.2 Hybrid calibration

3.3 Relationship to the two-stage design

3.4 Summary

References

Riko Kelter
Institute of Medical Statistics and Computational Biology
Faculty of Medicine, University of Cologne
Cologne, Germany