Introduction to Medical Statistics 2026
Exercise class 9
Sample Size Calculation and Repetition

Author

Ronald Geskus

Published

March 1, 2026

I. Sample size calculation via formulas

Investigators want to run a phase-3 RCT that compares a new chemotherapy to an old one. They use 6-month tumour response as the primary endpoint. They need to perform a sample size/power calculation. The investigators assume that the tumour response probability is 20% for the old chemotherapy (“control” arm) and 40% for the new one (“intervention” arm).

How many patients are required in order to have a power of 80% to detect the assumed increase in the tumour response probability at the two-sided 5% significance level? Do the same for a power of 90%. Use the R-Function power.prop.test (you may want to check the help file) and round the result up to a full number. Compare the result with the result of Practical II-b in session 3 (Tuesday morning), where with \(n=50\) the probability to reject the null hypothesis if \(\pi=0.4\) was 84%. Can you explain the difference?

power.prop.test(p1=..., p2=..., power=...)

Answer: We see that we need 81.2 individuals per group at 80% power and 108.2 individuals per group at 90% power, which we round to the safe values 82 and 109. In Practical II-b in session 3 (Tuesday morning) we saw that 50 individuals already gave a power of 84%. However, in that practical only considered collecting data for the new treatment, because the probability of tumour response for the old chemotherapy was assumed to be known and 20%. In a proper RCT setup, the old treatment is evaluated in the same trial, which gives extra variation that lowers the power.

power.prop.test(p1=0.2, p2=0.4, power=0.8)


     Two-sample comparison of proportions power calculation 

              n = 81.22424
             p1 = 0.2
             p2 = 0.4
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group

power.prop.test(p1=0.2, p2=0.4, power=0.9)


     Two-sample comparison of proportions power calculation 

              n = 108.2355
             p1 = 0.2
             p2 = 0.4
      sig.level = 0.05
          power = 0.9
    alternative = two.sided

NOTE: n is number in *each* group

Assume that in reality, the new therapy increases the tumour response probability only to 30%. What power will the RCT have to get a significant result with the sample size calculated in a. for 90% power? Use power.prop.test again, but now use the argument n=… instead of power=… and set the sample size per group to the value we found in a.

power.prop.test(p1=..., p2=..., n=109)

power.prop.test(p1=0.2, p2=0.3, n=109)


     Two-sample comparison of proportions power calculation 

              n = 109
             p1 = 0.2
             p2 = 0.3
      sig.level = 0.05
          power = 0.3986717
    alternative = two.sided

NOTE: n is number in *each* group

Answer: The power reduces to 40%. We can also perform this computation for a range of values to obtain a power curve.

library(ggplot2)
p2 <- seq(0.2,0.5,by=0.01)
power.prop.test(p1=0.2, p2=p2, n=109)$power

 [1] 0.02500000 0.03776501 0.05502726 0.07753971 0.10591162 0.14052788 0.18148125
 [8] 0.22852766 0.28107211 0.33818838 0.39867171 0.46111853 0.52402427 0.58588815
[15] 0.64531335 0.70109251 0.75227085 0.79818264 0.83846037 0.87301900 0.90202030
[22] 0.92582363 0.94493004 0.95992609 0.97143244 0.98006106 0.98638288 0.99090627
[29] 0.99406567 0.99621860 0.99764909

PowerByP2 <- data.frame(p2=p2, power=power.prop.test(p1=0.2, p2=p2, n=109)$power)
ggplot(PowerByP2, aes(p2,power)) + geom_line(linewidth=1.2) +
  geom_hline(yintercept=0.025, linetype="dotted",color="red")

We see that the power reaches the significance level 0.025 if p2 gets close to the value under the null hypothesis p1=0.2.

The new chemotherapy is expected to also cause less anemia. As a secondary endpoint, the haemoglobin value after the first chemotherapy cycle was chosen. It is expected that mean haemoglobin at that time point is 1 g/dl higher for the intervention arm and that the standard deviation of haemoglobin values is 2.5 g/dl in both arms. What power does the RCT with the sample size from a. have to give a significant result at the two-sided 5% significance level for the secondary endpoint under these assumptions? Do you have a wild guess of the power before you perform the calculation? Use power.t.test (again setting sample size per group n=… and not setting a value for the power argument).

power.t.test(delta=..., sd=..., n=109)

power.t.test(delta=1, sd=2.5,n=109)


     Two-sample t test power calculation 

              n = 109
          delta = 1
             sd = 2.5
      sig.level = 0.05
          power = 0.8364191
    alternative = two.sided

NOTE: n is number in *each* group

Answer: The power is 84%.

II: Analysis of RCTs/Repetition of statistical tests

A randomized controlled clinical trial compared a new procalcitonin(biomarker)-guided antibiotics administration versus antibiotics administration according to standard of care in patients admitted to hospital with lower respiratory tract infections. A total of 442 patients were randomized. The primary endpoint of the trial was antibiotics prescription within 30 days of hospitalization (i.e. did the patient get any antibiotics during that time interval or not). An important secondary endpoint was the length of hospital stay (in days). The results of the trial are in the dataset pctTrial.csv with a short description in file pctTrial_description.txt.

Summarize the antibiotics prescription probabilities in the two arms and compare them with an appropriate statistical test. What does the code below do? What do you conclude?

## START PRACTICAL II
pctTrial  <- read.csv("Data/pctTrial.csv")

xtabs(~AByn+group, data=pctTrial)

     group
AByn  Control PCT
  no       27  58
  yes     194 163

prop.test(c(194,163), c(221,221))


    2-sample test for equality of proportions with continuity correction

data:  c(194, 163) out of c(221, 221)
X-squared = 13.109, df = 1, p-value = 0.0002938
alternative hypothesis: two.sided
95 percent confidence interval:
 0.06343631 0.21710668
sample estimates:
   prop 1    prop 2 
0.8778281 0.7375566

with(pctTrial, chisq.test(AByn,group))


    Pearson's Chi-squared test with Yates' continuity correction

data:  AByn and group
X-squared = 13.109, df = 1, p-value = 0.0002938

# or
chisq.test(matrix(c(194,163,27,58), nrow=2))


    Pearson's Chi-squared test with Yates' continuity correction

data:  matrix(c(194, 163, 27, 58), nrow = 2)
X-squared = 13.109, df = 1, p-value = 0.0002938

# we can also use the gtsummary package for a more beautiful output
library(gtsummary)
add_p(tbl_summary(pctTrial, by=group, include=c(AByn,group)))

Characteristic	Control N = 221¹	PCT N = 221¹	p-value²
AByn	194 (88%)	163 (74%)	<0.001
¹ n (%)
² Pearson’s Chi-squared test

Answer: Prescription probability is 88% (control) vs. 74% (PCT). Both variables are binary. We can use the prop.test or chisq.test function. The observed reduction in the PCT group (compared to control) is 14% (95% CI 6% to 22%, p=0.0003).

Visualize the length of hospital stay in the two arms via a violin plot and an ECDF-plot. Which one do you prefer? Also give a numerical summary and compare the mean length by arm with an appropriate test.

library(ggplot2)
ggplot(pctTrial, aes(group,LOS)) + geom_violin() + 
  geom_jitter(size = 1, alpha = 0.5, width = 0.25, colour = 'red')

ggplot(pctTrial, aes(LOS,group=group, color=group)) + stat_ecdf()

## numerical summary
tbl_summary(pctTrial, by=group, include=c(LOS,group))

Characteristic	Control N = 221¹	PCT N = 221¹
LOS	8 (4, 12)	8 (4, 12)
¹ Median (Q1, Q3)

# a bit skewed, but not too much. Hence the t-test should be ok given the sample size
wilcox.test(LOS~group,data=pctTrial)


    Wilcoxon rank sum test with continuity correction

data:  LOS by group
W = 23993, p-value = 0.7498
alternative hypothesis: true location shift is not equal to 0

t.test(LOS~group,data=pctTrial)


    Welch Two Sample t-test

data:  LOS by group
t = -0.24839, df = 439.62, p-value = 0.804
alternative hypothesis: true difference in means between group Control and group PCT is not equal to 0
95 percent confidence interval:
 -1.572807  1.219866
sample estimates:
mean in group Control     mean in group PCT 
             9.162896              9.339367

Answer: The distribution of length of stay is a bit skewed, but not too much. Median (mean) length of stay is 8/8 (9.16/9.34) days in the groups and there’s no evidence for a difference between the groups (p=0.75 Wilcoxon test; p=0.8 t-test).

Plot the relation between the two biomarkers pct and proADM. First check whether the values of the biomarkers should be transformed prior to the analysis. How large is the correlation between the two markers?

ggplot(pctTrial, aes(pct)) + geom_histogram()

ggplot(pctTrial, aes(proADM)) + geom_histogram()

ggplot(pctTrial, aes(pct,proADM)) + geom_point()

summary(pctTrial$pct)

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
  0.01955   0.11235   0.24221   4.05090   1.38530 199.60704

summary(pctTrial$proADM) # no zero values, hence we can use log values

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.2118  0.6650  1.0497  1.4876  1.6002  9.0887

ggplot(pctTrial, aes(1,log10(pct))) + geom_violin() +
    geom_jitter(size = 1, alpha = 0.5, width = 0.25, colour = 'red')

ggplot(pctTrial, aes(log10(pct))) + geom_histogram()

ggplot(pctTrial, aes(1,log10(proADM))) + geom_violin() +
    geom_jitter(size = 1, alpha = 0.5, width = 0.25, colour = 'red')

ggplot(pctTrial, aes(log10(proADM))) + geom_histogram()

ggplot(pctTrial, aes(log10(pct),log10(proADM))) + geom_point()

## we compute both Pearson and Spearman correlation (note that Spearman is insensitive to monotone transformations)
cor.test(pctTrial$pct, pctTrial$proADM)


    Pearson's product-moment correlation

data:  pctTrial$pct and pctTrial$proADM
t = 11.531, df = 440, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4067332 0.5502770
sample estimates:
      cor 
0.4817299

cor.test(log10(pctTrial$pct), log10(pctTrial$proADM))


    Pearson's product-moment correlation

data:  log10(pctTrial$pct) and log10(pctTrial$proADM)
t = 18.436, df = 440, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6040861 0.7097323
sample estimates:
      cor 
0.6601618

cor.test(pctTrial$pct, pctTrial$proADM, method="spearman")


    Spearman's rank correlation rho

data:  pctTrial$pct and pctTrial$proADM
S = 5231034, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.6365253

Answer: Both are very skewed, as can be seen in both the boxplot and the scatterplot. The log-transformed values look much more symmetric, and the relation between both looks quite linear.

III: Sample size calculation via simulation

In the real world we typically have one single data set. However, on our computer we can create a hypothetical situation in which we generate as many artificial data sets as we want. Such a simulation study is often performed by statisticians in order to investigate performance of a newly developed statistical method.

Simulation studies can also be used in sample size/power calculations, because they consider a hypothetical situation, before data are being collected. Instead of using ready-to-use software for power and sample size calculations, we can also use our computer as our laboratory: we generate RCTs on our computer and look at the distribution of the outcomes.

We use the setting of question I.a. We first set some values

N <- 109 # sample size per group
signif <- 0.05 # significance level
prob1 <- 0.2 # success probability under $H_0$
prob2 <- 0.4 # success probability in intervention group under assumed value of $H_1$

Visualize the probability distribution of the number of observed tumour reponses with a sample of N=109 and a response probability of prob1=0.2. Do the same for the proportion of observed tumour responses (code not given).

dist.binom <- data.frame(x=0:N, p1=dbinom(x=0:N, size=N ,prob=prob1))
dist.binom

      x           p1
1     0 2.734063e-11
2     1 7.450323e-10
3     2 1.005794e-08
4     3 8.968326e-08
5     4 5.941516e-07
6     5 3.119296e-06
7     6 1.351695e-05
8     7 4.972306e-05
9     8 1.584923e-04
10    9 4.446588e-04
11   10 1.111647e-03
12   11 2.501206e-03
13   12 5.106629e-03
14   13 9.525827e-03
15   14 1.632999e-02
16   15 2.585582e-02
17   16 3.797573e-02
18   17 5.193740e-02
19   18 6.636445e-02
20   19 7.946270e-02
21   20 8.939553e-02
22   21 9.471670e-02
23   22 9.471670e-02
24   23 8.956905e-02
25   24 8.023894e-02
26   25 6.820310e-02
27   26 5.508712e-02
28   27 4.233547e-02
29   28 3.099561e-02
30   29 2.164349e-02
31   30 1.442899e-02
32   31 9.192664e-03
33   32 5.601780e-03
34   33 3.267705e-03
35   34 1.826070e-03
36   35 9.782520e-04
37   36 5.027128e-04
38   37 2.479597e-04
39   38 1.174546e-04
40   39 5.345690e-05
41   40 2.338739e-05
42   41 9.839818e-06
43   42 3.982783e-06
44   43 1.551433e-06
45   44 5.817874e-07
46   45 2.100899e-07
47   46 7.307475e-08
48   47 2.448781e-08
49   48 7.907523e-09
50   49 2.461015e-09
51   50 7.383045e-10
52   51 2.135292e-10
53   52 5.954181e-11
54   53 1.600888e-11
55   54 4.150451e-12
56   55 1.037613e-12
57   56 2.501388e-13
58   57 5.814630e-14
59   58 1.303279e-14
60   59 2.816408e-15
61   60 5.867517e-16
62   61 1.178313e-16
63   62 2.280605e-17
64   63 4.253510e-18
65   64 7.643026e-19
66   65 1.322831e-19
67   66 2.204719e-20
68   67 3.537422e-21
69   68 5.462196e-22
70   69 8.114132e-23
71   70 1.159162e-23
72   71 1.591807e-24
73   72 2.100300e-25
74   73 2.661340e-26
75   74 3.236764e-27
76   75 3.776225e-28
77   76 4.223410e-29
78   77 4.525082e-30
79   78 4.641109e-31
80   79 4.552987e-32
81   80 4.268425e-33
82   81 3.820504e-34
83   82 3.261406e-35
84   83 2.652348e-36
85   84 2.052412e-37
86   85 1.509127e-38
87   86 1.052879e-39
88   87 6.958684e-41
89   88 4.349177e-42
90   89 2.565526e-43
91   90 1.425292e-44
92   91 7.439712e-46
93   92 3.638990e-47
94   93 1.662979e-48
95   94 7.076507e-50
96   95 2.793358e-51
97   96 1.018412e-52
98   97 3.412204e-54
99   98 1.044552e-55
100  99 2.901534e-57
101 100 7.253836e-59
102 101 1.615954e-60
103 102 3.168536e-62
104 103 5.383435e-64
105 104 7.764570e-66
106 105 9.243536e-68
107 106 8.720317e-70
108 107 6.112372e-72
109 108 2.829802e-74
110 109 6.490371e-77

## we first save the plot in an object plot.binom because we will use it again later
library(ggplot2)
plot.binom <- ggplot(dist.binom) + 
  geom_linerange(aes(x,p1,ymin=0,ymax=p1),linewidth=1)
plot.binom

## we repeat the figure, but restrict x to the range of values with probability larger than 0.00001
plot.binom + xlim(range(which(dist.binom$p1>0.00001)))

For the proportion, the we only need to change “x” into “x/N”.

ggplot(dist.binom) +  geom_linerange(aes(x/N,p1,ymin=0,ymax=p1),linewidth=1)

Sample N=109 tumour outcomes under the response probability prob1=0.2 and compute the number of responses. Locate this number in the above figure. Is this number likely to occur? (Note: this is random, so each of you may have a slightly different answer.) What is the proportion of tumour response in your generated data set? We use the rbinom function, generating N binary 0-1 outcomes.

data.p1 <- rbinom(N,1,prob1)
data.p1 # individual data

  [1] 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 1 0 1 1 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 0
 [41] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0
 [81] 0 1 1 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0

sum(data.p1) # number of "successes"

[1] 22

sum(data.p1)/N # proportion of "successes"

[1] 0.2018349

If we want to get an idea of the distribution of the number of successes, we repeat the experiment many times. We compute the number of tumour responses in each experiment and summarize over the experiments. Note that when generating the data with the rbinom function, we now specify M binomial experiments each of size N=109, with the total number of “successes” per experiment as outcome. Set M=100. Can you explain the difference between the generated data (in orange) and the theoretical distribution (in black)?

M <- 100
data.p1 <- rbinom(M,N,prob1)
data.p1

  [1] 15 22 14 22 22 26 17 18 19 23 17 16 33 16 19 20 19 19 24 23 22 19 25 14 22 27
 [27] 22 19 20 23 25 20 18 21 19 23 26 23 23 29 23 23 26 23 29 21 28 26 22 21 18 21
 [53] 22 24 22 20 25 21 19 21 16 23 28 20 19 12 24 23 20 13 20 23 21 23 23 17 25 27
 [79] 24 22 19 19 23 28 23 14 19 26 19 22 25 24 19 14 29 21 24 31 28 15

## we summarize these values in a histogram and add it to the plot
hist.p1 <- data.frame(N=sort(unique(data.p1)), freq=as.numeric(table(data.p1))/M)
hist.p1

plot.binom + xlim(range(which(dist.binom$p1>0.00001))) + 
  geom_linerange(data=hist.p1, aes(N-0.2,freq,ymin=0,ymax=freq),
                 size=1,col="orange")

Answer: The difference is explained by the randomness. We have drawn 100 values of the number of successes of a B(N=109,p=0.2) distribution, as summarized by the orange bars. The black bars give the true distribution. We increase the number of experiments to one million and repeat the procedure. The summary based on our experiments gets closer to the true (theoretical) values if we increase the number of experiments. Therefore, the red histogram is much closer to the true values.

M <- 1000000
data.p1 <- rbinom(M,N,prob1)
hist.p1.M <- data.frame(N=sort(unique(data.p1)), freq=as.numeric(table(data.p1))/M)
plot.binom + xlim(range(which(dist.binom$p1>0.00001))) + 
  geom_linerange(data=hist.p1, aes(N-0.2,freq,ymin=0,ymax=freq),
                 size=1,col="orange") + 
  geom_linerange(data=hist.p1.M, aes(N+0.2,freq,ymin=0,ymax=freq),
                 size=1,col="red")

We return to the sample size and power calculations. We simulate the RCT M=1,000,000 times. In each “experiment”, we generate N individuals from the control group and the intervention group, the latter both under \(H_0\) and under the assumed value \(p=0.4\) under \(H_1\). We collect all results in a data.frame. What do you observe?

M <- 1000000
data.sim <- data.frame(H0.Control = rbinom(M,N,prob1), 
                       H0.Intervention = rbinom(M,N,prob1), 
                       H1.Control = rbinom(M,N,prob1), 
                       H1.Intervention = rbinom(M,N,prob2))
summary(data.sim)

   H0.Control   H0.Intervention   H1.Control    H1.Intervention
 Min.   : 5.0   Min.   : 6.0    Min.   : 4.00   Min.   :20.0   
 1st Qu.:19.0   1st Qu.:19.0    1st Qu.:19.00   1st Qu.:40.0   
 Median :22.0   Median :22.0    Median :22.00   Median :44.0   
 Mean   :21.8   Mean   :21.8    Mean   :21.81   Mean   :43.6   
 3rd Qu.:25.0   3rd Qu.:25.0    3rd Qu.:25.00   3rd Qu.:47.0   
 Max.   :44.0   Max.   :45.0    Max.   :43.00   Max.   :69.0

Answer: The mean values are according to what we expect. The distribution for the first three columns is the same.

Under the null hypothesis, the tumour response probability is 0.2 in both the intervention group tr and the control group pl. Since the number in each group is the same, we can use the difference between the number of individuals with tumour response in both groups as test statistic, \(D=X_{tr}-X_{pl}\). In order to obtain the value of observed responses when we reject \(H_0\), we need to compute the distribution of \(D\) under the null hypothesis. \(D\) doesn’t have any standard distribution that you know of (although it can be approximated by a normal distribution). However, we can use our simulated data! Summarize and plot the distribution of the difference in number of tumour responses under \(H_0\). Under two-sided significance level \(\alpha=0.05\), how large should the difference be in order to decide that \(H_0\) is not correct (this value is called the “critical value”)?

ggplot(data.sim) + geom_histogram(aes(H0.Intervention-H0.Control,after_stat(density)), 
                                  binwidth=0.2)

Q <- quantile(data.sim$H0.Intervention-data.sim$H0.Control, 
              prob=(1-signif/2))
Q

97.5% 
   12

## check whether the value of Q itself should lead to rejection already:
sum(data.sim$H0.Intervention-data.sim$H0.Control>=12)/M # P(D >= 12) under null hypothesis

[1] 0.025771

sum(data.sim$H0.Intervention-data.sim$H0.Control>=13)/M # P(D >= 13) under null hypothesis

[1] 0.01714

Answer: We reject the null hypothesis if the difference is at least 13: under the null hypothesis, the fraction of experiments in which the number of responses in the new chemotherapy group is at least 13 higher than in the old therapy group is smaller than 0.025.

Compute the power if N=109 and p=0.4, i.e. the percentage of “experiments” that lead to rejection of \(H_0\) if p=0.4.

sum(data.sim$H1.Intervention-data.sim$H1.Control>=13)/M # P(D >= 13) if p=0.4

[1] 0.919823

## We can also create a nice visualisation
ggplot(data.sim) + geom_histogram(aes(H0.Intervention-H0.Control,y=..density..), 
                                  binwidth=0.2) +
  geom_vline(aes(xintercept=13-0.5),color="red") + 
  geom_histogram(aes(H1.Intervention-H1.Control+0.2,y=..density..), 
                 binwidth=0.2, fill="green") + 
  xlab("difference in number of tumour responses") + ylab("probability")

Answer: the power is a bit larger than 90%, namely 92%.

We computed the power for sample size per group of N=109. However, we can use the same approach to do a sample size calculation. For this, we vary \(N\) and repeat the calculations as in d./e./f. The most efficient way of doing this is by first creating a function that performs the essential steps. I created this function for you. Run the code in the chunk below in your R session; this will make the function available for your use. Check whether it gives the same value for the power as above if N=109.

sim.power <- function(Nstart, Nstop, prob1=0.2, prob2=0.4, M=1000000, signif=0.05){
    power <- data.frame(N=Nstart:Nstop, CriticalValue=NA, Power=NA)
    for(Ni in 1:(Nstop+1-Nstart)){
        data.sim <- data.frame(H0.Control = rbinom(M,Nstart+Ni-1,prob1),
                               H0.Intervention = rbinom(M,Nstart+Ni-1,prob1),
                               H1.Control = rbinom(M,Nstart+Ni-1,prob1),
                               H1.Intervention = rbinom(M,Nstart+Ni-1,prob2))
        Q <- quantile(data.sim$H0.Intervention-data.sim$H0.Control, prob=(1-signif/2))
        power[Ni, "CriticalValue"] <- Q+1
        power[Ni,"Power"] <- 1-cumsum(xtabs(~I(H1.Intervention-H1.Control), data=data.sim)/M)[as.character(Q)]
    }
    power
}
sim.power(Nstart=109,Nstop=109)

    N CriticalValue    Power
1 109            13 0.920369

power.prop.test(p1=0.2, p2=0.4, n=109)


     Two-sample comparison of proportions power calculation 

              n = 109
             p1 = 0.2
             p2 = 0.4
      sig.level = 0.05
          power = 0.9020203
    alternative = two.sided

NOTE: n is number in *each* group

Answer: the power is larger than the number given by the power.prop.test function.

Now we can compute the power for different sample sizes. If we have no idea of the required sample size, we can choose a range, say from 50 to 200. In order to save time, we set M smaller at 100,000. (This calculation may take one minute.) What do you conclude with respect to the sample size needed to obtain a power of 90%?

BinPower <- sim.power(50,200,M=100000)
ggplot(BinPower, aes(N,Power)) + geom_point(size=1)

ggplot(BinPower, aes(N,Power)) + geom_point(size=2) + xlim(90,115) + ylim(0.85,0.95)

Answer: The required sample size is 98 instead of 109. The function power.prop.test uses an approximation. The simulation is more reliable. We see a strange non-monotonous pattern in the power as function of sample size. Sometimes the power jumps down if we increase the sample size. This has to do with the fact that the binomial distribution has discrete values only, leading to discrete critical values. This is a well-known phenomenon.