Introduction to Medical Statistics 2024
Exercises Class IV
Normal distribution and the Confidence Intervals

Authors

Thuan Hoang

Ronald Geskus

Published

March 24, 2026

Warning: package 'ggplot2' was built under R version 4.5.1

Warning: package 'gghighlight' was built under R version 4.5.2

I. Calculation of normal probabilities

Below is a bell-shaped curve demonstrating a Standard Normal distribution. In R, function pnorm(x, mean, sd) is a function to quantify the probability \(P(X \leq x)\) from a Normal distribution with mean mean and standard deviation sd.

Figure 1: Given X ~ Normal(0,1). The curve represents the probability density function of x. The probability of x < -1 (P(x < -1) is highlighted in red. Do you see that it is symmetrical?

Assume that the probability distribution of a laboratory marker in a population has a normal distribution with mean 75 and standard deviation 15. Use the 68-95-99.7-rule to (approximately) calculate the following probabilities by hand. Use the pnorm function in R to confirm your approximations.

P(60<X<90)

Answer: Standardized value: (90-75)/15=1 sd from the mean. About 68% of the values is within 1 sd from the mean, hence P(60<x<90)=0.68

pnorm(60, mean=75, sd=15) - pnorm(75, mean=75, sd=15)

[1] -0.3413447

P(X<60)

Answer: 32% percent of the values is more than 1sd from the mean, 16% on each side. Hence P(x<60)=0.16

1-pnorm(60, mean=75, sd=15)

[1] 0.8413447

P(X>90)

Answer: Same as above as Normal Distribution is symmetrical. Hence P(X>90) = P(X<60) = 1-P(X<90) = 0.16

1-pnorm(60, mean=75, sd=15)

[1] 0.8413447

# or pnorm(90, mean=75, sd=15, lower.tail=FALSE)

P(X<60 or X>105)

Answer: (105-75)/15=2 About 95% is within 2sd from the mean, and about 2.5% is larger than 105. Hence P(X<60 or X>105)=0.16+0.025=0.185

pnorm(60,75,15) +
  (1-pnorm(105,75,15))  # P(X<60) + (1-P(X<105))

[1] 0.1814054

II. Distribution of the sample mean

Assume that the probability distribution of a laboratory marker in a population has a normal distribution with mean 75 and standard deviation 15 and that we measure this marker in 144 individuals. We compute the mean of those 144 values.

What is the distribution of this sample mean? Give the mean as well as the standard deviation.

As the sample size is large (N=144), the sample mean approximately follows:

\[ M \sim \mathcal N(\hat{\text{M}}, \frac{\text{SD}}{\sqrt{\text N}}) \] where M is the sample mean, \(\hat M\) is the estimate of M, and SD is the sample standard deviation. Hence the answer is

\[ M \sim \mathcal N(75, \frac{15}{\sqrt{144}}) = \mathcal N(75, 1.25) \]

Assuming that the population true mean is \(\mu\) = 75. We no do collect a sample of size 144. The mean of this sample is M. What is the probability that this sample mean M differs by more than 2 units from the population mean \(\mu\) = 75?

Answer We want to calculate the probability that M < 75-2 or M > 75+2, i.e, P(M<73 or M>77). We use code given in Exercise I.

pnorm(75-2,mean=75,sd=1.25) + 
  pnorm(75+2,mean=75,sd=1.25,lower.tail=F)

[1] 0.1095986

We visualised what we just calculated.

We are running into a problem. We failed to collect the planned sample size of N = 144. In fact, we only managed to collect a very small handful of samples. That mean we could not approximate a Normal distribution anymore. We have to use a t distribution. Given the function to calculate a cumulative probabily of a t distribution is R is pstudent_t(x, df, mu, sigma) (instead of pnorm(x, mean, sd) as we have seen before). Recalculate the probability in question 2 the scenarios below. Compare that with what you observe from the Normal observation, what can you conclude?

N = 4
N = 9
N = 25
N = 144 (but don’t use the Normal approximation.)

Answer The number of degrees of freedom for t distribution is the sample size minus 1 (N-1). The calculations are hence done as below:

# The pstudent_t function does not exist in base R. We have to load a package first
library(ggdist)

Warning: package 'ggdist' was built under R version 4.5.2

# N = 5
pstudent_t(75-2, df=3, mu=75,sigma=1.25) +
  pstudent_t(75+2, df=3, mu=75,sigma=1.25,lower.tail=F)

[1] 0.2079048

# N = 9
pstudent_t(75-2, df=8, mu=75,sigma=1.25) + 
  pstudent_t(75+2, df=8, mu=75,sigma=1.25,lower.tail=F)

[1] 0.1482661

# N = 25
pstudent_t(75-2, df=24, mu=75,sigma=1.25) + 
  pstudent_t(75+2, df=24, mu=75,sigma=1.25,lower.tail=F)

[1] 0.1226814

# N = 144
pstudent_t(75-2, df=142, mu=75,sigma=1.25) + 
  pstudent_t(75+2, df=143, mu=75,sigma=1.25,lower.tail=F)

[1] 0.1118134

We notice that the more samples we collected, the lighter the tails and the nearer to a Normal distribution a \(t_{N-1}\) is. However even with a sample size as large as 144, there is still a tiny difference (0.112 vs 0.110).

III. Confidence interval for a proportion

We continue with the exercise from this morning, in which we considered a new chemotherapy. In a trial of 50 patients, 16 had a tumor response. This morning we tested the null hypothesis that the new therapy does not increase the tumor response probability to a value higher than the 20% of the currently available chemotherapies. Now we focus on the 95% confidence interval.

Calculate a two-sided 95% confidence interval for the true tumor response probability. In \(\textsf{R}\), this is done via function prop.test. Use the fact that the confidence interval is equal to the estimate \(\pm 2 \times\) SE, with SE the standard error of the sample proportion.

Answer The calculation is as below:

Tip

Contrary to the default in \(\textsf{R}\), the Yates’ correction for continuity might be too conservative. It is safe to not use that.

prop.test(x = 16, # number of positive response/success
          n = 50, # total number of patients involved,
          conf.level = 0.95, # confidence level = 1-alpha, 
          correct=FALSE 
          )$conf.int

[1] 0.2075822 0.4581030
attr(,"conf.level")
[1] 0.95

Note that in \(\textsf R\), there are many methods to approximate the confidence interval and p-value, as you might have known. To name a few:

Wilson method (via prop.test)
Exact method (via binom.test)
And Wald’s asymptotic method

When in doubt, use prop.test but one may want to be cautious if p-value \(\approx 0.05\). It is advisable to investigate all three methods with Hmisc::binconf.

library(Hmisc)

Warning: package 'Hmisc' was built under R version 4.5.1


Attaching package: 'Hmisc'

The following objects are masked from 'package:base':

    format.pval, units

binconf(16, 50, method="all")

           PointEst     Lower     Upper
Exact          0.32 0.1952042 0.4669938
Wilson         0.32 0.2075822 0.4581030
Asymptotic     0.32 0.1907018 0.4492982

Oopsie! Only the Wilson method gives a confidence interval that doesn’t cover 0.2, and hence would reject \(H_0\). Note: If you look at the p-value of binom.test, you see that it is smaller than 0.05, which would lead to a rejection of \(H_0\). When you read the Details section in the help file of the function, you notice that there is more than one way to compute the confidence interval. So apparently the procedure used can give a confidence interval that is not in correspondence with the p-value. It again shows that a decision based on p<0.05 can depend on subtle variations in the procedures followed

IV. Inference on the change in white cell count

The dataset bmData.csv contains selected variables from 300 patients with confirmed bacterial meningitis. They were randomized to either adjunctive dexamethasone therapy or placebo. In this exercise we study the change in CSF total white cell count between baseline and follow-up in the dexamethasone arm.

Have a look at the description of the dataset and import the dataset bmData.csv into R with variable name bmData and use tbl_summary to summarise the dataset (except patId) by treatment group (variable group).

library(gtsummary)

Warning: package 'gtsummary' was built under R version 4.5.1

bmData <- read.csv('https://raw.githubusercontent.com/oucru-biostats/IntroductionToBiostatistics2024/main/data/bmData.csv') # HINT: change the function name to read a csv file (Exercise I)
tbl_summary(bmData[,-1],  # remove the first column, which is the patient ids
            by = group,   # do the summary by group
            missing_text = '(missing)')

Characteristic	dexamethasone N = 143¹	placebo N = 157¹
age	43 (33, 58)	42 (29, 52)
sex
F	43 (30%)	37 (24%)
M	100 (70%)	120 (76%)
gcs	12 (9, 15)	13 (10, 15)
paresis	16 (11%)	9 (5.7%)
op.csf	20 (14, 35)	20 (16, 28)
(missing)	35	28
wc.csf	3,600 (1,000, 7,800)	2,808 (1,175, 8,045)
(missing)	0	1
glu.csf	19 (10, 36)	20 (10, 37)
wc.csf.fup	828 (249, 2,800)	945 (459, 2,129)
(missing)	9	13
death.6mo	9 (6.3%)	22 (14%)
¹ Median (Q1, Q3); n (%)

Create a dataset that restricts observations to the dexamethasone group (e.g. using the subset function; look at the examples in the help file for filter if you need to learn how to use it).

Make a numerical summary of CSF total white cell count at each of the time points: wc.csf at baseline and wc.csf.fup at follow-up. What do you find regarding the distributions and value ranges?

Answer Both have skewed distributions with only values larger than zero

bmData.dex <- subset(bmData, 
                     group=="dexamethasone") # what group you want to see?
summary(bmData.dex$wc.csf)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1    1000    3600    6048    7790   48000

summary(bmData.dex$wc.csf.fup)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   16.0   250.8   827.5  3858.8  2757.5 84000.0       9

Now make a histogram to describe the variation visually. Does CSF total white cell count at each of the time points approximately have a Normal distribution?
If not: consider a suitable transformation of the variable and make the histograms again.

# Load the packages first
library(ggplot2)
library(patchwork) # This is a nice package to align plots together

Warning: package 'patchwork' was built under R version 4.5.2

plt.baseline <- ggplot(bmData.dex, aes(wc.csf)) +
  geom_histogram(boundary=0) +
  ggtitle("Baseline")
plt.fup      <- ggplot(bmData.dex, aes(wc.csf.fup)) +
  geom_histogram(boundary=0) +
  ggtitle("Follow-up")

## to show the two figures next to each other we simply use the "+" as defined in patchwork
plt.baseline + plt.fup

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Warning: Removed 9 rows containing non-finite outside the scale range
(`stat_bin()`).

The plots clearly indicate a skewd distribution. For strictly positive variables with right skewness (i.e., having a wiggling tail on the right), a very effective transformation is the logarithm. We use the log 10 transformation. We adjust the number of bin in the histogram to make it clearer.

Distribution on log-scale doesn’t look perfectly normal, but it’s better than the original scale.

plt.baseline.log <- ggplot(bmData.dex, aes(log10(wc.csf))) +
  geom_histogram(bins=15) +
  ggtitle("Baseline (log scale)")
plt.fup.log      <- ggplot(bmData.dex, aes(log10(wc.csf.fup))) +
  geom_histogram(bins=15) +
  ggtitle("Follow-up (log scale)")

## the "/" is defined in patchwork as a new line
(plt.baseline + plt.fup) / (plt.baseline.log + plt.fup.log)

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Warning: Removed 9 rows containing non-finite outside the scale range (`stat_bin()`).
Removed 9 rows containing non-finite outside the scale range (`stat_bin()`).

We shall create a variable that contains the change in CSF total white cell count between baseline and follow-up. Do the same for the change in the transformed values. Have a look at the distribution of the difference (diff.wc and diff.log.wc), both for the original variable as well as for the transformed variable.

Answer:

If we plot baseline and follow-up values separately, their distributions are quite skewed
If we plot the absolute difference of the original variable, the distribution is more symmetric (because of the substraction extreme values can occur in both directions) but there are many outliers.
The distribution of the difference in log-values is closer to normal.

# Original scale
bmData.dex$diff.wc <- bmData.dex$wc.csf.fup-bmData.dex$wc.csf

# Transformed scale
transform_fn <- log10 # ? what transform shall we take
bmData.dex$diff.log.wc <- transform_fn(bmData.dex$wc.csf.fup) - transform_fn(bmData.dex$wc.csf)

plt.diffwc <- ggplot(bmData.dex, aes(diff.wc)) +
  geom_histogram(bins=15) +
  ggtitle("Differences")
plt.difflogwc <- ggplot(bmData.dex, aes(diff.log.wc)) +
  geom_histogram(bins=15)  +
  ggtitle("Differences (log scale)")

plt.diffwc + plt.difflogwc

Warning: Removed 9 rows containing non-finite outside the scale range (`stat_bin()`).
Removed 9 rows containing non-finite outside the scale range (`stat_bin()`).

Compute the 95% confidence interval for the change in transformed value of CSF total white cell count, based on the quantiles of the Student’s t distribution. In \(\textsf{R}\), the t.test function can be used. Is there a change in CSF total white cell count between baseline and follow-up? Compare the results with those if the change in the original values is studied.

Answer

# t-test for the difference on original scale
t.test(bmData.dex$diff.wc)


    One Sample t-test

data:  bmData.dex$diff.wc
t = -2.1218, df = 133, p-value = 0.03571
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -4535.491  -159.121
sample estimates:
mean of x 
-2347.306

# t-test for the difference on log10 scale
t.test(bmData.dex$diff.log.wc)


    One Sample t-test

data:  bmData.dex$diff.log.wc
t = -7.1351, df = 133, p-value = 5.623e-11
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -0.662089 -0.374681
sample estimates:
mean of x 
-0.518385

You might also want to plot the comparison for insight. This is done via function plotttest in package nhstplot

# Load the package first
library(nhstplot)

Warning: package 'nhstplot' was built under R version 4.5.2

# t-test on the original scale
orig.ttest.plot <- plotttest(t.test(bmData.dex$diff.wc))

# t-test on the log scale
log.ttest.plot <- plotttest(t.test(bmData.dex$diff.log.wc))

orig.ttest.plot + log.ttest.plot

Do the calculations to obtain the 95% confidence interval yourself and compare results with those from the t.test function. Use the function introduced in Exercise II. There are missing values, which we need to remove before we can apply the functions

mean.diff <- mean(bmData.dex$diff.log.wc, na.rm=TRUE)
se.diff   <- 
  sqrt(var(bmData.dex$diff.log.wc, na.rm=TRUE))/
  sqrt(143-9) # There are 9 missing values.
qstudent_t(p = c(0.025,0.975), # quantile to calculate
           df= 133,            # The number of degrees of freedom is N-1=133
           mu = mean.diff,
           sigma = se.diff)

[1] -0.662089 -0.374681