Introduction to Medical Statistics 2026
Exercises Class 5
Statistics for the comparison of groups (continuous data)

Author

Nguyen Lam Vuong and the Biostatistics team

Published

March 23, 2026

Exercise i)

We use the dataset cmTbmData.csv. It contains information on 201 patients with meningitis from 4 different patient groups. For this session, we will restrict attention to HIV-positive patients.

  1. Import the dataset and create a new dataset cmTbm.hiv which contains HIV-positive patients only. We also load the ggplot2 package to plot beautiful graphs, and the patchwork package which allows us to easily plot graphs next to each other.
library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.4.3
library(patchwork)
Warning: package 'patchwork' was built under R version 4.4.3
# Import the dataset
cmTbm <- read.csv("https://raw.githubusercontent.com/oucru-biostats/IntroductionToBiostatistics2024/main/data/cmTbmData.csv")

# Create a new dataset cmTbm.hiv that contains HIV-positive patients only
cmTbm.hiv <- subset(cmTbm,hiv==1)
  1. We test whether there a difference in age between HIV-positive patients with cryptococcal meningitis (CM) or tuberculous meningitis (TBM). What should we do?
  1. First, visualise and compare the age distribution in the two groups using a boxplot and a density plot. Do you think it is reasonable to assume that the age distribution is normal in both groups?
ggplot(cmTbm.hiv, aes(diagnosis,age)) + geom_boxplot() + 
  geom_jitter(size = 1, alpha = 0.5, width = 0.25, colour = 'red')
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

ggplot(cmTbm.hiv, aes(x = age, fill = diagnosis)) + geom_density(alpha = 0.7)
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_density()`).

## alternative: frequency polygon (not available via ggplotgui)
ggplot(cmTbm.hiv, aes(x = age)) + geom_freqpoly(aes(colour = diagnosis),binwidth = 5)
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_bin()`).

  1. Perform a t-test to check the hypothesis that both age distributions are equal. Use the t.test function. You can use the formula specification age~diagnosis as first argument, with the outcome (dependent) variable on the left hand side, and the grouping (independent) variable on the right hand side. Try to understand the output of the t.test function. What do you conclude?

Next, remove the person aged 62 and check whether anything changes.

t.test(age ~ diagnosis, data = cmTbm.hiv) 

    Welch Two Sample t-test

data:  age by diagnosis
t = -1.363, df = 101.48, p-value = 0.1759
alternative hypothesis: true difference in means between group CM and group TBM is not equal to 0
95 percent confidence interval:
 -4.5856262  0.8504538
sample estimates:
 mean in group CM mean in group TBM 
         26.46000          28.32759 
# remove the person aged 62
t.test(age ~ diagnosis, data = cmTbm.hiv, subset = age<60)

    Welch Two Sample t-test

data:  age by diagnosis
t = -2.212, df = 104.03, p-value = 0.02915
alternative hypothesis: true difference in means between group CM and group TBM is not equal to 0
95 percent confidence interval:
 -4.9173369 -0.2684478
sample estimates:
 mean in group CM mean in group TBM 
         25.73469          28.32759 
  1. Test the hypothesis again, now with the Wilcoxon rank-sum test, using the wilcox.test function (here we can use the formula structure as well). What happens if we remove the person aged 62 in the test?
# Wilcoxon rank-sum test
wilcox.test(age~diagnosis, data=cmTbm.hiv)

    Wilcoxon rank sum test with continuity correction

data:  age by diagnosis
W = 1210, p-value = 0.1392
alternative hypothesis: true location shift is not equal to 0
# remove the person aged 62
wilcox.test(age~diagnosis, data=cmTbm.hiv, subset=age<60)

    Wilcoxon rank sum test with continuity correction

data:  age by diagnosis
W = 1152, p-value = 0.09244
alternative hypothesis: true location shift is not equal to 0
  1. We compare white cell count both in blood (bldwcc) and in CSF (csfwcc) between the two groups, using both the t-test and the Wilcoxon rank-sum test. Are there significant differences between HIV-positive patients with CM or TBM? Do the variables need to be transformed before performing a t-test? If yes, please do so.
# First, draw boxplots and calculate several summary statistics
summary(cmTbm.hiv$bldwcc)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  1.010   5.317   7.155   8.143   9.748  21.900       1 
p1 <- ggplot(cmTbm.hiv, aes(diagnosis, bldwcc)) + geom_boxplot() + 
  geom_jitter(size = 1, alpha = 0.5, width = 0.25, colour = 'red')
p2 <- ggplot(cmTbm.hiv, aes(diagnosis, log10(bldwcc))) + geom_boxplot() + 
  geom_jitter(size = 1, alpha = 0.5, width = 0.25, colour = 'red')
p1 + p2
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

summary(cmTbm.hiv$csfwcc)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    0.0    23.5   115.5   422.4   545.5  3960.0       1 
p1 <- ggplot(cmTbm.hiv, aes(diagnosis, csfwcc)) + geom_boxplot() + 
  geom_jitter(size = 1, alpha = 0.5, width = 0.25, colour = 'red')
p2 <- ggplot(cmTbm.hiv, aes(diagnosis, log10(csfwcc + 1))) + geom_boxplot() + 
  geom_jitter(size = 1, alpha = 0.5, width = 0.25, colour = 'red')
p1 + p2
Warning: Removed 1 row containing non-finite outside the scale range (`stat_boxplot()`).
Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

Exercise ii)

We use the dataset bmData.csv, containing selected variables from 300 patients with confirmed bacterial meningitis who were randomized to either adjunctive dexamethasone therapy or placebo.

  1. Import the dataset with both treatment groups, check the distribution of CSF total white cell count at baseline and follow-up. Compute the transformed variables if needed.
# Import dataset
bmData <- read.csv("https://raw.githubusercontent.com/oucru-biostats/IntroductionToBiostatistics2024/main/data/bmData.csv")

summary(bmData$wc.csf)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      1    1085    3160    6252    7830   64000       1 
summary(bmData$wc.csf.fup)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   16.0   352.5   902.5  2883.4  2375.0 84000.0      22 
# draw histogram
p1 <- ggplot(bmData, aes(wc.csf)) + geom_histogram()
p2 <- ggplot(bmData, aes(wc.csf.fup)) + geom_histogram()
p1 + p2
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_bin()`).
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Warning: Removed 22 rows containing non-finite outside the scale range
(`stat_bin()`).

# perform a log-transformation
bmData$log.wc <- log10(bmData$wc.csf)
bmData$log.wc.fup <- log10(bmData$wc.csf.fup)

# draw histogram of the log-transformed variables
p1 <- ggplot(bmData, aes(log.wc)) + geom_histogram()
p2 <- ggplot(bmData, aes(log.wc.fup)) + geom_histogram()
p1 + p2
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_bin()`).
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Warning: Removed 22 rows containing non-finite outside the scale range
(`stat_bin()`).

  1. Test whether the change in value differs from zero in the dexamethasone group, using the paired t-test. Compare the result with the one based on the one-sample t-test for the difference.
# compare
t.test(subset(bmData, group=="dexamethasone")$log.wc, subset(bmData, group=="dexamethasone")$log.wc.fup, paired = TRUE)

    Paired t-test

data:  subset(bmData, group == "dexamethasone")$log.wc and subset(bmData, group == "dexamethasone")$log.wc.fup
t = 7.1351, df = 133, p-value = 5.623e-11
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 0.374681 0.662089
sample estimates:
mean difference 
       0.518385 
# alternative formulation
with( subset(bmData, group=="dexamethasone"), t.test(log.wc, log.wc.fup, paired = TRUE))

    Paired t-test

data:  log.wc and log.wc.fup
t = 7.1351, df = 133, p-value = 5.623e-11
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 0.374681 0.662089
sample estimates:
mean difference 
       0.518385 
# Equivalent alternative: One sample t.test to test whether difference is 0
t.test(subset(bmData, group=="dexamethasone")$log.wc - subset(bmData, group=="dexamethasone")$log.wc.fup)

    One Sample t-test

data:  subset(bmData, group == "dexamethasone")$log.wc - subset(bmData, group == "dexamethasone")$log.wc.fup
t = 7.1351, df = 133, p-value = 5.623e-11
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 0.374681 0.662089
sample estimates:
mean of x 
 0.518385 
# this can also be formulated as
t.test(log.wc.fup - log.wc ~ 1, data = bmData, subset = group=="dexamethasone")

    One Sample t-test

data:  log.wc.fup - log.wc
t = -7.1351, df = 133, p-value = 5.623e-11
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -0.662089 -0.374681
sample estimates:
mean of x 
-0.518385 
  1. Does the change in CSF total white cell counts differ between the two randomized groups?
# t-test using the formula structure, with the difference as outcome.
t.test(log.wc.fup - log.wc ~ group, data = bmData)

    Welch Two Sample t-test

data:  log.wc.fup - log.wc by group
t = -0.88503, df = 265.48, p-value = 0.3769
alternative hypothesis: true difference in means between group dexamethasone and group placebo is not equal to 0
95 percent confidence interval:
 -0.2726660  0.1035552
sample estimates:
mean in group dexamethasone       mean in group placebo 
                 -0.5183850                  -0.4338296 
# Or Wilcoxon test
wilcox.test(log.wc.fup-log.wc ~ group, data = bmData)

    Wilcoxon rank sum test with continuity correction

data:  log.wc.fup - log.wc by group
W = 8980, p-value = 0.3674
alternative hypothesis: true location shift is not equal to 0