Introduction to Medical Statistics 2026
Exercises Class 5
Statistics for the comparison of groups (continuous data)
Author
Nguyen Lam Vuong and the Biostatistics team
Published
March 23, 2026
Exercise i)
We use the dataset cmTbmData.csv. It contains information on 201 patients with meningitis from 4 different patient groups. For this session, we will restrict attention to HIV-positive patients.
Import the dataset and create a new dataset cmTbm.hiv which contains HIV-positive patients only. We also load the ggplot2 package to plot beautiful graphs, and the patchwork package which allows us to easily plot graphs next to each other.
library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.4.3
library(patchwork)
Warning: package 'patchwork' was built under R version 4.4.3
# Import the datasetcmTbm <-read.csv("https://raw.githubusercontent.com/oucru-biostats/IntroductionToBiostatistics2024/main/data/cmTbmData.csv")# Create a new dataset cmTbm.hiv that contains HIV-positive patients onlycmTbm.hiv <-subset(cmTbm,hiv==1)
We test whether there a difference in age between HIV-positive patients with cryptococcal meningitis (CM) or tuberculous meningitis (TBM). What should we do?
First, visualise and compare the age distribution in the two groups using a boxplot and a density plot. Do you think it is reasonable to assume that the age distribution is normal in both groups?
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
ggplot(cmTbm.hiv, aes(x = age, fill = diagnosis)) +geom_density(alpha =0.7)
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_density()`).
## alternative: frequency polygon (not available via ggplotgui)ggplot(cmTbm.hiv, aes(x = age)) +geom_freqpoly(aes(colour = diagnosis),binwidth =5)
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_bin()`).
Perform a t-test to check the hypothesis that both age distributions are equal. Use the t.test function. You can use the formula specification age~diagnosis as first argument, with the outcome (dependent) variable on the left hand side, and the grouping (independent) variable on the right hand side. Try to understand the output of the t.test function. What do you conclude?
Next, remove the person aged 62 and check whether anything changes.
t.test(age ~ diagnosis, data = cmTbm.hiv)
Welch Two Sample t-test
data: age by diagnosis
t = -1.363, df = 101.48, p-value = 0.1759
alternative hypothesis: true difference in means between group CM and group TBM is not equal to 0
95 percent confidence interval:
-4.5856262 0.8504538
sample estimates:
mean in group CM mean in group TBM
26.46000 28.32759
# remove the person aged 62t.test(age ~ diagnosis, data = cmTbm.hiv, subset = age<60)
Welch Two Sample t-test
data: age by diagnosis
t = -2.212, df = 104.03, p-value = 0.02915
alternative hypothesis: true difference in means between group CM and group TBM is not equal to 0
95 percent confidence interval:
-4.9173369 -0.2684478
sample estimates:
mean in group CM mean in group TBM
25.73469 28.32759
Test the hypothesis again, now with the Wilcoxon rank-sum test, using the wilcox.test function (here we can use the formula structure as well). What happens if we remove the person aged 62 in the test?
Wilcoxon rank sum test with continuity correction
data: age by diagnosis
W = 1210, p-value = 0.1392
alternative hypothesis: true location shift is not equal to 0
# remove the person aged 62wilcox.test(age~diagnosis, data=cmTbm.hiv, subset=age<60)
Wilcoxon rank sum test with continuity correction
data: age by diagnosis
W = 1152, p-value = 0.09244
alternative hypothesis: true location shift is not equal to 0
We compare white cell count both in blood (bldwcc) and in CSF (csfwcc) between the two groups, using both the t-test and the Wilcoxon rank-sum test. Are there significant differences between HIV-positive patients with CM or TBM? Do the variables need to be transformed before performing a t-test? If yes, please do so.
# First, draw boxplots and calculate several summary statisticssummary(cmTbm.hiv$bldwcc)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.010 5.317 7.155 8.143 9.748 21.900 1
Warning: Removed 1 row containing non-finite outside the scale range (`stat_boxplot()`).
Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
Exercise ii)
We use the dataset bmData.csv, containing selected variables from 300 patients with confirmed bacterial meningitis who were randomized to either adjunctive dexamethasone therapy or placebo.
Import the dataset with both treatment groups, check the distribution of CSF total white cell count at baseline and follow-up. Compute the transformed variables if needed.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_bin()`).
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Warning: Removed 22 rows containing non-finite outside the scale range
(`stat_bin()`).
# perform a log-transformationbmData$log.wc <-log10(bmData$wc.csf)bmData$log.wc.fup <-log10(bmData$wc.csf.fup)# draw histogram of the log-transformed variablesp1 <-ggplot(bmData, aes(log.wc)) +geom_histogram()p2 <-ggplot(bmData, aes(log.wc.fup)) +geom_histogram()p1 + p2
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_bin()`).
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Warning: Removed 22 rows containing non-finite outside the scale range
(`stat_bin()`).
Test whether the change in value differs from zero in the dexamethasone group, using the paired t-test. Compare the result with the one based on the one-sample t-test for the difference.
Paired t-test
data: subset(bmData, group == "dexamethasone")$log.wc and subset(bmData, group == "dexamethasone")$log.wc.fup
t = 7.1351, df = 133, p-value = 5.623e-11
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
0.374681 0.662089
sample estimates:
mean difference
0.518385
# alternative formulationwith( subset(bmData, group=="dexamethasone"), t.test(log.wc, log.wc.fup, paired =TRUE))
Paired t-test
data: log.wc and log.wc.fup
t = 7.1351, df = 133, p-value = 5.623e-11
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
0.374681 0.662089
sample estimates:
mean difference
0.518385
# Equivalent alternative: One sample t.test to test whether difference is 0t.test(subset(bmData, group=="dexamethasone")$log.wc -subset(bmData, group=="dexamethasone")$log.wc.fup)
One Sample t-test
data: subset(bmData, group == "dexamethasone")$log.wc - subset(bmData, group == "dexamethasone")$log.wc.fup
t = 7.1351, df = 133, p-value = 5.623e-11
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
0.374681 0.662089
sample estimates:
mean of x
0.518385
# this can also be formulated ast.test(log.wc.fup - log.wc ~1, data = bmData, subset = group=="dexamethasone")
One Sample t-test
data: log.wc.fup - log.wc
t = -7.1351, df = 133, p-value = 5.623e-11
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.662089 -0.374681
sample estimates:
mean of x
-0.518385
Does the change in CSF total white cell counts differ between the two randomized groups?
# t-test using the formula structure, with the difference as outcome.t.test(log.wc.fup - log.wc ~ group, data = bmData)
Welch Two Sample t-test
data: log.wc.fup - log.wc by group
t = -0.88503, df = 265.48, p-value = 0.3769
alternative hypothesis: true difference in means between group dexamethasone and group placebo is not equal to 0
95 percent confidence interval:
-0.2726660 0.1035552
sample estimates:
mean in group dexamethasone mean in group placebo
-0.5183850 -0.4338296
# Or Wilcoxon testwilcox.test(log.wc.fup-log.wc ~ group, data = bmData)
Wilcoxon rank sum test with continuity correction
data: log.wc.fup - log.wc by group
W = 8980, p-value = 0.3674
alternative hypothesis: true location shift is not equal to 0