Introduction to Medical Statistics 2026
Exercises Class 5
Statistics for the comparison of groups (continuous data)

Author

Nguyen Lam Vuong and the Biostatistics team

Published

March 23, 2026

Exercise i)

We use the dataset cmTbmData.csv. It contains information on 201 patients with meningitis from 4 different patient groups. For this session, we will restrict attention to HIV-positive patients.

Import the dataset and create a new dataset cmTbm.hiv which contains HIV-positive patients only. We also load the ggplot2 package to plot beautiful graphs, and the patchwork package which allows us to easily plot graphs next to each other.

We test whether there a difference in age between HIV-positive patients with cryptococcal meningitis (CM) or tuberculous meningitis (TBM). What should we do?

First, visualise and compare the age distribution in the two groups using a boxplot and a density plot. Do you think it is reasonable to assume that the age distribution is normal in both groups?

Answer: The distribution of age in each diagnosis group is slightly skewed to the right. there is one outlier in the CM group, a person aged 62.

Perform a t-test to check the hypothesis that both age distributions are equal. Use the t.test function. You can use the formula specification age~diagnosis as first argument, with the outcome (dependent) variable on the left hand side, and the grouping (independent) variable on the right hand side. Try to understand the output of the t.test function. What do you conclude?

Next, remove the person aged 62 and check whether anything changes.

Answer: The p-value of 0.18 gives no strong indication of a difference in age. If we remove the person aged 62, the p-value changes quite a lot and based on the p-value there is a suggestion of a difference.

Test the hypothesis again, now with the Wilcoxon rank-sum test, using the wilcox.test function (here we can use the formula structure as well). What happens if we remove the person aged 62 in the test?

Answer: If we do the Wilcoxon rank-sum test, p-value is more in correspondence with the t.test on the full data set. Removing the one person lowers the p-value, but not as much as with the t-test.

We could conclude that there’s no strong indication that the age distribution is different. There are two further things to consider:

Is the difference in mean age clinically relevant?

What is the interpretation of the test? Do we test for a difference in age between hiv positive individuals with CM or TBM in some larger population? If so, which population?

If the age difference is specific for this sample, performing the test and computing confidence intervals and p-values is irrelevant. We only compare the age distribution in this specific sample and it cannot be generalized to another setting and a larger population.

We compare white cell count both in blood (bldwcc) and in CSF (csfwcc) between the two groups, using both the t-test and the Wilcoxon rank-sum test. Are there significant differences between HIV-positive patients with CM or TBM? Do the variables need to be transformed before performing a t-test? If yes, please do so.

Answer: Log-transformation is a good idea for csfwcc, because it has a very skewed distribution. Note that there’s still an outlier in the TBM group. For bldwcc a transformation is not needed. Notice that for csfwcc there are 0 values –> if you decide to log transform, please add a small value into the original data.

Both markers seem to differ by infection type. All tests convincingly show that both bldwcc and csfwcc are higher in TBM patients (all p<0.0001).

t.test gives you the CI (but note that this is a CI for the difference of the log-transformed data if we use the log-transformed values). For the Wilcoxon test it doesn’t matter whether data is transformed or not.The wilcox.test function does not provide CI’s.

Exercise ii)

We use the dataset bmData.csv, containing selected variables from 300 patients with confirmed bacterial meningitis who were randomized to either adjunctive dexamethasone therapy or placebo.

Import the dataset with both treatment groups, check the distribution of CSF total white cell count at baseline and follow-up. Compute the transformed variables if needed.

Answer: We saw that CSF total white cell count at baseline and follow-up are highly skewed in the dexamethasone group. Therefore, we perform a log-transformation first (note that all values were >0). Be careful that these are paired variables (WCC in CSF measured at baseline and at follow-up are from the same patients).

Test whether the change in value differs from zero in the dexamethasone group, using the paired t-test. Compare the result with the one based on the one-sample t-test for the difference.

Answer: To perform the test for the dexamethasone group, we could create a separate data set first using subset(). That would save some writing, hence make the code more readable. But it’s not needed. We can also apply the subset() function when performing the test. We could use a paired t-test or a one sample t.test to test whether difference is 0. The latter has the advantage that we can use the formula structure of the function and select the dexamethasone subgroup via the subset argument.

The data strongly suggest that the value is higher in the follow-up measurement.

Does the change in CSF total white cell counts differ between the two randomized groups?

Answer: The change seems to be comparable in both groups.