Warning: package 'ggplot2' was built under R version 4.5.1
Warning: package 'gghighlight' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.1
Warning: package 'gghighlight' was built under R version 4.5.2
Below is a bell-shaped curve demonstrating a Standard Normal distribution. In R, function pnorm(x, mean, sd) is a function to quantify the probability \(P(X \leq x)\) from a Normal distribution with mean mean and standard deviation sd.
Assume that the probability distribution of a laboratory marker in a population has a normal distribution with mean 75 and standard deviation 15. Use the 68-95-99.7-rule to (approximately) calculate the following probabilities by hand. Use the pnorm function in R to confirm your approximations.
Answer: Standardized value: (90-75)/15=1 sd from the mean. About 68% of the values is within 1 sd from the mean, hence P(60<x<90)=0.68
pnorm(60, mean=75, sd=15) - pnorm(75, mean=75, sd=15)[1] -0.3413447
Answer: 32% percent of the values is more than 1sd from the mean, 16% on each side. Hence P(x<60)=0.16
1-pnorm(60, mean=75, sd=15)[1] 0.8413447
Answer: Same as above as Normal Distribution is symmetrical. Hence P(X>90) = P(X<60) = 1-P(X<90) = 0.16
1-pnorm(60, mean=75, sd=15)[1] 0.8413447
# or pnorm(90, mean=75, sd=15, lower.tail=FALSE)Answer: (105-75)/15=2 About 95% is within 2sd from the mean, and about 2.5% is larger than 105. Hence P(X<60 or X>105)=0.16+0.025=0.185
pnorm(60,75,15) +
(1-pnorm(105,75,15)) # P(X<60) + (1-P(X<105))[1] 0.1814054
Assume that the probability distribution of a laboratory marker in a population has a normal distribution with mean 75 and standard deviation 15 and that we measure this marker in 144 individuals. We compute the mean of those 144 values.
As the sample size is large (N=144), the sample mean approximately follows:
\[ M \sim \mathcal N(\hat{\text{M}}, \frac{\text{SD}}{\sqrt{\text N}}) \] where M is the sample mean, \(\hat M\) is the estimate of M, and SD is the sample standard deviation. Hence the answer is
\[ M \sim \mathcal N(75, \frac{15}{\sqrt{144}}) = \mathcal N(75, 1.25) \]
Answer We want to calculate the probability that M < 75-2 or M > 75+2, i.e, P(M<73 or M>77). We use code given in Exercise I.
pnorm(75-2,mean=75,sd=1.25) +
pnorm(75+2,mean=75,sd=1.25,lower.tail=F) [1] 0.1095986
We visualised what we just calculated.
R is pstudent_t(x, df, mu, sigma) (instead of pnorm(x, mean, sd) as we have seen before). Recalculate the probability in question 2 the scenarios below. Compare that with what you observe from the Normal observation, what can you conclude?Answer The number of degrees of freedom for t distribution is the sample size minus 1 (N-1). The calculations are hence done as below:
# The pstudent_t function does not exist in base R. We have to load a package first
library(ggdist)Warning: package 'ggdist' was built under R version 4.5.2
# N = 5
pstudent_t(75-2, df=3, mu=75,sigma=1.25) +
pstudent_t(75+2, df=3, mu=75,sigma=1.25,lower.tail=F) [1] 0.2079048
# N = 9
pstudent_t(75-2, df=8, mu=75,sigma=1.25) +
pstudent_t(75+2, df=8, mu=75,sigma=1.25,lower.tail=F) [1] 0.1482661
# N = 25
pstudent_t(75-2, df=24, mu=75,sigma=1.25) +
pstudent_t(75+2, df=24, mu=75,sigma=1.25,lower.tail=F) [1] 0.1226814
# N = 144
pstudent_t(75-2, df=142, mu=75,sigma=1.25) +
pstudent_t(75+2, df=143, mu=75,sigma=1.25,lower.tail=F) [1] 0.1118134
We notice that the more samples we collected, the lighter the tails and the nearer to a Normal distribution a \(t_{N-1}\) is. However even with a sample size as large as 144, there is still a tiny difference (0.112 vs 0.110).
We continue with the exercise from this morning, in which we considered a new chemotherapy. In a trial of 50 patients, 16 had a tumor response. This morning we tested the null hypothesis that the new therapy does not increase the tumor response probability to a value higher than the 20% of the currently available chemotherapies. Now we focus on the 95% confidence interval.
Calculate a two-sided 95% confidence interval for the true tumor response probability. In \(\textsf{R}\), this is done via function prop.test. Use the fact that the confidence interval is equal to the estimate \(\pm 2 \times\) SE, with SE the standard error of the sample proportion.
Answer The calculation is as below:
Contrary to the default in \(\textsf{R}\), the Yates’ correction for continuity might be too conservative. It is safe to not use that.
prop.test(x = 16, # number of positive response/success
n = 50, # total number of patients involved,
conf.level = 0.95, # confidence level = 1-alpha,
correct=FALSE
)$conf.int[1] 0.2075822 0.4581030
attr(,"conf.level")
[1] 0.95
Note that in \(\textsf R\), there are many methods to approximate the confidence interval and p-value, as you might have known. To name a few:
prop.test)binom.test)When in doubt, use prop.test but one may want to be cautious if p-value \(\approx 0.05\). It is advisable to investigate all three methods with Hmisc::binconf.
library(Hmisc)Warning: package 'Hmisc' was built under R version 4.5.1
Attaching package: 'Hmisc'
The following objects are masked from 'package:base':
format.pval, units
binconf(16, 50, method="all") PointEst Lower Upper
Exact 0.32 0.1952042 0.4669938
Wilson 0.32 0.2075822 0.4581030
Asymptotic 0.32 0.1907018 0.4492982
Oopsie! Only the Wilson method gives a confidence interval that doesn’t cover 0.2, and hence would reject \(H_0\). Note: If you look at the p-value of binom.test, you see that it is smaller than 0.05, which would lead to a rejection of \(H_0\). When you read the Details section in the help file of the function, you notice that there is more than one way to compute the confidence interval. So apparently the procedure used can give a confidence interval that is not in correspondence with the p-value. It again shows that a decision based on p<0.05 can depend on subtle variations in the procedures followed
The dataset bmData.csv contains selected variables from 300 patients with confirmed bacterial meningitis. They were randomized to either adjunctive dexamethasone therapy or placebo. In this exercise we study the change in CSF total white cell count between baseline and follow-up in the dexamethasone arm.
tbl_summary to summarise the dataset (except patId) by treatment group (variable group).library(gtsummary)Warning: package 'gtsummary' was built under R version 4.5.1
bmData <- read.csv('https://raw.githubusercontent.com/oucru-biostats/IntroductionToBiostatistics2024/main/data/bmData.csv') # HINT: change the function name to read a csv file (Exercise I)
tbl_summary(bmData[,-1], # remove the first column, which is the patient ids
by = group, # do the summary by group
missing_text = '(missing)') | Characteristic | dexamethasone N = 1431 |
placebo N = 1571 |
|---|---|---|
| age | 43 (33, 58) | 42 (29, 52) |
| sex | ||
| F | 43 (30%) | 37 (24%) |
| M | 100 (70%) | 120 (76%) |
| gcs | 12 (9, 15) | 13 (10, 15) |
| paresis | 16 (11%) | 9 (5.7%) |
| op.csf | 20 (14, 35) | 20 (16, 28) |
| (missing) | 35 | 28 |
| wc.csf | 3,600 (1,000, 7,800) | 2,808 (1,175, 8,045) |
| (missing) | 0 | 1 |
| glu.csf | 19 (10, 36) | 20 (10, 37) |
| wc.csf.fup | 828 (249, 2,800) | 945 (459, 2,129) |
| (missing) | 9 | 13 |
| death.6mo | 9 (6.3%) | 22 (14%) |
| 1 Median (Q1, Q3); n (%) | ||
Answer Both have skewed distributions with only values larger than zero
bmData.dex <- subset(bmData,
group=="dexamethasone") # what group you want to see?
summary(bmData.dex$wc.csf) Min. 1st Qu. Median Mean 3rd Qu. Max.
1 1000 3600 6048 7790 48000
summary(bmData.dex$wc.csf.fup) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
16.0 250.8 827.5 3858.8 2757.5 84000.0 9
Now make a histogram to describe the variation visually. Does CSF total white cell count at each of the time points approximately have a Normal distribution?
If not: consider a suitable transformation of the variable and make the histograms again.
# Load the packages first
library(ggplot2)
library(patchwork) # This is a nice package to align plots togetherWarning: package 'patchwork' was built under R version 4.5.2
plt.baseline <- ggplot(bmData.dex, aes(wc.csf)) +
geom_histogram(boundary=0) +
ggtitle("Baseline")
plt.fup <- ggplot(bmData.dex, aes(wc.csf.fup)) +
geom_histogram(boundary=0) +
ggtitle("Follow-up")
## to show the two figures next to each other we simply use the "+" as defined in patchwork
plt.baseline + plt.fup `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Warning: Removed 9 rows containing non-finite outside the scale range
(`stat_bin()`).
The plots clearly indicate a skewd distribution. For strictly positive variables with right skewness (i.e., having a wiggling tail on the right), a very effective transformation is the logarithm. We use the log 10 transformation. We adjust the number of bin in the histogram to make it clearer.
Distribution on log-scale doesn’t look perfectly normal, but it’s better than the original scale.
plt.baseline.log <- ggplot(bmData.dex, aes(log10(wc.csf))) +
geom_histogram(bins=15) +
ggtitle("Baseline (log scale)")
plt.fup.log <- ggplot(bmData.dex, aes(log10(wc.csf.fup))) +
geom_histogram(bins=15) +
ggtitle("Follow-up (log scale)")
## the "/" is defined in patchwork as a new line
(plt.baseline + plt.fup) / (plt.baseline.log + plt.fup.log) `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Warning: Removed 9 rows containing non-finite outside the scale range (`stat_bin()`).
Removed 9 rows containing non-finite outside the scale range (`stat_bin()`).
diff.wc and diff.log.wc), both for the original variable as well as for the transformed variable.Answer:
# Original scale
bmData.dex$diff.wc <- bmData.dex$wc.csf.fup-bmData.dex$wc.csf
# Transformed scale
transform_fn <- log10 # ? what transform shall we take
bmData.dex$diff.log.wc <- transform_fn(bmData.dex$wc.csf.fup) - transform_fn(bmData.dex$wc.csf)
plt.diffwc <- ggplot(bmData.dex, aes(diff.wc)) +
geom_histogram(bins=15) +
ggtitle("Differences")
plt.difflogwc <- ggplot(bmData.dex, aes(diff.log.wc)) +
geom_histogram(bins=15) +
ggtitle("Differences (log scale)")
plt.diffwc + plt.difflogwcWarning: Removed 9 rows containing non-finite outside the scale range (`stat_bin()`).
Removed 9 rows containing non-finite outside the scale range (`stat_bin()`).
t.test function can be used. Is there a change in CSF total white cell count between baseline and follow-up? Compare the results with those if the change in the original values is studied.Answer
# t-test for the difference on original scale
t.test(bmData.dex$diff.wc)
One Sample t-test
data: bmData.dex$diff.wc
t = -2.1218, df = 133, p-value = 0.03571
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-4535.491 -159.121
sample estimates:
mean of x
-2347.306
# t-test for the difference on log10 scale
t.test(bmData.dex$diff.log.wc)
One Sample t-test
data: bmData.dex$diff.log.wc
t = -7.1351, df = 133, p-value = 5.623e-11
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.662089 -0.374681
sample estimates:
mean of x
-0.518385
plotttest in package nhstplot# Load the package first
library(nhstplot)Warning: package 'nhstplot' was built under R version 4.5.2
# t-test on the original scale
orig.ttest.plot <- plotttest(t.test(bmData.dex$diff.wc))
# t-test on the log scale
log.ttest.plot <- plotttest(t.test(bmData.dex$diff.log.wc))
orig.ttest.plot + log.ttest.plott.test function. Use the function introduced in Exercise II. There are missing values, which we need to remove before we can apply the functionsmean.diff <- mean(bmData.dex$diff.log.wc, na.rm=TRUE)
se.diff <-
sqrt(var(bmData.dex$diff.log.wc, na.rm=TRUE))/
sqrt(143-9) # There are 9 missing values.
qstudent_t(p = c(0.025,0.975), # quantile to calculate
df= 133, # The number of degrees of freedom is N-1=133
mu = mean.diff,
sigma = se.diff) [1] -0.662089 -0.374681