Introduction to Medical Statistics 2026
Exercise 6 – Simple Linear Regression
Data Analysis and Model Diagnostics
Author
Nguyen Lam Vuong and Biostatistics team
Published
March 23, 2026
Exercise i) (Simple Linear Regression – HIV-negative CM patients)
As in the exercises of day 1, we use the dataset cmTbmData.csv containing information on 201 patients with meningitis from 4 different patient groups. However, for this session, we will restrict attention to the 49 HIV-negative patients with cryptococcal meningitis. We will examine how well blood white cell count can predict CSF white cell count in this group.
Import the dataset cmTbmData.csv and create a new data.frame cm.hivneg which contains HIV-negative patients with cryptococcal meningitis only.
Perform a linear regression with CSF white cell count as the outcome (response variable) and blood white cell count as a covariable (explanatory variable) and interpret the output. Calculate the 95% confidence intervals for the regression coefficients. Use the functions lm, summary, and confint. What do you conclude from the model results?
Add the fitted regression line to the scatterplot. (You can use the GUI in the ggplotgui package to obtain the scatterplot with the fitted regression line.) By looking at the plot, do you think that the model assumptions are fulfilled?
fit <-lm(csfwcc ~ bldwcc, data = cm.hivneg)# Summarize the regression resultsummary(fit)
Call:
lm(formula = csfwcc ~ bldwcc, data = cm.hivneg)
Residuals:
Min 1Q Median 3Q Max
-341.81 -134.30 -50.36 45.82 915.01
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.594 77.533 0.227 0.82147
bldwcc 17.758 6.398 2.776 0.00788 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 229.3 on 47 degrees of freedom
Multiple R-squared: 0.1408, Adjusted R-squared: 0.1226
F-statistic: 7.705 on 1 and 47 DF, p-value: 0.00788
# Draw the scatterplot again (and add a regression line)ggplot(cm.hivneg, aes(bldwcc, csfwcc)) +geom_point() +geom_smooth(method ="lm")
library(ggExtra)# the ggExtra package allows to create the histograms (or boxplots) alongside the scatterplot via ggMarginal# you may have to install it firstp <-ggplot(cm.hivneg, aes(bldwcc,csfwcc)) +geom_point() +geom_smooth(method="lm")ggMarginal(p, type ="histogram", xparams=list(binwidth=2, boundary=0), yparams=list(binwidth=75, boundary=0))
Perform diagnostic plots for the fitted model using plot(fit). Interpret the residuals. Do they indicate any problems regarding the assumptions of the linear regression model? (Some further explanation of diagnostic plots in R can be obtained at http://data.library.virginia.edu/diagnostic-plots/ (https://easystats.github.io/performance/) )
plot(fit)
An alternative is to use the resid_panel function frim the ggResidpanel package. Using the argument plot=“R” gives the same four plots.
library(ggResidpanel)resid_panel(fit, plot="R")
library(performance)
Warning: package 'performance' was built under R version 4.4.3
library(see)
Warning: package 'see' was built under R version 4.4.3
library(patchwork)
Warning: package 'patchwork' was built under R version 4.4.3
# checking model assumptionscheck_model(fit)
Create two new variables log10.bldwcc and log10.csfwcc containing log10-transformed values of the original data and then perform steps b) and c) again for the log-transformed variables. What do you conclude?
# First check if there is any 0 in each wcc by seeing if the lowest value is 0summary(cm.hivneg$bldwcc)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.54 7.50 10.50 10.98 13.20 26.70
summary(cm.hivneg$csfwcc)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.0 56.0 110.0 212.7 260.0 1080.0
# Create new variables with log10-transformed valuescm.hivneg$log10.bldwcc <-log10(cm.hivneg$bldwcc)cm.hivneg$log10.csfwcc <-log10(cm.hivneg$csfwcc)# Repeat steps b) - c)fit.log <-lm(log10.csfwcc ~ log10.bldwcc, data = cm.hivneg)summary(fit.log)
Call:
lm(formula = log10.csfwcc ~ log10.bldwcc, data = cm.hivneg)
Residuals:
Min 1Q Median 3Q Max
-1.92746 -0.30877 0.09044 0.35827 1.03000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.3457 0.3054 4.406 6.05e-05 ***
log10.bldwcc 0.7156 0.3003 2.383 0.0213 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5428 on 47 degrees of freedom
Multiple R-squared: 0.1078, Adjusted R-squared: 0.08881
F-statistic: 5.678 on 1 and 47 DF, p-value: 0.02127
# As a variation, we add a boxplot instead of a histogramp <-ggplot(cm.hivneg, aes(log10.bldwcc,log10.csfwcc)) +geom_point()ggMarginal( p, type ="boxplot")
p +geom_smooth(method="lm")
# We use the resid\_panel function from the ggResidpanel package, and now plot all residual plotsresid_panel(fit.log,"all")
# The extreme individual is the 37th observation
The “Residuals vs Leverage” plot suggests that there’s one individual that may have a large impact on the parameter estimates. Identify this point and perform steps b) and c) again for the log-transformed data with that observation removed. Comments? Do you see any other individual that may not follow the model assumptions?
cm.hivneg["149",]
code hiv diagnosis group groupLong age sex bldwcc bldneut bldlym
149 BMD901 0 CM 4 HIV neg - CM 49 2 0.54 0.43 0.05
csfwcc csfneut csflym log10.bldwcc log10.csfwcc
149 73 64.24 7.3 -0.2676062 1.863323
# The individual in row number 149 is extreme # Note that the subset function keeps the row numbers as in the original data set cmTbm# Hence, that individual is not the 149th row in cm.hivneg# We can visualize and show its values viaggplot(cm.hivneg, aes(log10.bldwcc, log10.csfwcc)) +geom_point() +geom_point(data = cm.hivneg["149",], colour ="red", size =2)
# In fact, it is the only person with a negative value for log10.bldwcc, hence we can use that to selectsubset(cm.hivneg, log10.bldwcc <0)
code hiv diagnosis group groupLong age sex bldwcc bldneut bldlym
149 BMD901 0 CM 4 HIV neg - CM 49 2 0.54 0.43 0.05
csfwcc csfneut csflym log10.bldwcc log10.csfwcc
149 73 64.24 7.3 -0.2676062 1.863323
# Refit without that personfit.log.del <-lm(log10.csfwcc ~ log10.bldwcc, data = cm.hivneg, subset =row.names(cm.hivneg) !="149")# Easier is# fit.log.del <- lm(log10.csfwcc ~ log10.bldwcc, data = cm.hivneg, subset = log10.bldwcc > 0)# Or observing that it is individual 37# fit.log.del <- lm(log10.csfwcc ~ log10.bldwcc, data = cm.hivneg[-37,])summary(fit.log.del)
Call:
lm(formula = log10.csfwcc ~ log10.bldwcc, data = cm.hivneg, subset = row.names(cm.hivneg) !=
"149")
Residuals:
Min 1Q Median 3Q Max
-1.8059 -0.3538 0.1124 0.3567 1.0940
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.7829 0.4200 1.864 0.06868 .
log10.bldwcc 1.2583 0.4090 3.076 0.00352 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5284 on 46 degrees of freedom
Multiple R-squared: 0.1706, Adjusted R-squared: 0.1526
F-statistic: 9.465 on 1 and 46 DF, p-value: 0.003522
# Now individual with row number 126 may be problematic with respect to the QQ plot and Residuals vs Fittedfit.log.del2 <-lm(log10.csfwcc ~ log10.bldwcc, data = cm.hivneg, subset =!row.names(cm.hivneg) %in%c("126", "149"))summary(fit.log.del2)
Call:
lm(formula = log10.csfwcc ~ log10.bldwcc, data = cm.hivneg, subset = !row.names(cm.hivneg) %in%
c("126", "149"))
Residuals:
Min 1Q Median 3Q Max
-0.98527 -0.34672 0.07149 0.32415 1.03442
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.0472 0.3697 2.832 0.00689 **
log10.bldwcc 1.0356 0.3587 2.887 0.00595 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4578 on 45 degrees of freedom
Multiple R-squared: 0.1563, Adjusted R-squared: 0.1376
F-statistic: 8.336 on 1 and 45 DF, p-value: 0.005952
# the individual in row 37 is extreme# we can visualize and show its values viaggplot(cm.hivneg, aes(log10.bldwcc,log10.csfwcc)) +geom_point() +geom_point(data=cm.hivneg[37,], colour="red", size=2)
subset(cm.hivneg, log10.bldwcc<0 )
code hiv diagnosis group groupLong age sex bldwcc bldneut bldlym
149 BMD901 0 CM 4 HIV neg - CM 49 2 0.54 0.43 0.05
csfwcc csfneut csflym log10.bldwcc log10.csfwcc
149 73 64.24 7.3 -0.2676062 1.863323
# -> refit without observations 37fit.log.del.alt <-lm(log10.csfwcc~log10.bldwcc, data=cm.hivneg[-37,])summary(fit.log.del.alt)
Call:
lm(formula = log10.csfwcc ~ log10.bldwcc, data = cm.hivneg[-37,
])
Residuals:
Min 1Q Median 3Q Max
-1.8059 -0.3538 0.1124 0.3567 1.0940
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.7829 0.4200 1.864 0.06868 .
log10.bldwcc 1.2583 0.4090 3.076 0.00352 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5284 on 46 degrees of freedom
Multiple R-squared: 0.1706, Adjusted R-squared: 0.1526
F-statistic: 9.465 on 1 and 46 DF, p-value: 0.003522
# we can also decide to transform only the dependent variable, csfwcc (it was the extreme white# blood cell count that led to the outlier on the log scale)# remember that we don't need to assume approximate normality for the covariable, here bldwccfit.log.csf <-lm(log10.csfwcc~bldwcc, data=cm.hivneg)summary(fit.log.csf)
Call:
lm(formula = log10.csfwcc ~ bldwcc, data = cm.hivneg)
Residuals:
Min 1Q Median 3Q Max
-1.86267 -0.30318 0.09761 0.36432 1.09570
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.59164 0.17975 8.855 1.39e-11 ***
bldwcc 0.04170 0.01483 2.811 0.00718 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5317 on 47 degrees of freedom
Multiple R-squared: 0.1439, Adjusted R-squared: 0.1257
F-statistic: 7.903 on 1 and 47 DF, p-value: 0.007176
resid_panel(fit.log.csf, plots="all")
What CSF white cell count does the model from e) predict for a patient with a white cell count in blood of 10x10^3/mm³, i.e., with log10.bldwcc = 1? Calculate a 95% prediction interval for log10.csfwcc in a patient with log10.bldwcc = 1.
# Prediction of csfwcc for a patient with lo10bldwcc=1predict(fit.log.del,newdata=data.frame(log10.bldwcc=1),interval="prediction")
# The prediction interval is much wider than the confidence interval.# Slightly easier is to use the following code from the ggeffects package. Note that results are not# exactly the same because ggeffects uses the normal distribution instead of the t-distribution# You may have to install that package firstlibrary(ggeffects)ggpredict(fit.log.del)