Introduction to Medical Statistics 2026
Exercise 6 – Simple Linear Regression
Data Analysis and Model Diagnostics
Exercise 6 – Simple Linear Regression
Data Analysis and Model Diagnostics
Exercise 6 – Simple Linear Regression
In this exercise, we use the package ggplot2.
Exercise i) (Simple Linear Regression – HIV-negative CM patients) As in the exercises of day 1, we use the dataset cmTbmData.csv containing information on 201 patients with meningitis from 4 different patient groups. However, for this session, we will restrict attention to the 49 HIV-negative patients with cryptococcal meningitis. We will examine how well blood white cell count can predict CSF white cell count in this group.
- Import the dataset cmTbmData.csv and create a new data.frame cm.hivneg which contains HIV-negative patients with cryptococcal meningitis only.
- Perform a linear regression with CSF white cell count as the outcome (response variable) and blood white cell count as a covariable (explanatory variable) and interpret the output. Calculate the 95% confidence intervals for the regression coefficients. Use the functions lm, summary, and confint. What do you conclude from the model results?
Add the fitted regression line to the scatterplot. (You can use the GUI in the ggplotgui package to obtain the scatterplot with the fitted regression line.) By looking at the plot, do you think that the model assumptions are fulfilled?
Answer: The test that the parameter for bldwcc is zero has a p-value of 0.008. Hence, assuming that the model assumptions are correct, there is a relation between blood and CSF white cell count. However, there are some points way above the fitted line, while this is not the case for points below the fitted line. Hence, the residuals do not follow a normal distribution.
- Perform diagnostic plots for the fitted model using plot(fit). Interpret the residuals. Do they indicate any problems regarding the assumptions of the linear regression model? (Some further explanation of diagnostic plots in R can be obtained at http://data.library.virginia.edu/diagnostic-plots/ (https://easystats.github.io/performance/) )
Answer: Also the residual plots don’t look very good. There are some residuals that are much larger than 0. From the Q-Q plot we conclude that the residuals don’t follow a normal distribution.
An alternative is to use the resid_panel function frim the ggResidpanel package. Using the argument plot=“R” gives the same four plots.
- Create two new variables log10.bldwcc and log10.csfwcc containing log10-transformed values of the original data and then perform steps b) and c) again for the log-transformed variables. What do you conclude?
Answer d: 1. Diagnostic plots look better though there are some suspicious points. 2. Q-Q plot: The distribution looks more normal, except for one outlier. 3. The “Residuals vs Leverage” plot suggests that there’s one individual that may have a large impact on the parameter estimates. 4. There is still a fairly strong suggestion that white blood cell count relates to CSF white cell count, although the p-value has gone up quite a bit.
- The “Residuals vs Leverage” plot suggests that there’s one individual that may have a large impact on the parameter estimates. Identify this point and perform steps b) and c) again for the log-transformed data with that observation removed. Comments? Do you see any other individual that may not follow the model assumptions?
Answer e: 1. Especially the individual in row number 149 is extreme. 2. Fit model again without that individual. The estimates change quite a lot. The p-value has gone down again. Diagnostic plot looks better. 3. Now individual with row number 126 may be problematic with respect the QQ plot and Residuals vs Fitted. Again, things change quite a bit. However, we should be prudent in removing individuals. It is partly subjective when to stop. Just report your decisions in your analysis, and keep in mind that estimates and p-values can be quite sensitive to such decisions.
Alternative solution:
- What CSF white cell count does the model from e) predict for a patient with a white cell count in blood of 10x10^3/mm³, i.e., with log10.bldwcc = 1? Calculate a 95% prediction interval for log10.csfwcc in a patient with log10.bldwcc = 1.