Introduction to Medical Statistics 2026
Exercise 7 – Multiple linear regression
Data Analysis and Model Diagnostics
Exercise 7 – Multiple linear regression
Data Analysis and Model Diagnostics
Exercise i (Multivariable linear regression – Peru lung data set)
The dataset perulung contains data from a study of lung function among 636 children aged 7 to 10 years living in a deprived suburb of Lima, Peru. The outcome of interest is the maximum volume of air a child could breathe out in one second measured using a spirometer (forced expiratory volume, litres/second) and we aim to predict it based on other covariables. We will use the packages ggplot2, patchwork, gtsummary and ggResidpanel. Load these packages.
- Plot the outcome (fev1) against the child’s age and height. Do the associations look linear?
Answer: There seems to be a positive relationships between fev1 and age as well as between fev1 and height. The associations look fairly linear. However, associations and linearity may no longer hold in the multivariable model.
- Perform 2 separate linear regression models of fev1 against age and height.
Answer: Convincing positive relationships for both covariables.
- Perform a multiple linear regression analysis of fev1 depending on both age and height. Compare the regression coefficients of the multiple regression with the simple regression models. Why is the coefficient for age smaller in the multiple regression model?
Calculate the 95% confidence interval for the regression coefficients.
Answer: The age effect is less pronounced after controlling for height, because part of the association between age and fev1 in the univariable model is explained by height.
- What is the interpretation of the intercept in the model c)? Scale the variables, so that both the intercept and the regression coefficients are easier to interpret.
Answer: In the model above, the intercept corresponds to the predicted fev1 of a patients with age 0 and height 0. This does not make sense and extrapolates far beyond the data range.
For example, scale the variables such that the intercept corresponds to a child with age 7 years and height 120 cm and that the coefficient for height corresponds to a 10 cm increase in height.
After rescaling, the estimated coefficient of age is still the same. The estimated coefficient of height increased 10-fold, but the p-value is still the same. Interpretation of the intercept makes more sense now.
- Add sex as an additional covariate to the regression model. How can the coefficient for sex be interpreted? How much of the total variability in fev1 does the model explain?
Answer: Sex effect of 0.12 corresponds to the expected difference in fev1 between a man and woman with identical age and height.
R squared 0.48, can also be read from the output of summary.
- Perform appropriate diagnostic plots for the model in e) using the ggResidpanel package and the code resid_interact(fit6, plots=“R”). Is there any evidence that the assumptions of the regression model are violated? There is one individual that has fairly extreme residuals in all four plots. Can you find it? What happens if you refit the model without that individual?
Answer: We could check whether removing row 275 has a big impact on the conclusions (it does not).
Exercise ii (Multivariable linear regression – Dengue viremia and interactions)
The dataset dengueViremia contains selected data from 121 children with dengue serotype 1 or 2 presenting to a community clinic in Ho Chi Minh City within 3 days of illness onset. In this exercise, we will investigate how the dengue serotype (DENV-1 or DENV-2) and the serology (primary or secondary infection) affect the child’s dengue viremia level on day 3. We will use log10-transformed viremia for all analyses.
- Import the dengue viremia dataset and create a boxplot of log10-viremia by type. Is there evidence of an interaction between the effect of the serotype and the effect of serology on viremia?
Answer: Boxplots show a clear indication of an interaction. For DENV-2, secondary infections tend to have higher viremia levels, but in DENV-1, secondary infections tend to have lower viremia levels.
- Compare viremia-levels between primary and secondary infections with an appropriate test, combining both subtypes. In view of a), does this comparison make sense?
Answer: Overall comparison does not show any difference. This test is not very sensible as the opposing effects of serology in DENV-1 and DENV-2 may balance out.
- Compare viremia-levels of primary and secondary infections in the subgroups of patients with DENV-1 and DENV-2 separately with appropriate tests.
Answer: Secondary infections have higher viremia in DENV-1 and lower viremia in DENV-2.
- We want to assess whether dengue serotype and serology affect log10-viremia after controlling for age and gender. Model the log-10 viremia with a multiple linear regression model with the covariates age, gender, serotype, serology. What do you conclude?
Answer: DENV-2 infected individuals have lower viremia. The other variables don’t show a strong relation with viremia.
- This model fit may not be adequate because we already known that there may be an interaction between serotype and serology. Therefore, add an interaction between serotype and serology. Create an article-ready table using the tbl_regression function from the gtsummary package. Interpret the regression output. Use the predict function, or the ggpredict function from the ggeffects package, to obtain the expected value and 95% confidence intervals for each of the four serotype-serology combinations. Choose age=11 and sex =“female” (which are the default values chosen by ggpredict.)
Answer: This gives a better fit to the data (e.g. R-squared increases from 0.095 tp 0.173), and the interaction term is different from 0. However, the interpretation of the estimated coefficients is more difficult now!!! In DENV-1, the secondary infection has a -0.62 lower viremia on the log scale. In DENV-2, the secondary infection has a (-0.62+1.84)=1.22 higher viremia on the log scale. The ggeffects package is useful to report such results.
- Perform diagnostic plots for the model from e).
Answer: The residual plots look reasonable. There is some small evidence of non-normal residuals in the normal Q-Q plot, but probably nothing to worry about.