Introduction to Medical Statistics 2026
Exercises Class II
Exploratory Data Analysis
Exercises Class II
Exploratory Data Analysis
The purpose of this exercise is to find out what we can learn from exploratory data analysis.
We will investigate relationships between variables, using the same data set cmTbmData.csv as this morning. We first import the data set again.
I. Baseline table
We will start with some numerical summaries.
- Summarize the variables age, white cell count in blood (bldwcc), white cell count in cerebrospinal fluid (csfwcc) and sex by patient group (variable groupLong) in a nice tabular format. Use the
tbl_summaryfunction from the gtsummary package. First load the gtsummary package into your R session. Specify the argumentsdata,byandinclude. How many patients are in each group? Are there any missing values for these variables? What do the numbers represent?
Answer: Number of patients in each subgroup can be read off the header. Missing values are represented as “Unknown”. Numbers are explained in the footnote. Skewness of the numeric variables is not that easy to determine, but also not impossible. You need to check whether the 25 and 76 percentile are equally far away from the median or not; the latter case suggests skewness.
We change and polish the table a bit. We add the units of age and white cell count. For this we use the label argument. We additionally report mean and standard deviation for age and both csf and blood white cell count via the statistic argument. We change the label “Unknown” into “missing” via the missing_text argument. With binary variables like sex, it is usually sufficient to show only one of the two levels, because the values for the other level is simply the rest. Hence, we remove one of the values in sex. Run the code and have a look at the results. We made one error which you will notice when looking at the output; please correct it (Hint: look at the sex specification in the label argument). Does the distribution of the variables clearly differ by patient group? Can you tell more about skewness of the three numeric variables?
Answer: Since we made sex a categorical variable, we should write “female” instead of the number 2. HIV positive patients are younger, have lower blood white cell count and are more likely to be male. TBM patients have higher CSF white cell count. CSF white cell count has a skewed distribution, because median and mean are clearly different. Later we learn that for a variable with an approximately symmetric (“normal”) distribution we can use the rule of thumb that 2.5% of the values lies more than two standard deviations below the mean. This cannot be the case for white cell counts, because that variable cannot have negative values (\(213-2 \times 245\) is much below zero).
II. Visualization of CSF white cell count by subgroup
We will switch to a graphical display of CSF white cell count, split by patient group.
- Draw a boxplot of CSF white cell count for each of the four patient groups. Add the raw data points to the plot. Do this for both the original values as well as for the log-transformed CSF values that we created in the morning. Add an informative label to the white CSF cell count values via the
ylabfunction. What information do these plots give you with respect to skewness of the variables? How does the distribution vary by patient group. Relate the figures to the numerical summaries that you made earlier. Can you explain the warning messages?
Answer: The boxplot also shows the skewness of the original CSF white cell count values. They become much more symmetric after the transformation. HIV-positive CM patients seem to have the lowest CSF white cell count, and both TBM groups have the highest values. There is some suggestion that there’s more variation (larger SD), in log-transformed scale, in the HIV positive TBM group than in the HIV negative TBM group. In the log-transformed CSF WCC there are some outliers. The warning messages come from the missing values in csfwcc.
- Make a frequency polygon of the log-transformed CSF white cell count for each of the four patients groups separately, but plotted on top of each other. Distinguish between the four groups by colour. A density plot is a graphical summary that can be seen as a “smoothed” version of a histogram. Instead of the binwidth in a histogram, we now specify the level of detail via a bandwidth parameter. Make the density plots for each of the four patients groups separately, but again plotted on top of each other. We make the colours slightly transparent via the
alphaargument. See what happens if you change the argumentsalphaandadjustin the density plot to lower or higher values. Which plot type do you prefer and why: boxplot, frequency polygon or density plot?
- A violin plot is an alternative to the boxplot that summarizes data in a similar fashion as the density plot. Summarize CSF white blood count by patient group in a violin plot. Give each of the four groups a different colour. This time, make a separate panel (“facet”) for males and females, next to each other. Do you see any difference between males and females?
Answer: We see a somewhat different violin shapes among HIV positive females.
Make the same violin plot, but now add the individual values.
Answer: Adding the individual points shows us that there are only few women in that group and therefore we need to be careful in drawing any conclusions on the shape.
- Make a raincloud plot instead of the violin plot. Which plot type do you prefer, the violin or the raincloud?
III. Relation between blood and CSF white cell count.
- CSF white cell count is harder to measure than white cell count in blood. It would be great if we could measure blood wcc to obtain an idea of CSF wcc. Make a scatterplot of CSF white cell count (y-axis) against blood white cell count (x-axis). Use an appropriate transformation for both (you can make histograms to find the best one). Do you observe a relation between both? Would it be feasible to predict CSF white cell count based on white cell count in blood?
Answer: By making a histogram or a boxplot you can observe that white cell count in blood doesn’t have a highly skewed distribution. In fact, using the log-transformed values makes the distribution skewed to the left. Therefore, we are not sure whether it is better to use a log transformation for blood white cell count. It may be best to make a plot both on the original scale and on the transformed scale.
From the scatterplot we conclude that there may be some relation (low blood counts can give low CSF counts), but the relation is not very strong. The relation looks nonlinear. The relation looks more linear if we use the log-transformed blood white cell count, but there are some very low values that may give an incorrect impression of the relation.
- Quantify the strength of the relation via an appropriate correlation. Try both the Pearson and Spearman rank correlation, and both with the original CSF WCC as well as the log-transformed values. Use the original values for blood WCC. What do you observe?
Answer: The Pearson correlation changes quite a bit if we use the log-transformed CSF WCC. This latter value is better for the Pearson correlation, given the skewed nature of the variable. The Spearman correlation does not change. This is to be expected, because the rank of the values does not change with a log-transformation. Pearson correlation with the log-transformed values is quite similar to the Spearman correlation.
- Repeat exercise a., but now split up by sex and patient group. What additional information can we obtain from this figure?
Answer: Within each of the four disease groups, we do not observe a clear difference by gender. Individuals with low values for both genders are mostly in the HIV-pos CM group. Within each of the gender-disease group combinations, there is little relation between blood and CSF white cell count. The fact that we did see some relation between both is due to the males in the HIV pos - CM group, who have low CSF and blood WCC. This is an example of a phenomenon called confounding: a variable that affects both outcomes can change the observed relation between the outcomes if we do not correct for it. Note that numbers are small, so we need to be careful in drawing firm conclusions.
Graphical summary of several variables at once.
- Make a pairwise summary plot for the variables age, log10 of CSF WCC and sex using the command below. What type of summaries do you see? What do you conclude with respect to the three variables?
Answer: The density plot shows that distribution of age is not symmetric. The boxplot suggests that males are younger, which makes the older males to be interpreted as outliers. Age has some outliers for males. The histogram shows that this is indeed due to the large number of males between 20 and 40 years old (the females have a more uniform distribution). The scatterplot and the Pearson correlation of 0.49 show that the (linear) relation between age and CSF white cell count is not very strong. The nice property of ggpairs is that it gives all pairwise relations in one graph, with the type of visualization depending on variable type.