Introduction to Medical Statistics 2026
Exercises Class I
Data, Variables, Descriptive Statistics

Author

Ronald Geskus

Published

March 23, 2026

I. Characteristics of a numeric variable

Days off at a mining plant. Workers at a particular mining site receive an average of 35 days paid vacation, which is lower than the national average. The manager is under pressure to increase the amount of paid time off. However, he does not want to give more days off to the workers because that would be costly. Instead he decides to fire 10 employees in such a way as to raise the average number of days off that are reported by his employees. In order to achieve this goal, should he fire employees who have the largest number of paid vacation days, those with the smallest number, or those who have about the average number of days off?

Answer: If he wants to increase the mean value, he should remove the lower values. Hence, the manager should fire the employees that have only few days of paid vacation.

Infant mortality. The infant mortality rate is defined as the number of infant deaths per 1,000 live births. This rate is often used as an indicator of the level of health in a country. The frequency histogram below shows the distribution of estimated infant death rates in 2012 for 222 countries. The height of the bar depicts the frequency of that range of mortality rates.

Guess the first quartile and the median from the histogram.
Would you expect the mean of this data set to be smaller or larger than the median? Explain your reasoning

Answer: The first quartile is between 0 and 10 because the first bar reaches above 0.25. The histogram doesn’t allow you to tell exactly where it is. The median is between 10 and 20. The distribution is skewed to the right, hence the median is smaller than the mean.

Distributions and appropriate statistics. For each of the following, describe whether you expect the distribution to be approximately symmetric, right skewed, or left skewed. Also specify whether the mean or median would best represent a “typical” observation in the data, and whether the variability of observations would be better represented by the standard deviation or quartiles/IQR.

DENV viremia in a data set where the first quartile of the values is at 350,000 copies/mL, the median at 450,000, the third quartile at 1,000,000 and 4.3% of individuals has more than 6,000,000 copies/mL.
DENV viremia in a data set where the first quartile of the values is at 300,000 copies/mL, the median is at 600,000, the third quartile at 900,000 and 0.5% of individuals has more than 1,200,000.
Weight distribution of adults living in Ho Chi Minh City
The number of days that individuals stay in ICU at HTD.

Answers:

Right skewed
Approximately symmetric, although the values above 1,200,000 cannot be mirrored by values below zero
Variables like weight often have a fairly symmetric distribution in a population. But there is no guarantee; in the US the distribution is probably skewed to the right because of the many that are obese
Right skewed; many stay in ICU a few days only, but there are some that stay in ICU for a long time. Time/duration variables often have a distribution that is skewed to the right

For left and right skewed distributions, median and IQR/quartiles are better. For symmetric distributions, mean and sd will do well.

Histograms and box plots. Compare the two plots below. What characteristics of the distribution are apparent in the histogram and not in the box plot? What characteristics are apparent in the box plot but not in the histogram?

Answer: From the histogram you can see that the distribution is bi-modal: it has two peaks. The boxplot shows the exact location of the median and the most extreme values (outliers).

Today we will use a dataset that contains information on 201 patients with meningitis from 4 different patient groups, determined by whether the patients have tuberculous or cryptococcal meningitis and whether they have HIV coinfection. You can find a description of the variables in the file cmTbmData_description.txt.

II. Data import

We start running the code cells. To run a single line, you use “command + Return” in macOS and “Ctrl + Enter” in Windows/Linux. To run the entire code cell, you can simply click the “Run code” button, or use the keyboard shortcut “Shift + Return/Enter”.

Import the data set cmTbmDataWithErrors.csv. Is the data set “tidy”? Do you agree with the naming of the columns? Describe the type of each of the variables.

Answer: The data set is tidy. Column names are lowercase. The exception is groupLong; it may be more consistent to also write BldNeut, BldLym etc. Are the last six names informative enough? Some of the variables that have been coded numerically should be interpreted as categorical (hiv, group, sex).

With respect to the type of variables:

the last six columns are all continuous
age is continuous as well, although it can also be seen as discrete (in this data set it only gives the years as integer vaues)
hiv, diagnosis and sex are binary
code, group and groupLong are nominal. Note that the latter two represent the same information.

We purposely created 4 errors in this data set. Create a summary of the variables and see whether you can find them. Hint: look at the variables groupLong, sex, bldwcc and csfwcc.

Answer:

groupLong has one observation with “HIV neg- CM” and 48 with “HIV neg - CM”. A different number of spaces in the level of a categorical variable makes the levels different. “HIV neg- CM” should have an extra space. The command with(cmTbm, table(groupLong, group)) shows that they have the same value in the group column.
sex has been coded as numerical, with 1=male, 2=female. It cannot have a maximum value of 4.
The lowest value in csfwcc is negative. A count cannot be negative.
The largest value in bldwcc is 99999, which is much larger than all other values. You can observe this by clicking on the object “cmTbm” in the Environment tab and sorting the variable bldwcc by value (click on the small triangle in that column). Missing values are sometimes coded as a number that differs from the rest of the data, here 99999. This is a bad practice that is better avoided.

From now on we will use the data set without the errors. Import the data.

III. Numerical summaries and data transformations

Summarize the variables age, white cell count in CSF (csfwcc) and sex. Do you think that the variables age and CSF white cell count have a skewed distribution? Have a look at the sex variable. The summary of the sex variable is probably not what you expect. What is the reason?

Make sex into a categorical variable via the factor function; use appropriate labels. Run the summary function again on the sex variable.

Answer: The mean of age is slightly higher than the median. However, they are not that far apart. We are better able to tell about skewness based on a figure, which we do in a later exercise. For white cell count, mean is quite a bit larger than median and the largest value is way larger than the mean, indicating that the distribution is skewed to the right. Sex is initially summarized as a numeric, because this is how it was coded. After making it into a categorical variable, it shows the correct summary.

Create a new variable log10.csfwcc containing log10-transformed values of white cell count in CSF and add it to the dataset. Check whether the values of log10.csfwcc make sense by applying the summary function to that variable. Do you observe anything strange? If so, what do you think has happened?

How would you solve this problem? Try it out by changing the code above, and make a summary of the logarithm of white cell count in CSF again. Is the distribution of CSF white cell count less skewed after the logarithmic transformation?

Answer: The minimum and mean value are “-Inf”. This is because some individuals had a csf white cell count of 0, and log(0)=-infinity. The solution is to add a small number to the variable csfwcc; the most obvious is adding the value 1, which keeps the csfwcc values at zero after log transformation. The summary suggests that the distribution is much more symmetric: mean and median are more or less equal, and minimum and maximum value are 1.5 to 2 from the central value.

IV. Basic visual data summaries

Most of the figures in the practicals of this course make use of the ggplot2 package, but there is often a similar base R plotting functions as well. We only provide answers in ggplot2. This package is based on the “Grammar of Graphics” philosophy, which is further explained in the course Principles of Data Visualization.

In the previous section we produced a numerical summary of CSF white cell count. There was a clear suggestion that it had a skewed distribution, which became much more symmetric after we log-transformed the values. Now we see what we can learn from a histogram. Draw a histogram for white cell count in CSF (csfwcc). Vary the number of bins for the histogram (at the dots …) to see whether that changes the visual appearance. Choose the binwidth that gives a detailed visual representation without too much noise. We first need to “load” the ggplot2 package into our R session. What do you think is the role of the boundary argument.

Answer: We choose a binwidth of 200. Note that the first bar starts below 0, even though negative values are not possible. This can be prevented via the boundary argument (which is not available in the GUI). The log-transformed data are much less skewed.

Do the same for the log-transformed values. Can you tell from the figure whether the distribution of the CSF white cell count becomes less skewed after the log transformation?

Answer: It has become a bit skewed to the left, but is much less skewed than for the original values.

Repeat the above exercises a, but now use a boxplot and the empirical cumulative distribution function.. How does the shown information compare with the histogram?

Answer: The boxplot clearly shows the skewness, but doesn’t clearly show the downward trend in density of the values. The ECDF is harder to interpret, but shows all the individual data values.

Repeat the above exercises a. and b., but now use the empirical cumulative distribution function. How does the shown information compare with the histogram and the boxplot?

Answer: The boxplot shows the more symmetric data distribution, but doesn’t show the small peak in the beginning. The ECDF is harder to interpret, but shows all the individual data values.