Introduction to Medical Statistics 2026
Exercises Class I
Data, Variables, Descriptive Statistics

Author

Ronald Geskus

Published

March 23, 2026

I. Characteristics of a numeric variable

  1. Days off at a mining plant. Workers at a particular mining site receive an average of 35 days paid vacation, which is lower than the national average. The manager is under pressure to increase the amount of paid time off. However, he does not want to give more days off to the workers because that would be costly. Instead he decides to fire 10 employees in such a way as to raise the average number of days off that are reported by his employees. In order to achieve this goal, should he fire employees who have the largest number of paid vacation days, those with the smallest number, or those who have about the average number of days off?
  1. Infant mortality. The infant mortality rate is defined as the number of infant deaths per 1,000 live births. This rate is often used as an indicator of the level of health in a country. The frequency histogram below shows the distribution of estimated infant death rates in 2012 for 222 countries. The height of the bar depicts the frequency of that range of mortality rates.

  1. Guess the first quartile and the median from the histogram.
  2. Would you expect the mean of this data set to be smaller or larger than the median? Explain your reasoning
  1. Distributions and appropriate statistics. For each of the following, describe whether you expect the distribution to be approximately symmetric, right skewed, or left skewed. Also specify whether the mean or median would best represent a “typical” observation in the data, and whether the variability of observations would be better represented by the standard deviation or quartiles/IQR.
  1. DENV viremia in a data set where the first quartile of the values is at 350,000 copies/mL, the median at 450,000, the third quartile at 1,000,000 and 4.3% of individuals has more than 6,000,000 copies/mL.
  2. DENV viremia in a data set where the first quartile of the values is at 300,000 copies/mL, the median is at 600,000, the third quartile at 900,000 and 0.5% of individuals has more than 1,200,000.
  3. Weight distribution of adults living in Ho Chi Minh City
  4. The number of days that individuals stay in ICU at HTD.
  1. Histograms and box plots. Compare the two plots below. What characteristics of the distribution are apparent in the histogram and not in the box plot? What characteristics are apparent in the box plot but not in the histogram?


Today we will use a dataset that contains information on 201 patients with meningitis from 4 different patient groups, determined by whether the patients have tuberculous or cryptococcal meningitis and whether they have HIV coinfection. You can find a description of the variables in the file cmTbmData_description.txt.

II. Data import

  1. Download the cmTbmDataWithErrors.csv. Open the dataset cmTbmDataWithErrors.csv in MS Excel. Is the data set “tidy”? Do you agree with the naming of the columns? Describe the type of each of the variables. Save the dataset as an Excel Workbook file.
  1. Start RStudio and import the Excel data set via the Import Dataset menu in the Environment tab. Have a look at the values of the continuous variables. What do you notice?
  1. Import the data set cmTbmDataWithErrors.csv via the Import Dataset menu in the Environment tab. Choose the option From Text (base)…. Before clicking on the Import button, have a look at some of the Import options and tick the option Strings as factors. We purposely created 4 errors in this data set. Issue the following command and see whether you can find them. Hint: look at the variables groupLong, sex, bldwcc and csfwcc.

III. Numerical summaries and data transformations

From now onwards, you are recommended to write all your code in an R Notebook file (which is a special type of R Markdown file), but you can also use an R Script file if you prefer. A new R Notebook file can be created via File -> New File -> R Notebook. This file allows you to combine R code with explanatory text and run the code interactively. Every piece of R code is written in a so-called chunk. (You see an example in the new R Notebook file that you created. Run that chunk as example.) You can use the shortcut Ctrl-Alt-I (MS Windows) or Cmd-Option-I (Mac) to insert a new chunk.

Donwload cmTbmData.csv. Import the data set cmTbmData.csv in the same way as you imported cmTbmDataWithErrors.csv. Choose the option From Text (base)…. Tick the option Strings as factors. Name the data set cmTbm. Copy the code that is used to import the data into RStudio to a new chunk in your R Notebook file.

  1. Summarize the variables age, white cell count in CSF (csfwcc) and sex. Do you think that the variables age and CSF white cell count have a skewed distribution? Have a look at the sex variable. The summary of the sex variable is probably not what you expect. What is the reason?
summary(cmTbm[,c("age","csfwcc","sex")])
  1. Make sex into a categorical variable via the factor function; use appropriate labels. Run the summary function again on the sex variable.
cmTbm$sex <- factor(cmTbm$sex, labels=c("male","female"))
## summarize sex variable
summary(cmTbm$sex)
  1. Create a new variable log10.csfwcc containing log10-transformed values of white cell count in CSF and add it to the dataset. Check whether the values of log10.csfwcc make sense by applying the summary function to that variable. Do you observe anything strange? If so, what do you think has happened?
cmTbm$log10.csfwcc <- log10(cmTbm$csfwcc)
summary(cmTbm$log10.csfwcc)
  1. How would you solve this problem? Try it out by changing the code above, and make a summary of the logarithm of white cell count in CSF again. Is the distribution of CSF white cell count less skewed after the logarithmic transformation?

IV. Basic visual data summaries

You are recommended to make all of the subsequent figures using the ggplot2 package, but you can use the base R plotting functions if you prefer. We only provide answers in ggplot2. You can choose either of these options:

  • Write all R code yourself. If you are not familiar with ggplot2, you can read the ggplot2 web page. Note that you need to load the package first:
library(ggplot2)
  • You can also use the esquisse package. Have a look at the esquisse web page. You can either install and load the package in RStudio and start the GUI via esquisser(cmTbm). You can also use it online. If you are satisfied with the figure, save the R code that generates the figure to your R Notebook/R script file and continue to the next exercise. After you close the GUI window, you can run the code from your R Notebook/Script file to reproduce the figure (you may have to change the name of the object that contains the data into cmTbm).
  1. In the previous section we produced a numerical summary of CSF white cell count. There was a clear suggestion that it had a skewed distribution, which became much more symmetric after we log-transformed the values. Now we see what we can learn from a histogram. Draw a histogram for white cell count in CSF (csfwcc). Vary the number of bins for the histogram to see whether that changes the visual appearance. Choose the binwidth that gives a detailed visual representation without too much noise.
library(ggplot2)
ggplot(cmTbm, aes(csfwcc)) + geom_histogram(binwidth=..., boundary = 0) 
  1. Do the same for the log-transformed values. Can you tell from the figure whether the distribution of the CSF white cell count becomes less skewed after the log transformation?
ggplot(cmTbm, aes(log10.csfwcc)) + geom_histogram(binwidth=0.25)
  1. Repeat the above exercises a, but now use a boxplot and the empirical cumulative distribution function.. How does the shown information compare with the histogram?
ggplot(cmTbm, aes(csfwcc)) + geom_boxplot()
ggplot(cmTbm, aes(csfwcc)) + stat_ecdf()
  1. Repeat the above exercises a. and b., but now use the empirical cumulative distribution function. How does the shown information compare with the histogram and the boxplot?
ggplot(cmTbm, aes(log10.csfwcc)) + geom_boxplot()
ggplot(cmTbm, aes(log10.csfwcc)) + stat_ecdf()