Introduction to Medical Statistics 2026
Exercises Class I
Data, Variables, Descriptive Statistics

Author

Ronald Geskus

Published

March 23, 2026

I. Characteristics of a numeric variable

  1. Days off at a mining plant. Workers at a particular mining site receive an average of 35 days paid vacation, which is lower than the national average. The manager is under pressure to increase the amount of paid time off. However, he does not want to give more days off to the workers because that would be costly. Instead he decides to fire 10 employees in such a way as to raise the average number of days off that are reported by his employees. In order to achieve this goal, should he fire employees who have the largest number of paid vacation days, those with the smallest number, or those who have about the average number of days off?

Answer: If he wants to increase the mean value, he should remove the lower values. Hence, the manager should fire the employees that have only few days of paid vacation.

  1. Infant mortality. The infant mortality rate is defined as the number of infant deaths per 1,000 live births. This rate is often used as an indicator of the level of health in a country. The frequency histogram below shows the distribution of estimated infant death rates in 2012 for 222 countries. The height of the bar depicts the frequency of that range of mortality rates.

  1. Guess the first quartile and the median from the histogram.
  2. Would you expect the mean of this data set to be smaller or larger than the median? Explain your reasoning

Answer: The first quartile is between 0 and 10 because the first bar reaches above 0.25. The histogram doesn’t allow you to tell exactly where it is. The median is between 10 and 20. The distribution is skewed to the right, hence the median is smaller than the mean.

  1. Distributions and appropriate statistics. For each of the following, describe whether you expect the distribution to be approximately symmetric, right skewed, or left skewed. Also specify whether the mean or median would best represent a “typical” observation in the data, and whether the variability of observations would be better represented by the standard deviation or quartiles/IQR.
  1. DENV viremia in a data set where the first quartile of the values is at 350,000 copies/mL, the median at 450,000, the third quartile at 1,000,000 and 4.3% of individuals has more than 6,000,000 copies/mL.
  2. DENV viremia in a data set where the first quartile of the values is at 300,000 copies/mL, the median is at 600,000, the third quartile at 900,000 and 0.5% of individuals has more than 1,200,000.
  3. Weight distribution of adults living in Ho Chi Minh City
  4. The number of days that individuals stay in ICU at HTD.

Answers:

  1. Right skewed
  2. Approximately symmetric, although the values above 1,200,000 cannot be mirrored by values below zero
  3. Variables like weight often have a fairly symmetric distribution in a population. But there is no guarantee; in the US the distribution is probably skewed to the right because of the many that are obese
  4. Right skewed; many stay in ICU a few days only, but there are some that stay in ICU for a long time. Time/duration variables often have a distribution that is skewed to the right

For left and right skewed distributions, median and IQR/quartiles are better. For symmetric distributions, mean and sd will do well.

  1. Histograms and box plots. Compare the two plots below. What characteristics of the distribution are apparent in the histogram and not in the box plot? What characteristics are apparent in the box plot but not in the histogram?

Answer: From the histogram you can see that the distribution is bi-modal: it has two peaks. The boxplot shows the exact location of the median and the most extreme values (outliers).


Today we will use a dataset that contains information on 201 patients with meningitis from 4 different patient groups, determined by whether the patients have tuberculous or cryptococcal meningitis and whether they have HIV coinfection. You can find a description of the variables in the file cmTbmData_description.txt.

II. Data import

  1. Open the dataset cmTbmDataWithErrors.csv in MS Excel. Is the data set “tidy”? Do you agree with the naming of the columns? Describe the type of each of the variables. Save the dataset as an Excel Workbook file.

Answer: The data set is tidy. Column names are lowercase. The exception is groupLong; it may be more consistent to also write BldNeut, BldLym etc. Are the last six names informative enough? Some of the variables that have been coded numerically should be interpreted as categorical (hiv, group, sex).

With respect to the type of variables:

  • the last six columns are all continuous
  • age is continuous as well, although it can also be seen as discrete (in this data set it only gives the years as integer vaues)
  • hiv, diagnosis and sex are binary
  • code, group and groupLong are nominal. Note that the latter two represent the same information.
  1. Start RStudio and import the Excel data set via the Import Dataset menu in the Environment tab. Have a look at the values of the continuous variables. What do you notice?

Answer: When reading the Excel file into RStudio, the last six variables are incorrectly interpreted as of type character. The reason is that the variables contain missing values. Missing values are represented by the R-style NA value, which is interpreted as a character value. When reading an Excel file into R, the variable type is determined by the first 1000 values in the column, and at least one of the cells has the character value NA. This procedure of guessing the variable type is used when importing data from Excel because of its sloppy (others call it flexible) definition of variable type. Note that there is an option to interpret NA values as missings when reading the file into R using the na argument in the read_excel function.

  1. Import the data set cmTbmDataWithErrors.csv via the Import Dataset menu in the Environment tab. Choose the option From Text (base)…. Before clicking on the Import button, have a look at some of the Import options and tick the option Strings as factors. We purposely created 4 errors in this data set. Issue the following command and see whether you can find them. Hint: look at the variables groupLong, sex, bldwcc and csfwcc.

Answer:

  • groupLong has one observation with “HIV neg- CM” and 48 with “HIV neg - CM”. A different number of spaces in the level of a categorical variable makes the levels different. “HIV neg- CM” should have an extra space. The command with(cmTbm, table(groupLong, group)) shows that they have the same value in the group column.
  • sex has been coded as numerical, with 1=male, 2=female. It cannot have a maximum value of 4.
  • The lowest value in csfwcc is negative. A count cannot be negative.
  • The largest value in bldwcc is 99999, which is much larger than all other values. You can observe this by clicking on the object “cmTbm” in the Environment tab and sorting the variable bldwcc by value (click on the small triangle in that column). Missing values are sometimes coded as a number that differs from the rest of the data, here 99999. This is a bad practice that is better avoided.

III. Numerical summaries and data transformations

From now onwards, you are recommended to write all your code in an R Notebook file (which is a special type of R Markdown file), but you can also use an R Script file if you prefer. A new R Notebook file can be created via File -> New File -> R Notebook. This file allows you to combine R code with explanatory text and run the code interactively. Every piece of R code is written in a so-called chunk. (You see an example in the new R Notebook file that you created. Run that chunk as example.) You can use the shortcut Ctrl-Alt-I (MS Windows) or Cmd-Option-I (Mac) to insert a new chunk.

Import the data set cmTbmData.csv in the same way as you imported cmTbmDataWithErrors.csv. Choose the option From Text (base)…. Tick the option Strings as factors. Name the data set cmTbm. Copy the code that is used to import the data into RStudio to a new chunk in your R Notebook file.

  1. Summarize the variables age, white cell count in CSF (csfwcc) and sex. Do you think that the variables age and CSF white cell count have a skewed distribution? Have a look at the sex variable. The summary of the sex variable is probably not what you expect. What is the reason?
summary(cmTbm[,c("age","csfwcc","sex")])
      age            csfwcc            sex       
 Min.   :15.00   Min.   :   0.0   Min.   :1.000  
 1st Qu.:23.00   1st Qu.:  46.5   1st Qu.:1.000  
 Median :28.00   Median : 142.0   Median :1.000  
 Mean   :32.08   Mean   : 367.1   Mean   :1.259  
 3rd Qu.:37.00   3rd Qu.: 453.5   3rd Qu.:2.000  
 Max.   :78.00   Max.   :3960.0   Max.   :2.000  
 NA's   :1       NA's   :6                       

Answer: The mean of age is slightly higher than the median. However, they are not that far apart. We are better able to tell about skewness based on a figure, which we do in a later exercise. For white cell count, mean is quite a bit larger than median and the largest value is way larger than the mean, indicating that the distribution is skewed to the right. Sex is initially summarized as a numeric, because this is how it was coded. After making it into a categorical variable, it shows the correct summary.

  1. Make sex into a categorical variable via the factor function; use appropriate labels. Run the summary function again on the sex variable.
cmTbm$sex <- factor(cmTbm$sex, labels=c("male","female"))
## summarize sex variable
summary(cmTbm$sex)
  male female 
   149     52 

Answer: After making sex into a categorical variable, it shows the correct summary.

  1. Create a new variable log10.csfwcc containing log10-transformed values of white cell count in CSF and add it to the dataset. Check whether the values of log10.csfwcc make sense by applying the summary function to that variable. Do you observe anything strange? If so, what do you think has happened?
cmTbm$log10.csfwcc <- log10(cmTbm$csfwcc)
summary(cmTbm$log10.csfwcc)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   -Inf   1.667   2.152    -Inf   2.657   3.598       6 
  1. How would you solve this problem? Try it out by changing the code above, and make a summary of the logarithm of white cell count in CSF again. Is the distribution of CSF white cell count less skewed after the logarithmic transformation?

Answer: The minimum and mean value are “-Inf”. This is because some individuals had a csf white cell count of 0, and log(0)=-infinity. The solution is to add a small number to the variable csfwcc; the most obvious is adding the value 1, which keeps the csfwcc values at zero after log transformation. The summary suggests that the distribution is much more symmetric: mean and median are more or less equal, and minimum and maximum value are 1.5 to 2 from the central value.

cmTbm$log10.csfwcc <- log10(cmTbm$csfwcc+1)
summary(cmTbm$log10.csfwcc)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   1.677   2.155   2.075   2.658   3.598       6 

IV. Basic visual data summaries

You are recommended to make all of the subsequent figures using the ggplot2 package, but you can use the base R plotting functions if you prefer. We only provide answers in ggplot2. You can choose either of these options:

  • Write all R code yourself. If you are not familiar with ggplot2, you can read the ggplot2 web page. Note that you need to load the package first:
library(ggplot2)
  • You can also use the esquisse package. Have a look at the esquisse web page. You can either install and load the package in RStudio and start the GUI via esquisser(cmTbm). You can also use it online. If you are satisfied with the figure, save the R code that generates the figure to your R Notebook/R script file and continue to the next exercise. After you close the GUI window, you can run the code from your R Notebook/Script file to reproduce the figure (you may have to change the name of the object that contains the data into cmTbm).
  1. In the previous section we produced a numerical summary of CSF white cell count. There was a clear suggestion that it had a skewed distribution, which became much more symmetric after we log-transformed the values. Now we see what we can learn from a histogram. Draw a histogram for white cell count in CSF (csfwcc). Vary the number of bins for the histogram to see whether that changes the visual appearance. Choose the binwidth that gives a detailed visual representation without too much noise.
library(ggplot2)
ggplot(cmTbm, aes(csfwcc)) + geom_histogram(binwidth=..., boundary = 0) 

Answer: We choose a binwidth of 200. Note that the first bar starts below 0, even though negative values are not possible. This can be prevented via the boundary argument (which is not available in the GUI). The log-transformed data are much less skewed.

ggplot(cmTbm, aes(csfwcc)) + geom_histogram(binwidth=200, boundary = 0) 

  1. Do the same for the log-transformed values. Can you tell from the figure whether the distribution of the CSF white cell count becomes less skewed after the log transformation?
ggplot(cmTbm, aes(log10.csfwcc)) + geom_histogram(binwidth=0.25)

Answer: It has become a bit skewed to the left, but is much less skewed than for the original values.

  1. Repeat the above exercises a, but now use a boxplot and the empirical cumulative distribution function.. How does the shown information compare with the histogram?
ggplot(cmTbm, aes(csfwcc)) + geom_boxplot()
ggplot(cmTbm, aes(csfwcc)) + stat_ecdf()

Answer: The boxplot clearly shows the skewness, but doesn’t clearly show the downward trend in density of the values. The ECDF is harder to interpret, but shows all the individual data values.

  1. Repeat the above exercises a. and b., but now use the empirical cumulative distribution function. How does the shown information compare with the histogram and the boxplot?
ggplot(cmTbm, aes(log10.csfwcc)) + geom_boxplot()
ggplot(cmTbm, aes(log10.csfwcc)) + stat_ecdf()

Answer: The boxplot shows the more symmetric data distribution, but doesn’t show the small peak in the beginning. The ECDF is harder to interpret, but shows all the individual data values.