How to add number of observations to a ggplot2 boxplot

Introduction

Boxplots are extremely useful to learn more about any given dataset. Basically, it allows you to compare a continuous and a categorical variable, that includes information about distribution and statistics, such as the median. As an example, let us explore the Iris dataset.

Let’s say you want to know more about the variable Sepal.Length. One way to do this would be to look at its statistics.

summary(iris$Sepal.Length)##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 4.300 5.100 5.800 5.843 6.400 7.900

If you want to look at the variable Sepal.Length and differentiate by another variable - let's say Species you could summarize it as such:

kable(
iris %>%
group_by(Species) %>%
summarize(
mean = mean(Sepal.Length),
count = n())
)
Species mean count
----------- ------ ------
setosa 5.006 50
versicolor 5.936 50
virginica 6.588 50

A visual way of exploring the data is to use a boxplot. It shows you the distribution, the median as well as the upper and lower quartile.

ggplot(iris, aes(Species, Sepal.Length)) + 
geom_boxplot() +
theme_fivethirtyeight()

Challenge

The only missing information in a boxplot for me is the count of observation by category and the mean. I will try to show a way to add this information to the plot as convenient as possible.

One way to add number and mean information to a boxplot

First attempt

I will make use of the stat_summary function. It allows to Summarise y values at unique/binned x (cf. reference).

stat_summary expects a function that computes necessary informations to display. I will call this function stat_box_data. As input variables it expects the y variable - basis for computation. The other input variable is the upper_limit, which is somewhat a limitation that we need to provide it. It basically is the value to position the text on top of the boxes and is dependent on the y-variable.

stat_box_data <- function(y, upper_limit = max(iris$Sepal.Length) * 1.15) {
return(
data.frame(
y = 0.95 * upper_limit,
label = paste('count =', length(y), '\n',
'mean =', round(mean(y), 1), '\n')
)
)
}

This basically means, that for each plot it is necessary to adapt it. If you know of a better way to do it, please let me know, seriously!

We may know use the function stat_box_data in the ggplot call.

ggplot(iris, aes(Species, Sepal.Length)) + 
geom_boxplot() +
stat_summary(
fun.data = stat_box_data,
geom = "text",
hjust = 0.5,
vjust = 0.9
) +
theme_fivethirtyeight()

Additional tweaks to format a thousands separator

In case you want to explore variables with a higher number, i.e., above thousand, you may want to add a thousands separator. For this, I add a formatter in the function stat_box_data. Also, I make use of another dataset for this.

stat_box_data <- function(x, upper_limit = max(diamonds$price) * 1.15) {
return(
data.frame(
y = 0.95 * upper_limit,
label = paste('count =',
format(length(x), big.mark = ",", decimal.mark = ".", scientific = FALSE),
'\n',
'mean =',
format(round(mean(x), 1), big.mark = ",", decimal.mark = ".", scientific = FALSE))
)
)
}

This will result in the following plot:

ggplot(diamonds, aes(color, price)) + 
geom_boxplot() +
stat_summary(
fun.data = stat_box_data,
geom = "text",
hjust = 0.5,
vjust = 0.9
) +
theme_fivethirtyeight()

Conclusion

By adding stat_summary() to the ggplot call and providing the stat_box_data function to provide what information to display it is easily possible to have additional information on the boxplot for data analysis.

One drawback is that the function stat_box_data is dependent on the maximum value of a given continuous variable, and hence, need to be adapted for each dataset or variable to explore.

Please feel free to reach out to me if you have ideas to improve this solution!

All the best,
Gregor

ps. You find code and figures also in my GitHub repository.

Gregor Scheithauer is a consultant, data scientist, and researcher. He is specialized in Process Mining, Process Management, and Data Analytics.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store