What’s wrong with using bar charts


…and possible ways to fix it

Image generated by Canva text to image tool

Don’t get me wrong, bar charts can be a great tool for data visualization, especially when used for displaying counts or proportions. However, using them in the wrong way can lead to unintentional (or even worse, intentional) data misinterpretation. The particular issue I will be talking about today is using bar charts to present aggregated summary statistics such as means or medians.

The biggest problem here is the loss of detail as bar charts can oversimplify, leaving out important information such as variance, distribution, outliers, and trends. In this post I will illustrate this problem using a series of examples and propose potential solutions. In order to not interfere with the flow of the post, the code for the charts will be specified at the end for those who are interested 🙂

The wine quality dataset

Photo by Kym Ellis on Unsplash

For this post, I will be using the wine quality dataset¹, available through the UCI ML repository. Although the dataset contains many wine properties, we will focus on the total sulfur dioxide measurements.

Sulfur, commonly added to wine as sulfur dioxide, plays a crucial role in winemaking due to its preservative qualities. Acting as an antioxidant, it helps prevent the wine’s oxidation, safeguarding it from discoloration and undesired flavor alterations. Its antimicrobial characteristics also protect the wine against spoilage from bacteria and yeasts, preserving the intended taste and quality.

Let’s illustrate the issue by plotting a simple bar chart comparing total sulfur dioxide levels between red and white wines.

Image by Author

Ok, maybe it isn’t fair to bash on bar charts using the above example since the basic chart looks so ugly, it is off-putting without any further argument needed. Let’s first make it a bit prettier by tweaking some aesthetic properties.

Image by Author

Much better. Now, back to the issue a hand. What does the chart tell us? Well, obviously, the sulfur levels seem to be much higher for white wines. This was to be expected due to the differences in the winemaking process between red and white wines.

Red wines are fermented with their skins, providing natural antioxidants that help protect the wine from oxidation. In contrast, white wines are typically made by pressing the grapes and removing the skins prior to fermentation. This leaves them more susceptible to oxidation, requiring additional protection in the form of sulfur dioxide.

Although the average effect is discernible, the bar chart gives us no information about the distribution of values in each group, or the number of observations per each group.

This can partially be addressed by adding the number of observations above the bars and adding errorbars to show the standard deviations of each group.

Image by Author

This can be enough if the underlying distribution of values is symmetrical, but that doesn’t have to be the case, making standard deviation a poor choice as a dispersion statistic. Nothing more can be added to bar charts to fix this without making it closer to a completely different type of chart. This indicates that bar charts are not ideal for presenting this type of data.

So, what are the possible alternatives? I’ll go through a couple in the remainder of the post.

Here I offer four possible alternatives I think are a better and more transparent solution.

1. Jittered points

The first possibility is to add the actual individual observations to the chart.

Image by Author

This can be a great alternative if the number of observations is relatively small. However, in this specific case, it feels quite cumbersome by itself due to a very large number of wines in the dataset.

2. Boxplots with specified means

The second alternative is using boxplots with an added twist of specifying the means as well as medians (which are displayed by a flat line in the central box by default). Although boxplots give us an idea of the underlying distribution by specifying quartiles, I like the additional information which the mean offers. This is because a large and easily visible difference between the mean and the median immediately tells us whether the distribution is skewed and in which direction.

Image by Author

3. Violin plots with medians

Violin plots are great because they let us in on the shape of the underlying distributions, making it possible to easily detect anomalies such as bimodalities or data skewness. One might argue that boxplots do this implicitly as well. Although I agree to a certain point, we also have to take into account that a person has to be taught how to read a boxplot, whereas that’s not the case with violin plots.

I also like to add the information on the median since the violins leave a lot of unused space, so why not 🙂

Image by Author

4. Violin plots with jittered points

Ok, this one isn’t really a standalone option, but rather a combination of options 1 and 3. For our specific case, this would be my pick, but that doesn’t mean it would be ideal for all possible scenarios, as that depends on specifics of the problem such as the number of groups for comparison, total number of points, group dispersions, …

Notice that I didn’t try to combine boxplots with specific points. This is intentional, as I feel that such a combination would defeat the purpose of the boxplot. Namely, the boxplot charts display specific points only if they are 1.5 interquartile ranges above the upper border of the central box. This can be used as a simple method for outlier detection, and would be obscured by adding too many other points as well.

Image by Author

This post talks about a specific issue of using bar charts to present aggregate group statistics using a wine quality dataset to provide hand-on examples. After illustrating the issue, four possible alternatives are presented and their advantages and disadvantages are discussed.

Finally, remember, the primary goal of any data visualization is to accurately and effectively convey information. Always choose the type of visualization that best suits the data and the message you want to communicate.

I hope you will find the post useful. If you have any comments, feel free to leave a reply to the post. And, of course, if you liked what you read, please clap and follow me for more similar content.

Footnotes

¹P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547–553, 2009. (CC BY 4.0)

Code for generating the charts

library(tidyverse)

wine <- read_delim("winequality-red.csv",
delim = ";", escape_double = FALSE, trim_ws = TRUE) %>%
mutate(Type = "Red") %>%
bind_rows(read_delim("winequality-white.csv",
delim = ";", escape_double = FALSE, trim_ws = TRUE) %>%
mutate(Type = "White")) %>%
mutate(Type = factor(Type)) %>%
pivot_longer(`fixed acidity`:`quality`,
names_to = "Parameter", values_to = "Value") %>%
filter(Parameter == "total sulfur dioxide") %>%
select(-Parameter)

wine_summary <- wine %>%
group_by(Type) %>%
summarise(Median = median(Value), Mean = mean(Value),
SD = sd(Value), N = n())
#basic bar chart
wine_summary %>%
ggplot(aes(Type, Mean)) +
geom_col() +
labs(x = "Wine type", y = "Total sulfur levels")

#aesthetically pleasing bar chart
wine_summary %>%
ggplot(aes(Type, Mean)) +
geom_col(aes(fill = Type), width = 0.8) +
labs(x = "Wine type", y = "Total sulfur levels") +
scale_fill_manual(values = c("#b11226", "#F4E076")) +
labs(x = "Wine type", y = "Total sulfur levels") +
theme_bw() +
theme(legend.position = "none")

#bar chart with errorbars and specified number of observations per group
wine_summary %>%
ggplot(aes(Type, Mean)) +
geom_col(aes(fill = Type), width = 0.8) +
geom_errorbar(aes(ymin = Mean - SD, ymax = Mean + SD), width = 0.15) +
geom_label(aes(y = 200, label = N), fill = "gray97") +
scale_fill_manual(values = c("#b11226", "#F4E076")) +
labs(x = "Wine type", y = "Total sulfur levels") +
theme_bw() +
theme(legend.position = "none")

#jittered points chart
wine_summary %>%
ggplot(aes(Type, Mean)) +
geom_jitter(data = wine, aes(x = Type, y = Value, col = Type), alpha = 0.4) +
geom_errorbar(aes(ymin = Mean - SD, ymax = Mean + SD), width = 0.15) +
geom_point(shape = 4, size = 2, stroke = 2) +
geom_label(aes(y = 450, label = N), fill = "gray97") +
scale_color_manual(values = c("#b11226", "#F4E076")) +
labs(x = "Wine type", y = "Total sulfur levels") +
theme_bw() +
theme(legend.position = "none")

#boxplot with added information about the mean
wine_summary %>%
ggplot(aes(Type, Mean)) +
geom_boxplot(data = wine, aes(x = Type, y = Value, col = Type)) +
geom_point(shape = 4, size = 2, stroke = 2) +
geom_label(aes(y = 450, label = N), fill = "gray97") +
scale_color_manual(values = c("#b11226", "#F4E076")) +
labs(x = "Wine type", y = "Total sulfur levels") +
theme_bw() +
theme(legend.position = "none")

#violin plot with information about the median
wine_summary %>%
ggplot(aes(Type, Median)) +
geom_violin(data = wine, aes(x = Type, y = Value, col = Type)) +
geom_point(shape = 4, size = 2, stroke = 2) +
geom_label(aes(y = 450, label = N), fill = "gray97") +
scale_color_manual(values = c("#b11226", "#F4E076")) +
labs(x = "Wine type", y = "Total sulfur levels") +
theme_bw() +
theme(legend.position = "none")

#violin plot with added jittered points
wine_summary %>%
ggplot(aes(Type, Median)) +
geom_violin(data = wine, aes(x = Type, y = Value), fill = "gray92") +
geom_jitter(data = wine, aes(x = Type, y = Value, col = Type), alpha = 0.1) +
geom_point(shape = 4, size = 2, stroke = 2) +
geom_label(aes(y = 450, label = N), fill = "gray97") +
scale_color_manual(values = c("#b11226", "#F4E076")) +
labs(x = "Wine type", y = "Total sulfur levels") +
theme_bw() +
theme(legend.position = "none")



Source link

Leave a Comment