ggplot2 histogram data visualization, how to use the stat_*() function

created at 07-03-2021 views: 2

Preface

We often use geom_*() functions when drawing layers, but rarely use stat_*() functions.

Of course, most of the drawing work can be done by using the geom_*() function. Is it necessary to use the stat_*() function?

Let's look at an example, assuming the following data

> select(diamonds, cut, price)
# A tibble: 53,940 x 2
   cut       price
   <ord>     <int>
 1 Ideal       326
 2 Premium     326
 3 Good        327
 4 Premium     334
 5 Good        335
 6 Very Good   336
 7 Very Good   336
 8 Very Good   337
 9 Fair        337
10 Very Good   338
# … with 53,930 more rows

We want to draw a histogram to show the average price of each cut.

The conventional method is to use tidyverse functions to organize the data, and then calculate the required statistical values and map them to the corresponding graphic attributes, namely

select(diamonds, cut, price) %>%
  group_by(cut) %>%
  summarise(
    mean_price = mean(price),
    .groups = "drop"
  ) %>%
  ggplot(aes(cut, mean_price, fill = cut)) +
  geom_col()

 histogram

Now, we are not satisfied with this. Now, we want to add error bars to the histogram

Of course, this is also very simple, we can perform statistical calculations on the data, and then draw

select(diamonds, cut, price) %>%
  group_by(cut) %>%
  summarise(
    mean_price = mean(price),
    .groups = "drop",
    se = sqrt(var(price)/length(price))
  ) %>%
  mutate(lower = mean_price - se, upper = mean_price + se) %>%
  ggplot(aes(cut, mean_price, fill = cut)) +
  geom_col() +
  geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.5)

 add error bars

em..., in order to draw such a simple picture, the code we wrote is longer than the picture.

Because our concept is still there, prepare the data first, and then map the data to the graphic attributes.

This leads to the need to perform a lot of statistical calculations on the data, which does not conform to the neatness of the data.

We can think of it this way. Since all the statistical information comes from the same data, why not pass the data directly to ggplot so that the statistical calculation of the data is performed internally?

We can rewrite like this

select(diamonds, cut, price) %>%
  ggplot(aes(cut, price, fill = cut)) +
  stat_summary(geom = "bar") +
  stat_summary(geom = "errorbar", width = 0.5)

stat version histogram

It can be done in two lines of code. Why do you have to write so much? It would be great to save time and have a cup of tea.

Principle analysis
After learning and understanding the working principle of the stat_summary function, the other stat_* functions are also well understood.

So how do we understand stat_summary? Let's give an example

Using the above data, we draw a dot plot of cut and price

select(diamonds, cut, price) %>%
  ggplot(aes(cut, price, colour = cut)) +
  geom_point()

stat_summary

Then use stat_summary without parameters to replace geom_point to see what happens

select(diamonds, cut, price) %>%
  ggplot(aes(cut, price, colour = cut)) +
  stat_summary()

stat_summary replace geom_point

The pointrange object is drawn.

Let’s take a look at the stat_summary function first

stat_summary(
  mapping = NULL,
  data = NULL,
  geom = "pointrange",
  position = "identity",
  ...,
  fun.data = NULL,
  fun = NULL,
  fun.max = NULL,
  fun.min = NULL,
  fun.args = list(),
  na.rm = FALSE,
  orientation = NA,
  show.legend = NA,
  inherit.aes = TRUE,
  fun.y,
  fun.ymin,
  fun.ymax
)

The default drawing is pointrange, so what attribute mappings need to be defined for pointrange?

x or y
ymin or xmin
ymax or xmax
However, we did not define ymin and ymax, it should be the corresponding value calculated by stat_summary and passed to pointrange

How to verify our conjecture? First, we see that running the above code will output a warning message

No summary function supplied, defaulting to `mean_se()`

In other words, the mean_se() function transformation is applied by default

Let's take a look at what mean_se() does

> mean_se
function (x, mult = 1) 
{
    x <- stats::na.omit(x)
    se <- mult * sqrt(stats::var(x)/length(x))
    mean <- mean(x)
    new_data_frame(list(y = mean, ymin = mean - se, ymax = mean + 
        se), n = 1)
}
<bytecode: 0x7fca56dfa5d0>
<environment: namespace:ggplot2>

We can see that the data frame returned by the function contains three values, which are exactly the parameters that pointrange needs to pass in

We can use the layer_data() function to extract the data used in the layer

> p <- select(diamonds, cut, price) %>%
+   ggplot(aes(cut, price, colour = cut)) +
+   stat_summary()
>
> layer_data(p, 1)
No summary function supplied, defaulting to `mean_se()`
     colour x group        y     ymin     ymax PANEL flipped_aes size linetype shape fill alpha stroke
1 #440154FF 1     1 4358.758 4270.025 4447.491     1       FALSE  0.5        1    19   NA    NA      1
2 #3B528BFF 2     2 3928.864 3876.302 3981.426     1       FALSE  0.5        1    19   NA    NA      1
3 #21908CFF 3     3 3981.760 3945.953 4017.567     1       FALSE  0.5        1    19   NA    NA      1
4 #5DC863FF 4     4 4584.258 4547.223 4621.293     1       FALSE  0.5        1    19   NA    NA      1
5 #FDE725FF 5     5 3457.542 3431.600 3483.484     1       FALSE  0.5        1    19   NA    NA      1

Then compare with the calculation result using the mean_se() function

> select(diamonds, cut, price) %>%
+   group_by(cut) %>%
+   summarise(mean_se(price))
# A tibble: 5 x 4
  cut           y  ymin  ymax
* <ord>     <dbl> <dbl> <dbl>
1 Fair      4359. 4270. 4447.
2 Good      3929. 3876. 3981.
3 Very Good 3982. 3946. 4018.
4 Premium   4584. 4547. 4621.
5 Ideal     3458. 3432. 3483.

We can see that the values of the three parameters y, ymin, and ymax are consistent with the calculation result of mean_se()

usage

Now that the transformation function can be determined, we can define our own statistical transformation, and then we can make some personalized adjustments to the graph as needed.

The parameter fun.data of the stat_summary() function can specify the statistical transformation function, the default is mean_se().

The function passed in fun.data requires the return of the data frame, and the data frame variable is called the attribute mapping parameter

Let's draw some personalized pictures below

1. 95% confidence interval error bars

select(diamonds, cut, price) %>%
  ggplot(aes(cut, price, fill = cut)) +
  stat_summary(geom = "bar") +
  stat_summary(
    geom = "errorbar", width = 0.5,
    fun.data = ~mean_se(., mult = 1.96)
  )

95% confidence

Note: We use the ~ symbol to construct anonymous functions, which is equivalent to

function(x) {mean_se(x, mult = 1.96)}

2. Specify the filling color

We use a transformation function to set the color of the groups that meet the conditions, and separate the groups whose median value is greater than and less than the threshold by color

func_median_color <- function(x, cut_off) {
  tibble(y = median(x)) %>%
    mutate(fill = if_else(y < cut_off, "#80b1d3", "#fb8072"))
}

select(diamonds, cut, price) %>%
  ggplot(aes(cut, price)) +
  stat_summary(
    fun.data = func_median_color,
    fun.args = c(cut_off = 2800),
    geom = "bar"
  )

Specify the fill color

We pass additional parameters to fun.args to replace the anonymous function, which is equivalent to

fun.data = ~ func_median_color(., cut_off = 2800)

3. Set the size of the point-line graph

We set the size of the point in the dot-line graph according to the number of observations in the group

select(diamonds, cut, price) %>%
  ggplot(aes(cut, price, colour = cut)) +
  stat_summary(
    fun.data = function(x) {
      mean_se(x) %>%
        mutate(size = length(x) * 5 / nrow(diamonds))
    }
  )

Set the size of the point line graph

Please log in to leave a comment.