# ggplot2 histogram data visualization, how to use the stat_*() function

created at 07-03-2021 views: 2

## Preface¶

We often use `geom_*()` functions when drawing layers, but rarely use `stat_*()` functions.

Of course, most of the drawing work can be done by using the `geom_*()` function. Is it necessary to use the `stat_*()` function?

Let's look at an example, assuming the following data

``````> select(diamonds, cut, price)
# A tibble: 53,940 x 2
cut       price
<ord>     <int>
1 Ideal       326
3 Good        327
5 Good        335
6 Very Good   336
7 Very Good   336
8 Very Good   337
9 Fair        337
10 Very Good   338
# … with 53,930 more rows
``````

We want to draw a histogram to show the average price of each cut.

The conventional method is to use `tidyverse` functions to organize the data, and then calculate the required statistical values and map them to the corresponding graphic attributes, namely

``````select(diamonds, cut, price) %>%
group_by(cut) %>%
summarise(
mean_price = mean(price),
.groups = "drop"
) %>%
ggplot(aes(cut, mean_price, fill = cut)) +
geom_col()
`````` Now, we are not satisfied with this. Now, we want to add error bars to the histogram

Of course, this is also very simple, we can perform statistical calculations on the data, and then draw

``````select(diamonds, cut, price) %>%
group_by(cut) %>%
summarise(
mean_price = mean(price),
.groups = "drop",
se = sqrt(var(price)/length(price))
) %>%
mutate(lower = mean_price - se, upper = mean_price + se) %>%
ggplot(aes(cut, mean_price, fill = cut)) +
geom_col() +
geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.5)
`````` em..., in order to draw such a simple picture, the code we wrote is longer than the picture.

Because our concept is still there, prepare the data first, and then map the data to the graphic attributes.

This leads to the need to perform a lot of statistical calculations on the data, which does not conform to the neatness of the data.

We can think of it this way. Since all the statistical information comes from the same data, why not pass the data directly to ggplot so that the statistical calculation of the data is performed internally?

We can rewrite like this

``````select(diamonds, cut, price) %>%
ggplot(aes(cut, price, fill = cut)) +
stat_summary(geom = "bar") +
stat_summary(geom = "errorbar", width = 0.5)
`````` It can be done in two lines of code. Why do you have to write so much? It would be great to save time and have a cup of tea.

Principle analysis
After learning and understanding the working principle of the stat_summary function, the other stat_* functions are also well understood.

So how do we understand `stat_summary`? Let's give an example

Using the above data, we draw a dot plot of cut and price

``````select(diamonds, cut, price) %>%
ggplot(aes(cut, price, colour = cut)) +
geom_point()
`````` Then use `stat_summary` without parameters to replace `geom_point` to see what happens

``````select(diamonds, cut, price) %>%
ggplot(aes(cut, price, colour = cut)) +
stat_summary()
`````` The `pointrange` object is drawn.

Let’s take a look at the `stat_summary` function first

``````stat_summary(
mapping = NULL,
data = NULL,
geom = "pointrange",
position = "identity",
...,
fun.data = NULL,
fun = NULL,
fun.max = NULL,
fun.min = NULL,
fun.args = list(),
na.rm = FALSE,
orientation = NA,
show.legend = NA,
inherit.aes = TRUE,
fun.y,
fun.ymin,
fun.ymax
)
``````

The default drawing is pointrange, so what attribute mappings need to be defined for pointrange?

`x` or `y`
`ymin` or `xmin`
`ymax` or `xmax`
However, we did not define `ymin` and `ymax`, it should be the corresponding value calculated by `stat_summary` and passed to `pointrange`

How to verify our conjecture? First, we see that running the above code will output a warning message

``````No summary function supplied, defaulting to `mean_se()`
``````

In other words, the `mean_se()` function transformation is applied by default

Let's take a look at what `mean_se()` does

``````> mean_se
function (x, mult = 1)
{
x <- stats::na.omit(x)
se <- mult * sqrt(stats::var(x)/length(x))
mean <- mean(x)
new_data_frame(list(y = mean, ymin = mean - se, ymax = mean +
se), n = 1)
}
<bytecode: 0x7fca56dfa5d0>
<environment: namespace:ggplot2>
``````

We can see that the data frame returned by the function contains three values, which are exactly the parameters that `pointrange` needs to pass in

We can use the `layer_data()` function to extract the data used in the layer

``````> p <- select(diamonds, cut, price) %>%
+   ggplot(aes(cut, price, colour = cut)) +
+   stat_summary()
>
> layer_data(p, 1)
No summary function supplied, defaulting to `mean_se()`
colour x group        y     ymin     ymax PANEL flipped_aes size linetype shape fill alpha stroke
1 #440154FF 1     1 4358.758 4270.025 4447.491     1       FALSE  0.5        1    19   NA    NA      1
2 #3B528BFF 2     2 3928.864 3876.302 3981.426     1       FALSE  0.5        1    19   NA    NA      1
3 #21908CFF 3     3 3981.760 3945.953 4017.567     1       FALSE  0.5        1    19   NA    NA      1
4 #5DC863FF 4     4 4584.258 4547.223 4621.293     1       FALSE  0.5        1    19   NA    NA      1
5 #FDE725FF 5     5 3457.542 3431.600 3483.484     1       FALSE  0.5        1    19   NA    NA      1
``````

Then compare with the calculation result using the `mean_se()` function

``````> select(diamonds, cut, price) %>%
+   group_by(cut) %>%
+   summarise(mean_se(price))
# A tibble: 5 x 4
cut           y  ymin  ymax
* <ord>     <dbl> <dbl> <dbl>
1 Fair      4359. 4270. 4447.
2 Good      3929. 3876. 3981.
3 Very Good 3982. 3946. 4018.
5 Ideal     3458. 3432. 3483.
``````

We can see that the values of the three parameters `y`, `ymin`, and `ymax` are consistent with the calculation result of `mean_se()`

## usage¶

Now that the transformation function can be determined, we can define our own statistical transformation, and then we can make some personalized adjustments to the graph as needed.

The parameter `fun.data` of the `stat_summary()` function can specify the statistical transformation function, the default is `mean_se()`.

The function passed in `fun.data` requires the return of the data frame, and the data frame variable is called the attribute mapping parameter

Let's draw some personalized pictures below

### 1. 95% confidence interval error bars¶

``````select(diamonds, cut, price) %>%
ggplot(aes(cut, price, fill = cut)) +
stat_summary(geom = "bar") +
stat_summary(
geom = "errorbar", width = 0.5,
fun.data = ~mean_se(., mult = 1.96)
)
`````` Note: We use the `~` symbol to construct anonymous functions, which is equivalent to

``````function(x) {mean_se(x, mult = 1.96)}
``````

### 2. Specify the filling color¶

We use a transformation function to set the color of the groups that meet the conditions, and separate the groups whose median value is greater than and less than the threshold by color

``````func_median_color <- function(x, cut_off) {
tibble(y = median(x)) %>%
mutate(fill = if_else(y < cut_off, "#80b1d3", "#fb8072"))
}

select(diamonds, cut, price) %>%
ggplot(aes(cut, price)) +
stat_summary(
fun.data = func_median_color,
fun.args = c(cut_off = 2800),
geom = "bar"
)
`````` We pass additional parameters to `fun.args` to replace the anonymous function, which is equivalent to

``````fun.data = ~ func_median_color(., cut_off = 2800)
``````

### 3. Set the size of the point-line graph¶

We set the size of the point in the dot-line graph according to the number of observations in the group

``````select(diamonds, cut, price) %>%
ggplot(aes(cut, price, colour = cut)) +
stat_summary(
fun.data = function(x) {
mean_se(x) %>%
mutate(size = length(x) * 5 / nrow(diamonds))
}
)
`````` 