Overview

In this lesson we will work with ggplot2 and its “grammar of graphics” to create a variety of plots using the gapminder data we saw in the last lesson.

Grammar of graphics

When people talk about ggplot2 you may hear them use the phrase “grammar of graphics,” which is a conception of statistical graphics developed by Leland Wilkinson. The main components of the grammar (in bold) are defined as follows in the book ggplot2: Elegant Graphics for Data Analysis.

All plots are composed of the data, the information to visualize, and a mapping, the description of the data’s variables are mapped to aesthetic attributes. There are five mapping components:

  • A layer is a collection of geometric elements and statistical transformations. Geometric elements, geoms for short, represent what you actually see in the plot: points, lines, polygons, etc. Statistical transformations, stats for short, summarise the data: for example, binning and counting observations to create a histogram, or fitting a linear model.
  • Scales map values in the data space to values in the aesthetic space. This includes the use of colour, shape or size. Scales also draw the legend and axes, which make it possible to read the original data values from the plot (an inverse mapping).
  • A coord, or coordinate system, describes how data coordinates are mapped to the plane of the graphic. It also provides axes and gridlines to help read the graph. We normally use the Cartesian coordinate system, but a number of others are available, including polar coordinates and map projections.
  • A facet specifies how to break up and display subsets of data as small multiples. This is also known as conditioning or latticing/trellising.
  • A theme controls the finer points of display, like the font size and background colour. While the defaults in ggplot2 have been chosen with care, you may need to consult other references to create an attractive plot.

Remembering all of this isn’t necessary for generating plots quickly with ggplot2, so don’t worry if these concepts aren’t completely clear at first. As you work with ggplot2 you will start to get a sense for how these things fit together and what function calls to use.

Scatter plots

We first load the ggplot2 library:

library(ggplot2)

The plots you generate with ggplot2 will usually conform to the template:

ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +
        <GEOM_FUNCTION>() +
        <COORD_FUNCTION>() +
        <SCALE_FUNCTION>() +
        <THEME_FUNCTION>()

We may not always use coordinate, scale, and theme functions, but at minimum we must specify the data, mapping, and geometry functions. ggplot2 functions need data in the ‘long’ format, i.e., a column for every dimension, and a row for every observation. Well-structured data will save you lots of time when making figures with ggplot2. The gapminder data that we have been using is already in the format best-suited for ggplot2, but keep in mind you might need to use the dplyr functions we learned in the last lesson to get to this form for other data you may encounter.

ggplots can be built iteratively, which is great for understanding what each addition does to the resulting plot.

  1. Use the ggplot() function and bind the plot to the gapminder data frame with the data argument:
ggplot(data = gapminder)

This plot is empty because we told it what data to use, but neither the mappings nor geometries.

  1. Let’s define a mapping (using the aesthetic (aes) function), by selecting the variables to be plotted and specifying how to present them in the graph. First we’ll specify x/y axes, but later we will see how to use size, shape, color, etc.
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp))

This is a little better in that we at least have axes, but there is still no data on the plot because we didn’t specify a geometry.

  1. Now let’s add a ‘geom’ – graphical representation of the data in the plot (points, lines, bars). ggplot2 offers many different geoms; we will use some common ones today, including:
    • geom_point() for scatter plots, dot plots, etc.
    • geom_line() for trend lines, time series, etc.
    • geom_barplot() for, well, boxplots!
    • geom_boxplot() distributions.

To add a geom to the plot use the + operator (this is the ggplot version of the pipe from the previous lesson). Because we have two continuous variables, let’s use geom_point() first:

ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) + geom_point()

The + in the ggplot2 package is particularly useful because it allows you to modify existing ggplot objects. Now we might want to change some things like the axis labels and add a title to the plot.

  1. We can modify plot titles, axis labels, and their sizes, colors, etc. using theme functions. Let’s give the axis labels better names, and give the plot a title using the labs() function.
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
    geom_point() +
    labs(title = 'GDP per Capita vs Life Expectancy', x = 'GDP per Capita', y = 'Life Expectancy')

Now we have clearer labels and a good summary title. But what if we’re not keen on that gray background? We can use a theme function to change the look and feel of the graph. The Themes section of the ggplot2 book provides a good overview of theming options.

  1. Let’s change the theme to something lighter:
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
    geom_point() +
    labs(title = 'GDP per Capita vs Life Expectancy', x = 'GDP per Capita', y = 'Life Expectancy') +
    theme_bw()

Storing plots as objects

Base graphics in R typically don’t allow plots to be saved and modified as objects, but you can with ggplots. This can be helpful for storing a base version of a plot to tinker with more easily without having to write some of the same code over and over again. For example:

base_plot = ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp))

And from here we can try some different things out, like changing the color of all the points:

base_plot + geom_point(color = 'blue')

Or we can make the points a little transparent:

base_plot + geom_point(alpha = 0.5)

Tip

  • Anything you put in the ggplot() function can be seen by any geom layers that you add (i.e., these are universal plot settings). This includes the x- and y-axis mapping you set up in aes().
  • You can also specify mappings for a given geom independently of the mappings defined globally in the ggplot() function.
  • The + sign used to add new layers must be placed at the end of the line containing the previous layer.

Exploring other aesthetics

Color

To color each continent in the plot differently, you can specify that of the data to use as the color in the aesthetic (aes()) funciton.

ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
    geom_point() +
    labs(title = 'GDP per Capita vs Life Expectancy', x = 'GDP per Capita', y = 'Life Expectancy') +
    theme_bw()

You can also change the colors that are used by specifying a named palette, or by manually defining the colors. This page shows some of the named palettes which are available, and ?scale_colour_brewer also shows the names of the palettes.

ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
    geom_point() +
    labs(title = 'GDP per Capita vs Life Expectancy', x = 'GDP per Capita', y = 'Life Expectancy') +
    scale_color_brewer(palette = 'Set3') +
    theme_bw()

Shape

You can also alter the shape of the points by specifying shape in the aes() function:

ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp, shape = continent)) +
    geom_point() +
    labs(title = 'GDP per Capita vs Life Expectancy', x = 'GDP per Capita', y = 'Life Expectancy') +
    theme_bw()

Size

You can alter the size of points by specifying size in the aes() function while also using the continents as a color. Let’s save this plot as an object to use later in this lesson.

gdp_life_plot <- ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp, color = continent, size = pop)) +
    geom_point(alpha = 0.5) +
    labs(title = 'GDP per Capita vs Life Expectancy (Size by Population)', x = 'GDP per Capita', y = 'Life Expectancy') +
    theme_bw()
# When saving a plot as an object, RStudio won't automatically display it
gdp_life_plot

Faceting

ggplot2 has a special technique called faceting that allows us to split one plot into multiple plots based on a factor in the data. Let’s use it to split our plot into five panels, one for each continent.

ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
    geom_point() +
    facet_grid(. ~ continent) +
    labs(title = 'GDP per Capita vs Life Expectancy', x = 'GDP per Capita', y = 'Life Expectancy') +
    theme_bw()

We can also experiment with stacking the facets vertically, rather than horizontally. The facet_grid geometry allows you to explicitly specify how you want your plots to be arranged via formula notation (rows ~ columns; a . can be used as a placeholder that indicates only one row or column).

ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
    geom_point() +
    facet_grid(continent ~ .) +
    labs(title = 'GDP per Capita vs Life Expectancy', x = 'GDP per Capita', y = 'Life Expectancy') +
    theme_bw()

Line plots

We’ve explored a lot of options with the geom_point() geometry, but we should expand our horizons to other geometries. Let’s first create a subset of the gapminder data limited only to Poland and plot the life expectancy over all the years for which we have data as a line plot.

Exercise

Use the filter() function to select only the Poland data from the gapminder data, and save it as as an object named “poland”.

Solution
poland = gapminder %>% filter(country == 'Poland')
poland
# A tibble: 12 × 7
   country  year      pop continent lifeExp gdpPercap     total_gdp
   <chr>   <int>    <dbl> <chr>       <dbl>     <dbl>         <dbl>
 1 Poland   1952 25730551 Europe       61.3     4029. 103676873316.
 2 Poland   1957 28235346 Europe       65.8     4734. 133673272043.
 3 Poland   1962 30329617 Europe       67.6     5339. 161922307755.
 4 Poland   1967 31785378 Europe       69.6     6557. 208421579589.
 5 Poland   1972 33039545 Europe       70.8     8007. 264531348088.
 6 Poland   1977 34621254 Europe       70.7     9508. 329183780347.
 7 Poland   1982 36227381 Europe       71.3     8452. 306176833715.
 8 Poland   1987 37740710 Europe       71.0     9082. 342774381701.
 9 Poland   1992 38370697 Europe       71.0     7739. 296946267448.
10 Poland   1997 38654957 Europe       72.8    10160. 392718270288.
11 Poland   2002 38625976 Europe       74.7    12002. 463598198650.
12 Poland   2007 38518241 Europe       75.6    15390. 592792827796.


Exercise

For the poland subset of the gapminder data, plot the life expectancy as a function of the year. Make sure to label the axes and give the plot an appropriate title. Make sure to save your plot as an object called “poland_plot”. Hint: Try the geom_line() geometry.

Solution
poland_plot <- ggplot(data = poland, aes(x = year, y = lifeExp)) +
    geom_line() +
    labs(x = 'Year', y = 'Life Expectancy', title = 'Polish Life Expectancy Over Time')
poland_plot


This is excellent! Now let’s consider plotting a similar line graph of life expectancy across time for all countries on the same axes. Moreover, let’s color the lines by their continent. Perhaps we can even take as a guide our scatterplot from above:

ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
    geom_point() +
    labs(title = 'GDP per Capita vs Life Expectancy', x = 'GDP per Capita', y = 'Life Expectancy') +
    theme_bw()

We might just replace gdpPercap with lifeExp and geom_point() with geom_line() and hopefully get what we want. Let’s give it a shot:

ggplot(data = gapminder, aes(x = year, y = lifeExp, color = continent)) +
    geom_line() +
    labs(title = 'Life Expectancy Over Time by Country', x = 'Year', y = 'Life Expectancy') +
    theme_bw()

This doesn’t really look like what we expected. I expected a line for each country, but I can’t really explain how ggplot decided to create the lines. There is a parameter of aes() called group which can help us. It will tell ggplot to first group the data by a column of the data, in this we probably want country. What happens if we try that? (Let’s also save this plot as an object to use later.)

life_over_time_plot <- ggplot(data = gapminder, aes(x = year, y = lifeExp, color = continent, group = country)) +
    geom_line() +
    labs(title = 'Life Expectancy Over Time by Country', x = 'Year', y = 'Life Expectancy') +
    theme_bw()
life_over_time_plot

Voila, we’ve got ourselves a spaghetti plot.

Saving plots to file

Let’s take a brief pause from experimenting with plots and learn how to save plots as image files. We should have three plots that we’ve saved as objects: gdp_life_plot, poland_plot, and life_over_time_plot. Let’s save these plots as files using the ggsave() function.

The ggsave() function is an easy way to specify which plots to save, at what location, at what size and resolution, and in what format. Let’s first have a look at the help file to figure out which parameters we should pay attention to.

?ggsave

The filename parameter can determine what format to write the file in based on the extension (e.g. .tiff, .png, .pdf, or .jpeg). Next the height and width can be specified in any number of units to get the exact size figure you prefer. Finally, the dpi parameter specifies the resolution in “dots per inch”, and this parameter can ensure that the plots you output are sufficiently high-quality to submit as part of a publication to a journal. Typically 300dpi is sufficient. With this in mind, let’s save the gdp_life_plot:

ggsave(filename = 'gdp_life_plot.png', plot = gdp_life_plot, width = 6, height = 6, units = 'in', dpi = 300)

Bar plots

What if we wanted to plot the global total GDP per year as a bar plot, but separated by continent? Let’s first take a look at the help function for geom_bar():

?geom_bar

Looking through the description:

Since we want the height of the bars to represent the values of the total_gdp column, we should be using geom_col(), a relative of geom_bar(). To make this graph, it’s clear we want the year on the x-axis, the total_gdp on the y-axis, and if we want to separate the continents apart in the graph, we will use a new aesthetic parameter called fill and assign it continent.

ggplot(data = gapminder, aes(x = year, y = total_gdp, fill = continent)) + geom_col()

This is a good start, and from previous examples we know how we might alter the x and y axis labels, but how can we edit the legend title? This is a common task when creating publication figures because the legend title will inherit the name of the column, which is usually some shorthand. We can use the same labs() function we used before, and specify a value for fill, to go with the fill parameter in aes() above. Let’s save this plot as “total_gdp_plot”.

total_gdp_plot = ggplot(data = gapminder, aes(x = year, y = total_gdp, fill = continent)) +
    geom_col() +
    labs(title = 'Global GDP Over Time (by continent)', x = 'Year', y = 'Total GDP', fill = 'Continent')
total_gdp_plot

Box plots

What if we wanted a plot to look at the distribution of populations of countries grouped by continents over the years? Let’s start by building this plot iteratively.

ggplot(data = gapminder, aes(x = continent, y = pop)) + geom_boxplot()

This is a good start, but because of the wide range of populations, we’re having a problem seeing the distribution of values at the lower end. This is a perfect opportunity to change the scale of the y-axis to a logarithmic scale using the scale_y_log10() function.

ggplot(data = gapminder, aes(x = continent, y = pop)) + geom_boxplot() + scale_y_log10()

This is better, but we still have all the years lumped into each continent bar. We could split out the years as facets. We saw an example of that with facet_grid() above. Since we’re trying to compare the distributions of populations across years at a glance, it would probably best to facet the years horizontally.

ggplot(data = gapminder, aes(x = continent, y = pop)) +
    geom_boxplot() +
    scale_y_log10() +
    facet_grid(. ~ year)

This is great, we’ve come a long way, but maybe we’d actually like the x-axis to be by year and the facets to be by continent. That would feel like a better display of the data.

ggplot(data = gapminder, aes(x = year, y = pop)) +
    geom_boxplot() +
    scale_y_log10() +
    facet_grid(. ~ continent)
Warning: Continuous x aesthetic -- did you forget aes(group=...)?

That’s an interesting warning. Because year is considered a numeric, we’re not really getting what we’d like. ggplot is helpfully suggesting adding group = year.

ggplot(data = gapminder, aes(x = year, y = pop, group = year)) +
    geom_boxplot() +
    scale_y_log10() +
    facet_grid(. ~ continent)

That feels like closer to what we intended. It also has the benefit of being beautiful in a way.

Rotating axis labels

You will very likely discover in the course of making plots that the axis labels may collide with one another, and that rotating the labels would make for a better plot. Let’s modify the plot we just made to rotate the years by 45 degrees. But where to start? Typically the answer for ggplot questions is to search the internet and look at example code. Let’s do just that. Let’s search for “how to rotate axis labels in ggplot”.

It looks like a possible solution is to add

theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

Let’s give it a shot:

ggplot(data = gapminder, aes(x = year, y = pop, group = year)) +
    geom_boxplot() +
    scale_y_log10() +
    facet_grid(. ~ continent) +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

Perfect!

Summary

In this lesson we’ve introduced the basic concepts underpinnning ggplot2:

We’ve introduced a number of specific geometries:

And we explored various customizations:

And we explored how to save our plots in different formats and resolutions to ensure publication quality graphics.

Resources