In this lesson we will work with ggplot2
and its
“grammar of graphics” to create a variety of plots using the
gapminder
data we saw in the last lesson.
When people talk about ggplot2
you may hear them use the
phrase “grammar of graphics,” which is a conception of statistical
graphics developed by Leland Wilkinson. The main components of the
grammar (in bold) are defined as follows in the book ggplot2: Elegant Graphics for Data
Analysis.
All plots are composed of the data, the information to visualize, and a mapping, the description of the data’s variables are mapped to aesthetic attributes. There are five mapping components:
- A layer is a collection of geometric elements and statistical transformations. Geometric elements, geoms for short, represent what you actually see in the plot: points, lines, polygons, etc. Statistical transformations, stats for short, summarise the data: for example, binning and counting observations to create a histogram, or fitting a linear model.
- Scales map values in the data space to values in the aesthetic space. This includes the use of colour, shape or size. Scales also draw the legend and axes, which make it possible to read the original data values from the plot (an inverse mapping).
- A coord, or coordinate system, describes how data coordinates are mapped to the plane of the graphic. It also provides axes and gridlines to help read the graph. We normally use the Cartesian coordinate system, but a number of others are available, including polar coordinates and map projections.
- A facet specifies how to break up and display subsets of data as small multiples. This is also known as conditioning or latticing/trellising.
- A theme controls the finer points of display, like the font size and background colour. While the defaults in ggplot2 have been chosen with care, you may need to consult other references to create an attractive plot.
Remembering all of this isn’t necessary for generating plots quickly
with ggplot2
, so don’t worry if these concepts aren’t
completely clear at first. As you work with ggplot2
you
will start to get a sense for how these things fit together and what
function calls to use.
We first load the ggplot2
library:
library(ggplot2)
The plots you generate with ggplot2
will usually conform
to the template:
ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +
<GEOM_FUNCTION>() +
<COORD_FUNCTION>() +
<SCALE_FUNCTION>() +
<THEME_FUNCTION>()
We may not always use coordinate, scale, and theme functions, but at
minimum we must specify the data, mapping, and geometry
functions. ggplot2
functions need data in the ‘long’
format, i.e., a column for every dimension, and a row for every
observation. Well-structured data will save you lots of time when making
figures with ggplot2
. The gapminder
data that
we have been using is already in the format best-suited for
ggplot2
, but keep in mind you might need to use the
dplyr
functions we learned in the last lesson to get to
this form for other data you may encounter.
ggplots can be built iteratively, which is great for understanding what each addition does to the resulting plot.
ggplot()
function and bind the plot to the
gapminder
data frame with the data
argument:ggplot(data = gapminder)
This plot is empty because we told it what data to use, but neither the mappings nor geometries.
aes
)
function), by selecting the variables to be plotted and specifying how
to present them in the graph. First we’ll specify x/y axes, but later we
will see how to use size, shape, color, etc.ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp))
This is a little better in that we at least have axes, but there is still no data on the plot because we didn’t specify a geometry.
ggplot2
offers many different
geoms; we will use some common ones today, including:
geom_point()
for scatter plots, dot plots, etc.geom_line()
for trend lines, time series, etc.geom_barplot()
for, well, boxplots!geom_boxplot()
distributions.To add a geom to the plot use the +
operator (this is
the ggplot
version of the pipe from the previous lesson).
Because we have two continuous variables, let’s use
geom_point()
first:
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) + geom_point()
The +
in the ggplot2
package is
particularly useful because it allows you to modify existing
ggplot
objects. Now we might want to change some things
like the axis labels and add a title to the plot.
labs()
function.ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
labs(title = 'GDP per Capita vs Life Expectancy', x = 'GDP per Capita', y = 'Life Expectancy')
Now we have clearer labels and a good summary title. But what if we’re not keen on that gray background? We can use a theme function to change the look and feel of the graph. The Themes section of the ggplot2 book provides a good overview of theming options.
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
labs(title = 'GDP per Capita vs Life Expectancy', x = 'GDP per Capita', y = 'Life Expectancy') +
theme_bw()
Base graphics in R typically don’t allow plots to be saved and modified as objects, but you can with ggplots. This can be helpful for storing a base version of a plot to tinker with more easily without having to write some of the same code over and over again. For example:
base_plot = ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp))
And from here we can try some different things out, like changing the color of all the points:
base_plot + geom_point(color = 'blue')
Or we can make the points a little transparent:
base_plot + geom_point(alpha = 0.5)
Tip
- Anything you put in the
ggplot()
function can be seen by any geom layers that you add (i.e., these are universal plot settings). This includes the x- and y-axis mapping you set up inaes()
.- You can also specify mappings for a given geom independently of the mappings defined globally in the
ggplot()
function.- The
+
sign used to add new layers must be placed at the end of the line containing the previous layer.
To color each continent in the plot differently, you can specify that
of the data to use as the color
in the aesthetic
(aes()
) funciton.
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point() +
labs(title = 'GDP per Capita vs Life Expectancy', x = 'GDP per Capita', y = 'Life Expectancy') +
theme_bw()
You can also change the colors that are used by specifying a named
palette, or by manually defining the colors. This
page shows some of the named palettes which are available, and
?scale_colour_brewer
also shows the names of the
palettes.
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point() +
labs(title = 'GDP per Capita vs Life Expectancy', x = 'GDP per Capita', y = 'Life Expectancy') +
scale_color_brewer(palette = 'Set3') +
theme_bw()
You can also alter the shape of the points by specifying
shape
in the aes()
function:
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp, shape = continent)) +
geom_point() +
labs(title = 'GDP per Capita vs Life Expectancy', x = 'GDP per Capita', y = 'Life Expectancy') +
theme_bw()
You can alter the size of points by specifying size
in
the aes()
function while also using the continents as a
color
. Let’s save this plot as an object to use later in
this lesson.
gdp_life_plot <- ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp, color = continent, size = pop)) +
geom_point(alpha = 0.5) +
labs(title = 'GDP per Capita vs Life Expectancy (Size by Population)', x = 'GDP per Capita', y = 'Life Expectancy') +
theme_bw()
# When saving a plot as an object, RStudio won't automatically display it
gdp_life_plot
ggplot2
has a special technique called faceting
that allows us to split one plot into multiple plots based on a factor
in the data. Let’s use it to split our plot into five panels, one for
each continent.
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point() +
facet_grid(. ~ continent) +
labs(title = 'GDP per Capita vs Life Expectancy', x = 'GDP per Capita', y = 'Life Expectancy') +
theme_bw()
We can also experiment with stacking the facets vertically, rather
than horizontally. The facet_grid
geometry allows you to
explicitly specify how you want your plots to be arranged via formula
notation (rows ~ columns
; a .
can be used as a
placeholder that indicates only one row or column).
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point() +
facet_grid(continent ~ .) +
labs(title = 'GDP per Capita vs Life Expectancy', x = 'GDP per Capita', y = 'Life Expectancy') +
theme_bw()
We’ve explored a lot of options with the geom_point()
geometry, but we should expand our horizons to other geometries. Let’s
first create a subset of the gapminder
data limited only to
Poland and plot the life expectancy over all the years for which we have
data as a line plot.
Exercise
Use the
filter()
function to select only the Poland data from thegapminder
data, and save it as as an object named “poland”.
poland = gapminder %>% filter(country == 'Poland')
poland
# A tibble: 12 × 7
country year pop continent lifeExp gdpPercap total_gdp
<chr> <int> <dbl> <chr> <dbl> <dbl> <dbl>
1 Poland 1952 25730551 Europe 61.3 4029. 103676873316.
2 Poland 1957 28235346 Europe 65.8 4734. 133673272043.
3 Poland 1962 30329617 Europe 67.6 5339. 161922307755.
4 Poland 1967 31785378 Europe 69.6 6557. 208421579589.
5 Poland 1972 33039545 Europe 70.8 8007. 264531348088.
6 Poland 1977 34621254 Europe 70.7 9508. 329183780347.
7 Poland 1982 36227381 Europe 71.3 8452. 306176833715.
8 Poland 1987 37740710 Europe 71.0 9082. 342774381701.
9 Poland 1992 38370697 Europe 71.0 7739. 296946267448.
10 Poland 1997 38654957 Europe 72.8 10160. 392718270288.
11 Poland 2002 38625976 Europe 74.7 12002. 463598198650.
12 Poland 2007 38518241 Europe 75.6 15390. 592792827796.
Exercise
For the
poland
subset of thegapminder
data, plot the life expectancy as a function of the year. Make sure to label the axes and give the plot an appropriate title. Make sure to save your plot as an object called “poland_plot”. Hint: Try thegeom_line()
geometry.
poland_plot <- ggplot(data = poland, aes(x = year, y = lifeExp)) +
geom_line() +
labs(x = 'Year', y = 'Life Expectancy', title = 'Polish Life Expectancy Over Time')
poland_plot
This is excellent! Now let’s consider plotting a similar line graph of life expectancy across time for all countries on the same axes. Moreover, let’s color the lines by their continent. Perhaps we can even take as a guide our scatterplot from above:
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point() +
labs(title = 'GDP per Capita vs Life Expectancy', x = 'GDP per Capita', y = 'Life Expectancy') +
theme_bw()
We might just replace gdpPercap
with
lifeExp
and geom_point()
with
geom_line()
and hopefully get what we want. Let’s give it a
shot:
ggplot(data = gapminder, aes(x = year, y = lifeExp, color = continent)) +
geom_line() +
labs(title = 'Life Expectancy Over Time by Country', x = 'Year', y = 'Life Expectancy') +
theme_bw()
This doesn’t really look like what we expected. I expected a line for
each country, but I can’t really explain how ggplot decided to create
the lines. There is a parameter of aes()
called
group
which can help us. It will tell ggplot to first group
the data by a column of the data, in this we probably want
country
. What happens if we try that? (Let’s also save this
plot as an object to use later.)
life_over_time_plot <- ggplot(data = gapminder, aes(x = year, y = lifeExp, color = continent, group = country)) +
geom_line() +
labs(title = 'Life Expectancy Over Time by Country', x = 'Year', y = 'Life Expectancy') +
theme_bw()
life_over_time_plot
Voila, we’ve got ourselves a spaghetti plot.
Let’s take a brief pause from experimenting with plots and learn how
to save plots as image files. We should have three plots that we’ve
saved as objects: gdp_life_plot
, poland_plot
,
and life_over_time_plot
. Let’s save these plots as files
using the ggsave()
function.
The ggsave()
function is an easy way to specify which
plots to save, at what location, at what size and resolution, and in
what format. Let’s first have a look at the help file to figure out
which parameters we should pay attention to.
?ggsave
The filename
parameter can determine what format to
write the file in based on the extension (e.g. .tiff
,
.png
, .pdf
, or .jpeg
). Next the
height
and width
can be specified in any
number of units
to get the exact size figure you prefer.
Finally, the dpi
parameter specifies the resolution in
“dots per inch”, and this parameter can ensure that the plots you output
are sufficiently high-quality to submit as part of a publication to a
journal. Typically 300dpi is sufficient. With this in mind, let’s save
the gdp_life_plot
:
ggsave(filename = 'gdp_life_plot.png', plot = gdp_life_plot, width = 6, height = 6, units = 'in', dpi = 300)
What if we wanted to plot the global total GDP per year as a bar
plot, but separated by continent? Let’s first take a look at the help
function for geom_bar()
:
?geom_bar
Looking through the description:
geom_bar()
makes the height of the bar proportional to
the number of cases in each group.geom_col()
instead.Since we want the height of the bars to represent the values of the
total_gdp
column, we should be using
geom_col()
, a relative of geom_bar()
. To make
this graph, it’s clear we want the year
on the x-axis, the
total_gdp
on the y-axis, and if we want to separate the
continents apart in the graph, we will use a new aesthetic parameter
called fill
and assign it continent
.
ggplot(data = gapminder, aes(x = year, y = total_gdp, fill = continent)) + geom_col()
This is a good start, and from previous examples we know how we might
alter the x and y axis labels, but how can we edit the legend title?
This is a common task when creating publication figures because the
legend title will inherit the name of the column, which is usually some
shorthand. We can use the same labs()
function we used
before, and specify a value for fill
, to go with the
fill
parameter in aes()
above. Let’s save this
plot as “total_gdp_plot”.
total_gdp_plot = ggplot(data = gapminder, aes(x = year, y = total_gdp, fill = continent)) +
geom_col() +
labs(title = 'Global GDP Over Time (by continent)', x = 'Year', y = 'Total GDP', fill = 'Continent')
total_gdp_plot
What if we wanted a plot to look at the distribution of populations of countries grouped by continents over the years? Let’s start by building this plot iteratively.
ggplot(data = gapminder, aes(x = continent, y = pop)) + geom_boxplot()
This is a good start, but because of the wide range of populations,
we’re having a problem seeing the distribution of values at the lower
end. This is a perfect opportunity to change the scale of the y-axis to
a logarithmic scale using the scale_y_log10()
function.
ggplot(data = gapminder, aes(x = continent, y = pop)) + geom_boxplot() + scale_y_log10()
This is better, but we still have all the years lumped into each
continent bar. We could split out the years as facets. We saw an example
of that with facet_grid()
above. Since we’re trying to
compare the distributions of populations across years at a glance, it
would probably best to facet the years horizontally.
ggplot(data = gapminder, aes(x = continent, y = pop)) +
geom_boxplot() +
scale_y_log10() +
facet_grid(. ~ year)
This is great, we’ve come a long way, but maybe we’d actually like the x-axis to be by year and the facets to be by continent. That would feel like a better display of the data.
ggplot(data = gapminder, aes(x = year, y = pop)) +
geom_boxplot() +
scale_y_log10() +
facet_grid(. ~ continent)
Warning: Continuous x aesthetic -- did you forget aes(group=...)?
That’s an interesting warning. Because year is considered a numeric,
we’re not really getting what we’d like. ggplot is helpfully suggesting
adding group = year
.
ggplot(data = gapminder, aes(x = year, y = pop, group = year)) +
geom_boxplot() +
scale_y_log10() +
facet_grid(. ~ continent)
That feels like closer to what we intended. It also has the benefit of being beautiful in a way.
You will very likely discover in the course of making plots that the axis labels may collide with one another, and that rotating the labels would make for a better plot. Let’s modify the plot we just made to rotate the years by 45 degrees. But where to start? Typically the answer for ggplot questions is to search the internet and look at example code. Let’s do just that. Let’s search for “how to rotate axis labels in ggplot”.
…
It looks like a possible solution is to add
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
Let’s give it a shot:
ggplot(data = gapminder, aes(x = year, y = pop, group = year)) +
geom_boxplot() +
scale_y_log10() +
facet_grid(. ~ continent) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
Perfect!
In this lesson we’ve introduced the basic concepts underpinnning ggplot2:
We’ve introduced a number of specific geometries:
geom_point()
geom_line()
geom_bar()
and geom_col()
geom_boxplot()
And we explored various customizations:
And we explored how to save our plots in different formats and resolutions to ensure publication quality graphics.
ggplot2
implementation.