To go back to the main page, click here
Curriculum
Scatterplots are used to visualize the relationship between two continuous variables. In ggplot2 we can use the geom geom_point()
to create a simple scatterplot:
# Simple scatterplot
ggplot(issp, aes(x=age, y=hours_work)) +
geom_point()
Above we created a simple scatterplot of the relationship between age and hours_work from the issp dataset. We could modify the points by using another shape:
# Changing the shape of the points
ggplot(issp, aes(x=age, y=hours_work)) +
geom_point(shape=18)
There are several shapes you can use to modify the points. The picture below shows the different shapes and their corresponding numbers.
If we wanted to include a third variable, class, we could use several aesthetics to map it. We could use the color =
argument to highlight the class differences by using a specific color for each class:
# Simple scatterplot with third variable using the color = argument
ggplot(issp, aes(x=age, y=hours_work, color=class)) +
geom_point()
However, we could also use a different size or shape to highlight the class categories:
# Shape
ggplot(issp, aes(x=age, y=hours_work, shape=class)) +
geom_point()
# Size
ggplot(issp, aes(x=age, y=hours_work, size=class)) +
geom_point()
## Warning: Using size for a discrete variable is not advised.
We get a warning from R when using the size = class
argument. As you can see from the plot the points tend to overlap making it difficult to distinguish between the different classes.
A common problem if you have a large dataset with many observations is overplotting, i.e. that the points overlap. There are different ways to tackle this problem. For example, we could make the points more transparent using the alpha =
argument from before:
# Correcting for overplotting: transparency of points
ggplot(issp, aes(x=age, y=hours_work)) +
geom_point(alpha=0.2)
This makes it a bit better. You should always try different numbers in the alpha =
argument. alpha =
ranges from 0-1, where lower values corresponds to more transparency.
Alternatively, we could use geom_jitter()
to help tackling overplotting. This geom adds some random variation to the location of each point:
# Correcting for overplotting: geom_jitter()
ggplot(issp, aes(x=age, y=hours_work)) +
geom_jitter()
The plot clearly shows more datapoints using geom_jitter()
. We could then combine geom_jitter()
with the alpha =
argument to reduce the problem further:
# Correcting for overplotting: geom_jitter() and transparency
ggplot(issp, aes(x=age, y=hours_work)) +
geom_jitter(alpha=0.3)
geom_jitter()
is especially useful if you plot the relationship between a discrete and continuous variable. Take the following example. Note that we include coord_flip()
so that the labels on the x-axis won’t overlap.
ggplot(issp, aes(x=class, y=hours_work, color=class)) +
geom_point() +
coord_flip()
This is not very informative since it just creates lines of points for each value of class. We can correct this issue by using geom_jitter()
instead:
ggplot(issp, aes(x=class, y=hours_work, color=class)) +
geom_jitter() +
coord_flip()
However, it would be more useful to create a boxplot if you want to visualize the relationship between a categorical and continuous variable.
Another option to help reduce the problem with overplotting, is by using geom_hex()
from the hexbin package. This geom creates a hexagonal heat map. Each hexagon square represents a collection of points:
# Correcting for overplotting: geom_hex()
ggplot(issp, aes(x=age, y=hours_work)) +
geom_hex()
We can see that there are more observations in the brighter colors.
You can also modify the number of bins using the bins =
argument:
# Correcting for overplotting: geom_hex()
ggplot(issp, aes(x=age, y=hours_work)) +
geom_hex(bins=10) +
scale_fill_gradient(low = "grey", high = "red")
You should always try increasing and decreasing the number of bins to see what happens to your plot.
We could also change the color of the gradient by adding the scale_fill_gradient()
function:
# Changing the color of gradient
ggplot(issp, aes(x=age, y=hours_work)) +
geom_hex(bins = 40) +
scale_fill_gradient(low = "grey", high = "red")
And to visualize the points even better, we could place the scatterplot on top of the hexagonal heat map:
A general note when mapping multiple geoms to a plot, is to put the aes()
argument at the top of the code as in the example above. We place the aes()
argument inside ggplot()
instead of inside geom_hex()
or geom_jitter()
. R will then take the arguments inside aes()
as the default for the following geoms, and thereby making the code more efficient to write (it saves us from more typing). This is also a good practice to do if you only use one geom, so the code becomes more intuitive to read. That’s why we’ve always placed the aes()
argument inside ggplot()
in the examples in this session.
These are different approaches to tackle the problem with overplotting. However, which method that comes closest to solving the problem will vary with each case, and you should try them all out to see which handles the issue best in your particular case.
geom_smooth()
with lmIt is also possible to fit a straight regression line through the data by adding geom_smooth()
to our scatterplot:
# Adding a straight regression line
ggplot(issp, aes(x=age, y=hours_work)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
In the code above, we specify the methods =
argument to “lm”. You’ll learn more about what “lm” means in the session on regression, but for now, just know that specifying “lm” in the methods =
argument, tells R to fit a linear regression line (i.e, a straight line) to the plot.
From the output, we see that the regression line is basically straight, although there seems to be a very weak negative correlation between age and the amount of working hours.
If we do not want to have standard errors displayed in the plot, we could add the argument se = FALSE
inside geom_smooth()
:
# Removing standard errors
ggplot(issp, aes(x=age, y=hours_work)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
If we wanted to add regression lines by a grouping variable, we could use geom_point()
and the color =
argument, in combination with geom_smooth()
:
# Including a grouping variable using the color = argument
ggplot(issp, aes(x=age, y=hours_work, color=gender)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
Help documentation
To see the help-page for the functions geom_point()
geom_jitter()
, geom_hex()
and geom_smooth
, just run the following codes and the help-pages will open in your browser.
help(geom_point, package = "ggplot2")
help(geom_jitter, package = "ggplot2")
help(geom_hex, package = "hexbin")
help(geom_smooth, package = "ggplot2")
Reminder
In these exercises, you will be using the dataset abu89. In all exercises where you are asked to make a plot or modify a code used to make a plot, you’ll have to type the name of the plot at another line to print the output. And finally, remember to always press the ‘run code’ button before submitting your answer.
The exercises (1,2,3) in this part is also sequential (1a-1g, 2a-2c, 3a-3e), meaning that you should modify your code from the previous step (a) to correspond to the changes in the following step (b, etc.).
Scatterplots and overplotting
1. Using the abu89 dataset…
a) … Make a scatterplot of the relationship between age and wage_hour. Store it in an object named p. Remember to type the name of the plot at another line to print the result.
p <- ggplot(abu891) +
geom_point(aes(x=age, y=wage_hour))
p
"geom_point() is the way to go!"
"Replace the dotted lines with the name of the dataset and the correct variables"
p <- ggplot(...) +
geom_point(aes(x=..., y=...))
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"
p <- ggplot(abu89) +
geom_point(aes(x=age, y=wage_hour))
...
b) Add some color to your plot by adding the color="steelblue"
argument.
p <- ggplot(abu89, aes(x=age, y=wage_hour)) +
geom_point(color="steelblue")
p
"Remember to place the color argument outside aes()"
"Replace the dotted lines with the correct color"
p <- ggplot(abu89, aes(x=age, y=wage_hour)) +
geom_point(color="...")
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"
p <- ggplot(abu89, aes(x=age, y=wage_hour)) +
geom_point(color="steelblue")
...
c) Make the points a bit more transparent by setting the alpha =
argument to 0.2. Does it make the plot look better?
p <- ggplot(abu89, aes(x=age, y=wage_hour)) +
geom_point(color="steelblue", alpha=0.2)
p
"Remember that the 'alpha =' argument should be placed inside geom_point()"
"Replace the dotted lines with the correct specification"
p <- ggplot(abu89, aes(x=age, y=wage_hour)) +
geom_point(color="steelblue", alpha=...)
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"
p <- ggplot(abu89, aes(x=age, y=wage_hour)) +
geom_point(color="steelblue", alpha=0.2)
...
d) Try using geom_jitter()
instead of geom_point()
. Does it make the plot look better?
p <- ggplot(abu89, aes(x=age, y=wage_hour)) +
geom_jitter(color="steelblue", alpha=0.2)
p
"Replace the dotted lines with the correct geom"
p <- ggplot(abu89, aes(x=age, y=wage_hour)) +
geom_...(color="steelblue", alpha=0.2)
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"
p <- ggplot(abu89, aes(x=age, y=wage_hour)) +
geom_jitter(color="steelblue", alpha=0.2)
...
e) Try using geom_hex()
instead of geom_jitter()
. Does this make the plot look better?
p <- ggplot(abu89) +
geom_hex(aes(x=age, y=wage_hour))
"You'll have to remove the color= and alpha= argument"
"Replace the dotted line with the correct geom"
p <- ggplot(abu89, aes(x=age, y=wage_hour)) +
geom_...()
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"
p <- ggplot(abu89) +
geom_hex(aes(x=age, y=wage_hour))
...
f) Try mapping the scatterplot (geom_point()
) on top of geom_hex()
. Does this make the plot look better?
p <- ggplot(abu89, aes(x=age, y=wage_hour)) +
geom_hex() +
geom_point()
p
"Replace the dotted lines with the correct geoms"
p <- ggplot(abu89, aes(x=age, y=wage_hour)) +
geom_...() +
geom_...()
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"
p <- ggplot(abu89, aes(x=age, y=wage_hour)) +
geom_hex() +
geom_point()
...
g) Change the color gradient by adding the scale_fill_gradient()
function. Set the low color to “grey” and the high color to “orange”.
p <- ggplot(abu89, aes(x=age, y=wage_hour)) +
geom_hex() +
geom_point() +
scale_fill_gradient(low = "grey", high="orange")
p
"Add the scale_fill_gradient() function to the bottom of your code. Remember the + operator!"
"Replace the dotted lines with the correct specifications"
p <- ggplot(abu89, aes(x=age, y=wage_hour)) +
geom_hex() +
geom_point() +
scale_fill_gradient(... = "...", ... = "...")
"Replace the dotted lines with the correct colors"
p <- ggplot(abu89, aes(x=age, y=wage_hour)) +
geom_hex() +
geom_point() +
scale_fill_gradient(low = "...", high = "...")
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"
p <- ggplot(abu89, aes(x=age, y=wage_hour)) +
geom_hex() +
geom_point() +
scale_fill_gradient(low = "grey", high="orange")
...
h) Use the console below and the skills you’ve acquired to modify your plot to make it even better. Add appropriate labels and titles, change the theme, modify the bin size of the hexagon squares, change the color, modify the limits of the x- and y-axis etc. Do whatever you like!
Scatterplots with groups
2. Using the abu89 dataset…
a) …Make a scatterplot (using geom_jitter()
) of the relationship between age and wage_hour, but this time group the plot by gender using the color =
argument to highlight the gender differences by separate colors. Save it in a object named p. Remember to type the name of the plot at another line to print the result.
p <- ggplot(abu89, aes(x=age, y=wage_hour, color=gender)) +
geom_jitter()
p
"Remember to place the color= argument inside aes()"
"Replace the dotted lines with the correct variable"
p <- ggplot(abu89, aes(x=age, y=wage_hour, color=...)) +
geom_jitter()
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"
p <- ggplot(abu89, aes(x=age, y=wage_hour, color=gender)) +
geom_jitter()
...
b) Make the points a bit more transparent by specifying the alpha =
argument to 0.3.
p <- ggplot(abu89, aes(x=age, y=wage_hour, color=gender)) +
geom_jitter(alpha = 0.3)
p
"Remember to place the alpha = argument inside geom_jitter()"
"Replace the dotted lines with the correct specification"
p <- ggplot(abu89, aes(x=age, y=wage_hour, color=gender)) +
geom_jitter(alpha = ...)
p
c) Add appropriate labels to the y- and x-axis to communicate the information in the plot more clearly. Use the following labels:
p <- ggplot(abu89, aes(x=age, y=wage_hour, color=gender)) +
geom_jitter(alpha = 0.3) +
labs(y = "Hourly wage",
x = "Age")
p
"Remember the labs() function and the correct arguments"
"Replace the dotted lines with the correct labels"
p <- ggplot(abu89, aes(x=age, y=wage_hour, color=gender)) +
geom_jitter(alpha = 0.3) +
labs(y = "...",
x = "...")
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"
p <- ggplot(abu89, aes(x=age, y=wage_hour, color=gender)) +
geom_jitter(alpha = 0.3) +
labs(y = "Hourly wage",
x = "Age")
...
d) A good title is also necessary to communicate the information in the plot effectively to the reader. Add the following title: “Relationship between age and hourly wages by gender”
p <- ggplot(abu89, aes(x=age, y=wage_hour, color=gender)) +
geom_jitter(alpha = 0.3) +
labs(y = "Hourly wage",
x = "Age",
title = "Relationship between age and hourly wages by gender")
p
"To add a title, remember to specify the title = argument inside the labs() function"
"Replace the dotted line with the correct title"
p <- ggplot(abu89, aes(x=age, y=wage_hour, color=gender)) +
geom_jitter(alpha = 0.3) +
labs(y = "Hourly wage",
x = "Age",
title = "...")
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"
p <- filter(covdata, country_code == "NOR" & year == 2020) %>%
ggplot(aes(x = week, y = rate_total)) +
geom_line() +
scale_x_continuous(breaks = seq(1,52,2)) +
labs(y = "Death rate (total deaths)",
x = "Week",
title = "Death rate (total) in Norway in 2020")
...
e) Instead of using the color =
argument to group by gender, try using the shape =
argument instead.
p <- ggplot(abu89, aes(x=age, y=wage_hour, shape=gender)) +
geom_jitter(alpha = 0.3) +
labs(y = "Hourly wage",
x = "Age",
title = "Relationship between age and hourly wages by gender")
p
"All you have to do is to replace the color= argument with the shape= argument"
"Replace the dotted line with the correct variable"
p <- ggplot(abu89, aes(x=age, y=wage_hour, shape=...)) +
geom_jitter(alpha = 0.3) +
labs(y = "Hourly wage",
x = "Age",
title = "Relationship between age and hourly wages by gender")
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"
p <- ggplot(abu89, aes(x=age, y=wage_hour, shape=gender)) +
geom_jitter(alpha = 0.3) +
labs(y = "Hourly wage",
x = "Age",
title = "Relationship between age and hourly wages by gender")
...
f) Try using size to group the plot by gender.
p <- ggplot(abu89, aes(x=age, y=wage_hour, size=gender)) +
geom_jitter(alpha = 0.3) +
labs(y = "Hourly wage",
x = "Age",
title = "Relationship between age and hourly wages by gender")
p
"Replace the shape = argument with the size = argument"
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"
p <- ggplot(abu89, aes(x=age, y=wage_hour, size=gender)) +
geom_jitter(alpha = 0.3) +
labs(y = "Hourly wage",
x = "Age",
title = "Relationship between age and hourly wages by gender")
...
g) You get a warning message. Reflect for yourself or discuss with a classmate why you think the size aesthetic is not recommended to use with a discrete variable.
Scatterplots with regression lines
3.* Using the abu89 dataset…
a) …Make a scatterplot using geom_point()
) of the relationship between age and wage_hour grouped by gender using the color =
argument to highlight the gender differences. Save it in a object named p. Remember to type the name of the plot at another line to print the result.
p <- ggplot(abu89) +
geom_point(aes(x=age, y=wage_hour, color=gender))
p
"Remember to place the color= argument inside aes()"
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"
p <- ggplot(abu89) +
geom_point(aes(x=age, y=wage_hour, color=gender))
...
b) Replace geom_point()
with geom_jitter()
and add the alpha = 0.3
argument.
p <- ggplot(abu89, aes(x=age, y=wage_hour, color=gender)) +
geom_jitter(alpha=0.3)
p
"Remember to put the alpha= argument outside aes()"
"Replace the dotted lines with the correct geom and the correct specification"
p <- ggplot(abu89, aes(x=age, y=wage_hour, color=gender)) +
geom_...(alpha=...)
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"
p <- ggplot(abu89, aes(x=age, y=wage_hour, color=gender)) +
geom_jitter(alpha=0.3)
...
c) Add a straight regression line to the plot by adding geom_smooth()
and specifying the correct method in the method =
argument.
p <- ggplot(abu89, aes(x=age, y=wage_hour, color=gender)) +
geom_jitter(alpha=0.3) +
geom_smooth(method = "lm")
p
"Finish the code to obtain the desired plot"
p <- ggplot(abu89, aes(x=age, y=wage_hour, color=gender)) +
... +
...
"Replace the dotted lines with the correct geoms and method"
p <- ggplot(abu89, aes(x=age, y=wage_hour, color=gender)) +
geom_...(alpha=0.3) +
geom_...(method = "...")
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"
p <- ggplot(abu89, aes(x=age, y=wage_hour, color=gender)) +
geom_jitter(alpha=0.3) +
geom_smooth(method = "lm")
...
e) Remove the standard errors from the regression line by specifying the se =
argument correctly.
p <- ggplot(abu89, aes(x=age, y=wage_hour, color=gender)) +
geom_jitter(alpha=0.3) +
geom_smooth(method = "lm", se=FALSE)
p
"All you have to do is to add the se= argument inside `geom_smooth()`"
"Replace the dotted line with the correct argument"
p <- ggplot(abu89, aes(x=age, y=wage_hour, color=gender)) +
geom_jitter(alpha=0.3) +
geom_smooth(method = "lm", se=...)
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"
p <- ggplot(abu89, aes(x=age, y=wage_hour, color=gender)) +
geom_jitter(alpha=0.3) +
geom_smooth(method = "lm", se=FALSE)
...
f) Use the console below and the skills you’ve acquired in this session to modify your plot as you see fit. Add an appropriate title, experiment with colors, limits of x- and y-axis, legend position, whether to include standard errors or not etc. The sky is the limit!
Curriculum
Thus far, we’ve only used datasets that are cross-sectional, i.e data from a population at only one specific point in time. However, sometimes we have data that are from multiple time points. We can create time series plots with the geom geom_line()
. In this part, we’ll be using the dataset gapminder from the gapminder package. This is an excerpt of the original gapminder data, and provides data on the life expectancy, total population and per capita GDP for 9 countries in the years 1972-2007. We start off by getting some more information of the dataset using glimpse()
:
# Getting more information on the dataset
glimpse(gapminder)
## Rows: 72
## Columns: 7
## Groups: continent, country, year [72]
## $ country <fct> "United States", "United States", "United States", "United S…
## $ continent <fct> Americas, Americas, Americas, Americas, Americas, Americas, …
## $ year <int> 1972, 1977, 1982, 1987, 1992, 1997, 2002, 2007, 1972, 1977, …
## $ lifeExp <dbl> 71.340, 73.380, 74.650, 75.020, 76.090, 76.810, 77.310, 78.2…
## $ pop <int> 209896000, 220239000, 232187835, 242803533, 256894189, 27291…
## $ gdpPercap <dbl> 21806.04, 24072.63, 25009.56, 29884.35, 32003.93, 35767.43, …
## $ year_ind <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE…
We can see that there are 72 rows and 7 columns in the gapminder dataset. First we want to create a time series plot of life expectancy in Norway from the year 1990 and onwards. We then have to use the filter()
function from the dplyr package, which subsets the observations in the dataset based on specific conditions. You’ll learn more about this function in an upcoming session. In this particular case, we use filter()
to subset the observations to only include data from Norway and in years greater than 1990.
# Simple time series plot for Norway and years greater than 1990
filter(gapminder, country == "Norway" & year > 1990) %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_line()
In the code above, we use filter()
, where the first argument is the name of the dataframe, and then we filter the observations to only include Norway and years greater than 1990. We then send this data to ggplot()
, where we map the variable year to the x-axis and life expectancy (lifeExp) to the y-axis. Finally, we use geom_line()
to create a line plot.
Next, we want to make a plot of the life expectancy in Norway for all the years included in the data (range 1972-2007 by five year intervals)
# Plot for Norway and all years - specifying x-axis ticks
filter(gapminder, country == "Norway" & year >= 1972) %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_line() +
scale_x_continuous(breaks = seq(1972, 2007, 5))
In the code above, we modify the filter()
function to include years equal to or greater than 1972 (the first year in the data). We then use ggplot()
to make the plot as before, and finally we added the scale_x_continuous()
function with the breaks =
argument and the seq()
function to specify how we want the ticks on the x-axis to be. We tell R to set the tick marks in the range of 1972-2007 with an interval of 5. If we do not include this function, we do not get all the years on the x-axis:
filter(gapminder, country == "Norway" & year >= 1972) %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_line()
There are several modifications we can do to the plot. For example, we could add points for each year on the line by adding geom_point()
:
# Adding points to lines
filter(gapminder, country == "Norway" & year >= 1972) %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_line() +
scale_x_continuous(breaks = seq(1972, 2007, 5)) +
geom_point()
We could also modify line types, colors and size - both for the line and the points:
# Changing linetype, color and size
filter(gapminder, country == "Norway" & year >= 1972) %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_line(linetype = "dashed", size = 2, color = "steelblue") +
scale_x_continuous(breaks = seq(1972, 2007, 5)) +
geom_point(color = "red", size = 3)
There are several line types to be used in geom_line()
. The picture below shows the different types with their corresponding names.
So far we’ve only been looking at the trend in life expectancy in Norway. We could include all the countries in the dataset and use the group =
argument to group by the country variable.
# Include all countries in plot and grouping by country
ggplot(gapminder, aes(x = year, y = lifeExp, group = country)) +
geom_line() +
scale_x_continuous(breaks = seq(1972, 2007, 5))
However, it’s impossible to see which line corresponds to which country in the plot. We could change this by adding the color =
argument and mapping it to the country variable:
# Different line colors by country
ggplot(gapminder, aes(x = year, y = lifeExp, group = country)) +
geom_line(aes(color = country)) +
scale_x_continuous(breaks = seq(1972, 2007, 5))
This is a good example of how we can use colors to convey information in an effective manner. It is much more easy to intepret the trend in life expectancy for the different countries when we separate the lines corresponding to each conutry by different colors. Alternatively, we could choose the colors for each country manually, by using the scale_color_manual()
function along with the values =
and labels =
arguments:
In the code above, we place different colors inside the values =
argument, and we specify the corresponding country inside the labels =
argument.
We could also have different line types for each country by adding the linetype =
argument and mapping it to the country variable:
# Different linetype by country
ggplot(gapminder, aes(x = year, y = lifeExp, group = country)) +
geom_line(aes(color = country, linetype = country)) +
scale_x_continuous(breaks = seq(1972, 2007, 5))
Finally, we could add colors to the points by country, an appropriate title, change the labels and set a theme:
# Adding colors to points by country, titles, labels and theme
ggplot(gapminder, aes(x = year, y = lifeExp, group = country)) +
geom_line(aes(color = country, linetype = country)) +
scale_x_continuous(breaks = seq(1972, 2007, 5)) +
geom_point(aes(color = country)) +
labs(title = "Trends in life expectancy 1972-2007",
subtitle = "Sweden has the highest, while Poland has the lowest life expectancy in 2007",
y = "Life expectancy",
x = "Year") +
theme_bw()
If we wanted to show multiple subplots by each country instead, we could simply replace the group =
argument with the facet_wrap()
function:
# Mutiple subplots (faceting) by country
ggplot(gapminder, aes(x = year, y = lifeExp)) +
geom_line(aes(color = country, linetype = country)) +
scale_x_continuous(breaks = seq(1972, 2007, 5)) +
geom_point(aes(color = country)) +
labs(title = "Trends in life expectancy 1972-2007",
subtitle = "Sweden has the highest, while Poland has the lowest life expectancy in 2007",
y = "Life expectancy",
x = "Year") +
facet_wrap(~country) +
theme_bw()
If we wanted the scales to vary freely by country, we could add the argument scales = "free"
inside facet_wrap()
:
# Free scales in each subplot
ggplot(gapminder, aes(x = year, y = lifeExp)) +
geom_line(aes(color = country, linetype = country)) +
scale_x_continuous(breaks = seq(1972, 2007, 5)) +
geom_point(aes(color = country)) +
labs(title = "Trends in life expectancy 1972-2007",
subtitle = "Sweden has the highest, while Poland has the lowest life expectancy in 2007",
y = "Life expectancy",
x = "Year") +
facet_wrap(~country, scales = "free") +
theme_bw()
Help documentation
To see the help-page for the functions geom_line()
, just run the following code and the help-page will open in your browser.
help(geom_line, package = "ggplot2")
Reminder
In these exercises, you will be using the dataset covdata. The covdata dataset contains data on death rates in 9 different countries in the years 2010-2021. It is used to analyze the impact of covid-19 to the total death toll in each country.
In all exercises where you are asked to make a plot or modify a code used to make a plot, you’ll have to type the name of the plot at another line to print the output. And finally, remember to always press the ‘run code’ button before submitting your answer.
The exercises in this part is also sequential, meaning that you should modify your code from the previous step (a) to correspond to the changes in the following step (b, etc.).
Simple time series plot
1. Using the covdata dataset…
a) …Get a bit more familiar with the dataset by using the glimpse()
function.
glimpse(covdata)
"Replace the dotted line with the name of the dataset"
glimpse(...)
b) Create a time series plot of the death rate in Norway in the year 2020. In the console below, we have filled out some central part for making this plot using the filter()
function to subset the covdata dataset to include observations only in Norway and the year 2020. Finish typing the code to make the desired plot. Map the variable week to the x-axis and the variable rate_total to the y-axis. Finally, remember to type the name of the plot to print the result before hitting the “run code” button.
p <- filter(covdata, country_code == "NOR" & year == 2020) %>%
ggplot()
p <- filter(covdata, country_code == "NOR" & year == 2020) %>%
ggplot(aes(x = week, y = rate_total)) +
geom_line()
p
"Replace the dotted lines with the correct variables and geom"
p <- filter(covdata, country_code == "NOR" & year == 2020) %>%
ggplot(aes(x = ..., y = ...)) +
...()
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"
p <- filter(covdata, country_code == "NOR" & year == 2020) %>%
ggplot(aes(x = week, y = rate_total)) +
geom_line()
...
c) Specify which weeks should be shown on the x-axis by modifying the ticks on the x-axis using the scale_x_continuous()
function. You’ll want to show from week 1 to week 52 by an interval of 2. Hint: Remember the seq()
function.
p <- filter(covdata, country_code == "NOR" & year == 2020) %>%
ggplot(aes(x = week, y = rate_total)) +
geom_line() +
scale_x_continuous(breaks = seq(1,52,2))
p
"Replace the dotted lines with the correct specifications of the ticks on the x-axis"
p <- filter(covdata, country_code == "NOR" & year == 2020) %>%
ggplot(aes(x = week, y = rate_total)) +
geom_line() +
scale_x_continuous(breaks = seq(..., ..., ...))
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"
p <- filter(covdata, country_code == "NOR" & year == 2020) %>%
ggplot(aes(x = week, y = rate_total)) +
geom_line() +
scale_x_continuous(breaks = seq(1,52,2))
...
d) Add appropriate labels to your plot. Include the following specifications:
p <- filter(covdata, country_code == "NOR" & year == 2020) %>%
ggplot(aes(x = week, y = rate_total)) +
geom_line() +
scale_x_continuous(breaks = seq(1,52,2)) +
labs(y = "Death rate (total deaths)",
x = "Week")
p
p <- filter(covdata, country_code == "NOR" & year == 2020) %>%
ggplot(aes(x = week, y = rate_total)) +
geom_line() +
scale_x_continuous(breaks = seq(1,52,2)) +
labs(y = "Death rate (total deaths)",
x = "Week")
p
"Remember the labs() function and the correct arguments"
"Replace the dotted lines with the correct labels"
p <- filter(covdata, country_code == "NOR" & year == 2020) %>%
ggplot(aes(x = week, y = rate_total)) +
geom_line() +
scale_x_continuous(breaks = seq(1,52,2)) +
labs(y = "...",
x = "...")
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"
p <- filter(covdata, country_code == "NOR" & year == 2020) %>%
ggplot(aes(x = week, y = rate_total)) +
geom_line() +
scale_x_continuous(breaks = seq(1,52,2)) +
labs(y = "Death rate (total deaths)",
x = "Week")
...
e) Now modify your plot from above to add an appropriate title to your plot. The title should be: “Death rate (total) in Norway in 2020”.
p <- filter(covdata, country_code == "NOR" & year == 2020) %>%
ggplot(aes(x = week, y = rate_total)) +
geom_line() +
scale_x_continuous(breaks = seq(1,52,2)) +
labs(y = "Death rate (total deaths)",
x = "Week",
title = "Death rate (total) in Norway in 2020")
p
"To add a title, remember to specify the title = argument inside the labs() function"
"Replace the dotted line with the correct title"
p <- filter(covdata, country_code == "NOR" & year == 2020) %>%
ggplot(aes(x = week, y = rate_total)) +
geom_line() +
scale_x_continuous(breaks = seq(1,52,2)) +
labs(y = "Death rate (total deaths)",
x = "Week",
title = "...")
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"
p <- filter(covdata, country_code == "NOR" & year == 2020) %>%
ggplot(aes(x = week, y = rate_total)) +
geom_line() +
scale_x_continuous(breaks = seq(1,52,2)) +
labs(y = "Death rate (total deaths)",
x = "Week",
title = "Death rate (total) in Norway in 2020")
...
Time series plot by group
2. Using the covdata dataset…
a) Create a plot showing the death rate for all the years in Norway. Finish the code below to obtain the desired result! Remember that you’ll have to use the ´group =´ argument to group the plot by the variable year. In the same way as you did in the previous plot, use the scale_x_continuous
function along with the breaks =
argument and the seq()
function to show week 1 to 52 with an interval of two.
p <- filter(covdata, country_code == "NOR") %>%
ggplot()
p <- filter(covdata, country_code == "NOR") %>%
ggplot(aes(x = week, y = rate_total, group = year)) +
geom_line() +
scale_x_continuous(breaks = seq(1,52,2))
p
"Remember to place the group = argument inside aes()"
"Replace the dotted line with the correct grouping variable"
p <- filter(covdata, country_code == "NOR") %>%
ggplot(aes(x = week, y = rate_total, group = ...)) +
geom_line() +
scale_x_continuous(breaks = seq(1,52,2))
"Replace the dotted lines with the correct specifications"
p <- filter(covdata, country_code == "NOR") %>%
ggplot(aes(x = week, y = rate_total, group = year)) +
geom_line() +
scale_x_continuous(breaks = seq(..., ..., ...))
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"
p <- filter(covdata, country_code == "NOR") %>%
ggplot(aes(x = week, y = rate_total, group = year)) +
geom_line() +
scale_x_continuous(breaks = seq(1,52,2))
...
b) Highlight the year 2020 in a different color in order to clearly communicate the impact of covid 19 on the total death rate in Norway. In the covdata dataset, there is a variable called year_ind which specifies if the year is 2020 or a year between 2010-2021 (except 2020). Inside aes()
map this variable to the color =
argument. Using the scale_color_manual()
function, place the colors “gray70” and “red” inside the values =
argument, and the corresponding labels, “2010-2021” and “2020”, inside the labels =
argument.
p <- filter(covdata, country_code == "NOR") %>%
ggplot(aes(x=week, y=rate_total, group = year, color = year_ind)) +
geom_line() +
scale_x_continuous(breaks = seq(1,52,2)) +
scale_color_manual(values = c("gray70", "red"),
labels = c("2010-2021", "2020"))
p
"Replace the dotted line with the correct variable to color by"
p <- filter(covdata, country_code == "NOR") %>%
ggplot(aes(x = week, y = rate_total, group = year, color = ...)) +
geom_line() +
scale_x_continuous(breaks = seq(1,52,2)) +
...()
"Replace the dotted lines with the correct color specifications, and the corresponding labels in the labels argument"
p <- filter(covdata, country_code == "NOR") %>%
ggplot(aes(x = week, y = rate_total, group = year, color = year_ind)) +
geom_line() +
scale_x_continuous(breaks = seq(1,52,2)) +
scale_color_manual(values = c("...", "..."),
labels = c("...", "..."))
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"
p <- filter(covdata, country_code == "NOR") %>%
ggplot(aes(x=week, y=rate_total, group = year, color = year_ind)) +
geom_line() +
scale_x_continuous(breaks = seq(1,52,2)) +
scale_color_manual(values = c("gray70", "red"),
labels = c("2010-2021", "2020"))
...
c) Now modify your plot from above to make a plot that shows the total death rate for all the countries. You can achieve this by dropping the filter()
function, and adding the name of the dataset inside ggplot()
. You also have to separate the countries from each other by adding the facet_wrap()
function in a new line. Hint: The variable cname gives the names of all the countries.
p <- ggplot(covdata, aes(x=week, y=rate_total, group = year, color = year_ind)) +
geom_line() +
scale_x_continuous(breaks = seq(1,52,2)) +
scale_color_manual(values = c("gray70", "red"),
labels = c("2010-2021", "2020")) +
facet_wrap(~cname)
p
"Replace the dotted line with the correct variable to separate the countries"
p <- ggplot(covdata, aes(x=week, y=rate_total, group = year, color = year_ind)) +
geom_line() +
scale_x_continuous(breaks = seq(1,52,2)) +
scale_color_manual(values = c("gray70", "red"),
labels = c("2010-2021", "2020")) +
facet_wrap(~...)
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"
p <- ggplot(covdata, aes(x=week, y=rate_total, group = year, color = year_ind)) +
geom_line() +
scale_x_continuous(breaks = seq(1,52,2)) +
scale_color_manual(values = c("gray70", "red"),
labels = c("2010-2021", "2020")) +
facet_wrap(~cname)
...
d) Modify your code so that the labels on the x-axis does not overlap each other. Let the x-axis range between 1 and 52 (indicating the week numbers) with an interval of 10.
p <- ggplot(covdata, aes(x=week, y=rate_total, group = year, color = year_ind)) +
geom_line() +
scale_x_continuous(breaks = seq(1,52,10)) +
scale_color_manual(values = c("gray70", "red"),
labels = c("2000-2021", "2020")) +
facet_wrap(~cname)
p
"Modify the arguments inside the sec() function to obtain the desired result"
"Replace the dotted lines with the correct specifications for the x-axis"
p <- ggplot(covdata, aes(x=week, y=rate_total, group = year, color = year_ind)) +
geom_line() +
scale_x_continuous(breaks = seq(..., ..., ...)) +
scale_color_manual(values = c("gray70", "red"),
labels = c("2000-2021", "2020")) +
facet_wrap(~cname)
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"
p <- ggplot(covdata, aes(x=week, y=rate_total, group = year, color = year_ind)) +
geom_line() +
scale_x_continuous(breaks = seq(1,52,10)) +
scale_color_manual(values = c("gray70", "red"),
labels = c("2000-2021", "2020")) +
facet_wrap(~cname)
...
e) Add appropriate labels to your plot to make the plot more intuitive for the reader to interpret. Use these labels:
p <- ggplot(covdata, aes(x=week, y=rate_total, group = year, color = year_ind)) +
geom_line() +
scale_x_continuous(breaks = seq(1,52,10)) +
scale_color_manual(values = c("gray70", "red"),
labels = c("2000-2021", "2020")) +
facet_wrap(~cname) +
labs(y = "Death rate (total deaths)",
x = "Week")
p <- ggplot(covdata, aes(x=week, y=rate_total, group = year, color = year_ind)) +
geom_line() +
scale_x_continuous(breaks = seq(1,52,10)) +
scale_color_manual(values = c("gray70", "red"),
labels = c("2000-2021", "2020")) +
facet_wrap(~cname) +
labs(y = "Death rate (total deaths)",
x = "Week")
p
"Remember the labs() function and the correct arguments"
"Replace the dotted lines with the correct labels"
p <- filter(covdata, country_code == "NOR" & year == 2020) %>%
ggplot(aes(x = week, y = rate_total)) +
geom_line() +
scale_x_continuous(breaks = seq(1,52,2)) +
labs(y = "...",
x = "...")
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"
p <- ggplot(covdata, aes(x=week, y=rate_total, group = year, color = year_ind)) +
geom_line() +
scale_x_continuous(breaks = seq(1,52,10)) +
scale_color_manual(values = c("gray70", "red"),
labels = c("2000-2021", "2020")) +
facet_wrap(~cname) +
labs(y = "Death rate (total deaths)",
x = "Week")
...
f) Finally, add an appropriate title to communicate the information in the plot even more clearly. Use the following title: “Death rate (total) 2010-2021 by country”.
p <- ggplot(covdata, aes(x=week, y=rate_total, group = year, color = year_ind)) +
geom_line() +
scale_x_continuous(breaks = seq(1,52,10)) +
scale_color_manual(values = c("gray70", "red"),
labels = c("2000-2021", "2020")) +
facet_wrap(~cname) +
labs(y = "Death rate (total deaths)",
x = "Week",
title = "Death rate (total) 2010-2021 by country")
p
"To add a title, remember to specify the title = argument inside the labs() function"
"Replace the dotted line with the correct title"
p <- ggplot(covdata, aes(x=week, y=rate_total, group = year, color = year_ind)) +
geom_line() +
scale_x_continuous(breaks = seq(1,52,10)) +
scale_color_manual(values = c("gray70", "red"),
labels = c("2000-2021", "2020")) +
facet_wrap(~cname) +
labs(y = "Death rate (total deaths)",
x = "Week",
title = "...")
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"
p <- ggplot(covdata, aes(x=week, y=rate_total, group = year, color = year_ind)) +
geom_line() +
scale_x_continuous(breaks = seq(1,52,10)) +
scale_color_manual(values = c("gray70", "red"),
labels = c("2000-2021", "2020")) +
facet_wrap(~cname) +
labs(y = "Death rate (total deaths)",
x = "Week",
title = "Death rate (total) 2010-2021 by country")
...