To get back to the main page click here


1 Introduction

A good place to start any kind of data analysis is by describing what you see. In this session you will learn how to do exploratory data analysis by visualizing data using the package ggplot2, which is a core member of the tidyverse package. The tidyverse is a collection of R packages designed by Hadley Wickham and colleagues at Rstudio to make the process of data analysis easier. The packages in tidyverse are built on common principles which allows the packages to work together seamlessly. If you want to read more on tidyverse, check out their homepage.

The “gg” in the name ggplot function comes from “grammar of graphics”, with underlying principles and grammar. This means that by learning the basic types of plot in this course you also get an introduction to this grammar, and much more advanced graphics builds on the same grammar. You should consult R4DS for a proper explanation. Here, we head straight on.

Curriculum

The curriculum for this session is chapter 1 in R for data science (R4DS) and chapter 1, 4 and 5 in The Basic Practice of Statistics (BPS). For some topics, paragraphs in other chapters are relevant. In these cases, we’ll refer you to the correct paragraphs or pages. We highly recommend that you read the curriculum before solving the exercises, and to use the curriculum as manuals when solving the exercises. You can think of R4DS as a practical manual, showing you how to operate, and performing data analysis, in R. BPS, on the other hand, gives a introduction to statistical data analysis and explains the central principles, terms and methods. Simply put, BPS tells you what to do (and in some cases how), and R4DS tells you how to do it practically in R. For examples and learning outcomes that are not explicitly mentioned in R4DS, some introductory text and examples will be given.

Prerequisites

We will be showing examples using the dataset issp, which is an extract from the Norwegian part of the 2019 International Social Survey Programme, and data can be downloaded from their homepage. When solving the exercises, you will mainly be using the dataset abu89, which is the labour market survey from Norway in 1989. This dataset were originally prepared for a textbook on Stata programming by Wiborg and Ringdal and can be downloaded from the book’s homepage. In some exercises, other datasets will be used. In these cases we will mention it explicitly. The dataset, abu89, and the package tidyverse (where ggplot2 is automatically included) are loaded in your workspace.

Even though this tutorial and exercises will provide all the functions you need for the purposes of this course, we also recommend to check out the cheat sheet for ggplot2. You can download the cheatsheet and read more about ggplot2 here.

Note: As you will soon discover, in ggplot2, you can more functions to modify the plot by putting + at the end, and then add another function. This resembles what is elsewhere called a pipe when using the symbol %>% to do more stuff. This might be confusing later on. Just remember this: in ggplot2 you use + and in datawrangling you use %>%. The other way around gives you an error.

2 Histograms and density plots

Curriculum

  • R4DS: Chapter 1
  • R4DS: chapter 5 (paragraph: “Visualizing Distributions”)
  • BPS: Chapter 1, p. 21-28

Histograms are used when you want to visualize the distribution of scores of a single continuous variable. A histogram plot summarizes a continuous variable by chopping it up into “bins” (intervals) and counting the number of observations within each bin (Healey, 2019, p.85). By default, R will choose a bin size for us.

Let’s say we wanted to plot a histogram of the working hours per week variable (hours_work) in the issp dataset. First, we’ll get a bit familiar with the dataset by using the function glimpse(). (Don’t worry, you’ll learn more about this function later).

glimpse(issp)
## Rows: 715
## Columns: 21
## $ idnr           <int> 1002, 1008, 1036, 1047, 1050, 1060, 1062, 1063, 1074, 1…
## $ day            <int> 2, 17, 30, 2, 24, 13, 5, 6, 17, 29, 6, 30, 16, 13, 14, …
## $ month          <int> 3, 3, 3, 3, 4, 5, 3, 5, 3, 2, 3, 3, 3, 3, 4, 3, 5, 4, 5…
## $ year           <int> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2…
## $ gender         <fct> Female, Male, Female, Male, Female, Male, Male, Female,…
## $ age            <int> 31, 68, 50, 61, 52, 66, 49, 29, 54, 40, 58, 46, 67, 42,…
## $ ethnicity      <fct> Norwegian, Norwegian, Norwegian, Norwegian, Norwegian, …
## $ pl_residence   <fct> Big city, Sparsely settled, Suburbs of city, Suburbs of…
## $ marital_status <fct> Married, Widowed, Married, Unmarried, Married, Married,…
## $ religion       <fct> Christianity, Christianity, Christianity, Christianity,…
## $ yeareduc       <int> 16, 2, 18, 14, 6, 12, 20, 18, 2, 20, 16, 18, 13, 15, 14…
## $ hours_work     <dbl> 35.0, 20.0, 40.0, 40.0, 37.5, 10.0, 50.0, 42.0, 84.0, 3…
## $ inc_dec        <fct> 558-658000, 435-493000, 558-658000, 231-306000, 659-856…
## $ class          <fct> Upper middle class, Middle class, Upper middle class, L…
## $ voting_elec    <fct> Arbeiderpartiet, Senterpartiet, Hoyre, Arbeiderpartiet,…
## $ diff_income    <fct> Strongly agree, Agree, Disagree, Neutral, Neutral, Agre…
## $ good_edu       <fct> Very important, Essential, Very important, Fairly impor…
## $ rich_fam       <fct> Not very important, Fairly important, Fairly important,…
## $ edu_pay        <fct> Essential, Very important, Very important, Fairly impor…
## $ scalenow       <int> 9, 6, 7, 5, 5, 6, 6, 8, 5, 7, 7, 4, 7, 6, 6, 5, 3, 6, 6…
## $ scalefam       <int> 5, 7, 7, 7, 5, 8, 5, 6, 4, 5, 6, 4, 5, 6, 5, 5, 4, 5, 3…

The issp dataset contains 21 variables and 715 observations.

# Simple histogram plot
ggplot(data = issp, aes(x = hours_work)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

There we have a histogram of work_hours. We can see the newly created variable count on the x-axis, which counts the number of observations within each bin. Further, the output tells us that the stat_bin argument chose 30 bins.

Adjusting the bins

The intervals in a histogram is called bins, and the width of those bins will affect the look of the graph. As pointed out in R4DS chapter 5, it is highly recommended to experiment with bins when plotting a histogram. They make a huge difference as to how the final figure will appear. You can specify the bin size with the bins = argument:

# Modifying the number of bins
ggplot(data = issp, aes(x = hours_work)) +
  geom_histogram(bins = 50)

Alternatively, to specifying the number of bins, you can also experiment with the width of the bins using the binwidth = argument.

# Modifying binwidth
ggplot(data = issp, aes(x = hours_work)) +
  geom_histogram(binwidth = 10)

From the output, we can now see that the bins are much thicker than in the previous plot. It is a bit rough, removing much detail, but it might be good enough for the purpose?

2.1 Density plots

The default scale of the y-axis when using geom_histogram() is counts: the number of observations in each interval. A common alternative is “density”. In short, this makes the area equal to the proportion in each bin. Importantly: the y-scale is “density” and proportion, \(p\), is the area of any given interval: \(p = x\cdot y\). That implies that the total area of the graph sums to 1. That is: the proportions across all intervals sums to 100%, of course.

Note that the plot should visually look exactly the same, only a different scale on the y-scale.

To make the plot in this way, add y = ..density.. inside aes().

# Simple density plot
ggplot(data = issp, aes(x = hours_work, y = ..density..)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

If we wanted to visualize the density curve of a continuous variable, we could use the geom_density() function. Instead of binning the data (as with histograms), this function computes a kernel density curve of the underlying distribution. Basically, it’s a smoothed version of the histogram, and it is at the density-scale.

# Simple density plot
ggplot(data = issp, aes(x = hours_work)) +
  geom_density()

In a similar way as for how bins can be changed in histograms, the kernel density can be adjusted by setting n = to adjust the number of kernels. In other words: how smooth the curve should be. Higher number gives more “wiggliness”. Usually, we are quite happy with what R do by default, but here is an example:

# Simple density plot
ggplot(data = issp, aes(x = hours_work)) +
  geom_density(n=100)

To see the relation between histograms and density plots, we can combine both in the same plot. Both need to be on the density-scale. Thus, we have to add the argument y = ..density.., as otherwise, the histogram will use the default scale.

# Combined histogram and density plot
ggplot(data = issp, aes(x = hours_work, y = ..density..)) +
  geom_histogram() +
  geom_density()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Note: When you try to understand what the normal distribution really shows, and how on earth you are to interpret the tables, it might help to realize that the normal distribution is a density distribution. Thus: the area under the curve (in any interval) is the proportion.

Density plots with two groups

It is also possible to make a density plot where we compare the density curves of two groups. This is one advantage of density plots, as putting two histograms on top of each other in one graph will not look good. Let’s say we wanted to investigate whether the amount of working hours varies between men and women. To obtain this plot, we have to use the fill = argument and the alpha = argument.

# Density plot with two groups
ggplot(data = issp, aes(x = hours_work, fill = gender)) +
  geom_density(alpha = 0.5)

In the code above we map the gender variable to the fill = argument, which fills the density curves for each gender with a separate color. We also use the alpha = argument to make the curves slightly transparent. You’ll learn more about these two arguments in the following subchapters of this session.

Help documentation

To see the help-pages for the functions geom_histogram() and geom_density(), just run the following codes and the help-pages will open in your browser.

help(geom_histogram, package = "ggplot2")
help(geom_density, package = "ggplot2")

2.2 Exercises

In these exercises, you will be using the dataset abu89. In all exercises where you are asked to make a plot or modify a code used to make a plot, you’ll have to type the name of the plot at another line to print the output. And finally, remember to always press the ‘run code’ button before submitting your answer.

1. Use the glimpse() function to get a bit more information on the dataset.

glimpse(abu89)

2. Using the dataset abu89

a) …Create a histogram of the wage_hour variable and store it in a object named p. Print the plot by typing it’s name at another line in your code. Remember to hit the “run code” button before submitting your answer.

p <- ggplot(abu89, aes(x = wage_hour)) +
  geom_histogram()

p
"Replace the dotted lines with the name of the dataset and the name of the variable"

p <- ggplot(..., aes(x = ...)) +
  geom_histogram()
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"

p <- ggplot(abu89, aes(x = wage_hour)) +
  geom_histogram()

...

b) Modify your code so that the plot have 50 bins. Remember to print the plot after your modifications by typing it’s name.

p <- ggplot(abu89, aes(x = wage_hour)) +
  geom_histogram(bins = 50)

p
"Remember the `bins =` argument in the `geom_histogram()` function"
"Replace the dotted lines with the correct argument"

p <- ggplot(abu89, aes(x = wage_hour)) +
  geom_histogram(... = ...)
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"

p <- ggplot(abu89, aes(x = wage_hour)) +
  geom_histogram(bins = 50)

...

4. Make a density plot instead of a histogram and store it in a object named d. Remember to print the plot by typing it’s name at another line.

d <- ggplot(abu89, aes(x = wage_hour)) +
  geom_density()

d
"You'll have to replace 'geom_histogram()' with another geom"
"Replace the dotted lines with the correct geom"

d <- ggplot(abu89, aes(x = wage_hour)) +
  geom_...()
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"

d <- ggplot(abu89, aes(x = wage_hour)) +
  geom_density()

...

5. Make a combined histogram and density plot and store it in a object named c. Remember to print the plot by typing it’s name on another line.

c <- ggplot(abu89, aes(x = wage_hour, y = ..density..)) +
  geom_histogram() +
  geom_density()

c
"You'll have to add the `y = ..density..` argument inside aes()"
"Replace the dotted lines with the correct argument and geom"

c <- ggplot(abu89, aes(x = wage_hour, y = ...)) +
  geom_histogram() +
  geom_...()
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"

c <- ggplot(abu89, aes(x = wage_hour, y = ..density..)) +
  geom_histogram() +
  geom_density()

...

3 Extra: Ridge plot

You can also stack the graphs on top of each other to see each more clearly. This is called a ridge plot, but is basically just multiple density plots. The package ggridges provides the function geom_density_ridges() to do just that. You need to specify the grouping-variable as y = and if you like different colors put the same variable in fill = as follows:

## Picking joint bandwidth of 1.3

However, this technique is mainly useful if you want to compare many groups. For example, if you would like to see if working hours differs by educational level, you can use the variable yeareduc as follows. Note that the grouping-variable have to be categorical, and we use factor(yeareduc) to make R not interpret it as a continuous scale. You will learn more about factors in a later session.

## Picking joint bandwidth of 3.54

4 Adjusting the look of a graph

There are a number of adjustments you can do to adjust the final look of a graph. This typically needs to be done before it is ready for publication. What to be done will vary by publisher who typically will have some standards for how graphics should look. You probably have some views yourself as well even if the publisher do not.

We use the histogram as example, but these techniques apply to all plots made with the ggplot2 package.

Labels

Good labels are important in making the plots intuitive to interpret for the reader. We want to covey the information of a plot as clearly and effectively as possible, and with good labels we are one step further in this process. If we wanted to change the labels of the x- and y axis, we could use the labs() function and specifying the x = and y = argument. In this case, we need to explain what the variable is along the x-axis, and the unit along the y-axis:

# Specifying axis titles
ggplot(data = issp, aes(x = hours_work)) +
  geom_histogram() +
  labs(x = "Working hours per week (work_hours)",
       y = "Number of persons")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Colors

If we want to add some color to our histogram plot, we could use either the color = or the fill = argument. The former applies to lines and the latter to areas.

One reason for using colors is just to liven up the graph a bit. That can be nice. But importantly: too much colors without can make your plot look like a lot of clutter. We will use colors to highlight information later. Here is just the basic of how to add colors.

# Modifying color
ggplot(data = issp, aes(x = hours_work)) +
  geom_histogram(color = "blue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As you can see the color = "blue" argument colors the outlines of the bins. It does not matter whether you type color or colour. It will give the same result. If we want to fill the bins with color, we can use the fill = argument:

# Modifying the fill color
ggplot(data = issp, aes(x = hours_work)) +
  geom_histogram(fill = "blue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

It’s also worth mentioning that both the colors = and fill = arguments applies to all the different geoms, e.g. geom_density(), geom_histogram(), geom_point() etc. In our combined histogram and density plot, we could fill the density plot with the color red, and use the alpha = argument to make the color transparent:

# Filling density plot with color and making the color transparent
ggplot(data = issp, mapping = aes(x = hours_work, y = ..density..)) +
  geom_histogram() +
  geom_density(alpha = 0.2, fill = "red")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Color palettes

ggplot2 offers a lot of different built-in colors to use. The default colors in ggplot2 are shown in the image below.

However, the great thing about R being open-source is that several users have created their own colors and color schemes. All you have to do to add them is just install the packages. For example, the RColorBrewer package have a range of different color palettes and some are also robust to color blindness.

In the following we use the function display.brewer.all() to list all the available color palettes in the RColorBrewer package.

display.brewer.all()

Note that the first set of palettes are called sequential, and should be used for continuous variables (i.e data that ranges from low to high - or high to low). If you want to visualize differences between groups, use the second set of palettes. These are called qualitative, and consists of colors that are easy to distinguish from each other. This is pretty handy when we want to visualize group differences - the information is conveyed more clearly when the colors are distinguishable. The final set of palettes are called diverging and are recommended to use if you want to emphasize the middle values of variables and both sides of the end values.

You can read more about the different color schemes in RColorBrewer here

Colorblind-friendly palettes

Sometimes we use colors just to make the graph nice. Other times to highlight important differences between groups, grading scales etc. In such cases, colors are important information! That is less useful if a sizable proportion do not see clearly the difference you try to emphasize. About 5%-8% of men are colorblind, but far fewer women. There are different kinds of colorblindness, and truly universal design might not be easy.

Fortunately, there are several functions and packages in R that makes it easier. As stated previously, the RColorBrewer packages includes a range of different color palettes that are robust to color blindness. By adding the colorblindFriendly = TRUE argument inside the display.brewer.all() function, we list all color palettes that are colorblind friendly!

display.brewer.all(colorblindFriendly = TRUE)

The following examples illustrates the visual difference between using conventional palettes and colorblind friendly palettes. In the following we make a density plot showing the income distribution by gender using conventional colors. To make a density plot by gender, we add fill = gender to get different fill color for each gender. We also use the alpha = 0.3 argument to make the density curves slightly transparent.

# Density curve with conventional color palette
ggplot(data = issp, mapping = aes(x = hours_work, y = ..density.., fill = gender)) +
  geom_density(alpha = 0.3)

However, if we want our plots to be robust to colorblindness, we could use one of the colorblindness friendly palettes from the RColorBrewer package. In the following we use the function scale_fill_brewer() and specifying the palette = argument to set2, which is a colorblind friendly palette.

# Density curve with colorblind friendly palette - RColorBrewer
ggplot(data = issp, mapping = aes(x = hours_work, y = ..density.., fill = gender)) +
  geom_density(alpha = 0.3) +
   scale_fill_brewer(palette = "Dark2")

This plot looks a lot different than the previous one, and we can be sure that the difference is seen clearly by everyone!

However, as an alternative to the RColorBrewer package, we could use the color palettes from the viridis package. This package is built into ggplot2, and you can read more about it and the different palettes here. This color scale also ensures that the graphics works well when e.g. printed in black and white. That’s pretty neat!

To add the palette, you simply add scale_fill_viridis_d() to the code. In the same way as for the ****RColorBrewer** package, there are related scales for continuous color scales as well as coloring the lines rather than the fill also in viridis.

# Density curve with colorblind friendly palette - viridis
ggplot(data = issp, mapping = aes(x = hours_work, y = ..density.., fill = gender)) +
  geom_density(alpha = 0.2) +
   scale_fill_viridis_d()

As we do not go into depth of color uses, we leave this here. For a more thorough discussion of functions for colorblind friendly palettes, check out specialized packages to that end, such as the colorBlindness package here.

Titles

In addition to labels and colors, a good title is also essential to communicate the information in a plot effectively.

By using the labs() function, we could create a title by including the title = argument in the labs() function.

# Adding a title 
ggplot(data = issp, aes(x = hours_work)) +
  geom_histogram() +
  labs(x = "Working hours per week (work_hours)",
       y = "Number of persons",
       title = "Histogram of self-reported working hours, from ISSP")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As mentioned in R4DS chapter 22, you could also include a subtitle and a caption if you need more text. Information about data source etc. would be natural to put in the caption.

# Adding a subtitle and a caption
ggplot(data = issp, aes(x = hours_work)) +
  geom_histogram() +
    labs(x = "Working hours per week (work_hours)",
       y = "Number of persons",
       title = "Histogram of self-reported working hours",
       subtitle = "Including overtime",
       caption = "Data from ISSP, 2019")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Axis limits

From the plot we can see that there are two observations that falls outside of the overall pattern. It seems like some respondents work more than 75 hours per week. We can check these observations to see the values using the functions filter() and select() from the dplyr package. You will learn more about these functions in a later session, but for now you just have to know that the code below filters the rows in the issp dataset to only show observations where the value of hours_work is greater than 75.

issp %>%
  filter(hours_work >75) %>%
  select(hours_work)

The output tells us that there are five observations with a value ranging between 80-90. These observations would be defined as outliers. We could modify the limits of the axis, using the ylim() (specifying the y-axis) and the xlim() functions (specifying the x-axis), to remove those observations from the plot. If removing data, you need to tell somewhere, in the analysis or e.g. in the subtitle as we do here. Specifying the limits of the y-axis might tidy up the graph a bit if you like.

# Specifying the limits (minimum and maximum) of the y- and x-axis
ggplot(data = issp, aes(x = hours_work)) +
  geom_histogram() +
    labs(x = "Working hours per week (work_hours)",
       y = "Number of persons",
       title = "Histogram of self-reported working hours",
       subtitle = "(Outliers over 70 excluded)",
       caption = "Data from ISSP, 2019") +
  ylim(0, 200) +
  xlim(0, 70)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 7 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).

You can see that the x-axis is now limited to a maximum value of 70, and the plot do not show the outliers with more than 70 working hours per week. Please ignore the warning in the output. This simply means that there are some missing values in the variable and that R has removed these missing values when making the plot. You’ll learn more about missing values in the next session.

You can gain more control over the appearance of the axis by using another function: scale_x_continuous(). For example, we can also specify the ticks of the x-axis in our plot using the scale_x_continuous() function and the breaks= argument. This will override the xlim function, and we have removed that here as well as the subtitle.

# Specifying ticks of x-axis
ggplot(data = issp, aes(x = hours_work)) +
  geom_histogram() +
    labs(x = "Working hours per week (work_hours)",
       y = "Number of persons",
       title = "Histogram of self-reported working hours",
       caption = "Data from ISSP, 2019") +
  scale_x_continuous(breaks = c(0,20,40,60,80,100))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

In the example above we specify the position of the tick marks at the x-axis to 8 different tick marks using the c() function. However, a more effective way to obtain the same result would be to use the seq() function where you specify from, to and by (in that order) which creates a sequence of numbers:

# Specifying ticks of x-axis
ggplot(data = issp, aes(x = hours_work)) +
  geom_histogram() +
    labs(x = "Working hours per week (work_hours)",
       y = "Number of persons",
       title = "Histogram of self-reported working hours",
       caption = "Data from ISSP, 2019") +
  scale_x_continuous(breaks = seq(0,100,by = 20))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

In the code above, we tell R to create a sequence of numbers from 0-100 with a step size between the numbers of 20. Don’t worry, you’ll learn more about both the c() and the seq() functions in the next session.

Consistent look of your graphs

The histogram looks pretty good now, but there are all sorts of reasons why you would like to change any element in a plot. Perhaps you do not like the gray background? Or the white lines? Or the fonts or whatever. But you surely would like all plots to have a consistent look in your report. The solution is to use themes. ggplot2 has a number of different built-in themes to choose from. We can quickly modify the look of our plot using the theme_* function.

Often, theme_minimal() gives you just what you need: a clean graph with just light background grid to ease reading the values.

ggplot(data = issp, aes(x = hours_work)) +
  geom_histogram() +
    labs(x = "Working hours per week (work_hours)",
       y = "Number of persons",
       title = "Histogram of self-reported working hours",
       caption = "Data from ISSP, 2019") +
  scale_x_continuous(breaks = seq(0,100,by = 20)) +
  theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Sometimes, you would like even less, just the data and nothing else. This can be useful for maps etc. You should typically not use theme_void() where scales matter. For this example, also titles and caption are removed.

ggplot(data = issp, aes(x = hours_work)) +
  geom_histogram() +
  scale_x_continuous(breaks = seq(0,100,by = 20)) +
  theme_void()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Or perhaps you would like a dark background and let the graph itself be light. theme_dark() does the trick, but then you need to specify the histogram to be white using fill = and perhaps also with black lines using col =. (These will be explained later).

ggplot(data = issp, aes(x = hours_work)) +
  geom_histogram(fill="white", col="black") +
    labs(x = "Working hours per week (work_hours)",
       y = "Number of persons",
       title = "Histogram of self-reported working hours",
       caption = "Data from ISSP, 2019") +
  scale_x_continuous(breaks = seq(0,100,by = 20)) +
  theme_dark()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot2 comes with a total of eight different built-in themes. However, users have also developed their own themes, and there are several packages you can install to get more themes, e.g. ggthemes or ggthemr. The different built-in themes in ggplot2 are shown in the image below.

More aestethics specifications

There are a lot of different aesthetics specifications you can use to modify your plots in ggplot2. As described in R4DS chapter 1, you can also change the size, shape or color of points in addition to the different options described previously. You will learn more about this in the third part of the tutorial on ggplot2. For more information, check out the help page for the aesthetic specifications in ggplot2.

Saving plots with ggsave()

When you’re happy with your plot, you would like to include it in your report. Copy-paste from the plot window in R is possible, but you should generally avoid that. Save it to the disc instead and copy into e.g. MS Word afterwards. (R also integrates with other typsetting formats as Latex and Markdown, but that is probably not relevant for most of you - even if that is pretty cool).

Saving plots to your project folder is easy. We’ll show you how to do that. However, we cannot really demonstrate this online as it has to be done on your local computer. Try it out!

When we are satisfied with how our plot looks in the plot window, we can use the function ggsave() to save our plot to the working directory folder. Preferably, you should have a folder for where you store output, and you would save it to this folder by specifying a relative filepath as follows:

ggsave("output/density_hours_work.png")

The ggsave() function will save your last displayed plot by default. So make sure that you save the correct plot. However, if you store your plot in an object, you can specify which plot to save by adding the plot = argument inside ggsave(). That is much preferable.

We could also store our plot in an object named hist_plot by using the assignment operator <- (In the next session you will learn more about this operator):

# Making the plot and storing it
hist_plot <- ggplot(data = issp, aes(x = hours_work)) +
  geom_histogram() +
    labs(x = "Working hours per week (work_hours)",
       y = "Number of persons",
       title = "Histogram of self-reported working hours",
       caption = "Data from ISSP, 2019") +
  scale_x_continuous(breaks = seq(0,100,by = 20)) +
  theme_minimal()

In the codechunk above, we first create the plot and store it in an object named hist_plot, then we save it by using ggsave() as follows:

ggsave(plot = hist_plot, "density_hours_work.png")

The plot = specify what to save (the plot object), and then we type the name of the file where it will be stored on disc. It will overrwrite any existing file with the same name. ggsave will save in common formats depending on the tail of the filenames. If it ends in .png it will be saved in png-format. If it ends in .jpg it will be jpg-format, .pdf for pdf-format and so on.

4.1 Exercises

In these exercises, you will be using the dataset abu89. In all exercises where you are asked to make a plot or modify a code used to make a plot, you’ll have to type the name of the plot at another line to print the output. And finally, remember to always press the ‘run code’ button before submitting your answer.

1. What is wrong with the following code? Why are the outline of the bins not blue? Modify the code so the outlines are colored in blue.

hist <- ggplot(abu89) +
  geom_histogram(aes(x = wage_hour, color = "blue"))

hist
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
hist <- ggplot(abu89) +
  geom_histogram(aes(x = wage_hour), color = "blue")

hist
"The color argument should not be placed inside `aes()`"
"If color argument is placed inside `aes()` it has to be set equal to a variable name, and colors will vary by the values of those variables."

2. In the following, the exercises are sequential, i.e. you should use the code from the previous exercise as default for the next, and simply modify the code so that the plot corresponds to the specific task. Remember to print the output each time, by typing the name of the plot, to make sure you’ve done it correctly.

a) Create a combined histogram and density plot of the variable wage_hour in the abu89 dataset and store it in a object named plot. Set the theme to theme_bw(). Remember to print the plot by typing it’s name on another line.

plot <- ggplot(abu89, aes(x = wage_hour)) +
  geom_histogram(aes(y = ..density..)) +
  geom_density() +
  theme_bw()

plot
"Replace the dotted lines with the name of the theme"

plot <- ggplot(abu89, aes(x = wage_hour, y = ..density..)) +
  geom_histogram() +
  geom_density() +
  ...()
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"

plot <- ggplot(abu89, aes(x = wage_hour, y = ..density..)) +
  geom_histogram() +
  geom_density() +
  theme_bw()

...

b) Modify your code to fill the density curve with the color “firebrick”, and make it somewhat transparent by setting the alpha = argument to 0.3. Remember to print the result by typing the name of the plot on another line.

plot <- ggplot(abu89, aes(x = wage_hour, y = ..density..)) +
  geom_histogram() +
  geom_density(alpha = 0.3, fill = "firebrick") +
  theme_bw()

plot
"Replace the dotted lines with the correct arguments"

plot <- ggplot(abu89, aes(x = wage_hour, y = ..density..)) +
  geom_histogram() +
  geom_density(... = ..., ... = "...") +
  theme_bw()
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"

plot <- ggplot(abu89, aes(x = wage_hour, y = ..density..)) +
  geom_histogram() +
  geom_density(alpha = 0.3, fill = "firebrick") +
  theme_bw()

...

c) Modify the ticks of the x-axis using the scale_x_continuous() function and the breaks = argument. Let it range between 0-300 with an interval of 50.

plot <- ggplot(abu89, aes(x = wage_hour, y = ..density..)) +
  geom_histogram() +
  geom_density(alpha=0.3, fill="firebrick") +
  theme_bw() +
  scale_x_continuous(breaks = seq(0,300,50))

plot
"Remember the seq() function"
"Replace the dotted lines with the correct specifications to solve the exercise"

plot <- ggplot(abu89, aes(x = wage_hour, y = ..density..)) +
  geom_histogram() +
  geom_density(alpha=0.3, fill="firebrick") +
  theme_bw() +
  scale_x_continuous(... = seq(...,...,...))
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"

plot <- ggplot(abu89, aes(x = wage_hour, y = ..density..)) +
  geom_histogram() +
  geom_density(alpha=0.3, fill="firebrick") +
  theme_bw() +
  scale_x_continuous(breaks = seq(0,300,50))

...

d) Modify the labels on the plot. Change the label of the y-axis to “Hourly wage (NOK)”.

plot <- ggplot(abu89, aes(x = wage_hour, y = ..density..)) +
  geom_histogram() +
  geom_density(alpha=0.3, fill="firebrick") +
  theme_bw() +
  scale_x_continuous(breaks = seq(0,300,50)) +
  labs(x = "Hourly wage (NOK")

plot
"Add the labs() function to your code and place the x = argument inside it"
"Replace the dotted lines with the right argument"

(plot <- ggplot(abu89, aes(x = wage_hour, y = ..density..)) +
  geom_histogram() +
  geom_density(alpha=0.3, fill="firebrick") +
  theme_bw() +
  scale_x_continuous(breaks = seq(0,300,50)) +
  labs(... = "...")
)
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"

plot <- ggplot(abu89, aes(x = wage_hour, y = ..density..)) +
  geom_histogram() +
  geom_density(alpha=0.3, fill="firebrick") +
  theme_bw() +
  scale_x_continuous(breaks = seq(0,300,50)) +
  labs(x = "Hourly wage (NOK")

...

e). Write the correct code to save your plot. Name the plot “hist_density.png”, and remember to specify which plot to save using the name of the plot and the plot = argument.

ggsave("hist_density.png", plot = plot)

f) Use the console below to try out the different aesthetics introduced in this chapter to your plot. Try setting an appropriate title, different colors and fills, different number of bins etc.

5 Boxplots

Curriculum

  • R4DS: Chapter 1
  • BPS: Chapter 2, p. 53-56

Boxplots are used to graph what is known as the five-number-summary (maximum, minimum, median, first quartile and third quartile). You can read more about boxplots and the five-number-summary in BPS (Chapter 2, paragraph 2.5).

We can create a boxplot of the hours_work variable in the issp dataset by using the geom geom_boxplot():

# Simple boxplot
ggplot(issp, aes(y = hours_work)) + 
  geom_boxplot(color = "red", 
               show.legend = FALSE)

In the example above we map the hours_work variable to the y-axis and set the color = argument to red, and then use the show.legend = FALSE argument to drop the legend.

However, boxplots are usually reserved for comparing distributions. We can map the variable gender to the x-axis to show the five-number-summary for the variable hours_work by gender:

# Simple boxplot by gender
ggplot(issp, aes(y = hours_work, x = gender)) + 
  geom_boxplot(color = "black", 
               fill = "skyblue")

ggplot2 also allow us to change the box plot color by groups. As shown in R4DS chapter 1, we could modify our code so that the colors vary with each gender by mapping the grouping variable gender to the color = argument inside aes():

# Changing color of plot by group
ggplot(issp, aes(y = hours_work, x = gender, color = gender)) + 
  geom_boxplot()

Mapping a grouping variable (a discrete variable) to the color = argument is a great way to visualize the difference between groups as each group gets a separate color.

In a previous example we removed the legend from the plot. However, if we wanted to adjust the position of the legend, we could use the theme() function combined with the legend.position = argument. You can put it on “top”, “bottom”, “left”, “right” - or remove it altogether with “none”.

# Changing the position of legend
ggplot(issp, aes(y = hours_work, x = gender, color = gender)) + 
  geom_boxplot() +
  theme(legend.position = "top")

ggplot(issp, aes(y = hours_work, x = gender, color = gender)) + 
  geom_boxplot() +
  theme(legend.position = "bottom")

ggplot(issp, aes(y = hours_work, x = gender, color = gender)) + 
  geom_boxplot() +
  theme(legend.position = "none")

One handy feature to know about is an additional aesthetic specification: coord_flip(). This function flips the y- and x-axis. As with other aesthetics, this function may also be applied to other ggplots, but boxplots make a good case of showing the effect of flipping the axis:

# Flipping the y- and x-axis
ggplot(issp, aes(y = hours_work, x = gender, color = gender)) + 
  geom_boxplot() +
  coord_flip()+
    theme(legend.position = "none")

Help documentation

To see the help-page for the function geom_boxplot(), just run the following code and the help-page will open in your browser.

help(geom_boxplot, package = "ggplot2")

5.1 Exercises

Reminder

In these exercises, you will be using the dataset abu89. In all exercises where you are asked to make a plot or modify a code used to make a plot, you’ll have to type the name of the plot at another line to print the output. And finally, remember to always press the ‘run code’ button before submitting your answer.

1. Using the abu89 dataset…

a) …Make a boxplot of the wage_hour variable with the following specifications and store it in an object named box:

  • Color: “gray”
  • Fill: “coral”
  • Theme: theme_minimal()

Also, remember to print the plot by typing it’s name on another line.

box <- ggplot(abu89, aes(y = wage_hour)) + 
  geom_boxplot(color = "gray", fill = "coral") +
  theme_minimal()

box
"Replace the dotted lines with the correct arguments and specifications"

box <- ggplot(abu89, aes(y = ...)) + 
  geom_boxplot(color = "...", fill = "...") +
  theme_...()
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"

box <- ggplot(abu89, aes(y = wage_hour)) + 
  geom_boxplot(color = "gray", fill = "coral") +
  theme_minimal()

...

b) Modify your code from above so that you group the plot by the class variable.

box <- ggplot(abu89, aes(y = wage_hour, x = class)) + 
  geom_boxplot(color = "gray", fill = "coral") +
  theme_minimal()

box
"Map the class variable to the `x =` argument inside aes()"
"Replace the dotted lines with the correct argument and variable"

box <- ggplot(abu89, aes(y = wage_hour, ... = ...)) + 
  geom_boxplot(color = "gray", fill = "coral") +
  theme_minimal()
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"

box <- ggplot(abu89, aes(y = wage_hour, x = class)) + 
  geom_boxplot(color = "gray", fill = "coral") +
  theme_minimal()

...

c) Modify your code from the previous exercise, so that the box for each class has a separate color.

box <- ggplot(abu89, aes(y = wage_hour, x = class, color = class)) + 
  geom_boxplot() +
  theme_minimal()

box
"Inside aes() map the variable class to the color argument"
"Replace the dotted lines with the name of the grouping variable"

box <- ggplot(abu89, aes(y = wage_hour, x = class, color = ...)) + 
  geom_boxplot() +
  theme_minimal()
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"

box <- ggplot(abu89, aes(y = wage_hour, x = class, color = class)) + 
  geom_boxplot() +
  theme_minimal()

...

How did that plot turn out? Overlapping labels may be an issue if the labels are long. Can you think of a function presented in this part that would be handy to make the plot look better?

d) Modify your code so that the labels on the x-axis is not overlapping.

box <- ggplot(abu89, aes(y = wage_hour, x = class, color = class)) + 
  geom_boxplot() +
  theme_minimal() +
   coord_flip()

box
"Maybe the function coord_flip() would be useful in this case?"
"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"

box <- ggplot(abu89, aes(y = wage_hour, x = class, color = class)) + 
  geom_boxplot() +
  theme_minimal() +
   coord_flip()

...

Continuous variables

Published: 12 September, 2021