Data visualization and graphics

To go back to the main page, click here

1 Scatterplots

Curriculum

R4DS: Chapter 1
BPS: Chapter 4, p. 103-110

Scatterplots are used to visualize the relationship between two continuous variables. In ggplot2 we can use the geom geom_point() to create a simple scatterplot:

# Simple scatterplot
ggplot(issp, aes(x=age, y=hours_work)) +
 geom_point()

Above we created a simple scatterplot of the relationship between age and hours_work from the issp dataset. We could modify the points by using another shape:

# Changing the shape of the points
ggplot(issp, aes(x=age, y=hours_work)) +
 geom_point(shape=18)

There are several shapes you can use to modify the points. The picture below shows the different shapes and their corresponding numbers.

Scatterplots with a third variable

If we wanted to include a third variable, class, we could use several aesthetics to map it. We could use the color = argument to highlight the class differences by using a specific color for each class:

# Simple scatterplot with third variable using the color = argument
ggplot(issp, aes(x=age, y=hours_work, color=class)) +
 geom_point()

However, we could also use a different size or shape to highlight the class categories:

# Shape
ggplot(issp, aes(x=age, y=hours_work, shape=class)) +
 geom_point()

# Size
ggplot(issp, aes(x=age, y=hours_work, size=class)) +
 geom_point()

## Warning: Using size for a discrete variable is not advised.

We get a warning from R when using the size = class argument. As you can see from the plot the points tend to overlap making it difficult to distinguish between the different classes.

Tackling the issue with overplotting

A common problem if you have a large dataset with many observations is overplotting, i.e. that the points overlap. There are different ways to tackle this problem. For example, we could make the points more transparent using the alpha = argument from before:

# Correcting for overplotting: transparency of points
ggplot(issp, aes(x=age, y=hours_work)) +
 geom_point(alpha=0.2)

This makes it a bit better. You should always try different numbers in the alpha = argument. alpha = ranges from 0-1, where lower values corresponds to more transparency.

Alternatively, we could use geom_jitter() to help tackling overplotting. This geom adds some random variation to the location of each point:

# Correcting for overplotting: geom_jitter()
ggplot(issp, aes(x=age, y=hours_work)) +
 geom_jitter()

The plot clearly shows more datapoints using geom_jitter(). We could then combine geom_jitter() with the alpha = argument to reduce the problem further:

# Correcting for overplotting: geom_jitter() and transparency
ggplot(issp, aes(x=age, y=hours_work)) +
 geom_jitter(alpha=0.3)

geom_jitter() is especially useful if you plot the relationship between a discrete and continuous variable. Take the following example. Note that we include coord_flip() so that the labels on the x-axis won’t overlap.

ggplot(issp, aes(x=class, y=hours_work, color=class)) +
 geom_point() + 
  coord_flip()

This is not very informative since it just creates lines of points for each value of class. We can correct this issue by using geom_jitter() instead:

ggplot(issp, aes(x=class, y=hours_work, color=class)) +
 geom_jitter() +
  coord_flip()

However, it would be more useful to create a boxplot if you want to visualize the relationship between a categorical and continuous variable.

Another option to help reduce the problem with overplotting, is by using geom_hex() from the hexbin package. This geom creates a hexagonal heat map. Each hexagon square represents a collection of points:

# Correcting for overplotting: geom_hex()
ggplot(issp, aes(x=age, y=hours_work)) +
 geom_hex()

We can see that there are more observations in the brighter colors.

You can also modify the number of bins using the bins = argument:

# Correcting for overplotting: geom_hex()
ggplot(issp, aes(x=age, y=hours_work)) +
 geom_hex(bins=10) +
  scale_fill_gradient(low = "grey", high = "red")

You should always try increasing and decreasing the number of bins to see what happens to your plot.

We could also change the color of the gradient by adding the scale_fill_gradient() function:

# Changing the color of gradient
ggplot(issp, aes(x=age, y=hours_work)) +
 geom_hex(bins = 40) +
  scale_fill_gradient(low = "grey", high = "red")

And to visualize the points even better, we could place the scatterplot on top of the hexagonal heat map:

A general note when mapping multiple geoms to a plot, is to put the aes() argument at the top of the code as in the example above. We place the aes() argument inside ggplot() instead of inside geom_hex() or geom_jitter(). R will then take the arguments inside aes() as the default for the following geoms, and thereby making the code more efficient to write (it saves us from more typing). This is also a good practice to do if you only use one geom, so the code becomes more intuitive to read. That’s why we’ve always placed the aes() argument inside ggplot() in the examples in this session.

These are different approaches to tackle the problem with overplotting. However, which method that comes closest to solving the problem will vary with each case, and you should try them all out to see which handles the issue best in your particular case.

Scatterplots with regression line: `geom_smooth()` with lm

It is also possible to fit a straight regression line through the data by adding geom_smooth() to our scatterplot:

# Adding a straight regression line
ggplot(issp, aes(x=age, y=hours_work)) +
 geom_point() +
  geom_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

In the code above, we specify the methods = argument to “lm”. You’ll learn more about what “lm” means in the session on regression, but for now, just know that specifying “lm” in the methods = argument, tells R to fit a linear regression line (i.e, a straight line) to the plot.

From the output, we see that the regression line is basically straight, although there seems to be a very weak negative correlation between age and the amount of working hours.

If we do not want to have standard errors displayed in the plot, we could add the argument se = FALSE inside geom_smooth():

# Removing standard errors 
ggplot(issp, aes(x=age, y=hours_work)) +
 geom_point() +
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

If we wanted to add regression lines by a grouping variable, we could use geom_point() and the color = argument, in combination with geom_smooth():

# Including a grouping variable using the color = argument
ggplot(issp, aes(x=age, y=hours_work, color=gender)) +
 geom_point() +
  geom_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

Help documentation

To see the help-page for the functions geom_point() geom_jitter(), geom_hex() and geom_smooth, just run the following codes and the help-pages will open in your browser.

R Code Start Over

Run Code

1
2
3
help(geom_point, package = "ggplot2")
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

R Code Start Over

Run Code

1
2
3
help(geom_jitter, package = "ggplot2")
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

R Code Start Over

Run Code

1
2
3
help(geom_hex, package = "hexbin")
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

R Code Start Over

Run Code

1
2
3
help(geom_smooth, package = "ggplot2")
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

1.1 Exercises

Reminder

In these exercises, you will be using the dataset abu89. In all exercises where you are asked to make a plot or modify a code used to make a plot, you’ll have to type the name of the plot at another line to print the output. And finally, remember to always press the ‘run code’ button before submitting your answer.

The exercises (1,2,3) in this part is also sequential (1a-1g, 2a-2c, 3a-3e), meaning that you should modify your code from the previous step (a) to correspond to the changes in the following step (b, etc.).

Scatterplots and overplotting

1. Using the abu89 dataset…

a) … Make a scatterplot of the relationship between age and wage_hour. Store it in an object named p. Remember to type the name of the plot at another line to print the result.