Data visualization and graphics

To get back to the main page click here

1 Barplots

Curriculum

R4DS: Chapter 1
BPS: Chapter 1, p. 16-21

Barplots are used to visualize the distribution of categorical data. There are two types of barplots in ggplot2: geom_col() and geom_bar(). We use the former when our data material is aggregated (e.g. a table with frequencies), and the latter when the data is not aggregated, i.e. a dataset with one row per observation.

The main difference between the two geoms, is that they have different default “stat”, which is basically what it puts on the y-axis. The default stat of geom_col() is stat_identity(), which leaves the data as is. Thus, geom_col() expects both y- and x-values to be specified in aes(), so that the y-variable represents the height of each bar. Usually, the x-axis variable is discrete, while the y-axis variable is numeric.

On the other hand, the default stat of geom_bar() is stat_count(). The geom_bar() function expects only an X-variable. stat_count() counts the number of observations for each value of X, and the values of these counts are then mapped to the y-axis.

Barplots with `geom_bar()`

For example if we wanted to visualize how many respondents voted on the different Norwegian political parties, we could use geom_bar():

# Simple barplot
ggplot(issp, aes(x = voting_elec)) +
  geom_bar() + 
  coord_flip()

In the code above, we created a barplot of the variable voting_elec from the issp dataset using the geom geom_bar(). We also used the function coord_flip() to flip the y-and x-axis so that the labels of the variable voting_elec do not overlap. On the x-axis we can see the newly created variable count (which R automatically creates), which counts the number of observations in each value of voting_elec.

If we want to show proportions instead of counts, we can modify our code and add the y = ..prop.. argument along with the group = 1 argument:

# Proportions instead of counts
ggplot(issp, aes(x = voting_elec, y = ..prop.., group = 1)) +
  geom_bar() + 
  coord_flip()

If we do not set the group = 1 argument, the proportions will be calculated within each group and therefore be wrong:

ggplot(issp, aes(x = voting_elec, y = ..prop..)) +
  geom_bar() + 
  coord_flip()

If we’d rather show percentages instead, we add the y = ..prop..*100 argument:

# Percentages instead of counts
ggplot(issp, aes(x = voting_elec, y = ..prop..*100, group = 1)) +
  geom_bar() + 
  coord_flip()

Barplots with two variables

It’s possible to include a second variable to geom_bar(). If we wanted to visualize how many votes each political party received by gender, we could use the color = argument:

# Including a grouping-variable using the color = argument
ggplot(issp, aes(x = voting_elec, color = gender)) +
  geom_bar() +
  coord_flip()

In the code above we map the variable gender to the color = argument. However, we should improve the plot. It may be more useful to map the gender variable to the fill = argument:

# Including a grouping-variable using the fill = argument
ggplot(issp, aes(x = voting_elec, fill = gender)) +
  geom_bar() +
  coord_flip()

Probably, in this case, it would be more useful to have the bars by gender side-by-side instead of stacked on top of each other. To do so, use position = "dodge" inside the geom_bar() function.

# Changing the position of the bars 
ggplot(issp, aes(x = voting_elec, fill = gender)) +
  geom_bar(position = "dodge") +
  coord_flip()

This plot visualizes the proportions of each gender more clearly than the previous one.

Help documentation

To see the help-page for the function geom_bar(), just run the following code and the help-page will open in your browser.

help(geom_bar, package = "ggplot2")

1.1 Exercises

Reminder

In these exercises, you will be using the dataset abu89. In all exercises where you are asked to make a plot or modify a code used to make a plot, you’ll have to type the name of the plot at another line to print the output. And finally, remember to always press the ‘run code’ button before submitting your answer.

1. Using the abu89 dataset…

a) …Make a barplot of the variable promotion and store it in a object named p.

p <- ggplot(abu89) +
  geom_bar(aes(x=promotion))

p

"Replace the dotted lines with the name of the dataset and the variable"

p <- ggplot(...) +
  geom_bar(aes(x=...))

"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"

p <- ggplot(abu89) +
  geom_bar(aes(x=promotion))

...

b) Modify your code so that the plot shows proportions instead of counts.

p <- ggplot(abu89) +
  geom_bar(aes(x=promotion, y=..prop.., group = 1))

p

"Replace the dotted lines with the correct specifications and argument"

p <- ggplot(abu89) +
  geom_bar(aes(x=promotion, y=..., ... = ...))

"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"

p <- ggplot(abu89) +
  geom_bar(aes(x=promotion, y=..prop.., group = 1))

...

c) Modify your code so that the plot shows percentages instead of proportions.

p <- ggplot(abu89) +
  geom_bar(aes(x=promotion, y=..prop..*100, group = 1))

p

"Maybe some multiplication will help?"

"Replace the dotted lines with the correct specifications"

p <- ggplot(abu89) +
  geom_bar(aes(x=promotion, y=...*..., group = ...))

"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"

p <- ggplot(abu89) +
  geom_bar(aes(x=promotion, y=..prop..*100, group = 1))

...

2. How does promotion vary with gender? Using the abu89 dataset…

a) …Make a barplot of promotion by gender using the fill = argument. Store the plot in an object named p, and remember to print the plot by typing it’s name on another line.

p <- ggplot(abu89) +
  geom_bar(aes(x=promotion, fill=gender))

p

"Replace the dotted lines with the name of the variable and the correct argument"

p <- ggplot(abu89) +
  geom_bar(aes(x=..., ...=...))

"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"

p <- ggplot(abu89) +
  geom_bar(aes(x=promotion, fill=gender))

...

b) Are there more promotions in private or public sector? Use the same code from the previous exercise, but replace the gender variable with the sector variable.

p <- ggplot(abu89) +
  geom_bar(aes(x=promotion, fill=sector))

p

"Replace the dotted line with the correct variable"

p <- ggplot(abu89) +
  geom_bar(aes(x=promotion, fill=...))

"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"

p <- ggplot(abu89) +
  geom_bar(aes(x=promotion, fill=sector))

...

c) Try to make the plot look better as you see fit. Give it a descriptive title, x- and y-axis labels, get some colors in it, decide if you want to show counts, proportions or percent etc.

3. What is wrong with the following code? Why do the plot look wrong? Modify the code so that the plot becomes correct.

plot <- ggplot(issp) +
  geom_bar(aes(x = class, y = ..prop..)) + 
  coord_flip()

plot

plot <- ggplot(issp) +
  geom_bar(aes(x = class, y = ..prop.., group=1)) + 
  coord_flip()

plot

"Remember the `group =` argument"

"Replace the dotted line with the correct specification for the 'group =' argument"

plot <- ggplot(issp) +
  geom_bar(aes(x = class, y = ..prop.., group = ...)) + 
  coord_flip()

plot

2 Barplots with data from a table

As mentioned earlier, geom_col() works differently than geom_bar(). geom_col() requires that you map both an X- and a Y-variable. You should use geom_col() if you have pre-calculated values that already exists in the data. These values are often aggregated summaries, frequencies or proportions.

For the purpose of example, let’s assume we had this dataset vote:

vote has 3 variables. It contains the voting_elec variable from the issp dataset, the variable n, which counts the number of observations in each value of voting_elec and prop, which gives the proportions of each value of voting_elec.

We could then use geom_col() to plot the proportion of votes to each party. We also add the coord_flip() function to flip the y- and x-axis so that the labels on the x-axis won’t overlap.

# Barplot using geom_col()
ggplot(vote) +
  geom_col(aes(x = voting_elec, y = prop)) +
  coord_flip()

As you can see, the result of this plot is identical to the plot where we used geom_bar() to plot the proportions of votes each party received. The difference is that R automatically calculates the proportion when using the geom_bar() function and specifying the y = ..prop.. argument, while we have to map the variable proportion when using geom_col(). Therefore, to be able to use geom_col(), the y-values must already be present in your data.

However, we could plot the exact plot from above using geom_bar() instead. As mentioned earlier, the default stat of geom_bar() is stat = count. By changing the default stat to stat = identity, we can use the Y-variable that already exists in the data in geom_bar() and obtaining the same plot as when we used geom_col():

# Changing the default stat
ggplot(vote) +
  geom_bar(aes(x = voting_elec, y = prop), stat = "identity") +
  coord_flip()

By comparing this plot with the one above, we see that they are exactly the same!

Barplots with a third variable: colors, fill, groups, stacked vs. dodge etc

But what if we wanted to visualize the percentage of votes to each political party by gender? To make this plot, we have to use three variables. But how?

Imagine we had this dataset, vote2, which has four variables: class and voting_elec from the issp dataset, n, which counts the number of observations in each combination of voting_elec and class (used to make the percent variable), and percent, which gives the percent of votes to each party by class.

To be able to plot the percentage of votes to the different political parties by class, we could use the color = and fill = arguments in geom_col() to obtain the desired plot:

# Barplot with a third variable using the color = argument
ggplot(vote2) +
  geom_col(aes(x = voting_elec, y = percent, color = class)) +
  coord_flip()

Or we could use the fill = argument instead:

# Barplot with a third variable using the fill = argument
ggplot(vote2) +
  geom_col(aes(x = voting_elec, y = percent, fill = class)) +
  coord_flip()

Data from a cross-tabulation

Sometimes data comes in the form of a cross-tabulation where the values you would like to plot is spread across two columns. A lot of the datasets in the textbook we use (and elsewhere as well) is in this form. Published tables are often in this form as it is easier layout for the reader. Thus, we need to be able to handle this structure as well.

That can be handled by specifying geom_col() twice and then different aes(y = ...) in each. However, with ggplot() it is best if the values are in one single column. This is sometimes called “long form”, and it is more efficient. It is also a lot more tidy structure, and it is sometimes just called tidy. All the functions we use from the tidyverse package is based on this structure. Guess why it is called tidyverse!

To transpose the cross-tab to tidy, use the function pivot_longer(). We will not go into details on pivoting. It is enough (for our purposes) to just specify which column not to transpose. Remember that the exclamation mark ! means “not”, so the following code means “transpose everything, but not the variable voting_elec”. For this to work as planned,

# Transposing all columns except 'voting_elec'
vote4 <- vote3 %>% 
  pivot_longer(!voting_elec)
vote4

This will change the variable names. That does not really matter for making the plot. But you can specify that using names_to = and values_to =. (You can also use the rename() function you will learn about later).

vote4 <- vote3 %>% 
  pivot_longer(!voting_elec, names_to = "gender", values_to = "percent")
vote4

Now, you can plot the data as before:

ggplot(vote4) +
  geom_col(aes(x = voting_elec, y = percent, fill = gender)) +
  coord_flip()

Changing the position

As mentioned in chapter 1 of R4DS (paragraph: position), the default position of both geom_bar() and geom_col() are stack. This means that the proportions of each group are stacked in each bar:

# Position = stack
ggplot(vote2, aes(x = voting_elec, y = percent, fill = class)) +
  geom_col(position = "stack") +
  coord_flip()

However, we could modify the position by adding the argument position = "dodge:

# Position = dodge
ggplot(vote2, aes(x = voting_elec, y = percent, fill = class)) +
  geom_col(position = "dodge") +
  coord_flip()

As you can see, by adding this argument the different classes get their own bar for each political party.

Help documentation

To see the help-page for the function geom_col(), just run the following code and the help-page will open in your browser.

help(geom_col, package = "ggplot2")

2.1 Exercises

Reminder

In these exercises, you will be using the dataset promo (shown in the output below). In all exercises where you are asked to make a plot or modify a code used to make a plot, you’ll have to type the name of the plot at another line to print the output. And finally, remember to always press the ‘run code’ button before submitting your answer.

1. Using the promo dataset and geom_col()…

a) …Create a barplot showing the proportions of promotion. Store the plot in an object named p.

p <- ggplot(promo, aes(x=promotion, y=proportion)) +
  geom_col()

p

"Replace the dotted lines with the correct dataset and variables"

p <- ggplot(..., aes(x=..., y=...)) +
  geom_col()

"Replace the dotted lines with the correct variable"

p <- ggplot(promo, aes(x=promotion, y=...)) +
  geom_col()

"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"

p <- ggplot(promo, aes(x=promotion, y=proportion)) +
  geom_col()

...

b) What if you wanted to show counts instead of proportion? Which variable in the promo dataset should you then include in your code? Modify your code from above to obtain the desired result.

p <- ggplot(promo, aes(x=promotion, y=n)) +
  geom_col()

p

"Maybe the 'n' variable may help?"

"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"

p <- ggplot(promo, aes(x=promotion, y=n)) +
  geom_col()

...

3. Does promotion vary with gender? Using the promo_gen dataset (shown in the output below) and geom_col()…

a) …Create a barplot showing the percentage of promotion by gender. Store it in an object named p. Hint: Use the fill= argument.

p <- ggplot(promo_gen, aes(x=promotion, y=percent, fill=gender)) +
  geom_col()

p

"Replace the dotted lines with the correct variables"

p <- ggplot(promo_gen, aes(x=..., y=..., fill=...)) +
  geom_col()

"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"

p <- ggplot(promo_gen, aes(x=promotion, y=percent, fill=gender)) +
  geom_col()

...

b) Change your plot so that each gender gets a separate bar for each value of promotion.

p <- ggplot(promo_gen, aes(x=promotion, y=percent, fill=gender)) +
  geom_col(position = "dodge")

p

"Remember the 'position =' argument"

"Replace the dotted lines with the correct position"

p <- ggplot(promo_gen, aes(x=promotion, y=percent, fill=gender)) +
  geom_col(position = "...")

"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"

p <- ggplot(promo_gen, aes(x=promotion, y=percent, fill=gender)) +
  geom_col(position = "dodge")

...

3 Multiple plots in one

Curriculum

R4DS: Chapter 1

In addition to using aesthetics arguments to map a third variable to a plot, ggplot2 also offers the function facet_wrap(). facet_wrap() works especially well with categorical data, and let’s you separate your plot into multiple subplots for each category in the given variable:

# Faceting plot by class
ggplot(vote2) +
  geom_col(aes(x = voting_elec, y = percent)) +
  coord_flip() +
  facet_wrap(~class)

We now get a separate plot for each class. If we wanted to add different colors for each class, we could add the fill = argument:

# Fill with separate colors for each class
ggplot(vote2) +
  geom_col(aes(x = voting_elec, y = percent, fill = class)) +
  coord_flip() +
  facet_wrap(~class)

Help documentation

To see the help-page for the function facet_wrap(), just run the following code and the help-page will open in your browser.

help(facet_wrap, package = "ggplot2")

3.1 Exercises

Reminder

In these exercises, you will be using the dataset promo_gen (shown in the output below). In all exercises where you are asked to make a plot or modify a code used to make a plot, you’ll have to type the name of the plot at another line to print the output. And finally, remember to always press the ‘run code’ button before submitting your answer.

1. Using the promo_gen dataset…

a) …Make a barplot using geom_col() to plot the percent of promotions by gender and store it in an object named p. Use facet_wrap() to divide the plot into two subplots for each gender. Remember to type in the name of the plot on another line to print it.

p <- ggplot(promo_gen, aes(x=promotion, y=percent)) +
  geom_col() +
  facet_wrap(~gender)

p

"Replace the dotted lines with the name of the dataset and the variables"

p <- ggplot(..., aes(x=..., y=...)) +
  geom_col() +
  facet_wrap(~...)

"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"

p <- ggplot(promo_gen, aes(x=promotion, y=percent)) +
  geom_col() +
  facet_wrap(~gender)
  
...

b) Fill the plot with color by gender to highlight the gender differences.

p <- ggplot(promo_gen, aes(x=promotion, y=percent, fill=gender)) +
  geom_col() +
  facet_wrap(~gender)

p

"Remember the fill= argument"

"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"

p <- ggplot(promo_gen, aes(x=promotion, y=percent, fill=...)) +
  geom_col() +
  facet_wrap(~gender)

...

c) As stated previously, labels can make plots look much better! Add the following labels to the y- and x-axis:

y-axis: “Percent” x-axis: “Promotion”

p <- ggplot(promo_gen, aes(x=promotion, y=percent, fill=gender)) +
  geom_col() +
  facet_wrap(~gender) +
   labs(y = "Percent",
        x = "Age",
     title = "Men get more promotions than women")

p

"Remember the 'labs()' function and the 'title =' argument"

"Replace the dotted line with the correct title"

p <- ggplot(promo_gen, aes(x=promotion, y=percent, fill=gender)) +
  geom_col() +
  facet_wrap(~gender) +
   labs(title = "...")

"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"

p <- ggplot(promo_gen, aes(x=promotion, y=percent, fill=gender)) +
  geom_col() +
  facet_wrap(~gender) +
   labs(title = "Men get more promotions than women")


...

d) A good title is also necessary to communicate the information in a plot effectively. Add an appropriate title to your plot. Use the following: “Men get more promotions than women” .

p <- ggplot(promo_gen, aes(x=promotion, y=percent, fill=gender)) +
  geom_col() +
  facet_wrap(~gender) +
   labs(y = "Percent",
        x = "Age",
     title = "Men get more promotions than women")

p

"Remember the 'title =' argument inside the 'labs()' function"

"Replace the dotted line with the correct title"

p <- ggplot(promo_gen, aes(x=promotion, y=percent, fill=gender)) +
  geom_col() +
  facet_wrap(~gender) +
   labs(y = "Percent",
        x = "Age",
     title = "...")

"Finally, remember to print the plot by typing it's name. Replace the dotted line with the name of the plot"

p <- ggplot(promo_gen, aes(x=promotion, y=percent, fill=gender)) +
  geom_col() +
  facet_wrap(~gender) +
   labs(y = "Percent",
        x = "Age",
     title = "Men get more promotions than women")


...

4 Pie and donut charts

Pie chart is a popular type of graph, but there are really no situations where a pie chart is preferable to barcharts or some other type of chart. If the goal is to communicate information in a best possible way, you have no use for pie charts.

R can make it of course, and there are packages to make it relative painless as well. So, if you really, really need to do it, you can.

Here is an example using ggplot2. The key here is to make a bar chart on a circular coordinate system, using coord_polar(). The values that specify the size of each pie piece is the y-variable, and the categories are the fill colors. So, you let the x-argument be blank.

Then there are a couple of arguments to add to the coord_polar function: theta = "y" specifies that it is the y-variable that should map to angle. The argument start = 0 just lets the first category start on the top of the pie.

You can try it out below. The dataset vote5 is loaded in your workspace. Try changing some of the arguments to see what happens.

(This exersize is just for playing around with pie charts, so you will not get feedback on this exercise).

# Piechart
ggplot(vote5, aes(x = "", y = prop, fill = voting_elec )) +
  geom_col(width = 1, col = "white") + 
  ___(theta = "y", start=0) +
  theme_void()

Now, you might think this plot would be a lot easier to read if labels and percentages were added to each piece of the pie. Of course, that can also be done. But you should then give the following a thought: If you need to read the categories and percentages to make sense of the graph, then why make a graph in the first place?

So, we do not cover further elaborations on pie charts here.

One of main problems with pie charts is that you have to visually estimate the angle of each piece to get a perception of the area which represents the proportion in each category. You can actually improve the visual appearance by removing the angle. A donut chart is a pie chart with a whole in the middle. So, if someone insists you should make a pie chart, then try giving them a donut chart instead.

A donut chart is made by a slight modification of the pie chart. First, you need to specify the x-axis in the aes() function to x = 2. Then you add the xlim() argument. It is a bit cryptic reasoning behind this, which we do not explain here, but set the xlim values to .2 and 2.5.

# Piechart
ggplot(vote5, aes(x = ___, y = prop, fill = voting_elec )) +
  geom_col(width = 1, col = "white") + 
  coord_polar(theta = "y", start=0) +
  xlim(___, ___) +
  theme_void()

1 Barplots

Barplots with `geom_bar()`

Barplots with two variables

1.1 Exercises

2 Barplots with data from a table

Barplots with a third variable: colors, fill, groups, stacked vs. dodge etc

Data from a cross-tabulation

Changing the position

2.1 Exercises

3 Multiple plots in one

3.1 Exercises

4 Pie and donut charts