Data wrangling with tidyverse and dplyr

To get back to the main page click here

1 Introduction

Statisticians often use pre-existing datasets to answer their research questions, and pre-existing datasets often contain a large number of variables and observations. A central first step for the majority of statistical data analysis is then to transform the data to make the dataset more manageable. For example, it would be great to just select the variables needed for the analysis and drop the ones that we don’t need. We also have to filter out observations that do not fulfill the sample criteria, and usually it’s also necessary to create some new variables. These examples illustrates the importance of data transformation. In this session you will learn how to transform and handle data using the tidyverse and dplyr packages.

The curriculum for this topic is chapter 3 in R for data science (R4DS). The exercises in this session follows the outline of chapter 3 in R4DS, so we highly recommend that you use the chapter as a manual when trying to solve the following exercises. For examples and learning outcomes that are not explicitly mentioned in R4DS, some introductory text and examples will be presented.

Mainly you will be using the dataset issp when exploring the dplyr verbs for data transformation and solving the exercises, but for some verbs other datasets will be used. The issp dataset and the packages tidyverse and dplyr is loaded in your workspace.

Before we dive straight into the dplyr verbs for data transformation, get familiar with the issp dataset using the functions glimpse() and head() that you’ve already learned. Replace the dotted lines with the name of the dataset, and remember to hit the ‘run code’ button to get the output.

glimpse(...)

head(...)

2 The dplyr verbs for data transformation

The dplyr package is a core part of the tidyverse, and is intended to make the process of data transformation easy by focusing on six verbs. You’ve probably read the curriculum (chapter 3 in R4DS) which explains the verbs, but here is a quick reminder:

filter(): subsets observations based on conditions
arrange(): reorders the observations based on values in specified variables
select(): picks out specific variables to keep or delete (from a dataset)
mutate(): creates new variables. Typically based on conditions of existing variables
summarize(): creates summary statistics for a specific variable and is especially useful in combination with the group_by() function, which enables analysis on groups instead of the complete dataset.

It may also be useful to take a look at dplyr’s cheatsheet. You can read more about dplyr and download the cheatsheet here

2.1 Filter observations with `filter()`

Help documentation

To see the help-page for the function filter(), just run the following code and the help-page will open in your browser.

help(filter, package = "dplyr")

Note that since the filter() function is not a part of base R, we’ll have to specify which package the function belongs to. In this case, it’s the dplyr package. Sometimes when you see us referring to help-pages we do not use the package = argument, and other times we do. This depends on whether or not the particular function is part of base R or another package.

2.1.1 Exercises

Read the paragraph in R4DS, chapter 3: “Filter rows with filter()”, and try to solve the following exercises.

1. Using the issp dataset, filter observations to only include respondents who are greater than or equal to 25 years old and lesser than or equal to 50 years old.

filter(issp, age >= 25 & age <=50)

"Remember the logical operators: greater than or equal to: >=, and lesser than or equal to: <="

2. Create a new dataset named issp_age based on the filtering conditions in the previous exercise

issp_age <- filter(issp, age >= 25 & age <=50)

"The assignment operator <- is very useful for the purpose of creating a new dataframe"

3. In the dataset issp, how many respondents have 16 years of education and works more than 40 hours per week? …

a) … Use filter() to filter the rows to only keep observations where they have 16 years of education (yeareduc) and works more than 40 hours per week (hours_work). Store the result in an object named issp_educwork.

issp_educwork <- filter(issp, yeareduc == 16 & hours_work >40)

"You can either use the 'and' operator: &, or you could use a comma"

issp_educwork <- filter(issp, ... & ...)
issp_educwork <- filter(issp, ..., ...)

b) Use the glimpse() function to see how many respondents have 16 years of education and works more than 40 hours per week. Hint: Look at the number of rows.

glimpse(issp_educwork)

"Replace the dotted line with the name of your newly created object"

glimpse(...)

4. How many respondents are in the highest or lowest income deciles (inc_dec) and live in a sparsely settled area (pl_residence)?

a) … Using the issp dataset, filter() the observations to only keep respondents who are either in the highest (“857000_or_higher”) or lowest (“0-123000”) income deciles (inc_dec). Store the result in an object named issp_inc.

issp_inc <- filter(issp, (inc_dec == "0-123000" | inc_dec == "857000_or_higher"))

"Replace the dotted lines with the correct values of the inc_dec variable"

issp_inc <- filter(issp, (inc_dec == "..." | inc_dec == "..."))

b) …Now filter the issp_inc data frame to only include respondents whose place of living (pl_residence) is in a sparsely settled area (“Sparsely_settled”). Store the result in an object named issp_incpl.

issp_incpl <- filter(issp_inc, pl_residence == "Sparsely_settled")

"Replace the dotted lines with the correct dataframe and value of the inc_dec variable to complete the task"

issp_incpl <- filter(..., pl_residence == "...")

"Replace the dotted lines with the correct value of the inc_dec variable to complete the task"

issp_incpl <- filter(issp_inc, pl_residence == "...")

c) …Now that you’ve filtered the data on all the conditions, use the glimpse() function on your newest created object (issp_incpl) to see how many respondents are in the highest or lowest income deciles and live in a sparsely settled area. Hint: Look at the number of rows.

glimpse(issp_incpl)

"Replace the dotted line with the name of the correct object to solve the task"

glimpse(...)

5. What is wrong with the following code? The output we wanted was to see all observations which are either “Upper_middle_class” or “Upper_class”. Modify the code to get the output desired. There are two correct ways to solve this task (see R4DS, chapter 3) - you can use either one.

filter(issp, class == "Upper_middle_class" | "Upper_class")

filter(issp, class == "Upper_middle_class" | class == "Upper_class")
filter(issp, class %in% c("Upper_middle_class", "Upper_class"))

"You can either use the or operator: |, or you can use the %in% operator combined with the c() function"

filter(issp, class == ... | class == ...)

filter(issp, class %in% c("...", "..."))

6. Using the issp dataset, what gender is the respondent who is 35 years old and divorced? Write the correct code to get the desired output where you can identify the respondent’s gender. There are two possible ways (correct codes) to solve this task:

filter(issp, age == 35 & marital_status == "Divorced")
filter(issp, age == 35, marital_status == "Divorced")

"You can either use the 'and' operator: &, or you could use a comma"

filter(issp, ... & ...)
filter(issp, ..., ...)

filter(issp, age ... & marital_status ...)

7. Using the issp dataset, filter the observations to only include respondents who are not married.

filter(issp, marital_status != "Married")

"Remember the 'not equal to' operator: !="

filter(issp, ... != "...")

2.2 Reorder rows with `arrange()`

You might wonder why you should sort your dataset. Does it matter for analyses? It does not matter for the analysis as such, but it might matter for some re-coding. For now, it mainly matters for the output. You sometimes need to check the data, e.g. looking at a subset of the dataset and grouping observation together e.g within the same year, in chronological order, within family etc. We can do this using the dplyr function arrange().

Help documentation

To see the help-page for the function arrange(), just run the following code and the help-page will open in your browser.

help(arrange, package = "dplyr")

2.2.1 Exercises

Read the paragraph in R4DS, chapter 3: “Arrange rows with arrange()”, and try to solve the following exercises.

1. Using the issp dataset, reorder the rows of the age variable from youngest to oldest (ascending order). What is the gender of the youngest respondent? Write the correct code to get the output needed to answer this question.

arrange(issp, age)

arrange(..., ...)

2. Using the issp dataset, reorder the rows of the years of education variable from highest to lowest. How many years of completed education has the respondent with the highest years of education? Write the correct code to obtain this information.

arrange(issp, desc(yeareduc))

arrange(..., desc(...))

2.3 Select variables with `select()`

Read the paragraph in R4DS, chapter 3: “Select Columns with select()”, and try to solve the following exercises.

Help documentation

To see the help-page for the function select(), just run the following code and the help-page will open in your browser.

help(select, package = "dplyr")

You’ll also encounter the function rename(), which gives a column a new name. Why would you like to rename variables? Well, sometimes variable names are inconvenient, such as being too long or unintelligible. You should have variable names that give sufficient hint at what it contains so you do not have to look it up in the data documentation report all the time. Too long names takes a lot of typing, which leads to frequent typo. The function rename() just gives the column a new name.

Run the following code and the help-page will open in your browser.

help(rename, package = "dplyr")

2.3.1 Exercises

1. Write the correct code to reorder the columns in the issp dataset so that the gender variable comes first. Hint:: use select() with the everything() helper (see R4DS: chapter 3).

select(issp, gender, everything())

select(..., ..., everything())

2. Create a new dataframe called subset which only contains the columns idnr, gender, age and class. Hint: remember the assignment operator <-.

subset <- select(issp, idnr, gender, age, class)

subset <- select(..., ..., ..., ..., ...)

3. Using the issp dataset, write the correct code to rename the variable yeareduc to years_educ.

rename(issp, years_educ = yeareduc)

"the rename() function would be useful"

"Remember: first the new variable name, then the old variable name"

rename(issp, new_var_name = old_var_name)

4. Write the correct code to select all variables in the issp dataset except for the variable religion.

select(issp, -religion)

"Remember to use the minus symbol: -"

select(..., -...)

5. Create a new dataframe called subset2 which only contains the 1st, 4th, 3rd and 10th column in the issp dataset.

subset2 <- select(issp, 1,4,3,10)

"You can refer to the variables (columns) either by their name or their column number"

subset2 <- select(..., ..,..,..,10)

6. Write the correct code that selects all other variables except voting_elec, diff_income and good_edu. Hint: Remember the : operator for consecutive columns

select(issp, -(voting_elec:good_edu))

"Remember to use parentheses"

select(..., -(...:...))

2.4 Recoding and creating new variables based on existing ones with `mutate()`

If we want to create or modify variables in a dataset, the mutate() function from dplyr is our way to go! In this part, we’ll briefly introduce the most basic use of mutate() and how it can be combined with other dplyr functions.

In addition to creating new variables based on functions from existing variables described in R4DS chapter three (section: Add New Variables with mutate()), the mutate() function is also very useful if you want to recode an existing variable. Using mutate() in combination with other dplyr functions like if_else() and case_when() is especially handy in this regard and will be the focus in this section.

if_else() and case_when() are called conditional operations, because we write a (set of) condition(s) which are evaluated by R. R will then execute different procedures based on whether or not the condition is met (TRUE or FALSE). We’ll start by introducing the ìf_else() function.

Help documentation

To see the help-page for the function mutate(), just run the following code and the help-page will open in your browser.

help(mutate, package = "dplyr")

`if_else()` function

if_else() is especially useful when we only have one specified condition. In the issp dataset, the variable gender is a factor variable consisting of two categories: “Female” and “Male”. However, when dealing with a categorical variable with only two categories, the most appropriate way is to convert them to a dummy-variable. A dummy variable is a numeric variable, consisting of the values 1 if the condition is TRUE, and everything else is given the value 0 (FALSE).

Let’s say we wanted to create a dummy variable where the value of 1 refers to females, and the value of 0 refers to males. A general rule when creating dummy-variables, is to name the variable with the quality assigned to the value 1. In our example, the value of 1 in the gender variable equals females. We therefore want to create a dummy variable named female. We can do this by using mutate() in combination with if_else():

issp <- mutate(
  issp, female = if_else(
    condition = gender == "Female",
    true = 1, 
    false = 0)
  )

In the example above issp refers to the dataframe, female is the name of the new variable we’re creating and if_else() is a logical function. The first argument in the if_else() function is the condition argument. Here we ask R to evaluate whether the value of gender equals “Female”. The second argument in the if_else() function is the true = argument. The value given to the true = argument tells R what value the female variable should have if the condition (gender == "Female") is true. In our case, we gave it the value of 1. The third argument in the if_else() function is the false = argument, and the value given to the false = argument refers to the value the female variable should have if the condition is false (which essentially means gender != "Female": gender is not equal to “Female”). Basically you could read the example above as “if the value of gender equals”Female“, the variable female should be given the value of 1, everything else should be given the value of 0”.

In the example code snippet above, we wrote the code a bit longer for pedagogical purposes. In the future we will write the code shorter. Essentially, meaning we won’t be writing condition = .., true = .., and false = ... However, it is important to note that the first argument ALWAYS refers to the condition = argument, the second value ALWAYS refers to the true = argument and the third value will ALWAYS refer to the false = argument. The following code yields exactly the same results:

mutate(issp, female = if_else(gender == "Female", 1, 0))

Be aware of the difference between ifelse() and if_else(). The former is base R and is less strict and takes longer to execute, while the latter belongs to the dplyr package. We recommend to use the dplyr and tidyverse functions consistently for a smooth learning process.

Help documentation

To see the help-page for the function if_else(), just run the following code and the help-page will open in your browser.

help(if_else, package = "dplyr")

Before we proceed to the introduction of the next function, a general note on the use of mutate() is necessary. If you use the same variable name as the original variable when using mutate(), the function will overwrite the existing variable with the condition(s) specified. However, if you write a new variable name, like we did in the example above (new variable = female, old variable = gender), the mutate() function will create a new variable which will be placed at the end of the dataset.

`case_when()` function

But what if we have multiple conditions? Let’s say we wanted to create a new categorical variable in the issp dataset called work_hours_cat based on the continuous variable hours_work. We want to create a variable with five categories referring to different levels of working hours per week. In addition to creating dummy variables, this type of transformation of variables from continuous to categorical is also very common. While if_else() is useful when we only have one condition, dplyr’s case_when() is the ‘go-to’ in cases where we have multiple conditions.

In our example of creating a categorical variable with 5 levels based on hours_work, we essentially have five conditions:

1-25: greater than 0 and less than or equal to 25 working hours per week
26-35: greater than 25 and less than or equal to 35 working hours per week
36-45: greater than 35 and less than or equal to 45 working hours per week
46-55: greater than 45 and less than or equal to 55 working hours per week
56 or more: greater than or equal to 56 working hours per week

Using mutate() in combination with case_when(), we can create the variable work_hours_cat with five different levels.

issp <- mutate(
  issp, work_hours_cat = case_when(
    hours_work >0 & hours_work <= 25 ~ 1,
    hours_work >25 & hours_work <= 35 ~ 2,
    hours_work >35 & hours_work <= 45 ~ 3,
    hours_work >45 & hours_work <= 55 ~ 4,
    hours_work >= 56 ~ 5))

Help documentation

You can look up the help documentation of the case_when() function and follow along with the explanation of what we did below. To see the help-page for case_when(), just run the following code and the help-page will open in your browser.

help(case_when, package = "dplyr")

Reading the help documentation, it informs us that the case_when() function only has one argument: the ... argument and that we should type one or more two-sided formulas separated by commas to the argument. Okey, but what does that actually mean?

When the help documentation highlights a two-sided formula, it basically means: LHS ~ RHS. In this case, LHS refers to left-hand side, while RHS means right-hand side. In the LHS you should type the conditions that you want R to test. This is basically the equivalent to if_else() condition argument. On the other hand, in the RHS we should type the value we want to be returned by the case_when() function when the condition on the left-hand side is met (TRUE). This you can think of as the equivalent to if_else() function’s true = argument. Finally, the tilde symbol (~) is used to separate the left-hand side (the conditions) from the right-hand side (the return values).

In the first line of our example above, issp refers to the dataframe, work_hours_cat is the name of the new variable we’re creating, and we combine the functions mutate() and case_when() to create the variable. Inside the parentheses, the left-hand side refers to hours_work >0 & hours_work <= 25 and the right-hand side refers to the value 1. As you can see the left-hand side and the right-hand side is separated by the tilde (~) symbol. Essentially this means that if hours_work is greater than 0 and less than or equal to 25, then return the value of 1 in the new variable (work_hours_cat). And as you can see, this structure is the same for all the five conditions. case_when() is therefore very easy to use, as the code is intuitive to read and understand.

`mutate()` and `factor()`: Recode to factor variable

We have now created the categorical variable, work_hours_cat, but the values are not that informative. Recodings like this are very common when you use a pre-existing dataset to answer a research question, and it can get very complicated to remember what value 3 means when you do this type of recoding on 10+ variables. We should convert work_hours_cat to a factor, and give the different levels more informative labels. If we combine the mutate() function with the factor() function you’ve already learned in an earlier session, we can do exactly that.

issp <- mutate(
  issp, work_hours_cat = factor(
    work_hours_cat, 
    levels = c(1,2,3,4,5), 
    labels = c("0-25h", "26-35h", "36-45h", "46-55h", "more than 55h")))

In the code above, we convert the variable work_hours_cat to a factor, and specifying the levels and the corresponding labels attached to each level. From the output, we can see that it is now a factor variable with five levels:

glimpse(issp$work_hours_cat)

##  Factor w/ 5 levels "0-25h","26-35h",..: 2 NA 1 NA NA NA NA 3 NA 2 ...

However, instead of doing it in two different chunks of code, we can do it in a single chunk of code:

issp <- mutate(
  issp, work_hours_cat = case_when(
    hours_work >0 & hours_work <= 25 ~ "0-25h",
    hours_work >25 & hours_work <= 35 ~ "26-35h",
    hours_work >35 & hours_work <= 45 ~ "36-45h",
    hours_work >45 & hours_work <= 55 ~ "46-55h",
    hours_work >= 56 ~ "more than 55h"),
               work_hours_cat = factor(work_hours_cat))

glimpse(issp$work_hours_cat)

##  Factor w/ 5 levels "0-25h","26-35h",..: 2 NA 1 NA NA NA NA 3 NA 2 ...

In the codesnippet above, we both create the categorical variable work_hours_cat and we convert it to a factor (converting the variable to a factor implies overwriting the variable we just created). In the final line we print the variable.

Notice how we gave the variable a character value in this code chunk. Since we dropped giving the return values a numeric value, we don’t have to specify the levels and the corresponding labels in the factor() function which makes the code more efficient. As you can see from the output, this single code chunk gives the exact same result as when we created the variable in two different code chunks.

Help documentation

To see the help-page for the factor() function, just run the following code and the help-page will open in your browser.

help(factor)

What about missing values (NA’s)?

The help documentation for the case_when() function also explains that if no cases are matched to the specified conditions, NA’s are returned. If we use summary() function to get information on the dataset, we can see that the new variable, work_hours_cat, has 21 more missing values than hours_work:

summary(issp)

##       idnr           day            month            year         gender   
##  Min.   :1002   Min.   : 1.00   Min.   :2.000   Min.   :2020   Male  :606  
##  1st Qu.:2177   1st Qu.: 6.00   1st Qu.:3.000   1st Qu.:2020   Female:717  
##  Median :3175   Median :13.00   Median :3.000   Median :2020               
##  Mean   :3206   Mean   :13.15   Mean   :3.315   Mean   :2020               
##  3rd Qu.:4270   3rd Qu.:17.00   3rd Qu.:3.000   3rd Qu.:2020               
##  Max.   :5400   Max.   :31.00   Max.   :6.000   Max.   :2020               
##                                                                            
##       age                    ethnicity              pl_residence
##  Min.   :18.00   Norwegian        :1196   Big_city        :393  
##  1st Qu.:37.00   No_specific_group:  48   Suburbs_of_city :197  
##  Median :52.00   European         :  42   Small_town      :321  
##  Mean   :50.52   Asian            :   8   Village         :235  
##  3rd Qu.:64.00   Iranian          :   3   Sparsely_settled:170  
##  Max.   :79.00   (Other)          :  10   NA's            :  7  
##  NA's   :2       NA's             :  16                         
##    marital_status         religion      yeareduc       hours_work   
##  Married  :706    Christianity:876   Min.   : 0.00   Min.   :  0.0  
##  Partner  : 44    No_belief   :276   1st Qu.:12.00   1st Qu.: 35.5  
##  Separated: 12    Catholic    : 47   Median :15.00   Median : 38.0  
##  Divorced :114    Other       : 44   Mean   :14.23   Mean   : 37.1  
##  Widowed  : 51    Islam       : 25   3rd Qu.:17.00   3rd Qu.: 42.0  
##  Unmarried:384    (Other)     : 46   Max.   :30.00   Max.   :165.0  
##  NA's     : 12    NA's        :  9   NA's   :11      NA's   :361    
##        inc_dec                   class                 voting_elec 
##  558-658000:159   Middle_class      :616   Arbeiderpartiet   :305  
##  494-557000:151   Working_class     :235   Hoyre             :253  
##  659-856000:144   Upper_middle_class:226   Senterpartiet     :115  
##  370-434000:138   Lower_middle_class:137   Fremskrittspartiet: 81  
##  435-493000:133   Lower_class       : 42   Dont_remember     : 67  
##  (Other)   :588   (Other)           :  2   (Other)           :264  
##  NA's      : 10   NA's              : 65   NA's              :238  
##             diff_income                good_edu                 rich_fam  
##  Strongly_agree   :232   Essential         :120   Essential         : 33  
##  Agree            :638   Very_important    :546   Very_important    :159  
##  Neutral          :249   Fairly_important  :510   Fairly_important  :443  
##  Disagree         :150   Not_very_important: 81   Not_very_important:462  
##  Strongly_disagree: 25   Not_important     : 16   Not_important     :172  
##  Dont_know        : 14   Dont_know         :  4   Dont_know         :  7  
##  NA's             : 15   NA's              : 46   NA's              : 47  
##                edu_pay       scalenow         scalefam          female     
##  Essential         :128   Min.   : 1.000   Min.   : 1.000   Min.   :0.000  
##  Very_important    :496   1st Qu.: 5.000   1st Qu.: 4.000   1st Qu.:0.000  
##  Fairly_important  :565   Median : 6.000   Median : 5.000   Median :1.000  
##  Not_very_important:102   Mean   : 5.792   Mean   : 5.258   Mean   :0.542  
##  Not_important     : 16   3rd Qu.: 7.000   3rd Qu.: 6.000   3rd Qu.:1.000  
##  Dont_know         :  4   Max.   :10.000   Max.   :10.000   Max.   :1.000  
##  NA's              : 12   NA's   :16       NA's   :19                      
##        work_hours_cat
##  0-25h        :119   
##  26-35h       : 99   
##  36-45h       :591   
##  46-55h       :103   
##  more than 55h: 29   
##  NA's         :382   
##

Why has this happened? Is there something wrong with our code? Not exactly. If we print the hours_work variable, we see that this variable has a valid range of values from 0.0-165.0.

summary(issp$hours_work)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0    35.5    38.0    37.1    42.0   165.0     361

If we then use the functions select() and arrange() functions to select and arrange the hours_work variable in ascending order (ignore the %>% operator - we’ll come back to this operator later in this session), we find that the 21 first rows of the hours_work variable has a value of 0.0:

issp %>%
  select(hours_work) %>%
  arrange(hours_work)

When we created the work_hours_cat variable, the first condition on the left-hand side stated that only when hours_work is greater than 0 and less than or equal to 25, it should get the value of 1 (right-hand side). Therefore, the 21 respondents who answered 0.0 on the hours_work variable, got assigned the value of NA in our newly created variable (work_hours_cat) since they do not meet any of the conditions in our code.

Just as with the if_else() function, there is also several ways to use the case_when() function. For example, you may recode a variable based on conditions from multiple existing variables. In our examples above, we only show conditions based on one variable. However, providing a detailed documentation of all the possibilities with the case_when() function is beyond the scope for the purposes of this course. Creating variables based on conditions of one variable is sufficient for the learning outcomes in this course. If you want to learn more about the possibilities with the case_when() function, we recommend reading the help pages.

2.4.1 Exercises

1. What is wrong with the following code? If you scroll to see the last column, you’ll see that all the values in the male variable is 0. This is not correct. Modify the code to get the correct output.

mutate(issp, male = if_else(gender == "male", 1,0))

mutate(issp, male = if_else(gender == "Male", 1,0))

"It may be useful to look up the gender variable.."

"Remember that R is case-sensitive"

2. In the issp dataset, create a new variable named birth_year using mutate() and the existing variables year (year of interview) and age (age of respondent). Store the new variable in the existing dataframe issp using the assignment operator. Finally, print the output by typing the name of the dataset at another line.

issp <- mutate(issp, birth_year = year-age)

issp

"Replace the dotted lines and complete the code:
"

issp <- mutate(..., ... = ... - ...)

"Replace the dotted lines and complete the code:
"

issp <- mutate(issp, birth_year = ... - age)

3. Using mutate(), case_when() and factor(), generate a new variable named birth_cohort based on the variable you just created, birth_year. The birth cohorts should be divided into six groups based on the following conditions:

1941-1950: if birth_year of respondent is greater than or equal to 1941 and less than or equal to 1950
1951-1960: if birth_year of respondent is greater than 1950 and less than or equal to 1960
1961-1970: if birth_year of respondent is greater than 1960 and less than or equal to 1970
1971-1980: if birth_year of respondent is greater than 1970 and less than or equal to 1980
1981-1990: if birth_year of respondent is greater than 1980 and less than or equal to 1990
1991-2002: if birth_year of respondent is greater than 1990 and less than or equal to 2002

mutate(issp, birth_cohort = case_when(birth_year >= 1941 & birth_year <= 1950 ~ "1941-1950",
                                      birth_year > 1950 & birth_year <= 1960 ~ "1951-1960",
                                      birth_year > 1960 & birth_year <= 1970 ~ "1961-1970",
                                      birth_year > 1970 & birth_year <= 1980 ~ "1971-1980",
                                      birth_year > 1980 & birth_year <= 1990 ~ "1981-1990",
                                      birth_year > 1990 & birth_year <= 2002 ~ "1991-2002"),
       birth_cohort = factor(birth_cohort
                             )
       )

mutate(issp, birth_cohort = case_when(birth_year ... & birth_year ... ~ "1941-1950",
                                      birth_year ... & birth_year ... ~ "1951-1960",
                                      birth_year ... & birth_year ... ~ "1961-1970",
                                      birth_year ... & birth_year ... ~ "1971-1980",
                                      birth_year ... & birth_year ... ~ "1981-1990",
                                      birth_year ... & birth_year ... ~ "1991-2002"),
       ... = factor(...))

mutate(issp, birth_cohort = case_when(birth_year >= 1941 & birth_year <= 1950 ~ "1941-1950",
                                      birth_year ... & birth_year ... ~ "1951-1960",
                                      birth_year ... & birth_year ... ~ "1961-1970",
                                      birth_year ... & birth_year ... ~ "1971-1980",
                                      birth_year ... & birth_year ... ~ "1981-1990",
                                      birth_year ... & birth_year ... ~ "1991-2002"),
       ... = factor(...))

mutate(issp, birth_cohort = case_when(birth_year >= 1941 & birth_year <= 1950 ~ "1941-1950",
                                      birth_year > 1950 & birth_year <= 1960 ~ "1951-1960",
                                      birth_year ... & birth_year ... ~ "1961-1970",
                                      birth_year ... & birth_year ... ~ "1971-1980",
                                      birth_year ... & birth_year ... ~ "1981-1990",
                                      birth_year ... & birth_year ... ~ "1991-2002"),
       birth_cohort = factor(...))

4. Using mutate() and if_else(), create a dummy variable named married, based on the marital_status variable in the issp dataset, where the value 1 is given if the respondent is married (“Married”), everything else should be given the value 0.

mutate(issp, married = if_else(marital_status == "Married", 1, 0))

mutate(..., ... = if_else(... = "...", 1, 0))

2.5 Summarise data with `summarise()`

As explained in R4DS, chapter 3 (section: Grouped summaries with summarise()), the summarise() function summarizes multiple values to a single value. There are a lot of useful summary functions to be used in combination with summarise(). The list below is not exhaustive, but presents some relevant functions for this course:

sum() - calculates the sum of elements in a vector
min() - returns the minimum value of a variable
max() - returns the maximum value of a variable
mean() - calculates the mean of a vector
median() - calculates the median of a vector
sd() - calculates the standard deviance of a vector
IQR() - calculates the interquartile range of a vector
n() - counts the number of observations
n_distinct() - counts the number of unique observations in a variable
sum(is_na()) - counts the number of missing values in a variable
sum(!is_na()) - counts the number of non-missing values

Note that it does not matter whether you type summarise() or summarize(), it will produce the same result.

For example, if we wanted to know the average age of the respondents in the issp dataset, we could write the following code:

summarise(issp, age = mean(age, na.rm = TRUE))

It’s also possible to do summary functions on several variables in the same use of summarise():

summarise(issp, 
          age = mean(age, na.rm = TRUE),
          yeareduc = mean(yeareduc, na.rm = TRUE), 
          hours_work = mean(hours_work, na.rm = TRUE))

Remember to set the na.rm = argument to TRUE! As chapter 3 in R4DS carefully explains, we have to use the argument na.rm = TRUE for aggregating functions when the input variable has missing values. If we do not set the na.rm = argument to TRUE, we’ll get NA returned:

summarise(issp, age = mean(age))

This makes sense. It’s just not possible to do calculations if the values are missing. This is why we use the na.rm = (which literally means NA remove) argument and set it to TRUE.

Summarise by group with `summarise()` and `group_by()`

However, the summarise() function is not very useful in itself, and is mostly used in combination with another fruitful dplyr verb: the group_by() function. These two verbs together provides grouped summaries. If we wanted to know the average age by gender, we could write the following code:

by_gender <- group_by(issp, gender)
summarise(by_gender, age = mean(age, na.rm = TRUE))

In the codechunk above, we first created a new dataframe named by_gender using dplyr’s group_by() function, which groups the entire dataset issp by gender. Then we used the summarize() function and the newly created grouped dataset, by_gender, to obtain the average age for males and females.

Try it out yourself. In the following you’ll be solving different exercises using the summarise() function. We would recommend to have the section on Grouped summaries using summarize() in chapter 3 of R4DS beside you for example codes.

Help documentation

To see the help-pages for the summarise() and group_by() functions, just run the following codes and the help-pages will open in your browser.

help(summarise, package = "dplyr")

help(group_by, package = "dplyr")

2.5.1 Exercises

1 How many respondents are there of each social class in the issp dataset?…

a) …Create a new dataframe named by_class which are grouped by the variable class.

by_class <- group_by(issp, class)

by_class <- group_by(issp, class)

"Replace the dotted lines with the name of the dataset and the grouping variable"

by_class <- group_by(..., ...)

b) …Now that you’ve created a grouped dataframe, by_class, use the summarise() and n() function to count.

summarise(by_class, count = n())

"Replace the dotted lines with the correct dataframe and function to solve the task"

summarise(..., count = ...())

2. What is the average years of completed education (yeareduc) by the different social classes? Write the correct code to obtain this information. Hint: use the dataframe you just created that’s grouped by class, by_class.

summarize(by_class, mean(yeareduc, na.rm=TRUE))

"Replace the dotted lines with the correct specifications to solve the exercise"

summarise(..., mean(..., ...))

"Remember to include the 'na.rm =' argument with the correct specification"

3. Is there a difference between the social classes in their least amount of years of completed education (yeareduc)? Hint: use the by_class dataframe, and the min() function.

summarize(by_class, min(yeareduc, na.rm = TRUE))

summarise(..., min(..., ...))

"Remember to use the 'na.rm =' argument with the correct specification"

4. Ignoring the variables idnr and year, what are the standard deviance of the remaining 5 numeric variables in the issp dataset? Hint: use the glimpse() function to obtain information on which variables are numeric (in chapter 3 in R4DS: section on nycflights13 they explain the abbreviations of the different types of variables).

summarize(issp, 
          age = sd(age, na.rm = TRUE),
          yeareduc = sd(yeareduc, na.rm = TRUE),
          hours_work = sd(hours_work, na.rm = TRUE),
          scalenow = sd(scalenow, na.rm = TRUE),
          scalefam = sd(scalefam, na.rm = TRUE))

"The 5 remaining numeric variables are: age, yeareduc, hours_work, scalenow and scalefam"

"Remember the 'sd()' function"

" Replace the dotted lines with the correct specifications to solve the exercise"

summarise(issp, 
          age = sd(age, na.rm = TRUE),
          yeareduc = ..(..., ... = ...),
          hours_work = ..(..., ... = ...),
          scalenow = ..(..., ... = ...),
          scalefam = ..(..., ... = ...))

5. Which class has the highest amount of working hours per week (hours_work)? Hint: use the by_class dataframe and the max() function.

summarize(by_class, max(hours_work, na.rm = TRUE))

summarise(..., max(..., ...))

"Remember to use the 'na.rm =' argument with the correct specification!"

3 What is a pipe %>% ?

The pipe operator %>% is another central part of the tidyverse and was launched in the margrittr package. As explained in R4DS, chapter three, section: Combining multiple operations with the Pipe, the pipe allows us to do multiple operations in the same code chunk.

Note: As you will soon discover, the pipe-operator %>% resembles how you use + in ggplot2. But these are not the same, and you will probably be confused by it. Just remember this: in ggplot2 you use + and in datawrangling you use %>%. The other way around gives you an error. The author of the ggplot2 package explained on Reddit that he wrote the package before the pipe-operator existed in R, and if it had, he would have done it differently back then.

Remember how we first had to create a new grouped dataset before we could use the summarise() function? What if we had multiple things we wanted to do? Well, we would either have to create a lot of temporary dataframes to store the outputs from the different stages for then to use them in the following stage, or we would have to nest the different functions (i.e place one function inside of another). However, this makes the code much harder to read if you have a lot of functions. Luckily, there is a solution! The pipe operator helps us to do this much more easily, and it also makes the code a lot easier to read! Essentially pipes %>% allows us to take the output of one function and then send the output directly to the next function, which is very handy when we want to do multiple things to the same dataset.

Let’s use an example to illustrate the usefulness of pipes. Say we just got the issp dataset, and we want to know more about the average amount of work hours per week by different birth cohorts. We only want to include respondents who are of working age, and not retired for example. This transformation process consists of multiple operations.

First we’ll have to select the variables needed to obtain this information (1), and then we also have to subset the dataframe to only keep respondents in between the ages of 18 and 65 (working age) (2). Then we need to create a new variable which informs us about the birth year of the respondents (3) before we can proceed to create a variable with different birth cohorts (4). Finally we have to group the dataframe by birth cohorts (5) and then get the mean of working hours per week (6). Because of the multiple steps needed to obtain this information, the above mentioned hypothetical scenario is a good example to illustrate how useful pipes actually are:

We could do this process without the use of pipes:

# Example without the use of pipes
df1 <- select(issp, year, age, hours_work)

df2 <- filter(df1, age >17 & age <= 65)

df3 <- mutate(df2, year_born = year-age)

df4 <- mutate(df3, birth_cohort = case_when(
    year_born >= 1955 & year_born <= 1965 ~ "1955-1965",
    year_born > 1965 & year_born <= 1975 ~ "1966-1975",
    year_born > 1975 & year_born <= 1985 ~ "1976-1985",
    year_born > 1985 & year_born <= 1995 ~ "1986-1995",
    year_born > 1995 & year_born <= 2002 ~ "1996-2002"),
    birth_cohort = factor(birth_cohort))

df5 <- group_by(df4, birth_cohort)

df6 <- summarise(df5, mean_hours_work = mean(hours_work, na.rm=TRUE))

df6

As you can see, with this method we have to create several temporary dataframes which we use as input for the next function. This method can therefore fill up your environment pane with a lot of different objects making your workspace really cluttered. However, we could also do the same operations using pipes:

# Example with the use of pipes
meanwh_by_birthcohort <- issp %>%
  select(year, age, hours_work) %>%
  filter(age >17 & age <= 65) %>%
  mutate(year_born = year-age) %>%
  mutate(birth_cohort = case_when(
    year_born >= 1955 & year_born <= 1965 ~ "1955-1965",
    year_born > 1965 & year_born <= 1975 ~ "1966-1975",
    year_born > 1975 & year_born <= 1985 ~ "1976-1985",
    year_born > 1985 & year_born <= 1995 ~ "1986-1995",
    year_born > 1995 & year_born <= 2002 ~ "1996-2002"),
    birth_cohort = factor(birth_cohort)) %>%
  group_by(birth_cohort) %>%
  summarise(mean_hours_work = mean(hours_work, na.rm=TRUE))

meanwh_by_birthcohort

In the code above we start of by creating a new object named meanwh_by_birthcohort based on the issp dataset. We then use the pipe operator %>% to send the wh_by_birthcohort first through select() where we only keep the variables we need, then through filter() to only keep observations where age is between 18 and 65 years, then we first use mutate() to create a new variable, year_born, then we use mutate() again combined with the factor() function to create the variable birth_cohort and convert it to a factor, then we group the dataframe by birth_cohort and then, finally, we use summarise() to calculate the average of working hours by birth cohort.

Notice how we did not have to create any temporary dataframes. As you can see, the pipe operator (%%) allows us to do multiple operations in a series of steps. This means that each line of code only does one thing and they follow in sequential order. Because pipes allows us to take the output of one function and send it directly to the next function, the use of pipes makes it much easier to write an efficient code! It also makes the code intuitive to read - remember that you could basically read the %>% operator as “then”!

Integration with other packages in the tidyverse

One thing that works especially well with the tidyverse, is the integration between the core packages. As shown in R4DS, chapter 3, you can easily combine the verbs from the dplyr package with ggplot2. Unfortunately, since ggplot2 was written before the discovery of the pipe %>%, ggplot2 still uses + instead of pipes. However, you’ll get comfortable with this pretty quickly.

Using the example from before, if we wanted to visualize the average working hours by birth cohorts this is no problem! We simply add ggplot2 specifications to the code:

issp %>%
  select(year, age, hours_work) %>%
  filter(age >17 & age <= 65) %>%
  mutate(year_born = year-age) %>%
  mutate(birth_cohort = case_when(
    year_born >= 1955 & year_born <= 1965 ~ "1955-1965",
    year_born > 1965 & year_born <= 1975 ~ "1966-1975",
    year_born > 1975 & year_born <= 1985 ~ "1976-1985",
    year_born > 1985 & year_born <= 1995 ~ "1986-1995",
    year_born > 1995 & year_born <= 2002 ~ "1996-2002"),
    birth_cohort = factor(birth_cohort)) %>%
  group_by(birth_cohort) %>%
  summarize(mean_hours_work = mean(hours_work, na.rm=TRUE)) %>%
  ggplot(aes(y=mean_hours_work, x=birth_cohort)) +
  geom_bar(stat = "identity") +
  theme_minimal()

The difference between the first example and this one, is simply that we’ve added some ggplot2 code in the last three lines of the code chunk. We specify the y-axis to be mean_hours_work and the x-axis to be birth_cohort. Then we add a new line, specifying the geom to be geom_bar() and telling R that we’ll specify the y-values ourselves (stat = "identity") and to not use the default, which is count. Finally, we add a theme to make the plot look a bit better (theme_minimal()).

3.1 Exercises

In this part, you’ll have to use pipes (%>%) to solve the tasks.

TIP: In Rstudio, there is a keyboard shortcut for writing %>%: Ctrl+Shift+M. Shortcuts are just a bit more convenient.

1. Which gender has the highest amount of working hours?

Using the issp dataset, then
use select to keep the variables age, hours_work and gender, then
filter the rows to remove all respondents who are not in the ages between 25-55, then
group the dataframe by gender and then
summarise the maximum of hours_work for each gender (name it max_hours_work), then
Finally, arrange max_hours_work in descending order.

issp %>%
  select(age, hours_work, gender) %>%
  filter(age >= 25 & age <= 55) %>%
  group_by(gender) %>%
  summarize(max_hours_work = max(hours_work, na.rm = TRUE)) %>%
  arrange(desc(max_hours_work))

issp %>%
  select(..., ..., ...) %>%
  filter()

issp %>%
  select(age, hours_work, gender) %>%
  filter(... & ...) %>%
  group_by()

"Remember the max() function to use inside summarise()"

issp %>%
  select(age, hours_work, gender) %>%
  filter(age >= 25 & age <= 55) %>%
  group_by(...) %>%
  summarize(... = ...(...))

"Have you remembered to set the na.rm argument to TRUE?"

"Remember the desc() function to be used inside arrange()"

issp %>%
  select(age, hours_work, gender) %>%
  filter(age >= 25 & age <= 55) %>%
  group_by(gender) %>%
  summarize(max_hours_work = max(hours_work, na.rm = TRUE)) %>%
  arrange(...(...))

2. In this task you’ll be using the gss_wages dataset, based on a survey from the US, which are loaded in your workspace.

a) Use glimpse() to get some more information about the dataset.

glimpse(gss_wages)

b) Which gender has the highest average income? Store the result in an object named p.

using the gss_wages dataset, then
use select() to select the variables realrinc, age and gender, then
use filter() to only keep respondents in the ages between 22 and 65, then
use mutate() in combination with factor() to recode the gender variable to a factor variable named gender_cat, then
group the dataset by gender_cat, and then
summarise the mean income by gender_cat (name it mean_income), then
visualize the difference in average income (mean_income) by gender (gender_cat) with ggplot.
use geom_bar() and specify the stat = argument to “identity”.
Add theme_minminal() to make the plot look better!
Finally, remember to print the plot by typing it’s name on another line!

p <- gss_wages %>%
  select(realrinc, age, gender) %>%
  filter(age >= 22 & age <= 65) %>%
  mutate(gender_cat = factor(gender)) %>%
  group_by(gender_cat) %>%
  summarize(mean_income = mean(realrinc, na.rm = TRUE)) %>%
  ggplot(aes(y=mean_income, x=gender_cat)) +
    geom_bar(stat = "identity") +
    theme_minimal()

p

p <- gss_wages %>%
  select(..., ..., ...) %>%
  filter()

p <- gss_wages %>%
  select(realrinc, age, gender) %>%
  filter(... & ...) %>%
  mutate()

p <- gss_wages %>%
  select(realrinc, age, gender) %>%
  filter(age >= 22 & age <= 65) %>%
  mutate(gender_cat = ...(...)) %>%
  group_by()

p <- gss_wages %>%
  select(realrinc, age, gender) %>%
  filter(age >= 22 & age <= 65) %>%
  mutate(gender_cat = factor(gender)) %>%
  group_by(...) %>%
  summarize()

"Remember the mean() function to be used inside summarize(), and don't forget the na.rm argument!"

p <- gss_wages %>%
  select(realrinc, age, gender) %>%
  filter(age >= 22 & age <= 65) %>%
  mutate(gender_cat = factor(gender)) %>%
  group_by(gender_cat) %>%
  summarize(mean_income = ...(..., ... = ...))

p <- gss_wages %>%
  select(realrinc, age, gender) %>%
  filter(age >= 22 & age <= 65) %>%
  mutate(gender_cat = factor(gender)) %>%
  group_by(gender_cat) %>%
  summarize(mean_income = mean(realrinc, na.rm = TRUE)) %>%
  ggplot(...(y=..., x=...)) +
  geom_bar()

p <- gss_wages %>%
  select(realrinc, age, gender) %>%
  filter(age >= 22 & age <= 65) %>%
  mutate(gender_cat = factor(gender)) %>%
  group_by(gender_cat) %>%
  summarize(mean_income = mean(realrinc, na.rm = TRUE)) %>%
  ggplot(aes(y=mean_income, x=gender_cat)) +
  geom_bar(... = "...") +
  theme_minimal()

3. In this task, you’ll use the dataset nhis (national health interview survey), which is a survey on health from the US. The dataset are loaded in your workspace.

a) Use glimpse() to get a overview of the dataset.

glimpse(nhis)

b) What is the difference in average health status depending on whether or not the respondent has some health insurance?

Using the nhis dataset, then
select() the variables hi, age and hlth, then
filter() the rows to only keep respondents who are in between 30-50 years old, then
use mutate(), case_when() and factor() to recode the hi variable into a factor variable (name it hi_cat). Value 0 in hi refers to “no health insurance”, while the value 1 refers to “Some health insurance”. Attach these labels to the levels, then
group the dataframe by hi_cat, and then
calculate the mean of hlth (name it mean_hlth).

nhis %>%
  select(hi, age, hlth) %>%
  filter(age >= 30 & age <= 50) %>%
  mutate(hi_cat = case_when(hi == 0 ~ "No health insurance",
                            hi == 1 ~ "Some health insurance"),
         hi_cat = factor(hi_cat)) %>%
  group_by(hi_cat) %>%
  summarize(mean_hlth = mean(hlth, na.rm = TRUE))

nhis %>%
  select(..., ..., ...) %>%
  filter()

nhis %>%
  select(hi, age, hlth) %>%
  filter(... & ...) %>%
  mutate()

nhis %>%
  select(hi, age, hlth) %>%
  filter(age >= 30 & age <= 50) %>%
  mutate(hi_cat = case_when(... == ... ~ "...",
                            ... == ... ~ "..."),
         hi_cat = ...()) %>%

nhis %>%
  select(hi, age, hlth) %>%
  filter(age >= 30 & age <= 50) %>%
  mutate(hi_cat = case_when(hi == 0 ~ "No health insurance",
                            hi == 1 ~ "Some health insurance"),
         hi_cat = factor(hi_cat)) %>%
  group_by(...) %>%
  summarize()

nhis %>%
  select(hi, age, hlth) %>%
  filter(age >= 30 & age <= 50) %>%
  mutate(hi_cat = case_when(hi == 0 ~ "No health insurance",
                            hi == 1 ~ "Some health insurance"),
         hi_cat = factor(hi_cat)) %>%
  group_by(hi_cat) %>%
  summarize(mean_hlth = ...(..., ... = ...))

4 More recoding: NA values

Another common task when transforming a dataset to suit the research question at hand, is recoding of missing values. Very often with pre-existing datasets, they’ll typically use a numeric value to indicate “non response” or “don’t know” for example. If we do not recode these numeric values all of our calculations and descriptive statistics will be wrong. Remember how the mean is sensitive to the distribution of either extremely small or extremely large values? Think about how the mean of a vector would be affected if the value 98 has a lot of cases and refers to “non-response”.

Recoding to NA - A single vector

Take a look at this raw version of the issp dataset, issp_raw.

glimpse(issp_raw)

## Rows: 1,323
## Columns: 21
## $ idnr           <int> 1002, 1003, 1008, 1015, 1017, 1025, 1029, 1036, 1037, 1…
## $ day            <int> 2, 13, 17, 5, 4, 20, 18, 30, 20, 30, 2, 24, 14, 13, 5, …
## $ month          <int> 3, 5, 3, 3, 3, 4, 3, 3, 4, 3, 3, 4, 5, 5, 3, 5, 3, 2, 3…
## $ year           <int> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2…
## $ gender         <int> 2, 1, 1, 1, 1, 2, 2, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 2, 1…
## $ age            <int> 31, 71, 68, 71, 59, 78, 64, 50, 52, 56, 61, 52, 75, 66,…
## $ pl_residence   <int> 1, 5, 5, 1, 5, 5, 3, 2, 2, 2, 2, 3, 1, 4, 3, 1, 5, 1, 5…
## $ ethnicity      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ yeareduc       <int> 16, 11, 2, 16, 10, 15, 13, 18, 12, 14, 14, 6, 9, 12, 20…
## $ marital_status <int> 1, 1, 5, 5, 6, 1, 1, 1, 6, 1, 6, 1, 1, 1, 1, 1, 4, 1, 1…
## $ inc_dec        <int> 8, 4, 6, 8, 5, 4, 9, 8, 3, NA, 3, 9, 5, 5, 10, 9, 9, 7,…
## $ hours_work     <dbl> 35.0, NA, 20.0, NA, NA, NA, NA, 40.0, NA, 32.0, 40.0, 3…
## $ good_edu       <int> 2, 3, 1, 2, 3, 4, 4, 2, 1, 2, 3, 2, 2, 3, 3, 1, 2, 2, 2…
## $ rich_fam       <int> 4, 4, 3, 3, 4, 4, 5, 3, 3, 4, 3, 5, 3, 4, 4, 4, 5, 3, 2…
## $ diff_income    <int> 1, 3, 2, 4, 1, 2, 2, 4, 1, 4, 3, 3, NA, 2, 3, 4, 2, 2, …
## $ scalenow       <int> 9, 5, 6, 6, 3, 8, 7, 7, 4, 8, 5, 5, 7, 6, 6, 8, 5, 7, 7…
## $ scalefam       <int> 5, 5, 7, 3, 6, 4, 6, 7, 4, 4, 7, 5, 6, 8, 5, 6, 4, 5, 6…
## $ class          <int> 5, NA, 4, 4, 3, 5, 5, 5, 2, 4, 3, 5, NA, 4, 4, 5, 2, 5,…
## $ edupay         <int> 1, 4, 2, 3, 3, 3, 4, 2, 3, NA, 3, 1, 2, 2, 2, 2, 2, 2, …
## $ religion       <int> 1, 1, 1, 13, 13, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 13, 1…
## $ voting_elec    <int> 1, NA, 7, 9, NA, 7, 98, 3, 1, NA, 1, 3, 3, 2, 3, 1, 2, …

The version of the issp dataset you’ve been seeing so far in this session has undergone some transformation for the purposes of the exercises. However, in the raw version, the value of 98 in the inc_dec variable indicates “don’t know”. If we wanted to use information on the income of respondents, this value has to be recoded in a way such that R recognizes it as a missing value. We’ll calculate the mean of inc_dec with and without a recoding, to illustrate the importance of recoding to NA. First, we check out the mean of inc_dec without a recoding:

issp_raw %>%
  summarise(inc_dec = mean(inc_dec, na.rm = TRUE))

From the output, we can see that the mean of inc_dec is 10.86519. However, the mean will change when we recode the value of 98 to NA. Using the dplyr function na_if() in combination with mutate(), we can do just that:

issp_raw <- issp_raw %>%
  mutate(inc_dec = na_if(inc_dec, 98))

In the code above, we overwrite the issp_raw dataset and save the output from the next line: using mutate(), we tell R to modify inc_dec and to replace the value 98 in inc_dec with NA. Since we did not give the variable a new name, the code above tells mutate() to overwrite the inc_dec variable. Let’s now calculate the mean of inc_dec after recoding to NA.

issp_raw %>%
  summarise(inc_dec = mean(inc_dec, na.rm = TRUE))

Wow. That’s quite a big change in the output. This example clearly shows how important it is to handle and transform missing values to NA.

Help documentation

To see the help-pages for the na_if()function, just run the following code and the help-page will open in your browser.

help(na_if, package = "dplyr")

Recoding to NA - multiple vectors at once

What if there is more than one variable that has the numeric value of 98 which should be converted to NA? Using the summary() function is useful to get an overview over the minimum and maximum values of the variables in a dataset. Usually, when operating with a pre-existing dataset, like issp_raw, sociologists read the relevant documentation on the dataset which specifies the range, labels and missing values of the variables. Since we have read the documentation of the issp_raw dataset, we know that the value 98 should be converted to NA. First, we’ll use summary() to get an indication of which variables that have the value of 98.

summary(issp_raw)

##       idnr           day            month            year          gender     
##  Min.   :1002   Min.   : 1.00   Min.   :2.000   Min.   :2020   Min.   :1.000  
##  1st Qu.:2177   1st Qu.: 6.00   1st Qu.:3.000   1st Qu.:2020   1st Qu.:1.000  
##  Median :3175   Median :13.00   Median :3.000   Median :2020   Median :2.000  
##  Mean   :3206   Mean   :13.15   Mean   :3.315   Mean   :2020   Mean   :1.542  
##  3rd Qu.:4270   3rd Qu.:17.00   3rd Qu.:3.000   3rd Qu.:2020   3rd Qu.:2.000  
##  Max.   :5400   Max.   :31.00   Max.   :6.000   Max.   :2020   Max.   :2.000  
##                                                                               
##       age         pl_residence    ethnicity        yeareduc     marital_status 
##  Min.   :18.00   Min.   :1.00   Min.   : 1.00   Min.   : 0.00   Min.   :1.000  
##  1st Qu.:37.00   1st Qu.:1.00   1st Qu.: 1.00   1st Qu.:12.00   1st Qu.:1.000  
##  Median :52.00   Median :3.00   Median : 1.00   Median :15.00   Median :1.000  
##  Mean   :50.52   Mean   :2.69   Mean   : 5.84   Mean   :14.23   Mean   :2.933  
##  3rd Qu.:64.00   3rd Qu.:4.00   3rd Qu.: 1.00   3rd Qu.:17.00   3rd Qu.:6.000  
##  Max.   :79.00   Max.   :5.00   Max.   :98.00   Max.   :30.00   Max.   :6.000  
##  NA's   :2       NA's   :7      NA's   :16      NA's   :11      NA's   :12     
##     inc_dec         hours_work       good_edu        rich_fam    
##  Min.   : 1.000   Min.   :  0.0   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 4.000   1st Qu.: 35.5   1st Qu.:2.000   1st Qu.:3.000  
##  Median : 6.000   Median : 38.0   Median :2.000   Median :4.000  
##  Mean   : 5.958   Mean   : 37.1   Mean   :2.489   Mean   :3.483  
##  3rd Qu.: 8.000   3rd Qu.: 42.0   3rd Qu.:3.000   3rd Qu.:4.000  
##  Max.   :10.000   Max.   :165.0   Max.   :8.000   Max.   :8.000  
##  NA's   :80       NA's   :361     NA's   :46      NA's   :47     
##   diff_income       scalenow         scalefam          class      
##  Min.   :1.000   Min.   : 1.000   Min.   : 1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.: 5.000   1st Qu.: 4.000   1st Qu.:3.000  
##  Median :2.000   Median : 6.000   Median : 5.000   Median :4.000  
##  Mean   :2.364   Mean   : 5.792   Mean   : 5.258   Mean   :3.768  
##  3rd Qu.:3.000   3rd Qu.: 7.000   3rd Qu.: 6.000   3rd Qu.:4.000  
##  Max.   :8.000   Max.   :10.000   Max.   :10.000   Max.   :8.000  
##  NA's   :15      NA's   :16       NA's   :19       NA's   :15     
##      edupay         religion       voting_elec    
##  Min.   :1.000   Min.   : 1.000   Min.   : 1.000  
##  1st Qu.:2.000   1st Qu.: 1.000   1st Qu.: 1.000  
##  Median :3.000   Median : 1.000   Median : 3.000  
##  Mean   :2.544   Mean   : 5.928   Mean   : 9.522  
##  3rd Qu.:3.000   3rd Qu.:12.000   3rd Qu.: 7.000  
##  Max.   :8.000   Max.   :98.000   Max.   :98.000  
##  NA's   :12      NA's   :9        NA's   :238

We see that the variables ethnicity, religion and voting_elec all have their maximum value set to 98, and we know that this value actually is a NA value. We could recode these variables so that 98 is transformed to NA:

issp_raw <- issp_raw %>%
  mutate(across(c(ethnicity, religion, voting_elec), ~ na_if(., 98)))

In the code above, we overwrite the existing dataset issp_raw, and then, by using mutate() combined with the across() function, we specify a list of vectors c(ethnicity, religion, voting_elec) to be modified, and using the tilde symbol (~) and the na_if() function, we tell R to return NA if the value in the specified variables is 98.

It’s important to note that whatever datatype the value is, whether numeric like in our examples above, or character (example: “don’t know”), can be modified to NA.

Help documentation The across() function is another dplyr verb, intended to make it easy to apply the same transformation to multiple variables at the same time, and can be used with all the other dplyr verbs, like summarise() for example.

To see the help-page for the across() function, just run the following code and the help-page will open in your browser.

help(across, package = "dplyr")

4.1 Exercises

1. Some variables in the issp_raw dataset has a numeric value of 8 which should be recoded to NA. The variables are: diff_income, good_edu, rich_fam and edu_pay. Write the correct code so that the value 8 in these variables becomes NA.

issp_raw <- issp_raw %>%
  mutate(across(c(diff_income, good_edu, rich_fam, edu_pay), ~ na_if(., 8)))

"Remember the mutate() function combined with across()"

2. Using the issp_raw dataset, write the correct code to convert the value of 8 in the class variable to NA. Store the output in a new variable named class_clean.

issp_raw <- issp_raw %>%
  mutate(class_clean = na_if(class, 8))

... <- issp_raw %>%
  mutate(... = na_if(..., ...))

issp_raw <- issp_raw %>%
  mutate(... = na_if(class, 8))

3. In this exercise you will be using the issp dataset to recode the value “Dont_know” to NA across all variables which contains the value “Dont_know”…

a) … Use summary() and find out which variables you need to recode. There should be a total of seven variables.

summary(...)

b) … Use summary() to check out the variables class, inc_dec and religion. Modify the code below to obtain the information you need.

summary(...$...)

c) …Now you know which variables to recode, write the correct code to recode the value “Dont_know” to NA across all seven variables.

issp <- issp %>%
  mutate(across(c(good_edu, edu_pay, diff_income, class, rich_fam, inc_dec, religion), ~ na_if(., "Dont_know")))

"Replace the dotted lines with the correct specifications to solve the exercise:"

... <- ... %>%
  mutate(across(c(..., ..., ..., ..., ..., ..., ...), ~ na_if(., ...)))

"Remember that 'Dont_know' is a value of the character datatype"

issp <- issp %>%
  mutate(across(c(..., ..., ..., ..., ..., ..., ...), ~ na_if(., "...")))

1 Introduction

2 The dplyr verbs for data transformation

2.1 Filter observations with filter()

2.1.1 Exercises

2.2 Reorder rows with arrange()

2.2.1 Exercises

2.3 Select variables with select()

2.3.1 Exercises

2.4 Recoding and creating new variables based on existing ones with mutate()

if_else() function

case_when() function

mutate() and factor(): Recode to factor variable

2.4.1 Exercises

2.5 Summarise data with summarise()

Summarise by group with summarise() and group_by()

2.5.1 Exercises

3 What is a pipe %>% ?

3.1 Exercises

4 More recoding: NA values

Recoding to NA - A single vector

Recoding to NA - multiple vectors at once

4.1 Exercises