Linear regression

To get back to the main page click here

1 Multiple regression

It is plausible to assume that there are relevant variables that are not included in our model. For example, the relationship between the amount of working hours and years of education might vary by gender. In this case, we are talking about multiple regression instead of simple linear regression. As stated in BPS, chapter 29, multiple linear regression allows us to include several predictor variables to combine in the explanation of the outcome variable of interest. It is far more common in sociological research to have multiple explanatory variables to examine to be able to control for other factors.

Estimating a multiple regression model in R is pretty straight forward. All we have to do is to write a formula = that is a bit more complicated than with a simple linear regression. If we wanted to include gender as a second predictor in our model, we could simply type this command:

# Estimating a multiple regression model
m2 <- lm(formula = hours_work ~ yeareduc + gender,
         data = issp)

Notice that the formula = argument is now a bit different. We added gender as another predictor by using the + symbol at the end. You could include as many predictors as you’d like - all you have to do is to list them after the + symbol. We also stored the results from this model in a new object named m2.

summary(m2)

## 
## Call:
## lm(formula = hours_work ~ yeareduc + gender, data = issp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.432  -1.780   1.231   4.623  53.927 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   34.3872     1.6482  20.863  < 2e-16 ***
## yeareduc       0.3371     0.1063   3.172 0.001580 ** 
## genderFemale  -3.3924     0.8684  -3.907 0.000103 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.49 on 712 degrees of freedom
## Multiple R-squared:  0.03081,    Adjusted R-squared:  0.02808 
## F-statistic: 11.32 on 2 and 712 DF,  p-value: 1.452e-05

We can see from the output that we now have a new row in the coefficients table, where we get the slope estimate for genderFemale. On average, females work 3.3924 hours less than males per week holding the effect of years of education constant. Note that gender is a factor variable with two levels: “Male” and “Female”. R automatically chooses level 1 of the factor variable as a reference category which is placed in the intercept. If we wanted to change the reference category, we could use the relevel() function directly inside the lm() call:

lm(formula = hours_work ~ yeareduc + relevel(gender, ref = "Female"),
    data = issp)

## 
## Call:
## lm(formula = hours_work ~ yeareduc + relevel(gender, ref = "Female"), 
##     data = issp)
## 
## Coefficients:
##                         (Intercept)                             yeareduc  
##                             30.9948                               0.3371  
## relevel(gender, ref = "Female")Male  
##                              3.3924

As you can see the reference level is now changed to “Female”, and we get a slope coefficient for genderMale. We could alternatively coded the gender variable as a dummy variable instead. It would give exactly the same result.

Predictions based on multiple variables

What if we wanted to know the average amount of working hours per week for females with 18 years of education? In the same way as before, when we predicted the average hours of work per week given a specific value of years of education, we could simply add the specification of the gender variable to our code. Importantly: you must provide values for all variables in the model for the calculations to make sense, so predict() will throw an error if you have not.

# Creating a new dataframe with the correct specifications
woman_educ <- data.frame(yeareduc = 18,
                         gender = "Female")

# Predicting the mean outcome value given the specifications above
predict(m2, newdata = woman_educ)

##        1 
## 37.06204

As we can see from the output, the expected mean amount of working hours per week for a woman with 18 years of completed education is 37.06204. You can also see that all we did to obtain this information, was simply to add another argument inside the data.frame() function where we specified which gender we wanted to predict at.

1.1 Exercises

Regression with two groups

1. Using the abu89 dataset again, estimate a multiple regression model where wage_hour is the outcome variable, age is the predictor and female is another predictor. Store the results in a object named est2.

est2 <- lm(formula = wage_hour ~ age + female,
            data = abu89)

"Remember that you add more predictors by using the + symbol"

"Replace the dotted lines with the specified variables so that the model formula is correct"

est2 <- lm(formula = ... ~ ... + ...,
            data = abu89)

2. Check out the result of your model (est2) by using the summary() function.

summary(est2)

3. Use your model (est2) to predict the mean hourly wage for men (value = 0 in the female variable) at the age of 40 years old (variable: age).

a) Create a new dataframe named men containing the given specifications. Remember to type the name of your newly created dataframe at another line to make sure you did it correctly.

men <- data.frame(female = 0,
                  age = 40)

men

"Replace the dotted lines with the correct specifications"

men <- data.frame(female = "...",
                  age = ...)

b) Using your newly created dataset (men) and your model (est2), predict the mean hourly wage for men at age 40.

predict(est2, newdata = men)

"Replace the dotted lines with the name of your model and your new dataframe"

predict(..., newdata = ...)

4. Use your model (est2) to predict the mean hourly wage for women (value = 1 in the female variable) at the age of 40 years old.

a) Create a new dataframe named women containing the given specifications . Remember to type the name of your newly created dataframe at another line to make sure you did it correctly.

women <- data.frame(female = 1,
                     age = 40)

women

b) Using your newly created dataset (women) and your model (est2) predict the mean hourly wage for women at age 40.

predict(est2, newdata = women)

2 Check model assumptions

In this part we’ll show you how you can do regression diagnostics, which refers to the process of checking whether the assumptions of linear regression is met in your model. Regression diagnostics are not easy, and requires a good amount of judgement when doing it. However, it’s an important part of statistical inference, and a analysis is not complete without checking whether the assumptions are met or not.

This section describes how you can check the assumptions of your model. However, it does not cover all the things you can do, but it introduces you to the basics. We will not go into detail about the assumptions of linear regression as these are covered in BPS chapter 26 and 29. However, here we present a quick reminder:

Linearity - There exists a linear relationship between the explanatory variable, X, and the outcome variable, Y.
Normality - The residuals are normally distributed
Homoscedasticity - The residuals have constant variance at every level(value) of X.
Independence - The residuals are independent. This specifically refers to that there should be no correlation between consecutive residuals in time series data. In other words, this assumption is automatically met when dealing with cross-sectional data. In this course, you’ll mainly be working with cross-sectional data, and therefore this assumption is not covered here.
No influential outliers - This is not technically an assumption of linear regression, but it is heavily implied by the others. Influential outliers refers to outliers that have an impact on the regression line. Not all outliers will impact the regression line, but we do have to be aware if we have outliers that alters it.

We will be showing you how you can do regression diagnostics using base R’s built in plot() function, where you use this to an lm object after running an analysis. R will then by default show you four diagnostic plots one by one. You could also specify which plot you want to print using the which = argument followed by the corresponding plot number (see the help documentation and the following subsections for more).

The plots show residuals in four different ways, and in each plot you will often see numbers next to some points. These numbers refers to specific row numbers (observations), and are identified as outliers based on each criterion. We will address this again in a later subsection.

Help documentation

Note that plot() is a different kind of graphics engine than ggplot(), and these functions cannot be combined. We use plot() only for quick plots here. plot() will provide some standard output based on the type of object. Similar to how predict() calls a sub-function, plot() will call plot.lm() when the object is a linear model.

To see the help-file you can try the following code:

help(plot.lm)

Checking linearity

The linearity assumption can be checked either by plotting a scatterplot and checking the correlation between the explanatory variable and the outcome variable (continuous variables), or you can use a residuals vs fitted plot. This plot shows whether there are non-linear patterns of the residuals in your model. If the plot shows equally spread residuals around a horizontal line without any distinct patterns, it is a good indication that you don’t have any non-linear relationship.

Let’s illustrate this using our multiple regression model from before, m2. To make a quick-and-dirty plot, we can use the function plot().

For linear models, there are six automatic plots, but in this case we are only interested in the first of these. Thus, we specify which = 1 as well:

# Plotting a residuals vs fitted plot
plot(m2, which = 1)

The plot shows no clear non-linear patterns, e.g. there is no curvature on the line, and the residuals are pretty equally spread along a horizontal line. In this case, this is a good indication that we do not have any issues with non-linearity in our model.

Extra: Doing the same with `ggplot()`

You have previously learned to use ggplot() for making proper graphics. If you would like to make the residual plot nicer, it can also be done using ggplot() and you can then use all the skills you have learned earlier to make it publication-ready quality graphics. ggplot() mainly plots data in the format of a data.frame, but also allows extracting from regression objects like we got here. Inside the object m2, residuals, fitted values and a lot of other stuff are stored to be used if needed. Residuals and fitted values can be extracted by putting .fitted as the x-variable and .resid as the y-variable in the aes(). The line shown in the previous plot is just a regression line, so to make the plot just use geom_point() and geom_smooth() as you have learned earlier. It will be as follows:

ggplot(m2, aes(x = .fitted, y = .resid)) +
  geom_point()+
  geom_smooth(method = "lm", se = F)

## `geom_smooth()` using formula 'y ~ x'

For residual plots, the quick-fix using plot() will typically be enough, as it is mainly for yourself to see, and will usually not find it’s way into a published report.

Checking normality of residuals

The residuals should also be normally distributed. We can check this assumption by plotting a histogram of the residuals and/or creating a quantile-quantile-plot.

Histogram of the residuals

We start off by plotting a histogram of the residuals using the ggplot2 package. As explained above, we extract the residuals by specifying .resid as follows:

# Plotting a histogram of residuals 
ggplot(m2, aes(x = .resid)) +
  geom_histogram(bins = 20, fill = "steelblue", color = "black") +
  labs(title = "Histograms of residuals", x = "Residuals", y = "Frequency")

Here, we have also added some fill and colors, and add a title and labels for the x- and y-axis.

Quantile-quantile plot

The qq-plot shows the quantiles of the residuals plotted against the theoretical quantiles of a normal distribution. We skip the details here, but if the residuals are normal, the plot should show a straight line.

We could also create a qq-plot using plot() on the regression object. If the qq-plot shows that the residuals follow the straight line well, this is a good indication that the residuals are normally distributed. However, if they deviate severely from the line, we could have issues with non-normality of residuals. We can make a qq-plot plot using the plot() function and specify the which = argument to number 2:

# Plotting a qq-plot
plot(m2, which = 2)

What do you think of the plot? Does the residuals look normally distributed? To us, this plot shows that we could have issues with non-normality. The residuals deviate quite severely from the line in the beginning and at the end.

Extra: quantile-quantile plot with `ggplot()`

For the sake of completeness, here is the code for making a quantile-quantile plot using the ggplot()-function. Note that inside the aes() you need to specify sample = .resid. There is one geom for making qq-plot as scatter and another geom for making the diagonal line.

ggplot(m2, aes(sample = .resid)) +
  geom_qq() +
  geom_qq_line()

These plots are aesthetically similar, but ggplot() is easier to make really nice if needed. Again: the quick version using plot() will usually do.

Checking homoscedasticity

Another assumption about the residuals is the assumption of homoscedasticity, i.e that the residuals should have constant variance at all values of the explanatory variables. We can check whether or not this assumption is met by plotting a scale-location plot (also called spread-location plot). It is a good indication that homoscedasticity is present if the plot shows that the residuals are equally (randomly) spread along a horizontal line and there’s no distinct patterns. We can obtain this plot by using the plot() function and specifying the which = argument to number 3:

# Plotting a scale-location plot 
plot(m2, which = 3)

What do you think? Does the plot show equal variance?

To do this and the next plot in ggplot() requires a bit more work, so we leave that out for now.

Checking for influential outliers

Finally, we could check for influential outliers, i.e outliers that have an impact on the regression line, by plotting a residuals vs leverage plot. Not all outliers may be influential, meaning that the results would stay the same whether we included or excluded them from our analysis. However, if the results change depending on whether we keep or drop an observation, this observation would be defined as an influential observation.

We can interpret the plot by looking out for outlying values at the lower or upper right corner. These are the spots where observations may be influential to the regression line. We should be looking for observations that fall outside of a red, dashed line. When observations are located outside the dashed lines, they are usually influential to the regression results. This essentially means that the results would be different if we dropped those observations from the analysis.

By using the plot() function, and specifying the which = argument to number 5, we can plot a residuals vs leverage plot:

# Plotting a residuals vs leverage plot
plot(m2, which = 5)

What do you think? Does it appear to be any influential outliers? Since we cannot see any of the red, dashed lines, this means that there are no influential outliers. Even though we see that observations number 1216, 83 and 17 are marked as outliers, they are not on the outside of any red, dashed lines and are therefore not influential to the regression results.

What now?

The keen observer may have noticed that some observations are marked in each plot, observations number 1216 and 17. When this happens it may be useful to take a further look at these cases individually. Are there anything remarkably different about these observations? May it be caused by data entry errors?

If you do notice patterns in residuals in your data, the answer to what to do next is not straight-forward. You should go back to your theory, e.g. is it reasonable to assume a linear relationship between the predictors and your outcome? Maybe there are important variables that are missing from your model?

In closing this section on regression diagnostics, we would like to emphasize that the methods described in this part, i.e regression diagnostic plots of residuals, do not give the full picture of whether the assumptions of linear regression are met. As stated in BPS: “(…) regression diagnostics is a subject that could fill an entire book” (chapter 29, p. 50). Obviously, we are not able to cover the full topic here. However, the methods described in this section is a good place to start and sufficient for the purposes of this course.

2.1 Exercises

1. Using your model (est2) and the plot() function, check whether the assumption of linearity is met by plotting a residual vs fitted plot.

plot(est2, which = 1)

"To obtain the residuals vs fitted plot, remember to specify the `which =` argument to number 1"

"Replace the dotted line with the name of your model (lm object)"

plot(..., which = 1)

2. Using your model (est2) check whether the assumption of normally distributed residuals is met…

a) …by plotting a histogram of the residuals with the following specifications in ggplot(). Store the plot in an object named hist_res. Remember to type the name of the object at another line to print it. (Hint: Remember that the dataset you’re using are abu89).

Fill: “coral”
Color: “black”
Title: “Histogram of residuals”
Label x-axis: “Residuals”
Label y-axis: “Frequency”

hist_res <- ggplot(data = abu89, aes(x = est2$residuals)) +
  geom_histogram(fill = "coral", color = "black") +
  labs(title = "Histograms of residuals", x = "Residuals", y = "Frequency")

hist_res

"Replace the dotted lines with the correct dataset, variables and corresponding argument specifications"

hist_res <- ggplot(data = .., aes(x = ..$..)) +
  geom_histogram(fill = "..", color = "..") +
  labs(title = "..", x = "..", y = "..")

" Replace the dotted lines with the correct specifications to solve the exercise"

hist_res <- ggplot(data = abu89, aes(x = est2$residuals)) +
  geom_histogram(fill = "..", color = "..") +
  labs(title = "..", x = "..", y = "..")

b) by plotting a qq-plot using the plot() function.

plot(est2, which = 2)

"To obtain the qq-plot, remember to specify the `which =` argument to number 2"

3. Using your model (est2) and the plot() function, check whether the assumption of homoscedasticity (i.e residuals have constant variance) is met by plotting a scale-location plot.

plot(est2, which = 3)

"To obtain the scale-location plot, remember to specify the `which =` argument to number 3"

4. Using your model (est2) and the plot() function, check whether the assumption of influential outliers are met by plotting a residuals vs leverage plot.

plot(est2, which = 5)

"To obtain the residuals vs leverage plot, remember to specify the `which =` argument to number 5"

3 Interaction

Sometimes there are reasons to assume that the outcome value changes when we change the value of the explanatory variable. For example, there are reasons to assume that the relationship between working hours and years of education is not the same for both genders, e.g that although working hours increase by years of education, that increase might differ for men and women. In fact, the “effect” of education could hold only for one of the sexes - or even be in the opposite direction.

Interaction terms allows to estimate slope parameters that varies by another variable in the model. In this section we’ll be looking at interactions, and how to implement an interaction in a regression model in R.

Including a interaction in regression model

Let’s say we wanted to include an interaction of yeareduc*gender in our model to check whether the effect of years of education on working hours depends on whether the respondents are males or females. In Norway, women tend to outdo men in higher education, and maybe the differences in working hours may be explained by the fact that women tend to have more years of education than men? All we have to do is simply to add an interaction term to our formula = argument.

To estimate a model of the form \(y = \alpha + \beta_1 yeareduc + \beta_2gender + beta_3(yeareduc * gender)\) we add each term to the formula =.

m3 <- lm(formula = hours_work ~ yeareduc + gender + yeareduc*gender,
         data = issp)
summary(m3)

## 
## Call:
## lm(formula = hours_work ~ yeareduc + gender + yeareduc * gender, 
##     data = issp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.690  -1.690   1.084   4.803  52.893 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            35.9767     2.2483  16.001   <2e-16 ***
## yeareduc                0.2261     0.1506   1.501   0.1337    
## genderFemale           -6.6673     3.2684  -2.040   0.0417 *  
## yeareduc:genderFemale   0.2209     0.2125   1.039   0.2990    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.49 on 711 degrees of freedom
## Multiple R-squared:  0.03228,    Adjusted R-squared:  0.02819 
## F-statistic: 7.905 on 3 and 711 DF,  p-value: 3.423e-05

Notice that we simply add yeareduc*gender to the formula used in the previous model. By combining two variables using the * symbol, we create an interaction term.

Again, we can obtain the result of our new model using the summary() function:

summary(m3)

## 
## Call:
## lm(formula = hours_work ~ yeareduc + gender + yeareduc * gender, 
##     data = issp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.690  -1.690   1.084   4.803  52.893 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            35.9767     2.2483  16.001   <2e-16 ***
## yeareduc                0.2261     0.1506   1.501   0.1337    
## genderFemale           -6.6673     3.2684  -2.040   0.0417 *  
## yeareduc:genderFemale   0.2209     0.2125   1.039   0.2990    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.49 on 711 degrees of freedom
## Multiple R-squared:  0.03228,    Adjusted R-squared:  0.02819 
## F-statistic: 7.905 on 3 and 711 DF,  p-value: 3.423e-05

As we can see from the output, we still get a slope coefficient for both yeareduc and genderFemale, but in addition we also get a coefficient for the interaction term (yeareduc:genderFemale). The coefficients for yeareduc and genderFemale are interpreted the same way as before:

yeareduc: for each additional year of education, the amount of working hours increases by 0.2261.
genderFemale: On average, females work 6.6673 hours less than males

The interpretation of the interaction term is a bit different: Simply put, the coefficient for the interaction term tells us that the effect of years of education on working hours per week is 0.2209 higher for females than for males (reference group). However, the interaction term is not significant, meaning that we cannot say that there is an actual interaction effect in the population - it might be due to some random variation in this particular sample.

Visualizing interactions with ggplot2

It is often useful to visualize an interaction. An interaction in regression results in two regression lines, and in cases where there is an interaction effect, the two regression lines will be different from each other. Maybe one of them is much steeper than the other etc.

We can do this by using the ggplot2 package, combined with the geom geom_smooth(), which you are already familiar with:

# Plot with two regression lines 
ggplot(issp, aes(x=yeareduc, y=hours_work, color=gender)) +
  geom_smooth(method = "lm", se = FALSE) + 
  labs(x = "Years of education",
       y = "Working hours per week",
       color = "Moderator") +
  theme_bw()

## `geom_smooth()` using formula 'y ~ x'

However, we usually want to clearly communicate as much information as possible, and combining a scatterplot with regression lines for both genders may be more useful:

# Scatterplot combined with two regression lines
ggplot(issp, aes(x=yeareduc, y=hours_work, color=gender)) +
  geom_jitter(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) + 
  labs(x = "Years of education",
       y = "Working hours per week",
       color = "Moderator") +
  theme_bw()

## `geom_smooth()` using formula 'y ~ x'

This plot communicates more clearly that gender does not substantially moderate the relationship between years of education and working hours per week than the previous one where we only included the regression lines.

3.1 Exercises

1. Does gender moderate the effect of age on hourly wages? Use the dataset abu89, and…

a) …Create a scatterplot of the relationship between age (age) and hourly wages (wage_hour) using the geom geom_jitter() to help with the overplotting. Make the points somewhat transparent by setting the alpha = argument to 0.3, and use the color = argument inside aes() to group the plot by the variable female, and add the theme theme_minimal(). Save the plot in a object named p. Remember to type the name of the object at another line to print it. Based on the plot, does it seem like gender moderates the relationship between age and hourly wages?

p <- ggplot(abu89, aes(x=age, y=wage_hour, color=female)) +
  geom_jitter(alpha = 0.3) +
  theme_minimal()

p

"Replace the dotted lines with the correct variables and specifications:"

p <- ggplot(abu89, aes(x=..., y=..., color=...)) +
  geom_jitter(alpha = ...) +
  theme_...()

b) Using your code from the previous exercise, add regression lines to the plot using geom_smooth(). Remember to set the `method = argument to the correct method, and remove the confidence intervals from the regression line by setting the se = argument to FALSE. Is the effect of age on hourly wages different for women and men?

p <- ggplot(abu89, aes(x=age, y=wage_hour, color=female)) +
  geom_jitter(alpha = 0.3) +
   geom_smooth(method = "lm", se = FALSE) +
   theme_minimal()

p

"Replace the dotted lines with the correct specifications:"

p <- ggplot(abu89, aes(x=age, y=wage_hour, color=female)) +
  geom_jitter(alpha = 0.3) +
   geom_...(method = "...", se = ...) +
   theme_minimal()

c) Estimate a regression model where you include an interaction between age and gender (age*female). Store the result in an object named est3.

est3 <- lm(wage_hour ~ age + female + age*female, data = abu89)

"Remember that you use the * symbol to make an interaction between two variables"

d) Print the result of the regression model est3, and reflect on what each of the coefficients and their corresponding p-values tell you.

summary(est3)

"The summary() function may be very useful!"

e) Based on the model (est3) with the interaction term: How much does the hourly wage increase for each additional year of age for women? Type in the correct mathematical formula to calculate this answer. (Hint: Use the coefficients).

0.5907-0.2551

"Remember that women are not in the reference group"

"Since women are not in the reference group, you'll have to use the coefficients for 'age' and 'age:female' in the formula"

4 Put it all in a table

So far we’ve run three different models in this session when trying to understand the effect of years of education on the amount of working hours per week. We started with a simple model with only the outcome and the explanatory variable (m1), then we estimated a multiple regression model where we included gender as another predictor (m2) and finally we estimated a model containing the interaction between years of education and gender (m3). This process is very common when doing analyses.

However, we also need to extract the models into publication-ready tables so that we can use them in our assignments, papers or thesis. It does not look good to just screenshot the output directly from R. We want a nice table output for our models and we want them combined in a single table. We can achieve this by using the functions tbl_regression() and tbl_merge() from the package gtsummary. We also have to use the package flextable in order to save the table to our work folder.

Modifying apperance of tables

To be able to merge together several regression models into a single table, we first have to generate each of the models as gtsummary objects and do some modifications to each of them.

As you can see from the help documentation on tbl_regression() (scroll down to the section: “help documentation”), there are a lot of different functions you can use to modify the appearance of the tables. In this session, we’ll focus on the simple things and for those of you who are interested, check out the help documentation and try it out yourself in R.

Let’s begin with our first model, m1, to show you the default output of tbl_regression():

# Default output from tbl_regression()
m1 %>%
  tbl_regression()

Characteristic	Beta	95% CI¹	p-value
yeareduc	0.29	0.08, 0.50	0.007
¹ CI = Confidence Interval

In the code above, we send the model m1 through tbl_regression() by using the pipe operator %>% from the magrittr package (included in tidyverse).

From the output we can see the coefficient for the explanatory variable, yeareduc, and it’s corresponding 95% confidence interval and p-value. However, the intercept is not shown in the default output. We can include the intercept in the table by setting the intercept = argument to TRUE:

# Including the intercept in the table
m1 %>%
  tbl_regression(intercept = TRUE)

Characteristic	Beta	95% CI¹	p-value
(Intercept)	33	30, 37	<0.001
yeareduc	0.29	0.08, 0.50	0.007
¹ CI = Confidence Interval

There we go! Now the intercept is also shown in the table. We would also like to modify the labels of the variables so that they do not just show the variable names. We can do this by using the labels = argument:

# Modifying label names of variables
m1 %>%
  tbl_regression(intercept = TRUE, label = list(yeareduc = "Years of education"))

Characteristic	Beta	95% CI¹	p-value
(Intercept)	33	30, 37	<0.001
Years of education	0.29	0.08, 0.50	0.007
¹ CI = Confidence Interval

If we wanted to highlight the variables, we could either bold the labels of the variables by adding the bold_labels() function, or we could use the italicize_labels() function to italicize the variable labels:

# Bold labels of variables
m1 %>%
  tbl_regression(intercept = TRUE, label = list(yeareduc = "Years of education")) %>%
  bold_labels()

Characteristic	Beta	95% CI¹	p-value
(Intercept)	33	30, 37	<0.001
Years of education	0.29	0.08, 0.50	0.007
¹ CI = Confidence Interval

# Italicize labels of variables
m1 %>%
  tbl_regression(intercept = TRUE, label = list(yeareduc = "Years of education")) %>%
  italicize_labels()

Characteristic	Beta	95% CI¹	p-value
(Intercept)	33	30, 37	<0.001
Years of education	0.29	0.08, 0.50	0.007
¹ CI = Confidence Interval

Preparing each of the models

Now we have shown you some of the basic aesthetic functions to use with tbl_regression(), and we can move on to preparing each of our three models as gtsummary objects. All we have to do is simply to set up the models using the lm() function and then send it through the tbl_regression() function and store it in an object (below we call them t1, t2 and t3). In the code below we do this for all three models, and we do some modifications to each of them:

# Model 1: simple linear regression
t1 <-
   lm(formula = hours_work ~ yeareduc, data = issp) %>%
  tbl_regression(intercept = TRUE, label = list(yeareduc = "Years of education")) %>%
  bold_labels()

t1

Characteristic	Beta	95% CI¹	p-value
(Intercept)	33	30, 37	<0.001
Years of education	0.29	0.08, 0.50	0.007
¹ CI = Confidence Interval

# Model 2: Multiple regression, including gender
t2 <-
    lm(formula = hours_work ~ yeareduc + gender, data = issp) %>%
  tbl_regression(intercept = TRUE, label = list(yeareduc = "Years of education", gender = "Gender")) %>%
  bold_labels() %>%
  italicize_levels()

t2

Characteristic	Beta	95% CI¹	p-value
(Intercept)	34	31, 38	<0.001
Years of education	0.34	0.13, 0.55	0.002
Gender
Male	—	—
Female	-3.4	-5.1, -1.7	<0.001
¹ CI = Confidence Interval

# Model 3: Interaction term, yeareduc*gender
t3 <-
    lm(formula = hours_work ~ yeareduc*gender, data = issp) %>%
  tbl_regression(intercept = TRUE, label = list(yeareduc = "Years of education", gender = "Gender")) %>%
  bold_labels() %>%
  italicize_levels()

t3

Characteristic	Beta	95% CI¹	p-value
(Intercept)	36	32, 40	<0.001
Years of education	0.23	-0.07, 0.52	0.13
Gender
Male	—	—
Female	-6.7	-13, -0.25	0.042
Years of education * Gender
Years of education * Female	0.22	-0.20, 0.64	0.3
¹ CI = Confidence Interval

As you can see from the output of t2 and t3, we use the function italicize_levels() to italicize the levels of categorical variables, in our case, the gender variable. You can also see in the output from t3 that when we specify the variable labels, the new labels also automatically appears in the interaction term!

Merging the models together

Now that we have stored each of the models as gtsummary objects, we can merge them together in a single table output using the tbl_merge() function:

merged <- tbl_merge(tbls = list(t1, t2, t3), tab_spanner = c("**Model 1**", "**Model 2**", "**Model 3**"))
 
merged

Characteristic	Model 1			Model 2			Model 3
Characteristic	Beta	95% CI¹	p-value	Beta	95% CI¹	p-value	Beta	95% CI¹	p-value
(Intercept)	33	30, 37	<0.001	34	31, 38	<0.001	36	32, 40	<0.001
Years of education	0.29	0.08, 0.50	0.007	0.34	0.13, 0.55	0.002	0.23	-0.07, 0.52	0.13
Gender
Male				—	—		—	—
Female				-3.4	-5.1, -1.7	<0.001	-6.7	-13, -0.25	0.042
Years of education * Gender
Years of education * Female							0.22	-0.20, 0.64	0.3
¹ CI = Confidence Interval

In the code above, we specify which gtsummary objects to merge by setting the tbls = argument, and then we use the tab_spanner = argument to give labels to each of the models.

As you can see, the results for all the models are now merged together side-by-side into a single table!

Help documentation

To see the help-pages for the functions tbl_regression() and tbl_merge(), just run the following codes and the help-pages will open in your browser.

help(tbl_regression, package = "gtsummary")

help(tbl_merge, package = "gtsummary")

Saving tables with models

After we’ve created the table outputs for our models in tbl_regression, we also have to save them to our working folder. We can do this by using the functions as_flex_table() from the gtsummary package in combination with the function save_as_docx() from the flextable package. We want to save the table which merged all of our models, merged. The following code saves the merged table as a word document to our working folder. so first off we have to store the merged table as an R object:

# Saving the table as a word file to our working folder 
merged %>%
  as_flex_table() %>%
  save_as_docx(path = "output.docx")

In the code above we use the object merged, then we send it through the function as_flex_table() to convert the object into a flextable object, and finally, we then send that output through the function save_as_docx() to save the table to our working folder. We specify the path = argument with what we want to name the file (output) and the correct file extension (.docx) to save it as a word file.

Help documentation

To see the help-pages for the functions as_flex_table() and save_as_docx(), just run the following codes and the help-pages will open in your browser.

help(as_flex_table, package = "gtsummary")

help(save_as_docx, package = "flextable")

4.1 Exercises

In the exercises in this session, you’ve made three different models (est1, est2, est3) for investigating the relationship between age and hourly wages. Now you’re going to merge the results from the models to a single table and saving the merged table to your working folder. Remember that the formula for your three regression models are as followed:

est1 = lm(formula = wage_hour ~ age, data = abu89)
est2 = lm(formula = wage_hour ~ age + female, data = abu89)
est3 = lm(formula = wage_hour ~ age*female, data = abu89)

Note that the packages gtsummary and flextable are loaded in your workspace.

1. First, you’ll have to transform your models into gtsummary objects. Hint: check out the examples in paragraph 7.2.

a) Transform your model est1 to a gtsummary object. Specify the regression model, then use the function tbl_regression() to put the results into a table. Include the intercept by setting the intercept = argument to TRUE, label the variable age to “Age” with the label = argument, and highlight the variable names in bold using the bold_labels() function. Store it in an object named tab1. Remember to type the name of the object at another line to print the output.

tab1 <-
   lm(formula = wage_hour ~ age, data = abu89) %>%
  tbl_regression(intercept = TRUE, label = list(age = "Age")) %>%
  bold_labels()

tab1

"Remember to use the pipe operator: %>%"

"Finish the code below to solve the exercise"

tab1 <-
   lm(formula = ... ~ ..., data = ...) %>%
  tbl_regression(intercept = ..., label = list(... = "...")) %>%
  bold_labels()

tab1

b) Transform your model est2 to a gtsummary object. Specify the regression model, then use the function tbl_regression() to put the results into a table. Include the intercept by setting the intercept = argument to TRUE, label the variable age to “Age” and the variable female to “Gender” with the label = argument, highlight the variable names in bold using the bold_labels() function and italicize the levels using the italicize_levels() function. Store it in an object named tab2. Remember to type the name of the object at another line to print the output.

tab2 <-
   lm(formula = wage_hour ~ age + female, data = abu89) %>%
  tbl_regression(intercept = TRUE, label = list(age = "Age", female = "Gender")) %>%
  bold_labels() %>%
   italicize_levels()

tab2

"Finish the code below to solve the exercise"

tab2 <-
   lm(formula = ... ~ ... + ..., data = ...) %>%
  tbl_regression(intercept = ..., label = list(... = "...", ... = "...")) %>%
  bold_...() %>%
    italicize_...()

tab2

c) Transform your model est3 to a gtsummary object. Specify the regression model, then use the function tbl_regression() to put the results into a table. Include the intercept by setting the intercept = argument to TRUE, label the variable age to “Age” and the variable female to “Gender” with the label = argument, highlight the variable names in bold using the bold_labels() function and italicize the levels using the italicize_levels() function. Store it in an object named tab3. Remember to type the name of the object at another line to print the output.

tab3 <-
   lm(formula = wage_hour ~ age*female, data = abu89) %>%
  tbl_regression(intercept = TRUE, label = list(age = "Age", female = "Gender")) %>%
  bold_labels() %>%
   italicize_levels()

tab3

"Finish the code below to solve the exercise"

... <-
   lm(formula = ... ~ ...*..., data = ...) %>%
  tbl_regression(intercept = ..., label = list(... = "...", ... = "...")) %>%
  bold_...() %>%
    italicize_...()

tab3

2. Now that you have created tables for each of your models, and transformed them to gtsummary objects, merge the tables into a singular table using the tab_merge() function. Using the tab_spanner = argument, label tab1, tab2 and tab3 the corresponding labels: “Bivariate”, “Multiple” and “Interaction”. Save the table in an object named full_tab. Remember to type the name of the object at another line to print the output.

full_tab <- tbl_merge(tbls = list(tab1, tab2, tab3), tab_spanner = c("**Bivariate**", "**Multiple**", "**Interaction**"))

full_tab

"Finish the code below to solve the exercise - replace the dotted lines with the correct variables/specifications:"

full_tab <- tbl_merge(... = list(..., ..., ...), tab_spanner = c("**...**", "**...**", "**...**"))

full_tab

3. Now that you’ve merged the tables into a single one, write the correct code to save the merged table as a word file to your working folder by using the functions as_flex_table() and save_as_docx(). Name the file: reg_output.docx

full_tab %>%
  as_flex_table() %>%
  save_as_docx(path = "reg_output.docx")

"Replace the dotted lines with the correct specifications to solve the exercise:"

full_tab %>%
  ...() %>%
  ...(path = "...")

1 Multiple regression

Predictions based on multiple variables

1.1 Exercises

2 Check model assumptions

Checking linearity

Extra: Doing the same with ggplot()

Checking normality of residuals

Histogram of the residuals

Quantile-quantile plot

Extra: quantile-quantile plot with ggplot()

Checking homoscedasticity

Checking for influential outliers

What now?

2.1 Exercises

3 Interaction

Including a interaction in regression model

Visualizing interactions with ggplot2

3.1 Exercises

4 Put it all in a table

Modifying apperance of tables

Preparing each of the models

Merging the models together

Saving tables with models

4.1 Exercises