Linear regression

To get back to the main page click here

1 Introduction

In this session you will learn how to perform linear regression in R. You’ll learn how to estimate both bivariate and multiple regression, as well as how to get predicted means for specific values of the explanatory variable. We will also cover how you can estimate a linear probability model in R, and you will also learn how to do diagnostics of your models (i.e to check that the assumptions of linear regression is met) and how to obtain and interpret a residuals plot. We will also show you how to estimate regression with interactions in R, and finally, you’ll learn how to export your regression models into publication ready tables and how to save them to your working folder.

Curriculum

The curriculum for this session is chapter 5, 26 and 29 in Basic Practice of Statistics (BPS). Chapter 29, on multiple regression, can be found online here).

We highly recommend that you read those chapters before diving into the exercises and the examples in this session. We won’t get into details about the theoretical and technical assumptions behind linear regression and interpretations as these are described in the curriculum. However, we will show you how to practically estimate linear regression models in R.

In some of the exercises, we have also included some quiz-questions for the purpose of repetition. To be able to answer these questions, you’ll have to have read the curriculum.

Packages

gtsummary, flextable, tidyverse, ggplot2

Data

We will be using the issp dataset for examples, and you will be using the dataset abu89 to solve the exercises in this session.

2 Simple linear regression

The base R function lm() estimates a linear regression model used to assess the relationship between a continuous response variable and a explanatory variable(s) that are assumed to have a linear relationship. The lm() function uses Ordinary Least Squares (OLS) as it’s estimation method.

By reading the function’s help documentation (scroll down to help documentation to look it up), you’ll see that there are a lot of different arguments that can be used. However, for the purposes of this course, there are two important arguments to notice: formula = and data =.

formula =: refers to a formula which specifies the regression model. For simple linear regression models where you have a singular explanatory (predictor) variable in addition to an intercept term, this formula takes the form of outcome ~ predictor in R. As you can see, the outcome variable and the predictor variable is separated by the tilde symbol ~. (This is equivalent to how textbooks typically write regression formula as \(y = a + b x\) or \(y = \alpha + \beta x\), but note that the constant, \(\alpha\), is not specified in the formula). However, you can have more complicated formulas, which we will describe later (regression with two or more predictors and interactions).
data =: here you should include the dataframe (dataset) that contains the variables.

Since the output of the lm() formula contains quite a lot of technical information and this information is used by other functions, we generally create a object which stores the results of our regression. With this in mind, assume that we wanted to estimate a regression of the relationship between years of education and the amount of working hours per week from the issp dataset. The command we would type to obtain this result is the following:

# Simple linear regression
m1 <- lm(formula = hours_work ~ yeareduc, 
         data = issp)

In our code snippet above, our formula is hours_work ~ yeareduc: the variable hours_work is our outcome variable, and yeareduc is our explanatory (or predictor) variable. The command above creates an lm object called m1. If we wanted to print the results of our estimated model, we could type:

# Print the results of the model
summary(m1)

## 
## Call:
## lm(formula = hours_work ~ yeareduc, data = issp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.051  -1.301   1.235   4.381  55.245 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  33.3237     1.6418  20.298  < 2e-16 ***
## yeareduc      0.2864     0.1065   2.688  0.00735 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.61 on 713 degrees of freedom
## Multiple R-squared:  0.01003,    Adjusted R-squared:  0.008645 
## F-statistic: 7.226 on 1 and 713 DF,  p-value: 0.007353

There we have it! As you can see from the output, there is quite a lot of information provided by the summary() function. First of all, R is reminding us of the formula we used to estimate the model. The second part consist of information on the residuals.

However, the more interesting piece of information for the purposes here, is the third part. This is the estimates of main interest: the regression coefficients. Here R gives us the intercept (33.3237) and the slope coefficient (0.2864) for yeareduc. Substantially, this means that there is a positive relationship between years of completed education and the amount of working hours per week. The intercept tells us that if you have 0 years of completed education, the average amount of working hours is 33.3237 - but it would be reasonable to round the number to 33.3 hours. For each additional year of completed education, the amount of working hours per week increases with 0.2864 hours on average.

The output also gives information on the residuals, the standard errors, t-value and p-value of the intercept and the slope. Then follows information on the level of significance, where each level are indicated by a corresponding symbol. We can see that our coefficient is highly significant indicated by the three asterisk which indicates that the estimate is statistically significant at the level of 99.9%. This basically means that this estimate is unlikely to be due to random variation.

At the bottom we get some more information about the residual standard error (11.61) and the degrees of freedom (713). We also get information on R-squared (0.01003) and adjusted R squared (0.008645). Finally, the output tells us the F-statistic (7.226) with it’s corresponding p-value (0.007353).

A note on `lm()`:

The lm()-function requires two arguments: formula and data. Each argument can be specified as formula = followed by data =. However, you do not have to type that as R will make a guess based on the order of the arguments: the first is the formula and the second is the data.

Thus, this model: lm(formula = hours_work ~ yeareduc,data = issp),

will give identical results as this model: lm(hours_work ~ yeareduc, data = issp),

as well as this model: lm(hours_work ~ yeareduc, issp)

This logic applies to all R-functions, and you can avoid a lot of typing (and issues with typos!) this way. But it requires that you know what you are doing as it is easy to make mistakes. Until you are experienced and well familiar with R, it is good practice to be explicit by naming each argument.

Help documentation:

To see the help-page for linear models just run the following code, and the help-page will open in your browser.

help(lm)

2.1 Exercises

1. Is there a relationship between hourly wages and age? Using the abu89 dataset, estimate a simple linear regression model with wage_hour as your outcome variable and age as your explanatory variable. Store the result in an object named est1. Replace the dotted line with the correct arguments.

est1 <- lm( ...)

est1 <- lm(formula = wage_hour ~ age, 
           data = abu89)

"Remember the correct formula: outcome ~ predictor"

"Replace the dotted lines with the correct variables and dataset"

est1 <- lm(formula = ... ~ ..., 
           data = ...)

2. Write the correct code to check out the results of your model.

summary(est1)

"The summary() function will help you!"

3 Calculating predicted values

As stated in BPS, chapter 26, one possible purpose of estimating a linear regression is to predict an outcome value on the basis of a particular value of the explanatory variable (BPS, p. 616). It also have other purposes beyond what is covered in this course.

Let’s say we want to know the average amount of working hours if years of completed education is 16. If the regression model is \(y = a + bx\), we substitute the \(a\) and \(b\) with the estimated regression parameters so that it becomes: \(\hat{y} = 33.32 + 0.29x\). To calculate for \(x = 16\), just substitute \(x\) with 16.

This can be calculated “manually” in R thus:

33.32 + 0.29 * 16

## [1] 37.96

For simple models and for one value, this is easy to do manually. However, it is not convenient if the model is complex or you need to calculate predicted estimates at many X-values.

The `predict()` function

We could use the predict() function to do this in R. Note, that the output from a regression model will round off decimals while predict() uses all decimals whether shown in the output or not. Thus, you might not get the exact same result as when calculating manually.

To use predict(), we have to start of by creating a new dataframe which contains the values of our explanatory variable that we want to predict the outcome value at. Therefore, we start by creating a dataframe which contains the value 16 using the data.frame() function you already know:

# Creating a new dataframe which contains the value (16) of the explanatory variable we want to predict the outcome at
sixteen_yeareduc <- data.frame(yeareduc = 16)

After we have created this dataframe, we can then use predict() to predict the expected amount of working hours per week when the years of completed education is 16:

# Predicting the outcome value
predict(m1, newdata = sixteen_yeareduc)

##        1 
## 37.90547

In the code above we first type the name of our model, m1, that predicts the amount of working hours on the basis of the amount of years of education. This is also illustrative of why we usually store the results of a regression model in an object - we often use the output in later functions! We then use the newdata = argument and attach to it the name of our newly created dataset sixteen_yeareduc. The output tells us that the expected amount of working hours per week if completed years of education equals 16, is 37.80061.

If we wanted to know the predicted outcome for several values of our explanatory variable, we could simply include more values in our new dataset:

# Creating a new dataframe containing several values to predict at
new_yeareduc <- data.frame(yeareduc = c(10,13,16,18))

Notice how we had to include the c() function when we have multiple values we want to predict the outcome by. After we have created this dataset, we could simply use the same code as above and replacing the name of the dataset with the new one:

# Predicting the outcome values
predict(m1, newdata = new_yeareduc)

##        1        2        3        4 
## 36.18730 37.04639 37.90547 38.47820

Note that the first value in the output (1) corresponds to the first value we included in the new dataframe (10), and the second value in the output (2) corresponds to our second value (13) and so on.

Confidence and prediction intervals

Recall the difference between confidence- and prediction intervals: The former is used when you want to predict the expected average value for a group. The latter is used when you want to predict the expected value for a single individual.

If we wanted to obtain 95% confidence intervals for our predicted outcome values, we simply add the argument interval = "confidence" to our command:

# Obtaining confidence intervals for predicted values
predict(m1, newdata = new_yeareduc, interval = "confidence")

##        fit      lwr      upr
## 1 36.18730 34.86032 37.51428
## 2 37.04639 36.10923 37.98354
## 3 37.90547 37.02059 38.79035
## 4 38.47820 37.40259 39.55380

From the output we see that the first column refers to the predicted outcome for our four values of years of education, while the second and third column refers to the lower and upper confidence limits for the expected outcome values.

However, if we’d rather have the prediction interval, we simply specify “prediction” in the interval = argument instead:

# Obtaining prediction intervals for predicted values
predict(m1, newdata = new_yeareduc, interval = "prediction")

##        fit      lwr      upr
## 1 36.18730 13.35803 59.01657
## 2 37.04639 14.23646 59.85631
## 3 37.90547 15.09763 60.71331
## 4 38.47820 15.66216 61.29423

Help documentation for predict()

The predict()-function recognized that the object is a linear model and applies the correct calculations based on the type of model. Thus, predict() calls a sub-function for linear models, predict.lm(). To see the help-page for linear models just run the following code, and the help-page will open in your browser.

help(predict.lm)

For other types of models, there are related sub-functions. Thus, predict() can also be used for a wide range of statistical models (not covered in this course).

3.1 Exercises

1. What is the expected hourly wage at age 63? Use the model you estimated in the previous exercise (est1), and the predict() function to predict the outcome value.

a) Create a new dataframe named age_sixty which contains the value 63 of the age variable. Remember to type the name of your created dataframe at another line to make sure you did it correctly.

age_sixty <- data.frame(age = 63) 

age_sixty

"The data.frame() function is the way to go!"

"Replace the dotted lines with the correct specifications"

... <- data.frame(age = ...)

b) Use your model (est1) and your newly created dataset (age_sixty) to predict the expected hourly wage at age 63.

predict(est1, newdata = age_sixty)

"The predict() function may be very helpful!"

"Replace the dotted lines with the correct specifications:"

predict(..., newdata = ...)

c) Modify your code from the previous exercise so you add confidence intervals to your predicted value.

predict(est1, newdata = age_sixty, interval = "confidence")

"Remember the interval = argument inside predict()"

"Replace the dotted lines with the correct specification"

predict(est1, newdata = age_sixty, interval = "...")

d) Modify your code from the previous exercise to add prediction intervals instead of confidence intervals.

predict(est1, newdata = age_sixty, interval = "prediction")

"Replace 'confidence' with 'prediction' in the interval = argument."

2. What is the mean hourly wage at ages 25, 35, 45 and 55?

a) Create a new dataset named ages which holds the values 25, 35, 45 and 55 of age. Remember to type the name of your created dataframe at another line to make sure you did it correctly.

ages <- data.frame(age = c(25,35,45,55))

ages

"Remember the c() function to list multiple values"

b) Use your newly created dataset (ages) from the previous exercise (2a) and your model (est1) to predict the mean hourly wage at ages 25, 35, 45 and 55. Add confidence intervals to the predictions.

predict(est1, newdata = ages, interval = "confidence")

4 The linear probability models

The linear regression models is initially meant for cases where you have a continuous outcome variables. Textbooks will typically present it that way, and so does BPS, the textbook for this course. What happens if you estimate a linear regression model with an outcome variable with only two levels (e.g. a dichotomous/binary variable)? Well, essentially you then estimate what is denoted a linear probability model. That is just an ordinary linear regression model where the outcome is a proportion or probability. But what you have learned about linear regression models so far apply. It is the same thing, really. (There exists other special models for such outcomes, but you will do just fine with linear regression).

The outcome variable must contain the values 0 and 1 (dummy-variable) and be of the numeric type. Alternatively, it can be a factor, and R will recode to 0/1 on the fly. That will give identical results, but if factor, you need to make sure the right category is 0 and 1.

In either way, R will handle the outcome as having values 0 or 1. The estimated coefficients in a linear probability model gives information about the expected probabilities that Y = 1. Alternatively, you can also interpret the coefficients as the expected proportions that has the value 1 on the outcome variable.

Let’s illustrate this with an example using the ess8 dataset. Does unemployment vary by gender or not? In this case, whether or not a respondent is unemployed is our outcome variable (unemployed), and gender is our explanatory variable. We use a binary variable for gender (male), where the value 1 equals males and 0 equals females. To estimate this model, all we have to do is simply to use the lm() function. In cases where the outcome variable is binary, R will automatically estimate a linear probability model when using the lm() function:

lm_prob <- lm(formula = unemployed ~ male, data = ess8)

Again, we can use summary() to see the regression results:

summary(lm_prob)

## 
## Call:
## lm(formula = unemployed ~ male, data = ess8)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.02651 -0.02651 -0.02651 -0.02517  0.97483 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.025175   0.005943   4.236 2.41e-05 ***
## male        0.001331   0.008108   0.164     0.87    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1589 on 1543 degrees of freedom
## Multiple R-squared:  1.747e-05,  Adjusted R-squared:  -0.0006306 
## F-statistic: 0.02696 on 1 and 1543 DF,  p-value: 0.8696

The intercept tell us the expected probability of being unemployed for females (reference group), while the coefficient for male tells us the difference in probability for males and females. By looking at the intercept, we see that females have a probability of 2,5% of being unemployed. The coefficient for male indicates that unemployment is slightly more prevalent among men compared to women, with men having a 0.13% higher probability than women. Substantially, the size of the coefficient is very little, meaning that there practically is no difference between males and females in the probability of being unemployed. The coefficient is also not significant at \(\alpha = .05\), which means that the gender difference observable in this sample is no larger than can be expected from random variation. Thus, this is not strong evidence of an actual gender difference.

4.1 Exercises

In these exercises you will use the dataset abu89 to investigate the relationship between the binary outcome variable promotion and two different predictors, gender and age. The variable promotion is coded so that the value 1 equals that the respondent has been promoted, while the value 0 equals no promotions.

1. Does promotion vary by gender?

a) Estimate a linear probability model with promotion as your outcome variable and female as your explanatory variable. The female variable is a binary variable where the value 1 refers to women, and the value 0 refers to men. Store the result in an object named lm_gender.

lm_gender <- lm(formula = promotion ~ gender, data = abu89)

"Replace the dotted lines with the correct specifications to solve the exercise:"

... <- lm(formula = ... ~ ..., data = ...)

b) Print the results of your model by using the summary() function, and reflect on what the estimates mean. Remember that the variable female is a binary variable.

summary(lm_gender)

"Replace the dotted line with the name of your model"

summary(...)

2. Does promotion vary with age?

a) Estimate a linear probability model with promotion as your outcome variable and age as your explanatory variable. Store the result in an object named lm_age.

lm_age <- lm(formula = promotion ~ age, data = abu89)

"Replace the dotted lines with the correct specifications to solve the exercise:"

... <- lm(formula = ... ~ ..., data = ...)

b) Print the results of your model (lm_age) by using the summary() function, and reflect on what the estimates mean. Remember that the age variable is a continuous variable. How does the interpretation of a coefficient from a continuous variable differ from the interpretation of a coefficient from a dichotomous variable?

summary(lm_age)

"Replace the dotted line with the name of your model"

summary(...)