To get back to the main page click here

1. Basic inference

Statistical inference is about comparisons and consider to what extent something are similar or different. Comparing the data at hand is easy: it is what it is. However, when we consider a sample to represent some larger population, we wish to make conclusion about the the population. The sample as such is actually relatively uninteresting in itself - but provides information about something more than itself!

The key problem is that some difference in the sample is expected due to random variation alone. Thus, differences in the sample might not reflect a true difference in the population. To say that e.g. two groups have different average values in the population, we need take into account the possibility that the differences in the sample are merely conincidences. Statistical inference is the procedure of interpreting results while taking this uncertainty into account.

This is no easy task! Typically, we do not know much about the population. That is why one made the study in the first place: to learn about the population - from a sample!

There are two key procedures you need to master:

  • confidence intervals
  • statistical tests

These are closely related, and both are based on taking uncertainty into account. Confidence intervals adds a margin of error around the estimated difference in e.g. means. For a specific estimate, we might also make a statistical test to check whether the difference might be consistent with just random variation - or not.

To compare groups, we typically do a t-test for continuous variables and chi-square test for categorical variables. To compare two proportions, we can also do z-tests that are very similar to t-tests.

For all techniques you might use for estimating something, there is a corresponding statistical tests and confidence intervals. How these are calculated differs. In this chapter, we consider inference for one and two proportions as well as one and two means.

You should also understand that inference for regression estimates are t-tests - just as when comparing means. So, t-tests are important to understand to be able to interpret regression results.

Packages used in this chapter

There are quite many packages for doing the basic statistical tests in R. Thus, if you make a search on the internet, you might find different codes than we will use here.

We will rely on the package mosaic. This packages is chosen because it provides the same basic syntax for the tests we cover in this chapter. First of all, these tests are specified using a formula similar to what is used in regression models: outcomevariable ~ groupvariable and then specify in what dataset these variables are found: data = dataset.

2. Inference for means

Compare two groups

Consider the data from the international social survey project, ISSP. One question in the questionaire is how much the participants think people from various occupations should earn. The graph shows the distribution by gender of the participant.

ggplot(issp, aes(x = factor(kjonn), y = boerlege)) +
  geom_jitter(width=0.1, height=20, alpha=.4) +
  ylab("") +
  xlab("Gender")
## Warning: Removed 29 rows containing missing values (geom_point).

To get the exact figures you can calculate the averages as shown below. In addition to the mean, we also include standard deviations and standard errors.

In this example, we do the calculations a bit manually, using some standard functions from the tidyverse like group_by, filter, summarise.

issp %>% 
  group_by(kjonn) %>% 
  filter( !is.na(boerlege)) %>% 
  summarise(n = n(), mean = mean(boerlege), std = sd(boerlege)) 

It seems men think doctors should earn more than women thinks: average of 843 vs 764. That is what the sample say. But does this mean that men and woment have different opinions about doctors’ salaries more generally?

The function t_test() do a few things for you:

  • calculate the differences in means between the two groups (this is your “estimate”)
  • calculate the confidence interval around this estimate
  • performs the t-test

Actually, in the example, the two first lines are enought. The last line of code just specify the test to be a two-sided t-test. You can also specify one-sided tests by specifying alternative = "greated" og alternative = "less".

t_test(formula = boerlege ~ kjonn,
       data = filter(issp, !is.na(boerlege)),
       alternative = "two.sided"
       )
## 
##  Welch Two Sample t-test
## 
## data:  boerlege by kjonn
## t = 3.3615, df = 1279.3, p-value = 0.000798
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   32.86543 124.99291
## sample estimates:
##   mean in group Male mean in group Female 
##             843.2763             764.3472

The difference between the groups are 78.9, with a 95% confidence interval between 32.9 and 125.

There is also a t-test where the “statistic” is 3.36, which is the t-value. The calculation is done under the hood, so you do not get to see the standard error, just the results. It also calculates the degrees of freedom (df) which is 1279.3, and the p-value for a two-sided test is 0.000798.

NB! It is easy to get a bit confused about the degrees of freedom as the formula for calculating it by hand in any textbook will give you a different figure! Importantly: those formulas are only approximate, and the formula used in software is more complicated. The approximate will be fine in most instances, but software do not need such approximations and will do it exact. Thus: do not worry about this.

You need to make special notice of a few things with this code:

  • there is a formula where the left hand side (LHS) is the outcome variable of interest, and the right hand side (RHS) is the grouping variable.
  • the type of test is specified by stating the alternative hypothesis. When stating “two-sided”, you (obviously) get a two-sided t-test.
  • the order-argument states which way the difference is calculated. It does not really matter for the calculations except for the sign.

1.1. Exercises

Now it is your turn. Repeat the t-test above, comparing how much men and women thinks doctors should earn.

t_test(... ~ ..., 
       data = ...) 
t_test(boerlege ~ kjonn, data = issp) 

2.1 Doing the test by “hand” using code

Recall the earlier chapter on handling data and getting basic descriptives. You can use the same techniques to do the same thing.

You do this by using the functions group_by() and summarise(), specifying the number of observations, the mean and the standard deviation as follows:

issp %>% 
  filter( !is.na(boerlege)) %>% 
  group_by(kjonn) %>% 
  summarise(n = n(), mean = mean(boerlege), std = sd(boerlege)) 

What is missing is the standard error, but you can that is just the standard deviation divided by the square of the number of observations. Just create a new variable using mutate() like this:

issp %>% 
  filter( !is.na(boerlege)) %>% 
  group_by(kjonn) %>% 
  summarise(n = n(), mean = mean(boerlege), std = sd(boerlege)) %>% 
  mutate(se = std/sqrt(n)) %>% 
  ungroup() %>% 
  mutate(diff = mean - lag(mean) )
grade_code()
## function (check_env) 
## {
##     if (is.list(check_env)) {
##         check_env <- list2env(check_env)
##     }
##     user_code <- check_env$.user_code
##     if (is.null(user_code)) {
##         return(legacy_graded(correct = FALSE, message = "I didn't receive your code. Did you write any?"))
##     }
##     solution_code <- check_env$.solution_code
##     if (is.null(solution_code) || length(str2expression(solution_code)) == 
##         0) {
##         return(legacy_graded(correct = FALSE, message = "No exercise solution provided. Defaulting to _incorrect_."))
##     }
##     message <- code_feedback(user_code = user_code, solution_code = solution_code, 
##         env = check_env, allow_partial_matching = allow_partial_matching)
##     if (is.null(message)) {
##         return(legacy_graded(correct = TRUE, message = glue_message(glue_correct %||% 
##             gradethis_legacy_options$gradethis.glue_correct, 
##             .is_correct = TRUE, .message = NULL, .correct = correct, 
##             .user_code = user_code)))
##     }
##     message <- glue_message(glue_incorrect %||% gradethis_legacy_options$gradethis.glue_incorrect, 
##         .is_correct = FALSE, .message = message, .incorrect = incorrect, 
##         .user_code = user_code)
##     legacy_graded(correct = FALSE, message = message)
## }
## <bytecode: 0x47c64f0>
## <environment: 0x47c2db0>

The special function lag alows calculations based on values from a previous row.

Admittingly, this is a bit elaborate code, and it is slightly more complicated than we expect you to be able to do in this course.

3. Inference for proportions

issp <- issp %>% 
  filter(religion_viktig %in% c("Important", "Not important")) %>% 
  mutate(religion_viktig = factor(religion_viktig, levels = c("Important", "Not important"))) 

issp %>% 
  group_by(kjonn) %>% 
  summarise(n = n(), prop= mean(religion_viktig == "Important") )
prop.test(religion_viktig ~ kjonn, data =  issp)
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  tally(religion_viktig ~ kjonn)
## X-squared = 0.043093, df = 1, p-value = 0.8355
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.04011965  0.05304584
## sample estimates:
##    prop 1    prop 2 
## 0.1911765 0.1847134

3.1. Exercises

grade_code()
## function (check_env) 
## {
##     if (is.list(check_env)) {
##         check_env <- list2env(check_env)
##     }
##     user_code <- check_env$.user_code
##     if (is.null(user_code)) {
##         return(legacy_graded(correct = FALSE, message = "I didn't receive your code. Did you write any?"))
##     }
##     solution_code <- check_env$.solution_code
##     if (is.null(solution_code) || length(str2expression(solution_code)) == 
##         0) {
##         return(legacy_graded(correct = FALSE, message = "No exercise solution provided. Defaulting to _incorrect_."))
##     }
##     message <- code_feedback(user_code = user_code, solution_code = solution_code, 
##         env = check_env, allow_partial_matching = allow_partial_matching)
##     if (is.null(message)) {
##         return(legacy_graded(correct = TRUE, message = glue_message(glue_correct %||% 
##             gradethis_legacy_options$gradethis.glue_correct, 
##             .is_correct = TRUE, .message = NULL, .correct = correct, 
##             .user_code = user_code)))
##     }
##     message <- glue_message(glue_incorrect %||% gradethis_legacy_options$gradethis.glue_incorrect, 
##         .is_correct = FALSE, .message = message, .incorrect = incorrect, 
##         .user_code = user_code)
##     legacy_graded(correct = FALSE, message = message)
## }
## <bytecode: 0x47c64f0>
## <environment: 0xd96f278>

4. Inference for cross-tables

The \(\chi^2\)-test can be used in a similar way as the test for comparing two proportions. In fact, it will give the exact same p-value. However, you cannot create confidence intervals in this way, just the test. On the other hand, \(\chi^2\)-test extends to cross-tables with more categories.

We use the function xchisq.test() from the mosaic-package to perform the \(\chi^2\)-test. For this test, it does not matter which variable you regard as outcome and explanatory, but we nevertheless use the formula in a similar way. Both variables need to be categorical, preferably a factor.

xchisq.test(religion_viktig ~ kjonn, data = issp)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  x
## X-squared = 0.043093, df = 1, p-value = 0.8355
## 
##   104      116   
## (102.12) (117.88)
## [0.0188] [0.0162]
## < 0.186> <-0.174>
##    
##   440      512   
## (441.88) (510.12)
## [0.0043] [0.0038]
## <-0.090> < 0.083>
##    
## key:
##  observed
##  (expected)
##  [contribution to X-squared]
##  <Pearson residual>

The output is a bit messy, but you should be able to find each parts of interest. An alternative is to store the results in an object and then pick out each part by using $.

xtest <- xchisq.test(religion_viktig ~ kjonn, data = issp)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  x
## X-squared = 0.043093, df = 1, p-value = 0.8355
## 
##   104      116   
## (102.12) (117.88)
## [0.0188] [0.0162]
## < 0.186> <-0.174>
##    
##   440      512   
## (441.88) (510.12)
## [0.0043] [0.0038]
## <-0.090> < 0.083>
##    
## key:
##  observed
##  (expected)
##  [contribution to X-squared]
##  <Pearson residual>
xtest$observed
##                kjonn
## religion_viktig Male Female
##   Important      104    116
##   Not important  440    512
xtest$expected
##                kjonn
## religion_viktig    Male  Female
##   Important     102.116 117.884
##   Not important 441.884 510.116

4.1. Exercises

5. Doing tests in descriptive tables

The statistical tests as typically taught in textbooks, and also in the previous chapters here, are quite detailed and explicit about each part. In other words: The output provides a lot of details. This is very useful for you to understand what is actually going on.

However, the output from these procedures is typically ugly and hard to get into a publication-ready format. When writing up a report, a research paper or a term paper or thesis, you typically do not have to provide as much details as that. You need to know what is going on, but only the basic information will suffice. It is quite common to report only the estimate, the standard error and the p-value.

In a previous chapter, we covered tabulations using the package gtsummary. In a publication, you would normally not show all the details of a performed statistical test. It is more common to show e.g. the differences in means in a table, and the adding either standard errors, confidence intervals or p-values from a significance test.

The add_p() function will automatically determine the appropriate test based on the type of variable and additional requirements. Thus, the specific functions for each test covered above is used by gtsummary to do the exact same thing. You need to know each function and test so that you understand the output in the table.

There exist more tests than is covered in our curriculum, so we restrict to the three mentioned tests.

We specify the tests as t.test and chi-square in the add_p(), similarly as to how we have used lists elsewhere. The default tests would add continuity correction to the test, which we do not cover here. With specifying the tests, you will get the same results if you do the calculations by hand. You should double-check just that! (The difference between tests would usually be tiny).

issp %>% 
  tbl_summary(by=kjonn,
              statistic = list(all_continuous() ~ "{mean} ({sd})",
                               all_categorical() ~ "{p}%"),
              label = list(religion_viktig ~ "Religion important", 
                           boerbuti ~ "Work in shops should earn (NKr)",
                           boerlege ~ "Doctors should earn (NKr)")
              ) %>% 
  #add_p() %>% 
  add_p(test = list(all_continuous() ~ "t.test",
                    all_categorical() ~ "chisq.test.no.correct")) %>% 
  add_overall() %>% 
  modify_header(update = list(
      label ~ "",
      stat_0 ~ "**All**",
      stat_1 ~ "**Women**",
      stat_2 ~ "**Men**"
      )
    ) 
issp %>% 
  tbl_summary(by=kjonn,
              statistic = list(all_continuous() ~ "{mean} ({sd})"),
              label = list(religion_viktig ~ "Religion important", 
                           boerbuti ~ "Work in shops should earn (NKr)",
                           boerlege ~ "Doctors should earn (NKr)")
              ) %>% 
  #add_p() %>% 
  add_p(test = list(all_continuous() ~ "t.test",
                    all_categorical() ~ "chisq.test.no.correct")) %>% 
  add_overall() %>% 
  modify_header(update = list(
      label ~ "",
      stat_0 ~ "**All**",
      stat_1 ~ "**Women**",
      stat_2 ~ "**Men**"
      )
    ) 
issp %>% 
  select(religion_viktig, kjonn) %>% 
  tbl_cross() %>% 
  add_p()
Characteristic kjonn Total p-value1
Male Female
religion_viktig 0.8
Important 104 116 220
Not important 440 512 952
Total 544 628 1,172

1 Pearson's Chi-squared test

5.1. Exercises

Published: X XXX, 2021
Revised: 22 January, 2025