MERK: Det kan være feil i disse sidene med tekst og/eller bilder. Meld gjerne i fra i nettskjema her.

1 Getting started

This document provides a detailed instruction on how to get things done in R. That requires you get data into R, understand a bit of how the software works, and then mastering specific techniques.
It is a mix of tutorials and hands-on exercises so you can try it out straight away. You do not need to install anything on your local computer, but you can do everything in a browser. For each section, there are interactive exersizes for you to try. Follow the instructions to get the code right, and get immediate feedback!

When you think you know how things work, you should do it on your local computer with a different dataset as well. In these online exercises, you will get a lot of help: half-written codes for you to fill in, hints when wrong, and immediate feedback if it works or not. When doing an analysis for yourself - e.g. on the exam - there will be none of that. So, you should prepare for that as well.

The goal is that after finishing this course, you should be ready to use R to carry out a comple set of basic empirical analysis - as well as understanding what you’re doing!

R has several sub-languages for data analysis, which you can access by installing packages. Packages are basically user-written functions that enhance R’s functionality. Some packages are tiny utility-functions, while others are entire systems for data management or analysis. That means it is possible to get a bit lost in all stuff you can do, and different ways of doing the same thing etc. Searching for help on coding in R on the internet can therefore be a bit hard in the beginning. Do not worry! We’ll guide you on a straight path of consistent and user-friendly approach.

In this course, we focus on a very popular approach to both datamanagement and analysis that is called tidyverse. It is a collection of packages that are linked together using the same principles. When installing the package tidyverse on your computer, you are actually installing dplyr, ggplot2 and some other as well. In addition, we will be using packages that are not formally part of this ensemble, but are designed in a similar way to extend the tidyverse-functionality. The advantage of using tidy principles is that it is an effective, consistent and user-friendly approach to data analysis.

It is worth noting that using tidyverse is more than just a preference. It is build on consistent principles for data handling and analysis that goes beyond what we do just in R. You do not need to learn the details of that, but if you like, you can check out Hadley Wickham’s article on tidy data explaining the principles.

Most of what we cover here is explained in more depth in the book R for data science by Wickham and Grolemund. Throughout the tutorials and exercises, we refer to that book as R4DS. But here, we focus specifically on what is needed for the course SOSGEO1120 and skip everything else. Some places we will extend with functions not covered in the book, but will then explain those parts a bit more to get you covered.

As of now, the course SOSGEO1120 uses the textbook The basic practice of statistics, and we will throughout refer to it as BPS. That book explains how the techniques work, and what you’re actually doing. But frankly, any good textbook will do just fine as long as it covers the same material. R4DS explains how to do it in R. You need to know both, of course.

Note: You might worry you need to know all the codes and functions by heart. Not so. Looking up stuff while working is fine, also on the exam and later in professional life. However, if you spend too much time looking up everything it is hard to get anything done in reasonable time. Thus: you need to know the most important functions by heart, but look up the details. For example: you should be able to make a decent graph without the manual, but to fix the details of making it publication ready, it is fine to look up the manual.

Enjoy!

2 Essentials of R

To get you started, you need to start with some data. In some other programs (like Excel, Stata, SPSS etc) you would open a dataset and you will look at the data while you work. Not so in R. Rather, you will read data from disc and into an object in your workspace inside R. Taking a look at the data is a bit different than what you might be used to or expect. But you will quickly get used to it.

Go here for interactive exercises on R essentials

3 Beautiful evidence: graphics

Graphics can be a very effective way of communicating results, and you will often see plenty of statistical graphics in published results. Graphics is also very well suited to explore data as a step in you analysis of data, checking results and conditions. However, you often see graphs in published results that are less than good, and sometimes even ugly. Other times, the graphics are overloaded with elements that distorts the data it is supposed to show. What’s the point of that?!? In this chapter, you will learn how to make tidy graphics that clearly shows what you intend to show. Graphics should always look good, and this is an introduction to beautiful graphics.

Go here for the first part of interactive exercises on Beautiful graphics - part 1

Go here for the second part of interactive exercises on Beautiful graphics - part 2

Go here for the third part of interactive exercises on Beautiful graphics - part 3

Go here for the fourth part of interactive exercises on creating maps: Beautiful graphics - part 4

4 Some basic functioning of R

To summarise data, you also need to understand a little about how R works in the first place.

You can use R as a calculator. Everything we do in this course as hand calculations, you can do by programming in R. In other words: just write up the equations in R as you would by hand. You do not need a calculator for this course. In fact: you really should use R instead!

We move from basic calculations “by hand” using R, and then move on to vectors and data.frames and how those work.

Go here for interactive exercises on R basics

5 How to handle your dataset

While most datasets in this course are fairly tidy, real-world data are typically messy when you first lay your hands on them. They never come neatly in the way you would like them to. Thus, you need to sort it out. Cleaning and preparing data is a really large part of any analysis. This is sometimes called data wrangling. You need to do some of that to analyse subsamples, compare groups, handle missing data, re-code variables etc.

Of course, you also need some functions to summarise the data.

Go here for interactive exercises on Data wrangling

6 Effective tabulations of descriptive statistics

Almost every analysis will contain some tables with descriptive statistics. Some reports will contain tons of them, shorter scientific articles maybe just a couple. This might be basic infomation about the data, overview of distribution of main variables, bi-variate comparison of groups (with or without p-values), or more complicated structures. This chapter teaches you how to do the basic and slightly advanced tabulations using R and the package gtsummary.

Go here for interactive exercises on Creating tables

7 Statistical tests

Standard statistical analyses typically involve some tests of whether any observed differences between groups might just be due to random variation. Statistical tests basically adds a margin of error to the estimate. If the differences are larger than we would expect from random variation, we denote the difference as “statistically significant”.

You should learn the basics by hand to understand what is going on. In practice, we often let software do this work, and we end up with just some standard errors, p-values and confidence intervals.

In the chapter on tabulations, you learned to add p-values when comparing groups, and a footnote on which test was used. In the next chapters on regression, you will also just get the tests automatically.

In this chapter, we’ll quickly walk you through each test in R. That is: t-tests for comparing means, and z-tests for comparing proportions. The two first can be for one or two samples, or one-sided or two-sided. They also come with confidence intervals. Given some assumptions, the z-test can be used instead of the t-test, but not the other way around. So, you’re probably already confused. In addition, there is the Chi-squared test for cross-tables.

To get less confused, read the textbook BPS which explains it all. And do some handcalculations. This section just explains how to get it done using R.

Go here for interactive exercises on statistical tests [ONLY VERY ROUGH DRAFT SO FAR - COMING LATER]

8 Intro to linear regression

Linear regression is the working horse of much social science analyses. What is often called simple regression is just when there is only one outcome variable and one explanatory variable. We will start with that. There is nothing simple about simple regression, beyond it is less complex than multiple regression.

Go here for interactive exercises on Regression analysis

9 Intro to multiple regression

Multiple regression is the same thing as simple regression, only adding more explanatory variables. It allows you to compare groups on some characteristics while controlling for a bunch of other stuff. For real world applications, we do multiple regression, and most analyses will just refer to it simply as regression. Lots of techniques with very fancy names are basically multiple regession. When the analysis is done, and you have a pretty good idea of what is going on, you need to gather the stuff in one table.

Go here for interactive exercises on Multiple regression analysis

10 Statistics in practise - Extra material

In this section, we put together all the previous parts. This is the basis for writing up an analysis of any kind, such as a BA-thesis or a MA-thesis. But also if you are to write a scientific paper or a governmental report.

Most analyses of quantitative data will include some tables with basic statistics (and they should look good!) and some graphics (which should be beautiful). Usually, there will be some comparisons of means on some characteristic between groups, and often the comparisons are done using regression techniques to control for other differences - and regression tables should look good as well. Throughout, you need to consider uncertainties in the form of p-values and confidence intervals.

However, before you get to all of this, with real data, you will most certainly bump into a host of practical problems. Most datasets used in courses like this will typically be clean and tidy data, so it allows you to practice the techniques you’ve just learned. Most real-world datasets are not like this, and you need to get them into shape before you are able to do much sensible with them. This includes things like importing data into R from some unfamiliar format, dealing with missing values, recoding, handling of various datatypes etc. We will look at some common practical issues.

All of this is pretty useless if the research design is crap. Remember: inference is conditioning on the kind of data you got. If the sample is not a random sample, you cannot infer the results holds for the population, and the measures of uncertainties are not informative. For causal inference, you need to have an experiment in some way. Natural experiments will do, though. If you do not have randomization to a treatment-group and a control-group, you cannot estimate the causal effect no matter how fancy statistical maneuvering you do. The bottom line is:

Non-random sample do not allow for inference out side the sample. Thus, p-values and confidence intervals are not useful.
Random sample allows for inference to the population, p-values and confidence intervals are necessary
Randomised experiments allows for estimating causal effects, p-values and confidence intervals are necessary.

The bottom line is: for real-world analysis, you need to master some data management as well as understanding research design and statistical inference. This chapter prepares you to get your hands a bit dirty.

Go here for interactive exercises on Statistics in practice [ONLY VERY ROUGH DRAFT SO FAR - COMING LATER]

11 Quizes

This section is only in Norwegian. There are a handful of questions to each chapter of “The basic practice of statistics” (BPS), but that only serves to structure the content. It doesn’t really matter which textbook you use.

Disse quiz-oppgavene er relativt lette og dekker de sentrale temaene i kurset - men de dekker ikke alt. Du kan ta en oppgave på nytt hvis du svarer feil, og noen ganger får du hint som hjelper deg forstå hva som ble feil. Ideelt sett skal du få det meste riktig, men gjør du ikke - så tar du det bare en gang til.

Gå hit for quiz-oppgaver om Beskrive en variabel (BPS kapittel 1-3)

Gå hit for quiz-oppgaver om Beskrive flere variable samtidig (BPS kapittel 4-6, og litt 29)

Gå hit for quiz-oppgaver om Datainnsamling (BPS kapittel 8-10)

Gå hit for quiz-oppgaver om Sannsynlighetsregning (BPS kapittel 12, 15-17)

Gå hit for quiz-oppgaver om Statistisk tolkning i praksis (BPS kapittel 18-26, og litt 29)

—

About this page

Why did we choose these packages?

There are many ways of doing things in R, and you might wonder why we did not use your favorite approach. Or you might find textbooks that use other packages.

The most important principle here is that we would like to use an approach that makes the students capable to make something useful quickly, and in a way that lays the foundations for future more advanced uses - to be learned at a much later stage. Moreover, all output should be exported to MS Word, as this is the text processor which most beginners use. As a result, the techniques taught here should make the students capable of producing professionally looking reports. It also lays the foundation for professional uses like reproducible and automated reports, which requires eliminating such things as manual editing.

These goals can be reached in several ways, and some courses and many textbooks use a different approach than we do here. We prefer the Tidyverse, and it is for the above reasons. The syntax is consistent and have an underlying philosophy also for data structure. Thus, it lays the foundations for good habits, although principles of data structure and workflow is not emphasized as such. This makes it easier to learn more later. A lot of more tricky stuff can be saved for later when you actually need it.

David Robinson at Datacamp gave a talk about this a few years back, of which we largely agree.

Thus, this course is largely a consistent tidyverse-approach to data analysis. However, we actually use some packages that are not formally a part of the tidyverse. Those are Mosaic and gtsummary, and combined with Flextable. And of course: we use ggplot for all graphics.

Although we use a handful of packages, and those have dependencies etc, the goal is to use as few packages as possible, while fulfilling the goals mentioned.

Why ggplot() rather than plot()

It is clearly so that base plot gets you a basic plot quicker than ggplot. There are some advantages to that. However, the underlying idea in this course is not only to learn to make specific graphics. In addition, the goal is to teach a consistent grammar for greating a wide range of graphics using the same syntax. These first graphs are just the starting point, an introduction to the grammer. Indeed the “gg” in ggplot stands for “grammar of graphics”. This is the introduction to learning very advanced graphics.

If the goal was only learn barplot, histogram and scatterplot, it would have sufficed to learn plot(), barplot() and hist(). However, that is not the goal. Moreover, to make base-plot publication read requires additional work which makes the syntax not so easy after all. Moreover, using ggsave() is easier than the export functions when turning graphics devices on and off.

Mosaic rather than base tests

The Mosaic package is its own universe, and not particularly connected to tidyverse. We use it only for a few specific purposes: the basic statistical tests: t-test, z-test, and chi-square test. It is useful when we need the full output from these text, which also aligns with hand calculations.

Mosaic provides a consistent syntax using formula for the tests we use: t-test, z-test, and chi-square test. Thus it is better for us to use t_test() than t.test(), and especially prop_test() rather than prop.test(). The point is consistent syntax acoss these tests. These procedures are a bit cumbersome, but it is because they detail exactly what the textbook describes for hand calculations. However, these tests can also be done using gtsummary-package but with a very minimalistic output.

Another advantage of using the formula-syntax is that it familiarizes with formula, as also used in regression models and some other functions which they might need at some later stage.

However, in publication-ready tables, these outputs are way to verboose. A standard reporting would be much more compact and should be covered well in the functions for tabulations, both for descriptive tables and regression tables. In other words: using the gtsummary-package, tables of descriptive statistics and comparing groups are done using the appropriate statistical tests. The Mosaic package is used to learn what is going on “under the hood”, as both packages rely on the same basic functions provided by Base-R.

Tabulations: gtsummary rather than various other stuff

There are quite many other packages to greate tables, and some are great. Most are for specific purposes. We need a relatively easy package that have a general purpose. When students get more advanced, they can use whatever package they want.

The goal should be to be able to create publishable tables, but creating tables in an efficient way can be challenging. We require the following to be done easily:

create very simple univariate tables
expand to multiple variables in the same table
use different metrics for different variables as needed, such as mean and sd for continuous variables and percentage for categorical variables, but also adding median, quantiles etc when neeeded d) expand to both two-way and three-way descriptive tables
create tables where the data are grouped by more than one variable in the columns
The same functions should also be able to handle regression output in equally advanced manner - although the basic should be relatively easy.

Finally: the results must be possible to export to common formats, in particularly MS Word, but also Markdown, LaTex and html. Most students use MS Word, so that is the primary goal here. But later, students might need other output formats. Since gtsummary works with the flextable-package, it allows for exporting to all of these formats.

gtsummary also allows for inline-reporting in RMarkdown, which is useful for later advanced use. We do not use that here, but we lay the groundwork for future advanced use.

Another reason is that gtsummary works great with tidyverse, and builds on the gt-package. This means that the full range of gt-functions can be used to create even better tables, if needed. Actually, further modifications can also be done using flextable. In sum: a fully professional looking table can be automated in R, and the exported to whatever format. This lays the groundwork for reproducible research.

MA in sociology from the University of Oslo. Main author of the interactive exercises on basic R, graphics, descriptives and regression↩︎
BA student at the University of Oslo. Main author of quizes to each chapter of the BPS↩︎
Professor in sociology at the University of Oslo. Overall responsibility for content and structure. Main author of the interactive exercises on tabulations and maps. Responsible for future edits and updates.↩︎

Practical data analysis for sociologists

Course material for SOSGEO1120 at University of Oslo

Authors:

Benedicte Nordahl Berntsen¹

Sigrid Kirkebø Landa²

Torbjørn Skardhamar³

Published: 1 September, 2021
Revised: 25 April, 2022

1 Getting started

2 Essentials of R

3 Beautiful evidence: graphics

4 Some basic functioning of R

5 How to handle your dataset

6 Effective tabulations of descriptive statistics

7 Statistical tests

8 Intro to linear regression

9 Intro to multiple regression

10 Statistics in practise - Extra material

11 Quizes

About this page

Why did we choose these packages?

Why ggplot() rather than plot()

Mosaic rather than base tests

Tabulations: gtsummary rather than various other stuff

Practical data analysis for sociologists

Course material for SOSGEO1120 at University of Oslo

Authors:

Benedicte Nordahl Berntsen1

Sigrid Kirkebø Landa2

Torbjørn Skardhamar3

Published: 1 September, 2021 Revised: 25 April, 2022

1 Getting started

2 Essentials of R

3 Beautiful evidence: graphics

4 Some basic functioning of R

5 How to handle your dataset

6 Effective tabulations of descriptive statistics

7 Statistical tests

8 Intro to linear regression

9 Intro to multiple regression

10 Statistics in practise - Extra material

11 Quizes

About this page

Why did we choose these packages?

Why ggplot() rather than plot()

Mosaic rather than base tests

Tabulations: gtsummary rather than various other stuff

Benedicte Nordahl Berntsen¹

Sigrid Kirkebø Landa²

Torbjørn Skardhamar³

Published: 1 September, 2021
Revised: 25 April, 2022