To get back to the main page click here
While a typical course in quantitative methods will teach you statistical techniques for analyzing data and present the results, the data are typically quite tidy and not so complicated. In this course, we have introduced some basic techniques for data management, such as reading in data, recoding variables and techniques for summary statistics with grouped data. When doing analyses outside statistical courses, you will need a lot of such techniques.
This chapter has the main goal of giving you some feel of real-world quantitative analyses, and present some very handy solutions that you can apply right away.
These are the main challenges to be adressed in this chapter:
As mentioned earlier: this chapter will not cover it all. You will - at some point - need to search for help. First of all: a lot of what you will need is covered in the book “R for data science”. Try that first.
A second very important point is that R has a very active online community. You will find solutions to your problems in a range of online discussion forums, blogs, various cours pages, online books, tutorials and so forth. A major advantage of R is exactly this: there is so much free online resources and a very helful and friendly online community! But it can be overwhelming at first, and it can be hard to navigate. (And as in all other online forums, you should search the forum for existing solutions before opening a new topic that might already be solved).
You remember we have used the package “tidyverse” a lot? This is a collection of packages which provides an entire set of functions to handle data in a logical and consistent way. It is almost like a programming language in its own, inside R.
That implies there are other ways of doing the same things. Some of the suggested solutions you find online will not be the tidyverse-way, but “base”, “data.table” or some other approach. In forums such as Stackoverflow, someone might have a problem, someone else present a solution that works well, and then: others add futher solutions often starting by saying “For completeness…”. And the do the same thing just doing it differently. You will often find solutions in both “base”, “tidyverse” og “data.table” - and some more.
If you want to avoid being too confused, look for the tidyverse-solutions as we have used in this course. Sometimes, it is referred to as “tidy” or “dplyr” as well. The package “dplyr” is the main package in the “tidyverse” for datamanagement. The “tidy” refers to tidy-principles for data handling which the tidyverse is built upon. Also: any solutions using the pipe-symbol, %>%, is probably tidyverse, as this is a tidyverse-invention. (Other apporaches have pipes as well, but written differently).
This applies to books as well! Introductory textbooks in R are quite often written in base-R. While this is fine, I do not think is a particularly good idea if one should also prepare for a modern approach to practical data analyses. At some point in professional life, you might need to learn base-R, but hey: you can wait!
So, before you buy an expensive book, be aware that it might use a different approach than this one. As long as it works, it might not matter, but on the other hand: it is a lot easier to work consistently in one approach.
You need to get hold of some data and get it into R to do some analyses. Quite often, data do not come in the format you would prefer. The dataset might originate in other statistical softwares such as SPSS, Stata, SAS or something else you’ve never heard of before. But even formats that you are familiar with, such as Excel, might be a challenge. Other times, you might get a lot of choices, and you do not know which you should choose.
Here is an example of the kinds of choice you have to make when downloading from a data archive. This one is from European Social Survey, and you have to make a choice. The data formats refers to specific statistical softwares, but R is not mentioned here.
For people who work in one of these statistical packages
So the question is: how do you get the data into R?!?
Luckily, R can read a lot of different formats. You are already familiar with the tidyverse. That is a actually not one package, but a collection of packages, and one of those packages is called haven which includes functions to read data from the statistical softwares mentioned above.
My preferred choice for all permanent storeage of data is plain text, where the variables are delimited using either tab, comma or semi-colon. If so, any software can open and use it. It also means you will never have problems with compatibility across platforms, reading older, deprecated formats and so on.
Those used to work in e.g. SPSS or Stata might say that if you do that, all the formatting would be lost. In these softwares one would typically use “labeles” which links the values of the variable to a text-string indicating what it means. Such formats is not supported by plain text format. I would say that is not really a limitation of the format, but of the person or organization storing the data. There are ways of handling this.
It might be more convenient using such formats supporting labels for some uses, but all permanent storing should be based on textfiles. Both the values and labels should be stored. Labels can be e.g. be stored either as separate variables in the dataset or as separate textfiles with coding lists designed for easy reading for recoding in any software. (No, a pdf documentation is not the same thing).
If you ever get a job where you need to store data permanently, you really need to find out how to do it this way.
All imports of data in R implies reading data and put it into an object. We have covered some basics in the earlier section of R basics on reading data. You need to use a different function for each type of format. (There exists multiple functions for each format, but we only cover some basics here).
We rely on the haven-package, which is a part of the tidyverse, to read Stata, SPSS and SAS files. These have very similar names and syntax, where you specify the filepath and filename with appropriate ending.
read_sav("filepath/filename.sav")
read_dta("filepath/filename.dta")
read_sas("filepath/filename.sas7bdat")
As encountered earlier, for csv-files we use the basic read.csv()
and for RDS-files we use readRDS()
.
It should be noted that importing data from foreign formats can be slow, but mainly noticeable for large files. For very large files, there are speedier functions for at least csv-files, and there also exist special formats to handle very large files quicker. R can also connect to a wide range of databases. We do not cover such topics here, but focus on the types of data you are more likely to encounter in forseable future as a student of social sciences.
There are also commersial software available to transfer datafiles from one format to another. One such is Stat/Transfer which solves most such issues. However, those who store data for sharing should ideally take care to store the data so that one avoids such problems. Alas, that is not always the case.
Stata is a very common format in the social sciences, so that is good place to start. In practice, importing SPSS and SAS are the same, and using Stata as an example will suffice.
The example data is a downloadable crosstable from European Social Survey, using their online analysis tool. It is stored in Stata format, the filename ending with .dta. It includes three variables: occupation, gender and response to a question on how interested the person is in politics.
The following reads the data into the object polit and then print a preview using glimpse()
.
## Rows: 49,519
## Columns: 3
## $ isco08 <dbl+lbl> 3333, 7122, 4221, 4311, 6130, 7212, 5131, 5223, 6112, 4224…
## $ gndr <dbl+lbl> 1, 1, 2, 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1…
## $ polintr <dbl+lbl> 3, 2, 4, 3, 2, 2, 4, 3, 3, 4, 2, 2, 1, 4, 2, 1, 2, 3, 2, 3…
Note that the variable names are a bit cryptic, and so are the values. It is important to know what these codes means, and a documentation report for ESS is available here. So, the first thing to do is to find out what the variables are by searching the documentation.
There are three variables: isco08 is a code of occupation, gndr is male or female, and polintr is an answer to how interested the person is in politics.
For the variable polintr the documentation looks like this:
There are really only 4 valid values: 1 through 4. The other are types of non-answers, which might be interesting, but they might also cause problems in your analysis. These are defined as missing. That is common in survey data, but might not be relevant for other kinds of data. We will look more closely at missing values later.
Moreover, the function glimpse()
also report variable types, and in previous examples that has been mainly indicating numeric, character or factor. In this case, the variable types have two assignments rather than one, the second being lbl
. This means that these variables are labelled.
These are features of how the data, and not the type of datafiles as such. It will look the same when importing from SPSS, Stata or SAS. However, the format need to support labelled data, so e.g. csv-files will not easily do so. (There are ways, of course).
Ex 1. Import data in Stata format (.dta) using read_dta()
dta <- read____("data/politics.___")
Ex 2. Import data in SPSS format (.sav) using read_sav()
Ex 3. Import data in SAS format (.sas7bdat) using read_sas()
sas <- read____("data/politics.sas7bdat")
Ex 4. Import data in csv format using read.csv. (OBS! Use punctuation). To make sure all variables are interpreted as character, set colClasses = "character"
. The separator is semi-colon, so specify that in sep = " "
csv <- read.___("data/politics.csv", sep=";", colClasses="character")
The variable types is not a problem in itself. R can handle these easily. The problem is that you might not handle it as easily, unless you know what is what.
Here is the basic challenge: R will treat variables differently depending on their type, in tabulations as well as regression analyses. For datahandling, it also matters how you do e.g. recoding and summarising. It is not always a problem, so you might not have noticed it in the chapters so far.
Numeric are digits on a continuous scale. In short: variables where it makes sense to calculate the mean and standard deviations. These can be stored with varying degrees of presicion, and will appear as double
, numberic
, float
or integers
. In almost every situation, these are just continuous, and how it is stored does not matter much. (At least in social sciences. If you later start doing physics, it probably matters).
Character are variables consisting of text strings, or plain text if you will. They will appear with apostrophes around it, such as “high education” or “low social class” etc. They might also have numeric values, but treated as text. For example, municipality codes are numbers, but not really numeric, and would normally be represented as text. Oslo has the code “0301”. If it had been numeric the leading zero would make it appear as 301. Might not be such a big deal? However, the city area “Gamle Oslo” has the code “01”, and the full code is a combination of those two: “030101”. If you combine 301 and 1, you need to increase the number of digits and insert an additional zero. Worse: to extract the two last digits from the full code “030101” is harder. If stored as a string, there are functions to do such things easily.
Factors are categorical variables with a numeric representation, but an attached meaning to each “level”. In this way, the variable can be used for calculations, and R will automatically create dummy-variables when appropriate. You would often neeed to convert character variables to factors when doing analyses.
Dates is a special type of numeric variable to represent date and times. We will skip these for now, but you sometimes bump into these and should be aware that they are special. The package lubridate provides a lot of functions the handle dates and times.
In the above example using when reading Statafiles, you might have noticed that all the variables are numeric. Clearly, that is not quite correct. At least isco08 and gndr are categorical. (Would you compute the average occupation type in the population??). This is how it is stored in the original files, so there is nothing wrong with the data as such. If someone is to blame, it is those persons who made the file.
There is one potential problem with storing isco08 as numeric. It is a standard coding scheme and you might take a look at Statistics Norway standard classification pages to get a grip of it here or the international page for the international standard classifications of occupation.
If the two first digits are “01”, it means “Armed forces”, but the numberic representation would then be 1. If the first digit is 1, it means “leader position”. Thus, if extracting the first digit from the full code, you might get some unexpected errors. They are avoidable also in the numeric representation, but easier avoided if it was a text string to begin with, as “0” would then be the first digit.
But this is how the world of data is: not ideal. Those who created the dataset did not have you in mind spesifically, and might have been sloppy as well. More likely: the quality control ensured that such mistakes are not actually possible for these data - even though you couldn’t necessarily know that. We’ll assume this is ok. Anyway, we need to deal with it. These data are labelled, so that should be easy.
We use glimpse()
to take a look at the data.
glimpse(polit)
## Rows: 49,519
## Columns: 3
## $ isco08 <dbl+lbl> 3333, 7122, 4221, 4311, 6130, 7212, 5131, 5223, 6112, 4224…
## $ gndr <dbl+lbl> 1, 1, 2, 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1…
## $ polintr <dbl+lbl> 3, 2, 4, 3, 2, 2, 4, 3, 3, 4, 2, 2, 1, 4, 2, 1, 2, 3, 2, 3…
Notice that for each variable it is indicated what kind of variable it is. The ‘isco08’ is marked as <dbl+lbl>. The first indicates it is a numeric variable of the kind “double” (which is just a technical detail about how data is stored: it is a numeric variable). The second indicates that the variable is “labelled”. The standard way of storing data in some software is to have labels attached to them, and this is done here. That means that the variable has a value and a label indicating what the value means.
Labelled variable type is not the same thing as a factor, although it might sound similar! A label is more like a sticker, glued to the variable, but not the variable value as such. For data management, you need to use the values. While R can handle labels, it is rarely used in R, and it will not show on most standard outputs.
The first observation has isco08-code 3333, which means “Employment agents and contractors”. For the variable gndr, the value is 1, which means “Male”, and so forth.
As you will see below, for the isco08, we will need the values. To do recoding on long text strings like this would be a true mess. The codes are meaningful in themselves, and will eases some recoding a lot. For gndr, we can just make it into a factor-variable, but for the others we should keep the original values.
Let’s make a couple of versions of these variables to be used when needed. To do so, we use mutate(). This can be done in one mutate-sentence, but in the following, mutate is repeated three times. It is not necessary, but it is done here because we are using a little trick: .after=
is for putting the new variable a specific place, just after the original variable. This is done for isco08 and polintr so that these versions of the same variable is next to each other. The placement of the variable does not matter for analysis, so it is just because it can be convenient when looking at the data. Afterwards, we use mutate again, just to make the original isco08 and gndr into factor variables. Note that the order this is done matters: we have to create the factor variable of isco08 last because as.character()
on a factor variable will give unexpected results.
polit <- read_dta("data/politics.dta")
polit <- polit %>%
mutate(isco08_code = as.character(isco08), .after=isco08) %>%
mutate(polintr_fct = as_factor(polintr), .after = polintr) %>%
mutate(isco08 = as_factor(isco08),
gndr_fct = as_factor(gndr)) %>%
relocate(gndr, .after = last_col())
glimpse(polit)
## Rows: 49,519
## Columns: 6
## $ isco08 <fct> "Employment agents and contractors", "Floor layers and til…
## $ isco08_code <chr> "3333", "7122", "4221", "4311", "6130", "7212", "5131", "5…
## $ polintr <dbl+lbl> 3, 2, 4, 3, 2, 2, 4, 3, 3, 4, 2, 2, 1, 4, 2, 1, 2, 3, …
## $ polintr_fct <fct> Hardly interested, Quite interested, Not at all interested…
## $ gndr_fct <fct> Male, Male, Female, Male, Female, Male, Male, Female, Male…
## $ gndr <dbl+lbl> 1, 1, 2, 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, …
Let’s make some tables. We start by using a very basic code for making tables: table()
and the corresponding addmargins()
to get the totals as well.
polit %>%
select(polintr, gndr) %>%
table() %>%
addmargins()
Looks good! But if you look closely, and the total number of observations are less than reported in glimpse()
above. The reason is missing values.
Let’s try to make tables recommended in a previous chapter. We will also rather use the factor-versions of these variables.
polit <- read_dta("data/politics.dta") %>%
mutate(isco08_code = as.character(isco08), .after=isco08) %>%
mutate(polintr_fct = as_factor(polintr), .after = polintr) %>%
mutate(isco08 = as_factor(isco08),
gndr_fct = as_factor(gndr)) %>%
relocate(gndr, .after = last_col())
polit %>%
select(polintr_fct, gndr_fct) %>%
tbl_summary(by = gndr_fct) %>%
add_overall()
A new row appears at the bottom: “Unknown”. These are not included in the percentages, though. This is because these are the missing values as shown in the documentation shown above. These values are defined as should not be used in computations.
It can be convenient to define missing values in this way, but mainly for categorical variables. If you do it for continuous variables, make sure the indicator for missing is not a potential true value as that might cause problems later.
Now, lets look at the occupations. A bar graph might work. We know there are many categories, so we’ll turn the angel for the lables on the x-axis, and make the text tiny.ggplot( polit, aes(x=isco08)) +
geom_bar()+
theme(axis.text.x = element_text(angle = -90, size = .5, vjust = 3))
Damn! That was not very informative at all. There are just too many values. We have to do something about that. And what is that one occupation which is so much larger than anyone else?
Let’s start identifying what kind of occupation is the largest one. An earlier chapter gives you the tools to do so, so look at that and fill in the gaps below. The code should count the number of observations for each level of isco08, and then sort the data with the largest group first.
polit %>%
group_by(...) %>%
...(count = n()) %>%
...(desc(count))
Ouch. The largest occupation group is missing. NA
means “not available”.
For some of the responsdents, there might be missing values. Either the respondendt did not answer, or perhaps the questionaire was designed so that not everybody got all the questions.
In survey data, it is common to also code reasons for not responding, or including response options that indicates that the respondendt do not want to answer. En example is the variable polintr as displayed in the documentation:
This means that only values 1 through 4 are actually valid responses to the question. The other responses can be regarded as reasons to not respond. Sometimes,
In the previous section, we discovered a couple of things:
Observations with missing values on any variable will be dropped from any analyses. There are 98 observations with missing on polintr and 4794 with missing on isco08. These will be dropped from any analyses unless we do something about it.
For occupation, we need to aggregate into fewer, meaningful groups of occupations.
Let’s start with the variable on political interest. If we are to study how political interest vary by occupation and gender, then polintr is the outcome variable. One approach is to focus on those who are interested in politics, so, one might choose to focus on those who answered very or quite interested vs the rest. That implies simplifying to only two categories: interested vs the rest. Then those with missing values would reasonably go into the “rest” group.
In the following exercise
polit_d <- polit %>%
mutate(interest = as.numeric(polintr %in% c(1, 2)))
polit_d %>%
select(polintr_fct, interest) %>%
tbl_cross()
Characteristic | interest | Total | |
---|---|---|---|
0 | 1 | ||
How interested in politics | |||
Very interested | 0 | 5,423 | 5,423 |
Quite interested | 0 | 16,016 | 16,016 |
Hardly interested | 17,837 | 0 | 17,837 |
Not at all interested | 10,145 | 0 | 10,145 |
Unknown | 98 | 0 | 98 |
Total | 28,080 | 21,439 | 49,519 |
Sometimes, a categorical variable have houndreds of values. One such example is the international standard classifications of occupation where each kind of occupation has a specific code. There are many types of occupations, so there are about 800 codes. For most analyses, this is way too detailed, and you need to group into a smaller number of main categories. Statistics Norway provides an easy accessible overview of these codes here. In short, there is a 4-digit code, where the first position is a rough grouping, and the following positions implies further details within the overall group.
You could write up recoding as explained above, but that will take quite a few lines of code. It will take a lot of time, it is easy to make mistakes - and it is a nightmare to de-bug to find errors. In short: that approach is out of the question. (That has nothing to do with R, and you should not do so in any software).
Rather, we just extract the first position from the text string. The function str_sub()
is made for extracting parts of texts, and by specifying from where to start and end the extraction. In this case, it is just the first position.
Sometimes, such long list of unique codes have a standard grouping that are not hierarchical. One such example is the ordc class schema as developed by the HISTCLASS-project homepage. This is a coding schema for isco-codes to classes, and a catalogue is available from the homepage. As there are about 800 unique values and they are stored in Excel-format with corresponding grouping for each code. The first lines in Excel-file looks lik this:
This file can be read into R using read_excel()
, where the first line is typically read as variable names. But the first lines are not to be used, and neither are several columns.
To read it into R, we need to skip the lines not needed and change the variable names. In this example, we only need the isco-kodes and the ordc_yrk codes. Variable names should not include spaces, so we include an argument for ensuring universal valid variable names. That substitutes white space with punctuation. We also specify col_types = "text"
to avoid the values being interpreted as numeric.
isco <- readxl::read_excel(path = "data/3_codes_isco88_ordc.xlsx", skip = 4, .name_repair = "universal", col_types = "text") %>%
select(2,9)
## New names:
## * `ISCO-88 Title EN` -> ISCO.88.Title.EN
## * `ISCO-88 code` -> ISCO.88.code
## * `ISCO 08 Code` -> ISCO.08.Code
## * `ISCO-08 part` -> ISCO.08.part
## * `ISCO-08 Title EN` -> ISCO.08.Title.EN
## * ...
head(isco)
Now, we can merge the data with this catalogue. So that every record in the catalogue is merged to each record with the same code. To do this, we use left_join()
, and store in a new object.
isco <- isco %>%
rename(isco08_code = ISCO.88.code)
polit2 <- left_join(polit, isco, by = "isco08_code")
head(polit2)
What happened here is that the recoding happened almost automatically by adding a new column with the new variable.
Now, you can make e.g. a cross-tabulation of social class by gender.
polit2 %>%
select(ORDC_YRK, gndr) %>%
tbl_summary(by = gndr)
## ℹ Column(s) gndr are class "haven_labelled". This is an intermediate datastructure not meant for analysis. Convert columns with `haven::as_factor()`, `labelled::to_factor()`, `labelled::unlabelled()`, and `unclass()`. "haven_labelled" value labels are ignored when columns are not converted. Failure to convert may have unintended consequences or result in error.
## • https://haven.tidyverse.org/articles/semantics.html
## • https://larmarange.github.io/labelled/articles/intro_labelled.html#unlabelled
Characteristic | 1, N = 151,1871 | 2, N = 228,4521 |
---|---|---|
ORDC_YRK | ||
1 | 448 (2.4%) | 364 (2.0%) |
10 | 5,887 (32%) | 3,222 (17%) |
11 | 3,757 (20%) | 5,563 (30%) |
12 | 1,222 (6.6%) | 1,187 (6.4%) |
2 | 1,367 (7.4%) | 1,333 (7.1%) |
3 | 30 (0.2%) | 13 (<0.1%) |
4 | 561 (3.0%) | 1,251 (6.7%) |
5 | 2,000 (11%) | 2,669 (14%) |
6 | 1,681 (9.1%) | 1,850 (9.9%) |
7 | 180 (1.0%) | 267 (1.4%) |
8 | 1,129 (6.1%) | 726 (3.9%) |
9 | 134 (0.7%) | 105 (0.6%) |
996 | 84 (0.5%) | 109 (0.6%) |
997 | 5 (<0.1%) | 1 (<0.1%) |
Unknown | 132,702 | 209,792 |
1
n (%)
|
Sometimes, data comes in a “wide” format. Get it into “long”.
Using the essurvey-package for easy download.
Replication data for the study can be downloaded from dataverse.no here
Data at the municipality level is downloadable from Statistics Norway, and can be merged on using the municipality code. We will here look at the relationship between population structure, unemployment and crime.
Statistics Norway provides time-series data on most statistics.