Essentials in R

To get back to the main page click here

1 Introduction

In this session you’ll learn about some of the basic concepts and some essential skills in R. You will learn what objects and dataframes are, how you can read data into R and how to look at data when it’s been read into your workspace.

This is not meant to be an exhaustive tutorial on the essentials of R, and more concepts will be covered in later sessions. However, it will give you an understanding of key concepts in R and provide you with the necessary skills of reading data into R and how to obtain more information on the dataset.

2 Objects

For those of you who have tried out R or have read a bit in R for data science (abbreviation: R4DS) or other information on R, you’ll probably have seen the word object getting thrown around. But what is actually an R object?

Essentially, everything you store in R - variables, graphs, dataframes, numbers, tables etc. - are objects which are assigned a name and can be referred to in later functions or commands.

An object exists when you’ve assigned it a value. It will appear in your environment pane in RStudio, and can then be modified and operated upon. You define an object by using the assignment operator (<-) in combination with a name for your object. You can think of the assignment operator as the words “is defined as”. Assignment commands typically follow this general order:

name_of_object <- value (or process/function/calculation that produces a value)

For example, we could store simple math calculations in an object:

# Creating an object named sub
sub <- 345-87

In the code above we define the object named sub and we assign it the value of 87 subtracted from 345. Further, we could print the content of our object by simply typing it’s name:

# Printing the content of sub
sub

## [1] 258

We can see from the output that the object sub contains the value of 258. This makes sense since 345 - 87 = 258. However, if we just type the calculation, without storing it as an object, R simply prints the result, but there is no object stored in our environment pane:

345-87

## [1] 258

You can always overwrite an object’s value by running the assignment operator to redefine it’s value. In the following example, we redefine the value in sub:

# Redefining the object sub
sub <- 67

By typing the name of the object, we can see from the output that it now contains the value of 67:

sub

## [1] 67

It may be useful to inform you about the following when naming objects:

Object names must begin with a letter (i.e it cannot begin with a number)
Object names are case-sensitive (e.g the object sub is not the same as the object Sub)
Object names cannot contain spaces, but you could and should use underscore (_) or a period (.) instead of space.

It is also important to note that all R objects are saved in your pc’s memory. As long as you have RStudio opened, the object will exist in your working space. However, if you close RStudio it will no longer be there when you re-open Rstudio again. This is one of the central reasons for why you should ALWAYS write your code in scripts, so that you can just run the script each time you open RStudio.

You will get more familiar with defining objects and the assignment operator in the upcoming sessions. For now, we wanted to introduce you to the concept of objects and how we create them.

3 Read data into R

In this course you will be working with pre-existing data, conventionally called datasets. A dataset is also an object in R (typically “dataframes”) and must be assigned a name when we import it into R. Importantly, you need to have good order in your working directory.

The following code imports the dataset isspNO2019 into R. The data is in the folder in data. That folder should be inside your working directory (project folder), so you should use a relative file path, naming the folder before the file name.

# Importing the dataset isspNO2019 into R and storing it in an object named issp

issp <- readRDS("data/isspNO2019.rds")

We store the dataset in an object named issp as shown by using the assignment operator. The function readRDS() imports the dataset into R. We then use brackets inside the parentheses to tell R where to find the dataset, i.e the file path containing where to find the file (in our data folder which is a subfolder in our project folder) and the name of the file (isspNO2019). As we can see in the picture below, after running the code, the object issp is then found in our environment pane in RStudio and is defined as a data frame object.

However, which function you should use to read a dataset into R depends on the type of file the dataset is (e.g, .xlsx, .csv, .rds etc). The table below show some of the most common file types and which command to use for reading that type of file into R.

File extension	Explanation	Command
.csv	A comma-separated values (CSV) text file	read.csv()
.rds	R’s own data file format for saving individual files	readRDS()
.xlsx	A Microsoft Excel file	read_excel()
.RDat .RData	R’s own data file format that can contain one or several objects. In the case of several, you then import all at once	load()

3.1 Exercises

Try it out yourself!

Here is a simple exercise with an empty code chunk provided for entering the answer.

1. Write the correct code to import the dataset: abu89.rds which is in a folder in your workspace called data. Store it in an object named abu89.

"Did you specify relative file path?"

"Did you include the file name extention .rds?
(The object you assign the data do not need any extention)."

abu89 <- readRDS("data/abu89.rds")

4 Dataframes

As mentioned previously, when we load a dataset into R and assign it a name it becomes an R object of the type ‘data frame’. A data frame is a two-dimensional table structure which is used to hold the data values of multiple vectors (i.e. variables with more than one value) of the same length. A data frame has the variables of a dataset as columns and the observations of a dataset as rows. Therefore, the variables (vectors) must be of the same length, meaning they have to have the same number of observations. Let’s take a look at the structure of a data frame using the issp data set:

issp

The top line of the table contains the column names or variables. Each horizontal line that follows indicates a data row of observations. It begins with the row number, and is then followed by actual data. Further, each data point is called a cell. The vertical lines represents observations in each column (variable), while the horizontal lines represents values in each column for a given observation (respondent).

Sometimes, you might spot that your data is stored not as a data.frame but as a tibble. For this course, the difference does not matter at all: the dataset would have the same structure and the same functions will work on both. However, tibble also supports some fancy stuff that we will not cover in this course.

5 Looking at data

After we’ve read a dataset into R and assigned it as an R object of the class data frame, we can get more familiar with it using the functions head(), glimpse() and view(). We will illustrate these functions using the issp dataset we’ve already loaded to our workspace.

The head() function returns information about the first part of the dataset. Depending on the size of the dataset, the output will show the first rows and columns.

head(issp)

We can see that in this particular case of the issp dataset, the output reveals the first six rows (observations) of all the columns (variables) of the data frame. Because the issp dataset only contains 22 columns, it prints the first six rows of each column.

The glimpse() function from the dplyr package returns even more information. Simply put, this function tries to provide you with as much information about the dataset as possible.

glimpse(issp)

## Rows: 715
## Columns: 21
## $ idnr           <int> 1002, 1008, 1036, 1047, 1050, 1060, 1062, 1063, 1074, 1…
## $ day            <int> 2, 17, 30, 2, 24, 13, 5, 6, 17, 29, 6, 30, 16, 13, 14, …
## $ month          <int> 3, 3, 3, 3, 4, 5, 3, 5, 3, 2, 3, 3, 3, 3, 4, 3, 5, 4, 5…
## $ year           <int> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2…
## $ gender         <fct> Female, Male, Female, Male, Female, Male, Male, Female,…
## $ age            <int> 31, 68, 50, 61, 52, 66, 49, 29, 54, 40, 58, 46, 67, 42,…
## $ ethnicity      <fct> Norwegian, Norwegian, Norwegian, Norwegian, Norwegian, …
## $ pl_residence   <fct> Big city, Sparsely settled, Suburbs of city, Suburbs of…
## $ marital_status <fct> Married, Widowed, Married, Unmarried, Married, Married,…
## $ religion       <fct> Christianity, Christianity, Christianity, Christianity,…
## $ yeareduc       <int> 16, 2, 18, 14, 6, 12, 20, 18, 2, 20, 16, 18, 13, 15, 14…
## $ hours_work     <dbl> 35.0, 20.0, 40.0, 40.0, 37.5, 10.0, 50.0, 42.0, 84.0, 3…
## $ inc_dec        <fct> 558-658000, 435-493000, 558-658000, 231-306000, 659-856…
## $ class          <fct> Upper middle class, Middle class, Upper middle class, L…
## $ voting_elec    <fct> Arbeiderpartiet, Senterpartiet, Hoyre, Arbeiderpartiet,…
## $ diff_income    <fct> Strongly agree, Agree, Disagree, Neutral, Neutral, Agre…
## $ good_edu       <fct> Very important, Essential, Very important, Fairly impor…
## $ rich_fam       <fct> Not very important, Fairly important, Fairly important,…
## $ edu_pay        <fct> Essential, Very important, Very important, Fairly impor…
## $ scalenow       <int> 9, 6, 7, 5, 5, 6, 6, 8, 5, 7, 7, 4, 7, 6, 6, 5, 3, 6, 6…
## $ scalefam       <int> 5, 7, 7, 7, 5, 8, 5, 6, 4, 5, 6, 4, 5, 6, 5, 5, 4, 5, 3…

The output reveals that in addition to obtaining information about the total number of rows and columns in the data set, the glimpse() function, also returns all columns, the class of each column, and as many of the values of each column possible. The main difference from this function and the head() function, is that the columns are now listed vertically, followed by the corresponding rows horizontally. (Don’t worry that you do not know what `the class of each column’ means. Basically, class refers to different types of variables. You will learn more about the different types of variables in a later session).

The view() function does not work in this learning environment, but it is useful to mention. When you run the function in Rstudio on your computer, it will automatically open a new tab with a spreadsheet-style and scrollable data viewer of the complete dataset. Many students prefer this function since it gives a good overview of the dataset. We recommend to try it out yourself in RStudio at your computer but here is a picture showing what the scrollable data viewer looks like in Rstudio.

Help documentation

To see the help-pages for the functions head(), glimpse() and view(), just run the following codes and the help-pages will open in your browser.

help(head)

help(glimpse, package = "dplyr")

help(View)

Note that the code for the function glimpse() is a bit different from the others. To navigate to the help-page for this function, we’ll have to specify which package it belongs to.

5.1 Exercises

In later sessions, you’ll be working with the abu89 dataset in addition to the issp dataset. The dataset abu89 is loaded in your workspace in addition to the package dplyr.

Get more familiar with the abu89 dataset using the functions head() and glimpse(). Replace the dotted lines with the name of the dataset, and press the ‘run code’ button to see the output.

1. The head() function.

head(...)

head(abu89)

2. The glimpse() function.

glimpse(...)

glimpse(abu89)