For this session there are no slides, because I’m not at the WZB to hold a presentation. Instead I provide you with a full document for self-study. As always, there are some tasks at the end of the document which you should complete. I recommend that you not only read the document, but also try out some of the code examples and experiment with them.
The solutions for tasks #4 are now online on https://wzbsocialsciencecenter.github.io/wzb_r_tutorial/.
After being introduced to the basics of R programming using “traditional” R (or: base R) concepts, we will now focus on a more modern way of R programming using the tidyverse philosophy. Don’t worry, you will still apply concepts that you learned in the last sessions with base R as you will soon notice! The modern R programming approach in the tidyverse builds on top of the basic R concepts and provides a common, clearly defined “grammar” for all the main aspects of data analysis including data transformation, visualization and modeling. The tidyverse is organized into several R packages, each serving a distinct purpose. You can inform yourself about the individual packages on https://www.tidyverse.org/packages/.
The concept of the tidyverse and the underlying software packages are mainly developed by Hadley Wickham. Together with Garrett Grolemund, he wrote the book R for Data Science. The book is freely available online on http://r4ds.had.co.nz/. We will use this book to study the main concepts of modern data transformation in R.
Before diving into the tidyverse, make sure you’ve installed the tidyverse package via RStudio’s package manager or the install.packages()
function. You also need to install nycflights13 package in order to complete the exercises.
filter()
Read sections 5.1 and 5.2 on filtering (i.e. subsetting) observations with filter()
from the tidyverse package dplyr. Notice how the concepts for comparisons and logical operators are similar to what you’ve learned in the previous sessions. They are now applied in a different context.
distinct()
Sometimes, you need to get all distinct observations of a vector or data frame. This means that the resulting vector or data frame does not include duplicate observations. Only unique observations will be left.
In base R, there is the function unique()
which will remove duplicate values in vectors:
colors <- c('red', 'green', 'red', 'yellow', 'red', 'yellow')
unique(colors)
## [1] "red" "green" "yellow"
To filter data frames for unique observations, we can use distinct()
from the tidyverse package dplyr. With this, for example, we can find all unique carrier codes in the flights
data set:
library(nycflights13)
library(tidyverse)
distinct(flights, carrier)
## # A tibble: 16 x 1
## carrier
## <chr>
## 1 UA
## 2 AA
## 3 B6
## 4 DL
## 5 EV
## 6 MQ
## 7 US
## 8 WN
## 9 VX
## 10 FL
## 11 AS
## 12 9E
## 13 F9
## 14 HA
## 15 YV
## 16 OO
We could do the same with unique()
, however, notice that distinct()
always returns a data frame (the above having only a single column) while unique()
returns a vector:
unique(flights$carrier)
## [1] "UA" "AA" "B6" "DL" "EV" "MQ" "US" "WN" "VX" "FL" "AS" "9E" "F9" "HA"
## [15] "YV" "OO"
So far we only created distinct observations of a single column, but you can also find the distinct observations using several (or all) columns in a data frame. With this, for example, we can find all distinct flight routes in the flights
data set:
distinct(flights, origin, dest)
## # A tibble: 224 x 2
## origin dest
## <chr> <chr>
## 1 EWR IAH
## 2 LGA IAH
## 3 JFK MIA
## 4 JFK BQN
## 5 LGA ATL
## 6 EWR ORD
## 7 EWR FLL
## 8 LGA IAD
## 9 JFK MCO
## 10 LGA ORD
## # ... with 214 more rows
arrange()
Read sections 5.3 on sorting observations with arrange()
from the tidyverse package dplyr. We already got to know the function sort()
for sorting vectors. arrange()
allows us to sort data frames, even using several variables to sort along.
select()
and rename()
Read sections 5.4 on selecting and renaming variables in data frames with select()
and rename()
from the tidyverse package dplyr.
mutate()
and transmute()
At first, read sections 5.5 on adding new variables to data frames with mutate()
and transmute()
from the tidyverse package dplyr. Then continue with the two sections below.
Note: You can gloss over section 5.5.1 “Useful creation functions” quickly – you don’t need to understand what these functions do in all detail.
mutate()
and transmute()
mutate()
and transmute()
are often used to convert the data type of variables. Let’s consider this data frame, where smoker
is indicated with a numerical variable where 0 means non-smoker and 1 means smoker:
(smoker_data <- data.frame(age = c(19, 15, 24, 29, 17), smoker = c(0, 0, 1, NA, 1)))
## age smoker
## 1 19 0
## 2 15 0
## 3 24 1
## 4 29 NA
## 5 17 1
We can convert this to a logical TRUE
/FALSE
vector with the conversion function as.logical
:
(smoker_data <- mutate(smoker_data, smoker = as.logical(smoker)))
## age smoker
## 1 19 FALSE
## 2 15 FALSE
## 3 24 TRUE
## 4 29 NA
## 5 17 TRUE
mutate()
and transmute()
in combination with ifelse()
mutate()
and transmute()
are often used in combination with the function ifelse(cond, true-value, false-value)
. This function allows to set one value true-value
if a condition cond
is TRUE
and another value false-value
if this condition is FALSE
.
Imagine that NA values were coded as -1
in the smoker
variable:
(smoker_data <- data.frame(age = c(19, 15, 24, 29, 17), smoker = c(0, 0, 1, -1, 1)))
## age smoker
## 1 19 0
## 2 15 0
## 3 24 1
## 4 29 -1
## 5 17 1
Now if we tried to convert that to a logical vector, we would run into trouble, because everything that is not 0 is considered TRUE
, hence -1 also becomes TRUE
:
mutate(smoker_data, smoker = as.logical(smoker))
## age smoker
## 1 19 FALSE
## 2 15 FALSE
## 3 24 TRUE
## 4 29 TRUE
## 5 17 TRUE
In order to fix that, we can use ifelse()
. We set the condition smoker == -1
and pass NA
as value that should be taken whenever the condition is TRUE
. On all other occasions, we want to retain the original value, hence we pass smoker
. After converting the output to as.logical()
we get the correct result:
mutate(smoker_data, smoker = as.logical(ifelse(smoker == -1, NA, smoker)))
## age smoker
## 1 19 FALSE
## 2 15 FALSE
## 3 24 TRUE
## 4 29 NA
## 5 17 TRUE
Note: You need to install the packages nycflights13 and tidyverse in order to complete the exercises.
05transform1-resources.zip
from the course website, read codebook.txt
and complete the following tasks:
schulen_potsdam.csv
into R without specifying further parameters for read.csv()
. Have a look at the data using functions like str()
and head()
. Do you spot any potential problems?read.csv()
: stringsAsFactors = FALSE, colClasses = c(plz = "character")
. Inspect the result. What is the effect of these parameters? Are all problems now fixed and do the variable types match the specifications in the codebook (see codebook.txt
also contained in the zip-file)? If not, convert these variables to the correct data type using mutate()
and as.factor()
.full_address
using mutate()
. This new variable should contain the full address consisting of street name, zip code and city name, e.g. “Carl-von-Ossietzky-Straße 37 01570 Potsdam”. You can combine several strings to form one string using the function paste()
.