For this session there are no slides, because I’m not at the WZB to hold a presentation. Instead I provide you with a full document for self-study. As always, there are some tasks at the end of the document which you should complete. I recommend that you not only read the document, but also try out some of the code examples and experiment with them.
The solutions for tasks #4 are now online on https://wzbsocialsciencecenter.github.io/wzb_r_tutorial/.
After being introduced to the basics of R programming using “traditional” R (or: base R) concepts, we will now focus on a more modern way of R programming using the tidyverse philosophy. Don’t worry, you will still apply concepts that you learned in the last sessions with base R as you will soon notice! The modern R programming approach in the tidyverse builds on top of the basic R concepts and provides a common, clearly defined “grammar” for all the main aspects of data analysis including data transformation, visualization and modeling. The tidyverse is organized into several R packages, each serving a distinct purpose. You can inform yourself about the individual packages on https://www.tidyverse.org/packages/.
The concept of the tidyverse and the underlying software packages are mainly developed by Hadley Wickham. Together with Garrett Grolemund, he wrote the book R for Data Science. The book is freely available online on http://r4ds.had.co.nz/. We will use this book to study the main concepts of modern data transformation in R.
Before diving into the tidyverse, make sure you’ve installed the tidyverse package via RStudio’s package manager or the install.packages() function. You also need to install nycflights13 package in order to complete the exercises.
filter()Read sections 5.1 and 5.2 on filtering (i.e. subsetting) observations with filter() from the tidyverse package dplyr. Notice how the concepts for comparisons and logical operators are similar to what you’ve learned in the previous sessions. They are now applied in a different context.
distinct()Sometimes, you need to get all distinct observations of a vector or data frame. This means that the resulting vector or data frame does not include duplicate observations. Only unique observations will be left.
In base R, there is the function unique() which will remove duplicate values in vectors:
colors <- c('red', 'green', 'red', 'yellow', 'red', 'yellow')
unique(colors)## [1] "red"    "green"  "yellow"To filter data frames for unique observations, we can use distinct() from the tidyverse package dplyr. With this, for example, we can find all unique carrier codes in the flights data set:
library(nycflights13)
library(tidyverse)
distinct(flights, carrier)## # A tibble: 16 x 1
##    carrier
##    <chr>  
##  1 UA     
##  2 AA     
##  3 B6     
##  4 DL     
##  5 EV     
##  6 MQ     
##  7 US     
##  8 WN     
##  9 VX     
## 10 FL     
## 11 AS     
## 12 9E     
## 13 F9     
## 14 HA     
## 15 YV     
## 16 OOWe could do the same with unique(), however, notice that distinct() always returns a data frame (the above having only a single column) while unique() returns a vector:
unique(flights$carrier)##  [1] "UA" "AA" "B6" "DL" "EV" "MQ" "US" "WN" "VX" "FL" "AS" "9E" "F9" "HA"
## [15] "YV" "OO"So far we only created distinct observations of a single column, but you can also find the distinct observations using several (or all) columns in a data frame. With this, for example, we can find all distinct flight routes in the flights data set:
distinct(flights, origin, dest)## # A tibble: 224 x 2
##    origin dest 
##    <chr>  <chr>
##  1 EWR    IAH  
##  2 LGA    IAH  
##  3 JFK    MIA  
##  4 JFK    BQN  
##  5 LGA    ATL  
##  6 EWR    ORD  
##  7 EWR    FLL  
##  8 LGA    IAD  
##  9 JFK    MCO  
## 10 LGA    ORD  
## # ... with 214 more rowsarrange()Read sections 5.3 on sorting observations with arrange() from the tidyverse package dplyr. We already got to know the function sort() for sorting vectors. arrange() allows us to sort data frames, even using several variables to sort along.
select() and rename()Read sections 5.4 on selecting and renaming variables in data frames with select() and rename() from the tidyverse package dplyr.
mutate() and transmute()At first, read sections 5.5 on adding new variables to data frames with mutate() and transmute() from the tidyverse package dplyr. Then continue with the two sections below.
Note: You can gloss over section 5.5.1 “Useful creation functions” quickly – you don’t need to understand what these functions do in all detail.
mutate() and transmute()mutate() and transmute() are often used to convert the data type of variables. Let’s consider this data frame, where smoker is indicated with a numerical variable where 0 means non-smoker and 1 means smoker:
(smoker_data <- data.frame(age = c(19, 15, 24, 29, 17), smoker = c(0, 0, 1, NA, 1)))##   age smoker
## 1  19      0
## 2  15      0
## 3  24      1
## 4  29     NA
## 5  17      1We can convert this to a logical TRUE/FALSE vector with the conversion function as.logical:
(smoker_data <- mutate(smoker_data, smoker = as.logical(smoker)))##   age smoker
## 1  19  FALSE
## 2  15  FALSE
## 3  24   TRUE
## 4  29     NA
## 5  17   TRUEmutate() and transmute() in combination with ifelse()mutate() and transmute() are often used in combination with the function ifelse(cond, true-value, false-value). This function allows to set one value true-value if a condition cond is TRUE and another value false-value if this condition is FALSE.
Imagine that NA values were coded as -1 in the smoker variable:
(smoker_data <- data.frame(age = c(19, 15, 24, 29, 17), smoker = c(0, 0, 1, -1, 1)))##   age smoker
## 1  19      0
## 2  15      0
## 3  24      1
## 4  29     -1
## 5  17      1Now if we tried to convert that to a logical vector, we would run into trouble, because everything that is not 0 is considered TRUE, hence -1 also becomes TRUE:
mutate(smoker_data, smoker = as.logical(smoker))##   age smoker
## 1  19  FALSE
## 2  15  FALSE
## 3  24   TRUE
## 4  29   TRUE
## 5  17   TRUEIn order to fix that, we can use ifelse(). We set the condition smoker == -1 and pass NA as value that should be taken whenever the condition is TRUE. On all other occasions, we want to retain the original value, hence we pass smoker. After converting the output to as.logical() we get the correct result:
mutate(smoker_data, smoker = as.logical(ifelse(smoker == -1, NA, smoker)))##   age smoker
## 1  19  FALSE
## 2  15  FALSE
## 3  24   TRUE
## 4  29     NA
## 5  17   TRUENote: You need to install the packages nycflights13 and tidyverse in order to complete the exercises.
05transform1-resources.zip from the course website, read codebook.txt and complete the following tasks:
schulen_potsdam.csv into R without specifying further parameters for read.csv(). Have a look at the data using functions like str() and head(). Do you spot any potential problems?read.csv(): stringsAsFactors = FALSE, colClasses = c(plz = "character"). Inspect the result. What is the effect of these parameters? Are all problems now fixed and do the variable types match the specifications in the codebook (see codebook.txt also contained in the zip-file)? If not, convert these variables to the correct data type using mutate() and as.factor().full_address using mutate(). This new variable should contain the full address consisting of street name, zip code and city name, e.g. “Carl-von-Ossietzky-Straße 37 01570 Potsdam”. You can combine several strings to form one string using the function paste().