R Tutorial at the WZB

Tasks

Note: You need to install the packages nycflights13 and tidyverse in order to complete the exercises.

1. Complete exercises 1 to 3 from section 5.2.4 in R for Data Science

At first we load the required libraries (this is necessary for all tasks and I won’t repeat it for the other tasks, too).

library(tidyverse)
library(nycflights13)

We should check the documentation of the data in order to see what the variables mean and what their units are:

?flights

We should have a look at the whole data frame first:

View(flights)

We can also find out how many rows and columns we have:

dim(flights)

## [1] 336776     19

More than 330,000 rows and 19 columns!

Exercise 1

You only had to do 3 out of the 7 tasks here, but I’ll show solutions for all tasks.

1.1. Find all flights that had an arrival delay of two or more hours

Departure and arrival delays are given in minutes. Hence we need to filter for an arrival delay of at least 120 minutes:

filter(flights, arr_delay >= 2*60)

## # A tibble: 10,200 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1  2013     1     1      811            630       101     1047            830       137 MQ     
##  2  2013     1     1      848           1835       853     1001           1950       851 MQ     
##  3  2013     1     1      957            733       144     1056            853       123 UA     
##  4  2013     1     1     1114            900       134     1447           1222       145 UA     
##  5  2013     1     1     1505           1310       115     1638           1431       127 EV     
##  6  2013     1     1     1525           1340       105     1831           1626       125 B6     
##  7  2013     1     1     1549           1445        64     1912           1656       136 EV     
##  8  2013     1     1     1558           1359       119     1718           1515       123 EV     
##  9  2013     1     1     1732           1630        62     2028           1825       123 EV     
## 10  2013     1     1     1803           1620       103     2008           1750       138 MQ     
## # ... with 10,190 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

1.2. Find all flights that flew to Houston (IAH or HOU)

filter(flights, dest %in% c('IAH', 'HOU'))

## # A tibble: 9,313 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1  2013     1     1      517            515         2      830            819        11 UA     
##  2  2013     1     1      533            529         4      850            830        20 UA     
##  3  2013     1     1      623            627        -4      933            932         1 UA     
##  4  2013     1     1      728            732        -4     1041           1038         3 UA     
##  5  2013     1     1      739            739         0     1104           1038        26 UA     
##  6  2013     1     1      908            908         0     1228           1219         9 UA     
##  7  2013     1     1     1028           1026         2     1350           1339        11 UA     
##  8  2013     1     1     1044           1045        -1     1352           1351         1 UA     
##  9  2013     1     1     1114            900       134     1447           1222       145 UA     
## 10  2013     1     1     1205           1200         5     1503           1505        -2 UA     
## # ... with 9,303 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

1.3. Find all flights that were operated by United, American, or Delta

Note: The airline codes are available in the data frame airlines.

filter(flights, carrier %in% c('UA', 'AA', 'DL'))

## # A tibble: 139,504 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1  2013     1     1      517            515         2      830            819        11 UA     
##  2  2013     1     1      533            529         4      850            830        20 UA     
##  3  2013     1     1      542            540         2      923            850        33 AA     
##  4  2013     1     1      554            600        -6      812            837       -25 DL     
##  5  2013     1     1      554            558        -4      740            728        12 UA     
##  6  2013     1     1      558            600        -2      753            745         8 AA     
##  7  2013     1     1      558            600        -2      924            917         7 UA     
##  8  2013     1     1      558            600        -2      923            937       -14 UA     
##  9  2013     1     1      559            600        -1      941            910        31 AA     
## 10  2013     1     1      559            600        -1      854            902        -8 UA     
## # ... with 139,494 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

1.4. Find all flights that departed in summer (July, August, and September)

# alternatives are for example:
# filter(flights, month >= 7 & month <= 9)
# filter(flights, month %in% c(7, 8, 9))
filter(flights, month %in% 7:9)

## # A tibble: 86,326 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1  2013     7     1        1           2029       212      236           2359       157 B6     
##  2  2013     7     1        2           2359         3      344            344         0 B6     
##  3  2013     7     1       29           2245       104      151              1       110 B6     
##  4  2013     7     1       43           2130       193      322             14       188 B6     
##  5  2013     7     1       44           2150       174      300            100       120 AA     
##  6  2013     7     1       46           2051       235      304           2358       186 B6     
##  7  2013     7     1       48           2001       287      308           2305       243 VX     
##  8  2013     7     1       58           2155       183      335             43       172 B6     
##  9  2013     7     1      100           2146       194      327             30       177 B6     
## 10  2013     7     1      100           2245       135      337            135       122 B6     
## # ... with 86,316 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

1.5. Find all flights that arrived more than two hours late, but didn’t leave late

filter(flights, arr_delay > 2*60, dep_delay <= 0)

## # A tibble: 29 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1  2013     1    27     1419           1420        -1     1754           1550       124 MQ     
##  2  2013    10     7     1350           1350         0     1736           1526       130 EV     
##  3  2013    10     7     1357           1359        -2     1858           1654       124 AA     
##  4  2013    10    16      657            700        -3     1258           1056       122 B6     
##  5  2013    11     1      658            700        -2     1329           1015       194 VX     
##  6  2013     3    18     1844           1847        -3       39           2219       140 UA     
##  7  2013     4    17     1635           1640        -5     2049           1845       124 MQ     
##  8  2013     4    18      558            600        -2     1149            850       179 AA     
##  9  2013     4    18      655            700        -5     1213            950       143 AA     
## 10  2013     5    22     1827           1830        -3     2217           2010       127 MQ     
## # ... with 19 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

1.6. Find all flights that were delayed by at least an hour, but made up over 30 minutes in flight

filter(flights, dep_delay >= 60, arr_delay < 30)

## # A tibble: 206 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1  2013     1     3     1850           1745        65     2148           2120        28 AA     
##  2  2013     1     3     1950           1845        65     2228           2227         1 B6     
##  3  2013     1     3     2015           1915        60     2135           2111        24 9E     
##  4  2013     1     6     1019            900        79     1558           1530        28 HA     
##  5  2013     1     7     1543           1430        73     1758           1735        23 AA     
##  6  2013     1    11     1020            920        60     1311           1245        26 AA     
##  7  2013     1    12     1706           1600        66     1949           1927        22 DL     
##  8  2013     1    12     1953           1845        68     2154           2137        17 9E     
##  9  2013     1    19     1456           1355        61     1636           1615        21 EV     
## 10  2013     1    21     1531           1430        61     1843           1815        28 DL     
## # ... with 196 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

1.7. Find all flights that departed between midnight and 6am

filter(flights, dep_time >= 0, dep_time <= 600)

## # A tibble: 9,344 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1  2013     1     1      517            515         2      830            819        11 UA     
##  2  2013     1     1      533            529         4      850            830        20 UA     
##  3  2013     1     1      542            540         2      923            850        33 AA     
##  4  2013     1     1      544            545        -1     1004           1022       -18 B6     
##  5  2013     1     1      554            600        -6      812            837       -25 DL     
##  6  2013     1     1      554            558        -4      740            728        12 UA     
##  7  2013     1     1      555            600        -5      913            854        19 B6     
##  8  2013     1     1      557            600        -3      709            723       -14 EV     
##  9  2013     1     1      557            600        -3      838            846        -8 B6     
## 10  2013     1     1      558            600        -2      753            745         8 AA     
## # ... with 9,334 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Exercise 2

Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?

According to ?between() it’s a shortcut for x >= left & x <= right, i.e. to check if a value \(x\) is in the range \([\text{left}, \text{right}]\).

We can simplify the code for exercise 1.4 and 1.7 using between():

# 1.4. Find all flights that departed in summer (July, August, and September)

filter(flights, between(month, 7, 9))

## # A tibble: 86,326 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1  2013     7     1        1           2029       212      236           2359       157 B6     
##  2  2013     7     1        2           2359         3      344            344         0 B6     
##  3  2013     7     1       29           2245       104      151              1       110 B6     
##  4  2013     7     1       43           2130       193      322             14       188 B6     
##  5  2013     7     1       44           2150       174      300            100       120 AA     
##  6  2013     7     1       46           2051       235      304           2358       186 B6     
##  7  2013     7     1       48           2001       287      308           2305       243 VX     
##  8  2013     7     1       58           2155       183      335             43       172 B6     
##  9  2013     7     1      100           2146       194      327             30       177 B6     
## 10  2013     7     1      100           2245       135      337            135       122 B6     
## # ... with 86,316 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

# 1.7. Find all flights that departed between midnight and 6am

filter(flights, between(dep_time, 0, 600))

## # A tibble: 9,344 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1  2013     1     1      517            515         2      830            819        11 UA     
##  2  2013     1     1      533            529         4      850            830        20 UA     
##  3  2013     1     1      542            540         2      923            850        33 AA     
##  4  2013     1     1      544            545        -1     1004           1022       -18 B6     
##  5  2013     1     1      554            600        -6      812            837       -25 DL     
##  6  2013     1     1      554            558        -4      740            728        12 UA     
##  7  2013     1     1      555            600        -5      913            854        19 B6     
##  8  2013     1     1      557            600        -3      709            723       -14 EV     
##  9  2013     1     1      557            600        -3      838            846        -8 B6     
## 10  2013     1     1      558            600        -2      753            745         8 AA     
## # ... with 9,334 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

2. Complete exercises 1 to 3 from section 5.3.1 in R for Data Science

Exercise 1

One could at first sort for is.na(<VARIABLE>) but in decreasing order (because the logical vector from is.na() is sorted by FALSE (= 0) to TRUE (= 1) by default (ascending order) and we have to reverse that).

arrange(flights, desc(is.na(dep_time)), dep_time)

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1  2013     1     1       NA           1630        NA       NA           1815        NA EV     
##  2  2013     1     1       NA           1935        NA       NA           2240        NA AA     
##  3  2013     1     1       NA           1500        NA       NA           1825        NA AA     
##  4  2013     1     1       NA            600        NA       NA            901        NA B6     
##  5  2013     1     2       NA           1540        NA       NA           1747        NA EV     
##  6  2013     1     2       NA           1620        NA       NA           1746        NA EV     
##  7  2013     1     2       NA           1355        NA       NA           1459        NA EV     
##  8  2013     1     2       NA           1420        NA       NA           1644        NA EV     
##  9  2013     1     2       NA           1321        NA       NA           1536        NA EV     
## 10  2013     1     2       NA           1545        NA       NA           1910        NA AA     
## # ... with 336,766 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Exercise 2

# most delayed flights first:
arrange(flights, desc(dep_delay))

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1  2013     1     9      641            900      1301     1242           1530      1272 HA     
##  2  2013     6    15     1432           1935      1137     1607           2120      1127 MQ     
##  3  2013     1    10     1121           1635      1126     1239           1810      1109 MQ     
##  4  2013     9    20     1139           1845      1014     1457           2210      1007 AA     
##  5  2013     7    22      845           1600      1005     1044           1815       989 MQ     
##  6  2013     4    10     1100           1900       960     1342           2211       931 DL     
##  7  2013     3    17     2321            810       911      135           1020       915 DL     
##  8  2013     6    27      959           1900       899     1236           2226       850 DL     
##  9  2013     7    22     2257            759       898      121           1026       895 DL     
## 10  2013    12     5      756           1700       896     1058           2020       878 AA     
## # ... with 336,766 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

# flights that left earliest ("negative delay")
arrange(flights, dep_delay)

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1  2013    12     7     2040           2123       -43       40           2352        48 B6     
##  2  2013     2     3     2022           2055       -33     2240           2338       -58 DL     
##  3  2013    11    10     1408           1440       -32     1549           1559       -10 EV     
##  4  2013     1    11     1900           1930       -30     2233           2243       -10 DL     
##  5  2013     1    29     1703           1730       -27     1947           1957       -10 F9     
##  6  2013     8     9      729            755       -26     1002            955         7 MQ     
##  7  2013    10    23     1907           1932       -25     2143           2143         0 EV     
##  8  2013     3    30     2030           2055       -25     2213           2250       -37 MQ     
##  9  2013     3     2     1431           1455       -24     1601           1631       -30 9E     
## 10  2013     5     5      934            958       -24     1225           1309       -44 B6     
## # ... with 336,766 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Exercise 3

To sort by fastest flights, we sort by speed in descending order. Speed is calculated as distance divided by air_time.

arrange(flights, desc(distance / air_time))

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1  2013     5    25     1709           1700         9     1923           1937       -14 DL     
##  2  2013     7     2     1558           1513        45     1745           1719        26 EV     
##  3  2013     5    13     2040           2025        15     2225           2226        -1 EV     
##  4  2013     3    23     1914           1910         4     2045           2043         2 EV     
##  5  2013     1    12     1559           1600        -1     1849           1917       -28 DL     
##  6  2013    11    17      650            655        -5     1059           1150       -51 DL     
##  7  2013     2    21     2355           2358        -3      412            438       -26 B6     
##  8  2013    11    17      759            800        -1     1212           1255       -43 AA     
##  9  2013    11    16     2003           1925        38       17             36       -19 DL     
## 10  2013    11    16     2349           2359       -10      402            440       -38 B6     
## # ... with 336,766 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

3. Complete exercises 1 and 2 from section 5.4.1 in R for Data Science

Excercise 1

These are three ways I came up with. There are probably more. The results are all the same, therefore I only printed them for the last command.

select(flights, dep_time, dep_delay, arr_time, arr_delay)

select(flights, starts_with('dep_'), starts_with('arr_'))

select(flights, c(4, 6, 7, 9))   # using column indices instead of names

## # A tibble: 336,776 x 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # ... with 336,766 more rows

Excercise 2

If you include the same variable name several times in a select() call, it will not be repeated in the resulting data frame. Only the first occurrence of the variable will be kept:

select(flights, carrier, dep_time, carrier)

## # A tibble: 336,776 x 2
##    carrier dep_time
##    <chr>      <int>
##  1 UA           517
##  2 UA           533
##  3 AA           542
##  4 B6           544
##  5 DL           554
##  6 UA           554
##  7 B6           555
##  8 EV           557
##  9 B6           557
## 10 AA           558
## # ... with 336,766 more rows

4. Download and unzip `05transform1-resources.zip` from the course website, read `codebook.txt` and complete the following tasks

4.1. Load the CSV file `schulen_potsdam.csv` into R without specifying further parameters for `read.csv()`

Have a look at the data using functions like str() and head(). Do you spot any potential problems?

schulen <- read.csv('05transform1-resources/schulen_potsdam.csv')
str(schulen)

## 'data.frame':    1340 obs. of  8 variables:
##  $ jahr           : int  1992 1992 1992 1992 1992 1992 1992 1992 1992 1992 ...
##  $ traeger        : Factor w/ 2 levels "oeff","priv": 1 1 1 1 1 1 1 1 1 1 ...
##  $ art_reduziert  : Factor w/ 3 levels "gs","mit_gym_os",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ priv_schule_typ: Factor w/ 4 levels "","Frei","Kirchlich",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ name           : Factor w/ 217 levels "\"Eisenhart-Schule\"",..: 101 181 90 197 109 70 192 67 98 105 ...
##  $ strasse        : Factor w/ 114 levels "","(befristet für ca. 1 Jahr) Schilfhof 23-",..: 4 9 11 14 21 25 26 27 28 31 ...
##  $ plz            : int  1585 14467 1570 1590 1560 1560 1597 1560 1580 1599 ...
##  $ ort            : Factor w/ 1 level "Potsdam": 1 1 1 1 1 1 1 1 1 1 ...

head(schulen)

##   jahr traeger art_reduziert priv_schule_typ                               name
## 1 1992    oeff            gs                                     Grundschule 35
## 2 1992    oeff            gs                        Rosa-Luxemburg-Gesamtschule
## 3 1992    oeff            gs                 Grundschule 12 "Gerhart Hauptmann"
## 4 1992    oeff            gs                             Schule am Griebnitzsee
## 5 1992    oeff            gs                                      Grundschule 8
## 6 1992    oeff            gs                                        Grundschule
##                        strasse   plz     ort
## 1      An der Alten Zauche 2 c  1585 Potsdam
## 2              Burgstraße 23 a 14467 Potsdam
## 3 Carl-von-Ossietzky-Straße 37  1570 Potsdam
## 4               Domstraße 14 b  1590 Potsdam
## 5            Dortustraße 28129  1560 Potsdam
## 6                 Finkenweg 15  1560 Potsdam

An obvious problem is that the ZIP codes in plz were converted to integers, which caused leading zeros like in “01585” to disappear and forming a number like 1585.

Another potential problem is that all character string variables were converted to factors, although variables like “name” or “strasse” are not categorical variables.

4.2. Load the data again, this time specifying two additional parameters for `read.csv()`: `stringsAsFactors = FALSE, colClasses = c(plz = "character")`

Inspect the result. What is the effect of these parameters? Are all problems now fixed and do the variable types match the specifications in the codebook (see codebook.txt also contained in the zip-file)? If not, convert these variables to the correct data type using mutate() and as.factor().

schulen <- read.csv('05transform1-resources/schulen_potsdam.csv',
                    stringsAsFactors = FALSE,
                    colClasses = c(plz = "character"))
str(schulen)

## 'data.frame':    1340 obs. of  8 variables:
##  $ jahr           : int  1992 1992 1992 1992 1992 1992 1992 1992 1992 1992 ...
##  $ traeger        : chr  "oeff" "oeff" "oeff" "oeff" ...
##  $ art_reduziert  : chr  "gs" "gs" "gs" "gs" ...
##  $ priv_schule_typ: chr  "" "" "" "" ...
##  $ name           : chr  "Grundschule 35" "Rosa-Luxemburg-Gesamtschule" "Grundschule 12 \"Gerhart Hauptmann\"" "Schule am Griebnitzsee" ...
##  $ strasse        : chr  "An der Alten Zauche 2 c" "Burgstraße 23 a" "Carl-von-Ossietzky-Straße 37" "Domstraße 14 b" ...
##  $ plz            : chr  "01585" "14467" "01570" "01590" ...
##  $ ort            : chr  "Potsdam" "Potsdam" "Potsdam" "Potsdam" ...

The problem with the ZIP code is fixed (by explicitely specifying a data type or “class” for the variable plz with colClasses = c(plz = "character")). However, some variables like traeger or art_reduziert are now character strings although they should be factors / categoricals according to the codebook. We can convert them to factors with mutate(). At first we need to load the tidyverse package for this:

library(tidyverse)

And now we update the data types of the variables:

schulen <- mutate(schulen, traeger = as.factor(traeger),
                  art_reduziert = as.factor(art_reduziert),
                  priv_schule_typ = as.factor(priv_schule_typ))
str(schulen)

## 'data.frame':    1340 obs. of  8 variables:
##  $ jahr           : int  1992 1992 1992 1992 1992 1992 1992 1992 1992 1992 ...
##  $ traeger        : Factor w/ 2 levels "oeff","priv": 1 1 1 1 1 1 1 1 1 1 ...
##  $ art_reduziert  : Factor w/ 3 levels "gs","mit_gym_os",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ priv_schule_typ: Factor w/ 4 levels "","Frei","Kirchlich",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ name           : chr  "Grundschule 35" "Rosa-Luxemburg-Gesamtschule" "Grundschule 12 \"Gerhart Hauptmann\"" "Schule am Griebnitzsee" ...
##  $ strasse        : chr  "An der Alten Zauche 2 c" "Burgstraße 23 a" "Carl-von-Ossietzky-Straße 37" "Domstraße 14 b" ...
##  $ plz            : chr  "01585" "14467" "01570" "01590" ...
##  $ ort            : chr  "Potsdam" "Potsdam" "Potsdam" "Potsdam" ...

4.3. Create a new variable `full_address` using `mutate()`

This new variable should contain the full address consisting of street name, zip code and city name, e.g. “Carl-von-Ossietzky-Straße 37 01570 Potsdam”. You can combine several strings to form one string using the function paste().

schulen <- mutate(schulen, full_address = paste(strasse, plz, ort))
head(schulen)

##   jahr traeger art_reduziert priv_schule_typ                               name
## 1 1992    oeff            gs                                     Grundschule 35
## 2 1992    oeff            gs                        Rosa-Luxemburg-Gesamtschule
## 3 1992    oeff            gs                 Grundschule 12 "Gerhart Hauptmann"
## 4 1992    oeff            gs                             Schule am Griebnitzsee
## 5 1992    oeff            gs                                      Grundschule 8
## 6 1992    oeff            gs                                        Grundschule
##                        strasse   plz     ort                               full_address
## 1      An der Alten Zauche 2 c 01585 Potsdam      An der Alten Zauche 2 c 01585 Potsdam
## 2              Burgstraße 23 a 14467 Potsdam              Burgstraße 23 a 14467 Potsdam
## 3 Carl-von-Ossietzky-Straße 37 01570 Potsdam Carl-von-Ossietzky-Straße 37 01570 Potsdam
## 4               Domstraße 14 b 01590 Potsdam               Domstraße 14 b 01590 Potsdam
## 5            Dortustraße 28129 01560 Potsdam            Dortustraße 28129 01560 Potsdam
## 6                 Finkenweg 15 01560 Potsdam                 Finkenweg 15 01560 Potsdam

Optional: 5. Complete exercises 1 and 2 from section 5.5.2 in R for Data Science

Important notes for the last excercise: I think the excercises from section 5.5.2 are too challenging for beginners without any guidance. Therefore I made it optional and I will provide a little more explanation here.

Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.

So what’s the problem with those variables anyway? What does it mean that “they’re not really continuous numbers”?

The documentation says for those variables that they are “actual departure and arrival times (format HHMM or HMM)”. Let’s have a look at some numbers:

library(tidyverse)
library(nycflights13)

sample(flights$dep_time, 10)

##  [1] 1121 2106 1530 1848 2134 1354 1155 1727 1638  554

We know that these are integer variables, but what do they mean? According to the documentation, we should treat the last two digits as minutes and the first one or two digits as hours “(format HHMM or HMM)”. So the above numbers mean “21:24” (so 9:24pm), 6:51am, etc.

Now what’s the problem with these numbers? The problem is that they combine two units of measurement into one: hours and minutes. The range of minutes is only between 0 and 60, and the range of hours is only from 0 to 23, so combining them in this way makes them not continuous and hard to compute with. For example, the difference between 5:10am and 4:50am is 20 minutes, right? Well, simply substracting the numbers as they are formatted in the data frame would yield 60 minutes instead:

510 - 450

## [1] 60

So how do we treat this? We need to convert them to continuous numbers! The excercise notes already give the correct hint: “Convert them to a more convenient representation of number of minutes since midnight.” How can we do that?

The general idea is that we need to split the hours and the minutes. Once we have both as separate numbers, we can easily calculate the number of minutes since midnight: \(h \cdot 60 + m\) with \(h\) being the hours and \(m\) being the minutes. Let’s suppose we already split the time components and did the calculation with the same numbers as above. Would we get the right result now?

(5 * 60 + 10) - (4 * 60 + 50)

## [1] 20

Seems correct! But the hard part is now splitting the time components. One way would be to convert the integers to character strings and do string slicing. However, there is an easier way by using “modular arithmetic”. Don’t worry, this sounds more intriguing than it is!

In order to “cut off” the last two digits of our combined time number, we need to divide it by 100. However, by this we would get a fraction most of the time, but we only need the integer part so we need to round always to the lower number. The function floor() does that for us:

floor(510 / 100)

## [1] 5

A shorter version of this is the “integer division” operator %/%:

510 %/% 100

## [1] 5

How do we get minutes component? We can use the “modulo” operator %% which gives us the remainder of an integer division. For example, if you do integer division of 13 by 6 you get 2 and a remainder of 1 (because \(6 \cdot 2 + 1 = 13\)). Let’s apply it to a combined time format number:

510 %% 100

## [1] 10

So we now have a way to split hours and minutes from the combined time numbers. We can write a general rule for calculating the numbers of minutes since midnight from any combined time number t: (t %/% 100) * 60 + (t %% 100).

Whenever we have something like a general rule or algorithm and we want to use it several times (i.e. for several variables), we should define a function for it, so that we don’t repeat ourselves. We will learn how to define our own functions in the next session so for now I will create the function and call it normalize_time:

normalize_time <- function(t) {
  (t %/% 100) * 60 + t %% 100
}

We can now use it with any number in combined time format to get the number of minutes since midnight:

normalize_time(510)

## [1] 310

Of course we can pass integer vectors:

normalize_time(c(510, 450))

## [1] 310 290

We can now also calculate the correct result for the time differences in minutes:

normalize_time(510) - normalize_time(450)

## [1] 20

From now on, the rest of the tasks should be easier to solve. I recommend that you at first create a new data frame consisting only of the variables that you need to solve the tasks (so use select()!). This provides better overview. You can then apply the normalize_time() function to the time variables using mutate().

Excercise 1

Create a subset of the variables for better overview.

(fl_times <- select(flights, dep_time, sched_dep_time, arr_time, sched_arr_time, air_time))

## # A tibble: 336,776 x 5
##    dep_time sched_dep_time arr_time sched_arr_time air_time
##       <int>          <int>    <int>          <int>    <dbl>
##  1      517            515      830            819      227
##  2      533            529      850            830      227
##  3      542            540      923            850      160
##  4      544            545     1004           1022      183
##  5      554            600      812            837      116
##  6      554            558      740            728      150
##  7      555            600      913            854      158
##  8      557            600      709            723       53
##  9      557            600      838            846      140
## 10      558            600      753            745      138
## # ... with 336,766 more rows

Define the function that converts the combined time format to number of minutes since midnight.

normalize_time <- function(t) {
  (t %/% 100) * 60 + t %% 100
}

Apply this function to dep_time and sched_dep_time.

fl_times_norm <- mutate(fl_times, dep_time = normalize_time(dep_time),
                                  sched_dep_time = normalize_time(sched_dep_time))
fl_times_norm

## # A tibble: 336,776 x 5
##    dep_time sched_dep_time arr_time sched_arr_time air_time
##       <dbl>          <dbl>    <int>          <int>    <dbl>
##  1      317            315      830            819      227
##  2      333            329      850            830      227
##  3      342            340      923            850      160
##  4      344            345     1004           1022      183
##  5      354            360      812            837      116
##  6      354            358      740            728      150
##  7      355            360      913            854      158
##  8      357            360      709            723       53
##  9      357            360      838            846      140
## 10      358            360      753            745      138
## # ... with 336,766 more rows

Excercise 2

At first we compare the variables:

transmute(fl_times_norm, dep_time, arr_time, arr_time - dep_time, air_time)

## # A tibble: 336,776 x 4
##    dep_time arr_time `arr_time - dep_time` air_time
##       <dbl>    <int>                 <dbl>    <dbl>
##  1      317      830                   513      227
##  2      333      850                   517      227
##  3      342      923                   581      160
##  4      344     1004                   660      183
##  5      354      812                   458      116
##  6      354      740                   386      150
##  7      355      913                   558      158
##  8      357      709                   352       53
##  9      357      838                   481      140
## 10      358      753                   395      138
## # ... with 336,766 more rows

air_time should match the difference between arrival and departure time, but it’s way off. First of all, we should also convert arr_time to the number of minutes since midnight, because we only did this for dep_time and sched_dep_time so far:

fl_times_norm <- mutate(fl_times_norm, arr_time = normalize_time(arr_time),
                                       sched_arr_time = normalize_time(sched_arr_time))
transmute(fl_times_norm, dep_time, arr_time, arr_time - dep_time, air_time)

## # A tibble: 336,776 x 4
##    dep_time arr_time `arr_time - dep_time` air_time
##       <dbl>    <dbl>                 <dbl>    <dbl>
##  1      317      510                   193      227
##  2      333      530                   197      227
##  3      342      563                   221      160
##  4      344      604                   260      183
##  5      354      492                   138      116
##  6      354      460                   106      150
##  7      355      553                   198      158
##  8      357      429                    72       53
##  9      357      518                   161      140
## 10      358      473                   115      138
## # ... with 336,766 more rows

Still, the numbers do not match, although the difference is not as big any more. The reason for that could be that the time zones are different or that the air time is not actually the difference between officially recorded departure and arrival.

R Tutorial at the WZB

Tasks for session 5 - Transforming data with R I

Markus Konrad

November 22, 2018

Tasks

1. Complete exercises 1 to 3 from section 5.2.4 in R for Data Science

Exercise 1

1.1. Find all flights that had an arrival delay of two or more hours

1.2. Find all flights that flew to Houston (IAH or HOU)

1.3. Find all flights that were operated by United, American, or Delta

1.4. Find all flights that departed in summer (July, August, and September)

1.5. Find all flights that arrived more than two hours late, but didn’t leave late

1.6. Find all flights that were delayed by at least an hour, but made up over 30 minutes in flight

1.7. Find all flights that departed between midnight and 6am

Exercise 2

2. Complete exercises 1 to 3 from section 5.3.1 in R for Data Science

Exercise 1

Exercise 2

Exercise 3

3. Complete exercises 1 and 2 from section 5.4.1 in R for Data Science

Excercise 1

Excercise 2

4. Download and unzip `05transform1-resources.zip` from the course website, read `codebook.txt` and complete the following tasks

4.1. Load the CSV file `schulen_potsdam.csv` into R without specifying further parameters for `read.csv()`

4.2. Load the data again, this time specifying two additional parameters for `read.csv()`: `stringsAsFactors = FALSE, colClasses = c(plz = "character")`

4.3. Create a new variable `full_address` using `mutate()`

Optional: 5. Complete exercises 1 and 2 from section 5.5.2 in R for Data Science

Excercise 1

Excercise 2

R Tutorial at the WZB

Tasks for session 5 - Transforming data with R I

Markus Konrad

November 22, 2018

Tasks

1. Complete exercises 1 to 3 from section 5.2.4 in R for Data Science

Exercise 1

1.1. Find all flights that had an arrival delay of two or more hours

1.2. Find all flights that flew to Houston (IAH or HOU)

1.3. Find all flights that were operated by United, American, or Delta

1.4. Find all flights that departed in summer (July, August, and September)

1.5. Find all flights that arrived more than two hours late, but didn’t leave late

1.6. Find all flights that were delayed by at least an hour, but made up over 30 minutes in flight

1.7. Find all flights that departed between midnight and 6am

Exercise 2

2. Complete exercises 1 to 3 from section 5.3.1 in R for Data Science

Exercise 1

Exercise 2

Exercise 3

3. Complete exercises 1 and 2 from section 5.4.1 in R for Data Science

Excercise 1

Excercise 2

4. Download and unzip 05transform1-resources.zip from the course website, read codebook.txt and complete the following tasks

4.1. Load the CSV file schulen_potsdam.csv into R without specifying further parameters for read.csv()

4.2. Load the data again, this time specifying two additional parameters for read.csv(): stringsAsFactors = FALSE, colClasses = c(plz = "character")

4.3. Create a new variable full_address using mutate()

Optional: 5. Complete exercises 1 and 2 from section 5.5.2 in R for Data Science

Excercise 1

Excercise 2

4. Download and unzip `05transform1-resources.zip` from the course website, read `codebook.txt` and complete the following tasks

4.1. Load the CSV file `schulen_potsdam.csv` into R without specifying further parameters for `read.csv()`

4.2. Load the data again, this time specifying two additional parameters for `read.csv()`: `stringsAsFactors = FALSE, colClasses = c(plz = "character")`

4.3. Create a new variable `full_address` using `mutate()`