05transform1-resources.zip
from the course website, read codebook.txt
and complete the following tasksNote: You need to install the packages nycflights13 and tidyverse in order to complete the exercises.
At first we load the required libraries (this is necessary for all tasks and I won’t repeat it for the other tasks, too).
library(tidyverse)
library(nycflights13)
We should check the documentation of the data in order to see what the variables mean and what their units are:
?flights
We should have a look at the whole data frame first:
View(flights)
We can also find out how many rows and columns we have:
dim(flights)
## [1] 336776 19
More than 330,000 rows and 19 columns!
You only had to do 3 out of the 7 tasks here, but I’ll show solutions for all tasks.
Departure and arrival delays are given in minutes. Hence we need to filter for an arrival delay of at least 120 minutes:
filter(flights, arr_delay >= 2*60)
## # A tibble: 10,200 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 811 630 101 1047 830 137 MQ
## 2 2013 1 1 848 1835 853 1001 1950 851 MQ
## 3 2013 1 1 957 733 144 1056 853 123 UA
## 4 2013 1 1 1114 900 134 1447 1222 145 UA
## 5 2013 1 1 1505 1310 115 1638 1431 127 EV
## 6 2013 1 1 1525 1340 105 1831 1626 125 B6
## 7 2013 1 1 1549 1445 64 1912 1656 136 EV
## 8 2013 1 1 1558 1359 119 1718 1515 123 EV
## 9 2013 1 1 1732 1630 62 2028 1825 123 EV
## 10 2013 1 1 1803 1620 103 2008 1750 138 MQ
## # ... with 10,190 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
filter(flights, dest %in% c('IAH', 'HOU'))
## # A tibble: 9,313 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 623 627 -4 933 932 1 UA
## 4 2013 1 1 728 732 -4 1041 1038 3 UA
## 5 2013 1 1 739 739 0 1104 1038 26 UA
## 6 2013 1 1 908 908 0 1228 1219 9 UA
## 7 2013 1 1 1028 1026 2 1350 1339 11 UA
## 8 2013 1 1 1044 1045 -1 1352 1351 1 UA
## 9 2013 1 1 1114 900 134 1447 1222 145 UA
## 10 2013 1 1 1205 1200 5 1503 1505 -2 UA
## # ... with 9,303 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Note: The airline codes are available in the data frame airlines
.
filter(flights, carrier %in% c('UA', 'AA', 'DL'))
## # A tibble: 139,504 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 554 600 -6 812 837 -25 DL
## 5 2013 1 1 554 558 -4 740 728 12 UA
## 6 2013 1 1 558 600 -2 753 745 8 AA
## 7 2013 1 1 558 600 -2 924 917 7 UA
## 8 2013 1 1 558 600 -2 923 937 -14 UA
## 9 2013 1 1 559 600 -1 941 910 31 AA
## 10 2013 1 1 559 600 -1 854 902 -8 UA
## # ... with 139,494 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# alternatives are for example:
# filter(flights, month >= 7 & month <= 9)
# filter(flights, month %in% c(7, 8, 9))
filter(flights, month %in% 7:9)
## # A tibble: 86,326 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 7 1 1 2029 212 236 2359 157 B6
## 2 2013 7 1 2 2359 3 344 344 0 B6
## 3 2013 7 1 29 2245 104 151 1 110 B6
## 4 2013 7 1 43 2130 193 322 14 188 B6
## 5 2013 7 1 44 2150 174 300 100 120 AA
## 6 2013 7 1 46 2051 235 304 2358 186 B6
## 7 2013 7 1 48 2001 287 308 2305 243 VX
## 8 2013 7 1 58 2155 183 335 43 172 B6
## 9 2013 7 1 100 2146 194 327 30 177 B6
## 10 2013 7 1 100 2245 135 337 135 122 B6
## # ... with 86,316 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
filter(flights, arr_delay > 2*60, dep_delay <= 0)
## # A tibble: 29 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 27 1419 1420 -1 1754 1550 124 MQ
## 2 2013 10 7 1350 1350 0 1736 1526 130 EV
## 3 2013 10 7 1357 1359 -2 1858 1654 124 AA
## 4 2013 10 16 657 700 -3 1258 1056 122 B6
## 5 2013 11 1 658 700 -2 1329 1015 194 VX
## 6 2013 3 18 1844 1847 -3 39 2219 140 UA
## 7 2013 4 17 1635 1640 -5 2049 1845 124 MQ
## 8 2013 4 18 558 600 -2 1149 850 179 AA
## 9 2013 4 18 655 700 -5 1213 950 143 AA
## 10 2013 5 22 1827 1830 -3 2217 2010 127 MQ
## # ... with 19 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
filter(flights, dep_delay >= 60, arr_delay < 30)
## # A tibble: 206 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 3 1850 1745 65 2148 2120 28 AA
## 2 2013 1 3 1950 1845 65 2228 2227 1 B6
## 3 2013 1 3 2015 1915 60 2135 2111 24 9E
## 4 2013 1 6 1019 900 79 1558 1530 28 HA
## 5 2013 1 7 1543 1430 73 1758 1735 23 AA
## 6 2013 1 11 1020 920 60 1311 1245 26 AA
## 7 2013 1 12 1706 1600 66 1949 1927 22 DL
## 8 2013 1 12 1953 1845 68 2154 2137 17 9E
## 9 2013 1 19 1456 1355 61 1636 1615 21 EV
## 10 2013 1 21 1531 1430 61 1843 1815 28 DL
## # ... with 196 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
filter(flights, dep_time >= 0, dep_time <= 600)
## # A tibble: 9,344 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6
## 5 2013 1 1 554 600 -6 812 837 -25 DL
## 6 2013 1 1 554 558 -4 740 728 12 UA
## 7 2013 1 1 555 600 -5 913 854 19 B6
## 8 2013 1 1 557 600 -3 709 723 -14 EV
## 9 2013 1 1 557 600 -3 838 846 -8 B6
## 10 2013 1 1 558 600 -2 753 745 8 AA
## # ... with 9,334 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Another useful dplyr filtering helper is between()
. What does it do? Can you use it to simplify the code needed to answer the previous challenges?
According to ?between()
it’s a shortcut for x >= left & x <= right
, i.e. to check if a value \(x\) is in the range \([\text{left}, \text{right}]\).
We can simplify the code for exercise 1.4 and 1.7 using between()
:
# 1.4. Find all flights that departed in summer (July, August, and September)
filter(flights, between(month, 7, 9))
## # A tibble: 86,326 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 7 1 1 2029 212 236 2359 157 B6
## 2 2013 7 1 2 2359 3 344 344 0 B6
## 3 2013 7 1 29 2245 104 151 1 110 B6
## 4 2013 7 1 43 2130 193 322 14 188 B6
## 5 2013 7 1 44 2150 174 300 100 120 AA
## 6 2013 7 1 46 2051 235 304 2358 186 B6
## 7 2013 7 1 48 2001 287 308 2305 243 VX
## 8 2013 7 1 58 2155 183 335 43 172 B6
## 9 2013 7 1 100 2146 194 327 30 177 B6
## 10 2013 7 1 100 2245 135 337 135 122 B6
## # ... with 86,316 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# 1.7. Find all flights that departed between midnight and 6am
filter(flights, between(dep_time, 0, 600))
## # A tibble: 9,344 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6
## 5 2013 1 1 554 600 -6 812 837 -25 DL
## 6 2013 1 1 554 558 -4 740 728 12 UA
## 7 2013 1 1 555 600 -5 913 854 19 B6
## 8 2013 1 1 557 600 -3 709 723 -14 EV
## 9 2013 1 1 557 600 -3 838 846 -8 B6
## 10 2013 1 1 558 600 -2 753 745 8 AA
## # ... with 9,334 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
One could at first sort for is.na(<VARIABLE>)
but in decreasing order (because the logical vector from is.na()
is sorted by FALSE
(= 0) to TRUE
(= 1) by default (ascending order) and we have to reverse that).
arrange(flights, desc(is.na(dep_time)), dep_time)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 NA 1630 NA NA 1815 NA EV
## 2 2013 1 1 NA 1935 NA NA 2240 NA AA
## 3 2013 1 1 NA 1500 NA NA 1825 NA AA
## 4 2013 1 1 NA 600 NA NA 901 NA B6
## 5 2013 1 2 NA 1540 NA NA 1747 NA EV
## 6 2013 1 2 NA 1620 NA NA 1746 NA EV
## 7 2013 1 2 NA 1355 NA NA 1459 NA EV
## 8 2013 1 2 NA 1420 NA NA 1644 NA EV
## 9 2013 1 2 NA 1321 NA NA 1536 NA EV
## 10 2013 1 2 NA 1545 NA NA 1910 NA AA
## # ... with 336,766 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# most delayed flights first:
arrange(flights, desc(dep_delay))
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 9 641 900 1301 1242 1530 1272 HA
## 2 2013 6 15 1432 1935 1137 1607 2120 1127 MQ
## 3 2013 1 10 1121 1635 1126 1239 1810 1109 MQ
## 4 2013 9 20 1139 1845 1014 1457 2210 1007 AA
## 5 2013 7 22 845 1600 1005 1044 1815 989 MQ
## 6 2013 4 10 1100 1900 960 1342 2211 931 DL
## 7 2013 3 17 2321 810 911 135 1020 915 DL
## 8 2013 6 27 959 1900 899 1236 2226 850 DL
## 9 2013 7 22 2257 759 898 121 1026 895 DL
## 10 2013 12 5 756 1700 896 1058 2020 878 AA
## # ... with 336,766 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# flights that left earliest ("negative delay")
arrange(flights, dep_delay)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 12 7 2040 2123 -43 40 2352 48 B6
## 2 2013 2 3 2022 2055 -33 2240 2338 -58 DL
## 3 2013 11 10 1408 1440 -32 1549 1559 -10 EV
## 4 2013 1 11 1900 1930 -30 2233 2243 -10 DL
## 5 2013 1 29 1703 1730 -27 1947 1957 -10 F9
## 6 2013 8 9 729 755 -26 1002 955 7 MQ
## 7 2013 10 23 1907 1932 -25 2143 2143 0 EV
## 8 2013 3 30 2030 2055 -25 2213 2250 -37 MQ
## 9 2013 3 2 1431 1455 -24 1601 1631 -30 9E
## 10 2013 5 5 934 958 -24 1225 1309 -44 B6
## # ... with 336,766 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
To sort by fastest flights, we sort by speed in descending order. Speed is calculated as distance
divided by air_time
.
arrange(flights, desc(distance / air_time))
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 5 25 1709 1700 9 1923 1937 -14 DL
## 2 2013 7 2 1558 1513 45 1745 1719 26 EV
## 3 2013 5 13 2040 2025 15 2225 2226 -1 EV
## 4 2013 3 23 1914 1910 4 2045 2043 2 EV
## 5 2013 1 12 1559 1600 -1 1849 1917 -28 DL
## 6 2013 11 17 650 655 -5 1059 1150 -51 DL
## 7 2013 2 21 2355 2358 -3 412 438 -26 B6
## 8 2013 11 17 759 800 -1 1212 1255 -43 AA
## 9 2013 11 16 2003 1925 38 17 36 -19 DL
## 10 2013 11 16 2349 2359 -10 402 440 -38 B6
## # ... with 336,766 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
These are three ways I came up with. There are probably more. The results are all the same, therefore I only printed them for the last command.
select(flights, dep_time, dep_delay, arr_time, arr_delay)
select(flights, starts_with('dep_'), starts_with('arr_'))
select(flights, c(4, 6, 7, 9)) # using column indices instead of names
## # A tibble: 336,776 x 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # ... with 336,766 more rows
If you include the same variable name several times in a select()
call, it will not be repeated in the resulting data frame. Only the first occurrence of the variable will be kept:
select(flights, carrier, dep_time, carrier)
## # A tibble: 336,776 x 2
## carrier dep_time
## <chr> <int>
## 1 UA 517
## 2 UA 533
## 3 AA 542
## 4 B6 544
## 5 DL 554
## 6 UA 554
## 7 B6 555
## 8 EV 557
## 9 B6 557
## 10 AA 558
## # ... with 336,766 more rows
05transform1-resources.zip
from the course website, read codebook.txt
and complete the following tasksschulen_potsdam.csv
into R without specifying further parameters for read.csv()
Have a look at the data using functions like str()
and head()
. Do you spot any potential problems?
schulen <- read.csv('05transform1-resources/schulen_potsdam.csv')
str(schulen)
## 'data.frame': 1340 obs. of 8 variables:
## $ jahr : int 1992 1992 1992 1992 1992 1992 1992 1992 1992 1992 ...
## $ traeger : Factor w/ 2 levels "oeff","priv": 1 1 1 1 1 1 1 1 1 1 ...
## $ art_reduziert : Factor w/ 3 levels "gs","mit_gym_os",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ priv_schule_typ: Factor w/ 4 levels "","Frei","Kirchlich",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 217 levels "\"Eisenhart-Schule\"",..: 101 181 90 197 109 70 192 67 98 105 ...
## $ strasse : Factor w/ 114 levels "","(befristet für ca. 1 Jahr) Schilfhof 23-",..: 4 9 11 14 21 25 26 27 28 31 ...
## $ plz : int 1585 14467 1570 1590 1560 1560 1597 1560 1580 1599 ...
## $ ort : Factor w/ 1 level "Potsdam": 1 1 1 1 1 1 1 1 1 1 ...
head(schulen)
## jahr traeger art_reduziert priv_schule_typ name
## 1 1992 oeff gs Grundschule 35
## 2 1992 oeff gs Rosa-Luxemburg-Gesamtschule
## 3 1992 oeff gs Grundschule 12 "Gerhart Hauptmann"
## 4 1992 oeff gs Schule am Griebnitzsee
## 5 1992 oeff gs Grundschule 8
## 6 1992 oeff gs Grundschule
## strasse plz ort
## 1 An der Alten Zauche 2 c 1585 Potsdam
## 2 Burgstraße 23 a 14467 Potsdam
## 3 Carl-von-Ossietzky-Straße 37 1570 Potsdam
## 4 Domstraße 14 b 1590 Potsdam
## 5 Dortustraße 28129 1560 Potsdam
## 6 Finkenweg 15 1560 Potsdam
An obvious problem is that the ZIP codes in plz
were converted to integers, which caused leading zeros like in “01585” to disappear and forming a number like 1585.
Another potential problem is that all character string variables were converted to factors, although variables like “name” or “strasse” are not categorical variables.
read.csv()
: stringsAsFactors = FALSE, colClasses = c(plz = "character")
Inspect the result. What is the effect of these parameters? Are all problems now fixed and do the variable types match the specifications in the codebook (see codebook.txt
also contained in the zip-file)? If not, convert these variables to the correct data type using mutate()
and as.factor()
.
schulen <- read.csv('05transform1-resources/schulen_potsdam.csv',
stringsAsFactors = FALSE,
colClasses = c(plz = "character"))
str(schulen)
## 'data.frame': 1340 obs. of 8 variables:
## $ jahr : int 1992 1992 1992 1992 1992 1992 1992 1992 1992 1992 ...
## $ traeger : chr "oeff" "oeff" "oeff" "oeff" ...
## $ art_reduziert : chr "gs" "gs" "gs" "gs" ...
## $ priv_schule_typ: chr "" "" "" "" ...
## $ name : chr "Grundschule 35" "Rosa-Luxemburg-Gesamtschule" "Grundschule 12 \"Gerhart Hauptmann\"" "Schule am Griebnitzsee" ...
## $ strasse : chr "An der Alten Zauche 2 c" "Burgstraße 23 a" "Carl-von-Ossietzky-Straße 37" "Domstraße 14 b" ...
## $ plz : chr "01585" "14467" "01570" "01590" ...
## $ ort : chr "Potsdam" "Potsdam" "Potsdam" "Potsdam" ...
The problem with the ZIP code is fixed (by explicitely specifying a data type or “class” for the variable plz
with colClasses = c(plz = "character")
). However, some variables like traeger
or art_reduziert
are now character strings although they should be factors / categoricals according to the codebook. We can convert them to factors with mutate()
. At first we need to load the tidyverse package for this:
library(tidyverse)
And now we update the data types of the variables:
schulen <- mutate(schulen, traeger = as.factor(traeger),
art_reduziert = as.factor(art_reduziert),
priv_schule_typ = as.factor(priv_schule_typ))
str(schulen)
## 'data.frame': 1340 obs. of 8 variables:
## $ jahr : int 1992 1992 1992 1992 1992 1992 1992 1992 1992 1992 ...
## $ traeger : Factor w/ 2 levels "oeff","priv": 1 1 1 1 1 1 1 1 1 1 ...
## $ art_reduziert : Factor w/ 3 levels "gs","mit_gym_os",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ priv_schule_typ: Factor w/ 4 levels "","Frei","Kirchlich",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ name : chr "Grundschule 35" "Rosa-Luxemburg-Gesamtschule" "Grundschule 12 \"Gerhart Hauptmann\"" "Schule am Griebnitzsee" ...
## $ strasse : chr "An der Alten Zauche 2 c" "Burgstraße 23 a" "Carl-von-Ossietzky-Straße 37" "Domstraße 14 b" ...
## $ plz : chr "01585" "14467" "01570" "01590" ...
## $ ort : chr "Potsdam" "Potsdam" "Potsdam" "Potsdam" ...
full_address
using mutate()
This new variable should contain the full address consisting of street name, zip code and city name, e.g. “Carl-von-Ossietzky-Straße 37 01570 Potsdam”. You can combine several strings to form one string using the function paste()
.
schulen <- mutate(schulen, full_address = paste(strasse, plz, ort))
head(schulen)
## jahr traeger art_reduziert priv_schule_typ name
## 1 1992 oeff gs Grundschule 35
## 2 1992 oeff gs Rosa-Luxemburg-Gesamtschule
## 3 1992 oeff gs Grundschule 12 "Gerhart Hauptmann"
## 4 1992 oeff gs Schule am Griebnitzsee
## 5 1992 oeff gs Grundschule 8
## 6 1992 oeff gs Grundschule
## strasse plz ort full_address
## 1 An der Alten Zauche 2 c 01585 Potsdam An der Alten Zauche 2 c 01585 Potsdam
## 2 Burgstraße 23 a 14467 Potsdam Burgstraße 23 a 14467 Potsdam
## 3 Carl-von-Ossietzky-Straße 37 01570 Potsdam Carl-von-Ossietzky-Straße 37 01570 Potsdam
## 4 Domstraße 14 b 01590 Potsdam Domstraße 14 b 01590 Potsdam
## 5 Dortustraße 28129 01560 Potsdam Dortustraße 28129 01560 Potsdam
## 6 Finkenweg 15 01560 Potsdam Finkenweg 15 01560 Potsdam
Important notes for the last excercise: I think the excercises from section 5.5.2 are too challenging for beginners without any guidance. Therefore I made it optional and I will provide a little more explanation here.
- Currently
dep_time
andsched_dep_time
are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.
So what’s the problem with those variables anyway? What does it mean that “they’re not really continuous numbers”?
The documentation says for those variables that they are “actual departure and arrival times (format HHMM or HMM)”. Let’s have a look at some numbers:
library(tidyverse)
library(nycflights13)
sample(flights$dep_time, 10)
## [1] 1121 2106 1530 1848 2134 1354 1155 1727 1638 554
We know that these are integer variables, but what do they mean? According to the documentation, we should treat the last two digits as minutes and the first one or two digits as hours “(format HHMM or HMM)”. So the above numbers mean “21:24” (so 9:24pm), 6:51am, etc.
Now what’s the problem with these numbers? The problem is that they combine two units of measurement into one: hours and minutes. The range of minutes is only between 0 and 60, and the range of hours is only from 0 to 23, so combining them in this way makes them not continuous and hard to compute with. For example, the difference between 5:10am and 4:50am is 20 minutes, right? Well, simply substracting the numbers as they are formatted in the data frame would yield 60 minutes instead:
510 - 450
## [1] 60
So how do we treat this? We need to convert them to continuous numbers! The excercise notes already give the correct hint: “Convert them to a more convenient representation of number of minutes since midnight.” How can we do that?
The general idea is that we need to split the hours and the minutes. Once we have both as separate numbers, we can easily calculate the number of minutes since midnight: \(h \cdot 60 + m\) with \(h\) being the hours and \(m\) being the minutes. Let’s suppose we already split the time components and did the calculation with the same numbers as above. Would we get the right result now?
(5 * 60 + 10) - (4 * 60 + 50)
## [1] 20
Seems correct! But the hard part is now splitting the time components. One way would be to convert the integers to character strings and do string slicing. However, there is an easier way by using “modular arithmetic”. Don’t worry, this sounds more intriguing than it is!
In order to “cut off” the last two digits of our combined time number, we need to divide it by 100. However, by this we would get a fraction most of the time, but we only need the integer part so we need to round always to the lower number. The function floor()
does that for us:
floor(510 / 100)
## [1] 5
A shorter version of this is the “integer division” operator %/%
:
510 %/% 100
## [1] 5
How do we get minutes component? We can use the “modulo” operator %%
which gives us the remainder of an integer division. For example, if you do integer division of 13 by 6 you get 2 and a remainder of 1 (because \(6 \cdot 2 + 1 = 13\)). Let’s apply it to a combined time format number:
510 %% 100
## [1] 10
So we now have a way to split hours and minutes from the combined time numbers. We can write a general rule for calculating the numbers of minutes since midnight from any combined time number t
: (t %/% 100) * 60 + (t %% 100)
.
Whenever we have something like a general rule or algorithm and we want to use it several times (i.e. for several variables), we should define a function for it, so that we don’t repeat ourselves. We will learn how to define our own functions in the next session so for now I will create the function and call it normalize_time
:
normalize_time <- function(t) {
(t %/% 100) * 60 + t %% 100
}
We can now use it with any number in combined time format to get the number of minutes since midnight:
normalize_time(510)
## [1] 310
Of course we can pass integer vectors:
normalize_time(c(510, 450))
## [1] 310 290
We can now also calculate the correct result for the time differences in minutes:
normalize_time(510) - normalize_time(450)
## [1] 20
From now on, the rest of the tasks should be easier to solve. I recommend that you at first create a new data frame consisting only of the variables that you need to solve the tasks (so use select()
!). This provides better overview. You can then apply the normalize_time()
function to the time variables using mutate()
.
Create a subset of the variables for better overview.
(fl_times <- select(flights, dep_time, sched_dep_time, arr_time, sched_arr_time, air_time))
## # A tibble: 336,776 x 5
## dep_time sched_dep_time arr_time sched_arr_time air_time
## <int> <int> <int> <int> <dbl>
## 1 517 515 830 819 227
## 2 533 529 850 830 227
## 3 542 540 923 850 160
## 4 544 545 1004 1022 183
## 5 554 600 812 837 116
## 6 554 558 740 728 150
## 7 555 600 913 854 158
## 8 557 600 709 723 53
## 9 557 600 838 846 140
## 10 558 600 753 745 138
## # ... with 336,766 more rows
Define the function that converts the combined time format to number of minutes since midnight.
normalize_time <- function(t) {
(t %/% 100) * 60 + t %% 100
}
Apply this function to dep_time
and sched_dep_time
.
fl_times_norm <- mutate(fl_times, dep_time = normalize_time(dep_time),
sched_dep_time = normalize_time(sched_dep_time))
fl_times_norm
## # A tibble: 336,776 x 5
## dep_time sched_dep_time arr_time sched_arr_time air_time
## <dbl> <dbl> <int> <int> <dbl>
## 1 317 315 830 819 227
## 2 333 329 850 830 227
## 3 342 340 923 850 160
## 4 344 345 1004 1022 183
## 5 354 360 812 837 116
## 6 354 358 740 728 150
## 7 355 360 913 854 158
## 8 357 360 709 723 53
## 9 357 360 838 846 140
## 10 358 360 753 745 138
## # ... with 336,766 more rows
At first we compare the variables:
transmute(fl_times_norm, dep_time, arr_time, arr_time - dep_time, air_time)
## # A tibble: 336,776 x 4
## dep_time arr_time `arr_time - dep_time` air_time
## <dbl> <int> <dbl> <dbl>
## 1 317 830 513 227
## 2 333 850 517 227
## 3 342 923 581 160
## 4 344 1004 660 183
## 5 354 812 458 116
## 6 354 740 386 150
## 7 355 913 558 158
## 8 357 709 352 53
## 9 357 838 481 140
## 10 358 753 395 138
## # ... with 336,766 more rows
air_time
should match the difference between arrival and departure time, but it’s way off. First of all, we should also convert arr_time
to the number of minutes since midnight, because we only did this for dep_time
and sched_dep_time
so far:
fl_times_norm <- mutate(fl_times_norm, arr_time = normalize_time(arr_time),
sched_arr_time = normalize_time(sched_arr_time))
transmute(fl_times_norm, dep_time, arr_time, arr_time - dep_time, air_time)
## # A tibble: 336,776 x 4
## dep_time arr_time `arr_time - dep_time` air_time
## <dbl> <dbl> <dbl> <dbl>
## 1 317 510 193 227
## 2 333 530 197 227
## 3 342 563 221 160
## 4 344 604 260 183
## 5 354 492 138 116
## 6 354 460 106 150
## 7 355 553 198 158
## 8 357 429 72 53
## 9 357 518 161 140
## 10 358 473 115 138
## # ... with 336,766 more rows
Still, the numbers do not match, although the difference is not as big any more. The reason for that could be that the time zones are different or that the air time is not actually the difference between officially recorded departure and arrival.