Install the package pscl (a package from the Political Science Computational Laboratory of Standford Univ.). Load the data set iraqVote
and have a look at its documentation. Then, using the aggregation and summarization functions you learned in this session, solve the following tasks:
Replicate the table below, where n_members
is the number of members, n_aye
the number of “Aye” (yes) votes and perc_aye
the share of “Aye” votes per Republican/Non-republican group (Hint: Group on rep
and then use summarise()
).
library(tidyverse)
library(pscl)
iraqVote %>%
group_by(rep) %>%
summarise(n_members = n(),
n_aye = sum(y),
perc_aye = n_aye / n_members * 100)
## # A tibble: 2 x 4
## rep n_members n_aye perc_aye
## <lgl> <int> <dbl> <dbl>
## 1 FALSE 51 29 56.9
## 2 TRUE 49 48 98.0
In each state, there are two Senators that voted. Their votes might differ (e.g. Senator A voted “Aye”, Sentator B voted “Nay”) and/or their parties might differ (Senator A is Republican, Senator B is Democrat). Create a data frame that for each state indicates whether the votes diverged (new variable votes_differ
) and whether the party membership differs (new variable party_differs
). See the table below for an example output of the first five states. Hint: Consider that there are only two votes per state (where 0 means “Nay” and 1 means “Aye”). So there are four possible combinations of votes (0/0, 1/0, 0/1 or 1/1) out of which only those with a sum of 1 (1/0 or 0/1) are cases of diverging votes. The same principle can be used for the party membership.
(votes_diff <- iraqVote %>%
group_by(state.name) %>%
summarise(votes_differ = sum(y) == 1,
party_differs = sum(rep) == 1))
## # A tibble: 50 x 3
## state.name votes_differ party_differs
## <chr> <lgl> <lgl>
## 1 Alabama FALSE FALSE
## 2 Alaska FALSE FALSE
## 3 Arizona FALSE FALSE
## 4 Arkansas FALSE TRUE
## 5 California TRUE FALSE
## 6 Colorado FALSE FALSE
## 7 Connecticut FALSE FALSE
## 8 Delaware FALSE FALSE
## 9 Florida TRUE FALSE
## 10 Georgia FALSE FALSE
## # ... with 40 more rows
From the output of the previous task, select those states, where the votes diverged but the party membership of both senators did not differ. Save the output vector of those state names in an object named div_votes_states
(Hint: You can convert single column of a data frame to a vector using unlist()
: ... %>% select(state.name) %>% unlist()
). Using div_votes_states
, filter the observations of iraqVote
to get only the observations from the states listed in div_votes_states
. Are there any Republicans in this list?
(div_votes_states <- votes_diff %>%
filter(votes_differ & !party_differs) %>%
select(state.name) %>%
unlist(use.names = FALSE))
## [1] "California" "Florida" "Massachusetts" "New Jersey" "North Dakota" "Washington"
## [7] "West Virginia" "Wisconsin"
filter(iraqVote, state.name %in% div_votes_states)
## y state.abb name rep state.name gorevote
## 1 0 CA BOXER (D CA) FALSE California 53.45
## 2 1 CA FEINSTEIN (D CA) FALSE California 53.45
## 3 0 FL GRAHAM (D FL) FALSE Florida 48.84
## 4 1 FL NELSON (D FL) FALSE Florida 48.84
## 5 0 MA KENNEDY (D MA) FALSE Massachusetts 59.80
## 6 1 MA KERRY (D MA) FALSE Massachusetts 59.80
## 7 0 NJ CORZINE (D NJ) FALSE New Jersey 56.13
## 8 1 NJ TORRICELLI (D NJ) FALSE New Jersey 56.13
## 9 0 ND CONRAD (D ND) FALSE North Dakota 33.05
## 10 1 ND DORGAN (D ND) FALSE North Dakota 33.05
## 11 1 WA CANTWELL (D WA) FALSE Washington 50.13
## 12 0 WA MURRAY (D WA) FALSE Washington 50.13
## 13 0 WV BYRD (D WV) FALSE West Virginia 45.59
## 14 1 WV ROCKEFELLER (D WV) FALSE West Virginia 45.59
## 15 0 WI FEINGOLD (D WI) FALSE Wisconsin 47.83
## 16 1 WI KOHL (D WI) FALSE Wisconsin 47.83
There are no Republicans in this list.
Load the data set politicalInformation
that is also available from the package pscl and view its documentation. Then, solve the following tasks:
Create a new data set polinf
based on politicalInformation
but with a new variable age_group
.
polinf <- politicalInformation %>% mutate(age_group = case_when(
between(age, 18, 29) ~ 'young adult',
between(age, 29, 45) ~ 'adult',
between(age, 46, 59) ~ 'middle age',
between(age, 60, 99) ~ 'senior'
))
head(polinf)
## y collegeDegree female age homeOwn govt length id age_group
## 1 Fairly High Yes No 49 Yes No 58.40 1 middle age
## 2 Average No Yes 35 Yes No 46.15 2 adult
## 3 Very High No Yes 57 Yes No 89.52 3 middle age
## 4 Average No No 63 Yes No 92.63 4 senior
## 5 Fairly High Yes Yes 40 Yes No 58.85 4 adult
## 6 Average No No 77 Yes No 53.82 4 senior
Filter polinf
to only include non-NA values in age_group
. Then group by age_group
and collageDegree
. Compute the mean of the interviewer rating y
and indicate the number of non-NA values in y
for each group.
polinf %>% filter(!is.na(age_group)) %>%
group_by(age_group, collegeDegree) %>%
summarise(n_nonNA = sum(!is.na(y)),
mean_y = mean(as.integer(y), na.rm = TRUE))
## # A tibble: 8 x 4
## # Groups: age_group [?]
## age_group collegeDegree n_nonNA mean_y
## <chr> <fct> <int> <dbl>
## 1 adult No 329 2.95
## 2 adult Yes 307 3.74
## 3 middle age No 249 3.14
## 4 middle age Yes 211 4.06
## 5 senior No 304 3.06
## 6 senior Yes 113 3.99
## 7 young adult No 193 2.56
## 8 young adult Yes 86 3.45
Load the data set flights
from the package nycflights13.
library(nycflights13)
Find out the number of carrier companies in the data set (Hint: You can use unique()
or distinct()
for that).
length(unique(flights$carrier))
## [1] 16
Find out the three longest flights (in terms of air_time
) for each carrier. Construct a single command combined with %>%
-operators. You can use rank()
to rank observations according to a variable and then filter()
only for the ranks 1 to 3.
flights %>%
select(carrier, origin, dest, air_time, distance) %>% # only to reduce amount of data displayed
group_by(carrier) %>%
mutate(rank_air_time = rank(desc(air_time), ties.method = 'first')) %>%
filter(rank_air_time <= 3) %>%
arrange(carrier, desc(air_time))
## # A tibble: 48 x 6
## # Groups: carrier [16]
## carrier origin dest air_time distance rank_air_time
## <chr> <chr> <chr> <dbl> <dbl> <int>
## 1 9E JFK SAT 272 1587 1
## 2 9E JFK SAT 264 1587 2
## 3 9E JFK SAT 262 1587 3
## 4 AA JFK SFO 426 2586 1
## 5 AA JFK SFO 422 2586 2
## 6 AA JFK SFO 410 2586 3
## 7 AS EWR SEA 392 2402 1
## 8 AS EWR SEA 384 2402 2
## 9 AS EWR SEA 377 2402 3
## 10 B6 JFK SAN 413 2446 1
## # ... with 38 more rows