R Tutorial at the WZB

Tasks

1.

Install the package pscl (a package from the Political Science Computational Laboratory of Standford Univ.). Load the data set iraqVote and have a look at its documentation. Then, using the aggregation and summarization functions you learned in this session, solve the following tasks:

1.a)

Replicate the table below, where n_members is the number of members, n_aye the number of “Aye” (yes) votes and perc_aye the share of “Aye” votes per Republican/Non-republican group (Hint: Group on rep and then use summarise()).

library(tidyverse)
library(pscl)

iraqVote %>%
  group_by(rep) %>%
  summarise(n_members = n(),
            n_aye = sum(y),
            perc_aye = n_aye / n_members * 100)

## # A tibble: 2 x 4
##   rep   n_members n_aye perc_aye
##   <lgl>     <int> <dbl>    <dbl>
## 1 FALSE        51    29     56.9
## 2 TRUE         49    48     98.0

1.b)

In each state, there are two Senators that voted. Their votes might differ (e.g. Senator A voted “Aye”, Sentator B voted “Nay”) and/or their parties might differ (Senator A is Republican, Senator B is Democrat). Create a data frame that for each state indicates whether the votes diverged (new variable votes_differ) and whether the party membership differs (new variable party_differs). See the table below for an example output of the first five states. Hint: Consider that there are only two votes per state (where 0 means “Nay” and 1 means “Aye”). So there are four possible combinations of votes (0/0, 1/0, 0/1 or 1/1) out of which only those with a sum of 1 (1/0 or 0/1) are cases of diverging votes. The same principle can be used for the party membership.

(votes_diff <- iraqVote %>%
  group_by(state.name) %>%
  summarise(votes_differ = sum(y) == 1,
            party_differs = sum(rep) == 1))

## # A tibble: 50 x 3
##    state.name  votes_differ party_differs
##    <chr>       <lgl>        <lgl>        
##  1 Alabama     FALSE        FALSE        
##  2 Alaska      FALSE        FALSE        
##  3 Arizona     FALSE        FALSE        
##  4 Arkansas    FALSE        TRUE         
##  5 California  TRUE         FALSE        
##  6 Colorado    FALSE        FALSE        
##  7 Connecticut FALSE        FALSE        
##  8 Delaware    FALSE        FALSE        
##  9 Florida     TRUE         FALSE        
## 10 Georgia     FALSE        FALSE        
## # ... with 40 more rows

1.c)

From the output of the previous task, select those states, where the votes diverged but the party membership of both senators did not differ. Save the output vector of those state names in an object named div_votes_states (Hint: You can convert single column of a data frame to a vector using unlist(): ... %>% select(state.name) %>% unlist()). Using div_votes_states, filter the observations of iraqVote to get only the observations from the states listed in div_votes_states. Are there any Republicans in this list?

(div_votes_states <- votes_diff %>%
  filter(votes_differ & !party_differs) %>%
  select(state.name) %>%
  unlist(use.names = FALSE))

## [1] "California"    "Florida"       "Massachusetts" "New Jersey"    "North Dakota"  "Washington"   
## [7] "West Virginia" "Wisconsin"

filter(iraqVote, state.name %in% div_votes_states)

##    y state.abb               name   rep    state.name gorevote
## 1  0        CA       BOXER (D CA) FALSE    California    53.45
## 2  1        CA   FEINSTEIN (D CA) FALSE    California    53.45
## 3  0        FL      GRAHAM (D FL) FALSE       Florida    48.84
## 4  1        FL      NELSON (D FL) FALSE       Florida    48.84
## 5  0        MA     KENNEDY (D MA) FALSE Massachusetts    59.80
## 6  1        MA       KERRY (D MA) FALSE Massachusetts    59.80
## 7  0        NJ     CORZINE (D NJ) FALSE    New Jersey    56.13
## 8  1        NJ  TORRICELLI (D NJ) FALSE    New Jersey    56.13
## 9  0        ND      CONRAD (D ND) FALSE  North Dakota    33.05
## 10 1        ND      DORGAN (D ND) FALSE  North Dakota    33.05
## 11 1        WA    CANTWELL (D WA) FALSE    Washington    50.13
## 12 0        WA      MURRAY (D WA) FALSE    Washington    50.13
## 13 0        WV        BYRD (D WV) FALSE West Virginia    45.59
## 14 1        WV ROCKEFELLER (D WV) FALSE West Virginia    45.59
## 15 0        WI    FEINGOLD (D WI) FALSE     Wisconsin    47.83
## 16 1        WI        KOHL (D WI) FALSE     Wisconsin    47.83

There are no Republicans in this list.

2.

Load the data set politicalInformation that is also available from the package pscl and view its documentation. Then, solve the following tasks:

2.a)

Create a new data set polinf based on politicalInformation but with a new variable age_group.

polinf <- politicalInformation %>% mutate(age_group = case_when(
  between(age, 18, 29) ~ 'young adult',
  between(age, 29, 45) ~ 'adult',
  between(age, 46, 59) ~ 'middle age',
  between(age, 60, 99) ~ 'senior'
  ))
head(polinf)

##             y collegeDegree female age homeOwn govt length id  age_group
## 1 Fairly High           Yes     No  49     Yes   No  58.40  1 middle age
## 2     Average            No    Yes  35     Yes   No  46.15  2      adult
## 3   Very High            No    Yes  57     Yes   No  89.52  3 middle age
## 4     Average            No     No  63     Yes   No  92.63  4     senior
## 5 Fairly High           Yes    Yes  40     Yes   No  58.85  4      adult
## 6     Average            No     No  77     Yes   No  53.82  4     senior

2.b)

Filter polinf to only include non-NA values in age_group. Then group by age_group and collageDegree. Compute the mean of the interviewer rating y and indicate the number of non-NA values in y for each group.

polinf %>% filter(!is.na(age_group)) %>%
  group_by(age_group, collegeDegree) %>%
  summarise(n_nonNA = sum(!is.na(y)),
            mean_y = mean(as.integer(y), na.rm = TRUE))

## # A tibble: 8 x 4
## # Groups:   age_group [?]
##   age_group   collegeDegree n_nonNA mean_y
##   <chr>       <fct>           <int>  <dbl>
## 1 adult       No                329   2.95
## 2 adult       Yes               307   3.74
## 3 middle age  No                249   3.14
## 4 middle age  Yes               211   4.06
## 5 senior      No                304   3.06
## 6 senior      Yes               113   3.99
## 7 young adult No                193   2.56
## 8 young adult Yes                86   3.45

3.

Load the data set flights from the package nycflights13.

library(nycflights13)

3.a)

Find out the number of carrier companies in the data set (Hint: You can use unique() or distinct() for that).

length(unique(flights$carrier))

## [1] 16

3.b)

Find out the three longest flights (in terms of air_time) for each carrier. Construct a single command combined with %>%-operators. You can use rank() to rank observations according to a variable and then filter() only for the ranks 1 to 3.

flights %>%
  select(carrier, origin, dest, air_time, distance) %>%   # only to reduce amount of data displayed
  group_by(carrier) %>%
  mutate(rank_air_time = rank(desc(air_time), ties.method = 'first')) %>%
  filter(rank_air_time <= 3) %>%
  arrange(carrier, desc(air_time))

## # A tibble: 48 x 6
## # Groups:   carrier [16]
##    carrier origin dest  air_time distance rank_air_time
##    <chr>   <chr>  <chr>    <dbl>    <dbl>         <int>
##  1 9E      JFK    SAT        272     1587             1
##  2 9E      JFK    SAT        264     1587             2
##  3 9E      JFK    SAT        262     1587             3
##  4 AA      JFK    SFO        426     2586             1
##  5 AA      JFK    SFO        422     2586             2
##  6 AA      JFK    SFO        410     2586             3
##  7 AS      EWR    SEA        392     2402             1
##  8 AS      EWR    SEA        384     2402             2
##  9 AS      EWR    SEA        377     2402             3
## 10 B6      JFK    SAN        413     2446             1
## # ... with 38 more rows