Tasks

1. Indexing

There are two predefined objects in R that contain all letters from A-Z and a-z, respectively:

LETTERS
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
## [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
letters
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"

Using numeric indexing (not subsetting with logical expressions), try to generate the following output using either letters or LETTERS:


# 1. The single letter `"e"`
letters[5]
## [1] "e"
# 2. All letters but `"e"`
letters[-5]
##  [1] "a" "b" "c" "d" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r"
## [18] "s" "t" "u" "v" "w" "x" "y" "z"
# 3. The last five letters in the alphabet (`"v" "w" "x" "y" "z"`)
letters[22:26]
## [1] "v" "w" "x" "y" "z"
# 4. The 23th, 26th and second capital letters (in that order) forming `"W" "Z" "B"`
LETTERS[c(23, 26, 2)]
## [1] "W" "Z" "B"
# 5. Every second letter starting from 1 (`"a" "c" "e" "g" "i" "k" "m" "o" "q" "s" "u" "w" "y"`)
letters[seq(1, 26, by = 2)]
##  [1] "a" "c" "e" "g" "i" "k" "m" "o" "q" "s" "u" "w" "y"
# 6. All but the first five letters: `"f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"`
letters[-(1:5)]
##  [1] "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v"
## [18] "w" "x" "y" "z"
# 7. Create an object `myletters` as a copy of `letters` (`myletters <- letters`).
# Assign the first five capital letters (from `LETTERS`) to the first five letters
# of `myletters` so that `myletters` will then contain:
# `"A" "B" "C" "D" "E" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"`
myletters <- letters
myletters[1:5] <- LETTERS[1:5]
myletters
##  [1] "A" "B" "C" "D" "E" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"

2. Subsetting with logical expressions

  1. Complete lesson 6 (“Subsetting vectors”) of SWIRL Course “R Programming”. (See the notes in session 2 tasks about installing SWIRL if you have not done that yet.)
  2. Consider the following data and subset it according to the mentioned criteria below:
retweets <- c(1, 3, 2, 2, 3, 4, 3, 2, 8, 2)
likes <- c(6, 10, 9, 6, 3, 6, 6, 7, 6, 15)
users <- factor(c('WZB_Berlin', 'JWI_Berlin', 'JWI_Berlin', 'gesis_org', 'WZB_Berlin', 'WZB_Berlin', 'WZB_Berlin', 'gesis_org', 'JWI_Berlin', 'WZB_Berlin'))
located_in_berlin <- c(TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE)

Assume that the elements in the vectors are aligned, i.e. the first element in retweets corresponds to the first element in likes and users etc. (as if they were combined in a data frame). Solve all tasks by using logical expressions / logical vectors.


a) Form subsets of the vectors retweets and likes to contain only data from the user WZB_Berlin.

retweets[users == 'WZB_Berlin']
## [1] 1 3 4 3 2
likes[users == 'WZB_Berlin']
## [1]  6  3  6  6 15

b) Form a subset of the vector users to contain only elements where located_in_berlin is FALSE or users equals "WZB_Berlin" (this should return a vector only containing "gesis_org" and "WZB_Berlin").

users[!located_in_berlin | users == 'WZB_Berlin']
## [1] WZB_Berlin gesis_org  WZB_Berlin WZB_Berlin WZB_Berlin gesis_org 
## [7] WZB_Berlin
## Levels: gesis_org JWI_Berlin WZB_Berlin
# this is also correct:
users[located_in_berlin == FALSE | users == 'WZB_Berlin']
## [1] WZB_Berlin gesis_org  WZB_Berlin WZB_Berlin WZB_Berlin gesis_org 
## [7] WZB_Berlin
## Levels: gesis_org JWI_Berlin WZB_Berlin

c) Form subsets of the vectors retweets, likes and users with the criteria to have at least three retweets and at least six likes.

To save us some typing, I create a logical vector first:

(criteria <- retweets >= 3 & likes >= 6)
##  [1] FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE

And now I can use it with all the vectors:

retweets[criteria]
## [1] 3 4 3 8
likes[criteria]
## [1] 10  6  6  6
users[criteria]
## [1] JWI_Berlin WZB_Berlin WZB_Berlin JWI_Berlin
## Levels: gesis_org JWI_Berlin WZB_Berlin

d) Calculate the median of retweets. Now form a subset of retweets, users and located_in_berlin where retweets are higher than the median.

This is the median for retweets:

(med_retw <- median(retweets))
## [1] 2.5

Again, we create a logical vector first:

(retw_above_median <- retweets > med_retw)
##  [1] FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE
retweets[retw_above_median]
## [1] 3 3 4 3 8
users[retw_above_median]
## [1] JWI_Berlin WZB_Berlin WZB_Berlin WZB_Berlin JWI_Berlin
## Levels: gesis_org JWI_Berlin WZB_Berlin
located_in_berlin[retw_above_median]
## [1] TRUE TRUE TRUE TRUE TRUE

3. Reading and writing files / subsetting data frames

Create a script file in RStudio that does the following:

  1. It loads the CSV file segindex_sample.csv (from the accompanying resources file 04rbasics3-resources.zip available on the course website) into a data frame. Set read.csv() to not convert strings to factors automatically.
  2. It filters this data frame by selecting only observations from the states “NRW”, “RP” and “BW” (Hint: You can use the %in% operator for this – it was introduced in the previous session).
  3. It saves the filtered data frame to an Excel file segindex_subset.xlsx.

You should write this to a R script (with the file name extension .R), but I copied the contents of a solution here. Please note that the paths to the files for reading/writing can be different in your case. For this solution, I assume that the R script resides in the same directory as the CSV input and Excel output files and the working directory is the same.

library(writexl)

# step 1:
segindex <- read.csv('segindex_sample.csv', stringsAsFactors = FALSE)

# step 2:
segindex_subset <- segindex[segindex$state %in% c('NRW', 'RP', 'BW'),]

# step 3:
write_xlsx(segindex_subset, 'segindex_subset.xlsx')