- Review of last week's tasks
- Record linkage: Combining data sets
- Reproducible workflows with RStudio
January 17, 2019
now online on https://wzbsocialsciencecenter.github.io/wzb_r_tutorial/
Record linkage or data joining is the process of combining observations in a data set \(A\) with observations in a data set \(B\) according to some matching criteria. Most of the time, a matching criterion is a common identifier.
| id | x1 |
|---|---|
| 1 | 0.2875775 |
| 2 | 0.7883051 |
| 3 | 0.4089769 |
| 4 | 0.8830174 |
| 5 | 0.9404673 |
| id | x2 |
|---|---|
| 1 | 0 |
| 2 | 5 |
| 3 | 9 |
| 4 | 6 |
| 5 | 5 |
| id | x1 | x2 |
|---|---|---|
| 1 | 0.2875775 | 0 |
| 2 | 0.7883051 | 5 |
| 3 | 0.4089769 | 9 |
| 4 | 0.8830174 | 6 |
| 5 | 0.9404673 | 5 |
\(A\) and \(B\) are joined by common identifier "id".
If you have multiple data sets that can be combined, you have relational data.
Data from a repeated measures experiment:
## person_id test_type score ## 1 1 pre 2 ## 2 1 post 0 ## 3 2 pre 3 ## 4 2 post 10 ## 5 3 pre 9 ## 6 3 post 7
We also have "meta data" about each participant:
## id age smoker ## 1 1 23 TRUE ## 2 3 42 TRUE ## 3 4 20 FALSE
We combine the data using a left join with criterion person_id = id:
## person_id test_type score age smoker ## 1 1 pre 2 23 TRUE ## 2 1 post 0 23 TRUE ## 3 2 pre 3 NA NA ## 4 2 post 10 NA NA ## 5 3 pre 9 42 TRUE ## 6 3 post 7 42 TRUE
Notice how we introduced NAs, because participant ID 2 is missing in the "meta data". Also participant ID 4 does not appear in the result.
There are several functions in the package dplyr (contained in tidyverse) for combining data sets:
left_join, right_join, inner_join, full_join, semi_join, anti_joinintersect, union, setdiffbind_rows, bind_colsWe'll have a look at the most common operations.
There are six join operations (of which three are commonly used).
All join operations have three parameters in common:
abbyThe type of join operation determines which rows and values are retained.
| id | test_type | score |
|---|---|---|
| 1 | pre | 10 |
| 1 | post | 5 |
| 2 | pre | 7 |
| 2 | post | 6 |
| 3 | pre | 1 |
| 3 | post | 9 |
| id | age | smoker |
|---|---|---|
| 1 | 23 | TRUE |
| 3 | 42 | TRUE |
| 4 | 20 | FALSE |
left_join(a, b, by = <criterion>): always retains rows on the "left side" and fills up non-matching rows with NAs.
left_join(scores, personaldata, by = c('id'))
## id test_type score age smoker ## 1 1 pre 10 23 TRUE ## 2 1 post 5 23 TRUE ## 3 2 pre 7 NA NA ## 4 2 post 6 NA NA ## 5 3 pre 1 42 TRUE ## 6 3 post 9 42 TRUE
| id | test_type | score |
|---|---|---|
| 1 | pre | 10 |
| 1 | post | 5 |
| 2 | pre | 7 |
| 2 | post | 6 |
| 3 | pre | 1 |
| 3 | post | 9 |
| id | age | smoker |
|---|---|---|
| 1 | 23 | TRUE |
| 3 | 42 | TRUE |
| 4 | 20 | FALSE |
right_join(a, b, by = <criterion>): always retains rows on the "right side" and fills up non-matching rows with NAs.
How many rows do you expect for a right join between scores and personaldata?
right_join(scores, personaldata, by = c('id'))
## id test_type score age smoker ## 1 1 pre 10 23 TRUE ## 2 1 post 5 23 TRUE ## 3 3 pre 1 42 TRUE ## 4 3 post 9 42 TRUE ## 5 4 <NA> NA 20 FALSE
You can always transform a left join to a right join and vice versa.
Which of these statements are equivalent?
left_join(a, b, by = 'id') and right_join(a, b, by = 'id')right_join(b, a, by = 'id') and left_join(a, b, by = 'id')left_join(b, a, by = 'id') and right_join(a, b, by = 'id')Match criteria are specified with parameter by.
| group | test_type | mean_score |
|---|---|---|
| treat_A | pre | 6.405068 |
| treat_A | post | 9.942698 |
| treat_B | pre | 6.557058 |
| treat_B | post | 7.085305 |
| ctrl | pre | 5.440660 |
| ctrl | post | 5.941420 |
| group | test_type | n | lab |
|---|---|---|---|
| treat_A | pre | 11 | a |
| treat_A | post | 14 | b |
| treat_B | pre | 12 | b |
| treat_B | post | 10 | b |
| ctrl | pre | 13 | a |
| ctrl | post | 12 | a |
Parameter by is a character vector with all columns that must match:
left_join(experiments, session_info, by = c('group', 'test_type'))
## group test_type mean_score n lab ## 1 treat_A pre 6.405068 11 a ## 2 treat_A post 9.942698 14 b ## 3 treat_B pre 6.557058 12 b ## 4 treat_B post 7.085305 10 b ## 5 ctrl pre 5.440660 13 a ## 6 ctrl post 5.941420 12 a
Parameter by can be a named character vector like c('x' = 'y'). This will match a.x to b.y (x of left-hand side to y of right-hand side).
This time, the scores data set has an ID column named person_id:
| person_id | test_type | score |
|---|---|---|
| 1 | pre | 2 |
| 1 | post | 0 |
| 2 | pre | 3 |
| 2 | post | 10 |
| 3 | pre | 9 |
| 3 | post | 7 |
| id | age | smoker |
|---|---|---|
| 1 | 23 | TRUE |
| 3 | 42 | TRUE |
| 4 | 20 | FALSE |
We have to consider that when specifying the matching criterion:
left_join(scores, personaldata, by = c('person_id' = 'id'))
## person_id test_type score age smoker ## 1 1 pre 2 23 TRUE ## 2 1 post 0 23 TRUE ## 3 2 pre 3 NA NA ## 4 2 post 10 NA NA ## 5 3 pre 9 42 TRUE ## 6 3 post 7 42 TRUE
| person_id | test_type | score |
|---|---|---|
| 1 | pre | 2 |
| 1 | post | 0 |
| 2 | pre | 3 |
| 2 | post | 10 |
| 3 | pre | 9 |
| 3 | post | 7 |
| id | age | smoker |
|---|---|---|
| 1 | 23 | TRUE |
| 3 | 42 | TRUE |
| 4 | 20 | FALSE |
inner_join(a, b, by = <criterion>): only retains rows that match on both sides.
How many rows do you expect for an inner join between scores and personaldata?
inner_join(scores, personaldata, by = c('person_id' = 'id'))
## person_id test_type score age smoker ## 1 1 pre 2 23 TRUE ## 2 1 post 0 23 TRUE ## 3 3 pre 9 42 TRUE ## 4 3 post 7 42 TRUE
| person_id | test_type | score |
|---|---|---|
| 1 | pre | 2 |
| 1 | post | 0 |
| 2 | pre | 3 |
| 2 | post | 10 |
| 3 | pre | 9 |
| 3 | post | 7 |
| id | age | smoker |
|---|---|---|
| 1 | 23 | TRUE |
| 3 | 42 | TRUE |
| 4 | 20 | FALSE |
full_join(a, b, by = <criterion>): retains all rows for both sides and fills up non-matching rows with NAs.
How many rows do you expect for a full join between scores and personaldata?
full_join(scores, personaldata, by = c('person_id' = 'id'))
## person_id test_type score age smoker ## 1 1 pre 2 23 TRUE ## 2 1 post 0 23 TRUE ## 3 2 pre 3 NA NA ## 4 2 post 10 NA NA ## 5 3 pre 9 42 TRUE ## 6 3 post 7 42 TRUE ## 7 4 <NA> NA 20 FALSE
| person_id | test_type | score |
|---|---|---|
| 1 | pre | 2 |
| 1 | post | 0 |
| 2 | pre | 3 |
| 2 | post | 10 |
| 3 | pre | 9 |
| 3 | post | 7 |
| id | age | smoker |
|---|---|---|
| 1 | 23 | TRUE |
| 3 | 42 | TRUE |
| 4 | 20 | FALSE |
semi_join(a, b, by = <criterion>): A semi join is a filtering join. It returns all observations of a where the criterion matches.
semi_join(scores, personaldata, by = c('person_id' = 'id'))
## person_id test_type score ## 1 1 pre 2 ## 2 1 post 0 ## 3 3 pre 9 ## 4 3 post 7
→ "return all scores for which we have personal data"
| person_id | test_type | score |
|---|---|---|
| 1 | pre | 2 |
| 1 | post | 0 |
| 2 | pre | 3 |
| 2 | post | 10 |
| 3 | pre | 9 |
| 3 | post | 7 |
| id | age | smoker |
|---|---|---|
| 1 | 23 | TRUE |
| 3 | 42 | TRUE |
| 4 | 20 | FALSE |
anti_join(a, b, by = <criterion>): An anti join is the inverse of a semi join. It returns all observations of a where the criterion does not match.
anti_join(scores, personaldata, by = c('person_id' = 'id'))
## person_id test_type score ## 1 2 pre 3 ## 2 2 post 10
→ "return all scores for which we have no personal data"
An inner join matches keys that appear in both data sets and returns the combined observations:
Left and right outer joins keep all observations on the left-hand or right-hand side data sets respectively. Unmatched rows are filled up with NAs:
A full outer join keeps all observations of both data sets. Unmatched rows are filled up with NAs:
A semi join filters the left-hand data set to return only those observations, that match with the right-hand data set:
An anti join is the inverse of a semi join:
Set operations as defined in the dplyr package operate on data frames. They translate to the same operations as you know from maths:
intersect(A, B): \(A \cap B\)union(A, B): \(A \cup B\)setdiff(A, B): \(A \setminus B\)There is no "match criterion". All values in a row are taken into account to match observations.
Some examples:
| x | y |
|---|---|
| 1 | 1 |
| 2 | 1 |
| 3 | 2 |
| x | y |
|---|---|
| 1 | 1 |
| 2 | 0 |
| 3 | 2 |
intersect(A, B)
## x y ## 1 1 1 ## 2 3 2
union(A, B)
## x y ## 1 2 0 ## 2 3 2 ## 3 2 1 ## 4 1 1
setdiff(A, B)
## x y ## 1 2 1
1. Forgetting to specify by
left_join(scores, personaldata) ## Error: `by` required, because the data sources have no common variables
If there are common variables, they are used for matching by default, which is probably not what you want.
→ always specify by!
2. Not using a (named) character vector for by
left_join(scores, personaldata, by = c(person_id = id)) ## Error: `by` must specify variables to join by
→ quotes are missing (c('person_id' = 'id'))
left_join(scores, personaldata, by = ('person_id' = 'id'))
## Error: `by` can't contain join column `id` which is missing from LHS
→ the little c() is missing to denote a vector
3. Comparing the wrong types
A <- data.frame(id = 1:3, y = c(1, 1, 2)) B <- data.frame(id = as.factor(1:3), y = c(1, 0, 2)) inner_join(A, B, by = 'id') ## Error: Can't join on 'id' x 'id' because of incompatible types (integer / factor)
What is a reproducible workflow?
Hence you need to provide:
* subject to many limitations due to privacy and other considerations
setwd() in your scriptsGo to Tools > Global options… → Never save workspace to file and never restore it on startup
Go to Tools > Global options… → Never save workspace to file and never restore it on startup
Why?
Because:
"Restart R session" will not reload data: you can start afresh
use "Restart R session" (CTRL+SHIFT+F10) often to check if scripts work when loaded afresh
RStudio supports projects (File > New project …). Use them!
Why?
Because:
setwd()!) to the root of the projectmyproject.Rproj file)You may have noticed that the solutions to the tasks involve a mixture of code, plots and prose. They're RMarkdown documents:
My presentations are actually RMarkdown documents:
Good for:
Not good for:
→ use R scripts for that
Create a new RMarkdown document (short: Rmd) in RStudio with File > New file > RMarkdown ….
→ creates a sample document with explanations and links to RMarkdown documentation
Make sure that Rmd documents are reproducible: Select Run > Restart R and Run All Chunks.
Free book: R Markdown: The Definite Guide (Xie et al.)
Bad practice:
setwd('C:/Research/Super Interesting Project/Analysis & more/')
X <- read.csv('cat_research_data.csv',stringsAsFactors =FALSE,
col.names= c('weight', 'age', 'length_tail'))
m <- mean(X$var2)
if (m> 8) {
print('fat cats!')
if(m > 12) {print('super fat cats!')}
}
S <- X[X$var2 > m,]
library(ggplot2)
qplot(S$var2, S$var3)
Better practice:
# super important cat research script (no need for setwd() -- RStudio project!)
# author, date
library(ggplot2) # put libraries on top
# load the data
catdata <- read.csv('cat_research_data.csv',
stringsAsFactors = FALSE,
col.names = c('weight', 'age', 'length_tail'))
# calculate mean weight and create subset with obs. where weight > mean weight
mean_weight <- mean(catdata$weight)
above_mean <- catdata[catdata$weight > mean_weight,]
if (mean_weight > 8) { # just to show nested indentation
print('fat cats!')
if(mean_weight > 12) {
print('super fat cats!')
}
}
# scatter plot of above mean data with weight against length of cat's tail
qplot(above_mean$weight, above_mean$length_tail)
See dedicated tasks sheet on the tutorial website.