- Review of last week's tasks
- Record linkage: Combining data sets
- Reproducible workflows with RStudio
January 17, 2019
now online on https://wzbsocialsciencecenter.github.io/wzb_r_tutorial/
Record linkage or data joining is the process of combining observations in a data set \(A\) with observations in a data set \(B\) according to some matching criteria. Most of the time, a matching criterion is a common identifier.
id | x1 |
---|---|
1 | 0.2875775 |
2 | 0.7883051 |
3 | 0.4089769 |
4 | 0.8830174 |
5 | 0.9404673 |
id | x2 |
---|---|
1 | 0 |
2 | 5 |
3 | 9 |
4 | 6 |
5 | 5 |
id | x1 | x2 |
---|---|---|
1 | 0.2875775 | 0 |
2 | 0.7883051 | 5 |
3 | 0.4089769 | 9 |
4 | 0.8830174 | 6 |
5 | 0.9404673 | 5 |
\(A\) and \(B\) are joined by common identifier "id".
If you have multiple data sets that can be combined, you have relational data.
Data from a repeated measures experiment:
## person_id test_type score ## 1 1 pre 2 ## 2 1 post 0 ## 3 2 pre 3 ## 4 2 post 10 ## 5 3 pre 9 ## 6 3 post 7
We also have "meta data" about each participant:
## id age smoker ## 1 1 23 TRUE ## 2 3 42 TRUE ## 3 4 20 FALSE
We combine the data using a left join with criterion person_id = id
:
## person_id test_type score age smoker ## 1 1 pre 2 23 TRUE ## 2 1 post 0 23 TRUE ## 3 2 pre 3 NA NA ## 4 2 post 10 NA NA ## 5 3 pre 9 42 TRUE ## 6 3 post 7 42 TRUE
Notice how we introduced NAs, because participant ID 2 is missing in the "meta data". Also participant ID 4 does not appear in the result.
There are several functions in the package dplyr (contained in tidyverse) for combining data sets:
left_join
, right_join
, inner_join
, full_join
, semi_join
, anti_join
intersect
, union
, setdiff
bind_rows
, bind_cols
We'll have a look at the most common operations.
There are six join operations (of which three are commonly used).
All join operations have three parameters in common:
a
b
by
The type of join operation determines which rows and values are retained.
id | test_type | score |
---|---|---|
1 | pre | 10 |
1 | post | 5 |
2 | pre | 7 |
2 | post | 6 |
3 | pre | 1 |
3 | post | 9 |
id | age | smoker |
---|---|---|
1 | 23 | TRUE |
3 | 42 | TRUE |
4 | 20 | FALSE |
left_join(a, b, by = <criterion>)
: always retains rows on the "left side" and fills up non-matching rows with NAs.
left_join(scores, personaldata, by = c('id'))
## id test_type score age smoker ## 1 1 pre 10 23 TRUE ## 2 1 post 5 23 TRUE ## 3 2 pre 7 NA NA ## 4 2 post 6 NA NA ## 5 3 pre 1 42 TRUE ## 6 3 post 9 42 TRUE
id | test_type | score |
---|---|---|
1 | pre | 10 |
1 | post | 5 |
2 | pre | 7 |
2 | post | 6 |
3 | pre | 1 |
3 | post | 9 |
id | age | smoker |
---|---|---|
1 | 23 | TRUE |
3 | 42 | TRUE |
4 | 20 | FALSE |
right_join(a, b, by = <criterion>)
: always retains rows on the "right side" and fills up non-matching rows with NAs.
How many rows do you expect for a right join between scores
and personaldata
?
right_join(scores, personaldata, by = c('id'))
## id test_type score age smoker ## 1 1 pre 10 23 TRUE ## 2 1 post 5 23 TRUE ## 3 3 pre 1 42 TRUE ## 4 3 post 9 42 TRUE ## 5 4 <NA> NA 20 FALSE
You can always transform a left join to a right join and vice versa.
Which of these statements are equivalent?
left_join(a, b, by = 'id')
and right_join(a, b, by = 'id')
right_join(b, a, by = 'id')
and left_join(a, b, by = 'id')
left_join(b, a, by = 'id')
and right_join(a, b, by = 'id')
Match criteria are specified with parameter by
.
group | test_type | mean_score |
---|---|---|
treat_A | pre | 6.405068 |
treat_A | post | 9.942698 |
treat_B | pre | 6.557058 |
treat_B | post | 7.085305 |
ctrl | pre | 5.440660 |
ctrl | post | 5.941420 |
group | test_type | n | lab |
---|---|---|---|
treat_A | pre | 11 | a |
treat_A | post | 14 | b |
treat_B | pre | 12 | b |
treat_B | post | 10 | b |
ctrl | pre | 13 | a |
ctrl | post | 12 | a |
Parameter by
is a character vector with all columns that must match:
left_join(experiments, session_info, by = c('group', 'test_type'))
## group test_type mean_score n lab ## 1 treat_A pre 6.405068 11 a ## 2 treat_A post 9.942698 14 b ## 3 treat_B pre 6.557058 12 b ## 4 treat_B post 7.085305 10 b ## 5 ctrl pre 5.440660 13 a ## 6 ctrl post 5.941420 12 a
Parameter by
can be a named character vector like c('x' = 'y')
. This will match a.x
to b.y
(x
of left-hand side to y
of right-hand side).
This time, the scores
data set has an ID column named person_id
:
person_id | test_type | score |
---|---|---|
1 | pre | 2 |
1 | post | 0 |
2 | pre | 3 |
2 | post | 10 |
3 | pre | 9 |
3 | post | 7 |
id | age | smoker |
---|---|---|
1 | 23 | TRUE |
3 | 42 | TRUE |
4 | 20 | FALSE |
We have to consider that when specifying the matching criterion:
left_join(scores, personaldata, by = c('person_id' = 'id'))
## person_id test_type score age smoker ## 1 1 pre 2 23 TRUE ## 2 1 post 0 23 TRUE ## 3 2 pre 3 NA NA ## 4 2 post 10 NA NA ## 5 3 pre 9 42 TRUE ## 6 3 post 7 42 TRUE
person_id | test_type | score |
---|---|---|
1 | pre | 2 |
1 | post | 0 |
2 | pre | 3 |
2 | post | 10 |
3 | pre | 9 |
3 | post | 7 |
id | age | smoker |
---|---|---|
1 | 23 | TRUE |
3 | 42 | TRUE |
4 | 20 | FALSE |
inner_join(a, b, by = <criterion>)
: only retains rows that match on both sides.
How many rows do you expect for an inner join between scores
and personaldata
?
inner_join(scores, personaldata, by = c('person_id' = 'id'))
## person_id test_type score age smoker ## 1 1 pre 2 23 TRUE ## 2 1 post 0 23 TRUE ## 3 3 pre 9 42 TRUE ## 4 3 post 7 42 TRUE
person_id | test_type | score |
---|---|---|
1 | pre | 2 |
1 | post | 0 |
2 | pre | 3 |
2 | post | 10 |
3 | pre | 9 |
3 | post | 7 |
id | age | smoker |
---|---|---|
1 | 23 | TRUE |
3 | 42 | TRUE |
4 | 20 | FALSE |
full_join(a, b, by = <criterion>)
: retains all rows for both sides and fills up non-matching rows with NAs.
How many rows do you expect for a full join between scores
and personaldata
?
full_join(scores, personaldata, by = c('person_id' = 'id'))
## person_id test_type score age smoker ## 1 1 pre 2 23 TRUE ## 2 1 post 0 23 TRUE ## 3 2 pre 3 NA NA ## 4 2 post 10 NA NA ## 5 3 pre 9 42 TRUE ## 6 3 post 7 42 TRUE ## 7 4 <NA> NA 20 FALSE
person_id | test_type | score |
---|---|---|
1 | pre | 2 |
1 | post | 0 |
2 | pre | 3 |
2 | post | 10 |
3 | pre | 9 |
3 | post | 7 |
id | age | smoker |
---|---|---|
1 | 23 | TRUE |
3 | 42 | TRUE |
4 | 20 | FALSE |
semi_join(a, b, by = <criterion>)
: A semi join is a filtering join. It returns all observations of a
where the criterion matches.
semi_join(scores, personaldata, by = c('person_id' = 'id'))
## person_id test_type score ## 1 1 pre 2 ## 2 1 post 0 ## 3 3 pre 9 ## 4 3 post 7
→ "return all scores for which we have personal data"
person_id | test_type | score |
---|---|---|
1 | pre | 2 |
1 | post | 0 |
2 | pre | 3 |
2 | post | 10 |
3 | pre | 9 |
3 | post | 7 |
id | age | smoker |
---|---|---|
1 | 23 | TRUE |
3 | 42 | TRUE |
4 | 20 | FALSE |
anti_join(a, b, by = <criterion>)
: An anti join is the inverse of a semi join. It returns all observations of a
where the criterion does not match.
anti_join(scores, personaldata, by = c('person_id' = 'id'))
## person_id test_type score ## 1 2 pre 3 ## 2 2 post 10
→ "return all scores for which we have no personal data"
An inner join matches keys that appear in both data sets and returns the combined observations:
Left and right outer joins keep all observations on the left-hand or right-hand side data sets respectively. Unmatched rows are filled up with NAs:
A full outer join keeps all observations of both data sets. Unmatched rows are filled up with NAs:
A semi join filters the left-hand data set to return only those observations, that match with the right-hand data set:
An anti join is the inverse of a semi join:
Set operations as defined in the dplyr package operate on data frames. They translate to the same operations as you know from maths:
intersect(A, B)
: \(A \cap B\)union(A, B)
: \(A \cup B\)setdiff(A, B)
: \(A \setminus B\)There is no "match criterion". All values in a row are taken into account to match observations.
Some examples:
x | y |
---|---|
1 | 1 |
2 | 1 |
3 | 2 |
x | y |
---|---|
1 | 1 |
2 | 0 |
3 | 2 |
intersect(A, B)
## x y ## 1 1 1 ## 2 3 2
union(A, B)
## x y ## 1 2 0 ## 2 3 2 ## 3 2 1 ## 4 1 1
setdiff(A, B)
## x y ## 1 2 1
1. Forgetting to specify by
left_join(scores, personaldata) ## Error: `by` required, because the data sources have no common variables
If there are common variables, they are used for matching by default, which is probably not what you want.
→ always specify by
!
2. Not using a (named) character vector for by
left_join(scores, personaldata, by = c(person_id = id)) ## Error: `by` must specify variables to join by
→ quotes are missing (c('person_id' = 'id')
)
left_join(scores, personaldata, by = ('person_id' = 'id')) ## Error: `by` can't contain join column `id` which is missing from LHS
→ the little c()
is missing to denote a vector
3. Comparing the wrong types
A <- data.frame(id = 1:3, y = c(1, 1, 2)) B <- data.frame(id = as.factor(1:3), y = c(1, 0, 2)) inner_join(A, B, by = 'id') ## Error: Can't join on 'id' x 'id' because of incompatible types (integer / factor)
What is a reproducible workflow?
Hence you need to provide:
* subject to many limitations due to privacy and other considerations
setwd()
in your scriptsGo to Tools > Global options… → Never save workspace to file and never restore it on startup
Go to Tools > Global options… → Never save workspace to file and never restore it on startup
Why?
Because:
"Restart R session" will not reload data: you can start afresh
use "Restart R session" (CTRL+SHIFT+F10
) often to check if scripts work when loaded afresh
RStudio supports projects (File > New project …). Use them!
Why?
Because:
setwd()
!) to the root of the projectmyproject.Rproj
file)You may have noticed that the solutions to the tasks involve a mixture of code, plots and prose. They're RMarkdown documents:
My presentations are actually RMarkdown documents:
Good for:
Not good for:
→ use R scripts for that
Create a new RMarkdown document (short: Rmd) in RStudio with File > New file > RMarkdown ….
→ creates a sample document with explanations and links to RMarkdown documentation
Make sure that Rmd documents are reproducible: Select Run > Restart R and Run All Chunks.
Free book: R Markdown: The Definite Guide (Xie et al.)
Bad practice:
setwd('C:/Research/Super Interesting Project/Analysis & more/') X <- read.csv('cat_research_data.csv',stringsAsFactors =FALSE, col.names= c('weight', 'age', 'length_tail')) m <- mean(X$var2) if (m> 8) { print('fat cats!') if(m > 12) {print('super fat cats!')} } S <- X[X$var2 > m,] library(ggplot2) qplot(S$var2, S$var3)
Better practice:
# super important cat research script (no need for setwd() -- RStudio project!) # author, date library(ggplot2) # put libraries on top # load the data catdata <- read.csv('cat_research_data.csv', stringsAsFactors = FALSE, col.names = c('weight', 'age', 'length_tail')) # calculate mean weight and create subset with obs. where weight > mean weight mean_weight <- mean(catdata$weight) above_mean <- catdata[catdata$weight > mean_weight,] if (mean_weight > 8) { # just to show nested indentation print('fat cats!') if(mean_weight > 12) { print('super fat cats!') } } # scatter plot of above mean data with weight against length of cat's tail qplot(above_mean$weight, above_mean$length_tail)
See dedicated tasks sheet on the tutorial website.