- Review of last week's tasks
- Reshaping data with
gather()
andspread()
- Plotting with ggplot2
December 06, 2018
gather()
and spread()
now online on https://wzbsocialsciencecenter.github.io/wzb_r_tutorial/
gather()
and spread()
Hadley Wickham introduced the concept of tidy data as a way how data should be organized so it is comfortable to work with (H. Wickham 2014: Tidy Data). He defined three rules that make a data set tidy:
tools and packages in the tidyverse like dplyr or ggplot2 require data to be organized in that way
the problem: most data you'll get won't be "tidy" from the beginning on → you'll need to reshape it
OBrienKaiser
from package carData
.1
to .5
)treatment gender pre.1 pre.2 pre.3 pre.4 pre.5 post.1 post.2 post.3 post.4 post.5 fup.1 ... A M 7 8 7 9 9 9 9 10 8 9 9 ... B F 4 5 7 5 4 7 7 8 6 7 7 ... control M 1 2 4 2 1 3 2 5 3 2 2 ... ...
Is this data tidy?
No! You have several measurements (observations) per row. This is also called a wide table format.
→ hard to work with (e.g. compute mean of all three measurement types separately)
We can use gather()
from package tidyr to put the measurements in separate rows.
gather()
takes the following arguments (among others):
%>%
).key
: The name of the new column containing the measurement types.value
: The name of the new column containing the measurement values.(tidy_obk <- gather(OBrienKaiser, pre.1:fup.5, key = 'meas_type_occasion', value = 'value'))
## treatment gender meas_type_occasion value ## 1 control M pre.1 1 ## 2 control M pre.1 4 ## 3 control M pre.1 5 ## 4 control F pre.1 5 ## 5 control F pre.1 3 ## 6 A M pre.1 7 ## 7 A M pre.1 5 ## 8 A F pre.1 2 ## 9 A F pre.1 3 ## 10 B M pre.1 4 ## 11 B M pre.1 3 ## 12 B M pre.1 6 ## 13 B F pre.1 5 ## 14 B F pre.1 2 ## 15 B F pre.1 2 ## [ reached getOption("max.print") -- omitted 225 rows ]
Our data is already better to work with, but meas_type_occasion
still contains two values like "pre.1" (pretest 1) or "fup.4" (follow-up test 4). This violates rule #3: "Each value must have its own cell.").
We can use separate()
(package tidyr) to split columns that contain several values. It takes the following arguments (among others):
%>%
).into
: The names of the new columns.sep
: A rule for how to split the values. The default is to split on anything that is not a number or character (e.g. slash, period, hyphen, …).tidy_obk <- separate(tidy_obk, meas_type_occasion, into = c('meas_type', 'meas_occasion'))
Some additional variable conversion and we're done:
tidy_obk <- mutate(tidy_obk, meas_type = factor(meas_type, levels = c('pre', 'post', 'fup')), meas_occasion = as.integer(meas_occasion))
## treatment gender meas_type meas_occasion value ## 1 control M pre 1 1 ## 2 control M pre 1 4 ## 3 control M pre 1 5 ## 4 control F pre 1 5 ## 5 control F pre 1 3 ## 6 A M pre 1 7 ## 7 A M pre 1 5 ## 8 A F pre 1 2 ## 9 A F pre 1 3 ## 10 B M pre 1 4 ## 11 B M pre 1 3 ## 12 B M pre 1 6 ## [ reached getOption("max.print") -- omitted 228 rows ]
→ This is tidy data. It's also called the long table format.
With this data we can actually work!
tidy_obk %>% group_by(treatment, meas_type) %>% summarise(mean_measurement = mean(value))
## # A tibble: 9 x 3 ## # Groups: treatment [?] ## treatment meas_type mean_measurement ## <fct> <fct> <dbl> ## 1 control pre 4.2 ## 2 control post 4 ## 3 control fup 4.4 ## 4 A pre 5 ## 5 A post 6.5 ## 6 A fup 7.25 ## 7 B pre 4.14 ## 8 B post 6.57 ## 9 B fup 7.29
It's better spending some time on data cleanup than struggling with messy data sets!
spread()
is the opposite of gather()
: It combines observations from multiple rows into a single row with more columns. Hence it converts data from the long table format to the wide table format.
economics_long
from package ggplot2 contains data in long table format:
date variable value 1967-07-01 pce 507. 1967-08-01 pce 510. 1967-09-01 pce 516. ... 1967-07-01 pop 198712 1967-08-01 pop 198911 1967-09-01 pop 199113 ... 1967-07-01 psavert 12.5 1967-08-01 psavert 12.5 1967-09-01 psavert 11.7 ...
For each date, there are five variable types (pce, pop, psavert, unemploy, uempmed) and their respective values.
spread()
takes the following arguments:
%>%
).key
: The column containing the variable names.value
: The column containing the respective variable values.spread(economics_long, variable, value)
## # A tibble: 574 x 6 ## date pce pop psavert uempmed unemploy ## <date> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1967-07-01 507. 198712 12.5 4.5 2944 ## 2 1967-08-01 510. 198911 12.5 4.7 2945 ## 3 1967-09-01 516. 199113 11.7 4.6 2958 ## 4 1967-10-01 513. 199311 12.5 4.9 3143 ## 5 1967-11-01 518. 199498 12.5 4.7 3066 ## 6 1967-12-01 526. 199657 12.1 4.8 3018 ## 7 1968-01-01 532. 199808 11.7 5.1 2878 ## 8 1968-02-01 534. 199920 12.2 4.5 3001 ## 9 1968-03-01 545. 200056 11.6 4.1 2877 ## 10 1968-04-01 545. 200208 12.2 4.6 2709 ## # ... with 564 more rows
Grouping, aggregating and summarizing results are important for data analysis. They let you reduce complex data sets to simpler summarizations, compare groups and help to see patterns in your data.
Visualization of your data can have the same benefits and has the power to communicate a message much faster than a table of numbers. However, it requires a sense for which type of graph or visual representation is appropriate for your data.
A useful website for inspiration, code examples and caveats is https://www.data-to-viz.com.
source: tidyverse.org
There are three basic steps for constructing plots with ggplot2:
ggplot()
.aes()
.Additionally, you can further change the appearance of your plot by:
You combine all these steps with a +
.
We'll make a plot for historical election results from four states (data set presidentialElections
from package pscl):
## # A tibble: 88 x 4 ## state demVote year south ## <chr> <dbl> <int> <lgl> ## 1 Louisiana 92.8 1932 TRUE ## 2 Maine 43.2 1932 FALSE ## 3 South Dakota 63.6 1932 FALSE ## 4 Tennessee 66.5 1932 TRUE ## 5 Louisiana 88.8 1936 TRUE ## 6 Maine 41.5 1936 FALSE ## 7 South Dakota 54.0 1936 FALSE ## 8 Tennessee 68.8 1936 TRUE ## 9 Louisiana 85.9 1940 TRUE ## 10 Maine 48.8 1940 FALSE ## # ... with 78 more rows
Example data: politicalInformation
from pscl package
Interviewers administering the 2000 American National Election Studies assigned an ordinal rating to each respondent's "general level of information" about politics and public affairs.
## y collegeDegree female age homeOwn govt length id ## 1 Fairly High Yes No 49 Yes No 58.40 1 ## 2 Average No Yes 35 Yes No 46.15 2 ## 3 Very High No Yes 57 Yes No 89.52 3 ## 4 Average No No 63 Yes No 92.63 4 ## 5 Fairly High Yes Yes 40 Yes No 58.85 4 ## 6 Average No No 77 Yes No 53.82 4 ## 7 Average No No 43 Yes No 58.47 5 ## [ reached getOption("max.print") -- omitted 1800 rows ]
At first, you should make sure that you either loaded the tidyverse package or ggplot2 package. We define:
politicalInformation
.age
.geom_histogram()
.ggplot(politicalInformation, aes(age)) + geom_histogram()
I hid two messsages produced by ggplot:
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`. Removed 9 rows containing non-finite values (stat_bin).
binwidth = 5
below to make a histogram of age with five-year stepsage
variable → we can ignore thisggplot(politicalInformation, aes(age)) + geom_histogram(binwidth = 5)
What if we wanted to know if the distribution of age
in our sample is different for men and women?
We can produce boxplots for both subsets, putting female
on the x-axis (see how the aesthetics mapping was changed!):
ggplot(politicalInformation, aes(x = female, y = age)) + geom_boxplot()
A violin plot offers better insight about the actual shape of the distribution:
ggplot(politicalInformation, aes(x = female, y = age)) + geom_violin(draw_quantiles = c(0.25, 0.5, 0.75)) # horizontal lines for quartiles
A density plot also reveals that there are more younger men in the sample:
ggplot(politicalInformation, aes(age, color = female)) + geom_density()
Histograms can be used to bin continuous variables. To show the distribution of a categorical variable we can use a simple bar chart:
ggplot(politicalInformation, aes(y)) + geom_bar()
A stacked and grouped bar charts allows to compare groups. Here, we compare groups based on the collegeDegree
variable by mapping the "fill (color)" aesthetic to it.
By default, the bars get stacked:
ggplot(politicalInformation, aes(y, fill = collegeDegree)) + geom_bar()
A stacked and grouped bar charts allows to compare groups. Here, we compare groups based on the collegeDegree
variable by mapping the "fill (color)" aesthetic to it.
Setting the bars' position
to position_dodge()
creates a grouped bar chart:
ggplot(politicalInformation, aes(y, fill = collegeDegree)) + geom_bar(position = position_dodge())
A scatterplot displays the relationship between two numeric variables. We'll use the airquality
and have a look at the relationship between ozone and temperature.
head(airquality)
## Ozone Solar.R Wind Temp Month Day ## 1 41 190 7.4 67 5 1 ## 2 36 118 8.0 72 5 2 ## 3 12 149 12.6 74 5 3 ## 4 18 313 11.5 62 5 4 ## 5 NA NA 14.3 56 5 5 ## 6 28 NA 14.9 66 5 6
A scatterplot displays the relationship between two numeric variables. We'll use the airquality
and have a look at the relationship between ozone and temperature.
We map Ozone
to the y-axis and Temp
to the x-axis and use geom_point()
:
ggplot(airquality, aes(x = Temp, y = Ozone)) + geom_point()
Overplotting can easily occur, especially with large data sets.
ggplot(airquality, aes(x = Temp, y = Ozone)) + geom_point(alpha = 0.33) # alpha of 0 is invisible, 1 is opaque
Overplotting can easily occur, especially with large data sets.
ggplot(airquality, aes(x = Temp, y = Ozone)) + geom_point(position = position_jitter())
geom_smooth()
aids the eye in seeing patterns by adding a (local polynomial) regression line and its confidence interval:
ggplot(airquality, aes(x = Temp, y = Ozone)) + geom_point(alpha = 0.33) + geom_smooth()
A line graph can be used whenever you have an ordered numeric variable on the x-axis, such as years.
We'll use a subset of the presidentialElections
data set from the package pscl:
(cal_pres <- presidentialElections %>% filter(state == 'California'))
## # A tibble: 22 x 4 ## state demVote year south ## <chr> <dbl> <int> <lgl> ## 1 California 58.4 1932 FALSE ## 2 California 67.0 1936 FALSE ## 3 California 57.4 1940 FALSE ## 4 California 56.5 1944 FALSE ## 5 California 47.6 1948 FALSE ## 6 California 42.7 1952 FALSE ## 7 California 44.3 1956 FALSE ## 8 California 49.6 1960 FALSE ## 9 California 59.1 1964 FALSE ## 10 California 44.7 1968 FALSE ## # ... with 12 more rows
We map year
on the x-axis, the vote share for Democrats demVote
on the y-axis and add a geom_line()
layer:
ggplot(cal_pres, aes(x = year, y = demVote)) + geom_line()
We can additionally add points like in a scatterplot:
ggplot(cal_pres, aes(x = year, y = demVote)) + geom_line() + geom_point()
Again, we'll plot the historical election results from four states (data set presidentialElections
from package pscl) stored in an object sampled_pres
:
## # A tibble: 88 x 4 ## state demVote year south ## <chr> <dbl> <int> <lgl> ## 1 Louisiana 92.8 1932 TRUE ## 2 Maine 43.2 1932 FALSE ## 3 South Dakota 63.6 1932 FALSE ## 4 Tennessee 66.5 1932 TRUE ## 5 Louisiana 88.8 1936 TRUE ## 6 Maine 41.5 1936 FALSE ## 7 South Dakota 54.0 1936 FALSE ## 8 Tennessee 68.8 1936 TRUE ## 9 Louisiana 85.9 1940 TRUE ## 10 Maine 48.8 1940 FALSE ## # ... with 78 more rows
If we simply use the same command as before, we produce some interesting chaos:
ggplot(sampled_pres, aes(x = year, y = demVote)) + geom_line()
We have to add another aesthetic mapping (color = state
) in order to tell apart the states' lines:
ggplot(sampled_pres, aes(x = year, y = demVote, color = state)) + geom_line()
A ggplot object can be assigned a name just as any other object in R:
myplot <- ggplot(airquality, aes(x = Temp, y = Ozone)) myplot # shows an "empty" plot
You can re-use the ggplot object and try out different layers:
myplot + geom_point()
You can re-use the ggplot object and try out different layers or themes:
myplot + geom_point(position = position_jitter()) + theme_minimal()
You can eventually save the plot to disk with ggsave()
:
final_plot <- myplot + geom_point(position = position_jitter()) ggsave('07plotting-resources/example_saved_figure.png', plot = final_plot, width = 4, height = 3)
There are several options to configure the output file (see ?ggsave
):
A very common mistake is to accidentally put +
on a new line:
ggplot(airquality, aes(x = Temp, y = Ozone)) + geom_point()
Error: Cannot use "+.gg()" with a single argument. Did you accidentally put + on a new line?
The +
operator must appear before the line break (the same is true for other operators like %>%
used in dplyr):
ggplot(airquality, aes(x = Temp, y = Ozone)) + geom_point()
The type of your variables determines its scale for plotting. E.g. here you might want to use a discrete scale:
ggplot(airquality, aes(x = Temp, y = Ozone, color = Month)) + geom_point()
Converting the numerical to a factor tells ggplot that a discrete scale is appropriate:
ggplot(airquality, aes(x = Temp, y = Ozone, color = as.factor(Month))) + geom_point()
This was only a short intro about a few types of plots that can be made with ggplot2. For more advanced plots you should learn about:
There are several packages that extend ggplot's capabilities:
The documentation is excellent. If you don't know how a "geom" can be controlled, have a look at it's documentation, e.g.
[...] Aesthetics: ‘geom_point’ understands the following aesthetics (required aesthetics are in bold): • *‘x’* • *‘y’* • ‘alpha’ • ‘colour’ • ‘fill’ • ‘group’ • ‘shape’ • ‘size’ • ‘stroke’ [...]
→ you must specify the aesthetics x
and y
→ you can make several other properties (e.g. transparency or fill color) dependent on some variable
For more, see:
See dedicated tasks sheet on the tutorial website.