Tasks

1. Install and load the package reshape2 and have a look at the data set smiths. It is a very small data set (two observations) used for demonstration purposes.

library(reshape2)
library(tidyverse)

smiths
## # A tibble: 2 x 5
##   subject     time   age weight height
##   <chr>      <dbl> <dbl>  <dbl>  <dbl>
## 1 John Smith     1    33     90   1.87
## 2 Mary Smith     1    NA     NA   1.54

1.1. Add a logical (TRUE/FALSE) variable smoker to smiths. Use any values you like, e.g. c(FALSE, TRUE).

smiths$smoker <- c(FALSE, TRUE)
smiths
## # A tibble: 2 x 6
##   subject     time   age weight height smoker
##   <chr>      <dbl> <dbl>  <dbl>  <dbl> <lgl> 
## 1 John Smith     1    33     90   1.87 FALSE 
## 2 Mary Smith     1    NA     NA   1.54 TRUE

1.2. Reshape this data set to form a “long table” format. Do this by using gather() on the columns age, weight, height and smoker. Set the “key” column’s name to “var” and the “value” column’s name to “y”. Store the result in an object named smiths_long. What happened to the logical values of the smoker variable?

(smiths_long <- gather(smiths, age:smoker, key = "var", value = "y"))
## # A tibble: 8 x 4
##   subject     time var        y
##   <chr>      <dbl> <chr>  <dbl>
## 1 John Smith     1 age    33   
## 2 Mary Smith     1 age    NA   
## 3 John Smith     1 weight 90   
## 4 Mary Smith     1 weight NA   
## 5 John Smith     1 height  1.87
## 6 Mary Smith     1 height  1.54
## 7 John Smith     1 smoker  0   
## 8 Mary Smith     1 smoker  1

The logical values of the smoker variable were converted to numeric values 0.00 or 1.00, because a column in a data frame can only be of one type. Since the numeric data type is the most generic type for the given data, smoker is converted to that type.

1.3. Use spread() on smiths_long to convert back to its original format, having age, weight, height and smoker as separate columns. What is the type of the smoker variable now and why is that so?

spread(smiths_long, var, y)
## # A tibble: 2 x 6
##   subject     time   age height smoker weight
##   <chr>      <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1 John Smith     1    33   1.87      0     90
## 2 Mary Smith     1    NA   1.54      1     NA

The smoker variable is now a numerical type, not a logical type, because it was converted back from the numeric variable y in smiths_long and R doesn’t know that it was originally a logical variable.

2. Install the package reshape2 and inform yourself about the french_fries data that comes with the package.

2.1. There are 5 flavors being rated in each row. Convert the data into “long format” by using gather() on the five flavor columns. The “key” column should be named “flavor” and the “value” column should be named “rating”. Store the result in an object tidy_fries.

(tidy_fries <- gather(french_fries, potato:painty, key = 'flavor', value = 'rating'))
##      time treatment subject rep  flavor rating
## 1       1         1       3   1  potato    2.9
## 2       1         1       3   2  potato   14.0
## 3       1         1      10   1  potato   11.0
## 4       1         1      10   2  potato    9.9
## 5       1         1      15   1  potato    1.2
##  [ reached getOption("max.print") -- omitted 3475 rows ]

2.2. There are 696 observations in the original data set. Compare that to the number of rows in the “long” data set that you just created. Does the number of rows in the “long” data set make sense?

We expect to have \(696 \cdot 5 = 3480\) rows in the “long” data set, because each rating value of each five flavors is transformed into its own row. Our expectation can be confirmed:

nrow(tidy_fries)
## [1] 3480

2.3. Make a boxplot that shows a box for each interaction between treatment and flavor (x-axis) regarding their rating (y-axis).

Note: Interaction means each possible combination of treatment and flavor. It can be constructed with interaction(treatment, flavor).

ggplot(tidy_fries, aes(x = interaction(treatment, flavor), y = rating)) + geom_boxplot()

2.4. Improve the above plot by creating “small multiples”, i.e. facets for the variable flavor. This means each facet (i.e. a small embedded plot) then shows a boxplot for a specific flavor (with treatments on the x-axis and rating on the y-axis).

Note: Add facet_grid(~ flavor) to your plot to create small boxplots per flavor in a row.

ggplot(tidy_fries, aes(x = treatment, y = rating)) +
  geom_boxplot() +
  facet_grid(~ flavor)

3. Load the data UN from the package carData and inform yourself about it.

3.1. You want to study the relationship between Gross Domestic Product (GDP) and infant mortality using this data. What kind of plot can you use to do that? Which variables go on which axes?

A scatter plot with GDP on the x-axis and infant mortality rate on the y-axis can be used for that.

3.2. Construct a scatter plot that plots GDP (ppgdp variable) against infant mortality rate (infantMortality variable).

library(carData)

ggplot(UN, aes(x = ppgdp, y = infantMortality)) + geom_point()

3.3. How can you improve the plot to avoid overplotting? How can you aid the eye in showing a trend?

library(carData)

ggplot(UN, aes(x = ppgdp, y = infantMortality)) +
  geom_point(alpha = 0.33) +     # add transparency to avoid overplotting
  geom_smooth()                  # add a smooth line to show a trend

3.4. Add another dimension to the plot by making the points’ color dependent on the variable group. Keep the trend lines. What are the problems with the trend lines, especially for the groups “africa” and “other”?

library(carData)

ggplot(UN, aes(x = ppgdp, y = infantMortality, color = group)) +
  geom_point(alpha = 0.33) +
  geom_smooth()

The problem with the trend lines, especially with the “africa” group, is that for the more extreme GDP values in each group, there are few observations, making the trend unreliable (hence the large confidence interval around it) and prone to outliers (lines go up because of single outliers in “africa” and “other” group). Furthermore, the confidence interval goes into negative values on the y-scale, which does not make sense.

One could improve this by excluding outliers. Additionally, facetting could aid the eye.

4. Get acquainted with the data set Arrests from the package carData

4.1. Summarize the data by creating a data set with the number of arrests per year and sex (Hint: you can use group_by() and count() or n())
(arrests_per_year_sex <- Arrests %>% group_by(year, sex) %>% count())
## # A tibble: 12 x 3
## # Groups:   year, sex [12]
##     year sex        n
##    <int> <fct>  <int>
##  1  1997 Female    32
##  2  1997 Male     460
##  3  1998 Female    95
##  4  1998 Male     782
##  5  1999 Female   100
##  6  1999 Male     999
##  7  2000 Female   105
##  8  2000 Male    1165
##  9  2001 Female    93
## 10  2001 Male    1118
## 11  2002 Female    18
## 12  2002 Male     259
4.2. Visualize the data using an appropriate type of plot. Is there something in the trend that makes you wonder? If so, what could be the reason(s) for that?
ggplot(arrests_per_year_sex, aes(x = year, y = n, color = sex)) + geom_line() + geom_point()

There’s a sudden decline in arrests in 2002, which does not adhere to the trend of the previous years. This might have several reasons. One could be that the data for 2002 is not complete (e.g. only data from the first quarter of the year). Another reason could be a change in legislature regarding the possession of Marijuana.