smiths
. It is a very small data set (two observations) used for demonstration purposes.french_fries
data that comes with the package.UN
from the package carData
and inform yourself about it.smiths
. It is a very small data set (two observations) used for demonstration purposes.library(reshape2)
library(tidyverse)
smiths
## # A tibble: 2 x 5
## subject time age weight height
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 John Smith 1 33 90 1.87
## 2 Mary Smith 1 NA NA 1.54
smoker
to smiths
. Use any values you like, e.g. c(FALSE, TRUE)
.smiths$smoker <- c(FALSE, TRUE)
smiths
## # A tibble: 2 x 6
## subject time age weight height smoker
## <chr> <dbl> <dbl> <dbl> <dbl> <lgl>
## 1 John Smith 1 33 90 1.87 FALSE
## 2 Mary Smith 1 NA NA 1.54 TRUE
gather()
on the columns age
, weight
, height
and smoker
. Set the “key” column’s name to “var” and the “value” column’s name to “y”. Store the result in an object named smiths_long
. What happened to the logical values of the smoker
variable?(smiths_long <- gather(smiths, age:smoker, key = "var", value = "y"))
## # A tibble: 8 x 4
## subject time var y
## <chr> <dbl> <chr> <dbl>
## 1 John Smith 1 age 33
## 2 Mary Smith 1 age NA
## 3 John Smith 1 weight 90
## 4 Mary Smith 1 weight NA
## 5 John Smith 1 height 1.87
## 6 Mary Smith 1 height 1.54
## 7 John Smith 1 smoker 0
## 8 Mary Smith 1 smoker 1
The logical values of the smoker
variable were converted to numeric values 0.00 or 1.00, because a column in a data frame can only be of one type. Since the numeric data type is the most generic type for the given data, smoker
is converted to that type.
spread()
on smiths_long
to convert back to its original format, having age
, weight
, height
and smoker
as separate columns. What is the type of the smoker
variable now and why is that so?spread(smiths_long, var, y)
## # A tibble: 2 x 6
## subject time age height smoker weight
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 John Smith 1 33 1.87 0 90
## 2 Mary Smith 1 NA 1.54 1 NA
The smoker
variable is now a numerical type, not a logical type, because it was converted back from the numeric variable y
in smiths_long
and R doesn’t know that it was originally a logical variable.
french_fries
data that comes with the package.gather()
on the five flavor columns. The “key” column should be named “flavor” and the “value” column should be named “rating”. Store the result in an object tidy_fries
.(tidy_fries <- gather(french_fries, potato:painty, key = 'flavor', value = 'rating'))
## time treatment subject rep flavor rating
## 1 1 1 3 1 potato 2.9
## 2 1 1 3 2 potato 14.0
## 3 1 1 10 1 potato 11.0
## 4 1 1 10 2 potato 9.9
## 5 1 1 15 1 potato 1.2
## [ reached getOption("max.print") -- omitted 3475 rows ]
We expect to have \(696 \cdot 5 = 3480\) rows in the “long” data set, because each rating value of each five flavors is transformed into its own row. Our expectation can be confirmed:
nrow(tidy_fries)
## [1] 3480
treatment
and flavor
(x-axis) regarding their rating
(y-axis).Note: Interaction means each possible combination of treatment and flavor. It can be constructed with interaction(treatment, flavor)
.
ggplot(tidy_fries, aes(x = interaction(treatment, flavor), y = rating)) + geom_boxplot()
flavor
. This means each facet (i.e. a small embedded plot) then shows a boxplot for a specific flavor (with treatments on the x-axis and rating on the y-axis).Note: Add facet_grid(~ flavor)
to your plot to create small boxplots per flavor in a row.
ggplot(tidy_fries, aes(x = treatment, y = rating)) +
geom_boxplot() +
facet_grid(~ flavor)
UN
from the package carData
and inform yourself about it.A scatter plot with GDP on the x-axis and infant mortality rate on the y-axis can be used for that.
ppgdp
variable) against infant mortality rate (infantMortality
variable).library(carData)
ggplot(UN, aes(x = ppgdp, y = infantMortality)) + geom_point()
library(carData)
ggplot(UN, aes(x = ppgdp, y = infantMortality)) +
geom_point(alpha = 0.33) + # add transparency to avoid overplotting
geom_smooth() # add a smooth line to show a trend
group
. Keep the trend lines. What are the problems with the trend lines, especially for the groups “africa” and “other”?library(carData)
ggplot(UN, aes(x = ppgdp, y = infantMortality, color = group)) +
geom_point(alpha = 0.33) +
geom_smooth()
The problem with the trend lines, especially with the “africa” group, is that for the more extreme GDP values in each group, there are few observations, making the trend unreliable (hence the large confidence interval around it) and prone to outliers (lines go up because of single outliers in “africa” and “other” group). Furthermore, the confidence interval goes into negative values on the y-scale, which does not make sense.
One could improve this by excluding outliers. Additionally, facetting could aid the eye.
Arrests
from the package carDatayear
and sex
(Hint: you can use group_by()
and count()
or n()
)(arrests_per_year_sex <- Arrests %>% group_by(year, sex) %>% count())
## # A tibble: 12 x 3
## # Groups: year, sex [12]
## year sex n
## <int> <fct> <int>
## 1 1997 Female 32
## 2 1997 Male 460
## 3 1998 Female 95
## 4 1998 Male 782
## 5 1999 Female 100
## 6 1999 Male 999
## 7 2000 Female 105
## 8 2000 Male 1165
## 9 2001 Female 93
## 10 2001 Male 1118
## 11 2002 Female 18
## 12 2002 Male 259
ggplot(arrests_per_year_sex, aes(x = year, y = n, color = sex)) + geom_line() + geom_point()
There’s a sudden decline in arrests in 2002, which does not adhere to the trend of the previous years. This might have several reasons. One could be that the data for 2002 is not complete (e.g. only data from the first quarter of the year). Another reason could be a change in legislature regarding the possession of Marijuana.