library(tidyverse)
schools <- readRDS('10linkage-resources/schulen_bb_051015.RDS')
head(schools)
## art art_reduziert bundesland jahr
## 1 Grundschule gs brandenburg 2005
## 2 Grundschule gs brandenburg 2005
## 3 Grundschule gs brandenburg 2005
## 4 Grundschule gs brandenburg 2005
## 5 Grundschule gs brandenburg 2005
## 6 Grundschule gs brandenburg 2005
## name ort plz
## 1 Grundschule Spreenhagen Artur Becker Spreenhagen 15528
## 2 Grundschule "A. Diesterweg" Falkensee 14612
## 3 Havelschule Oranienburg 16515
## 4 Grundschule "Albert Schweitzer" Treuenbrietzen 14929
## 5 Astrid-Lindgren-Grundschule Frankfurt (Oder) 15236
## 6 Grundschule Königs Wusterhausen 15758
## priv_schule_typ strasse traeger
## 1 A.-Becker-Ring 3 oeff
## 2 Adlerstraße 9 oeff
## 3 Albert-Buchmann-Straße 11 oeff
## 4 Albert-Schweitzer-Straße 23 oeff
## 5 Alexej-Leonow-Straße 4 oeff
## 6 Alte Trift 3 oeff
gemvz <- readRDS('10linkage-resources/gemvz_bb_051015.RDS')
head(gemvz)
## # A tibble: 6 x 5
## jahr plz gemeindename einw_gesamt flaeche_km2
## <int> <chr> <chr> <dbl> <dbl>
## 1 2005 14770 Brandenburg an der Havel, Stadt 74129 229.
## 2 2005 03046 Cottbus, Stadt 105309 164.
## 3 2005 15230 Frankfurt (Oder), Stadt 63748 148.
## 4 2005 14461 Potsdam, Stadt 147583 187.
## 5 2005 16356 Ahrensfelde 12848 57.7
## 6 2005 16321 Bernau bei Berlin, Stadt 35235 104.
How many municipalities are there per postal code in each year? I just want to show that there are several municipalities per postal code.
n_per_jahr_plz <- group_by(gemvz, jahr, plz) %>% count()
head(n_per_jahr_plz)
## # A tibble: 6 x 3
## # Groups: jahr, plz [6]
## jahr plz n
## <int> <chr> <int>
## 1 2005 01945 10
## 2 2005 01968 1
## 3 2005 01979 1
## 4 2005 01983 2
## 5 2005 01987 1
## 6 2005 01990 2
range(n_per_jahr_plz$n)
## [1] 1 11
Let’s view a sample of a postal code with its municipalities in year 2005:
filter(gemvz, jahr == 2005, plz == '01945')
## # A tibble: 10 x 5
## jahr plz gemeindename einw_gesamt flaeche_km2
## <int> <chr> <chr> <dbl> <dbl>
## 1 2005 01945 Frauendorf 798 20.8
## 2 2005 01945 Kroppen 762 15.2
## 3 2005 01945 Lindenau 773 11.1
## 4 2005 01945 Tettau 881 8.8
## 5 2005 01945 Grünewald 635 13.4
## 6 2005 01945 Guteborn 624 16.6
## 7 2005 01945 Hermsdorf 924 32.8
## 8 2005 01945 Hohenbocka 1177 15.6
## 9 2005 01945 Ruhland, Stadt 4106 37.1
## 10 2005 01945 Schwarzbach 778 15.8
So we have 10 municipalities belonging to this postal code.
For each postal code in each year, we need to sum up the area and the population of each of its municipalities. We can do that by grouping and summarizing:
jahr_plz <- group_by(gemvz, jahr, plz) %>% # total postal code areas and population per year
summarise(einw_plz = sum(einw_gesamt), flaeche_plz = sum(flaeche_km2)) %>%
ungroup()
head(jahr_plz)
## # A tibble: 6 x 4
## jahr plz einw_plz flaeche_plz
## <int> <chr> <dbl> <dbl>
## 1 2005 01945 11458 187.
## 2 2005 01968 28774 127.
## 3 2005 01979 18697 88.4
## 4 2005 01983 12264 119.
## 5 2005 01987 6555 33.2
## 6 2005 01990 3790 21.1
Have a look at our sample from above which contained ten municipalities:
filter(jahr_plz, plz == '01945')
## # A tibble: 3 x 4
## jahr plz einw_plz flaeche_plz
## <int> <chr> <dbl> <dbl>
## 1 2005 01945 11458 187.
## 2 2010 01945 10759 187.
## 3 2015 01945 10072 188.
We can now calculate the population density for each postal code:
jahr_plz$einw_per_km2 = jahr_plz$einw_plz / jahr_plz$flaeche_plz
head(jahr_plz)
## # A tibble: 6 x 5
## jahr plz einw_plz flaeche_plz einw_per_km2
## <int> <chr> <dbl> <dbl> <dbl>
## 1 2005 01945 11458 187. 61.2
## 2 2005 01968 28774 127. 227.
## 3 2005 01979 18697 88.4 211.
## 4 2005 01983 12264 119. 103.
## 5 2005 01987 6555 33.2 197.
## 6 2005 01990 3790 21.1 180.
Let’s have a quick look at the postal codes with the highest population densities:
filter(jahr_plz, jahr == 2015) %>% arrange(desc(einw_per_km2)) %>% head(n = 5)
## # A tibble: 5 x 5
## jahr plz einw_plz flaeche_plz einw_per_km2
## <int> <chr> <dbl> <dbl> <dbl>
## 1 2015 16548 12155 4.61 2637.
## 2 2015 15732 14298 11.9 1203.
## 3 2015 14513 25483 21.6 1180.
## 4 2015 15745 9978 9.1 1096.
## 5 2015 16767 6678 6.45 1035.
The municipalities that belong to the high density postal code zones:
plz_max_dens <- filter(jahr_plz, jahr == 2015) %>%
arrange(desc(einw_per_km2)) %>% head(n = 5) %>%
pull(plz) # pull retrieves a single column as vector
filter(gemvz, jahr == 2015, plz %in% plz_max_dens)
## # A tibble: 6 x 5
## jahr plz gemeindename einw_gesamt flaeche_km2
## <int> <chr> <chr> <dbl> <dbl>
## 1 2015 15732 Eichwalde 6426 2.79
## 2 2015 15732 Schulzendorf 7872 9.1
## 3 2015 15745 Wildau, Stadt 9978 9.1
## 4 2015 16548 Glienicke/Nordbahn 12155 4.61
## 5 2015 16767 Leegebruch 6678 6.45
## 6 2015 14513 Teltow, Stadt 25483 21.6
Interestingly, it’s not the bigger cities that appear in that list. It’s postal code zones that cover a very small area, but still have a comparatively high number of inhabitants.
A summary of the density data shows us that both population and area in the postal code zones are quite unevenly distributed.
summary(jahr_plz)
## jahr plz einw_plz flaeche_plz
## Min. :2005 Length:540 Min. : 1222 Min. : 4.6
## 1st Qu.:2005 Class :character 1st Qu.: 5340 1st Qu.: 70.5
## Median :2010 Mode :character Median : 9012 Median :127.6
## Mean :2010 Mean : 13977 Mean :164.1
## 3rd Qu.:2015 3rd Qu.: 17832 3rd Qu.:207.0
## Max. :2015 Max. :167745 Max. :890.8
## einw_per_km2
## Min. : 14.25
## 1st Qu.: 34.04
## Median : 58.23
## Mean : 171.76
## 3rd Qu.: 175.46
## Max. :2636.66
qplot(jahr_plz$einw_plz, binwidth = 5000)
library(scales)
# a log2 scale makes more sense for area
qplot(jahr_plz$flaeche_plz, binwidth = 0.1) + scale_x_continuous(trans = log2_trans())
qplot(jahr_plz$einw_per_km2, binwidth = 100)
We can use an inner join because for our calculations we need only those school entries that match with the density data. We match on year and postal code.
schools_w_dens <- inner_join(schools, jahr_plz, by = c('jahr', 'plz')) %>%
select(-c(art, art_reduziert, bundesland, strasse, priv_schule_typ)) # we don't need these variables
head(schools_w_dens)
## jahr name ort plz traeger
## 1 2005 Grundschule Spreenhagen Artur Becker Spreenhagen 15528 oeff
## 2 2005 Grundschule "A. Diesterweg" Falkensee 14612 oeff
## 3 2005 Havelschule Oranienburg 16515 oeff
## 4 2005 Grundschule "Albert Schweitzer" Treuenbrietzen 14929 oeff
## 5 2005 Astrid-Lindgren-Grundschule Frankfurt (Oder) 15236 oeff
## 6 2005 Regenbogengrundschule Brüssow Brüssow 17326 oeff
## einw_plz flaeche_plz einw_per_km2
## 1 3535 135.80 26.03093
## 2 38376 43.30 886.28176
## 3 41115 162.37 253.21796
## 4 8475 211.33 40.10316
## 5 3859 101.82 37.90022
## 6 2315 101.02 22.91625
Let’s see how many school entries were matched.
(n_schools <- nrow(schools))
## [1] 2596
(n_schools_matched <- nrow(schools_w_dens))
## [1] 2176
Percentage that could be matched via postal code:
n_schools_matched / n_schools * 100
## [1] 83.82126
About 84% matched. A few school entries could not be matched because they have an invalid postal code:
sum(str_length(schools$plz) != 5)
## [1] 3
The others were probably not matched because the postal code was not correct for that year (postal codes changed from time to time and these changes may not have been reflected in the school entries). We would have to check that manually.
We can have a quick look on some summary statistics comparing public and private schools in terms of population density:
group_by(schools_w_dens, traeger) %>% summarise(med_dens = median(einw_per_km2),
mean_dens = mean(einw_per_km2),
sd_dens = sd(einw_per_km2),
n = n())
## # A tibble: 2 x 5
## traeger med_dens mean_dens sd_dens n
## <fct> <dbl> <dbl> <dbl> <int>
## 1 oeff 95.7 205. 263. 1891
## 2 priv 145. 282. 337. 285
As boxplot, cutting off some extreme values:
ggplot(schools_w_dens, aes(x = traeger, y = einw_per_km2)) +
geom_boxplot() +
ylim(0, 1000)
With the null hypothesis that the population density means between public and private schools are equal and the alternative hypothesis that the population density for private schools is higher than for public schools, we can perform a t-test. Since we have a large sample, the skewed density distribution should not be a problem.
Performing a t-test between public and private schools with a 95% confidence level for all years reveals that the population density around private schools is significantly higher:
# one-sided test with alternative hypothesis mean_public < mean_private
# default is 95% confidence level
t.test(einw_per_km2 ~ traeger, data = schools_w_dens, alternative = 'less')
##
## Welch Two Sample t-test
##
## data: einw_per_km2 by traeger
## t = -3.6635, df = 338.01, p-value = 0.0001444
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -42.04175
## sample estimates:
## mean in group oeff mean in group priv
## 205.0447 281.5153
However, the observations within the group are not independent when mixing the observations of all years. We should perform the t-test per year difference, which also reveals how things have changed over the years:
library(broom)
# We use `tidy()` from the package broom which converts the output of `t.test` into
# a data frame.
# `do()` applies a function to each group. The object `.` is a placeholder for the
# current group within a `do()`-call.
ttests_year <- group_by(schools_w_dens, jahr) %>%
do(tidy(t.test(einw_per_km2 ~ traeger, data = ., alternative = 'less')))
jahr | estimate | estimate1 | estimate2 | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|---|---|---|---|
2005 | -55.00505 | 195.8145 | 250.8195 | -1.612899 | 0.0555974 | 71.13215 | -Inf | 1.829981 | Welch Two Sample t-test | less |
2010 | -86.83647 | 207.9497 | 294.7862 | -2.352823 | 0.0101235 | 120.85401 | -Inf | -25.660327 | Welch Two Sample t-test | less |
2015 | -72.75758 | 213.2991 | 286.0567 | -2.109953 | 0.0182524 | 151.58393 | -Inf | -15.689273 | Welch Two Sample t-test | less |
In the output, estimate1
and estimate2
are the means for public and private school population densities respectively. estimate
is the difference in means. parameter
is the degrees of freedom used for calculating the t-statistic (column statistic
). conf.low
and conf.high
are the lower and upper bounds of the confidence interval at a 95% confidence level. We can see that a difference of means of \(0\) (our null hypothesis) is within the confidence interval for 2005 (which is also reflected by the p-value above the significance level of 5%), but not for the subsequent years.
The results suggest that private schools tend to be located in more densily populated areas as compared to public schools and the difference in means are significant at a 95% confidence level for the years 2010 and 2015. For 2005, the result is very close to the significance level of 5%.
However, with this limited data we can’t really understand the full dynamics of where and why (private) schools are founded. Also, the postal code zones are very heterogenous in terms of the area and population that they cover. A high density postal code zone doesn’t necessarily mean that this area is an urban area. It could be that the postal code zone is just very small. It would be more fair to do comparisons on municipality level.