Tasks

library(tidyverse)

Schools and population density

Reading in the data

schools <- readRDS('10linkage-resources/schulen_bb_051015.RDS')
head(schools)
##           art art_reduziert  bundesland jahr
## 1 Grundschule            gs brandenburg 2005
## 2 Grundschule            gs brandenburg 2005
## 3 Grundschule            gs brandenburg 2005
## 4 Grundschule            gs brandenburg 2005
## 5 Grundschule            gs brandenburg 2005
## 6 Grundschule            gs brandenburg 2005
##                                   name                 ort   plz
## 1 Grundschule Spreenhagen Artur Becker         Spreenhagen 15528
## 2          Grundschule "A. Diesterweg"           Falkensee 14612
## 3                          Havelschule         Oranienburg 16515
## 4      Grundschule "Albert Schweitzer"      Treuenbrietzen 14929
## 5          Astrid-Lindgren-Grundschule    Frankfurt (Oder) 15236
## 6                          Grundschule Königs Wusterhausen 15758
##   priv_schule_typ                     strasse traeger
## 1                            A.-Becker-Ring 3    oeff
## 2                               Adlerstraße 9    oeff
## 3                   Albert-Buchmann-Straße 11    oeff
## 4                 Albert-Schweitzer-Straße 23    oeff
## 5                      Alexej-Leonow-Straße 4    oeff
## 6                                Alte Trift 3    oeff
gemvz <- readRDS('10linkage-resources/gemvz_bb_051015.RDS')
head(gemvz)
## # A tibble: 6 x 5
##    jahr plz   gemeindename                    einw_gesamt flaeche_km2
##   <int> <chr> <chr>                                 <dbl>       <dbl>
## 1  2005 14770 Brandenburg an der Havel, Stadt       74129       229. 
## 2  2005 03046 Cottbus, Stadt                       105309       164. 
## 3  2005 15230 Frankfurt (Oder), Stadt               63748       148. 
## 4  2005 14461 Potsdam, Stadt                       147583       187. 
## 5  2005 16356 Ahrensfelde                           12848        57.7
## 6  2005 16321 Bernau bei Berlin, Stadt              35235       104.

How many municipalities are there per postal code in each year? I just want to show that there are several municipalities per postal code.

n_per_jahr_plz <- group_by(gemvz, jahr, plz) %>% count()
head(n_per_jahr_plz)
## # A tibble: 6 x 3
## # Groups:   jahr, plz [6]
##    jahr plz       n
##   <int> <chr> <int>
## 1  2005 01945    10
## 2  2005 01968     1
## 3  2005 01979     1
## 4  2005 01983     2
## 5  2005 01987     1
## 6  2005 01990     2
range(n_per_jahr_plz$n)
## [1]  1 11

Let’s view a sample of a postal code with its municipalities in year 2005:

filter(gemvz, jahr == 2005, plz == '01945')
## # A tibble: 10 x 5
##     jahr plz   gemeindename   einw_gesamt flaeche_km2
##    <int> <chr> <chr>                <dbl>       <dbl>
##  1  2005 01945 Frauendorf             798        20.8
##  2  2005 01945 Kroppen                762        15.2
##  3  2005 01945 Lindenau               773        11.1
##  4  2005 01945 Tettau                 881         8.8
##  5  2005 01945 Grünewald              635        13.4
##  6  2005 01945 Guteborn               624        16.6
##  7  2005 01945 Hermsdorf              924        32.8
##  8  2005 01945 Hohenbocka            1177        15.6
##  9  2005 01945 Ruhland, Stadt        4106        37.1
## 10  2005 01945 Schwarzbach            778        15.8

So we have 10 municipalities belonging to this postal code.

For each postal code in each year, we need to sum up the area and the population of each of its municipalities. We can do that by grouping and summarizing:

jahr_plz <- group_by(gemvz, jahr, plz) %>%   # total postal code areas and population per year
  summarise(einw_plz = sum(einw_gesamt), flaeche_plz = sum(flaeche_km2)) %>%
  ungroup()
head(jahr_plz)
## # A tibble: 6 x 4
##    jahr plz   einw_plz flaeche_plz
##   <int> <chr>    <dbl>       <dbl>
## 1  2005 01945    11458       187. 
## 2  2005 01968    28774       127. 
## 3  2005 01979    18697        88.4
## 4  2005 01983    12264       119. 
## 5  2005 01987     6555        33.2
## 6  2005 01990     3790        21.1

Have a look at our sample from above which contained ten municipalities:

filter(jahr_plz, plz == '01945')
## # A tibble: 3 x 4
##    jahr plz   einw_plz flaeche_plz
##   <int> <chr>    <dbl>       <dbl>
## 1  2005 01945    11458        187.
## 2  2010 01945    10759        187.
## 3  2015 01945    10072        188.

We can now calculate the population density for each postal code:

jahr_plz$einw_per_km2 = jahr_plz$einw_plz / jahr_plz$flaeche_plz
head(jahr_plz)
## # A tibble: 6 x 5
##    jahr plz   einw_plz flaeche_plz einw_per_km2
##   <int> <chr>    <dbl>       <dbl>        <dbl>
## 1  2005 01945    11458       187.          61.2
## 2  2005 01968    28774       127.         227. 
## 3  2005 01979    18697        88.4        211. 
## 4  2005 01983    12264       119.         103. 
## 5  2005 01987     6555        33.2        197. 
## 6  2005 01990     3790        21.1        180.

Let’s have a quick look at the postal codes with the highest population densities:

filter(jahr_plz, jahr == 2015) %>% arrange(desc(einw_per_km2)) %>% head(n = 5)
## # A tibble: 5 x 5
##    jahr plz   einw_plz flaeche_plz einw_per_km2
##   <int> <chr>    <dbl>       <dbl>        <dbl>
## 1  2015 16548    12155        4.61        2637.
## 2  2015 15732    14298       11.9         1203.
## 3  2015 14513    25483       21.6         1180.
## 4  2015 15745     9978        9.1         1096.
## 5  2015 16767     6678        6.45        1035.

The municipalities that belong to the high density postal code zones:

plz_max_dens <- filter(jahr_plz, jahr == 2015) %>%
  arrange(desc(einw_per_km2)) %>% head(n = 5) %>%
  pull(plz)  # pull retrieves a single column as vector

filter(gemvz, jahr == 2015, plz %in% plz_max_dens)
## # A tibble: 6 x 5
##    jahr plz   gemeindename       einw_gesamt flaeche_km2
##   <int> <chr> <chr>                    <dbl>       <dbl>
## 1  2015 15732 Eichwalde                 6426        2.79
## 2  2015 15732 Schulzendorf              7872        9.1 
## 3  2015 15745 Wildau, Stadt             9978        9.1 
## 4  2015 16548 Glienicke/Nordbahn       12155        4.61
## 5  2015 16767 Leegebruch                6678        6.45
## 6  2015 14513 Teltow, Stadt            25483       21.6

Interestingly, it’s not the bigger cities that appear in that list. It’s postal code zones that cover a very small area, but still have a comparatively high number of inhabitants.

A summary of the density data shows us that both population and area in the postal code zones are quite unevenly distributed.

summary(jahr_plz)
##       jahr          plz               einw_plz       flaeche_plz   
##  Min.   :2005   Length:540         Min.   :  1222   Min.   :  4.6  
##  1st Qu.:2005   Class :character   1st Qu.:  5340   1st Qu.: 70.5  
##  Median :2010   Mode  :character   Median :  9012   Median :127.6  
##  Mean   :2010                      Mean   : 13977   Mean   :164.1  
##  3rd Qu.:2015                      3rd Qu.: 17832   3rd Qu.:207.0  
##  Max.   :2015                      Max.   :167745   Max.   :890.8  
##   einw_per_km2    
##  Min.   :  14.25  
##  1st Qu.:  34.04  
##  Median :  58.23  
##  Mean   : 171.76  
##  3rd Qu.: 175.46  
##  Max.   :2636.66
qplot(jahr_plz$einw_plz, binwidth = 5000)

library(scales)
# a log2 scale makes more sense for area
qplot(jahr_plz$flaeche_plz, binwidth = 0.1) + scale_x_continuous(trans = log2_trans())

qplot(jahr_plz$einw_per_km2, binwidth = 100)

Combining schools with population density data

We can use an inner join because for our calculations we need only those school entries that match with the density data. We match on year and postal code.

schools_w_dens <- inner_join(schools, jahr_plz, by = c('jahr', 'plz')) %>%
  select(-c(art, art_reduziert, bundesland, strasse, priv_schule_typ))      # we don't need these variables
head(schools_w_dens)
##   jahr                                 name              ort   plz traeger
## 1 2005 Grundschule Spreenhagen Artur Becker      Spreenhagen 15528    oeff
## 2 2005          Grundschule "A. Diesterweg"        Falkensee 14612    oeff
## 3 2005                          Havelschule      Oranienburg 16515    oeff
## 4 2005      Grundschule "Albert Schweitzer"   Treuenbrietzen 14929    oeff
## 5 2005          Astrid-Lindgren-Grundschule Frankfurt (Oder) 15236    oeff
## 6 2005        Regenbogengrundschule Brüssow          Brüssow 17326    oeff
##   einw_plz flaeche_plz einw_per_km2
## 1     3535      135.80     26.03093
## 2    38376       43.30    886.28176
## 3    41115      162.37    253.21796
## 4     8475      211.33     40.10316
## 5     3859      101.82     37.90022
## 6     2315      101.02     22.91625

Let’s see how many school entries were matched.

(n_schools <- nrow(schools))
## [1] 2596
(n_schools_matched <- nrow(schools_w_dens))
## [1] 2176

Percentage that could be matched via postal code:

n_schools_matched / n_schools * 100
## [1] 83.82126

About 84% matched. A few school entries could not be matched because they have an invalid postal code:

sum(str_length(schools$plz) != 5)
## [1] 3

The others were probably not matched because the postal code was not correct for that year (postal codes changed from time to time and these changes may not have been reflected in the school entries). We would have to check that manually.

Analysis

We can have a quick look on some summary statistics comparing public and private schools in terms of population density:

group_by(schools_w_dens, traeger) %>% summarise(med_dens = median(einw_per_km2),
                                                mean_dens = mean(einw_per_km2),
                                                sd_dens = sd(einw_per_km2),
                                                n = n())
## # A tibble: 2 x 5
##   traeger med_dens mean_dens sd_dens     n
##   <fct>      <dbl>     <dbl>   <dbl> <int>
## 1 oeff        95.7      205.    263.  1891
## 2 priv       145.       282.    337.   285

As boxplot, cutting off some extreme values:

ggplot(schools_w_dens, aes(x = traeger, y = einw_per_km2)) +
  geom_boxplot() +
  ylim(0, 1000)

With the null hypothesis that the population density means between public and private schools are equal and the alternative hypothesis that the population density for private schools is higher than for public schools, we can perform a t-test. Since we have a large sample, the skewed density distribution should not be a problem.

Performing a t-test between public and private schools with a 95% confidence level for all years reveals that the population density around private schools is significantly higher:

# one-sided test with alternative hypothesis mean_public < mean_private
# default is 95% confidence level
t.test(einw_per_km2 ~ traeger, data = schools_w_dens, alternative = 'less')
## 
##  Welch Two Sample t-test
## 
## data:  einw_per_km2 by traeger
## t = -3.6635, df = 338.01, p-value = 0.0001444
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -42.04175
## sample estimates:
## mean in group oeff mean in group priv 
##           205.0447           281.5153

However, the observations within the group are not independent when mixing the observations of all years. We should perform the t-test per year difference, which also reveals how things have changed over the years:

library(broom)

# We use `tidy()` from the package broom which converts the output of `t.test` into
# a data frame.
# `do()` applies a function to each group. The object `.` is a placeholder for the
# current group within a `do()`-call.

ttests_year <- group_by(schools_w_dens, jahr) %>%
                 do(tidy(t.test(einw_per_km2 ~ traeger, data = ., alternative = 'less')))
jahr estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method alternative
2005 -55.00505 195.8145 250.8195 -1.612899 0.0555974 71.13215 -Inf 1.829981 Welch Two Sample t-test less
2010 -86.83647 207.9497 294.7862 -2.352823 0.0101235 120.85401 -Inf -25.660327 Welch Two Sample t-test less
2015 -72.75758 213.2991 286.0567 -2.109953 0.0182524 151.58393 -Inf -15.689273 Welch Two Sample t-test less

In the output, estimate1 and estimate2 are the means for public and private school population densities respectively. estimate is the difference in means. parameter is the degrees of freedom used for calculating the t-statistic (column statistic). conf.low and conf.high are the lower and upper bounds of the confidence interval at a 95% confidence level. We can see that a difference of means of \(0\) (our null hypothesis) is within the confidence interval for 2005 (which is also reflected by the p-value above the significance level of 5%), but not for the subsequent years.

The results suggest that private schools tend to be located in more densily populated areas as compared to public schools and the difference in means are significant at a 95% confidence level for the years 2010 and 2015. For 2005, the result is very close to the significance level of 5%.

However, with this limited data we can’t really understand the full dynamics of where and why (private) schools are founded. Also, the postal code zones are very heterogenous in terms of the area and population that they cover. A high density postal code zone doesn’t necessarily mean that this area is an urban area. It could be that the postal code zone is just very small. It would be more fair to do comparisons on municipality level.