Introduction

Spatial joins allow to augment one spatial dataset with information from another spatial dataset by linking overlapping features. In this post I will provide an example showing how to augment a dataset containing school locations with socioeconomic data of their surrounding statistical region using R and the package sf (Pebesma 2018). This approach has the drawback that the surrounding statistical region doesn’t reflect the actual catchment area of the school. I will present an alternative approach where the overlaps of the schools’ catchment areas with the statistical regions allow to calculate the weighted average of the socioeconomic statistics. If we have no data about the actual catchment areas of the schools, we may resort to approximating these areas as circular regions or as Voronoi regions around schools.

Data

For this example, I’d like to compare the percentage of children whose parents obtain social welfare in the neighborhood regions around public and private primary schools in Berlin. This blog post concentrates on how to join the point samples (the schools) with the surrounding statistical regions and calculate a spatially weighted average the welfare rate, so I will present only a few descriptive results in the end.

We will work with several datasets: The first spatial dataset contains the shape of the statistical regions in Berlin, the second dataset contains the socioeconomic data for these regions, the third and fourth datasets contain the locations and other attributes of public and private primary schools in Berlin, respectively.

All data and the code are available in the GitHub repository. We will use the sf package for working with spatial data in R, dplyr for data management and ggplot2 for a few more advanced visualizations, i.e. when base plot() is not sufficient.

library(sf)
library(dplyr)
library(ggplot2)

Socioeconomic data for statistical regions

We will at first load a dataset with the most granular official statistical regions for Berlin, called Planungsräume (planning areas). We select the area ID and name as spatial attributes. The result is a spatial dataframe (a simple feature (sf) collection).

bln_plan <- read_sf('data/berlin_plr.shp') %>%
  mutate(areaid = as.integer(SCHLUESSEL)) %>%   # transform character SCHLUESSEL to numeric area ID
  select(areaid, name = PLR_NAME)
head(bln_plan)
## Simple feature collection with 6 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 386668.2 ymin: 5817761 xmax: 390764.3 ymax: 5820432
## Projected CRS: ETRS89 / UTM zone 33N
## # A tibble: 6 x 3
##    areaid name                                                          geometry
##     <int> <chr>                                               <MULTIPOLYGON [m]>
## 1 1011101 Stülerstr.      (((387256.6 5818552, 387323.1 5818572, 387418.9 58186…
## 2 1011102 Großer Tiergar… (((386767.5 5819393, 386768.3 5819389, 386769.6 58193…
## 3 1011103 Lützowstr.      (((387952.6 5818275, 387986.7 5818313, 387994.6 58183…
## 4 1011104 Körnerstr.      (((388847.1 5817875, 388855.5 5817899, 388865.1 58179…
## 5 1011105 Nördlicher Lan… (((388129.5 5819015, 388157.1 5819017, 388170.8 58190…
## 6 1011201 Wilhelmstr.     (((389845.7 5819286, 389840.9 5819311, 389846.1 58193…

When printing this dataframe, the header reveals another important information: The coordinate reference system (CRS) of this dataset is ETRS89 / UTM zone 33N. We will later need to make sure that the coordinates of the school locations and the coordinates of the planning areas use the same coordinate system.

This data can be joined with socioeconomic information provided from official sources. Luckily, Helbig/Salomo 2021 compiled these information for some cities in Germany (available for download) among which is data for Berlin from 2020. I’ve created an excerpt with percentages of residents receiving social welfare (welfare) and percentage of children under 15 years whose parents receive social welfare (welfare_chld):

bln_welfare <- read.csv('data/berlin_welfare.csv', stringsAsFactors = FALSE)
head(bln_welfare)
##    areaid                 areaname welfare welfare_chld
## 1 1011101             Stülerstraße   10.09        15.44
## 2 1011102        Großer Tiergarten    4.76         0.00
## 3 1011103             Lützowstraße   22.21        36.80
## 4 1011104             Körnerstraße   24.81        42.14
## 5 1011105 Nördlicher Landwehrkanal    2.82         3.53
## 6 1011201            Wilhelmstraße   12.13        19.03

We can use the area ID for augmenting the planning areas with the welfare statistics. We’re joining a spatial with an ordinary dataframe, so we can use dplyr’s inner_join. Before that we can check that for each planning area we have welfare statistics information and vice versa: 1

setequal(bln_plan$areaid, bln_welfare$areaid)
## [1] TRUE
bln <- inner_join(bln_plan, bln_welfare, by = 'areaid') %>%
  select(-name)
head(bln)
## Simple feature collection with 6 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 386668.2 ymin: 5817761 xmax: 390764.3 ymax: 5820432
## Projected CRS: ETRS89 / UTM zone 33N
## # A tibble: 6 x 5
##    areaid                              geometry areaname    welfare welfare_chld
##     <int>                    <MULTIPOLYGON [m]> <chr>         <dbl>        <dbl>
## 1 1011101 (((387256.6 5818552, 387323.1 581857… Stülerstra…   10.1         15.4 
## 2 1011102 (((386767.5 5819393, 386768.3 581938… Großer Tie…    4.76         0   
## 3 1011103 (((387952.6 5818275, 387986.7 581831… Lützowstra…   22.2         36.8 
## 4 1011104 (((388847.1 5817875, 388855.5 581789… Körnerstra…   24.8         42.1 
## 5 1011105 (((388129.5 5819015, 388157.1 581901… Nördlicher…    2.82         3.53
## 6 1011201 (((389845.7 5819286, 389840.9 581931… Wilhelmstr…   12.1         19.0

A quick plot confirms that it is similar to the one from the dashboard of the Helbig/Salomo study.2

plot(bln['welfare_chld'])