R Tutorial at the WZB

Schools and population density project

I provide a ZIP file on the tutorial website with two data sets:

A data set on schools in the state of Brandenburg in the years 2005, 2010 and 2015 in file schulen_bb_051015.RDS. We need the following variables:
- jahr: year (integer of 2005, 2010 or 2015)
- plz: postal code (character string, 5-digit code)
- traeger: type of institution (factor, oeff – public, priv – private)
An excerpt of the official directory of municipalities in Germany (“Gemeindeverzeichnis”) in file gemvz_bb_051015.RDS. It contains the following data about each municipality in the state of Brandenburg in the years 2005, 2010 and 2015:
- jahr: year (integer of 2005, 2010 or 2015)
- plz: postal code (character string, 5-digit code)
- gemeindename: municipality name
- einw_gesamt: population in municipality in that year
- flaeche_km2: municipality area in square kilometers (km²)

The data are provided as “RDS” files, which you can read in and assign to an object:

mydata <- readRDS('my/path/to/some_file.RDS')

Question: Do private schools tend to be located in more densily populated areas as compared to public schools?

Try to answer the question by combining the given data sets and calculating summary statistics and/or creating graphical representations of the data.

Some notes to guide you:

I will call the school data schools and the municipality data gemvz.

At first you would need to calculate the population density in the gemvz data set. However, doing this directly with the data set will give you the population density per municipality, not per postal code. You cannot simply match the postal code in schools with the postal code in gemvz, because many municipalities may have the same postal code – so which population and area value would you use then? If you had the name of the municipality of each school in the schools data set, you could match by both, i.e postal code and municipality name. However, the variable ort in schools will not match the variable gemeindename in gemvz, so you can’t use that.

What you can do is matching with postal code only. For this, you need to reduce gemvz to a data set that contains the full area and population number of each postal code zone for each year. You can do that with group_by() and summarize(). Then you can calculate the population density per postal code zone in gemvz. I’ll call the resulting data set pop_dens_plz. It contains the population density per postal code for each year.

You can then combine pop_dens_plz with the schools data, where you match by postal code and year. However, this won’t work with all data in schools, i.e. some schools might not get matched with the postal codes from pop_dens_plz. Make sure to find out how many entries were not matched. What’s the portion of unmatched school entries? What could be the reason that some school entries could not be matched?

You should now have a combined data set with school entries and the according population density for each school’s postal code zone. With this data set you can answer this task’s question. However, you should scrutinize your results. What are the limitations of this approach? What could be the problems during matching? Is there a problem in using the postal code zone?

R Tutorial at the WZB

Tasks for session 10 - Record linkage

Markus Konrad

January 10, 2019

Schools and population density project