I provide a ZIP file on the tutorial website with two data sets:
schulen_bb_051015.RDS
. We need the following variables:jahr
: year (integer of 2005, 2010 or 2015)plz
: postal code (character string, 5-digit code)traeger
: type of institution (factor, oeff
– public, priv
– private)gemvz_bb_051015.RDS
. It contains the following data about each municipality in the state of Brandenburg in the years 2005, 2010 and 2015:
jahr
: year (integer of 2005, 2010 or 2015)plz
: postal code (character string, 5-digit code)gemeindename
: municipality nameeinw_gesamt
: population in municipality in that yearflaeche_km2
: municipality area in square kilometers (km²)The data are provided as “RDS” files, which you can read in and assign to an object:
mydata <- readRDS('my/path/to/some_file.RDS')
Question: Do private schools tend to be located in more densily populated areas as compared to public schools?
Try to answer the question by combining the given data sets and calculating summary statistics and/or creating graphical representations of the data.
Some notes to guide you:
I will call the school data schools
and the municipality data gemvz
.
At first you would need to calculate the population density in the gemvz
data set. However, doing this directly with the data set will give you the population density per municipality, not per postal code. You cannot simply match the postal code in schools
with the postal code in gemvz
, because many municipalities may have the same postal code – so which population and area value would you use then? If you had the name of the municipality of each school in the schools
data set, you could match by both, i.e postal code and municipality name. However, the variable ort
in schools
will not match the variable gemeindename
in gemvz
, so you can’t use that.
What you can do is matching with postal code only. For this, you need to reduce gemvz
to a data set that contains the full area and population number of each postal code zone for each year. You can do that with group_by()
and summarize()
. Then you can calculate the population density per postal code zone in gemvz
. I’ll call the resulting data set pop_dens_plz
. It contains the population density per postal code for each year.
You can then combine pop_dens_plz
with the schools
data, where you match by postal code and year. However, this won’t work with all data in schools
, i.e. some schools might not get matched with the postal codes from pop_dens_plz
. Make sure to find out how many entries were not matched. What’s the portion of unmatched school entries? What could be the reason that some school entries could not be matched?
You should now have a combined data set with school entries and the according population density for each school’s postal code zone. With this data set you can answer this task’s question. However, you should scrutinize your results. What are the limitations of this approach? What could be the problems during matching? Is there a problem in using the postal code zone?