May 6, 2019

Today's schedule

  1. Introduction
  2. A short introduction to plotting with ggplot2
  3. Hello world: Plotting points on a world map
  4. Sources and file formats for geographic data

– lunch break –

  1. A short introduction to data linkage with dplyr
  2. Combining data and making a choropleth map
  3. Geocoding and reverse geocoding via Google Maps API
  4. Making your maps publication-ready

Introduction

About me

  • working at WZB as Data Scientist in IT dept. since April 2016
  • my job:
    • data extraction and preparation (e.g. data retrieval from APIs, web scraping, data extraction from PDFs)
    • quantitive text analysis (Topic modeling)
    • interactive and static visualizations
    • implementing experiments (esp. w/ oTree)
    • designing and implementing research databases
    • apparently giving workshops

contact: markus.konrad@wzb.eu | tel -555 | office D005

Aims of today's workshop

  • will show how to use R with geo-spatial data
  • main focus on visualization; no spatial data analysis
  • sources and file formats for geographic data
  • combine your data with existing geographic data

Why use R for that?

  • for simple things, you can use online tools like datawrapper
  • but: R gives you several advantages over these tools
    • it's for free
    • offers full flexibility
    • also supports geo-spatial analysis
    • no need to switch tools

A short introduction to plotting with ggplot2

Plotting with ggplot2

  • ggplot2: versatile package to create elegant plots with R
  • based on a theory for declaratively creating graphics called "Grammar of Graphics" (Wilkinson 2005)
  • documentation on https://ggplot2.tidyverse.org/


  • ggplot works by providing a dataset and mapping its variables to visual properties, e.g. variable age should be on the x-axis, income should be on the y-axis
  • with the same principle, maps can be made, e.g.: variable GDP determines the color of a country

Plotting with ggplot2: Example dataset

We'll have a look at the principles of ggplot2 in general, which we can later apply to making maps.

We'll make a plot for historical election results from four randomly selected states in the US (dataset presidentialElections from package pscl):

## # A tibble: 88 x 4
##    state      demVote  year south
##    <chr>        <dbl> <int> <lgl>
##  1 California    58.4  1932 FALSE
##  2 Florida       74.5  1932 TRUE 
##  3 Illinois      55.2  1932 FALSE
##  4 Tennessee     66.5  1932 TRUE 
##  5 California    67.0  1936 FALSE
##  6 Florida       76.1  1936 TRUE 
##  7 Illinois      57.7  1936 FALSE
##  8 Tennessee     68.8  1936 TRUE 
##  9 California    57.4  1940 FALSE
## 10 Florida       74.0  1940 TRUE 
## # … with 78 more rows

Plotting with ggplot2: Step 1

Step 1: Load ggplot2 and pass the dataset.

library(ggplot2)

ggplot(sampled_pres)

Plotting with ggplot2: Step 2

Step 2: Specify an aesthetic mapping via aes().
This describes how variables of your data are mapped to visual properties: "year" is plotted on the x-axis and the share of votes for Democrats on the y-axis.

ggplot(sampled_pres, aes(x = year, y = demVote))

Plotting with ggplot2: Step 3

Step 3: Adding a geom.
geoms are geometric objects; they describe which graphical primitives to use (e.g. points in a scatter plot or bars in a bar plot). Notice that this plot looks wrong. We'll fix that next.

ggplot(sampled_pres, aes(x = year, y = demVote)) + geom_line()

Plotting with ggplot2: Step 4

Step 4: Fixing the aesthetic mapping.
The previous plot looked wrong because several y-values appeared for the same years and they were all connected by a single line: For each year, we have a y-value (vote share) for each state! We adjust the aesthetic mapping by making the line color dependent on state, hence we have a line (of different color) per state:

ggplot(sampled_pres, aes(x = year, y = demVote, color = state)) + geom_line()

Plotting with ggplot2: Step 5

Step 5: Adding another layer.
We can add more geometric objects by adding another geom_ layer. For better readability, I put the layer functions on separate lines:

ggplot(sampled_pres, aes(x = year, y = demVote, color = state)) +
    geom_line() +
    geom_point()

Plotting with ggplot2: Datasets per layer

  • when you pass a dataset and aesthetic mapping to ggplot() it used by all layers as default
  • you can override it in each geom_... layer by setting aes(...) and data = ...
  • each layer may actually use a different dataset; when plotting maps, this is often used (e.g. one layer with data for a "background map", another for points on it)
random_dots <- data.frame(x = sample(sampled_pres$year, 10),
                          y = rnorm(10, mean = 50, sd = 10))
ggplot() +
    geom_line(aes(x = year, y = demVote, color = state), data = sampled_pres) +
    geom_point(aes(x = x, y = y), data = random_dots) # use a different dataset here

Further customizations

Additionally, you can further change the appearance of your plot by:

  • altering the scales (e.g. use a logarithmic scale, modify display of factors, etc.)
  • defining facets → create small multiples, each showing a different subset of the data
  • changing the overall appearance of the plot by adjusting its theme (e.g. change background color, rotate axis labels, etc.)

You combine all these steps with a +.

Exercise 1

1. Making a scatterplot (for those new to ggplot)

  • load ggplot2 and inform yourself about the msleep dataset
  • make a scatterplot with bodywt on x-axis, sleep_total on y-axis
  • improve the plot by:
    • transforming the x scale to log10 using scale_x_log10()
    • making the dot color dependent on vore

Hello world: Plotting points on a world map

Coordinate reference systems in brief

We usually work with two-dimensional vector data: Points located within a coordinate reference system (CRS).

  • the point -73.94°, 40.6° is represented in the WGS84 CRS: the world geodetic system (e.g. used by GPS)
  • points describe location on a sphere¹; its components are given in degrees:
    • -73.94° is the longitude: about 74° West from the meridian
    • 40.6° is the latitude: about 41° North from the equator

¹ WGS84 actually models the Earth as ellipsoid

Coordinate reference systems in brief

Let's put this point into context:

  • points can be connected to form other geometric shapes such as lines or polygons
  • points and shapes can represent numerous geographic entities: cities, buildings, countries, rivers, lakes, etc.

Packages, packages, packages

We need to extend R in order to work with geographic data (geo-data) in R by installing these packages:

  • maps: contains geo-data for national borders and administrative regions for several countries
  • rgeos and maptools: Misc. functions for operations on geometries
  • sf: "Simple Features for R" – reading, writing and transforming spatial data

Install them if you haven't yet:

install.packages(c('maps', 'maptools', 'sf', 'rgeos'))

Simple features

Most packages for working with geo-data in R rely on the Simple Features standard (→ sf package).

Simple features describe how objects in the real world can be represented in computers in terms of their spatial geometry:

Features have a geometry describing where on Earth the feature is located, and they have attributes, which describe other properties.
sf package vignette

Examples:

  • geometry: point at -73.94°, 40.6°
  • name: New York City
  • population: 8.6 mio.
  • geometry: polygons with points at …
  • name: Italy
  • most popular meal: pizza

A spatial dataset is a dataset that contains geo-data (in a geometry column) and attributes of that entity like population, poverty rate, etc.

Making a world map

First, we load the packages that we need:

library(ggplot2)
library(maps)
library(sf)

Making a world map

The function map can be used to load the "world" data. We need to convert it to a "simple features" object via st_as_sf():

worldmap_data <- st_as_sf(map('world', plot = FALSE, fill = TRUE))
head(worldmap_data, n = 3)  # to have a look inside
## Simple feature collection with 3 features and 1 field
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -8.68335 ymin: 18.98662 xmax: 74.89131 ymax: 42.64795
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##                         geometry          ID
## 1 MULTIPOLYGON (((74.89131 37... Afghanistan
## 2 MULTIPOLYGON (((20.06396 42...     Albania
## 3 MULTIPOLYGON (((8.576563 36...     Algeria
  • spatial dataset with a "header" giving some general information
  • two columns: geometry (spatial data) and ID (attribute)

What do the numbers in the geometry column mean? What's their CRS?

Making a world map

We are ready to plot the world map. Every "simple features" object can be plotted by adding a geom_sf layer:

ggplot() + geom_sf(data = worldmap_data)

Putting points on the map

We have some cities along with their longitude (lng) and latitude (lat):

##       name    lng    lat
## 1   Berlin  13.38  52.52
## 2 New York -73.94  40.60
## 3   Sydney 151.21 -33.87

We add a "point geom" layer to our map:

ggplot(some_cities) +                                # pass the cities data to plot
    geom_sf(data = worldmap_data) +                  # layer 1: world countries
    geom_point(aes(x = lng, y = lat), color = 'red') # layer 2: points at city coord.

Adding labels next to the points

You may also add text labels for the cities. ggplot2 provides geom_text and geom_label (draws box around text):

ggplot(some_cities) +                                  # pass the cities data to plot
    geom_sf(data = worldmap_data) +                    # layer 1: world countries
    geom_point(aes(x = lng, y = lat), color = 'red') + # layer 2: points
    geom_label(aes(x = lng, y = lat, label = name),    # layer 3: labels
               hjust = 0, vjust = 1, nudge_x = 3) +    # labels appear next to point
    theme(axis.title = element_blank(),
          axis.text = element_blank())                 # disable the axis labels

Exercises 2 and 3

Choose one of the following two tasks (or make both if you like):

2. (Birth-)places on a map

  • form a small group (3 to 4 people) and make a dataset with your birthplaces (or favourite travel destinations or whatever you like) and their geo-coordinates
    • see handout on how to find out the geo-coordinates for any place
    • your dataset should have three columns: longitude, latitude, label
  • plot the points on a world map and add labels to them
  • if your points are too near to each other on a global scale:
    • consider filtering the "worldmap" dataset to contain only the country you need for plotting
    • or: restrict the display window (see hints in handout)

3. Visualizing WZB partner institutions on a world map

  • load the CSV file wzb_partners_data.csv from folder 3_wzb_partners and the data for the world map
  • remove the shape (i.e. the "feature") for Antarctica from the world map data
  • make a plot, showing the partner institutions with a dot on the world map (optional: make the color dependent on the "type" column)
  • optional: group and count institutions per city; show the city locations in Europe (restrict the display window – see hints in handout) only as a dot, make the dot size dependent on number of institutions

Sources and file formats for geo-data

Administrative levels and identifiers

In order to …

  • link your data with existing information on geographic entities e.g. country GDP, city population, poverty rate in the neighboorhood
  • visualize your data geographically

… you need to:

  1. agree on (administrative) level for matching (e.g. country, state, municipality)
  2. find identifiers to match the observations w/ respective geographic entities

Country names, country codes

  • different names may refer to the same country (e.g. Germany / Federal Republic of Germany, BRD, etc.) → often not a good identifier
  • ISO 3166-1 designates every country a two- or three-letter code (e.g. DE / DEU)
  • often used in datasets (e.g. from the UN)

Finding NUTS in the EU

The Nomenclature of Territorial Units for Statistics (NUTS) divides the EU territory into regions at 3 different levels for socio-economic analyses of the regions.

Candidate countries and countries inside EFTA also have NUTS regions (Norway, Switzerland, Albania, Turkey, etc.).

NUTS levels
source: Eurostat

Finding NUTS in the EU

AGS in Germany

In Germany, regions are hierarchically structured and identified by "Amtlicher Gemeindeschlüssel" (AGS ~ "municipality identificator"). The Gemeindeverzeichnis contains all regions with their name, AGS, area, population and other features.

Berlin: LOR codes

Since 2006, Berlin uses "Lebensweltlich orientierte Räume" (LOR) to structure the city area into sub-regions at different levels, each of them identified by "LOR" code:

Level 1: Prognoseräume (60 regions); Level 2: Bezirksregionen (138 regions); Level 3: Planungsräume (447 regions)

What if I have no geo-identifier?

If your observations don't contain the identifiers you need, you can still do the following as long as you have some information suitable for geo-coding (e.g. address, city name, institution name to convert to a geo-coordinate):

  1. find out the longitude/latitude coordinate for each observation
  2. match each coordinate to its surrounding geographic entity (i.e. a municipality) by doing a "point-within-polygon" test

Sources for geo-data

Have look in the handout document, where geo-data sources at world- and EU-level (NUTS) as well as sources for Germany (AGS) and Berlin (LOR) are presented.

File formats

There are numerous file formats that store data for geographic information system (GIS) software. The most popular include:

  • ESRI shapefile ("SHP"):
    • consists of at least three files in the same directory with filename extensions .shp, .shx and .dbf
  • GeoJSON (.geojson, .json) and TopoJSON (.topojson, .json)
  • KML (.kml, .kmz): Keyhole Markup Language, used by Google Earth
  • GML (.gml): Geography Markup Language
  • AutoCAD DXF (.dxf)

You may also encounter other formats such as vector data files (.svg, .poly) or databases (.sql, .sqlite).

Good news: sf can handle all popular file formats.

Exploring geo-data files with QGIS

QGIS is a free, open-source GIS software running on all major operating systems.

  • useful for exploring downloaded geo-data files
  • I will show how to load a geo-data file and investigate properties of individual shapes
  • if you've installed QGIS, you can follow along

QGIS screenshot