May 6, 2019
Two datasets:
city | pm_mg |
---|---|
Amman | 999 |
Saltillo | 869 |
Usak | 814 |
The Bronx | 284 |
Seoul | 129 |
city | lng | lat |
---|---|---|
Amman | 31.9100 | 35.94800 |
Saltillo | -100.9940 | 25.43400 |
Berlin | 13.4020 | 52.52000 |
Usak | 29.4030 | 38.67300 |
New York | -73.9400 | 40.60000 |
Seoul | 126.9838 | 37.53626 |
city | pm_mg |
---|---|
Amman | 999 |
Saltillo | 869 |
Usak | 814 |
The Bronx | 284 |
Seoul | 129 |
city | lng | lat |
---|---|---|
Amman | 31.9100 | 35.94800 |
Saltillo | -100.9940 | 25.43400 |
Berlin | 13.4020 | 52.52000 |
Usak | 29.4030 | 38.67300 |
New York | -73.9400 | 40.60000 |
Seoul | 126.9838 | 37.53626 |
Datasets can also be joined with merge
, but I find dplyr easier to use.
left_join(a, b, by = <criterion>)
: always retains rows on the "left side" and fills up non-matching rows with NAs.
library(dplyr) left_join(pm, city_coords, by = 'city')
## city pm_mg lng lat ## 1 Amman 999 31.9100 35.94800 ## 2 Saltillo 869 -100.9940 25.43400 ## 3 Usak 814 29.4030 38.67300 ## 4 The Bronx 284 NA NA ## 5 Seoul 129 126.9838 37.53626
city | pm_mg |
---|---|
Amman | 999 |
Saltillo | 869 |
Usak | 814 |
The Bronx | 284 |
Seoul | 129 |
city | lng | lat |
---|---|---|
Amman | 31.9100 | 35.94800 |
Saltillo | -100.9940 | 25.43400 |
Berlin | 13.4020 | 52.52000 |
Usak | 29.4030 | 38.67300 |
New York | -73.9400 | 40.60000 |
Seoul | 126.9838 | 37.53626 |
right_join(a, b, by = <criterion>)
: always retains rows on the "right side" and fills up non-matching rows with NAs. How many rows do you expect for a right join between pm
and city_coords
? How many of them will contain an NA value?
right_join(pm, city_coords, by = 'city')
## city pm_mg lng lat ## 1 Amman 999 31.9100 35.94800 ## 2 Saltillo 869 -100.9940 25.43400 ## 3 Berlin NA 13.4020 52.52000 ## 4 Usak 814 29.4030 38.67300 ## 5 New York NA -73.9400 40.60000 ## 6 Seoul 129 126.9838 37.53626
city | pm_mg |
---|---|
Amman | 999 |
Saltillo | 869 |
Usak | 814 |
The Bronx | 284 |
Seoul | 129 |
city | lng | lat |
---|---|---|
Amman | 31.9100 | 35.94800 |
Saltillo | -100.9940 | 25.43400 |
Berlin | 13.4020 | 52.52000 |
Usak | 29.4030 | 38.67300 |
New York | -73.9400 | 40.60000 |
Seoul | 126.9838 | 37.53626 |
inner_join(a, b, by = <criterion>)
: only retains rows that match on both sides.
How many rows do you expect for an inner join between pm
and city_coords
?
inner_join(pm, city_coords, by = 'city')
## city pm_mg lng lat ## 1 Amman 999 31.9100 35.94800 ## 2 Saltillo 869 -100.9940 25.43400 ## 3 Usak 814 29.4030 38.67300 ## 4 Seoul 129 126.9838 37.53626
In a choropleth map, regions are shaded according to the measurement of some variable, e.g.:
The above picture is a good example of a bad color scheme.
bln_sozind <- read_sf('data/bln_plr_sozind.geojson') head(bln_sozind)
## Simple feature collection with 6 features and 10 fields ## geometry type: MULTIPOLYGON ## dimension: XY ## bbox: xmin: 386668.2 ymin: 5817761 xmax: 390764.3 ymax: 5820432 ## epsg (SRID): 25833 ## proj4string: +proj=utm +zone=33 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs ## # A tibble: 6 x 11 ## PLANNAME EW2016 STATUS1 STATUS2 STATUS3 STATUS4 DYNAMO1 DYNAMO2 DYNAMO3 ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Stülers… 3114 4.98 1.52 8.16 20.5 -0.19 0.14 -0.42 ## 2 Großer … 192 NA NA NA NA NA NA NA ## 3 Lützows… 5518 6.83 2.08 17.7 34.3 -0.5 -0.03 -0.8 ## 4 Körners… 4704 6.27 1.49 20.7 45.9 -1.33 -0.76 -1.66 ## 5 Nördlic… 1135 2.06 0 1.5 3.55 -0.77 -0.74 -0.42 ## 6 Wilhelm… 2190 3.64 0.89 8.45 27.3 -1.23 -0.59 0.13 ## # … with 2 more variables: DYNAMO4 <dbl>, geometry <MULTIPOLYGON [m]>
## Simple feature collection with 3 features and 2 fields ## geometry type: MULTIPOLYGON ## dimension: XY ## bbox: xmin: 386668.2 ymin: 5817875 xmax: 389894.1 ymax: 5820432 ## epsg (SRID): 25833 ## proj4string: +proj=utm +zone=33 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs ## # A tibble: 3 x 3 ## PLANNAME STATUS4 geometry ## <chr> <dbl> <MULTIPOLYGON [m]> ## 1 Stülerstraße 20.5 (((387256.6 5818552, 387258.6 5818545, 387261.3 58… ## 2 Großer Tierg… NA (((386767.5 5819393, 386767.3 5819393, 386767.1 58… ## 3 Lützowstraße 34.3 (((387952.6 5818275, 387974.2 5818265, 387994.8 58…
STATUS4
– child poverty rate in percent²geometry
column¹ Note that the FIS-Broker website is kind of awkward to use and you can't link directly to dataset, but you have to search for the mentioned keywords.
² Portion of children under 15 living in household that obtains social support ("Hartz IV")
Directly plotting the map with fill color depending on STATUS4
:
ggplot() + geom_sf(aes(fill = STATUS4), data = bln_sozind)
Problem: STATUS4
is continuous and so is the color scale. For choropleth maps, discrete ranges are better for discerning the colors.
Put the percentages into 5 discrete bins, change the color palette, adjust the appearance:
bln_sozind$child_pov_bins <- cut(bln_sozind$STATUS4, seq(0, 100, by = 20)) ggplot() + geom_sf(aes(fill = child_pov_bins), data = bln_sozind) + scale_fill_brewer(palette = 'OrRd', na.value = "grey90", guide = guide_legend(title = 'Child pov.\nranges in pct.')) + coord_sf(datum = NA) + # disable lines ("graticule") in the background theme_void()
Most of the time, you have at least two datasets: one containing the measurement(s) you want to show, the other containing the geo-data.
Example dataset: People at risk of poverty or social exclusion by NUTS 2 regions (tgs00107) from Eurostat.
pov_risk <- read.csv('data/tgs00107_pov_risk_nuts2.csv', stringsAsFactors = FALSE) pov_risk$risk_pct_bins <- cut(pov_risk$risk_pct, seq(0, 100, by = 10)) pov_risk_2016 <- filter(pov_risk, year == 2016) # 2016 has fewest NAs head(pov_risk_2016)
## nuts year risk_pct risk_pct_bins ## 1 AT 2016 18.0 (10,20] ## 2 AT11 2016 15.1 (10,20] ## 3 AT12 2016 13.0 (10,20] ## 4 AT13 2016 26.0 (20,30] ## 5 AT21 2016 16.3 (10,20] ## 6 AT22 2016 18.4 (10,20]
We load the GeoJSON dataset for NUTS level-2 regions provided by Eurostat:
nutsrg2 <- read_sf('data/nutsrg_2.json') st_crs(nutsrg2) <- 3857 # set the correct CRS head(nutsrg2)
## Simple feature collection with 6 features and 2 fields ## geometry type: MULTIPOLYGON ## dimension: XY ## bbox: xmin: 1347007 ymin: 4105701 xmax: 3847132 ymax: 6631069 ## epsg (SRID): 3857 ## proj4string: +proj=merc +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 +x_0=0.0 +y_0=0 +k=1.0 +units=m +nadgrids=@null +wktext +no_defs ## # A tibble: 6 x 3 ## id na geometry ## <chr> <chr> <MULTIPOLYGON [m]> ## 1 CY00 Κύπρος (((3592720 4172836, 3594320 4178301, 3607520 4169324, … ## 2 CZ01 Praha (((1588619 6463231, 1592220 6468696, 1617821 6476893, … ## 3 CZ02 Střední Če… (((1686224 6537392, 1685024 6526853, 1686624 6506167, … ## 4 CZ03 Jihozápad (((1492615 6461670, 1536217 6432006, 1539017 6395706, … ## 5 CZ04 Severozápad (((1612621 6534270, 1600220 6515144, 1520216 6485089, … ## 6 CZ05 Severových… (((1729426 6582279, 1758628 6576034, 1786629 6556127, …
Both datasets contain a NUTS level-2 identifier so we can join them:
pov_risk_2016_geo <- left_join(nutsrg2, pov_risk_2016, by = c('id' = 'nuts')) head(pov_risk_2016_geo)
## Simple feature collection with 6 features and 5 fields ## geometry type: MULTIPOLYGON ## dimension: XY ## bbox: xmin: 1347007 ymin: 4105701 xmax: 3847132 ymax: 6631069 ## epsg (SRID): 3857 ## proj4string: +proj=merc +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 +x_0=0.0 +y_0=0 +k=1.0 +units=m +nadgrids=@null +wktext +no_defs ## # A tibble: 6 x 6 ## id na year risk_pct risk_pct_bins geometry ## <chr> <chr> <int> <dbl> <fct> <MULTIPOLYGON [m]> ## 1 CY00 Κύπρος NA NA <NA> (((3592720 4172836, 3594320 41… ## 2 CZ01 Praha 2016 10.1 (10,20] (((1588619 6463231, 1592220 64… ## 3 CZ02 Střed… 2016 10.3 (10,20] (((1686224 6537392, 1685024 65… ## 4 CZ03 Jihoz… 2016 10 (0,10] (((1492615 6461670, 1536217 64… ## 5 CZ04 Sever… 2016 19.5 (10,20] (((1612621 6534270, 1600220 65… ## 6 CZ05 Sever… 2016 11.7 (10,20] (((1729426 6582279, 1758628 65…
(eu_pov_plt <- ggplot() + geom_sf(aes(fill = risk_pct_bins), data = pov_risk_2016_geo, size = 0.1) + scale_fill_brewer(palette = 'OrRd', na.value = "grey90", guide = guide_legend(title = 'Pct. of people at\nrisk of poverty')) + labs(title = 'Poverty in the EU 2016', subtitle = 'Percentage of people at risk of poverty or social exclusion', caption = 'source: Eurostat') + coord_sf(datum = NA) + # disable lines ("graticule") in the background theme_void())
Choose one of the following two tasks (or make both if you like) and have a look at the handout for further hints:
4. A choropleth map of social indicators in Berlin
bln_plr_sozind_data.csv
from folder 4_berlin_sozind
bln_plr.geojson
containing the Berlin "Planungsraum" regions5. A choropleth map for unemployment rates in the EU
tgs00010_unempl_nuts2.csv
from folder 5_eu_unempl
unempl_pct
nutsrg_2_2016_epsg3857_20M.json
containing the EU NUTS level-2 regionstgs00010_unempl_nuts2.csv
for their respective regions in nutsrg_2_2016_epsg3857_20M.json
sex == 'T'
)sex
, i.e. displaying choropleth maps for the unemployment rate of women and men next to each otherIf you also have a time-dimension in your data that you want to show, there are several options:
gganimate
Small multiples: Dataset with time-related variable; plotted with facets (e.g. facet_wrap()
)
Animations: Use gganimate
for time-lapse plot animations.
Geocoding is the process of finding the geographic coordinates (longitude, latitude) for a specific query term (a full address, a postal code, a city name etc.).
13.36509, 52.50640
-73.896916, 40.706934
Reverse geocoding tries to map a set of geographic coordinates to a place name / address.
13.36509, 52.50640
→ Reichpietschufer 50, 10785 BerlinThe Google Maps API is a web service that allows programmatic geocoding or reverse geocoding (among other things). Put simply, this means:
As of June 2018, the Maps API is part of Google's "Could Platform". This requires you to have:
Inside GCP, you can go to APIs & Services > Credentials to get your API key.
You need to install the package ggmap, at least of version 2.7.
library(ggmap) # provide the Google Cloud API key here: register_google(key = google_cloud_api_key) places <- c('Berlin', 'Leipzig', '10317, Deutschland', 'Reichpietschufer 50, 10785 Berlin') place_coords <- geocode(places) %>% mutate(place = places) place_coords
## # A tibble: 4 x 3 ## lon lat place ## <dbl> <dbl> <chr> ## 1 13.4 52.5 Berlin ## 2 12.4 51.3 Leipzig ## 3 13.5 52.5 10317, Deutschland ## 4 13.4 52.5 Reichpietschufer 50, 10785 Berlin
Take the WZB's geo-coordinates and see if we can find the address to it:
# first longitude, then latitude revgeocode(c(13.36509, 52.50640))
## [1] "Reichpietschufer 50, 10785 Berlin, Germany"
Social media posts or digital photos often contain geo-coordinates that you may want to "reverse geocode".
Can you unroll a cylinder or a cone to a sheet without tearing or stretching?
Yes, easily:
Spheres or ellipsoids cannot be unrolled or unfolded without cutting or stretching. They are "non-developable surfaces".
source: flickr.com/photos/waferboard
Map projections try to do the impossible: project points on a sphere¹ to a plane.
¹ WGS84 actually models the Earth as ellipsoid
All map projections distort: area, shape, direction, distance and other properties can be different than on the actual sphere.
source: flickr.com/photos/waferboard
There are numerous projections, each with different properties and trade-offs between distortions.
Mercator projection
source: Strebe / wikipedia commons
Mollweide projection
source: Strebe / wikipedia commons
Goode homolosine projection
source: Strebe / wikipedia commons
The CRS of a spatial dataset is usually documented as EPSG / SRID number. epsg.io provides information about all common CRS.
When you load a spatial dataset you can see its CRS / projection:
worldmap_data <- st_as_sf(map('world', plot = FALSE, fill = TRUE)) head(worldmap_data, n=2)
## Simple feature collection with 2 features and 1 field ## geometry type: MULTIPOLYGON ## dimension: XY ## bbox: xmin: 19.28066 ymin: 29.39194 xmax: 74.89131 ymax: 42.64795 ## epsg (SRID): 4326 ## proj4string: +proj=longlat +datum=WGS84 +no_defs ## geometry ID ## 1 MULTIPOLYGON (((74.89131 37... Afghanistan ## 2 MULTIPOLYGON (((20.06396 42... Albania
Alternatively, use st_crs()
:
st_crs(worldmap_data)
## Coordinate Reference System: ## EPSG: 4326 ## proj4string: "+proj=longlat +datum=WGS84 +no_defs"
Often, the CRS is not set in the spatial dataset, so you have to do it manually:
bln_plan <- read_sf('data/Planungsraum_EPSG_25833.shp') head(bln_plan, n=2)
## Simple feature collection with 2 features and 2 fields ## geometry type: MULTIPOLYGON ## dimension: XY ## bbox: xmin: 386668.2 ymin: 5818275 xmax: 389894.1 ymax: 5820432 ## epsg (SRID): NA ## proj4string: NA ## # A tibble: 2 x 3 ## SCHLUESSEL PLR_NAME geometry ## <chr> <chr> <MULTIPOLYGON> ## 1 01011101 "St\xfclerstr." (((387256.6 5818552, 387323.1 5818572, 387418… ## 2 01011102 "Gro\xdfer Tie… (((386767.5 5819393, 386768.3 5819389, 386769…
The documentation (and the file name) say it's EPSG 25833, a CRS known as ETRS89:
st_crs(bln_plan) <- 25833
Note that by default, ggplot displays lines ("graticule") and axes according to WGS84 coordinates, although the actual data uses a different CRS:
ggplot() + geom_sf(data = bln_plan)
Setting the datum
to the same CRS in coord_sf
shows axis with the same units as in the data:
ggplot() + geom_sf(data = bln_plan) + coord_sf(datum = 25833)
Suppose we wanted to make some markers on the Berlin map. We obtained longitude/latitude coordinates for them, e.g. for the WZB:
wzb_coord <- data.frame(lng = 13.365097, lat = 52.506459) ggplot() + geom_sf(data = bln_plan) + geom_point(aes(x = lng, y = lat), data = wzb_coord, color = 'red')
Oops! What happened here?
The spatial data for Berlin has a different CRS, it doesn't use WGS84 longitude/latitude coordinates. We have to convert the WZB coordinates to the same CRS (EPSG 25833 aka ETRS89) via st_transform()
:
wzb_wgs84 <- st_sfc(st_point(c(wzb_coord$lng, wzb_coord$lat)), crs = 4326) wzb_etrs89 <- as.data.frame(st_coordinates(st_transform(wzb_wgs84, crs = 25833))) ggplot() + geom_sf(data = bln_plan) + geom_point(aes(x = X, y = Y), data = wzb_etrs89, color = 'red')
You may only want to show a portion of your spatial data, i.e. "zoom in" to a certain area on the map.
We use spatial data for all countries from the rnaturalearth
package:
worldmap <- ne_countries(type = 'map_units', returnclass = 'sf')
One way to "zoom in" is to form a subset of your data, e.g:
europe <- worldmap[worldmap$continent == 'Europe',] ggplot() + geom_sf(data = europe)
However, this has some disadvantages:
We can also use the whole dataset, but set limits on the display window via coord_sf()
using the current CRS unit (here: degrees longitude/latitude):
xlim
are the bounds for the x-axisylim
are the bounds for the y-axisUsing the default Mercator-like projection:
ggplot() + geom_sf(data = worldmap) + coord_sf(xlim = c(-20, 45), ylim = c(30, 73))
We can also "cut" the shapes to fit into the specified display window using st_crop
:
To better see the effect, I'm showing the result in ETRS89 projection after the shapes have been cut:
europe <- st_crop(worldmap, xmin = -20, xmax = 45, ymin = 30, ymax = 73) ggplot() + geom_sf(data = europe) + coord_sf(crs = 25833)
Sometimes you need to find the center point (centroid) of certain regions. Example: you want to annotate a map with labels for certain countries. You can't directly label a country, because it is represented as a polygon but you need a discrete point for labelling. You can find the centroid for a shape with st_centroid
.
st_centroid
doesn't work correctly with longitude/latitude → need to convert to a CRS that uses meters, such as ETRS89geometry
and get their coordinates (they are also in ETRS89)world_etrs89 <- st_transform(worldmap, crs = 25833) countries <- world_etrs89[world_etrs89$iso_a2 %in% c('SK', 'FR', 'TR', 'PL', 'CZ'),] country_centroids <- st_coordinates(st_centroid(countries$geometry)) country_centroids
## X Y ## 1 797648.2 5784802 ## 2 2264329.9 4535907 ## 3 -462849.1 5225447 ## 4 830884.1 5407641 ## 5 524732.7 5514198
We can use those coordinates for labelling the countries. First we combine the country names and their centroid coordinates:
country_labels <- cbind(select(countries, name), country_centroids) country_labels
## Simple feature collection with 5 features and 3 fields ## geometry type: MULTIPOLYGON ## dimension: XY ## bbox: xmin: -951869 ymin: 4101650 xmax: 3171811 ymax: 6081423 ## epsg (SRID): 25833 ## proj4string: +proj=utm +zone=33 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs ## name X Y geometry ## 114 Poland 797648.2 5784802 MULTIPOLYGON (((1056668 600... ## 125 Turkey 2264329.9 4535907 MULTIPOLYGON (((3171811 455... ## 144 France -462849.1 5225447 MULTIPOLYGON (((-138250.4 5... ## 156 Slovakia 830884.1 5407641 MULTIPOLYGON (((1051642 546... ## 157 Czechia 524732.7 5514198 MULTIPOLYGON (((501189.9 56...
Now we can use geom_sf_label
to display the labels. We pass the country_labels
dataset and map its name
variable to the label
.
Note how I define a display window in WGS84 longitudes/latitudes and convert it to ETRS89, as this is the CRS of the spatial datasets that are plotted.
disp_window <- st_sfc(st_point(c(-12, 28)), st_point(c(60, 64)), crs = 4326) disp_window_etrs89 <- st_transform(disp_window, crs = 25833) disp_window_etrs89_coord <- st_coordinates(disp_window_etrs89) ggplot() + geom_sf(data = world_etrs89) + geom_sf_label(aes(label = name), data = country_labels) + coord_sf(xlim = disp_window_etrs89_coord[,'X'], ylim = disp_window_etrs89_coord[,'Y'], datum = NA) + theme_void()
Labels close to each other will cause occlusion. The package ggrepel
contains a function geom_label_repel
that tries to place labels without occlusion. We map the x
, y
and label
with the respective columns from the country_labels
dataset:
library(ggrepel) ggplot() + geom_sf(data = world_etrs89) + geom_label_repel(aes(x = X, y = Y, label = name), data = country_labels, min.segment.length = 0) + coord_sf(xlim = disp_window_etrs89_coord[,'X'], ylim = disp_window_etrs89_coord[,'Y'], datum = NA) + theme_void()
Take one of your previous plots and improve it by:
sf