- Review of last week's tasks
- Tapping APIs
- Use case: Twitter API
- Use case: Geocoding with Google Maps API
- Web scraping
January 24, 2019
now online on https://wzbsocialsciencecenter.github.io/wzb_r_tutorial/
Whenever you need to collect mass data in the web in an automated manner.
Whenever you need to enrich or transform your existing data with the help of a web service (automatted translation, geocoding, …)
Whenever you want to run (semi-)automated experiments in the web (MTurk, Twitter bots, eBay, etc.).
It should definitely be preferred over web scraping. (We'll later see why.)
Web APIs usually employ a client-server model. The client – that is you. The server provides the API endpoints as URLs.
Communication is done with request and response messages over Hypertext transfer protocol (HTTP).
Each HTTP message contains a header (message meta data) and a message body (the actual content). The three-digit HTTP status code plays an important role:
The message body contains the requested data in a specific format, often JSON or XML.
We can directly query the Twitter search API endpoint.
In your browser: https://api.twitter.com/1.1/search/tweets.json?q=hello
In R:
library(RCurl) # add argument verbose = TRUE for full details getURL('https://api.twitter.com/1.1/search/tweets.json?q=hello')
## [1] "{\"errors\":[{\"code\":215,\"message\":\"Bad Authentication data.\"}]}"
→ we get an error because we're not authenticated (we'll do that later).
APIs often return data in JSON format, which is a nested data format that allows to store key-value pairs and ordered lists of values:
{ "profiles": [ { "name": "Alice", "age": 52 }, { "name": "Bob", "age": 35 } ] }
Social media:
Google (see API explorer):
Other:
Microsoft Face API, Amazon Mechanical Turk API, Wikipedia, etc.
For more, see programmableweb.com.
Working with a web API involves:
→ much implementation effort necessary.
For popular web services there are already "API wrapper packages" on CRAN:
Most web services require you to set up a user account on their platform.
Many web services provide a free subscription to their services with limited access (number of requests and/or results is limitted) and a paid subscription as "premium access" or as usage-dependent payment model. Some services (like Google Cloud Platform) require you to register with credit card information and grant a monthly free credit (e.g. $300 for Translation API at the moment).
In both cases you're required to authenticate with the service when you use it (→ API key or authorization token).
Always be aware that you're using a web service, i.e. you're sending (parts of) your data to some company's server.
Using a web API is a complex and often long running task. Be aware that many things can go wrong, e.g.:
→ never blindly trust what you get
→ always do validity checks on the results (check NAs, ranges, duplicates, etc.)
→ use defensive programming (e.g. save intermediate results to disk; implement wait & retry mechanisms on failures; etc.)
Twitter provides several APIs. They are documented at https://developer.twitter.com/en/docs
The most important APIs for us are the "Search API" (aka REST API) and "Realtime API" (aka Streaming API).
Twitter provides three subscription levels:
The rate limiting also differs per subscription level (number of requests per month).
Several "API wrapper" packages for Twitter exist on CRAN:
I'll use rtweet on the following slides.
You need to construct an authentication token and provide the keys from the "Twitter Apps" page:
library(rtweet) token <- create_token( app = "WZBAnalysis", consumer_key = "...", consumer_secret = "...", access_token = "...", access_secret = "...")
Sample of 10 tweets (excluding retweets) from the last 7 days containing "#wzb":
tw_search_wzb <- search_tweets('#wzb', n = 10, include_rts = FALSE) # display only 3 out of 88 variables: tw_search_wzb[c('screen_name', 'created_at', 'text')]
## # A tibble: 6 x 3 ## screen_name created_at text ## <chr> <dttm> <chr> ## 1 Kevin_Bernal… 2019-01-16 00:46:28 Without even knowing, Wonder is teach… ## 2 WZB_Berlin 2019-01-15 14:43:44 #Brexit als Symptom und die Wut gegen… ## 3 WZB_Berlin 2019-01-14 15:06:54 #Digitalisierung spaltet: Geringquali… ## 4 WZB_Berlin 2019-01-14 08:59:08 „Unser Bild von der #Universität ist … ## 5 Daver_eiss 2019-01-15 09:28:18 Und dass sie stattfindet, beweist so … ## 6 WZB_GlobCon 2019-01-13 14:00:00 "Ende des letzten Jahres sind die neu…
Retrieve 10 latest tweets from timelines of selected users:
tw_timelines <- get_timelines(c("WZB_Berlin", "JWI_Berlin", "DIW_Berlin"), n = 10) tw_timelines %>% # "favorite_count" is number of likes: select(screen_name, favorite_count, retweet_count, text) %>% group_by(screen_name) %>% arrange(screen_name, desc(favorite_count)) %>% top_n(3)
## # A tibble: 9 x 4 ## # Groups: screen_name [3] ## screen_name favorite_count retweet_count text ## <chr> <int> <int> <chr> ## 1 DIW_Berlin 6 4 „Die #Geschlechterquote für Au… ## 2 DIW_Berlin 3 2 "„Österreichs Weg zu einer kli… ## 3 DIW_Berlin 2 2 “Von den 200 umsatzstärksten U… ## 4 JWI_Berlin 13 4 "Weizenbaum Insights: Berlin i… ## 5 JWI_Berlin 10 3 We're excited that so many chi… ## 6 JWI_Berlin 5 5 We're happy to invite you and … ## 7 WZB_Berlin 5 4 "Und zur Vorbereitung auf die … ## 8 WZB_Berlin 3 3 Mehr Beschäftigte, weniger Aus… ## 9 WZB_Berlin 1 0 Und auf Deutsch hier in den WZ…
Posting a tweet to the timeline of your "app" account:
rand_nums <- round(runif(2, 0, 100)) # sprintf creates a character string by filling in placeholders new_tweet <- sprintf('Hello world, it is %s and %d + %d is %d.', Sys.time(), rand_nums[1], rand_nums[2], sum(rand_nums)) post_tweet(new_tweet) ## your tweet has been posted!
→ will be posted on twitter.com/WZBAnalysis
Live streaming of tweets is especially practical when run during events of interest (elections, demonstrations, etc.). This is because Twitter only allows limited download of historical data (see "Free vs. paid" slide before). So always try to collect the data during an event!
Realtime retrieval of tweets from sampled live stream. By default, this will collect tweets for 30 seconds according to optional search criteria:
stream_ht2019 <- stream_tweets('#2019') # Streaming tweets for 30 seconds... # Finished streaming tweets!
→ results in data frame with 88 variables as with previous functions.
A practical way to collect tweets during events is to specify the recording length and let the tweets be written to a file:
stream_tweets( "oscars,academy,awards", timeout = 60 * 60 * 24 * 7, # record tweets for 7 days (specified in seconds) file_name = "awards.json", parse = FALSE )
Make sure you have enough disk space and that the internet connection is stable!
After recording, load the data file as data frame:
awards <- parse_stream("awards.json")
For more functions, see the introductionary vignette to rtweet.
Geocoding is the process of finding the geographic coordinates (longitude, latitude) for a specific query term (a full address, a postal code, a city name etc.).
Reverse geocoding tries to map a set of geographic coordinates to a place name / address.
As of June 2018, the Maps API is part of Google's "Could Platform". This requires you to have:
Inside GCP, you can go to APIs & Services > Credentials to get your API key.
You need to install the package ggmap, at least of version 2.7.
library(ggmap) # provide the Google Cloud API key here: register_google(key = google_cloud_api_key) places <- c('Berlin', 'Leipzig', '10317, Deutschland', 'Reichpietschufer 50, 10785 Berlin') place_coords <- geocode(places) %>% mutate(place = places) place_coords
## lon lat place ## 1 13.40495 52.52001 Berlin ## 2 12.37307 51.33970 Leipzig ## 3 13.48475 52.49854 10317, Deutschland ## 4 13.36509 52.50640 Reichpietschufer 50, 10785 Berlin
Take the WZB's geo-coordinates and see if we can find the address to it:
# first longitude, then latitude revgeocode(c(13.36509, 52.50640))
## [1] "Reichpietschufer 50, 10785 Berlin, Germany"
Tweets also sometimes come with geo-coordinates. With reverse geocoding it is possible to find out from which city a tweet was sent.
Web scraping is the process of extracting data from websites for later retrieval or analysis.
→ usually done as automated process by a web crawler, web spider or bot:
Google and other search engines do it all the time in a big scale – Google: "Our index is well over 100,000,000 gigabytes"
The problem: this huge amount of data is largely unstructured.
Web scraping or web mining tries to extract structured information from this mess.
Web scraping should be your last resort if you can't get the data otherwise (we'll see why).
Web scraping might lead to (among others):
Depends mainly on:
User-agent: * Crawl-delay: 10 # Directories Disallow: /includes/ ...
Again, we have a client-server model:
This is very similar to the Web API concept (we use the same HTTP protocol), only that the server delivers HTML content this time.
In R:
library(httr) response <- GET('https://wzb.eu') response
## Response [https://wzb.eu/de] ## Date: 2019-01-23 11:53 ## Status: 200 ## Content-Type: text/html; charset=UTF-8 ## Size: 348 kB ## <!DOCTYPE html> ## <html lang="de" dir="ltr" prefix="content: http://purl.org/rss/1.0/modu... ## <head> ## <meta charset="utf-8" /> ## <meta name="title" content="WZB-Startseite | WZB" /> ## <link rel="shortlink" href="https://wzb.eu/de" /> ## <link rel="canonical" href="https://wzb.eu/de" /> ## <link rel="shortcut icon" href="/favicon.ico" /> ## <link rel="mask-icon" href="/safari-pinned-tab.svg" color="#000000" /> ## <link rel="icon" sizes="16x16" href="/favicon-16x16.png" /> ## ...
Too many HTTP requests.
The webserver may notice when you send too many requests in a small amount of time.
It might be considered as an attack (DoS – Denial of Service attack) and your IP gets blocked for some time.
Solution: Use delays during requests (for example with Sys.sleep()
)
→ You will need patience when you crawl many web pages!
HTML (Hypertext Markup Language) describes the structure of a website, e.g.:
Represented as nested tags with attributes:
<body> <nav width="100%"> ... </nav> <article> <h1>Some headline</h1> <img src="some_image.png" alt="My image" /> </article> </body>
<tag> ... </tag>
<tag attrib="value"> ... </tag>
Two websites rarely have the same HTML structure.
Examples:
Both websites have news but the HTML structure is completely different → specific data extraction instructions for both websites necessary
Pages on a single website with the same "page type" usually share the same structure, e.g.:
It's very hard to do web scraping on a big range of different websites.
→ before you start a project assess the HTML code of the websites → try to find similarities
→ try to find aggregator websites, public databases or similar platforms that gather information from different websites
Websites can get very complex
Example: Pages that load more articles when you scroll down (Facebook!)
→ what you see in your browser might not be what get when crawling the website!
Solutions: Automated web browsers, e.g. via Selenium → quite complex to implement
Websites change – They do relaunches or disappear.
Sometimes it is possible to recover websites from the Internet Archive.
On abgeordnetenwatch.de (which translates as “member of parliament watch”) users find a blog, petitions and short profiles of their representatives in the federal parliament as well as on the state and EU level.
– source
Example: Research on Twitter networks among MPs → find Twitter name for each MP.
abgeordnetenwatch.de links to Twitter account on MP's profiles, see for example:
https://www.abgeordnetenwatch.de/profile/christian-lindner *
* Personal remark: I'm no CL fanboy, I was just sure that he has a Twitter account…
First: Check if we can avoid scraping!
→ they provide an API: https://www.abgeordnetenwatch.de/api; all profiles of MPs are at: https://www.abgeordnetenwatch.de/api/parliament/bundestag/deputies.json
Unfortunately, no link to Twitter in the data from the API!
We could ask the owners of the website if they want to provide the data, but let's use this website as illustrative example for web scraping.
For web scraping, we need to:
Both steps require some basic understanding of HTML and CSS. More advanced scraping techniques require an understanding of XPath and regular expressions.
We won't cover any of these here, but I will give you a short example trying to show the basics.
A different profile, this time with many links (Facebook, Wikipedia, Twitter, etc): https://www.abgeordnetenwatch.de/profile/lars-klingbeil
In a browser, right click on the element of interest and select "Inspect". This opens a new pane on the right side which helps to navigate through the HTML tags and find a CSS selector for that element. This gives us a "path" to that element.
The crucial information for the "path" to the elements of interest, which is the links specified by an <a>...</a>
tag, is:
<div>
container with class attribute "deputy__custom-links"
<ul>
with class attribute "link-list"
<li>
<a>
that we wantWe can now use this information in R. The package rvest
is made for parsing HTML and extracting content from specific elements. First, we download the HTML source and parse it via read_html
:
library(rvest) html <- read_html('https://www.abgeordnetenwatch.de/profile/lars-klingbeil') html
## {xml_document} ## <html lang="de" dir="ltr"> ## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ... ## [2] <body class="html not-front not-logged-in no-sidebars page-user page ...
We apply the CSS selector (the "path" to the links) in order to extract only the specific link elements of the website:
links <- html_nodes(html, 'div.deputy__custom-links ul.link-list li a') links
## {xml_nodeset (6)} ## [1] <a href="http://www.bundestag.de/abgeordnete/biografien/K/klingbeil_ ... ## [2] <a href="https://plus.google.com/116503454070157395611" target="_bla ... ## [3] <a href="http://lars-klingbeil.de/" target="_blank">http://lars-klin ... ## [4] <a href="https://de.wikipedia.org/wiki/Lars_Klingbeil" target="_blan ... ## [5] <a href="https://www.facebook.com/klingbeil.lars" target="_blank">ht ... ## [6] <a href="https://twitter.com/larsklingbeil" target="_blank">Twitter</a>
And finally, we extract only the value of each link's href
attribute in order to get the actual URLs:
urls <- html_attr(links, 'href') urls
## [1] "http://www.bundestag.de/abgeordnete/biografien/K/klingbeil_lars/521076" ## [2] "https://plus.google.com/116503454070157395611" ## [3] "http://lars-klingbeil.de/" ## [4] "https://de.wikipedia.org/wiki/Lars_Klingbeil" ## [5] "https://www.facebook.com/klingbeil.lars" ## [6] "https://twitter.com/larsklingbeil"
In order to select only the link to Twitter and extract the Twitter name from there, we can apply a regular expression. Note that this is a quiet advanced topic. The gist is that you can create character string patterns and extract specified key information if this pattern matches:
# a pattern that matches: # http://twitter.com/user # https://twitter.com/user # http://www.twitter.com/user # https://www.twitter.com/user # and extracts the "user" part matches <- regexec('^https?://(www\\.)?twitter\\.com/([A-Za-z0-9_-]+)/?', urls) twitter_name <- sapply(regmatches(urls, matches), # the "user" part is number 3 function(s) { if (length(s) == 3) s[3] else NA }) twitter_name[!is.na(twitter_name)]
## [1] "larsklingbeil"
This whole process can be applied to all MPs (whose profile URLs we can get from the abgeordnetenwatch.de API). If we obey to the crawl limit of 1 request per 10 seconds as specified in their robots.txt file, it would take about 2h to fetch the profile pages of all 708 MPs and extract the Twitter name from it.
You can see that web scraping is really a powerful tool for automated data extraction from websites, but also that it involves much programming effort and many things can go wrong (see legal and technical issues slides before).
See dedicated tasks sheet on the tutorial website.