- Review of last week's tasks
- Tapping APIs
- Use case: Twitter API
- Use case: Geocoding with Google Maps API
- Web scraping
January 24, 2019
now online on https://wzbsocialsciencecenter.github.io/wzb_r_tutorial/
Whenever you need to collect mass data in the web in an automated manner.
Whenever you need to enrich or transform your existing data with the help of a web service (automatted translation, geocoding, …)
Whenever you want to run (semi-)automated experiments in the web (MTurk, Twitter bots, eBay, etc.).
It should definitely be preferred over web scraping. (We'll later see why.)
Web APIs usually employ a client-server model. The client – that is you. The server provides the API endpoints as URLs.
Communication is done with request and response messages over Hypertext transfer protocol (HTTP).
Each HTTP message contains a header (message meta data) and a message body (the actual content). The three-digit HTTP status code plays an important role:
The message body contains the requested data in a specific format, often JSON or XML.
We can directly query the Twitter search API endpoint.
In your browser: https://api.twitter.com/1.1/search/tweets.json?q=hello
In R:
library(RCurl) # add argument verbose = TRUE for full details getURL('https://api.twitter.com/1.1/search/tweets.json?q=hello')
## [1] "{\"errors\":[{\"code\":215,\"message\":\"Bad Authentication data.\"}]}"
→ we get an error because we're not authenticated (we'll do that later).
APIs often return data in JSON format, which is a nested data format that allows to store key-value pairs and ordered lists of values:
{ "profiles": [ { "name": "Alice", "age": 52 }, { "name": "Bob", "age": 35 } ] }
Social media:
Google (see API explorer):
Other:
Microsoft Face API, Amazon Mechanical Turk API, Wikipedia, etc.
For more, see programmableweb.com.
Working with a web API involves:
→ much implementation effort necessary.
For popular web services there are already "API wrapper packages" on CRAN:
Most web services require you to set up a user account on their platform.
Many web services provide a free subscription to their services with limited access (number of requests and/or results is limitted) and a paid subscription as "premium access" or as usage-dependent payment model. Some services (like Google Cloud Platform) require you to register with credit card information and grant a monthly free credit (e.g. $300 for Translation API at the moment).
In both cases you're required to authenticate with the service when you use it (→ API key or authorization token).
Always be aware that you're using a web service, i.e. you're sending (parts of) your data to some company's server.
Using a web API is a complex and often long running task. Be aware that many things can go wrong, e.g.:
→ never blindly trust what you get
→ always do validity checks on the results (check NAs, ranges, duplicates, etc.)
→ use defensive programming (e.g. save intermediate results to disk; implement wait & retry mechanisms on failures; etc.)
Twitter provides several APIs. They are documented at https://developer.twitter.com/en/docs
The most important APIs for us are the "Search API" (aka REST API) and "Realtime API" (aka Streaming API).
Twitter provides three subscription levels:
The rate limiting also differs per subscription level (number of requests per month).