January 24, 2019

Today's schedule

  1. Review of last week's tasks
  2. Tapping APIs
  3. Use case: Twitter API
  4. Use case: Geocoding with Google Maps API
  5. Web scraping

Review of last week's tasks

Solution for tasks #10

Tapping APIs

What is an API?

  • stands for Application Programming Interface
  • defined interface for communication between software components
  • Web API: provides an interface to structured data from a web service

API as back door

When should I use an API?

Whenever you need to collect mass data in the web in an automated manner.

Whenever you need to enrich or transform your existing data with the help of a web service (automatted translation, geocoding, …)

Whenever you want to run (semi-)automated experiments in the web (MTurk, Twitter bots, eBay, etc.).

It should definitely be preferred over web scraping. (We'll later see why.)

How does a web API work?

Web APIs usually employ a client-server model. The client – that is you. The server provides the API endpoints as URLs.

Web API schema

How does a web API work?

Communication is done with request and response messages over Hypertext transfer protocol (HTTP).

Each HTTP message contains a header (message meta data) and a message body (the actual content). The three-digit HTTP status code plays an important role:

  • 2xx: Success
  • 4xx: Client error (incl. the popular 404: Not found or 403: Forbidden)
  • 5xx: Server error

The message body contains the requested data in a specific format, often JSON or XML.

Let's query an API

We can directly query the Twitter search API endpoint.

In your browser: https://api.twitter.com/1.1/search/tweets.json?q=hello

In R:


# add argument verbose = TRUE for full details
## [1] "{\"errors\":[{\"code\":215,\"message\":\"Bad Authentication data.\"}]}"

→ we get an error because we're not authenticated (we'll do that later).

Reading JSON

Examples of popular APIs

What's an "API wrapper package"?

Working with a web API involves:

  • constructing request messages
  • parsing result messages
  • handling errors

→ much implementation effort necessary.

For popular web services there are already "API wrapper packages" on CRAN:

  • implement communication with the server
  • provide direct access to the data via R functions
  • examples: rtweet, ggmap (geocoding via Google Maps), wikipediR, genderizeR

Access to web APIs

Most web services require you to set up a user account on their platform.

Many web services provide a free subscription to their services with limited access (number of requests and/or results is limitted) and a paid subscription as "premium access" or as usage-dependent payment model. Some services (like Google Cloud Platform) require you to register with credit card information and grant a monthly free credit (e.g. $300 for Translation API at the moment).

In both cases you're required to authenticate with the service when you use it (→ API key or authorization token).

A few warning signs

Always be aware that you're using a web service, i.e. you're sending (parts of) your data to some company's server.

Using a web API is a complex and often long running task. Be aware that many things can go wrong, e.g.:

  • the server delivers garbage
  • the server crashes
  • your internet connection is lost
  • your computer crashes
  • your script produces an endless loop

never blindly trust what you get
always do validity checks on the results (check NAs, ranges, duplicates, etc.)
→ use defensive programming (e.g. save intermediate results to disk; implement wait & retry mechanisms on failures; etc.)

Use case: Twitter API

Which APIs does Twitter provide?

Twitter provides several APIs. They are documented at https://developer.twitter.com/en/docs

The most important APIs for us are the "Search API" (aka REST API) and "Realtime API" (aka Streaming API).

Free vs. paid

Twitter provides three subscription levels:

  • Standard (free)
    • search historical data for up to 7 days
    • get sampled live tweets
  • Premium ($150 to $2500 / month)
    • search historical data for up to 30 days
    • get full historical data per user
  • Enterprise (special contract with Twitter)
    • full live tweets

The rate limiting also differs per subscription level (number of requests per month).

What else do I need?

  1. A Twitter account
  2. Authentication data for a "Twitter app"