R Tutorial at the WZB

1. Load the file “mdb_twitter.csv” included in the resources file “11collecting-resources.zip” from the tutorial website into R.

Remember that you can load CSV files with read.csv(). Make sure to not read in strings as factors and to read the postal code (variable personal.location.postal_code) as character string, not as numeric variable. Hint: Have a look at help for read.csv() and look for the arguments stringsAsFactors and colClasses.

This is a data subset on members of the 19th (2017 to 2021 max.) German Bundestag (ger. “Mitglieder des Bundestags – MdB”) that was fetched from abgeordnetenwatch.de.

Familiarize yourself with the data. There is no codebook to the data, which is bad practice, but for the sake of this exercise it should be enough to assess the variables from the column names and their values. Also check the number of observations.

2. Create a subset of the data with the following criteria:

select only the variables personal.first_name, personal.last_name, personal.gender, personal.birthyear, party, twitter_name
do not include members without party affiliation (“fraktionslos”)

After that, have a look at where we have missing data in the data set that might later cause trouble.

3. There are quite a lot NA’s to the `personal.gender` variable (I introduced them for the sake of this exercise). The gender of about a quarter of MPs is missing. Let’s try to impute the gender.

3.1. First, think how you could do imputation on that variable and what the pros and cons of each method are.

3.2. Then, inform yourself about the genderizeR package and install it.

Please note that the package uses the genderize.io API to predict the gender of a person using a first name. There is a rate limit of 1000 requests (i.e. 1000 names) per day.

The package is a bit awkward to work with, I have to say. An alternative would be to use the web API directly from R without the package, which would also allow to use additional features such as specifying the country from which the name comes (that increases accuracy). Anyway, that would lead too far, so let’s use the package.

To predict gender from a set of names with this package, you need to use two functions: 1. findGivenNames() and 2. genderize().

findGivenNames() extracts a first name from a text (in our case, that’s not really necessary because we already have the first name in a separate data frame column), does some text preparation (like lower case transformation), queries the web service of genderize.io and returns the result in a data frame. Note that it also includes an estimate of the correct prediction probability:

library(genderizeR)

testnames <- c('Paula', 'Paul')
test_genderinfo <- findGivenNames(testnames, progress = FALSE)

test_genderinfo

##     name gender probability count
## 1: paula female        0.99  2298
## 2:  paul   male           1  5931

Now we need to pass this result to genderize(). This is basically only necessary to combine the original names in testnames with the test_genderinfo output.

genderize(testnames, test_genderinfo, progress = FALSE)

##     text givenName gender genderIndicators
## 1: Paula     paula female                1
## 2:  Paul      paul   male                1

We can later use the original name in the text column for joining.

3.3. Try out the generizeR as shown above with a few names.

3.4. Next, select all first names from the data set of Bundestag members, for which we do not have gender information. Make sure to remove all duplicate names. Then, use the findGivenNames() function to let the web service predict the gender from the supplied names. Inspect the result data frame.

3.5. Have a look at the distribution of the probability variable in the result. Filter out results with a probability less than 0.9 and investigate them.

3.6. Then, use genderize() to get the complete data frame which also contains the original first names in the text variable. You should then have a data frame with the original first name and a predicted gender. Use this data frame now for imputation of the data set of Bundestag members. Check if there are still NAs in the personal.gender variable. If yes, you can manually set a gender if you’re confident or just ignore it.

Hint: A strategy here is to use a join (as we learned in “data linkage”) where the variables personal.first_name from the members data and text from the predicted gender data must match. You can then replace NA values in personal.gender for example by using mutate() and ifelse() (there are also other ways).

3.7. Make a final data frame of the Bundestag members by converting personal.gender and party to factors (makes further analysis easier).

4. Investigate the ratio of Twitter users per party (using tabular output and/or a plot). Finally, subset your data to only include Bundestag members that are Twitter users.

Optional: 5. Fetch user data for each Twitter user in the Bundestag members data set from the Twitter API using `rtweet`.

From now on, work only with the subset that you created previously (Bundestag members with Twitter account), because we want to fetch some metadata from Twitter. If you want to do that task, you will need to install the package rtweet and load it. You also need a Twitter account (you don’t have to tweet anything, though!) and need to set up an “App” at developer.twitter.com. You can then find the four authentication keys that you need for accessing the Twitter API in the “Keys and tokens” tab of the app that you created (as shown in the tutorial).

5.1. After that, use the keys to create an access token with create_token(). In order to check if it works, look up data for a single Twitter account, e.g. “WZB_Berlin” using the function lookup_users(). Investigate the result in terms of which data is returned. What could be interesting for us?

5.2. Now look up user data for all Twitter account names in the Bundestag members data set (note that this will take several seconds) and store the result.

5.3. Create a subset of the Twitter user data with only the variables screen_name, account_created_at, followers_count, statuses_count. If you wanted to join this data with the Bundestag members data, which variable would you use to match the observations?

5.4. Transform the data in the subset using mutate() so that 1) the screen_name is always lower case; 2) you have a new variable days_since_creation containing the “age” of the account in days calculated with round(as.numeric(as.POSIXct('2018-12-13') - account_created_at)); 3) another new variable tweets_per_day that is the average number of tweets sent per day by this account (calculate it with the help of days_since_creation). We can then use tweets_per_day as a measure of activity on Twitter.

5.5. Join the data of the Bundestag members with Twitter account and the data fetched from the Twitter API by matching twitter_name with screen_name. Check where the matching failed (i.e. because no data could be fetched for that account from Twitter).

Optional: 6. Now you can try to answer all sorts of questions you are interested in, for example:

Who has the most followers (e.g. make a top 10 list)?

Who sends the most tweets per day (again, a top 10 list)?

How is the distribution of follower counts?

What’s the mean and the median follower count per party?

How do follower counts differ between men and women in each party?

How is the distribution of the tweets per day frequency and how does this measure vary between the different parties?

R Tutorial at the WZB

Tasks for 11 – Collecting data from the web

Markus Konrad

January 24, 2019

1. Load the file “mdb_twitter.csv” included in the resources file “11collecting-resources.zip” from the tutorial website into R.

2. Create a subset of the data with the following criteria:

3. There are quite a lot NA’s to the `personal.gender` variable (I introduced them for the sake of this exercise). The gender of about a quarter of MPs is missing. Let’s try to impute the gender.

4. Investigate the ratio of Twitter users per party (using tabular output and/or a plot). Finally, subset your data to only include Bundestag members that are Twitter users.

Optional: 5. Fetch user data for each Twitter user in the Bundestag members data set from the Twitter API using `rtweet`.

Optional: 6. Now you can try to answer all sorts of questions you are interested in, for example:

R Tutorial at the WZB

Tasks for 11 – Collecting data from the web

Markus Konrad

January 24, 2019

1. Load the file “mdb_twitter.csv” included in the resources file “11collecting-resources.zip” from the tutorial website into R.

2. Create a subset of the data with the following criteria:

3. There are quite a lot NA’s to the personal.gender variable (I introduced them for the sake of this exercise). The gender of about a quarter of MPs is missing. Let’s try to impute the gender.

4. Investigate the ratio of Twitter users per party (using tabular output and/or a plot). Finally, subset your data to only include Bundestag members that are Twitter users.

Optional: 5. Fetch user data for each Twitter user in the Bundestag members data set from the Twitter API using rtweet.

Optional: 6. Now you can try to answer all sorts of questions you are interested in, for example:

3. There are quite a lot NA’s to the `personal.gender` variable (I introduced them for the sake of this exercise). The gender of about a quarter of MPs is missing. Let’s try to impute the gender.

Optional: 5. Fetch user data for each Twitter user in the Bundestag members data set from the Twitter API using `rtweet`.