Remember that you can load CSV files with read.csv()
. Make sure to not read in strings as factors and to read the postal code (variable personal.location.postal_code
) as character string, not as numeric variable. Hint: Have a look at help for read.csv()
and look for the arguments stringsAsFactors
and colClasses
.
This is a data subset on members of the 19th (2017 to 2021 max.) German Bundestag (ger. “Mitglieder des Bundestags – MdB”) that was fetched from abgeordnetenwatch.de.
Familiarize yourself with the data. There is no codebook to the data, which is bad practice, but for the sake of this exercise it should be enough to assess the variables from the column names and their values. Also check the number of observations.
personal.first_name, personal.last_name, personal.gender, personal.birthyear, party, twitter_name
After that, have a look at where we have missing data in the data set that might later cause trouble.
personal.gender
variable (I introduced them for the sake of this exercise). The gender of about a quarter of MPs is missing. Let’s try to impute the gender.3.1. First, think how you could do imputation on that variable and what the pros and cons of each method are.
3.2. Then, inform yourself about the genderizeR package and install it.
Please note that the package uses the genderize.io API to predict the gender of a person using a first name. There is a rate limit of 1000 requests (i.e. 1000 names) per day.
The package is a bit awkward to work with, I have to say. An alternative would be to use the web API directly from R without the package, which would also allow to use additional features such as specifying the country from which the name comes (that increases accuracy). Anyway, that would lead too far, so let’s use the package.
To predict gender from a set of names with this package, you need to use two functions: 1. findGivenNames()
and 2. genderize()
.
findGivenNames()
extracts a first name from a text (in our case, that’s not really necessary because we already have the first name in a separate data frame column), does some text preparation (like lower case transformation), queries the web service of genderize.io and returns the result in a data frame. Note that it also includes an estimate of the correct prediction probability:
library(genderizeR)
testnames <- c('Paula', 'Paul')
test_genderinfo <- findGivenNames(testnames, progress = FALSE)
test_genderinfo
## name gender probability count
## 1: paula female 0.99 2298
## 2: paul male 1 5931
Now we need to pass this result to genderize()
. This is basically only necessary to combine the original names in testnames
with the test_genderinfo
output.
genderize(testnames, test_genderinfo, progress = FALSE)
## text givenName gender genderIndicators
## 1: Paula paula female 1
## 2: Paul paul male 1
We can later use the original name in the text
column for joining.
3.3. Try out the generizeR as shown above with a few names.
3.4. Next, select all first names from the data set of Bundestag members, for which we do not have gender information. Make sure to remove all duplicate names. Then, use the findGivenNames()
function to let the web service predict the gender from the supplied names. Inspect the result data frame.
3.5. Have a look at the distribution of the probability
variable in the result. Filter out results with a probability less than 0.9 and investigate them.
3.6. Then, use genderize()
to get the complete data frame which also contains the original first names in the text
variable. You should then have a data frame with the original first name and a predicted gender. Use this data frame now for imputation of the data set of Bundestag members. Check if there are still NAs in the personal.gender
variable. If yes, you can manually set a gender if you’re confident or just ignore it.
Hint: A strategy here is to use a join (as we learned in “data linkage”) where the variables personal.first_name
from the members data and text
from the predicted gender data must match. You can then replace NA values in personal.gender
for example by using mutate()
and ifelse()
(there are also other ways).
3.7. Make a final data frame of the Bundestag members by converting personal.gender
and party
to factors (makes further analysis easier).
rtweet
.From now on, work only with the subset that you created previously (Bundestag members with Twitter account), because we want to fetch some metadata from Twitter. If you want to do that task, you will need to install the package rtweet
and load it. You also need a Twitter account (you don’t have to tweet anything, though!) and need to set up an “App” at developer.twitter.com. You can then find the four authentication keys that you need for accessing the Twitter API in the “Keys and tokens” tab of the app that you created (as shown in the tutorial).
5.1. After that, use the keys to create an access token with create_token()
. In order to check if it works, look up data for a single Twitter account, e.g. “WZB_Berlin” using the function lookup_users()
. Investigate the result in terms of which data is returned. What could be interesting for us?
5.2. Now look up user data for all Twitter account names in the Bundestag members data set (note that this will take several seconds) and store the result.
5.3. Create a subset of the Twitter user data with only the variables screen_name, account_created_at, followers_count, statuses_count
. If you wanted to join this data with the Bundestag members data, which variable would you use to match the observations?
5.4. Transform the data in the subset using mutate()
so that 1) the screen_name
is always lower case; 2) you have a new variable days_since_creation
containing the “age” of the account in days calculated with round(as.numeric(as.POSIXct('2018-12-13') - account_created_at))
; 3) another new variable tweets_per_day
that is the average number of tweets sent per day by this account (calculate it with the help of days_since_creation
). We can then use tweets_per_day
as a measure of activity on Twitter.
5.5. Join the data of the Bundestag members with Twitter account and the data fetched from the Twitter API by matching twitter_name
with screen_name
. Check where the matching failed (i.e. because no data could be fetched for that account from Twitter).
Who has the most followers (e.g. make a top 10 list)?
Who sends the most tweets per day (again, a top 10 list)?
How is the distribution of follower counts?
What’s the mean and the median follower count per party?
How do follower counts differ between men and women in each party?
How is the distribution of the tweets per day frequency and how does this measure vary between the different parties?